Teaching Wiki

This wiki captures what the research actually says about effective teaching — particularly in computational and project-based courses. It is built around a core problem: student evaluations are a noisy, biased proxy for teaching quality, and better instruments are needed.

Core Tension

Conditions that produce the best learning outcomes consistently receive the worst evaluations in the short term. This is not opinion — it is a documented and replicated finding (Bjork, 1994; Deslauriers et al., 2019).

📊

What Predicts Effectiveness

Hattie's meta-analysis findings. What actually moves the needle.

⚡

Desirable Difficulties

Why harder conditions produce better retention despite worse evaluations.

🔬

Concept Inventories

Diagnostic instruments for computational courses. How to build them.

🧠

Measuring Mental Models

Techniques that reveal whether students built real understanding.

🤖

The AI Problem

How AI usage disrupts traditional assessment and what to do about it.

🔇

The Silence-Confusion Gap

Why students don't ask questions when confused — and what actually works.

🛠️

Technique Library

Ready-to-run collaborative techniques with exact protocols and failure modes.

⚙️

Course Systems

What to set up before semester starts so techniques run without friction.

🔩

Activity Compliance Design

Why activities collapse into project time and how to structurally prevent it.

📡

Engagement Systems

Whole-class systems that shift the default from passive to active — with specific protocols.

🏗️

Project-Based Learning

What the evidence actually says — including where it fails — and how to scaffold it properly.

🧭

Picking Your Systems

How to choose and sequence systems without overloading yourself or your students.

Last updated: April 2026 · Add entries by describing what you've learned

Provenance Warning on Hattie

Hattie's effect sizes are averages across many heterogeneous studies. The label "feedback" or "teacher clarity" aggregates dozens of different interventions with different protocols. The effect sizes in the table below are real but the labels do not tell you what to do. Use the primary studies linked to each factor, not the meta-analytic label, when designing instruction.

Effect Sizes by Factor — With Exact Study Protocols

Factor	Effect Size (d)	What was actually done in the highest-quality studies
Feedback quality	0.73	Hattie & Timperley (2007) reviewed 196 studies. The highest-effect feedback was task-level and process-level — e.g. "your loop logic is correct but you are not resetting the accumulator between calls" — delivered within the same session. Feedback on self-regulation ("you stopped too early") was moderate. Praise ("good job") had near-zero effect on learning outcomes. The critical variable was specificity and proximity to the error, not warmth or encouragement.
Teacher clarity	0.75	Hattie's aggregated effect size. The claim that clarity means explaining "why" rather than just demonstrating a procedure is supported by research on pedagogical content knowledge (Shulman, 1986 — teachers with deeper subject knowledge produce better student transfer) but the specific Rowan et al. (2004) attribution used previously is imprecise — their work covers instructional quality broadly. Provenance note: treat as Hattie aggregate + Shulman's PCK framework rather than a single study finding.
Cognitive activation	0.60	Hattie's aggregated effect size across studies coding for higher-order questioning. Lipowsky and colleagues have published on instructional quality in mathematics (e.g. Lipowsky et al., 2009, ZDM) but the specific 0.6 SD effect on transfer attributed to a single Lipowsky study previously was imprecise — this effect size is Hattie's aggregate. Provenance note: the finding that cognitive activation effects appear on transfer tests more than recall tests is real but comes from the aggregate, not a single study. Cite Hattie (2009) for the effect size.
Teacher-student relationship	0.52	Cornelius-White (2007) meta-analysis of 119 studies. Operationalized as student-reported perceived support AND perceived high expectations simultaneously — not warmth alone. The "warm demander" concept appears in education literature from at least Kleinfeld (1975) and Irvine & Fraser (1998) — not solely Ware (2006) as cited previously. Ware's work is relevant but the concept has an earlier lineage.
Enthusiasm	0.43	Hattie's aggregated effect size. The specific claim that enthusiasm effects are mediated by engagement and matter more for aversive content is a reasonable inference from motivation research generally but was attributed previously to "Patrick et al. (2000)" with more specificity than is warranted. Provenance note: treat enthusiasm effect as Hattie aggregate; the mediation claim is inferential.
Affability / likeability	~0.20	Marsh (1987) and subsequent work: student likeability ratings of instructors correlate strongly with satisfaction surveys (r~0.6) but weakly with objective learning outcomes (r~0.1–0.2). The correlation with outcomes largely disappears when course difficulty and prior student ability are controlled. Likeability predicts whether students recommend the course, not whether they learn from it.

The Relationship Factor — What "Warm Demander" Actually Means

The "warm demander" concept appears in education literature from at least Kleinfeld (1975) and was developed further by Irvine & Fraser (1998). It describes teachers who hold students to high standards while explicitly communicating belief in their ability. The operationalization: teachers returned work with detailed corrections rather than just grades, used re-do standards, and named their confidence in the student explicitly. Provenance note: Ware (2006) is one relevant source but not the origin of the concept — earlier citations are more accurate if making a formal claim.

Implication for Computational Courses

Returning a failing notebook with "see me" is not warm demanding. Returning it with "your indexing logic breaks here for this specific reason — fix this, resubmit by Thursday, you can do this" is. The specific correction plus the explicit confidence signal is the mechanism.

On Modeling Expert Uncertainty

Chi's research on tutoring (multiple papers 1989–1996) documents that expert tutors elicit more self-explanation from students and are more responsive to student errors than novice tutors. The specific claim that expert tutors explicitly model uncertainty and reason aloud is an inference from this body of work rather than a direct measured finding of a single study. The more precise claim supported by Chi's work is that tutors who respond to student errors with probing questions rather than direct correction produce better learning outcomes. Provenance note: the "modeling uncertainty" framing is a reasonable pedagogical inference, not a direct experimental finding to cite.

Primary sources: Hattie & Timperley (2007) Review of Educational Research · Cornelius-White (2007) Review of Educational Research · Kleinfeld (1975) and Irvine & Fraser (1998) for warm demander concept · Shulman (1986) for pedagogical content knowledge · Chi et al. multiple publications 1989–1996 for tutoring research

The Core Problem

Students consistently mis-predict what will help them learn. They rate easier conditions as more effective even when performance data shows the opposite. This is not a matter of opinion — it has been measured directly in controlled studies where students rate their learning in one condition and are then tested in both. The subjective experience of learning is a systematically poor guide to actual learning.

Storage Strength vs Retrieval Strength — The Bjork Framework

Bjork & Bjork (1992) distinguish between storage strength (depth of encoding in long-term memory) and retrieval strength (ease of access right now). These are dissociable and during learning they are inversely correlated. Conditions that make retrieval feel easy during practice boost retrieval strength without building storage strength. The implication: feeling like you know it and having durably encoded it are independent, and easy conditions maximize the former at the cost of the latter.

DD-01 — Generation Effect

Exact Protocol (Slamecka & Graf, 1978 — the founding study)

Participants saw word pairs like "hot — ?" and either read the complete pair ("hot — cold") or generated the second word from a rule ("hot — c___"). On a later free recall test, generated words were recalled at significantly higher rates than read words. Effect has been replicated hundreds of times across materials ranging from word pairs to mathematical procedures to code.

Computational course application: Students who write a function from a specification learn it more durably than students who read a completed function, even when reading time equals writing time. The generation attempt is the mechanism — it is not about time on task.

Important constraint: The generation effect requires that students have enough prior knowledge to attempt generation meaningfully. If they have zero relevant knowledge, generation produces confusion rather than encoding. The sweet spot is tasks that are within reach but require effortful retrieval of partial knowledge.

DD-02 — Spacing Effect

Exact Protocol (Cepeda et al., 2006 — large-scale review)

Cepeda et al. reviewed 254 studies with 14,000+ participants. The specific finding: for a retention interval of 1 week, an inter-study gap of 1 day produced better recall than a gap of 0 days (massed). For a retention interval of 1 month, an optimal gap of approximately 1 week produced the best recall. The optimal spacing gap scales with the desired retention interval — roughly 10–20% of the time until test.

Computational course application: A concept introduced in Module 2 should appear as a building block in Module 4 and as a prerequisite check in Module 6, not be exhaustively practiced in Module 2 and never returned to. The returns on revisiting are largest when the gap is long enough that retrieval requires effort.

What spacing is not: Spacing is not "covering the topic again" in the same way. It is requiring retrieval of the concept in a new context after a gap sufficient to produce some forgetting. Re-explaining is not spacing. Requiring application after a gap is.

DD-03 — Interleaving

Exact Protocol (Rohrer & Taylor, 2007)

Students practiced mathematics problems in either blocked order (all problems of type A, then all of type B, then all of type C) or interleaved order (A, B, C, A, B, C). During practice, blocked students performed better. On a test one week later, interleaved students outperformed blocked students by a substantial margin. The effect is specifically on discrimination — knowing which procedure to apply when — not on executing any individual procedure.

Computational course application: A problem set that mixes data cleaning, model selection, and evaluation problems forces students to identify what kind of problem they are solving before they solve it. A problem set that groups all data cleaning together, then all model selection, removes that identification step — which is exactly the step that matters in real projects.

Why students hate it: Interleaved practice produces more errors during practice and feels less productive. Students and instructors both tend to interpret more errors as evidence the approach is not working. The errors are the mechanism. Discriminating between problem types is what is being practiced.

DD-04 — Testing Effect (Retrieval Practice)

Exact Protocol (Roediger & Karpicke, 2006 — the key study)

Students read a prose passage about a scientific topic. Group 1: read it once. Group 2: read it four times in the same session. Group 3: read it once, then took three free-recall tests (writing everything they could remember, no feedback). Five minutes later, Group 2 recalled most. One week later, Group 3 recalled 50% more than Group 2. Group 1 recalled least in both conditions.

The mechanism: Each retrieval attempt strengthens and reorganizes the memory trace in ways that re-reading does not. Retrieval also surfaces gaps — students discover what they cannot recall, which is itself diagnostic. The test does not need to be graded or formal. The act of attempted retrieval is what produces the effect.

Exact implementation for a coding module: At the start of a session, before reviewing prior material, ask students to write down everything they remember about the previous concept and try to reconstruct a function without notes. Three minutes, not graded, not collected. This alone outperforms spending those three minutes reviewing the prior material.

The Deslauriers Study — The Cleanest "Active Learning" Test

Exact Protocol (Deslauriers et al., 2019, PNAS)

Two sections of the same introductory physics course at a large university. One section received traditional lecture from an experienced, highly-rated instructor. The other received "active learning" — specifically: students were given pre-class reading, class time was spent on peer instruction (students answered clicker questions, discussed with neighbors, re-answered). Same content, same assessments. Active learning section scored significantly higher on a post-class test. Crucially: students in the active learning section rated their experience as less effective and reported feeling they learned less. The active learning condition used was specifically peer instruction with clicker questions and structured neighbor discussion — not "active learning" generically.

For Computational Courses

Write-the-code-yourself maps onto the generation effect. Spaced problem sets map onto spacing. Mixed problem types map onto interleaving. Start-of-session cold recall maps onto the testing effect. These are four distinct mechanisms with distinct protocols — not one thing called "active learning."

Primary sources: Bjork & Bjork (1992) · Slamecka & Graf (1978) Journal of Experimental Psychology · Cepeda et al. (2006) Psychological Bulletin · Rohrer & Taylor (2007) European Journal of Cognitive Psychology · Roediger & Karpicke (2006) Psychological Science · Deslauriers et al. (2019) PNAS

The Core Finding on Validity

Exact Finding (Uttl et al., 2017)

Uttl, White & Gonzalez conducted a meta-analysis of 97 studies examining the correlation between student evaluation of teaching (SET) scores and objective learning outcomes (exam performance, blind-graded assessments). The mean correlation was r = 0.00 to 0.10 — effectively zero — after correcting for the small sample sizes that inflated earlier estimates. Earlier studies showing moderate correlations (r~0.4) used small samples where the correlation was noise. The large-sample studies consistently show near-zero correlation. SETs measure satisfaction, not learning.

Known Confounds — With Specific Evidence

CONF-01

Ease Bias — Exact Finding

Clayson (2009) reviewed studies across business, science, and humanities courses. Higher expected grades reliably predicted higher SET scores independent of actual learning. Carrell & West (2010) used a natural experiment at the US Air Force Academy where students were randomly assigned to instructors: instructors who produced better performance on follow-on courses received lower contemporary evaluations. The easy-grader / higher-evaluation effect was approximately 0.2 SD per letter grade of grade inflation.

CONF-02

Selection Bias in Optional Surveys

Porter et al. (2004) compared voluntary versus mandatory response conditions for course evaluations. Voluntary response rates under 50% produced mean scores 0.3–0.5 points higher (on a 5-point scale) than mandatory response conditions for the same courses. The students most likely to respond voluntarily are those with strong opinions — extreme satisfaction or extreme dissatisfaction. Students in the middle, who are the majority, are systematically underrepresented in low-response surveys.

CONF-03

Contrast Effects — Exact Mechanism

Students do not evaluate in absolute terms — they evaluate relative to their most recent comparison point. A rigorous module following an easier one will receive lower ratings than the same module taught in isolation or following an equally rigorous one. This is a well-documented anchoring effect in judgment research (Kahneman & Miller, 1986 norm theory) applied to course evaluation. The implication: your evaluation scores are partially a function of what your co-instructor teaches, not only what you teach.

CONF-04

Retrospective Framing

Temporal distance from a difficult experience systematically changes its evaluation — people reconstruct meaning rather than recall feeling. Wilson & Ross (2001, PSPB) studied autobiographical memory reconstruction, showing people reframe past experiences to support current self-views. The application to course evaluations is a reasonable inference from this work rather than a direct finding about SET timing. Provenance note: the claim that temporal distance improves evaluation of hard courses is plausible and consistent with the memory literature, but a direct study of SET scores versus time-since-course is not being cited here. Flag as inference if making a formal claim.

What Evaluations Are Actually Useful For

Despite low validity for measuring learning, SETs do reliably measure a few things worth knowing: whether students felt they understood the material (not whether they did), whether they felt the instructor was organized, and whether they felt respected. These are not nothing — perceived organization and perceived respect predict engagement, which is a precursor to learning even if not equivalent to it. The mistake is using satisfaction as a proxy for learning, not measuring satisfaction at all.

Better Survey Questions — The Design Principle

Evaluative questions produce sentiment. Behavioral and diagnostic questions produce information. The distinction comes from survey methodology research on self-report validity — asking what people did is more reliable than asking how they felt about something.

❌ "Was this module too fast?" — produces comparison-based sentiment

✅ "Describe a moment where you felt lost and what you did next." — produces behavior data

❌ "Did you feel supported?" — produces a global judgment

✅ "What would you need to complete this independently that you don't currently have?" — produces a specific gap

❌ "Rate the instructor's clarity 1–5." — produces a number with no diagnostic value

✅ "In one sentence, explain the main idea of this module in your own words." — is itself a comprehension check

Primary sources: Uttl, White & Gonzalez (2017) Studies in Educational Evaluation · Clayson (2009) Journal of Marketing Education · Carrell & West (2010) Journal of Political Economy · Porter et al. (2004) Research in Higher Education · Wilson & Ross (2001) Personality and Social Psychology Bulletin

Near Transfer vs Far Transfer — Why the Distinction Matters

Barnett & Ceci (2002) reviewed the transfer literature and found that near transfer (same surface features, same underlying structure) and far transfer (different surface features, same underlying structure) are empirically separable — a student can show near transfer without far transfer. Most in-class checks and many project rubrics measure near transfer only. Genuine mental models support far transfer. Surface pattern-matching does not. The practical implication: if your project prompts use the same data structure and the same problem framing as your module examples, you are measuring near transfer and will overestimate how much students have actually learned.

The Illusion of Explanatory Depth — Exact Finding

Rozenblit & Keil (2002) — Exact Protocol

Participants rated their understanding of how everyday devices work (toilets, zippers, cylinder locks) on a scale of 1–7. They then attempted to write detailed mechanistic explanations. After writing, they re-rated their understanding. Ratings dropped substantially after the writing attempt — participants discovered they understood far less than they believed. The effect was specific to mechanistic explanations and did not occur for factual knowledge or procedural knowledge. The implication for teaching: students who nod along during a clear explanation are not withholding confusion. They genuinely do not yet know they are confused — because the illusion only breaks when they attempt to explain or apply.

Techniques That Reveal Real Understanding — With Protocols

MM-01

Two-Stage Retrieval Check — Protocol

Immediately after explaining a concept, students write a one-sentence explanation in their own words, no notes, not graded. The theoretical basis is the generation effect (Slamecka & Graf, 1978) — producing an output requires constructing a representation rather than recognizing one — combined with the write-to-learn literature showing that writing-to-explain (as distinct from expressive or reflective writing) engages elaborative processing. Provenance note: Klein (1999) studied expressive writing for emotional processing, not paraphrase-for-comprehension — that was a mis-citation in earlier versions. The more accurate citations for writing-as-elaboration are Langer & Applebee (1987) and the broader writing-across-the-curriculum research. Students who produce an incoherent paraphrase have revealed a model gap that "did everyone understand?" would have missed entirely.

MM-02

Prediction Tasks — Protocol

Chi & VanLehn (1991) showed that self-explanation and prediction during worked examples produced significantly better transfer than reading the same examples passively. The specific protocol: before revealing output or a result, ask students to write a prediction and a one-sentence justification. The justification is the diagnostic — students who predict correctly for the wrong reason are revealed, whereas a correct prediction alone is not informative. In code contexts: show a function, ask students to predict output for a specific input and say why before running it.

MM-03

Targeted Wrong Answer Probing — Protocol

Based on Mazur's peer instruction method (Mazur, 1997): present a conceptual question with multiple choice answers where distractors correspond to specific known misconceptions. Students vote individually (anonymously via clicker or raised hands with eyes closed). If 30–70% choose the wrong answer, students discuss with a neighbor for 2 minutes, then re-vote. This range is the target — below 30% wrong means the question is too easy; above 70% wrong means students don't have enough knowledge to benefit from peer discussion. The re-vote data tells you whether peer discussion resolved the misconception or whether instructor intervention is needed.

MM-04

Muddiest Point — Protocol

Angelo & Cross (1993) classroom assessment technique. Exact implementation: in the last two minutes of a session, students write anonymously on a card: "What was the muddiest point in today's session?" Instructor reads before next session and opens by addressing the most common muddiest point. The key distinction: muddiest point reveals confusion about framing and explanation (instructor problem) versus confusion about the underlying concept (requires more instruction). These require different responses — the first means re-explain differently; the second means more examples or practice.

MM-05

Anonymous Polling — What the Research Shows

Stowell & Nelson (2007) compared hand-raising versus anonymous electronic response in the same classes. Anonymous conditions produced response rates 3–4x higher and showed substantially more honest distribution of answers — including more admissions of uncertainty. The mechanism is social risk reduction, not the technology. The same effect can be achieved with anonymous paper cards. The technology (iClicker, Poll Everywhere, Mentimeter) matters only insofar as it delivers anonymity quickly. If students can see each other's responses in real time, the anonymity benefit is partially lost.

Design Implication for Projects

Barnett & Ceci's near/far transfer taxonomy suggests a direct test for your project design: take a project prompt and ask whether a student who only pattern-matched to module examples could complete it. If yes, redesign the prompt to require structural transfer — same underlying concept, different domain, different data type, or different framing. The additional design cost is small. The diagnostic value is large.

Primary sources: Barnett & Ceci (2002) Psychological Bulletin · Rozenblit & Keil (2002) Cognitive Science · Chi & VanLehn (1991) Journal of the Learning Sciences · Mazur (1997) Peer Instruction: A User's Manual · Angelo & Cross (1993) Classroom Assessment Techniques · Stowell & Nelson (2007) Teaching of Psychology

Origin

The canonical example is the Force Concept Inventory in physics (Hestenes et al., 1992). It revealed that students who could solve Newton's law equations still held Aristotelian mental models of force. Procedural fluency and conceptual understanding are dissociable — the same is true in computation.

Provenance Note — Read First

The misconception clusters below are a synthesis, not a single citable taxonomy. The MC-01 through MC-06 labels are organizational — they do not refer to a named framework in the literature. Each cluster has different levels of empirical support, noted inline. Where source literature exists it is cited; where it does not, that absence is flagged explicitly. Do not cite these clusters as if they reference a single source. Trace back to the underlying papers for any formal claim.

How They Work

A concept inventory item presents a scenario — typically a short piece of code or a problem setup — and offers multiple choice answers where the distractors correspond to known, documented misconceptions, not random wrong answers. The diagnostic power is entirely in distractor design. A student who answers incorrectly reveals which wrong mental model they hold, not just that they are wrong.

Misconception Clusters — With Source Provenance

MC-01

Variable and State — Container vs Reference Model

Students hold a "container model" — variables as named boxes holding values — that breaks under mutability, scope, and pass-by-reference. Empirical support: strong. This is one of the most studied misconceptions in CS education. Juha Sorva's doctoral work (2012, Aalto University) on "notional machines" documents this extensively. Also appears in the broader "programming misconceptions" literature at SIGCSE and ICER conferences. The specific framing of container vs reference model is well established.

x = [1, 2, 3] y = x y.append(4) # What is x now? # (A) [1, 2, 3] ← misconception: copy/container model # (B) [1, 2, 3, 4] ← correct: reference model # (C) Error ← misconception: immutability assumption

MC-02

Loop and Iteration — Simultaneity Model

Students model loops as operating simultaneously across all values rather than sequentially. Empirical support: moderate. The simultaneity misconception and "loop as filter" error appear in CS education research, including work by Qian & Lehman (2017) on novice programming errors and Sleeman et al. on BASIC programming misconceptions. Less formally validated for Python specifically — extrapolation from earlier language research is reasonable but should be noted.

total = 0 for i in [1, 2, 3]: total = total + i # What is total at the end of iteration 2? # (A) 6 ← misconception: simultaneous model # (B) 3 ← correct: sequential accumulation # (C) 2 ← misconception: i replaces total

MC-03

Abstraction Layers — Implementation Leakage

Students conflate what a function does with how it does it, preventing black-box reasoning. Empirical support: moderate, indirect. Not as directly studied as MC-01. Related to the broader "levels of abstraction" literature in computing education and to work on expert vs novice problem solving in CS (Soloway & Ehrlich, 1984). The "implementation leakage" framing is a reasonable inference from that literature, not a directly validated construct. Treat as theoretically grounded but empirically thinner.

Probe: "Without looking at the source code, what can you say with certainty about what sort([3,1,2]) will return?" Correct mental model: input-output contract only. Common misconception: students hesitate because they don't know the sorting algorithm used.

MC-04

Probability and Randomness

Gambler's fallacy, representativeness heuristic, frequency vs probability confusion. Empirical support: strongest of all six clusters. These misconceptions originate in Kahneman & Tversky's heuristics and biases program (1970s–1980s) and have been replicated extensively in general populations and students. The specific application to data science interpretation (p-values, model accuracy, confidence intervals) is more recent and less formally inventoried — but the underlying cognitive mechanisms are extremely well established. This is the cluster most ready to be turned into validated inventory items.

A model is 80% accurate. You run it on 5 samples and it gets 3 right. Should you be concerned? Common misconception: yes, because 3/5 = 60% ≠ 80%. Correct model: variance over 5 samples is enormous; 3/5 is well within expected range.

MC-05

Debugging Causality — Local Causality Model

Students look for bugs near the symptom rather than tracing execution flow backward. Empirical support: weak to moderate — mostly practitioner observation. McCauley et al. (2008, Computer Science Education) is a real and citable review of debugging behavior research. "Murphy et al. (2008)" cited alongside it in earlier drafts could not be verified as a distinct paper on debugging — remove that citation and rely on McCauley et al. alone for this cluster. The "silent failure blindness" variant specific to data pipelines is largely extrapolated from practitioner experience. Flag this cluster when citing formally.

Error on line 47. The bug is actually on line 12. Probe: "What is the first thing you check and why?" Correct model: trace execution path from error backward. Common misconception: change line 47 and re-run.

MC-06

Data Structure Mental Models — Pandas/Numpy Specific

Positional vs label-based indexing confusion, copy vs view semantics, broadcasting assumptions. Empirical support: weakest of the six — largely extrapolated. General data structure misconceptions appear in CS education research, but pandas and numpy specific misconceptions are not yet well inventoried in the formal literature as of early 2026. This cluster is inferred from general principles (positional vs label model) and practitioner observation. It is probably accurate and worth probing, but should be treated as hypothesis-generating rather than empirically validated. Developing validated items here would itself be a novel contribution.

df1 = pd.DataFrame({'a': [1,2,3]}, index=[0,1,2]) df2 = pd.DataFrame({'b': [4,5,6]}, index=[2,1,0]) result = df1['a'] + df2['b'] # What is result[0]? # Probes index-alignment mental model # (label-based vs positional assumption)

The Cross-Cutting Pattern

Each misconception above shares a common structure: it is a locally coherent model that works in a restricted domain and breaks when the domain expands. The container model works for primitives. The simultaneous loop model matches mathematical notation. The local causality model works for simple scripts. This means simply presenting the correct model often fails — students already have a model that explains their prior experience. Effective remediation requires generating a case where the existing model makes a wrong prediction the student can observe directly. That cognitive conflict has to be self-generated, not just described.

Building Your Own Inventory

Formal concept inventories for computational data science are underdeveloped compared to physics and biology. MC-06 in particular represents an open research contribution opportunity — validated pandas/numpy misconception items do not yet exist in the published literature. The development process:

CI Development Protocol

1. Elicit misconceptions — open-ended questions on homework and exams; interviews with struggling students; think-aloud problem solving sessions.

2. Categorize errors — group incorrect answers by the underlying wrong model they reveal, not just by the correct answer they missed.

3. Build distractors — each wrong answer option should correspond to a specific documented misconception, making incorrect answers maximally informative.

4. Validate — administer before and after instruction; interview students who chose each distractor to verify it corresponds to the intended misconception.

5. Iterate — a good inventory takes 2-3 cycles of administration and revision.

Key Literature for This Section

Sorva, J. (2012). Visual Program Simulation in Introductory Programming Education. Doctoral dissertation, Aalto University. — Primary source on notional machines and the container/reference variable misconception.
Qian, Y., & Lehman, J. (2017). Students' misconceptions and errors in programming. International Journal of Computer Science Education in Schools, 1(3). — Systematic review of novice programming errors including loop misconceptions.
McCauley, R., et al. (2008). Debugging: Finding, fixing and flailing. Computer Science Education, 18(2), 93–116. — Review of debugging behavior research; source for MC-05 claims.
Kahneman, D., & Tversky, A. (1974). Judgment under uncertainty: Heuristics and biases. Science, 185(4157), 1124–1131. — Foundational source for MC-04 probability misconceptions.
Hestenes, D., Wells, M., & Swackhamer, G. (1992). Force concept inventory. The Physics Teacher, 30(3), 141–158. — Model for concept inventory design methodology.

Common Mistake

Most instructors use probes only as end-of-instruction comprehension checks. The research suggests the more powerful use is probing before instruction — activating prior knowledge and surfacing the specific misconception you are about to address so students notice the conflict when it appears. The first probe and the last probe are doing fundamentally different work.

The Four-Phase Sequence

PS-01

Phase 1 — Activation Probe (before instruction)

Targets the mental model students bring in, not the one you want them to leave with. Should be answerable from intuition alone. Goal: get students to commit to a model — ideally a confident wrong answer — so instruction creates genuine conflict rather than just adding information. Empirical basis: the pretesting effect (Kornell et al., 2009) — students learn more from instruction when they have attempted the problem first, even when the attempt is wrong.

PS-02

Phase 2 — Instruction with Named Conflict

Not "here is the correct model" but "here is where the model you probably have predicts X and the actual result is Y." The conflict must be named explicitly. Students who are not told their prior model is wrong often assimilate new information into it rather than replacing it. This is conceptual change teaching (Posner et al., 1982).

PS-03

Phase 3 — Transfer Probe (immediately after instruction)

Same conceptual model, different surface context. Distinguishes students who updated their model from students who just learned the correct answer to the specific case used in instruction. If you taught with example A, probe with example B that requires the same reasoning but looks different.

PS-04

Phase 4 — Spaced Retrieval Probe (next session)

One probe question at the start of the next module, explicitly returning to prior material. Slightly harder than the original transfer probe — a case where the misconception produces a plausible-looking wrong answer rather than an obviously wrong one. The testing effect (Roediger & Karpicke, 2006) is strongest with spaced retrieval. Most instructors skip this phase entirely.

On Interleaving Across Concepts

If a module addresses multiple misconception clusters, do not exhaust one completely before moving to the next. Interleaved probes (A, B, A, B) outperform blocked probes (A, A, B, B) on transfer assessments because interleaving forces students to identify which concept applies rather than applying the most recent one by default. It feels less organized and produces better learning.

When a Probe Reveals Widespread Misconception

The default response — re-explaining the same thing more slowly — is not what the research supports. If a formative probe shows most students still hold the wrong model, the intervention is more conflict, not more explanation. In a coding context this is tractable: have students execute code and observe the output themselves. Self-generated cognitive conflict is more persuasive than instructor-described conflict because it cannot be attributed to the instructor being wrong.

Worked Example 1 — Unsupervised Learning (k-means clustering)

Target misconception: students believe clustering finds "true" or "correct" groups that exist in the data objectively, rather than partitioning the space according to a chosen algorithm and a chosen k. Secondary misconception: k is determined by the data, not chosen by the analyst.

Phase 1 — Activation Probe

Before you teach anything about k-means

Show students a scatter plot of unlabeled 2D data with no obvious cluster structure. Ask: "How many groups are in this dataset? How would you find out?"

Most students will either say "I can see 3" (imposing visual structure) or "the algorithm will find the right number." Both reveal the objective-groups misconception. Let them commit to an answer before moving on. Do not correct them yet.

Then immediately show the same data clustered into k=2, k=4, and k=7 with equally low within-cluster variance. Ask: "Which of these is correct?" The visual conflict between three apparently valid clusterings of the same data is the activation event.

Phase 2 — Instruction with Named Conflict

Name the misconception explicitly: "Most people's intuition is that there is a right answer — a true number of clusters the data is hiding. k-means does not find that. It finds the best partition into whatever k you chose. The choice of k is an analyst decision, not a data property."

Then teach the algorithm mechanically. The named conflict gives students something to anchor the new model to — they are replacing a specific wrong belief, not just accumulating information.

Phase 3 — Transfer Probe

Same concept, different surface

Give students a 1D dataset — ages of customers — and ask: "A colleague runs k-means with k=3 and gets silhouette score 0.68. Another runs k=5 and gets 0.71. Your manager asks which clustering is correct. What do you tell them?"

A student who updated their model says neither is "correct" — both are valid partitions under different k choices; the right k depends on the business question. A student who just learned the algorithm will try to pick the higher silhouette score as the correct answer, revealing the misconception persists.

Phase 4 — Spaced Retrieval Probe (open next session with this)

Harder — misconception produces plausible wrong answer

"You run k-means on patient gene expression data and get k=4 clusters. A clinician asks: 'Are these the four real subtypes of this cancer?' What do you say, and what would you need to answer their question more honestly?"

This probe is harder because the clinical framing makes the objective-groups interpretation feel appropriate. A robust model update handles it; surface learning does not. It also opens into a genuine scientific discussion about validation, which is the content you want for the next module.

Worked Example 2 — Euler's Method

Target misconception: students treat Euler's method as producing the true solution to a differential equation rather than a numerical approximation whose accuracy depends on step size. Secondary misconception: error accumulates randomly rather than systematically in a predictable direction.

Phase 1 — Activation Probe

Before teaching the algorithm

Give students the ODE dy/dx = y, y(0) = 1, whose true solution is e^x. Ask them to sketch what they think the solution looks like from x=0 to x=3. Most will draw a reasonable exponential curve.

Then ask: "If a computer approximates this curve by taking small steps — at each point, moving in the direction the equation says to go — what happens to the approximation as you take fewer, larger steps?"

Common answers: "it gets less smooth," "it gets more jagged," "it might go in the wrong direction." Almost no student will say unprompted that it will systematically underestimate the true solution for a convex function. That systematic directional bias is what you are about to teach.

Phase 2 — Instruction with Named Conflict

Show the Euler approximation alongside the true solution for h=1, h=0.5, h=0.1. Name the conflict: "Notice the approximation is always below the true solution. This is not random error — for a convex function, Euler's method systematically underestimates because it uses the slope at the beginning of each step, which is always less than the average slope across the step. The error is not noise. It has a direction, and that direction is predictable from the shape of the function."

Then teach the algorithm and the error analysis formally. Students now have a concrete visual and a named mechanism to attach the formalism to.

Phase 3 — Transfer Probe

Same concept, different function shape

Give students dy/dx = -y, y(0) = 1 (decaying exponential — concave). Ask: "For this equation, will Euler's method overestimate or underestimate the true solution? Explain without running the code."

A student with an updated model reasons from the concavity: the slope at the beginning of each step is steeper than the average slope, so Euler overshoots — the approximation will be above the true solution. A student who memorized "Euler underestimates" from the first example will get this wrong. The transfer probe distinguishes the two.

Phase 4 — Spaced Retrieval Probe (open next session)

Harder — requires reasoning about error in an applied context

"You are modeling tumor growth with a logistic ODE using Euler's method with h=0.5. Near the inflection point of the logistic curve, what happens to the direction of Euler's error, and why does this matter for clinical predictions based on the model?"

This probe is harder because the logistic curve changes concavity at the inflection point — so the direction of Euler's error reverses mid-simulation. A student with a robust model can reason through this. It also connects numerical methods directly to the biomedical context, which is the bridge you want to the next module.

The Practical Constraint

Time Tradeoff

A full four-phase probe sequence takes significantly longer than content delivery alone. The research suggests this investment pays off in downstream retention and transfer, but it requires accepting that you will cover less content per session. This is a pedagogically defensible choice — covering fewer concepts with durable learning outperforms covering more concepts with shallow encoding. But it is a choice that should be made explicitly and communicated to students and colleagues, not absorbed silently as lost content.

Key Literature for This Section

Kornell, N., Hays, M.J., & Bjork, R.A. (2009). Unsuccessful retrieval attempts enhance subsequent learning. Journal of Experimental Psychology: Learning, Memory, and Cognition, 35(4), 989–998. — Primary source for the pretesting effect: failed retrieval attempts before instruction improve learning from subsequent instruction.
Posner, G.J., Strike, K.A., Hewson, P.W., & Gertzog, W.A. (1982). Accommodation of a scientific conception: Toward a theory of conceptual change. Science Education, 66(2), 211–227. — Foundational conceptual change theory. Conditions required for model replacement rather than assimilation.
Roediger, H.L., & Karpicke, J.D. (2006). Test-enhanced learning. Psychological Science, 17(3), 249–255. — Testing effect and spaced retrieval. Basis for Phase 4 design.
Taylor, K., & Rohrer, D. (2010). The effects of interleaved practice. Applied Cognitive Psychology, 24(6), 837–848. — Interleaving versus blocked practice on transfer performance.

Diagnostic vs Evaluative Feedback

Evaluative feedback tells you how students felt about something after it happened. Diagnostic feedback tells you what students can and cannot do, while there is still time to act. Most course evaluation infrastructure produces evaluative feedback. Diagnostic feedback has to be designed in deliberately.

Timing Matters

End-of-module surveys are retrospective. By the time a student fills one out, judgments have formed and the module is over. The highest-value feedback is caught during instruction, when intervention is still possible.

High-Signal Diagnostic Techniques

FT-01

Mid-Module Muddiest Point

In-class, anonymous, one question: "What is still unclear?" Two minutes. Distinguishes confusion about content from confusion about framing. That distinction determines whether you need to re-explain differently or just add another example.

FT-02

Behavioral Survey Questions

"Describe a moment where you felt lost and what you did next." Produces behavioral data rather than sentiment. The response "I used ChatGPT" is more useful than "this was too hard" — it tells you where the scaffolding broke down.

FT-03

Own-Words Paraphrase Check

Ask students to explain a concept in one sentence without notes, immediately after instruction. Not graded. Students who cannot paraphrase it did not build a usable mental model. This surfaces the illusion of explanatory depth before it becomes a problem on assessments.

FT-04

Peer Observation with Specific Frame

General peer observation produces general feedback ("looks good"). Give the observer a specific question: "I want to know whether students look confused and then reach for a shortcut rather than asking." Targeted observation produces targeted data.

FT-05

Structured Student Interview

Ask a high-signal student for elaboration on specific feedback. Frame as collaborative research into your own teaching, not as seeking reassurance. Ask behavioral questions: "What specifically felt unclear and what did you do when it did?" rather than "what didn't you like?"

Making Pedagogy Visible to Students

Students who understand why a difficulty is structured the way it is tolerate it differently than students who experience it as arbitrary. Research on metacognitive awareness shows that explaining the pedagogical rationale — before the hard thing, not after — increases engagement and reduces shortcuts.

Example Framing

"This module is harder than what you've done before. You're going to want shortcuts. I'm not giving them to you deliberately — the research on skill acquisition is unambiguous that generating your own solution, even imperfectly, produces retention that studying a worked example does not. The struggle is the mechanism."

Replacing "Does That Make Sense?"

"Does that make sense?" is one of the least diagnostic questions in teaching. It conflates three different things — whether students understood, whether they think they understood, and whether they are willing to admit they did not — and all three produce the same answer: silence or yes. The illusion of explanatory depth research (Rozenblit & Keil, 2002) explains why: students who genuinely do not understand also genuinely believe they do, immediately after a clear explanation. The question asks students to introspect on something they are systematically bad at introspecting on. Questions that require production rather than evaluation of understanding bypass this entirely.

Why It Fails

Wiliam (2011, Embedded Formative Assessment) argues that open-ended comprehension checks are structurally inferior to hinge questions — questions where different wrong answers reveal different misconceptions — because they require students to commit to a specific position rather than rate their confidence. A student who does not understand cannot accurately rate their own non-understanding. A student who picks the wrong distractor has revealed exactly which model gap they have.

When You Want to Know Whether They Built the Right Model

CQ-01

"What would you expect to happen if we changed X?"

Forces prediction from the model rather than recognition of it. A student who understood can reason forward from the concept to a new case. A student who followed along passively cannot — they can only recognize correct answers when shown them, not generate predictions from first principles. Works particularly well in coding contexts: "what would happen to the output if I changed this line?"

CQ-02

"Give me an example of this that I haven't used."

Requires generative retrieval and near-transfer simultaneously. Students who only learned the instructor's examples cannot do this — they have pattern-matched to the specific case rather than abstracted the concept. The inability to generate a novel example is the clearest signal that a model has not been built. Collect written responses rather than verbal — verbal answers allow students to elaborate until something sounds right.

CQ-03

"Where would this break?"

Requires understanding of the concept's boundary conditions, which is a stronger test than knowing its central case. A student who can identify when k-means fails understands k-means more deeply than a student who can only describe what it does. Also directly relevant to data science practice — professionals spend more time diagnosing failures than building successes. Asking this question routinely trains the right professional habit.

When You Want to Know Where the Confusion Is

CQ-04

"What's the part that feels least solid right now?"

More specific and more answerable than "any questions?" — it assumes some uncertainty exists rather than asking students to volunteer that they are confused. The framing treats partial understanding as normal rather than exceptional, which reduces the social cost of honest response. Works better written and anonymous than asked verbally — verbal asking still exposes the student who answers.

CQ-05

"If you had to explain this to someone who missed today, what would be the hardest part to convey?"

Third-person framing reduces social cost — the student is describing a hypothetical difficulty rather than admitting their own. Also requires the student to construct an explanation, which surfaces gaps through the generation process rather than through self-report. The "hardest part to convey" framing specifically targets the boundary of their understanding rather than its center.

CQ-06

"What question do you wish I'd asked you just now?"

From Hattie & Timperley's feedback research — cited as particularly effective for surfacing gaps students can sense but have not articulated. Requires students to identify what they do not fully know rather than evaluate what they do know — a different and often more accessible cognitive operation. Students who answer this well have demonstrated metacognitive awareness of their own model gaps, which is itself a teaching goal.

When You Want Live Signal Without Cold-Calling

CQ-07

Finger Confidence Scale (1–3)

Students hold up 1, 2, or 3 fingers simultaneously on a count: 1 = lost, 2 = followed but couldn't explain it yet, 3 = could explain it right now. Simultaneous display prevents anchoring to what neighbors show. Fast, anonymous enough to be honest, gives you a distribution rather than a yes/no. A room of mostly 2s means the concept landed but needs consolidation. A room of mixed 1s and 3s means the class has split — you have a bimodal understanding problem, not a general confusion problem, and the response is different for each.

CQ-08

Hinge Question (Targeted Wrong Answers)

The highest-diagnostic version of any comprehension check. Present a multiple choice question where each wrong answer corresponds to a specific known misconception — not random wrong answers. Students commit to an answer before discussion. The distribution of wrong answers tells you not just that students are confused but which specific wrong model is most prevalent. Design the question so the correct answer requires the concept you just taught, and each distractor requires a specific prior misconception. Source: Wiliam (2011) — hinge questions are defined as questions where the answer determines what the instructor does next.

The General Principle

Replace questions that ask students to evaluate their own understanding with questions that require them to demonstrate it. Self-evaluation of understanding is unreliable immediately after instruction — the illusion of explanatory depth is strongest in that moment. Production tasks (predict, generate, identify failure modes) bypass the introspection problem entirely: the output tells you what self-report cannot.

Sources: Rozenblit & Keil (2002) — illusion of explanatory depth · Wiliam (2011) Embedded Formative Assessment — hinge questions defined and evaluated · Hattie & Timperley (2007) — "what question do you wish I'd asked" framing

The Loop

Students use AI → skip effortful practice → perform adequately on assessments → feel they didn't learn → blame the course. The instructor receives negative evaluations for rigorous modules and loses the downstream performance signal that would otherwise validate the approach.

Why This Is a Structural Problem, Not a Character Problem

Students who rely on AI and then complain they didn't learn the material are giving accurate self-reports. The short-term path of least resistance and the path that produces learning are fully decoupled in a way they were not before. Students are not lazy — they are rational actors in a broken incentive structure. The question is not how to stop them using AI but whether the things you are asking them to do are worth doing without AI in the first place.

The Real Question — From a 2024 Data Science Education Paper

A 2024 arxiv preprint on AI-resilient assessment design specifically for data science education (Colusso et al., 2024) made the sharpest version of this argument: tasks that once defined early-career data science work — writing boilerplate code, executing small-scale data analysis, drafting SQL queries — are now completed instantly by AI tools. These same modular tasks still dominate most data science assessments. The educational value of practicing skills industry has already automated is genuinely unclear. Students recognize this disconnect, which undermines motivation even when they engage honestly. The response is not to ban AI — it is to ask what competencies remain valuable and design assessments around those.

What a Data Scientist Actually Needs to Know — A Competency Framework for Assessment Design

Before evaluating whether an assessment is good, you need a model of what genuine data science competence looks like. The following framework is synthesized from the data science competency literature (including a 2024 systematic review of 38 studies, Bredillet et al.) and from industry competency analyses. It is organized by how resistant each competency is to AI substitution — which is the dimension most relevant to assessment design right now.

Tier 1 — Highly AI-Substitutable (Assess Carefully)

These are skills where current AI tools perform at or near the level of a competent junior data scientist. Assessments that primarily test these skills are vulnerable to AI completion without genuine student understanding.

Syntax and API fluency: Writing pandas operations, sklearn pipelines, matplotlib plots. AI produces this reliably and correctly. Testing whether a student can write df.groupby('col').mean() does not tell you whether they understand what groupby is doing or when to use it.

Boilerplate modeling pipelines: Train/test split, fit a model, evaluate accuracy. AI produces correct boilerplate for standard tasks. A student can submit a functioning notebook without understanding why the pipeline is structured that way.

Definition and recall: "What is overfitting?" "Define precision and recall." AI answers these accurately. Testing recall of definitions measures whether a student read the material, not whether they can apply the concept.

Standard data cleaning operations: Handling missing values with common strategies, encoding categoricals. These are well-documented patterns that AI executes reliably.

Implication for assessment: None of these should be the primary thing a checkpoint is testing. They can appear as scaffolding within a larger task — but if completing them correctly is sufficient to pass the assessment, AI completes the assessment.

Tier 2 — Partially AI-Substitutable (Assess With Augmentation)

These skills require judgment that AI can assist with but cannot fully replace. AI can generate plausible-sounding responses that a student with genuine understanding could distinguish from correct ones — but a student without understanding cannot.

Model selection reasoning: Choosing between model families for a specific problem and justifying the choice given data characteristics. AI can generate reasonable-sounding justifications, but they are often generic. A student with genuine understanding can evaluate whether an AI-generated justification is actually correct for their specific dataset.

Interpreting model outputs in context: Explaining what a coefficient, feature importance, or confusion matrix means for a specific domain problem. AI can describe what these are generically. Connecting them to the actual business or research question requires domain-situated reasoning.

Debugging non-obvious failures: Diagnosing why a model performs well on validation but poorly in deployment. AI can suggest common causes but cannot trace through the specific data pipeline and identify where the problem actually is.

Implication for assessment: These can be tested, but need augmentation — require students to apply reasoning to their specific dataset and specific results, not to a generic problem. AI cannot pattern-match to a student's actual data.

Tier 3 — Minimally AI-Substitutable (High-Value Assessment Targets)

These are the competencies where genuine understanding is difficult or impossible to fake with AI, and where the gap between a student who understands and one who does not is most visible.

Detecting silent failures: Identifying that a model is producing plausible-looking wrong answers — data leakage, label leakage, distribution shift between train and test, class imbalance artifacts. AI cannot detect these in a specific student's notebook because it cannot run the code and observe what actually happens. This requires both technical depth and developed intuition about where pipelines break quietly.

Causal reasoning vs correlation: Distinguishing what a model can validly claim from what it cannot. Understanding why a high-AUC model on historical data may be useless or harmful if deployed. This requires understanding the assumptions behind the method, not just the method.

Experimental design and validity: Designing a valid evaluation framework for a specific problem — choosing the right train/validation/test split strategy, identifying appropriate baselines, knowing when cross-validation is and is not appropriate. AI produces generic answers. Correct answers depend on the specific data structure and problem constraints.

Communicating uncertainty honestly: Expressing what a model can and cannot claim, what assumptions underlie it, and what would invalidate the conclusion — to a non-technical audience. AI produces fluent but often over-confident summaries. Genuine competence here requires knowing what you do not know.

Ethical and consequential reasoning: Identifying where a model's outputs could cause harm, who is affected, and what the failure modes are in deployment. This requires domain knowledge and the ability to reason from first principles about systems that do not yet exist.

Assessment Audit Tool — Evaluating an Existing Assessment

Use this to evaluate any existing checkpoint, project milestone, or exam question before deciding whether to keep it, modify it, or replace it.

The Five-Question Audit

Question 1: Can current AI tools complete this correctly without student input?
Test it yourself — paste the prompt into Claude or GPT-4 with no additional context. If the output would receive full or near-full marks, the assessment is in Tier 1. It needs either replacement or augmentation with a student-specific constraint.

Question 2: Does this require reasoning about the student's specific data or results?
Generic prompts ("explain overfitting") can be answered without any connection to the student's actual work. Specific prompts ("explain what your validation curve suggests about whether your model is overfitting, and what you would change") require the student to interpret their own outputs. Specificity is the primary driver of AI resistance.

Question 3: Would a student who understands the concept and a student who only ran the code produce distinguishably different responses?
If the answer is no — if surface compliance and genuine understanding produce identical outputs — the assessment cannot distinguish the two. This is the core validity failure. Fix: add a "why" or "what would change if X" component that requires model depth to answer correctly.

Question 4: Does this assess a Tier 3 competency a working data scientist actually needs?
Map each assessment item to the competency framework above. If the majority of marks are allocated to Tier 1 competencies, rebalance toward Tier 2 and Tier 3. The proportion of marks on Tier 3 competencies is a rough proxy for assessment quality in the AI era.

Question 5: Does this require a process artifact that cannot be retroactively generated?
A final notebook can be produced by AI. A series of commits showing how the analysis evolved, each with a written decision note explaining why a choice was made, cannot. Process evidence — staged submissions, decision logs, incremental commits — is the most robust AI-resistance mechanism because it requires the student to have been thinking throughout, not just at the end.

Redesign Patterns — Moving an Assessment Up the Tiers

AI-01

Add Specificity — Anchor to Student's Own Data

Replace generic prompts with prompts that require reasoning about the student's specific outputs. "Explain what train/test split means" → "Your model shows 94% training accuracy and 71% test accuracy on your dataset. Diagnose the most likely cause, identify where in your pipeline it originates, and propose two changes with different tradeoffs." AI can answer the generic question. Only a student who ran their code and understands the results can answer the specific one.

AI-02

Add Process Evidence — Decision Logs

Require students to maintain a brief decision log throughout a project — for each major modeling decision, one sentence explaining what they chose, one sentence explaining what they considered and rejected, and one sentence explaining what would change their mind. This is low-overhead to write in real time and nearly impossible to reconstruct retroactively. It also functions as a formative assessment tool — you can see where students' reasoning breaks down before the final submission.

AI-03

Add a Break-It Component

After students complete a standard modeling task, add: "Now deliberately introduce one form of data leakage into your pipeline, verify that it inflates your metric, and explain why it inflates it and why it would fail in deployment." This requires understanding the mechanism well enough to break it intentionally — which is substantially harder than building it correctly. AI can describe data leakage generically but cannot execute it on the student's specific pipeline and explain the specific result.

AI-04

Add a Failure Mode Analysis

For any model submission, require: "Describe two realistic scenarios in which this model would fail in deployment. For each scenario, explain what in the training setup made it vulnerable, and what you would need to change to address it." This targets Tier 3 directly — it requires causal reasoning about the relationship between modeling choices and real-world performance, which AI cannot do reliably for a specific student's model on a specific dataset.

AI-05

Live Walkthrough or Oral Component

The most AI-resistant assessment format is a brief live walkthrough — student shares screen, walks through their notebook, and answers 2–3 follow-up questions in real time. Research on conversational exams (arxiv 2025) suggests this scales better than expected when structured as small group sessions rather than individual oral exams — several students attend simultaneously, one presents while others observe, and follow-up questions are directed to any student. A student who did not do their own work cannot answer follow-up questions about implementation decisions they did not make.

The Honest Constraint

What "AI-Proofing" Cannot Do

A 2025 study (Kofinas, Tsay & Pike) found that experienced graders could not reliably distinguish student-written from AI-written responses even on assessments designed to be authentic. Complete AI-proofing is not achievable through design alone. The more productive frame is: design assessments that measure competencies worth measuring, where the act of completing the assessment produces learning regardless of whether AI is involved. If a student uses AI to complete a decision log but writes each entry honestly in response to their own results, some learning still happens. If a student uses AI to generate a final notebook without engaging with the data at all, no learning happens. The design goal is to make the former more likely than the latter — not to make the latter impossible.

Designing for Learning Even When AI Is Used

This is a different design goal from AI-proofing. It accepts that students will use AI and asks: what assessment structures ensure something is learned in the process regardless? The answer comes from three overlapping bodies of research — the FACT framework for data science specifically, the metacognitive scaffolding literature, and structured reflection research.

The FACT Framework — Built for Data Science With AI

Source: Frontiers in Education (2025) — environmental data science course

The FACT framework (Fundamental, Applied, Conceptual, critical Thinking) was developed and evaluated specifically for upper-level data science education to balance AI-assisted learning with genuine skill development. Each component maps to Bloom's Taxonomy levels and has a different AI policy:

F — Fundamental skills (no AI): Low-stakes in-class quizzes or exercises testing core syntax, statistical reasoning, and procedural knowledge without AI access. The rationale is "ventriloquizing" prevention — Jain & Samuel (2025) coined this term for students who replicate AI outputs without genuine internalization. F assessments verify that students have independent foundational capability that higher-order work depends on. If a student cannot answer basic questions about their own model without AI, the applied work is hollow.

A — Applied projects (AI permitted): Take-home projects where AI use is explicitly allowed and treated as a professional tool. Students are required to document what AI contributed and what decisions they made independently. The assessment rubric evaluates judgment, not just output — did the student critically evaluate AI suggestions, or accept them uncritically? AI-assisted projects that show evidence of critical evaluation score higher than clean-looking projects with no engagement trace.

C — Conceptual understanding (no AI): In-class or oral assessments verifying that students can explain why their methods work, not just that they can execute them. This is the transfer test — a student who genuinely understood the applied project can explain the conceptual basis; a student who used AI as a ventriloquist cannot. Conceptual assessments directly probe the gap the AI usage may have created.

T — Critical Thinking (AI as subject): Assessments that ask students to evaluate AI outputs critically — identify where an AI-generated analysis went wrong, what assumptions it violated, or what it would miss in a specific deployment context. This is the highest-order component and the one most relevant to actual data science practice, where the job increasingly involves evaluating and correcting AI outputs rather than generating them from scratch.

Why This Structure Works

F and C act as learning anchors — they create consequences for not building genuine understanding, because those components cannot be completed with AI. A and T leverage AI as a learning context rather than fighting it. The key finding from the Frontiers study: students who completed F and C components without AI showed significantly better transfer on novel problems than students assessed only through applied projects, even when the applied projects were well-designed. The combination is what drives the effect.

The Ventriloquizing Problem — And How Reflection Breaks It

The Mechanism of Surface AI Use

Jain & Samuel (2025) describe ventriloquizing as students who produce AI-generated outputs without genuine internalization — they are the voice, but the words are not theirs. The problem is not that AI was used but that no cognitive processing of the AI output occurred. The research on metacognitive scaffolding in programming education (Choi et al., 2023, cited in arxiv 2511.04144) shows that prompting reflection after AI-assisted tasks improved performance on both immediate and delayed tests compared to completing tasks without reflection. The reflection requirement is what converts AI use from passive ventriloquism into active learning.

The specific reflection structure that works is not "write about what you learned" — that is too open and produces generic responses. The structure that produces measurable metacognitive gains requires three specific moves:

1. What did AI suggest that you accepted, and why did you accept it? Forces evaluation rather than passive acceptance.

2. What did AI suggest that you rejected or modified, and why? Forces critical engagement. A student who accepted everything cannot answer this honestly.

3. What does the output tell you that you would not have known otherwise? Forces integration of AI output with prior knowledge rather than replacement of it.

These three questions take 5–10 minutes to answer honestly and cannot be completed without genuine engagement with both the AI output and the student's own understanding. They are also self-diagnostic — students who answer "I accepted everything because it looked right" have just told you and themselves that no learning occurred.

Structured Human-AI Interaction Patterns — What Research Shows

Source: ICAIR 2025 Conference — Human-AI Interaction and Metacognition

A 2025 conference study (ICAIR) compared two student clusters based on how they interacted with AI during problem-solving. Cluster 1 used exploratory, divergent prompting with low-level cognitive engagement early on — broad questions, accepting outputs, moving quickly. Cluster 2 used structured regulation behaviors — initiating with deep-level questions, making deliberate modifications, completing full self-regulated learning cycles (planning, monitoring, reflecting). Cluster 2 showed significantly higher gains in problem-solving skills, AI literacy, and metacognitive strategy use. The finding: it is not whether students use AI but how they use it that determines learning outcomes.

The implication for assessment design: rather than banning AI or permitting it without structure, design the assessment to require Cluster 2 behavior. This means building the prompting strategy into the assignment itself:

— Require students to write their first attempt at a problem independently before querying AI.

— Require them to document their initial query and AI's response.

— Require them to identify what in the response they tested, modified, or rejected.

— Require a final summary of what changed in their understanding between their initial attempt and their final solution.

This structure forces self-regulated learning regardless of whether the student would have done it naturally. It also produces a process trace that is diagnostically useful for you and genuinely difficult to fake retroactively.

The King's College Co-Design Study — Compliant vs Unrestricted AI Use

Source: Martin et al. (2025), Behavioral Sciences, King's College London

A randomized study at King's College London assigned postgraduate psychology students to two conditions: a "compliant" group that used AI with a structured scaffolding requirement (use AI to assist, then critically reflect on its outputs), and an "unrestricted" group with free rein to use AI however they chose. Teaching staff graded both groups blind to condition. Key findings: the compliant group produced work that showed both higher technical quality and higher critical AI literacy. The unrestricted group produced more polished-looking outputs but showed less evidence of critical engagement with AI suggestions. The scaffolding requirement — not the AI restriction — was the active ingredient.

The study also found that co-designing the assessment structure with students increased buy-in substantially. Students who understood why the reflection requirement existed were more likely to engage with it honestly. This maps directly onto SYS-01 (pedagogical contract) — explaining the mechanism upfront changes how students approach the task.

Practical Implementation — The AI-Integrated Assessment Template

A Usable Structure for Any Data Science Checkpoint

Combine the FACT framework with the reflection and process trace requirements into a single checkpoint structure:

Part 1 — Fundamental check (in-class, no AI, 10 min): 3 questions on the core concept assessed by this checkpoint. Ungraded or low-stakes. Purpose: verify independent baseline before AI-assisted work. This is the ventriloquizing diagnostic — if students cannot answer these without AI, the applied work will be hollow.

Part 2 — Applied component (take-home, AI permitted with documentation): The main project milestone. AI use is explicitly permitted and documented. Students submit a brief AI interaction log alongside their code: what they queried, what AI produced, what they accepted and why, what they rejected and why.

Part 3 — Conceptual bridge (written, no AI, submitted with Part 2): Two questions answered independently: (1) "Explain the most important modeling decision in your submission and the reasoning behind it." (2) "Describe one way your model could fail in deployment that your current evaluation would not detect." These target Tier 3 competencies and cannot be answered without genuine engagement with the applied component.

Part 4 — Reflection (5–10 min, written): The three-question AI reflection above. Graded for specificity, not length. A student who engaged honestly with AI produces a different response than a student who accepted everything uncritically — that difference is visible and scoreable.

Grading weight suggestion: Part 1: 10% (completion). Part 2: 50% (output quality + AI log). Part 3: 30% (conceptual depth). Part 4: 10% (reflection quality). The weight on Part 3 is deliberately high — it is the component that requires the most genuine understanding and cannot be outsourced.

Key Literature — AI With Learning

Martin et al. (2025) Behavioral Sciences — King's College randomized study; scaffolded AI use outperforms unrestricted AI use on both quality and critical literacy. · Jain & Samuel (2025) — ventriloquizing concept; reflection requirements as the mechanism for converting AI use into learning. · ICAIR (2025) conference — structured regulation behavior (Cluster 2) vs exploratory AI use (Cluster 1); structured prompting produces metacognitive gains. · Frontiers in Education (2025) — FACT framework for data science; F and C components without AI improve transfer performance. · Choi et al. (2023) — prompting reflection after AI-assisted programming tasks improves both immediate and delayed test performance. · PMC metacognitive scaffolding review (2025) — AI tools used strategically as metacognitive scaffolds produce learning gains; surface-level use does not.

Key Literature — Assessment Design Generally

Colusso et al. (2024) arxiv:2512.10758 — AI-resilient assessment design for data science; interconnected problem framework. · Bredillet et al. (2024) — data science competency frameworks, seven domains. · Kofinas, Tsay & Pike (2025) — authenticity alone insufficient; graders cannot reliably detect AI writing. · HEPI (2026) — AI has exposed pre-existing assessment validity problems. · Beilock et al. (2004) — anxiety and working memory interference in assessment contexts.

Why "Any Questions?" Produces Silence

There are two distinct mechanisms, and they require different responses:

Mechanism 1: Dunning-Kruger at Low Competence

Students who are most confused are often least able to articulate what they don't understand. Confusion and metacognitive awareness are inversely correlated at low competence levels. They don't ask because they don't yet know what to ask.

Mechanism 2: Illusion of Explanatory Depth

Students hear a clear explanation, it feels coherent in the moment, and they genuinely believe they understood it. The gap surfaces only when they try to reproduce or apply the concept. Silence is not withholding — it is genuine (false) confidence.

Mechanism 3: Social Risk

Admitting confusion publicly is socially costly, especially in technical fields where competence is central to identity. Students will not reveal confusion in front of peers if doing so carries social risk. This is why anonymity dramatically changes response rates.

What Actually Works

SG-01

Replace Open Questions with Specific Probes

"Any questions?" is too open. "Is the following statement correct: [plausible misconception]?" is answerable even by confused students, because it requires recognition rather than generation, and it removes the social exposure of asking a question.

SG-02

Anonymous Real-Time Polling

Anonymous polling consistently produces more honest comprehension signals than public questioning. The mechanism is social risk reduction, not technology. Even low-tech anonymous written responses on index cards outperform public questioning.

SG-03

Think-Pair-Share Before Whole-Class Discussion

Asking students to discuss with a neighbor before sharing with the class reduces the social cost of being wrong publicly, because the answer becomes the pair's answer rather than the individual's. Confusion surfaces more readily.

How to Use This Section

Each technique is self-contained — you can run it without reading the rest of the wiki. The mechanism notes explain why it works so you can adapt it intelligently rather than follow it rigidly. Failure modes are included because most technique descriptions omit them, which is why techniques fail silently in practice.

Provenance Standard

Techniques here are either directly from empirical studies (sourced) or adapted from validated techniques with the adaptation noted. Techniques added from personal experience are flagged as practitioner-derived until a supporting study is identified.

TECH-01 — Delayed Pair Explain

Setup time: 0 min · Session disruption: low · Works best: after a new concept, mid-module

Exact Protocol

Step 1. Instructor explains a concept (2–5 min).

Step 2. Student A turns to Student B and explains the concept back in their own words. Student B listens without correcting (90 seconds).

Step 3. Instruction continues. Move on to the next concept or example.

Step 4. 10–20 minutes later — without warning — ask Student B to re-explain the same concept to Student A from memory. Student A can now correct or add to it (90 seconds).

Step 5. Optional: cold-call one pair to share with the class. Do not pre-announce who will be called — this raises the encoding stakes for everyone.

Mechanisms Exploited

Generation effect: Both students must produce an explanation rather than recognize one. The act of constructing the explanation in their own words requires building a representation, not retrieving a memorized phrase.

Spaced retrieval: The 10–20 minute gap between Step 2 and Step 4 is enough to reduce retrieval strength slightly, which means Step 4 is a genuine retrieval attempt rather than a repetition. This is the key difference from standard think-pair-share, where the pair discusses immediately and the gap is zero.

Peer instruction correction: Student A correcting Student B in Step 4 provides immediate feedback from a peer. Fantuzzo et al.'s Reciprocal Peer Tutoring work (1989, Journal of Educational Psychology) showed peer tutoring with structured role rotation produced learning gains comparable to teacher-delivered instruction. Provenance note: Fantuzzo's RPT protocol involves more structured reciprocal tutoring than the paired explanation described here — the citation supports the general principle of structured peer feedback but should not be read as a direct test of this specific protocol.

Encoding stakes: Knowing there is a chance of being cold-called raises encoding effort during original instruction. This is not punitive — it is a legitimate mechanism. The uncertainty of who will be called is sufficient; it does not need to be graded.

Known Failure Modes

Gap too short: If Step 4 follows Step 2 by less than 5 minutes, students are essentially repeating rather than retrieving. The delay is the mechanism — skipping it collapses this into standard think-pair-share.

Student B corrects during Step 2: The instruction to listen without correcting in Step 2 needs to be stated explicitly. If B corrects immediately, A's explanation attempt is interrupted before it completes, which undermines the generation effect for A.

No cold-call threat: If students know Step 5 will never actually happen, encoding stakes drop. Follow through at least occasionally — even once per course is enough to maintain the effect for the rest of the semester.

Pairs who already know each other well: Close friends tend to drift into discussion rather than structured explanation. Mixed or assigned pairs produce more reliable protocol adherence.

Adaptation for Computational Courses

In a coding context, Step 2 can be: Student A explains what a function does and why, not how. Student B then writes (not explains) a one-line docstring from memory 15 minutes later. The docstring is a behavioral output that cannot be faked with vague paraphrase — it requires a precise mental model.

Mechanism sources: Fantuzzo et al. (1989) Journal of Educational Psychology · Roediger & Karpicke (2006) for spaced retrieval component · Slamecka & Graf (1978) for generation component. Protocol design: practitioner-derived synthesis.

Optional Extension — Closing the Feedback Loop

The retrieval attempt in Steps 2 and 4 is sufficient for encoding. It is not sufficient for error correction or diagnostic signal. Add two minutes after Step 4:

Step 5. Students write anonymously on an index card or digital form: either (a) the specific thing they got wrong in their second explanation, or (b) the one thing they still cannot explain confidently. Both options are equally valid — the goal is honest self-assessment, not performance.

Step 6. You read responses before the next session. Open the next session by addressing the two or three most common gaps directly — not by re-explaining the concept from scratch, but by targeting the specific wrong model that appeared most frequently.

Why after retrieval, not after instruction: Muddiest point works better after the retrieval attempt than after original instruction because retrieval surfaces what students cannot actually remember, whereas immediately after instruction students are still in the illusion of explanatory depth and do not yet know what they do not know. The retrieval attempt breaks the illusion first; the muddiest point captures what it reveals.

Do not grade Step 5. The principle that anonymous ungraded responses produce more honest self-assessment than graded responses is consistent with the social desirability literature in survey research (Tourangeau & Yan, 2007) and with Roediger's broader work on retrieval conditions. Provenance note: the specific "Butler & Roediger (2007)" citation used in earlier drafts could not be verified with confidence — treat the graded-vs-diagnostic distinction as well-supported in principle but do not cite that specific paper without verifying it. The graded artifact belongs at checkpoint level, not technique level.

TECH-02 — Peer Instruction with Targeted Wrong Answers

Setup time: requires pre-written question · Session disruption: low · Works best: after instruction, to test model formation

Exact Protocol (Mazur, 1997)

Step 1. Instructor poses a conceptual multiple-choice question. Answer choices include one correct answer and distractors that each correspond to a specific known misconception. Do not use "all of the above" or random wrong answers — distractor quality is everything.

Step 2. Students vote individually and anonymously (clicker, Poll Everywhere, or hands with eyes closed). Do not reveal results yet.

Step 3. Check the distribution privately.

— If >70% correct: question was too easy. Acknowledge, move on, make a harder question next time.

— If 30–70% correct: the productive range. Tell students the distribution without revealing the answer. Students discuss with a neighbor for 2 minutes.

— If <30% correct: students don't have enough knowledge to benefit from peer discussion. Re-teach before re-polling.

Step 4. Re-vote. Reveal results and correct answer. Explain why each distractor corresponds to a specific wrong model — naming the misconception is important, not just announcing the right answer.

Mechanisms Exploited

Productive confusion: The 30–70% range means enough students are wrong that discussion is necessary, but enough are right that correct models are present in the room for peers to encounter. Peer discussion in this range outperforms instructor explanation for the same content (Smith et al., 2009 — re-polling with new isomorphic questions showed retention of peer-discussed answers exceeded instructor-explained answers).

Misconception surfacing: Because distractors map to specific wrong models, the vote distribution tells you exactly which misconception is most prevalent — not just that students are wrong.

Anonymity: Individual anonymous vote before discussion removes the social pressure to agree with neighbors during discussion. Students enter discussion having already committed to a position, which makes the discussion more substantive.

Known Failure Modes

Weak distractors: If wrong answers are obviously wrong rather than plausibly wrong from a specific misconception, the technique collapses into a trivial comprehension check. The hardest part of this technique is writing good distractors. Budget time for it.

Revealing the answer before discussion: If you reveal the correct answer before Step 4 discussion, students stop reasoning and start justifying. The answer must stay hidden until after re-vote.

Not naming the misconception: Announcing "B is correct" without explaining what mental model C and D represent leaves the wrong models intact. Students who chose C need to know specifically what was wrong about their model, not just that they were wrong.

Adaptation for Computational Courses

Show 5 lines of code. Ask: "What does this output?" with four options — one correct, three corresponding to known wrong models (container model, simultaneity model, off-by-one). The vote distribution tells you which mental model gap is most prevalent in the room right now. This is faster and more informative than asking "does everyone understand?"

Source: Mazur, E. (1997). Peer Instruction: A User's Manual. Prentice Hall. · Smith et al. (2009) Science — re-polling with isomorphic questions showed peer discussion produces durable retention.

TECH-03 — Cold Call Retrieval Opener

Setup time: 0 min · Session disruption: none · Works best: first 3 minutes of any session

Exact Protocol

Step 1. Before reviewing anything from last session, ask one question requiring retrieval of a specific concept from the previous meeting: "What was the key difference between X and Y that we established last time?" No notes, no slides up.

Step 2. Give 60 seconds of silent individual writing. Not optional — the writing attempt is the mechanism, not the answer.

Step 3. Cold-call one student (not a volunteer). Hear their answer. Ask a follow-up: "Why does that matter for what we're doing today?"

Step 4. Briefly correct or affirm, then connect explicitly to today's content. Total time: 3–4 minutes.

Mechanisms Exploited

Testing effect with spacing: The gap since last session is the spacing. The retrieval attempt before any review produces stronger encoding than the same amount of re-study (Roediger & Karpicke, 2006). The writing makes the attempt behavioral rather than passive — students cannot coast on a vague sense of recognition.

Encoding stakes: The possibility of being cold-called raises encoding effort during today's session prospectively. Students who know this happens every class encode more carefully during instruction because they know retrieval will be required next session.

Explicit connection: Asking "why does that matter for today" forces the student to construct a link between prior and new knowledge — this is the mechanism behind advance organizers (Ausubel, 1960) but generated by the student rather than provided by the instructor.

Known Failure Modes

Turning it into review: If you put up a slide or summarize before asking the retrieval question, students are recognizing rather than retrieving. The slides must stay down until after the retrieval attempt.

Only calling volunteers: If the cold-call becomes optional in practice, encoding stakes disappear. The technique requires genuine randomness — use a name randomizer or physical cards if necessary.

Not connecting to today: If Step 4 is just "correct — anyway, today we're doing X," the forward link is lost. The connection is not decorative — it is the advance organizer that scaffolds new encoding.

Source: Roediger & Karpicke (2006) for testing effect · Ausubel, D.P. (1960) Educational Psychology — advance organizer framework for forward linking.

TECH-04 — Predict-Observe-Explain (POE)

Setup time: low · Session disruption: low · Works best: before running code, before showing a result

Exact Protocol (White & Gunstone, 1992)

Step 1 — Predict: Before running code or revealing a result, students write a specific prediction: what will the output be, and why. Both parts are required — a prediction without a justification is not diagnostic.

Step 2 — Observe: Run the code. Students see the actual result.

Step 3 — Explain: Students whose prediction was wrong write an explanation of the discrepancy: "I predicted X because I believed Y. The actual result was Z, which means my model of Y was wrong in this way." Students whose prediction was correct write why they were right — this prevents lucky guesses from going unexamined.

Step 4: Share 2–3 discrepancy explanations with the class. The discrepancies are more informative than the correct predictions.

Mechanisms Exploited

Conceptual conflict generation: Students who predicted incorrectly have self-generated the cognitive conflict — it cannot be attributed to the instructor being confusing or the example being contrived. This is the strongest form of conflict for driving conceptual change (Posner et al., 1982).

Model externalization: Requiring a written justification forces the mental model into an explicit form where it can be examined. Vague intuitions that produce wrong predictions become visible as specific wrong beliefs that can be corrected.

Correct-prediction examination: Asking correct predictors to explain why they were right prevents the technique from becoming a sorting exercise. Some correct predictions come from correct models; others come from memorized rules that will fail on transfer problems. The explanation distinguishes them.

Known Failure Modes

Prediction without justification: "I think it will output 4" is not useful. "I think it will output 4 because the loop runs three times and adds i each time" is. The justification requirement must be enforced — it is the diagnostic component.

Moving on too fast after observation: Students who were wrong need time to write the discrepancy explanation before the instructor explains. If you explain immediately after revealing the result, students receive the correct model passively rather than constructing it from the conflict. Wait for the writing before explaining.

Only sharing correct predictions: The discrepancies are the value. Sharing only correct predictions turns this into a confidence-building exercise rather than a misconception-surfacing one.

Source: White, R., & Gunstone, R. (1992). Probing Understanding. Falmer Press. · Posner et al. (1982) for conceptual conflict mechanism.

Design Principle — Graded vs Diagnostic Artifacts

In-class techniques produce two kinds of artifacts: graded artifacts that assess whether students have achieved a standard, and diagnostic artifacts that tell you and the student where model gaps are. These serve different purposes and should not be collapsed. Grading a muddiest point or an anonymous retrieval attempt destroys the honesty that makes it diagnostic. Leaving a checkpoint ungraded destroys the accountability signal. The rule: diagnostic artifacts are anonymous and ungraded; graded artifacts are checkpoints at module boundaries where students know in advance they will be assessed.

Adding New Techniques

Describe what you tried in class — the exact steps you followed, what happened, and what you think the mechanism was. Include any failure modes you observed. A technique entry does not need a source to be added — mark it as practitioner-derived and add the source when you find it. The goal is to capture what you are actually doing, not only what has been formally validated.

Why This Matters

A technique introduced mid-semester without prior framing reads as arbitrary or punitive. The same technique introduced on day one as part of explicit course norms reads as purposeful. Student resistance to cold calls, retrieval openers, and anonymous surveys drops substantially when the pedagogy is explained upfront and the system is consistent. Consistency is the mechanism — students encode more carefully during instruction when they reliably know retrieval will follow.

SYS-01 — Establish the Pedagogical Contract on Day One

What to Say and Why

On day one, before any content, spend 10 minutes explaining how the course is structured and why. This is not a syllabus reading — it is an explicit statement of the pedagogical philosophy students are about to experience. Cover three things:

1. Why difficulty is intentional. Tell students directly: some sessions will feel harder than others. The harder sessions are not poorly designed — they are designed to produce struggle because struggle is what builds durable understanding. Cite Bjork if you want credibility: research consistently shows students learn more from conditions that feel harder, even when those conditions produce worse performance in the moment. Give them the language to interpret their own discomfort productively.

2. What the recurring structures are. Name every recurring technique you plan to use — cold call retrieval openers, anonymous muddiest point cards, paired explanations, peer instruction polls. Explain each in one sentence and say when it will happen. Students who know what is coming do not experience these as surprises or traps. Surprise is the enemy of buy-in.

3. The difference between diagnostic and graded. Explicitly tell students that anonymous in-class writing and muddiest point cards will never be graded or attributed to them — their only purpose is to help you adjust instruction. Graded artifacts happen at checkpoints, which are announced in advance. This contract makes anonymous responses more honest and reduces anxiety about in-class participation.

What the Research Says About Upfront Framing

There is empirical support for upfront difficulty framing from utility value research — Hulleman & Harackiewicz (2009, Journal of Educational Psychology) showed that helping students connect coursework to personal relevance reduced performance gaps and increased interest, particularly for students with low initial confidence. Explaining why difficulty is pedagogically intentional is a related but distinct intervention. Provenance note: "Canning et al. (2019) PNAS" cited in earlier drafts could not be verified with confidence as matching the specific protocol described. The general principle — that framing difficulty as expected and meaningful reduces anxiety — is consistent with the self-determination theory literature (Deci & Ryan) and Dweck's mindset work, but cite those instead of Canning et al. until the specific paper is verified.

Separately, Dweck's mindset research operationalizes this as teaching students that struggle signals learning rather than inability. The specific intervention that produced outcomes was not a one-time message but a recurring norm — instructors consistently reframing difficulty as expected and productive rather than exceptional and concerning.

SYS-02 — Anonymous Response Infrastructure

Set Up Before Semester — Choose One Method and Stick to It

Muddiest point, post-retrieval gap cards, and mid-module diagnostic surveys all require an anonymous response channel. The specific tool matters less than consistency — students who have to learn a new tool mid-semester disengage from the content. Set up one channel on day one and use it for everything anonymous throughout the course.

Option A — Physical index cards: Lowest friction. Hand out a stack at the start of every session. Collect at the end. No login, no device, no barrier. The limitation is you cannot aggregate responses quickly across a large class.

Option B — Google Form with a fixed URL: One form, one short URL written on the board, used every session. No account required for submission. Responses are aggregated automatically and searchable. The limitation is it requires devices and a moment of friction to navigate to.

Option C — Poll Everywhere or Mentimeter: Best for in-the-moment multiple choice probes (TECH-02 peer instruction). Less suited for open-ended muddiest point because text responses are visible to the class in real time, which undermines anonymity.

Recommendation for computational courses: Google Form for muddiest point and gap cards (open-ended, private); Poll Everywhere or iClicker for peer instruction polls (multiple choice, immediate). Two tools, two purposes, set up on day one.

SYS-03 — Retrieval Opener Cadence

Decision: Every Session or Weekly?

The testing effect is strongest with consistent spacing — the benefit compounds when retrieval happens on a predictable schedule. The question is whether to do a retrieval opener every session or once per week.

Every session: Maximum encoding benefit. Students encode during instruction knowing retrieval will happen next session. The downside is time cost — 3–4 minutes per session adds up. In a 75-minute class this is manageable. In a 50-minute class it competes with content.

Once per week (first session of the week): Reasonable compromise for shorter class periods. Announce the cadence on day one so students know Monday always starts with retrieval. The encoding stakes effect still operates as long as students know retrieval will happen — the specific day is less important than the consistency.

What not to do: Do not run retrieval openers sporadically or only when you remember. Irregular timing removes the encoding stakes effect because students cannot predict when retrieval will be required. Consistency is the mechanism — it has to be reliable to change how students encode during instruction.

Announce on day one: "Every Monday, the first 3 minutes of class is a cold retrieval question from the previous session. No notes, not graded, but I will cold-call one person. This is how the class works." Students who know this from week one adapt their encoding behavior from week one.

SYS-04 — Muddiest Point Cadence and Response Protocol

Decision: Every Session or Every Module?

Angelo & Cross (1993) recommend muddiest point after every session when first establishing the technique, then pulling back to key sessions once students trust that responses are read and acted on. The critical variable is whether students believe you use the responses — if two or three sessions pass without you visibly addressing muddiest points, response quality drops and eventually response rates drop.

The minimum viable protocol: Muddiest point after every session that introduces a new concept. Not after work sessions, review sessions, or project time — only after new conceptual content where model formation gaps are most likely.

The response protocol matters as much as the collection: Read responses the same evening. At the start of the next session, say explicitly: "Three people said they couldn't explain X. Here is what was missing from most explanations." Do not read responses and say nothing — this is the fastest way to kill response honesty. Students need to see that responses produce a visible change in instruction to believe the channel is real.

What to do when responses are vague: "Everything was clear" or "I understood it fine" as a muddiest point response means one of three things: the student is in the illusion of explanatory depth, the question felt too exposing to answer honestly, or the session genuinely landed well. You cannot distinguish these from the response alone. Use the retrieval opener next session to probe — if responses were genuinely "all clear," retrieval should confirm it.

SYS-05 — Checkpoint Assessment Structure

Where Graded Artifacts Belong in the System

The in-class techniques — retrieval openers, paired explanations, muddiest point, peer instruction — are all ungraded diagnostic infrastructure. The graded layer sits above them at module boundaries. The checkpoint is where you find out whether the accumulated in-class practice has produced durable models.

Checkpoint design principles from the literature:

1. Announce checkpoints on day one with their exact purpose. Students should know from week one that at the end of each module there is a graded assessment that requires far transfer — applying concepts in a context they have not seen before. This is not a surprise; it is the stated standard. Ambiguity about assessment standards produces anxiety; clarity about standards produces preparation.

2. Design for far transfer, not near transfer. Barnett & Ceci (2002) — if the checkpoint problem looks like the module examples, it measures pattern-matching, not model formation. The graded artifact should require the same underlying concept in a structurally novel context. This is harder to design but is the only checkpoint that tells you whether the in-class techniques worked.

3. Allow re-submission with correction memo. Requiring students to write what their model was, what was wrong with it, and what the correct model is forces the conceptual change process explicitly. This is consistent with the conceptual change literature (Posner et al., 1982) and with error correction research showing that retrieving and correcting wrong information produces stronger memory traces than ignoring errors. Provenance note: "Butler & Roediger (2007)" cited in earlier drafts for this claim could not be verified — cite Posner et al. and the general error-correction literature instead. This is not about leniency; it is about using the checkpoint as a learning event.

4. Time checkpoints after spacing, not immediately after instruction. A checkpoint administered immediately after a module measures retrieval strength. A checkpoint administered one week after a module measures storage strength — which is what you actually want to know. If your course structure allows, build a one-week gap between module completion and checkpoint submission.

SYS-06 — Paired Explanation Partner Assignment

Assign or Random? And How Often to Rotate?

Fantuzzo et al. (1989) found that assigned pairs with explicit role rotation outperformed self-selected pairs and random unassigned pairs on learning outcomes. The mechanism is that self-selected pairs optimize for comfort and random unassigned pairs produce inconsistent role-taking. Assigned pairs with clear roles (explainer and listener, rotating) produce the most consistent protocol adherence.

Practical setup on day one: Assign semester-long pairs or rotating monthly pairs. Announce the rotation schedule upfront. For a computational course with project groups, aligning explanation pairs with project partners has the advantage of building the working relationship early — but has the disadvantage that pairs who already work well together may skip the structured protocol and drift into discussion.

Rotation recommendation: Rotate pairs every 3–4 weeks. Long enough that pairs develop a working rhythm; short enough that students are exposed to different explanatory styles and do not become complacent with a familiar partner.

What to tell students on day one: "You will have a designated explanation partner for the first four weeks. Your partner rotates after that. During paired explanation exercises, one person explains and the other listens without correcting until they are asked to. I will tell you which role you are in."

Sample Day-One System Announcement

Template — Adapt to Your Voice

"Before we start the content, I want to explain how this course is structured because it will feel different from some courses you've taken.

Some sessions will feel harder than others. That is intentional. Research on how people learn is clear that the conditions that feel most productive — being given worked examples, reviewing material, getting explanations — produce weaker long-term retention than conditions that feel frustrating — writing code from scratch, being asked to recall something without notes, explaining a concept to a peer before you feel ready. I have designed the hard sessions deliberately. When something feels difficult, that is the learning happening, not a sign something is wrong.

Here is what will happen every class: we will start with a 3-minute retrieval question from last session. No notes, not graded, but I will cold-call someone. Partway through class you will explain a concept to your designated partner. At the end of class I will ask you to write anonymously what is still unclear. I read every card before the next session and we start next time by addressing what came up most. None of the anonymous writing is graded or attributed to you. The graded work happens at module checkpoints, which are announced in advance and are designed to require applying concepts in new contexts, not repeating examples from class.

Your explanation partner for the first four weeks is the person I will assign in a moment. You will rotate every four weeks."

Key Literature for This Section

Hulleman & Harackiewicz (2009) Journal of Educational Psychology — utility value framing reduces performance gaps · Dweck, C.S. (2006) Mindset — growth mindset as recurring norm · Deci & Ryan self-determination theory — autonomy and competence framing · Angelo & Cross (1993) — muddiest point cadence and response protocol · Fantuzzo et al. (1989) — structured peer tutoring with role rotation · Barnett & Ceci (2002) — far transfer checkpoint design · Posner et al. (1982) — correction memo as conceptual change mechanism. Note: "Canning et al. 2019 PNAS" and "Butler & Roediger 2007" cited in earlier drafts could not be verified — removed pending confirmation.

The Core Diagnostic

Before designing a fix, identify which failure mode you have. They look identical from the front of the room but have different causes and different solutions. An activity can fail because the accountability structure is missing, because it competes with higher-stakes work at the wrong moment, or because the social cost is higher than students will pay without explicit scaffolding. Most collapsed activities have all three problems simultaneously.

The Three Failure Modes

FM-01

Missing Accountability Structure — Task Substitution

Students replace the assigned activity with a more comfortable or immediately useful one. This is called task substitution and is rational behavior when non-compliance has no consequence. An activity that is optional in practice will be substituted whenever deadline pressure is present, regardless of what the instructor says. The diagnostic question: what happens to a student who does not do this activity? If the answer is nothing, the activity is structurally optional.

Source: Michaelsen, Knight & Fink (2004) Team-Based Learning — cross-team activities require both a deliverable that cannot be produced without the interaction and a visible reporting structure to achieve reliable compliance.

FM-02

Wrong Timing — Activity Competes With Higher-Stakes Work

An activity positioned during acute deadline pressure competes directly with the work students are most anxious about. From the student's perspective, explaining your model to another group when the project is due next week is a tax on project time, not a contribution to it — even when the explanation would genuinely help. Peer learning activities work better positioned as preparation for a checkpoint than as a diversion during project work.

FM-03

Unscaffolded Social Cost — Exposure Avoidance

Explaining your work to your own group is low stakes — they know your work and will not judge gaps harshly. Explaining to a different group means exposing uncertainty to near-strangers in a domain where competence is salient. Students who are most uncertain about their model will avoid cross-group explanation most strongly — which means the students who would benefit most are precisely those most likely to opt out. The activity self-selects for students who least need it.

Source: Cooperative vs collaborative task distinction — Laal & Ghodsi (2012) International Conference on Education and Educational Psychology. Cooperative tasks allow pooling; collaborative tasks require individual demonstration. Activities that are collaborative in intent but cooperative in execution will default to pooling without structural enforcement.

Design Principles for Compliance

Principle 1 — Every Activity Needs a Physical Artifact

An artifact is anything that cannot be produced without completing the activity and that gets collected or displayed. It does not need to be graded heavily — a 0/1 completion mark is sufficient. The artifact serves two functions: it makes non-compliance visible rather than invisible, and it makes the activity feel purposeful rather than procedural.

Examples of minimal artifacts: An index card handed to you at the end. A one-sentence written response in a shared doc. A vote or response in Poll Everywhere. A sticky note on the board. The content matters less than the fact that it exists and is collected.

What does not work as an artifact: Verbal participation without a record. "Did everyone do this?" as a check. Circulating to observe groups — students resume compliance while you are watching and substitute when you leave.

Principle 2 — Cross-Group Movement Must Be Structural, Not Requested

If students can stay in their existing groups and still technically comply, they will. Cross-group interaction requires a structure that makes staying put impossible or makes the required output unavailable from your own group.

The single-representative constraint: Send one person from each group to another group. The rest of the group cannot do the activity with their remaining members because the activity requires the absent person. When one person leaves, the group physically cannot substitute project time for the activity — the activity and the project work are spatially separated.

The information gap constraint: Design the activity so each group has information the other group needs to complete a task. Neither group can produce the output alone. This is the jigsaw method (Aronson, 1978) — interdependence is structural rather than requested.

Principle 3 — Reframe From Evaluation to Consultation

Cross-group explanation activities collapse partly because they feel like performance evaluations to students who are uncertain about their work. Reframing the same activity as receiving expert input rather than demonstrating competence reduces the social cost enough to change behavior.

Language matters specifically: "Explain your model to another group" positions the explainer as being assessed. "Act as a consultant to another group and identify their main risk" positions the student as the expert rather than the subject. The task is structurally identical but the social dynamic is inverted.

Source: Reframing effects on task engagement are documented in self-determination theory research (Deci & Ryan, 2000) — autonomy-supportive framing ("you are here to help them") produces higher engagement than evaluative framing ("show what you know"), particularly for students with lower confidence.

Principle 4 — Position Activities as Preparation, Not Interruption

An activity that competes with deadline-proximate project work will lose. The same activity positioned as preparation for an upcoming checkpoint reframes the time cost as investment rather than tax.

Exact framing: "Before the checkpoint next week, you need to be able to explain your model's overfitting risk to someone outside your group. This activity is practice for that — and the feedback you get will directly improve your checkpoint submission." This makes the activity instrumentally useful to the project rather than orthogonal to it.

Timing rule of thumb: Cross-group explanation activities work best 1–2 sessions before a checkpoint, not during the final session before a deadline. At the final session, deadline anxiety dominates everything and no activity will compete successfully.

Worked Example — The ML Model Consultancy Round

Redesign of the cross-group overfitting explanation activity that collapsed into project time. Addresses FM-01, FM-02, and FM-03 simultaneously.

Setup (announce at start of session, not mid-session)

"In 20 minutes we are doing a consultancy round. Each group will send one person — the person whose name I call — to a different group. That person is a consultant. Their job is to listen to the other group describe their model, identify the single biggest overfitting or underfitting risk they hear, and write it on this card. The card gets left with the group they visited and handed to me. The rest of your group keeps working."

Assign consultants by name rather than by group choice. Use a randomizer. Do not ask for volunteers.

The Activity (15 minutes)

Minutes 1–5: Each remaining group prepares a 3-sentence description of their model: what it does, what their current validation performance is, and what they think the main risk is. Written, not verbal — the writing forces precision and prevents the group from realizing mid-explanation that they cannot describe their own model clearly.

Minutes 5–12: Consultant arrives at the other group. The group reads their 3-sentence description aloud. Consultant asks one clarifying question only. Consultant writes their diagnosis on the card: "Based on what I heard, the main risk is X because Y."

Minutes 12–15: Consultant returns to their own group with a card from the group they visited. Their group has a card from their consultant. Both cards handed to instructor.

Why This Works — Structural Analysis

FM-01 addressed: The card is the artifact. It cannot be produced without the cross-group interaction. It is collected. Non-compliance is visible — a missing card means the activity did not happen.

FM-02 addressed: The remaining group members keep working on the project during the activity. Project time is not lost — it is redistributed. One person leaves for 15 minutes; four keep working. The activity is framed as generating useful external feedback, not interrupting work.

FM-03 addressed: The consultant role inverts the social dynamic. The consultant is the expert diagnosing risk, not the student being evaluated. The group receiving the consultant is the one being assessed, not the consultant. Students who are uncertain about their own model are comfortable in the consultant role even when they would avoid the explanation role.

Single-representative constraint: The rest of the group cannot substitute because the consultant is gone. The physical separation enforces the activity without requiring constant monitoring.

Variations

Gallery walk version: Each group posts their 3-sentence model description on the wall. All students circulate for 10 minutes and leave one sticky note on each posting: either a diagnosis ("main risk: X") or a question ("what happens when Y?"). No cross-group conversation required — lower social cost, lower diagnostic value.

Pre-checkpoint version: Run the consultancy round in the session before a checkpoint. The consultant's card becomes part of the checkpoint submission — groups must address the identified risk in their final write-up. This makes the activity instrumentally necessary to the checkpoint, not optional to it.

Two-consultant version: Send two people to different groups simultaneously. Each consultant visits a different group. The home group is left with two remaining members who keep working. Doubles cross-group exposure at the cost of leaving fewer people on the home project.

The General Rule

Design Test for Any Activity

Before running any in-class activity, answer these four questions:

1. What is the artifact, and can it be produced without completing the activity?

2. Is there a structural constraint that prevents students from staying in their existing group?

3. Does the framing position students as experts or consultants rather than as subjects being evaluated?

4. Is this timed at least one session before a deadline rather than during the final work session?

If any answer is no, redesign before running. A collapsed activity is harder to recover than a well-designed one is to run.

Key Literature

Michaelsen, Knight & Fink (2004) Team-Based Learning. Stylus. — Cross-team activity design; artifact and reporting structure requirements for compliance. · Aronson, E. (1978) The Jigsaw Classroom. Sage. — Structural interdependence as the mechanism for genuine collaboration. · Laal & Ghodsi (2012) — cooperative vs collaborative task distinction. · Deci & Ryan (2000) self-determination theory — autonomy-supportive vs evaluative framing effects on engagement.

Provenance Warning on "Active Learning"

Freeman et al. (2014, PNAS) is the largest meta-analysis of undergraduate STEM instruction — 225 studies, 0.47 SD performance advantage, students 1.5x more likely to fail under traditional lecture. This is a real and robust finding. But "active learning" in that meta-analysis is defined as any approach engaging students in higher-order thinking during class — it aggregates over 50 different specific interventions. The effect size tells you something is working. It does not tell you which intervention to use or why. The systems below are specific enough to implement and evaluate.

SYS-07 — Exit Ticket System

Setup: one-time tool configuration · Per-session cost: 3–5 minutes · Best for: closing every concept-heavy session with a diagnostic signal

What It Is and What the Research Shows

An exit ticket is a brief written response collected in the last 3–5 minutes of class, before students leave. It is not a quiz — it is a structured retrieval and reflection prompt. A seven-semester longitudinal study (Baker et al., 2024, College Teaching) found that iteratively refined exit tickets improved student performance on higher-stakes assignments over time, and that removing instructor-guided prompts progressively — starting with structured questions and shifting toward free writing — produced the strongest metacognitive development.

The mechanism is dual: for students, the exit ticket is a forced retrieval attempt that surfaces the illusion of explanatory depth before they leave the room. For you, it is a real-time diagnostic that tells you what to address at the start of next session rather than guessing.

Exact Protocol

Step 1. In the last 4 minutes of class, display one prompt — not multiple questions. One well-designed question outperforms three recall questions on both engagement and diagnostic value.

Step 2. Students respond anonymously in writing — index card, Google Form, or Poll Everywhere. Not graded. Not attributed.

Step 3. You read responses before next session. Open next session with: "Three people wrote X. Here is what was missing." Address the two most common gaps specifically.

Prompt design matters more than the tool: The weakest prompts ask for recall ("what is k-means?"). The strongest ask for application or self-diagnosis:

— "Describe one thing from today you could explain to someone outside this course."

— "What would break in your current project if you misunderstood today's concept?"

— "What is still unclear, and what specifically is unclear about it?"

The third prompt is diagnostic. The second is transfer. The first is generative retrieval. Rotate across all three types across the semester.

Known Failure Modes

Not responding to what you collect. This is the fastest way to kill exit ticket honesty. If students see no evidence that responses affect instruction, response quality drops within two weeks. The response protocol — visibly addressing common gaps next session — is not optional. It is what makes the system work.

Grading exit tickets. Graded exit tickets shift student optimization from honest self-assessment to performance. They produce longer responses and less honest ones. Keep them ungraded; award participation credit at most.

Using the same prompt format every session. Students habituate to prompt formats quickly. A recall prompt every session produces diminishing engagement after week three. Rotating prompt types maintains the retrieval effort that produces the encoding benefit.

SYS-08 — Readiness Assurance System (Pre-Class Preparation Loop)

Setup: pre-class reading or resource assigned · Per-session cost: 10–15 minutes · Best for: modules where in-class time should be spent on application, not transmission

What It Is and What the Research Shows

From Team-Based Learning (Michaelsen, Knight & Fink, 2004): students complete pre-class preparation, then take a brief individual Readiness Assurance Test (iRAT) at the start of class — typically 5–8 multiple choice questions on the assigned material. The same test is then taken again as a team (tRAT). Class time is spent entirely on application problems that require the pre-class material.

The key research finding: ungraded iRATs in highly motivated students produce comparable learning outcomes to graded iRATs, with substantially lower stress and higher quality discussion (Rudio et al., 2021, PMC; multiple pharmacy education studies 2020–2023). The enforcement mechanism does not need to be a grade — it needs to be visible accountability to peers. Students who did not prepare cannot contribute to the team discussion, which creates social accountability without grade stakes.

A separate finding (Koh et al., 2019) showed graded iRATs did produce higher pre-class preparation rates in some populations — the evidence is mixed and context-dependent. For highly motivated students (graduate level, professional programs) ungraded works. For lower-motivation contexts, low-stakes grading may be necessary.

Exact Protocol — Lightweight Version for Computational Courses

Before class: Assign 20–30 minutes of preparation — a reading, a short video, or a notebook to run and observe. Frame it specifically: "After this, you should be able to answer: what is the difference between validation loss and test loss, and when does each matter?"

Start of class (5 min): 3–5 individual questions on the preparation material. Anonymous, low-stakes (completion credit only). Students answer before any discussion.

Team phase (5 min): Same questions discussed in groups. Groups must reach consensus. One answer per group submitted.

Remaining class time: Spent entirely on application — a problem, a project checkpoint, a debugging exercise — that requires the pre-class material. If students did not prepare, they cannot contribute. The application problem is the enforcement mechanism, not the quiz grade.

Known Failure Modes

Pre-class material is too long or too vague. "Read chapter 4" produces inconsistent preparation. "Read pages 12–18 and be able to answer these two specific questions" produces consistent preparation. The specificity of the framing question determines preparation quality more than the length of the material.

Application problem does not require the preparation. If students can complete the in-class application without the pre-class reading, the preparation becomes optional in practice. The application problem must be designed to require the specific concept from the pre-class material — not adjacent to it.

iRAT questions test recall rather than readiness. "Define overfitting" tests whether students read. "A model has 97% training accuracy and 61% validation accuracy — what is the most likely cause and what would you check first?" tests whether they understood. The second question also functions as an activation probe for the application work that follows.

SYS-09 — Low-Stakes Weekly Quiz System

Setup: quiz infrastructure (Canvas, Google Form) · Per-session cost: 5–10 minutes · Best for: courses where concept accumulation matters and students tend to defer studying

What It Is and What the Research Shows

A short quiz (3–5 questions) administered at a fixed point each week — either start of first session or end of last session. Low stakes: worth a small and fixed percentage of the grade, or completion-only credit. The mechanism is the testing effect (Roediger & Karpicke, 2006) applied at weekly intervals to force spaced retrieval across the course.

The specific finding from medical education scoping review (Heeneman et al., 2025, PMC): detailed asynchronous feedback on low-stakes quizzes significantly improves exam performance — more than the quiz itself. The quiz surfaces gaps; the feedback closes them. A quiz without feedback is a measurement. A quiz with targeted feedback is a learning event.

Optional vs required: Lerchenfeldt et al. (2021) found that optional low-stakes quizzes produced comparable exam performance to required quizzes when students were told that quiz content predicted exam content. The enforcement mechanism shifted from grade stakes to self-interest. This is only reliable when students believe the signal — it requires you to actually use quiz content on exams and say so explicitly.

Exact Protocol

Design: 3–5 questions. Mix one recall question (verifying basic coverage), one application question (requiring use of the concept), and one transfer question (novel context). The transfer question is the most diagnostic — students who can only answer the recall question have not built a usable model.

Timing: Fixed, predictable, same slot every week. Announce on day one. The predictability is the mechanism — it changes how students encode during the preceding week, not just in the hour before the quiz.

Feedback protocol: Return with targeted feedback within 24 hours. Not "correct/incorrect" — a one-sentence explanation of what the wrong answer reveals about the model gap. This is the highest-value part of the system and takes 15–20 minutes of your time per quiz cycle once you have a template.

Grading: Completion credit (submitted = full credit) or low-stakes (5–10% of total grade). Do not grade for correctness if you want honest attempts — students who fear penalty guess strategically rather than revealing genuine model gaps.

Known Failure Modes

No feedback returned. Without feedback the quiz is pure measurement with no learning benefit beyond the retrieval attempt itself. The retrieval attempt has value — but feedback multiplies it. If you cannot commit to 24-hour feedback, reduce quiz frequency rather than skip feedback.

All recall questions. A quiz that only tests whether students read produces strategic reading — students scan for testable facts rather than building models. At least one application or transfer question per quiz is required to reward genuine understanding over coverage scanning.

Irregular timing. A quiz system that appears sporadically does not change encoding behavior during the week. Students prepare for a quiz the night before if it is predictable, and not at all if it is not. Fixed weekly timing is the mechanism, not the quiz content itself.

SYS-10 — The Engagement Gradient: Structuring Passive-to-Active Within a Session

The Problem With Purely Active Sessions

A session that is active from start to finish produces cognitive overload for students who do not yet have the schema to process new information while simultaneously producing output. The research on cognitive load (Sweller, 1988) is clear: worked examples and direct instruction are more efficient than discovery learning for students with low prior knowledge. Active processing produces its benefits at the edge of student competence — not before competence exists.

The practical implication: a session should have an explicit passive phase and an explicit active phase, in that order. Direct instruction or demonstration first — long enough to build the schema needed for active processing. Active processing second — applied immediately to the material just covered. The transition between the two phases is where most of the techniques in this wiki sit.

Recommended Session Architecture for Computational Courses

Minutes 0–3: Cold retrieval opener (TECH-03) — no slides, spaced retrieval from prior session.

Minutes 3–20: Direct instruction or live coding demonstration. Instructor talks. Students observe and take notes. This is the schema-building phase — do not interrupt it with activities.

Minutes 20–22: Prediction task (TECH-04, Phase 1 only) — students predict output of the next code block before it runs. 90 seconds, written.

Minutes 22–35: Students write code from scratch applying the demonstrated concept. Paired or individual. Instructor circulates.

Minutes 35–45: Peer instruction poll (TECH-02) — one targeted misconception question on the concept just practiced.

Minutes 45–55: Application problem or project work.

Minutes 55–58: Exit ticket (SYS-07) — one prompt, anonymous, collected.

This architecture ensures every session has a diagnostic opening, a schema-building phase, a generation event, a misconception check, and a diagnostic close. It is not prescriptive — compress or expand phases based on content — but the sequence of passive before active and diagnostic at open and close should be held consistently.

Key Literature

Freeman et al. (2014) PNAS — 225-study meta-analysis, 0.47 SD active learning advantage, 1.5x failure rate under lecture. Active learning defined as any higher-order engagement during class — not a single intervention. · Michaelsen, Knight & Fink (2004) Team-Based Learning — iRAT/tRAT readiness assurance system. · Baker et al. (2024) College Teaching — seven-semester exit ticket study, iterative refinement improves higher-stakes performance. · Heeneman et al. (2025) PMC scoping review — feedback on low-stakes quizzes significantly improves exam performance. · Sweller, J. (1988) Cognitive load theory — direct instruction before active processing for low-prior-knowledge learners. · Roediger & Karpicke (2006) — testing effect, basis for weekly quiz spacing.

Honest Assessment of the Evidence Base

Meta-analyses of PBL report aggregate effect sizes of d = 0.65–0.85, which look impressive. The problem is that these aggregate over implementations ranging from highly scaffolded, milestone-driven projects to loosely structured open-ended discovery. The quality of scaffolding is the primary moderating variable — not the fact of using projects. A poorly scaffolded PBL course produces weaker outcomes than well-structured direct instruction. A well-scaffolded PBL course produces better transfer, higher-order thinking, and retention of applied skills than either.

The Central Tension — Kirschner et al. vs PBL Proponents

Kirschner, Sweller & Clark (2006, Educational Psychologist) argued that minimally guided instruction — including PBL — fails because it ignores working memory limitations. For novice learners, unguided discovery produces cognitive overload: students cannot simultaneously search for solutions and encode new schemas. Direct instruction with worked examples is more efficient for initial knowledge acquisition.

PBL proponents (Schmidt et al., 2007, same journal) correctly pointed out that Kirschner et al. conflated PBL with unguided discovery learning. Well-designed PBL is heavily scaffolded — it is not asking students to discover everything independently. The honest resolution: Kirschner et al. are right about unscaffolded PBL; they are wrong to apply that critique to scaffolded PBL. The distinction matters enormously for design.

The Moderator That Matters Most

A 2023 meta-analysis (Chen & Yang, Frontiers in Psychology, 66 studies) found PBL implementations of 9–18 weeks showed the strongest effects (SMD = 0.673). Implementations longer than 18 weeks showed substantially smaller effects (SMD = 0.359). The finding: PBL fatigue is real. Projects that run the full semester without structured phase transitions produce diminishing engagement and learning returns after approximately half a semester.

What PBL Does Well vs Poorly

PBL+

Where PBL Outperforms Direct Instruction

Transfer to novel problems, higher-order thinking (analysis, synthesis, evaluation on Bloom's taxonomy), motivation and sustained engagement, retention of applied skills over time, and collaborative problem-solving ability. A 2025 meta-analysis in biology higher education (42 studies, 5,247 students) found d = 0.847 specifically for higher-order thinking skills — the effect was larger for conceptual understanding than for factual recall.

PBL−

Where PBL Underperforms Without Scaffolding

Initial acquisition of foundational knowledge and procedural skills — especially for novice learners. A MDPI 2025 systematic review found that surface-level PBL implementations lacking scaffolding, structured reflection, and formative assessment produced weak and inconsistent outcomes. The failure modes: vague instructional guidance, lack of alignment between project work and core concepts, and insufficient structured check-ins allowing students to drift without detecting it.

The Five Conditions for Effective PBL

Kokotsaki et al. (2016) synthesized the PBL implementation literature and identified five conditions that consistently distinguish effective from ineffective implementations:

Condition 1 — Sufficient Prior Knowledge Before the Project Starts

Students cannot apply what they do not know. PBL works best as a vehicle for applying and extending knowledge, not as the primary mechanism for acquiring it. This maps directly onto the two-phase model: direct instruction or structured modules first to build the schema; project work second to apply and extend it. Starting a project before students have the foundational concepts produces the cognitive overload Kirschner et al. correctly identified.

Implementation check: Before assigning a project phase, ask — can every student in this class produce a working solution to a simplified version of this problem independently? If no, the foundational knowledge phase is not complete.

Condition 2 — Scaffolding That Decreases Over Time

Early project phases require high instructor guidance — specific prompts, constrained choices, worked examples of process (not solution). Later phases progressively reduce scaffolding as competence builds. This is the zone of proximal development (Vygotsky) applied at project scale. The failure mode in most PBL implementations is holding scaffolding constant — either too high throughout (which prevents the development of independence) or too low throughout (which produces the cognitive overload problem).

Practical structure: Module 1 project phase: constrained scope, specific deliverable format, instructor-provided data. Module 3 project phase: student-chosen scope, student-justified format, student-sourced data. The increase in autonomy should be deliberate and staged, not an accident of reduced instruction time.

Condition 3 — Structured Reflection Built Into Milestones

The 2025 MDPI systematic review identified structured reflection as one of the most consistently underimplemented components of PBL. Reflection is not asking "how did that go?" at the end — it is a structured prompt at each milestone that requires students to connect project decisions to course concepts explicitly. Without it, students can complete a project successfully without encoding why the approach worked, which prevents transfer.

Reflection prompt structure that works: (1) What decision did you make at this milestone and what were the alternatives? (2) Which course concept justified your choice? (3) What would have happened if you had made a different choice? Three questions, written, collected at each milestone. Not graded for correctness — graded for completion and specificity.

Condition 4 — Formative Assessment at Each Phase, Not Just Final Submission

The most common PBL failure mode in computational courses is that problems compound invisibly across phases. Students make a wrong architectural decision in phase 1, build on it in phase 2, and arrive at phase 3 with a fundamentally broken approach that cannot be fixed in the time remaining. Phase-level formative checkpoints — with specific feedback, not just a grade — catch compounding errors before they become unrecoverable.

Checkpoint design: Each milestone checkpoint should require students to demonstrate the specific concept introduced in the corresponding module, applied to their project. Not "submit your code" — "show that your model does not have data leakage and explain how you verified it." The concept-to-project mapping makes the checkpoint both a learning event and a project quality control.

Condition 5 — Individual Accountability Within Group Work

Group projects without individual accountability produce free-riding and uneven learning — students who do less work learn less, and the distribution of effort within groups is rarely what instructors assume. Michaelsen's TBL research shows that individual accountability mechanisms — individual components of group deliverables, individual oral defenses, randomly assigned spokesperson roles — produce more equitable learning outcomes without reducing the collaborative benefits of group work.

Minimal viable individual accountability: Each group member writes one paragraph of the milestone reflection independently, attributed to them individually. This requires every member to connect the project work to course concepts in their own words — the students who did less work reveal themselves through the quality of their reflection, not through peer rating or self-report.

How PBL Fits With the Rest of This Wiki

PBL is not a replacement for the session-level techniques documented elsewhere here — it is the container that gives those techniques their applied context. The probe sequencing section describes how to build conceptual understanding before students apply it. The engagement systems section describes how to structure sessions within a project phase. The activity compliance section describes how to prevent project work from substituting for other learning activities. PBL works best when those session-level systems are running inside it, not when it is treated as a standalone pedagogy that replaces them.

Key Literature

Kirschner, Sweller & Clark (2006) Educational Psychologist 41(2) — minimally guided instruction and cognitive load. The critique applies to unscaffolded PBL, not well-designed PBL. · Schmidt et al. (2007) same journal — rebuttal correctly distinguishing scaffolded PBL from discovery learning. · Chen & Yang (2023) Frontiers in Psychology — 66-study meta-analysis; 9–18 week sweet spot finding. · Kokotsaki et al. (2016) — five facilitating conditions synthesis. · MDPI systematic review (2025) — scaffolding and structured reflection as underimplemented components. · Alfieri et al. (2011) Journal of Educational Psychology — meta-analysis finding that discovery learning works when it includes feedback and worked examples; fails without them.

The Overload Problem

Every system in this wiki imposes a cost on both you and your students. Exit tickets require reading and responding. Weekly quizzes require writing and returning feedback. Muddiest point requires reading before next session. Peer instruction requires pre-written questions. If you implement all of these simultaneously from week one, you will produce two things: student cognitive and compliance fatigue, and instructor burnout from the feedback volume. The research on implementation science is clear — partial implementation of fewer systems produces better outcomes than surface-level implementation of many.

A Framework for Choosing — Three Variables

Variable 1 — What Is Your Highest-Value Diagnostic Need?

Every system in this wiki produces a different kind of signal. Before choosing a system, identify what you most need to know that you currently cannot find out:

If you need to know whether students built accurate mental models during instruction → Exit ticket or muddiest point. Low setup cost, immediate signal, works every session.

If you need to know whether students are doing preparation before class → Readiness assurance (SYS-08). Requires pre-class material design but produces immediate in-class signal.

If you need to know whether concepts are accumulating across weeks → Weekly quiz (SYS-09). Requires quiz infrastructure and 24-hour feedback commitment.

If you need to know which specific misconception is most prevalent right now → Peer instruction poll (TECH-02). Requires pre-written questions with good distractors — time-intensive to design, fast to run.

Start with the signal you most need. Add other systems only after the first one is running reliably.

Variable 2 — What Is Your Realistic Feedback Capacity?

Every diagnostic system only produces value if you close the loop. The systems that require the most feedback time are: weekly quiz with written feedback (15–20 min per quiz cycle), muddiest point read-and-respond (10–15 min per session), and exit ticket read-and-respond (10 min per session). The systems with the lowest feedback overhead are: peer instruction poll (feedback delivered live during class) and cold call retrieval opener (feedback delivered live, no prep required).

A realistic capacity estimate: For a single course, a sustainable feedback load is probably one written feedback cycle per week plus one live diagnostic per session. Anything beyond that requires either reducing course load elsewhere or automating part of the response (e.g. pre-written feedback for common wrong answers on quizzes).

The automation option: For weekly quizzes, writing three targeted feedback responses to the three most common wrong answers and sending them to the whole class is more efficient than individual feedback and nearly as effective. You are addressing the most prevalent misconceptions, which are also the ones most likely to appear on the next assessment.

Variable 3 — Where Are You in the Semester?

Systems have different optimal introduction points. Introducing too many systems simultaneously at the start of semester produces student compliance anxiety rather than engagement. Introducing systems mid-semester without prior framing reads as arbitrary or punitive.

Week 1: Establish the pedagogical contract (SYS-01). Name every system you plan to use. Set up the anonymous response infrastructure (SYS-02). Assign pairs (SYS-06). Start cold call retrieval opener (TECH-03) — it requires zero setup and establishes the retrieval norm from day one.

Weeks 1–3: Run exit tickets or muddiest point every session while you are building the norm that responses affect instruction. This is the investment phase — the payoff is student trust in the diagnostic channel.

Week 3 onwards: Add one system at a time once the prior one is running reliably. The readiness assurance system is a natural addition once students know what to expect from pre-class preparation. The weekly quiz is a natural addition once the exit ticket system is established.

Mid-project phases: Reduce session-level diagnostic systems during final project phases when deadline pressure dominates. Exit tickets still work but should shift to "what decision are you most uncertain about in your project right now?" rather than concept-level recall.

The Minimum Viable System Set

The Smallest Set That Covers the Most Important Needs

If you could only implement three systems from this wiki, the combination with the highest impact-to-cost ratio is:

1. Cold call retrieval opener (TECH-03) — zero setup, 3 minutes per session, establishes encoding stakes and spaced retrieval simultaneously. Run every session.

2. Exit ticket (SYS-07) — 4 minutes per session to collect, 10 minutes to read and prepare next-session response. Closes the diagnostic loop that the cold call opener opens. Run every concept-heavy session.

3. Milestone reflection prompts (PBL Condition 3) — if running a project-based course, the structured reflection at each milestone is the single highest-leverage intervention for connecting project work to course concepts. No additional session time required — built into existing checkpoint structure.

These three cover the encoding stakes mechanism, the diagnostic feedback loop, and the concept-to-application bridge. Everything else in this wiki adds specificity and depth to one of those three functions.

Signals That You Have Overloaded the System

Warning Signs — For You and For Students

Instructor warning signs: You are reading exit tickets but not responding to them in class. Quiz feedback is taking more than 30 minutes per cycle. You are dreading the start-of-session retrieval opener. You have forgotten which system you are supposed to run this week. Any of these means the system load exceeds your sustainable capacity — remove one system before the quality of all of them degrades.

Student warning signs: Exit ticket responses become progressively shorter and more generic over the semester. Muddiest point responses are consistently "nothing was unclear." Cold call answers are increasingly minimal. Students ask "is this graded?" about activities that have been running for weeks. These are compliance fatigue signals — students have habituated to the systems and are executing minimally. The response is to reduce frequency of one system and increase the consequence of another, not to add more systems.

The Implementation Science Principle

Fixsen et al. (2005) — Implementation Stages

Implementation science research (Fixsen et al., 2005, National Implementation Research Network) identifies four stages of implementation: exploration, installation, initial implementation, and full implementation. The key finding: most educational interventions fail not because the intervention is ineffective but because instructors attempt full implementation before completing initial implementation. A system that is partially implemented — run inconsistently, without closed feedback loops, without student buy-in from upfront framing — produces near-zero benefit and significant cost. One system fully implemented outperforms five systems partially implemented. This is the empirical basis for the minimum viable system set recommendation above.

Key Literature

Fixsen, D.L., et al. (2005). Implementation Research: A Synthesis of the Literature. University of South Florida — implementation stages framework. · Deslauriers et al. (2019) PNAS — active learning effects are largest when instructors are committed volunteers, not when required; suggests implementation quality matters more than intervention choice. · Kokotsaki et al. (2016) — scaffolding quality as primary moderator across PBL implementations.

Why This Happens and Why It Matters

Pace variance in coding tasks is larger than in most other activity types because coding competence is not normally distributed in a mixed class — it is often bimodal or multimodal. Students with prior industry experience or strong CS backgrounds can be 5–10x faster than students encountering the concept for the first time. If you design for the median, you lose both ends simultaneously. The fast students disengage within minutes. The slow students feel surveilled and anxious, which is precisely the psychological condition that impairs working memory and further slows performance (Beilock et al., 2004 — math anxiety and working memory interference).

The Core Design Principle — Layered Tasks

The most well-supported structural solution is what the CS education literature calls task layering or scaffolded extension — designing every in-class coding exercise as three sequential tasks rather than one, where each layer requires genuine additional understanding rather than just more of the same work.

The Three-Layer Structure

Layer 1 — Core task (required): The minimum viable demonstration of the concept. Every student should be able to complete this within the intended time window if they have the prerequisite knowledge. Designed for the slowest third of the class. If students cannot complete Layer 1, that is a prerequisite gap signal, not a pace problem.

Layer 2 — Extension task (expected): Applies the same concept in a slightly more complex or different context. Requires transfer rather than repetition. Students who finish Layer 1 early move here without waiting for permission. Designed for the middle third.

Layer 3 — Challenge task (optional): Genuinely hard. Requires either combining the current concept with a prior concept or applying it to an edge case the instruction has not covered. Students who complete Layer 2 early move here. Designed for the fastest students. Critically — this layer should not be completable by pure speed. It should require conceptual depth that speed alone cannot compensate for.

What makes this work: No student is waiting. No student is falling behind publicly. Fast students are not rewarded with free time — they are rewarded with a harder problem. Slow students are not penalized for being slow — they have a clear, achievable target. The distribution of completion times becomes irrelevant because every student is working on the right problem for their current level.

Layer 3 Design — The Critical Constraint

Layer 3 fails if it is just "do more of Layer 2." Fast students will complete it quickly and disengage again. Layer 3 must require something qualitatively different — a conceptual connection that cannot be made by pattern-matching from the earlier layers.

Good Layer 3 patterns for computational courses:

— "Now break it intentionally in a specific way and explain exactly why it breaks." (Requires model depth, not just fluency)

— "Rewrite this to handle the edge case where [specific condition]. What assumption in your Layer 1 solution does this violate?" (Requires identifying hidden assumptions)

— "Your Layer 2 solution works but will fail on datasets larger than ~10,000 rows. Why, and what would you change?" (Requires reasoning about computational complexity without having been taught it explicitly)

— "A colleague shows you this solution to Layer 1 [show a subtly wrong solution]. It passes all the tests you wrote. What is wrong with it and how would you catch it?" (Requires mental model depth to detect a non-obvious bug)

Handling the Pace Gap During the Exercise

What to Do While Students Are Coding

Do not circulate to check on everyone. Circulating implies surveillance. Students who are struggling will minimize their window or switch to fake activity when you approach. Instead, make yourself available at a fixed location — students who need help come to you. This reframes help-seeking as a choice rather than an exposure.

Use fast finishers as a resource, deliberately. When a student finishes Layer 2, give them a specific role: "You're done with Layer 2 — before you move to Layer 3, write one sentence on a sticky note describing the most likely place someone would get stuck in Layer 1 and why." This converts early finishers from a management problem into a diagnostic asset. Their explanations of likely sticking points tell you where to focus your circulating time on the students who are still working.

Set a public time check, not a deadline. At the midpoint of the coding time, say "check where you are — if you're not through Layer 1, flag me." This surfaces students who are stuck without requiring them to self-identify publicly. The flag can be a physical sticky note on their monitor, a raised hand, or a one-word form response. The mechanism is low social cost.

When to Stop the Exercise

Do not wait for the slowest student to finish before moving on. This creates a perverse incentive — slow students learn that being slow holds the class and can feel embarrassed or pressured, which worsens the anxiety-working memory interference. Fast students disengage for extended periods.

Stop when approximately 70% of students have completed Layer 1. The remaining 30% are not behind — they will finish the exercise outside class or revisit it. The 70% threshold ensures enough of the class can participate in the debrief discussion meaningfully, while not penalizing students who work at a different pace with extended public waiting time.

Make the stopping norm explicit on day one: "I stop in-class coding exercises when most of the class has reached the core task. You will not always finish. That is expected and fine. The exercise is in the course materials and you should complete it before the next session." This removes the stigma from not finishing and sets an explicit expectation about out-of-class completion.

The Debrief — Where Most of the Learning Happens

Structured Debrief Protocol

The coding exercise itself is the generation event. The debrief is where the mental model gets consolidated and corrected. Skipping the debrief to give more coding time is a common mistake — it trades consolidation for coverage.

Step 1 (2 min): Ask one student to share their Layer 1 solution — not the fastest student, a middle student. Display it. Ask the class: "What does this get right? What would you do differently?"

Step 2 (2 min): Show one common error from what you observed during circulation. Do not attribute it to any student. "A pattern I saw was X. What is wrong with this and what mental model produces it?" This names the misconception at the class level without exposing individuals.

Step 3 (1 min): Ask a Layer 3 student to describe their approach — not their solution, their approach. What assumption did they have to question to get there? This makes the conceptual depth of Layer 3 visible to students who did not reach it, which seeds the model for next time.

Longer-Term: Using Pace Data

What Pace Variance Tells You Diagnostically

If the same students are consistently in the slowest third, that is a prerequisite gap problem or an anxiety problem — not a pace problem. The intervention is different for each. Prerequisite gap: offer office hours specifically framed as "foundational skills catch-up, not remediation." Anxiety: the Beilock et al. research suggests that brief expressive writing about anxiety before a coding task (10 minutes, not collected) reduces working memory interference and improves performance. This has been replicated in math and test-taking contexts and is plausible for coding contexts though not directly studied there.

If the fastest students are consistently finishing all three layers quickly, Layer 3 is not hard enough. The solution is not to make Layer 1 or 2 harder — it is to make Layer 3 genuinely require conceptual depth rather than fluency. The design question is: can this be completed by someone who is fast but shallow? If yes, redesign it.

Key Literature

Beilock, S.L., et al. (2004). On the causal mechanisms of stereotype threat. Journal of Experimental Social Psychology — anxiety and working memory interference, directly relevant to slow coders under time pressure. · Beilock & Carr (2005) Psychological Science — expressive writing before performance tasks reduces anxiety-driven working memory interference. · The layered task design draws on differentiated instruction research (Tomlinson, 2001) and the zone of proximal development (Vygotsky, 1978) — each student works at the edge of their current competence rather than at a single class-level target. · CS education research on pace variance is sparse — this section draws more heavily on general educational psychology than domain-specific studies. Flag as practitioner-derived synthesis where specific protocols are concerned.

Foundational Works

Hattie, J. (2009). Visible Learning: A Synthesis of Over 800 Meta-Analyses Relating to Achievement. Routledge. — Meta-analytic effect sizes. Use with caution — labels aggregate heterogeneous interventions. Always trace to primary studies for specific protocols.
Bjork, R.A. & Bjork, E.L. (1992). A new theory of disuse and an old theory of stimulus fluctuation. In A. Healy, S. Kosslyn & R. Shiffrin (Eds.), From Learning Processes to Cognitive Processes. Erlbaum. — Storage strength vs retrieval strength. The theoretical foundation of desirable difficulties.
Hestenes, D., Wells, M., & Swackhamer, G. (1992). Force concept inventory. The Physics Teacher, 30(3), 141–158. — Model for concept inventory design. The FCI revealed that procedural fluency and conceptual understanding are dissociable.

On Teaching Effectiveness

Hattie, J., & Timperley, H. (2007). The power of feedback. Review of Educational Research, 77(1), 81–112. — 196-study review. Task-level and process-level feedback drives outcomes; praise has near-zero effect on learning.
Cornelius-White, J. (2007). Learner-centered teacher-student relationships are effective. Review of Educational Research, 77(1), 113–143. — 119-study meta-analysis. Relationship effect requires perceived support + high expectations simultaneously, not warmth alone.
Kleinfeld, J. (1975). Effective teachers of Eskimo and Indian students. School Review, 83(2), 301–344. — Early source for the "warm demander" concept. Ware (2006) and Irvine & Fraser (1998) are follow-on work.
Shulman, L.S. (1986). Those who understand: Knowledge growth in teaching. Educational Researcher, 15(2), 4–14. — Pedagogical content knowledge framework. Teachers who understand why a procedure works (not just how) produce better student outcomes. Source for teacher clarity claims.
Chi, M.T.H., et al. (multiple publications 1989–1996). See: Chi et al. (1989) Self-explanations. Cognitive Science; Chi & VanLehn (1991) Journal of the Learning Sciences. — Tutoring and self-explanation research. The claim about expert tutors modeling uncertainty is an inference from this body of work; the direct finding is that tutors who elicit student self-explanation produce better learning than tutors who explain directly.
Carrell, S.E., & West, J.E. (2010). Does professor quality matter? Journal of Political Economy, 118(3), 409–432. — Natural experiment at USAFA. Instructors producing better downstream performance received lower contemporary evaluations. One of the strongest causal designs on the SET-learning relationship.

On Desirable Difficulties

Slamecka, N.J., & Graf, P. (1978). The generation effect: Delineation of a phenomenon. Journal of Experimental Psychology: Human Learning and Memory, 4(6), 592–604. — Founding generation effect study. Generated items recalled at higher rates than read items.
Cepeda, N.J., et al. (2006). Distributed practice in verbal recall tasks. Psychological Bulletin, 132(3), 354–380. — 254 studies, 14,000+ participants. Optimal spacing gap is ~10–20% of desired retention interval.
Rohrer, D., & Taylor, K. (2007). The shuffling of mathematics problems improves learning. Instructional Science, 35(6), 481–498. — Interleaved vs blocked practice. Interleaving produces worse practice performance but substantially better transfer one week later.
Roediger, H.L., & Karpicke, J.D. (2006). Test-enhanced learning. Psychological Science, 17(3), 249–255. — Read once + 3 free-recall tests outperformed read 4 times by 50% at one-week retention. Retrieval is the mechanism.
Deslauriers, L., et al. (2019). Measuring actual learning versus feeling of learning. PNAS, 116(39), 19251–19257. — Peer instruction (clicker questions + neighbor discussion) outperformed expert lecture; students rated lecture as more effective.

On Student Evaluations

Uttl, B., White, C.A., & Gonzalez, D.W. (2017). Meta-analysis of faculty's teaching effectiveness. Studies in Educational Evaluation, 54, 22–42. — Corrected meta-analysis: SET scores correlate r≈0.00–0.10 with objective learning outcomes.
Clayson, D.E. (2009). Student evaluations of teaching: Are they related to what students learn? Journal of Marketing Education, 31(1), 16–30. — Ease bias across disciplines: expected grade predicts SET score independent of learning.
Porter, S.R., et al. (2004). Multiple surveys of students and survey fatigue. New Directions for Institutional Research, 121, 63–73. — Voluntary response bias: low response rates inflate mean scores by 0.3–0.5 points on 5-point scale.

On Mental Models and Assessment

Rozenblit, L., & Keil, F. (2002). The misunderstood limits of folk science. Cognitive Science, 26(5), 521–562. — Illusion of explanatory depth: ratings drop after attempting mechanistic explanation. Silence ≠ understanding.
Barnett, S.M., & Ceci, S.J. (2002). When and where do we apply what we learn? Psychological Bulletin, 128(4), 612–637. — Near vs far transfer taxonomy. Near and far transfer are empirically separable.
Chi, M.T.H., & VanLehn, K.A. (1991). The content of physics self-explanations. Journal of the Learning Sciences, 1(1), 69–105. — Self-explanation and prediction during examples produces better transfer than passive reading.
Mazur, E. (1997). Peer Instruction: A User's Manual. Prentice Hall. — Source for the 30–70% wrong answer target range for productive peer discussion.
Angelo, T.A., & Cross, K.P. (1993). Classroom Assessment Techniques. Jossey-Bass. — Source for muddiest point protocol and other low-cost diagnostic techniques.
Stowell, J.R., & Nelson, J.M. (2007). Benefits of electronic audience response systems on student participation. Teaching of Psychology, 34(4), 253–258. — Anonymous polling produces 3–4x higher honest response rates than hand-raising.
Dunning, D., & Kruger, J. (1999). Unskilled and unaware of it. Journal of Personality and Social Psychology, 77(6), 1121–1134. — Competence and metacognitive accuracy are inversely correlated at low ability.

On Probe Sequencing

Kornell, N., Hays, M.J., & Bjork, R.A. (2009). Unsuccessful retrieval attempts enhance subsequent learning. Journal of Experimental Psychology: LMC, 35(4), 989–998. — Pre-testing effect: failed attempts before instruction improve learning from instruction more than instruction alone.
Posner, G.J., et al. (1982). Accommodation of a scientific conception. Science Education, 66(2), 211–227. — Conceptual change requires intelligibility, plausibility, and fruitfulness of new model — and dissatisfaction with current model. Naming the conflict is not optional.
Taylor, K., & Rohrer, D. (2010). The effects of interleaved practice. Applied Cognitive Psychology, 24(6), 837–848. — Interleaved problem sets outperform blocked on discrimination and transfer.

On CS Education and Misconceptions

Sorva, J. (2012). Visual Program Simulation in Introductory Programming Education. Doctoral dissertation, Aalto University. — Notional machine framework; container vs reference model documented empirically.
Qian, Y., & Lehman, J. (2017). Students' misconceptions and errors in programming. International Journal of Computer Science Education in Schools, 1(3). — Systematic review of novice programming errors including loop and iteration misconceptions.
McCauley, R., et al. (2008). Debugging: Finding, fixing and flailing. Computer Science Education, 18(2), 93–116. — Review of debugging behavior research; basis for debugging causality misconception cluster.
Kahneman, D., & Tversky, A. (1974). Judgment under uncertainty: Heuristics and biases. Science, 185(4157), 1124–1131. — Foundational source for probability misconception cluster.

Add new references by describing the paper and what it contributes — include the exact protocol used where relevant.

Evidence-Based Teaching

What Predicts Effectiveness

Desirable Difficulties

Concept Inventories

Measuring Mental Models

The AI Problem

The Silence-Confusion Gap

Technique Library

Course Systems

Activity Compliance Design

Engagement Systems

Project-Based Learning

Picking Your Systems

What Predicts Teaching Effectiveness

Effect Sizes by Factor — With Exact Study Protocols

The Relationship Factor — What "Warm Demander" Actually Means

On Modeling Expert Uncertainty

Desirable Difficulties

Storage Strength vs Retrieval Strength — The Bjork Framework

DD-01 — Generation Effect

DD-02 — Spacing Effect

DD-03 — Interleaving

DD-04 — Testing Effect (Retrieval Practice)

The Deslauriers Study — The Cleanest "Active Learning" Test

On Student Evaluations

The Core Finding on Validity

Known Confounds — With Specific Evidence

Ease Bias — Exact Finding

Selection Bias in Optional Surveys

Contrast Effects — Exact Mechanism

Retrospective Framing

What Evaluations Are Actually Useful For

Better Survey Questions — The Design Principle

Measuring Mental Models

Near Transfer vs Far Transfer — Why the Distinction Matters

The Illusion of Explanatory Depth — Exact Finding

Techniques That Reveal Real Understanding — With Protocols

Two-Stage Retrieval Check — Protocol

Prediction Tasks — Protocol

Targeted Wrong Answer Probing — Protocol

Muddiest Point — Protocol

Anonymous Polling — What the Research Shows

Concept Inventories for Computational Courses

How They Work

Misconception Clusters — With Source Provenance

Variable and State — Container vs Reference Model

Loop and Iteration — Simultaneity Model

Abstraction Layers — Implementation Leakage

Probability and Randomness

Debugging Causality — Local Causality Model

Data Structure Mental Models — Pandas/Numpy Specific

The Cross-Cutting Pattern

Building Your Own Inventory

Key Literature for This Section

Probe Sequencing Within a Module

The Four-Phase Sequence

Phase 1 — Activation Probe (before instruction)

Phase 2 — Instruction with Named Conflict

Phase 3 — Transfer Probe (immediately after instruction)

Phase 4 — Spaced Retrieval Probe (next session)

When a Probe Reveals Widespread Misconception

Worked Example 1 — Unsupervised Learning (k-means clustering)

Phase 1 — Activation Probe

Phase 2 — Instruction with Named Conflict

Phase 3 — Transfer Probe

Phase 4 — Spaced Retrieval Probe (open next session with this)

Worked Example 2 — Euler's Method

Phase 1 — Activation Probe

Phase 2 — Instruction with Named Conflict

Phase 3 — Transfer Probe

Phase 4 — Spaced Retrieval Probe (open next session)

The Practical Constraint

Key Literature for This Section

Feedback Techniques

Diagnostic vs Evaluative Feedback

High-Signal Diagnostic Techniques

Mid-Module Muddiest Point

Behavioral Survey Questions

Own-Words Paraphrase Check

Peer Observation with Specific Frame