Introduction: The Science Behind the Science of the Mind
Series Overview: This is Part 13 of our 14-part Cognitive Psychology Series. We turn our lens inward to examine the research methods that make cognitive psychology a rigorous science -- from experimental design and statistical reasoning to the paradigms that reveal how the mind processes information.
1
Memory Systems & Encoding
Sensory, working & long-term memory, consolidation
2
Attention & Focus
Selective, sustained, divided attention models
3
Perception & Interpretation
Sensory processing, Gestalt, visual perception
4
Problem-Solving & Creativity
Heuristics, biases, insight, decision-making
5
Language & Communication
Phonology, syntax, acquisition, Sapir-Whorf
6
Learning & Knowledge
Conditioning, schemas, skill acquisition, metacognition
7
Cognitive Neuroscience
Brain regions, neural networks, neuroplasticity
8
Cognitive Development
Piaget, Vygotsky, aging & cognitive decline
9
Intelligence & Individual Differences
IQ theories, multiple intelligences, cognitive styles
10
Emotion & Cognition
Emotion-thinking interaction, stress, motivation
11
Social Cognition
Theory of mind, attribution, stereotypes, groups
12
Applied Cognitive Psychology
UX design, education, behavioral economics
13
Research Methods
Experimental design, statistics, reaction time
You Are Here
14
Computational & AI Models
ACT-R, SOAR, neural networks, predictive processing
How do we know what we know about the human mind? You cannot open a skull and watch thoughts form. You cannot measure an emotion the way you measure temperature. Cognitive processes are inherently invisible -- they must be inferred from observable behavior. This fundamental challenge is what makes research methodology in cognitive psychology both fascinating and demanding.
Every finding discussed in this series -- from Sperling's iconic memory to Kahneman's heuristics -- rests on carefully designed experiments, rigorous statistical analysis, and the creative use of behavioral paradigms that make the invisible visible. Understanding these methods is not just academic; it is the foundation of scientific literacy in an age when psychological claims flood popular media.
Key Insight: A research method is only as good as its ability to rule out alternative explanations. The hallmark of strong experimental design is internal validity -- the confidence that the independent variable, and nothing else, caused the observed changes in the dependent variable.
A Brief History of Experimental Psychology
The story of cognitive research methods begins in 1879, when Wilhelm Wundt established the first formal psychology laboratory at the University of Leipzig. Wundt used introspection -- trained self-observation of conscious experience -- as his primary method. While introspection was eventually criticized for its subjectivity, Wundt's insistence on controlled laboratory conditions established psychology as an empirical science.
Even earlier, Gustav Fechner (1860) developed psychophysics -- the first quantitative approach to studying the relationship between physical stimuli and psychological perception. His methods for measuring thresholds (the minimum detectable stimulus) remain in use today.
Franciscus Donders (1868) pioneered the use of reaction time as a window into cognitive processing, developing the subtraction method that allowed researchers to estimate the duration of specific mental operations. This innovation was revolutionary: for the first time, thought could be timed.
Historical Milestone
Wundt's Leipzig Laboratory (1879)
Wilhelm Wundt's laboratory at the University of Leipzig is considered the birthplace of experimental psychology. His research program focused on measuring the basic elements of conscious experience -- sensation, perception, and reaction time -- under controlled conditions. Wundt trained over 180 doctoral students, many of whom went on to establish psychology departments across Europe and North America, spreading the experimental approach worldwide.
Wundt's key methodological contribution was demonstrating that mental phenomena could be studied with the same rigor as physical phenomena. His insistence on systematic manipulation of variables, repeated measurements, and controlled conditions laid the groundwork for modern experimental cognitive psychology.
Experimental Psychology
Introspection
Psychophysics
Leipzig 1879
1. Experimental Design
Experimental design is the blueprint of a study -- the strategic plan for how participants will be assigned to conditions, what variables will be manipulated and measured, and how confounds will be controlled. The choice of design has profound implications for what conclusions can be drawn.
1.1 Between-Subjects Design
In a between-subjects (independent groups) design, each participant experiences only one level of the independent variable. Different groups of participants are compared.
Example: To test whether background music affects reading comprehension, Group A reads in silence, Group B reads with classical music, and Group C reads with pop music. Each participant is in only one condition.
| Advantage |
Disadvantage |
| No carryover effects (practice, fatigue) |
Requires more participants |
| Each condition is independent |
Individual differences between groups may confound results |
| Suitable when conditions cannot be reversed |
Reduced statistical power (more error variance) |
Key control technique: Random assignment -- participants are randomly allocated to conditions, ensuring that pre-existing differences (age, IQ, motivation) are distributed equally across groups. Without random assignment, you have a quasi-experiment, not a true experiment.
1.2 Within-Subjects (Repeated Measures) Design
In a within-subjects design, every participant experiences all levels of the independent variable. The same person serves as their own control.
Example: Each participant completes a Stroop task under three conditions: congruent (word "RED" in red ink), incongruent (word "RED" in blue ink), and neutral (a string of Xs in colored ink). Their reaction times are compared across all three conditions.
| Advantage |
Disadvantage |
| Eliminates individual difference confounds |
Order effects (practice, fatigue, carryover) |
| Requires fewer participants |
Demand characteristics (participants guess the hypothesis) |
| Greater statistical power |
Not suitable when exposure to one condition changes the participant |
Key control technique: Counterbalancing -- varying the order of conditions across participants (e.g., half do condition A then B; the other half do B then A) to cancel out order effects. A full Latin Square design systematically rotates all possible orderings.
1.3 Factorial & Mixed Designs
A factorial design examines two or more independent variables simultaneously, allowing researchers to detect interactions -- situations where the effect of one variable depends on the level of another.
Key Insight: Interactions are often more theoretically interesting than main effects. For example, the finding that caffeine improves performance on simple tasks but impairs performance on complex tasks (Yerkes-Dodson law) is an interaction between arousal and task difficulty -- neither variable alone tells the full story.
Example of a 2 x 3 factorial design: Factor A = Encoding type (visual vs verbal), Factor B = Retention interval (1 hour, 1 day, 1 week). This yields 6 conditions and can reveal whether the decay rate differs for visual vs verbal memories.
A mixed design combines between-subjects and within-subjects factors. For instance, comparing older adults vs younger adults (between-subjects) on a memory task measured at multiple time points (within-subjects).
1.4 Quasi-Experimental Design
When true random assignment is impossible -- because the independent variable is a pre-existing characteristic (age, clinical diagnosis, handedness) or an event that cannot be ethically manipulated -- researchers use quasi-experimental designs.
Example: Comparing cognitive function in patients with Alzheimer's disease vs healthy controls. You cannot randomly assign people to have Alzheimer's, so groups differ in ways beyond just the variable of interest.
Critical Limitation: Quasi-experiments cannot establish causation because the lack of random assignment means confounding variables are not controlled. Alzheimer's patients may differ from controls in education, medication use, and general health -- any of which could explain group differences in cognition.
2. Variables & Hypothesis Testing
2.1 Independent, Dependent & Confounding Variables
Every experiment revolves around three types of variables:
| Variable Type |
Definition |
Example (Stroop Study) |
| Independent Variable (IV) |
The factor manipulated by the researcher |
Congruency of word and ink color (congruent vs incongruent) |
| Dependent Variable (DV) |
The outcome measured |
Reaction time (ms) and error rate (%) |
| Confounding Variable |
Uncontrolled factor that varies with the IV, threatening internal validity |
Word frequency (if congruent words happen to be more common) |
Extraneous variables are any variables other than the IV that could affect the DV. They become confounds only when they systematically co-vary with the IV. Good experimental design uses randomization, counterbalancing, and standardization to prevent extraneous variables from becoming confounds.
2.2 Hypothesis Testing: Null vs Alternative
The logic of null hypothesis significance testing (NHST) -- the dominant statistical framework in psychology -- works by assuming the null hypothesis (H0: there is no effect) is true, then calculating the probability of obtaining data as extreme as what was observed.
| Concept |
Definition |
Typical Threshold |
| Null Hypothesis (H0) |
No effect; any differences are due to chance |
-- |
| Alternative Hypothesis (H1) |
There is a real effect of the IV on the DV |
-- |
| p-value |
Probability of obtaining data this extreme if H0 is true |
p < .05 |
| Type I Error (False Positive) |
Rejecting H0 when it is actually true |
alpha = .05 (5% risk) |
| Type II Error (False Negative) |
Failing to reject H0 when H1 is actually true |
beta = .20 (20% risk) |
Common Misconception: A p-value of .03 does not mean there is a 3% probability the null hypothesis is true. It means: "If the null hypothesis were true, there would be a 3% probability of observing data this extreme or more extreme." The p-value is about the data given the hypothesis, not the hypothesis given the data. This distinction, often missed, leads to widespread misinterpretation of research findings.
2.3 Effect Size & Power Analysis
Statistical significance tells you whether an effect is likely real, but not whether it matters. That is the role of effect size -- a standardized measure of the magnitude of an effect.
| Effect Size Measure |
Used With |
Small |
Medium |
Large |
| Cohen's d |
Comparing two means |
0.2 |
0.5 |
0.8 |
| eta-squared (n2) |
ANOVA |
.01 |
.06 |
.14 |
| Pearson's r |
Correlation |
.10 |
.30 |
.50 |
Statistical power is the probability of correctly detecting a real effect (1 - beta). A well-powered study typically aims for 80% power. Power depends on three factors: effect size (larger = easier to detect), sample size (larger = more power), and alpha level (more lenient = more power but more Type I errors).
# Power analysis and effect size calculation in Python
import math
import random
def cohens_d(group1, group2):
"""Calculate Cohen's d for two independent groups."""
n1, n2 = len(group1), len(group2)
mean1, mean2 = sum(group1) / n1, sum(group2) / n2
# Pooled standard deviation
var1 = sum((x - mean1) ** 2 for x in group1) / (n1 - 1)
var2 = sum((x - mean2) ** 2 for x in group2) / (n2 - 1)
pooled_sd = math.sqrt(((n1 - 1) * var1 + (n2 - 1) * var2) / (n1 + n2 - 2))
d = (mean1 - mean2) / pooled_sd
return d
def required_sample_size(effect_size, alpha=0.05, power=0.80):
"""
Approximate sample size per group for a two-sample t-test.
Uses the formula: n = (z_alpha + z_beta)^2 * 2 / d^2
"""
# Z-scores for common alpha and power levels
z_alpha = 1.96 if alpha == 0.05 else 2.576 # two-tailed
z_beta = 0.84 if power == 0.80 else 1.28 # power = .80 or .90
n_per_group = math.ceil((z_alpha + z_beta) ** 2 * 2 / effect_size ** 2)
return n_per_group
# Simulate a Stroop experiment
random.seed(42)
congruent_rts = [random.gauss(520, 80) for _ in range(50)] # ms
incongruent_rts = [random.gauss(620, 95) for _ in range(50)] # ms
d = cohens_d(incongruent_rts, congruent_rts)
print(f"=== Stroop Effect Simulation ===")
print(f"Congruent mean RT: {sum(congruent_rts)/len(congruent_rts):.1f} ms")
print(f"Incongruent mean RT: {sum(incongruent_rts)/len(incongruent_rts):.1f} ms")
print(f"Cohen's d: {abs(d):.3f} (large effect)")
# Power analysis for different effect sizes
print(f"\n=== Required Sample Sizes (alpha=.05, power=.80) ===")
for label, es in [("Small (d=0.2)", 0.2), ("Medium (d=0.5)", 0.5), ("Large (d=0.8)", 0.8)]:
n = required_sample_size(es)
print(f" {label}: {n} participants per group ({n*2} total)")
3. Statistical Analysis
Choosing the correct statistical test depends on the research design, the type of data, and the research question. Here is a practical guide to the most commonly used tests in cognitive psychology research.
3.1 t-Tests
The t-test compares means from two conditions to determine whether they differ significantly.
| Type |
When to Use |
Example |
| Independent-samples t-test |
Two different groups compared on the same DV |
Comparing memory scores of young vs old adults |
| Paired-samples t-test |
Same participants measured twice (within-subjects) |
Comparing Stroop congruent vs incongruent RTs |
| One-sample t-test |
Comparing a sample mean to a known value |
Is this group's average IQ different from 100? |
3.2 Analysis of Variance (ANOVA)
When you have three or more conditions, multiple t-tests inflate your Type I error rate. ANOVA solves this by testing whether any group means differ significantly in a single omnibus test.
| ANOVA Type |
Design |
Example |
| One-Way ANOVA |
One IV with 3+ levels, between-subjects |
Memory recall under silence, classical, or pop music |
| Repeated Measures ANOVA |
One IV with 3+ levels, within-subjects |
Reaction time at three SOA intervals for the same participants |
| Factorial ANOVA |
Two or more IVs, tests main effects + interactions |
Age (young/old) x Encoding type (visual/verbal) on recall |
| Mixed ANOVA |
At least one between- and one within-subjects factor |
Clinical group (between) x Time point (within) |
# Simulating a One-Way ANOVA: Effect of background music on memory recall
import random
import math
def one_way_anova(groups):
"""
Perform a one-way ANOVA from scratch.
Returns F-statistic and effect size (eta-squared).
"""
k = len(groups)
all_data = [x for g in groups for x in g]
N = len(all_data)
grand_mean = sum(all_data) / N
# Sum of Squares Between (SSB)
ssb = sum(len(g) * (sum(g)/len(g) - grand_mean)**2 for g in groups)
# Sum of Squares Within (SSW)
ssw = sum(sum((x - sum(g)/len(g))**2 for x in g) for g in groups)
# Degrees of freedom
df_between = k - 1
df_within = N - k
# Mean squares
msb = ssb / df_between
msw = ssw / df_within
# F-statistic
f_stat = msb / msw
# Effect size (eta-squared)
eta_sq = ssb / (ssb + ssw)
return f_stat, df_between, df_within, eta_sq
# Simulate three conditions
random.seed(42)
silence = [random.gauss(15, 3) for _ in range(30)] # 15 words recalled
classical = [random.gauss(14, 3) for _ in range(30)] # 14 words recalled
pop_music = [random.gauss(11, 3.5) for _ in range(30)] # 11 words recalled
f, df1, df2, eta = one_way_anova([silence, classical, pop_music])
print("=== One-Way ANOVA: Background Music & Memory ===")
print(f"Silence: M = {sum(silence)/len(silence):.1f} words")
print(f"Classical: M = {sum(classical)/len(classical):.1f} words")
print(f"Pop Music: M = {sum(pop_music)/len(pop_music):.1f} words")
print(f"\nF({df1}, {df2}) = {f:.2f}")
print(f"eta-squared = {eta:.3f} ({'large' if eta > .14 else 'medium' if eta > .06 else 'small'} effect)")
print(f"{'Significant at p < .05' if f > 3.10 else 'Not significant'}")
3.3 Correlation & Regression
Correlation measures the strength and direction of the linear relationship between two variables. Regression goes further by modeling one variable as a function of one or more predictors.
| Method |
What It Tests |
Example |
| Pearson's r |
Linear relationship between two continuous variables |
Correlation between working memory capacity and reading comprehension |
| Spearman's rho |
Monotonic relationship (works with ordinal data or non-linear) |
Rank-order correlation between confidence and accuracy |
| Simple Regression |
Predicting DV from one IV |
Predicting exam score from hours of sleep |
| Multiple Regression |
Predicting DV from multiple IVs simultaneously |
Predicting cognitive decline from age, education, and exercise |
| Chi-Square |
Association between two categorical variables |
Is encoding strategy (visual/verbal) related to participant gender? |
Correlation Does Not Imply Causation: A positive correlation between ice cream sales and drowning deaths does not mean ice cream causes drowning. Both are caused by a third variable: hot weather. In cognitive psychology, a correlation between screen time and attention problems could reflect reverse causation (people with attention difficulties seek more stimulation) or a shared cause (impulsivity).
3.4 Choosing the Right Statistical Test
| Research Question |
Data Type |
Groups |
Recommended Test |
| Difference between 2 independent groups |
Continuous |
Between |
Independent t-test |
| Difference between 2 related measures |
Continuous |
Within |
Paired t-test |
| Difference among 3+ independent groups |
Continuous |
Between |
One-way ANOVA |
| Difference among 3+ related measures |
Continuous |
Within |
Repeated measures ANOVA |
| Relationship between 2 continuous variables |
Continuous |
-- |
Pearson correlation |
| Association between 2 categorical variables |
Categorical |
-- |
Chi-square test |
| Predicting outcome from multiple predictors |
Mixed |
-- |
Multiple regression |
4. Cognitive Experimental Paradigms
Cognitive psychologists have developed a remarkable toolkit of experimental paradigms -- standardized tasks that reliably tap specific cognitive processes. These paradigms are the workhorses of the field, used in thousands of studies across labs worldwide.
4.1 The Stroop Task
The Stroop task (Stroop, 1935) is arguably the most famous paradigm in cognitive psychology. Participants must name the ink color of printed words while ignoring the word itself. When the word and ink color conflict (e.g., the word "RED" printed in blue ink), reaction times increase dramatically -- the Stroop effect.
Classic Paradigm
The Stroop Effect -- Automatic vs Controlled Processing
John Ridley Stroop's 1935 dissertation revealed a fundamental truth about the mind: reading is automatic. We cannot help but read a word, even when explicitly instructed to ignore it. Naming the ink color of an incongruent word requires the controlled, effortful suppression of the automatic reading response -- a process that takes measurably longer.
The Stroop effect has proven remarkably robust: it has been replicated across languages, age groups, and cultures. It serves as a marker for cognitive control, executive function, and selective attention. Clinically, enlarged Stroop effects are observed in conditions like ADHD, schizophrenia, and frontal lobe damage -- making it a sensitive diagnostic tool.
Typical effect size: Incongruent trials are approximately 80-120 ms slower than congruent trials, with Cohen's d values typically exceeding 1.0 -- one of the largest and most reliable effects in all of psychology.
Automatic Processing
Cognitive Control
Executive Function
Selective Attention
4.2 The Eriksen Flanker Task
The flanker task (Eriksen & Eriksen, 1974) measures the ability to suppress responses to irrelevant stimuli surrounding a target. Participants respond to a central stimulus (e.g., the direction of a central arrow) while ignoring flanking distractors.
Example stimuli:
- Congruent: > > > > > (all arrows point right) -- Fast, accurate
- Incongruent: < < > < < (flankers conflict with target) -- Slower, more errors
- Neutral: -- -- > -- -- (non-arrow flankers) -- Intermediate
The flanker effect demonstrates that selective attention has spatial limits -- nearby distractors are processed even when they are irrelevant, particularly when they are close to the target. This paradigm is central to theories of response competition and attentional filtering.
4.3 Go/No-Go, N-Back & Visual Search
| Paradigm |
What It Measures |
Task Description |
Key Findings |
| Go/No-Go |
Response inhibition |
Respond to "go" stimuli, withhold response to "no-go" stimuli |
No-go errors index impulsivity; used in ADHD research |
| N-Back |
Working memory updating |
Respond when current stimulus matches the one N items back |
Performance drops sharply from 1-back to 3-back; strongly activates DLPFC |
| Visual Search |
Attention: parallel vs serial processing |
Find a target among distractors (e.g., red circle among blue circles) |
Pop-out (feature search) is parallel; conjunction search is serial (Treisman) |
| Priming |
Implicit memory, associative networks |
Prior exposure to a stimulus facilitates processing of related stimuli |
Semantic priming: "doctor" speeds recognition of "nurse" |
| Simon Task |
Stimulus-response compatibility |
Respond to stimulus identity, ignoring its spatial location |
Faster when stimulus and response are on the same side (Simon effect) |
Case Study
The Simon Effect -- When Location Matters
In J.R. Simon's classic paradigm, participants press a left or right key based on stimulus identity (e.g., press left for a high tone, right for a low tone). Despite the irrelevance of the tone's spatial location, participants are faster when the stimulus and response are on the same side (compatible) than on opposite sides (incompatible).
The Simon effect reveals an automatic spatial stimulus-response mapping that persists even when participants are explicitly told to ignore location. Like the Stroop effect, it demonstrates the limits of controlled processing in overriding automatic tendencies. The Simon effect is typically 20-30 ms and has been used extensively in research on cognitive aging and bilingualism.
Stimulus-Response Compatibility
Automatic Processing
Spatial Coding
5. Reaction Time Studies
Reaction time (RT) is the primary dependent variable in cognitive psychology -- a millisecond-precise window into the speed of mental processing. The logic is simple but powerful: if manipulating a variable increases RT, that manipulation has added cognitive processing demands.
5.1 Donders' Subtraction Method
Franciscus Donders (1868) developed the subtraction method to estimate the duration of specific mental processes. He designed three types of reaction time tasks, each adding one additional cognitive operation:
| Task Type |
Cognitive Demands |
Example |
Typical RT |
| A-Reaction (Simple) |
Detection only |
Press button when light appears |
~180 ms |
| B-Reaction (Choice) |
Detection + Discrimination + Selection |
Press left for red light, right for green |
~350 ms |
| C-Reaction (Go/No-Go) |
Detection + Discrimination |
Press button for red light, do nothing for green |
~265 ms |
By subtracting task durations, Donders estimated: Discrimination time = C - A = ~85 ms; Response selection time = B - C = ~85 ms. This elegant logic assumes that cognitive processes are additive and independent -- an assumption later challenged by Sternberg's additive factors method (1969).
5.2 Speed-Accuracy Tradeoff
One of the most fundamental constraints in human information processing is the speed-accuracy tradeoff (SAT): faster responses tend to be less accurate, and more accurate responses tend to be slower. Participants can shift their criterion along this continuum.
Methodological Implication: Because of the SAT, reporting only reaction time (or only accuracy) can be misleading. A manipulation that appears to speed up responses may actually be making participants less careful. Modern cognitive research reports both RT and error rate, and some studies use sophisticated models like the diffusion model (Ratcliff, 1978) to disentangle speed and accuracy into separate parameters: drift rate (information quality), boundary separation (caution), and non-decision time.
5.3 Hick's Law
Hick's Law (Hick, 1952; Hyman, 1953) states that choice reaction time increases logarithmically with the number of response alternatives:
RT = a + b * log2(n)
where n is the number of equally probable alternatives, a is the base RT, and b is the slope (about 150 ms per bit of information). This relationship shows that the human decision-making system processes information in bits, much like a digital system, supporting the information-processing metaphor central to cognitive psychology.
Practical application: Hick's Law directly influences UX design. Menus with fewer options lead to faster selection times. This is why simplified navigation (e.g., 5-7 main menu items) leads to better user experience than presenting 20+ options simultaneously.
6. Neuroimaging Methods in Cognitive Research
Modern cognitive psychology increasingly integrates neuroimaging -- techniques that measure brain activity during cognitive tasks. Each method offers a different tradeoff between spatial resolution (where in the brain) and temporal resolution (when activity occurs).
| Method |
Measures |
Spatial Resolution |
Temporal Resolution |
Key Application |
| fMRI |
Blood oxygenation (BOLD signal) |
~1-2 mm (excellent) |
~1-2 seconds (poor) |
Localizing cognitive functions to brain regions |
| EEG |
Electrical activity (scalp electrodes) |
~5-10 cm (poor) |
~1 ms (excellent) |
Event-related potentials (ERPs); timing of processing stages |
| MEG |
Magnetic fields from neural activity |
~5 mm (good) |
~1 ms (excellent) |
Combining spatial and temporal precision |
| PET |
Metabolic activity (radioactive tracers) |
~4-8 mm (moderate) |
~30-60 seconds (poor) |
Neurotransmitter receptor mapping |
| TMS |
Causal role of brain areas (disruption) |
~1 cm (good) |
~10 ms (good) |
Testing whether a brain region is necessary for a task |
| NIRS/fNIRS |
Blood oxygenation (near-infrared light) |
~1-3 cm (moderate) |
~100 ms (moderate) |
Portable neuroimaging; developmental studies |
Key Insight: The most informative cognitive neuroscience studies use converging evidence from multiple methods. fMRI tells you where processing occurs; EEG tells you when; TMS tells you whether that brain region is necessary. No single method provides a complete picture.
7. Replication Crisis & Open Science
7.1 Ecological Validity
Ecological validity refers to the degree to which experimental findings generalize to real-world settings. A perennial tension in cognitive psychology is between internal validity (controlled lab conditions) and ecological validity (real-world relevance).
Consider memory research: studying word list recall in a quiet lab has high internal validity but may tell us little about how memory operates when navigating a busy city, having a conversation, or studying for an exam while distracted by social media. Neisser (1976) famously criticized the field for studying memory "in a vacuum," calling for more ecologically valid research paradigms.
Modern responses to this challenge include experience sampling methods (ESM), virtual reality experiments, and large-scale online studies that sacrifice some control for greater ecological representativeness.
7.2 The Replication Crisis
In 2015, the Open Science Collaboration attempted to replicate 100 published psychology experiments. The results were sobering: while 97% of the original studies reported significant results, only 36% of replications yielded significant effects. Effect sizes in replications were, on average, half the magnitude of the originals.
Case Study
The Reproducibility Project: Psychology (2015)
Led by Brian Nosek and the Center for Open Science, 270 researchers across 50 labs attempted high-fidelity replications of 100 studies from three top psychology journals. Key findings:
- 97% of original studies had significant results (p < .05)
- Only 36% of replications achieved significance
- Mean effect size dropped from r = .403 to r = .197
- Cognitive psychology replicated better (~50%) than social psychology (~25%)
This did not mean most psychology findings are false, but it exposed systemic problems: publication bias (journals preferring significant results), small sample sizes, flexible data analysis (p-hacking), and insufficient emphasis on replication.
Reproducibility
Publication Bias
Open Science
Brian Nosek
Several factors contributed to the crisis:
- Publication bias: Journals overwhelmingly publish positive results, creating a "file drawer problem" where null results are never shared
- p-hacking: Researchers (often unconsciously) make analysis decisions that inflate p-values -- testing multiple DVs, removing outliers, adding covariates, or stopping data collection when p < .05
- HARKing: Hypothesizing After Results are Known -- presenting post-hoc findings as if they were predicted a priori
- Underpowered studies: Many studies had too few participants to reliably detect the effects they claimed to find
7.3 The Open Science Movement
The replication crisis catalyzed a powerful reform movement. The open science movement promotes transparency, rigor, and reproducibility through concrete practices:
| Practice |
Description |
Impact |
| Pre-registration |
Publicly registering hypotheses, methods, and analysis plans before data collection |
Prevents p-hacking and HARKing; distinguishes confirmatory from exploratory analyses |
| Open Data |
Making raw data publicly available |
Enables independent verification and re-analysis |
| Open Materials |
Sharing stimuli, code, and experimental scripts |
Facilitates exact replications and methodological improvements |
| Registered Reports |
Journals peer-review and accept studies before data collection |
Eliminates publication bias entirely; results cannot influence acceptance |
| Many Labs Projects |
Large-scale collaborative replications across many laboratories |
Provides definitive estimates of effect sizes and generalizability |
Meta-analysis is a statistical technique that synthesizes findings from multiple studies on the same topic, providing a more precise estimate of the true effect size than any single study can.
Key Insight: A single study is just one data point. Meta-analysis pools data across studies to estimate the true effect while accounting for sampling variability. For example, a meta-analysis of 100 Stroop studies provides a much more precise estimate of the Stroop effect's magnitude than any individual study. Meta-analyses can also test moderating variables: is the Stroop effect larger in older adults? In clinical populations? The answers emerge from patterns across studies.
Steps in conducting a meta-analysis:
- Define the research question and inclusion criteria
- Systematic literature search -- exhaustive, documented search of databases
- Code studies -- extract effect sizes, sample sizes, and moderator variables
- Compute weighted average effect size -- larger studies receive more weight
- Test for heterogeneity -- are effect sizes consistent across studies?
- Analyze moderators -- what factors explain variation in effect sizes?
- Assess publication bias -- funnel plots and trim-and-fill analysis
Exercises & Self-Assessment
Exercise 1
Design Your Own Experiment
Design a between-subjects experiment to test whether handwriting versus typing lecture notes leads to better exam performance. Specify:
- Your independent variable and its levels
- Your dependent variable(s) and how you would measure them
- At least three potential confounding variables and how you would control each
- Your sample size and how you determined it (hint: use power analysis)
- Your statistical test and why it is appropriate
Challenge: Now redesign this as a within-subjects study. What changes? What new problems arise?
Exercise 2
Spot the Flaws
Identify the methodological problems in each scenario:
- A researcher tests 20 DVs and reports the one that was significant at p = .04 without mentioning the others.
- A study with 12 participants per group reports a "significant" effect of meditation on attention (p = .048).
- A memory study compares psychology students (Group A) to engineering students (Group B), finding that Group A recalls more psychology terms.
- A researcher finds p = .06 and concludes "there was a trend toward significance," treating it as partial support for the hypothesis.
Answers: (1) p-hacking / multiple comparisons, (2) severely underpowered, (3) confound: prior knowledge, (4) misinterpretation of p-values; .06 is not significant, and "trends" are not evidence.
Exercise 3
DIY Stroop Experiment
Conduct a simple Stroop experiment with a friend:
- Create two lists: (A) Color words printed in matching ink, (B) Color words in mismatching ink
- Time how long it takes to name all the ink colors in each list
- Record the number of errors in each condition
- Calculate the Stroop effect (time difference between lists)
- Test at least 5 people and compute the average Stroop effect and its standard deviation
Discussion: Did you observe the expected Stroop interference? How much variability was there across participants? What might explain individual differences?
Exercise 4
Reflective Questions
- Explain why random assignment is essential for establishing causation. What happens without it?
- A study reports a "statistically significant" effect with p = .03 and Cohen's d = 0.1. Should we be excited? Why or why not?
- Why did cognitive psychology replicate better than social psychology in the Reproducibility Project? What methodological features might explain this?
- Design a study using Donders' subtraction method to estimate how long it takes to mentally rotate an object 90 degrees.
- What are the pros and cons of pre-registration? Could it stifle exploratory research?
Conclusion & Next Steps
In this penultimate chapter of our Cognitive Psychology Series, we have examined the scientific methods that underpin everything cognitive psychologists claim to know about the mind. Here are the key takeaways:
- Experimental design is the foundation of causal inference. Between-subjects, within-subjects, factorial, and quasi-experimental designs each have distinct strengths and limitations. Random assignment is essential for causation.
- Hypothesis testing is widely used but widely misunderstood. A p-value is not the probability that H0 is true. Effect size and power are at least as important as significance.
- Statistical tests should match the research design: t-tests for two conditions, ANOVA for three or more, correlation and regression for relationships, chi-square for categorical data.
- Cognitive paradigms like the Stroop, flanker, and n-back tasks are the workhorses of cognitive research, providing reliable windows into specific processes like attention, inhibition, and working memory.
- Reaction time is the gold standard DV in cognitive psychology, dating back to Donders' subtraction method. The speed-accuracy tradeoff and Hick's law reveal fundamental constraints of the information-processing system.
- Neuroimaging methods complement behavioral measures, each with different spatial and temporal resolution tradeoffs. Converging evidence across methods yields the strongest conclusions.
- The replication crisis exposed real problems in research practice, but the open science response -- pre-registration, open data, registered reports -- is making the field more rigorous and trustworthy.
Next in the Series
In Part 14: Computational & AI Models of Cognition, we reach the finale of our series by exploring how researchers build computational models of the mind -- from classic cognitive architectures like ACT-R and SOAR to modern neural networks, Bayesian inference, and predictive processing. We will also examine the fascinating question of how artificial intelligence relates to human cognition.
Continue the Series
Part 14: Computational & AI Models
Explore cognitive architectures (ACT-R, SOAR), neural networks, Bayesian models, predictive processing, and the future of computational cognitive science.
Read Article
Part 1: Memory Systems & Encoding
Revisit the foundational concepts of memory systems, encoding mechanisms, and the experiments that revealed how memory works.
Read Article
Part 12: Applied Cognitive Psychology
See how research methods translate into real-world applications in UX design, education, and behavioral economics.
Read Article