Back to Psychology

Cognitive Psychology Series Part 13: Research Methods

March 31, 2026 Wasil Zafar 42 min read

Discover the scientific toolkit behind cognitive psychology. From designing rigorous experiments and choosing the right statistical tests to understanding reaction time paradigms, neuroimaging, and the challenges of the replication crisis, learn how researchers uncover the workings of the human mind.

Table of Contents

  1. Experimental Design
  2. Variables & Hypothesis Testing
  3. Statistical Analysis
  4. Cognitive Experimental Paradigms
  5. Reaction Time Studies
  6. Neuroimaging Methods
  7. Replication Crisis & Open Science
  8. Exercises & Self-Assessment
  9. Research Methods Plan Generator
  10. Conclusion & Next Steps

Introduction: The Science Behind the Science of the Mind

Series Overview: This is Part 13 of our 14-part Cognitive Psychology Series. We turn our lens inward to examine the research methods that make cognitive psychology a rigorous science -- from experimental design and statistical reasoning to the paradigms that reveal how the mind processes information.

How do we know what we know about the human mind? You cannot open a skull and watch thoughts form. You cannot measure an emotion the way you measure temperature. Cognitive processes are inherently invisible -- they must be inferred from observable behavior. This fundamental challenge is what makes research methodology in cognitive psychology both fascinating and demanding.

Every finding discussed in this series -- from Sperling's iconic memory to Kahneman's heuristics -- rests on carefully designed experiments, rigorous statistical analysis, and the creative use of behavioral paradigms that make the invisible visible. Understanding these methods is not just academic; it is the foundation of scientific literacy in an age when psychological claims flood popular media.

Key Insight: A research method is only as good as its ability to rule out alternative explanations. The hallmark of strong experimental design is internal validity -- the confidence that the independent variable, and nothing else, caused the observed changes in the dependent variable.

A Brief History of Experimental Psychology

The story of cognitive research methods begins in 1879, when Wilhelm Wundt established the first formal psychology laboratory at the University of Leipzig. Wundt used introspection -- trained self-observation of conscious experience -- as his primary method. While introspection was eventually criticized for its subjectivity, Wundt's insistence on controlled laboratory conditions established psychology as an empirical science.

Even earlier, Gustav Fechner (1860) developed psychophysics -- the first quantitative approach to studying the relationship between physical stimuli and psychological perception. His methods for measuring thresholds (the minimum detectable stimulus) remain in use today.

Franciscus Donders (1868) pioneered the use of reaction time as a window into cognitive processing, developing the subtraction method that allowed researchers to estimate the duration of specific mental operations. This innovation was revolutionary: for the first time, thought could be timed.

Historical Milestone

Wundt's Leipzig Laboratory (1879)

Wilhelm Wundt's laboratory at the University of Leipzig is considered the birthplace of experimental psychology. His research program focused on measuring the basic elements of conscious experience -- sensation, perception, and reaction time -- under controlled conditions. Wundt trained over 180 doctoral students, many of whom went on to establish psychology departments across Europe and North America, spreading the experimental approach worldwide.

Wundt's key methodological contribution was demonstrating that mental phenomena could be studied with the same rigor as physical phenomena. His insistence on systematic manipulation of variables, repeated measurements, and controlled conditions laid the groundwork for modern experimental cognitive psychology.

Experimental Psychology Introspection Psychophysics Leipzig 1879

1. Experimental Design

Experimental design is the blueprint of a study -- the strategic plan for how participants will be assigned to conditions, what variables will be manipulated and measured, and how confounds will be controlled. The choice of design has profound implications for what conclusions can be drawn.

1.1 Between-Subjects Design

In a between-subjects (independent groups) design, each participant experiences only one level of the independent variable. Different groups of participants are compared.

Example: To test whether background music affects reading comprehension, Group A reads in silence, Group B reads with classical music, and Group C reads with pop music. Each participant is in only one condition.

Advantage Disadvantage
No carryover effects (practice, fatigue) Requires more participants
Each condition is independent Individual differences between groups may confound results
Suitable when conditions cannot be reversed Reduced statistical power (more error variance)

Key control technique: Random assignment -- participants are randomly allocated to conditions, ensuring that pre-existing differences (age, IQ, motivation) are distributed equally across groups. Without random assignment, you have a quasi-experiment, not a true experiment.

1.2 Within-Subjects (Repeated Measures) Design

In a within-subjects design, every participant experiences all levels of the independent variable. The same person serves as their own control.

Example: Each participant completes a Stroop task under three conditions: congruent (word "RED" in red ink), incongruent (word "RED" in blue ink), and neutral (a string of Xs in colored ink). Their reaction times are compared across all three conditions.

Advantage Disadvantage
Eliminates individual difference confounds Order effects (practice, fatigue, carryover)
Requires fewer participants Demand characteristics (participants guess the hypothesis)
Greater statistical power Not suitable when exposure to one condition changes the participant

Key control technique: Counterbalancing -- varying the order of conditions across participants (e.g., half do condition A then B; the other half do B then A) to cancel out order effects. A full Latin Square design systematically rotates all possible orderings.

1.3 Factorial & Mixed Designs

A factorial design examines two or more independent variables simultaneously, allowing researchers to detect interactions -- situations where the effect of one variable depends on the level of another.

Key Insight: Interactions are often more theoretically interesting than main effects. For example, the finding that caffeine improves performance on simple tasks but impairs performance on complex tasks (Yerkes-Dodson law) is an interaction between arousal and task difficulty -- neither variable alone tells the full story.

Example of a 2 x 3 factorial design: Factor A = Encoding type (visual vs verbal), Factor B = Retention interval (1 hour, 1 day, 1 week). This yields 6 conditions and can reveal whether the decay rate differs for visual vs verbal memories.

A mixed design combines between-subjects and within-subjects factors. For instance, comparing older adults vs younger adults (between-subjects) on a memory task measured at multiple time points (within-subjects).

1.4 Quasi-Experimental Design

When true random assignment is impossible -- because the independent variable is a pre-existing characteristic (age, clinical diagnosis, handedness) or an event that cannot be ethically manipulated -- researchers use quasi-experimental designs.

Example: Comparing cognitive function in patients with Alzheimer's disease vs healthy controls. You cannot randomly assign people to have Alzheimer's, so groups differ in ways beyond just the variable of interest.

Critical Limitation: Quasi-experiments cannot establish causation because the lack of random assignment means confounding variables are not controlled. Alzheimer's patients may differ from controls in education, medication use, and general health -- any of which could explain group differences in cognition.

2. Variables & Hypothesis Testing

2.1 Independent, Dependent & Confounding Variables

Every experiment revolves around three types of variables:

Variable Type Definition Example (Stroop Study)
Independent Variable (IV) The factor manipulated by the researcher Congruency of word and ink color (congruent vs incongruent)
Dependent Variable (DV) The outcome measured Reaction time (ms) and error rate (%)
Confounding Variable Uncontrolled factor that varies with the IV, threatening internal validity Word frequency (if congruent words happen to be more common)

Extraneous variables are any variables other than the IV that could affect the DV. They become confounds only when they systematically co-vary with the IV. Good experimental design uses randomization, counterbalancing, and standardization to prevent extraneous variables from becoming confounds.

2.2 Hypothesis Testing: Null vs Alternative

The logic of null hypothesis significance testing (NHST) -- the dominant statistical framework in psychology -- works by assuming the null hypothesis (H0: there is no effect) is true, then calculating the probability of obtaining data as extreme as what was observed.

Concept Definition Typical Threshold
Null Hypothesis (H0) No effect; any differences are due to chance --
Alternative Hypothesis (H1) There is a real effect of the IV on the DV --
p-value Probability of obtaining data this extreme if H0 is true p < .05
Type I Error (False Positive) Rejecting H0 when it is actually true alpha = .05 (5% risk)
Type II Error (False Negative) Failing to reject H0 when H1 is actually true beta = .20 (20% risk)
Common Misconception: A p-value of .03 does not mean there is a 3% probability the null hypothesis is true. It means: "If the null hypothesis were true, there would be a 3% probability of observing data this extreme or more extreme." The p-value is about the data given the hypothesis, not the hypothesis given the data. This distinction, often missed, leads to widespread misinterpretation of research findings.

2.3 Effect Size & Power Analysis

Statistical significance tells you whether an effect is likely real, but not whether it matters. That is the role of effect size -- a standardized measure of the magnitude of an effect.

Effect Size Measure Used With Small Medium Large
Cohen's d Comparing two means 0.2 0.5 0.8
eta-squared (n2) ANOVA .01 .06 .14
Pearson's r Correlation .10 .30 .50

Statistical power is the probability of correctly detecting a real effect (1 - beta). A well-powered study typically aims for 80% power. Power depends on three factors: effect size (larger = easier to detect), sample size (larger = more power), and alpha level (more lenient = more power but more Type I errors).

# Power analysis and effect size calculation in Python
import math
import random

def cohens_d(group1, group2):
    """Calculate Cohen's d for two independent groups."""
    n1, n2 = len(group1), len(group2)
    mean1, mean2 = sum(group1) / n1, sum(group2) / n2

    # Pooled standard deviation
    var1 = sum((x - mean1) ** 2 for x in group1) / (n1 - 1)
    var2 = sum((x - mean2) ** 2 for x in group2) / (n2 - 1)
    pooled_sd = math.sqrt(((n1 - 1) * var1 + (n2 - 1) * var2) / (n1 + n2 - 2))

    d = (mean1 - mean2) / pooled_sd
    return d

def required_sample_size(effect_size, alpha=0.05, power=0.80):
    """
    Approximate sample size per group for a two-sample t-test.
    Uses the formula: n = (z_alpha + z_beta)^2 * 2 / d^2
    """
    # Z-scores for common alpha and power levels
    z_alpha = 1.96 if alpha == 0.05 else 2.576  # two-tailed
    z_beta = 0.84 if power == 0.80 else 1.28    # power = .80 or .90

    n_per_group = math.ceil((z_alpha + z_beta) ** 2 * 2 / effect_size ** 2)
    return n_per_group

# Simulate a Stroop experiment
random.seed(42)
congruent_rts = [random.gauss(520, 80) for _ in range(50)]    # ms
incongruent_rts = [random.gauss(620, 95) for _ in range(50)]  # ms

d = cohens_d(incongruent_rts, congruent_rts)
print(f"=== Stroop Effect Simulation ===")
print(f"Congruent mean RT: {sum(congruent_rts)/len(congruent_rts):.1f} ms")
print(f"Incongruent mean RT: {sum(incongruent_rts)/len(incongruent_rts):.1f} ms")
print(f"Cohen's d: {abs(d):.3f} (large effect)")

# Power analysis for different effect sizes
print(f"\n=== Required Sample Sizes (alpha=.05, power=.80) ===")
for label, es in [("Small (d=0.2)", 0.2), ("Medium (d=0.5)", 0.5), ("Large (d=0.8)", 0.8)]:
    n = required_sample_size(es)
    print(f"  {label}: {n} participants per group ({n*2} total)")

3. Statistical Analysis

Choosing the correct statistical test depends on the research design, the type of data, and the research question. Here is a practical guide to the most commonly used tests in cognitive psychology research.

3.1 t-Tests

The t-test compares means from two conditions to determine whether they differ significantly.

Type When to Use Example
Independent-samples t-test Two different groups compared on the same DV Comparing memory scores of young vs old adults
Paired-samples t-test Same participants measured twice (within-subjects) Comparing Stroop congruent vs incongruent RTs
One-sample t-test Comparing a sample mean to a known value Is this group's average IQ different from 100?

3.2 Analysis of Variance (ANOVA)

When you have three or more conditions, multiple t-tests inflate your Type I error rate. ANOVA solves this by testing whether any group means differ significantly in a single omnibus test.

ANOVA Type Design Example
One-Way ANOVA One IV with 3+ levels, between-subjects Memory recall under silence, classical, or pop music
Repeated Measures ANOVA One IV with 3+ levels, within-subjects Reaction time at three SOA intervals for the same participants
Factorial ANOVA Two or more IVs, tests main effects + interactions Age (young/old) x Encoding type (visual/verbal) on recall
Mixed ANOVA At least one between- and one within-subjects factor Clinical group (between) x Time point (within)
# Simulating a One-Way ANOVA: Effect of background music on memory recall
import random
import math

def one_way_anova(groups):
    """
    Perform a one-way ANOVA from scratch.
    Returns F-statistic and effect size (eta-squared).
    """
    k = len(groups)
    all_data = [x for g in groups for x in g]
    N = len(all_data)
    grand_mean = sum(all_data) / N

    # Sum of Squares Between (SSB)
    ssb = sum(len(g) * (sum(g)/len(g) - grand_mean)**2 for g in groups)

    # Sum of Squares Within (SSW)
    ssw = sum(sum((x - sum(g)/len(g))**2 for x in g) for g in groups)

    # Degrees of freedom
    df_between = k - 1
    df_within = N - k

    # Mean squares
    msb = ssb / df_between
    msw = ssw / df_within

    # F-statistic
    f_stat = msb / msw

    # Effect size (eta-squared)
    eta_sq = ssb / (ssb + ssw)

    return f_stat, df_between, df_within, eta_sq

# Simulate three conditions
random.seed(42)
silence = [random.gauss(15, 3) for _ in range(30)]      # 15 words recalled
classical = [random.gauss(14, 3) for _ in range(30)]     # 14 words recalled
pop_music = [random.gauss(11, 3.5) for _ in range(30)]   # 11 words recalled

f, df1, df2, eta = one_way_anova([silence, classical, pop_music])

print("=== One-Way ANOVA: Background Music & Memory ===")
print(f"Silence:   M = {sum(silence)/len(silence):.1f} words")
print(f"Classical: M = {sum(classical)/len(classical):.1f} words")
print(f"Pop Music: M = {sum(pop_music)/len(pop_music):.1f} words")
print(f"\nF({df1}, {df2}) = {f:.2f}")
print(f"eta-squared = {eta:.3f} ({'large' if eta > .14 else 'medium' if eta > .06 else 'small'} effect)")
print(f"{'Significant at p < .05' if f > 3.10 else 'Not significant'}")

3.3 Correlation & Regression

Correlation measures the strength and direction of the linear relationship between two variables. Regression goes further by modeling one variable as a function of one or more predictors.

Method What It Tests Example
Pearson's r Linear relationship between two continuous variables Correlation between working memory capacity and reading comprehension
Spearman's rho Monotonic relationship (works with ordinal data or non-linear) Rank-order correlation between confidence and accuracy
Simple Regression Predicting DV from one IV Predicting exam score from hours of sleep
Multiple Regression Predicting DV from multiple IVs simultaneously Predicting cognitive decline from age, education, and exercise
Chi-Square Association between two categorical variables Is encoding strategy (visual/verbal) related to participant gender?
Correlation Does Not Imply Causation: A positive correlation between ice cream sales and drowning deaths does not mean ice cream causes drowning. Both are caused by a third variable: hot weather. In cognitive psychology, a correlation between screen time and attention problems could reflect reverse causation (people with attention difficulties seek more stimulation) or a shared cause (impulsivity).

3.4 Choosing the Right Statistical Test

Research Question Data Type Groups Recommended Test
Difference between 2 independent groups Continuous Between Independent t-test
Difference between 2 related measures Continuous Within Paired t-test
Difference among 3+ independent groups Continuous Between One-way ANOVA
Difference among 3+ related measures Continuous Within Repeated measures ANOVA
Relationship between 2 continuous variables Continuous -- Pearson correlation
Association between 2 categorical variables Categorical -- Chi-square test
Predicting outcome from multiple predictors Mixed -- Multiple regression

4. Cognitive Experimental Paradigms

Cognitive psychologists have developed a remarkable toolkit of experimental paradigms -- standardized tasks that reliably tap specific cognitive processes. These paradigms are the workhorses of the field, used in thousands of studies across labs worldwide.

4.1 The Stroop Task

The Stroop task (Stroop, 1935) is arguably the most famous paradigm in cognitive psychology. Participants must name the ink color of printed words while ignoring the word itself. When the word and ink color conflict (e.g., the word "RED" printed in blue ink), reaction times increase dramatically -- the Stroop effect.

Classic Paradigm

The Stroop Effect -- Automatic vs Controlled Processing

John Ridley Stroop's 1935 dissertation revealed a fundamental truth about the mind: reading is automatic. We cannot help but read a word, even when explicitly instructed to ignore it. Naming the ink color of an incongruent word requires the controlled, effortful suppression of the automatic reading response -- a process that takes measurably longer.

The Stroop effect has proven remarkably robust: it has been replicated across languages, age groups, and cultures. It serves as a marker for cognitive control, executive function, and selective attention. Clinically, enlarged Stroop effects are observed in conditions like ADHD, schizophrenia, and frontal lobe damage -- making it a sensitive diagnostic tool.

Typical effect size: Incongruent trials are approximately 80-120 ms slower than congruent trials, with Cohen's d values typically exceeding 1.0 -- one of the largest and most reliable effects in all of psychology.

Automatic Processing Cognitive Control Executive Function Selective Attention

4.2 The Eriksen Flanker Task

The flanker task (Eriksen & Eriksen, 1974) measures the ability to suppress responses to irrelevant stimuli surrounding a target. Participants respond to a central stimulus (e.g., the direction of a central arrow) while ignoring flanking distractors.

Example stimuli:

  • Congruent: > > > > > (all arrows point right) -- Fast, accurate
  • Incongruent: < < > < < (flankers conflict with target) -- Slower, more errors
  • Neutral: -- -- > -- -- (non-arrow flankers) -- Intermediate

The flanker effect demonstrates that selective attention has spatial limits -- nearby distractors are processed even when they are irrelevant, particularly when they are close to the target. This paradigm is central to theories of response competition and attentional filtering.

4.3 Go/No-Go, N-Back & Visual Search

Paradigm What It Measures Task Description Key Findings
Go/No-Go Response inhibition Respond to "go" stimuli, withhold response to "no-go" stimuli No-go errors index impulsivity; used in ADHD research
N-Back Working memory updating Respond when current stimulus matches the one N items back Performance drops sharply from 1-back to 3-back; strongly activates DLPFC
Visual Search Attention: parallel vs serial processing Find a target among distractors (e.g., red circle among blue circles) Pop-out (feature search) is parallel; conjunction search is serial (Treisman)
Priming Implicit memory, associative networks Prior exposure to a stimulus facilitates processing of related stimuli Semantic priming: "doctor" speeds recognition of "nurse"
Simon Task Stimulus-response compatibility Respond to stimulus identity, ignoring its spatial location Faster when stimulus and response are on the same side (Simon effect)
Case Study

The Simon Effect -- When Location Matters

In J.R. Simon's classic paradigm, participants press a left or right key based on stimulus identity (e.g., press left for a high tone, right for a low tone). Despite the irrelevance of the tone's spatial location, participants are faster when the stimulus and response are on the same side (compatible) than on opposite sides (incompatible).

The Simon effect reveals an automatic spatial stimulus-response mapping that persists even when participants are explicitly told to ignore location. Like the Stroop effect, it demonstrates the limits of controlled processing in overriding automatic tendencies. The Simon effect is typically 20-30 ms and has been used extensively in research on cognitive aging and bilingualism.

Stimulus-Response Compatibility Automatic Processing Spatial Coding

5. Reaction Time Studies

Reaction time (RT) is the primary dependent variable in cognitive psychology -- a millisecond-precise window into the speed of mental processing. The logic is simple but powerful: if manipulating a variable increases RT, that manipulation has added cognitive processing demands.

5.1 Donders' Subtraction Method

Franciscus Donders (1868) developed the subtraction method to estimate the duration of specific mental processes. He designed three types of reaction time tasks, each adding one additional cognitive operation:

Task Type Cognitive Demands Example Typical RT
A-Reaction (Simple) Detection only Press button when light appears ~180 ms
B-Reaction (Choice) Detection + Discrimination + Selection Press left for red light, right for green ~350 ms
C-Reaction (Go/No-Go) Detection + Discrimination Press button for red light, do nothing for green ~265 ms

By subtracting task durations, Donders estimated: Discrimination time = C - A = ~85 ms; Response selection time = B - C = ~85 ms. This elegant logic assumes that cognitive processes are additive and independent -- an assumption later challenged by Sternberg's additive factors method (1969).

5.2 Speed-Accuracy Tradeoff

One of the most fundamental constraints in human information processing is the speed-accuracy tradeoff (SAT): faster responses tend to be less accurate, and more accurate responses tend to be slower. Participants can shift their criterion along this continuum.

Methodological Implication: Because of the SAT, reporting only reaction time (or only accuracy) can be misleading. A manipulation that appears to speed up responses may actually be making participants less careful. Modern cognitive research reports both RT and error rate, and some studies use sophisticated models like the diffusion model (Ratcliff, 1978) to disentangle speed and accuracy into separate parameters: drift rate (information quality), boundary separation (caution), and non-decision time.

5.3 Hick's Law

Hick's Law (Hick, 1952; Hyman, 1953) states that choice reaction time increases logarithmically with the number of response alternatives:

RT = a + b * log2(n)

where n is the number of equally probable alternatives, a is the base RT, and b is the slope (about 150 ms per bit of information). This relationship shows that the human decision-making system processes information in bits, much like a digital system, supporting the information-processing metaphor central to cognitive psychology.

Practical application: Hick's Law directly influences UX design. Menus with fewer options lead to faster selection times. This is why simplified navigation (e.g., 5-7 main menu items) leads to better user experience than presenting 20+ options simultaneously.

6. Neuroimaging Methods in Cognitive Research

Modern cognitive psychology increasingly integrates neuroimaging -- techniques that measure brain activity during cognitive tasks. Each method offers a different tradeoff between spatial resolution (where in the brain) and temporal resolution (when activity occurs).

Method Measures Spatial Resolution Temporal Resolution Key Application
fMRI Blood oxygenation (BOLD signal) ~1-2 mm (excellent) ~1-2 seconds (poor) Localizing cognitive functions to brain regions
EEG Electrical activity (scalp electrodes) ~5-10 cm (poor) ~1 ms (excellent) Event-related potentials (ERPs); timing of processing stages
MEG Magnetic fields from neural activity ~5 mm (good) ~1 ms (excellent) Combining spatial and temporal precision
PET Metabolic activity (radioactive tracers) ~4-8 mm (moderate) ~30-60 seconds (poor) Neurotransmitter receptor mapping
TMS Causal role of brain areas (disruption) ~1 cm (good) ~10 ms (good) Testing whether a brain region is necessary for a task
NIRS/fNIRS Blood oxygenation (near-infrared light) ~1-3 cm (moderate) ~100 ms (moderate) Portable neuroimaging; developmental studies
Key Insight: The most informative cognitive neuroscience studies use converging evidence from multiple methods. fMRI tells you where processing occurs; EEG tells you when; TMS tells you whether that brain region is necessary. No single method provides a complete picture.

7. Replication Crisis & Open Science

7.1 Ecological Validity

Ecological validity refers to the degree to which experimental findings generalize to real-world settings. A perennial tension in cognitive psychology is between internal validity (controlled lab conditions) and ecological validity (real-world relevance).

Consider memory research: studying word list recall in a quiet lab has high internal validity but may tell us little about how memory operates when navigating a busy city, having a conversation, or studying for an exam while distracted by social media. Neisser (1976) famously criticized the field for studying memory "in a vacuum," calling for more ecologically valid research paradigms.

Modern responses to this challenge include experience sampling methods (ESM), virtual reality experiments, and large-scale online studies that sacrifice some control for greater ecological representativeness.

7.2 The Replication Crisis

In 2015, the Open Science Collaboration attempted to replicate 100 published psychology experiments. The results were sobering: while 97% of the original studies reported significant results, only 36% of replications yielded significant effects. Effect sizes in replications were, on average, half the magnitude of the originals.

Case Study

The Reproducibility Project: Psychology (2015)

Led by Brian Nosek and the Center for Open Science, 270 researchers across 50 labs attempted high-fidelity replications of 100 studies from three top psychology journals. Key findings:

  • 97% of original studies had significant results (p < .05)
  • Only 36% of replications achieved significance
  • Mean effect size dropped from r = .403 to r = .197
  • Cognitive psychology replicated better (~50%) than social psychology (~25%)

This did not mean most psychology findings are false, but it exposed systemic problems: publication bias (journals preferring significant results), small sample sizes, flexible data analysis (p-hacking), and insufficient emphasis on replication.

Reproducibility Publication Bias Open Science Brian Nosek

Several factors contributed to the crisis:

  • Publication bias: Journals overwhelmingly publish positive results, creating a "file drawer problem" where null results are never shared
  • p-hacking: Researchers (often unconsciously) make analysis decisions that inflate p-values -- testing multiple DVs, removing outliers, adding covariates, or stopping data collection when p < .05
  • HARKing: Hypothesizing After Results are Known -- presenting post-hoc findings as if they were predicted a priori
  • Underpowered studies: Many studies had too few participants to reliably detect the effects they claimed to find

7.3 The Open Science Movement

The replication crisis catalyzed a powerful reform movement. The open science movement promotes transparency, rigor, and reproducibility through concrete practices:

Practice Description Impact
Pre-registration Publicly registering hypotheses, methods, and analysis plans before data collection Prevents p-hacking and HARKing; distinguishes confirmatory from exploratory analyses
Open Data Making raw data publicly available Enables independent verification and re-analysis
Open Materials Sharing stimuli, code, and experimental scripts Facilitates exact replications and methodological improvements
Registered Reports Journals peer-review and accept studies before data collection Eliminates publication bias entirely; results cannot influence acceptance
Many Labs Projects Large-scale collaborative replications across many laboratories Provides definitive estimates of effect sizes and generalizability

7.4 Meta-Analysis

Meta-analysis is a statistical technique that synthesizes findings from multiple studies on the same topic, providing a more precise estimate of the true effect size than any single study can.

Key Insight: A single study is just one data point. Meta-analysis pools data across studies to estimate the true effect while accounting for sampling variability. For example, a meta-analysis of 100 Stroop studies provides a much more precise estimate of the Stroop effect's magnitude than any individual study. Meta-analyses can also test moderating variables: is the Stroop effect larger in older adults? In clinical populations? The answers emerge from patterns across studies.

Steps in conducting a meta-analysis:

  1. Define the research question and inclusion criteria
  2. Systematic literature search -- exhaustive, documented search of databases
  3. Code studies -- extract effect sizes, sample sizes, and moderator variables
  4. Compute weighted average effect size -- larger studies receive more weight
  5. Test for heterogeneity -- are effect sizes consistent across studies?
  6. Analyze moderators -- what factors explain variation in effect sizes?
  7. Assess publication bias -- funnel plots and trim-and-fill analysis

Exercises & Self-Assessment

Exercise 1

Design Your Own Experiment

Design a between-subjects experiment to test whether handwriting versus typing lecture notes leads to better exam performance. Specify:

  1. Your independent variable and its levels
  2. Your dependent variable(s) and how you would measure them
  3. At least three potential confounding variables and how you would control each
  4. Your sample size and how you determined it (hint: use power analysis)
  5. Your statistical test and why it is appropriate

Challenge: Now redesign this as a within-subjects study. What changes? What new problems arise?

Exercise 2

Spot the Flaws

Identify the methodological problems in each scenario:

  1. A researcher tests 20 DVs and reports the one that was significant at p = .04 without mentioning the others.
  2. A study with 12 participants per group reports a "significant" effect of meditation on attention (p = .048).
  3. A memory study compares psychology students (Group A) to engineering students (Group B), finding that Group A recalls more psychology terms.
  4. A researcher finds p = .06 and concludes "there was a trend toward significance," treating it as partial support for the hypothesis.

Answers: (1) p-hacking / multiple comparisons, (2) severely underpowered, (3) confound: prior knowledge, (4) misinterpretation of p-values; .06 is not significant, and "trends" are not evidence.

Exercise 3

DIY Stroop Experiment

Conduct a simple Stroop experiment with a friend:

  1. Create two lists: (A) Color words printed in matching ink, (B) Color words in mismatching ink
  2. Time how long it takes to name all the ink colors in each list
  3. Record the number of errors in each condition
  4. Calculate the Stroop effect (time difference between lists)
  5. Test at least 5 people and compute the average Stroop effect and its standard deviation

Discussion: Did you observe the expected Stroop interference? How much variability was there across participants? What might explain individual differences?

Exercise 4

Reflective Questions

  1. Explain why random assignment is essential for establishing causation. What happens without it?
  2. A study reports a "statistically significant" effect with p = .03 and Cohen's d = 0.1. Should we be excited? Why or why not?
  3. Why did cognitive psychology replicate better than social psychology in the Reproducibility Project? What methodological features might explain this?
  4. Design a study using Donders' subtraction method to estimate how long it takes to mentally rotate an object 90 degrees.
  5. What are the pros and cons of pre-registration? Could it stifle exploratory research?

Research Methods Plan Generator

Design your cognitive psychology research plan. Download as Word, Excel, PDF, or PowerPoint.

Draft auto-saved

All data stays in your browser. Nothing is sent to or stored on any server.

Conclusion & Next Steps

In this penultimate chapter of our Cognitive Psychology Series, we have examined the scientific methods that underpin everything cognitive psychologists claim to know about the mind. Here are the key takeaways:

  • Experimental design is the foundation of causal inference. Between-subjects, within-subjects, factorial, and quasi-experimental designs each have distinct strengths and limitations. Random assignment is essential for causation.
  • Hypothesis testing is widely used but widely misunderstood. A p-value is not the probability that H0 is true. Effect size and power are at least as important as significance.
  • Statistical tests should match the research design: t-tests for two conditions, ANOVA for three or more, correlation and regression for relationships, chi-square for categorical data.
  • Cognitive paradigms like the Stroop, flanker, and n-back tasks are the workhorses of cognitive research, providing reliable windows into specific processes like attention, inhibition, and working memory.
  • Reaction time is the gold standard DV in cognitive psychology, dating back to Donders' subtraction method. The speed-accuracy tradeoff and Hick's law reveal fundamental constraints of the information-processing system.
  • Neuroimaging methods complement behavioral measures, each with different spatial and temporal resolution tradeoffs. Converging evidence across methods yields the strongest conclusions.
  • The replication crisis exposed real problems in research practice, but the open science response -- pre-registration, open data, registered reports -- is making the field more rigorous and trustworthy.

Next in the Series

In Part 14: Computational & AI Models of Cognition, we reach the finale of our series by exploring how researchers build computational models of the mind -- from classic cognitive architectures like ACT-R and SOAR to modern neural networks, Bayesian inference, and predictive processing. We will also examine the fascinating question of how artificial intelligence relates to human cognition.

Psychology