Analysis

Preprocessing

Pre-processing serves to exclude data that may not reflect cognitive processes and thus decreases the signal-to-noise ratio. We recommend keeping the pre-processing to a minimum, to avoid systematic bias in excluding potentially meaningful datapoints and maximise the number of observations. As there are many different rules of thumb, we suggest either minimising the amount of pre-processing (e.g., when the amount of data is limited) or conducting a robustness analysis, where the results are verified with a set of different pre-processing methods (e.g., when there is a large amount of data available; see Short et al. 2025).

Below are some recommended pre-processing options. Please note that they should ideally be specified in a pre-registration.

Excluding long responses
- When a time-out criterion was used (e.g., the word disappeared after 5000 ms and the response times were therefore set to 5000 ms), it needs to be set to Not Available (e.g., NA in R).
- When no time-out was used, find evidence-based criteria (e.g., previous studies with similar types of items) to remove unusually long reaction times.
- Note that for different groups (e.g., children), these criteria can differ considerably.
Excluding short responses
- Button presses within the first 200 ms are generally considered accidental responses, as this short amount doesn’t allow for the relevant cognitive processes to unfold. They should also be set to Not Available.
Participants with chance-level performance should be excluded. The chance level is 50% when there are 50% words and 50% pseudowords. Note: If the percentage of words deviates from 50% (e.g., if it’s 70%), then participants who always provide a “word” response may have above-chance accuracy but still need to be removed.
Individual participant data quality check
- For each participant, we can assess whether the response times for each trial correlate with the average response times for the same items across all other participants. The participant can be excluded if the correlation is low (e.g., below r < 0.5).
- Item-level removal: While some studies remove items with particularly low accuracy, this may create a systematic bias. We thus recommend against it.

An important decision is whether one is interested in analysing response times or accuracy. In studies with adults, accuracy is generally high. For response time analyses, one excludes all trials with incorrect responses. However, if accuracy is low, this systematically excludes data for the more difficult items. If accuracy is high, one needs to be aware of ceiling effects when investigating accuracy.

Note that there are model based analyses that allow the simultaneous analysis of response times and accuracy at the same time (e.g., drift-diffusion modelling; Ratcliff, & McKoon, 2008)

Data transformation

Reaction time data generally has an exponentially modified Gaussian (Ex-Gaussian) or shifted log-normal distribution, which looks like a bell-shaped curve with a right skew. As most frequentist analysis methods assume a normal distribution, one can choose between transforming the data or using a model that assumes an Ex-Gaussian or shifted log-normal distribution (see HERE for illustrations and more details).

Possible transformations include:

Log-RT: In R-code logRT <- log(RT)
Inverse RT: invRT <- 1/RT. Note that this gives you a reading rate in Hertz (Hz; assuming that the RT is seconds) rather than speed, such that small numbers correspond to larger RTs (i.e., 1 Hz would transform to one word per second; i.e., a 1-second RT; see Gagl et al., 2022). Sometimes the RT values are multiplied by -1, so smaller values correspond to faster response times.
Z-transformations: For each participant, subtract the mean from the trial RT and divide by the participant's standard deviation. This transformation will remove any between-participant variability, which is an issue when it is a matter of investigation, but is advantageous if one wants to avoid the problems with over-additivity (Faust et al., 1999). Here, it is worth noting that after this transformation, explicitly modelling participant variance (e.g., as in linear mixed modelling) is no longer necessary.

As an alternative to data transformation, one can model the Ex-Gaussian or shifted log-normal distribution explicitly, using Bayesian methods (e.g., as implemented in the brms package in R). One could also use gamma generalized linear mixed models (GLMM), an alternative analysis method that allows for statistical inference without data transformation on non-normal distributions (see Lo & Andrews, 2015).

One can implement a multi-verse analysis to examine the effect of choices (e.g., see Heyman et al., 2025).

Statistics

Sanity check analysis

Some effects are well-established and can serve as a sanity check to ensure good data quality. We expect long items (as measured by the number of letters) to be responded to more slowly than short items (with fewer letters), and pseudowords more slowly than words; there should also be an interaction between these, with a more substantial length effect for pseudowords than words (e.g., Weekes, 1997). Alternatively, one can investigate the word frequency effect. If the main effects and interaction are not present, one should troubleshoot the following:

The data export process (e.g., is it possible that some columns in a spreadsheet were re-sorted while the others were not)
The design of the experiment (are the reaction times recorded as the number of seconds/milliseconds between stimulus onset and response?)

Assessing the effect of interest

We recommend linear mixed-effects modelling at the trial level, with crossed random effects for item and participant (see the lme4 package for a frequentist and brms package for a Bayesian approach). See Baayen, Davidson & Bates (2008) or Meteyard & Davies (2020) for a general tutorial.
Center all continuous variables (i.e., subtract the mean of all values of each given variable from the value itself, so that the average is 0) - this means that the estimates of the model will represent the grand mean. We strongly advise against dichotomising continuous variables as several potential problems can arise (e.g., equal distribution of cases, etc.)
Contrast code dichotomous variables (e.g., see Schad et al., 2020 for a tutorial).
Fit the fixed effect specification in accordance with the hypothesis (e.g., if interested in the frequency-by-lexicality interaction: rt ~ freq * lex; R formula syntax)
Then, add covariates of no interest as fixed effects (frequency, trial order, previous trial RT, orthographic/phonological Levenshtein distance, Age of Acquisition, etc.). Note that the included variables should be motivated by theoretical considerations.
For a tutorial regarding the random effect specification, see Bates et al. (2018)
Use theoretical knowledge about the different predictors to decide whether the effect of continuous variables should be linear or not (e.g., see Kliegl et al., 2006)
emmeans analysis (e.g., see Documentation HERE) is an option for pairwise comparison of the levels of a fixed effect (or interaction) from a (g)lmer model
Central to this approach is a repeated measures design (i.e., multiple participants responding to the same items).

Important: The standards in the field are ever changing, so it is every researcher's responsibility to stay as up to date as possible.

Inference and reporting

Depending on your research question and preferences, you can either assess the significance of effects of interests or quantify the size of these effects (including whether the estimate range includes the value of zero).

Frequentist significance testing: This is the most common approach.
- You can fit the model with the R package lme4 to estimate effect sizes and model fits. If needed, one can use a R package like lmerTest to provide p-values for each fixed effect and interaction in the model (find alternatives HERE).
- We recommend reporting the unstandardised model effect size estimate (slope), standard error of the estimate, and t- and p-values.
- For pre-registered a priori hypotheses, a commonly used alpha threshold for statistical significance is .05, but we note that lower cut-offs may be preferable (Benjamin et al., 2018), or that, alternatively, rather than a single cut-off, researchers could justify their specific choice of alpha based on the expected outcomes of their decision (Lakens et al., 2018; Maier & Lakens, 2022). In the case of exploratory or post-hoc analyses, one should apply a correction for multiple comparisons (e.g., Bonferroni correction, von der Malsburg & Angele, 2017; False discovery rate, familywise error correction).
- Frequentist effect size estimation: To interpret the effect size and its estimated accuracy, one can report the effect size estimate and its 95% confidence interval (see Cumming, 2013). The package nlme provides 95% confidence intervals for main effects and interactions in mixed-effect models.
Bayesian statistical tests: Instead of frequentist p-values, one can rely on Bayes Factors for inference (e.g., Schmalz et al., 2023).
- In R, you can use the package BayesFactor to calculate Bayes Factors for specific effects in linear mixed-effect models (Rouder et al., 2026).
- Here, one compares a model with an effect or interaction of interest to one that excludes this particular effect or interaction. The Bayes Factor is a ratio that quantifies the extent to which the data is more compatible with the model in the numerator than the denominator.
- Large values support the model in the numerator and small values support the model in the denominator.
- Unless one has theoretical reasons for doing otherwise, we recommend using the default prior to the BayesFactor package. If a different prior is chosen, it should be clearly reported and justified.
- For more straightforward interpretability, we recommend placing the more complex model in the numerator, so that increasingly small values (< 1) correspond to evidence for the null model and increasingly large values (>1) correspond to evidence for the alternative model.
- We recommend a continuous interpretation of the Bayes Factor rather than a "trichotomisation". The Bayes Factor provides a measure of the extent to which you should update your belief in the respective direction.
- Bayesian effect size estimation: Unlike the frequentist approach, which relies on the observed data only, Bayesian effect size estimation considers prior knowledge, which can take the form of existing data. See Bürkner, 2017 for a tutorial.

Additional possibilities for analysis

Monte-Carlo simulated experiments for a-priori or post-hoc measurements: Existing large-scale datasets (i.e., lexicon projects like the British Lexicon Project) allow for subsampling of experiments that enable running experiments with specific sample sizes and stimulus material to investigate effects (see Kupermann, 2015, Perry, 2024 for examples).

Beyond measuring an effect or phenomenon, one typically needs to implement a different set of inference methods. Commonly, one would like to predict future events based on current data (i.e., the prediction approach) or explain the causal structure of a phenomenon/effect to understand reading better (see Hofman et al. 2021 for a perspective).

In explanation focused (neuro)-cognitive models, one would use computational model simulations a priori to specify hypotheses. Comparing the behavioral data and its fit to different models could lead to stronger inferences about computational models (e.g., Perry et al., 2007 or Gagl et al., 2025; see Norris, 2013 for a review)
- Another interesting case here is the drift diffusion model (Ratcliff et al., 2004) that allows for modeling of both reaction times and accuracy data at the same time. Also, the model's output can be interpreted as the effect estimated on assumed processes like the drift rate (i.e., reflecting evidence accumulation before a decision is implemented). Note that the model can cope with exponential Gaussian response time distributions.
Prediction-focused learning models (e.g., Machine learning methods) can be used for many things:
- Explanations based on architectural constraints (i.e., compare Linke et al., 2017 vs. Hannagan et al., 2014) or through implementation of different training regimes (Hannagan et al., 2022) or the investigation of neuro-cognitive processes related to reading (Rajalingham et al., 2020; but see model comparisons can be applied to infer over-prediction and explanation focused models Pauli et al., 2025).
- Investigating memory structure (e.g., Trautwein et al., 2018 or Gatti et al., 2023)
- Diagnostics (e.g., Schmidtke & Moro, 2020; Gagl & Gregorova, 2024; but see Ziegler et al., 2020 for an approach using a explaination focused model)

Important: Share your data so these analyses can be conducted.