Today I give my favourite lecture of the year. Second year students today will be introduced to some of the controversies surrounding the use of p-values. Although I don’t go into great depth, I highlight that p-values—among the many problems associated with them—don’t provide answers to the type of questions we—as researchers—are asking.
You might think this is heresy, but if you take a few minutes to mull this over, you will realise it is right. Remember, the p-value gives us the probability of observing scores as extreme (or more so) as we have, assuming the null is true. Now think of the types of questions you ask as a (budding or established) researcher. Here are some off the top of my head: “How many items can be stored in visual short-term memory?”; “Does bilingualism improve executive functioning?”; “Does brain training increase working memory capacity?”. Researchers’ questions lend themselves to a “how much?” approach, rather than a “how small is our p-value” approach.
In the lecture, I recommend the reporting of effect sizes as a solid solution to the limited information the p-value can give us. An effect size is an objective measure of the magnitude of an observed effect, and helps us answer the “how much” type of question. Now, you are all familiar with many types of effect sizes, whether you have classified them as such or not. Here are some, for example:
These are all examples of unstandardised effect sizes, and they indicate the extent to which scores are different. Note, the p-value only tells us if this difference is “significant”, and is thus limited.
However, unstandardised effect sizes cannot easily be compared if the measured units differ. For example, how can you compare the effect of alcohol on mean response time with the effect of alcohol on the percentage of road deaths every year? If the units are chalk and cheese, the comparison can’t be made.
Standardised effect sizes can be compared, though. Put simply, standardised effect sizes are mathematically translated into the same unit of measurement. One very popular effect size for the comparison of two groups is Cohen’s d, which is calculated as the difference between group 1’s mean and group 2’s mean, divided by the pooled standard deviation.
In factorial ANOVA designs—the main topic of the module I run and am lecturing on today—tends to use partial eta squared. Although not really recommended by the APA big-wigs, it is still the most common effect size in APA journals! Why? It’s easy to understand, it’s easy to generate in SPSS, and it’s easy to interpret. Let’s go through it.
Partial Eta Squared
Before we talk about PES (partial eta squared), let’s quickly recap what an ANOVA is doing. ANOVA stands for “analysis of variance”, and is used when you are comparing more than 2 means. It compares the variance due to your treatment (e.g. your manipulation) to the variance due to experimental error. The resulting statistic, F, is basically “treatment effects divided by experimental error”—a crass simplification, but serves our purpose here—and so large values of F reflect larger amount of variance in your data being able to be explained by your treatment rather than experimental error.
Now, variability in your data can be assessed using the good old “sums of squares” (sum of squared deviations)—hereafter, SS. PES describes the proportion of variability associated with a particular effect (treatment) when the variance associated with all other effects has been controlled for. That is, PES hones in on the proportion of variance explained by your treatment effect independent of all other sources of variance in your data. The more variance that can be explained by your treatment effect, the more of an effect it is obviously having on behaviour (or whatever it is that you are measuring); so, large PES means large effect sizes!
The equation for partial eta squared is given as:
This might look scary at first glance, but you can get all of this information from a typical SPSS output (in fact, SPSS can even calculate PES for you if you ask it nicely enough). SS-effect is the sum of squares from your effect of interest; this could be a main effect or an interaction. SS-error is the SS for the error term associated with this main effect or interaction. Let’s calculate some PESs from typical SPSS output:
We need to look in the “Type III Sum of Squares” column to get our SS. This is data from a fully repeated measures (i.e. within-subjects) experimental design. Not only does it tell you above the output, you could have worked this out as there is a separate error term for each main effect and interaction.
To work out PES for factor A, we recall that PES = SS-Effect / SS-Effect + SS-Error, giving us PES = 648.10 / 648.10 + 20204.55 = 0.031. It worked! For factor B, 2476.06 / 2476.06 + 13112.79 = 0.159, and for the interaction, PES = 516.05 / 516.05 + 14250.35 = 0.035. Very straightforward!
Interpreting Partial Eta Squared
So, it’s straightforward to calculate, but how do we interpret PES? It’s the proportion of variance explained by an effect: a partial eta squared of 0.88 means 88% of the variability in your data can be explained by your treatment effect, when all other effects identified in the analysis has been removed from consideration. Large PES means large effect size.
The following are guidelines for interpreting the magnitude of PES:
- Small: >0.01
- Medium: >0.06
- Large: >0.14
So, Factor B in the analysis above had a large effect size, whereas factor A and the interaction had small effect sizes.
Advantages of Effect Sizes over p-values
One of the main advantages of reporting effect sizes together with (instead of? Gasp!) p-values is that they provide an answer to the “how much” question. Also, you will note that nowhere in the equations for effect sizes will you find a term that represents sample size. Effect sizes are calculated independently of sample size. Why is this important? P-values can become artificially low with large samples, even when the difference your are examining is pitifully small. For example, here is some SPSS output for 33 subjects:
Notice anything? That’s right! The p-value is now indistinguishable from zero (i.e. highly significant), but the effect size has remained the same! Thus, the ES is giving a more robust (i.e. independent of sample size) measure of the magnitude—and hence, importance—of an effect. The size of the effect of Factor A is very small, no matter how many subjects you throw at it.
Thus, you can have highly significant results that actually have low effect sizes, and you can have non-significant results with large effect sizes!
There are some notes of caution with PES: care must be exercised when comparing PES across different studies with different designs, as they will have different error terms (see the denominator of the PES equation). This is partly the reason why PES isn’t the BEST effect size to use. However, due to its simplicity, it’s a good one to use to introduce the concepts.
So, remember next time you are reporting your results to include an estimate of the effect sizes. You will be providing an answer to the questions your readers likely have: “how much”?