Effect Sizes, FTW!

Today I give my favourite lecture of the year. Second year students today will be introduced to some of the controversies surrounding the use of p-values. Although I don’t go into great depth, I highlight that p-values—among the many problems associated with them—don’t provide answers to the type of questions we—as researchers—are asking.

You might think this is heresy, but if you take a few minutes to mull this over, you will realise it is right. Remember, the p-value gives us the probability of observing scores as extreme (or more so) as we have, assuming the null is true. Now think of the types of questions you ask as a (budding or established) researcher. Here are some off the top of my head: “How many items can be stored in visual short-term memory?”; “Does bilingualism improve executive functioning?”; “Does brain training increase working memory capacity?”.  Researchers’ questions lend themselves to a “how much?” approach, rather than a “how small is our p-value” approach.

In the lecture, I recommend the reporting of effect sizes as a solid solution to the limited information the p-value can give us. An effect size is an objective measure of the magnitude of an observed effect, and helps us answer the “how much” type of question. Now, you are all familiar with many types of effect sizes, whether you have classified them as such or  not. Here are some, for example:


These are all examples of unstandardised effect sizes, and they indicate the extent to which scores are different. Note, the p-value only tells us if this difference is “significant”, and is thus limited.

However, unstandardised effect sizes cannot easily be compared if the measured units differ. For example, how can you compare the effect of alcohol on mean response time with the effect of alcohol on the percentage of road deaths every year? If the units are chalk and cheese, the comparison can’t be made.

Standardised effect sizes can be compared, though. Put simply, standardised effect sizes are mathematically translated into the same unit of measurement. One very popular effect size for the comparison of two groups is Cohen’s d, which is calculated as the difference between group 1’s mean and group 2’s mean, divided by the pooled standard deviation.

In factorial ANOVA designs—the main topic of the module I run and am lecturing on today—tends to use partial eta squared. Although not really recommended by the APA big-wigs, it is still the most common effect size in APA journals! Why? It’s easy to understand, it’s easy to generate in SPSS, and it’s easy to interpret. Let’s go through it.

Partial Eta Squared

Before we talk about PES (partial eta squared), let’s quickly recap what an ANOVA is doing. ANOVA stands for “analysis of variance”, and is used when you are comparing more than 2 means. It compares the variance due to your treatment (e.g. your manipulation) to the variance due to experimental error. The resulting statistic, F, is basically “treatment effects divided by experimental error”—a crass simplification, but serves our purpose here—and so large values of F reflect larger amount of variance in your data being able to be explained by your treatment rather than experimental error.

Now, variability in your data can be assessed using the good old “sums of squares” (sum of squared deviations)—hereafter, SS. PES describes the proportion of variability associated with a particular effect (treatment) when the variance associated with all other effects has been controlled for. That is, PES hones in on the proportion of variance explained by your treatment effect independent of all other sources of variance in your data. The more variance that can be explained by your treatment effect, the more of an effect it is obviously having on behaviour (or whatever it is that you are measuring); so, large PES means large effect sizes!

The equation for partial eta squared is given as:

ESThis might look scary at first glance, but you can get all of this information from a typical SPSS output (in fact, SPSS can even calculate PES for you if you ask it nicely enough). SS-effect is the sum of squares from your effect of interest; this could be a main effect or an interaction. SS-error is the SS for the error term associated with this main effect or interaction. Let’s calculate some PESs from typical SPSS output:

ESWe need to look in the “Type III Sum of Squares” column to get our SS. This is data from a fully repeated measures (i.e. within-subjects) experimental design. Not only does it tell you above the output, you could have worked this out as there is a separate error term for each main effect and interaction.

To work out PES for factor A, we recall that PES = SS-Effect / SS-Effect + SS-Error, giving us PES = 648.10 / 648.10 + 20204.55 = 0.031. It worked! For factor B, 2476.06 / 2476.06 + 13112.79 = 0.159, and for the interaction, PES = 516.05 / 516.05 + 14250.35 = 0.035. Very straightforward!

Interpreting Partial Eta Squared

So, it’s straightforward to calculate, but how do we interpret PES? It’s the proportion of variance explained by an effect: a partial eta squared of 0.88 means 88% of the variability in your data can be explained by your treatment effect, when all other effects identified in the analysis has been removed from consideration. Large PES means large effect size.

The following are guidelines for interpreting the magnitude of PES:

  • Small:  >0.01
  • Medium: >0.06
  • Large: >0.14

So, Factor B in the analysis above had a large effect size, whereas factor A and the interaction had small effect sizes.

Advantages of Effect Sizes over p-values

One of the main advantages of reporting effect sizes together with (instead of? Gasp!) p-values is that they provide an answer to the “how much” question. Also, you will note that nowhere in the equations for effect sizes will you find a term that represents sample size. Effect sizes are calculated independently of sample size. Why is this important? P-values can become artificially low with large samples, even when the difference your are examining is pitifully small. For example, here is some SPSS output for 33 subjects:

ESNow, if I copy and paste the data files multiple times so that I have an artificial sample size of 1,056 subjects, I get the following output:

ESNotice anything? That’s right! The p-value is now indistinguishable from zero (i.e. highly significant), but the effect size has remained the same! Thus, the ES is giving a more robust (i.e. independent of sample size) measure of the magnitude—and hence, importance—of an effect. The size of the effect of Factor A is very small, no matter how many subjects you throw at it.

Thus, you can have highly significant results that actually have low effect sizes, and you can have non-significant results with large effect sizes!

There are some notes of caution with PES: care must be exercised when comparing PES across different studies with different designs, as they will have different error terms (see the denominator of the PES equation). This is partly the reason why PES isn’t the BEST effect size to use. However, due to its simplicity, it’s a good one to use to introduce the concepts.

So, remember next time you are reporting your results to include an estimate of the effect sizes. You will be providing an answer to the questions your readers likely have: “how much”?

Tagged , , ,

10 thoughts on “Effect Sizes, FTW!

  1. kay says:

    this was exceptionally helpful but it is still unclear what is meant by positive/negative direction in terms of effect sizes: i am using two independent variables and running both pairwise analysis for each and two-way anova for the interaction

  2. Jim Grange says:

    Hi Kay,

    Thanks for your positive comment. Which effect size are you estimating?

    • kay says:

      no problem! thanks for the post.
      for example, in a pairwise comparison i have a large positive and a large negative PES for african–>american and american–>african cultures respectively but i am unsure how to interpret this.
      is it fair to say:
      the direction of the effect indicates that the african group average was significantly larger than the american group average
      can i be more specific than this, say the PES is 0.7

  3. Jim Grange says:

    I guess you’re using Cohen’s d? This effect size can be negative; it’s all relative to the sign of the difference between groups. For example, if Group A score higher than Group B, and I do the comparison of A vs. B, the effect size will be positive; if, however, I do the comparison B vs. A, the effect size will be negative (because B scored lower than A).

    Does that get at the issue at hand?

  4. Jim Grange says:

    I should have noted that you said you were using PES, apologies. This should never come out as negative in SPSS (if that’s what you’re using). Take PES as the size of the effect of some difference; it gives the absolute effect size and doesn’t care whether you are comparing A vs. B or B vs. A (for example).

  5. kay says:

    apologies, you are quite right!
    it is not that PES is negative (it is, in fact, 0.7) but that in my pairwise comparison i have a negative mean difference from american–>african and a positive mean difference from african–>american.
    is it thus fair to say:
    the direction of the effect indicates that the african group average was significantly larger than the american group average

    my instructions note: if the pairwise comparison gives a positive result, e.g. 12%, then the direction of the effect is that african group is significantly bigger than the american group.

    but what is meant by “bigger?”

    thanks so much, you are wonderful

  6. Jim Grange says:

    I’m not sure how to say this without sounding sarcastic! But here goes: “Bigger” means that your dependent variable (the thing you measured) was larger in the African group.

    Remember, though, that effect sizes do not allow you to conclude “significance”; that is, although there is a difference between the African and American group in terms of the mean difference (i.e. numerical), and the effect size of this difference is 0.7, this doesn’t speak to whether the difference is statistically significant. For this you need some form of inferential test; in cases where you are testing the difference between two groups, this is often the t-test (related or unrelated, depending on whether your independent variable was within-subjects or between-subjects, respectively).

    • Amy A says:

      great thank you! would you recommend a t test for independent samples to check the significance of this effect size?
      (i already know, based on 2 way anova with another IV and pairwise comparison that the result itself is very significant p<0.01)

  7. Jim Grange says:

    If you’ve already done pairwise comparisons then you don’t need to do the t-tests.

  8. Paul Power says:

    Hi great blog, however it has left me a bit confused, this is with respect to the guidlines for interpreting the effect size, are you reffering to Cohens d with these guidelines?

    I am at the moment processing a two way repeated measures design in SPSS, the two factors one having two levels the other seven. Using the partial eta squared for the factor with seven levels this is showing a P ETA of .9, i take it then this is a large effect size? and all things considered i can be confident in my sig values?

    Thanks Mr P Power

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: