As I mentioned in an earlier post, p-values in psychological research are often misunderstood. Ask students (and academics!) what the definition of the p-value is and you will likely get many different responses. To jog your memory, the definition of the p-value is the **probability of observing a test statistic as extreme—or more extreme—than the one you have observed, assuming the null is true.** But, even with this definition in hand, many struggle to conceptualise what the p-value reflects. In this blog post, I take inspiration from a lovely paper I have recently read that advocates using computer simulation to understand the p-value.

Before we delve into the simulations, answer the following question. Assuming you know for sure that there is no difference between two groups that you have tested (i.e., you know for sure that the null is true, which of course you never know, but bear with me), which p-value do you think is **more **likely: p=.009, or p=.9?

If you are like most, you will likely believe that the p-value of .9 is much more likely; how could .009 be more likely, as this represents a significant finding, but I told you the null was true!?! Cue confusion.

But, the p-value behaves much differently under the null (i.e. when the null is true) than many people appreciate. To demonstrate this, I have conducted some computer simulations. Computer simulations are advantageous here because we can generate synthetic data from a known distribution; that is, we tell the computer what the mean and standard deviation of each experimental “group” is. This has the advantage over real life, because we never know the “true” mean of a population. This gives us superb control.

By getting the computer to simulate thousands of “experiments” where the participants are drawn from the same population (i.e. they behave very similarly, and hence produce null differences between groups), we can observe the behaviour of the p-value under the null.

**SIMULATIONS ASSUMING THE NULL IS TRUE**

In this first simulation, we will examine the behaviour of the p-value when the difference between two groups is null. I simulated 100,000 experiments, with 100 participants in each group. The mean of each group was fixed at 100, and the standard deviation was fixed at 20. For each participant, the computer set as their score a randomly selected number from a normal distribution with a mean of 100 and SD of 20. At the end of each experiment, the computer performs a t-test, and records the observed p-value. At the end, a histogram is plotted which shows the frequency of each p-value across the entire simulation.

Recall that if you though a p-value of .9 is more likely than a p-value of .009, you would expect a higher frequency of p-values that are close to 1. Is this what we find? The code for the simulation—conducted in R—is below.

nSims <- 100000 #number of simulated experiments p <-numeric(nSims) #set up empty container for all simulated p-values for(i in 1:nSims){ #for each simulated experiment x<-rnorm(n = 100, mean = 100, sd = 20) #produce 100 simulated participants #with mean=100 and SD=20 y<-rnorm(n = 100, mean = 100, sd = 20) #produce 100 simulated participants #with mean=100 and SD=20 z<-t.test(x,y) #perform the t-test p[i]<-z$p.value #get the p-value and store it } #now plot the histogram hist(p, main="Histogram of p-values under the null", xlab=("Observed p-value"))

This produces the following histogram:

Contrary to many’s expectations, the p-value follows a uniform distribution; that is—and this is really the take-home message—**when the null is true, the p-value is a random variable between zero and one; in everyday language, when the null is true, the p-value is equally likely to take on ANY value. ** Thus, a p-value of .009 is just as likely as a p-value of .9. Are you surprised?? (I should add here that this only holds true for continuous dependent variables.)

**SIMULATIONS ASSUMING THE NULL IS NOT TRUE**

How does the behaviour of the p-value change when there IS a difference between groups? Make a prediction to yourself for the upcoming simulations before moving on.

A short think should lead you to expect that when there is a difference, observations of p-values at the lower end should increase in frequency. Does this hold true? For the following simulations, I increased the mean of the distribution for one group from 100 to 103 (a tiny increase!). Here is the amended code, and the resulting histogram.

nSims <- 100000 #number of simulated experiments p <-numeric(nSims) #set up empty container for all simulated p-values for(i in 1:nSims){ #for each simulated experiment x<-rnorm(n = 100, mean = 103, sd = 20) #produce 100 simulated participants #with mean=103 and SD=20 y<-rnorm(n = 100, mean = 100, sd = 20) #produce 100 simulated participants #with mean=100 and SD=20 z<-t.test(x,y) #perform the t-test p[i]<-z$p.value #get the p-value and store it } #now plot the histogram hist(p, main="Histogram of p-values (true group difference)", xlab=("Observed p-value"))

We can see that—as expected—there is a greater occurrence of low p-values. Although I don’t want to go into the concept of power, we can see that the observed frequency of p-values towards its lower value increases when we boost the sample size of each simulated experiment to 500: this is because each experiment—due to its larger sample size—has a greater probability of finding the true difference inherent in the data. Here is the new histogram with N=500 in each experiment:

**CONCLUSION**

Although a short post, I hope it serves to once again show that the p-value is misunderstood; in particular, I have focussed on how the p-value behaves when there is no true difference between groups.; **any p-value is as likely as any other.** This is very surprising.

What I have learned from this is to think twice when I find significant effects; it (once again) stresses the HUGE importance of replicating experiments when you find significant results. Who knows, perhaps your experiment’s p<.001 was just a random variable from a null effect. Watch out!!

[…] also use R exclusively for simulation work that I do, be it statistical simulations (e.g., this post on the dynamics of p-values under the null hypothesis) or cognitive simulations (e.g., fitting […]