# P-value Calculator

Use this **statistical significance calculator** to easily calculate the **p-value** and determine whether the difference between two proportions or means (independent groups) is statistically significant. It will also output the **Z-score or T-score** for the difference. Inferrences about both absolute and relative difference (percentage change, percent effect) are supported. Detailed information about what a p-value is, how to interpret it, common misinterpretations and more below.

### Quick navigation:

- Using the p-value calculator
- What is "p-value" and "significance level"
- P-value formula
- Why do we need a p-value?
- How to interpret a statistically significant result / low p-value
- Common misinterpretations of statistical significance
- One-tailed vs. two-tailed tests of significance
- P-value for relative difference

## Using the p-value calculator

This **statistical significance calculator** allows you to perform a post-hoc statistical evaluation of a set of data when the outcome of interest is difference of two proportions (binomial data, e.g. conversion rate or event rate) or difference of two means (continuous data, e.g. height, weight, speed, time, revenue, etc.). You can use a **Z-test** (recommended) or a **T-test** to calculate the observed significance level (p-value statistic). The Student's T-test is recommended mostly for very small sample sizes, e.g. n < 30.

If entering proportions data, you need to know the sample sizes of the two groups as well as the number or rate of events. You can enter that as a proportion (e.g. 0.10), percentage (e.g. 10%) or just the raw number of events (e.g. 50).

If entering means data, you need to simply copy/paste or type in the raw data, each observation separated by comma, space, new line or tab. Copy-pasting from a Google or Excel spreadsheet works fine.

The **p-value calculator will output**: p-value, significance level, T-score or Z-score, degrees of freedom, and the observed difference. For means data it will also output the sample sizes, means, and pooled standard error of the mean. The p-value is for a **one-sided hypothesis** (one-tailed test), allowing you to infer the direction of the effect (more on one vs. two-tailed tests).

**Warning:** You must have fixed the sample size / stopping time of your experiment in advance, otherwise you will be guilty of optional stopping (fishing for significance) which will inflate your type I error. Also, you should not use this calculator for comparisons of more than two means or proportions, or for comparisons of two groups based on more than one metric. If your experiment involves more than one treatment group or involve more than one outcome variables you need a more advanced tool which corrects for multiple comparisons and multiple testing. This statistical calculator might help.

## What is "p-value" and "significance level"

The p-value is a heavily used statistic that quantifies the **uncertainty** of a given measurement, usually as a part of an experiment, medical trial, as well as in observational studies. It is inseparable from inference through a **Null-Hypothesis Statistical Test (NHST)** in which we pose a null hypothesis reflecting the currently established theory or a model of the world we don't want to dismiss without solid evidence (the tested hypothesis), and an alternative hypothesis: an alternative model of the world. For example, the statistical null hypothesis could be that exposure to ultra-violet light for prolonged periods of time has positive or neutral effects regarding developing skin cancer, while the alternative hypothesis can be that it has a negative effect on development of skin cancer.

In this framework the p-value reflects the **probability of observing the result which was observed, or a more extreme one, assuming the null hypothesis is true**. In notation this is expressed as:

**p(x _{0}) = Pr(d(X) > d(x_{0}); H_{0})**

where **x _{0}** is the observed data (x

_{1},x

_{2}...x

_{n}),

**d**is a special function (statistic, e.g. calculating a Z-score),

**X**is a random sample (X

_{1},X

_{2}...X

_{n}) from the sampling distribution of the null hypothesis. This can be visualized in this way:

Therefore the p-value expresses the probability of committing a **type I error**: rejecting the null hypothesis if it is in fact true. See below for a full proper interpretation of the p-value statistic.

Another way to think of the p-value is as a more user-friendly expression of how many standard deviations away from the normal a given observation is. For example, in a one-tailed test of significance for a normally-distributed variable like the difference of two means, a result which is 1.6448 standard deviations away (1.6448σ) results in a p-value of 0.05.

The term **"statistical significance"** or **"significance level"** is often used in conjunction to the p-value, either to say that a result is "statistically significant", which has a specific meaning in statistical inference (see interpretation below), or to refer to the percentage representation the level of significance: (1 - p value), e.g. a p-value of 0.05 is equivalent to significance level of 95% (1 - 0.05 * 100).

## P-value formula

There are different ways to arrive at a p-value depending on the assumption about the underlying distribution. In this significance calculator we support two such distributions: the Student's T-distribution and the normal Z-distribution (Gaussian).

In both cases you need to start by estimating the variance and standard deviation, then derive the standard error of the mean, after which a standard score is calculated using the formula ^{[2]}:

**X** (read "X bar") is the arithmetic mean of the population baseline or the control, **μ _{0}** is the observed mean / treatment group mean, while

**σ**is the standard error of the mean (SEM, or standard deviation of the error of the mean).

_{x}When calculating a p-value using the **Z-distribution** the formula is **Φ(Z)** or **Φ(-Z)** for lower and upper-tailed tests, respectively. **Φ** is the standard normal cumulative distribution function.

When using the **T-distribution** the formula is **T _{n}(Z)** or

**T**for lower and upper-tailed tests, respectively.

_{n}(-Z)**T**is the cumulative distribution function for a T-distribution with

_{n}*n*degrees of freedom.

The population standard deviation is often unknown and is thus estimated from the samples, usually from the pooled samples variance.

## Why do we need a p-value?

People need to share information about the evidential strength of data that can be easily understood and easily compared between experiments. The picture below represents, albeit imperfectly, the results of two simple experiments, each ending up with the control with 10% event rate treatment group at 12% event rate.

However, it is obvious that the evidential input of the data is not the same, demonstrating that communicating just the observed proportions or their difference (effect size) is not enough to estimate and communicate the evidential strength of the experiment. In order to **fully describe the evidence and associated uncertainty**, several statistics need to be communicated, for example, the sample size, sample proportions and the shape of the error distribution. Their interaction is not trivial to understand, so communicating them separately makes it very difficult for one to grasp what information is present in the data. What would you infer if I told you the observed proportions are 10% and 12%, the sample sizes are 10,000 users each, and the error distribution is binomial?

Instead of communicating several statistics, a **single statistic** was developed that communicates all the necessary information in one piece: the **p-value**. It was first derived in the late 18-th century by Pierre-Simon Laplace, when he observed data about a million births that showed an excess of boys, compared to girls. Using the calculation of significance, he argued that the effect was real, but unexplained at the time. We know this now to be true, and there are several explanations for the phenomena coming from evolutionary biology. **Statistical significance calculations** were formally introduced in the early 20-th century by Pearson and popularized by Sir Ronald Fisher in his work, most notably "The Design of Experiments" (1935) ^{[1]}.

## How to interpret a statistically significant result / low p-value

Saying that a result is **statistically significant** means that the p-value is below the evidential threshold decided for the test before it was conducted. For example, if observing something which would only happen 1 out of 20 times if the null hypothesis is true is considered sufficient evidence to reject the null hypothesis, the threshold will be 0.05. In such case, observing a p-value of 0.025 would mean that the result is statistically significant.

But what does that really mean? What inference can we make from seeing a result which was quite improbable if the null was true?

**Observing any given low p-value can mean one of three things ^{[4]}:**

- There is a true effect from the tested treatment or intervention.
- There is no true effect, but we happened to observe a rare outcome. The lower the p-value, the rarer (less likely, less probable) the outcome.
- The statistical model is invalid (does not reflect reality).

Obviously, one can't simply jump to conclusion 1.) and claim it with one hundred percent certainty, as this would go against the whole idea of the p-value. In order to use the p-value as a part of a decision process you need to consider external factors, which are a part of the experimental design process, which includes deciding on the significance threshold, sample size and power (power analysis), and the expected effect size, among other things.

If you are happy going forward with this much (or this little) uncertainty as is indicated by the p-value, then you have some quantifiable guarantees related to the effect and future performance of whatever you are testing.

## Common misinterpretations of statistical significance

There are several common misinterpretations of p-values and statistical significance and no calculator can save you from falling for them. The following errors are often committed when using statistical significance to make inferences:

### Low p-value does not mean no effect

Treating a high p-value / low significance level as evidence, by itself, that there is no effect, no difference between the means, is a common mistake. However, it is trivial to demonstrate why it is mistake. Take a simple experiment in which you measure only 2 (two) people or objects in the control and treatment groups. When you calculate the p-value for this test of significance you will find that it is not statistically significant. Does that mean that the treatment is ineffective? Of course not, since that claim has not been tested severely enough. Using a statistic such as severity can completely eliminate this error ^{[3]}.

A more detailed response would say that failure to observe a statistically significant result, given that the test has enough statistical power, can be used to argue for accepting the null hypothesis to the extent warranted by the power and minimum detectable effect for which it was calculated.

### Statistical significance is not practical significance

A result may be highly statistically significant (e.g. p-value 0.0001) but it might still have no practical consequences due to a trivial effect. This often happens with overpowered designs, but it can also happen in a properly designed statistical test. This error can be avoided by always reporting the effect size and confidence intervals around it.

### Treating the significance level as likelihood for the observed effect

Observing a highly significant result, say p-value 0.01 does not mean that the likelihood that the observed difference is the true difference. In fact, that likelihood is much, much smaller. Remember that statistical significance has a strict meaning in the NHST framework. To make claims about a particular effect size, you can use confidence intervals or severity.

### Treating the p-value as likelihoods attached to hypothesis

For example, stating that a p-value of 0.02 means that there is 98% probability that the alternative hypothesis is true or that there is 2% probability that the null hypothesis is true. This is a logical error. You know that even if the null hypothesis is true, you will see p-values equal to or lower than 0.02 exactly 2% of the time, so you cannot use the fact that you have observed a low p-value to argue directly against the null hypothesis - further steps are required. Frequentist error-statistical methods do not allow one to attach probabilities to hypothesis ^{[3]} since doing so requires an exhaustive list of hypothesis and prior probabilities, attached to them. This is decision-making and experimental design territory. Put in Bayesian terms, the p-value is not a posterior probability.

## One-tailed vs. two-tailed tests of significance

There are wide-spread misconceptions about one-tailed and two-tailed tests, often referred to as one-sided and two-sided hypotheses ^{[6]}. This is not surprising given that even the Wikipedia article on the topic gets it wrong by stating that one-sided tests are appropriate only if "the estimated value can depart from the reference value in just one direction". Consequently, people often prefer two-sided tests as they believe using one-tailed tests leads to bias, higher than nominal error rates (type I errors) and involves making more assumptions about the directiona of the effect.

However, nothing could be further from the truth. In reality, there is **no practical or theoretical situation in which a two-tailed test is appropriate**. For it to be appropriate, the inference drawn or action taken needs to be the same regardless of the direction of the effect of interest and that is never the case. Therefore, this p-value calculator uses one-tailed z-scores and t-scores.

**"A two-sided hypothesis and a two-tailed test should be used only when we would act the same way, or draw the same conclusions, if we discover a statistically significant difference in any direction."** ^{[5]}. Doing otherwise leads to a host of issues, including uncontrolled type III errors (inferring the direction of an effect to be the same as the observed direction, when it is not).

On the other hand, **"A one-sided hypothesis and a one-tailed test should be used when we would act a certain way, or draw certain conclusions, if we discover a statistically significant difference in a particular direction, but not in the other direction."** ^{[5]} which describes all practical and scientific applications of p-values / tests of significance.

Not only do one-tailed tests answer the question you are actually asking when you properly formulate your hypotheses, but they are faster to execute, resulting in %20-%60 faster tests/trials/experiments ^{[5]}. There is no good reason to use two-tailed tests as it means committing to 20-60% slower tests to answer a question you didn’t ask and the answer to which would make no difference whatsoever. Further and more in-depth reading and explanations: One-tailed vs Two-tailed Tests of Significance and reference #5 below.

## P-value for relative difference

When comparing two independent groups and the variable of interest is the relative (a.k.a. relative change, relative difference, percent change, percentage difference), as opposed to the absolute difference between the two means or proportions, the standard deviation of the variable is different, compelling different p-value calculations. This is due to the fact that in calculating relative difference we are performing an additional division by a random variable: the conversion rate of the control during the experiment, which adds more variance to the estimation and the resulting p-value is usually higher (the result will be less statistically significant).

In simulations I performed the difference in p-values was about 50% of nominal: a 0.05 p-value for absolute difference corresponded to probability of about 0.075 of observing the relative difference corresponding to the observed absolute difference. Therefore, if you are using p-values calculated for absolute difference when making an inference about percentage difference, you are likely reporting error rates which are about 50% of the actual, thus significantly overstating the statistical significance of your results and underestimating the uncertainty attached to them.

With this p-value calculator you can avoid this mistake by simply indicating the inference you want to make.

#### References

[1] Fisher R.A. (1935) – "The Design of Experiments", *Edinburgh: Oliver & Boyd*

[2] Mayo D.G., Spanos A. (2010) – "Error Statistics", in P. S. Bandyopadhyay & M. R. Forster (Eds.), Philosophy of Statistics, (7, 152–198). *Handbook of the Philosophy of Science*. The Netherlands: Elsevier.

[3] Mayo D.G., Spanos A. (2006) – "Severe Testing as a Basic Concept in a Neyman–Pearson Philosophy of Induction", *British Society for the Philosophy of Science*, 57:323-357

[4] Georgiev G.Z. (2017) "Statistical Significance in A/B Testing – a Complete Guide", [online] http://blog.analytics-toolkit.com/2017/statistical-significance-ab-testing-complete-guide/ (accessed Apr 27, 2018)

[5] Georgiev G.Z. (2017) "One-tailed vs Two-tailed Tests of Significance in A/B Testing", [online] http://blog.analytics-toolkit.com/2017/one-tailed-two-tailed-tests-significance-ab-testing/ (accessed Apr 27, 2018)

[6] Hyun-Chul Cho Shuzo Abe (2013) "Is two-tailed testing for directional research hypotheses tests legitimate?", *Journal of Business Research* 66:1261-1266

#### Cite this calculator & page

If you'd like to cite this online calculator resource and information as provided on the page, you can use the following citation:

Georgiev G.Z., *"P-value Calculator"*, [online] Available at: https://www.gigacalculator.com/calculators/p-value-significance-calculator.php URL [Accessed Date: 22 Jun, 2018].