# P-value Calculator

Use this **statistical significance calculator** to easily calculate the **p-value** and determine whether the difference between two proportions or means (independent groups) is statistically significant. It will also output the **Z-score or T-score** for the difference. Inferrences about both absolute and relative difference (percentage change, percent effect) are supported. Detailed explanation of what a p-value is, how to use and interpret it.

### Quick navigation:

- Using the p-value calculator
- What is "p-value" and "significance level"
- P-value formula
- Why do we need a p-value?
- How to interpret a statistically significant result / low p-value

## Using the p-value calculator

This **statistical significance calculator** allows you to perform a post-hoc statistical evaluation of a set of data when the outcome of interest is difference of two proportions (binomial data, e.g. conversion rate or event rate) or difference of two means (continuous data, e.g. height, weight, speed, time, revenue, etc.). You can use a **Z-test** (recommended) or a **T-test** to calculate the observed significance level (p-value statistic). The Student's T-test is recommended mostly for very small sample sizes, e.g. n < 30. In order to avoid type I error inflation which might occur with unequal variances the calculator automaticaly applies the Welch's T-test instead of Student's T-test if the sample sizes differ significantly or if one of them is less than 30 and the sampling ratio is different than one.

If entering proportions data, you need to know the sample sizes of the two groups as well as the number or rate of events. You can enter that as a proportion (e.g. 0.10), percentage (e.g. 10%) or just the raw number of events (e.g. 50).

If entering means data in the calculator, you need to simply copy/paste or type in the raw data, each observation separated by comma, space, new line or tab. Copy-pasting from a Google or Excel spreadsheet works fine.

The **p-value calculator will output**: p-value, significance level, T-score or Z-score, degrees of freedom, and the observed difference. For means data it will also output the sample sizes, means, and pooled standard error of the mean. The p-value is for a **one-sided hypothesis** (one-tailed test), allowing you to infer the direction of the effect (more on one vs. two-tailed tests).

**Warning:** You must have fixed the sample size / stopping time of your experiment in advance, otherwise you will be guilty of optional stopping (fishing for significance) which will inflate your type I error, rendering the statistical significance level unusable. Also, you should not use this significance calculator for comparisons of more than two means or proportions, or for comparisons of two groups based on more than one metric. If your experiment involves more than one treatment group or involve more than one outcome variables you need a more advanced tool which corrects for multiple comparisons and multiple testing. This statistical calculator might help.

## What is "p-value" and "significance level"

The p-value is a heavily used statistic that quantifies the **uncertainty** of a given measurement, usually as a part of an experiment, medical trial, as well as in observational studies. By definition, it is inseparable from inference through a **Null-Hypothesis Statistical Test (NHST)**. In it we pose a null hypothesis reflecting the currently established theory or a model of the world we don't want to dismiss without solid evidence (the tested hypothesis), and an alternative hypothesis: an alternative model of the world. For example, the statistical null hypothesis could be that exposure to ultra-violet light for prolonged periods of time has positive or neutral effects regarding developing skin cancer, while the alternative hypothesis can be that it has a negative effect on development of skin cancer.

In this framework a p-value is defined as the **probability of observing the result which was observed, or a more extreme one, assuming the null hypothesis is true**. In notation this is expressed as:

**p(x _{0}) = Pr(d(X) > d(x_{0}); H_{0})**

where **x _{0}** is the observed data (x

_{1},x

_{2}...x

_{n}),

**d**is a special function (statistic, e.g. calculating a Z-score),

**X**is a random sample (X

_{1},X

_{2}...X

_{n}) from the sampling distribution of the null hypothesis. This equation is used in this p-value calculator and can be visualized as such:

Therefore the p-value expresses the probability of committing a **type I error**: rejecting the null hypothesis if it is in fact true. See below for a full proper interpretation of the p-value statistic.

Another way to think of the p-value is as a more user-friendly expression of how many standard deviations away from the normal a given observation is. For example, in a one-tailed test of significance for a normally-distributed variable like the difference of two means, a result which is 1.6448 standard deviations away (1.6448σ) results in a p-value of 0.05.

The term **"statistical significance"** or **"significance level"** is often used in conjunction to the p-value, either to say that a result is "statistically significant", which has a specific meaning in statistical inference (see interpretation below), or to refer to the percentage representation the level of significance: (1 - p value), e.g. a p-value of 0.05 is equivalent to significance level of 95% (1 - 0.05 * 100).

## P-value formula

There are different ways to arrive at a p-value depending on the assumption about the underlying distribution. This tool supports two such distributions: the Student's T-distribution and the normal Z-distribution (Gaussian).

In both cases you need to start the p-value calculation by estimating the variance and standard deviation, then derive the standard error of the mean, after which a standard score is calculated using the formula ^{[2]}:

**X** (read "X bar") is the arithmetic mean of the population baseline or the control, **μ _{0}** is the observed mean / treatment group mean, while

**σ**is the standard error of the mean (SEM, or standard deviation of the error of the mean).

_{x}When calculating a p-value using the **Z-distribution** the formula is **Φ(Z)** or **Φ(-Z)** for lower and upper-tailed tests, respectively. **Φ** is the standard normal cumulative distribution function.

When using the **T-distribution** the formula is **T _{n}(Z)** or

**T**for lower and upper-tailed tests, respectively.

_{n}(-Z)**T**is the cumulative distribution function for a T-distribution with

_{n}*n*degrees of freedom.

The population standard deviation is often unknown and is thus estimated from the samples, usually from the pooled samples variance. Knowing or estimating the standard deviation is a prerequisite for using a significance calculator.

## Why do we need a p-value?

If you are in the sceiences, it is often a requirement by scientific journals. If you apply in business experiments (e.g. A/B testing) it is reported alongside confidence intervals and other estimates. However, what is the utility of the p-value, really?

First, let us define the problem the p-value is intended to solve. People need to share information about the evidential strength of data that can be easily understood and easily compared between experiments. The picture below represents, albeit imperfectly, the results of two simple experiments, each ending up with the control with 10% event rate treatment group at 12% event rate.

However, it is obvious that the evidential input of the data is not the same, demonstrating that communicating just the observed proportions or their difference (effect size) is not enough to estimate and communicate the evidential strength of the experiment. In order to **fully describe the evidence and associated uncertainty**, several statistics need to be communicated, for example, the sample size, sample proportions and the shape of the error distribution. Their interaction is not trivial to understand, so communicating them separately makes it very difficult for one to grasp what information is present in the data. What would you infer if I told you the observed proportions are 10% and 12%, the sample sizes are 10,000 users each, and the error distribution is binomial?

Instead of communicating several statistics, a **single statistic** was developed that communicates all the necessary information in one piece: the **p-value**. A p-value was first derived in the late 18-th century by Pierre-Simon Laplace, when he observed data about a million births that showed an excess of boys, compared to girls. Using the calculation of significance, he argued that the effect was real, but unexplained at the time. We know this now to be true, and there are several explanations for the phenomena coming from evolutionary biology. **Statistical significance calculations** were formally introduced in the early 20-th century by Pearson and popularized by Sir Ronald Fisher in his work, most notably "The Design of Experiments" (1935) ^{[1]}.

## How to interpret a statistically significant result / low p-value

Saying that a result is **statistically significant** means that the p-value is below the evidential threshold decided for the test before it was conducted. For example, if observing something which would only happen 1 out of 20 times if the null hypothesis is true is considered sufficient evidence to reject the null hypothesis, the threshold will be 0.05. In such case, observing a p-value of 0.025 would mean that the result is interpreted as statistically significant.

But what does that really mean? What inference can we make from seeing a result which was quite improbable if the null was true?

**Observing any given low p-value can mean one of three things ^{[3]}:**

- There is a true effect from the tested treatment or intervention.
- There is no true effect, but we happened to observe a rare outcome. The lower the p-value, the rarer (less likely, less probable) the outcome.
- The statistical model is invalid (does not reflect reality).

Obviously, one can't simply jump to conclusion 1.) and claim it with one hundred percent certainty, as this would go against the whole idea of the p-value and statistical significance. In order to use the p-value as a part of a decision process you need to consider external factors, which are a part of the experimental design process, which includes deciding on the significance threshold, sample size and power (power analysis), and the expected effect size, among other things.

If you are happy going forward with this much (or this little) uncertainty as is indicated by the p-value calculation suggests, then you have some quantifiable guarantees related to the effect and future performance of whatever you are testing. For a deeper take on the p-value meaning and interpretation, including common misinterpretations, see: definition and interpretation of the p-value in statistics.

## P-value and significance for relative difference in means or proportions

When comparing two independent groups and the variable of interest is the relative (a.k.a. relative change, relative difference, percent change, percentage difference), as opposed to the absolute difference between the two means or proportions, the standard deviation of the variable is different, compelling different p-value calculations ^{[5]}. This is due to the fact that in calculating relative difference we are performing an additional division by a random variable: the conversion rate of the control during the experiment, which adds more variance to the estimation and the resulting statistical significance is usually higher (the result will be less statistically significant).

In simulations I performed the difference in p-values was about 50% of nominal: a 0.05 p-value for absolute difference corresponded to probability of about 0.075 of observing the relative difference corresponding to the observed absolute difference. Therefore, if you are using p-values calculated for absolute difference when making an inference about percentage difference, you are likely reporting error rates which are about 50% of the actual, thus significantly overstating the statistical significance of your results and underestimating the uncertainty attached to them.

With this calculator you can avoid this mistake by simply indicating the inference you want to make.

#### References

[1] Fisher R.A. (1935) – "The Design of Experiments", *Edinburgh: Oliver & Boyd*

[2] Mayo D.G., Spanos A. (2010) – "Error Statistics", in P. S. Bandyopadhyay & M. R. Forster (Eds.), Philosophy of Statistics, (7, 152–198). *Handbook of the Philosophy of Science*. The Netherlands: Elsevier.

[3] Georgiev G.Z. (2017) "Statistical Significance in A/B Testing – a Complete Guide", [online] http://blog.analytics-toolkit.com/2017/statistical-significance-ab-testing-complete-guide/ (accessed Apr 27, 2018)

[4] Mayo D.G., Spanos A. (2006) – "Severe Testing as a Basic Concept in a Neyman–Pearson Philosophy of Induction", *British Society for the Philosophy of Science*, 57:323-357

[5] Georgiev G.Z. (2018) "Confidence Intervals & P-values for Percent Change / Relative Difference", [online] http://blog.analytics-toolkit.com/2018/confidence-intervals-p-values-percent-change-relative-difference/ (accessed May 20, 2018)

#### Cite this calculator & page

If you'd like to cite this online calculator resource and information as provided on the page, you can use the following citation:

Georgiev G.Z., *"P-value Calculator"*, [online] Available at: https://www.gigacalculator.com/calculators/p-value-significance-calculator.php URL [Accessed Date: 19 Jan, 2021].