# Power & Sample Size Calculator

Use this advanced **sample size calculator** to calculate the sample size required for a one-sample statistic, or for differences between two proportions or means (two independent samples). More than two groups supported for binomial data. **Calculate power** given sample size, alpha, and the minimum detectable effect (MDE, minimum effect of interest).

### Quick navigation:

- Using the power & sample size calculator
- Why is computing sample size important?
- What is statistical power?
- Types of null and alternative hypotheses
- Absolute versus relative difference

## Using the power & sample size calculator

This calculator allows you to evaluate the properties of different statistical designs when planning an experiment (trial, test) utilizing a Null-Hypothesis Statistical Test to make inferences. This online tool can be used as a **sample size calculator** and as a **statistical power calculator**. This is the first choice you need to make in the interface. Usually you would calculate the sample size required given a particular power requirement, but in cases where you have a predetermined sample size you can instead calculate the power for a given effect size of interest.

It supports experiments in which you are gathering data on a **single sample** in order to compare it to a general population or known reference value (one-sample), as well as ones where you compare a control group to one or more treatment groups (**two-sample, k-sample**) in order to detect differences between them. For comparing more than one treatment group to a control group we use sample size calculations based on the Dunnett's correction - they are only approximately accurate, subject to the assumption of about equal effect size in all k groups, and can only support equal sample size in all groups and the control. Power calculations are not currently supported for more than one treatment group due to their complexity.

The outcome of interest can be the **absolute difference of two proportions** (binomial data, e.g. conversion rate or event rate), the **absolute difference of two means** (continuous data, e.g. height, weight, speed, time, revenue, etc.), or the **relative difference** between two proportions or two means (percent difference, percent change, etc.). You can also calculate power and sample size for the mean of just a single group. The calculator uses the Z-distribution (normal distribution).

If entering proportions data, you need to enter the proportion or rate of events according to the null hypothesis (worst-case scenario for a composite null) and the minimum effect of interest, which is called the minimum detectable effect (**MDE**) in power and sample size calculations. This should be **difference you would not like to miss**, if it existed. You can enter them as a proportion (e.g. 0.10) or as percentage (e.g. 10%). It is important to keep in mind that it is always relative to the mean/proportion under H_{0} ± the superiority/non-inferiority or equivalence margin. Thus, if you have baseline mean of **10** and a superiority alternative hypothesis with a superiority margin of **1** and your minimum effect of interest relative to the baseline is 3, you need to enter an MDE of **2**, since the MDE plus the superiority margin will equal exactly 3. In this case the MDE is calculated relative to the baseline plus the superiority margin, as it is usually more intuitive to be interested in that value.

If entering means data, you need to specify the mean under the null hypothesis (worst-case scenario for a composite null) and the standard deviation of the data (known or estimated from a sample).

The calculator supports **superiority**, **non-inferiority** and **equivalence** alternative hypotheses. When the superiority or non-inferiority margin is zero, it becomes a classical left or right sided hypothesis, if it is larger than zero then it becomes a true superiority / non-inferiority design. The equivalence margin cannot be zero.

The type I error rate, **α**, should always be provided, while power, calculated as **1 - β**, where β is the type II error, is only required when calculating for sample size. The type I error rate is equivalent to the significance threshold if you are doing p-value calculations and to the confidence level if using confidence intervals.

The **sample size calculator will output** the sample size of the single group or of all groups, the total sample size required. If used to solve for power it will output the power as a proportion and as a percentage.

## Why is computing sample size important?

Estimating the required sample size before running an experiment, conducting a trial that will be judged by a statistical test (tests of significance, confidence intervals, etc) allows you to **understand the magnitude of the effect you can detect with a certain power, or the power for a given effect size of interest**. This is crucial information with regards to making the test cost-efficient. Having a proper sample size can even mean the difference between conducting the experiment or postponing it for when you can afford a sample size that is large enough to give you a good probability to detect an effect of practical significance.

For example, if a medical trial has low power, say less than 80% (β = 0.2) for a given minimum effect of interest, then it might be unethical to conduct it as it has a low probability of rejecting the null hypothesis and to establish the effectiveness of the treatment. Similarly, for experiments in physics, psychology, economics, marketing, conversion rate optimization, etc. Balancing the risks and rewards and assuring the cost-effectiveness of an experiment is a difficult task that requires juggling with the interests of many stakeholders. It is beyond the scope of this article.

## What is statistical power?

Statistical power is the **probability of rejecting a false null hypothesis with a given level of statistical significance**, against a particular alternative hypothesis. Alternatively, it can be said to be the probability to detect with a given level of significance a true effect of a certain magnitude. Power is closely related with the **type II error** rate: β, and it is always equal to (1 - β). In a probability notation the type two error for a given point alternative can be expressed as ^{[1]}:

**β(T _{α}; μ_{1}) = P(d(X) ≤ c_{α}; μ = μ_{1})**

It should be understood that the type II error rate is calculated at a given point, signified by the presence of a parameter for the function of beta. Similarly, such a parameter is present in the expression for power since POW = 1 - β ^{[1]}:

**POW(T _{α}; μ_{1}) = P(d(X) > c_{α}; μ = μ_{1})**

In the equations above **c _{α}** represents the critical value for rejecting the null (significance threshold), d(X) is a statistical function of the parameter of interest - usually a transformation to a standardized score, and μ

_{1}is a specific value from the space of the alternative hypothesis.

One can also plot the whole power function, getting an estimate of the power for many different alternative hypotheses. Due to the S-shape of the function, power quickly rises to nearly 100% for larger effect sizes, while it decreases more gradually to zero for smaller effect sizes.

Statistical power is directly and inversely related to the significance threshold. At the zero effect point for a simple superiority alternative hypothesis power is exactly 1 - α. At the same time power is positively related to sample size, so increasing the sample size will increase the power for a given effect size, assuming all other parameters remain the same.

### Post-hoc power (Observed power)

Power calculations can be useful even after a test has been completed since failing to reject the null can be used as an argument for the null and against particular alternative hypotheses to the extent to which the test had power to reject them. This is more explicitly defined in the severe testing concept proposed by Mayo & Spanos (2006).

Computing observed power is only useful if there was no rejection of the null hypothesis and we are interested in estimating **how probative the test was towards the null**. It is absolutely **useless** to compute post-hoc power for a test which resulted in a statistically significant effect being found ^{[5]}. If the effect is significant, then the test had enough power to detect it. In fact, there is a 1 to 1 inverse relationship between observed power and statistical significance, so you gain nothing from computing post-hoc power, e.g. a test planned for α = 0.05 that passed with a p-value of just 0.0499 will have exactly 50% observed power (observed β = 0.5).

I strongly encourage using this power and sample size calculator to compute observed power in the former case, and strongly discourage it in the latter.

## Types of null and alternative hypotheses

When doing sample size calculations, it is important that you know what your null hypothesis is (H_{0}, the hypothesis being tested) and what the alternative hypothesis is (H_{1}). The test can reject the null or it can fail to reject the null. Strictly logically speaking it cannot lead to accepting the null or to accepting the alternative hypothesis. A null hypothesis can be a **point** one - hypothesizing that the true value is an exact point from the possible values, or a **composite** one: covering many possible values, usually from -∞ to some value or from some value to +∞. The alternative hypothesis can also be a point one or a composite one.

In a Neyman-Pearson framework of NHST (Null-Hypothesis Statistical Test) the alternative should exhaust all values that do not belong to the null, so it is usually composite. Below is an illustration of some possible combinations of null and alternative statistical hypotheses: superiority, non-inferiority, strong superiority (margin > 0), equivalence.

All of these are supported in our calculator for power and sample size calculations.

Careful consideration has to be made when **deciding on a non-inferiority margin, superiority margin or an equivalence margin**. Equivalence trials are sometimes used in clinical trials where a drug can be performing equally (within some bounds) to an existing drug but can still be preferred due to less or less severe side effects, cheaper manufacturing, or other benefits, however, non-inferiority designs are more common. Similar cases exist in disciplines such as conversion rate optimization (^{[2]}) and other business applications where benefits not measured by the primary outcome of interest can influence the adoption of a given solution. For equivalence tests it is assumed that they will be evaluated using a two one-sided t-tests (TOST) or z-tests, or confidence intervals.

You will note that our calculator does not support the schoolbook case of a point null and a point alternative, nor a point null and an alternative that covers all the remaining values. This is since such cases are non-existent in experimental practice ^{[3][4]}. The only two-sided calculation is for the equivalence alternative hypothesis, all other calculations are **one-sided (one-tailed)**.

## Absolute versus relative difference

When making power and sample size calculations it is important to know what kind of inference you are looking to make: about the absolute or about the relative difference, often called percent effect, percentage effect, relative change, percent lift, etc. Where the fist is **μ _{1} - μ** the second is

**μ**or

_{1}-μ / μ**μ**(%). The division by μ is what adds more variance to such an estimate, since μ is just another variable with random error, therefore a test for relative difference will require larger sample size than a test for absolute difference. Consequently, if sample size is fixed, there will be less power for the relative change equivalent to any given absolute change.

_{1}-μ / μ x 100For the above reason it is important to know and state beforehand if you are going to be interested in percentage change or if absolute change is of primary interest.

#### References

[1] Mayo D.G., Spanos A. (2010) – "Error Statistics", in P. S. Bandyopadhyay & M. R. Forster (Eds.), Philosophy of Statistics, (7, 152–198). *Handbook of the Philosophy of Science*. The Netherlands: Elsevier.

[2] Georgiev G.Z. (2017) "The Case for Non-Inferiority A/B Tests", [online] http://blog.analytics-toolkit.com/2017/case-non-inferiority-designs-ab-testing/ (accessed May 7, 2018)

[3] Georgiev G.Z. (2017) "One-tailed vs Two-tailed Tests of Significance in A/B Testing", [online] http://blog.analytics-toolkit.com/2017/one-tailed-two-tailed-tests-significance-ab-testing/ (accessed May 7, 2018)

[4] Hyun-Chul Cho Shuzo Abe (2013) "Is two-tailed testing for directional research hypotheses tests legitimate?", *Journal of Business Research* 66:1261-1266

[5] Lakens D. (2014) "Observed power, and what to do if your editor asks for post-hoc power analyses" [online] http://daniellakens.blogspot.bg/2014/12/observed-power-and-what-to-do-if-your.html (accessed May 7, 2018)

#### Cite this calculator & page

If you'd like to cite this online calculator resource and information as provided on the page, you can use the following citation:

Georgiev G.Z., *"Sample Size Calculator"*, [online] Available at: https://www.gigacalculator.com/calculators/power-sample-size-calculator.php URL [Accessed Date: 10 Dec, 2019].