# Statistical Inference Formulas

Since I could not find a list of formulas anywhere, I thought I would create one.

### Notation

The notation used is fairly standard, but it is here for completeness.

Notation Meaning
$\mbox{df}$ Degrees of Freedom
$\mbox{SE}$ Standard Error
Confidence SE Confidence interval standard error
$n$ Sample Size
$s$ Sample standard deviation
$s^2$ Sample variance
$\bar{X}$ Sample mean
$N$ Population size
$\mu$ Population mean
$\sigma$ Population standard deviation
$\sigma^2$ Population variance
$s_p^2$ Pooled variance
$\alpha$ The significance level for a hypothesis test (or the probability of a type I errror)

### Hypothesis Testing

• $H_0$ and $H_a$ are the null and alternative hypotheses, respectively
• The p-value: the probability of returning a more extreme value than that of the test statistic given that the null hypothesis is true.
• Find the probability of a value being less than or equal to that of the test statistic under the chosen distribution (using it's cdf). Subtract the found value from 1, since we are looking for a more extreme value.
• For a one-sided test, this is the p-value.
• For a two-sided test, multiply by 2 (since we need the more extreme value for both sides)
• Rejection region: If the absolute value of the test statistic is larger than this value then the null hypothesis logically can't be true, since this directly corresponds to the significance level $\alpha$

### Single Population

Test df Test se Test Statistic Confidence se Confidence interval
$\mu$ when $\sigma$ known - $\mbox{SE} = \frac{\sigma}{\sqrt{n}}$ $z = \frac{\hat{\mu}}{\mbox{SE}}$ $\mbox{SE} = \frac{\sigma}{\sqrt{n}}$ $\hat{\mu} \pm z(1 - {\alpha \over 2}) \mbox{SE}$
$\mu$ when $\sigma$ unknown $\mbox{df} = n-1$ $\mbox{SE} = \frac{s}{\sqrt{n}}$ $t = \frac{\hat{\mu}}{\mbox{SE}}$ $\mbox{SE} = \frac{s}{\sqrt{n}}$ $\hat{\mu} \pm t_{n-1}(1 - \frac{\alpha}{2}) \mbox{SE}$
Population proportion $p$ - $\mbox{SE} = \sqrt{\frac{p_0 (1 - p_0)}{n}}$ $z = \frac{\hat{p} - p_0}{\mbox{SE}}$ $\mbox{SE} = \sqrt{\frac{\hat{p} (1 - \hat{p})}{n}}$ $\hat{p} \pm z(1 - {\alpha \over 2}) \mbox{SE}$

### Two Populations

Test df Test se Test Statistic Confidence se Confidence interval
two means equal variances $\mbox{df} = n_1 + n_2 - 2$ $\mbox{SE} = \sqrt{\frac{(n_1-1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}}\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}$ $t = \frac{\mu_1 - \mu_2 - d_0}{\mbox{SE}}$ $\mbox{SE} = \sqrt{\frac{(n_1-1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}}\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}$ $\hat{\mu} \pm t_{\mbox{df}}(1 - \frac{\alpha}{2}) \times \mbox{SE}$
two means unequal variances * $\mbox{df} \approx \frac{({s_{1}^2 \over n_1} + {s_2^2 \over n_2})^2}{{s_1^4 \over n_1^2 (n_1 - 1)} + {s_2^4 \over n_2^2 (n_2 - 1)}}$ $\mbox{SE} = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}$ $t = \frac{\mu_1 - \mu_2-d_0}{\mbox{SE}}$ $\mbox{SE} = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}$ $\hat{\mu} \pm t_{\mbox{df}}(1 - \frac{\alpha}{2}) \mbox{SE}$

Notes:

• *: If conservative $\mbox{df}$ is used, use the minimum of $n_1 - 1$ and $n_2 - 1$

### ANOVA

Source of Variance $\mbox{df}$ Sum of Square Mean square F statistic p-value
Treatment $I-1$ $\mbox{sst} = \sum_{i=1}^I n_i (\bar{X}_i - \bar{X})^2$ $\mbox{MST} = \frac{\mbox{sst}}{I-1}$ $f = \frac{\mbox{MST}}{\mbox{mse}}$ *
Error $n-I$ $\mbox{sse} = \sum_{i=1}^I s^2_i (n_i - 1)$ ** $\mbox{mse} = \frac{\mbox{sse}}{n-I}$
Total $n-1$ $\mbox{sst} + \mbox{sse}$

Notes:

• *: $f$ follows an F distribution with degrees of freedom equal to $I-1$ and $n-I$. $I-1$ corresponds to the numerator and $n-I$ the denominator.
• **: This is the pooled variance for all $I$ groups

### Contrasts and Linear Combinations of Group Means

• $\gamma = \sum_{i=1}^I c_i \mu_i$
• Estimated with $g = \sum_{i=1}^I c_i \bar{X}_i$
• If the coefficients sum to 0 then this linear combination is a contrast (if $\sum_{i=1}^I c_i = 0$)
• $\mbox{SE}(g) = s_p \sqrt{\sum_{i=1}^I \frac{c_i^2}{n_i}}$
• Use $s_p = \sqrt{\mbox{mse}}$ from the ANOVA table with the same degrees of freedom ($n-1$) This is because even if you're comparing less than $I$ groups in a contrast, those groups are still represented (just with 0 c coefficients)
• Confidence Interval: $g \pm t_{n-I}(1 - {\alpha \over 2}) \times \mbox{SE}(g)$
• Test Statistic: $t = \frac{g - \gamma}{\mbox{SE}\left(g\right)}$

### Post-ANOVA Comparison Methods

• Confidence intervals are constructed with $\mbox{Estimate} \pm \mbox{Multiplier} \times \mbox{SE}\left(\mbox{Estimate}\right)$
• Since pairs are really a contrast with coefficients (1, -1) and 0 for all other groups, the $\mbox{mse}$ from the ANOVA table is used for $s_p^2$.
Method Multiplier Notes
Least Significant Difference (LSD) $t_{n-I}(1 - {\alpha \over 2})$ No attempt to control the family wise error rate. If ANOVA is ran first, this is the f-protected LSD method
Tukey-Kramer $\frac{q_{I,n-I,1-\alpha}}{\sqrt{2}}$ Used for all pairwise comparisons, more conservative than the above two methods, uses the Studentized Range distribution for the multiplier
Bonferroni $t_{n-I}(1 - {\alpha \over 2k})$ Used for $k$ pairwise comparisons
Scheffe $\sqrt{(I-1)F_{I-1,n-I}(1 - \alpha)}$ Used for all contrasts, most conservative test

### Statistical Models

• Degrees of freedom for the model are the number of parameters in the model that vary.
• A reduced model is a model with fewer parameters.
• To determine lack of fit between two models:
• $H_0$: The reduced model adequately explains the data. $H_a$: The full model is required to adequately explain the data.
• Test statistic: $F = \frac{\frac{\text{ss}(error)_{red} - \text{ss}(error)_{full}}{\text{df}(error)_{red} - \text{df}(error)_{full}}}{\frac{\text{ss}(error)_{full}}{\text{df}(error)_{full}}}$
• F follows an f distribution with degrees of freedom equal to $\text{ss}(error)_{red}$ and $\text{ss}(error)_{full}$
• If F is less than the significance level $\alpha$, the full model is required to adequately explain the data.

### Regression

#### Simple Linear Regression and least squares regression

• $Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i$
• Estimated with $\hat{Y}_i = b_0 + b_1 X_i$
• Sums:
• $\mbox{ss}_{XY} = \sum_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y})$
• $ss_Y = \sum_{i=1}^n (Y_i - \bar{Y})^2$
• $ss_X = \sum_{i=1}^n (X_i - \bar{X})^2$
• The line of best fit must pass through the point $(\bar{X}, \bar{Y})$
• The slope ($\beta_1$): $\beta_1 = \frac{\mbox{ss}_{XY}}{\mbox{ss}_X}$
• The intercept ($\beta_0$): $\beta_0 = \bar{Y} - b_1 \bar{X}$
• $ss_e = \sum_{i=1}^n (Y_i - \hat{Y})^2$
• $ss_r = \sum_{i=1}^n (\hat{Y}_i - \bar{Y})^2$ This is the sum of squares explained by the model.
• $ss_t = \sum_{i=1}^n (Y_i - \bar{Y})^2$ same as $ss_Y$ above.
• The Coefficient of determination $R^2$
• Is the percentage of the sum of squares explained by the model to the sum of squares total
• $R^2 = \frac{ss_r}{ss_t}$
• It can also be rearranged as $R^2 = 1 - \frac{ss_e}{ss_t}$

#### Multiple Regression

• $Y_i = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_I X_I$
• Predicted with b as per usual. Note the lack of $\varepsilon$