# Statistical Inference Formulas

Since I could not find a list of formulas anywhere, I thought I would create one.

### Notation

The notation used is fairly standard, but it is here for completeness.

Notation | Meaning |
---|---|

\(\mbox{df}\) | Degrees of Freedom |

\(\mbox{SE}\) | Standard Error |

Confidence SE | Confidence interval standard error |

\(n\) | Sample Size |

\(s\) | Sample standard deviation |

\(s^2\) | Sample variance |

\(\bar{X}\) | Sample mean |

\(N\) | Population size |

\(\mu\) | Population mean |

\(\sigma\) | Population standard deviation |

\(\sigma^2\) | Population variance |

\(s_p^2\) | Pooled variance |

\(\alpha\) | The significance level for a hypothesis test (or the probability of a type I errror) |

### Hypothesis Testing

- \(H_0\) and \(H_a\) are the null and alternative hypotheses, respectively
- The p-value: the probability of returning a more extreme value than that of the test statistic given that the null hypothesis is true.
- Find the probability of a value being less than or equal to that of the test statistic under the chosen distribution (using it's cdf). Subtract the found value from 1, since we are looking for a more extreme value.
- For a one-sided test, this is the p-value.
- For a two-sided test, multiply by 2 (since we need the more extreme value for both sides)

- Rejection region: If the absolute value of the test statistic is larger than this value then the null hypothesis logically can't be true, since this directly corresponds to the significance level \(\alpha\)

### Single Population

Test | df | Test se | Test Statistic | Confidence se | Confidence interval |
---|---|---|---|---|---|

\(\mu\) when \(\sigma\) known | - | \(\mbox{SE} = \frac{\sigma}{\sqrt{n}}\) | \(z = \frac{\hat{\mu}}{\mbox{SE}}\) | \(\mbox{SE} = \frac{\sigma}{\sqrt{n}}\) | \(\hat{\mu} \pm z(1 - {\alpha \over 2}) \mbox{SE}\) |

\(\mu\) when \(\sigma\) unknown | \(\mbox{df} = n-1\) | \(\mbox{SE} = \frac{s}{\sqrt{n}}\) | \(t = \frac{\hat{\mu}}{\mbox{SE}}\) | \(\mbox{SE} = \frac{s}{\sqrt{n}}\) | \(\hat{\mu} \pm t_{n-1}(1 - \frac{\alpha}{2}) \mbox{SE}\) |

Population proportion \(p\) | - | \(\mbox{SE} = \sqrt{\frac{p_0 (1 - p_0)}{n}}\) | \(z = \frac{\hat{p} - p_0}{\mbox{SE}}\) | \(\mbox{SE} = \sqrt{\frac{\hat{p} (1 - \hat{p})}{n}}\) | \(\hat{p} \pm z(1 - {\alpha \over 2}) \mbox{SE}\) |

### Two Populations

Test | df | Test se | Test Statistic | Confidence se | Confidence interval |
---|---|---|---|---|---|

two means equal variances | \(\mbox{df} = n_1 + n_2 - 2\) | \(\mbox{SE} = \sqrt{\frac{(n_1-1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}}\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}\) | \(t = \frac{\mu_1 - \mu_2 - d_0}{\mbox{SE}}\) | \(\mbox{SE} = \sqrt{\frac{(n_1-1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}}\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}\) | \(\hat{\mu} \pm t_{\mbox{df}}(1 - \frac{\alpha}{2}) \times \mbox{SE}\) |

two means unequal variances | * \(\mbox{df} \approx \frac{({s_{1}^2 \over n_1} + {s_2^2 \over n_2})^2}{{s_1^4 \over n_1^2 (n_1 - 1)} + {s_2^4 \over n_2^2 (n_2 - 1)}}\) | \(\mbox{SE} = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}\) | \(t = \frac{\mu_1 - \mu_2-d_0}{\mbox{SE}}\) | \(\mbox{SE} = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}\) | \(\hat{\mu} \pm t_{\mbox{df}}(1 - \frac{\alpha}{2}) \mbox{SE}\) |

Notes:

- *: If conservative \(\mbox{df}\) is used, use the minimum of \(n_1 - 1\) and \(n_2 - 1\)

### ANOVA

Source of Variance | \(\mbox{df}\) | Sum of Square | Mean square | F statistic | p-value |
---|---|---|---|---|---|

Treatment | \(I-1\) | \(\mbox{sst} = \sum_{i=1}^I n_i (\bar{X}_i - \bar{X})^2\) | \(\mbox{MST} = \frac{\mbox{sst}}{I-1}\) | \(f = \frac{\mbox{MST}}{\mbox{mse}}\) | * |

Error | \(n-I\) | \(\mbox{sse} = \sum_{i=1}^I s^2_i (n_i - 1)\) | ** \(\mbox{mse} = \frac{\mbox{sse}}{n-I}\) | ||

Total | \(n-1\) | \(\mbox{sst} + \mbox{sse}\) |

Notes:

- *: \(f\) follows an F distribution with degrees of freedom equal to \(I-1\) and \(n-I\). \(I-1\) corresponds to the numerator and \(n-I\) the denominator.
- **: This is the pooled variance for all \(I\) groups

### Contrasts and Linear Combinations of Group Means

- \(\gamma = \sum_{i=1}^I c_i \mu_i\)
- Estimated with \(g = \sum_{i=1}^I c_i \bar{X}_i\)
- If the coefficients sum to 0 then this linear combination is a contrast (if \(\sum_{i=1}^I c_i = 0\))
- \(\mbox{SE}(g) = s_p \sqrt{\sum_{i=1}^I \frac{c_i^2}{n_i}}\)
- Use \(s_p = \sqrt{\mbox{mse}}\) from the ANOVA table with the same degrees of freedom (\(n-1\)) This is because even if you're comparing less than \(I\) groups in a contrast, those groups are still represented (just with 0 c coefficients)
- Confidence Interval: \(g \pm t_{n-I}(1 - {\alpha \over 2}) \times \mbox{SE}(g)\)
- Test Statistic: \(t = \frac{g - \gamma}{\mbox{SE}\left(g\right)}\)

### Post-ANOVA Comparison Methods

- Confidence intervals are constructed with \(\mbox{Estimate} \pm \mbox{Multiplier} \times \mbox{SE}\left(\mbox{Estimate}\right)\)
- Since pairs are really a contrast with coefficients (1, -1) and 0 for all other groups, the \(\mbox{mse}\) from the ANOVA table is used for \(s_p^2\).

Method | Multiplier | Notes |
---|---|---|

Least Significant Difference (LSD) | \(t_{n-I}(1 - {\alpha \over 2})\) | No attempt to control the family wise error rate. If ANOVA is ran first, this is the f-protected LSD method |

Tukey-Kramer | \(\frac{q_{I,n-I,1-\alpha}}{\sqrt{2}}\) | Used for all pairwise comparisons, more conservative than the above two methods, uses the Studentized Range distribution for the multiplier |

Bonferroni | \(t_{n-I}(1 - {\alpha \over 2k})\) | Used for \(k\) pairwise comparisons |

Scheffe | \(\sqrt{(I-1)F_{I-1,n-I}(1 - \alpha)}\) | Used for all contrasts, most conservative test |

### Statistical Models

- Degrees of freedom for the model are the number of parameters in the model that vary.
- A reduced model is a model with fewer parameters.
- To determine lack of fit between two models:
- \(H_0\): The reduced model adequately explains the data. \(H_a\): The full model is required to adequately explain the data.
- Test statistic: \(F = \frac{\frac{\text{ss}(error)_{red} - \text{ss}(error)_{full}}{\text{df}(error)_{red} - \text{df}(error)_{full}}}{\frac{\text{ss}(error)_{full}}{\text{df}(error)_{full}}}\)
- F follows an f distribution with degrees of freedom equal to \(\text{ss}(error)_{red}\) and \(\text{ss}(error)_{full}\)
- If F is less than the significance level \(\alpha\), the full model is required to adequately explain the data.

### Regression

#### Simple Linear Regression and least squares regression

- \(Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i\)
- Estimated with \(\hat{Y}_i = b_0 + b_1 X_i\)
- Sums:
- \(\mbox{ss}_{XY} = \sum_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y})\)
- \(ss_Y = \sum_{i=1}^n (Y_i - \bar{Y})^2\)
- \(ss_X = \sum_{i=1}^n (X_i - \bar{X})^2\)

- The line of best fit must pass through the point \((\bar{X}, \bar{Y})\)
- The slope (\(\beta_1\)): \(\beta_1 = \frac{\mbox{ss}_{XY}}{\mbox{ss}_X}\)
- The intercept (\(\beta_0\)): \(\beta_0 = \bar{Y} - b_1 \bar{X}\)
- \(ss_e = \sum_{i=1}^n (Y_i - \hat{Y})^2\)
- \(ss_r = \sum_{i=1}^n (\hat{Y}_i - \bar{Y})^2\) This is the sum of squares explained by the model.
- \(ss_t = \sum_{i=1}^n (Y_i - \bar{Y})^2\) same as \(ss_Y\) above.
- The Coefficient of determination \(R^2\)
- Is the percentage of the sum of squares explained by the model to the sum of squares total
- \(R^2 = \frac{ss_r}{ss_t}\)
- It can also be rearranged as \(R^2 = 1 - \frac{ss_e}{ss_t}\)

#### Multiple Regression

- \(Y_i = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_I X_I\)
- Predicted with b as per usual. Note the lack of \(\varepsilon\)