Statistics Study Guide

Kobi Abayomi

Mathematics & Computer Science

Seton Hall

1. Topics Covered

2. The Big Picture

2.1. The Model, The Parameters

2.2. The Random Variable

2.3. The Probability Distribution

2.4. The functions on Probability Distribution

2.5. The Sample

2.6. The Central Limit Theorem

2.7. The Normal distribution; and is closed to Linear Transforms

2.8. Inference: Hypothesis Testing

2.9. Inference: Confidence Intervals

2.10. Inference: Bayesian

2.11. Bayesian Computation

3. Bayesian Posterior Intervals

1. Topics Covered

This class is different every time I teach it: below is what we we usually cover in the 1st semester of an introductory sequence.

This Semester we Covered:

Illustrating Data: Histograms, Contingency Tables.
Statistics: Sample Average, Sample Variance, Order Statistics
Probability and Experiments: Set Operations, The Sample Space, Independence. Condi-
tional Distributions, Expectation,Variance. Covariance. Correlation.
Counting Methods: The Counting Principle, Permutations and Combinations.
Discrete Distributions: Bernoulli, Binomial, Negative Binomial
Continuous Distributions: Uniform, Normal.
Statistics and Sampling: The Central Limit Theorem, The Distribution of the Sample
mean, The Likelihood
Inference: The Test Statistic, Hypothesis Testing, Confidence Intervals, Bayesian Inference.

2. The Big Picture

Statistics is all about saying something about what you believe about how the world works from what you have observed from it. Statistics is the science of experimentation: what we can observe, and all that we can say about it, is the result of our assumptions about the Experiment or Model that generated it.
We always start off with a model: we’ve used the Bernoulli experiment to illustrate.

2.1. The Model, The Parameters

(1)

ℙ (S u c c e s s) = p

The simplest possible experiment - only two outcomes, Success or Failure.
We can enumerate the sample space for this experiment

(2)

Ω = S u c c e s s, F a i l u r e

2.2. The Random Variable

And then assign a Random Variable as a mapping from the Sample Space (Ω) to the real numbers

(3)

X : Ω \mapsto ℝ

For example an outcome or event on the sample space is $ω \in Ω$ a particular one is ω = Success. We typically assign

(4)

X (ω = S u c c e s s) = 1

and 0 otherwise.

2.3. The Probability Distribution

Now we can immediately write a probability distribution for X

ℙ (X = x) = \{\begin{cases} 1 w . p . p \\ 0 w . p . 1 - p \end{cases}

This is its probability mass function (density function for continuous random variables).
Then we can write its distribution function immediately

F (x) = ℙ (X \leq x) = \{\begin{cases} 0 w . p . 1 - p \\ 1 w . p . p \end{cases}

And so on with Expectation, Variance, Covariance (if we have more than one)...all functions on the probability mass function.

2.4. The functions on Probability Distribution

We know that these functions have nice properties and that Variance, and Covariance are basically just Expectations...discrete or continuous averages of functions on the probability mass function.

(5)

V a r (X) = E [{(X - E (X))}^{2}]

(6)

C o v (X, Y) = E (X Y) - E (X) \cdot E (Y)

We know how to transform the random variable and generate a new distribution for the new random variable after transformation. We know that expectation is a linear function so that linearity is respected.

(7)

E (a X + b) = a E (X) + b

From this we get results on summation and integration ‘passing’ through expectations under independence

(8)

E (\sum_{i = 1}^{n} X_{i}) = \sum_{i = 1}^{n} E (X_{i})

We use these results to investigate the properties of multiple, repeated, often supposed identical and independent outcomes from the experiment: the Sample.

2.5. The Sample

Suppose we take $X_{1}, . . ., X_{n}$ , observations from our Bernoulli Random Variable model. We devise an estimator for the probability of success that is just the average of the observed successes

(9)

\hat{p} = \bar{x} = \sum_{i = 1}^{n} x_{i} / n

We can see right away from the properties of the Expectation

(10)

E (\hat{p}) = p

and

(11)

V a r (\hat{p}) = p (1 - p) / n

2.6. The Central Limit Theorem

The CLT tells us, that when we average things, the averages tend to a Normal distribution, no matter the initial distribution we draw from. It is enough to assume that the initial distribution - or model - is stable, and that we take enough samples and then average them.

(12)

\underset{n \to \infty}{l i m} X ~ N (E (X), V a r (X) / n)

The parameters - i.e. the particular values which give us a particular distribution - of this Normal distribution for the average are governed by the initial mean and variance of the underlying Random Variable (Experiment/Process)

2.7. The Normal distribution; and is closed to Linear Transforms

One of the nice properties of the Normal Distribution is that its quantiles are easy to remember and it is closed to linear transforms.

Set

(13)

Z = \frac{T - E (T)}{{[V a r (T)]}^{1 / 2}}

and by the properties of the Expectation

E (Z) = 0

V a r (Z) = 1

and if T ∼ Normal then Z ∼ Normal. Let’s write it again:

(14)

Z = \frac{\hat{p} - p}{[\frac{p (1 - p)}{n}]^{\frac{1}{2}}} ~ N (0, 1)

2.8. Inference: Hypothesis Testing

So, we have an easy mapping from the sampling distribution (14),(10),(11) of our estimator p̂ (9) to the Z-score which is scaled in terms of standard deviation units.

But remember: What we really want is to say something about the true state of the Experiment/Model/Nature/The World.

One way is to set up a choice between beliefs about the parameters of our model, and a way to choose one set of plausible values

(15)

H_{0} : p = p_{0} v s H_{a} : p \neq p_{0}

given our tolerance for choosing the alternative hypothesis when it isn’t true

(16)

α = ℙ_{H_{0}} (H_{a})

It is reasonable that when our Z statistic is large in magnitude we should take that as evidence that we are not at our null hypothesis and reject that for our alternative hypothesis: we set the Z statistic up that way, as just the difference between what we observe and what we expect.

(17)

Z = \frac{{\hat{p}}_{o b s} - p_{0}}{\sqrt{\frac{p_{0} (1 - p_{0})}{n}}}

Our ability to choose the alternative when in fact it is true we call the power of the test

(18)

1 - β \equiv P o w e r = ℙ_{H_{a}} (H_{a}) = ℙ_{H_{a}} (O b s e r v e d T e s t S t a t i s t i c i s i n R e j e c t i o n R e g i o n)

The Rejection Region is, of course, set by our choice of α.

2.9. Inference: Confidence Intervals

We can frame the hypothesis testing paradigm in terms of the values that we think are plausible for the parameter of interest, at a particular tolerance for α. Using algebra that yields

(19)

ℙ (p \in p_{0} \pm Z_{α / 2} \cdot \sqrt{p_{0} (1 - p_{0}) / n}) = 1 - α

which means that

[p_{0} - Z_{α / 2} \cdot \sqrt{p_{0} (1 - p_{0}) / n}, p_{0} + Z_{α / 2} \cdot \sqrt{p_{0} (1 - p_{0}) / n}]

is a (1 − α) percent Confidence Interval for our true parameter of interest p. This is to say that out of M total experiments, where each time we generate an estimator p̂, (1 − α) · M of them should cover p.

2.10. Inference: Bayesian

Lastly, we can make statements about the estimators using Bayesian inference, if we are willing to explicitly state that the parameter if interest is random, i.e. we are not certain in our belief about it.

This allow us to quantify any prior beliefs about the parameter, look at data, and construct a set of posterior beliefs about the parameter.

In the Bayesian approach, the posterior for θ, π(θ|x) is a full PDF, or distribution. This distribution is the tool or method by which we conduct inference.

The Bayesian approach augments ‘frequentist’ procedures by including ‘prior’ information about the parameter of interest, In the ‘frequentist approach’ p is a constant, which we estimate, say via the likelihood, Lik(x|p) = ∏ni=1 fp(xi), for example. We derive estimates using the likelihood of the data, or the sampling distribution. Common estimates are p̂ = x̄.

In the ‘Bayesian approach’ p is an instance of a random variable with a PDF, say π(p), and now we derive estimates using the additional randomness of π(p) via Bayes′Equation:

π (θ | x) = \frac{f (x | p) \cdot π (p)}{g (x)}

$I n t h i s s e t u p, f o r x_{1}, . . ., x_{n} = X ~ f_{θ} (x)$

π (p) \equiv p r i o r d i s t . f o r p

f (x | p) \equiv ‘ l i k e l i h o o d', p r o b . o f d a t a g i v e n p

g (x) \equiv m a r g i n a l d i s t o f x_{1}, . . ., x_{n}

π (p | x) \equiv t h e p o s t e r i o r d i s t f o r θ

We get the marginal distribution for the data, $g (x)$

g (x) = \{\begin{cases} \sum_{θ} f (x | θ) π (θ) i f θ i s d i s c r e t e \\ \int_{- \infty}^{\infty} f (x | θ) π (θ) i f θ i s c o n t . \end{cases}

2.11. Bayesian Computation

In the Bayesian approach, the posterior for θ, π(θ|x) is a full PDF, or distribution. This distribution is the tool or method by which we conduct inference.

Example:

L e t X ~ B i n (2, p), a n d π (p = . 1) = . 6, π (p = . 2) = . 4 .

Then

f_{p = 1 / 2} (x) = f (x | p = 1 / 2) = C_{x}^{2} p^{2} {(1 - p)}^{2 - x}

And

g (x) = \sum_{p} f_{p} (x) π (p) = = f (x | . 1) \cdot π (. 1) + f (x | . 2) \cdot π (. 2) = = C_{2}^{x} . 1^{x} . 9^{2 - x} \cdot . 6 + C_{2}^{x} . 2^{x} . 8^{2 - x} \cdot . 4

π (. 1 | x) = \frac{f (x | . 1) \cdot π (. 1)}{g (x)} = = \frac{. 1^{x} . 9^{2 - x} \cdot . 6}{. 1^{x} . 9^{2 - x} \cdot . 6 + . 2^{x} . 8^{2 - x} \cdot . 4}

and here,

π (. 2 | x) = 1 - π (. 1 | x)

Suppose we observe data xobs = 0, then:

π (. 1 | 0) = \frac{. 1^{0} . 9^{2} \cdot . 6}{. 1^{0} . 9^{2} \cdot . 6 + . 2^{0} . 8^{2} \cdot . 4} = = . 6550

and

π (. 2 | 0) = . 3450

Now, with a full PDF for p, we can find ‘posterior’ estimates of parameters. Contrast this with the ‘frequentist’ approach where we found point estimates and used the sampling distribution; in the Bayesian approach the sampling distribution role in inference on the parameter is replaced with the Bayesian posterior distribution.

Example

X ~ B i n (2, p) a n d π (p) = 1, 0 \leq p \leq 1, w e o b s e r v e d a t a x = 1

Thus

π (p | 1) = 6 p (1 - p)

Thus

\hat{p} = E (p) = \int_{0}^{1} p \cdot 6 p (1 - p) d p = \frac{1}{2}

...one possible estimate for the parameter p.
Another possible estimate of

p, \hat{p} *,

the posterior mode:

\frac{d π (p | x)}{d p} = 6 - 12 p = {\hat{p}}^{*} = 1 / 2

3. Bayesian Posterior Intervals

In the frequentist approach the Confidence Interval is the interval, say I, such that

P (μ \in I i n r e p e a t e d e x p e r i m e n t s) = (1 - α) %

In the Bayesian approach the Confidence Interval is the interval, I, such that

P (μ \in I) = (1 - α) %

Example

Let

X ~ B i n (2, p) a n d π (p) = 1, 0 \leq p \leq 1

and the observed data

x_{o b s} = 0

The 95% Bayes Interval is (a, b) such that

\int_{0}^{a} {3 (1 - p)}^{2} d p = . 025

and

\int_{b}^{1} {3 (1 - p)}^{2} d p = . 025

thus

a = . 0084 a n d b = . 7076 . T h e n ℙ (p \in (. 0084, . 7076)) = . 95

Introductory Statistics with Calculus

Kobi Abayomi

Contents

1. Topics Covered

2. The Big Picture

2.1. The Model, The Parameters

2.2. The Random Variable

2.3. The Probability Distribution

2.4. The functions on Probability Distribution

2.5. The Sample

2.6. The Central Limit Theorem

2.7. The Normal distribution; and is closed to Linear Transforms

2.8. Inference: Hypothesis Testing

2.9. Inference: Confidence Intervals

2.10. Inference: Bayesian

2.11. Bayesian Computation

3. Bayesian Posterior Intervals

1. Topics Covered

This Semester we Covered:

2. The Big Picture

2.1. The Model, The Parameters

2.2. The Random Variable

2.3. The Probability Distribution

2.4. The functions on Probability Distribution

2.5. The Sample

2.6. The Central Limit Theorem

2.7. The Normal distribution; and is closed to Linear Transforms

2.8. Inference: Hypothesis Testing

2.9. Inference: Confidence Intervals

2.10. Inference: Bayesian

2.11. Bayesian Computation

Example:

Example

3. Bayesian Posterior Intervals

Example