Mathematics & Computer Science

Seton Hall

This class is different every time I teach it: below is what we we usually cover in the 1st semester of an introductory sequence.

- Illustrating Data: Histograms, Contingency Tables.
- Statistics: Sample Average, Sample Variance, Order Statistics
- Probability and Experiments: Set Operations,
The Sample Space, Independence. Condi-

tional Distributions, Expectation,Variance. Covariance. Correlation. - Counting Methods: The Counting Principle, Permutations and Combinations.
- Discrete Distributions: Bernoulli, Binomial, Negative Binomial
- Continuous Distributions: Uniform, Normal.
- Statistics and Sampling: The Central Limit
Theorem, The Distribution of the Sample

mean, The Likelihood - Inference: The Test Statistic, Hypothesis Testing, Confidence Intervals, Bayesian Inference.

Statistics is all about saying something about what you believe about how the
world works from what you have observed from it. Statistics is the science of experimentation: what we can
observe, and all that we can say about it, is the result of our assumptions about the Experiment or Model
that generated it.

We always start off with a model: we’ve used the Bernoulli experiment to
illustrate.

(1)

The simplest possible experiment - only two outcomes, Success or Failure.

We
can enumerate the sample space for this experiment

(2)

And then assign a Random Variable as a mapping from the Sample Space (Ω) to the real numbers

(3)

For example an outcome or event on the sample space is a particular one is
ω = Success. We typically assign

(4)

and 0 otherwise.

Now we can immediately write a probability distribution for X

This is its probability mass function (density function for continuous
random variables).

Then we can write its distribution function immediately

And so on with Expectation, Variance, Covariance (if we have more than one)...all
functions on the probability mass function.

We know that these functions have nice properties and that Variance, and Covariance are basically just Expectations...discrete or continuous averages of functions on the probability mass function.

(5)

(6)

We know how to transform the random variable and generate a new distribution for
the new random variable after transformation. We know that expectation is a linear function so that
linearity is respected.

(7)

From this we get results on summation and integration ‘passing’ through
expectations under independence

(8)

We use these results to investigate the properties of multiple, repeated, often
supposed identical and independent outcomes from the experiment: the Sample.

Suppose we take , observations
from our Bernoulli Random Variable model. We devise an estimator for the probability of success that is just
the average of the observed successes

(9)

We can see right away from the properties of the Expectation

(10)

and

(11)

The CLT tells us, that when we average things, the averages tend to a Normal distribution, no matter the initial distribution we draw from. It is enough to assume that the initial distribution - or model - is stable, and that we take enough samples and then average them.

(12)

The parameters - i.e. the particular values which give us a particular distribution -
of this Normal distribution for the average are governed by the initial mean and variance of the underlying
Random Variable (Experiment/Process)

One of the nice properties of the Normal Distribution is that its quantiles are
easy to remember and it is closed to linear transforms.

Set

(13)

and by the properties of the Expectation

and if T ∼ Normal then Z ∼ Normal. Let’s write it again:

(14)

So, we have an easy mapping from the sampling distribution (14),(10),(11) of our estimator p̂ (9) to the Z-score which is scaled in
terms of standard deviation units.

But remember: What we really want is to say something about the true state of the Experiment/Model/Nature/The World.

One way is to set up a choice between beliefs about the parameters of our model,
and a way to choose one set of plausible values

(15)

given our tolerance for choosing the alternative hypothesis when it
isn’t true

(16)

It is reasonable that when our Z statistic is large in magnitude we should take that
as evidence that we are not at our null hypothesis and reject that for our alternative hypothesis: we set
the Z statistic up that way, as just the difference between what we observe and what we expect.

(17)

Our ability to choose the alternative when in fact it is true we call the power of the test

(18)

The Rejection Region is, of course, set by our choice of α.

We can frame the hypothesis testing paradigm in terms of the values that we
think are plausible for the parameter of interest, at a particular tolerance for α. Using algebra that
yields

(19)

which means that

is a (1 − α) percent Confidence Interval for our true parameter of interest p.
This is to say that out of M total experiments, where each time we generate an estimator p̂, (1 −
α) · M of them should cover p.

Lastly, we can make statements about the estimators using Bayesian inference, if we are willing to explicitly state that the parameter if interest is random, i.e. we are not certain in our belief about it.

This allow us to quantify any prior beliefs about the parameter, look at data,
and construct a set of posterior beliefs about the parameter.

In the Bayesian approach, the posterior for θ, π(θ|x) is a full
PDF, or distribution. This distribution is the tool or method by which we conduct inference.

The Bayesian approach augments ‘frequentist’ procedures by
including ‘prior’ information about the parameter of interest, In the ‘frequentist
approach’ p is a constant, which we estimate, say via the likelihood, Lik(x|p) = ∏ni=1 fp(xi),
for example. We derive estimates using the likelihood of the data, or the sampling distribution. Common
estimates are p̂ = x̄.

In the ‘Bayesian approach’ p is an instance of a random
variable with a PDF, say π(p), and now we derive estimates using the additional randomness of π(p) via
Bayes′Equation:

We get the marginal distribution for the data,

In the Bayesian approach, the posterior for θ, π(θ|x) is a full PDF,
or distribution. This distribution is the tool or method by which we conduct inference.

Then

And

So

and here,

Suppose we observe data xobs = 0, then:

and

Now, with a full PDF for p, we can find ‘posterior’ estimates
of parameters. Contrast this with the ‘frequentist’ approach where we found point estimates and
used the sampling distribution; in the Bayesian approach the sampling distribution role in inference on the
parameter is replaced with the Bayesian posterior distribution.

Thus

Thus

...one possible estimate for the parameter p.

Another possible estimate of

the posterior mode:

In the frequentist approach the Confidence Interval is the interval, say I, such that

In the Bayesian approach the Confidence Interval is the interval, I, such that

Let

and the observed data

The 95% Bayes Interval is (a, b) such that

and

thus