Kobi Abayomi

Mathematics & Computer Science
Seton Hall


1. Topics Covered
2. The Big Picture
2.1. The Model, The Parameters
2.2. The Random Variable
2.3. The Probability Distribution
2.4. The functions on Probability Distribution
2.5. The Sample
2.6. The Central Limit Theorem
2.7. The Normal distribution; and is closed to Linear Transforms
2.8. Inference: Hypothesis Testing
2.9. Inference: Confidence Intervals
2.10. Inference: Bayesian
2.11. Bayesian Computation
3. Bayesian Posterior Intervals

1. Topics Covered

This class is different every time I teach it: below is what we we usually cover in the 1st semester of an introductory sequence.

This Semester we Covered:

  • Illustrating Data: Histograms, Contingency Tables.
  • Statistics: Sample Average, Sample Variance, Order Statistics
  • Probability and Experiments: Set Operations, The Sample Space, Independence. Condi-
    tional Distributions, Expectation,Variance. Covariance. Correlation.
  • Counting Methods: The Counting Principle, Permutations and Combinations.
  • Discrete Distributions: Bernoulli, Binomial, Negative Binomial
  • Continuous Distributions: Uniform, Normal.
  • Statistics and Sampling: The Central Limit Theorem, The Distribution of the Sample
    mean, The Likelihood
  • Inference: The Test Statistic, Hypothesis Testing, Confidence Intervals, Bayesian Inference.

2. The Big Picture

Statistics is all about saying something about what you believe about how the world works from what you have observed from it. Statistics is the science of experimentation: what we can observe, and all that we can say about it, is the result of our assumptions about the Experiment or Model that generated it.
We always start off with a model: we’ve used the Bernoulli experiment to illustrate.

2.1. The Model, The Parameters

Success = p

The simplest possible experiment - only two outcomes, Success or Failure.
We can enumerate the sample space for this experiment


2.2. The Random Variable

And then assign a Random Variable as a mapping from the Sample Space (Ω) to the real numbers

X : Ω 

For example an outcome or event on the sample space is  
ω  Ω a particular one is ω = Success. We typically assign         

Xω = Success=1 

and 0 otherwise.

2.3. The Probability Distribution

Now we can immediately write a probability distribution for X

X=x=1  w.p.   p0 w.p. 1-p

This is its probability mass function (density function for continuous random variables).
Then we can write its distribution function immediately

Fx = Xx=0  w.p. 1-p1  w.p.   p

And so on with Expectation, Variance, Covariance (if we have more than one)...all functions on the probability mass function.

 2.4. The functions on Probability Distribution

We know that these functions have nice properties and that Variance, and Covariance are basically just Expectations...discrete or continuous averages of functions on the probability mass function.


We know how to transform the random variable and generate a new distribution for the new random variable after transformation. We know that expectation is a linear function so that linearity is respected.

EaX +b = aEX+b 

From this we get results on summation and integration ‘passing’ through expectations under independence


We use these results to investigate the properties of multiple, repeated, often supposed identical and independent outcomes from the experiment: the Sample.

2.5. The Sample

Suppose we take X1, ..., Xn, observations from our Bernoulli Random Variable model. We devise an estimator for the probability of success that is just the average of the observed successes

p^=x¯=i=1n xi/n

We can see right away from the properties of the Expectation

Epˆ = p 


Varpˆ = p1  p/n 

2.6. The Central Limit Theorem

The CLT tells us, that when we average things, the averages tend to a Normal distribution, no matter the initial distribution we draw from. It is enough to assume that the initial distribution - or model - is stable, and that we take enough samples and then average them.

limn X~NEX,VarX/n

The parameters - i.e. the particular values which give us a particular distribution - of this Normal distribution for the average are governed by the initial mean and variance of the underlying Random Variable (Experiment/Process)

2.7. The Normal distribution; and is closed to Linear Transforms

 One of the nice properties of the Normal Distribution is that its quantiles are easy to remember and it is closed to linear transforms.



and by the properties of the Expectation

EZ = 0
V arZ=1 

and if T ∼ Normal then Z ∼ Normal. Let’s write it again:

Z = pˆ  p[p(1p)n ]12 ~ N0,1 

2.8. Inference: Hypothesis Testing

So, we have an easy mapping from the sampling distribution (14),(10),(11) of our estimator p̂ (9) to the Z-score which is scaled in terms of standard deviation units.         

But remember: What we really want is to say something about the true state of the Experiment/Model/Nature/The World.

One way is to set up a choice between beliefs about the parameters of our model, and a way to choose one set of plausible values

H0 : p = p0 vs Ha : pp0

given our tolerance for choosing the alternative hypothesis when it isn’t true


It is reasonable that when our Z statistic is large in magnitude we should take that as evidence that we are not at our null hypothesis and reject that for our alternative hypothesis: we set the Z statistic up that way, as just the difference between what we observe and what we expect.

Z =  p^obs-p0p01-p0n

Our ability to choose the alternative when in fact it is true we call the power of the test

1  β  Power = HaHa = HaObserved Test Statistic is in Rejection Region

The Rejection Region is, of course, set by our choice of α.

2.9. Inference: Confidence Intervals

 We can frame the hypothesis testing paradigm in terms of the values that we think are plausible for the parameter of interest, at a particular tolerance for α. Using algebra that yields


which means that


 is a (1 − α) percent Confidence Interval for our true parameter of interest p. This is to say that out of M total experiments, where each time we generate an estimator p̂, (1 − α) · M of them should cover p.

2.10. Inference: Bayesian

Lastly, we can make statements about the estimators using Bayesian inference, if we are willing to explicitly state that the parameter if interest is random, i.e. we are not certain in our belief about it.

This allow us to quantify any prior beliefs about the parameter, look at data, and construct a set of posterior beliefs about the parameter.

In the Bayesian approach, the posterior for θ, π(θ|x) is a full PDF, or distribution. This distribution is the tool or method by which we conduct inference.

The Bayesian approach augments ‘frequentist’ procedures by including ‘prior’ information about the parameter of interest, In the ‘frequentist approach’ p is a constant, which we estimate, say via the likelihood, Lik(x|p) = ∏ni=1 fp(xi), for example. We derive estimates using the likelihood of the data, or the sampling distribution. Common estimates are p̂ = x̄.

In the ‘Bayesian approach’ p is an instance of a random variable with a PDF, say π(p), and now we derive estimates using the additional randomness of π(p) via Bayes′Equation:

πθ|x = f(x|p) · π(p)g(x) 

In this setup, for x1, ..., xn = X ~ fθx

πp  prior dist. for p
fx|p  likelihood, prob. of data given p
gx  marginal dist of x1, ..., xn
πp|x  the posterior dist for θ

We get the marginal distribution for the data,

gx=θ fx|θπθ    if    θ is discrete fx|θπθ   if       θ is cont.

 2.11. Bayesian Computation

In the Bayesian approach, the posterior for θ, π(θ|x) is a full PDF, or distribution. This distribution is the tool or method by which we conduct inference.


Let X ~ Bin2,p, and πp = .1 = .6, πp = .2 = .4. 


fp=1/2x = fx|p=1/2=Cx2 p21-p2-x


gx = p fpxπp =  = fx|.1 · π.1 + fx|.2 · π.2 = = C2x . 1x.92-x · .6 + C2x.2x.82-x · .4 


π.1|x = f(x|.1) · π(.1)g(x) = = .1x.92-x · .6.1x.92-x · .6 + .2x.82-x · .4  

and here,

π.2|x=1  π.1|x

Suppose we observe data xobs = 0, then:

π.1|0 = .10.92 · .6.10.92 · .6 + .20.82 · .4 =                            = .6550 


π.2|0 = .3450

Now, with a full PDF for p, we can find ‘posterior’ estimates of parameters. Contrast this with the ‘frequentist’ approach where we found point estimates and used the sampling distribution; in the Bayesian approach the sampling distribution role in inference on the parameter is replaced with the Bayesian posterior distribution.


X ~ Bin2,p and πp=1, 0  p  1, we observe data x = 1 


πp|1 = 6p1  p 


pˆ = Ep = 01  p · 6p1  pdp =12

...one possible estimate for the parameter p.
Another possible estimate of

p, pˆ,

 the posterior mode:


3. Bayesian Posterior Intervals

In the frequentist approach the Confidence Interval is the interval, say I, such that

Pμ  I in repeated experiments = 1  α% 

In the Bayesian approach the Confidence Interval is the interval, I, such that

Pμ  I = 1  α% 



X ~ Bin2,p and πp=1, 0  p  1

 and the observed data

xobs= 0

The 95% Bayes Interval is (a, b) such that

0a 3(1  p)2dp = .025 


b1 3(1  p)2dp = .025


a = .0084 and b = .7076. Then p  .0084,.7076 = .95