final_exam_solutions
.pdf
keyboard_arrow_up
School
New York University *
*We aren’t endorsed by this school
Course
473
Subject
Computer Science
Date
Jan 9, 2024
Type
Pages
15
Uploaded by DeanFox3968
Introduction to Machine Learning (CSCI-UA.473): Final
Exam Solutions
Instructor: Sumit Chopra
December 21
st
, 2021
1
Probability and Basics (20 Points)
1. Let
X
and
Y
be discrete random variables. You are given their joint distribution p(X,
Y), and let
D
denote the training set and
θ
denote the parameters of your model.
Answer the following questions:
(a)
[2 Points]
Write the expression of the marginal distribution over
X
provided
p
(
X, Y
).
[Sol:]
p
(
X
) =
X
y
p
(
X, y
)
(1)
(b)
[2 Points]
Write the conditional distribution of
X
given
Y
provided
p
(
X, Y
),
p
(
X
), and
p
(
Y
).
[Sol:]
p
(
X
|
Y
) =
p
(
X, Y
)
p
(
Y
)
(2)
(c)
[2 Points]
Write the posterior distribution of
Y
given
p
(
X
|
Y
),
p
(
Y
), and
p
(
X
).
[Sol:]
p
(
Y
|
X
) =
p
(
X, Y
)
p
(
X
)
=
p
(
X
|
Y
)
·
p
(
Y
)
p
(
X
)
(3)
(d)
[2 Points]
Write the expression of the posterior distribution of the parameters
θ
,
given the prior
p
(
θ
) and likelihood of the data.
[Sol:]
p
(
θ
|D
) =
p
(
D
, θ
)
p
(
D
)
=
p
(
D|
θ
)
p
(
θ
)
p
(
D
)
(4)
2.
[4 Points]
Show that the Tanh function (tanh) and the Logistic Sigmoid function (
σ
)
are related by
tanh(
a
) = 2
σ
(2
a
)
-
1
.
1
[Sol:]
2
σ
(2
a
)
-
1
=
2
1 +
e
-
2
a
-
1
=
2
1 +
e
-
2
a
-
1 +
e
-
2
a
1 +
e
-
2
a
=
1
-
e
-
2
a
1 +
e
-
2
a
=
e
a
-
e
-
a
e
a
+
e
-
a
=
tanh(
a
)
3.
[8 Points]
Show that a general linear combination of Logistic Sigmoid functions of the
form
y
(
x,
w
) =
w
0
+
M
X
j
=1
w
j
·
σ
x
-
μ
j
s
is equivalent to a linear combination of tanh functions of the form
y
(
x,
u
) =
u
0
+
M
X
j
=1
u
j
·
tanh
x
-
μ
j
s
and find the expression that relates the new parameters
{
u
1
, . . . , u
M
}
to the original
parameters
{
w
1
, . . . , w
M
}
.
[Sol:]
If we take
a
j
= (
x
-
μ
j
)
/
2
s
, we can re-write the first equation as
y
(
x,
w
)
=
w
0
+
M
X
j
=1
w
j
σ
(2
a
j
)
=
w
0
+
M
X
j
=1
w
j
2
(2
σ
(2
a
j
)
-
1 + 1)
=
u
0
+
M
X
j
=1
u
j
tanh(
a
j
)
,
where
u
j
=
w
j
/
2 for
j
= 1
, . . . , M
and
u
0
=
w
0
+
∑
M
j
=1
w
j
/
2
2
2
Parametric Models (20 Points)
1. Let
D
=
{
(
x
1
, y
1
)
, . . . ,
(
x
N
, y
N
)
}
be the training set where each training sample (
x
i
, y
i
)
is independently and identically distributed. The model is given by
ˆ
y
i
=
w
T
x
i
+
i
,
where
i
∼ N
(0
, σ
2
). Answer the following questions:
(a)
[4 Points]
Let
Y
= [
y
1
, . . . , y
N
] be a vector of all the labels and
X
= [
x
1
, . . . , x
N
]
be a matrix of all the inputs.
Write the expression for conditional likelihood
p
(
Y
|
X
).
[Sol:]
Y
|
X
∼
N
(
W
T
X, σ
2
)
p
(
Y
|
X
)
=
N
Y
i
=1
1
√
2
πσ
2
e
-
(
y
i
-
w
T
x
i
)
2
2
σ
2
(b)
[8 Points]
Assume that the prior distribution of the parameters
θ
is gaussian:
θ
∼ N
(0
, β
2
I
).
Show that computing the MAP estimate of the parameters is
equivalent to minimizing a loss function composed of the mean squared error and
an
L
2
regularizer.
[Sol:]
θ
MAP
= arg max
θ
[
p
(
Y
|
X, θ
)
·
p
(
θ
)]
We already have shown the expression for
p
(
Y
|
X, θ
) in the previous part and
p
(
θ
)
is provided as the gaussian
N
(0
, β
2
I
).
θ
MAP
= arg min
θ
N
Y
i
1
√
2
πσ
2
e
-
(
y
i
-
w
T
x
i
)
2
2
σ
2
·
1
p
2
πβ
2
e
-
θ
T
θ
2
β
2
Now we can convert the argmax operation to an argmin by taking the negative
log and to further simplify remove the constants,
θ
MAP
= arg min
θ
N
X
i
(
w
T
x
i
-
y
i
)
2
2
σ
2
+
1
2
β
θ
T
θ
Let
λ
=
σ
2
β
, thus we have :
θ
MAP
= arg min
θ
1
2
n
X
i
(
w
T
x
i
-
y
i
)
2
+
λθ
T
θ
3
2.
[8 Points]
In the following data set
A
,
B
,
C
are input binary random variables, and
y
is a binary output whose value we want to predict:
A
B
C
y
0
0
1
0
0
1
0
0
1
1
0
0
0
0
1
1
1
1
1
1
1
0
0
1
1
1
0
1
How will the Naive Bayes Classifier predict
y
given the input:
A
= 0,
B
= 0,
C
= 1?
[Sol:]
From the table above we have:
P
(
y
= 0) = 3
/
7
P
(
y
= 1) = 4
/
7
P
(
A
= 0
|
y
= 0) = 2
/
3
P
(
B
= 0
|
y
= 0) = 1
/
3
P
(
C
= 1
|
y
= 0) = 1
/
3
P
(
A
= 0
|
y
= 1) = 1
/
4
P
(
B
= 0
|
y
= 1) = 1
/
2
P
(
C
= 1
|
y
= 1) = 1
/
2
Predicted
y
maximizes
P
(
A
= 0
|
y
)
P
(
B
= 0
|
y
)
P
(
C
= 1
|
y
)
P
(
y
).
For
y
= 0 we have:
P
(
A
= 0
|
y
= 0)
P
(
B
= 0
|
y
= 0)
P
(
C
= 1
|
y
= 0)
P
(
y
= 0) = 0
.
0317
.
For
y
= 1 we have:
P
(
A
= 0
|
y
= 1)
P
(
B
= 0
|
y
= 1)
P
(
C
= 1
|
y
= 1)
P
(
y
= 0) = 0
.
0357
.
Thus the Naive Bayes Classifier will predict
y
= 1.
4
3
Support Vector Machines (20 Points)
1.
[10 Points]
Kernel functions implicitly define a mapping function
φ
(
·
) that transforms
an input instance
x
∈ <
d
to a high dimensional feature space
Q
by giving the form
of a dot product in
Q
:
K
(
x
i
, x
j
) =
φ
(
x
i
)
·
φ
(
x
j
). Assume we use a kernel function of
the form
K
(
x
i
, x
j
) = exp(
-
1
2
k
x
i
-
x
j
k
2
). Thus we assume that there is some implicit
unknown function
φ
(
x
) such that
φ
(
x
i
)
·
φ
(
x
j
) =
K
(
x
i
, x
j
) = exp(
-
1
2
k
x
i
-
x
j
k
2
)
.
Prove that for any two input instances
x
i
and
x
j
, the squared Euclidean distance of
their corresponding points in the feature space
Q
is less than 2.
That is prove that
k
φ
(
x
i
)
-
φ
(
x
j
)
k
2
<
2.
[Sol:]
k
φ
(
x
i
)
-
φ
(
x
j
)
k
2
=
(
φ
(
x
i
)
-
φ
(
x
j
))
·
(
φ
(
x
i
)
-
φ
(
x
j
))
=
φ
(
x
i
)
·
φ
(
x
i
) +
φ
(
x
j
)
·
φ
(
x
j
)
-
2
·
φ
(
x
i
)
φ
(
x
j
)
=
2
-
2 exp(
-
1
2
||
x
i
-
x
j
||
2
)
<
2
since exp(
-
1
2
||
x
i
-
x
j
||
2
)
>
0.
2.
[4 Points]
Let
M
SV M
be an SVM model which you’ve trained using some training set.
Consider a new point that is correctly classified and distant from the decision boundary.
Why would SVMs decision boundary be unaffected by this point, but the one learnt by
logistic regression would be affected.
[Sol:]
This is because the Hinge loss used by the SVM will assign a zero weight to this distant
point and hence it’ll not have any effect of the learning of the decision boundary. In
contrast the negative log likelihood loss function optimized by the logistic regression will
allocate some non-zero weight to this point. Hence the decision boundary for logistic
regression will be affected.
3. Answer the following True/False questions and explain your answer in no more than
two lines.
(a)
[3 Points]
If we switch from a linear kernel to a higher order polynomial kernel
the support vectors will not change.
[Sol:]
False.
There is no guarantee that the support vectors would remain the same
because the transformed feature vectors implicitly generated by the polynomial
kernel are non-linear functions of the original input vectors and thus the support
points for maximum margin separation in the feature space can be quite different.
(b)
[3 Points]
The maximum margin decision boundary that the SVMs learn provide
the best generalization error among all linear classifiers.
5
[Sol:]
False. The maximum margin separating hyperplane is often a reasonable choice to
pick among all the possible separating hyper-planes however there is no guarantee
that this hyper-plane will provide the best generalization error among all possible
hyper-planes.
6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Questions
Please help step to step with Program R (CS) with explanation and final code for understanding thank you.
arrow_forward
Computer Science AI question
arrow_forward
Let X be discrete random variable. If P(X
arrow_forward
Computer Science
A, and B and C are bitstrings of the same length. A has a Poisson distribution. B has a normal distribution. C is uniformly distributed. What is the distribution of D=A⨁B⨁C?
arrow_forward
Solve in R programming language:
5) Use R to calculate and simulate with the exponential distribution as follows.
(a) For an exponential random variable X with λ = 4, simulate 1000 independent exponential random variables by using the R function rexp(n, λ). Calculate the mean and variance of this sample.
(b) Compare the empirical results from part (a) with the distribution mean 1/λ and the distribution standard deviation 1/λ.
arrow_forward
Please help step by step with an explanation for Program R (CS) with a final code for understanding thank you.
arrow_forward
Please help step to step with Program R (CS) with explanation and final code for understanding thank you.
arrow_forward
In most of Europe and Asia annual automobile insurance is determined by the bonus malus system.
Each policyholder is given a number (state) and the policy premium is determined as a function of this state.
A policy holder's state changes from year to year in response to the number of claims made by the policy holder.
Suppose the number of daims made by a particular policyholder in any given year, is a Poisson random variable with parameter A.
Let. s;(k) be the next state of a policyholder who was in state i in the previous year and who made k claims in that year.
The following table is a simplified Bonus-Malus system having 4 states.
Next state if
Annual
State
Premium
O claims
1 claim |2 claims 3 or more claims
1
200
1
2
4
2
250
1
3
4
4
3
400
2
4
4
4
4
600
3
4
4
4
For A =2
A. Find the mean time to retum to state 1 (if you start in state 1):
drestime
(round to 2 decimals)
B. Find the expected annual premium for a random person after it has been with the insurance company for a while:…
arrow_forward
5. The probability distribution of discrete random variable X is given by
k
for x = 1, 2, 3
x +1
P(X = x)
%3D
arrow_forward
6...
arrow_forward
4.
The moment generating function of the random variable X is given by
10
Mx (t) = exp(2e¹-2) and that of Y by My (t) = (e² + ²) ²
Assuming that the random variables X and Y are independent, find
(a) P(X+Y 0).
(c) E(XY).
arrow_forward
PROBLEM 3
The PDF of a random variable X is given in the picture below
K is
2
Find the correct value of c.
Find the variance of the given random variable.
Find P(X<2).
arrow_forward
Prove that I(X; Y |Z) ≥ I(X; Y ) . Note: X, Y, and Z are random variables. X and Z are independent.
arrow_forward
1. In a two-class problem, the likelihood ratio is
p(x|C₁)
p(x|C₂)
Write the discriminant function in terms of the likelihood ratio.
arrow_forward
Answer. A and b
arrow_forward
Can you prepare the project given below in matlab? (I would appreciate if you explain every step of the written code.)
CE 412 - PROJECT 2 "Monte Carlo Simulation - Integration"
Monte Carlo integration is a powerful method for computing the value of complex integrals using
probabilistic techniques. In this project we are going to calculate the area under the Weibull distribution which is given by f(x) = x βα
-x-1e-(x/B) for x ≥ 0.
In this project, write a program by using both the hit-or-miss method and the sample-mean method to
calculate the following integral by using n random points.
[ fox b. f(x)dx = [ α e-(x/f) dx -xα-1e- βα
Your program should ask the following parameters as the input:
the method type, hit-or-miss method or sample mean method
n, the number of random points
a, the shape parameter of the distribution
B, the scale parameter of the distribution
a, the lower limit for the integral ⚫b, the uppper limit for the integral
Project 2 Submission:…
arrow_forward
1. The impulse response of a causal system is:
h(t) = A cos(wt) e¯¹/¹u(t)
where u(t) is the Heaviside step function. The response is measured experimentally with a
sampling interval of T.
a. Write an expression for the sampled impulse response h[n].
b. Calculate the z transform of h[n] and write an expression for H[z]. Use the tables
provided below as necessary.
c.
Does the system have an infinite impulse response (IIR) or finite impulse response
(FIR)? Justify your answer.
d.
What is the DC gain of H[z]?
e. Write a difference equation that describes the output y[n] in terms of input x[n].
arrow_forward
2. Let D be a distribution over R where the mean is 5 and variance is 9. Suppose x1, ... , x10 are indepen-
dent draws from D. Plot the possible positions of these random variables on the real line.
3. Let w E Rd be the variable, and let x E Rd and y E R be given. Calculate the gradient of the following
functions with respect to w:
• F(w) = (y – w
· x)100;
• F(w) = vrwsi
1
y+w.x'
• F(w) = log(1+ yw · x);
F(w) = e(w-x)².
arrow_forward
In most of Europe and Asia annual automobile insurance is determined by the bonus malus system.
Each policyholder is given a number (state) and the policy premium is determined as a function of this state.
A policy holder's state changes from year to year in response to the number of claims made by the policy holder.
Suppose the number of claims made by a particular policyholder in any given year, is a Poisson random variable with parameter A.
Let. si(k) be the next state of a policyholder who was in state i in the previous year and who made k claims in that year.
The following table is a simplified Bonus-Malus system having 4 states.
Next state if
Annual
State
O claims
1 claim 2 claims 3 or more claims
Premium
1
200
3
4
250
1
3
4
4
400
2
4
4.
4
4
600
4
4
4
For A =2
A. Find the mean time to return to state 1 (if you start in state 1):
(round to 2 decimals)
B. Find the expected annual premium for a random person after it has been with the insurance company for a while:
decimals)
(round…
arrow_forward
please show step by step working, if python code can be provided, it will be good!
arrow_forward
Please help step to step with Program R (CS) with explanation and final code for understanding thank you.
arrow_forward
7.
Let X be a random variable with expectation EX <∞o. Prove that Var(X) = E(X²) – E(X)².
arrow_forward
Let X1, X2, .,X25 be i.i.d. random variables from Po(5). Estimate
the MSE for the median estimator using Monte Carlo estimation.
arrow_forward
When maximizing a function, the gradient at a given point will always point in
Notes: In machine learning, when we are trying to learn parameters to solve a
problem the direction of the gradient will be crucial to finding "good"
parameters!
any direction
the direction of steppest ascent
the direction away from the origin
the direction of steppest descent
arrow_forward
Computer science
Please help me with this homework question
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you
Operations Research : Applications and Algorithms
Computer Science
ISBN:9780534380588
Author:Wayne L. Winston
Publisher:Brooks Cole
Related Questions
- Computer Science A, and B and C are bitstrings of the same length. A has a Poisson distribution. B has a normal distribution. C is uniformly distributed. What is the distribution of D=A⨁B⨁C?arrow_forwardSolve in R programming language: 5) Use R to calculate and simulate with the exponential distribution as follows. (a) For an exponential random variable X with λ = 4, simulate 1000 independent exponential random variables by using the R function rexp(n, λ). Calculate the mean and variance of this sample. (b) Compare the empirical results from part (a) with the distribution mean 1/λ and the distribution standard deviation 1/λ.arrow_forwardPlease help step by step with an explanation for Program R (CS) with a final code for understanding thank you.arrow_forward
- Please help step to step with Program R (CS) with explanation and final code for understanding thank you.arrow_forwardIn most of Europe and Asia annual automobile insurance is determined by the bonus malus system. Each policyholder is given a number (state) and the policy premium is determined as a function of this state. A policy holder's state changes from year to year in response to the number of claims made by the policy holder. Suppose the number of daims made by a particular policyholder in any given year, is a Poisson random variable with parameter A. Let. s;(k) be the next state of a policyholder who was in state i in the previous year and who made k claims in that year. The following table is a simplified Bonus-Malus system having 4 states. Next state if Annual State Premium O claims 1 claim |2 claims 3 or more claims 1 200 1 2 4 2 250 1 3 4 4 3 400 2 4 4 4 4 600 3 4 4 4 For A =2 A. Find the mean time to retum to state 1 (if you start in state 1): drestime (round to 2 decimals) B. Find the expected annual premium for a random person after it has been with the insurance company for a while:…arrow_forward5. The probability distribution of discrete random variable X is given by k for x = 1, 2, 3 x +1 P(X = x) %3Darrow_forward
- 6...arrow_forward4. The moment generating function of the random variable X is given by 10 Mx (t) = exp(2e¹-2) and that of Y by My (t) = (e² + ²) ² Assuming that the random variables X and Y are independent, find (a) P(X+Y 0). (c) E(XY).arrow_forwardPROBLEM 3 The PDF of a random variable X is given in the picture below K is 2 Find the correct value of c. Find the variance of the given random variable. Find P(X<2).arrow_forward
arrow_back_ios
SEE MORE QUESTIONS
arrow_forward_ios
Recommended textbooks for you
- Operations Research : Applications and AlgorithmsComputer ScienceISBN:9780534380588Author:Wayne L. WinstonPublisher:Brooks Cole
Operations Research : Applications and Algorithms
Computer Science
ISBN:9780534380588
Author:Wayne L. Winston
Publisher:Brooks Cole