Midterm_stub
.pdf
keyboard_arrow_up
School
University of California, San Diego *
*We aren’t endorsed by this school
Course
258
Subject
Industrial Engineering
Date
Dec 6, 2023
Type
Pages
17
Uploaded by ChancellorParrotPerson964
Midterm_stub
November 4, 2021
1
Section 1 (Regression)
[7]:
import
gzip
from
collections
import
defaultdict
import
math
import
scipy.optimize
import
numpy
import
string
import
random
from
sklearn
import
linear_model
[8]:
def
parse
(f):
for
l
in
gzip
.
open(f):
yield
eval
(l)
[9]:
# Download data from below:
# https://cseweb.ucsd.edu/classes/fa21/cse258-b/files/
dataset
=
list
(parse(
"trainRecipes.json.gz"
))
[10]:
len
(dataset)
[10]:
200000
[11]:
train
=
dataset[:
150000
]
valid
=
dataset[
150000
:
175000
]
test
=
dataset[
175000
:]
[12]:
dataset[
0
]
[12]:
{'name': 'sexy fried eggs for sunday brunch',
'minutes': 10,
'contributor_id': '14298494',
'submitted': '2004-05-21',
'steps': 'heat a ridged griddle pan\tlightly brush the tomato slices and bread
with some olive oil\tcook the tomato slices first , for at least 5 minutes\twhen
they are almost ready , toast the bread in the same pan until well bar-
marked\tin the meantime , pour a little olive oil into a small frying pan and
1
crack in the egg\tallow it to set for a minute or so and add the garlic and
chilli\tcook for a couple of minutes , spooning the hot oil over the egg until
cooked to your liking\tplace the griddled bread on a plate and quickly spoon the
tomatoes on top\tthrow the chives into the egg pan and splash in the balsamic
vinegar\tseason well , then slide the egg on to the tomatoes and drizzle the pan
juices on top\tserve immediately , with a good cup of tea !',
'description': 'this is from silvana franco\'s book "family" which i love. i
made these for brunch yesterday and we loved them so much that we had them again
today!',
'ingredients': ['plum tomato',
'ciabatta',
'olive oil',
'egg',
'garlic clove',
'chili',
'chives',
'balsamic vinegar',
'salt and pepper'],
'recipe_id': '06432987'}
1.1
Question 1
[13]:
def
feat1a
(d):
return
[
len
(d[
'steps'
]),
len
(d[
'ingredients'
])]
print
(
f"Feature vector for first training sample of 1a:
\n
{
feat1a(dataset[
0
])
}
"
)
Feature vector for first training sample of 1a:
[743, 9]
[14]:
maxYear
= -
math
.
inf
minYear
=
math
.
inf
for
elem
in
dataset:
year
=
int
(elem[
'submitted'
][:
4
])
if
year
>
maxYear: maxYear
=
year
if
year
<
minYear: minYear
=
year
print
(
f"Year of newest submission:
{
maxYear
}
\n
Year of oldest submission:
␣
,
→
{
minYear
}
"
)
Year of newest submission: 2018
Year of oldest submission: 1999
[15]:
def
feat1b
(d):
year
=
int
(d[
'submitted'
][:
4
])
numYears
=
maxYear
-
minYear
2
yearFeat
=
[
0
]
*
numYears
if
year
!=
maxYear:
yearFeat[maxYear
-
year
-1
]
= 1
month
=
int
(d[
'submitted'
][
5
:
7
])
monthFeat
=
[
0
]
*11
monthFeat[month
-2
]
= 1
return
yearFeat
+
monthFeat
print
(
f"Feature vector for first training sample of 1b:
\n
{
feat1b(dataset[
0
])
}
"
)
Feature vector for first training sample of 1b:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
0, 0, 0, 0]
[16]:
ingredientCount
=
{}
for
elem
in
dataset:
for
ingredient
in
elem[
'ingredients'
]:
if
ingredient
in
ingredientCount:
ingredientCount[ingredient]
+= 1
else
:
ingredientCount[ingredient]
= 1
ingredientCount
=
dict
(
sorted
(ingredientCount
.
items(), key
=
lambda
item:
␣
,
→
item[
1
], reverse
=
True
))
topFiftyIngredients
=
[]
for
key
in
ingredientCount
.
keys():
if
len
(topFiftyIngredients)
>= 50
:
break
else
:
topFiftyIngredients
.
append(key)
[17]:
def
feat1c
(d):
feat
=
[
0
]
*50
ingredients
=
d[
'ingredients'
]
index
= 0
for
elem
in
topFiftyIngredients:
if
elem
in
ingredients:
feat[index]
= 1
index
+= 1
return
feat
print
(
f"Feature vector for first training sample of 1c:
\n
{
feat1c(dataset[
0
])
}
"
)
Feature vector for first training sample of 1c:
3
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[18]:
def
feat
(d, a
=
True
, b
=
True
, c
=
True
):
# Hint: for Questions 1 and 2, might be useful to set up a function like
␣
,
→
this
#
which allows you to "select" which features are included
feature
=
[
1
]
if
a: feature
+=
feat1a(d)
if
b: feature
+=
feat1b(d)
if
c: feature
+=
feat1c(d)
return
feature
[19]:
def
MSE
(y, ypred):
# Can use library if you prefer
differences
=
[(x
-
y)
**2
for
x,y
in
zip
(y,ypred)]
return
sum
(differences)
/
len
(differences)
[20]:
# Splitting dataset
# Dataset not sorted after date, thus splitting it in the trivial way
dataTrain
=
dataset[:
len
(dataset)
*3//4
]
dataValidation
=
dataset[
len
(dataset)
*3//4
:
len
(dataset)
*7//8
]
dataTest
=
dataset[
len
(dataset)
*7//8
:]
[21]:
def
train_model
(mod, a
=
True
, b
=
True
, c
=
True
):
# Hint: might be useful to write this function which extracts features and
#
computes the performance of a particular model on those features
XTrain
=
[feat(data, a, b, c)
for
data
in
dataTrain]
yTrain
=
[data[
'minutes'
]
for
data
in
dataTrain]
mod
.
fit(XTrain, yTrain)
[22]:
model1a
=
linear_model
.
LinearRegression()
train_model(model1a,
True
,
False
,
False
)
X1aTest
=
[feat(data,
True
,
False
,
False
)
for
data
in
dataTest]
y1aTest
=
[d[
'minutes'
]
for
d
in
dataTest]
y1aPred
=
model1a
.
predict(X1aTest)
print
(
f"MSE for 1a:
{
MSE(y1aTest, y1aPred)
}
"
)
MSE for 1a: 6169.549296366476
[23]:
model1b
=
linear_model
.
LinearRegression()
y1bPred
=
train_model(model1b,
False
,
True
,
False
)
X1bTest
=
[feat(data,
False
,
True
,
False
)
for
data
in
dataTest]
y1bTest
=
[d[
'minutes'
]
for
d
in
dataTest]
4
y1bPred
=
model1b
.
predict(X1bTest)
print
(
f"MSE for 1b:
{
MSE(y1bTest, y1bPred)
}
"
)
MSE for 1b: 6396.644907898458
[24]:
model1c
=
linear_model
.
LinearRegression()
y1cPred
=
train_model(model1c,
False
,
False
,
True
)
X1cTest
=
[feat(data,
False
,
False
,
True
)
for
data
in
dataTest]
y1cTest
=
[d[
'minutes'
]
for
d
in
dataTest]
y1cPred
=
model1c
.
predict(X1cTest)
print
(
f"MSE for 1c:
{
MSE(y1cTest, y1cPred)
}
"
)
MSE for 1c: 6000.948439855985
1.2
Question 2
[25]:
modelAll
=
linear_model
.
LinearRegression()
yAllPred
=
train_model(modelAll,
True
,
True
,
True
)
XAllTest
=
[feat(data,
True
,
True
,
True
)
for
data
in
dataTest]
yAllTest
=
[d[
'minutes'
]
for
d
in
dataTest]
yAllPred
=
modelAll
.
predict(XAllTest)
print
(
f"MSE for model with all features:
{
MSE(yAllTest, yAllPred)
}
"
)
MSE for model with all features: 5861.087768668749
[26]:
model1bc
=
linear_model
.
LinearRegression()
y1bcPred
=
train_model(model1bc,
False
,
True
,
True
)
X1bcTest
=
[feat(data,
False
,
True
,
True
)
for
data
in
dataTest]
y1bcTest
=
[d[
'minutes'
]
for
d
in
dataTest]
y1bcPred
=
model1bc
.
predict(X1bcTest)
print
(
f"MSE for model without 1a:
{
MSE(y1bcTest, y1bcPred)
}
"
)
MSE for model without 1a: 5992.513432436371
[27]:
model1ac
=
linear_model
.
LinearRegression()
y1acPred
=
train_model(model1ac,
True
,
False
,
True
)
X1acTest
=
[feat(data,
True
,
False
,
True
)
for
data
in
dataTest]
y1acTest
=
[d[
'minutes'
]
for
d
in
dataTest]
y1acPred
=
model1ac
.
predict(X1acTest)
5
print
(
f"MSE for model without 1b:
{
MSE(y1acTest, y1acPred)
}
"
)
MSE for model without 1b: 5870.115061656081
[28]:
model1ab
=
linear_model
.
LinearRegression()
y1abPred
=
train_model(model1ab,
True
,
True
,
False
)
X1abTest
=
[feat(data,
True
,
True
,
False
)
for
data
in
dataTest]
y1abTest
=
[d[
'minutes'
]
for
d
in
dataTest]
y1abPred
=
model1ab
.
predict(X1abTest)
print
(
f"MSE for model without 1c:
{
MSE(y1abTest, y1abPred)
}
"
)
MSE for model without 1c: 6157.511552782132
1.2.1
Reasoning on result
By comparing the MSEs when excluding one of the features at a time, it looks like the most
important feature was the one implemented in task 1c, since the exclution of this lead to the most
significant increase in the MSE. The exclution of the features from task 1a also led to a quite
significant increase in the MSE, which makes it the second most important feature. The exclution
of the features from task 1b only lead to a slight increase in the MSE. However, with the amount
of data we have it seems like the best predictor is the one when all features are included.
1.3
Question 4
The problem with this is that, since the error is squared, hence Mean SQUARE Error, the prediction
errors on the outliers with long cooking time will be further amplified by the square, and thus be
dominant on the MSE. To optimize the MSE, the predictor will thus focus more on the few recipes
with long cooking times rather than the big amount of recipes with short cook times, whoch is an
obvious drawback.
There are several ways to design a better predictor in such a dataset.
The simplest way is to
just remove the outliers from the dataset.
This however does not give us any practical means
for prediction of outliers.
Another way is to tranform the output variable to a variable which
downscales the varibles with greater magnitude more than the other, such as the logarithmic value.
To find this appropriate transformation may be hard, and may require careful tuning. A third way
may be to reframe the problem as a classification problem, e.g. predict whether the cooking time
is above 20 minutes. This does however give os a very coarse perdiction. A fourth and last way
may be to use an objective that are not as sensitive to outliers as the MSE. One oportunity is to
use MAE.
6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help