Midterm_stub

.pdf

School

University of California, San Diego *

*We aren’t endorsed by this school

Course

258

Subject

Industrial Engineering

Date

Dec 6, 2023

Type

pdf

Pages

Uploaded by ChancellorParrotPerson964

Midterm_stub November 4, 2021 1 Section 1 (Regression) [7]: import gzip from collections import defaultdict import math import scipy.optimize import numpy import string import random from sklearn import linear_model [8]: def parse (f): for l in gzip . open(f): yield eval (l) [9]: # Download data from below: # https://cseweb.ucsd.edu/classes/fa21/cse258-b/files/ dataset = list (parse( "trainRecipes.json.gz" )) [10]: len (dataset) [10]: 200000 [11]: train = dataset[: 150000 ] valid = dataset[ 150000 : 175000 ] test = dataset[ 175000 :] [12]: dataset[ 0 ] [12]: {'name': 'sexy fried eggs for sunday brunch', 'minutes': 10, 'contributor_id': '14298494', 'submitted': '2004-05-21', 'steps': 'heat a ridged griddle pan\tlightly brush the tomato slices and bread with some olive oil\tcook the tomato slices first , for at least 5 minutes\twhen they are almost ready , toast the bread in the same pan until well bar- marked\tin the meantime , pour a little olive oil into a small frying pan and 1

crack in the egg\tallow it to set for a minute or so and add the garlic and chilli\tcook for a couple of minutes , spooning the hot oil over the egg until cooked to your liking\tplace the griddled bread on a plate and quickly spoon the tomatoes on top\tthrow the chives into the egg pan and splash in the balsamic vinegar\tseason well , then slide the egg on to the tomatoes and drizzle the pan juices on top\tserve immediately , with a good cup of tea !', 'description': 'this is from silvana franco\'s book "family" which i love. i made these for brunch yesterday and we loved them so much that we had them again today!', 'ingredients': ['plum tomato', 'ciabatta', 'olive oil', 'egg', 'garlic clove', 'chili', 'chives', 'balsamic vinegar', 'salt and pepper'], 'recipe_id': '06432987'} 1.1 Question 1 [13]: def feat1a (d): return [ len (d[ 'steps' ]), len (d[ 'ingredients' ])] print ( f"Feature vector for first training sample of 1a: \n { feat1a(dataset[ 0 ]) } " ) Feature vector for first training sample of 1a: [743, 9] [14]: maxYear = - math . inf minYear = math . inf for elem in dataset: year = int (elem[ 'submitted' ][: 4 ]) if year > maxYear: maxYear = year if year < minYear: minYear = year print ( f"Year of newest submission: { maxYear } \n Year of oldest submission: ␣ , → { minYear } " ) Year of newest submission: 2018 Year of oldest submission: 1999 [15]: def feat1b (d): year = int (d[ 'submitted' ][: 4 ]) numYears = maxYear - minYear 2

yearFeat = [ 0 ] * numYears if year != maxYear: yearFeat[maxYear - year -1 ] = 1 month = int (d[ 'submitted' ][ 5 : 7 ]) monthFeat = [ 0 ] *11 monthFeat[month -2 ] = 1 return yearFeat + monthFeat print ( f"Feature vector for first training sample of 1b: \n { feat1b(dataset[ 0 ]) } " ) Feature vector for first training sample of 1b: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0] [16]: ingredientCount = {} for elem in dataset: for ingredient in elem[ 'ingredients' ]: if ingredient in ingredientCount: ingredientCount[ingredient] += 1 else : ingredientCount[ingredient] = 1 ingredientCount = dict ( sorted (ingredientCount . items(), key = lambda item: ␣ , → item[ 1 ], reverse = True )) topFiftyIngredients = [] for key in ingredientCount . keys(): if len (topFiftyIngredients) >= 50 : break else : topFiftyIngredients . append(key) [17]: def feat1c (d): feat = [ 0 ] *50 ingredients = d[ 'ingredients' ] index = 0 for elem in topFiftyIngredients: if elem in ingredients: feat[index] = 1 index += 1 return feat print ( f"Feature vector for first training sample of 1c: \n { feat1c(dataset[ 0 ]) } " ) Feature vector for first training sample of 1c: 3

[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] [18]: def feat (d, a = True , b = True , c = True ): # Hint: for Questions 1 and 2, might be useful to set up a function like ␣ , → this # which allows you to "select" which features are included feature = [ 1 ] if a: feature += feat1a(d) if b: feature += feat1b(d) if c: feature += feat1c(d) return feature [19]: def MSE (y, ypred): # Can use library if you prefer differences = [(x - y) **2 for x,y in zip (y,ypred)] return sum (differences) / len (differences) [20]: # Splitting dataset # Dataset not sorted after date, thus splitting it in the trivial way dataTrain = dataset[: len (dataset) *3//4 ] dataValidation = dataset[ len (dataset) *3//4 : len (dataset) *7//8 ] dataTest = dataset[ len (dataset) *7//8 :] [21]: def train_model (mod, a = True , b = True , c = True ): # Hint: might be useful to write this function which extracts features and # computes the performance of a particular model on those features XTrain = [feat(data, a, b, c) for data in dataTrain] yTrain = [data[ 'minutes' ] for data in dataTrain] mod . fit(XTrain, yTrain) [22]: model1a = linear_model . LinearRegression() train_model(model1a, True , False , False ) X1aTest = [feat(data, True , False , False ) for data in dataTest] y1aTest = [d[ 'minutes' ] for d in dataTest] y1aPred = model1a . predict(X1aTest) print ( f"MSE for 1a: { MSE(y1aTest, y1aPred) } " ) MSE for 1a: 6169.549296366476 [23]: model1b = linear_model . LinearRegression() y1bPred = train_model(model1b, False , True , False ) X1bTest = [feat(data, False , True , False ) for data in dataTest] y1bTest = [d[ 'minutes' ] for d in dataTest] 4

y1bPred = model1b . predict(X1bTest) print ( f"MSE for 1b: { MSE(y1bTest, y1bPred) } " ) MSE for 1b: 6396.644907898458 [24]: model1c = linear_model . LinearRegression() y1cPred = train_model(model1c, False , False , True ) X1cTest = [feat(data, False , False , True ) for data in dataTest] y1cTest = [d[ 'minutes' ] for d in dataTest] y1cPred = model1c . predict(X1cTest) print ( f"MSE for 1c: { MSE(y1cTest, y1cPred) } " ) MSE for 1c: 6000.948439855985 1.2 Question 2 [25]: modelAll = linear_model . LinearRegression() yAllPred = train_model(modelAll, True , True , True ) XAllTest = [feat(data, True , True , True ) for data in dataTest] yAllTest = [d[ 'minutes' ] for d in dataTest] yAllPred = modelAll . predict(XAllTest) print ( f"MSE for model with all features: { MSE(yAllTest, yAllPred) } " ) MSE for model with all features: 5861.087768668749 [26]: model1bc = linear_model . LinearRegression() y1bcPred = train_model(model1bc, False , True , True ) X1bcTest = [feat(data, False , True , True ) for data in dataTest] y1bcTest = [d[ 'minutes' ] for d in dataTest] y1bcPred = model1bc . predict(X1bcTest) print ( f"MSE for model without 1a: { MSE(y1bcTest, y1bcPred) } " ) MSE for model without 1a: 5992.513432436371 [27]: model1ac = linear_model . LinearRegression() y1acPred = train_model(model1ac, True , False , True ) X1acTest = [feat(data, True , False , True ) for data in dataTest] y1acTest = [d[ 'minutes' ] for d in dataTest] y1acPred = model1ac . predict(X1acTest) 5

print ( f"MSE for model without 1b: { MSE(y1acTest, y1acPred) } " ) MSE for model without 1b: 5870.115061656081 [28]: model1ab = linear_model . LinearRegression() y1abPred = train_model(model1ab, True , True , False ) X1abTest = [feat(data, True , True , False ) for data in dataTest] y1abTest = [d[ 'minutes' ] for d in dataTest] y1abPred = model1ab . predict(X1abTest) print ( f"MSE for model without 1c: { MSE(y1abTest, y1abPred) } " ) MSE for model without 1c: 6157.511552782132 1.2.1 Reasoning on result By comparing the MSEs when excluding one of the features at a time, it looks like the most important feature was the one implemented in task 1c, since the exclution of this lead to the most significant increase in the MSE. The exclution of the features from task 1a also led to a quite significant increase in the MSE, which makes it the second most important feature. The exclution of the features from task 1b only lead to a slight increase in the MSE. However, with the amount of data we have it seems like the best predictor is the one when all features are included. 1.3 Question 4 The problem with this is that, since the error is squared, hence Mean SQUARE Error, the prediction errors on the outliers with long cooking time will be further amplified by the square, and thus be dominant on the MSE. To optimize the MSE, the predictor will thus focus more on the few recipes with long cooking times rather than the big amount of recipes with short cook times, whoch is an obvious drawback. There are several ways to design a better predictor in such a dataset. The simplest way is to just remove the outliers from the dataset. This however does not give us any practical means for prediction of outliers. Another way is to tranform the output variable to a variable which downscales the varibles with greater magnitude more than the other, such as the logarithmic value. To find this appropriate transformation may be hard, and may require careful tuning. A third way may be to reframe the problem as a classification problem, e.g. predict whether the cooking time is above 20 minutes. This does however give os a very coarse perdiction. A fourth and last way may be to use an objective that are not as sensitive to outliers as the MSE. One oportunity is to use MAE. 6

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help