Model_Interpretability

.pdf

School

New Jersey Institute Of Technology *

*We aren’t endorsed by this school

Course

622

Subject

Business

Date

Apr 3, 2024

Type

pdf

Pages

Uploaded by MinisterWaterHare33

17/03/2024, 21:14 Model_Interpretability.ipynb - Colaboratory https://colab.research.google.com/drive/1Q7Os1OJQha-d63CzcJw7ScUpbnM5bnnC?usp=sharing#scrollTo=g3sTFJp3G9-o 1/20 Copyright (c) 2024 Lakshmi Anchitha Panchaparvala Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation ±les (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. MIT License link: https://github.com/anchitha1309/Dataset This study utilizes a comprehensive real estate dataset to explore and predict housing prices. The dataset, comprising 81 features for each property, offers a detailed insight into various aspects of residential properties. Key attributes include building class, zoning classi±cation, lot size, road access type, and property shape. It also encompasses a wide range of both numerical and categorical variables, from basic utilities to speci±c details like alley access and pool quality, though some features exhibit missing values. The primary focus of the analysis is the 'SalePrice' variable, representing the sale price of each house, which is a critical indicator of market trends and property value. The dataset's richness in features provides a robust foundation for applying regression techniques to predict housing prices, offering valuable insights for potential homeowners, real estate agents, and market analysts. The objective is to understand how various factors in²uence property values and to develop accurate predictive models that can aid in decision-making processes in the real estate market. Dataset The dataset contains a mix of numerical and categorical variables. There are also some missing values, indicated by 'NaN', particularly in columns like "Alley" and "PoolQC". In total, there are 81 columns, indicating a wide range of features that describe each property, such as the type of dwelling, the quality and condition of various features, the year certain components were built or remodeled, and other characteristics related to the property and its surroundings. This dataset is typically used for regression tasks, particularly for predicting the sale price of houses based on their characteristics Dataset 1. Fit a linear model and interpret the regression coe³cients 2. Fit a tree-based model and interpret the nodes 3. Use auto ml to ±nd the best model 4. Run SHAP analysis on the models from steps 1, 2, and 3, interpret the SHAP values and compare them with the other model interpretability methods. Interpret your models. Importing Necessary Libraries !pip install shap Requirement already satisfied: shap in /usr/local/lib/python3.10/dist-packages (0.45.0) Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from shap) (1.25.2) Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from shap) (1.11.4) Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (from shap) (1.2.2) Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from shap) (1.5.3) Requirement already satisfied: tqdm>=4.27.0 in /usr/local/lib/python3.10/dist-packages (from shap) (4.66.2) Requirement already satisfied: packaging>20.9 in /usr/local/lib/python3.10/dist-packages (from shap) (24.0) Requirement already satisfied: slicer==0.0.7 in /usr/local/lib/python3.10/dist-packages (from shap) (0.0.7) Requirement already satisfied: numba in /usr/local/lib/python3.10/dist-packages (from shap) (0.58.1) Requirement already satisfied: cloudpickle in /usr/local/lib/python3.10/dist-packages (from shap) (2.2.1) Requirement already satisfied: llvmlite<0.42,>=0.41.0dev0 in /usr/local/lib/python3.10/dist-packages (from numba->shap) Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas->shap) (2. Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->shap) (2023.4) Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->shap) (1.3.2

17/03/2024, 21:14 Model_Interpretability.ipynb - Colaboratory https://colab.research.google.com/drive/1Q7Os1OJQha-d63CzcJw7ScUpbnM5bnnC?usp=sharing#scrollTo=g3sTFJp3G9-o 2/20 Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->shap) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas- import pandas as pd import shap import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.pipeline import Pipeline from sklearn.preprocessing import OrdinalEncoder, StandardScaler, OneHotEncoder from sklearn_pandas import DataFrameMapper from sklearn.impute import SimpleImputer from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_absolute_error import statsmodels.api as sm import numpy as np import matplotlib.pyplot as plt Reading Data df = pd.read_csv("https://raw.githubusercontent.com/anchitha1309/Dataset/main/train.csv") Missing Values Data Preprocessing Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC F 0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN 1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN 2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN 3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN 4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN 5 rows × 81 columns #viewing first few colums df.head() #percentage of missing values in each feature df.isnull().sum().sort_values(ascending=False)*100/len(df) PoolQC 99.520548 MiscFeature 96.301370 Alley 93.767123 Fence 80.753425 FireplaceQu 47.260274 ... ExterQual 0.000000 Exterior2nd 0.000000 Exterior1st 0.000000 RoofMatl 0.000000 SalePrice 0.000000 Length: 81, dtype: float64 # Let's drop the columns with more than 60% missing values df = df.drop(['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'MasVnrType'], axis=1) # Let's check the missing values missing_data_cols = df.isnull().sum()[df.isnull().sum() > 0].index.tolist() # display all missing values columns df[missing_data_cols].head()

17/03/2024, 21:14 Model_Interpretability.ipynb - Colaboratory https://colab.research.google.com/drive/1Q7Os1OJQha-d63CzcJw7ScUpbnM5bnnC?usp=sharing#scrollTo=g3sTFJp3G9-o 3/20 LotFrontage MasVnrArea BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinType2 Electrical FireplaceQu GarageTy 0 65.0 196.0 Gd TA No GLQ Unf SBrkr NaN Attc 1 80.0 0.0 Gd TA Gd ALQ Unf SBrkr TA Attc 2 68.0 162.0 Gd TA Mn GLQ Unf SBrkr TA Attc 3 60.0 0.0 TA Gd No ALQ Unf SBrkr Gd Detc 4 84.0 350.0 Gd TA Av GLQ Unf SBrkr TA Attc categorical_cols = df.select_dtypes(include='object').columns.tolist() numeric_cols = df.select_dtypes(include=['int64','float64']).columns.tolist() Correlation Analysis # Correlation analysis correlation_matrix = df.corr() # Plotting the heatmap of the correlation matrix plt.figure(figsize=(15, 12)) sns.heatmap(correlation_matrix, annot=False, cmap='coolwarm') plt.title('Correlation Matrix of Variables') plt.show() # Displaying correlation values with SalePrice in descending order correlation_with_saleprice = correlation_matrix['SalePrice'].sort_values(ascending=False) correlation_with_saleprice

17/03/2024, 21:14 Model_Interpretability.ipynb - Colaboratory https://colab.research.google.com/drive/1Q7Os1OJQha-d63CzcJw7ScUpbnM5bnnC?usp=sharing#scrollTo=g3sTFJp3G9-o 4/20 SalePrice 1.000000 OverallQual 0.790982 GrLivArea 0.708624 GarageCars 0.640409 GarageArea 0.623431 T t lB tSF 0 613581

17/03/2024, 21:14 Model_Interpretability.ipynb - Colaboratory https://colab.research.google.com/drive/1Q7Os1OJQha-d63CzcJw7ScUpbnM5bnnC?usp=sharing#scrollTo=g3sTFJp3G9-o 5/20 # let's encode the categorical columns with label encoder using for loop from sklearn.preprocessing import LabelEncoder le = LabelEncoder() for col in df.columns: if df[col].dtypes == 'object': df[col] = le.fit_transform(df[col].astype(str)) # check again the missing values df.isnull().sum().sort_values(ascending=False)*100/len(df) LotFrontage 17.739726 GarageYrBlt 5.547945 MasVnrArea 0.547945 Id 0.000000 BedroomAbvGr 0.000000 ... ExterCond 0.000000 ExterQual 0.000000 Exterior2nd 0.000000 Exterior1st 0.000000 SalePrice 0.000000 Length: 76, dtype: float64 # remove duplicated index from the dataset df = df.reset_index(drop=True) # print categroical columns with missing values categorical_cols = df[missing_data_cols].select_dtypes(include='object').columns.tolist() categorical_cols [] df.dtypes.value_counts() int64 73 float64 3 dtype: int64 Imputing Missing Values # lets impute the missing values using ML imputer # defining the function to impute the missing values from sklearn.preprocessing import LabelEncoder from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

17/03/2024, 21:14 Model_Interpretability.ipynb - Colaboratory https://colab.research.google.com/drive/1Q7Os1OJQha-d63CzcJw7ScUpbnM5bnnC?usp=sharing#scrollTo=g3sTFJp3G9-o 6/20 def impute_categorical_missing_data(passed_col): df_null = df[df[passed_col].isnull()] df_not_null = df[df[passed_col].notnull()] X = df_not_null.drop(passed_col, axis=1) y = df_not_null[passed_col] other_missing_cols = [col for col in missing_data_cols if col != passed_col] label_encoder = LabelEncoder() for col in X.columns: if X[col].dtype == 'object' or X[col].dtype == 'category': X[col] = label_encoder.fit_transform(X[col]) if passed_col in bool_cols: y = label_encoder.fit_transform(y) iterative_imputer = IterativeImputer(estimator=RandomForestRegressor(random_state=42), add_indicator=True) for col in other_missing_cols: if X[col].isnull().sum() > 0: col_with_missing_values = X[col].values.reshape(-1, 1) imputed_values = iterative_imputer.fit_transform(col_with_missing_values) X[col] = imputed_values[:, 0] else: pass X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) rf_classifier = RandomForestRegressor() rf_classifier.fit(X_train, y_train) y_pred = rf_classifier.predict(X_test) acc_score = r2_score(y_test, y_pred) print("The feature '"+ passed_col+ "' has been imputed with", round((acc_score * 100), 2), "accuracy\n") X = df_null.drop(passed_col, axis=1) for col in X.columns: if X[col].dtype == 'object' or X[col].dtype == 'category': X[col] = label_encoder.fit_transform(X[col]) for col in other_missing_cols: if X[col].isnull().sum() > 0: col_with_missing_values = X[col].values.reshape(-1, 1) imputed_values = iterative_imputer.fit_transform(col_with_missing_values) X[col] = imputed_values[:, 0] else: pass if len(df_null) > 0: df_null[passed_col] = rf_classifier.predict(X) if passed_col in bool_cols: df_null[passed_col] = df_null[passed_col].map({0: False, 1: True}) else: pass else: pass df_combined = pd.concat([df_not_null, df_null]) return df_combined[passed_col] def impute_continuous_missing_data(passed_col): df_null = df[df[passed_col].isnull()] df_not_null = df[df[passed_col].notnull()] X = df_not_null.drop(passed_col, axis=1) y = df_not_null[passed_col] other_missing_cols = [col for col in missing_data_cols if col != passed_col] label_encoder = LabelEncoder() for col in X.columns:

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version