ISYE 7406 HW4

.pdf

School

Georgia Institute Of Technology *

*We aren’t endorsed by this school

Course

7406

Subject

Computer Science

Date

Dec 6, 2023

Type

pdf

Pages

10

Uploaded by EarlResolve98061

Report
ISyE 7406: Data Mining & Statistical Learning HW#4 INTRODUCTION The goal of this homework is to help better understand the statistical properties and computational challenges of local smoothing such as loess, Nadaraya-Watson (NW) kernel smoothing, and spline smoothing. For this purpose, we will compute empirical bias, empirical variances, and empirical mean square error (MSE) based on m = 1000 Monte Carlo runs, where in each run we simulate a data set of n = 101 observations from the additive noise model Yi = f (xi) + i with the famous Mexican hat function f (x) = (1 x2) exp( 0.5x2), 2 π x 2 π , and 1, · · · , n are independent and identically distributed (iid) N (0, 0.22). This function is known to pose a variety of estimation challenges, and below we explore the difficulties inherent in this function EXPLOTATORY DATA ANALYSIS The x-values are systematically generated as equidistant points between - 2π and 2π, comprising a fixed design of 101 points uniformly spaced at an interval of 0.1256637 units between each point. In the non-equidistant design, x-values are generated between - 2π and 2π, but the distances between them vary in such a way that x[1] - x[2] is not equal to x[2] - x[3], and this pattern continues up to x[101]. Figure 1: Plot of the equidistant design Figure 2: Plot of the non- equidistant design
Two datasets were generated through a Monte Carlo simulation, comprising 1000 runs for each smoothing model. The first dataset consisted of 101 equidistant points, while the second dataset included 101 non-equidistant points randomly generated in R. In each run, the three local smoothing methods (LOESS, NW kernel smoothing, and Spline Smoothing) were applied to the datasets, and the resulting fitted values were recorded. The analysis involved computing and visualizing the empirical bias, variance, and mean squared error (MSE). These investigations aimed to assess the performance and statistical properties of the three smoothing methods in the context of these simulated datasets, which posed a known estimation challenge due to the presence of the Mexican Hat function. METHODOLOGY A Monte Carlo simulation involving 1000 runs for each smoothing model was employed to generate three datasets. The initial model chosen was LOESS, a method that utilizes local smoothing to fit a polynomial surface based on one or more predictors. While cross-validation is typically performed to determine the optimal span, a span of 0.75 was pre-specified for this simulation. In the context of leave- one-out cross-validation with k-folds, improvements could potentially be made by selecting the model with the lowest root mean square error of prediction (RMSEP). Here's a brief overview of the local smoothing models used: 1. LOESS (Locally Weighted Scatterplot Smoothing): LOESS is a non-parametric regression technique that combines linear regression and local weighted smoothing to fit a smooth curve to a scatterplot. It estimates the value of each data point by fitting a weighted regression model to a local subset of the data, with the weights determined by a kernel function. The level of smoothing is controlled by a smoothing parameter, which governs the size of the local subset. 2. Nadaraya-Watson (NW) Kernel Smoothing: NW kernel smoothing is another non-parametric regression technique that estimates the value of each data point as a weighted average of its neighbors, with the weights determined by a kernel function. The degree of smoothing is controlled by a bandwidth parameter, which determines the size of the neighborhood. NW kernel smoothing is computationally efficient and suitable for high-dimensional data. 3. Spline Smoothing: Spline smoothing is a parametric regression technique that fits a piecewise polynomial function to the data. The degree of the polynomial and the location of knots are determined by a smoothing parameter. Spline smoothing can handle data with complex nonlinear relationships but requires more computation compared to the other two methods. RESULTS AND FINDINGS 1. Equidistant design The comparison of empirical bias values reveals a significant challenge at x = 0 when compared to other x-values, primarily due to the broader range of response values at this point. In accordance with the bias- variance trade-off principle, smaller empirical bias values typically coincide with larger empirical variances, and vice versa. Notably, the LOESS estimator outperforms the other two local smoothing methods concerning empirical bias and MSE values, likely due to the choice of a relatively higher span parameter (0.75). This could lead to a degree of over-smoothing.
Conversely, Spline smoothing exhibits superior performance in terms of empirical MSE values, but this advantage may be attributed to its default tuning using generalized cross-validation. In practical scenarios, cross-validation is typically employed to fine-tune model parameters. It's essential to note that the comparison may not be entirely precise since specific tuning values were employed for the other two local smoothing methods, potentially resulting in suboptimal performance due to insufficient parameter tuning. Plots of the equidistant design's fitted mean, empirical bias, empirical variance, and empirical MSE are shown below. Figure 3: Fitted mean Figure 4: Bias
Figure 5: Variance Figure 6: MSE The fitted values for the LOESS estimator with a span of 0.75, NW kernel smoothing using a Gaussian Kernel with a bandwidth of 0.2, and spline smoothing are represented by the black, red, and blue plotted lines, respectively. 2. Non equidistant design In the presented plots, we explore the empirical bias and mean squared error (MSE) of three distinct smoothing methods: spline smoothing, kernel smoothing, and LOESS, when applied to a dataset featuring the Mexican hat function. This dataset encompasses both equidistant and non-equidistant x values. A notable observation is that x = 0 stands out, exhibiting significantly larger empirical bias and MSE compared to other x values. This highlights the inherent challenge faced by these methods in accurately estimating the function in this specific region. Interestingly, an inverse relationship between empirical bias and empirical variance is evident across all three estimators. When analyzing the non-equidistant dataset, we notice a slightly higher empirical bias in the spline smoothing method in comparison to the equidistant dataset. This may be attributed to potential over-smoothing caused by a relatively larger spar parameter. Conversely, the LOESS model in the non-equidistant setup displays generally smaller empirical bias and MSE than its equidistant counterpart. This improvement is likely linked to the use of a smaller LOESS span, which enhances the local fit and reduces both bias and MSE.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help