University of Connecticut STAT5605 Project :The Analysis of Data-TRI Prediction on high-dimensional and multicollinear data Contents [Abstract]: 2 Section 1: Introduction 3 Section 2: Data Description 4 Section 3: Methods and Models: 5 Section 4: Analysis of Data 8 Principle Component Analysis (PCA) 8 Ridge Regression 17 Section5: Model Comparison, Conclusion and Remarks. 20 Section6: Appendix 23 Appendix.A 23 Appendix.B 27 References 29 [Abstract]: This paper is mainly based on the data provided by TRI, a quick service restaurant company. Given the data, their goal of this research is to predict the revenue so that they can decide whether it is wise to open a new restaurant at some places. In this paper, we first want to find a linear model to predict this data. Due to the limitation of the linear model, several methods, such as PCA (principle component analysis), ridge regression and robust regression, are used for improving the performance of the original model. Main issue solved in this paper is the multicollinearity problem. In the original data, 37 continuous variables are highly collinear. The PCA, ridge regression and robust regression are helpful in some ways. Finally, we come up with some predictions and use the known data to test the accuracy of our model. [Key words] prediction, multicollinearity, high dimension, principle component analysis, robust regression, ridge regression, linear regression Section 1: Introduction With over 1,200 quick
This paper provides a summary of our analysis of the data obtained for 60 Crusty Dough Pizza Company restaurants. We compared 16 pizza store characteristics to monthly profit in order to determine the best indicators of success. The results of this analysis may be used to determine the store services and attributes that have the most bearing on profitably.
The purpose of this case is to determine which key variables drive Crusty Pizza Restaurant’s monthly profit and then forecast what the monthly profit would be for potential stores. Based off of this information we will be able to make a recommendation to Crusty Dough Pizza Restaurant on which stores they should open and which they avoid. The group was provided 60 restaurants’ data that included monthly profit, student population, advertising expenditures, parking spots, population within 20 miles, pizza varieties, and competitors within 15 miles. For the potential stores we were given all of this
I worked on a dataset that analyzed the malignant and benign prostate tissues that provided insight into a gene expression signature for prostate cancer. The goal was to conduct several prediction algorithms including principal component regression, elastic nets, partial least squares on the samples containing gene expression profiling by array to accurately predict prostate cancer. It was based on variables like factors that included disease state, and genotype/phenotype
The algorithms used to determine predicted brain age considered gray matter density as a single variable, x. Support Vector machine and Support Vector Regression machine, which are both learning algorithms, were used. Both machines require two phases, a training phase using the variable x and a test phase. The training phase is the phase in which baseline MRI scans of healthy subjects are used, and the result is a brain age prediction model. From this, the prediction model is applied to new data in the test phase to ensure the model works properly. The Support Vector machine was
This investigation will look into 3 Data Sets that have an X and Y value of time and value of Jetski, the years go up in 0.5 meaning every 6 months a new value is set in the data set tables. An investigation will be done to concour what data set best suits the real world scenario of Jetski ownership. This will be proved in a number of different mathematical methods and strategies. Many types of graphs will be utilised throughout this experiment for maximal analysis of data sets for example scatter plots will be drawn using the given data. Modifications are made to these graphs in order to create predictions, Regression line equations will be created, used and analysed in this experiment. Regression line equations are trend lines on
Logistic regression model, as a usual approach before, was used to analyze the stroke outcomes' data. Fortunately, because of its potentially more powerful high-level prediction performance, machine learning algorithms have been proposed as an alternative to analyze large-scale multivariate data. Support vector machine (SVM) is one of the most popular machine learning methods to use for recognition or classification. Support vector machine (SVM) is one of the most popular machine learning methods used for recognition or classification.
For the value 6 and 7 for thal, the next best predictor is ca and for the thal value 3 the next best predictor is age.
Multivariate regression is a standard statistical tool that regresses independent variables (predictors) against a single dependent variable (response variable).The objective is to find a linear model that best predicts the dependent variable from the independent variables. In order to explain the data in the simplest way, redundant or unnecessary predictors should be removed. Such eliminating process is needed for the following reasons. First, unnecessary predictors will add noise to the estimation of other quantities that we are interested, causing loss in degrees of freedom in statistical point of view. Second, if the model is to be used for prediction, we can save time and/or money by not measuring redundant predictors. Finally, multi co-linearity is caused by having too many variables trying to do the same job.
The final model was shown to have minimal error and was therefore a strong representation of the data. However, due to its continuous nature, the model failed to display the fluctuations in population
Principal component analysis (PCA) is one of the most widely used multivariate statistical techniques. PCA could be used to extract the important information from the data table that contains the observations described by dependent variables. Then, PCA used a set of new orthogonal variables, which called principal components (PCs), to express the important information. Besides, PCA could also represent the pattern of similarity of observations and of the variables by drawing them as points in maps .
In order to remedy serial correlation, GLS is introduced. Generalized least squares (GLS) is “a method of ridding an equation of pure first-order serial correlation and in the process restoring the minimum variance property to its estimation (Studenmund, 2006).” In this model, GLS did not necessarily run because of the inconclusive result of the DW test, so that the presence of serial correlation is unsure.
In this paper, a novel variable selection technique with adaptive shrinkage stochastic search is used to understand the influences of the predictor variables. Variable selection is a challenging task and several Bayesian techniques are available. A comparative review of the methods are given in O'Hara and Sillanpää (2009). This paper develops a technique that combines a binary indicator model selection using stochastic search introduced by George and McCulloch (1993) and an adaptive shrinkage method (Zou, 2006) through Laplacian prior (Park & Casella, 2008) for a binomial logistic model. A similar approach has also been used by Lykou and Ntzoufras (2013), however they used the Laplacian prior without any adaptive shrinkage for Gaussian model. Ročková and George (2016) developed a hybrid Gaussian modelling approach that incorporates the stochastic variable selection like spike-and-slab (Ishwaran & Rao, 2005) and penalized approach like the Lasso (Tibshirani, 1996). In addition, a non-Gaussian version based on generalised linear model is also developed by Tang et al. (2017). However, all these methods did not consider any spatially correlated process in their modelling hierarchy.