R-CheatSheet

.pdf

School

New Jersey Institute Of Technology *

*We aren’t endorsed by this school

Course

636

Subject

Computer Science

Date

Feb 20, 2024

Type

pdf

Pages

Uploaded by ElderToadMaster889

The Fibonacci sequence 1, 1, 2, 3, 5, 8, 13, 21 ...... starts with two 1s, and each term afterwards is the sum of its two predecessors: Fib(n) = Fib(n-1) + Fib(n-1), Fib(1)=1, Fib(2)=1. Please write a program to return the first number in the Fibonacci sequence which is larger than a given parameter, e.g. 100000. The average GPA scored by a class is 4.91 and standard deviation is 0.72. Fora sample of 64 students, find the probability that the sample average is in the interval(4.7336, 5.0864). # Generate x values from 11 to 50 with a step of 1 # Generate 40 random normal values for y # Set up a PDF file for saving plots # Set point character (pch) to 22 and color (col) to red for all plots # Set up a 2x4 grid of plots # Specify different plot types using the 'opts' vector # Loop through the options and create plots # Create a plot with the specified type # Turn off the PDF device to save the file Initialization: nExpr is set to 1000, representing the number of experiments, and tosses is set to 7, representing the number of coin tosses in each experiment. Simulation Loop: The code uses a loop to simulate each experiment, generating a random sequence of "H" (heads) and "T" (tails) outcomes and counting the number of heads. Histogram: The hist function creates a histogram of the results, with breaks specified to align with the possible values (0 to 7). Binomial Distribution Overlay: The lines and points functions overlay the expected binomial distribution on the histogram. The expected values are calculated using the binomial probability mass function (dbinom). Legend: A legend is added to the top-right corner of the plot to identify the overlay as the binomial distribution. 10. phones <- c("2197338965","+1 219 733 8965"," 219 733 8965", "329-293-8753 ","595 794 7569","387 287 6718", "233.398.9187", "482 952 3315", "Work: 579-499-7527","Home:543.355.3679") pattern <- "\\b(?:\\+?\\d{1,3}[\\s.-]?)?(\\d{3}[\\s.-]?\\d{3}[\\s.-]?\\d{4})\\b" matched_numbers <- grep(pattern, phones, value = TRUE) Let sex = c(1,1,1,1,1,1,2,2,2,2); graduate = c(1,0,1,0,1,0,0,0,0,1);score=c(9:1, NA). A data frame is constructed as zz = data.frame(sex, graduate, score).Give the results of these R commands: A) table(zz[,"sex"]) The table function in R is used to tabulate the counts of elements in a vector. 1 2 6 4 B) apply(zz[-1, ], 2, max) The apply function is used to apply a function along the margins of an array. sex graduate score 2 1 9 c) zz[zz[,3]>7,] The code zz[zz[,3]>7,] extracts the rows from the data frame zz where the values in the third column (score) are greater than 7. sex graduate score 1 1 1 9 2 1 0 8 d) which.max(zz$score) The which.max function in R is used to find the index of the first occurrence of the maximum value in a vector. [1] 1 E) zz[order(zz["graduate"],zz["score"]),] The order function in R is used to obtain the index permutation that would sort a given vector. When applied to multiple vectors, it sorts the rows of a data frame based on those vectors. sex graduate score 9 2 0 1 8 2 0 2 7 2 0 3 6 1 0 4 4 1 0 6 2 1 0 8 10 2 1 NA 5 1 1 5 3 1 1 7 1 1 1 9 ….. continued beside after ascbbbbcbcccc Henry and Tony play a game. They toss a fair coin that Pr(x=H)=Pr(x=T)=0.5. If it gets a Head (H), Henry wins; Tail (T), Tony wins. They agree in advance that the first player who has won 3 rounds will collect the entire prize. However, the game is interrupted for some reason after 3 rounds. They got 2 H and 1 T. Suppose that if they continue to play, what is the probability that Henry will win the entire prize? And what is the probability that Tony will win the entire prize? 4. Central Limit Theorem The Central Limit Theorem (CLT) is a statistical principle stating that, given a sufficiently large sample size, the distribution of the sample mean will be approximately normal, regardless of the shape of the original population distribution. It highlights the remarkable tendency of sample means to follow a normal distribution, facilitating more robust statistical inferences. Key conditions include random sampling, independence, and an adequately sized sample. The CLT is pivotal in hypothesis testing and confidence interval construction, offering a reliable framework for understanding and analyzing population parameters based on sample data. 5. The merge operation plays an important role in merge sort algorithm. Supposeyou have two sorted sequences S1 and S2, merge operation will combine these twosequences into a single ordered sequence. Please write a function, Merge(S1, S2),which accepts two ordered vectors S1 and S2 as parameters. It will return a singleordered sequence. For example,S1 = c(1,3,5,7);S2 = c(2,4,6,10);Merge(S1, S2) will return c(1,2,3,4,5,6,7,10)Testing commands:Merge(seq(1, 50, by=3), seq(2, 30, by=2)) the Merge function in R to combine two ordered vectors into a single ordered sequence Which of the following matches regexp /a.[bc]+/ a) abc b) abbbbbbbb c) azc d) abcbcbcbc e) ac f) asccbbbbcbcccc F) subset(zz, zz["sex"]==1) The subset function in R is used to extract a subset of a data frame based on certain conditions sex graduate score 1 1 1 9 2 1 0 8 3 1 1 7 4 1 0 6 5 1 1 5 6 1 0 4 G) tapply(zz$score, zz$graduate, mean, na.rm=T) The tapply function in R is used to apply a function to subsets of a vector or data frame. In this case, you are applying the mean function to the "score" column of the data frame zz, grouping by the values in the "graduate" column. 0 1 4.3750000 7.6666667 F) apply(zz[-10, ], 1, function(x){ sum(x) }) The apply function in R is used to apply a function to the rows or columns of a matrix or data frame. In this case, you are applying an anonymous function (a function created on-the-fly) to the rows of the data frame zz, excluding the 10th row, and calculating the sum of each row. [1] 10 8 8 7 6 5 3 2 1 rep(1:3, each=3) [1] 1 1 1 2 2 2 3 3 3 seq(1, 10, by=2) [1] 1 3 5 7 9 pnorm(0, mean=0, sd=1) The pnorm function in R is used to compute the cumulative distribution function (CDF) of a normal distribution. [1] 0.5 print((1:4)>2&(1:4)%%2==0); [1] FALSE FALSE FALSE TRUE qunif(0.4, min = 1, max = 6) The qunif function in R is used to compute the quantile function (inverse cumulative distribution function) for a uniform distribution. [1] 2.8

Explain the necessity of the feature selection in modeling. Introduce a representative featureselectionmethodindetailintermsofitsadvantages,disadvantagesandRfunctions/implementations /packages. Feature selection is crucial to enhance model efficiency, accuracy, and interpretability by choosing the most relevant features and reducing dimensionality. Recursive Feature Elimination (RFE) is a representative method automating this process. RFE iteratively removes the least important features, allowing for improved model performance. While computationally intensive, RFE is flexible and compatible with various machine learning algorithms. In R, the `caret` package provides an implementation of RFE through the `rfe` function, aiding in automated and effective feature selection for enhanced modeling. Please describe Support Vector Machine (SVM) and highlight the underlying mechanism underits appealing performance in practical use Support Vector Machine (SVM) is a powerful machine learning algorithm used for classification and regression. Its appeal lies in the pursuit of maximum margin, aiming to find hyperplanes that distinctly separate different classes. Utilizing a kernel trick, SVM efficiently handles non-linear relationships by implicitly mapping data into higher-dimensional spaces. The algorithm's reliance on support vectors ensures robustness to outliers and computational efficiency. SVM's versatility, ability to generalize, and effectiveness in various applications, including image classification and bioinformatics, contribute to its widespread use in practical scenarios. What is the commonality and difference between SVM and Logistic Regression? Commonality: Both Support Vector Machine (SVM) and Logistic Regression are supervised machine learning algorithms used for classification tasks. They aim to find decision boundaries that separate classes in the feature space. Difference: The key difference lies in their approach to decision boundaries. SVM seeks a hyperplane with maximum margin, emphasizing points near the decision boundary (support vectors). Logistic Regression, on the other hand, uses a logistic function to model the probability of class membership, providing probabilistic outputs. SVM is effective in high-dimensional spaces, while Logistic Regression is more interpretable and suited for probabilistic classification. What are main characteristics of tree-based classification algorithms? List at least 3representative classifiers of this kind and explain the difference among them briefly? Main Characteristics of Tree-Based Classification Algorithms: 1. Hierarchical Decision Making: Tree-based classifiers make decisions in a hierarchical, tree-like structure. Each internal node represents a decision based on a feature, and each leaf node corresponds to a class label. 2. Non-Linear Decision Boundaries: These algorithms can model complex, non-linear decision boundaries, making them suitable for a wide range of datasets with intricate patterns. 3. Feature Importance: Tree-based classifiers provide a measure of feature importance, indicating the contribution of each feature to the model's decision-making process. Representative Classifiers: 1. Decision Trees (DT): - Description: A basic tree structure where each internal node represents a decision based on a feature, and each leaf node corresponds to a class label. - Difference: Prone to overfitting, but can be controlled using techniques like pruning. 2. Random Forest (RF): - Description: An ensemble of decision trees, where multiple trees are trained independently and their predictions are averaged or voted upon. - Difference: Reduces overfitting compared to individual decision trees and provides enhanced generalization. 3. Gradient Boosting Machines (GBM): - Description: Builds decision trees sequentially, with each tree correcting the errors of the previous ones. - Difference: Emphasizes correcting mistakes of preceding trees, leading to improved overall predictive accuracy. - Difference: Known for its efficiency, speed, and enhanced performance, making it a popular choice in competitions. These tree-based classifiers differ in their structures, strategies for reducing overfitting, and approaches to combining individual trees for improved predictiveperformance. If X2 is removed, please draw a line in the figure that can separate the remaining samples X1,X3, and X4. What is (are) its support vector(s) and its corresponding margin? Consider the following testing samples<1, 3><1, -1><-1, 2><-2, 1>As we can see, these testing instances have 2 input variables, <X1, X2>, and class labels areunknown (either -1 or +1).Assuming that we have already trained a classifier Y=sgn(w1*X1+ w2*X2 + w0) and theestimated weightw =<w1, w2, w0>=<-1, 1, 1>.(sgn(x)=-1, if x<0;sgn(x)=0, if x=0;sgn(x)=+1, if x>0)(3)Please compute the distance between each testing point and the estimated separator (hyper-plane). /a(ab)*a/ matches strings that start with 'a', followed by zero or more occurrences of 'ab', and ending with 'a' String Matches 1 abababa TRUE 2 aaba TRUE 3 aabbaa FALSE 4 aba FALSE /ab+c?/ matches strings that start with 'a', followed by one or more 'b's, and optionally followed by a 'c'. String Matches 1 abc TRUE 2 ac TRUE 3 abbb TRUE 4 bbc FALSE /a.[bc]+/ matches strings that start with 'a', followed by any character (except a newline), and then followed by one or more occurrences of 'b' or 'c'. String Matches 1 abc TRUE 2 azc TRUE 3 abcbcbcbc TRUE 4 ac FALSE if any, that match the pattern specified by the regular expression.[Dd]onald( [Jj])? [Tt]rumpin the following sentences, Hi, Donald j trump. Nice to meet you. - True i bet i can spell better than you and donald( trump combined - False I met with Donald J Trump yesterday. - True CNN reported that President Donald JTrump visited the university yesterday - False Carve your own Donald Trumpkin in time for Halloween.- False Another man is also called Donald JjTrump - False The regular expression [Dd]onald( [Jj])? [Tt]rump matches variations of the name "Donald Trump" in the given sentences. It allows for both uppercase and lowercase versions of "Donald," an optional middle initial "J" or "j," and case-insensitive matching for "Trump." glmnet.fit<- cv.glmnet(x=train.x,y=train.y,family="binomial",type.measure="auc",nfolds=5)What does this code do? Explain each parameter as detailed as possible. glmnet.fit: This is the name of the object that will store the results of the cross-validated logistic regression model. cv.glmnet: This is the function for fitting regularized generalized linear models. It performs cross-validation to automatically select the optimal regularization parameters. x=train.x: This specifies the matrix of predictor variables (train.x) used in the logistic regression model. Each column corresponds to a different predictor variable. y=train.y: This specifies the response variable (train.y) in the logistic regression model. It contains the binary outcome variable indicating the classes (0 or 1). family="binomial": This specifies the family of the generalized linear model. In this case, it's set to "binomial" because logistic regression is being used for binary classification. type.measure="auc": This indicates the type of measure used for model performance evaluation during cross-validation. In this case, "auc" refers to the area under the Receiver Operating Characteristic (ROC) curve, which is a common metric for binary classification models. nfolds=5: This sets the number of folds for cross-validation. The dataset is divided into 5 subsets (folds), and the model is trained and tested on each subset to assess its performance. the code fits a logistic regression model with L1 regularization using the glmnet package. It performs cross-validation to automatically select the best regularization parameters and evaluates the model's performance using the area under the ROC curve. The results are stored in the glmnet.fit object for further analysis and interpretation. What are strengths and weakness of KNN? Strengths of K-Nearest Neighbors (KNN): KNN is simple, versatile, and effective for various data types. It requires no training phase and adapts well to changes in data distribution, making it suitable for dynamic datasets. It handles non-linear relationships and performs well in low-dimensional spaces. Weaknesses of KNN: KNN suffers from computational inefficiency, especially in high-dimensional spaces. Its performance is sensitive to irrelevant or redundant features. The need for storing and searching neighbors can be resource-intensive. Additionally, it is susceptible to the curse of dimensionality, impacting accuracy as the feature space grows, and may struggle with imbalanced datasets. Given the linear separator L1 as shown in the figure, what is (are) its support vector(s)?Please give the values of thew =< w1, w2, w0>in the primal problem for L1. Calculate itsmargin

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version