Glossary of relevant R functions
Creating training/validation/test splits
sample(vector/number, size)– Samples a certain amount of numbers from a range/vector- Base R
sample_frac(proportion)– samples a certain amount of a datasetdplyrpackage- Can combine with
anti_joinfromdplyrpackage to create opposite of dataset
sample.split(target_variable, split_ratio)caToolspackage- Creates a list of row indexes, while preserving the ratio of labels for the target variable
createDataPartition(target_variable, number_of_partitions, training_proportion)caretpackage- Creates training/test partitions with similar distributions of the target variable y.
EDA functions
hist(data, breaks)- Plots a histogram of a vector of data
- Breaks argument allows you to specify the number of breaks/bins to use
par(mfrow = c(a,b))- Specifies plotting display in
R - Will display a grid of plots
arows bybcolumns
- Specifies plotting display in
pairs(data)- Plots a matrix of scatterplots
- Categorical and logical variables converted to numeric factors similar to
data.matrix()
See also: ggplot2 introduction and quick examples
Linear models and generalised linear models
lm(target_variable ~ predictors, data, subset, offset)- Fits a simple linear regression using the specified predictors on the target variable
- Offset specifies if you would like to include any variables with known slope – such as using population to scale the predicted value on a proportional basis to population
- Subset allows you to specify indexes to use to train the data – can use row indexes rather than manually subset data
- Can call
plot(model_object)to plot diagnostic plots - Can call
summary(model_object)to display summary table of coefficients and p-values
glm(target_variable ~ predictors, family, data, offset, subset)- Fits a glm model using a specified distributional family
- Can specify custom link function to use instead of canonical link function – see documentation
- Can call
plot(glm_object)to plot diagnostic plots - Can call
summary(glm_object)to display summary table of coefficients and p-values
Fitting a k-nearest neighbours model
Using class package: i.e. first run install.packages("class") and library(class).
knn(train, test, cl, k, prob)trainspecifies training dataset to use for KNNtestspecifies test dataset to predict using KNN modelclis a vector of the true classification labelsKspecifies the number of nearest neighbours- Outputs a list of predicted labels using the KNN model
- Can use
prob = TRUEargument to instead output probabilities calculated using KNN
Subset selection: Best, forward and backward
Using leaps package: i.e. first run install.packages("leaps") and library(leaps).
regsubsets(Y_var ~ predictors, Data, method)- Can perform stepwise, forward and backward selection by setting
methodto “forward”, “stepwise” or “backward” summary(regsubsets_object)returns a list of variables used for each model size- Summary object has sub-objects such as Mallow’s C_p, BIC and Adjusted R^2
- Can use
coef(regsubsets_object, num_variables)to extract coefficients for a given model size
- Can perform stepwise, forward and backward selection by setting
LOOCV and k-fold CV on GLM models
Using boot package: i.e. first run install.packages("boot") and library(boot).
cv.glm(Data, glm_model_object, K)– performs cross-validation using the fitted glm object and data- By default, performs LOOCV CV but can use
Kargument to specify number of folds, and then perform k-fold CV. - Access cross-validation errors using
$deltaoncv.glmobject. Returns two values – one is the raw cross-validated error and the other is the bias corrected version for not using LOOCV - See documentation here: cv.glm function - RDocumentation
- This approach only works for GLM objects
- By default, performs LOOCV CV but can use
Alternative approach: Manually creating folds using caret package
createFolds(target_variable, k)- Creates
knumber of folds, by default returned in a matrix, with roughly equal distribution of the target distribution in each fold - Could use these folds along with a loop to create cross-validated errors manually
- Creates
Fitting ridge regression and lasso regression models
Using glmnet package: i.e. first run install.packages("glmnet") and library(glmnet).
model.matrix(target_variable ~ predictors, Data)[, -1]- Creates a model matrix with the predictors and an intercept. Use
[, -1]to drop the created intercept column - Required when using glmnet to fit lasso and ridge regression models
- Creates a model matrix with the predictors and an intercept. Use
glmnet(x_var, y_var, alpha, lambda)x_varis the matrix of predictors created usingmodel.matrixalpha = 0specifies a ridge regression,alpha = 1specifies a lasso regressionlambdaallows you to specify a custom range of lambda values to look across
predict(glmnet_model, s, type, newx)- Using predict with a glmnet model object allows you to specify s, the value of lambda
type= coefficients returns coefficients, otherwise returns predicted values by replacing type with newx argument.
cv.glmnet(x_var, y_var, alpha, nfolds = 10)- Fits either a ridge regression or lasso regression based on the value of
alpha - Allows you to extract the lambda that minimises the RSS using
$lambda.minon the cv.glmnet object. - Also simultaneously performs either k-fold or LOOCV using
nfoldsargument (the number of folds) - See documentation here: Cross-validation for glmnet — cv.glmnet • glmnet (stanford.edu)
- Fits either a ridge regression or lasso regression based on the value of
Fitting tree models
Using tree package: i.e. first run install.packages("tree") and library(tree).
tree(target_variable ~ predictors, data, subset)- Fits a simple decision tree model using specified predictors
- Can use subset argument, similar to a linear model
- Can plot a graph of the fitted tree using:
plot(tree_model)text(tree_model, pretty = 0)
Using the rpart & rpart.plot packages: i.e. first run install.packages(c("rpart", "rpart.plot")) then library(rpart) and library(rpart.plot).
rpart(Sales ~., data, subset)- Similar to tree but allows plotting using
rpart.plotfunction
- Similar to tree but allows plotting using
rpart.plot(rpart_tree_model)- Plots the rpart tree model in a nice plot
Cross validating optimal decision tree size and pruning tree
cv.tree(tree_model, k)- Input a fitted tree model into function to perform cross validation
- Can specify
k, the number of folds to use for cross validation - Can access
cv_tree_object$size,cv_tree_object$devandcv_tree_object$k, for vectors of the size, corresponding deviance and value of alpha (the cost complexity parameter for pruning), to find optimal cost complexity parameter based on lowest deviance
prune.tree(tree_model, best, k)- Creates a new pruned tree based on an already fitted tree model, the specified number of terminal nodes, or alternatively, the cost complexity parameter
bestrefers to the number of terminal nodeskrefers to the cost complexity parameter- Only one of
bestorkneeds to be specified
Fitting bagging and random forest models
Using randomForest package: i.e. first run install.packages("randomForest") and library(randomForest).
randomForest(target_variable ~ predictors, data, importance, mtry, subset)- Fits either a random forest model or a bagged model based on what is specified for the
mtryargument mtryrefers to the number of variables to randomly sample at each split. When fitting a bagged model,mtryshould equal the number of predictors in the data, while in a random forest model, it can be any value (by default it is \sqrt(p) for classification and p/3 for regression, where p is the number of predictors)
- Fits either a random forest model or a bagged model based on what is specified for the
importance(rf_model)- Outputs a list of variable importance for the fitted
rf_model, based on an averaged MSE across fitted trees and total decrease in node purity
- Outputs a list of variable importance for the fitted
varImpPlot(rf_model, sort)- Plots a variable importance plot based on the averaged MSE metric and decrease in node purity metric
sortspecifies whether to sort variables by importance in descending order. By default, is true.
Fitting a gradient boosted model
Using gbm package: i.e. first run install.packages("gbm") and library(gbm).
gbm(target_variables ~ predictors, distribution, data, n.trees, interaction.depth, shrinkage)- Fits a generalised gradient boosted regression model
distributionrefers to the distribution used for the loss function when performing splits using the GBM modeln.treesrefers to the total number of ensemble trees to fitinteraction.depthspecifies the number of splits in each tree – 1 refers to trees with one split, depth is 2 is typically used to incorporate interaction effectsshrinkagespecifies the learning rate to be used in the gradient boosting algorithmcv.foldsspecifies how many folds to use when performing cross-validation – can use to instruct gbm function to perform cross-validation
Fitting hierarchical clustering
hclust(dist(data), method)- Need to wrap data using
dist()function to create a dissimilarity matrix based on the data - Method specifies the linkage method to be used. Can specify complete, average, and single
- Can use
plot(hclust_object)to plot dendrogram
- Need to wrap data using
cutree(hclust_object, k, h)- Cuts a
hclust_objectand returns cluster labels corresponding to each observation - Can either specify
korhto cut the treekrefers to the desired number of clustershrefers to the height at which to cut the tree
- Cuts a
Fitting a kmeans model
kmeans(data, centers, nstart)- Performs kmeans clustering on data, using specified number of clusters, specified by centers
nstartspecifies the number of different initial conditions to use to compare. R will run k-means on each of iterations and choose the best solution with the lowest within-cluster variance.- Can access within-cluster sum of squares using
$tot.withinssobject of thekmeans_model - Can access final cluster labels output by kmeans algorithm using
$clusterobject ofkmeans_model
Performing principal components analysis
prcomp(data, scale, center)- Performs prinicipal components analysis on the data
scalespecifies whether to scale variables to have standard deviation onecenterspecifies whether to shift variables to have mean of 0$rotationobject of pca_model contains the component loadings on each of the principal components$xcontains the principal component scores or the coordinates of the predictor variable in each direction of the principal component$sdevcontains the standard deviation of each principal component – you can square this variable to obtain the variance of each principal component, and hence calculate the total variance explained by each principal component
biplot(pr_object, scale = 0)- Plots a biplot based on the pr_object fitted where it plots the datapoints on a scatterplot of the first two principal components