Glossary of relevant R functions

Creating training/validation/test splits

  • sample(vector/number, size) – Samples a certain amount of numbers from a range/vector
    • Base R
  • sample_frac(proportion) – samples a certain amount of a dataset
    • dplyr package
    • Can combine with anti_join from dplyr package to create opposite of dataset
  • sample.split(target_variable, split_ratio)
    • caTools package
    • Creates a list of row indexes, while preserving the ratio of labels for the target variable
  • createDataPartition(target_variable, number_of_partitions, training_proportion)
    • caret package
    • Creates training/test partitions with similar distributions of the target variable y.

EDA functions

  • hist(data, breaks)
    • Plots a histogram of a vector of data
    • Breaks argument allows you to specify the number of breaks/bins to use
  • par(mfrow = c(a,b))
    • Specifies plotting display in R
    • Will display a grid of plots a rows by b columns
  • pairs(data)
    • Plots a matrix of scatterplots
    • Categorical and logical variables converted to numeric factors similar to data.matrix()

See also: ggplot2 introduction and quick examples

Linear models and generalised linear models

  • lm(target_variable ~ predictors, data, subset, offset)
    • Fits a simple linear regression using the specified predictors on the target variable
    • Offset specifies if you would like to include any variables with known slope – such as using population to scale the predicted value on a proportional basis to population
    • Subset allows you to specify indexes to use to train the data – can use row indexes rather than manually subset data
    • Can call plot(model_object) to plot diagnostic plots
    • Can call summary(model_object) to display summary table of coefficients and p-values
  • glm(target_variable ~ predictors, family, data, offset, subset)
    • Fits a glm model using a specified distributional family
    • Can specify custom link function to use instead of canonical link function – see documentation
    • Can call plot(glm_object) to plot diagnostic plots
    • Can call summary(glm_object) to display summary table of coefficients and p-values

Fitting a k-nearest neighbours model

Using class package: i.e. first run install.packages("class") and library(class).

  • knn(train, test, cl, k, prob)
    • train specifies training dataset to use for KNN
    • test specifies test dataset to predict using KNN model
    • cl is a vector of the true classification labels
    • K specifies the number of nearest neighbours
    • Outputs a list of predicted labels using the KNN model
    • Can use prob = TRUE argument to instead output probabilities calculated using KNN

Subset selection: Best, forward and backward

Using leaps package: i.e. first run install.packages("leaps") and library(leaps).

  • regsubsets(Y_var ~ predictors, Data, method)
    • Can perform stepwise, forward and backward selection by setting method to “forward”, “stepwise” or “backward”
    • summary(regsubsets_object) returns a list of variables used for each model size
    • Summary object has sub-objects such as Mallow’s C_p, BIC and Adjusted R^2
    • Can use coef(regsubsets_object, num_variables) to extract coefficients for a given model size

LOOCV and k-fold CV on GLM models

Using boot package: i.e. first run install.packages("boot") and library(boot).

  • cv.glm(Data, glm_model_object, K) – performs cross-validation using the fitted glm object and data
    • By default, performs LOOCV CV but can use K argument to specify number of folds, and then perform k-fold CV.
    • Access cross-validation errors using $delta on cv.glm object. Returns two values – one is the raw cross-validated error and the other is the bias corrected version for not using LOOCV
    • See documentation here: cv.glm function - RDocumentation
    • This approach only works for GLM objects

Alternative approach: Manually creating folds using caret package

  • createFolds(target_variable, k)
    • Creates k number of folds, by default returned in a matrix, with roughly equal distribution of the target distribution in each fold
    • Could use these folds along with a loop to create cross-validated errors manually

Fitting ridge regression and lasso regression models

Using glmnet package: i.e. first run install.packages("glmnet") and library(glmnet).

  • model.matrix(target_variable ~ predictors, Data)[, -1]

    • Creates a model matrix with the predictors and an intercept. Use [, -1] to drop the created intercept column
    • Required when using glmnet to fit lasso and ridge regression models
  • glmnet(x_var, y_var, alpha, lambda)

    • x_var is the matrix of predictors created using model.matrix
    • alpha = 0 specifies a ridge regression, alpha = 1 specifies a lasso regression
    • lambda allows you to specify a custom range of lambda values to look across
  • predict(glmnet_model, s, type, newx)

    • Using predict with a glmnet model object allows you to specify s, the value of lambda
    • type = coefficients returns coefficients, otherwise returns predicted values by replacing type with newx argument.
  • cv.glmnet(x_var, y_var, alpha, nfolds = 10)

    • Fits either a ridge regression or lasso regression based on the value of alpha
    • Allows you to extract the lambda that minimises the RSS using $lambda.min on the cv.glmnet object.
    • Also simultaneously performs either k-fold or LOOCV using nfolds argument (the number of folds)
    • See documentation here: Cross-validation for glmnet — cv.glmnet • glmnet (stanford.edu)

Fitting tree models

Using tree package: i.e. first run install.packages("tree") and library(tree).

  • tree(target_variable ~ predictors, data, subset)
    • Fits a simple decision tree model using specified predictors
    • Can use subset argument, similar to a linear model
    • Can plot a graph of the fitted tree using:
      • plot(tree_model)
      • text(tree_model, pretty = 0)

Using the rpart & rpart.plot packages: i.e. first run install.packages(c("rpart", "rpart.plot")) then library(rpart) and library(rpart.plot).

  • rpart(Sales ~., data, subset)
    • Similar to tree but allows plotting using rpart.plot function
  • rpart.plot(rpart_tree_model)
    • Plots the rpart tree model in a nice plot

Cross validating optimal decision tree size and pruning tree

  • cv.tree(tree_model, k)
    • Input a fitted tree model into function to perform cross validation
    • Can specify k, the number of folds to use for cross validation
    • Can access cv_tree_object$size, cv_tree_object$dev and cv_tree_object$k, for vectors of the size, corresponding deviance and value of alpha (the cost complexity parameter for pruning), to find optimal cost complexity parameter based on lowest deviance
  • prune.tree(tree_model, best, k)
    • Creates a new pruned tree based on an already fitted tree model, the specified number of terminal nodes, or alternatively, the cost complexity parameter
    • best refers to the number of terminal nodes
    • k refers to the cost complexity parameter
    • Only one of best or k needs to be specified

Fitting bagging and random forest models

Using randomForest package: i.e. first run install.packages("randomForest") and library(randomForest).

  • randomForest(target_variable ~ predictors, data, importance, mtry, subset)
    • Fits either a random forest model or a bagged model based on what is specified for the mtry argument
    • mtry refers to the number of variables to randomly sample at each split. When fitting a bagged model, mtry should equal the number of predictors in the data, while in a random forest model, it can be any value (by default it is \sqrt(p) for classification and p/3 for regression, where p is the number of predictors)
  • importance(rf_model)
    • Outputs a list of variable importance for the fitted rf_model, based on an averaged MSE across fitted trees and total decrease in node purity
  • varImpPlot(rf_model, sort)
    • Plots a variable importance plot based on the averaged MSE metric and decrease in node purity metric
    • sort specifies whether to sort variables by importance in descending order. By default, is true.

Fitting a gradient boosted model

Using gbm package: i.e. first run install.packages("gbm") and library(gbm).

  • gbm(target_variables ~ predictors, distribution, data, n.trees, interaction.depth, shrinkage)
    • Fits a generalised gradient boosted regression model
    • distribution refers to the distribution used for the loss function when performing splits using the GBM model
    • n.trees refers to the total number of ensemble trees to fit
    • interaction.depth specifies the number of splits in each tree – 1 refers to trees with one split, depth is 2 is typically used to incorporate interaction effects
    • shrinkage specifies the learning rate to be used in the gradient boosting algorithm
    • cv.folds specifies how many folds to use when performing cross-validation – can use to instruct gbm function to perform cross-validation

Fitting hierarchical clustering

  • hclust(dist(data), method)
    • Need to wrap data using dist() function to create a dissimilarity matrix based on the data
    • Method specifies the linkage method to be used. Can specify complete, average, and single
    • Can use plot(hclust_object) to plot dendrogram
  • cutree(hclust_object, k, h)
    • Cuts a hclust_object and returns cluster labels corresponding to each observation
    • Can either specify k or h to cut the tree
      • k refers to the desired number of clusters
      • h refers to the height at which to cut the tree

Fitting a kmeans model

  • kmeans(data, centers, nstart)
    • Performs kmeans clustering on data, using specified number of clusters, specified by centers
    • nstart specifies the number of different initial conditions to use to compare. R will run k-means on each of iterations and choose the best solution with the lowest within-cluster variance.
    • Can access within-cluster sum of squares using $tot.withinss object of the kmeans_model
    • Can access final cluster labels output by kmeans algorithm using $cluster object of kmeans_model

Performing principal components analysis

  • prcomp(data, scale, center)
    • Performs prinicipal components analysis on the data
    • scale specifies whether to scale variables to have standard deviation one
    • center specifies whether to shift variables to have mean of 0
    • $rotation object of pca_model contains the component loadings on each of the principal components
    • $x contains the principal component scores or the coordinates of the predictor variable in each direction of the principal component
    • $sdevcontains the standard deviation of each principal component – you can square this variable to obtain the variance of each principal component, and hence calculate the total variance explained by each principal component
  • biplot(pr_object, scale = 0)
    • Plots a biplot based on the pr_object fitted where it plots the datapoints on a scatterplot of the first two principal components