Glossary of relevant R functions
Creating training/validation/test splits
sample(vector/number, size)
– Samples a certain amount of numbers from a range/vector- Base R
sample_frac(proportion)
– samples a certain amount of a datasetdplyr
package- Can combine with
anti_join
fromdplyr
package to create opposite of dataset
sample.split(target_variable, split_ratio)
caTools
package- Creates a list of row indexes, while preserving the ratio of labels for the target variable
createDataPartition(target_variable, number_of_partitions, training_proportion)
caret
package- Creates training/test partitions with similar distributions of the target variable y.
EDA functions
hist(data, breaks)
- Plots a histogram of a vector of data
- Breaks argument allows you to specify the number of breaks/bins to use
par(mfrow = c(a,b))
- Specifies plotting display in
R
- Will display a grid of plots
a
rows byb
columns
- Specifies plotting display in
pairs(data)
- Plots a matrix of scatterplots
- Categorical and logical variables converted to numeric factors similar to
data.matrix()
See also: ggplot2
introduction and quick examples
Linear models and generalised linear models
lm(target_variable ~ predictors, data, subset, offset)
- Fits a simple linear regression using the specified predictors on the target variable
- Offset specifies if you would like to include any variables with known slope – such as using population to scale the predicted value on a proportional basis to population
- Subset allows you to specify indexes to use to train the data – can use row indexes rather than manually subset data
- Can call
plot(model_object)
to plot diagnostic plots - Can call
summary(model_object)
to display summary table of coefficients and p-values
glm(target_variable ~ predictors, family, data, offset, subset)
- Fits a glm model using a specified distributional family
- Can specify custom link function to use instead of canonical link function – see documentation
- Can call
plot(glm_object)
to plot diagnostic plots - Can call
summary(glm_object)
to display summary table of coefficients and p-values
Fitting a k-nearest neighbours model
Using class package: i.e. first run install.packages("class")
and library(class)
.
knn(train, test, cl, k, prob)
train
specifies training dataset to use for KNNtest
specifies test dataset to predict using KNN modelcl
is a vector of the true classification labelsK
specifies the number of nearest neighbours- Outputs a list of predicted labels using the KNN model
- Can use
prob = TRUE
argument to instead output probabilities calculated using KNN
Subset selection: Best, forward and backward
Using leaps
package: i.e. first run install.packages("leaps")
and library(leaps)
.
regsubsets(Y_var ~ predictors, Data, method)
- Can perform stepwise, forward and backward selection by setting
method
to “forward”, “stepwise” or “backward” summary(regsubsets_object)
returns a list of variables used for each model size- Summary object has sub-objects such as Mallow’s C_p, BIC and Adjusted R^2
- Can use
coef(regsubsets_object, num_variables)
to extract coefficients for a given model size
- Can perform stepwise, forward and backward selection by setting
LOOCV and k-fold CV on GLM models
Using boot package: i.e. first run install.packages("boot")
and library(boot)
.
cv.glm(Data, glm_model_object, K)
– performs cross-validation using the fitted glm object and data- By default, performs LOOCV CV but can use
K
argument to specify number of folds, and then perform k-fold CV. - Access cross-validation errors using
$delta
oncv.glm
object. Returns two values – one is the raw cross-validated error and the other is the bias corrected version for not using LOOCV - See documentation here: cv.glm function - RDocumentation
- This approach only works for GLM objects
- By default, performs LOOCV CV but can use
Alternative approach: Manually creating folds using caret
package
createFolds(target_variable, k)
- Creates
k
number of folds, by default returned in a matrix, with roughly equal distribution of the target distribution in each fold - Could use these folds along with a loop to create cross-validated errors manually
- Creates
Fitting ridge regression and lasso regression models
Using glmnet package: i.e. first run install.packages("glmnet")
and library(glmnet)
.
model.matrix(target_variable ~ predictors, Data)[, -1]
- Creates a model matrix with the predictors and an intercept. Use
[, -1]
to drop the created intercept column - Required when using glmnet to fit lasso and ridge regression models
- Creates a model matrix with the predictors and an intercept. Use
glmnet(x_var, y_var, alpha, lambda)
x_var
is the matrix of predictors created usingmodel.matrix
alpha = 0
specifies a ridge regression,alpha = 1
specifies a lasso regressionlambda
allows you to specify a custom range of lambda values to look across
predict(glmnet_model, s, type, newx)
- Using predict with a glmnet model object allows you to specify s, the value of lambda
type
= coefficients returns coefficients, otherwise returns predicted values by replacing type with newx argument.
cv.glmnet(x_var, y_var, alpha, nfolds = 10)
- Fits either a ridge regression or lasso regression based on the value of
alpha
- Allows you to extract the lambda that minimises the RSS using
$lambda.min
on the cv.glmnet object. - Also simultaneously performs either k-fold or LOOCV using
nfolds
argument (the number of folds) - See documentation here: Cross-validation for glmnet — cv.glmnet • glmnet (stanford.edu)
- Fits either a ridge regression or lasso regression based on the value of
Fitting tree models
Using tree package: i.e. first run install.packages("tree")
and library(tree)
.
tree(target_variable ~ predictors, data, subset)
- Fits a simple decision tree model using specified predictors
- Can use subset argument, similar to a linear model
- Can plot a graph of the fitted tree using:
plot(tree_model)
text(tree_model, pretty = 0)
Using the rpart & rpart.plot packages: i.e. first run install.packages(c("rpart", "rpart.plot"))
then library(rpart)
and library(rpart.plot)
.
rpart(Sales ~., data, subset)
- Similar to tree but allows plotting using
rpart.plot
function
- Similar to tree but allows plotting using
rpart.plot(rpart_tree_model)
- Plots the rpart tree model in a nice plot
Cross validating optimal decision tree size and pruning tree
cv.tree(tree_model, k)
- Input a fitted tree model into function to perform cross validation
- Can specify
k
, the number of folds to use for cross validation - Can access
cv_tree_object$size
,cv_tree_object$dev
andcv_tree_object$k
, for vectors of the size, corresponding deviance and value of alpha (the cost complexity parameter for pruning), to find optimal cost complexity parameter based on lowest deviance
prune.tree(tree_model, best, k)
- Creates a new pruned tree based on an already fitted tree model, the specified number of terminal nodes, or alternatively, the cost complexity parameter
best
refers to the number of terminal nodesk
refers to the cost complexity parameter- Only one of
best
ork
needs to be specified
Fitting bagging and random forest models
Using randomForest package: i.e. first run install.packages("randomForest")
and library(randomForest)
.
randomForest(target_variable ~ predictors, data, importance, mtry, subset)
- Fits either a random forest model or a bagged model based on what is specified for the
mtry
argument mtry
refers to the number of variables to randomly sample at each split. When fitting a bagged model,mtry
should equal the number of predictors in the data, while in a random forest model, it can be any value (by default it is \sqrt(p) for classification and p/3 for regression, where p is the number of predictors)
- Fits either a random forest model or a bagged model based on what is specified for the
importance(rf_model)
- Outputs a list of variable importance for the fitted
rf_model
, based on an averaged MSE across fitted trees and total decrease in node purity
- Outputs a list of variable importance for the fitted
varImpPlot(rf_model, sort)
- Plots a variable importance plot based on the averaged MSE metric and decrease in node purity metric
sort
specifies whether to sort variables by importance in descending order. By default, is true.
Fitting a gradient boosted model
Using gbm package: i.e. first run install.packages("gbm")
and library(gbm)
.
gbm(target_variables ~ predictors, distribution, data, n.trees, interaction.depth, shrinkage)
- Fits a generalised gradient boosted regression model
distribution
refers to the distribution used for the loss function when performing splits using the GBM modeln.trees
refers to the total number of ensemble trees to fitinteraction.depth
specifies the number of splits in each tree – 1 refers to trees with one split, depth is 2 is typically used to incorporate interaction effectsshrinkage
specifies the learning rate to be used in the gradient boosting algorithmcv.folds
specifies how many folds to use when performing cross-validation – can use to instruct gbm function to perform cross-validation
Fitting hierarchical clustering
hclust(dist(data), method)
- Need to wrap data using
dist()
function to create a dissimilarity matrix based on the data - Method specifies the linkage method to be used. Can specify complete, average, and single
- Can use
plot(hclust_object)
to plot dendrogram
- Need to wrap data using
cutree(hclust_object, k, h)
- Cuts a
hclust_object
and returns cluster labels corresponding to each observation - Can either specify
k
orh
to cut the treek
refers to the desired number of clustersh
refers to the height at which to cut the tree
- Cuts a
Fitting a kmeans model
kmeans(data, centers, nstart)
- Performs kmeans clustering on data, using specified number of clusters, specified by centers
nstart
specifies the number of different initial conditions to use to compare. R will run k-means on each of iterations and choose the best solution with the lowest within-cluster variance.- Can access within-cluster sum of squares using
$tot.withinss
object of thekmeans_model
- Can access final cluster labels output by kmeans algorithm using
$cluster
object ofkmeans_model
Performing principal components analysis
prcomp(data, scale, center)
- Performs prinicipal components analysis on the data
scale
specifies whether to scale variables to have standard deviation onecenter
specifies whether to shift variables to have mean of 0$rotation
object of pca_model contains the component loadings on each of the principal components$x
contains the principal component scores or the coordinates of the predictor variable in each direction of the principal component$sdev
contains the standard deviation of each principal component – you can square this variable to obtain the variance of each principal component, and hence calculate the total variance explained by each principal component
biplot(pr_object, scale = 0)
- Plots a biplot based on the pr_object fitted where it plots the datapoints on a scatterplot of the first two principal components