Machine Learning

Predictive modelling with trees, ensembles, SVMs, neural networks, and principled workflows for validation and feature selection

Machine learning in applied statistics is ultimately about prediction: building a model whose accuracy on unseen data is the criterion of success. This area covers the major supervised and unsupervised methods a biostatistician is likely to use, framed around the workflow that separates honest ML from overfit artefacts: train/test splits, cross-validation, tuning on validation folds, and evaluation on a held-out set.

Every tutorial uses tidymodels as the default framework, with caret shown for legacy projects and mlr3 referenced for advanced needs. Emphasis is on reproducible pipelines, proper leakage-free preprocessing, and calibrated performance estimates.

Topics covered

  • The supervised learning framework: features, targets, loss functions, generalisation
  • Resampling: holdout, k-fold CV, repeated CV, stratified CV, nested CV for tuning
  • Tree methods: CART, conditional inference trees, pruning strategies
  • Ensembles: bagging, random forests, boosted trees (XGBoost, LightGBM)
  • Support vector machines: linear, polynomial, RBF kernels; soft-margin formulation
  • k-nearest neighbours and its curse of dimensionality
  • Neural networks with torch for R; feedforward, convolutional, recurrent architectures
  • Regularisation for generalisation: ridge, lasso, dropout, early stopping
  • Feature engineering with recipes: imputation, encoding, scaling, interactions, spline bases
  • Feature selection: filter, wrapper, and embedded approaches
  • Imbalanced classification: SMOTE, class weights, threshold tuning, calibration
  • Interpretability: variable importance, partial dependence, SHAP values, LIME
  • Unsupervised methods: clustering, anomaly detection, self-organising maps

The goal throughout is a pipeline a reader could defend in a peer review: properly validated, honestly tuned, and clearly interpreted.

Tutorials

TUTORIAL

A Complete mlr3 Workflow

End-to-end modelling with mlr3: tasks, learners, resamplings, and tuning

TUTORIAL

A Complete tidymodels Workflow

End-to-end modelling in tidymodels: recipes, parsnip, workflows, and tuning

TUTORIAL

Anomaly Detection

Unsupervised methods for flagging rare, unusual observations

TUTORIAL

Bagging

Bootstrap aggregation: variance reduction through averaging

TUTORIAL

Calibration Plots

Reliability diagrams: do predicted probabilities match observed frequencies?

TUTORIAL

CatBoost

Gradient boosting with native categorical handling and ordered target encoding

TUTORIAL

Class Weights for Imbalanced Data

Loss-based weighting as an alternative to resampling

TUTORIAL

Convolutional Neural Networks: Introduction

Spatial filters and pooling for image and signal learning

TUTORIAL

Cross-Validation

Estimating out-of-sample predictive performance honestly with k-fold, repeated, nested, and leave-one-out cross-validation

TUTORIAL

DBSCAN as an ML Tool

Density-based clustering and outlier detection in one step

TUTORIAL

Decision Trees (CART)

Recursive binary partitioning with Gini or variance splits

TUTORIAL

Dropout and Early Stopping

Two of the most effective neural-network regularisation techniques

TUTORIAL

Extremely Randomised Trees

Extra Trees: random thresholds for faster and more regularised ensembles

TUTORIAL

Feature Engineering

Transformations, interactions, and encodings to expose signal to models

TUTORIAL

Feature Selection

Filter, wrapper, and embedded strategies to reduce the feature space

TUTORIAL

Feedforward Networks in torch

Building and training deep feedforward networks in R with the torch package

TUTORIAL

Gradient Boosting

Stagewise additive modelling with gradient descent on loss

TUTORIAL

Isolation Forest

Tree-based unsupervised anomaly detection via path length

TUTORIAL

Isotonic Regression Calibration

Non-parametric monotone calibration via pool-adjacent-violators

TUTORIAL

Kernel SVMs

Non-linear classification via the kernel trick with RBF and polynomial kernels

TUTORIAL

LIME Explanations

Local Interpretable Model-Agnostic Explanations via surrogate linear models

TUTORIAL

LightGBM

Histogram-based leaf-wise gradient boosting for large datasets

TUTORIAL

Linear Discriminant as ML

LDA as a classifier with shared Gaussian class-conditional covariance

TUTORIAL

Linear Support Vector Machines

Large-margin classifiers with soft-margin slack and the C parameter

TUTORIAL

Logistic Regression as ML

Linear classifier with log-loss and L1/L2 regularisation

TUTORIAL

Naive Bayes

Probabilistic classification under conditional independence

TUTORIAL

Nested Cross-Validation

Honest model evaluation with separate loops for tuning and assessment

TUTORIAL

Neural Networks: Introduction

From perceptrons to multi-layer networks: weights, layers, and activation functions

TUTORIAL

Partial Dependence Plots

Marginal effect of a feature on model predictions

TUTORIAL

Platt Scaling

Logistic recalibration: mapping raw scores to calibrated probabilities

TUTORIAL

Preprocessing with recipes

Leak-free feature engineering in tidymodels with recipes

TUTORIAL

RNNs and LSTMs: Introduction

Recurrent networks for sequence modelling with gated memory

TUTORIAL

Random Forests

Bagging decision trees with random feature subsets for robust ensembles

TUTORIAL

Regularisation in ML

L2 ridge, L1 lasso, and elastic net for controlling model complexity

TUTORIAL

SHAP Values

Game-theoretic local feature attributions via Shapley values

TUTORIAL

SMOTE for Imbalanced Classes

Synthetic Minority Oversampling Technique for class-imbalanced classification

TUTORIAL

Stacking Ensembles

Combining heterogeneous base models via a meta-learner

TUTORIAL

Supervised Learning: Overview

Framework, loss functions, generalisation, and the bias-variance decomposition

TUTORIAL

The Bias-Variance Tradeoff

Decomposition of prediction error into structural and sampling components

TUTORIAL

Train-Test Splits

Holdout evaluation, stratification, and honest generalisation estimates

TUTORIAL

Transformers: Overview

Self-attention architectures underlying modern NLP and beyond

TUTORIAL

Variable Importance

Impurity-based and permutation importance for tree ensembles and beyond

TUTORIAL

XGBoost

Regularised gradient boosting with sparsity awareness and early stopping

TUTORIAL

k-Means as an ML Tool

Using k-means for feature engineering, quantisation, and pre-segmentation

TUTORIAL

k-Nearest Neighbours Classification

Instance-based learning with majority voting among nearest neighbours

TUTORIAL

k-Nearest Neighbours Regression

Local averaging for non-parametric regression