Machine Learning
Machine learning in applied statistics is ultimately about prediction: building a model whose accuracy on unseen data is the criterion of success. This area covers the major supervised and unsupervised methods a biostatistician is likely to use, framed around the workflow that separates honest ML from overfit artefacts: train/test splits, cross-validation, tuning on validation folds, and evaluation on a held-out set.
Every tutorial uses tidymodels as the default framework, with caret shown for legacy projects and mlr3 referenced for advanced needs. Emphasis is on reproducible pipelines, proper leakage-free preprocessing, and calibrated performance estimates.
Topics covered
- The supervised learning framework: features, targets, loss functions, generalisation
- Resampling: holdout, k-fold CV, repeated CV, stratified CV, nested CV for tuning
- Tree methods: CART, conditional inference trees, pruning strategies
- Ensembles: bagging, random forests, boosted trees (XGBoost, LightGBM)
- Support vector machines: linear, polynomial, RBF kernels; soft-margin formulation
- k-nearest neighbours and its curse of dimensionality
- Neural networks with
torchfor R; feedforward, convolutional, recurrent architectures - Regularisation for generalisation: ridge, lasso, dropout, early stopping
- Feature engineering with
recipes: imputation, encoding, scaling, interactions, spline bases - Feature selection: filter, wrapper, and embedded approaches
- Imbalanced classification: SMOTE, class weights, threshold tuning, calibration
- Interpretability: variable importance, partial dependence, SHAP values, LIME
- Unsupervised methods: clustering, anomaly detection, self-organising maps
The goal throughout is a pipeline a reader could defend in a peer review: properly validated, honestly tuned, and clearly interpreted.
Tutorials
TUTORIAL
A Complete mlr3 Workflow
End-to-end modelling with mlr3: tasks, learners, resamplings, and tuning
TUTORIAL
A Complete tidymodels Workflow
End-to-end modelling in tidymodels: recipes, parsnip, workflows, and tuning
TUTORIAL
Anomaly Detection
Unsupervised methods for flagging rare, unusual observations
TUTORIAL
Bagging
Bootstrap aggregation: variance reduction through averaging
TUTORIAL
Calibration Plots
Reliability diagrams: do predicted probabilities match observed frequencies?
TUTORIAL
CatBoost
Gradient boosting with native categorical handling and ordered target encoding
TUTORIAL
Class Weights for Imbalanced Data
Loss-based weighting as an alternative to resampling
TUTORIAL
Convolutional Neural Networks: Introduction
Spatial filters and pooling for image and signal learning
TUTORIAL
Cross-Validation
Estimating out-of-sample predictive performance honestly with k-fold, repeated, nested, and leave-one-out cross-validation
TUTORIAL
DBSCAN as an ML Tool
Density-based clustering and outlier detection in one step
TUTORIAL
Decision Trees (CART)
Recursive binary partitioning with Gini or variance splits
TUTORIAL
Dropout and Early Stopping
Two of the most effective neural-network regularisation techniques
TUTORIAL
Extremely Randomised Trees
Extra Trees: random thresholds for faster and more regularised ensembles
TUTORIAL
Feature Engineering
Transformations, interactions, and encodings to expose signal to models
TUTORIAL
Feature Selection
Filter, wrapper, and embedded strategies to reduce the feature space
TUTORIAL
Feedforward Networks in torch
Building and training deep feedforward networks in R with the torch package
TUTORIAL
Gradient Boosting
Stagewise additive modelling with gradient descent on loss
TUTORIAL
Isolation Forest
Tree-based unsupervised anomaly detection via path length
TUTORIAL
Isotonic Regression Calibration
Non-parametric monotone calibration via pool-adjacent-violators
TUTORIAL
Kernel SVMs
Non-linear classification via the kernel trick with RBF and polynomial kernels
TUTORIAL
LIME Explanations
Local Interpretable Model-Agnostic Explanations via surrogate linear models
TUTORIAL
LightGBM
Histogram-based leaf-wise gradient boosting for large datasets
TUTORIAL
Linear Discriminant as ML
LDA as a classifier with shared Gaussian class-conditional covariance
TUTORIAL
Linear Support Vector Machines
Large-margin classifiers with soft-margin slack and the C parameter
TUTORIAL
Logistic Regression as ML
Linear classifier with log-loss and L1/L2 regularisation
TUTORIAL
Naive Bayes
Probabilistic classification under conditional independence
TUTORIAL
Nested Cross-Validation
Honest model evaluation with separate loops for tuning and assessment
TUTORIAL
Neural Networks: Introduction
From perceptrons to multi-layer networks: weights, layers, and activation functions
TUTORIAL
Partial Dependence Plots
Marginal effect of a feature on model predictions
TUTORIAL
Platt Scaling
Logistic recalibration: mapping raw scores to calibrated probabilities
TUTORIAL
Preprocessing with recipes
Leak-free feature engineering in tidymodels with recipes
TUTORIAL
RNNs and LSTMs: Introduction
Recurrent networks for sequence modelling with gated memory
TUTORIAL
Random Forests
Bagging decision trees with random feature subsets for robust ensembles
TUTORIAL
Regularisation in ML
L2 ridge, L1 lasso, and elastic net for controlling model complexity
TUTORIAL
SHAP Values
Game-theoretic local feature attributions via Shapley values
TUTORIAL
SMOTE for Imbalanced Classes
Synthetic Minority Oversampling Technique for class-imbalanced classification
TUTORIAL
Stacking Ensembles
Combining heterogeneous base models via a meta-learner
TUTORIAL
Supervised Learning: Overview
Framework, loss functions, generalisation, and the bias-variance decomposition
TUTORIAL
The Bias-Variance Tradeoff
Decomposition of prediction error into structural and sampling components
TUTORIAL
Train-Test Splits
Holdout evaluation, stratification, and honest generalisation estimates
TUTORIAL
Transformers: Overview
Self-attention architectures underlying modern NLP and beyond
TUTORIAL
Variable Importance
Impurity-based and permutation importance for tree ensembles and beyond
TUTORIAL
XGBoost
Regularised gradient boosting with sparsity awareness and early stopping
TUTORIAL
k-Means as an ML Tool
Using k-means for feature engineering, quantisation, and pre-segmentation
TUTORIAL
k-Nearest Neighbours Classification
Instance-based learning with majority voting among nearest neighbours
TUTORIAL
k-Nearest Neighbours Regression
Local averaging for non-parametric regression