Counting Reads with featureCounts
Bioinformatics
featurecounts
htseq
gene-counts
Assigning reads to genes or genomic features from a BAM file
Introduction
After alignment, reads must be summarised into per-gene (or per-feature) counts for differential expression analysis. featureCounts (in Rsubread) and HTSeq are the standard tools.
Prerequisites
BAM files, genome annotation (GTF).
Theory
Each read is assigned to a feature (gene, exon) it overlaps. Key settings:
- Strand-specificity:
strandSpecific = 0 | 1 | 2(unstranded / stranded / reverse-stranded). - Multi-mappers: usually counted once or distributed.
- Primary-only: recommended.
- Fractional vs. integer counting.
Assumptions
Matching annotation to genome reference.
R Implementation
library(Rsubread)
# Build counts from BAM + GTF
# counts <- featureCounts(files = c("s1.bam", "s2.bam"),
# annot.ext = "annotation.gtf",
# isGTFAnnotationFile = TRUE,
# GTF.featureType = "exon",
# GTF.attrType = "gene_id",
# strandSpecific = 2,
# isPairedEnd = TRUE)
# counts$counts: gene x sample matrixOutput & Results
Gene-by-sample counts matrix + feature lengths + summary of assigned vs ambiguous reads.
Interpretation
“featureCounts assigned 82 % of reads to genes; 8 % were ambiguous (multi-gene overlap) and 10 % were unmapped to features.”
Practical Tips
- Report assignment statistics; low assignment rates flag alignment or annotation issues.
- Strand-specificity must match library prep protocol; wrong setting halves counts.
- Use
primaryOnly = TRUEfor ChIP / ATAC-seq. - For transcript-level counts, use pseudoalignment tools.
SummarizedExperimentobjects bundle counts + metadata cleanly.