35 Reproducible Pipelines
35.1 Learning objectives
After completing this chapter, you will be able to:
- Explain why ad hoc scripts are insufficient for reproducible bibliometric analysis
- Define a targets pipeline with dependency tracking and selective re-execution
- Use renv to snapshot and restore R package environments
- Understand how Docker containers ensure cross-platform reproducibility
- Structure a bibliometric project for long-term reproducibility
35.3 Conceptual background
A bibliometric analysis is a multi-step process: fetch data from an API, clean and deduplicate records, compute indicators, build networks, fit models, and generate figures. Each step depends on earlier ones. When the upstream data changes (a new API query, an updated corpus), downstream results must be regenerated — but only the affected parts.
Ad hoc scripts (running scripts manually in sequence) fail at scale: they do not track dependencies, they re-run everything even when only one input changed, and they break when run on a different machine with different package versions.
targets is an R pipeline tool that solves the dependency problem. You define each analysis step as a “target” (a function that takes inputs and produces outputs). targets builds a dependency graph, executes only the targets whose inputs have changed, and caches results. This is both faster (no unnecessary re-computation) and safer (no stale intermediate results).
renv snapshots the exact package versions used in your analysis into a renv.lock file. A collaborator (or your future self) can restore the identical package environment with renv::restore(). This eliminates the “works on my machine” problem for R packages.
Docker takes reproducibility further by containerising the entire compute environment: operating system, R version, system libraries, and package versions. A Dockerfile defines the environment declaratively; anyone with Docker can rebuild it identically. Docker is essential for long-term reproducibility, since R versions and system libraries evolve.
These tools complement each other: targets manages the computation, renv manages the packages, and Docker manages the environment. Together, they make a bibliometric analysis fully reproducible years after it was originally run.
35.4 Worked example
35.4.1 Defining a targets pipeline
The _targets.R file defines the pipeline. Here we show the structure for a typical bibliometric analysis.
# _targets.R — pipeline definition
library(targets)
tar_option_set(packages = c("tidyverse", "openalexR", "igraph"))
list(
tar_target(raw_works, {
oa_fetch(
entity = "works",
primary_location.source.id = "S148561398",
from_publication_date = "2020-01-01",
to_publication_date = "2023-12-31",
options = list(sample = 300, seed = 42)
)
}),
tar_target(clean_works, {
raw_works |>
filter(!is.na(doi)) |>
distinct(doi, .keep_all = TRUE) |>
mutate(year = year(publication_date))
}),
tar_target(citation_stats, {
clean_works |>
group_by(year) |>
summarise(
n = n(),
mean_cites = mean(cited_by_count),
.groups = "drop"
)
}),
tar_target(fig_trends, {
ggplot(citation_stats, aes(x = year, y = mean_cites)) +
geom_line() + geom_point() +
labs(x = "Year", y = "Mean citations")
})
)35.4.2 Running and inspecting the pipeline
# Run the pipeline
targets::tar_make()
# Visualise the dependency graph
targets::tar_visnetwork()
# Read a specific target
targets::tar_read(citation_stats)35.4.3 Project structure
# Recommended project layout:
#
# my-bibliometric-project/
# ├── _targets.R # Pipeline definition
# ├── R/ # Helper functions
# │ ├── fetch.R
# │ ├── clean.R
# │ └── analyse.R
# ├── renv.lock # Package snapshot
# ├── Dockerfile # Environment definition
# ├── data/ # Cached data (Parquet)
# ├── output/ # Figures and tables
# └── report.qmd # Quarto report consuming targets35.5 Diagnostics and interpretation
-
Target status:
tar_outdated()lists which targets need re-running. If everything is up to date,tar_make()completes instantly. -
Pipeline visualisation:
tar_visnetwork()shows the dependency graph. Disconnected subgraphs indicate independent analysis streams. -
Cache management: Targets stores cached results in
_targets/. This directory can grow large. Usetar_prune()to remove unused targets. -
renv consistency: Run
renv::status()to check if your lockfile matches the installed packages.
35.7 Limitations and responsible use
- API rate limits: targets will re-run API calls when inputs change. Use file-based caching (Parquet) to avoid hitting API rate limits on every pipeline run.
- Docker learning curve: Docker adds complexity. For solo projects, renv alone may suffice. Docker becomes essential for team reproducibility and long-term archiving.
- Overhead for small projects: A 50-line analysis script does not need targets. Use pipelines when the analysis has 5+ interdependent steps or will be re-run multiple times.
- Reproducibility is not eternal: Even with Docker, operating system patches and hardware changes can affect results. Archive final outputs alongside the pipeline.
35.9 Common pitfalls
- Not separating data retrieval from analysis. API calls in analysis targets re-fetch data on every change. Separate fetching into its own target with file-based caching.
- Forgetting to commit renv.lock. The lockfile is the reproducibility guarantee. Always version-control it.
- **Ignoring the _targets/ directory in .gitignore.** Cached results should not be committed to git (they can be regenerated). Add
_targets/to.gitignore. - Writing monolithic targets. Each target should do one thing. Fine-grained targets enable selective re-execution.
35.10 Exercises
Build a pipeline. Create a
_targets.Rfile for a simple bibliometric analysis: fetch → clean → summarise → plot. Run it and verify that changing the summary function only re-runs downstream targets.renv snapshot. Initialise renv in a project, install a new package, and snapshot. Delete the package and restore from the lockfile.
Branching. Use
tar_target()with dynamic branching to run the same analysis across multiple journals in parallel.
35.11 Solutions
Solutions are provided in 2.11.
35.13 Session info
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] uwot_0.2.4 Matrix_1.7-0
#> [3] word2vec_0.4.1 stm_1.3.8
#> [5] topicmodels_0.2-17 quanteda.textstats_0.97.2
#> [7] visNetwork_2.1.4 ggraph_2.2.2
#> [9] tidygraph_1.3.1 igraph_2.3.1
#> [11] quanteda_4.4 pdftools_3.9.0
#> [13] arrow_24.0.0 bibliometrix_5.4.0
#> [15] RefManageR_1.4.0 bib2df_1.1.2.0
#> [17] rcrossref_1.2.1 gt_1.3.0
#> [19] tidytext_0.4.3 glue_1.8.1
#> [21] openalexR_3.0.1 lubridate_1.9.5
#> [23] forcats_1.0.1 stringr_1.6.0
#> [25] dplyr_1.2.1 purrr_1.2.2
#> [27] readr_2.2.0 tidyr_1.3.2
#> [29] tibble_3.3.1 ggplot2_4.0.3
#> [31] tidyverse_2.0.0
#>
#> loaded via a namespace (and not attached):
#> [1] bibtex_0.5.2 RColorBrewer_1.1-3 rstudioapi_0.18.0
#> [4] jsonlite_2.0.0 magrittr_2.0.5 modeltools_0.2-24
#> [7] farver_2.1.2 rmarkdown_2.31 fs_2.1.0
#> [10] vctrs_0.7.3 memoise_2.0.1 askpass_1.2.1
#> [13] base64enc_0.1-6 htmltools_0.5.9 contentanalysis_1.0.0
#> [16] curl_7.1.0 broom_1.0.12 janeaustenr_1.0.0
#> [19] cellranger_1.1.0 sass_0.4.10 bslib_0.11.0
#> [22] htmlwidgets_1.6.4 tokenizers_0.3.0 plyr_1.8.9
#> [25] httr2_1.2.2 plotly_4.12.0 cachem_1.1.0
#> [28] dimensionsR_0.0.3 mime_0.13 lifecycle_1.0.5
#> [31] pkgconfig_2.0.3 R6_2.6.1 fastmap_1.2.0
#> [34] shiny_1.13.0 digest_0.6.39 patchwork_1.3.2
#> [37] shinycssloaders_1.1.0 rprojroot_2.1.1 RSpectra_0.16-2
#> [40] SnowballC_0.7.1 labeling_0.4.3 urltools_1.7.3.1
#> [43] timechange_0.4.0 mgcv_1.9-1 polyclip_1.10-7
#> [46] httr_1.4.8 compiler_4.4.1 here_1.0.2
#> [49] bit64_4.8.0 withr_3.0.2 S7_0.2.2
#> [52] backports_1.5.1 viridis_0.6.5 ggforce_0.5.0
#> [55] MASS_7.3-60.2 rappdirs_0.3.4 bibliometrixData_0.3.0
#> [58] tools_4.4.1 otel_0.2.0 stopwords_2.3
#> [61] zip_2.3.3 httpuv_1.6.17 rentrez_1.2.4
#> [64] nlme_3.1-164 promises_1.5.0 grid_4.4.1
#> [67] stringdist_0.9.17 reshape2_1.4.5 generics_0.1.4
#> [70] gtable_0.3.6 tzdb_0.5.0 rscopus_0.9.0
#> [73] ca_0.71.1 data.table_1.18.4 hms_1.1.4
#> [76] xml2_1.5.2 utf8_1.2.6 ggrepel_0.9.8
#> [79] pillar_1.11.1 nsyllable_1.0.1 vroom_1.7.1
#> [82] later_1.4.8 splines_4.4.1 tweenr_2.0.3
#> [85] brand.yml_0.1.0 lattice_0.22-6 FNN_1.1.4.1
#> [88] bit_4.6.0 tidyselect_1.2.1 tm_0.7-18
#> [91] miniUI_0.1.2 downlit_0.4.5 knitr_1.51
#> [94] gridExtra_2.3 NLP_0.3-2 bookdown_0.46
#> [97] stats4_4.4.1 crul_1.6.0 xfun_0.57
#> [100] graphlayouts_1.2.3 matrixStats_1.5.0 DT_0.34.0
#> [103] humaniformat_0.6.0 stringi_1.8.7 lazyeval_0.2.3
#> [106] qpdf_1.4.1 yaml_2.3.12 evaluate_1.0.5
#> [109] codetools_0.2-20 httpcode_0.3.0 cli_3.6.6
#> [112] xtable_1.8-8 jquerylib_0.1.4 dichromat_2.0-0.1
#> [115] Rcpp_1.1.1-1.1 readxl_1.4.5 triebeard_0.4.1
#> [118] XML_3.99-0.23 parallel_4.4.1 assertthat_0.2.1
#> [121] pubmedR_1.0.2 slam_0.1-55 viridisLite_0.4.3
#> [124] scales_1.4.0 crayon_1.5.3 openxlsx_4.2.8.1
#> [127] rlang_1.2.0 fastmatch_1.1-8