7 Parsing Native Exports

7.1 Learning objectives

After completing this chapter, you will be able to:

  • Parse BibTeX files using bib2df and RefManageR
  • Import Web of Science plain-text and Scopus CSV exports via bibliometrix
  • Standardize field names across formats into a unified schema
  • Diagnose common parsing failures and missing fields

7.2 Setup

7.3 Conceptual background

Not all bibliometric data comes from an API. Researchers frequently begin with files exported from database search interfaces: BibTeX (.bib) from Google Scholar or Zotero, RIS (.ris) from Scopus or PubMed, plain text from Web of Science, or CSV from Scopus. Each format encodes the same information — authors, title, year, journal, DOI — in different structures.

The bibliometrix package (Aria and Cuccurullo 2017) provides convert2df(), a versatile function that reads multiple export formats and returns a standardized data frame. For BibTeX specifically, bib2df and RefManageR offer complementary parsing capabilities. The key challenge is not parsing per se but standardization: ensuring that author names, journal titles, and identifiers are represented consistently regardless of the source format.

7.4 Worked example

7.4.1 Parsing a BibTeX string

We construct a small BibTeX string inline to demonstrate parsing without external files.

bib_text <- '
@article{smith2023,
  author = {Smith, John A. and Doe, Jane B.},
  title = {Advances in Bibliometric Methods},
  journal = {Journal of Informetrics},
  year = {2023},
  volume = {17},
  pages = {101--115},
  doi = {10.1234/example.2023}
}
@article{chen2022,
  author = {Chen, Wei and Li, Xiao},
  title = {Citation Networks in Computer Science},
  journal = {Scientometrics},
  year = {2022},
  volume = {127},
  pages = {3201--3220},
  doi = {10.1234/example.2022}
}
'

tmp_bib <- tempfile(fileext = ".bib")
writeLines(bib_text, tmp_bib)

bib_df <- bib2df(tmp_bib)
glimpse(bib_df)
#> Rows: 2
#> Columns: 27
#> $ CATEGORY     <chr> "ARTICLE", "ARTICLE"
#> $ BIBTEXKEY    <chr> "smith2023", "chen2022"
#> $ ADDRESS      <chr> NA, NA
#> $ ANNOTE       <chr> NA, NA
#> $ AUTHOR       <list> <"Smith, John A.", "Doe, Jane B.">, <"Chen, Wei", "Li, Xi…
#> $ BOOKTITLE    <chr> NA, NA
#> $ CHAPTER      <chr> NA, NA
#> $ CROSSREF     <chr> NA, NA
#> $ EDITION      <chr> NA, NA
#> $ EDITOR       <list> NA, NA
#> $ HOWPUBLISHED <chr> NA, NA
#> $ INSTITUTION  <chr> NA, NA
#> $ JOURNAL      <chr> "Journal of Informetrics", "Scientometrics"
#> $ KEY          <chr> NA, NA
#> $ MONTH        <chr> NA, NA
#> $ NOTE         <chr> NA, NA
#> $ NUMBER       <chr> NA, NA
#> $ ORGANIZATION <chr> NA, NA
#> $ PAGES        <chr> "101--115", "3201--3220"
#> $ PUBLISHER    <chr> NA, NA
#> $ SCHOOL       <chr> NA, NA
#> $ SERIES       <chr> NA, NA
#> $ TITLE        <chr> "Advances in Bibliometric Methods", "Citation Networks in…
#> $ TYPE         <chr> NA, NA
#> $ VOLUME       <chr> "17", "127"
#> $ YEAR         <dbl> 2023, 2022
#> $ DOI          <chr> "10.1234/example.2023", "10.1234/example.2022"

7.4.2 Parsing with RefManageR

bib_rm <- ReadBib(tmp_bib)
as.data.frame(bib_rm) |> select(title, author, year, journal, doi)
#>                                           title                        author
#> smith2023      Advances in Bibliometric Methods John A. Smith and Jane B. Doe
#> chen2022  Citation Networks in Computer Science          Wei Chen and Xiao Li
#>           year                 journal                  doi
#> smith2023 2023 Journal of Informetrics 10.1234/example.2023
#> chen2022  2022          Scientometrics 10.1234/example.2022

7.4.3 Parsing WoS plain text with bibliometrix

# Requires a WoS export file — not bundled
# wos_df <- convert2df("savedrecs.txt", dbsource = "wos", format = "plaintext")
# glimpse(wos_df)

7.4.4 Standardizing field names

standardize_bib <- function(df) {
  df |>
    transmute(
      title = TITLE,
      authors = AUTHOR,
      year = as.integer(YEAR),
      journal = JOURNAL,
      doi = DOI
    )
}

bib_df |> standardize_bib()
#> # A tibble: 2 × 5
#>   title                                 authors    year journal            doi  
#>   <chr>                                 <list>    <int> <chr>              <chr>
#> 1 Advances in Bibliometric Methods      <chr [2]>  2023 Journal of Inform… 10.1…
#> 2 Citation Networks in Computer Science <chr [2]>  2022 Scientometrics     10.1…

7.4.5 Visualization

completeness <- bib_df |>
  summarise(across(everything(), ~ mean(!is.na(.)))) |>
  pivot_longer(everything(), names_to = "field", values_to = "present") |>
  filter(present > 0) |>
  mutate(field = fct_reorder(field, present))

ggplot(completeness, aes(x = present, y = field)) +
  geom_col(fill = palette_sci(1)) +
  scale_x_continuous(labels = scales::percent) +
  labs(x = "Completeness", y = NULL) +
  theme_sci()
Horizontal bar chart showing which metadata fields are present in the parsed BibTeX records.

Figure 7.1: Field completeness in the parsed BibTeX records.

7.5 Diagnostics and interpretation

After parsing, always check:

  • Row count: Does the number of records match the export? Missing entries may indicate parsing errors.
  • Field completeness: Which fields are populated? Abstracts and keywords are often missing in BibTeX exports.
  • Author format: Names may appear as “Last, First” or “First Last” depending on the source. Standardize before analysis.
  • Encoding: Non-ASCII characters (diacritics, CJK) can corrupt during parsing. Ensure UTF-8 encoding.

7.6 Limitations and responsible use

7.7 Limitations and responsible use

  • Export files are snapshots: they reflect the database state at export time and become stale immediately.
  • Field mapping between formats is imperfect. Some formats lack fields (RIS has no standard abstract tag across all databases).
  • Parsing errors are silent — always validate record counts and spot-check metadata against the source (Hicks et al. 2015).

7.8 Common pitfalls

7.9 Common pitfalls

  • Encoding mismatches: Exporting as Latin-1 and reading as UTF-8 (or vice versa) corrupts names and titles.
  • Truncated exports: Many databases cap exports at 500 or 1,000 records per file. Large corpora require multiple export batches.
  • Inconsistent author delimiters: BibTeX uses “and”, RIS uses newlines, CSV uses semicolons. Parsing must handle each.
  • Assuming format from extension: A .txt file may be WoS plain text or something else entirely. Check the first few lines.

7.10 Exercises

  1. Export and parse. Export 50 records from Google Scholar as BibTeX. Parse them with bib2df. How many fields are populated?

  2. Format comparison. Export the same set of records from Scopus as both CSV and RIS. Parse both and compare the fields available.

  3. Standardization function. Write a function that takes a data frame from convert2df() and returns a tibble with columns: doi, title, authors, year, journal, cited_by.

7.11 Solutions

Solutions are provided in 2.11.

7.12 Further reading

  • Aria and Cuccurullo (2017) — The bibliometrix package, including convert2df() for multi-format parsing.
  • Priem et al. (2022) — OpenAlex as an API-first alternative to file-based workflows.

7.13 Session info

#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] bibliometrix_5.4.0 RefManageR_1.4.0   bib2df_1.1.2.0     rcrossref_1.2.1   
#>  [5] gt_1.3.0           tidytext_0.4.3     glue_1.8.1         openalexR_3.0.1   
#>  [9] lubridate_1.9.5    forcats_1.0.1      stringr_1.6.0      dplyr_1.2.1       
#> [13] purrr_1.2.2        readr_2.2.0        tidyr_1.3.2        tibble_3.3.1      
#> [17] ggplot2_4.0.3      tidyverse_2.0.0   
#> 
#> loaded via a namespace (and not attached):
#>   [1] httr2_1.2.2            gridExtra_2.3          readxl_1.4.5          
#>   [4] rlang_1.2.0            magrittr_2.0.5         otel_0.2.0            
#>   [7] compiler_4.4.1         vctrs_0.7.3            httpcode_0.3.0        
#>  [10] pkgconfig_2.0.3        fastmap_1.2.0          backports_1.5.1       
#>  [13] labeling_0.4.3         ca_0.71.1              utf8_1.2.6            
#>  [16] promises_1.5.0         rmarkdown_2.31         tzdb_0.5.0            
#>  [19] xfun_0.57              cachem_1.1.0           jsonlite_2.0.0        
#>  [22] SnowballC_0.7.1        later_1.4.8            parallel_4.4.1        
#>  [25] stopwords_2.3          R6_2.6.1               bslib_0.11.0          
#>  [28] stringi_1.8.7          RColorBrewer_1.1-3     cellranger_1.1.0      
#>  [31] jquerylib_0.1.4        Rcpp_1.1.1-1.1         bookdown_0.46         
#>  [34] knitr_1.51             triebeard_0.4.1        base64enc_0.1-6       
#>  [37] rentrez_1.2.4          igraph_2.3.1           httpuv_1.6.17         
#>  [40] Matrix_1.7-0           timechange_0.4.0       tidyselect_1.2.1      
#>  [43] stringdist_0.9.17      pubmedR_1.0.2          rstudioapi_0.18.0     
#>  [46] dichromat_2.0-0.1      yaml_2.3.12            viridis_0.6.5         
#>  [49] codetools_0.2-20       miniUI_0.1.2           humaniformat_0.6.0    
#>  [52] curl_7.1.0             qpdf_1.4.1             lattice_0.22-6        
#>  [55] plyr_1.8.9             shiny_1.13.0           withr_3.0.2           
#>  [58] S7_0.2.2               askpass_1.2.1          evaluate_1.0.5        
#>  [61] zip_2.3.3              xml2_1.5.2             shinycssloaders_1.1.0 
#>  [64] pillar_1.11.1          janeaustenr_1.0.0      DT_0.34.0             
#>  [67] plotly_4.12.0          generics_0.1.4         rprojroot_2.1.1       
#>  [70] hms_1.1.4              scales_1.4.0           xtable_1.8-8          
#>  [73] contentanalysis_1.0.0  lazyeval_0.2.3         tools_4.4.1           
#>  [76] brand.yml_0.1.0        data.table_1.18.4      tokenizers_0.3.0      
#>  [79] openxlsx_4.2.8.1       pdftools_3.9.0         XML_3.99-0.23         
#>  [82] fs_2.1.0               visNetwork_2.1.4       grid_4.4.1            
#>  [85] bibtex_0.5.2           urltools_1.7.3.1       rscopus_0.9.0         
#>  [88] dimensionsR_0.0.3      bibliometrixData_0.3.0 cli_3.6.6             
#>  [91] rappdirs_0.3.4         viridisLite_0.4.3      downlit_0.4.5         
#>  [94] gtable_0.3.6           sass_0.4.10            digest_0.6.39         
#>  [97] ggrepel_0.9.8          crul_1.6.0             htmlwidgets_1.6.4     
#> [100] farver_2.1.2           memoise_2.0.1          htmltools_0.5.9       
#> [103] lifecycle_1.0.5        httr_1.4.8             here_1.0.2            
#> [106] mime_0.13
This book was built by the bookdown R package.