B SQL Primer for the OpenAlex Snapshot
B.1 When to use the snapshot
The OpenAlex API (Chapter 5) is ideal for moderate queries. For analyses requiring millions of records — full-database bibliographic coupling, global co-authorship networks, or exhaustive field delineation — the OpenAlex snapshot is more practical. The snapshot is a full database dump released monthly in compressed JSON-lines format.
B.2 Loading the snapshot
OpenAlex provides scripts to load the snapshot into PostgreSQL. After loading:
B.3 Common queries
B.3.2 Top institutions by output
SELECT i.display_name, COUNT(DISTINCT w.id) AS n_works
FROM openalex.works w
JOIN openalex.works_authorships wa ON w.id = wa.work_id
JOIN openalex.works_authorships_institutions wai
ON wa.work_id = wai.work_id AND wa.author_position = wai.author_position
JOIN openalex.institutions i ON wai.institution_id = i.id
WHERE w.publication_year = 2023
AND w.type = 'article'
GROUP BY i.display_name
ORDER BY n_works DESC
LIMIT 20;B.4 Connecting from R
library(DBI)
library(RPostgres)
con <- dbConnect(
Postgres(),
dbname = "openalex",
host = "localhost",
port = 5432,
user = "openalex"
)
top_journals <- dbGetQuery(con, "
SELECT s.display_name, COUNT(*) AS n_works
FROM openalex.works w
JOIN openalex.sources s ON w.primary_location_source_id = s.id
WHERE w.publication_year = 2023
GROUP BY s.display_name
ORDER BY n_works DESC
LIMIT 20
")
dbDisconnect(con)B.5 Performance tips
-
Index key columns:
work_id,author_id,institution_id,publication_year, andcited_by_countshould all be indexed. -
Partition by year: For time-series queries, partitioning the works table by
publication_yearspeeds up range scans substantially. - Materialised views: Pre-compute expensive joins (e.g., institution–work counts) as materialised views and refresh them after each snapshot update.
-
Use COPY for bulk loads: The
COPYcommand is orders of magnitude faster than row-by-rowINSERT.
This book was built by the bookdown R package.