---
title: "pubmedR"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{pubmedR: A brief example}
%\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8]{inputenc}
---
## An R-package to gather bibliographic data from PubMed.
The goal of pubmedR is to gather metadata about publications, grants and
clinical trials from the PubMed database using NCBI REST APIs.
*https://github.com/massimoaria/pubmedR*
*Latest version: `r packageVersion("pubmedR")`, `r Sys.Date()`*
**by Massimo Aria**
Full Professor in Social Statistics
PhD in Computational Statistics
Laboratory and Research Group STAD Statistics, Technology, Data Analysis
Department of Economics and Statistics
University of Naples Federico II
email aria@unina.it
https://www.massimoaria.com
## Installation
You can install the development version of pubmedR from [GitHub](https://github.com) with:
install.packages("devtools")
devtools::install_github("massimoaria/pubmedR")
You can install the released version of pubmedR from [CRAN](https://CRAN.R-project.org) with:
install.packages("pubmedR")
## Load the package
library(pubmedR)
## NCBI API key
By default, access to the NCBI API is free and does not strictly require an API key.
Without a key, NCBI limits users to 3 requests per second; registered users with an
API key are allowed up to 10 requests per second.
To obtain a key, register for a ***"my ncbi account"*** (*https://account.ncbi.nlm.nih.gov/*)
and generate one from the ***"account settings page"*** (*https://account.ncbi.nlm.nih.gov/settings/*).
You can pass the key explicitly via the `api_key` argument of any function, or -
preferably - set it once as an environment variable. pubmedR will automatically
pick it up from `PUBMED_API_KEY` or `ENTREZ_KEY`:
# option 1: pass it explicitly
api_key <- "your API key"
# option 2: set it once per session (or in ~/.Renviron)
Sys.setenv(PUBMED_API_KEY = "your API key")
# no key
api_key <- NULL
# A brief example
Imagine we want to download a metadata collection of journal articles that
(1) use bibliometric approaches, (2) were published in the last two decades,
and (3) are written in English.
Since version 0.1.0, pubmedR offers **two equivalent workflows**:
- a **step-by-step workflow** (`pmQueryBuild` → `pmQueryTotalCount` →
`pmApiRequest` → `pmApi2df`), which gives you full control over each stage;
- a **one-step workflow** via `pmCollect()`, a convenience wrapper that chains
the whole pipeline in a single call and optionally enriches the result with
citation data.
## One-step workflow: `pmCollect()`
For most use cases, `pmCollect()` is the fastest way to go from a query to a
bibliometrix-ready data frame. It builds the query, checks the total count,
downloads the records, converts the XML into a data frame, and (optionally)
adds citation counts and cited references.
library(pubmedR)
M <- pmCollect(
terms = "bibliometric*",
fields = "Title/Abstract",
language = "english",
pub_type = "Journal Article",
date_range = c("2000", "2020"),
limit = 2000,
api_key = NULL
)
# Query: (bibliometric*[Title/Abstract]) AND english[LA] AND
# Journal Article[PT] AND 2000:2020[DP]
#
# Total records found: 2921
# Records to download: 2000
You can also pass a raw PubMed query string directly:
M <- pmCollect(
query = "bibliometric*[Title/Abstract] AND english[LA] AND 2023:2024[DP]",
limit = 200
)
Set `enrich = TRUE` to add citation counts (`TC`) and cited references (`CR`)
via `pmEnrichCitations()`. Note that enrichment adds two API calls per record,
so it is best used on smaller collections:
M <- pmCollect(
terms = "bibliometric*",
date_range = c("2023", "2024"),
limit = 50,
enrich = TRUE
)
## Step-by-step workflow
If you prefer fine-grained control - for example to inspect the translated
query before downloading - you can run the pipeline one stage at a time.
### Step 1: Build the query
Instead of writing the Entrez query string by hand, you can use
`pmQueryBuild()` to compose it programmatically from parameters:
query <- pmQueryBuild(
terms = "bibliometric*",
fields = "Title/Abstract",
language = "english",
pub_type = "Journal Article",
date_range = c("2000", "2020")
)
query
# [1] "(bibliometric*[Title/Abstract]) AND english[LA] AND
# Journal Article[PT] AND 2000:2020[DP]"
Of course, you can still write the query manually if you prefer:
query <- "bibliometric*[Title/Abstract] AND english[LA] AND Journal Article[PT] AND 2000:2020[DP]"
### Step 2: Check the effectiveness of the query
Use `pmQueryTotalCount()` to see how many records PubMed would return, along
with the automatically translated query:
res <- pmQueryTotalCount(query = query, api_key = api_key)
res$total_count
# [1] 2921
res$query_translation
# [1] "(bibliometric[Title/Abstract] OR bibliometrica[Title/Abstract] OR
# ... OR bibliometricstrade[Title/Abstract]) AND english[LA] AND
# Journal Article[PT] AND 2000[PDAT] : 2020[PDAT]"
### Step 3: Download the collection of document metadata
You can now download the whole collection (or a subset, by lowering `limit`):
D <- pmApiRequest(query = query, limit = res$total_count, api_key = api_key)
# Documents 200 of 2921
# Documents 400 of 2921
# ...
# Documents 2921 of 2921
`pmApiRequest()` returns a list with the following elements:
- **`data`** - the XML-structured list containing the bibliographic metadata
collection downloaded from PubMed.
- **`query`** - the original query submitted by the user.
- **`query_translation`** - the query as translated and executed by NCBI's
Automatic Terms Translation system.
- **`records_downloaded`** - number of records actually downloaded.
- **`total_count`** - total number of records matching the query.
### Step 4: Convert the XML object into a data frame
Finally, transform the XML-structured object `D` into a data frame where rows
are documents and columns are field tags compatible with the
**bibliometrix R package**
(https://CRAN.R-project.org/package=bibliometrix, https://bibliometrix.org/).
M <- pmApi2df(D, format = "bibliometrix")
str(M)
# 'data.frame': 2918 obs. of 27 variables:
# $ AU : chr ...
# $ AF : chr ...
# $ TI : chr ...
# $ SO : chr ...
# $ LA : chr ...
# $ DT : chr ...
# $ DE : chr ...
# $ AB : chr ...
# $ C1 : chr ...
# $ TC : num ...
# $ PY : num ...
# $ DI : chr ...
# $ PMID : chr ...
# ...
Setting `format = "raw"` returns the data frame with all fields in their
native PubMed form instead of the bibliometrix-style field tags.
## Fetching records by PMID
If you already know which articles you want, you can bypass the query step and
download records directly by their PubMed identifiers with `pmFetchById()`:
pmids <- c("34813985", "34813456", "34812345")
D <- pmFetchById(pmids = pmids, api_key = api_key)
M <- pmApi2df(D)
The returned object follows the same structure as `pmApiRequest()`, so it can
be fed into `pmApi2df()` exactly the same way.
## Citation enrichment
pubmedR exposes three helpers to retrieve citation information via NCBI's
E-Link service (based on PubMed Central):
- `pmCitedBy(pmid)` - returns the PMIDs of articles citing the given article;
- `pmReferences(pmid)` - returns the PMIDs of articles referenced by the given article;
- `pmEnrichCitations(df)` - adds a `TC` (times cited) column and a `CR` (cited
references) column to a pubmedR data frame.
Example:
cites <- pmCitedBy(pmid = "25824007")
cites$count
cites$cited_by
refs <- pmReferences(pmid = "25824007")
refs$count
refs$references
# Add citation counts and references to an existing data frame
M_enriched <- pmEnrichCitations(M, api_key = api_key)
Note: citation data in PubMed comes from PubMed Central and is less
comprehensive than commercial databases such as Web of Science or Scopus.
## An overview of the collection using bibliometrix
Once you have a data frame `M`, you can use **bibliometrix** for descriptive
and network analyses
(https://CRAN.R-project.org/package=bibliometrix, https://bibliometrix.org/).
install.packages("bibliometrix")
library(bibliometrix)
results <- biblioAnalysis(M)
summary(results)
# Main Information about data
#
# Documents 2918
# Sources (Journals, Books, etc.) 1275
# Keywords Plus (ID) 2245
# Author's Keywords (DE) 4212
# Period 2000 - 2020
# ...