---
title: "pubmedR"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{pubmedR: A brief example}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---
## An R-package to gather bibliographic data from PubMed.

&nbsp;

    The goal of pubmedR is to gather metadata about publications, grants and
    clinical trials from the PubMed database using NCBI REST APIs.

<!-- badges: start -->
<!-- badges: end -->


&nbsp;

*https://github.com/massimoaria/pubmedR*

*Latest version: `r packageVersion("pubmedR")`, `r Sys.Date()`*

&nbsp;

**by Massimo Aria**

Full Professor in Social Statistics

PhD in Computational Statistics

Laboratory and Research Group STAD Statistics, Technology, Data Analysis

Department of Economics and Statistics

University of Naples Federico II

email aria@unina.it

https://www.massimoaria.com

&nbsp;

## Installation

You can install the development version of pubmedR from [GitHub](https://github.com) with:

    install.packages("devtools")
    devtools::install_github("massimoaria/pubmedR")


You can install the released version of pubmedR from [CRAN](https://CRAN.R-project.org) with:

    install.packages("pubmedR")

&nbsp;

## Load the package


    library(pubmedR)

&nbsp;

## NCBI API key

By default, access to the NCBI API is free and does not strictly require an API key.
Without a key, NCBI limits users to 3 requests per second; registered users with an
API key are allowed up to 10 requests per second.

To obtain a key, register for a ***"my ncbi account"*** (*https://account.ncbi.nlm.nih.gov/*)
and generate one from the ***"account settings page"*** (*https://account.ncbi.nlm.nih.gov/settings/*).

You can pass the key explicitly via the `api_key` argument of any function, or -
preferably - set it once as an environment variable. pubmedR will automatically
pick it up from `PUBMED_API_KEY` or `ENTREZ_KEY`:

    # option 1: pass it explicitly
    api_key <- "your API key"

    # option 2: set it once per session (or in ~/.Renviron)
    Sys.setenv(PUBMED_API_KEY = "your API key")

    # no key
    api_key <- NULL

&nbsp;

# A brief example

Imagine we want to download a metadata collection of journal articles that
(1) use bibliometric approaches, (2) were published in the last two decades,
and (3) are written in English.

Since version 0.1.0, pubmedR offers **two equivalent workflows**:

- a **step-by-step workflow** (`pmQueryBuild` → `pmQueryTotalCount` →
  `pmApiRequest` → `pmApi2df`), which gives you full control over each stage;
- a **one-step workflow** via `pmCollect()`, a convenience wrapper that chains
  the whole pipeline in a single call and optionally enriches the result with
  citation data.

&nbsp;

## One-step workflow: `pmCollect()`

For most use cases, `pmCollect()` is the fastest way to go from a query to a
bibliometrix-ready data frame. It builds the query, checks the total count,
downloads the records, converts the XML into a data frame, and (optionally)
adds citation counts and cited references.

    library(pubmedR)

    M <- pmCollect(
      terms      = "bibliometric*",
      fields     = "Title/Abstract",
      language   = "english",
      pub_type   = "Journal Article",
      date_range = c("2000", "2020"),
      limit      = 2000,
      api_key    = NULL
    )

    # Query: (bibliometric*[Title/Abstract]) AND english[LA] AND
    #        Journal Article[PT] AND 2000:2020[DP]
    #
    # Total records found: 2921
    # Records to download: 2000

You can also pass a raw PubMed query string directly:

    M <- pmCollect(
      query   = "bibliometric*[Title/Abstract] AND english[LA] AND 2023:2024[DP]",
      limit   = 200
    )

Set `enrich = TRUE` to add citation counts (`TC`) and cited references (`CR`)
via `pmEnrichCitations()`. Note that enrichment adds two API calls per record,
so it is best used on smaller collections:

    M <- pmCollect(
      terms      = "bibliometric*",
      date_range = c("2023", "2024"),
      limit      = 50,
      enrich     = TRUE
    )

&nbsp;

## Step-by-step workflow

If you prefer fine-grained control - for example to inspect the translated
query before downloading - you can run the pipeline one stage at a time.

### Step 1: Build the query

Instead of writing the Entrez query string by hand, you can use
`pmQueryBuild()` to compose it programmatically from parameters:

    query <- pmQueryBuild(
      terms      = "bibliometric*",
      fields     = "Title/Abstract",
      language   = "english",
      pub_type   = "Journal Article",
      date_range = c("2000", "2020")
    )

    query
    # [1] "(bibliometric*[Title/Abstract]) AND english[LA] AND
    #      Journal Article[PT] AND 2000:2020[DP]"

Of course, you can still write the query manually if you prefer:

    query <- "bibliometric*[Title/Abstract] AND english[LA] AND Journal Article[PT] AND 2000:2020[DP]"

&nbsp;

### Step 2: Check the effectiveness of the query

Use `pmQueryTotalCount()` to see how many records PubMed would return, along
with the automatically translated query:

    res <- pmQueryTotalCount(query = query, api_key = api_key)

    res$total_count
    # [1] 2921

    res$query_translation
    # [1] "(bibliometric[Title/Abstract] OR bibliometrica[Title/Abstract] OR
    #      ... OR bibliometricstrade[Title/Abstract]) AND english[LA] AND
    #      Journal Article[PT] AND 2000[PDAT] : 2020[PDAT]"

&nbsp;

### Step 3: Download the collection of document metadata

You can now download the whole collection (or a subset, by lowering `limit`):

    D <- pmApiRequest(query = query, limit = res$total_count, api_key = api_key)

    # Documents  200  of  2921
    # Documents  400  of  2921
    # ...
    # Documents  2921  of  2921

`pmApiRequest()` returns a list with the following elements:

- **`data`** - the XML-structured list containing the bibliographic metadata
  collection downloaded from PubMed.
- **`query`** - the original query submitted by the user.
- **`query_translation`** - the query as translated and executed by NCBI's
  Automatic Terms Translation system.
- **`records_downloaded`** - number of records actually downloaded.
- **`total_count`** - total number of records matching the query.

&nbsp;

### Step 4: Convert the XML object into a data frame

Finally, transform the XML-structured object `D` into a data frame where rows
are documents and columns are field tags compatible with the
**bibliometrix R package**
(https://CRAN.R-project.org/package=bibliometrix, https://bibliometrix.org/).

    M <- pmApi2df(D, format = "bibliometrix")

    str(M)
    # 'data.frame':   2918 obs. of  27 variables:
    #  $ AU    : chr  ...
    #  $ AF    : chr  ...
    #  $ TI    : chr  ...
    #  $ SO    : chr  ...
    #  $ LA    : chr  ...
    #  $ DT    : chr  ...
    #  $ DE    : chr  ...
    #  $ AB    : chr  ...
    #  $ C1    : chr  ...
    #  $ TC    : num  ...
    #  $ PY    : num  ...
    #  $ DI    : chr  ...
    #  $ PMID  : chr  ...
    #  ...

Setting `format = "raw"` returns the data frame with all fields in their
native PubMed form instead of the bibliometrix-style field tags.

&nbsp;

## Fetching records by PMID

If you already know which articles you want, you can bypass the query step and
download records directly by their PubMed identifiers with `pmFetchById()`:

    pmids <- c("34813985", "34813456", "34812345")
    D <- pmFetchById(pmids = pmids, api_key = api_key)
    M <- pmApi2df(D)

The returned object follows the same structure as `pmApiRequest()`, so it can
be fed into `pmApi2df()` exactly the same way.

&nbsp;

## Citation enrichment

pubmedR exposes three helpers to retrieve citation information via NCBI's
E-Link service (based on PubMed Central):

- `pmCitedBy(pmid)` - returns the PMIDs of articles citing the given article;
- `pmReferences(pmid)` - returns the PMIDs of articles referenced by the given article;
- `pmEnrichCitations(df)` - adds a `TC` (times cited) column and a `CR` (cited
  references) column to a pubmedR data frame.

Example:

    cites <- pmCitedBy(pmid = "25824007")
    cites$count
    cites$cited_by

    refs <- pmReferences(pmid = "25824007")
    refs$count
    refs$references

    # Add citation counts and references to an existing data frame
    M_enriched <- pmEnrichCitations(M, api_key = api_key)

Note: citation data in PubMed comes from PubMed Central and is less
comprehensive than commercial databases such as Web of Science or Scopus.

&nbsp;

## An overview of the collection using bibliometrix

Once you have a data frame `M`, you can use **bibliometrix** for descriptive
and network analyses
(https://CRAN.R-project.org/package=bibliometrix, https://bibliometrix.org/).

    install.packages("bibliometrix")
    library(bibliometrix)

    results <- biblioAnalysis(M)
    summary(results)

    # Main Information about data
    #
    #  Documents                             2918
    #  Sources (Journals, Books, etc.)       1275
    #  Keywords Plus (ID)                    2245
    #  Author's Keywords (DE)                4212
    #  Period                                2000 - 2020
    #  ...