Skip to content Skip to sidebar Skip to footer

Extracting Affiliation Information From Pubmed Search String In R

I need some help extracting affiliation information from PubMed search strings in R. I have already successfully extracted affiliation information from a single PubMed ID XML, but

Solution 1:

Have you tried the pubmedR package? https://cran.rstudio.com/web/packages/pubmedR/index.html

library(pubmedR)
library(purrr)
library(tidyr)

my_query <- '(((("diabetes mellitus"[MeSH Major Topic]) AND ("english"[Language])) AND (("2020/01/01"[Date - Create] : "3000"[Date - Create]))) AND ("coronavirus"[MeSH Major Topic])'

my_request <- pmApiRequest(query = my_query,
                            limit = 5)

You can use the built in function my_pm_df <- pmApi2df(my_request) but this will not provide affiliations for all authors.

You can use a combination of pluck() and map() from purrr to extract what you need into a tibble.

auth <- pluck(my_request, "data") %>% {
  tibble(
    pmid = map_chr(., pluck, "MedlineCitation", "PMID", "text"),
    author_list = map(., pluck, "MedlineCitation", "Article", "AuthorList")
  )
  }

All author data is contained in that nested list, in the Author$AffiliationInfo list (note it is a list because one author can have multiple affiliations).

================================================= EDIT based on comments:

First construct your request URLs. Make sure you replace &email with your email address:

library(httr)
library(xml2)

mypmids <- c("32946812", "32921748", "32921727", "32921708", "32911500", 
             "32894970", "32883566", "32880294", "32873658", "32856805",
             "32856803", "32820143", "32810084", "32809963", "32798472")

my_query <- paste0("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=",
                   mypmids,
                   "&retmode=xml&email=MYEMAIL@MYDOMAIN.COM")

I like to wrap my API requests in safely to catch any errors. Then use map to loop through the my_query vector. Note we Sys.sleep for 5 seconds after each request to comply with PubMed's rate limit. You can probably cut this down a bit seconds or even less, check in the API documentation.

get_safely <- safely(GET)

my_req <- map(my_query, function(z) {
  print(z)
  req <- get_safely(url = z)
  Sys.sleep(5)
  return(req)
})

Next we parse the request with content() in read_xml(). Note that we are parsing the result:

my_resp <- map(my_req, function(z) {
  read_xml(content(z$result,
                   as = "text",
                   encoding = "UTF-8"))
})

This can probably be cleaned up some but it works. Coerce the AuthorInfo to a list and use a combination of map() , pluck() and unnest(). Note that a given author might have more than one affiliation but am only plucking the first one.

my_pm_list <- map(my_resp,function(z){
  my_xml <- xml_child(xml_child(z,1),1)
  pmid <- xml_text(xml_find_first(my_xml,"//PMID"))
  authinfo <- as_list(xml_find_all(my_xml,".//AuthorList"))return(list(pmid, authinfo))})

myauthinfo <- map(my_pmids,function(z){
  auth <- z[[2]][[1]]})

mytibble <- myauthinfo %>%{
  tibble(
    lastname = map_depth(.,2, pluck,"LastName",1, .default =NA_character_),
    firstname = map_depth(.,2, pluck,"ForeName",1, .default =NA_character_),
    affil = map_depth(.,2, pluck,"AffiliationInfo","Affiliation",1, .default =NA_character_))}

my_unnested_tibble <- mytibble %>%
  bind_cols(pmid = map_chr(my_pm_list, pluck,1))%>%
  unnest(c(lastname, firstname, affil))

Post a Comment for "Extracting Affiliation Information From Pubmed Search String In R"