Extracting Affiliation Information From Pubmed Search String In R
Solution 1:
Have you tried the pubmedR
package? https://cran.rstudio.com/web/packages/pubmedR/index.html
library(pubmedR)
library(purrr)
library(tidyr)
my_query <- '(((("diabetes mellitus"[MeSH Major Topic]) AND ("english"[Language])) AND (("2020/01/01"[Date - Create] : "3000"[Date - Create]))) AND ("coronavirus"[MeSH Major Topic])'
my_request <- pmApiRequest(query = my_query,
limit = 5)
You can use the built in function my_pm_df <- pmApi2df(my_request)
but this will not provide affiliations for all authors.
You can use a combination of pluck()
and map()
from purrr
to extract what you need into a tibble.
auth <- pluck(my_request, "data") %>% {
tibble(
pmid = map_chr(., pluck, "MedlineCitation", "PMID", "text"),
author_list = map(., pluck, "MedlineCitation", "Article", "AuthorList")
)
}
All author data is contained in that nested list, in the Author$AffiliationInfo
list (note it is a list because one author can have multiple affiliations).
================================================= EDIT based on comments:
First construct your request URLs. Make sure you replace &email
with your email address:
library(httr)
library(xml2)
mypmids <- c("32946812", "32921748", "32921727", "32921708", "32911500",
"32894970", "32883566", "32880294", "32873658", "32856805",
"32856803", "32820143", "32810084", "32809963", "32798472")
my_query <- paste0("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=",
mypmids,
"&retmode=xml&email=MYEMAIL@MYDOMAIN.COM")
I like to wrap my API requests in safely
to catch any errors. Then use map
to loop through the my_query
vector. Note we Sys.sleep
for 5 seconds after each request to comply with PubMed's rate limit. You can probably cut this down a bit seconds or even less, check in the API documentation.
get_safely <- safely(GET)
my_req <- map(my_query, function(z) {
print(z)
req <- get_safely(url = z)
Sys.sleep(5)
return(req)
})
Next we parse the request with content()
in read_xml()
. Note that we are parsing the result
:
my_resp <- map(my_req, function(z) {
read_xml(content(z$result,
as = "text",
encoding = "UTF-8"))
})
This can probably be cleaned up some but it works. Coerce the AuthorInfo to a list and use a combination of map()
, pluck()
and unnest()
. Note that a given author might have more than one affiliation but am only plucking the first one.
my_pm_list <- map(my_resp,function(z){
my_xml <- xml_child(xml_child(z,1),1)
pmid <- xml_text(xml_find_first(my_xml,"//PMID"))
authinfo <- as_list(xml_find_all(my_xml,".//AuthorList"))return(list(pmid, authinfo))})
myauthinfo <- map(my_pmids,function(z){
auth <- z[[2]][[1]]})
mytibble <- myauthinfo %>%{
tibble(
lastname = map_depth(.,2, pluck,"LastName",1, .default =NA_character_),
firstname = map_depth(.,2, pluck,"ForeName",1, .default =NA_character_),
affil = map_depth(.,2, pluck,"AffiliationInfo","Affiliation",1, .default =NA_character_))}
my_unnested_tibble <- mytibble %>%
bind_cols(pmid = map_chr(my_pm_list, pluck,1))%>%
unnest(c(lastname, firstname, affil))
Post a Comment for "Extracting Affiliation Information From Pubmed Search String In R"