Scraping Data From Tables On Multiple Web Pages In R (football Players)
I'm working on a project for school where I need to collect the career statistics for individual NCAA football players. The data for each player is in this format. http://www.sport
Solution 1:
Here's how you can easily get all the data in all the tables on all the player pages...
First make a list of the URLs for all the players' pages...
require(RCurl); require(XML)
n <- length(letters)
# pre-allocate list to fill
links <- vector("list", length = n)
for(i in 1:n){
print(i) # keep track of what the function is up to# get all html on each page of the a-z index pages
inx_page <- htmlParse(getURI(paste0("http://www.sports-reference.com/cfb/players/", letters[i], "-index.html")))
# scrape URLs for each player from each index page
lnk <- unname(xpathSApply(inx_page, "//a/@href"))
# skip first 63 and last 10 links as they are constant on each page
lnk <- lnk[-c(1:63, (length(lnk)-10):length(lnk))]
# only keep links that go to players (exclude schools)
lnk <- lnk[grep("players", lnk)]
# now we have a list of all the URLs to all the players on that index page# but the URLs are incomplete, so let's complete them so we can use them from # anywhere
links[[i]] <- paste0("http://www.sports-reference.com", lnk)
}
# unlist into a single character vector
links <- unlist(links)
Now we have a vector of some 67,000 URLs (seems like a lot of players, can that be right?), so:
Second, scrape all the tables at each URL to get their data, like so:
# Go to each URL in the list and scrape all the data from the tables# this will take some time... don't interrupt it!# start edit1 here - just so you can see what's changed# pre-allocate list
all_tables <- vector("list",length=(length(links)))for(i in1:length(links)){
print(i)# error handling - skips to next URL if it gets an error
result <- try(
all_tables[[i]]<- readHTMLTable(links[i], stringsAsFactors =FALSE)); if(class(result)=="try-error")next;
}# end edit1 here# Put player names in the list so we know who the data belong to# extract names from the URLs to their stats page...
toMatch <-c("http://www.sports-reference.com/cfb/players/","-1.html")
player_names <- unique (gsub(paste(toMatch,collapse="|"),"", links))# assign player names to list of tablesnames(all_tables)<- player_names
The result looks like this (this is just a snippet of the output):
all_tables
$`neli-aasa`$`neli-aasa`$defense
Year School Conf Class Pos Solo Ast Tot Loss Sk Int Yds Avg TD PD FR Yds TD FF
1 *2007 Utah MWC FR DL 2 1 3 0.0 0.0 0 0 0 0 0 0 0 0
2 *2010 Utah MWC SR DL 4 4 8 2.5 1.5 0 0 0 1 0 0 0 0
$`neli-aasa`$kick_ret
Year School Conf Class Pos Ret Yds Avg TD Ret Yds Avg TD
1 *2007 Utah MWC FR DL 0 0 0 0 0 0
2 *2010 Utah MWC SR DL 2 24 12.0 0 0 0 0
$`neli-aasa`$receiving
Year School Conf Class Pos Rec Yds Avg TD Att Yds Avg TD Plays Yds Avg TD
1 *2007 Utah MWC FR DL 1 41 41.0 0 0 0 0 1 41 41.0 0
2 *2010 Utah MWC SR DL 0 0 0 0 0 0 0 0 0
Finally, let's say we just want to look at the passing tables...
# just show passing tables
passing <- lapply(all_tables,function(i) i$passing)# but lots of NULL in here, and not a convenient format, so...
passing <- do.call(rbind, passing)
And we end up with a data frame that is ready for further analyses (also just a snippet)...
YearSchoolConfClassPosCmpAttPctYdsY/AAY/ATDIntRatejames-aaron1978 AirForceIndQB285650.03165.63.61392.6jeff-aaron.12000 Alabama-BirminghamCUSAJRQB10018254.91135 6.26.053113.1jeff-aaron.22001 Alabama-BirminghamCUSASRQB7714852.08285.64.34699.8
Post a Comment for "Scraping Data From Tables On Multiple Web Pages In R (football Players)"