In my previous post, data science, I defined data science, introduced a small collection of useful technologies for conducting data science, and mentioned a few applications of data science. I then discussed data acquisition via Quandl, an excellent source for financial and economic data that supports easy integration with R, Python, MATLAB, and Excel (among many other technologies as you can see here).

Unfortunately, most of the time we will want to obtain data from sources that do not make the process quite so easy. In these instances, we will often be downloading structured data files such as comma delimited ones. If we only want one file, then this is not a big deal. However, the obvious question arises: what if we want more files? Say 10? 100? 1000? What if we have to download some file(s) every month? Every week? Daily? Do we really want to manually download each and every one of these files every single time? Absolutely not!

When one conducts spend analysis they generally have to acquire a large collection of disparate information and clean/aggregate it before digging deeper. As a contrived example, consider obtaining procurement card information from the Gosport Borough Council. Unfortunately this process won't be as straight-forward as it seems. Note that the files are spread over multiple webpages. Fortunately, this is just a minor deterrent thanks in part to Hadley Wickham's excellent R package rvest. An excellent tutorial on using this package for webscraping can be found at R-Bloggers.

We begin by loading the package and identifying the aforementioned webpage.

link <- ""

Next we extract the html contents of the page, determine how many pages there are, and construct the url to each.

page <- html(link)
# identify number of pages with material for download
page_count <- length(html_nodes(page, ""))
# construct link to each page
link <- paste(link, "?p=", 1:page_count, sep="")

Now we loop through each of the (two) pages, extract the url to each of the files, and download them to a temporary directory.

# loop through each page
for(i in 1:length(link)){
  # read html
  page <- html(link[i])
  # extract link to downloadable files
  page <- html_nodes(page, "table.DataGrid.oDataGrid") %>% html_nodes("a") %>% html_attr("href")
  # convert to searchable strings
  ind <- as(page, "character")
  # identify the partial links of interest
  ind <- grep("full&servicetype=Attachment", ind)
  # subset to the links of interest
  files <- page[ind]
  # construct full link of interest by adding base url
  files <- paste("", files, sep="")
  # download each file on given page and save in temporary destination
  for(j in 1:length(files)){
    download.file(files[j], destfile=tempfile())

By throwing the commands

start <- proc.time()


proc.time() - ptm

before and after the code, respectively, we can even time how long the execution takes.

> proc.time() - ptm
   user  system elapsed 
   0.11    0.19    7.35

And... Voila! We have downloaded all 16 files in 7.35 seconds (on a Microsoft Surface Pro 3).

In this post I used webscraping as a vehicle for downloading structured data. That is, we used it as a vehicle to complete our primary objective. In my next post I'll use it as the primary tool for assimilating data from a website.

If you’d like accurate and efficient help with your company's spend, please contact a professional at Source One Management Services, LLC

Share To:

James Patounas

Post A Comment:

0 comments so far,add yours