Data Acquisition: Structured Data I

In my previous post, data science, I defined data science, introduced a small collection of technologies that are useful for conducting data science, and mentioned a few applications of data science. It is really important to emphasize that when conducting data science, we must never lose sight of the fact that we are attempting to arrive at actionable insights. This can only be accomplished when open and effective communication occurs between the data scientist and subject matter experts or management. Hence, it's almost as important for a data scientist to be able to convey their thoughts in a clear and concise manner as it is to be able to conduct the analyses accurately and efficiently. Of course, all of that is null and void if we don't have data to work with in the first place! We begin our data science series on ways to cheaply acquire useful data.

Here I will consider the simplest case: structured data. That is, data that has already been aggregated and structured nicely. Simply put – imagine a table in a spreadsheet with very clear headers and numbers formatted as we would expect underneath them. For example, a very well-known source of structured data is Yahoo! Finance’s stock market prices. Why should we care about structured data in the procurement industry? Because there are lot of useful (and often free) datasets available through the internet. Furthermore, application program interfaces (APIs) often make acquisition of useful structured data very easy.

One very concrete example of useful structured data is that of indices. When developing a contract for raw materials that exhibit highly volatile pricing, many buyers tie the cost per unit to an industry recognized index such as those found at Cottrill Research. Being able to easily access the indices allows one to efficiently perform comparative analysis, risk analysis, and forecasting. An alternative use of structured data is that of product assessment. For any company looking to procure a product it would behest them to understand the current state of the market. Consider, for instance, coffee. By accessing existing time series data, visualizing the data, and computing elementary statistical metrics, we can develop a baseline level of knowledge about ground roast coffee and determine whether further in depth analysis (that will likely cost us money) is desirable. Consider the average price of ground roast coffee among all cities in the United States. This information is available via the BLS Consumer Price Index (CPI) Survey on Quandl at CPI survey.

We begin by obtaining the dataset with two simple commands.

require(Quandl)

avg_price_coffee <- Quandl("BLSI/APU0000717311")

We next change our data to a zoo object to make it easier to work with and subset it to include anything from January 1^st, 2013 until the current day of compilation.

# put data in easy to manipulate format

avg_price_coffee <- zoo(avg_price_coffee$Value, order.by=avg_price_coffee$Date)

# subset data, compute smoother and bandwidth

avg_price_coffee <- window(avg_price_coffee, start = "2013-01-01")

As the data appears to be a bit volatile we will smooth it. We elect to use a 3 month rolling and 3 month rolling standard to better assess the market trend. Assuming that two standard deviations represents a reasonable band, we apply the following commands:

trend <- rollapply(avg_price_coffee, 3, mean)

bound <- rollapply(avg_price_coffee, 3, sd)

It remains to plot the data for easy visual inspection. We plot the original data in black, the smoothed line in red, and the deviation bands in blue. We also modify the axis to show every other month.

# plot data

plot(avg_price_coffee, lwd=2, ylim = c(4,6), xaxt="n", xlab = "", ylab = "Average Price ($)", main = "Average Price for Coffee in U.S. city average,\n100%, ground roast, all sizes, per lb.
(453.6 gm)")

lines(trend, col = 2, lwd = 2)

lines(trend + 2 * bound, col = 4, lwd = 1, lty = 2)

lines(trend - 2 * bound, col = 4, lwd = 1, lty = 2)

n <- length(avg_price_coffee)

axis.Date(1, at = index(avg_price_coffee)[seq(1,n,2)],
format = "%b %Y", las = 2)

The result is the following graph:

Upon inspection we note that globally there appears to be downward trend in the market and that in general the variability over a 3 month period should be expected to be around $0.53 with extreme periods of up to about $1.17. It is interesting to note the apparent mean-reversion towards $5.15 during 2014 with a significant spike around June. It would likely be worthwhile to extend the time frame under consideration and rerun this analysis.

If you’d like help understanding how this type of data manipulation can help you gain better control of your spending, please contact a professional at Source One Management Services, LLC.

Corcentric Solutions:

The Strategic Sourceror

Data Acquisition: Structured Data I

Share To:

James Patounas

Post A Comment:

0 comments so far,add yours

Recent Posts

Popular Posts

Recent Comments

Strategic Sourceror

About Corcentric

Contact Us

Pages

Corcentric Solutions:

Data Acquisition: Structured Data I

Share To:

Next

Newer Post

Previous

Older Post

James Patounas

Post A Comment:

0 comments so far,add yours