In my previous post, data science, I defined data science, introduced a small collection of technologies that are useful for conducting data science, and mentioned a few applications of data science. It is really important to emphasize that when conducting data science, we must never lose sight of the fact that we are attempting to arrive at actionable insights. This can only be accomplished when open and effective communication occurs between the data scientist and subject matter experts or management. Hence, it's almost as important for a data scientist to be able to convey their thoughts in a clear and concise manner as it is to be able to conduct the analyses accurately and efficiently. Of course, all of that is null and void if we don't have data to work with in the first place! We begin our data science series on ways to cheaply acquire useful data.
Here I will consider the simplest case: structured data. That is, data that has already been aggregated and structured nicely. Simply put – imagine a table in a spreadsheet with very clear headers and numbers formatted as we would expect underneath them. For example, a very well-known source of structured data is Yahoo! Finance’s stock market prices. Why should we care about structured data in the procurement industry? Because there are lot of useful (and often free) datasets available through the internet. Furthermore, application program interfaces (APIs) often make acquisition of useful structured data very easy.
Here I will consider the simplest case: structured data. That is, data that has already been aggregated and structured nicely. Simply put – imagine a table in a spreadsheet with very clear headers and numbers formatted as we would expect underneath them. For example, a very well-known source of structured data is Yahoo! Finance’s stock market prices. Why should we care about structured data in the procurement industry? Because there are lot of useful (and often free) datasets available through the internet. Furthermore, application program interfaces (APIs) often make acquisition of useful structured data very easy.
One very concrete example of useful structured data is that
of indices. When developing a contract for raw materials that exhibit highly
volatile pricing, many buyers tie the cost per unit to an industry recognized
index such as those found at Cottrill
Research. Being able to easily access the indices allows one to efficiently
perform comparative analysis, risk analysis, and forecasting. An alternative use of structured data is that of product
assessment. For any company looking to procure a product it would behest them
to understand the current state of the market. Consider, for instance, coffee.
By accessing existing time series data, visualizing the data, and computing
elementary statistical metrics, we can develop a baseline level of knowledge
about ground roast coffee and determine whether further in depth analysis (that
will likely cost us money) is desirable. Consider the average price of ground
roast coffee among all cities in the United States. This information is
available via the BLS Consumer Price Index (CPI) Survey on Quandl at CPI survey.
We begin by obtaining the dataset with two simple commands.
require(Quandl)
avg_price_coffee <-
Quandl("BLSI/APU0000717311")
We next
change our data to a zoo object to make it easier to work with and subset it to
include anything from January 1st, 2013 until the current day of
compilation.
# put data in easy to
manipulate format
avg_price_coffee <-
zoo(avg_price_coffee$Value, order.by=avg_price_coffee$Date)
# subset data, compute
smoother and bandwidth
avg_price_coffee <-
window(avg_price_coffee, start = "2013-01-01")
As the data
appears to be a bit volatile we will smooth it. We elect to use a 3 month
rolling and 3 month rolling standard to better assess the market trend. Assuming
that two standard deviations represents a reasonable band, we apply the
following commands:
trend <-
rollapply(avg_price_coffee, 3, mean)
bound <-
rollapply(avg_price_coffee, 3, sd)
It remains to
plot the data for easy visual inspection. We plot the original data in black,
the smoothed line in red, and the deviation bands in blue. We also modify the
axis to show every other month.
# plot data
plot(avg_price_coffee,
lwd=2, ylim = c(4,6), xaxt="n", xlab = "", ylab =
"Average Price ($)", main
= "Average Price for Coffee in U.S. city average,\n100%, ground roast, all
sizes, per lb.
(453.6 gm)")
(453.6 gm)")
lines(trend, col = 2, lwd
= 2)
lines(trend + 2 * bound,
col = 4, lwd = 1, lty = 2)
lines(trend - 2 * bound,
col = 4, lwd = 1, lty = 2)
n <- length(avg_price_coffee)
axis.Date(1, at =
index(avg_price_coffee)[seq(1,n,2)],
format = "%b %Y", las = 2)
format = "%b %Y", las = 2)
The result is the following
graph:
Upon inspection we note that globally there appears to be
downward trend in the market and that in general the variability over a 3 month
period should be expected to be around $0.53 with extreme periods of up to
about $1.17. It is interesting to note the apparent mean-reversion towards $5.15
during 2014 with a significant spike around June. It would likely be worthwhile
to extend the time frame under consideration and rerun this analysis.
If you’d like help understanding how this type of data
manipulation can help you gain better control of your spending, please contact
a professional at Source One Management
Services, LLC.
Post A Comment:
0 comments so far,add yours