There are two ways of consuming reports for decision making. The first is that of performing a quick high level assessment to answer a well-posed question. These tend to be static reports that are automated on a recurring basis. An example is a weekly resource utilization report which answers a question around how well resources are being used over some time horizon. The second use case is an insights driven analysis where the objective is for the end-user to interact with the report in an interactive capacity and in doing so identify insights or generate new questions that should be answered. A reasonable way to think of this second use case is that of R&D where the underlying asset is data. However, in order to generate these reports there are an awful lot of complexities that must be implemented and managed against. Let us first consider the typical data pipeline.

Data Pipeline

A typical pipeline involves the following steps:
  • Integration
  • Preparation, Curation, & Enrichment
  • Exploration & Refinement
  • Analysis
  • Visualization
Every analytics implementation has a data pipeline. In some cases this may be reasonably simple. In other cases this may be extremely complex. The one certainty is that there will be data flowing in and out. What needs to happen in between these two steps will significantly differ depending on the complexities intrinsic in your data, the manipulations that must occur within your process, and the availability of people, process, and technology to execute. 

Data Considerations

The primary considerations for procuring analytics tools are the same as those for any procurement initiative: people, process, technology, and value. At a more technical level, it's important to account for what are known as the V's of data. 


How much data is being processed?

If it's small (i.e., ~ <600,000 rows) then Excel may meet your needs. If it's medium (i.e., can be held in computer memory) then a more technical solution like Python or R may be appropriate. If it's large (i.e., can't be held in computer memory) then perhaps a combination of a database and chunking with Python and R may be appropriate. If it's really large maybe you need to consider Hadoop and MapReduce.

Many modern technologies distribute parallel data access and computations over lower cost solution rather than continue to optimize vertically. Further efforts have begun on optimizing against the hardware itself such as SSD. Keep in mind, if you have a small dataset then processing in parallel may actually slow down your analytics pipeline due to the overhead of introducing a slave/master relationship that must be managed. Furthermore, this assumes that the underlying software you are using even addresses parallelization in the first place.

The volume will have significant implications in your process due to limitations from write speeds and read speeds. Specific attention should be paid to the tasks that you will be performing most frequently as not all technologies are made equal. 


What types of data sources are being processed?

Data comes in many different formats. The three classes of data are structured, semi-structured, and unstructured. Some example consumables include images, video, audio, numeric, spatial, and text. Many deliverables are actually a combination of multiple consumables. Consider for example a video. It may include visuals, audio, and text overplayed on top of each other. As such it's very important to consider the types of data you will be handling and your organization's requirements for ingesting, manipulating, analyzing, and presenting it. For instance, an organization handling Geographic Information Systems (GIS) data will likely have very different primary considerations than a financial institution.


How frequently is the data received? How frequently is it consumed?

Data comes in two types that play a significant role in determining their velocity: streaming or non-streaming. Theoretically a streaming data source can be thought of as infinite data processes in small temporal chunks.

Velocity considerations may impact both our confidence in the accuracy of the data, how powerful (and so costly) our technology needs to be, and how complex our analytics use cases can be built. For instance, if we need to instantly process data because we receive it at a high frequency (i.e., stock market data) then it might not be viable to have bi-directional machine communication that properly validates the correctness of the packets being exchanged. Alternatively, if we need to make quick decisions with the data (i.e., a credit card company validating fraud with its models) we may need to have disparate analytic processes (i.e., an immediate feedback mechanism based upon a standard model and a small time horizon that's likely done in memory and a separate system that allows us to build, backtest, and tune a model from the a larger pool of data).


What is the quality of the data being processed?

The quality of the data and the steps that one needs to apply in order to make it useful are an important technology and resource consideration. It is often estimated that 70-80% of a data resources time is spent cleaning data. There is definitely truth behind this assertion.


To what degree is the data interconnected?

Data that has a clear primary key / foreign key relationship or columns with a standard recurrent structure is wonderful because then you can easily connect it through joins, vlookups, appends, concatenation, or whatever flavor of connectivity functions you may be familiar with. However, most data is not this easily connected and requires some level of processing and/or entity resolution. Real value and insight typically doesn't come from structured data. Rather, it often comes from taking unstructured data and layering it on top of the structured data.

Further Considerations

Now that we've properly considered the data that we are handling and what we need to do with it in order to generate insights and visualizations we must also consider the appropriate organizational philosophy.


Do we need backups of our data? Where and how often?

As much as we want everything to work all of the time, this is not a practical expectation. There will be disruptions to your process and technology. The real question is when and to what extent. It is extraordinarily important to consider how, where, and with what technology the data and associated processes should be developed, version controlled, backed up, and audited. Furthermore, what are the down-time versus up-time requirements? What is the plan in the unlikely scenario of a catastrophic event? Most importantly, it's important to consider how issues will be managed and communicated back to the developers, IT, and end-users, among others.


What can we ethically do with the data, what must we divulge about it, and are our results potentially biased in some way?

This is a consideration that could potentially have huge PR impact. Ethics have now become a mainstream consideration given the problems that have surfaced in the media around Facebook, Microsoft, and other major tech companies. Let us assume that you have a machine learning model. Can you reasonably assume that your data science made it without any preconceived notions? This is not likely the case if they all come from similar academic or cultural backgrounds. Let us assume that there are no preconceived notions. Can you assume that the data itself is not the product of an underlying societal or cultural norm that is suboptimal and being reinforced by the model itself?


How long should we keep the data? Who should be able to access it? What are the security requirements around it?

Most IT teams would agree that it is very important to know what level of restrictions and security protocol exists around your data. Does the data get transferred server-to-server? Then there should definitely be some form of encryption and decryption protocol in place. What considerations have gone into administrative rights? Who may actually access the data at any given point? Should they be able to access it? These are important considerations that must be managed and continually updated. Ideally from a secure, central location that offers easily deployed updated through some level of automation.

Of course, these days security is not limited to administrative and IT rights. There is also a data risk component associated with security. What is the shelf life of the data and how long should it be kept? The more data that might require a user agreement or NDA that your company has the more it might risk your company might be at if it's acquired by the wrong people. While unfortunate, it's imperative for every organization to consider the risk/reward surrounding its data, the shelf-life of that data, and the importance of insurance surrounding it.


The first step in the journey to analytics technology adoption is identification. From there it's all about executing against the above. Technology consideration, procurement, and adoption is a decision that should not be taken lightly. Many people think of an analytics solutions solely from the perspective of the operational efficiencies, visualization capabilities, and/or the insights to be gained from using them. While these are important considerations, they do not necessarily take into consideration the full picture. There are inherent complexities in selecting analytics solutions, such as a Business Intelligence (BI) tool, for your organization. As such, all of the relevant stakeholders including but not limited to IT, software engineering, analytics, operations, and legal should be involved in the decision making process. Feel free to review Source One's procurement technology advisory services roadmap or contact us directly if you need further guidance.

Share To:

James Patounas

Post A Comment:

0 comments so far,add yours