DataOps Pillar: Instrument
By Simon Trewin
What is Instrumentation all about? It is easiest to define through a question.
'Have you ever been on a project that has spent 3 months trying to implement a solution that is not possible due to the quality, availability, timeliness of data, or the capacity of your infrastructure?'
It is interesting, because when you start a project you don't know what you don't know. You are full of enthusiasm about this great vision but you can easily overlook the obvious barriers.
An example of a failure at a bank comes to mind. Thankfully this was not my project, but it serves as a reminder when I follow the DataOps methodology that there is good reason for the discipline it brings.
In this instance, the vision for a new risk system was proposed. The goal: To reduce the processing time so front office risk was available at 7am. This would enable the traders to know their positions and risk without having to rely on spreadsheets, bringing massive competitive advantage. Teams started looking at processes, technology and infrastructure to handle the calculations and the flow of data. A new frontend application was designed, with new calculation engines and protocols were established and the project team held conversations with data sources. But everything was not as it seemed. After speaking to one of the sources, it was clear that the data required to deliver accurate results would not be available until 10am which deemed it worthless.
The cost to the project included the disruption, budget and opportunity which was lost not only from the project team, but the stakeholders.
Instrumenting your pipeline is about:
• Establishing data dependencies up front in the project.
• Understanding the parameters of the challenge before choosing a tool to help you.
• Defining and discussing the constraints around a solution before UAT.
• Avoiding costly dead ends that take years to unwind.
Instrumenting your data pipeline consists of collecting information about the pipeline either to support implementation of a change, or to manage the operations of the pipeline and process.
Collecting information how data gets from source into analytics is key to understanding the different dependencies that exist between data sources. Being able to write this down into a single plan empowers you to pre-empt potential bottlenecks, hold ups and to also establish the critical path.
Data Quality is a key to the pipeline. Can you trust the information that is being sent through? In general, the answer to this question is to code defensively – however the importance of accuracy in your end system will determine how defensive you need to be. Understanding the implications of incorrect data coming through is important. During operation it provides reassurance that the data flowing can be trusted. During implementation it can determine if the solution is viable and should commence.
The types and varieties of data has implications on your pipeline. Underneath these types there are different ways of transporting the data. Within each of these transports there are different formats that require different processing. Understanding the complexity of this is important because it has a large impact on the processing requirements. We have found that on some projects certain translations of data from one format to another has cost nearly 60% of the processing power. Fixing this enabled the project to be viable and deliver.
Understanding the Volume along the data pipeline is key to understanding how you can optimise it. The Kinaesis DataOps methodology enables you to document this and then use it to establish a workable design. A good target operating model and platform enables you to manage this proactively to avoid production support issues, maintenance and rework.
Velocity or throughput
Coupled with the volume of data is the velocity. The interesting thing about velocity is when multiplied by volume it generates pressure, understanding these pressure points enables you to navigate to a successful outcome. When implementing a project, you need to know the answer to this question at the beginning of the analysis phase to establish your infrastructure requirements. For daily operational use it is important to capacity manage the system and predict future demand.
The final instrument is value. All implementations of data pipelines require some cost benefit analysis. In many instances through my career I have had to persuade stakeholders of the value of delivering a use case. It is key to rank the value of your data items against the cost of implementation. Sometimes, when people understand the eventual cost, they will lower the need for the requirement in the first place. This is essential in moving to a collaborative analytics process which is key to your delivery effectiveness.
Instrumentation is as important in a well governed data pipeline as it is in any other production line or engineering process. Let us compare to a factory producing baked beans. The production line has many points of measurement for quality to make sure that the product being shipped is trusted and reliable otherwise the beans will not be delivered to the customer in a satisfactory manner. Learn to instrument your data pipelines and projects, enables reduced risk, improved efficiency and the capacity to deliver trusted solutions.