Articles by Simon Trewin
By Simon Trewin
What is Instrumentation all about? It is easiest to define through a question.
'Have you ever been on a project that has spent 3 months trying to implement a solution that is not possible due to the quality, availability, timeliness of data, or the capacity of your infrastructure?'
It is interesting, because when you start a project you don't know what you don't know. You are full of enthusiasm about this great vision but you can easily overlook the obvious barriers.
An example of a failure at a bank comes to mind. Thankfully this was not my project, but it serves as a reminder when I follow the DataOps methodology that there is good reason for the discipline it brings.
In this instance, the vision for a new risk system was proposed. The goal: To reduce the processing time so front office risk was available at 7am. This would enable the traders to know their positions and risk without having to rely on spreadsheets, bringing massive competitive advantage. Teams started looking at processes, technology and infrastructure to handle the calculations and the flow of data. A new frontend application was designed, with new calculation engines and protocols were established and the project team held conversations with data sources. But everything was not as it seemed. After speaking to one of the sources, it was clear that the data required to deliver accurate results would not be available until 10am which deemed it worthless.
The cost to the project included the disruption, budget and opportunity which was lost not only from the project team, but the stakeholders.
Instrumenting your pipeline is about:
• Establishing data dependencies up front in the project.
• Understanding the parameters of the challenge before choosing a tool to help you.
• Defining and discussing the constraints around a solution before UAT.
• Avoiding costly dead ends that take years to unwind.
Instrumenting your data pipeline consists of collecting information about the pipeline either to support implementation of a change, or to manage the operations of the pipeline and process.
Collecting information how data gets from source into analytics is key to understanding the different dependencies that exist between data sources. Being able to write this down into a single plan empowers you to pre-empt potential bottlenecks, hold ups and to also establish the critical path.
Data Quality is a key to the pipeline. Can you trust the information that is being sent through? In general, the answer to this question is to code defensively – however the importance of accuracy in your end system will determine how defensive you need to be. Understanding the implications of incorrect data coming through is important. During operation it provides reassurance that the data flowing can be trusted. During implementation it can determine if the solution is viable and should commence.
The types and varieties of data has implications on your pipeline. Underneath these types there are different ways of transporting the data. Within each of these transports there are different formats that require different processing. Understanding the complexity of this is important because it has a large impact on the processing requirements. We have found that on some projects certain translations of data from one format to another has cost nearly 60% of the processing power. Fixing this enabled the project to be viable and deliver.
Understanding the Volume along the data pipeline is key to understanding how you can optimise it. The Kinaesis DataOps methodology enables you to document this and then use it to establish a workable design. A good target operating model and platform enables you to manage this proactively to avoid production support issues, maintenance and rework.
Velocity or throughput
Coupled with the volume of data is the velocity. The interesting thing about velocity is when multiplied by volume it generates pressure, understanding these pressure points enables you to navigate to a successful outcome. When implementing a project, you need to know the answer to this question at the beginning of the analysis phase to establish your infrastructure requirements. For daily operational use it is important to capacity manage the system and predict future demand.
The final instrument is value. All implementations of data pipelines require some cost benefit analysis. In many instances through my career I have had to persuade stakeholders of the value of delivering a use case. It is key to rank the value of your data items against the cost of implementation. Sometimes, when people understand the eventual cost, they will lower the need for the requirement in the first place. This is essential in moving to a collaborative analytics process which is key to your delivery effectiveness.
Instrumentation is as important in a well governed data pipeline as it is in any other production line or engineering process. Let us compare to a factory producing baked beans. The production line has many points of measurement for quality to make sure that the product being shipped is trusted and reliable otherwise the beans will not be delivered to the customer in a satisfactory manner. Learn to instrument your data pipelines and projects, enables reduced risk, improved efficiency and the capacity to deliver trusted solutions.
Looking at the traditional lifecycle for a data development project, there are key constraints that drive all organisations into a waterfall model. These are data sourcing and hardware provision. Typically, it takes around 6 months or more in most organisations to be able to identify and collect data from upstream systems, and even longer to procure hardware. This then forces the project into a waterfall approach, where users need to define exactly what they want to analyse 6 months before the capability to analyse it can be created. The critical path on the project plan, is predicated by the time taken to procure the machines of the correct size to house the data for the business to analyse and the time taken to schedule feeds from upstream systems. One thing I have learnt over my years in the industry is that this is not how users work. Typically, they want to analyse some data to learn some new insight and they want to do it now, while the subject is a priority. In fact, the BCBS 239 requirements and the regulatory demands dictate that this should be how solutions work. When you have a slow waterfall approach this is simply not possible. Also, what if the new data needed for an analysis takes you beyond the capacity that you have set up, based on what you knew about requirements at the start of the project? The upfront cost of a large data project includes hardware to provide the required capacity across 3-4 environments, such as Development, Test, Production and Backup. Costs include the team to build the requirements, map the data and specify the architecture, an implementation team to build the models, integrate and then present the data, and optimise for the hardware chosen and finally, a test team to validate that the results are accurate.
This conundrum presents considerable challenges to organisations. On the one hand, the solution offered by IT can only really work in a mechanical way, through scoping, specification, design and build, yet business leaders require agile ad-hoc analysis, rapid turnaround and the flexibility to change their minds. The resulting gap creates a divide between business and IT, which benefits neither party. Business build their own data environments saving down spreadsheets and datasets to build ad-hoc data environments, whilst IT build warehouse solutions that really lack the agility to be able to satisfy the user base needs. As a solution, many organisations are now looking to big data technologies. Innovation labs are springing up to load lots of data into lakes to reduce the time to source. Hadoop clusters are being created to provide flexible processing capability and advanced visual analytics are being used to pull the data together to produce rapid results.
To get this right there are many frameworks that need to be established to prevent the lake from turning into landfill.
Strong governance driven by a well-defined operating model, business terminology, lineage and common understanding.
A set of architectural principles defining the data processes, organisation and rules of engagement.
A clear strategy and model for change control and quality control. This needs to enable rapid development, whilst protecting the environment from introduction of confusion, clearly observed in end user environments where many versions of the truth are allowed to exist and confidence in underlying figures is low.
Kinaesis has implemented solutions to satisfy all of the above in a number of financial organisations. We have a model for building maturity within your data environment; this consists of an initial assessment followed by a set of recommendations and a roadmap for success. Following on from this, we have a considerable number of accelerators to help progress your maturity, including:
• Kinaesis Clarity Control - Control framework designed to advance your end user environments to a controlled understood asset.
• Kinaesis Clarity Meta Data - Enables you to holistically visualise your lineage data and to make informed decisions on improving the quality and consistency of your analytics platform.
• Kinaesis Clarity Analytics - A cloud hosted analytics environment to deliver a best practice solution born out of years of experience and capability delivering analytics on the move to the key decision makers in the organisation.
In addition, and in combination with our partners, we can implement the latest in Dictionaries, Governance, MDM, Reference Data as well as advanced data architectures which will enable you to be at the forefront of the data revolution.
In conclusion, building data platforms can be expensive and high risk. To help reduce this risk there are a number of paths to success.
Implement the project with best practice accelerators to keep on the correct track, reduce risk and improve time and cost to actionable insight.
Implement the latest technologies to enable faster time to value and quicker iteration, making sure that you combine this with the latest control and governance structures.
Use a prebuilt best practice cloud service to deliver the solution rapidly to users through any device anywhere.
Make sure that you combine this with the latest control and governance structures.