Articles by Simon Trewin
An ex colleague and I were talking a few days ago and he mentioned people don't want to buy DataOps. To this I have given some thought and can only come up with the conclusion that I agree, in the same way that I can agree that I don't want to buy running shoes, however I want to be fit and healthy. It is interesting to find so many people that are happy with their problems that a solution is less attractive.
The way to look at DataOps methodology and training is that it is an investment in yourself, or your organisation that enables you to tackle the needs that you struggle to make progress on. The needs that might resonate more with you are machine learning and AI, Digitisation, Legacy migration, Optimisation and cost reduction, improving your customer experience, and improving the speed to delivery and compliance of your Reporting and Analytics.
The DataOps methodology provides you with a recipe book of tools and techniques that if followed enables you to deliver the data driven organisation.
At one organisation that works with us, they found that working with the DataOps tools and techniques enabled them to deliver an important regulation in record time, rebuild the collaboration between stakeholders and form a template for future projects. For more information then please feel free to reach out to me.
What is the extensible platforms pillar within the DataOps methodology? The purpose of the platform within DataOps is to enable the agility within the methodology and to recognise the fact that data science is evolving rapidly. Due to the constant innovation around tools, hardware and solutions, what is cutting edge today could well be out of date tomorrow. What you need to know from your data today may only be the tip of the iceberg once you have productionised the solution and the next requirement could completely change the solution you have proposed. To address this issue, DataOps requires an evolving and extendable platform.
Extensibility of data platforms is delivered in a number of different ways through:
• Infrastructure patterns
• A DataOps Development Approach
• Architecture Patterns
• Data Patterns
In most large organisations, data centres and infrastructure teams have many competing priorities and delivery times can be as much as 6-9 months for new hardware. With data projects this can be the difference between running through agile iterations or implementing waterfall where you collect requirements to size the hardware upfront. To manage the risks, project teams either over order hardware creating massive redundancy or to keep costs down, under order and then have large project delays. An example of this are Big data solutions requiring a large number crunching capability to process metrics which stresses the system for a number of hours each day, but after this the infrastructure sits idle until the next batch of data arrives. The cost to organisations of redundant hardware is significant. To address this the developing answer is the cloud where servers can be set up with data processes to generate results and then brought down again reducing the redundancy significantly. Grids and internal clouds offer an on premises option. To migrate and leverage this flexibility, organisations need to consider their strategy and approach for data migration where lift and shift would duplicate data therefore meaning incremental reengineering makes more sense.
DataOps Development Approach
A DataOps development approach enables the integration of Data Science with Engineering leading to innovation reaching production quality levels more rapidly and at lower risk. Results with data projects are best when you can use tools and techniques directly on the data to prototype, profile, cleanse and build analytics on the fly. This agile approach requires you to build a bridge to the data engineers who can take the data science and turn it into a repeatable production quality process. The key to this is a DataOps development approach that builds operating models and patterns to promote analytics into production quickly and efficiently.
One of the challenges in driving innovation and agility in data platforms forwards is the architecture of production quality data with traceability and reusable components. Too small and these components become a nightmare to join and use, too large and too much is hardcoded hampering reuse. Often data in production will need to be shared with the data scientists. This is difficult because the production processes can break a poorly formed process, and poor documentation can lead to numbers being used out of context. Complexity exists where outputs from processes become inputs to other processes and sometimes in reverse creating a tangle of dependencies. The key to solving this is building out architecture patterns to enable reuse of common data in a governed way, but with the ability to enrich the data with business specific content within the architecture. Quality processes need to be embedded along the data path.
The final challenge is to organise data within the system in logical patterns that allow it to be extended rapidly for individual use cases, but to form a structure from which to maintain governance and control. Historically and with modern tools, analytical schemas enable slice and dice on known dimensions which is great for known workloads. To deliver extensibility, DataOps requires a more flexible data pattern to generate either one off analytics or to tailor analytics to individual use cases. The data pattern and organisation needs to allow for trial and error but with this there is a need to have discipline. Meta data should be kept up to date and in line with the data itself. External or enrichment data needs to be integrated almost instantly and removed again, or promoted into a production ready state. To do this you need patterns which allow for the federation of the data schemas.
The capabilities above combine to enable you to create an extensible platform as part of an overall DataOps approach. Marry this up with the other 5 pillars of DataOps then each new requirement should become an extension to your data organisation rather than a brand new system or capability.
By Simon Trewin
Are you amazed by how quickly business glossaries fill up and become hard to use? I have been involved with large complex organisations with numerous departments whose teams have tried to document their data and reports without proper guidance. Typically, the results I have witnessed are glossaries 10,000 lines long, with different grains of information being entered, technical terms being uploaded alongside business terms and everything at a consistent level. What is the right way to implement a model to fill out a glossary to make it useful in this circumstance?
Many organisations have tried to implement a directed approach through the CDO leveraging budgets for BCBS 239 and other regulatory compliance initiatives to build out their data glossaries. Attempts have been made to create both federated models and centralised models for this initiative, however I have yet to see an organisation succeed in building out a resource that truly is value add. Every implementation seems to be a tax on the workforce who show it little enthusiasm, care and attention.
If you want to avoid falling into a perpetual circle of disappointment and wasted time, here are some tips that I have picked up in my years working with data:
Understand the scope of your terms. It is likely that there will be many representations of Country for instance, Country of Risk, Country of Issue, etc understand which one you have. Ask yourself: why does the term that you are entering exist, was it because a regulator referred to it in a report, or is it a core term?
Make terminology value add. Make it useful in the applications that surface data, i.e. context sensitive help. If someone must keep seeing a bad term when they hover their mouse they are more likely to fix it.
Link it to technical terms. If a dictionary term does not represent something physical then it becomes a theoretical concept, which is good for providing food for debate for many years, but not very helpful to an organisation.
Communicate using the terms, they should provide clarity of understanding through the organisation but they quite often establish language barriers. Make sure that people can find the terms in the appropriate resource efficiently so that they can use modern search to enhance their learning.
Build relationships between terms. Language requires context to enable it to be understood. Context is provided through relationships.
Set out your structure and your principles and rules before employing a glossary tool. Setting an organisation loose on glossary tools before setting them up correctly is a recipe for a lot of head scratching and wasted budget.
Start Small and test your model for the glossary before you try to document the whole world.
I am not saying that this is easy but following the rules above is likely to set you up for success.
By Simon Trewin
What is Instrumentation all about? It is easiest to define through a question.
'Have you ever been on a project that has spent 3 months trying to implement a solution that is not possible due to the quality, availability, timeliness of data, or the capacity of your infrastructure?'
It is interesting, because when you start a project you don't know what you don't know. You are full of enthusiasm about this great vision but you can easily overlook the obvious barriers.
An example of a failure at a bank comes to mind. Thankfully this was not my project, but it serves as a reminder when I follow the DataOps methodology that there is good reason for the discipline it brings.
In this instance, the vision for a new risk system was proposed. The goal: To reduce the processing time so front office risk was available at 7am. This would enable the traders to know their positions and risk without having to rely on spreadsheets, bringing massive competitive advantage. Teams started looking at processes, technology and infrastructure to handle the calculations and the flow of data. A new frontend application was designed, with new calculation engines and protocols were established and the project team held conversations with data sources. But everything was not as it seemed. After speaking to one of the sources, it was clear that the data required to deliver accurate results would not be available until 10am which deemed it worthless.
The cost to the project included the disruption, budget and opportunity which was lost not only from the project team, but the stakeholders.
Instrumenting your pipeline is about:
• Establishing data dependencies up front in the project.
• Understanding the parameters of the challenge before choosing a tool to help you.
• Defining and discussing the constraints around a solution before UAT.
• Avoiding costly dead ends that take years to unwind.
Instrumenting your data pipeline consists of collecting information about the pipeline either to support implementation of a change, or to manage the operations of the pipeline and process.
Collecting information how data gets from source into analytics is key to understanding the different dependencies that exist between data sources. Being able to write this down into a single plan empowers you to pre-empt potential bottlenecks, hold ups and to also establish the critical path.
Data Quality is a key to the pipeline. Can you trust the information that is being sent through? In general, the answer to this question is to code defensively – however the importance of accuracy in your end system will determine how defensive you need to be. Understanding the implications of incorrect data coming through is important. During operation it provides reassurance that the data flowing can be trusted. During implementation it can determine if the solution is viable and should commence.
The types and varieties of data has implications on your pipeline. Underneath these types there are different ways of transporting the data. Within each of these transports there are different formats that require different processing. Understanding the complexity of this is important because it has a large impact on the processing requirements. We have found that on some projects certain translations of data from one format to another has cost nearly 60% of the processing power. Fixing this enabled the project to be viable and deliver.
Understanding the Volume along the data pipeline is key to understanding how you can optimise it. The Kinaesis DataOps methodology enables you to document this and then use it to establish a workable design. A good target operating model and platform enables you to manage this proactively to avoid production support issues, maintenance and rework.
Velocity or throughput
Coupled with the volume of data is the velocity. The interesting thing about velocity is when multiplied by volume it generates pressure, understanding these pressure points enables you to navigate to a successful outcome. When implementing a project, you need to know the answer to this question at the beginning of the analysis phase to establish your infrastructure requirements. For daily operational use it is important to capacity manage the system and predict future demand.
The final instrument is value. All implementations of data pipelines require some cost benefit analysis. In many instances through my career I have had to persuade stakeholders of the value of delivering a use case. It is key to rank the value of your data items against the cost of implementation. Sometimes, when people understand the eventual cost, they will lower the need for the requirement in the first place. This is essential in moving to a collaborative analytics process which is key to your delivery effectiveness.
Instrumentation is as important in a well governed data pipeline as it is in any other production line or engineering process. Let us compare to a factory producing baked beans. The production line has many points of measurement for quality to make sure that the product being shipped is trusted and reliable otherwise the beans will not be delivered to the customer in a satisfactory manner. Learn to instrument your data pipelines and projects, enables reduced risk, improved efficiency and the capacity to deliver trusted solutions.
Sign up for updates for our DataOps Course here.
Looking at the traditional lifecycle for a data development project, there are key constraints that drive all organisations into a waterfall model. These are data sourcing and hardware provision. Typically, it takes around 6 months or more in most organisations to be able to identify and collect data from upstream systems, and even longer to procure hardware. This then forces the project into a waterfall approach, where users need to define exactly what they want to analyse 6 months before the capability to analyse it can be created. The critical path on the project plan, is predicated by the time taken to procure the machines of the correct size to house the data for the business to analyse and the time taken to schedule feeds from upstream systems. One thing I have learnt over my years in the industry is that this is not how users work. Typically, they want to analyse some data to learn some new insight and they want to do it now, while the subject is a priority. In fact, the BCBS 239 requirements and the regulatory demands dictate that this should be how solutions work. When you have a slow waterfall approach this is simply not possible. Also, what if the new data needed for an analysis takes you beyond the capacity that you have set up, based on what you knew about requirements at the start of the project? The upfront cost of a large data project includes hardware to provide the required capacity across 3-4 environments, such as Development, Test, Production and Backup. Costs include the team to build the requirements, map the data and specify the architecture, an implementation team to build the models, integrate and then present the data, and optimise for the hardware chosen and finally, a test team to validate that the results are accurate.
This conundrum presents considerable challenges to organisations. On the one hand, the solution offered by IT can only really work in a mechanical way, through scoping, specification, design and build, yet business leaders require agile ad-hoc analysis, rapid turnaround and the flexibility to change their minds. The resulting gap creates a divide between business and IT, which benefits neither party. Business build their own data environments saving down spreadsheets and datasets to build ad-hoc data environments, whilst IT build warehouse solutions that really lack the agility to be able to satisfy the user base needs. As a solution, many organisations are now looking to big data technologies. Innovation labs are springing up to load lots of data into lakes to reduce the time to source. Hadoop clusters are being created to provide flexible processing capability and advanced visual analytics are being used to pull the data together to produce rapid results.
To get this right there are many frameworks that need to be established to prevent the lake from turning into landfill.
Strong governance driven by a well-defined operating model, business terminology, lineage and common understanding.
A set of architectural principles defining the data processes, organisation and rules of engagement.
A clear strategy and model for change control and quality control. This needs to enable rapid development, whilst protecting the environment from introduction of confusion, clearly observed in end user environments where many versions of the truth are allowed to exist and confidence in underlying figures is low.
Kinaesis has implemented solutions to satisfy all of the above in a number of financial organisations. We have a model for building maturity within your data environment; this consists of an initial assessment followed by a set of recommendations and a roadmap for success. Following on from this, we have a considerable number of accelerators to help progress your maturity, including:
• Kinaesis Clarity Control - Control framework designed to advance your end user environments to a controlled understood asset.
• Kinaesis Clarity Meta Data - Enables you to holistically visualise your lineage data and to make informed decisions on improving the quality and consistency of your analytics platform.
• Kinaesis Clarity Analytics - A cloud hosted analytics environment to deliver a best practice solution born out of years of experience and capability delivering analytics on the move to the key decision makers in the organisation.
In addition, and in combination with our partners, we can implement the latest in Dictionaries, Governance, MDM, Reference Data as well as advanced data architectures which will enable you to be at the forefront of the data revolution.
In conclusion, building data platforms can be expensive and high risk. To help reduce this risk there are a number of paths to success.
Implement the project with best practice accelerators to keep on the correct track, reduce risk and improve time and cost to actionable insight.
Implement the latest technologies to enable faster time to value and quicker iteration, making sure that you combine this with the latest control and governance structures.
Use a prebuilt best practice cloud service to deliver the solution rapidly to users through any device anywhere.
Make sure that you combine this with the latest control and governance structures.