Articles by Simon Trewin
Simon Trewin provides his input to DevOps online to explain why DataOps is essential to drive business value
Click here to read the article
Author:Simon Trewin founder DataOps Thinktank, founder DataOps Academy, Author of The DataOps Revolution.
What is small data?
Small data is all of that information that you attach to your operational data / big data to be able to make sense of it or to transform it into business insight. Typically, it is owned very close to the where data is leveraged for operations, or insight. It helps with cleansing, grouping, aggregating, filtering, and tagging, or it helps to drive a business process.
Why is it needed?
Small data is needed because the use cases for operations /data insight change rapidly. This cadence is generally too fast for enterprise IT to keep up with and I would argue that it is something that they should not try to keep up with.
What are the Organisational Challenges wishing to be data driven and incorporate ML and AI?
The challenges are that often the truth in terms of data only exists once data pipelines have passed through the small data filters and checkers. Therefore, the accuracy of machine learning and AI models and is hindered by the fact they do not have access to the truth. The CDAO strategies are held back by their ability to leverage the right information.
Small data is also very easy to copy and to reuse making it hard to maintain and master. It exists within reporting systems and end user applications like Excel and is emailed around, linked, and reused. It is the source of reporting errors that can lead to regulatory fines, missed opportunities, and bad decisions. It can also lead to many versions of the truth preventing organisations from knowing the true state of things and preventing them from making decisions.
It often gets complex making it hard to unwind and building up organisational inertia which makes it hard to move forward with a digital strategy. It needs to be combined in the overall data strategy for the organisation but is often considered too hard and complicated to tackle.
What you need to do
The key to small data is to be able to democratise it incorporating data quality controls and master it to empower your employees. This needs to be done incrementally in a system that provides secure transparency through lineage, usage statistics, and links to business terminology. This system should provide an easy migration of assets to enterprise systems for the purpose of enabling the digital enterprise.
To deal with the complexity you need to automate the analysis of your existing estate efficiently and effectively to be able to group your small data by complexity, importance, risk and dependencies. You can then prioritise the actions to take to make improvements, at this stage you should be able to track improvements through time to see the changes made and the changes still required.
For some small data it is OK for it to remain as small data however for completeness this should be logged and monitored and kept up to date through time.
Small data is essential in any organisation to bridge the gap from operational data to business processes and knowledge. It moves quickly due to the nature of changing business requirements; it quickly can become complex and will introduce poor data quality through duplication and inconsistencies. As a CDO you need to incorporate it into your overall strategy if you truly want to deliver the data driven enterprise.
Kinaesis is very pleased to have Edward Chu join their growing Innovation team. Edward will be focusing his efforts on Acutect, delivering his expertise to our advanced EUC solutions.
Edward is a frontend developer with a degree in Statistics and Financial Mathematics. He previously worked at a start-up and a global IT consultancy firm. His expertise is in the development of web applications although he has also had experience with backend development.
Great Bundle of DataOps Resources to start your DataOps journey!
We have bundled our DataOps Target Pillar course with our DataOps – Scaled Agile Framework (SAFe) video and included a free copy of the DataOps Revolution Book due out on the 6 August. All for the same price as our standard DataOps Target Pillar training course.
To get hold of this bundle follow this link
How do you leverage our DataOps six IMPACT pillars alongside your SAFe agile methodology?
We get asked this question a lot. Many organisations have rolled out a SAFe agile methodology for software development in their organisations. Where does our DataOps Six Pillars methodology fit into this?
This new DataOps academy video provides you with the knowledge you need to understand how to apply SAFe effectively to a data and analytics project by leveraging the Six IMPACT pillars of DataOps.
Click on this link to access this new course.
Having access to reliable data is key to being able to make informed decisions and provide service levels to your customers that the digital world demands. The first realisation of this has manifested itself in the recognition of the Chief Data Officer role. It has driven executives to employ strategists and consultants to move their organisations forward. The strategy has been described accurately, the gap is that it is hard to drive this through the organisation and deliver the results that are now expected.
DataOps Revolution describes a methodology and approach that is proven to work. It connects the strategy with the people on the ground that must implement the changes required. It is based on real life scenarios that exist and describes the keys to transformational success. The DataOps approach is told in a narrative style to appeal and resonate with as many people involved in the data revolution as possible. We hope that it leads you to improving your delivery and the value you’re your data and analytics projects to accelerating your data driven initiatives.
DataOps Revolution is available to pre-order here
Kinaesis’ mission statement is to enable organisations to make better decisions through leveraging data and analytics efficiently and effectively. Over the ten years of working with clients our aims have been to add value and deliver solutions.
During that time, our methodology and software has been developed to enhance our client’s data capabilities through training, innovative tooling and services. Our amazing engineers are focused on providing innovative solutions to today’s data and analytics challenges.
Our goal in 2021 is to make data capabilities accessible to all. To achieve this, we have enhanced our service offerings and made large Research and Development investments in our knowledge and technical capabilities.
In February 2021 we launched the Kinaesis DataOps Academy platform which we will use to deliver content throughout 2021, specifically our proprietary 6 DataOps Pillars. There is a free introductory course to complete, and coming soon is the Target pillar to be closely followed by further pillar breakdowns. To help you to learn how to leverage these skills into your existing delivery frameworks, we are producing a course on how they integrate with agile methodologies.
Know Your Register (KYR)
In March 2021 we launched version 2 of Kinaesis KYR for the Investment Management community to help bring better data insights to their Sales and Marketing efforts. You can see demonstrations of the platform and its capabilities on our dedicated training site:
In May 2021 Kinaesis Acutect will be launched in its first release. Our aim is to surface “small data” from End User Computing tools (EUCs) and to empower users and IT to work more collaboratively around data solutions. We have plans to take this forward to new levels so that we can help financial organisations get on top of their EUC estates. Take a look at our product overview and continue to watch this space for more announcements.
In the middle of the 2021 we are releasing a DataOps book to go hand in hand with our training. The purpose of this is to explain the benefits of following a data driven approach for your data projects. Here we explain in narrative style how the Kinaesis DataOps methodology can be put to practice.
It is going to be an exciting year at Kinaesis where we aim to accelerate your effectiveness with data and analytics through the application of DataOps tools and techniques.
We are really pleased to announce that our world class DataOps training is now online. Over the past few months we have been converting our content into consumable short videos, quizzes and certificates. The first instalment "introduction to DataOps" is FREE and provides an overview of DataOps and our proprietary 6 pillars methodology. In the coming months we will deliver more content on each of the 6 pillars in turn. In addition to "introduction to DataOps" we have provided a video of our recent webinar with DataKitchen: "Differentiation through DataOps in Financial Services".
If you are inpatient for results then why not book yourself and a number of your colleagues a DataOps training course with an industry leading professional. Please send all enquiries via this link
Simon Trewin provides insight on the differentiation through DataOps for financial services with Chris Bergh from DataKitchen. For those who missed it please sign up and watch through this link
10 years ago Kinaesis was incorporated. Over the last 10 years we have been very lucky to work with excellent clients, partners and employees.
A big thank you to everyone who has been involved. We are looking forward to adding even more value over the coming years.
Wed, Feb 10th | 11 am ET / 4 pm GMT
When financial institutions use data more efficiently and innovatively they can deliver the product and customer experiences that differentiate them from the competition. Although most financial services companies collect and store tremendous amounts of data, new analytics are delivered through incredibly complex pipelines. Furthermore, governance and security are not optional. Existing and emerging regulations require that financial institutions that collect customer data manage it carefully. Balancing speed, quality, and governance is critical.
In this webinar, Simon Trewin of Kinaesis joins Chris Bergh of DataKitchen to discuss how DataOps enables financial institutions to move fast without breaking things. They’ll cover how DataOps enables organizations to:
- Increase collaboration and transparency;
- Deliver new analytics fast via virtual environments and continuous deployment;
- Increase the quality of data by automating testing across end-to-end pipelines; and
- Balance agility with compliance and control.
They’ll also share real-life examples of how financial service companies have successfully implemented DataOps as the foundation for a digital transformation.
To register for this webinar please use this link
One of the big challenges within organisations is building collaboration between IT and the business. This challenge has increased over the years as business have become more adept at using IT tools. For example, knowing your customer and being able to offer them the best product at the right time through advanced CRM requires clever analytics coordinated with good data. Enterprise IT has sped up tremendously over the past few years with the building blocks becoming quicker to integrate and extend. However, this is somewhat of a double-edged sword, the faster it speeds up, the greater the competition exists and therefore the speed with which solutions need to be implemented. What this leads to is a friction between the relatively slow-moving world of IT process to the need for solutions and information in the business.
To address this friction there needs to be healthy leverage of technology in the business in collaboration with Enterprise IT. Through tools like Python, MS Office, Tableau, Qlik the business is more empowered than ever to implement solutions. Many successful organisations leverage this ability to meet demands from regulation and to advance management information. Over time, these capabilities and solutions start to get more complicated due to the nature in which they evolve. At some point this complexity reaches a critical stage and errors happen. This leads to regulatory fines and losses and normally a knee jerk reaction that exasperates the issue rather than improving it. A proactive solution to this problem is to have a healthy flow between the fast-moving environment in the business into enterprise solutions from IT.
To make this work what needs to be recognised is that not all information or solutions in the business needs to find its way into enterprise IT. The reason for this is that the scope of the data/solution may only exist for one user. For example, if a business user wants to see a set of Sales orders bucketed into categories based on value, i.e., 0-1,000 | 1,000-5,000 | 5,000-15,000 | 15,000-50,000, It may not be relevant for any other job function to know these categories. Has IT got time to manage these requirements? Is it prudent to spend your budget implementing them inside the data lake with the maintenance that goes with them? I would argue no. If no is your answer, then how do you manage this data that defines a particular business process? I find that when I have been working with clients it is often this small data that is the barrier between IT and the business. Generally, this data is not understood or appreciated by the large processes in IT, but it is also the barrier for why the data is extracted from the data lake and manipulated in the business when it is used. It is often the reason that the business needs to create EUCs.
How does DataOps help? Firstly, DataOps recognises this data. Within the 6 pillars it discovers the data, categorises it, and architects it. The focus of DataOps is collaboration, and extensibility, therefore the methodology identifies that the data items that undergo the most change need to be located as closely as possible to the change agent. Translating this into the example above, the small data needs to be organised, documented, and owned by the business through IT enabled systems. This is achieved through defining the right metadata, managing the metadata and then governing the small data in a way that democratises control. i.e., give someone some rope, but make sure they use it to build a bridge with the rope between enterprise IT and the business. In short Kinaesis Acutect and DataOps recognises and implements a methodology and approach that allows you to look after the small data, so that the big data looks after itself.
This article follows on from the recent articles on DataOps by Kinaesis:
• Why you need DataOps?
• What is DataOps?
• How does DataOps make a difference?
• Get control of your glossaries to set yourself up for success
• Why DataOps should come before (and during) your Analytics and AI
End-user Computing (EUC) and in particular spreadsheet risk, has been well documented recently with one of the most high-profile headlines concerning the test and trace system for COVID-19 in the UK. Other examples include issues where formulas were implemented incorrectly within a spreadsheet, in one example leading to billions of dollars in loss to JP Morgan due to the miscalculation of their VAR position. However, what is the other side of the coin? What is the lost opportunity of having your data and processes tied up inside many versions of a tool where there is no reconciliation, data quality checks, and no mastering of information? Industry is moving rapidly towards a data and analytics arms race, where the winners and losers are being decided by the ability of organisations to leverage their time and energy efficiently and effectively. This is enabled by advanced analytics that enables firms to:
- Market the right product to the right customer efficiently
- Manage risk and financials effectively, and
- Optimise departments to offer services faster and cheaper than their competitors
To get to the front or even the middle of the pack, then you need to be sure of your data. You also need to have your data to hand and available to the analysts, and data scientists to run their models. It is still a fact that 80% of a data scientists’ job is taken up with the consolidation, cleansing and conforming of data and training sets prior to being able to deploy their models. Given the escalating salaries of these resources, it is extremely expensive to have them performing work outside of their paid for skillset. If the real view of data is hidden inside spreadsheets, then it may not even be possible to pull together a valid data set without months of preparation work.
It is not uncommon for hard working users, in answer to business questions, to have formed a complex web of EUCs and manual processes over dozens of years to resolve business challenges. In organisations where there has been heavy M&A activity, it is not uncommon to find that data and business rules tied up in EUCs are the only place where there is truth in the organisation. However, more than likely the truth of one EUC can quite often contradict a second EUC that is reporting or analysing a different business problem. This leads to inconsistencies and eventually a lack of confidence in the results, which means that organisations are not able to capitalise on their information and insight.
One example that comes to mind where an analyst had diligently and very carefully organised a set of EUCs by date over several years. The total number of snapshots saved was around 90 and represented the reported truth for several key metrics that the business used. One request came in that required a trend of the KPIs to support business decisions. The task required the analyst to open each of the spreadsheets in turn, conform the data across all 90 instances and then extract the key metrics. The effort involved was around 3 weeks work. In this example, by the time the work had completed the business opportunity had passed. What isn’t clear is, could this knowledge have prevented a disaster or enabled the organisation to take advantage of an opportunity?
It is clear from the striking headlines that there are risks in having your data and processes in ungoverned EUC processes, as well as potential for regulatory fines. However, it is also as important for people to be aware of the limitations that are caused by a large EUC estate in the modern fast-moving world.
In recognition of our expertise in migrating EUC estates and simplifying complex business processes, Kinaesis have been awarded a discretionary R&D grant from Innovate UK. This will be used to develop a new platform, Kinaesis Acutect, which will accelerate the migration of legacy manual processes thus improving the compliance and competitiveness of the financial services sector.
Innovate UK Executive Chair Dr Ian Campbell said:
“In these difficult times we have seen the best of British business innovation. The pandemic is not just a health emergency but one that impacts society and the economy.
“Kinaesis, along with every initiative Innovate UK has supported through this fund, is an important step forward in driving sustainable economic development. Each one is also helping to realise the ambitions of hard-working people.”
Allan Eyears, Founder at Kinaesis says "This is a very exciting opportunity for us to develop a genuinely unique product to complement our existing consulting capabilities."
Simon Trewin, Founder, CEO at Kinaesis says "We feel very proud that Kinaesis has been chosen for this grant. It is great opportunity for us to leverage our DataOps skills and deliver huge value to our customers."
If you are interested in finding out more then please provide details by following this link
For those who missed the round table session with Mitratech, click here to request access to the video and a new Kinaesis case study.
We often get asked by our clients to differentiate the Kinaesis DataOps methodology to a standard Data Management Methodology. Many organisations are implementing standard methodologies and are not seeing the benefits. The key differentiator to me between the methodologies is that DataOps is based around practical actions that over time add up to deliver results greater than the sum of the parts. If delivered correctly across the 6 pillars of DataOps they help you to transform the organisation to be data driven. The key to this is the integration of people and process that deliver real business outcomes avoiding paper exercises.
On my consulting travels I find that many industry data management methodologies layout the theory around implementing a Data Dictionary for example and this is taken as a mandate to deliver a dictionary as a business outcome. Within DataOps a dictionary is not a business outcome, it is part of the deliverables that are part of the methodology and an accelerator of delivering a business outcome. This is a subtle difference, but one which leads to the effort of the Data Dictionary being part of the business process and not an additional tax on the strained budgets. Within the methodology it is produced as an asset within the project and for the future to make subsequent projects easier.
Another difference is that in standard data management approaches the methods are quite prescribed and consistent across all different use cases. The nature of the DataOps methodology is that it fits the approach to the problem being solved. For example some data management problems are highly model driven like credit scoring, customer propensity, capital calculations, etc. Other problems can be more reporting and analytics. Each of these require a different focus and a different emphasis and sometimes a different operating model. Through the iterative approach then there is freedom within the methodology to achieve this.
Many data management approaches prescribe an approach that tries to encapsulate all of the data within an organisation. This is a noble cause, but it is a large impediment to making progress. Firstly in many cases we have found that only 15-20% of the legacy data is ever required to meet existing business cases. Secondly we find that the shape of data is highly dependent on the use case being implemented and because you do not know all the use cases and future use cases it is not pragmatic to do this. By being able to measure this usage through instrumentation and driving them through use cases then the data management problems can be simplified to achievable outcomes in short periods that can form the foundations of the data management strategy for a business area to be leveraged.
There are many other differentiators within the DataOps methodology however all of them start with the principles that anything you do needs to be implemented and pervasive. The methodology builds its strength from tying business benefit to the process and builds from this. The goal is to deliver value early and often and to leverage the benefits to build momentum to deliver more benefits over time. Once integrated into the operating model then the approach builds and transforms the culture from the ground up. This delivers great benefits to the organisation where many data management methodologies start with great promise and then struggle to gain support when the size and the complexity of the task start to become apparent.
If you would like to find out more about the Differences of DataOps then please do not hesitate to contact me.
In recent meetings with clients we have come across a lot of instances of the need to improve the art of capturing requirements and building the 'Solution Contract'. Typically, these projects are large data analytics and reporting projects where the data is spread across the organisation and needs to be pulled together and analysed in particular ways. A typical data science, data engineering task in today's data centric world.
The problem that people are describing is that they are quite often asked to solve problems that the data does not support, or the requirements process does not extract the true definition of what a solution should be.
We are asked “how can you improve the process of requirements negotiation using the DataOps methodology?”
Kinaesis breaks down its DataOps methodology into the 6 IMPACT pillars. These are;
• Meta Data
• (Extensible) Platforms
• (Collaborative) Analytics
The requirements process is predominantly within the Target Pillar with leverage of the Instrumentation and Meta Data pillars. The Target pillar starts with establishing the correct questions to ask within the requirements process. These questions recognise the need to establish not only output, but people, process and data. You should ask a series of questions to capture this for the immediate requirement, but within the context of an overall Vision.
The second step is then Instrumenting the data and Meta Data. It is important to capture these efficiently and effectively using tools and techniques, but also to run profiling of the data to match to the model and check feasibility. Through the results of this process you can then work Collaboratively with the sponsor and stakeholders to solve the data and process requirements. Using data prototyping methods to illustrate the solution further assists in communicating the agreed output and the identified wrinkles which helps to build the collaboration through shared vision.
We find in our projects that following a structured approach to this part of the project yields results of building consensus, establishing gaps and building trust.
In one particular client engagement this improved the delivery velocity to the level that is the difference between success and failure. In another client engagement we are able to deliver very large and complex problems within an incredibly tight timescales.
The key point is that requirements definitions are there to build a shared contract that defines a solution that is achievable, therefore you need to include the DataOps analysis into the process to achieve the results that you want.
An ex colleague and I were talking a few days ago and he mentioned people don't want to buy DataOps. To this I have given some thought and can only come up with the conclusion that I agree, in the same way that I can agree that I don't want to buy running shoes, however I want to be fit and healthy. It is interesting to find so many people that are happy with their problems that a solution is less attractive.
The way to look at DataOps methodology and training is that it is an investment in yourself, or your organisation that enables you to tackle the needs that you struggle to make progress on. The needs that might resonate more with you are machine learning and AI, Digitisation, Legacy migration, Optimisation and cost reduction, improving your customer experience, and improving the speed to delivery and compliance of your Reporting and Analytics.
The DataOps methodology provides you with a recipe book of tools and techniques that if followed enables you to deliver the data driven organisation.
At one organisation that works with us, they found that working with the DataOps tools and techniques enabled them to deliver an important regulation in record time, rebuild the collaboration between stakeholders and form a template for future projects. For more information then please feel free to reach out to me.
What is the extensible platforms pillar within the DataOps methodology? The purpose of the platform within DataOps is to enable the agility within the methodology and to recognise the fact that data science is evolving rapidly. Due to the constant innovation around tools, hardware and solutions, what is cutting edge today could well be out of date tomorrow. What you need to know from your data today may only be the tip of the iceberg once you have productionised the solution and the next requirement could completely change the solution you have proposed. To address this issue, DataOps requires an evolving and extendable platform.
Extensibility of data platforms is delivered in a number of different ways through:
• Infrastructure patterns
• A DataOps Development Approach
• Architecture Patterns
• Data Patterns
In most large organisations, data centres and infrastructure teams have many competing priorities and delivery times can be as much as 6-9 months for new hardware. With data projects this can be the difference between running through agile iterations or implementing waterfall where you collect requirements to size the hardware upfront. To manage the risks, project teams either over order hardware creating massive redundancy or to keep costs down, under order and then have large project delays. An example of this are Big data solutions requiring a large number crunching capability to process metrics which stresses the system for a number of hours each day, but after this the infrastructure sits idle until the next batch of data arrives. The cost to organisations of redundant hardware is significant. To address this the developing answer is the cloud where servers can be set up with data processes to generate results and then brought down again reducing the redundancy significantly. Grids and internal clouds offer an on premises option. To migrate and leverage this flexibility, organisations need to consider their strategy and approach for data migration where lift and shift would duplicate data therefore meaning incremental reengineering makes more sense.
DataOps Development Approach
A DataOps development approach enables the integration of Data Science with Engineering leading to innovation reaching production quality levels more rapidly and at lower risk. Results with data projects are best when you can use tools and techniques directly on the data to prototype, profile, cleanse and build analytics on the fly. This agile approach requires you to build a bridge to the data engineers who can take the data science and turn it into a repeatable production quality process. The key to this is a DataOps development approach that builds operating models and patterns to promote analytics into production quickly and efficiently.
One of the challenges in driving innovation and agility in data platforms forwards is the architecture of production quality data with traceability and reusable components. Too small and these components become a nightmare to join and use, too large and too much is hardcoded hampering reuse. Often data in production will need to be shared with the data scientists. This is difficult because the production processes can break a poorly formed process, and poor documentation can lead to numbers being used out of context. Complexity exists where outputs from processes become inputs to other processes and sometimes in reverse creating a tangle of dependencies. The key to solving this is building out architecture patterns to enable reuse of common data in a governed way, but with the ability to enrich the data with business specific content within the architecture. Quality processes need to be embedded along the data path.
The final challenge is to organise data within the system in logical patterns that allow it to be extended rapidly for individual use cases, but to form a structure from which to maintain governance and control. Historically and with modern tools, analytical schemas enable slice and dice on known dimensions which is great for known workloads. To deliver extensibility, DataOps requires a more flexible data pattern to generate either one off analytics or to tailor analytics to individual use cases. The data pattern and organisation needs to allow for trial and error but with this there is a need to have discipline. Meta data should be kept up to date and in line with the data itself. External or enrichment data needs to be integrated almost instantly and removed again, or promoted into a production ready state. To do this you need patterns which allow for the federation of the data schemas.
The capabilities above combine to enable you to create an extensible platform as part of an overall DataOps approach. Marry this up with the other 5 pillars of DataOps then each new requirement should become an extension to your data organisation rather than a brand new system or capability.
By Simon Trewin
Are you amazed by how quickly business glossaries fill up and become hard to use? I have been involved with large complex organisations with numerous departments whose teams have tried to document their data and reports without proper guidance. Typically, the results I have witnessed are glossaries 10,000 lines long, with different grains of information being entered, technical terms being uploaded alongside business terms and everything at a consistent level. What is the right way to implement a model to fill out a glossary to make it useful in this circumstance?
Many organisations have tried to implement a directed approach through the CDO leveraging budgets for BCBS 239 and other regulatory compliance initiatives to build out their data glossaries. Attempts have been made to create both federated models and centralised models for this initiative, however I have yet to see an organisation succeed in building out a resource that truly is value add. Every implementation seems to be a tax on the workforce who show it little enthusiasm, care and attention.
If you want to avoid falling into a perpetual circle of disappointment and wasted time, here are some tips that I have picked up in my years working with data:
Understand the scope of your terms. It is likely that there will be many representations of Country for instance, Country of Risk, Country of Issue, etc understand which one you have. Ask yourself: why does the term that you are entering exist, was it because a regulator referred to it in a report, or is it a core term?
Make terminology value add. Make it useful in the applications that surface data, i.e. context sensitive help. If someone must keep seeing a bad term when they hover their mouse they are more likely to fix it.
Link it to technical terms. If a dictionary term does not represent something physical then it becomes a theoretical concept, which is good for providing food for debate for many years, but not very helpful to an organisation.
Communicate using the terms, they should provide clarity of understanding through the organisation but they quite often establish language barriers. Make sure that people can find the terms in the appropriate resource efficiently so that they can use modern search to enhance their learning.
Build relationships between terms. Language requires context to enable it to be understood. Context is provided through relationships.
Set out your structure and your principles and rules before employing a glossary tool. Setting an organisation loose on glossary tools before setting them up correctly is a recipe for a lot of head scratching and wasted budget.
Start Small and test your model for the glossary before you try to document the whole world.
I am not saying that this is easy but following the rules above is likely to set you up for success.
By Simon Trewin
What is Instrumentation all about? It is easiest to define through a question.
'Have you ever been on a project that has spent 3 months trying to implement a solution that is not possible due to the quality, availability, timeliness of data, or the capacity of your infrastructure?'
It is interesting, because when you start a project you don't know what you don't know. You are full of enthusiasm about this great vision but you can easily overlook the obvious barriers.
An example of a failure at a bank comes to mind. Thankfully this was not my project, but it serves as a reminder when I follow the DataOps methodology that there is good reason for the discipline it brings.
In this instance, the vision for a new risk system was proposed. The goal: To reduce the processing time so front office risk was available at 7am. This would enable the traders to know their positions and risk without having to rely on spreadsheets, bringing massive competitive advantage. Teams started looking at processes, technology and infrastructure to handle the calculations and the flow of data. A new frontend application was designed, with new calculation engines and protocols were established and the project team held conversations with data sources. But everything was not as it seemed. After speaking to one of the sources, it was clear that the data required to deliver accurate results would not be available until 10am which deemed it worthless.
The cost to the project included the disruption, budget and opportunity which was lost not only from the project team, but the stakeholders.
Instrumenting your pipeline is about:
• Establishing data dependencies up front in the project.
• Understanding the parameters of the challenge before choosing a tool to help you.
• Defining and discussing the constraints around a solution before UAT.
• Avoiding costly dead ends that take years to unwind.
Instrumenting your data pipeline consists of collecting information about the pipeline either to support implementation of a change, or to manage the operations of the pipeline and process.
Collecting information how data gets from source into analytics is key to understanding the different dependencies that exist between data sources. Being able to write this down into a single plan empowers you to pre-empt potential bottlenecks, hold ups and to also establish the critical path.
Data Quality is a key to the pipeline. Can you trust the information that is being sent through? In general, the answer to this question is to code defensively – however the importance of accuracy in your end system will determine how defensive you need to be. Understanding the implications of incorrect data coming through is important. During operation it provides reassurance that the data flowing can be trusted. During implementation it can determine if the solution is viable and should commence.
The types and varieties of data has implications on your pipeline. Underneath these types there are different ways of transporting the data. Within each of these transports there are different formats that require different processing. Understanding the complexity of this is important because it has a large impact on the processing requirements. We have found that on some projects certain translations of data from one format to another has cost nearly 60% of the processing power. Fixing this enabled the project to be viable and deliver.
Understanding the Volume along the data pipeline is key to understanding how you can optimise it. The Kinaesis DataOps methodology enables you to document this and then use it to establish a workable design. A good target operating model and platform enables you to manage this proactively to avoid production support issues, maintenance and rework.
Velocity or throughput
Coupled with the volume of data is the velocity. The interesting thing about velocity is when multiplied by volume it generates pressure, understanding these pressure points enables you to navigate to a successful outcome. When implementing a project, you need to know the answer to this question at the beginning of the analysis phase to establish your infrastructure requirements. For daily operational use it is important to capacity manage the system and predict future demand.
The final instrument is value. All implementations of data pipelines require some cost benefit analysis. In many instances through my career I have had to persuade stakeholders of the value of delivering a use case. It is key to rank the value of your data items against the cost of implementation. Sometimes, when people understand the eventual cost, they will lower the need for the requirement in the first place. This is essential in moving to a collaborative analytics process which is key to your delivery effectiveness.
Instrumentation is as important in a well governed data pipeline as it is in any other production line or engineering process. Let us compare to a factory producing baked beans. The production line has many points of measurement for quality to make sure that the product being shipped is trusted and reliable otherwise the beans will not be delivered to the customer in a satisfactory manner. Learn to instrument your data pipelines and projects, enables reduced risk, improved efficiency and the capacity to deliver trusted solutions.
Sign up for updates for our DataOps Course here.
Looking at the traditional lifecycle for a data development project, there are key constraints that drive all organisations into a waterfall model. These are data sourcing and hardware provision. Typically, it takes around 6 months or more in most organisations to be able to identify and collect data from upstream systems, and even longer to procure hardware. This then forces the project into a waterfall approach, where users need to define exactly what they want to analyse 6 months before the capability to analyse it can be created. The critical path on the project plan, is predicated by the time taken to procure the machines of the correct size to house the data for the business to analyse and the time taken to schedule feeds from upstream systems. One thing I have learnt over my years in the industry is that this is not how users work. Typically, they want to analyse some data to learn some new insight and they want to do it now, while the subject is a priority. In fact, the BCBS 239 requirements and the regulatory demands dictate that this should be how solutions work. When you have a slow waterfall approach this is simply not possible. Also, what if the new data needed for an analysis takes you beyond the capacity that you have set up, based on what you knew about requirements at the start of the project? The upfront cost of a large data project includes hardware to provide the required capacity across 3-4 environments, such as Development, Test, Production and Backup. Costs include the team to build the requirements, map the data and specify the architecture, an implementation team to build the models, integrate and then present the data, and optimise for the hardware chosen and finally, a test team to validate that the results are accurate.
This conundrum presents considerable challenges to organisations. On the one hand, the solution offered by IT can only really work in a mechanical way, through scoping, specification, design and build, yet business leaders require agile ad-hoc analysis, rapid turnaround and the flexibility to change their minds. The resulting gap creates a divide between business and IT, which benefits neither party. Business build their own data environments saving down spreadsheets and datasets to build ad-hoc data environments, whilst IT build warehouse solutions that really lack the agility to be able to satisfy the user base needs. As a solution, many organisations are now looking to big data technologies. Innovation labs are springing up to load lots of data into lakes to reduce the time to source. Hadoop clusters are being created to provide flexible processing capability and advanced visual analytics are being used to pull the data together to produce rapid results.
To get this right there are many frameworks that need to be established to prevent the lake from turning into landfill.
Strong governance driven by a well-defined operating model, business terminology, lineage and common understanding.
A set of architectural principles defining the data processes, organisation and rules of engagement.
A clear strategy and model for change control and quality control. This needs to enable rapid development, whilst protecting the environment from introduction of confusion, clearly observed in end user environments where many versions of the truth are allowed to exist and confidence in underlying figures is low.
Kinaesis has implemented solutions to satisfy all of the above in a number of financial organisations. We have a model for building maturity within your data environment; this consists of an initial assessment followed by a set of recommendations and a roadmap for success. Following on from this, we have a considerable number of accelerators to help progress your maturity, including:
• Kinaesis Clarity Control - Control framework designed to advance your end user environments to a controlled understood asset.
• Kinaesis Clarity Meta Data - Enables you to holistically visualise your lineage data and to make informed decisions on improving the quality and consistency of your analytics platform.
• Kinaesis Clarity Analytics - A cloud hosted analytics environment to deliver a best practice solution born out of years of experience and capability delivering analytics on the move to the key decision makers in the organisation.
In addition, and in combination with our partners, we can implement the latest in Dictionaries, Governance, MDM, Reference Data as well as advanced data architectures which will enable you to be at the forefront of the data revolution.
In conclusion, building data platforms can be expensive and high risk. To help reduce this risk there are a number of paths to success.
Implement the project with best practice accelerators to keep on the correct track, reduce risk and improve time and cost to actionable insight.
Implement the latest technologies to enable faster time to value and quicker iteration, making sure that you combine this with the latest control and governance structures.
Use a prebuilt best practice cloud service to deliver the solution rapidly to users through any device anywhere.
Make sure that you combine this with the latest control and governance structures.