What is the extensible platforms pillar within the DataOps methodology? The purpose of the platform within DataOps is to enable the agility within the methodology and to recognise the fact that data science is evolving rapidly. Due to the constant innovation around tools, hardware and solutions, what is cutting edge today could well be out of date tomorrow. What you need to know from your data today may only be the tip of the iceberg once you have productionised the solution and the next requirement could completely change the solution you have proposed. To address this issue, DataOps requires an evolving and extendable platform.
Extensibility of data platforms is delivered in a number of different ways through:
• Infrastructure patterns
• A DataOps Development Approach
• Architecture Patterns
• Data Patterns
In most large organisations, data centres and infrastructure teams have many competing priorities and delivery times can be as much as 6-9 months for new hardware. With data projects this can be the difference between running through agile iterations or implementing waterfall where you collect requirements to size the hardware upfront. To manage the risks, project teams either over order hardware creating massive redundancy or to keep costs down, under order and then have large project delays. An example of this are Big data solutions requiring a large number crunching capability to process metrics which stresses the system for a number of hours each day, but after this the infrastructure sits idle until the next batch of data arrives. The cost to organisations of redundant hardware is significant. To address this the developing answer is the cloud where servers can be set up with data processes to generate results and then brought down again reducing the redundancy significantly. Grids and internal clouds offer an on premises option. To migrate and leverage this flexibility, organisations need to consider their strategy and approach for data migration where lift and shift would duplicate data therefore meaning incremental reengineering makes more sense.
DataOps Development Approach
A DataOps development approach enables the integration of Data Science with Engineering leading to innovation reaching production quality levels more rapidly and at lower risk. Results with data projects are best when you can use tools and techniques directly on the data to prototype, profile, cleanse and build analytics on the fly. This agile approach requires you to build a bridge to the data engineers who can take the data science and turn it into a repeatable production quality process. The key to this is a DataOps development approach that builds operating models and patterns to promote analytics into production quickly and efficiently.
One of the challenges in driving innovation and agility in data platforms forwards is the architecture of production quality data with traceability and reusable components. Too small and these components become a nightmare to join and use, too large and too much is hardcoded hampering reuse. Often data in production will need to be shared with the data scientists. This is difficult because the production processes can break a poorly formed process, and poor documentation can lead to numbers being used out of context. Complexity exists where outputs from processes become inputs to other processes and sometimes in reverse creating a tangle of dependencies. The key to solving this is building out architecture patterns to enable reuse of common data in a governed way, but with the ability to enrich the data with business specific content within the architecture. Quality processes need to be embedded along the data path.
The final challenge is to organise data within the system in logical patterns that allow it to be extended rapidly for individual use cases, but to form a structure from which to maintain governance and control. Historically and with modern tools, analytical schemas enable slice and dice on known dimensions which is great for known workloads. To deliver extensibility, DataOps requires a more flexible data pattern to generate either one off analytics or to tailor analytics to individual use cases. The data pattern and organisation needs to allow for trial and error but with this there is a need to have discipline. Meta data should be kept up to date and in line with the data itself. External or enrichment data needs to be integrated almost instantly and removed again, or promoted into a production ready state. To do this you need patterns which allow for the federation of the data schemas.
The capabilities above combine to enable you to create an extensible platform as part of an overall DataOps approach. Marry this up with the other 5 pillars of DataOps then each new requirement should become an extension to your data organisation rather than a brand new system or capability.
By Simon Trewin
Are you amazed by how quickly business glossaries fill up and become hard to use? I have been involved with large complex organisations with numerous departments whose teams have tried to document their data and reports without proper guidance. Typically, the results I have witnessed are glossaries 10,000 lines long, with different grains of information being entered, technical terms being uploaded alongside business terms and everything at a consistent level. What is the right way to implement a model to fill out a glossary to make it useful in this circumstance?
Many organisations have tried to implement a directed approach through the CDO leveraging budgets for BCBS 239 and other regulatory compliance initiatives to build out their data glossaries. Attempts have been made to create both federated models and centralised models for this initiative, however I have yet to see an organisation succeed in building out a resource that truly is value add. Every implementation seems to be a tax on the workforce who show it little enthusiasm, care and attention.
If you want to avoid falling into a perpetual circle of disappointment and wasted time, here are some tips that I have picked up in my years working with data:
Understand the scope of your terms. It is likely that there will be many representations of Country for instance, Country of Risk, Country of Issue, etc understand which one you have. Ask yourself: why does the term that you are entering exist, was it because a regulator referred to it in a report, or is it a core term?
Make terminology value add. Make it useful in the applications that surface data, i.e. context sensitive help. If someone must keep seeing a bad term when they hover their mouse they are more likely to fix it.
Link it to technical terms. If a dictionary term does not represent something physical then it becomes a theoretical concept, which is good for providing food for debate for many years, but not very helpful to an organisation.
Communicate using the terms, they should provide clarity of understanding through the organisation but they quite often establish language barriers. Make sure that people can find the terms in the appropriate resource efficiently so that they can use modern search to enhance their learning.
Build relationships between terms. Language requires context to enable it to be understood. Context is provided through relationships.
Set out your structure and your principles and rules before employing a glossary tool. Setting an organisation loose on glossary tools before setting them up correctly is a recipe for a lot of head scratching and wasted budget.
Start Small and test your model for the glossary before you try to document the whole world.
I am not saying that this is easy but following the rules above is likely to set you up for success.
Did you spot us in SD Times’ latest article on DataOps?
We are delighted our work on DataOps is being picked up on and we hope to continue adding value to our clients utilising our DataOps methodology and also to continue giving to the DataOps Community. You can find the feature here: https://sdtimes.com/data/a-guide-to-dataops-tools/