Articles by Benjamin Peterson
Not long ago, 'metadata' was a fairly rare word, representing something exotic and a bit geeky that generally wasn't considered essential to business.
Times have changed. Regulation has forced business to build up metadata. Vendors are emphasising the metadata management capabilities of their systems. The word 'metadata' almost sums up the post-BCBS 239 era of data management - the era in which enterprises are expected to be able to show their working, rather than just present numbers.
Customers are frequently asking for improved and a greater volume of metadata - looking to reduce costs and risk, please auditors and satisfy regulators.
The trouble with labels, though, is that they tend to hide the truth. 'Metadata' itself is a label and the more we discuss 'metadata' and how we'd like to have more of it, the more we start to wonder if 'metadata' actually means the same thing to everyone. In this article, I'd like to propose a strawman breakdown of what metadata actually consists of. That way, we'll have a concise, domain-appropriate definition to share when we refer to "global metadata" - good practice, to say the least!
So, when we gather and manage metadata, what do we gather and manage?
Terms: what data means
To become ‘information’ rather than just ‘data’, a number must be associated with some business meaning. Unfortunately, experience shows that simple words like 'arrears' or 'loan amount' do not, in fact, have a generally agreed business meaning, even within one enterprise. This is the reason why we have glossary systems; to keep track of business terms and to relate them to physical data. Managing terms and showing how physical data relates to business terms is an important aspect of metadata. Much has been invested and achieved in this area over the last few years. Nevertheless, compiling glossaries that really represent the business and that can practically be applied to physical data remains a complex and challenging affair.
Lineage: where data comes from
Lineage (not to be confused with provenance) is a description of how data is transformed, enriched and changed as it flows through the pipeline. It generally takes the form of a dependency graph. When I say 'the risk numbers submitted to the Fed flows through the following systems,' that's lineage. If it's fine-grained and correct, lineage is an incredibly valuable kind of metadata; it's also required, explicitly or implicitly, by many regulations.
Provenance: what data is made of
Provenance (not to be confused with lineage) is a description of where a particular set of data exiting the pipeline has come from: the filenames, software versions, manual adjustments and quality processes that are relevant to that particular physical batch of data. When I say 'the risk numbers submitted to the Fed in Q2 came from the following risk batches and reference data files,' that's provenance. Provenance is flat-out essential in many highly regulated areas, including stress testing, credit scoring models and many others.
Quality metrics: what data is like
Everyone has a data quality process. Not everyone can take the outputs and apply them to actual data delivery so that quality measures and profiling information are delivered alongside the data itself. It’s great that clued in businesses are starting to ask for this kind of metadata frequently. The other good news is that advances in DataOps approaches and in tooling are making it easier and easier to deliver.
Usage metadata: how data may be used
'Usage metadata' is not a very commonly used term. Yet it's a very important type of metadata, in terms of the money and risk that could be saved by applying it pervasively and getting it right. Usage metadata describes how data should be used. One example is the identification of golden sources and golden redistributors; that metadata tells us which data should be re-used as a mart and which data should not be depended upon. But another extremely important type of metadata to maintain is sizing and capacity information, without which new use cases may require painful trial and error before reaching production.
There are other kinds of metadata as well; one organisation might have complex ontology information that goes beyond what's normally meant by 'terms' and another may describe file permissions and timestamps as 'metadata'. In the list above, I've tried to outline the types of metadata that should be considered as part of any discussion of how to improve an enterprise data estate... and I've also tried to sneak in a quick explanation of how 'lineage' is different from 'provenance'. Of all life's pleasures, well defined terms are perhaps the greatest.
Control, in the sense of governance, repeatability and transparency is currently much stronger in software development than data delivery. Software delivery has recently been through the DevOps revolution - but even before DevOps became a buzzword, the best teams had already adopted strong version control, continuous integration, release management and other powerful techniques. During the last two decades, software delivery has moved from a cottage industry driven by individuals to a relatively well-automated process supported by standard tooling.
Data delivery brings many additional challenges. Software delivery, for instance, is done once and then shared between many users of the software; data must be delivered differently for each unique organisation. Data delivery involves the handling of a much greater bulk of data and the co-ordination of parts that are rarely under control of a single team; it’s a superset of software delivery, too, as it involves tracking the delivery of the software components that handle the data and relating their history to the data itself!
Given these challenges, it’s unsurprising that the tools and methodologies which have been in place for years in the software world are still relatively rare and underdeveloped in the data world.
Take version control, for example. Since time immemorial, version control has been ubiquitous in software development. Good version control permits tighter governance, fewer quality issues, greater transparency and thus greater collaboration and re-use.
You expect to be able to re-create the state of your business logic as it was at any point in the past.
You expect to be able to list and attribute all the changes that were made to it since then.
You expect to be able to branch and merge, leaving teams free to change their own branches until the powers that be unify the logic into a release.
On the data side, that's still the exception rather than the rule - but with the growing profile of DataOps, the demand is now there and increasingly the tooling is too. The will to change and reform the way data is delivered also seems to be there - perhaps driven by auditors and regulators who are increasingly interested in what's possible, rather than in what vendors have traditionally got away with in the past. We stand on the brink of great changes in the way data is delivered and it's going to get a fair bit more technical.
What isn't quite visible yet is a well-defined methodology, so as we start to incorporate proper governance and collaboration into our data pipeline, we face a choice of approaches. Here are a few of the considerations around version control, which of course is only a part of the Control pillar:
» Some vendors are already strong in this area and we have the option of leveraging their offerings - for example, Informatica Powercenter has had version control for some time and many Powercenter deployments are already run in a DevOps-like way.
» Some vendors offer a choice between visual and non-visual approaches - for example with some vendors you can stick to non-visual development and use most of the same techniques you might use in DevOps. If you want to take advantage of visual design features, however, you'll need to solve the problem of version and release control yourself.
» Some enterprises govern each software system in their pipeline separately, using whatever tools are provided by vendors and don't attempt a unified paper trail across the entire pipeline.
» Some enterprises that have a diverse vendor environment take a 'snapshot' approach to version and release control - freezing virtual environments in time to provide a snapshot of the pipeline that can be brought back to life to reproduce key regulatory outputs. This helps ensure compliance, but does little to streamline development and delivery.
It's no small matter to pick an approach when there are multiple vendors, each with varying levels of support for varying forms of governance, in your data estate. Yet the implications of the approach you choose are profound.
DevOps has helped to define a target and to demonstrate the benefits of good tooling and governance. To achieve that target, those who own or manage data pipelines need to consider widely different functions, from ETL to ML, with widely different vendor tooling. Navigating that complex set of functions and forming a policy that takes advantage of your investment in vendors while still covering your pipeline will require skill and experience.
Knowledge of DataOps and related methodologies, knowledge of data storage, distribution, profiling, quality and analytics functions, knowledge of regulatory and business needs and, above all, knowledge of the data itself, will be critical in making DataOps deliver.
By Benjamin Peterson
DataOps comes from a background of Agile, Lean and above all DevOps - so it's no surprise that it embodies a strong focus on governance, automation and collaborative delivery. The formulation of the Kinaesis® DataOps pillars isn't too different from others, although our interpretation reflects a background in financial sector data, rather than just software. However, I believe there's an extra pillar in DataOps that’s missing from the usual set.
Most versions of DataOps focus primarily on the idea of a supply chain, a data pipeline that leads to delivery via functions such as quality, analytics and lifecycle management. That's a good thing. However supply chains exist for a purpose - to support a target state, a business vision. The creation and management of that vision and the connecting of the target to actual data is just as important as the delivery of data itself.
The importance of setting a Target
DataOps work shouldn’t be driven entirely by the current state and the available data; it should support a well-defined target. That target consists of a business vision describing the experience the business needs to have.
On the software development side, it's been accepted for a long time that User Experience (UX) is an important and somewhat separate branch of software delivery. Even before the 'digital channels' trend, whole companies focused on designing and building a user friendly experience.
Delivering an experience is different from working to a spec because a close interaction with actual users is required. Whole new approaches to testing and ways of measuring success are needed. UX development includes important methodologies such as iterative refinement - a flavour of Agile which involves delivering a whole solution at a certain level of detail and then drilling down to refine certain aspects of the experience, as necessary. Eventually, UX has become a mature recognised branch of software development.
Delivery of data has much to learn from UX approaches. If users are expected to make decisions based on the data - via dashboards, analytics, ML or free-form discovery - then essentially you are providing a user experience, a visual workflow in which users will interact with the data to explore and to achieve a result. That's far closer to UX development than to a traditional functional requirements document.
Design and DataOps: a match made in heaven
To achieve results that are truly transformative for the business, those UX principles can be applied to data delivery. 'User journeys' can provide a way to record and express the actual workflow of users across time as they exploit data. Rapid prototyping can be used to evaluate and refine dashboard ideas. Requirements can, and should, be driven from the user's desktop experience, not allowed to flow down from IT. All these artefacts are developed in a way that not only contributes to the target, but allows pragmatic assessment of the required effort.
Most of all, work should start with a vision expressing how the business should ideally be using information. That vision can be elicited through a design exercise, whose aim is not to specify data flows and metadata (that comes later) but to show how the information in question should be used, how it should be adding value if legacy features and old habits were not in the way. Some would even say this vision does not have to be, strictly speaking, feasible; I'm not sure I'd go that far, but certainly the aim is to show the art of the possible, an aspirational target state against which subsequent changes to data delivery can be measured. Without that vision, DataOps only really optimises what’s already there - it can improve quality but it can't turn data into better information and deeper knowledge.
Sometimes, the remit of DataOps is just to improve quality, to reduce defects or to satisfy auditors and this in itself is often an interesting and substantial challenge. But when the aim is to transform and empower business, to improve decisions, to discover opportunity, we need our Target pillar: a design process, along with an early investment in developing a vision. That way, our data delivery can truly become an agent of transformation.
The core concepts behind DataOps drive everything that we do at Kinaesis.
Kinaesis DataOps is a powerful toolkit for creating data-driven business change, covering people, process and technology. We use it to manage the entire data pipeline, from requirements and vision to analytics and presentation.
We specialise in applying DataOps within the financial sector to achieve compliance, cost savings and revenue growth. Our approach to DataOps reflects the specific needs and challenges of the sector and our experience as long time practitioners and SMEs.
Where does Kinaesis DataOps come from?
Kinaesis DataOps is a new approach to delivering data, based on a methodology that has proved successful in software delivery. Our DataOps Pillars are more than a set of best practices; each of our pillars within DataOps blends lessons learnt in the software world with solutions appropriate for the more challenging world of data.
To truly maximise the value of DataOps requires more than knowledge of DevOps and tooling; it requires a deep familiarity with data management, analytics and sector specific needs. That is what drives our interpretation of DataOps into these Six Pillars:
Instrument: Instrument the data flow every step of the way using profiling, DQ and monitoring to create a clear view of data reliability and timeliness. Always present data quality information with actual data.
Metadata: Maintain clear business definitions and models; keep them connected to your data and up to date.
(Extensible) Platforms: Data needs to be leveraged and the business will always come up with new demands. Open standards, IT components that interoperate well, clear contracts and operating models are essential to ensure that new demands can always be met.
(Collaborative) Analytics: Make sure that data consumers can collaborate together along with IT and data owners.
Control: Meet quality, attestation and audit targets by applying proper version control and proper release management. Channel the output of instrumentation into strong exception handling and DQ processes.
Target: Make sure the data pipeline is driven by a business vision, not just by the data that happens to be available. Map out user journeys and visions to drive technical change.
Back in 2008, a group of tried and tested software development practices combined with some new tooling that improved environment and release management were bundled together as DevOps. DevOps is the principal inspiration behind DataOps and generated the requirement for version control, release management, environment management and documentation.
DataOps takes the emphasis on collaboration, quick release cycles and iterative refinement from Agile development methodology.
The focus on instrumentation comes from Lean; Lean methodologies propose that fully instrumenting a supply chain is necessary to optimise it. Our DataOps vision takes this a little further - not just the supply chain itself, but the data, needs to be instrumented.
Finally, from User Experience (UX) development comes a toolkit for generating vision and requirements by interaction and exploration with the business. This is essential if DataOps is to be more than just optimisation!
GDPR will force changes onto pre loan credit check processes. Benjamin Peterson, our Head of Data Privacy, takes you through what to expect and how to solve the problems this will create.
Some banking processes are more GDPR-sensitive than others. Pre-loan credit checks, that depend on modelling and analytics are very significant in GDPR terms. While consuming large amounts of personal data, they also involve profiling and automated decision making - two areas on which GDPR specifically focuses. Despite their importance, many have been assuming that these processes won’t be hugely impacted by GDPR. After all, credit checking is so fundamental to what a bank does - surely it’ll turn out that credit checks are a justified use of whatever personal data we happen to need?
Recent guidance from the Article 29 Working Party – the committee that spends time clarifying GDPR, section by section – has demolished that hope, imposing more discipline than expected. October’s guidance on profiling and automated decision-making does three things: adjusts some definitions, clarifies some principles and discusses some key edge cases. It’s surprising how tweaking a few terms can make credit checking and modelling seem far more difficult, in privacy terms.
Yet, in many ways, the new guidance throws banks a lifeline. First, though, let’s map out the problematic tweaks at a high level:
- Credit checking is not deemed ‘necessary in order to enter a contract’. Lenders had hoped that credit checks might be considered as such and thus justified in GDPR terms.
- Automated decision-making is prohibited by default. Lenders had hoped automated decision-making would not attract significant extra restrictions.
- Credit refusal can be deemed ‘of similar significance’ to a ‘legal effect’. Lenders had hoped credit decisions would not be given the same status as legal effects – due to the restrictions and customer rights that accompany them.
So, there are small tweaks that could prove hard work for data and risk owners. Banks will have to make sure that their credit checking and modelling processes stick to GDPR principles. Processes such as data minimisation and the various rights to challenge, correct and be informed will prove tricky to follow when other regulators need to audit historical models!
But we can protect ourselves. One thing we can do is avoid full automation; fully automated decision-making has stringent constraints but adding a manual review sidesteps them. We also need to stick close to the general GDPR principles. For example, data minimisation - this can mean controlling data lifecycle and scope by utilising clever desensitisation and anonymisation to satisfy audit and model management requirements. This will keep you on the right side of GDPR.
Additionally, the recent guidance contains a very interesting set of clarifications around processing justifications. The best kind is the subject’s consent. Establishing justification through necessity or unreasonable cost is complex and subjective; the subject’s consent is an unassailable justification. The recent guidance reinforces the power of the subject’s consent and tells banks how to make that consent more powerful still – by keeping subjects informed. The flip side is, of course, that the consent of an uninformed subject is not really consent at all and could lead to serious breaches.
So, well informed customers are an essential part of our solution for running credit checks and building models in the post-GDPR world. Fortunately, the Article 29 Working Party released detailed and sensible guidance on just how to keep them informed – here’s a high level summary:
· The bank should find a 'simple way' to tell the subject about each process in which their personal data will be involved.
· For each piece of personal information used, the subject should be told the characteristics, source and relevance of that information. Good metadata and lineage would make this task very easy.
· The bank need not provide exhaustive technical detail – it’s about creating a realistic understanding of the subject, not about exposing every detail of the bank’s logic.
· The guidance suggests using visualisations, standard icons and step by step views to create an easily understood summary of data usage and processes affecting the subject.
So, if you want your banking business to experience minimum impact from GDPR, one message is clear – you need to provide transparency to your customers, as well as your internal officers and auditors. Just as you provide various perspectives on your data flows to your various stakeholders, you’ll benefit from providing a simplified perspective to your customers. The metadata, lineage and quality information you’ve accumulated now has an extra use case: keeping your customers informed, so you are able to keep running the modelling and checking processes that you depend on.
Want more from our GDPR experts? Check out our GDPR solution packages here and see more of our regulatory compliance projects here. Or you can reach us on 020 7347 5666.
Just when we'd grown used to the idea that it matters how we handle our data, regulators have taken it to the next level. It’s not enough to have our own data management practises well-groomed – as we step onto the data privacy dance floor, we need to be intimately acquainted with our partner’s habits as well.
The GDPR’s strong words about data controllers and data processors make it clear that compliance is now a team effort, with financial institutions and their service providers expected to work together to meet the regulation’s goals. Financial institutions almost invariably have significant service provider relationships – from large banks, with their galaxy of data processing partners, to simple funds whose fund administrator is a single but crucially important partner in the personal data tango.
Fortunately, the GDPR does make it clear what it expects from data controller / data processor relationships. The Data Processing Agreement enshrines the data processor’s responsibilities to the data controller in some detail. Beyond that, both types of organization are held to the same standard and must support the same rights for the data subject. Our existing governance models, then, must be extended to cover:
• Our own internal data governance
• Our interfaces (technical and contractual) to our data processors
• Our data processor’s governance
The good news is that an effective data governance model can actually be extended quite naturally over this new dancefloor. For our internal data, we’d expect to already be identifying sensitive data (the GDPR gives us hints, rather than a fixed set of criteria, but it’s nothing we can’t manage). Identifying systems and processes that handle that data, and checking those systems for compatibility with GDPR. ‘Compatibility’ here is a concept that can be broken down into two areas: support for GDPR rights (such as the right to be forgotten), and support for GDPR principles (such as access control).
To sort out our data privacy social life, we could decide to form a governance model for partnerships analogous to the ones we apply to in-house systems. Just as we evaluate the maturity of a system, we can evaluate the GDPR maturity of a relationship with a data processor:
• Immature: A relationship that makes no specific provision for data management.
• Better: Formal, contractual coverage of data handling and privacy parameters. In-house metadata that describes the sensitivity, lifespan, and access rights of the data in question.
• Better still: A GDPR-compliant Data Processing Agreement.
• Bulletproof: A Data Processing Agreement, an independent DPO role with adequate visibility of the relationship on both sides, and metadata that covers both controller and provider.
Once we’ve enumerated relationships, evaluated their maturity, and put in place a change model that covers new relationships and contractual changes, the problem starts to look finite. That change model is imperative – in the future, will organizations even want to dance with a partner who doesn’t know the GDPR steps?
Read about the two different solutions to GDPR Kinaesis provide here: http://www.kinaesis.com/solution/017-new-practical-kinaesis-gdpr-solutions
New technologies and approaches in financial IT are always exciting – but often that excitement is tinged with reservations. Ground-breaking quantum leaps don’t always work out as intended, and don’t always add as much value as they initially promise. That’s why it’s always good to see an exciting new development that’s actually a common-sense application of existing principles and tools, to achieve a known – as opposed to theoretical – result.
DataOps is that kind of new development. It promises to deliver considerable benefits to everyone who owns, processes or exploits data, but it’s simply a natural extension of where we were going already.
DataOps is the unification, integration and automation of the different functions in the data pipeline. Just as DevOps, on the software side, integrated such unglamorous functions as test management and release management with the software development lifecycle, so DataOps integrates functions such as profiling, metadata documentation, provenance and packaging with the data production lifecycle. The result isn’t just savings in operational efficiency – it also makes sure that those quality-focused functions actually happen, which isn’t necessarily the case in every instance today.
In DataOps, technology tools are used to break down functional silos within the data pipeline and create a unified, service-oriented process. These tools don’t have to be particularly high-tech – a lot of us are still busy absorbing the last generation of high-tech tools after all. But they do have to be deployed in such a way as to create a set of well-defined services within your data organisation, services that can be stacked to produce a highly automated pipeline whose outputs include not just data but quality metrics, exceptions, metadata, and analytics.
It could be argued that DevOps came along and saved software development just at the time when we moved, rightly, from software systems to data sets as the unit of ownership and control. DataOps catches up with that trend and ensures that the ownership and management of data is front and center.
Under DataOps, software purchases in the data organisation happen not in the name of a specific function, but in order to implement the comprehensive set of services you specify as you plan your DataOps-based pipeline. This means of course that a service specification, covering far more than just production of as-is data, has to exist, and other artefacts have to exist with it, such as an operating model, quality tolerances, data owners… in other words, the things every organisation should in theory have already, but which get pushed to the back of the queue by each new business need or regulatory challenge. With DataOps, there’s finally a methodology for making sure those artefacts come into being, and for embedding an end-to-end production process that keeps them relevant.
In other words, with the advent of DataOps, we’re happy to see the community giving a name to what people like us have been doing for years!