DataOps Pillar: Metadata
Not long ago, 'metadata' was a fairly rare word, representing something exotic and a bit geeky that generally wasn't considered essential to business.
Times have changed. Regulation has forced business to build up metadata. Vendors are emphasising the metadata management capabilities of their systems. The word 'metadata' almost sums up the post-BCBS 239 era of data management - the era in which enterprises are expected to be able to show their working, rather than just present numbers.
Customers are frequently asking for improved and a greater volume of metadata - looking to reduce costs and risk, please auditors and satisfy regulators.
The trouble with labels, though, is that they tend to hide the truth. 'Metadata' itself is a label and the more we discuss 'metadata' and how we'd like to have more of it, the more we start to wonder if 'metadata' actually means the same thing to everyone. In this article, I'd like to propose a strawman breakdown of what metadata actually consists of. That way, we'll have a concise, domain-appropriate definition to share when we refer to "global metadata" - good practice, to say the least!
So, when we gather and manage metadata, what do we gather and manage?
Terms: what data means
To become ‘information’ rather than just ‘data’, a number must be associated with some business meaning. Unfortunately, experience shows that simple words like 'arrears' or 'loan amount' do not, in fact, have a generally agreed business meaning, even within one enterprise. This is the reason why we have glossary systems; to keep track of business terms and to relate them to physical data. Managing terms and showing how physical data relates to business terms is an important aspect of metadata. Much has been invested and achieved in this area over the last few years. Nevertheless, compiling glossaries that really represent the business and that can practically be applied to physical data remains a complex and challenging affair.
Lineage: where data comes from
Lineage (not to be confused with provenance) is a description of how data is transformed, enriched and changed as it flows through the pipeline. It generally takes the form of a dependency graph. When I say 'the risk numbers submitted to the Fed flows through the following systems,' that's lineage. If it's fine-grained and correct, lineage is an incredibly valuable kind of metadata; it's also required, explicitly or implicitly, by many regulations.
Provenance: what data is made of
Provenance (not to be confused with lineage) is a description of where a particular set of data exiting the pipeline has come from: the filenames, software versions, manual adjustments and quality processes that are relevant to that particular physical batch of data. When I say 'the risk numbers submitted to the Fed in Q2 came from the following risk batches and reference data files,' that's provenance. Provenance is flat-out essential in many highly regulated areas, including stress testing, credit scoring models and many others.
Quality metrics: what data is like
Everyone has a data quality process. Not everyone can take the outputs and apply them to actual data delivery so that quality measures and profiling information are delivered alongside the data itself. It’s great that clued in businesses are starting to ask for this kind of metadata frequently. The other good news is that advances in DataOps approaches and in tooling are making it easier and easier to deliver.
Usage metadata: how data may be used
'Usage metadata' is not a very commonly used term. Yet it's a very important type of metadata, in terms of the money and risk that could be saved by applying it pervasively and getting it right. Usage metadata describes how data should be used. One example is the identification of golden sources and golden redistributors; that metadata tells us which data should be re-used as a mart and which data should not be depended upon. But another extremely important type of metadata to maintain is sizing and capacity information, without which new use cases may require painful trial and error before reaching production.
There are other kinds of metadata as well; one organisation might have complex ontology information that goes beyond what's normally meant by 'terms' and another may describe file permissions and timestamps as 'metadata'. In the list above, I've tried to outline the types of metadata that should be considered as part of any discussion of how to improve an enterprise data estate... and I've also tried to sneak in a quick explanation of how 'lineage' is different from 'provenance'. Of all life's pleasures, well defined terms are perhaps the greatest.