Control, in the sense of governance, repeatability and transparency is currently much stronger in software development than data delivery. Software delivery has recently been through the DevOps revolution - but even before DevOps became a buzzword, the best teams had already adopted strong version control, continuous integration, release management and other powerful techniques. During the last two decades, software delivery has moved from a cottage industry driven by individuals to a relatively well-automated process supported by standard tooling.
Data delivery brings many additional challenges. Software delivery, for instance, is done once and then shared between many users of the software; data must be delivered differently for each unique organisation. Data delivery involves the handling of a much greater bulk of data and the co-ordination of parts that are rarely under control of a single team; it’s a superset of software delivery, too, as it involves tracking the delivery of the software components that handle the data and relating their history to the data itself!
Given these challenges, it’s unsurprising that the tools and methodologies which have been in place for years in the software world are still relatively rare and underdeveloped in the data world.
Take version control, for example. Since time immemorial, version control has been ubiquitous in software development. Good version control permits tighter governance, fewer quality issues, greater transparency and thus greater collaboration and re-use.
You expect to be able to re-create the state of your business logic as it was at any point in the past.
You expect to be able to list and attribute all the changes that were made to it since then.
You expect to be able to branch and merge, leaving teams free to change their own branches until the powers that be unify the logic into a release.
On the data side, that's still the exception rather than the rule - but with the growing profile of DataOps, the demand is now there and increasingly the tooling is too. The will to change and reform the way data is delivered also seems to be there - perhaps driven by auditors and regulators who are increasingly interested in what's possible, rather than in what vendors have traditionally got away with in the past. We stand on the brink of great changes in the way data is delivered and it's going to get a fair bit more technical.
What isn't quite visible yet is a well-defined methodology, so as we start to incorporate proper governance and collaboration into our data pipeline, we face a choice of approaches. Here are a few of the considerations around version control, which of course is only a part of the Control pillar:
» Some vendors are already strong in this area and we have the option of leveraging their offerings - for example, Informatica Powercenter has had version control for some time and many Powercenter deployments are already run in a DevOps-like way.
» Some vendors offer a choice between visual and non-visual approaches - for example with some vendors you can stick to non-visual development and use most of the same techniques you might use in DevOps. If you want to take advantage of visual design features, however, you'll need to solve the problem of version and release control yourself.
» Some enterprises govern each software system in their pipeline separately, using whatever tools are provided by vendors and don't attempt a unified paper trail across the entire pipeline.
» Some enterprises that have a diverse vendor environment take a 'snapshot' approach to version and release control - freezing virtual environments in time to provide a snapshot of the pipeline that can be brought back to life to reproduce key regulatory outputs. This helps ensure compliance, but does little to streamline development and delivery.
It's no small matter to pick an approach when there are multiple vendors, each with varying levels of support for varying forms of governance, in your data estate. Yet the implications of the approach you choose are profound.
DevOps has helped to define a target and to demonstrate the benefits of good tooling and governance. To achieve that target, those who own or manage data pipelines need to consider widely different functions, from ETL to ML, with widely different vendor tooling. Navigating that complex set of functions and forming a policy that takes advantage of your investment in vendors while still covering your pipeline will require skill and experience.
Knowledge of DataOps and related methodologies, knowledge of data storage, distribution, profiling, quality and analytics functions, knowledge of regulatory and business needs and, above all, knowledge of the data itself, will be critical in making DataOps deliver.
Sign up for updates for our DataOps Course here.