Data vault: new weaponry in your data science toolkit

admin
Bas Vlaming 17 Mar, 2021 03 - 8 min read
Share on facebook
Share on twitter
Share on linkedin

Picnic is an online grocer, where we aim to bring grocery shopping and the traditional Milkman model into the 21st century. Easy and frictionless ordering, a focus on personal service, home delivery with electric vehicles, and an efficient, highly optimised just-in-time supply chain. In short, we are tech’s answer to groceries.

We do not just run an app and outsource the actual work to third-party companies. Instead, we also run a full supply chain and a delivery fleet with dedicated systems and have the tech stack to support all this. We are a full-fledged technology company. As a Data Scientist, the challenge is to solve complex (business) problems and make relevant predictions, based on all the data that is tracked and produced. Thankfully, we have an amazing Data Engineering team in place. As detailed in this blog post by Iliana Iankoulova and its follow-up, a successful hybrid implementation of Data Vault and a Kimball-style dimensional model is in place. This allows for structure and reliability in the gazillions of data points that are being collected.

But first, let me introduce myself. My name is Bas Vlaming, and I am one of the lucky Data Scientists at Picnic who gets to interact with this treasure trove of data on a daily basis. We are currently a team of 7 full-time Data Scientists, and we like to think of ourselves as full-stack data scientists: we do data exploration, model and pipeline design, all the way to actually implementing our solutions in production. As a team we work on a wide range of topics and with a wide range of business teams. We work with the Distribution team to improve estimates of trip planning and the time we need to deliver groceries. We build and improve forecasting models, from high-level delivery forecasts to article-specific penetration rates. We help our Customer Success team to automatically categorise incoming messages and optimise their workflow. We work with the Payments team to combat fraud. We personalise the app experience. We build our own toolset and Python modules to support our workflow. And so on, and so forth.

In all these topics, and in Data Science in general, quality data is the magic dust that is required to build reliable models. Thankfully, our data is (almost always) readily available, cleanly structured, and in an immaculate state. As described in Iliana’s blog, the Data Engineering team has used Snowflake (1) to implement a Data Vault architecture on the backend, combined with (2) a Kimball-style dimensional model that exposes business concepts to the business users. In this blog post, I would like to go into how the Data Engineering architecture and its Data Vault implementation enables the Data Science team to get the most out of our data. In particular, it allows for:

  1. Unlocking high-quality historical data — even for novel business definitions and concepts
  2. Time travelling to access historical states of truth
  3. Extending the dimensional model to a quasi-real-time system
  4. Quick access to and exploration of novel quantities

1. Unlocking high-quality historical data — even for novel business definitions and concepts

In most businesses the concepts used in both the backend and on the business side will change over time. Picnic is no exception. When we expanded from the Netherlands into Germany we required a completely different setup with regards to how we define ‘delivery regions’. Some two years ago, we added the much-desired functionality to add additional products to an already existing order, which meant that the concept of an order had to be redesigned. When we started taking back bottles and recyclables, we needed to be able to handle deposits. These kinds of changes often have wide-ranging knock-on effects.

However, with a Data Vault architecture, we track “All the data, all the time”. This allows us to easily adapt to new business definitions: we ‘simply’ have to change how our atomic bits and pieces, our hubs and satellites, come together into new business concepts. Likewise, we can accommodate the corresponding changes in the backend systems. Sure, this requires modifications to the contracts that are made with the backend, work from both sides, and quite possibly additional hubs, satellites, and links. But in the end, we still have a set of logical and reliable building blocks and links between them in place, and clean recipes on how to distil the relevant business concepts out of it — both the new and old versions. These can then be exposed to the business, using a Kimball-style dimensional model.

So far, this is an important but mostly generic advantage of our data architecture. For the Data Science team specifically, this brings additional value to the table. When new or changing business concepts come to the fore, we often want to build some machine learning model that involves these new definitions. In other architectures, building a historical data set might involve complex and error-prone queries that try to reconstruct these quantities in the past. But now, the Data Engineering team has done the heavy lifting through their Data Vault set-up, and the new concepts are directly accessible for past data too — also before these new concepts were defined in the first place!

Figure 1: A Data Scientist’s impression of Data Engineering work.

2. Time travelling to access historical states of truth

Say, you are a Data Scientist in charge of flagging potential fraud cases. You want to build features and reconstruct what happened with historical payments. A dimensional model might give you some of these answers, but most likely, not the complete picture. For many analysts and business users at Picnic (we have a 50 % SQL adoption rate among our business users), it is probably interesting to know what has been bought, and that an order has been paid. But here, as in many cases, a Data Scientist has different requirements: we are not only interested in the final state or the current state, but we need to be able to reconstruct the full business reality at some historical point in time. For example, we might be interested in a customer’s number of outstanding unpaid orders at a given time; how many times a payment attempt has bounced; or what the outstanding debt was at the time of ordering. Very relevant to this specific Data Science use case, even if such quantities are not necessarily of interest to anyone outside of this particular business problem.

Figure 2. Time travelling Data Scientists (picture thanks to Joe Roberts).

Data Vault to the rescue. DV makes this task quite straightforward: we can at any given point in time easily find what states the hubs, satellites, and links were in. Moreover, we have a known recipe for recombining these into the business concepts we care about: these are simply the queries that are used to construct the dimensional model, as exposed to the business. By adapting these, we can straightforwardly rebuild the historical values of any relevant business quantity, and in this way build historical features for our models that really reflect the values they would have had at the time.

The possibility of time travelling also allows us to align our training and production queries as much as possible. The queries that generate features in production use the most recently available data. By using time travelling in Data Vault, we can use very similarly structured queries to generate historical features for training purposes. Ideally, all we have to do is change the reference point-in-time, and the historical truth (from the data perspective) rolls out. Keeping such queries aligned as much as possible reduces code and logic duplication, and also reduces the scope for mistakes. In contrast, in a different architecture, reproducing such a past historical state might involve complicated and error-prone queries full of filtering and join logic.

Of course, you could say: sure, but I could also just store these historical snapshots in my regular dimensional model. And of course, with 20/20 hindsight, maybe you could have — but this would require an almost prophetic ability on the Data Engineering side to predict what kind of snapshots (and what kind of business concepts and definitions) would become relevant in the future. With the unwanted side effect of requiring all kinds of complications to the data structure, making it harder for any business users to use and interpret the data correctly. Our current setup gets the best of both worlds, by maintaining a strict separation between a Data Vault approach on the backend side, and a dimensional Kimball-style model that exposes the data to business users. This allows for tractability, accountability, and resilience on the backend side, and a clean and structured dimensional model on the front end.

3. Extending the dimensional model to a quasi-real-time state

Fig. 3. Regrowing fresh, real-time data onto a reliable base. Brought to you by Picnic’s terrible-analogy department.

Data warehouse solutions and the provided structures are great for historical analysis, but data is not received in real-time. Typically, the Data Vault pulls in data every few hours; the dimensional model is usually refreshed once a day. However, the Data Vault architecture also allows for a pretty neat extension beyond the typical data warehouse use case, where we use real-time data to augment the available structured data. This allows us to use the structure as provided by the Data Warehouse (either the Data Vault backend or the Kimball dimensional front end) but allowing for features that are as up-to-date as possible.

In those cases, we start out from the most recent version of the relevant tables of the dimensional model that is available. At the time of prediction, we can on-demand construct an enriched version of our relevant features. That is, we pull data from the backend ourselves. By using slightly modified versions of the queries that are used by Data Engineering to build the Data Vault and subsequently the dimensional model, we can construct our own little job-specific Data Warehouse that is real-time augmented. Now that we have a fully up-to-date version of the relevant tables, we can make our predictions as accurately as possible, while again ensuring that we can use very similar queries to build features in training and in production.

4. Quick access to and exploration of novel quantities

Developing and maintaining the tables in the dimensional model that is exposed to the business takes time and effort. There is a great amount of automation that helps here to lighten the burden, as detailed here. Still, in the end, only those concepts that are relevant to the business are exposed. However, as Data Scientists, we are often trailblazing from the data perspective — in particular, when exploring a new business problem, we often need to construct niche quantities that could be relevant to our model, but not too much else. And maybe they turn out to be not that relevant at all, we just do not know a priori.

Fig. 4: Blazing trails.

Exploring such novel quantities might be quite tricky in many other architectures. It might require Data Engineers or Backend Engineers to create new concepts or expose other data, which takes a lot of valuable time and effort. And all that for a feature that might not even yield anything useful! Our Data Vault architecture directly unblocks us: we can simply build such quantities ourselves, do our analysis and exploration without needing to bother anyone else. And if a certain feature is indeed relevant for a wider audience, this can always be exposed structurally in the dimensional model at a later stage.

Wrapping up…

In daily life, the dimensional model is still the go-to source of data for any quick analysis. It is clean, well-structured, reliable, and will contain the commonly used business concepts and quantities. But the choice to use the Data Vault philosophy adds a number of very useful tools to our Data Science arsenal:

  1. Unlocking high-quality historical data — even for novel business definitions and concepts
  2. Time travelling to access historical states of truth
  3. Extending the dimensional model to a quasi-real-time system
  4. Quick access to and exploration of novel quantities

This is absolutely crucial in a number of Data Science projects that we run in production. And more generally, it is a multiplier on our productivity — being unblocked when we want to experiment with new features, and directly having access to historical data under changing definitions is extremely helpful. The majority of our twenty-ish production Data Science jobs and services have directly touched various backend Data Vault structures at some stage, and several crucial ones still do so structurally whenever they are being run.

Want to join Bas Vlaming in finding solutions to interesting problems?