How we built our Lakeless Data Warehouse

admin
Iliana Iankoulova 31 Mar, 2021 35 - 15 min read
Share on facebook
Share on twitter
Share on linkedin

Welcome to the third part in a series of five blog posts about Data Engineering at Picnic. It is inspired by Guy Raz’s awesome podcasts “How I Built This” and the feedback I received after the publication of the previous articles. In this piece, we share some stories of our (un)expected path to building a data-driven company. Picnic has a Single Source of Truth Lakeless Data Warehouse where Data Engineers play a key role in the future of online groceries.

It is the most personal of the posts in this series and is as much a collection of short stories about how we started our Data Engineering journey, as it is about personal growth. These lessons learned are examples of what brought us to where we are now. It is not an exhaustive list and what worked for us might not be universally true, but I hope it is thought-provoking and gives a glimpse into a way of thinking centered around value creation with data.

When I started at Picnic years ago there were no systems built yet. No legacy, just a completely blank and humbling whiteboard! We hadn’t delivered a single order, but there was nevertheless a spirit of great ambition and confidence that we would figure it out. From day one, I was confronted with many challenges, only knowing where to start but not where the process would take us. What especially inspired me was the possibility of making a difference in the sustainability of the food supply chain. This gave me a sense of purpose and, in retrospect, I know it was essential to get through the many challenges.

My background was in Business Intelligence consulting and E-commerce Master Data Management Engineering. I had no experience in a startup environment, had never built a Data Warehouse from scratch, and could only imagine what it took to grow a world-class Data Engineering team. This blog post covers the things I wish I knew back then.

The learnings are organized around Picnic’s values: Think, Dare, Do!, and the elements of DataOps: People, Process, and Technology. The (#) behind a learning maps to a section number below.

1. Create a vision, communicate it, and get people behind it

Sometime before I joined Picnic, I met Daniel Gebler, Picnic’s CTO, at a birthday event. We talked about a mysterious new startup that he was part of and the exciting possibility of having data at the core of a tech company. That got me interested and within a couple of days, I sent him a few slides to share an idea for how this could be achieved by building a centralized Data Warehouse (DWH) that would power business decisions and advanced machine learning models. It was far from polished but was effective in communicating technology and architecture alternatives.

Soon afterwards, we came to a shared vision: creating value with analytics is only possible if we have high-quality data that the business trusts. This in turn can only be achieved if Data Engineering is a first-class citizen at the R&D table. To scale it sustainably, security and governance are as important as creating business value.

This sounds logical and simple, but following it in practice on a daily basis is anything but easy. Still, getting people across the business behind this vision got us through some very challenging moments of rapid growth. In times of difficult conversations, we always went back to the vision and used it to make decisions about people, process, and technology. Over the years, we have changed many things based on learnings and almost nothing remains from the original tech stack on the DataOps side. Still, the vision didn’t change and is as strong as ever.

2. Think and plan small, to get big

I was keen to start building the Data Warehouse as soon as we had data, which was a couple of months into the operations, about five years ago. We decided to start by making a Google Sheet listing all the KPIs we wanted to track. Although it was a good exercise, it wasn’t very practical nor successful to get us going or to prioritize projects. The list became overwhelming, changing by the hour and without any chance of development capacity to realize it all. At some point, we had a mind-boggling ~500 KPIs.

Amid this waterfall process to get all our wishes down in writing, we started a small project with one of the founders and an analyst to generate daily stats posted in Slack. Nothing fancy: # orders, % of completed orders, % on-time deliveries, # active customers, order rating. However, I enjoyed this project very much, as it was hands-on. This is how the first tables on the DWH were born, from the need of powering stats in Slack.

3. Go against the trend if you truly believe in it

When I started working on the Slack stats in mid-2015, I took a couple of weeks to deliberate on the structure of the data model. I looked around at what other companies were doing, went to a few meetups, and had a few beers with seasoned Data Engineering leads. Although most of the companies I talked with were using Kimball data modeling in one way or another, there was quite a big difference in the extent to which they canonically followed the approach of a Single Source of Truth.

Back then, Data Lakes were considered the modern approach, promising freedom and development velocity. One distinct feature of the Data Lake approach is that it gave flexibility downstream, sometimes as far down as the individual analyst. The pattern that resonated with me was that the further downstream the data structuring responsibility is placed, the more disputed the results of the data is, as everyone calculated metrics a bit differently. What’s more, the poorer the data quality, the less trust there was in the whole Data Warehouse: people would simply not use it.

It seemed that our vision of a high-quality DWH was incompatible with the Data Lake paradigm. We had to find another way. I leaned into my centralized DWH background and adapted development processes that matched our speed. It wasn’t always a popular choice, but after a few years of structurally collected data across the whole business, it proved its value.

Picnic’s vision is inspired by the “Information Capabilities Framework” (ICF) published by Gartner in 2014. Common capabilities in the organization, such as data integration and governance, are key for creating value from information assets. ICF is a strategic collaboration between business and tech.

4. Intentionally balance pragmatism and perfectionism

I am a perfectionist by nature — if you give me all the time in the world I will spend it deliberating and perfecting the work I have already started. The golden mean rule, “perfect is an enemy of the good”, is especially relevant in a startup moving with the speed of light. I learned to deal with this by keeping a sharp focus on what needed to be done very well to stand the test of time and what could be done with a less polished solution. Instead of optimizing the quality of one single thing, I started taking joy in balancing those two aspects and the process to get there. Reflecting on this now, I think it is the most important skill I developed over the years.

To this point, when we got down to building the DWH, we chose to perfect the Data Model & APIs at the expense of making ETL processes robust from the start. For the first few months, the nightly jobs were running with a scheduled task on my Windows machine. Within a few months we deployed in AWS with Elastic Beanstalk and scheduled with CronJobs. We experimented with Airflow. Later, we fully automated with Terraform and Kubernetes and adopted Argo jobs for orchestration. Today, most of the DWH data model that was created in the first couple of years at Picnic is still in use and forms the foundation for the many aggregations built afterwards.

5. Do the research while getting your hands dirty

Always spend some time in research — reading, listening, and building on the learning of others. Nothing beats a good conversation with someone who has been there! People have unique perspectives and getting this input is especially valuable in the Data Engineering world, which is always squeezed between the business and the source systems.

In a startup, it is easy to get consumed by day-to-day needs and skip the theory/research, especially when tech is moving fast due to zero legacy. For Data Engineering, we recommend a few reads that have been especially helpful to us. There is so much wisdom in them that will help engineers avoid reinventing the wheel and set a strong analytics foundation.

6. Don’t compromise on Data Modeling and analytics API design

I can’t stress enough the importance of Data Modeling and API design. Spending time on those topics prevents many issues later on. We took the time to discuss naming conventions and standardization, affectionately calling it “baby naming”. It might feel counterintuitive to pay attention to such a “minor” thing, but it’s worth it to avoid confusion and consequent misuse of data on the part of DWH users. Taxonomy glossary and standardized naming patterns are very useful. At the same time, once there is an API for Business Analysts then it is extremely difficult to change further down the road.

For instance, in the stats-posting-to-Slack challenge, we started by modeling the basic dimensions, such as customerarticledelivery slot as the time window to receive the order, Picnic city hub from where the delivery originated, date with multiple role-playing dimensions, and the big decision to also create an order dimension that would hold the context properties, such as on-time and completeness classification. Now, years later, we still use the conformed dimensions and facts that were designed back then.

Excerpt from the chapter “Naming and defining business information model components”, Data Modeling for Quality, 2021, Graham Witt.

7. Bridge the gap between Master Data Management (MDM) and Data Warehousing (DWH)

I was closely involved at the beginning in setting up all the MDM systems around products, pricing, and promotions data. This was important for the future of Data Warehousing at Picnic, as I could see firsthand where data quality issues would arise and how challenging it would be to keep a good state. For example:

  • Categories of products changed all the time.
  • Flags representing the physical state of products in the supply chain were missing.
  • Nutritional information sourced from global systems was not always accurate.

In the early days, we had to provision for overwrite mechanisms to adjust large parts of the master data. As a Data Engineer, being aware of this is very helpful to make good DWH decisions. Usually, the MDM is quite distant from the DWH and rarely makes a design decision based on how data will be reported on in the DWH. Any steps to make those teams independent yet aligned will pay off in data quality improvements.

8. Set up mechanisms that enable lean flow in the Data Engineering development cycle

Building a good flow in any Data Engineering process benefits all stakeholders; in this area, lean manufacturing processes are a great inspiration. In our case, we focused on creating flow with mechanisms such as:

  • No single person is a bottleneck for production releases. This was me in the first couple of years and I am really glad that now I am not needed while the quality is as high as ever.
  • Enable Business Analysts to develop prototypes, and establish SQL as a common requirements language.
  • Give DataOps control and responsibility with independently deployed and scheduled jobs.
  • Minimize the tools, languages, and environments that need to be touched to get a feature implemented.
  • Ease peer review. This was exceptionally difficult with Pentaho, where the artifacts are XML files.

9. Document the data catalog as part of the development cycle

Documentation is hard, and quickly becomes outdated. To stay pragmatic and effective, include the creation/maintenance of the data catalog in the development of the feature.

Over the years we tried many tools. The two that have stuck are opening the code base in GitHub as a living document and heavily using the metadata comment fields of the objects directly in Snowflake.

10. Communicate the difference between Online Analytical Processing (OLAP) and Online Transactional Processing (OLTP)

Operational databases are not a Data Warehouse, and a Data Warehouse cannot be an operational database. The two have different purposes. Although analytics data is used in operations, technical choices need to be made to decouple systems and achieve separation of concerns. Likewise, it is possible to query operational databases for analysis, but this puts the quality of service at risk.

Excerpt from the chapter “Storage and retrieval” p91, Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems, 2017, Martin Kleppmann

11. Start with a low-risk tech stack, learn, and then scale up with the best

We started with Postgres on RDS, a perfectly reasonable relational database that is very well documented. At the same time, we had little data volume to worry about. As we had to start quickly, I set up an instance and started playing with it.

Our ETL tool of choice was Pentaho, an open-source graphical tool that I had a lot of experience using. Also, a few of my Java engineering colleagues were skilled at extending it by writing custom components. We also knew how to deploy the whole pipeline to AWS and with this, all the technical choices were set.

We were able to quickly start with those tools and immediately create value to the business. Later, as the company grew and we learned what worked and what didn’t, we could make long-term choices with less time pressure and more information. We no longer use Pentaho, nor RDS Postgres for BI. However, they helped massively to get us to a scalable solution.

12. Migrate steadily, only as much as the business can handle

Trees are kept in good shape with annual pruning. As a rule of thumb, a maximum of 30% of the tree can be cut if needed to keep it healthy. Similarly, if there are extended periods of weeks where the majority of effort needs to go into tech initiatives there is significant business risk. Addressing technical debt should be a steady activity and handled opportunistically. Following the Boy Scout rule of leaving the code better than you found it, goes a long way to avoid big bang technical debt projects. When this intervention is needed, the most important thing is to keep communicating to achieve a healthy balance.

13. Do not be wary of stop-gap solutions but make sure to follow up

The first scale-up challenge at Picnic emerged about six months into the DWH’s existence. We had passed the point when the ETL processes ran from my local machine and were already executing directly in AWS. The data modeling seemed to have already given a lot of value, and we could track about 150 of the massive list of 500 KPIs. We had chosen a Data Visualization tool, Tableau, and were living the self-service analytics dream.

We had grown the team from one to two, had started collecting analytics events from the app, and already had many users accessing the Postgres DWH through Tableau. Storage was starting to become an issue though, and the ETL processes started getting slower and slower because of the increased data volume and the live Tableau queries. As a first tech migration, we replaced Postgres with AWS Redshift and the improvement was substantial. Starting with a familiar technology helped us deliver immediate business value and learn in very fast iterations. Being realistic, preparing from the start to make a migration in a fixed time period and pushing through, had to go together.

14. Invest in Data Vault automation that fits the tech stack

At the beginning of 2016, we started using Data Vault 2.0. It was the foundation for running a fully incremental DWH and decoupling analytics presentation from the rigorous collection of historical data. I was fascinated with the framework from the first time I tried it: the predictability, the structure, and the extensibility gave me confidence that it would be valuable to Picnic.

The only worry we had at the time was that there weren’t many successful implementations that were comparable to our situation. Still, we persevered and within a year built an automation loading framework to make the development of new domains as easy as a few hours of implementation, without copy-pasting boilerplate code.

For a deep dive into the Data Vault implementation primer, a great resource is the UK Data Vault user group session “The things I wish I knew before I started my first Data Vault Project.” In that webinar, together with other experienced Data Vault professionals, we share key learnings. For a deeper understanding of how we use DV 2.0 for Data Science, I warmly recommend the blog post “Data vault: new weaponry in your data science toolkit” by my colleague Bas Vlaming.

Picnic’s Data Vault learnings presented at an expert panel session at the UK Data Vault user group session “The things I wish I knew before I started my first Data Vault Project.”

15. Store time in UTC in inner layers of the Data Warehouse

Dealing with time is difficult for analysis, as data sources can be mixed in their representation — some are UTC and some are local time. Requiring that everything is converted to UTC in the Data Vault layer really improves the reporting quality in the presentation layer. As a result of this very practical learning, there are fewer bugs because of daylight saving changes.

16. Ingrain API contract testing into the development culture

Two years into operations, we faced a big challenge with breaking ETL jobs. At some point, it had become common for us to need to fix something manually every night after the deployment of source systems so we would have fresh data in the DWH in the morning.

The cause was the rapidly changing back-end services, which were already reaching a point of major refactoring. The ETL processes were sourcing data directly from MongoDB collections that silently (and not so silently) started changing. It was no longer possible to align the different teams on Slack so that the schema that the DWH was depending on would not break.

This was a pivotal moment, where we felt we were losing control over the quality and completeness of the data in the DWH. Pretty scary! Our amazing back-end team stepped in and started implementing formal schema contracts via end-points especially built for ETL processes. This was a challenging process that took almost a year to complete. Ultimately it yielded a major improvement in stability and quality assurance in the DWH.

17. Finish what you start

There are so many ideas and cool things to work on in the context of a rapidly growing business. There are constantly new and shiny things that can distract you from wrapping up projects.

Make it a point to really close initiatives. This is not easy — sometimes it will take a while before it is possible to come back to a migration or an improvement, but eventually, it must be done. I have a very simple system for this: I create a reminder for myself to check in on something weeks/months in the future. When it pops up, often it is no longer relevant, but on a few occasions, some topics had been postponed so many times that it had to be escalated so we could properly close the initiative.

18. Build a diverse team that shares the vision

People are the most important element in a tech project. There are two things I learned over the past 10 years.

First, an effective team consists of like-minded individuals who share a vision and values. Even if technological preferences are different they can be reconciled when people are curious, respectful, and open-minded. If that is not the case, the team will spend a lot of precious energy in discussions without getting stuff done.

Second, complement skills as much as possible and embrace diversity. This might sound like a contradiction of the first point — how can a group of like-minded people be diverse?!

Here is an example. At Picnic, the first Data Engineers who joined all had consultancy experience. We focused on the customers of our product, the Business Analysts, and we strove for usability and quality in the DWH. We also shared our aspiration to enable the business rather than restrict advanced analysis by other teams. At the same time, we had very different personalities and areas of expertise. Later on, we could dedicate resources to set junior colleagues for success, and also step up our DevOps game with a very skilled Python Software Engineer. Currently, the team consists of nine engineers — representing seven nationalities, and having two women as leads. All of these factors ultimately made the DWH product stronger. 🇫🇷 🇵🇹 🇲🇽 🇳🇱 🇮🇹 🇧🇬 🇨🇦

Takeaways

While some of these learnings are context-specific for Picnic Data Engineering, I believe many of them could be helpful in other companies. Our culture revolves around a Think.Dare.Do mindset, within the setting of a rapidly growing company. We had to quickly adapt to new challenges and constantly shift gears. What helped us throughout was a shared data vision across Picnic, a diverse engineering team with a growth mindset, and lean development flows.

In the next blog post we will take a deep dive into some major initiatives we implemented to improve the Data Engineering flow. We will also talk about why we chose our current tech stack and will do a walk-through of three big migration projects.

🔨

Want to join Iliana Iankoulova in finding solutions to interesting problems?