Three big migration projects brought Picnic’s self-service Business Intelligence to new heights — speeding up the development cycle with lightweight tech at scale
Welcome to the fourth part in a series of five blog posts about Data Engineering at Picnic. The series has covered a lot of ground, introducing the state of the art of Picnic’s Lakeless Data Warehouse in the first blog post. In the second, we took a close look at the Data Engineer’s role in the future of groceries, and in the third, we shared some war stories of how we built the Data Warehouse (DWH) in the early days of Picnic.
Now it’s time to take a deep dive into our Business Intelligence scalability challenges over the years. This article reveals how we handled a number of common challenges that businesses face when they need to scale analytics from a few dozen users to hundreds.
At the beginning of 2018, the Data Engineering team was in a challenging situation. We had hit a few scalability roadblocks that nearly halted any new development for weeks. To tackle them, we radically changed our tech stack with three large-scale migration projects:
- Adopted Snowflake as a DWH platform. Moved away from AWS Redshift, Redshift Spectrum, and Athena.
- Adopted Python as the language for Extract Load Transform (ELT). Moved away from Hitachi Pentaho Data Integration (PDI) community open-source edition.
- Adopted Argo for job scheduling and orchestration. Migrated away from Airflow, Cron jobs, and K8s scheduling.
It was an ambitious plan, borderline impossible to achieve, but we were confident that we had the skills and motivation to see it through by living the words “The best way out is through.” Now, three years later, all three projects have turned out to be a great success. They were carefully selected to fulfill our vision for the Data Engineering discipline within Picnic. But it was a wild ride. It felt very much like we were replacing the engine and the cockpit of a plane in mid-air. As a team, we are proud of the extraordinary achievement of pulling it off.
The three-headed monster nemesis of the Data Engineer 🐲
Before we jump to the solutions, it is essential to understand what triggered the change. Here are some familiar challenges that every business faces at one point or another in its analytics journey:
- Under-performant DWH: slow queries, can’t fit all the data, and not resilient to heavy user queries.
- Sluggish ETL/ELT jobs: sequential runs, challenging to peer review, and failures under heavy load due to poor streaming support.
- Complex job orchestration tooling: scheduling tools unable to simultaneously meet dependency management, on-demand triggering, and Kubernetes compatibility deployment requirements.
We have always strived to push our technology stack to the limits and at that time it was evident that it had a hard time keeping up with the pace of the business. To turn this around, it is not sufficient to treat the symptoms — a sea change to rethink our data platform was needed to address the root causes. Let’s take a look at the Picnic-specific drivers.
Scaling from one to two countries: a challenge like no other
Picnic started operations in the Netherlands in April 2015 and quickly expanded to Germany in late 2017. The good news for the DWH was that we had considered this in the data model of the Data Vault (DV) and the Dimensional Kimball from the very beginning of Picnic. Still, there was a lot to adapt when it came to market-specific requirements. For example, we had to adjust to a new way of implementing our delivery areas. In the Netherlands, a postcode is very granular and comprises on average of eight addresses, while in Germany, it contains thousands. From easy database joins on postcodes, we had to move to queries based on coordinates of latitude and longitude if we wanted to do any geospatial analysis.
Besides that, the whole deployment setup needed a makeover. As a company, we decided that all countries would share the codebase while the country-specific deployments were entirely isolated. This included the DWH. We started building two separate DWHs side by side with the same structure. At first, managing all this data infrastructure was a full-time DBA job: deciding when to scale out to more nodes, rebuilding indices, monitoring queries, and lots of waiting in between.
Cross-country analysis (cross-cluster Redshift) was a must. Some of the systems, like ad campaigns, were global and it proved challenging to combine data from both clusters. After exploring bridges between two Redshift clusters, we ended up with the only sensible decision: we needed to duplicate all the data that was truly global and hence important for both markets.
The business grew rapidly; data grew even faster
Over five years, we grew from zero orders per day to tens of thousands, from zero customers to millions. Data volume grew even faster due to the introduction of Snowplow for collecting event-based analytics data from our apps and back-end services. At first, we loaded it in Redshift. At some point, after a few months of a steady scale-out of our nodes, we couldn’t expand any further on the tier of EC2 compute instances that we were using. The option was to either upgrade the tier, which was very expensive or switch to storage intense instances, which compromised the analytical performance. At the time, we ran as a pilot the newly released Redshift Spectrum. The technology turned out to require more maintenance effort that we could spare.
Increased business complexity means increased complexity of analytical queries
The analysis of the existing data became more complex by the day. In the first years of Picnic, to keep good data governance and security, we only exposed DWH data to Business Analysts via Tableau. We had beautiful data sources in the Dimensional Kimball schema and were relentlessly working on putting out more aggregated data sources to keep the performance of the analyses manageable. This also kept it secure, due to Tableau’s access control configuration. Simultaneously, there was a straightforward data API for analysis that allowed us to keep the inner workings of DWH transparent.
However, this was not enough to keep up with the rapidly expanding business, which became constrained and dependent on an already over-stretched Data Engineering team. To solve this we set up an initial pilot with Athena to access some of the larger data sources from the in-app analytics events via SQL. It served a handful of users and seemed to be working well at first with manageable costs. However, it became apparent that this would turn into a maintenance nightmare as more people needed access. There was high demand for raw data and minimal tooling around data protection for controlled access to sensitive attributes. It required us to enforce security during the data extraction by rerouting personal data. This is far more challenging than masking data on reading as it makes the requirements phase heavier and introduces additional dependencies on the producer.
Query load resulted in Redshift cluster failures
On top of everything else, we started experiencing regular Redshift cluster restarts. In the worst period, this was at least once a week. We introduced a downtime of up to 15 minutes a day when the DWH was not available. It was a big issue: no production data capture processes could run, and we had to start them manually one by one.
The restarts were caused by heavy blending operations in Tableau that fired monstrous queries to the database. They could hit at any moment from any of the 50+ Tableau creators. We worked hard together with the AWS Redshift and the Tableau teams to get this resolved, but in the end, any solution seemed temporary.
Around the same time, Tableau introduced Hyper, which improved the speed of data extracts, and we decided to give it a try. The extracts worked well for highly aggregated data sources, but in most cases, they still took hours because of our data modeling choices. Instead of publishing narrow data sources, we had a large star schema with a fact table and dozens of context dimensions that contained more than 1000 columns after flattening. This was not the best use case for Tableau extracts.
ELT jobs from sequential to parallel
Besides all the issues with the DWH’s scalability, we also had challenges on the ELT front. All the ETL/ELT code was in Pentaho, making it very hard for non-Data Engineers to see the business logic and follow the data lineage. Similarly, for back-end software engineers having to configure Pentaho locally to validate ETL logic was not straightforward.
Moreover, we had version control of the XML artifacts of PDI in Git from the very beginning, but peer reviews started to take longer because of the more complex logic and the need to always open the transformation or job files in Pentaho. It was not possible to simply go on Github and review the ETL code — there were a few other steps involved.
With the growing team, it was no longer possible to know where everything was and guess where precisely a change in PDI transformation had been made. We foresaw challenges in maintaining our high quality of development as we grew.
Another drawback was that PDI upgrades were nearly impossible — we attempted them a few times and ended up breaking more things than we were fixing, so we stayed with the same major version. Not being able to benefit from the development of a tool was a significant downside for us. For example, we started heavily using metadata injections, although there were better out-of-the-box alternatives in the newer versions. We were also heavy users of JSON sources, and due to the slow built-in PDI component, we needed to do customizations to the ‘Fast JSON’ component. All in all, we didn’t see PDI scaling at Picnic.
At first, we were pretty content with just having Cron jobs and Kubernetes schedules. However, as our use case became more demanding, we started looking at Airflow. We set up a proof-of-concept (PoC) in Docker on AWS and found it surprisingly tricky to get the UI responsive and running smoothly. We liked the idea of having an interface that gave us an overview of all jobs and visualization of the dependencies. We also found it difficult to manually trigger Directed Acyclic Graphs (DAGs), which was often required in our case, even after regular deployments. In the end, after the PoC in 2017, Airflow didn’t seem to be the best orchestration tool for Picnic.
Enter the three big migration projects 🎭
The valuable tools we used in the start-up phase of Picnic were not the tools we needed to get out of a challenging scale-up situation. We also needed to start thinking differently: we had let ourselves stray too far from the lean Data Engineering architecture focusing on delivering business value that we had initially envisioned. Instead, we were wrangling with dozens of tools that had become fragile, and no one person knew them all. Fragmented tools and fragmented knowledge were starting to hold us back. At the first signs of this situation, we embarked on a journey to re-invent and consolidate our Data Engineering tech stack.
Migrating the DWH to Snowflake
As we searched for a new DWH database we considered several technologies: Databricks, MonetDB, Big Query, Greenplum, and Snowflake. In our team, we had some experience with all of those except for Snowflake. We contacted them in early 2018 to start a PoC, and the process went very smoothly.
There were two primary drivers for the issues we were facing with Redshift.
- Tight coupling of compute resources and storage was forcing us to choose between hot, warm, and cold data storage for analytics. This is very difficult, especially in a rapidly growing organization where analytics priorities change with every new insight.
- Poor isolation of queries between production DWH processes, Data Science, Tableau, SQL, and ad hoc analysis required draconian measures to ensure that production jobs would be intact. Ultimately, this meant no direct SQL access to the DWH by analysts.
Both of those issues were handled by design in Snowflake. If we used its cloud DWH solution, we could store all the data we had, including in-app analytics, and easily create virtual warehouses of computing resources per user group without impacting each other’s queries. In addition, we would have been able to do a global analysis of data from both markets and eliminate the need to duplicate global data.
Another significant advantage of Snowflake is the performance that comes from auto-clustering and auto-scaling. There is no need for full-time DBAs, and analysts can be completely independent in running queries in isolated warehouses.
All this sounded too good to be true, and having seen so much, we were naturally a bit skeptical. However, after running a pilot, we got on board completely. Picnic was the first big Snowflake customer in Europe, and at the time, there were no other companies we could look at for guidance. We went from a light PoC in two weeks to a full-blown migration over the course of three months. Afterward, we were able to completely switch from Redshift + Redshift Spectrum + Athena to Snowflake, and move all Tableau data sources to live connections. This kind of speed and agility has always been in the Picnic DNA.
The multiple tools we used for analysis were best in class as standalone for the purpose they were built for. Our experience, however, was that the sum of the parts was less than the value we were expecting. A lot of analytical value was lost in the technology context switch. Using SQL as a single language in a single platform like Snowflake created synergy between all teams working with data and enabled us to technically grow together rather than drift apart in technicalities.
Migrating the ELT jobs to Python
Initially, at Picnic we chose to use PDI (for more information on why we made this initial choice, read more at the previous blog post in the series “How we built our Lakeless Data Warehouse”). Python was the language of choice for the data science projects from the start, and within three years, we also decided to migrate all our Data Engineering processes to Python as well. There are several reasons why we chose to do that, for more details, take a look at Matthieu Caneill’s article “Building a distributed ETL pipeline for a monolithic data warehouse”.
- Rich data ecosystem in Python supporting data streaming
- Cross-team skill in both tech & business
- Private Picnic libraries for common data tasks
- Automated CI/CD pipeline already in place
- Easy deployment with Docker on Kubernetes
- Efficient and effective peer review process over Github
- Better dissemination of knowledge via code
- Easier upgrade process to a newer version
- Unstructured JSON data handling capabilities
- Modern language attracting top Data Engineering talent
- Faster onboarding on a general programming language
- Controlled upgrades with fixed Python package versions
- Single development environment in IntelliJ
The migration project started with a few independent jobs that ingested data from third-party systems. This allowed us to build private modules for common tasks such as loading data to the DWH using SQLAlchemy. After the initial PoC and experiments with streaming data, we started a large-scale migrating of our Data Vault custom-built automation framework. The initial re-build took several months, on and off, before we could migrate a single Data Vault job. Once this was possible, we started to migrate the 80+ DV jobs and implemented new jobs directly in Python. Fast forward to the present day, with the migration project completed, and we are already experiencing all of the above benefits that we envisioned when we decided to change our ELT tool.
We discovered a particularly positive outcome during the migration: the Data Engineers are much happier with the technology stack and the development process. Above all, they feel more confident in their technical skills being on a par with Software Engineering. One of our senior developers said, “I have been a Data Engineer for many years, working on large and small projects in different industries, and now for the first time, I feel like a Software Engineer. Using advanced DevOps tools that have been readily available is now possible, and we are no longer isolated from the other tech teams.” So we’ve found that on several levels — technology, process, and people — our move to Python for ELT jobs has been very successful.
Migrating the scheduling and orchestration to Argo
The migration of the schedules probably followed the least clear path. We felt many limitations in using the Cron jobs when it came to ease of configuration, overview, and conditional execution. For that reason, most of the orchestration logic was directly implemented in PDI to overcome those limitations. One of the possible alternatives we piloted was Airflow. Although it could cover many of our use cases, we experienced challenges with the GUI responsiveness and manually restarting jobs.
At the end, we chose Argo. It is an open-source solution and allowed us to build Kubernetes native workflows, events, and CI/CD pipelines. Designed from the ground up for containers, it runs on any Kubernetes cluster, and enables orchestration of parallel jobs.
We have now migrated all our jobs to Argo and are very happy with it. The added benefit of using it is the native support for the local timezone, which allows us to configure the jobs to the time closest to business processes without needing to take many additional steps to accommodate daylight savings. All the workflow configuration is done in YAML manifests files, and depending on the job, different secrets are configured to be pulled from HashiCorp Vault at runtime.
There were many learnings from our scale-up journey in analytics, but there are two key takeaways.
First, migrations are necessary to maintain development velocity and reduce technical debt. A Data Engineering tech stack is not fixed around the skills and experience of the team. It has to be in line with the needs of the business, and if it is apparent that the existing tech stack is holding back value creation, it has to be changed. This is a difficult process of learning and re-inventing the Data Engineering team, but it’s the only way to stay strong and meet expectations. The migration process needs to be carried out with persistence and discipline in order not to leave any legacy behind. Top-level organizational commitment is vital to make sure there is continuity in the planning process.
Second, for startups and scale-ups, the leaner the Data Engineering tech stack is, the more adaptive the team is to changing business requirements, and the faster it can move. Every team member has similar and at the same time complementary skills, making central planning easier. There is no tight specialization around tooling, which enables us to allocate resources quickly and efficiently when needed. In a fast-growing organization, the modus operandi is that every week we need to deal with priority tasks that were not planned; being flexible about who can execute enables us to react very quickly.
The topic for the fifth (and last) article is still open. What is interesting to you? What data challenges do you face and would like to hear our take on them? I invite you to give me some ideas. ✨