Picnic is an online-only supermarket operating in 3 countries and providing the lowest price guarantee. We heavily rely on data-driven decisions to achieve that. That’s why Picnic collects more than 300M unique events per week from customer applications and our internal systems which power up future analysis in the DWH.
What are those events and how do we utilize them? The first part is our customer application. It collects data on user behavior like whether customers clicked on a product and how the checkout process went. Based on this our analysts figure out how Picnic can improve product recommendations or enhance the UI of the application to improve customer experience, providing them an easy and seamless way to buy the most suitable products.
The second yet vital part is our internal systems which send events on warehouse capacity, payments, product availability, and much more. It enables us to plan procurement, optimize warehouse operations, and do everything we can to make our customers happy.
But how do events reach the DWH? And why are Confluent and Kinesis involved? Let’s dive together into the Picnic Analytics Platform and see how it happens! This is the first post in a series of articles about Picnic’s journey to Apache Kafka. Over the course of the series, we will also tell you about monitoring setup, Snowplow configuration for Apache Kafka, and our Apache Kafka connect setup.
Analytics Platform: The Past
You may already have a glimpse of the idea that our data delivery platform consists of 2 major parts. The first is a data pipeline for customer application data and the second is for our internal services. We separated them to be able to scale and optimize them independently of each other since the nature and volume of data are completely different. While there are almost an order of magnitude more events from customer applications than from internal services, SLA for events from the latter is much stricter: not a single event can be lost.
Let’s start with how our pipeline for the customer app data looked a few months ago. The app sends data to it via Nginx Proxy and then we heavily rely on the Snowplow Pipeline (you can read more on Snowplow in one of our articles here). Basically, Snowplow collector checks that messages are syntactically correct and then forwards them to either the good stream of data or the bad one. For data streaming, we are utilizing AWS Kinesis + DynamoDB. AWS Kinesis is a service that provides durable and scalable streams for real-time data processing. Internally AWS Kinesis consists of shards which are scalability and parallelism units. It holds all our data for 1 week to have replayability. AWS DynamoDB stores information about consumer position (i.e., offset) in the AWS Kinesis shards. Next, we have the Snowplow Enrich service which validates the schema of our data against the schema registry — Snowplow Iglu — and then forwards data to the enriched good stream or, if they fail validation, to the bad one. Our custom Picnic Snowplow-Snowflake loader retrieves data from AWS Kinesis, batches it, and saves it in Avro format to the AWS S3 bucket. Finally, Snowflake reads events from S3 using Snowpipes.
Pipelines for our internal services data looks pretty close to the ones for customer data but with a few changes:
- There is no Nginx and customer app. Our services send messages to RabbitMQ and then another of our services posts them to Snowplow Collector.
- There is no Snowplow Iglu. Schema validation is done on the producer level.
Analytics Platform: The Change
Picnic is growing blazingly fast and with it grows the need for more performant and reliable data streaming, as well as the necessity for in-stream data analysis. We were happy with AWS Kinesis pipelines for quite a while but over time some limitations and complexities became too restrictive to turn a blind eye or search for a workaround. Here is a shortlist of them:
- Inability to store data for more than 1 week. Replayability and recoverability are key components of data pipelines; they allow us not only to fix malformed events or failures fast in the pipeline but also to do stream processing where consumers need a past to tell us a present or a future.
- Lack of ecosystem around AWS Kinesis. We dreamed of vast tooling around our data pipelines which would enable us to stream data easily to other systems without creating customer services from scratch.
- Last but not least, we wanted to have exactly-once semantics for some special cases like dynamic business rules evaluation.
We felt that the time for a change had come and we set sail to find the right answer to all these issues at once.
Our journey led us to the land of Apache Kafka. It is an event streaming platform that acts as a distributed append-only log. Steams are called topics in Apache Kafka and they are divided into partitions (like shards in AWS Kinesis). We gain the following benefits using Apache Kafka over AWS Kinesis:
- We can store data for as long as we want. Apache Kafka doesn’t have limitations on volume or time for data retention.
- Apache Kafka Connect provides exactly what we dreamed of: an ecosystem of tools to source and sink data from data streams.
- Exactly-once semantic is finally possible thanks to idempotency and transactions.
Moreover, Apache Kafka has more to offer, like querying streams with SQL thanks to kSQL DB, lack of vendor lock, and clear monitoring of event consumer positions in the streams. But as we currently are a small team of only 3 engineers, we wanted to add a cherry on top of it: have Apache Kafka as a SaaS. We compared a few providers and selected Confluent: it is a feature-rich SaaS provider of Kafka which covers all of our present and future needs. To sweeten it even more: our cost estimations show that Confluent setup will be even cheaper than the AWS Kinesis one! The time of migration has come.
Analytics Platform: The Present
We started with redesigning the architecture of the pipelines which led us to astonishing simplification for our internal-services pipeline, and slightly more lightweight customer application data processing.
The internal-services pipeline shrank from 5 managed services to only one! We use Apache Kafka Connect with Confluent plugin for RabbitMQ sourcing to forward data from RabbitMQ queues to Kafka topics. It is worth mentioning that originally all data from RabbitMQ was sent into a single stream in Kinesis which put a burden of event separation on DWH and made scalability less clear but it saved us a few euros. Here, we moved to Apache Kafka topic per RabbitMQ queue which greatly simplified scaling, and we immediately got clear streams of homogeneous data. Afterward, Confluent-managed connector for Snowflake loads data seamlessly into Snowflake.
The customer-app pipeline looked quite the same but we dropped dependency on DynamoDB since Apache Kafka consumers keep their offset in a partition.
Analytics Platform: The Future
We have great plans on the Analytics Platform: enable real-time analysis with Confluent-managed kSQL DB, move some of our data streams from RabbitMQ to Apache Kafka, and finally attach real-time dashboards to the topics. Migration from AWS Kinesis to Apache Kafka also enabled other initiatives in the company, for instance, dynamic business rules evaluation and action execution.
We are excited about our migration and future and I hope you found it interesting too!
Interested in joining one of the amazing Data teams at Picnic? Check out our open positions and apply today!
Data Engineer — Event Streaming Platform
Java Developer — Event Streaming Platform
Software Engineer (Java)
Software Engineer (Python) … and many more roles.