Making the right business and tech decisions isn’t an easy process in our ever-changing world. To accomplish this, it’s best to rely on facts and truth coming from the data on events around us. At Picnic, we invest a great deal of time into knowing our customers (and company) by collecting analytics events coming from our app and internal services. Of course, the analysis itself is carried out in our amazing data warehouse (check out an article on that by our data hero, Iliana Iankoulova) but before that, the data should be delivered to it. We built our special Analytics Platform to collect, prepare, and load data. At the heart of it lies an event streaming platform, in our case Apache Kafka powered by Confluent. Why did we choose it? Let’s take a look at our tech radar on Event Streaming Platforms and see what options were on our plate and what has changed over time (if you don’t know what Picnic tech radars are, check out this amazing article by Philip Leonard).
Event Streaming Platforms
Let’s first define an event as machine-readable data which is emitted by a device or service when something happens, for instance, a customer clicked in an app. Event streaming is then a brokering and transporting process of a single event or a small batch of events from a producer to a consumer. Event streaming platforms are receiving, transforming events on the fly (although this is optional), and then exposing events to the consumers. An important property of event streaming platforms that differentiates them from messaging queues systems is that many consumers should be able to read the same event any number of times while an event is present in the stream. Additionally, a platform should be reliable and follow at-least-once semantics (or even better — exactly-once) to avoid any loss in data.
Based on this definition we can see that technologies like classical RabbitMQ fall off the list because multiple consumers cannot read the same message multiple times (but we’ll talk about RabbitMQ Streams!). Although we did an internal comparison, if we abandon event streaming platforms and move to message queue, the cons of not having replayability outweigh the pros of having RabbitMQ already adopted.
Moreover, technologies like Apache Spark or Apache Flink are much more about event processing and not streaming; they do not actually keep the messages for multiple consumers. The separation of concerns and the line between the categorization of these systems is quite blurred. For instance, Apache Kafka, an event streaming platform, has support for KStreams and KSQL to achieve in-stream processing like aggregations and joins. However, the most important classification is what lies at its core: either it is receiving, keeping, and providing data to consumers, or processing and modifying data. Thus, we will talk today about streaming-first technologies rather than processing-first.
Current status: Deprecate
Before our recent migration to Apache Kafka (there will be a separate post on that, stay tuned!), we used AWS Kinesis as the heart of the Analytics Platform. AWS Kinesis is a fully-managed event streaming platform that integrates nicely with other AWS services like Lambda. Services and tools for AWS Kinesis can be written in any language which supports AWS libraries.
However, over the last several years we encountered a few limitations and complexities with AWS Kinesis which drove us to first demote and now deprecate it. Here’s the top of that issue list:
- It’s impossible to keep data within streams for more than 1 week. This was not adequate for us since replayability of events is crucial for disaster recovery and our plans for real-time data analysis.
- Vendor-specific ecosystems limited us to develop services regardless of AWS tech stack. Besides, the ecosystem for sourcing and sinking of data is far less diverse than with other technologies. We believe that the primary reason for that is AWS Kinesis being proprietary technology: it doesn’t encourage developers to build a strong tooling ecosystem around it like open-source technologies do.
- Consumers’ positions have to be stored outside of AWS Kinesis, typically in AWS DynamoDB. This means that maintenance and operational costs are going up.
- Monitoring of consumer positions in streams and the ability to reset it to a specific point helps with recovery and monitoring, but it’s also essential for real-time processing to avoid reading unnecessary data syncing.
The demotion process started a year ago. We carefully wrote down pain points and workarounds for them, searched what we can add to AWS Kinesis instead of replacing it, and invested time into the development of custom services around it. However, the structural and unresolvable issues outlined above led us to search for alternatives.
The depreciation process is tightly coupled with assessment and trial of technologies that are good candidates for the replacement: developers’ time is reallocated to the replacement technology and steered away from the sunsetted technology. To make it possible we put the following restrictions and processes on the tools around AWS Kinesis:
- Feature-freeze. No new major features should be introduced into Kinesis-related services except bug fixes and security updates.
- Freeze of new pipelines. Unless it’s vital for business continuity, we postponed deployments of new event streams.
- Overprovisioning of resources. We decided to slightly increase the number of shards in Kinesis streams even knowing it isn’t needed right now. While costs are up, this overhead frees up developers from operational maintenance and performance monitoring. Hence, we re-allocated time to fully focus on the replacement of AWS Kinesis.
Current status: Adopted
Apache Kafka is Picnic’s choice and the future foundation of our event streaming platform product strategy. It acts as a distributed persistent log with no (theoretical) limitation on retention time or size. Topics are semantic units in Kafka and one typically creates a topic per data schema. Consumers of data can be grouped together and work simultaneously reading data from a topic. They store topic offsets within Kafka itself, making it clear where they are, and making them resilient to restarts. A topic consists of partitions which are units of parallelism: they cap the number of parallelly working consumers within one group (but not across groups).
Up until Apache Kafka 3.0, Kafka consists of 2 core infrastructure components: Kafka brokers and Apache Zookeeper. The former is handling consumers, producers, and data storage, and the latter is used for orchestration of the brokers and configuration storage. It’s worth mentioning that soon Apache Zookeeper will be dropped in Kafka 3.0, and Kafka brokers will be able to establish consensus on their own via quorum controllers (you can find more information and features in the nice blogpost from Confluent on Kafka 3.0). This not only simplifies management for self-hosted systems but should also bring significant performance improvements and extend the soft-cap limit on the number of partitions per cluster.
Apart from being a decent technology with a huge community all around the world, Apache Kafka grew lots of meat on the bones by introducing a thriving data engineering ecosystem around it. Apache Kafka Connect is a system for sourcing and sinking data out of Kafka by use of connectors. Other Apache projects, like Apache Camel, also make adapters to enrich Apache Kafka Connect capabilities even more. Most of the connectors can work with just simple configuration but of course, you can modify the code or, even better, write your own transformation and validation functions to do single message transformation over the passing data. In Picnic we heavily rely on the Snowflake connector to deliver data straight into our Data warehouse but also on the RabbitMQ connector to propagate data from our internal services.
Apache Kafka Connect is not the only shining gem. KSQL lets you run complex SQL queries over streams including joins to accomplish in-stream data analysis. Apache Kafka also provides an exactly-once guarantee if you need it, and are ready to pay the price in performance (but for certain use cases it’s a must-have).
Of course, nothing comes for free. Apache Kafka is quite known for its complicated configuration to make it efficient with large-scale data. It’s also hard to choose the right number of partitions to handle spurs in data streams. At Picnic, we resolved the first issue by using Confluent Cloud, managed Apache Kafka solutions. It brought great relief to our team by removing maintenance and reducing configuration time to almost zero. It also greatly sped up development and integration by providing a managed Snowflake connector for Apache Kafka Connect, and KSQL DB. As for the second issue, we currently overprovision the number of partitions and keep thinking about the best possible solution.
We started the assessment of Apache Kafka around the time of AWS Kinesis demotion. The first step was a technological and architectural review. After that, we built a tiny PoC using docker-compose to see if our Snowplow pipelines and Apache Kafka Connect could work properly given our environment. Then we could fill out a proper proposal and assess the risks and implications of this migration. The following year, we were busy building a robust trial process. We:
- Built additional pipelines for one of our production environments using Kafka.
- Thoroughly verified data integrity, checking if any data loss happened (in turn, it did, but what it was and how we fixed it is for another article).
- Created monitoring setup around the new pipeline.
- Prepared a set of scripts we needed to simplify further work.
The trial process took us half a year including trying out new Apache Kafka ecosystem features, ironing out any issues in our setup, and communicating with all stakeholders. Afterward, we fully adopted Apache Kafka and started on our journey to full-scale deployments across all market environments.
Current status: Registered
Apache Pulsar is a relatively new distributed messaging system and event streaming platform. It covers both domains and looks as if some RabbitMQ features were added to Apache Kafka plus special perks coming on its own. From message queues perspective Apache Pulsar can work over AMQP protocol on broker level, and use direct or fan-out strategies similar to RabbitMQ (but not topic strategy, i.e., routing keys are not truly there). Therefore, one can argue that it’s following a basic pub/sub pattern rather than full-fledged message queueing. Its event streaming capabilities are broad and fit into our definition, plus it supports random access reads from a topic assuming message id is known.
Apache Pulsar architecture consists of 3 components: Pulsar brokers, Apache Zookeeper, and Apache BookKeeper. One interesting thing in such separation is that Pulsar brokers are stateless and their primary goal is to serve inbound and outbound messages and requests, while Apache BookKeeper provides persistence of event data. It allows to scale these 2 layers independently providing greater flexibility.
We will keep an eye on Apache Pulsar, but at the moment Apache Kafka is winning our hearts and minds for several reasons:
- Apache Kafka Connect has a much greater variety of sources and sinks compared to Apache Pulsar I/O. In particular, for us, the lack of a Snowflake Connector is a minus point.
- KStreams and KSQL are extremely powerful and the next closest equivalent in Apache Pulsar is Pulsar Functions. However, in our view, it’s more akin to slightly more feature-rich Apache Kafka Connect single message transformations which won’t be enough to cover our needs of in-stream data analysis.
- The community size of Apache Pulsar is constantly growing yet it’s still dwarfed by that of Apache Kafka.
Current status: Registered
RabbitMQ Streams is a brand new player on the event streaming platform field, announced in July 2021. This new feature allows users to create log-like streams over the AMQP protocol just by providing the right set of headers, and like Lazy Queues moves RabbitMQ into the territory of out-of primary memory persistence. Overall, the core set of features is pretty dry and fits exactly into our definition, the biggest advantage is that if you have RabbitMQ 3.9 you get it for free and all your applications can send data to streams over AMQP. Consumers are required to use a new API but it’s pretty close to the existing one.
At the moment we add RabbitMQ Streams to our radar and will wait for more news, features, and benchmarks to make our next move.
What’s to come?
Every quarter the guild produces an updated Picnic Tech Radar and shares it with the world. We hope that this will give you an insight into what we do here at Picnic, our vision for the future, and also kick-off an engagement with relevant tech communities. In the coming blogs, we’ll continue our deep dives by selecting a sector of particular interest. Please use the poll below or comments section to suggest topics—thanks!