Every team wants to deliver new features to delight their users but if the application is not available even the best feature won’t have the intended effect. At Picnic, we prioritize our customers’ ability to place orders and receive groceries seamlessly at any time. To ensure this availability we need to be able to see what our systems are doing at any point making the observability of our systems essential.
In the realm of observability, three core components come into play:
Today, we will focus on tracing and Application Performance Monitoring (APM), as they are particularly crucial when dealing with complex distributed systems. We want to share with you our journey of replacing our previous solution with Datadog; starting from the initial decision to migrate, the assessment of solutions, and finally the successful migration of our entire system.
Why look for a new solution?
Basically, we were not satisfied with our existing solution.
For one, the Java agent lacked support for several crucial frameworks we use in our company’s technology stack. Also, we faced multiple outages and performance degradations resulting from agent upgrades.
Additionally, our solution’s future-proofing left much to be desired. We encountered inconsistencies in OpenTelemetry support, and the absence of reliable support for the Terraform provider which blocked the use of the Infrastructure as Code (IaC) approach in certain instances.
Last but not least, the per-user pricing model was not suitable for our growth.
What did we want from the new solution?
Our primary aim was to ensure a solution provides a consistent feature set across languages and frameworks to allow teams to choose the languages and frameworks they prefer.
Moreover, we recognized the OpenTelemetry (OTEL) standard as the future of tracing and were looking for a solution that could seamlessly integrate OTEL into clients and platforms as we expected it to deliver a consistent feature set across languages and frameworks to provide the aforementioned flexibility to our teams To build a platform on top of the solution it was important for it to support Infrastructure as Code (IaC), preferably Terraform.
Why did we choose Datadog in the end?
- Datadog’s focus on cloud-native environments closely aligns with our technology stack: deep support and integrations with Kubernetes and main cloud providers.
- Their contributions to OTEL auto-instrumentation and commitment to support OTEL data within the platform ensure future compatibility.
- Datadog has a strong focus on infrastructure as a Code support via Terraform.
- Their documentation-first approach has made it easier for us to adopt the platform and harness its full potential.
- Continuous innovation and drive for improvement, which is in line with how we tackle our Picnic tech stack.
Proof of concept
To assess various solutions and make informed comparisons, we conducted a brief Proof of Concept (PoC) phase. During this PoC, we transitioned multiple teams to use the solution as their daily driver for two weeks. Following this period, we gathered feedback from these teams. We evaluated their experiences based on 30+ success criteria like user experience (UX), available integrations, observability capabilities for applications, infrastructure, queues, and synthetic tests.
Having completed this PoC for several solutions and carefully comparing the results, we decided to move forward with Datadog and use it at Picnic.
Installations and app instrumentation
While installing the Datadog agent on a Kubernetes cluster using their official Helm chart is a straightforward process, configuring application pods can appear rather complex. This complexity arises from setting up the required environment variables, injecting the Datadog agent for Java into the images/pods, configuring the JVM to load it, and mounting Unix sockets to communicate with the Datadog-agent daemonset. While these tasks can be accomplished using Helm chart modifications, we decided to follow the Admission controller pattern, specifically the MutatingAdmissionWebhook.
This approach simplifies pod configuration, hiding complex Kubernetes configurations from end-users and streamlining the application Helm chart logic for maintainers. Datadog provides an Admission controller built into their cluster agent. By simply configuring three labels and one annotation on a pod, we can automatically instrument this pod (by injecting Datadog agent via an additional init container) and propagate the required configuration, making the application ready to send traces.
admission.datadoghq.com/java-lib.version: "<CONTAINER IMAGE TAG>"
As we previously mentioned, Picnic believes in OTEL as the future standard for all observability-related integrations. This belief led us to choose OTEL auto-instrumentation for our Python applications as a first step to a full shift to OTEL standards since the amount of Python apps is much lower than the amount of Java apps in Picnic. To align OTEL and Datadog instrumented applications as closely as possible, we also opted for the OTEL operator, primarily for its built-in Admission controller — for the same reasons we used it with Datadog.
To streamline trace collection to a single point, we made the decision not to employ the OTEL collector, and instead use the Datadog agent as our collector.
The final solution architecture:
Observability as a Code:
Observability as Code is a critical part of our approach. It involves storing Datadog configuration as code within the application code repository and using Teamcity (our CI server) to deliver changes consistently.
Our key objectives for this approach are to:
- Encourage using code to define, deploy, and update Datadog objects.
- Enhance ownership and accountability by allowing developer teams to manage their observability configuration and review changes before applying them.
- Reduce operational toil, improve maintenance, and accelerate observability adoption across the organization.
We’ve successfully put this solution into practice by leveraging Terraform for provisioning resources for each service. We’ve taken care to maintain separate states to ensure configuration independence.
Here is the full flow of making changes to the observability configuration, from development to deployment:
Challenges we faced
Of course, it’s not all smooth sailing and we encountered our fair share of challenges along the way. Here are some of the most prominent ones:
- We initially faced issues with the visibility of asynchronous communication through RabbitMQ when using auto-instrumentation. In a collaboration between our team and Datadog we managed to resolve this challenge by fine-tuning settings and configuring explicit spans for methods we are interested in using with dd.trace.methods configuration parameter.
- Enabling profiling introduced a noticeable overhead. However, through adjustments to the profiling profile and fine-tuning the Java Flight Recorder configuration, we successfully found a balance between performance and visibility.
- Datadog aggregates data based on the specific “operations” they are associated with, such as acting as a server, client, RabbitMQ interaction, database query, or various methods. While this approach is effective when focusing on a particular aspect of your application, it can be less than ideal when you need a holistic overview of all operations simultaneously. Unfortunately, we have not been able to find a satisfactory solution for this challenge, we know that making changes to this behavior is a high priority for the Datadog team. As for now, we are using custom dashboards to do such aggregations.
What do we like?
We find Datadog easy to use for instrumenting applications and pods, and it has significantly increased APM adoption across our teams. Datadog provides great coverage of our tech stack through auto-instrumentation and the ability to incorporate OTEL where needed.
The capability to aggregate data in one place, combined with a wide range of integrations, simplifies data collection and access. You can seamlessly jump from traces to Kubernetes metrics, check database performance, and review RabbitMQ queue statuses, all in one place. Datadog’s deep integration with Kubernetes offers a wealth of data out of the box, including manifest history, network monitoring, and more.
A nice surprise was USM, an eBPF-based solution, which provides basic observability even for applications that you prefer not to instrument explicitly. It contributes to having an overview of the whole system status even if some parts are not instrumented
What is next?
We view our journey with Datadog as just the beginning. In the near future, we plan to:
- Utilise Terraform and Datadog’s built-in SLO functionality to create a solution for managing Service Level Objectives (SLOs) as code across all teams
- Extend our adoption to cover not only the backend but also multiple frontend parts using Real User Monitoring (RUM).
- Enhance the visibility of event-driven applications by exploring Data Stream Monitoring.
- Further aggregate data in one place by exploring Network Device Monitoring.