Building a Scalable Caching Platform

admin
Dylan Goldsborough 7 Apr, 2021 41 - 6 min read
Share on facebook
Share on twitter
Share on linkedin

The assortment of a grocery store app business is a dynamic thing. A team of category managers, striving to meet the needs of the customer, add and remove products on a daily basis. Meanwhile, the fulfillment team tweaks the status of products that are currently unavailable, and the store team adds some new attributes that they wish to display in the store on the products. Unlike in a physical store, changes to the assortment do not require a small army of shelf stockers to make, meaning changes may be made at any moment of the day. This creates a problem: how do we build a scalable platform that can ensure that all microservices in the Picnic tech stack stay at all times in sync and up-to-date with the actual assortment?

In this article we walk through the solution that we have developed in-house, moving from the source of our assortment data to the consuming microservices. Each step of the way we will discuss a component of our solution, the Publish Cache. More about our microservice story can be found in this article.

Source Data Snaphots

Within Picnic, we have data sources with multiple editors that serve as an input for backend systems. Data from these sources needs to be readily available to our platform in order for it to function, and we need to be able to publish consistent snapshots to ensure all backend systems are working with the same data. In addition, these sources are not always well suited to frequent querying.

An example of such a system is the PIM (short for “product information management” system), the source of all product data within Picnic. The information is curated by category managers, who are constantly tweaking values in their day-to-day tasks. Not every change should be broadcasted, however, as the system often is in a WIP-state. As the PIM is not limited to what is shown in-store but also provides all product information required in the supply chain, there are many places where things could break if the wrong thing were to be released. To prevent inconsistencies we create a snapshot of the data in the PIM. This snapshot, which is communicated to our platform, is versioned with a version number and is immutable. The snapshot generation process is controlled by category management to ensure that WIP-states are not accidentally broadcast to the platform.

Our PIM system did not support the snapshot functionality we desired and we also want to prevent every service from hitting the PIM to respect rate limits. For this reason, we decided to write the Publish Cache as a central service that can manage the publishing of snapshots for us. The service is written using the AIOHTTP web framework in Python supported by a MongoDB database, with the eye on a short time to market and a low maintenance effort. As a bonus, writing our own layer between the PIM and the rest of our tech stack allowed us to make the data source system pluggable, making the Publish Cache broadly applicable to any data source we implement a connector for.

Collections and Views

Within Picnic we have an organizational split between those who maintain source data systems and those who consume from the source data systems. In this setup the team(s) responsible for maintaining the source data ought to have control over what is exposed. On the other hand the consuming teams should have control over the way in which they want to consume data. We have enabled this in the Publish Cache by separating the definitions of a pullable data set (a “view”) from the available data a view can be constructed from (a “collection”).

Taking the PIM system as an example: the product data available in the PIM is vast and constantly changing. To provide a stable foundation, the Publish Cache imports a subset of the columns in tables and bundles them as collections. During every snapshot generation, the Publish Cache pulls all data that currently falls under a collection and persists it with the snapshot data version number attached. This provides the maintainers of the PIM data model control of what is used in production and gives developers a guarantee that the data they are pulling is stable and production-ready.

Product data is consumed by many different teams, and because every team has its own use-case for PIM data we opted for a self-service model. Using the REST interface of the Publish Cache developers can define their own views on the available collections. A view is a model transformation from the source system’s data model to the target system’s model. This transformation is configured as a JSON, referring to one or more collections that are available in the Publish Cache. In addition to transforming the data model a view can also include a filter on the underlying collection(s).

After pulling the collections during the snapshot generation all current views are computed from the collections, and then persisted labeled with the data version. As a result view contents are immutable, and can only be retrieved by specifying a data version. A schema check is performed when computing the new view contents, a failure here will prevent the snapshot from being published to provide consistency to the target systems. View contents are exposed as a JSON through the REST API of the Publish Cache.

The Consumers

At the end of the snapshot generation process, the new snapshot is synchronized with the backend services by notifying the services of the new version by emitting the new data version number over RabbitMQ. Once notified, each service will call the Publish Cache to retrieve the new version of the views it is interested in. While this model does minimize the total number of calls to Publish Cache it does bring a challenge: services request data when they are notified of a new version, and because Publish Cache notifies all services simultaneously of a new version a burst of traffic is created following a snapshot publishing.

To mitigate the burst of requests, which consists of around 50GB of data being requested over the span of a minute, we were able to make use of the immutability of view contents by applying an asynchronous LRU cache when retrieving view contents. This reduces the number of concurrent copies of the same data we load in memory, as there typically are a few popular large views that cause delays. This also allows us to consolidate database queries for concurrent requests, which is a nice added bonus as this also saves duplicating the deserialization effort that comes with processing the query result.

Because deciding when the PIM is “publish-ready” is a human action, mistakes can happen. To prevent failures, a select few important consumers where wrong data has a large impact (for instance the storefront) are allowed to sanity check the data and request a rollback. When receiving a rollback request the Publish Cache will move the current version back to the requested version and re-emit the data version that it rolled back to on RabbitMQ. Rolling back across the entire tech stack is important to avoid working with different data versions in the store and supply chain.

Epilogue

Of course, a project is never truly “done”; we have some further ambitions for the Publish Cache. In addition to the data source system, we are also working on making the storage system pluggable. One alternative backend we are looking to support next to MongoDB is Amazon S3. As data versions are immutable it is possible to generate a flat-file for every view version during the snapshot generation and serve out redirects to these files instead of the view contents themselves.

Internally we are also piloting if the Publish Cache is useful for more than PIM data. We have identified other data sources that would benefit similarly from versioning and distributing data through a central service. On that note: if you are interested in learning more about the Publish Cache, feel free to reach out!

Want to join Dylan Goldsborough in finding solutions to interesting problems?