The most important thing for a successful analytics strategy. Data Mesh, or Hub-and-Spoke? Is “lakeless” a thing!? … and other reflections on building data governance.
Since the publication of the first blog post in this series, we have received numerous questions via social media, direct messages, public posts, and meet-up discussions. It’s been truly amazing to see so much interest and, as promised, we will address the most frequently raised topics in this post. Though many of the ongoing questions were already covered in the previous blog posts, a few warranted their own article.
A quick recap of the Picnic Data Engineering Series:
- Part 1: “Picnic’s Lakeless Data Warehouse”
- Part 2: “Data Engineer’s Role in the Future of Groceries”
- Part 3: “How We Built Our Lakeless Data Warehouse”
- Part 4: “Scaling Business Intelligence with Python and Snowflake”
- Part 5: “7 Antifragile Principles for a Successful Data Warehouse”
Questions addressed in this article:
- What exactly is a “lakeless” Data Warehouse?
- What are your thoughts on the topic of Data Mesh, and whether to adopt it at Picnic?
- As a Data Lead, I feel that I don’t have the best seat at the organizational table. Any advice?
- How to succeed with an analytics strategy?
Let’s dive in!
What exactly is a “lakeless” Data Warehouse?
The adjective “lakeless” simply emphasizes that we are strong believers in governed analytical data management. It is not a new paradigm; on the contrary, we deeply respect the classics and have built a product that is:
- Rooted in three decades of Data Warehousing wisdom by Kimball and Linstedt
- Adapted to the capabilities of modern technologies
- Shaped by five years of war stories in supply-chain management and e-commerce
- Complemented with antifragile processes
- Focused on securing and protecting sensitive data
- Designed to enable decision-making in a rapidly-growing online supermarket
What are your thoughts on the topic of Data Mesh, and whether to adopt it at Picnic?
Choosing an architectural and organizational design in any tech area is a matter of trade-offs, and this is especially true in Data Engineering. There is no silver bullet.
Our Data Engineering team at Picnic consists of 14 engineers, with professional experience as BI, DWH, and Software Engineering specialists in consulting and e-commerce businesses. Empirically, together, we have reached a common understanding that there are some values we should not compromise on. They are fundamental for our organization that considers analytics a core competence. From a Data Engineering perspective, to make an initiative successful in the long term, the following values are essential:
- All code released to production is reviewed by two people who have the skills to challenge the assumptions, design, and implementation.
This is especially important for central concepts like orders, promotions, and customer feedback. No rubber-stamping of pull requests so they can just be released without the reviewers having a full understanding of the impact. This strict process is key for managing technical debt, sharing knowledge, and maintaining high-quality standards.
- Data Engineers work on a daily basis with other Data Engineers to create a productive learning environment.
Ideally, 3–4 Data Engineers should work closely together, where at least one is of a senior level. Otherwise, there is no continuity and engineers have a hard time growing; the same mechanical issues are solved multiple times by re-inventing the wheel with slight variations. As an engineer who has been in a situation where I was the only one with a specific skill set, I know how challenging and lonely such a place is. I worked with amazing Software Engineers and Analysts who were always willing to collaborate, but there is nothing like the productivity, joy, and depth that can be reached from being challenged by professionals with the same skills in building analytics products. An analogy is if one wants to win Olympic medal in swimming — training with sprinters and cyclists could be useful and fun, however, it is far more effective to have a world-class swimming coach and teammates who push each other to excel.
- Without skin in the game, a central function is at high risk of becoming detached and bureaucratic, or bypassed.
As Data Engineers who set central policies for the platform, we also use them all the time and feel the pain when they don’t work. Data Engineers with a DevOps mentality have all the incentives to create pragmatic guidelines and enforce them. If the central function is detached from the concrete business challenges, it is at high risk of being ignored in the interest of velocity, or it will create friction in the organization.
At Picnic, based on our context, team structure, and vision, we see value in DWH centralization and our implementation follows the three values described above. In addition, we have high development velocity, multiple antifragile tools to mitigate the downsides of centralization, and very high-quality analytics data across the business. Here are some of the reasons why centralized DWH works in our context:
- The setup is especially powerful in a microservices’ operational ecosystem, where data integration is notoriously difficult due to missing business keys and conceptual differences. Someone needs to look out for common operational data entities, and even when it is not feasible we need to at least make it visible.
- We can build automation tools for complex processes, such as our Data Vault diepvries open-source project. This would not have been possible if we had federated Data Engineering, as it is a big tech project that requires advanced skills in ETL, Python streaming, and deeply practical pipeline implementation knowledge. We could only achieve it because we have an amazing central team of Data Engineers, where resources can be pooled while we challenge and help each other.
- With our 25+ tech product teams and an equal number of business analytics teams, data engineering resources are scarce. Having a few Data Engineers scattered among those teams would be challenging at best, and with the constraints around finding exceptional talent, it’s virtually impossible.
- Our business priorities shift from quarter to quarter, and we need to be very quick at adapting with capacity in a matter of weeks, and sometimes days. This works very well in the current setup of Data Engineering squads. For example, during the COVID-19 pandemic, Picnic reduced its marketing efforts in the interest of focusing on fulfillment capacity and order completeness. Having a Data Engineer working on non-priority projects because he/she is staffed on a vertical achieves a local optimum. At the same time, with a central team, we can immediately staff urgent projects, no matter the domain, without sacrificing any continuity or overstretching people.
That said, we are following the development of Data Mesh with interest as it becomes more mature and better tested in practice. Although there is an obvious difference between Picnic’s centralized DWH and the federated Data Mesh, there are also some shared principles:
- Data as a product.
Traditionally, DWH teams were too often left to deal with bad quality. Data was a by-product of operational processes, and Data Engineers had to scrape logs and became tangled in an endless stream of maintenance tasks. This, in our view, is not so much an issue with DWH centralization, but rather organizational priorities and processes. Data should be treated as part of the quality of service of every product, with visible metrics. And, indeed, a federated process for this makes a lot of sense. For example, at Picnic, we have automated dashboards indicating data source issues, which are shared weekly with the entire tech team. This results in actions taken by the product teams — the DWH’s role is to provide visibility. Some other distributed measures that work are schema testing, behavioral testing, and protocols for breaking schema changes — all done at the source microservice.
- A platform where distributed analytics creation is built from the ground up.
We provide many tools to analysts to independently load data in Snowflake and combine it with production DWH sources. Those schemas are explicitly called TEMP and SANDBOX to promote following a process of bringing a product to production while moving as quickly as they need. In some cases, we even go a step further: we have frameworks that by design accept queries by analysts. For example, defining top-level KPIs is a peer-reviewed process where any analyst can implement a metric, which can become live in a matter of days without Data Engineering involvement. At the same time, our Data Scientists are owners of their own schemas.
- Quality standards from a central platform team, including guidelines for back-end teams to streamline data generation. There is no better position to promote and enforce quality standards than from a centralized DWH team with extensive practical experience and data acuteness.
What works well in our setup, incompatible with the domain division in Data Mesh, is the high quality of data integration and consistent experience for analysts. By centralizing the heavy lifting, we enable all analysts to work in a decentralized way, relying on a predictable structure and similar SLA. Data Mesh prescribes that we split either by source system or business team, and by doing that, performing root-cause analysis across many sources becomes more challenging. For example, now we take a holistic approach to every avocado’s journey, from the supplier purchase order to the receiving and picking in our warehouse, to driving behavior on the road and the temperature throughout the delivery trip, to the moment a customer gives us feedback on its freshness. The central integration of all this data gives not only powerful insights, but also makes it easy on the business side for analysts to work together and change context without concerns over maintenance.
In my personal view, Data Mesh is suited to large organizations with:
- Loose coupling among teams and less integrated business units
- An abundance of Data Engineering talent so that they can be staffed redundantly
- A history of mergers and acquisitions, where a central DWH will be a multi-year project to deliver any initial value and can’t keep up with the integration of new companies
- A global market with local development infrastructures and little need of cross-country integrated analytics, or with 1000s of engineers
- Multiple product lines with little commonality, which only need aggregated analytics at the top level
- Static organizational structure, in which few new teams are formed, and people don’t move around as much
Data Mesh is a response to the Data Lake challenges and definitely addresses many of them. Still, it compromises by accepting multiple versions of the same truth of multiple models, and risks a weak central governance function detached from the business. If the implementation is not managed carefully, this can regress to silos and create barriers to enforcing company-wide data security and privacy policies. Finally, if Kimball data modeling is used, there is a complex Directed Acyclic Graph (DAG) of dependencies refreshing conformed dimensions and facts. Without giving up on conformity, the effort to decouple based on domains will be prohibitively high.
At this stage of Picnic’s journey, we don’t see the need to make concessions on data quality, and have some tricks up our sleeves to maintain DWH development velocity as the business grows. We have proven that we can get very creative when it comes to solving challenges, as we have been able to scale 100x and expand to two more countries in the past years since we started with the DWH. Our strategy pays off with Data Science and SQL analysis. It is not easy, but it’s all about priorities. Some freedom from individual product teams is restricted. Still, autonomy is not a goal: it is a means of bringing business value.
In any case, Picnic is happy to see the innovation in the Data Engineering space, and the emergence of new architectural and organizational patterns. The Data Engineering discipline needs more research and ideas like Data Mesh to shed light on the importance of managing an essential asset in the digital revolution — data. It is an area where the technology is steps ahead of the best practices of using them sustainably and securely on the architectural level.
I am looking forward to seeing Data Mesh further detailed into practical “how to’s” for all analytical roles to adopt as a daily routine. So far, it has been thoroughly defined from an organizational perspective, however, in the area of data modeling, data lineage, and data protection it can be more specific. The books of Kimball and Linstedt provide excellent examples to follow, detailing how concrete use cases are solved with little room for interpretation. I am especially interested in how, after years of operation, companies using Data Mesh deal with large-scale and long-running migrations crossing domain boundaries; how teams align on historic data; and how business users perceive the analytics data quality.
Next to Data Mesh, we are following the adoption of an alternative decentralised approach — Hub-and-Spoke model. It is a classic dating back to 1950s and was often used in the transportation industry, more recently adopted for data teams, for example at Postman. As we are strengthening the SQL / Python capabilities of our business teams, the Hub-and-Spoke pattern serves as an inspiration. To learn more about this model I recommend the articles “The Data Mesh and the Hub-Spoke: A Macro Pattern for Scaling Analytics” by Pradeep Menon and “The hub-and-spoke model: An alternative to data mesh” by David Mariani.
As a Data Engineer/Lead, I feel that I don’t have the best seat at the organizational table. Any advice?
The more our initiatives are based on facts rather than opinions, the easier it is to convince people at every level of the organization. For this, it is important for Data Engineering leaders to keep up with the latest trends in R&D, and at the same time to understand the conceptual foundations of Data Warehousing. I believe it is essential to be assertive about why certain approaches don’t work for analytics, even if they are popular in Software Engineering. There is no one single answer — reading blog posts, and books, keeping connected to other companies, and collecting internal data will give you plenty of ammunition to use for substantiating your vision.
Besides all the materials shared throughout the series, I recommend two blog posts that have had a great deal of influence on how Data Engineering at Picnic works, how I understand my leadership role, and why my work matters. They are both written by Maxime Beauchemin, creator of Apache Superset and Apache Airflow, and are called “The Rise of the Data Engineer,” and, the sequel, “The Downfall of the Data Engineer.”
Beauchemin’s rhetoric gave us validation that it is hard to be a Data Engineer, and it doesn’t come with a manual. It is a role that has different meanings depending on the organization you are in. Still, no matter the challenge or the context, we should not give up on conformed dimensions and conformed metrics. At Picnic, those articles gave us the confidence to imagine the role of Picnic’s Data Engineers where we don’t make compromises, are proud of our work, and can be a multiplier for the business. At the same time as the articles were published in 2017, we were experiencing many Data Warehousing scalability pains and from many sides, we were pressured to go for a Data Lake. The reality check that often Data Engineering has the worst seat at the table gave us enormous motivation to show that we deserve to be heard and have the power to push back on popular trends if they compromise our values.
By having a strong DWH voice in our organizations, we establish it as a discipline on its own merit. With more data to manage than ever before, Data Engineering has a tailwind to help us. This demand, in turn, will trigger universities to design more programs, which will formalize the vast body of knowledge, and give new professionals a better foundation. Until data engineering becomes more mainstream, we should stay vigilant about the topics that matter for the responsible management of data and dare to speak up if something doesn’t feel right.
How to succeed with an analytics strategy?
Regardless of the type of analytics strategy — Data Warehouse, Data Mesh, Data Lakehouse, Hub-and-Spoke, or another paradigm — it is incredibly important to be Specific, Methodical, and Consistent (SMaC). SMaC is a concept developed by Jim Collins and Morten Hansen in their book Great by Choice: How to Manage Through Chaos. Too often we see that other areas of the business get much more focused strategic thinking, while analytics is left behind to follow the lead of operational systems. This results in trying multiple things at once, or not giving enough runway for any strategy to be realized, or worse: applying a reactive treatment to poorly understood challenges.
Here is a powerful quote from Collins’ book that I find extremely relevant for data engineering:
The more uncertain, fast-changing and unforgiving your environment; the more SMaC one needs to be. A SMaC recipe is a set of durable operating practices that create a replicable and consistent success formula; it is clear and concrete, enabling the entire enterprise to unify and organize its efforts, giving clear guidance regarding what to do and what not to do. A SMaC recipe reflects empirical validation and insight about what actually works and why.
Picnic Data Engineers might appear to be dogmatic, even rigid sometimes. In reality, it is about being empirically creative for developing and evolving our SMaC recipe of a “lakeless” Data Warehouse, fanatically disciplined around sticking to it, and productively paranoid about sensing necessary changes. We constantly question our DWH processes but rarely amend them, which ensures a good foundation to support our continuous learning. For an analytical system that collects historical data that will only realize its value in some cases after years of collection, this predictability balances continuity and change.
There are plenty of glimpses into our SMaC recipe in the blog posts in this series. To illustrate how specific we get: have you noticed the logo on each of the articles in the series? The three Picnic cubes are our Data Engineering brand logo. It was custom-made for our team and is widely used in Slack on topics needing our attention or involving the team. It represents the multidimensional OLAP cubes that are part of the Data Warehouse. There is one missing, to indicate that our work is never done — it is an ongoing challenge to build the most relevant analytics sources for the business.
Thank you for following the series and reaching out with so much feedback. Now that the series are complete, I will take a break from writing in the near future to focus on new projects and spend more time at in-person events. Looking forward to all conversations on building the next level tech in Data Warehousing, Data Science and Master Data Management. Before I go, special thanks to all my colleagues that took time to review/debate the series in the past 2 years and challenging me to step up with every publication. On to the next Picnic chapter … “I adapt to the unknown, under wandering stars I’ve grown” ♬