Summary of the "Data Mesh in Practice" book

This article provides a summary of the book "Data Mesh in Practice - How to Set Up a Data-Driven Organization". The book can be freely downloaded here. There is also a podcast on Data Mesh Radio with one of the authors. The book is an approx. 60-page report on how to operate the data mesh architecture within an enterprise. The book is divided into two main parts. Firstly, a description of the data mesh architecture and the reason why it exists. Secondly, the book shows how the data mesh journey can be executed within an enterprise.

Part 1: What Is Data Mesh and Why Do We Need It?

Pain Points of Centralized Data Responsibility

The data warehouse approach is to create a centralized, reliable and single source of truth repository. This approach does not scale and it is impossible to provide high quality source of truth from a dynamic system. The data lake approach has similar issues. All data is loaded into the data lake, the acctual quality control happens at read time (schema-on-read). Both approaches rely on a central data infrastructure team responsibility. Such teams are usually detached from use cases and the domain knowledge of the source systems.

The Pillars of Data Mesh

Data mesh is not another iteration of a centralized data analytics architecuture (such as data warehouse, data lake, data lakehouse, ...). The data mesh can be considered a paradigm shift and is built on four pillars:

Decentralized domain ownership

Business (customer, sales, orders, ...) or technical (click streams) topics, which are complex and important enough, form a domain. Such a domain should define the boundries of ownership. The decentralized ownership of a domain, by domain experts, allows for scaling (new sources, changes, ...).

Data as a product

Analytical data is treated with a product thinking mindset. The data product must be continously aligned by the feedback from the consumers. A data product can be started as a minimal viable product. The internal working of a data product is up to the team. A data product provides data in the form of an SQL API, storage access or similar. A data product is not intended for end users. A data product is an architectual quantum or basic building block of the data mesh. The success of a data product should be measured (e.g., usage metrics) and by that the product lifecycle should be guided.

Self-service data infrastructure platform

A data mesh is built on top of an infrastructure-as-a-platform self-service platform. Such a platform should provide the tools and services to allow domain teams to easily build data products. The platform should be centrally managed, but the platform should be domain agnostic and provide no domain-specific tool support.

Federated computational data governance

A centralized data governance would be a bottleneck, however a global data governance is required, e.g., for legal compliance. The data platform should take care, that such data governance rules are taken care of computationally.

Part 2: The Data Mesh Journey

A truly data-driven organization is based on decentralized data ownership. Adapting an organization to the data mesh paradigm takes time and is a journey in three steps:

Getting Started: A Data Product–Centered Mindset Shift

It is a misconception that change can be forced by technology. Organizations sit on top of large amounts of data, which are poorly documented and lack data ownership.

The chapter presents a case study, which explains by an example, what the lack of data ownership means. The case study makes the problems unclear data ownerhip clear. The section should be read in the book for more details.

The perspective of the data producers is more towards operational services. With awareness on how the data is used for analytical purposes and by early communication of changes in the operational domain, incidents in the analytical domain can be prevented. By creating incentives for the upstream dependency in the data flow and long term collaboration and success can be established. This can happen through social incentivization, e.g., by mentioning the upstream work as important when announcing new features. It can also happen through material incentivization, e.g., by producing value of the downstream data and by that increase the team-level budget of the upstream dependency to be able to invest more resources to fulling the downstream data requirements.

Getting started with data mesh is primarily a mindset change and not a technology or tool change. Such an organizational change should be started small. Within a "fail fast" setting an MVP of a first data product should be built. Also consumers of this data product should be involved. It is important to define the minimum requirements very clearly.

The infrastructure platform should not be built before the first data product, but with the first data product. To reduce the risk of platform overengineering only things, which are really required should be built. The infrastucture might be built completely on a greenfield or by adapting existing infrastructure.

Scaling the Mesh: Self-Serve Data Infrastructure

Data infrastructure teams are overloaded with central responsibility. The central team needs domain knowledge for decision making and manual centralized processes don't scale. With a case study on centralized compute capabilities the authors examplify the problems of teams with central responsibility. This case study should be read directly in the book to understand it better. Central infrastructure with decentralized responsibility would prevent the overloading of the central team.

The data infrastructure for the data mesh has many capabilities, .e.g.:

The data platform should be built on open standards to allow for interoperability.

Sustaining the Mesh: Federated Computational Data Governance

In order to prevent that two data products are not interoperable, as e.g. they use different identifiers for the same domain object, the federated data governance pillar of data mesh is important. Data mesh has no central source of truth, but multiple contextualized versions of truth. Things with the same name (e.g., customer) must not have the same meaning in each context. Cross-domain mappings and the identification of polysems must be handeld by a federated governance group. This should not end in a centralized enterprise model, but more on a protype-based approach and contextual level (maybe with short-lived cross-domain working groups).

There is a necessity for global data governance rules, e.g. in the context of GDPR, which need to be followed. The self-service platform should provide a service to enforce data governance rules (e.g., the deletion of a user account in all data products or the encryption of PII data).

Industry Practices

Common Pitfalls

Best Practices