Semantic Layers  ·  MDM  ·  Pharma Data Architecture

From Ontology to MDM:
How Semantic Layers Are Replacing
Hub-and-Spoke Master Data Architectures in Pharma

Why the next generation of pharma data architecture treats meaning as a first-class artifact — and what that demands of the teams who build it. OWL, SHACL, SKOS, R2RML, data mesh, production exemplars, and a five-step migration pattern.

May 26, 2026 ~30 min read Ali Shahmohammadi, Ph.D. 24 references
Read Article OWL 2 Spec (W3C)
75%
MDM programs fail to meet business objectives (Gartner via Dataversity)
4
W3C Recommendations (OWL, SKOS, SHACL, R2RML) that form the entire semantic stack
3+
Major pharma semantic-layer implementations in production (Bayer COLID, Roche EDIS, Pistoia FAIR)
Sep 2026
EMA eAF mandatory for all Centrally Authorised Products — the forcing function
Table of Contents
  1. 01The Gap Between MDM-as-Sold and MDM-as-Needed
  2. 02Semantic Layer: A Working Definition
  3. 03Why Pharma Is Unusually Suited to Semantic Layers
  4. 04The W3C Stack That Does the Actual Work
  5. 05The Federation Angle: Data Mesh as Operating Model
  6. 06Pharma Exemplars Already in Production
  7. 07Migration Pattern: Hub to Read Model
  8. 08Implementation Pitfalls Worth Naming
  9. 09Where the Next 24 Months Go
01 — The Problem

The Gap Between MDM-as-Sold and MDM-as-Needed

Pharma has mature, well-funded MDM capability. And almost nobody downstream trusts it completely.

Walk into almost any large pharma’s enterprise architecture function today and you will find a mature, well-funded Master Data Management capability. Customer, product, organization, site, study, asset — each has a hub, a steward, a workflow, an integration layer, and a roadmap of “domains still to onboard.” It looks like a solved problem.

Then ask anyone downstream — a clinical data manager, a regulatory writer, a pharmacovigilance analyst, an R&D data scientist — whether they trust it. The answer is almost always the same: for some things, in some systems, sometimes.

The gap is not a tooling failure. The hubs do what hubs do. The gap is architectural: the row-and-column “golden record” is the wrong unit of mastering for an industry whose data exchange surface is a stack of evolving, machine-processable, semantically rich standards. Per Gartner research cited via Dataversity, around 75% of MDM programs fail to meet their business objectives — a figure that has reportedly worsened since 2015.1

In pharma, the failure is usually not “we couldn’t deduplicate customers.” It’s “we built a hub, and then we still had to maintain a parallel translation layer for every regulatory submission, every clinical standard, every research consumer.” That parallel translation layer is the de facto semantic layer. The question facing pharma enterprise architecture in 2026 is whether to keep treating it as an integration tax — or to promote it to the primary architectural primitive and demote the hub to a derived view.

The core argument: Stop maintaining translation tiers. Promote them. The semantic layer is not a new technology — every piece of the stack has been a stable W3C Recommendation for over a decade. The production exemplars (Bayer COLID since 2019, Roche EDIS since 2017) have been running long enough to call them proven. What is new is the willingness to make the semantic layer the actual master.

02 — Definition

Semantic Layer: A Working Definition

A semantic layer is logical, not physical. It holds the meaning of the data, not the data itself — and that distinction is the whole architectural argument.

Strip it back to the architectural definition: “A semantic layer is a piece of enterprise data architecture designed to simplify interactions between complex data storage systems and business users… The semantic layer provides an intuitive interface that converts that data into meaningful business terms.”2

The crucial property is logical, not physical. A semantic layer does not hold the data; it holds the meaning of the data, expressed as a model that downstream consumers — humans, BI tools, agents, regulators — query against. The data itself can live in a lake, a warehouse, a relational system, a graph store, or all of those at once.

The lineage is older than most pharma teams realize. The first commercial semantic layers were the BusinessObjects “Universe” and the MicroStrategy “Semantic Graph” in the 1990s.3 Looker’s LookML (2012) made “semantics as code” — Git-versioned, peer-reviewed model definitions — into a mainstream practice.3 Today’s iteration is the universal or headless semantic layer: a tool-agnostic platform (dbt Semantic Layer, Cube, AtScale, Power BI semantic models) that defines logic once and serves it through APIs to every downstream consumer.3 Gartner explicitly positions semantic layers as a structural component of AI-era analytics architecture.4

Primitive Holds Optimized for Failure mode
MDM hub Reference records (gold) Operational reconciliation Lossy denormalization, brittle survivorship
Data warehouse / lake Facts and history Analytics throughput Schema sprawl, no shared meaning
Knowledge graph Entities + typed edges + provenance Compositional reasoning Governance scale, query latency
Semantic layer Definitions, mappings, constraints Consistent interpretation across consumers Drift if not version-controlled

The key insight: The semantic layer is not a substitute for the others — it is the contract that binds them. It says: “Wherever this entity lives, this is what it means, these are the rules it must satisfy, and these are the names it answers to across the standards we care about.”

03 — The Pharma Case

Why Pharma Is Unusually Suited to Semantic Layers

Almost every industry has standards. Pharma has a stack of them, each authoritative for a different slice of the same entity, each maintained by a different body, each updated on its own cadence.

The multi-standard reality of pharma is not a problem to be solved — it is the reason a semantic layer outperforms a hub. A hub-and-spoke MDM has to either flatten all of this into a denormalized row (and lose the semantics that make the standards regulatorily binding) or replicate the relationships in custom mapping tables that drift over time. A semantic layer treats every standard as a named, versioned view onto a common underlying model.

The same molecule, viewed through the CDISC lens, appears as STUDYID.SUBJID.EXTRT; viewed through the IDMP lens, as an MPID with linked substances; viewed through the OMOP lens, as a concept in the Drug Exposure domain. One model. Many projections.

Standard Governing Body Scope Regulatory binding
CDISC SDTM / ADaM / CDASH CDISC Clinical study data FDA, PMDA (required)5
OMOP CDM OHDSI Observational / RWD Open community standard8
HL7 SPL HL7 / FDA Product labeling FDA-binding10
EMA eAF EMA Regulatory applications Mandatory for CAPs from Sep 1, 202611
ISO IDMP ISO / EMA Product identification EMA SPOR services
MedDRA / RxNorm / NCIt / UNII ICH / NLM / NCI / FDA Terminologies Overlay every layer

Historical precedent — BRIDG Model (ISO 14199, Dec 2024): The Biomedical Research Integrated Domain Group model — jointly governed by CDISC, HL7, ISO, NCI, and FDA — demonstrated that one harmonized semantic model can underpin many surface standards.12 Its real legacy is proof of concept: one model, many standards. NCI/CBIIT now notes the model is “no longer actively maintained” as a living artifact and should be treated as a foundational reference.13

04 — The Stack

The W3C Stack That Does the Actual Work

Four W3C Recommendations, all stable for over a decade, give you everything needed to build a declarative, version-controllable, queryable semantic layer from end to end.

OWL 2
Web Ontology Language

Provides formal class semantics: subClass, equivalentClass, disjointness, property characteristics, and the reasoning substrate behind any nontrivial enterprise ontology. OWL defines what entities exist.

W3C Recommendation — 11 December 201214
SKOS
Simple Knowledge Organization System

Represents controlled vocabularies, thesauri, and taxonomies in RDF. The right tool for MedDRA-style hierarchies and value sets. SKOS defines how entities are labeled and arranged.

W3C Recommendation — 18 August 200915
SHACL
Shapes Constraint Language

Validates RDF graphs against declarative constraints. In a semantic layer, SHACL is what makes “this CDISC dataset is conformant” or “this IDMP submission is complete” into a query, not a code path. SHACL defines what is true of entities.

W3C Recommendation — 20 July 201716
R2RML
RDB to RDF Mapping Language

Projects existing relational systems into the semantic layer without rebuilding them. This is the key to incremental migration: your MDM hub becomes one input among several. R2RML defines how relational data projects into the model.

W3C Recommendation — 27 September 201217

What this means in practice: OWL defines what entities exist. SKOS defines how they are labeled and arranged. SHACL defines what is true of them. R2RML defines how to project the relational systems you already have. The same query can resolve an entity across a clinical warehouse (via R2RML), a regulatory submission (via IDMP shapes), and a SharePoint document library (via SKOS tagging). There is no comparable native primitive in any hub-and-spoke MDM product.

05 — Operating Model

The Federation Angle: Data Mesh as the Operating Model

A semantic layer owned by a central team and consumed by everyone else is just a centralized warehouse with a fancier query language. The pattern that holds up under pharma’s organizational scale is a federation.

Zhamak Dehghani’s data mesh thesis articulates the operating model in four principles: “domain-oriented decentralized data ownership and architecture, data as a product, self-serve data infrastructure as a platform, and federated computational governance.”1819

For pharma, the mapping is direct. Each functional domain — discovery, translational, clinical, regulatory, manufacturing, commercial, safety — owns its slice of the model. Each publishes data products with explicit contracts. A central platform team operates the substrate (the graph store, the mapping engine, the SHACL validator, the SKOS service). A federated governance body (semantic council, ontology board — call it what you like) arbitrates cross-domain alignment.

Architectural primitive Is about… Role in the federated model
Data mesh Ownership and contract Who owns which slice and what they guarantee
Semantic layer Meaning What entities mean and how they relate across domains
MDM hub Reference-record reconciliation A derived asset serving legacy operational systems — not the master

The mistake the last decade made was treating MDM as if it could do all three at once. These three primitives are complementary, not alternative. The hub does not disappear — it just stops being the master.

06 — Production Evidence

Pharma Exemplars Already in Production

This is not theoretical. Three pharma-authored implementations are openly documented and worth looking at.

Bayer
COLID — Corporate Linked Data

Fully operational across all Bayer divisions since January 2019. Provides persistent, globally unique URIs for corporate metadata assets, an RDF data model, and a SPARQL endpoint for consumers. Published under BSD-3-Clause license — to date, the most concrete example of a major pharma replacing a traditional MDM-style metadata registry with a semantic-layer architecture.20

In production since Jan 2019 • Open source (BSD-3-Clause)
Roche
EDIS + Lynx

EDIS (Enhanced Data and Insight Sharing) launched in 2017 as a company-wide program to transform Roche’s data management strategy. The Roche Dataset Portal’s metamodel is “entirely specified using FAIR standards and community vocabularies.”21 The companion Lynx system is a knowledge-graph engine for reference data integration across Roche’s semantic ecosystem.22

EDIS launched 2017 • Lynx: SEMANTiCS 2021
Pistoia Alliance
FAIR Implementation Project

Cross-pharma project with deliverables including the FAIR Toolkit, the FAIR Maturity Matrix (v1.1, March 2025), and FAIR-aligned submission frameworks for in vitro pharmacology and bioassay metadata.23 The FAIR Maturity Matrix provides a defensible target state for a semantic-layer program and lets you measure progress without inventing metrics from scratch.

FAIR Maturity Matrix v1.1 — March 2025

MELLODDY (IMI) — the federation proof of concept: The European IMI MELLODDY project (Bayer, GSK, Novartis, Janssen, and six other pharmas) built an industry-scale federated machine-learning platform for drug discovery without sharing the underlying data. The architectural lesson is the same one the semantic-layer pattern enforces: the unit of sharing is contracts and meaning, not raw rows.

07 — Migration Pattern

Migration Pattern: How a Hub Becomes a Read Model

The hardest question is not “should we build a semantic layer?” — that is a settled bet. It’s “what happens to the MDM hub we already have?”

The wrong answers are “rip it out” (politically and operationally untenable) and “ignore the new architecture” (which is how parallel-stack rot starts). The pattern that works, in roughly this order:

Step
1
Stand up the semantic layer alongside the hub

Pick one anchor domain — products and substances is usually the right starting point, given the IDMP-O foundation already exists. Define the ontology in OWL, the value sets in SKOS, the constraints in SHACL. Use R2RML to project the existing MDM hub into the semantic layer, so the hub becomes one source among several rather than the source of truth.

OWL ontology SKOS value sets SHACL constraints R2RML mapping
Step
2
Make the semantic layer authoritative for new uses

New downstream consumers (regulatory submission generators, AI agents, FAIR data products, cross-domain analytics) read from the semantic layer, never the hub directly. This is the moment the architectural center of gravity actually moves.

Regulatory submission generators AI agents FAIR data products
Step
3
Demote the hub to a write-side cache

The hub continues to serve legacy operational systems that need a flat API. But the hub’s content is now derived from the semantic layer — facts asserted in the graph project down into hub rows, not the other way around. The survivorship rules that used to be the hub’s secret sauce become declarative SHACL constraints, version-controlled and auditable.

Legacy system bridge SHACL replaces survivorship rules
Step
4
Federate ownership

As more domains stand up their own slice of the semantic model, the central team’s job shifts from “owning the master records” to “operating the platform and arbitrating cross-domain alignment.” This is when data mesh principles stop being slogans and start being how the system actually runs.

Semantic council Domain ownership Cross-domain invariants
Step
5
Project into external standards as views

CDISC Define-XML, IDMP submission XML, FDA SPL, EMA eAF — each becomes a generated artifact, produced by querying the semantic layer through the appropriate shapes. New regulatory standards require a new projection, not a re-platforming. EMA eAF goes mandatory for CAPs on 1 September 202611 — that is the near-term forcing function.

CDISC Define-XML IDMP submission XML FDA SPL EMA eAF (Sep 2026)

The economic case writes itself once a single regulatory submission is materially cheaper to generate. The integration tax in pharma R&D is structurally large and increasingly avoidable — the marginal cost of integrating a new system is now genuinely lower than maintaining a custom hub mapping, particularly with LLMs handling schema-mapping tasks.24

08 — Pitfalls

Implementation Pitfalls Worth Naming

A few failure modes that consume budget without producing outcomes.

Treating the ontology as an IT artifact

An enterprise ontology is regulated reference data with its own change-control regime. It needs versioning (SemVer is fine), release notes, deprecation policies, and an explicit governance body. Building it as a JIRA project that “the data team owns” produces an ontology nobody outside the team trusts. Fund the ontologists; staff a semantic council that includes regulatory, clinical, and commercial domain experts.

Conflating “semantic layer” with “BI tool semantic model”

A Power BI semantic model is a semantic layer in the BI-tool-specific sense — useful, but scoped to a single consumption tool. The enterprise semantic layer is upstream of every BI tool, every AI agent, every regulatory submission generator. The two coexist; they are not interchangeable.

Skipping SHACL

Constraints are the part of the stack that turns a semantic model from an ER diagram into something operationally trustworthy. Without SHACL (or equivalent shape language), the model is documentation; with it, the model is a contract enforceable by the platform.

Centralizing the council

Federated governance is the principle. A single central body that adjudicates every change is the anti-pattern. Each domain should be able to evolve its own slice within agreed cross-domain invariants; the central role is to maintain the invariants and arbitrate when domains disagree.

Underestimating the talent gap

Ontology engineering is a discipline. Hiring a “knowledge graph engineer” without a clear distinction between data engineering and semantic engineering produces a team that builds graphs which look right and reason wrong. The job descriptions need to differentiate; the training pipeline rarely does.

09 — Looking Ahead

Where the Next 24 Months Go

Three things are converging fast enough to make 2026–2027 the inflection year for this transition in pharma.

Regulatory pressure for machine-processable submissions

CDISC requirements are decade-old. IDMP rollout continues. EMA eAF becomes mandatory for Centrally Authorised Products on 1 September 2026.11 SPL is already FDA-binding.10 Each is a forcing function for structured, semantically grounded internal data — and each is materially cheaper to serve from a semantic layer than from a hub plus a translation tier.

Foundation models reshape the mapping economy

Schema mapping, entity resolution, and ontology alignment — the historic bottlenecks in semantic-layer rollouts — are exactly the tasks where LLMs match or exceed prior state-of-the-art. The economics of building and maintaining a semantic layer at pharma scale have shifted; the marginal cost of integrating a new system is now genuinely lower than maintaining a custom hub mapping.

Data mesh moves from talking point to operating model

The “data-as-product” framing has reached the point where major pharma organizations are restructuring data teams around it. Combined with semantic-layer thinking, this gives the federated ownership model that scales — without surrendering coherence.

The architectural conclusion: The master data layer of the next pharma stack is not a hub. It is an ontology, governed as a regulated artifact, expressed in W3C-standard primitives, owned by federated domains, and serving the rest of the enterprise — including the legacy MDM hub itself — as a logical, version-controlled view. The hub does not disappear. It just stops being the master.

Closing Thought

Stop Maintaining Translation Tiers. Promote Them.

Pharma has been pretending for a long time that a row in a hub is the same thing as the meaning of a medicinal product, a clinical study, a substance, or a site. It never was. The semantic layer is not a new technology — every piece of the stack has been a stable W3C Recommendation for a decade or more, and the production exemplars (Bayer COLID since 2019, Roche EDIS since 2017) have been running long enough to call them proven.

What is new is the willingness — driven by regulatory math, by AI economics, and by the cumulative weight of MDM programs that didn’t deliver — to actually make the semantic layer the master. The teams that get to that architectural state first will spend the rest of the decade integrating new systems by mapping, not by migrating. The teams that don’t will keep paying the integration tax.

Stop maintaining translation tiers. Promote them.

Back to Portfolio Related: Product Mastering & KGs → Related: FAIR Data Principles → Related: ISO/IEC 23894 in Pharma R&D →
References

24 References

  1. 1Knight, M. Common Master Data Management (MDM) Pitfalls. Dataversity, 11 July 2025 — citing a DGIQ Conference presentation attributing the figure to Gartner research. dataversity.net
  2. 2IBM. What is a semantic layer? IBM Think Topics. ibm.com
  3. 3Databricks. Semantic Layer Architecture: Components, Design Patterns, and AI Integration. Databricks Blog. databricks.com
  4. 4Gartner. Rethink Semantic Layers to Support the Future of Analytics and AI, 8 April 2025. gartner.com
  5. 5CDISC. Foundational Standards (SDTM, ADaM, CDASH, SEND). cdisc.org
  6. 6CDISC. Define-XML. cdisc.org
  7. 7U.S. Food and Drug Administration. Study Data Technical Conformance Guide — Technical Specifications Document. Docket FDA-2014-D-0092. fda.gov
  8. 8OHDSI. Data Standardization. ohdsi.org
  9. 9Hripcsak, G., Duke, J. D., Shah, N. H., et al. Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers. Studies in Health Technology and Informatics 216:574–578, 2015. doi.org
  10. 10U.S. Food and Drug Administration. Structured Product Labeling Resources. fda.gov
  11. 11European Medicines Agency. EU Electronic Application Forms (eAF). ema.europa.eu
  12. 12HL7 International. BRIDG Model (Biomedical Research Integrated Domain Group). confluence.hl7.org
  13. 13National Cancer Institute / CBIIT. BRIDG Model Documentation — note that the model “is no longer actively maintained.” cbiit.github.io
  14. 14W3C. OWL 2 Web Ontology Language Document Overview (Second Edition). W3C Recommendation, 11 December 2012. w3.org
  15. 15W3C. SKOS Simple Knowledge Organization System Reference. W3C Recommendation, 18 August 2009. w3.org
  16. 16W3C. Shapes Constraint Language (SHACL). W3C Recommendation, 20 July 2017. w3.org
  17. 17W3C. R2RML: RDB to RDF Mapping Language. W3C Recommendation, 27 September 2012. w3.org
  18. 18Dehghani, Z. How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh. martinfowler.com, 20 May 2019. martinfowler.com
  19. 19Dehghani, Z. Data Mesh Principles and Logical Architecture. martinfowler.com, 3 December 2020. martinfowler.com
  20. 20Bayer Group. COLID — Corporate Linked Data (open-source documentation). bayer-group.github.io
  21. 21Pistoia Alliance FAIR Toolkit. FAIR Data by Design — Roche EDIS / Roche Dataset Portal Case Study. fairtoolkit.pistoiaalliance.org
  22. 22Fernández, J. D., Lasierra, N. (Roche). Lynx: A FAIR Knowledge Graph Engine for Reference Data Integration. SEMANTiCS 2021 EU Conference. semantics.cc
  23. 23Pistoia Alliance. FAIR Implementation Project (FAIR Toolkit, FAIR Maturity Matrix v1.1, March 2025). pistoiaalliance.org
  24. 24Chilukuri, S., Fleming, E., Westra, A. Digital in R&D: The $100 Billion Opportunity. McKinsey & Company. mckinsey.com