Why the next generation of pharma data architecture treats meaning as a first-class artifact — and what that demands of the teams who build it. OWL, SHACL, SKOS, R2RML, data mesh, production exemplars, and a five-step migration pattern.
Pharma has mature, well-funded MDM capability. And almost nobody downstream trusts it completely.
Walk into almost any large pharma’s enterprise architecture function today and you will find a mature, well-funded Master Data Management capability. Customer, product, organization, site, study, asset — each has a hub, a steward, a workflow, an integration layer, and a roadmap of “domains still to onboard.” It looks like a solved problem.
Then ask anyone downstream — a clinical data manager, a regulatory writer, a pharmacovigilance analyst, an R&D data scientist — whether they trust it. The answer is almost always the same: for some things, in some systems, sometimes.
The gap is not a tooling failure. The hubs do what hubs do. The gap is architectural: the row-and-column “golden record” is the wrong unit of mastering for an industry whose data exchange surface is a stack of evolving, machine-processable, semantically rich standards. Per Gartner research cited via Dataversity, around 75% of MDM programs fail to meet their business objectives — a figure that has reportedly worsened since 2015.1
In pharma, the failure is usually not “we couldn’t deduplicate customers.” It’s “we built a hub, and then we still had to maintain a parallel translation layer for every regulatory submission, every clinical standard, every research consumer.” That parallel translation layer is the de facto semantic layer. The question facing pharma enterprise architecture in 2026 is whether to keep treating it as an integration tax — or to promote it to the primary architectural primitive and demote the hub to a derived view.
The core argument: Stop maintaining translation tiers. Promote them. The semantic layer is not a new technology — every piece of the stack has been a stable W3C Recommendation for over a decade. The production exemplars (Bayer COLID since 2019, Roche EDIS since 2017) have been running long enough to call them proven. What is new is the willingness to make the semantic layer the actual master.
A semantic layer is logical, not physical. It holds the meaning of the data, not the data itself — and that distinction is the whole architectural argument.
Strip it back to the architectural definition: “A semantic layer is a piece of enterprise data architecture designed to simplify interactions between complex data storage systems and business users… The semantic layer provides an intuitive interface that converts that data into meaningful business terms.”2
The crucial property is logical, not physical. A semantic layer does not hold the data; it holds the meaning of the data, expressed as a model that downstream consumers — humans, BI tools, agents, regulators — query against. The data itself can live in a lake, a warehouse, a relational system, a graph store, or all of those at once.
The lineage is older than most pharma teams realize. The first commercial semantic layers were the BusinessObjects “Universe” and the MicroStrategy “Semantic Graph” in the 1990s.3 Looker’s LookML (2012) made “semantics as code” — Git-versioned, peer-reviewed model definitions — into a mainstream practice.3 Today’s iteration is the universal or headless semantic layer: a tool-agnostic platform (dbt Semantic Layer, Cube, AtScale, Power BI semantic models) that defines logic once and serves it through APIs to every downstream consumer.3 Gartner explicitly positions semantic layers as a structural component of AI-era analytics architecture.4
| Primitive | Holds | Optimized for | Failure mode |
|---|---|---|---|
| MDM hub | Reference records (gold) | Operational reconciliation | Lossy denormalization, brittle survivorship |
| Data warehouse / lake | Facts and history | Analytics throughput | Schema sprawl, no shared meaning |
| Knowledge graph | Entities + typed edges + provenance | Compositional reasoning | Governance scale, query latency |
| Semantic layer | Definitions, mappings, constraints | Consistent interpretation across consumers | Drift if not version-controlled |
The key insight: The semantic layer is not a substitute for the others — it is the contract that binds them. It says: “Wherever this entity lives, this is what it means, these are the rules it must satisfy, and these are the names it answers to across the standards we care about.”
Almost every industry has standards. Pharma has a stack of them, each authoritative for a different slice of the same entity, each maintained by a different body, each updated on its own cadence.
The multi-standard reality of pharma is not a problem to be solved — it is the reason a semantic layer outperforms a hub. A hub-and-spoke MDM has to either flatten all of this into a denormalized row (and lose the semantics that make the standards regulatorily binding) or replicate the relationships in custom mapping tables that drift over time. A semantic layer treats every standard as a named, versioned view onto a common underlying model.
The same molecule, viewed through the CDISC lens, appears as STUDYID.SUBJID.EXTRT; viewed through the IDMP lens, as an MPID with linked substances; viewed through the OMOP lens, as a concept in the Drug Exposure domain. One model. Many projections.
| Standard | Governing Body | Scope | Regulatory binding |
|---|---|---|---|
| CDISC SDTM / ADaM / CDASH | CDISC | Clinical study data | FDA, PMDA (required)5 |
| OMOP CDM | OHDSI | Observational / RWD | Open community standard8 |
| HL7 SPL | HL7 / FDA | Product labeling | FDA-binding10 |
| EMA eAF | EMA | Regulatory applications | Mandatory for CAPs from Sep 1, 202611 |
| ISO IDMP | ISO / EMA | Product identification | EMA SPOR services |
| MedDRA / RxNorm / NCIt / UNII | ICH / NLM / NCI / FDA | Terminologies | Overlay every layer |
Historical precedent — BRIDG Model (ISO 14199, Dec 2024): The Biomedical Research Integrated Domain Group model — jointly governed by CDISC, HL7, ISO, NCI, and FDA — demonstrated that one harmonized semantic model can underpin many surface standards.12 Its real legacy is proof of concept: one model, many standards. NCI/CBIIT now notes the model is “no longer actively maintained” as a living artifact and should be treated as a foundational reference.13
Four W3C Recommendations, all stable for over a decade, give you everything needed to build a declarative, version-controllable, queryable semantic layer from end to end.
Provides formal class semantics: subClass, equivalentClass, disjointness, property characteristics, and the reasoning substrate behind any nontrivial enterprise ontology. OWL defines what entities exist.
Represents controlled vocabularies, thesauri, and taxonomies in RDF. The right tool for MedDRA-style hierarchies and value sets. SKOS defines how entities are labeled and arranged.
Validates RDF graphs against declarative constraints. In a semantic layer, SHACL is what makes “this CDISC dataset is conformant” or “this IDMP submission is complete” into a query, not a code path. SHACL defines what is true of entities.
Projects existing relational systems into the semantic layer without rebuilding them. This is the key to incremental migration: your MDM hub becomes one input among several. R2RML defines how relational data projects into the model.
What this means in practice: OWL defines what entities exist. SKOS defines how they are labeled and arranged. SHACL defines what is true of them. R2RML defines how to project the relational systems you already have. The same query can resolve an entity across a clinical warehouse (via R2RML), a regulatory submission (via IDMP shapes), and a SharePoint document library (via SKOS tagging). There is no comparable native primitive in any hub-and-spoke MDM product.
A semantic layer owned by a central team and consumed by everyone else is just a centralized warehouse with a fancier query language. The pattern that holds up under pharma’s organizational scale is a federation.
Zhamak Dehghani’s data mesh thesis articulates the operating model in four principles: “domain-oriented decentralized data ownership and architecture, data as a product, self-serve data infrastructure as a platform, and federated computational governance.”1819
For pharma, the mapping is direct. Each functional domain — discovery, translational, clinical, regulatory, manufacturing, commercial, safety — owns its slice of the model. Each publishes data products with explicit contracts. A central platform team operates the substrate (the graph store, the mapping engine, the SHACL validator, the SKOS service). A federated governance body (semantic council, ontology board — call it what you like) arbitrates cross-domain alignment.
| Architectural primitive | Is about… | Role in the federated model |
|---|---|---|
| Data mesh | Ownership and contract | Who owns which slice and what they guarantee |
| Semantic layer | Meaning | What entities mean and how they relate across domains |
| MDM hub | Reference-record reconciliation | A derived asset serving legacy operational systems — not the master |
The mistake the last decade made was treating MDM as if it could do all three at once. These three primitives are complementary, not alternative. The hub does not disappear — it just stops being the master.
This is not theoretical. Three pharma-authored implementations are openly documented and worth looking at.
Fully operational across all Bayer divisions since January 2019. Provides persistent, globally unique URIs for corporate metadata assets, an RDF data model, and a SPARQL endpoint for consumers. Published under BSD-3-Clause license — to date, the most concrete example of a major pharma replacing a traditional MDM-style metadata registry with a semantic-layer architecture.20
EDIS (Enhanced Data and Insight Sharing) launched in 2017 as a company-wide program to transform Roche’s data management strategy. The Roche Dataset Portal’s metamodel is “entirely specified using FAIR standards and community vocabularies.”21 The companion Lynx system is a knowledge-graph engine for reference data integration across Roche’s semantic ecosystem.22
Cross-pharma project with deliverables including the FAIR Toolkit, the FAIR Maturity Matrix (v1.1, March 2025), and FAIR-aligned submission frameworks for in vitro pharmacology and bioassay metadata.23 The FAIR Maturity Matrix provides a defensible target state for a semantic-layer program and lets you measure progress without inventing metrics from scratch.
MELLODDY (IMI) — the federation proof of concept: The European IMI MELLODDY project (Bayer, GSK, Novartis, Janssen, and six other pharmas) built an industry-scale federated machine-learning platform for drug discovery without sharing the underlying data. The architectural lesson is the same one the semantic-layer pattern enforces: the unit of sharing is contracts and meaning, not raw rows.
The hardest question is not “should we build a semantic layer?” — that is a settled bet. It’s “what happens to the MDM hub we already have?”
The wrong answers are “rip it out” (politically and operationally untenable) and “ignore the new architecture” (which is how parallel-stack rot starts). The pattern that works, in roughly this order:
Pick one anchor domain — products and substances is usually the right starting point, given the IDMP-O foundation already exists. Define the ontology in OWL, the value sets in SKOS, the constraints in SHACL. Use R2RML to project the existing MDM hub into the semantic layer, so the hub becomes one source among several rather than the source of truth.
New downstream consumers (regulatory submission generators, AI agents, FAIR data products, cross-domain analytics) read from the semantic layer, never the hub directly. This is the moment the architectural center of gravity actually moves.
The hub continues to serve legacy operational systems that need a flat API. But the hub’s content is now derived from the semantic layer — facts asserted in the graph project down into hub rows, not the other way around. The survivorship rules that used to be the hub’s secret sauce become declarative SHACL constraints, version-controlled and auditable.
As more domains stand up their own slice of the semantic model, the central team’s job shifts from “owning the master records” to “operating the platform and arbitrating cross-domain alignment.” This is when data mesh principles stop being slogans and start being how the system actually runs.
CDISC Define-XML, IDMP submission XML, FDA SPL, EMA eAF — each becomes a generated artifact, produced by querying the semantic layer through the appropriate shapes. New regulatory standards require a new projection, not a re-platforming. EMA eAF goes mandatory for CAPs on 1 September 202611 — that is the near-term forcing function.
The economic case writes itself once a single regulatory submission is materially cheaper to generate. The integration tax in pharma R&D is structurally large and increasingly avoidable — the marginal cost of integrating a new system is now genuinely lower than maintaining a custom hub mapping, particularly with LLMs handling schema-mapping tasks.24
A few failure modes that consume budget without producing outcomes.
An enterprise ontology is regulated reference data with its own change-control regime. It needs versioning (SemVer is fine), release notes, deprecation policies, and an explicit governance body. Building it as a JIRA project that “the data team owns” produces an ontology nobody outside the team trusts. Fund the ontologists; staff a semantic council that includes regulatory, clinical, and commercial domain experts.
A Power BI semantic model is a semantic layer in the BI-tool-specific sense — useful, but scoped to a single consumption tool. The enterprise semantic layer is upstream of every BI tool, every AI agent, every regulatory submission generator. The two coexist; they are not interchangeable.
Constraints are the part of the stack that turns a semantic model from an ER diagram into something operationally trustworthy. Without SHACL (or equivalent shape language), the model is documentation; with it, the model is a contract enforceable by the platform.
Federated governance is the principle. A single central body that adjudicates every change is the anti-pattern. Each domain should be able to evolve its own slice within agreed cross-domain invariants; the central role is to maintain the invariants and arbitrate when domains disagree.
Ontology engineering is a discipline. Hiring a “knowledge graph engineer” without a clear distinction between data engineering and semantic engineering produces a team that builds graphs which look right and reason wrong. The job descriptions need to differentiate; the training pipeline rarely does.
Three things are converging fast enough to make 2026–2027 the inflection year for this transition in pharma.
CDISC requirements are decade-old. IDMP rollout continues. EMA eAF becomes mandatory for Centrally Authorised Products on 1 September 2026.11 SPL is already FDA-binding.10 Each is a forcing function for structured, semantically grounded internal data — and each is materially cheaper to serve from a semantic layer than from a hub plus a translation tier.
Schema mapping, entity resolution, and ontology alignment — the historic bottlenecks in semantic-layer rollouts — are exactly the tasks where LLMs match or exceed prior state-of-the-art. The economics of building and maintaining a semantic layer at pharma scale have shifted; the marginal cost of integrating a new system is now genuinely lower than maintaining a custom hub mapping.
The “data-as-product” framing has reached the point where major pharma organizations are restructuring data teams around it. Combined with semantic-layer thinking, this gives the federated ownership model that scales — without surrendering coherence.
The architectural conclusion: The master data layer of the next pharma stack is not a hub. It is an ontology, governed as a regulated artifact, expressed in W3C-standard primitives, owned by federated domains, and serving the rest of the enterprise — including the legacy MDM hub itself — as a logical, version-controlled view. The hub does not disappear. It just stops being the master.
Pharma has been pretending for a long time that a row in a hub is the same thing as the meaning of a medicinal product, a clinical study, a substance, or a site. It never was. The semantic layer is not a new technology — every piece of the stack has been a stable W3C Recommendation for a decade or more, and the production exemplars (Bayer COLID since 2019, Roche EDIS since 2017) have been running long enough to call them proven.
What is new is the willingness — driven by regulatory math, by AI economics, and by the cumulative weight of MDM programs that didn’t deliver — to actually make the semantic layer the master. The teams that get to that architectural state first will spend the rest of the decade integrating new systems by mapping, not by migrating. The teams that don’t will keep paying the integration tax.
Stop maintaining translation tiers. Promote them.