Product Mastering Across R&D-to-Commercial: Why Knowledge Graphs Beat Traditional MDM

01 — The Problem

The Product Mastering Problem in Pharma

In most industries, “product” is a stable thing. Pharma does not fit that world — the entity is not a thing, it is a trajectory.

Master Data Management was designed for the world of washing machines: a SKU, a price, a manufacturer, and a few attributes that change slowly. Gartner defines it as the discipline that ensures “uniformity, accuracy, stewardship, semantic consistency and accountability of the enterprise’s official shared master data assets.”¹ That world assumes a stable entity with a single identity. Pharma has neither.

The same scientific entity moves through a dozen identity transformations before it reaches a patient. The right answer to any master data question depends on which point in the lifecycle the asker stands at. There is no single gold record because there is no single business question a one-row record would answer correctly.

Discovery

Structure hash

ChEMBL ID

Internal code

→

Preclinical

UNII (FDA GSRS)

INN (WHO)

ISO 11238 substance

→

Clinical

IND / EudraCT

Protocol product

Dose form + route

→

Regulatory

MPID (ISO 11615)

PhPID (ISO 11616)

Marketing Auth.

→

Commercial

NDC (US)

GTIN

Brand + label

Each stage is authoritative for its own identifiers. No single row captures all of them correctly at every point in time.

The core insight: A single asset accumulates identifiers across discovery, development, manufacturing, regulatory, supply chain, and commercial systems — each with its own authoritative source and its own update cadence. The entity is not a thing. It is a trajectory. Stop trying to flatten it into a row.

02 — Why MDM Fails

Why Hub-and-Spoke MDM Breaks Down

Classical MDM’s four implementation styles — registry, consolidation, coexistence, centralized — all assume the same primitive: a denormalized “golden” row chosen by survivorship rules. When the entity is relational and evolving, they all fail in the same ways.

Lossy denormalization of multi-cardinality

An MPID maps to multiple PhPIDs (strengths/forms), each PhPID to multiple substances, each substance to multiple specified-substance variants. Forcing this into a flat hub schema either explodes row counts or hides cardinality in a JSON blob the stewardship workflow can’t reason over. ISO IDMP is structurally a network — a graph published as five interlocking standards.³

The “single source of truth” misdiagnosis

The problem is not that pharma has too many sources of truth — it is that each is authoritative for a different facet at a different time. The LIMS is authoritative for structure. The regulatory system is authoritative for MPID assignment. Survivorship rules cannot capture “authoritative for attribute X only after milestone Y.”

Stewardship assumes a single entity owner

In R&D-to-commercial, ownership transitions hand-off-by-hand-off: chemistry → CMC → clinical → regulatory → commercial. Each handoff is itself a piece of master data. Hub MDM has no native primitive for this; it gets encoded as a status column that inevitably drifts from the actual SOP.

Historical truth & bitemporality

An adverse event on a 2018 batch of a 2014 formulation must trace to the specific substance, site, and product information that existed at that time — not the current golden record. Most MDM hubs implement temporality as effective-dated rows. Pharmacovigilance requires bitemporality (when the fact was true × when we knew it), which hub MDM architecturally cannot offer.

Industry signal: The Pistoia Alliance’s IDMP-O project — built jointly by 11 pharmaceutical companies (Bayer, Novartis, GSK, Roche, Merck KGaA, Boehringer Ingelheim, J&J, AstraZeneca, Amgen, AbbVie, and Pfizer) — exists because relational MDM implementations of ISO IDMP have produced diverging, non-interoperable outputs across jurisdictions.⁵ When eleven major pharma companies pool funding to build an ontology rather than another hub, that’s a signal worth reading.

03 — The Alternative

The Knowledge Graph Alternative

A knowledge graph inverts the MDM primitive. Instead of a denormalized row, the unit of mastering is the triple: (subject, predicate, object). Four properties of the graph model do real work that MDM cannot.

# Example: lifecycle transition as a typed edge with provenance
compound:CHEMBL1234 :nominatedAs devCandidate:DC-2017-44 .
devCandidate:DC-2017-44 :effectiveDate "2017-06-12"^^xsd:date .
devCandidate:DC-2017-44 :assertedBy agentRun:OntoCurator-v2.1-20170612 .
devCandidate:DC-2017-44 :assertedAt "2017-06-12T14:22:01Z"^^xsd:dateTime .

# Both identities survive — nothing is overwritten
compound:CHEMBL1234 owl:sameAs unii:ABCD1234EF .

Entities are persistent across lifecycle states

A discovery compound and the marketed substance are linked by an edge — not collapsed into one row. Both identities survive. A pharmacovigilance query in 2030 can resolve the 2014 entity exactly as it stood, because nothing was overwritten.

Lifecycle transitions are edges, not status columns

“Compound X was nominated as development candidate on 2017-06-12 under decision D-2017-44” becomes a typed edge with provenance and an effective interval. The graph encodes the event of transition — which is precisely what ICH E6(R3) audit trails require.⁷

Semantic versioning replaces survivorship

Each fact is asserted by a named source graph with a version and validity interval. The same property can hold different values for different consumers simultaneously — clinical ops reads the protocol-locked definition while commercial reads the current marketed config, from the same graph, by selecting the appropriate named graph or VALID-AT timestamp.

Reasoning is compositional

Because ontology classes carry formal semantics (subClass, equivalentClass, disjointWith), queries can ask “every marketed product whose active moiety is a JAK inhibitor manufactured at site S” without anyone hard-coding “JAK inhibitor” into a hub attribute. The classification lives in ChEBI or the relevant drug ontology, not the master data system.

Industrial evidence: AstraZeneca’s Biological Insights Knowledge Graph (BIKG) contains approximately 10.9 million nodes (22 types) and over 118 million edges (59 types), built from 39+ internal and public datasets for target identification and drug repurposing.⁹ Open Targets, Hetionet, and dozens of peer-reviewed KGs document the same pattern across the industry.¹⁰¹¹

04 — Ontology Stack

Connecting MDM to Ontologies

The right pattern is to let the ontologies become the master, and have what used to be MDM serve as the operational read model for systems that still need a flat view.

Layer	Ontology / Standard	Role in Product KG	Key IDs
Chemical structure & moiety	ChEBI	Classifies substances by structure and biological role; supports active-moiety reasoning across ingredient ↔ drug class ↔ mechanism	CHEBI:xxxxxx
Drug product semantics	DrON	OBO-Foundry drug ontology aligning RxNorm-derived clinical concepts with formal class semantics; bridges chemical and clinical views	DRON:xxxxxxxx
Substance identity (regulatory)	UNII / GSRS ISO 11238	Globally unique substance identifiers and the substance-definition standard; required for all FDA submissions	UNII: 10-char
Clinical drug normalization	RxNorm	NLM-maintained normalized names linking ingredient ↔ strength ↔ dose form ↔ brand; the standard interface to EMR/EHR systems	RxCUI: xxxxxxx
Adverse events & indications	MedDRA	ICH-maintained terminology required for regulatory safety reporting globally; maps to FAERS, EudraVigilance, VigiBase	MedDRA PT code
Oncology / mechanism / anatomy	NCIt	NCI Enterprise Vocabulary used widely in regulatory submissions; covers drug, disease, anatomy, and procedure hierarchies	C xxxxxxx
End-to-end product identity	IDMP-O (Pistoia)	The integrating layer — realizes ISO 11238/11239/11240/11615/11616/21090 in OWL; engineered to bridge chemistry ontologies upstream and regulatory identifiers downstream⁵	MPID / PhPID

What makes this stack work — rather than just a list — is alignment: each ontology declares equivalent classes and properties where they overlap, so a SPARQL query about an “active moiety” can resolve across ChEBI, UNII, and RxNorm consistently.

The MDM-style flat record doesn’t disappear in this architecture — it becomes a materialized projection of the graph, generated for systems that need a denormalized API. Crucially, the projection is derivable and reproducible from a version of the graph, which means data lineage is intrinsic rather than bolted on.

05 — Reference Architecture

A Reference Architecture

The shape that emerges in practice — across discovery, development, regulatory, and commercial — with three architectural choices that distinguish it from “another KG on top of the data lake.”

1

Ontology layer is governed independently

Ontology releases are versioned (SemVer) and the graph carries usedOntologyVersion provenance on every assertion. A MedDRA v27.0 → v27.1 upgrade does not silently change historical safety queries — it produces a new named graph the consumer can opt into.

2

Named graphs partition the lifecycle

Discovery, clinical, regulatory, and commercial facts each live in their own named graph with explicit cross-graph edges. This makes “authoritative for attribute X only after milestone Y” expressible declaratively — not as a status column.

3

Materialized read models are deliverables, not the master

IDMP submission XML, the safety data warehouse, the commercial product hierarchy — all generated from the graph by query, all reproducible from a graph snapshot. The MDM hub, where one still exists, becomes a cache.

06 — Agentic AI

Agentic AI for First-Pass Mastering, Human-in-the-Loop for the Rest

Foundation models change the economics of entity resolution. Three peer-reviewed results carry the load — and the workable pattern is agentic first-pass with structured handoff.

Research foundation: Narayan et al. (Stanford, PVLDB 2022) showed foundation models match or exceed task-specific SOTA on five data wrangling tasks including entity matching and schema matching — accessed only via in-context learning.¹⁸ Peeters & Bizer (Mannheim, 2023) confirmed ChatGPT reaches fine-tuned model performance on entity-matching benchmarks without any task-specific training data.¹⁹ Fernandez et al. (Chicago, PVLDB 2023) grounded the broader thesis: LLMs have “unprecedented ability to ground database tuples, schemas, and queries in real-world concepts.”²⁰

Step 01

Source Ingestion

Agent extracts entities from ELN, LIMS, regulatory source with structural identifiers and raw text context

→

Step 02

Candidate Matching

LLM proposes candidate matches with calibrated confidence, grounded citations (ontology class, regulatory record, structural ID)

→

Step 03

High-Confidence Path

Proposals above domain threshold auto-asserted into staging named graph with provenance pointing to agent run and model version

→

Step 04

Human Review Queue

Low-confidence or policy-sensitive proposals (marketed products, registered substances, safety signals) routed to stewards with agent’s reasoning attached

→

Step 05

Feedback Loop

Steward decisions feed back into the agent’s few-shot context for the next batch — a closed loop that compounds quality over time

The non-negotiable constraint is provenance on every asserted triple. Without it, the system fails inspection. With it, the agent becomes auditable: every assertion has a who (agent run + model version), a why (evidence), and a when (assertedAt timestamp). Any triple can be retracted without disturbing unrelated facts. This is something MDM survivorship architecturally cannot offer.

07 — AI-Readiness & Regulatory

What This Means for AI-Readiness and Regulatory Submissions

Two regulatory currents are converging on the same conclusion: product data needs to be machine-processable, traceable, and FAIR. The harder half of “AI-readiness” is data-side, not model-side.

EMA

Reflection Paper on AI in the Medicinal Product Lifecycle

CHMP/CVMP adoption September 2024. Covers AI use across discovery, non-clinical, clinical trials, manufacturing, and post-authorization. Explicitly emphasizes risk management of data quality, traceability, and human oversight throughout the lifecycle.²¹

EMA/CHMP/CVMP/83833/2023 · Sep 2024

FDA

AI/ML in Drug Development Discussion Paper

FDA’s 2023 counterpart to the EMA reflection paper. Sets expectations for transparency, validation, and data provenance in AI-assisted drug development workflows. Presupposes well-mastered, traceable source data.²²

FDA Discussion Paper · 2023

ICH

E6(R3) Good Clinical Practice

Adopted January 2025, EMA effective July 2025. Modernizes GCP for decentralized trials, electronic sources, and quality-by-design. All of which presuppose well-mastered protocol and product metadata.⁷

Adopted 6 Jan 2025 · EMA effective Jul 2025

IDMP as a query, not a project: If your product mastering is a row-shaped hub, generating IDMP-conformant XML at the cardinality and bitemporality the standard demands is a recurring engineering project, every submission. If your product mastering is graph-shaped and aligned to IDMP-O, IDMP becomes a SPARQL query. The same is true for pharmacovigilance line listings, label content management, and structured product labeling.

“AI-readiness” is often discussed as a model-side concern (GPUs, MLOps, vector DBs). The harder, less glamorous half is data-side: an LLM that has to ground “is this the same product as the one in the 2019 BLA” in a brittle MDM gold record will hallucinate. One that can query a typed, versioned, provenance-rich knowledge graph won’t have to. The FAIR Guiding Principles²³ are, almost word-for-word, a description of a properly governed knowledge graph: globally unique persistent identifiers, retrieval by standardized protocol, formal knowledge representation, rich provenance.

08 — Implementation Roadmap

Implementation Roadmap

A staged path that has survived contact with reality. Workable budget for phases 0–2: a core team of 4–6 engineers, 1–2 ontologists, 1–2 stewards per domain, and 9–12 months.

The most common failure mode is not technical — it’s treating the ontology as an IT artifact instead of a regulated piece of master data with its own governance. Fund the ontologists.

Phase

0

Pick One Transition, Not the Whole Lifecycle

The candidate-nomination → IND filing transition, or the marketing-authorization → first-commercial-launch transition. Both are bounded enough to ship in a quarter and consequential enough to fund the next phase. Avoid the “we’ll boil the ocean” KG that maps every system on day one.

Scope scoping Stakeholder alignment ~1 month

Phase

1

Stand Up the Ontology and Graph in Parallel

Adopt IDMP-O as the spine; pull in ChEBI, RxNorm, MedDRA, NCIt, and UNII as needed for the chosen transition. Define internal extensions explicitly in a versioned namespace you control, with documented alignments to public ontologies.

IDMP-O ChEBI · RxNorm · MedDRA Neptune or Neo4j ~2–3 months

Phase

2

Wire In Two Source Systems, End to End

One discovery-side (LIMS or ELN), one regulatory-side (the system holding MPID assignments). Build the agentic mastering loop with human-in-the-loop review on everything that touches a regulatory identifier. Measure: % triples with grounded provenance, steward override rate, round-trip latency from source change to graph assertion.

Agentic entity matching Human review queue Provenance on every triple ~3–4 months

Phase

3

Generate the First Downstream Deliverable from the Graph

Pick one — IDMP submission XML, a pharmacovigilance line listing, or a commercial product hierarchy — and replace its existing pipeline with a query against the graph. This is the moment that proves the architecture commercially and funds the remaining phases.

SPARQL / Cypher query IDMP XML generation Phv line listing ~2 months

Phase

4

Reduce the MDM Hub to a Read Model

Existing hubs do not need to die. They become projections, refreshed from the graph, serving systems that still want a flat API. New systems read the graph directly. The hub’s stewardship workflows retire one domain at a time.

Materialized views Legacy system bridging Stewardship migration

Phase

5

Expand Transitions and Named Graphs

Each subsequent lifecycle transition adds a named graph and a small number of typed edges. The cost of adding the n+1th transition is sublinear in n — the architectural payoff that hub-and-spoke MDM never delivered. Scale the agentic mastering loop across all source domains.

Named graph expansion Cross-lifecycle edges Sublinear marginal cost

Closing Thought

The Trajectory Is Not a Row

The interesting question is no longer whether graphs will replace traditional MDM for pharma product mastering — the regulators, the standards bodies, and eleven of the top twenty pharma companies have already answered that with IDMP-O. The interesting question is whether your organization will treat the next IDMP wave, the next AI regulation, and the next R&D-to-commercial handoff as another point fix to the hub — or as the moment to invest in the architecture the next decade of pharma data already presupposes.

The trajectory is not a row. Stop trying to flatten it.

Back to Portfolio IDMP-O Project → Related: FAIR Data Principles → Related: ISO/IEC 23894 in Pharma R&D →

References

24 References

1Gartner. Master Data Management (MDM), IT Glossary. gartner.com
2U.S. Food and Drug Administration. FDA’s Global Substance Registration System (GSRS) / UNII. gsrs.ncats.nih.gov
3International Organization for Standardization. ISO 11615:2017 — Identification of medicinal products. iso.org
4European Medicines Agency. Data on medicines (ISO IDMP standards): overview. ema.europa.eu
5Pistoia Alliance. IDMP-O — IDMP Ontology project page. pistoiaalliance.org
6Pistoia Alliance Ontologies. IDMP-O GitHub repository. github.com
7ICH. E6(R3) Good Clinical Practice — Final Guideline, adopted 6 January 2025. ich.org
8EMBL-EBI. ChEBI — Chemical Entities of Biological Interest. ebi.ac.uk/chebi
9Geleta D., Nikolov A., et al. (AstraZeneca). Biological Insights Knowledge Graph: an integrated knowledge graph to support drug development. bioRxiv 2021. doi.org
10Ochoa D., et al. Open Targets Platform: supporting systematic drug-target identification and prioritisation. Nucleic Acids Research, 2020. doi.org
11Himmelstein D.S., et al. Systematic integration of biomedical knowledge prioritizes drugs for repurposing (Hetionet). eLife, 2017. doi.org
12MacLean F. Knowledge graphs and their applications in drug discovery. Expert Opinion on Drug Discovery, 2021. doi.org
13Bonner S., et al. A Review of Biomedical Datasets Relating to Drug Discovery: A Knowledge Graph Perspective. Briefings in Bioinformatics, 2022. doi.org
14Hanna J., et al. Building a drug ontology based on RxNorm and other sources (DrON). Journal of Biomedical Semantics 4:44, 2013. doi.org
15U.S. National Library of Medicine. RxNorm. nlm.nih.gov
16ICH MSSO. MedDRA — Medical Dictionary for Regulatory Activities. meddra.org
17National Cancer Institute. NCI Thesaurus (NCIt). ncit.nci.nih.gov
18Narayan A., et al. Can Foundation Models Wrangle Your Data? PVLDB Vol. 16, 2022. arxiv.org
19Peeters R., Bizer C. Using ChatGPT for Entity Matching. ADBIS 2023. arxiv.org
20Fernandez R.C., et al. How Large Language Models Will Disrupt Data Management. PVLDB 16(11), 2023. doi.org
21EMA. Reflection paper on the use of AI in the medicinal product lifecycle. EMA/CHMP/CVMP/83833/2023, Sep 2024. ema.europa.eu
22FDA. AI/ML in Drug Development — Discussion Paper, 2023. fda.gov
23Wilkinson M.D., et al. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 3:160018, 2016. doi.org
24Yeh I. Unified Under a Universal Language: The Pistoia Alliance’s IDMP Ontology. Bio-IT World, 25 June 2024. bio-itworld.com

Product Mastering Across theR&D-to-Commercial Lifecycle:Why Knowledge Graphs Beat Traditional MDM

The Product Mastering Problem in Pharma

Why Hub-and-Spoke MDM Breaks Down

The Knowledge Graph Alternative

Connecting MDM to Ontologies

A Reference Architecture

Agentic AI for First-Pass Mastering, Human-in-the-Loop for the Rest

What This Means for AI-Readiness and Regulatory Submissions

Implementation Roadmap

The Trajectory Is Not a Row

24 References

Product Mastering Across the
R&D-to-Commercial Lifecycle:
Why Knowledge Graphs Beat Traditional MDM