Master Data  ·  Knowledge Graphs  ·  Pharma R&D

Product Mastering Across the
R&D-to-Commercial Lifecycle:
Why Knowledge Graphs Beat Traditional MDM

A practitioner’s view on why the row-and-column gold record breaks down for medicinal products — and what to build instead. Covers ISO IDMP, IDMP-O, bitemporal graphs, agentic entity resolution, and a deployable reference architecture.

May 25, 2026 ~25 min read Ali Shahmohammadi, Ph.D.
Read Article IDMP-O Project
11
Major pharma companies co-building IDMP-O to replace hub MDM
5
ISO IDMP standards forming a graph, not a table
118M+
Edges in AstraZeneca’s production knowledge graph (BIKG)
0
Survivorship rules needed when the ontology is the master
Table of Contents
  1. 01The Product Mastering Problem
  2. 02Why Hub-and-Spoke MDM Breaks Down
  3. 03The Knowledge Graph Alternative
  4. 04Connecting MDM to Ontologies
  5. 05Reference Architecture
  6. 06Agentic AI for First-Pass Mastering
  7. 07AI-Readiness & Regulatory Submissions
  8. 08Implementation Roadmap
01 — The Problem

The Product Mastering Problem in Pharma

In most industries, “product” is a stable thing. Pharma does not fit that world — the entity is not a thing, it is a trajectory.

Master Data Management was designed for the world of washing machines: a SKU, a price, a manufacturer, and a few attributes that change slowly. Gartner defines it as the discipline that ensures “uniformity, accuracy, stewardship, semantic consistency and accountability of the enterprise’s official shared master data assets.”1 That world assumes a stable entity with a single identity. Pharma has neither.

The same scientific entity moves through a dozen identity transformations before it reaches a patient. The right answer to any master data question depends on which point in the lifecycle the asker stands at. There is no single gold record because there is no single business question a one-row record would answer correctly.

Discovery
Structure hash
ChEMBL ID
Internal code
Preclinical
UNII (FDA GSRS)
INN (WHO)
ISO 11238 substance
Clinical
IND / EudraCT
Protocol product
Dose form + route
Regulatory
MPID (ISO 11615)
PhPID (ISO 11616)
Marketing Auth.
Commercial
NDC (US)
GTIN
Brand + label

Each stage is authoritative for its own identifiers. No single row captures all of them correctly at every point in time.

The core insight: A single asset accumulates identifiers across discovery, development, manufacturing, regulatory, supply chain, and commercial systems — each with its own authoritative source and its own update cadence. The entity is not a thing. It is a trajectory. Stop trying to flatten it into a row.

02 — Why MDM Fails

Why Hub-and-Spoke MDM Breaks Down

Classical MDM’s four implementation styles — registry, consolidation, coexistence, centralized — all assume the same primitive: a denormalized “golden” row chosen by survivorship rules. When the entity is relational and evolving, they all fail in the same ways.

Lossy denormalization of multi-cardinality

An MPID maps to multiple PhPIDs (strengths/forms), each PhPID to multiple substances, each substance to multiple specified-substance variants. Forcing this into a flat hub schema either explodes row counts or hides cardinality in a JSON blob the stewardship workflow can’t reason over. ISO IDMP is structurally a network — a graph published as five interlocking standards.3

The “single source of truth” misdiagnosis

The problem is not that pharma has too many sources of truth — it is that each is authoritative for a different facet at a different time. The LIMS is authoritative for structure. The regulatory system is authoritative for MPID assignment. Survivorship rules cannot capture “authoritative for attribute X only after milestone Y.”

Stewardship assumes a single entity owner

In R&D-to-commercial, ownership transitions hand-off-by-hand-off: chemistry → CMC → clinical → regulatory → commercial. Each handoff is itself a piece of master data. Hub MDM has no native primitive for this; it gets encoded as a status column that inevitably drifts from the actual SOP.

Historical truth & bitemporality

An adverse event on a 2018 batch of a 2014 formulation must trace to the specific substance, site, and product information that existed at that time — not the current golden record. Most MDM hubs implement temporality as effective-dated rows. Pharmacovigilance requires bitemporality (when the fact was true × when we knew it), which hub MDM architecturally cannot offer.

Industry signal: The Pistoia Alliance’s IDMP-O project — built jointly by 11 pharmaceutical companies (Bayer, Novartis, GSK, Roche, Merck KGaA, Boehringer Ingelheim, J&J, AstraZeneca, Amgen, AbbVie, and Pfizer) — exists because relational MDM implementations of ISO IDMP have produced diverging, non-interoperable outputs across jurisdictions.5 When eleven major pharma companies pool funding to build an ontology rather than another hub, that’s a signal worth reading.

03 — The Alternative

The Knowledge Graph Alternative

A knowledge graph inverts the MDM primitive. Instead of a denormalized row, the unit of mastering is the triple: (subject, predicate, object). Four properties of the graph model do real work that MDM cannot.

# Example: lifecycle transition as a typed edge with provenance
compound:CHEMBL1234 :nominatedAs devCandidate:DC-2017-44 .
devCandidate:DC-2017-44 :effectiveDate "2017-06-12"^^xsd:date .
devCandidate:DC-2017-44 :assertedBy agentRun:OntoCurator-v2.1-20170612 .
devCandidate:DC-2017-44 :assertedAt "2017-06-12T14:22:01Z"^^xsd:dateTime .

# Both identities survive — nothing is overwritten
compound:CHEMBL1234 owl:sameAs unii:ABCD1234EF .
Entities are persistent across lifecycle states

A discovery compound and the marketed substance are linked by an edge — not collapsed into one row. Both identities survive. A pharmacovigilance query in 2030 can resolve the 2014 entity exactly as it stood, because nothing was overwritten.

Lifecycle transitions are edges, not status columns

“Compound X was nominated as development candidate on 2017-06-12 under decision D-2017-44” becomes a typed edge with provenance and an effective interval. The graph encodes the event of transition — which is precisely what ICH E6(R3) audit trails require.7

Semantic versioning replaces survivorship

Each fact is asserted by a named source graph with a version and validity interval. The same property can hold different values for different consumers simultaneously — clinical ops reads the protocol-locked definition while commercial reads the current marketed config, from the same graph, by selecting the appropriate named graph or VALID-AT timestamp.

Reasoning is compositional

Because ontology classes carry formal semantics (subClass, equivalentClass, disjointWith), queries can ask “every marketed product whose active moiety is a JAK inhibitor manufactured at site S” without anyone hard-coding “JAK inhibitor” into a hub attribute. The classification lives in ChEBI or the relevant drug ontology, not the master data system.

Industrial evidence: AstraZeneca’s Biological Insights Knowledge Graph (BIKG) contains approximately 10.9 million nodes (22 types) and over 118 million edges (59 types), built from 39+ internal and public datasets for target identification and drug repurposing.9 Open Targets, Hetionet, and dozens of peer-reviewed KGs document the same pattern across the industry.1011

04 — Ontology Stack

Connecting MDM to Ontologies

The right pattern is to let the ontologies become the master, and have what used to be MDM serve as the operational read model for systems that still need a flat view.

Layer Ontology / Standard Role in Product KG Key IDs
Chemical structure & moiety ChEBI Classifies substances by structure and biological role; supports active-moiety reasoning across ingredient ↔ drug class ↔ mechanism CHEBI:xxxxxx
Drug product semantics DrON OBO-Foundry drug ontology aligning RxNorm-derived clinical concepts with formal class semantics; bridges chemical and clinical views DRON:xxxxxxxx
Substance identity (regulatory) UNII / GSRS ISO 11238 Globally unique substance identifiers and the substance-definition standard; required for all FDA submissions UNII: 10-char
Clinical drug normalization RxNorm NLM-maintained normalized names linking ingredient ↔ strength ↔ dose form ↔ brand; the standard interface to EMR/EHR systems RxCUI: xxxxxxx
Adverse events & indications MedDRA ICH-maintained terminology required for regulatory safety reporting globally; maps to FAERS, EudraVigilance, VigiBase MedDRA PT code
Oncology / mechanism / anatomy NCIt NCI Enterprise Vocabulary used widely in regulatory submissions; covers drug, disease, anatomy, and procedure hierarchies C xxxxxxx
End-to-end product identity IDMP-O (Pistoia) The integrating layer — realizes ISO 11238/11239/11240/11615/11616/21090 in OWL; engineered to bridge chemistry ontologies upstream and regulatory identifiers downstream5 MPID / PhPID

What makes this stack work — rather than just a list — is alignment: each ontology declares equivalent classes and properties where they overlap, so a SPARQL query about an “active moiety” can resolve across ChEBI, UNII, and RxNorm consistently.

The MDM-style flat record doesn’t disappear in this architecture — it becomes a materialized projection of the graph, generated for systems that need a denormalized API. Crucially, the projection is derivable and reproducible from a version of the graph, which means data lineage is intrinsic rather than bolted on.

05 — Reference Architecture

A Reference Architecture

The shape that emerges in practice — across discovery, development, regulatory, and commercial — with three architectural choices that distinguish it from “another KG on top of the data lake.”

ONTOLOGY LAYER (versioned, FAIR) ChEBI DrON RxNorm MedDRA NCIt UNII / GSRS IDMP-O (Pistoia Alliance) subClassOf · equivalentClass · sameAs · alignedWith PRODUCT KNOWLEDGE GRAPH (RDF / LPG) Entities: Compound · Substance · PharmaceuticalProduct · MedicinalProduct · PackagedProduct Indication · Organization · MarketingAuthorization · ClinicalStudy · Batch Edges: evolvedInto · activeMoietyOf · authorizedAs · manufacturedAt · indicatedFor · supersededBy Named graphs: discovery clinical regulatory commercial Bitemporal: validAt × assertedAt ingest + reconcile steward + version project AGENTIC AI · Entity matching & resolution · Schema mapping · Link prediction · Provenance on every triple HUMAN STEWARDS · Exception review queue · Ontology curation decisions · Policy & governance approvals · Human-in-the-loop sign-off MATERIALIZED READ MODELS · Flat MDM views (legacy systems) · IDMP submission XML · Pharmacovigilance line listings · Commercial product hierarchy SOURCE SYSTEMS ELN  ·  LIMS  ·  CTMS  ·  Regulatory submission systems  ·  MES  ·  ERP  ·  Safety / Pharmacovigilance  ·  CRM / Commercial
1
Ontology layer is governed independently

Ontology releases are versioned (SemVer) and the graph carries usedOntologyVersion provenance on every assertion. A MedDRA v27.0 → v27.1 upgrade does not silently change historical safety queries — it produces a new named graph the consumer can opt into.

2
Named graphs partition the lifecycle

Discovery, clinical, regulatory, and commercial facts each live in their own named graph with explicit cross-graph edges. This makes “authoritative for attribute X only after milestone Y” expressible declaratively — not as a status column.

3
Materialized read models are deliverables, not the master

IDMP submission XML, the safety data warehouse, the commercial product hierarchy — all generated from the graph by query, all reproducible from a graph snapshot. The MDM hub, where one still exists, becomes a cache.

06 — Agentic AI

Agentic AI for First-Pass Mastering, Human-in-the-Loop for the Rest

Foundation models change the economics of entity resolution. Three peer-reviewed results carry the load — and the workable pattern is agentic first-pass with structured handoff.

Research foundation: Narayan et al. (Stanford, PVLDB 2022) showed foundation models match or exceed task-specific SOTA on five data wrangling tasks including entity matching and schema matching — accessed only via in-context learning.18 Peeters & Bizer (Mannheim, 2023) confirmed ChatGPT reaches fine-tuned model performance on entity-matching benchmarks without any task-specific training data.19 Fernandez et al. (Chicago, PVLDB 2023) grounded the broader thesis: LLMs have “unprecedented ability to ground database tuples, schemas, and queries in real-world concepts.”20

Step 01
Source Ingestion

Agent extracts entities from ELN, LIMS, regulatory source with structural identifiers and raw text context

Step 02
Candidate Matching

LLM proposes candidate matches with calibrated confidence, grounded citations (ontology class, regulatory record, structural ID)

Step 03
High-Confidence Path

Proposals above domain threshold auto-asserted into staging named graph with provenance pointing to agent run and model version

Step 04
Human Review Queue

Low-confidence or policy-sensitive proposals (marketed products, registered substances, safety signals) routed to stewards with agent’s reasoning attached

Step 05
Feedback Loop

Steward decisions feed back into the agent’s few-shot context for the next batch — a closed loop that compounds quality over time

The non-negotiable constraint is provenance on every asserted triple. Without it, the system fails inspection. With it, the agent becomes auditable: every assertion has a who (agent run + model version), a why (evidence), and a when (assertedAt timestamp). Any triple can be retracted without disturbing unrelated facts. This is something MDM survivorship architecturally cannot offer.

07 — AI-Readiness & Regulatory

What This Means for AI-Readiness and Regulatory Submissions

Two regulatory currents are converging on the same conclusion: product data needs to be machine-processable, traceable, and FAIR. The harder half of “AI-readiness” is data-side, not model-side.

EMA
Reflection Paper on AI in the Medicinal Product Lifecycle

CHMP/CVMP adoption September 2024. Covers AI use across discovery, non-clinical, clinical trials, manufacturing, and post-authorization. Explicitly emphasizes risk management of data quality, traceability, and human oversight throughout the lifecycle.21

EMA/CHMP/CVMP/83833/2023 · Sep 2024
FDA
AI/ML in Drug Development Discussion Paper

FDA’s 2023 counterpart to the EMA reflection paper. Sets expectations for transparency, validation, and data provenance in AI-assisted drug development workflows. Presupposes well-mastered, traceable source data.22

FDA Discussion Paper · 2023
ICH
E6(R3) Good Clinical Practice

Adopted January 2025, EMA effective July 2025. Modernizes GCP for decentralized trials, electronic sources, and quality-by-design. All of which presuppose well-mastered protocol and product metadata.7

Adopted 6 Jan 2025 · EMA effective Jul 2025

IDMP as a query, not a project: If your product mastering is a row-shaped hub, generating IDMP-conformant XML at the cardinality and bitemporality the standard demands is a recurring engineering project, every submission. If your product mastering is graph-shaped and aligned to IDMP-O, IDMP becomes a SPARQL query. The same is true for pharmacovigilance line listings, label content management, and structured product labeling.

“AI-readiness” is often discussed as a model-side concern (GPUs, MLOps, vector DBs). The harder, less glamorous half is data-side: an LLM that has to ground “is this the same product as the one in the 2019 BLA” in a brittle MDM gold record will hallucinate. One that can query a typed, versioned, provenance-rich knowledge graph won’t have to. The FAIR Guiding Principles23 are, almost word-for-word, a description of a properly governed knowledge graph: globally unique persistent identifiers, retrieval by standardized protocol, formal knowledge representation, rich provenance.

08 — Implementation Roadmap

Implementation Roadmap

A staged path that has survived contact with reality. Workable budget for phases 0–2: a core team of 4–6 engineers, 1–2 ontologists, 1–2 stewards per domain, and 9–12 months.

The most common failure mode is not technical — it’s treating the ontology as an IT artifact instead of a regulated piece of master data with its own governance. Fund the ontologists.

Phase
0
Pick One Transition, Not the Whole Lifecycle

The candidate-nomination → IND filing transition, or the marketing-authorization → first-commercial-launch transition. Both are bounded enough to ship in a quarter and consequential enough to fund the next phase. Avoid the “we’ll boil the ocean” KG that maps every system on day one.

Scope scoping Stakeholder alignment ~1 month
Phase
1
Stand Up the Ontology and Graph in Parallel

Adopt IDMP-O as the spine; pull in ChEBI, RxNorm, MedDRA, NCIt, and UNII as needed for the chosen transition. Define internal extensions explicitly in a versioned namespace you control, with documented alignments to public ontologies.

IDMP-O ChEBI · RxNorm · MedDRA Neptune or Neo4j ~2–3 months
Phase
2
Wire In Two Source Systems, End to End

One discovery-side (LIMS or ELN), one regulatory-side (the system holding MPID assignments). Build the agentic mastering loop with human-in-the-loop review on everything that touches a regulatory identifier. Measure: % triples with grounded provenance, steward override rate, round-trip latency from source change to graph assertion.

Agentic entity matching Human review queue Provenance on every triple ~3–4 months
Phase
3
Generate the First Downstream Deliverable from the Graph

Pick one — IDMP submission XML, a pharmacovigilance line listing, or a commercial product hierarchy — and replace its existing pipeline with a query against the graph. This is the moment that proves the architecture commercially and funds the remaining phases.

SPARQL / Cypher query IDMP XML generation Phv line listing ~2 months
Phase
4
Reduce the MDM Hub to a Read Model

Existing hubs do not need to die. They become projections, refreshed from the graph, serving systems that still want a flat API. New systems read the graph directly. The hub’s stewardship workflows retire one domain at a time.

Materialized views Legacy system bridging Stewardship migration
Phase
5
Expand Transitions and Named Graphs

Each subsequent lifecycle transition adds a named graph and a small number of typed edges. The cost of adding the n+1th transition is sublinear in n — the architectural payoff that hub-and-spoke MDM never delivered. Scale the agentic mastering loop across all source domains.

Named graph expansion Cross-lifecycle edges Sublinear marginal cost
Closing Thought

The Trajectory Is Not a Row

The interesting question is no longer whether graphs will replace traditional MDM for pharma product mastering — the regulators, the standards bodies, and eleven of the top twenty pharma companies have already answered that with IDMP-O. The interesting question is whether your organization will treat the next IDMP wave, the next AI regulation, and the next R&D-to-commercial handoff as another point fix to the hub — or as the moment to invest in the architecture the next decade of pharma data already presupposes.

The trajectory is not a row. Stop trying to flatten it.

Back to Portfolio IDMP-O Project → Related: FAIR Data Principles → Related: ISO/IEC 23894 in Pharma R&D →
References

24 References

  1. 1Gartner. Master Data Management (MDM), IT Glossary. gartner.com
  2. 2U.S. Food and Drug Administration. FDA’s Global Substance Registration System (GSRS) / UNII. gsrs.ncats.nih.gov
  3. 3International Organization for Standardization. ISO 11615:2017 — Identification of medicinal products. iso.org
  4. 4European Medicines Agency. Data on medicines (ISO IDMP standards): overview. ema.europa.eu
  5. 5Pistoia Alliance. IDMP-O — IDMP Ontology project page. pistoiaalliance.org
  6. 6Pistoia Alliance Ontologies. IDMP-O GitHub repository. github.com
  7. 7ICH. E6(R3) Good Clinical Practice — Final Guideline, adopted 6 January 2025. ich.org
  8. 8EMBL-EBI. ChEBI — Chemical Entities of Biological Interest. ebi.ac.uk/chebi
  9. 9Geleta D., Nikolov A., et al. (AstraZeneca). Biological Insights Knowledge Graph: an integrated knowledge graph to support drug development. bioRxiv 2021. doi.org
  10. 10Ochoa D., et al. Open Targets Platform: supporting systematic drug-target identification and prioritisation. Nucleic Acids Research, 2020. doi.org
  11. 11Himmelstein D.S., et al. Systematic integration of biomedical knowledge prioritizes drugs for repurposing (Hetionet). eLife, 2017. doi.org
  12. 12MacLean F. Knowledge graphs and their applications in drug discovery. Expert Opinion on Drug Discovery, 2021. doi.org
  13. 13Bonner S., et al. A Review of Biomedical Datasets Relating to Drug Discovery: A Knowledge Graph Perspective. Briefings in Bioinformatics, 2022. doi.org
  14. 14Hanna J., et al. Building a drug ontology based on RxNorm and other sources (DrON). Journal of Biomedical Semantics 4:44, 2013. doi.org
  15. 15U.S. National Library of Medicine. RxNorm. nlm.nih.gov
  16. 16ICH MSSO. MedDRA — Medical Dictionary for Regulatory Activities. meddra.org
  17. 17National Cancer Institute. NCI Thesaurus (NCIt). ncit.nci.nih.gov
  18. 18Narayan A., et al. Can Foundation Models Wrangle Your Data? PVLDB Vol. 16, 2022. arxiv.org
  19. 19Peeters R., Bizer C. Using ChatGPT for Entity Matching. ADBIS 2023. arxiv.org
  20. 20Fernandez R.C., et al. How Large Language Models Will Disrupt Data Management. PVLDB 16(11), 2023. doi.org
  21. 21EMA. Reflection paper on the use of AI in the medicinal product lifecycle. EMA/CHMP/CVMP/83833/2023, Sep 2024. ema.europa.eu
  22. 22FDA. AI/ML in Drug Development — Discussion Paper, 2023. fda.gov
  23. 23Wilkinson M.D., et al. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 3:160018, 2016. doi.org
  24. 24Yeh I. Unified Under a Universal Language: The Pistoia Alliance’s IDMP Ontology. Bio-IT World, 25 June 2024. bio-itworld.com