A practitioner’s view on why the row-and-column gold record breaks down for medicinal products — and what to build instead. Covers ISO IDMP, IDMP-O, bitemporal graphs, agentic entity resolution, and a deployable reference architecture.
In most industries, “product” is a stable thing. Pharma does not fit that world — the entity is not a thing, it is a trajectory.
Master Data Management was designed for the world of washing machines: a SKU, a price, a manufacturer, and a few attributes that change slowly. Gartner defines it as the discipline that ensures “uniformity, accuracy, stewardship, semantic consistency and accountability of the enterprise’s official shared master data assets.”1 That world assumes a stable entity with a single identity. Pharma has neither.
The same scientific entity moves through a dozen identity transformations before it reaches a patient. The right answer to any master data question depends on which point in the lifecycle the asker stands at. There is no single gold record because there is no single business question a one-row record would answer correctly.
Each stage is authoritative for its own identifiers. No single row captures all of them correctly at every point in time.
The core insight: A single asset accumulates identifiers across discovery, development, manufacturing, regulatory, supply chain, and commercial systems — each with its own authoritative source and its own update cadence. The entity is not a thing. It is a trajectory. Stop trying to flatten it into a row.
Classical MDM’s four implementation styles — registry, consolidation, coexistence, centralized — all assume the same primitive: a denormalized “golden” row chosen by survivorship rules. When the entity is relational and evolving, they all fail in the same ways.
An MPID maps to multiple PhPIDs (strengths/forms), each PhPID to multiple substances, each substance to multiple specified-substance variants. Forcing this into a flat hub schema either explodes row counts or hides cardinality in a JSON blob the stewardship workflow can’t reason over. ISO IDMP is structurally a network — a graph published as five interlocking standards.3
The problem is not that pharma has too many sources of truth — it is that each is authoritative for a different facet at a different time. The LIMS is authoritative for structure. The regulatory system is authoritative for MPID assignment. Survivorship rules cannot capture “authoritative for attribute X only after milestone Y.”
In R&D-to-commercial, ownership transitions hand-off-by-hand-off: chemistry → CMC → clinical → regulatory → commercial. Each handoff is itself a piece of master data. Hub MDM has no native primitive for this; it gets encoded as a status column that inevitably drifts from the actual SOP.
An adverse event on a 2018 batch of a 2014 formulation must trace to the specific substance, site, and product information that existed at that time — not the current golden record. Most MDM hubs implement temporality as effective-dated rows. Pharmacovigilance requires bitemporality (when the fact was true × when we knew it), which hub MDM architecturally cannot offer.
Industry signal: The Pistoia Alliance’s IDMP-O project — built jointly by 11 pharmaceutical companies (Bayer, Novartis, GSK, Roche, Merck KGaA, Boehringer Ingelheim, J&J, AstraZeneca, Amgen, AbbVie, and Pfizer) — exists because relational MDM implementations of ISO IDMP have produced diverging, non-interoperable outputs across jurisdictions.5 When eleven major pharma companies pool funding to build an ontology rather than another hub, that’s a signal worth reading.
A knowledge graph inverts the MDM primitive. Instead of a denormalized row, the unit of mastering is the triple: (subject, predicate, object). Four properties of the graph model do real work that MDM cannot.
A discovery compound and the marketed substance are linked by an edge — not collapsed into one row. Both identities survive. A pharmacovigilance query in 2030 can resolve the 2014 entity exactly as it stood, because nothing was overwritten.
“Compound X was nominated as development candidate on 2017-06-12 under decision D-2017-44” becomes a typed edge with provenance and an effective interval. The graph encodes the event of transition — which is precisely what ICH E6(R3) audit trails require.7
Each fact is asserted by a named source graph with a version and validity interval. The same property can hold different values for different consumers simultaneously — clinical ops reads the protocol-locked definition while commercial reads the current marketed config, from the same graph, by selecting the appropriate named graph or VALID-AT timestamp.
Because ontology classes carry formal semantics (subClass, equivalentClass, disjointWith), queries can ask “every marketed product whose active moiety is a JAK inhibitor manufactured at site S” without anyone hard-coding “JAK inhibitor” into a hub attribute. The classification lives in ChEBI or the relevant drug ontology, not the master data system.
Industrial evidence: AstraZeneca’s Biological Insights Knowledge Graph (BIKG) contains approximately 10.9 million nodes (22 types) and over 118 million edges (59 types), built from 39+ internal and public datasets for target identification and drug repurposing.9 Open Targets, Hetionet, and dozens of peer-reviewed KGs document the same pattern across the industry.1011
The right pattern is to let the ontologies become the master, and have what used to be MDM serve as the operational read model for systems that still need a flat view.
| Layer | Ontology / Standard | Role in Product KG | Key IDs |
|---|---|---|---|
| Chemical structure & moiety | ChEBI | Classifies substances by structure and biological role; supports active-moiety reasoning across ingredient ↔ drug class ↔ mechanism | CHEBI:xxxxxx |
| Drug product semantics | DrON | OBO-Foundry drug ontology aligning RxNorm-derived clinical concepts with formal class semantics; bridges chemical and clinical views | DRON:xxxxxxxx |
| Substance identity (regulatory) | UNII / GSRS ISO 11238 | Globally unique substance identifiers and the substance-definition standard; required for all FDA submissions | UNII: 10-char |
| Clinical drug normalization | RxNorm | NLM-maintained normalized names linking ingredient ↔ strength ↔ dose form ↔ brand; the standard interface to EMR/EHR systems | RxCUI: xxxxxxx |
| Adverse events & indications | MedDRA | ICH-maintained terminology required for regulatory safety reporting globally; maps to FAERS, EudraVigilance, VigiBase | MedDRA PT code |
| Oncology / mechanism / anatomy | NCIt | NCI Enterprise Vocabulary used widely in regulatory submissions; covers drug, disease, anatomy, and procedure hierarchies | C xxxxxxx |
| End-to-end product identity | IDMP-O (Pistoia) | The integrating layer — realizes ISO 11238/11239/11240/11615/11616/21090 in OWL; engineered to bridge chemistry ontologies upstream and regulatory identifiers downstream5 | MPID / PhPID |
What makes this stack work — rather than just a list — is alignment: each ontology declares equivalent classes and properties where they overlap, so a SPARQL query about an “active moiety” can resolve across ChEBI, UNII, and RxNorm consistently.
The MDM-style flat record doesn’t disappear in this architecture — it becomes a materialized projection of the graph, generated for systems that need a denormalized API. Crucially, the projection is derivable and reproducible from a version of the graph, which means data lineage is intrinsic rather than bolted on.
The shape that emerges in practice — across discovery, development, regulatory, and commercial — with three architectural choices that distinguish it from “another KG on top of the data lake.”
Ontology releases are versioned (SemVer) and the graph carries usedOntologyVersion provenance on every assertion. A MedDRA v27.0 → v27.1 upgrade does not silently change historical safety queries — it produces a new named graph the consumer can opt into.
Discovery, clinical, regulatory, and commercial facts each live in their own named graph with explicit cross-graph edges. This makes “authoritative for attribute X only after milestone Y” expressible declaratively — not as a status column.
IDMP submission XML, the safety data warehouse, the commercial product hierarchy — all generated from the graph by query, all reproducible from a graph snapshot. The MDM hub, where one still exists, becomes a cache.
Foundation models change the economics of entity resolution. Three peer-reviewed results carry the load — and the workable pattern is agentic first-pass with structured handoff.
Research foundation: Narayan et al. (Stanford, PVLDB 2022) showed foundation models match or exceed task-specific SOTA on five data wrangling tasks including entity matching and schema matching — accessed only via in-context learning.18 Peeters & Bizer (Mannheim, 2023) confirmed ChatGPT reaches fine-tuned model performance on entity-matching benchmarks without any task-specific training data.19 Fernandez et al. (Chicago, PVLDB 2023) grounded the broader thesis: LLMs have “unprecedented ability to ground database tuples, schemas, and queries in real-world concepts.”20
Agent extracts entities from ELN, LIMS, regulatory source with structural identifiers and raw text context
LLM proposes candidate matches with calibrated confidence, grounded citations (ontology class, regulatory record, structural ID)
Proposals above domain threshold auto-asserted into staging named graph with provenance pointing to agent run and model version
Low-confidence or policy-sensitive proposals (marketed products, registered substances, safety signals) routed to stewards with agent’s reasoning attached
Steward decisions feed back into the agent’s few-shot context for the next batch — a closed loop that compounds quality over time
The non-negotiable constraint is provenance on every asserted triple. Without it, the system fails inspection. With it, the agent becomes auditable: every assertion has a who (agent run + model version), a why (evidence), and a when (assertedAt timestamp). Any triple can be retracted without disturbing unrelated facts. This is something MDM survivorship architecturally cannot offer.
Two regulatory currents are converging on the same conclusion: product data needs to be machine-processable, traceable, and FAIR. The harder half of “AI-readiness” is data-side, not model-side.
CHMP/CVMP adoption September 2024. Covers AI use across discovery, non-clinical, clinical trials, manufacturing, and post-authorization. Explicitly emphasizes risk management of data quality, traceability, and human oversight throughout the lifecycle.21
FDA’s 2023 counterpart to the EMA reflection paper. Sets expectations for transparency, validation, and data provenance in AI-assisted drug development workflows. Presupposes well-mastered, traceable source data.22
Adopted January 2025, EMA effective July 2025. Modernizes GCP for decentralized trials, electronic sources, and quality-by-design. All of which presuppose well-mastered protocol and product metadata.7
IDMP as a query, not a project: If your product mastering is a row-shaped hub, generating IDMP-conformant XML at the cardinality and bitemporality the standard demands is a recurring engineering project, every submission. If your product mastering is graph-shaped and aligned to IDMP-O, IDMP becomes a SPARQL query. The same is true for pharmacovigilance line listings, label content management, and structured product labeling.
“AI-readiness” is often discussed as a model-side concern (GPUs, MLOps, vector DBs). The harder, less glamorous half is data-side: an LLM that has to ground “is this the same product as the one in the 2019 BLA” in a brittle MDM gold record will hallucinate. One that can query a typed, versioned, provenance-rich knowledge graph won’t have to. The FAIR Guiding Principles23 are, almost word-for-word, a description of a properly governed knowledge graph: globally unique persistent identifiers, retrieval by standardized protocol, formal knowledge representation, rich provenance.
A staged path that has survived contact with reality. Workable budget for phases 0–2: a core team of 4–6 engineers, 1–2 ontologists, 1–2 stewards per domain, and 9–12 months.
The most common failure mode is not technical — it’s treating the ontology as an IT artifact instead of a regulated piece of master data with its own governance. Fund the ontologists.
The candidate-nomination → IND filing transition, or the marketing-authorization → first-commercial-launch transition. Both are bounded enough to ship in a quarter and consequential enough to fund the next phase. Avoid the “we’ll boil the ocean” KG that maps every system on day one.
Adopt IDMP-O as the spine; pull in ChEBI, RxNorm, MedDRA, NCIt, and UNII as needed for the chosen transition. Define internal extensions explicitly in a versioned namespace you control, with documented alignments to public ontologies.
One discovery-side (LIMS or ELN), one regulatory-side (the system holding MPID assignments). Build the agentic mastering loop with human-in-the-loop review on everything that touches a regulatory identifier. Measure: % triples with grounded provenance, steward override rate, round-trip latency from source change to graph assertion.
Pick one — IDMP submission XML, a pharmacovigilance line listing, or a commercial product hierarchy — and replace its existing pipeline with a query against the graph. This is the moment that proves the architecture commercially and funds the remaining phases.
Existing hubs do not need to die. They become projections, refreshed from the graph, serving systems that still want a flat API. New systems read the graph directly. The hub’s stewardship workflows retire one domain at a time.
Each subsequent lifecycle transition adds a named graph and a small number of typed edges. The cost of adding the n+1th transition is sublinear in n — the architectural payoff that hub-and-spoke MDM never delivered. Scale the agentic mastering loop across all source domains.
The interesting question is no longer whether graphs will replace traditional MDM for pharma product mastering — the regulators, the standards bodies, and eleven of the top twenty pharma companies have already answered that with IDMP-O. The interesting question is whether your organization will treat the next IDMP wave, the next AI regulation, and the next R&D-to-commercial handoff as another point fix to the hub — or as the moment to invest in the architecture the next decade of pharma data already presupposes.
The trajectory is not a row. Stop trying to flatten it.