Case Study · Agentic AI · Ontology Governance

Production Agentic AI for Ontology Curation
in Pharmaceutical R&D

How a four-agent LangGraph pipeline automates biomedical ontology curation — extracting entities, mapping to BioPortal and OLS4, detecting conflicts, and routing governance decisions — while keeping human stewards in the loop on every ambiguous call.

12-minute read Pharma R&D · Knowledge Graphs · AI Governance LangGraph · BioPortal · OLS4 · AWS
70%
Reduction in manual curation time
4
Specialized agents in the pipeline
100%
Audit trail coverage on all decisions
<2s
Per-entity routing latency
01 — The Problem

Ontology Curation: A Hidden Bottleneck in Pharma R&D

Biomedical ontologies are the semantic backbone of AI-ready data. But keeping them current is one of the most time-intensive, error-prone workflows in pharmaceutical research data management.

In any large pharmaceutical R&D organization, thousands of data entities flow daily through laboratory systems — ELN entries, LIMS assay results, compound registrations, cell therapy readouts — each requiring precise ontological grounding before they can power analytics, AI models, or regulatory submissions.

The traditional approach: a small team of ontology curators manually reviews each new term request, queries reference ontologies (ChEBI, GO, MeSH, UniProt), resolves naming conflicts between data systems, and routes ambiguous cases to senior domain experts. This process is slow, under-documented, and fundamentally unscalable as the volume of digital data in pharma accelerates.

Three core problems emerge at scale:

Throughput Bottleneck

Curator capacity limits how fast new terms can be onboarded. As data volumes scale, the backlog grows — delaying AI pipelines that depend on clean, harmonized ontology mappings.

Inconsistent Mapping Quality

Different curators apply subtly different rules for term selection, synonym resolution, and cross-ontology disambiguation — creating a fragmented semantic layer that undermines data interoperability.

Audit Gap

Manual curation leaves limited audit trails. In regulated environments where data lineage and governance approval records are required, the inability to trace every curation decision creates compliance risk.

02 — Constraints & Requirements

What the System Had to Do (and Not Do)

Any production governance system in pharma must satisfy hard requirements around traceability, human oversight, and alignment with existing stewardship workflows.

Human in the Loop — Always

The system must never make irreversible ontology decisions autonomously. All novel mappings and conflict resolutions require a steward's explicit approval before writing to the master ontology.

Full Audit Trail

Every automated action — entity extraction, candidate ranking, conflict flag, routing decision — must be logged with timestamps, confidence scores, and rationale for regulatory and governance review.

Deterministic Routing Rules

Routing logic (auto-approve vs. human review vs. SME escalation) must be policy-driven and inspectable — not black-box. Rules must be editable by governance leads without code changes.

Integration with Reference Ontologies

The system must query live BioPortal and OLS4 APIs to retrieve up-to-date term candidates. Fallback to UMLS and ChEBI when primary sources lack coverage.

Explainable Confidence Scores

Every term mapping must include a human-readable rationale: why this candidate was ranked first, what alternatives were considered, and what evidence (label match, synonym, cross-reference) drove the score.

Preserve Existing Vocabulary

The system must treat the existing curated vocabulary as ground truth. New mappings must not silently overwrite approved terms — only flag inconsistencies for steward review.

03 — System Architecture

Four Layers, One Governed Pipeline

The system is structured as a four-layer stack: data ingestion, agent orchestration, reference ontology APIs, and a governance approval layer that gates every write operation.

Ingestion ELN, LIMS, and MES events trigger entity extraction requests via event-driven AWS SQS queue. Structured and unstructured text payloads are normalized before agent processing.
AWS SQSELNLIMSMES
Agents Four specialized LangGraph agents run in a directed acyclic graph: EntityExtractor → OntologyMapper → ConflictDetector → GovernanceRouter. State is passed between agents via a shared context object.
LangGraphGPT-4oPythonState Graph
References Mapping candidates sourced from BioPortal REST API, OLS4, UMLS Metathesaurus, and ChEBI. Results are ranked by a composite score: label similarity + semantic type match + cross-reference coverage.
BioPortalOLS4UMLSChEBI
Governance High-confidence mappings enter an async approval queue in the data catalog. Stewards receive contextualized review packets — term, evidence, alternatives, conflicts — and approve, reject, or escalate via a lightweight UI.
Approval QueueSteward UIAudit LogDCAT
Knowledge Graph Approved terms are written to a graph store (AWS Neptune) as RDF triples, linking entities to their canonical ontology terms, synonyms, source systems, and approval provenance.
AWS NeptuneRDFSPARQLOWL
04 — Multi-Agent Design

Four Agents, One Shared Mission

Each agent has a single, well-bounded responsibility. No agent reads from or writes to the master ontology without passing through the GovernanceRouter.

Agent 01
EntityExtractorAgent

Parses raw text from lab records, assay descriptions, and compound registration forms. Uses a fine-tuned NER model + GPT-4o to identify candidate biomedical entities, normalize abbreviations, and resolve acronyms using local context. Outputs a structured JSON list of entity mentions with source provenance.

Input
raw_text: str
source: ELN | LIMS | MES
Output
entities: List[Entity]
confidence: float
Agent 02
OntologyMapperAgent

Takes each extracted entity and queries BioPortal, OLS4, and UMLS to retrieve top-5 candidate mappings. Scores candidates using a composite metric: TF-IDF label similarity, semantic type alignment, cross-reference coverage, and usage frequency in the existing vocabulary. Returns a ranked list with rationale for each candidate.

Input
entity: Entity
ontology_scope: List[str]
Output
candidates: List[Mapping]
top_score: float
Agent 03
ConflictDetectorAgent

Checks the proposed mapping against the current master vocabulary for four conflict types: (1) synonym collision with a different canonical term, (2) polysemy ambiguity across ontologies, (3) deprecated term usage, (4) cross-domain semantic drift (same label, different meaning in chemistry vs. biology contexts). Flags and categorizes each conflict with severity.

Input
mapping: Mapping
vocab: MasterVocab
Output
conflicts: List[Conflict]
severity: LOW|MED|HIGH
Agent 04
GovernanceRouterAgent

The final gatekeeper. Applies deterministic policy rules to decide one of three outcomes: AUTO_APPROVE (high confidence, no conflicts, entity type in approved auto-route list), STEWARD_REVIEW (confidence ≥ threshold but has minor conflicts or novel entity type), or SME_ESCALATION (low confidence, high-severity conflict, or cross-domain ambiguity). All decisions are logged to the audit store.

Input
mapping: Mapping
conflicts: List[Conflict]
Output
route: AUTO | REVIEW | SME
audit_record: AuditLog
05 — Conflict Detection

Four Conflict Types, One Taxonomy

Consistent conflict classification is what turns automated curation from a risky black box into a trusted governance tool. Every conflict type has a defined severity and default routing action.

Synonym Collision HIGH

The proposed synonym already exists in the vocabulary but maps to a different canonical term. Indicates either a naming inconsistency in the source data or a genuine semantic distinction that requires expert adjudication.

SME Escalation
Polysemy Ambiguity MEDIUM

The entity label matches multiple valid terms across different ontologies with similar scores. Common with shared scientific vocabulary used differently in chemistry vs. biology domains.

Steward Review
Deprecated Term MEDIUM

The top-ranked mapping is a deprecated ontology term. The system identifies the preferred successor term and flags the deprecation with the reason, but routes to steward review before updating the vocabulary.

Steward Review
Cross-Domain Semantic Drift HIGH

The same label is used with distinct meanings in different scientific domains already present in the vocabulary (e.g., "activation" in immunology vs. chemistry). Requires domain-scoped disambiguation by an SME.

SME Escalation
Clean Mapping NONE

No conflicts detected. The entity has a unique, high-confidence mapping with no overlap with existing vocabulary. If the entity type is in the auto-approve list and score exceeds threshold, the GovernanceRouter issues AUTO_APPROVE.

Auto-Approve
Novel Entity LOW–MED

No matching term found in any queried ontology. The system generates a structured draft term proposal — with suggested label, definition template, and parent class — and routes it for new term request submission.

Steward Review
06 — Human Routing

Governance Without Gatekeeping

The hardest design challenge wasn't the AI — it was designing the human layer so that governance stays rigorous without becoming a bottleneck.

The GovernanceRouter applies a three-tier decision framework driven by configurable policy rules stored as YAML — editable by governance leads without code changes. The framework is intentionally biased toward human review: when in doubt, the system asks rather than acts.

Tier 1 — Auto-Approve: Reserved for high-confidence, conflict-free mappings of entity types that governance leads have pre-approved for automation (e.g., standard chemical compound classes with SMILES-based identity checks). Logged with full evidence bundle but no human action required.

Tier 2 — Steward Review: The most common path. A data steward receives a review packet containing: proposed mapping, confidence score, ranked alternatives, conflict summary, and source evidence. They can approve, reject, or promote to SME escalation with a single click and an optional free-text note.

Tier 3 — SME Escalation: Reserved for high-severity conflicts or novel entity types that cross domain boundaries. The review packet is enriched with full semantic context from the knowledge graph — related terms, upstream data consumers, downstream AI pipeline dependencies — to give the SME the full picture before deciding.

Auto-Approve

Score ≥ 0.92 · No conflicts · Entity type in approved list · Approved in < 2 seconds. Full evidence logged to audit store without human action.

~35% of volume
Steward Review

Score 0.75–0.91 · Minor conflicts or novel entity type · Steward receives contextualized review packet · 1-click approve / reject / escalate.

~55% of volume
SME Escalation

Score < 0.75 · High-severity conflict · Cross-domain ambiguity · SME receives full knowledge graph context packet for adjudication.

~10% of volume
07 — Results & Impact

What Changed

Measured outcomes after 90 days in production against a 12-month baseline of manual-only curation.

35%
Of all entities auto-approved without human intervention
100%
Audit trail coverage — zero unlogged decisions
<2s
Latency per entity through the full agent pipeline
Increase in curation throughput without adding headcount
94%
Steward agreement rate with top-ranked automated mapping

Beyond the numbers, the most significant outcome was a shift in how data stewards spend their time. Instead of doing repetitive first-pass lookup work, stewards now focus on the genuinely ambiguous cases — the ones where human judgment matters. Curation became a higher-value activity.

The knowledge graph also gained a new property: provenance at the entity level. Every approved term in AWS Neptune now carries a structured provenance record: who approved it, when, with what evidence, and via which route. This became the foundation for downstream AI pipeline trust scoring.

08 — Lessons Learned

What We'd Do Differently

Building agentic systems for regulated domains means learning as much from failures as from successes.

01

Design the human layer first

We initially over-invested in tuning the mapping algorithm before designing the steward review UX. The breakthrough came when we ran user research sessions with actual ontology curators and rebuilt the review packet around their mental model — not the system's output format.

02

Routing rules must be inspectable and editable

Our first routing implementation was coded in Python class logic. When governance leads wanted to adjust confidence thresholds or add new entity types to the auto-approve list, it required a dev sprint. Moving routing rules to YAML configuration reduced the change cycle from weeks to hours.

03

Confidence calibration is harder than model accuracy

A mapping model with 94% accuracy is useless if its confidence scores aren't calibrated. Early on, the OntologyMapperAgent's scores were systematically overconfident on rare entity types, causing auto-approvals that stewards later flagged. Isotonic regression calibration on held-out data fixed this — but it took three iterations to get right.

04

The audit log is a product feature, not a side effect

We initially treated audit logging as a compliance checkbox. When governance leads started using the audit log to analyze curation patterns — which entity types generate the most conflicts, which stewards have the lowest agreement rate with the model — it became clear that the audit store was one of the most valuable outputs of the system.

05

Multi-agent doesn't mean more complexity

The four-agent design was initially questioned as over-engineered. In practice, the single-responsibility structure made the system significantly easier to debug, test, and evolve. When the OntologyMapperAgent needed an OLS4 API update, no other agents were affected. The modularity paid compound dividends over time.