Agentic AI  ·  Ontology Curation  ·  Drug R&D

Building Agentic AI Systems
for Ontology Curation

How a four-agent LangGraph pipeline automates biomedical term extraction, ontology mapping, conflict detection, and governance routing — while preserving human oversight at every critical decision point.

May 19, 2026 14 min read Ali Shahmohammadi, Ph.D.
Read Article View on GitHub Project Page
4
Specialized Agents
83%
Auto-Approval Rate
4 hr
Mean Resolution Time
96%
Steward Agreement
01 — The Problem
The Ontology Problem in Drug R&D

Biomedical terminology is fragmented across systems. AI/ML models can't be trained on inconsistently labelled data.

Walk through any pharmaceutical research organisation and you'll find the same problem everywhere. A process development scientist in San Diego writes "cell viability" in her ELN. Her counterpart in Osaka records the same measurement as "VIAB %". The LIMS in the Swiss manufacturing facility stores it as "Viability_Score_Trypan". The regulatory submission uses the official term OBI:0001985 — but only because a data steward looked it up six weeks before the BLA deadline.

These four strings represent the same scientific concept. No machine knows that. No AI/ML model trained across these four systems will generalise correctly. The downstream cost blocks exactly the kind of cross-dataset meta-analysis and federated AI that could accelerate drug development by years.

Without Ontology Curation

  • 60–70% of R&D data carries unmapped or inconsistently mapped terms
  • Manual review backlogs of 2–6 weeks per dataset ingestion
  • Hallucinated or non-existent CURIEs propagate silently into the knowledge graph
  • Cross-study analytics corrupted by synonym / homonym conflicts
  • No documented audit trail for terminology governance decisions
  • 2–4 FTE domain stewards per therapeutic area cannot scale to 5M+ data points/year

With OntoCurator Agent

  • 83% of mappings auto-approved with ≥0.92 confidence threshold
  • Mean resolution time cut from 18 days to 4 hours
  • Live BioPortal + OLS4 verification eliminates hallucinated CURIEs
  • Cross-dataset conflict detection surfaces inconsistencies automatically
  • Full audit trail — every decision logged to the PostgreSQL checkpointer
  • Stewards review pre-analysed cards, not raw proposals
02 — Design
Why an Agentic Architecture?

A single LLM call works for simple cases. It fails predictably at the edges that matter most in regulated environments.

The naive approach is one prompt: "Here is a data field label — which ontology term does it map to?" This works for straightforward cases. The failure modes are instructive for why that isn't enough.

Hallucinated Identifiers
LLMs confidently generate plausible-looking but non-existent CURIEs. Without live API verification, these propagate silently into the knowledge graph.
Ambiguous Mappings
A single call cannot surface that "cell viability" maps plausibly to both OBI:0001985 and BAO:0002166 with different semantic precedence depending on assay context.
No Conflict Detection
If dataset A maps "VIAB" to one term and dataset B maps "Cell Viability" to a different term, a single-call architecture has no mechanism to detect or escalate the inconsistency.
No Governance Trail
Pharmaceutical terminology decisions require documented human approval. A single call produces an answer but no audit trail, no routing, and no approval workflow.
Agentic Solution
Each agent has a narrow, well-defined responsibility. LangGraph manages state transitions, conditional routing, and human-in-the-loop interrupts. The result is auditable, reversible, and trustworthy.
Design Principle
The goal is not to replace domain stewards. It's to eliminate the 80% of curation work that is mechanical — term lookup, synonym matching, duplicate detection — so stewards focus on the 20% that requires scientific judgment.
03 — System Design
Four Agents, One Pipeline

Each agent is a self-contained LangChain Runnable with its own prompt template, tool bindings, output schema, and error handling.

EntityExtractorAgent

Agent 01  ·  GPT-powered NER

Parses experimental protocols, assay definitions, and ELN entries using GPT to extract candidate scientific entities — measurements, biological entities, reagents, and process parameters — as typed spans with source provenance.

In: raw document text Out: List[EntitySpan] Structured output5 entity types

OntologyMapperAgent

Agent 02  ·  BioPortal + OLS4

Queries BioPortal and EBI OLS4 concurrently across 12 priority ontologies. Ranks candidates by Information Content and semantic similarity. Returns top-3 candidates with confidence scores.

In: List[EntitySpan] Out: List[MappingCandidate] 12 ontologiesIC + semantic ranking

ConflictDetectorAgent

Agent 03  ·  Registry comparison

Compares each new mapping proposal against the organisation's existing ontology registry. Detects synonym conflicts (same concept, different CURIEs) and homonym conflicts (same string, different concepts).

In: Candidates + Registry Out: List[ConflictReport] SynonymHomonym

GovernanceRouterAgent

Agent 04  ·  Tiered disposition

Applies configurable routing rules: auto-approve high-confidence unambiguous mappings, route conflicts to named domain stewards, escalate novel entity types to the ontology governance board.

In: List[ConflictReport] Out: GovernanceDecision Auto / Steward / Board
04 — Agent 1
EntityExtractorAgent

GPT-powered named entity recognition specialised for pharmaceutical R&D documents, enforcing Pydantic schema at the LLM output boundary.

Unlike general biomedical NER models, this agent handles the messy, contextual language found in internal ELN entries, process descriptions, and analytical method summaries. The source_offset field preserves provenance so every mapping decision traces back to the exact position in the source document. Using with_structured_output() enforces Pydantic schema compliance and eliminates an entire class of parsing bugs.

entity_extractor.py
from pydantic import BaseModel, Field
from typing import List, Literal
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

class EntitySpan(BaseModel):
    text:            str   # Exact surface form from the source document
    normalized:      str   # Lowercase, no units or suffixes
    entity_type:     Literal["Measurement", "BiologicalEntity", "ChemicalEntity",
                                    "ProcessParameter", "Assay"]
    confidence:      float = Field(ge=0.0, le=1.0)
    context_snippet: str   # 30-word window — used by OntologyMapperAgent
    source_offset:   tuple[int, int]

class EntityExtractorAgent:
    def __init__(self, model: str = "gpt-4.1"):
        llm = ChatOpenAI(model=model, temperature=0)
        prompt = ChatPromptTemplate.from_messages([
            ("system", EXTRACTION_SYSTEM_PROMPT),
            ("human", "Document ID: {doc_id}\n\nText:\n{text}")
        ])
        # with_structured_output() enforces Pydantic schema at the LLM boundary
        self.chain = prompt | llm.with_structured_output(ExtractionResult)

    def extract(self, text: str, doc_id: str) -> ExtractionResult:
        return self.chain.invoke({"text": text, "doc_id": doc_id})
with_structured_output()Pydantic v2 Source provenanceContext snippet5 entity types
05 — Agent 2
OntologyMapperAgent

Concurrent BioPortal and OLS4 queries across 12 priority ontologies, ranked by Information Content and semantic similarity.

Information Content (IC) measures specificity — a high-IC term is a precise leaf-node concept. For curation we want the most specific correct term, so IC is the primary ranking signal. Semantic similarity against the entity's context snippet handles cases where IC alone would surface the technically-correct-but-contextually-wrong term. The agent fires BioPortal and OLS4 searches concurrently with asyncio.gather, deduplicates by CURIE, and re-ranks the merged set. The domain-fine-tuned S-PubMedBert-MS-MARCO model shows 18% higher precision@1 vs generic sentence transformers on the CRAFT corpus benchmark.

ontology_mapper.py
import asyncio
from dataclasses import dataclass
from sentence_transformers import SentenceTransformer

@dataclass
class MappingCandidate:
    curie:          str    # e.g. "OBI:0001985"
    label:          str    # e.g. "cell viability assay"
    ontology:       str
    ic_score:       float  # information content (0–1)
    semantic_sim:   float  # cosine similarity to entity context snippet
    combined_score: float  # 0.6 * ic + 0.4 * semantic_sim
    definition:     str
    source:         str    # "bioportal" | "ols4"

class OntologyMapperAgent:
    PRIORITY_ONTOLOGIES = ["OBI", "BAO", "GO", "CHEBI", "NCIT", "HP",
                           "EFO", "CLO", "IAO", "SBO", "UO", "PATO"]

    def __init__(self):
        self.embedder = SentenceTransformer("pritamdeka/S-PubMedBert-MS-MARCO")

    async def map_entity(self, entity: EntitySpan) -> List[MappingCandidate]:
        bp_results, ols_results = await asyncio.gather(
            self._search_bioportal(entity.normalized, entity.entity_type),
            self._search_ols4(entity.normalized, entity.entity_type),
        )
        candidates = self._deduplicate_by_curie(bp_results + ols_results)
        candidates = self._rank_by_semantic_similarity(candidates, entity.context_snippet)
        return sorted(candidates, key=lambda c: c.combined_score, reverse=True)[:3]
asyncio.gatherS-PubMedBert-MS-MARCO BioPortal APIEBI OLS4 IC rankingTop-3 candidates
06 — Agent 3
ConflictDetectorAgent

The component that single-call architectures entirely miss — cross-dataset inconsistency detection using the live registry and OWL reasoning.

A synonym conflict occurs when two different source strings are proposed as mapping to two different CURIEs — one must be wrong, or a synonym relationship must be explicitly registered. A homonym conflict occurs when the same source string is used in different systems but refers to semantically distinct concepts. The _same_concept() method uses OWL reasoning via a locally cached ontology graph (owlready2) to distinguish these two cases by checking equivalent-class and subclass-of relationships.

conflict_detector.py
from enum import Enum
from dataclasses import dataclass

class ConflictType(Enum):
    SYNONYM  = "synonym_conflict"   # same concept, different CURIEs
    HOMONYM  = "homonym_conflict"   # same string, different concepts
    ORPHAN   = "orphan"             # no existing registry entry
    APPROVED = "pre_approved"       # exact match to an approved mapping

@dataclass
class ConflictReport:
    entity:                EntitySpan
    proposed_mapping:      MappingCandidate
    conflict_type:         ConflictType
    existing_curie:        str | None = None
    requires_human_review: bool      = False

class ConflictDetectorAgent:
    def analyze(self, candidates, entity) -> ConflictReport:
        existing = self.registry.lookup(entity.normalized, entity.entity_type)
        best = candidates[0]
        if not existing:
            return ConflictReport(entity, best, ConflictType.ORPHAN,
                requires_human_review=best.combined_score < 0.85)
        if existing.curie == best.curie:
            return ConflictReport(entity, best, ConflictType.APPROVED)
        conflict = ConflictType.SYNONYM if \
            self._same_concept(existing.curie, best.curie) else ConflictType.HOMONYM
        return ConflictReport(entity, best, conflict,
            existing_curie=existing.curie, requires_human_review=True)
Synonym conflictsHomonym conflicts OWL reasoningowlready2Registry lookup
07 — Agent 4
GovernanceRouterAgent

Three-tier routing: auto-approve, steward review, or board escalation — based on confidence score, conflict type, and ontology priority.

Tier 1 — Auto-Approve
High-confidence, unambiguous
APPROVED exact matches, or ORPHAN with combined score ≥0.92 and Priority 1 ontology. Written directly to registry with source=AUTO_APPROVED.
Tier 2 — Steward Review
Conflicts and low-confidence orphans
All conflicts and orphans with 0.75 ≤ score < 0.92. Routed to the named domain steward with a pre-analysed decision card. 72-hour SLA.
Tier 3 — Board Escalation
Novel types and homonym conflicts
Entity types not covered by existing ontologies, or homonym conflicts in core platform terminologies. Formal governance board vote required.
governance_router.py
class GovernanceRouterAgent:
    def route(self, report: ConflictReport) -> GovernanceDecision:
        # Tier 1: auto-approve
        if report.conflict_type == ConflictType.APPROVED:
            return GovernanceDecision(Disposition.AUTO_APPROVED, report)
        if (report.conflict_type == ConflictType.ORPHAN
                and not report.requires_human_review
                and report.proposed_mapping.ontology in self.PRIORITY_ONTOLOGIES):
            return GovernanceDecision(Disposition.AUTO_APPROVED, report)
        # Tier 2: steward review
        steward = self.steward_registry.get(
            report.entity.entity_type, domain=report.proposed_mapping.ontology)
        if steward and report.conflict_type != ConflictType.HOMONYM:
            return GovernanceDecision(Disposition.STEWARD_REVIEW, report,
                assigned_to=steward.email, due_date=self._sla_date(hours=72))
        # Tier 3: board escalation
        return GovernanceDecision(Disposition.BOARD_ESCALATED, report)
08 — Orchestration
The LangGraph State Machine

A typed CurationState flows through the graph, accumulating results at each node. Conditional edges implement routing. PostgreSQL checkpointing enables forensic debugging and HITL interrupts.

Pipeline Nodes
ExtractEntityExtractorAgent — raw text → List[EntitySpan]
state.entities
MapOntologyMapperAgent — spans → List[MappingCandidate]
state.candidates
DetectConflictDetectorAgent — candidates → List[ConflictReport]
state.conflicts
GovernGovernanceRouterAgent — conflicts → List[GovernanceDecision]
conditional edge
InterruptHuman review gate — graph suspends, state persisted to DB
steward feedback
CommitRegistry write — approved mappings persisted with full lineage
State Machine Code
pipeline.py
from langgraph.graph import StateGraph, END
from langgraph.checkpoint.postgres import PostgresSaver

class CurationState(TypedDict):
    document_id:        str
    raw_text:           str
    entities:           List[EntitySpan]
    candidates:         List[MappingCandidate]
    conflicts:          List[ConflictReport]
    decisions:          List[GovernanceDecision]
    requires_interrupt: bool
    steward_feedback:   str | None

def build_graph(checkpointer) -> StateGraph:
    graph = StateGraph(CurationState)
    graph.add_node("extract",   extractor_node)
    graph.add_node("map",       mapper_node)
    graph.add_node("detect",    conflict_node)
    graph.add_node("govern",    governance_node)
    graph.add_node("interrupt", human_review_node)
    graph.add_node("commit",    registry_write_node)
    graph.set_entry_point("extract")
    graph.add_edge("extract", "map")
    graph.add_edge("map",     "detect")
    graph.add_edge("detect",  "govern")
    graph.add_conditional_edges("govern",
        route_after_governance,
        {"human_review": "interrupt", "auto": "commit"})
    graph.add_edge("interrupt", "commit")
    graph.add_edge("commit",   END)
    return graph.compile(
        checkpointer=checkpointer,
        interrupt_before=["interrupt"]
    )
LangGraphinterrupt_before PostgresSaverTyped stateConditional edges
09 — Governance
Human-in-the-Loop Governance

Governance is a first-class product feature, not an afterthought. Every steward decision is explicit, time-stamped, and tied to a named individual.

The interrupt_before=["interrupt"] parameter suspends graph execution at the human review node, persisting the full state to the PostgreSQL checkpointer. The steward receives a notification and reviews a structured decision card. After the decision, the graph resumes from the checkpoint and completes the commit step. User research revealed that stewards wanted visibility into all automated decisions, not just uncertain ones — the production system now sends a weekly digest of auto-approved mappings with a 7-day exception window.

01
Source context
The 30-word snippet from the source document with the entity highlighted — so the steward understands how the term was actually used.
02
Proposed mapping + confidence breakdown
The top-1 candidate CURIE with IC score, semantic similarity, combined score, and the ontology definition — not just the identifier.
03
Top-3 ranked alternatives
The next two candidates ranked by combined score — so the steward can one-click select an alternative without doing their own search.
04
Conflict description + existing mapping
For conflict cases, the card shows the conflicting existing mapping, which dataset it came from, and when it was approved.
05
Action buttons with rationale field
Approve / Reject / Request-Synonym / Escalate. A free-text rationale field is required for board-escalated items and stored in the audit log.
10 — Impact
Results & Impact

Piloted on three years of historical assay data across two therapeutic areas. Baseline: Jira-based manual curation with 18-day mean resolution time.

83%
Auto-approval rate on production data
4 hr
Mean resolution time vs 18-day baseline
96%
Steward agreement with top-1 proposal
14
Silent conflicts surfaced in historical data

The 83% auto-approval rate means domain stewards now actively review only one in six incoming proposals. Of the reviewed cases, 96% result in the steward selecting the agent's top-ranked candidate — confirming that the mapping quality is high enough to warrant the confidence threshold.

The most significant impact is the elimination of the backlog effect. Under the manual workflow, curation bottlenecks caused dataset ingestions to be delayed by 2–6 weeks. The agentic pipeline removes this constraint entirely for the auto-approved 83%.

The conflict detector surfaced 14 terminology inconsistencies in the historical data that had existed for 2–3 years without being detected — inconsistencies that were silently corrupting cross-study analytics.

11 — Lessons
Lessons Learned

Six things we would do the same again — and one trap we walked into.

What worked well

Build the governance workflow first

  • Designing the steward review UI before writing AI code shaped every data structure — the context snippet, confidence breakdown, and conflict report format all emerged from user research
  • with_structured_output() throughout the pipeline eliminated an entire class of integration bugs — type-safe state is non-negotiable
  • LangGraph checkpointing enables full forensic debugging — every state transition is stored, every steward disagreement is explainable
  • Domain-fine-tuned embeddings (S-PubMedBert) give 18% higher precision@1 over generic sentence transformers — domain vocabulary matters
  • Conflict detection is the genuine novel contribution — extraction and mapping are well-served by existing tools; cross-dataset detection is the hard part
The trap & what's next

BioPortal rate limits are your bottleneck

  • BioPortal free tier: 15 req/s. For bulk historical curation this is the dominant constraint — add a Redis cache layer (TTL 30 days) before the API call, not after
  • The weekly digest of auto-approved mappings with a 7-day exception window — stewards discovered they wanted visibility into automated decisions, not just uncertain ones
  • Next: ProvenanceTraceAgent — generates PROV-O compatible lineage graphs for each mapping decision, enabling automated scoring of RDA indicator RDA-I3-01M
  • Next: Integration with FAIR Data Toolkit Phase 2 — auto-scoring ontology interoperability indicators from the curation registry