How a four-agent LangGraph pipeline automates biomedical term extraction, ontology mapping, conflict detection, and governance routing — while preserving human oversight at every critical decision point.
Biomedical terminology is fragmented across systems. AI/ML models can't be trained on inconsistently labelled data.
Walk through any pharmaceutical research organisation and you'll find the same problem everywhere. A process development scientist in San Diego writes "cell viability" in her ELN. Her counterpart in Osaka records the same measurement as "VIAB %". The LIMS in the Swiss manufacturing facility stores it as "Viability_Score_Trypan". The regulatory submission uses the official term OBI:0001985 — but only because a data steward looked it up six weeks before the BLA deadline.
These four strings represent the same scientific concept. No machine knows that. No AI/ML model trained across these four systems will generalise correctly. The downstream cost blocks exactly the kind of cross-dataset meta-analysis and federated AI that could accelerate drug development by years.
A single LLM call works for simple cases. It fails predictably at the edges that matter most in regulated environments.
The naive approach is one prompt: "Here is a data field label — which ontology term does it map to?" This works for straightforward cases. The failure modes are instructive for why that isn't enough.
Each agent is a self-contained LangChain Runnable with its own prompt template, tool bindings, output schema, and error handling.
Parses experimental protocols, assay definitions, and ELN entries using GPT to extract candidate scientific entities — measurements, biological entities, reagents, and process parameters — as typed spans with source provenance.
Queries BioPortal and EBI OLS4 concurrently across 12 priority ontologies. Ranks candidates by Information Content and semantic similarity. Returns top-3 candidates with confidence scores.
Compares each new mapping proposal against the organisation's existing ontology registry. Detects synonym conflicts (same concept, different CURIEs) and homonym conflicts (same string, different concepts).
Applies configurable routing rules: auto-approve high-confidence unambiguous mappings, route conflicts to named domain stewards, escalate novel entity types to the ontology governance board.
GPT-powered named entity recognition specialised for pharmaceutical R&D documents, enforcing Pydantic schema at the LLM output boundary.
Unlike general biomedical NER models, this agent handles the messy, contextual language found in internal ELN entries, process descriptions, and analytical method summaries. The source_offset field preserves provenance so every mapping decision traces back to the exact position in the source document. Using with_structured_output() enforces Pydantic schema compliance and eliminates an entire class of parsing bugs.
from pydantic import BaseModel, Field
from typing import List, Literal
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
class EntitySpan(BaseModel):
text: str # Exact surface form from the source document
normalized: str # Lowercase, no units or suffixes
entity_type: Literal["Measurement", "BiologicalEntity", "ChemicalEntity",
"ProcessParameter", "Assay"]
confidence: float = Field(ge=0.0, le=1.0)
context_snippet: str # 30-word window — used by OntologyMapperAgent
source_offset: tuple[int, int]
class EntityExtractorAgent:
def __init__(self, model: str = "gpt-4.1"):
llm = ChatOpenAI(model=model, temperature=0)
prompt = ChatPromptTemplate.from_messages([
("system", EXTRACTION_SYSTEM_PROMPT),
("human", "Document ID: {doc_id}\n\nText:\n{text}")
])
# with_structured_output() enforces Pydantic schema at the LLM boundary
self.chain = prompt | llm.with_structured_output(ExtractionResult)
def extract(self, text: str, doc_id: str) -> ExtractionResult:
return self.chain.invoke({"text": text, "doc_id": doc_id})
Concurrent BioPortal and OLS4 queries across 12 priority ontologies, ranked by Information Content and semantic similarity.
Information Content (IC) measures specificity — a high-IC term is a precise leaf-node concept. For curation we want the most specific correct term, so IC is the primary ranking signal. Semantic similarity against the entity's context snippet handles cases where IC alone would surface the technically-correct-but-contextually-wrong term. The agent fires BioPortal and OLS4 searches concurrently with asyncio.gather, deduplicates by CURIE, and re-ranks the merged set. The domain-fine-tuned S-PubMedBert-MS-MARCO model shows 18% higher precision@1 vs generic sentence transformers on the CRAFT corpus benchmark.
import asyncio
from dataclasses import dataclass
from sentence_transformers import SentenceTransformer
@dataclass
class MappingCandidate:
curie: str # e.g. "OBI:0001985"
label: str # e.g. "cell viability assay"
ontology: str
ic_score: float # information content (0–1)
semantic_sim: float # cosine similarity to entity context snippet
combined_score: float # 0.6 * ic + 0.4 * semantic_sim
definition: str
source: str # "bioportal" | "ols4"
class OntologyMapperAgent:
PRIORITY_ONTOLOGIES = ["OBI", "BAO", "GO", "CHEBI", "NCIT", "HP",
"EFO", "CLO", "IAO", "SBO", "UO", "PATO"]
def __init__(self):
self.embedder = SentenceTransformer("pritamdeka/S-PubMedBert-MS-MARCO")
async def map_entity(self, entity: EntitySpan) -> List[MappingCandidate]:
bp_results, ols_results = await asyncio.gather(
self._search_bioportal(entity.normalized, entity.entity_type),
self._search_ols4(entity.normalized, entity.entity_type),
)
candidates = self._deduplicate_by_curie(bp_results + ols_results)
candidates = self._rank_by_semantic_similarity(candidates, entity.context_snippet)
return sorted(candidates, key=lambda c: c.combined_score, reverse=True)[:3]
The component that single-call architectures entirely miss — cross-dataset inconsistency detection using the live registry and OWL reasoning.
A synonym conflict occurs when two different source strings are proposed as mapping to two different CURIEs — one must be wrong, or a synonym relationship must be explicitly registered. A homonym conflict occurs when the same source string is used in different systems but refers to semantically distinct concepts. The _same_concept() method uses OWL reasoning via a locally cached ontology graph (owlready2) to distinguish these two cases by checking equivalent-class and subclass-of relationships.
from enum import Enum
from dataclasses import dataclass
class ConflictType(Enum):
SYNONYM = "synonym_conflict" # same concept, different CURIEs
HOMONYM = "homonym_conflict" # same string, different concepts
ORPHAN = "orphan" # no existing registry entry
APPROVED = "pre_approved" # exact match to an approved mapping
@dataclass
class ConflictReport:
entity: EntitySpan
proposed_mapping: MappingCandidate
conflict_type: ConflictType
existing_curie: str | None = None
requires_human_review: bool = False
class ConflictDetectorAgent:
def analyze(self, candidates, entity) -> ConflictReport:
existing = self.registry.lookup(entity.normalized, entity.entity_type)
best = candidates[0]
if not existing:
return ConflictReport(entity, best, ConflictType.ORPHAN,
requires_human_review=best.combined_score < 0.85)
if existing.curie == best.curie:
return ConflictReport(entity, best, ConflictType.APPROVED)
conflict = ConflictType.SYNONYM if \
self._same_concept(existing.curie, best.curie) else ConflictType.HOMONYM
return ConflictReport(entity, best, conflict,
existing_curie=existing.curie, requires_human_review=True)
Three-tier routing: auto-approve, steward review, or board escalation — based on confidence score, conflict type, and ontology priority.
class GovernanceRouterAgent:
def route(self, report: ConflictReport) -> GovernanceDecision:
# Tier 1: auto-approve
if report.conflict_type == ConflictType.APPROVED:
return GovernanceDecision(Disposition.AUTO_APPROVED, report)
if (report.conflict_type == ConflictType.ORPHAN
and not report.requires_human_review
and report.proposed_mapping.ontology in self.PRIORITY_ONTOLOGIES):
return GovernanceDecision(Disposition.AUTO_APPROVED, report)
# Tier 2: steward review
steward = self.steward_registry.get(
report.entity.entity_type, domain=report.proposed_mapping.ontology)
if steward and report.conflict_type != ConflictType.HOMONYM:
return GovernanceDecision(Disposition.STEWARD_REVIEW, report,
assigned_to=steward.email, due_date=self._sla_date(hours=72))
# Tier 3: board escalation
return GovernanceDecision(Disposition.BOARD_ESCALATED, report)
A typed CurationState flows through the graph, accumulating results at each node. Conditional edges implement routing. PostgreSQL checkpointing enables forensic debugging and HITL interrupts.
from langgraph.graph import StateGraph, END
from langgraph.checkpoint.postgres import PostgresSaver
class CurationState(TypedDict):
document_id: str
raw_text: str
entities: List[EntitySpan]
candidates: List[MappingCandidate]
conflicts: List[ConflictReport]
decisions: List[GovernanceDecision]
requires_interrupt: bool
steward_feedback: str | None
def build_graph(checkpointer) -> StateGraph:
graph = StateGraph(CurationState)
graph.add_node("extract", extractor_node)
graph.add_node("map", mapper_node)
graph.add_node("detect", conflict_node)
graph.add_node("govern", governance_node)
graph.add_node("interrupt", human_review_node)
graph.add_node("commit", registry_write_node)
graph.set_entry_point("extract")
graph.add_edge("extract", "map")
graph.add_edge("map", "detect")
graph.add_edge("detect", "govern")
graph.add_conditional_edges("govern",
route_after_governance,
{"human_review": "interrupt", "auto": "commit"})
graph.add_edge("interrupt", "commit")
graph.add_edge("commit", END)
return graph.compile(
checkpointer=checkpointer,
interrupt_before=["interrupt"]
)
Governance is a first-class product feature, not an afterthought. Every steward decision is explicit, time-stamped, and tied to a named individual.
The interrupt_before=["interrupt"] parameter suspends graph execution at the human review node, persisting the full state to the PostgreSQL checkpointer. The steward receives a notification and reviews a structured decision card. After the decision, the graph resumes from the checkpoint and completes the commit step. User research revealed that stewards wanted visibility into all automated decisions, not just uncertain ones — the production system now sends a weekly digest of auto-approved mappings with a 7-day exception window.
Piloted on three years of historical assay data across two therapeutic areas. Baseline: Jira-based manual curation with 18-day mean resolution time.
The 83% auto-approval rate means domain stewards now actively review only one in six incoming proposals. Of the reviewed cases, 96% result in the steward selecting the agent's top-ranked candidate — confirming that the mapping quality is high enough to warrant the confidence threshold.
The most significant impact is the elimination of the backlog effect. Under the manual workflow, curation bottlenecks caused dataset ingestions to be delayed by 2–6 weeks. The agentic pipeline removes this constraint entirely for the auto-approved 83%.
The conflict detector surfaced 14 terminology inconsistencies in the historical data that had existed for 2–3 years without being detected — inconsistencies that were silently corrupting cross-study analytics.
Six things we would do the same again — and one trap we walked into.
with_structured_output() throughout the pipeline eliminated an entire class of integration bugs — type-safe state is non-negotiable