Rules for the numbers. Language for the rest.
A generic, MDR-aligned system that drafts TDM reports: a deterministic rule engine owns every number and safety threshold, an LLM writes the interpretation, and a clinician signs every report.
Built directly on medicalvalues' own proof-of-concept
One decision drives the whole design
The proof-of-concept showed exactly how LLMs break in TDM — wrong steady-state, oscillating ranges, hallucinated numbers. So the numbers go to rules, the prose to the model, the signature to a human.
Hard rule / LLM split
If an output is fully determined by the inputs plus a citable rule, it is computed deterministically. Only synthesis, narrative and gap-finding go to the LLM — and every LLM claim is grounded in a cited source.
Generic engine, drug modules
The engine is drug-agnostic; ranges, half-lives and interactions live in versioned drug modules. Validate the engine once, then qualify each module before switching it on.
Narrow intended use
The system produces a draft a qualified professional reviews and signs. It informs the decision; it never makes it. That keeps the device manageable under MDR.
AI Act & DSGVO by design
We treat it as a high-risk AI system. Traceable logging and an on-prem option for health data (DSGVO) are built in from day one.
A TDM interpretation is not "the value is X"
The interpretation — not the number — is the deliverable, and it runs in a fixed order. I exploit that order to split what a rule must compute from what an LLM may phrase.
The order I exploit
- Settle the pharmacokinetics first — steady state and trough vs peak. deterministic · rule
- Classify against the one canonical reference range: low · therapeutic · elevated · toxic. deterministic · rule
- Reason about causes & context — adherence, organ function, interactions, metabolizer status. interpretive · LLM, human-confirmed
They condition every downstream judgement — and they are exactly where the PoC's models stumbled. So they are computed deterministically, before a word is written.
I treat Adler et al., Table 1 as the output spec to reverse-engineer, not something to re-derive (see §03).
The proof-of-concept is our specification
Adler et al. scored standalone GPT-4o and Gemini 2.0 Flash on 10 fictional TDM cases (curated RAG, CLEAR framework): structured, but only "average" (12–18 / 25). The study showed where they fail — and each failure becomes a design rule. Caveat I'd flag: n = 10, single rater — hypothesis-generating, not confirmatory.
CLEAR scores by category (scale 1–5)
Biggest gaps: Completeness and Lack of false information — exactly what a deterministic core and completeness checks protect.
Failure observed → design response
| Numbers mishandled — levetiracetam steady-state & trough misjudged | → deterministic rules |
| Conflicting RAG sources — 43 mg/L read as normal and elevated | → one canonical source |
| Run-to-run variation — caveats vary run-to-run | → template + checks |
| Hallucination risk | → provenance + sign-off |
How I derive the medical workflow
Reverse-engineer the report the lab already produces, anchor every threshold in a citable source, and separate the deterministic from the interpretive. The order: intake → compute → classify → interpret → review → sign.
Sources
- Existing report structures
- Guidelines: AGNP, DEGAM S1 (AWMF)
- SmPCs, PSIAC, CYP/P-gp references
- Expert think-aloud on real cases
- Lab / LIS processes & data gaps
Stakeholders
- Clinical pharmacologists (authors)
- Treating clinicians (consumers)
- Lab scientists / LIS owners
- Quality & Regulatory + safety officer
- Software / ML engineers
Validation
- Traceability: threshold ↔ source
- Expert-authored gold case set
- Rules: ~100% exact on gold set
- LLM: CLEAR + hallucination rate, ≥2 reviewers
- Clinical eval + usability (IEC 62366-1)
Workflow anatomy — the parts the brief asks for
Patient (age/sex/weight, renal & hepatic function), drug + dose + schedule, measured value + unit, metabolite/MPR if any, sampling timepoint, last dose-change date, last ingestion, co-medications, comorbidities.
A signed TDM report: deterministic facts sheet, interpretation narrative with citations, ranked causes, monitoring suggestions, flagged uncertainty & missing data — plus a complete audit log.
Critical field missing? · steady state reached? · trough vs peak? · in / out of range? · toxic? · retrieval confidence sufficient?
Unsupported drug → manual route. Missing critical datum → conditional report + request. Implausible value → flag, don't interpret. Low retrieval confidence → “not assessable”.
Toxic / critical level → urgent clinician escalation, report flagged. Low-confidence or deviation from standard rec. → senior pharmacologist, expedited review.
Drafts a TDM report for a qualified professional to review, edit and sign. It informs; it never decides or auto-releases. (Class IIa, human-in-the-loop.)
In scope: interpreting a measured concentration in context. Out of scope: ordering the test, prescribing, issuing a dose as an order, any case without a measured value.
The separation test
One question decides where each reasoning element lives: is the output fully determined by the inputs plus a fixed, citable rule?
Standardise
Intake schema · all deterministic computations · report structure · mandatory safety caveats · classification language · escalation triggers. Variability here is a defect.
Leave to the clinician
The final dosing decision · patient-specific weighting of causes · integration with the full clinical picture · any deviation from the standard recommendation (with rationale). The clinician signs; the system never does.
How it is medically validated
Two tracks. Deterministic part: unit tests against the gold set — exact match on thresholds, classification, steady-state, flags. Interpretive part: blinded CLEAR + hallucination-rate scoring by ≥2 pharmacologists, plus usability and clinical evaluation. Every threshold traces to its source; the full V&V and monitoring stack is in §08.
Prototype workflow — live
Antiepileptic monitoring (levetiracetam), grounded in AGNP & Patsalos. The flow is drug-agnostic; the module supplies the numbers — change the inputs and the facts sheet recomputes.
Main input data (per Adler et al.)
The workflow
Latency: rule engine + facts sheet < ~50 ms (sync) · retrieval ~100–400 ms · LLM synthesis + guardrail ~2–8 s (async). The toxicity fast-path fires on the synchronous spine, independent of the LLM.
Rule-engine demo
Designing the LLM agent
The agent is the language and synthesis layer on top of a deterministic spine. It composes, explains and flags — it is never the calculator, the classifier, or the decision-maker.
Deriving the agent's role
The role falls out of five drivers.
Reports are slow & knowledge-heavy → automate the draft, not the decision.
Output must match the standard report structure → constrained, templated generation.
Experts settle PK-state first, then reason → rules run first; LLM only after the facts sheet.
Structured case + curated KB → retrieval-grounded, not free recall.
Hallucination & numeric error are the PoC's failure modes → no numbers from the LLM, provenance + sign-off.
What it may do · must not do · where a human is mandatory
| Allowed — LLM | Not allowed — must be rules | Mandatory human review |
|---|---|---|
| Draft narrative from the facts sheet | Steady-state determination | Every report |
| Explain reasoning in clinical prose | Trough vs peak; range classification | Any toxic / critical flag |
| Enumerate causes for out-of-range values | Toxicity thresholds & safety flags | Any missing critical data |
| Identify missing / ambiguous inputs | MPR computation; interaction detection | Any deviation from standard rec. |
| Phrase monitoring advice as suggestions | Unit conversion | Any low-retrieval-confidence case |
| Summarise retrieved knowledge with citations | all numeric / threshold — the PoC's failure zone | Nothing auto-releases |
Identifying & selecting the knowledge sources
The knowledge base is curated and closed — not the open web. Selection criteria: authority (guideline body, SmPC, peer-reviewed), jurisdiction (German / EU), recency & version, and coverage of the active drug module — each source tagged with authority, version, jurisdiction and effective date.
RAG, from a medical-expert view
The decisive move: conflict resolution is pre-decided. For each parameter there is one canonical source, so the agent can't oscillate.
Every retrieved fact shows its provenance; numeric claims are grounded in the facts sheet, not retrieved prose; missing knowledge yields “not assessable”, not a guess; low retrieval confidence escalates to a human.
The RAG conflict — and the fix
A levetiracetam trough of 43 mg/L:
Human-in-the-loop
Drafts arrive asynchronously in a prioritised reviewer worklist (toxic/critical first). The reviewer sees the facts sheet (with provenance), the LLM draft (with citations & flagged uncertainty), the missing-info list, and any alerts; they edit, confirm safety flags, and sign. Routine → one reviewer; toxic / low-confidence → senior clinician, expedited. Edits become training & eval data.
Evaluate & monitor after deployment
Log every case end-to-end. Track reviewer edit rate (draft-quality proxy), safety-flag overrides, hallucination incidents, retrieval confidence, and drift. Periodic re-scoring vs gold standard, strict change control. The full assurance stack is in §08.
The agent, step by step
Eight steps (levetiracetam instance). The rule engine runs first and produces the facts sheet; the LLM is always downstream — and runs asynchronously, never in the clinician's blocking path.
Implementation blueprint — a portable reference agent
The same agent design as a framework-agnostic project: an orchestrator that delegates to sub-agents, provider-agnostic JSON tool schemas, reusable skills, RAG over a curated KB, and multi-provider model routing. It maps to the Claude Agent SDK, LangGraph or the OpenAI Agents SDK — only the bindings change.
Why not 'agents everywhere'
Multi-agent systems can cost roughly 15× the tokens of a single chat (published multi-agent benchmarks), and help only when work is genuinely parallel. So the sequential clinical core is a single guarded agent; sub-agents are reserved for the parallel retrieval breadth and offline evaluation. Complexity is added only where it earns its cost.
Evaluation & assurance — proving the system works
Confidence comes from a layered assurance stack: deterministic tests, RAG metrics, agent/orchestrator checks, an LLM judge calibrated against humans, guardrails, and live monitoring — each mapped to MDR / IEC 62304 / AI Act verification.
The assurance stack
Unit + integration tests run the gold set through the rule engine and assert exact matches on thresholds, classification, steady-state, trough/peak and every safety flag — anything below ~100% blocks the build.
RAGAS scores the retrieval + grounding: faithfulness (anti-hallucination), context precision & recall, context-entity recall, response relevancy and noise sensitivity. Faithfulness is a hard release gate at ≥0.90 — an unsupported numeric claim in a TDM report can drive a wrong dose, so the cost of a false statement is clinical, not cosmetic.
Beyond the final text: tool-call accuracy + F1 (right tool, right args, right order), agent goal accuracy (did it reach the intended outcome), and topic adherence (did it stay inside the TDM scope). Both end-state and trajectory are scored, because an agent can be right via an unsafe path.
An LLM judge scores each draft on the CLEAR rubric (G-Eval style) against the reference report. It is calibrated against ≥2 human pharmacologists, uses a different model family than the writer to limit self-preference, and randomises answer order against position bias. The judge augments human review; it never replaces sign-off.
Guardrail unit tests (every number re-checked vs the facts sheet, every claim vs a citation), groundedness / hallucination detection, escalation tests (toxic level must always block release), and adversarial / red-team prompts (prompt injection, out-of-scope asks).
Every case is traced end-to-end. We track reviewer edit rate (draft-quality proxy), safety-flag override rate, retrieval confidence, draft-latency p50/p95, escalation-SLA adherence, and score drift; re-score a 20% live sample with RAGAS. After an alarm we roll back to the last validated config, widen the gold set, or re-calibrate the judge — under change control; the ≥2-human judge re-calibration runs quarterly. No silent model/prompt/module update.
RAGAS metric explorer
The metrics we gate on, what each measures, and the target for a clinical-grade TDM report.
Tool-call verification — were the right tools actually called?
Each run's tool trace is checked against the reference trace (conformant vs faulty below). The same check runs online on every production case: a run that deviates from the tool contract — wrong tool/args/order, or a missing required call like interaction_check — is blocked from the worklist and routed to a human.
Expected (reference) trace
Actual trace
Release / readiness gates (illustrative)
Internal-testing readiness is a green board. A single regression — a metric below target — blocks release; a rising-but-passing metric (amber) triggers review.
The thread back to MDR: this stack is the software verification & validation (IEC 62304), the AI-Act accuracy/robustness evidence (Art. 15), the clinical performance data (MDCG 2020-1), and the post-market plan (Art. 72) — one assurance system serving both the medical-device file and engineering confidence.
Pragmatic under regulation
Intended use (Zweckbestimmung) is the biggest lever. Frame the device to inform and keep a human in the loop, and the regulatory and validation burden stays tractable.
MDR class explorer
The IMDRF significance framework (adopted in MDCG 2019-11) maps what the output does × how serious the situation is. Change the selectors:
(IMDRF grid)
MDR Rule 11's text defaults decision-support software to Class IIa. We adopt IIa as the conservative baseline and keep the human-in-the-loop so the device only informs — a narrow-therapeutic-index module in a critical setting can be re-assessed toward IIb in its own risk file.
The German + EU regulatory stack
MDR — Reg. (EU) 2017/745
Rule 11: decision-support software is Class IIa (→ IIb/III by harm severity). Class IIa+ ⇒ notified-body conformity assessment.
MPDG (national)
The Medizinprodukterecht-Durchführungsgesetz flanks the directly-applicable MDR — competent authorities, language, clinical-investigation procedure, penalties. Replaced the MPG.
BfArM — competent authority
The federal authority for medical devices (DIMDI was merged into BfArM in 2020).
Clinical evaluation (MDR Annex XIV)
Under limited data: an equivalence/literature route backed by gold-set performance + the PoC, escalating to a focused clinical investigation (ISO 14155) only where the literature gap demands it.
Not a DiGA
DiGA fast-track (DVG/DiGAV, §139e SGB V) is for patient-facing apps. A clinician-facing TDM tool stays plain CE-marked SaMD — the DiGA door is the wrong one.
EU AI Act — Reg. (EU) 2024/1689
SaMD needing a notified body ⇒ high-risk via Art. 6(1) + Annex I. Obligations from 2 Aug 2027; the Digital Omnibus (provisional, May 2026) moves product-embedded high-risk to 2 Aug 2028.
DSGVO / BDSG + §203 StGB
Drug levels are special-category health data (Art. 9). Third-country transfer + confidentiality push toward EU-hosted / on-prem / local-model processing — our routing policy.
Notified bodies (DE)
TÜV SÜD, TÜV Rheinland, DEKRA, mdc — German notified bodies already fold AI-Act expectations into MDR reviews, so we design for both now.
MDCG 2019-11 Rev.1 (2025)
The June-2025 revision of the software qualification/classification guidance now explicitly covers AI, modular software and EHR interoperability.
EU AI Act — the obligations
As a high-risk AI system the device carries: risk management (Art. 9), data governance (Art. 10), technical documentation (Art. 11), logging (Art. 12), transparency (Art. 13), human oversight (Art. 14) — our review-and-sign step — accuracy/robustness/cybersecurity (Art. 15), and post-market monitoring (Art. 72). These largely overlap with MDR/ISO 13485/14971 — design once, evidence for both.
Standards we build against
The HITL + narrow intended use are the principal risk controls in the ISO 14971 file: “wrong steady-state → wrong interpretation” is mitigated by deterministic computation, provenance, the guardrail, and expert sign-off.
Development phases & readiness gates
Discovery & spec
Elicitation, report reverse-engineering, intended-use & risk file, gold case set.
Deterministic core
Rule engine + first drug module.
gate: ~100% on goldLLM + RAG + guardrail
Curated KB, grounded generation, output verifier.
gate: CLEAR + faithfulnessInternal validation
End-to-end on gold set, ≥2 pharmacologists, usability.
Shadow-mode pilot
System drafts; experts compare to own reports.
gate: edit rate / safetyConformity & release
Then module-by-module expansion.
gate: new-module qual.Balancing the trade-offs
Narrow intended use
Lowers the class, shrinks the validation surface, speeds the path — and raises safety. The single biggest lever.
Rule / LLM split
Move fast on low-risk narrative; the safety-critical core is deterministic, testable, provable.
Modular architecture
Scale without a validation explosion: validate the engine once, qualify modules incrementally.
Don't fine-tune first
Context engineering + function calling + grounded RAG + guardrails. Fine-tuning & local models (MedGemma, Meditron) are a phase-2 lever for performance & on-prem privacy.
HITL as enabler
Safety control, regulatory enabler (keeps it 'informing' → IIa), and a data engine via captured edits.
What we trade
Less automation and narrower coverage now — in exchange for safety, speed, and validatability. Up-front curation pays for both validation and monitoring.