Therapeutic Drug Monitoring · Rule-based workflows · LLM agents

Rules for the numbers. Language for the rest.

A generic, MDR-aligned system that drafts TDM reports: a deterministic rule engine owns every number and safety threshold, an LLM writes the interpretation, and a clinician signs every report.

Built directly on medicalvalues' own proof-of-concept

Rule-based LLM Retrieval / RAG Human Escalation Output
↓ scroll, or jump via the sidebar — ordered for a 10–15 min walkthrough
00 / DESIGN THESIS

One decision drives the whole design

The proof-of-concept showed exactly how LLMs break in TDM — wrong steady-state, oscillating ranges, hallucinated numbers. So the numbers go to rules, the prose to the model, the signature to a human.

Commitment 1

Hard rule / LLM split

If an output is fully determined by the inputs plus a citable rule, it is computed deterministically. Only synthesis, narrative and gap-finding go to the LLM — and every LLM claim is grounded in a cited source.

Commitment 2

Generic engine, drug modules

The engine is drug-agnostic; ranges, half-lives and interactions live in versioned drug modules. Validate the engine once, then qualify each module before switching it on.

Commitment 3

Narrow intended use

The system produces a draft a qualified professional reviews and signs. It informs the decision; it never makes it. That keeps the device manageable under MDR.

Commitment 4

AI Act & DSGVO by design

We treat it as a high-risk AI system. Traceable logging and an on-prem option for health data (DSGVO) are built in from day one.

01 / UNDERSTANDING Answers → Discussion 1 · understanding TDM

A TDM interpretation is not "the value is X"

The interpretation — not the number — is the deliverable, and it runs in a fixed order. I exploit that order to split what a rule must compute from what an LLM may phrase.

The order I exploit

  • Settle the pharmacokinetics first — steady state and trough vs peak. deterministic · rule
  • Classify against the one canonical reference range: low · therapeutic · elevated · toxic. deterministic · rule
  • Reason about causes & context — adherence, organ function, interactions, metabolizer status. interpretive · LLM, human-confirmed
Steady-state & trough must be settled first

They condition every downstream judgement — and they are exactly where the PoC's models stumbled. So they are computed deterministically, before a word is written.

I treat Adler et al., Table 1 as the output spec to reverse-engineer, not something to re-derive (see §03).

02 / EVIDENCE BASE Answers → evidence · the spec

The proof-of-concept is our specification

Adler et al. scored standalone GPT-4o and Gemini 2.0 Flash on 10 fictional TDM cases (curated RAG, CLEAR framework): structured, but only "average" (12–18 / 25). The study showed where they fail — and each failure becomes a design rule. Caveat I'd flag: n = 10, single rater — hypothesis-generating, not confirmatory.

CLEAR scores by category (scale 1–5)

Gemini 2.0 Flash GPT-4o

Biggest gaps: Completeness and Lack of false information — exactly what a deterministic core and completeness checks protect.

Failure observed → design response

Numbers mishandled — levetiracetam steady-state & trough misjudged → deterministic rules
Conflicting RAG sources — 43 mg/L read as normal and elevated → one canonical source
Run-to-run variation — caveats vary run-to-run → template + checks
Hallucination risk → provenance + sign-off
03 / TASK 1A Answers → Task 1a · deriving the workflow

How I derive the medical workflow

Reverse-engineer the report the lab already produces, anchor every threshold in a citable source, and separate the deterministic from the interpretive. The order: intake → compute → classify → interpret → review → sign.

Sources

  • Existing report structures
  • Guidelines: AGNP, DEGAM S1 (AWMF)
  • SmPCs, PSIAC, CYP/P-gp references
  • Expert think-aloud on real cases
  • Lab / LIS processes & data gaps

Stakeholders

  • Clinical pharmacologists (authors)
  • Treating clinicians (consumers)
  • Lab scientists / LIS owners
  • Quality & Regulatory + safety officer
  • Software / ML engineers

Validation

  • Traceability: threshold ↔ source
  • Expert-authored gold case set
  • Rules: ~100% exact on gold set
  • LLM: CLEAR + hallucination rate, ≥2 reviewers
  • Clinical eval + usability (IEC 62366-1)

Workflow anatomy — the parts the brief asks for

Required inputs

Patient (age/sex/weight, renal & hepatic function), drug + dose + schedule, measured value + unit, metabolite/MPR if any, sampling timepoint, last dose-change date, last ingestion, co-medications, comorbidities.

Outputs

A signed TDM report: deterministic facts sheet, interpretation narrative with citations, ranked causes, monitoring suggestions, flagged uncertainty & missing data — plus a complete audit log.

Decision points

Critical field missing? · steady state reached? · trough vs peak? · in / out of range? · toxic? · retrieval confidence sufficient?

Exception handling

Unsupported drug → manual route. Missing critical datum → conditional report + request. Implausible value → flag, don't interpret. Low retrieval confidence → “not assessable”.

Escalation paths

Toxic / critical level → urgent clinician escalation, report flagged. Low-confidence or deviation from standard rec. → senior pharmacologist, expedited review.

Intended use

Drafts a TDM report for a qualified professional to review, edit and sign. It informs; it never decides or auto-releases. (Class IIa, human-in-the-loop.)

Workflow boundaries

In scope: interpreting a measured concentration in context. Out of scope: ordering the test, prescribing, issuing a dose as an order, any case without a measured value.

The separation test

One question decides where each reasoning element lives: is the output fully determined by the inputs plus a fixed, citable rule?

Standardise

Intake schema · all deterministic computations · report structure · mandatory safety caveats · classification language · escalation triggers. Variability here is a defect.

Leave to the clinician

The final dosing decision · patient-specific weighting of causes · integration with the full clinical picture · any deviation from the standard recommendation (with rationale). The clinician signs; the system never does.

How it is medically validated

Two tracks. Deterministic part: unit tests against the gold set — exact match on thresholds, classification, steady-state, flags. Interpretive part: blinded CLEAR + hallucination-rate scoring by ≥2 pharmacologists, plus usability and clinical evaluation. Every threshold traces to its source; the full V&V and monitoring stack is in §08.

04 / TASK 1B · PROTOTYPE Answers → Task 1b · prototype workflow (antiepileptics)

Prototype workflow — live

Antiepileptic monitoring (levetiracetam), grounded in AGNP & Patsalos. The flow is drug-agnostic; the module supplies the numbers — change the inputs and the facts sheet recomputes.

Main input data (per Adler et al.)

Patient Age · sex · weight / height / BMI · smoking status
Drug & dose Substance · dosing scheme · medications without measurement
Measured value Concentration + unit · trough vs peak
Reference range Lower & upper limit (canonical source)
Metabolite / MPR Metabolite value + range · MPR + range · sum + range (if any)
Last dose adjustment Date — drives steady-state assessment
Ingestion & sampling Last ingestion datetime · blood-withdrawal timepoint
Co-medications Full list → interaction & enzyme check
Comorbidities Renal / hepatic function · age-related PK

The workflow

Structured intake
patient · drug · value · sampling · co-meds
Completeness & plausibility
Critical field missing?
yes → flag conditional, request datum
PK-state determination
steady state? trough vs peak?
Reference-range classification
Safety & sampling checks → flags
— if toxic →
URGENT escalation to clinician
LLM synthesis
narrative · causes · recommendations — grounded on facts
Deterministic output guardrail
Mandatory human review & sign-off
Validated report + audit log

Latency: rule engine + facts sheet < ~50 ms (sync) · retrieval ~100–400 ms · LLM synthesis + guardrail ~2–8 s (async). The toxicity fast-path fires on the synchronous spine, independent of the LLM.

Rule-engine demo

Facts sheet deterministic
Illustrative thresholds; in production every parameter comes from a curated, versioned source (AGNP, Patsalos, SmPC), never hard-coded. Not for clinical use.
05 / TASK 2A Answers → Task 2a · designing the agent

Designing the LLM agent

The agent is the language and synthesis layer on top of a deterministic spine. It composes, explains and flags — it is never the calculator, the classifier, or the decision-maker.

Deriving the agent's role

The role falls out of five drivers.

Clinical need

Reports are slow & knowledge-heavy → automate the draft, not the decision.

Reporting requirement

Output must match the standard report structure → constrained, templated generation.

Expert workflow

Experts settle PK-state first, then reason → rules run first; LLM only after the facts sheet.

Available data

Structured case + curated KB → retrieval-grounded, not free recall.

Safety constraint

Hallucination & numeric error are the PoC's failure modes → no numbers from the LLM, provenance + sign-off.

Role: a retrieval-grounded drafting & explanation assistant that turns a deterministic facts sheet into a cited, standardized TDM report draft — and surfaces what's missing — for a clinician to review and sign.

What it may do · must not do · where a human is mandatory

Allowed — LLM Not allowed — must be rules Mandatory human review
Draft narrative from the facts sheet Steady-state determination Every report
Explain reasoning in clinical prose Trough vs peak; range classification Any toxic / critical flag
Enumerate causes for out-of-range values Toxicity thresholds & safety flags Any missing critical data
Identify missing / ambiguous inputs MPR computation; interaction detection Any deviation from standard rec.
Phrase monitoring advice as suggestions Unit conversion Any low-retrieval-confidence case
Summarise retrieved knowledge with citations all numeric / threshold — the PoC's failure zone Nothing auto-releases

Identifying & selecting the knowledge sources

The knowledge base is curated and closed — not the open web. Selection criteria: authority (guideline body, SmPC, peer-reviewed), jurisdiction (German / EU), recency & version, and coverage of the active drug module — each source tagged with authority, version, jurisdiction and effective date.

AGNP consensus (Hiemke et al. 2018)DEGAM S1 / AWMFPatsalos 2018 (antiepileptics)SmPCs (manufacturer PK/ADR)PSIAC interactionsCYP / P-gp pharmacogenetics

RAG, from a medical-expert view

The decisive move: conflict resolution is pre-decided. For each parameter there is one canonical source, so the agent can't oscillate.

Every retrieved fact shows its provenance; numeric claims are grounded in the facts sheet, not retrieved prose; missing knowledge yields “not assessable”, not a guess; low retrieval confidence escalates to a human.

The RAG conflict — and the fix

A levetiracetam trough of 43 mg/L:

Source A · AGNP 2018
10–40 mg/L
Source B · alt. lab
12–46 µg/mL

Human-in-the-loop

Drafts arrive asynchronously in a prioritised reviewer worklist (toxic/critical first). The reviewer sees the facts sheet (with provenance), the LLM draft (with citations & flagged uncertainty), the missing-info list, and any alerts; they edit, confirm safety flags, and sign. Routine → one reviewer; toxic / low-confidence → senior clinician, expedited. Edits become training & eval data.

Evaluate & monitor after deployment

Log every case end-to-end. Track reviewer edit rate (draft-quality proxy), safety-flag overrides, hallucination incidents, retrieval confidence, and drift. Periodic re-scoring vs gold standard, strict change control. The full assurance stack is in §08.

06 / TASK 2B · PROTOTYPE Answers → Task 2b · prototype agent (levetiracetam)

The agent, step by step

Eight steps (levetiracetam instance). The rule engine runs first and produces the facts sheet; the LLM is always downstream — and runs asynchronously, never in the clinician's blocking path.

07 / AGENT ARCHITECTURE Answers → the portable agent architecture

Implementation blueprint — a portable reference agent

The same agent design as a framework-agnostic project: an orchestrator that delegates to sub-agents, provider-agnostic JSON tool schemas, reusable skills, RAG over a curated KB, and multi-provider model routing. It maps to the Claude Agent SDK, LangGraph or the OpenAI Agents SDK — only the bindings change.

reference agent architecture
 

Why not 'agents everywhere'

Multi-agent systems can cost roughly 15× the tokens of a single chat (published multi-agent benchmarks), and help only when work is genuinely parallel. So the sequential clinical core is a single guarded agent; sub-agents are reserved for the parallel retrieval breadth and offline evaluation. Complexity is added only where it earns its cost.

08 / EVALUATION & ASSURANCE Answers → Task 2a · evaluate & monitor + assurance

Evaluation & assurance — proving the system works

Confidence comes from a layered assurance stack: deterministic tests, RAG metrics, agent/orchestrator checks, an LLM judge calibrated against humans, guardrails, and live monitoring — each mapped to MDR / IEC 62304 / AI Act verification.

The assurance stack

1
Deterministic rule tests
The non-negotiable floor

Unit + integration tests run the gold set through the rule engine and assert exact matches on thresholds, classification, steady-state, trough/peak and every safety flag — anything below ~100% blocks the build.

pytestgold_cases.jsonlCI gate
→ IEC 62304 unit/integration V&V · ISO 14971 risk control verification
2
RAG quality (RAGAS)
Is the retrieval grounded and complete?

RAGAS scores the retrieval + grounding: faithfulness (anti-hallucination), context precision & recall, context-entity recall, response relevancy and noise sensitivity. Faithfulness is a hard release gate at ≥0.90 — an unsupported numeric claim in a TDM report can drive a wrong dose, so the cost of a false statement is clinical, not cosmetic.

RAGASDeepEvalCI gate
→ AI Act Art. 15 (accuracy & robustness) · hallucination control
3
Agent / orchestrator evaluation
Did it do the right thing, the right way?

Beyond the final text: tool-call accuracy + F1 (right tool, right args, right order), agent goal accuracy (did it reach the intended outcome), and topic adherence (did it stay inside the TDM scope). Both end-state and trajectory are scored, because an agent can be right via an unsafe path.

RAGAS agenticDeepEval tracesLangSmith
→ assurance that the rule engine was actually invoked (not faked)
4
Report quality — LLM-as-judge
Calibrated, not trusted blindly

An LLM judge scores each draft on the CLEAR rubric (G-Eval style) against the reference report. It is calibrated against ≥2 human pharmacologists, uses a different model family than the writer to limit self-preference, and randomises answer order against position bias. The judge augments human review; it never replaces sign-off.

eval-judgeG-Eval / CLEAR≥2 human raters
→ clinical performance (MDCG 2020-1) + usability (IEC 62366-1)
5
Safety, guardrails & red-team
Adversarial and failure-mode testing

Guardrail unit tests (every number re-checked vs the facts sheet, every claim vs a citation), groundedness / hallucination detection, escalation tests (toxic level must always block release), and adversarial / red-team prompts (prompt injection, out-of-scope asks).

Promptfoo (red-team)guardrail-verifierGalileo
→ ISO 14971 + AAMI CR34971 (AI-specific risk)
6
Online monitoring & drift
Re-calibrated on a cadence

Every case is traced end-to-end. We track reviewer edit rate (draft-quality proxy), safety-flag override rate, retrieval confidence, draft-latency p50/p95, escalation-SLA adherence, and score drift; re-score a 20% live sample with RAGAS. After an alarm we roll back to the last validated config, widen the gold set, or re-calibrate the judge — under change control; the ≥2-human judge re-calibration runs quarterly. No silent model/prompt/module update.

Arize Phoenix / BraintrustOTel tracingdrift alarms
→ AI Act Art. 72 + MDR post-market surveillance

RAGAS metric explorer

The metrics we gate on, what each measures, and the target for a clinical-grade TDM report.

Tool-call verification — were the right tools actually called?

Each run's tool trace is checked against the reference trace (conformant vs faulty below). The same check runs online on every production case: a run that deviates from the tool contract — wrong tool/args/order, or a missing required call like interaction_check — is blocked from the worklist and routed to a human.

Expected (reference) trace

db_query()
sample_id="LEV-204"
pk_calculator()
half_life=7, last_change, sampled_at
interaction_check()
drugs=[levetiracetam, …]
kb_search()
q="levetiracetam elevated causes"

Actual trace

Release / readiness gates (illustrative)

Internal-testing readiness is a green board. A single regression — a metric below target — blocks release; a rising-but-passing metric (amber) triggers review.

Faithfulness
0.93 PASS
≥ 0.90
Context recall
0.88 PASS
≥ 0.85
Tool-call accuracy
0.96 PASS
≥ 0.95
CLEAR mean (judge)
4.2 / 5 PASS
≥ 4.0
Reviewer edit rate
23 % WATCH
≤ 25 %
Hallucination incidents
0.4 % PASS
≤ 1 %
Escalation fast-path SLA
99.6 % PASS
< 60 s, ≥ 99.5 %
Draft latency p95
6.4 s PASS
≤ 8 s

The thread back to MDR: this stack is the software verification & validation (IEC 62304), the AI-Act accuracy/robustness evidence (Art. 15), the clinical performance data (MDCG 2020-1), and the post-market plan (Art. 72) — one assurance system serving both the medical-device file and engineering confidence.

09 / CONSTRAINTS · TIME · RESOURCES · MDR Answers → constraints + Discussion 6

Pragmatic under regulation

Intended use (Zweckbestimmung) is the biggest lever. Frame the device to inform and keep a human in the loop, and the regulatory and validation burden stays tractable.

MDR class explorer

The IMDRF significance framework (adopted in MDCG 2019-11) maps what the output does × how serious the situation is. Change the selectors:

IIa resulting class
(IMDRF grid)
Our position: Class IIa

MDR Rule 11's text defaults decision-support software to Class IIa. We adopt IIa as the conservative baseline and keep the human-in-the-loop so the device only informs — a narrow-therapeutic-index module in a critical setting can be re-assessed toward IIb in its own risk file.

The German + EU regulatory stack

MDR — Reg. (EU) 2017/745

Rule 11: decision-support software is Class IIa (→ IIb/III by harm severity). Class IIa+ ⇒ notified-body conformity assessment.

MPDG (national)

The Medizinprodukterecht-Durchführungsgesetz flanks the directly-applicable MDR — competent authorities, language, clinical-investigation procedure, penalties. Replaced the MPG.

BfArM — competent authority

The federal authority for medical devices (DIMDI was merged into BfArM in 2020).

Clinical evaluation (MDR Annex XIV)

Under limited data: an equivalence/literature route backed by gold-set performance + the PoC, escalating to a focused clinical investigation (ISO 14155) only where the literature gap demands it.

Not a DiGA

DiGA fast-track (DVG/DiGAV, §139e SGB V) is for patient-facing apps. A clinician-facing TDM tool stays plain CE-marked SaMD — the DiGA door is the wrong one.

EU AI Act — Reg. (EU) 2024/1689

SaMD needing a notified body ⇒ high-risk via Art. 6(1) + Annex I. Obligations from 2 Aug 2027; the Digital Omnibus (provisional, May 2026) moves product-embedded high-risk to 2 Aug 2028.

DSGVO / BDSG + §203 StGB

Drug levels are special-category health data (Art. 9). Third-country transfer + confidentiality push toward EU-hosted / on-prem / local-model processing — our routing policy.

Notified bodies (DE)

TÜV SÜD, TÜV Rheinland, DEKRA, mdc — German notified bodies already fold AI-Act expectations into MDR reviews, so we design for both now.

MDCG 2019-11 Rev.1 (2025)

The June-2025 revision of the software qualification/classification guidance now explicitly covers AI, modular software and EHR interoperability.

EU AI Act — the obligations

As a high-risk AI system the device carries: risk management (Art. 9), data governance (Art. 10), technical documentation (Art. 11), logging (Art. 12), transparency (Art. 13), human oversight (Art. 14) — our review-and-sign step — accuracy/robustness/cybersecurity (Art. 15), and post-market monitoring (Art. 72). These largely overlap with MDR/ISO 13485/14971 — design once, evidence for both.

Data governanceTransparencyHuman oversightLoggingPost-market monitoring

Standards we build against

ISO 13485 · QMSISO 14971 + TR 24971 · riskAAMI CR34971 · AI/ML riskIEC 62304 · lifecycleIEC 62366-1 · usabilityIEC 81001-5-1 · securityISO/IEC 42001 · AI mgmtISO 14155 · clinical

The HITL + narrow intended use are the principal risk controls in the ISO 14971 file: “wrong steady-state → wrong interpretation” is mitigated by deterministic computation, provenance, the guardrail, and expert sign-off.

Development phases & readiness gates

PHASE 1

Discovery & spec

Elicitation, report reverse-engineering, intended-use & risk file, gold case set.

PHASE 2

Deterministic core

Rule engine + first drug module.

gate: ~100% on gold
PHASE 3

LLM + RAG + guardrail

Curated KB, grounded generation, output verifier.

gate: CLEAR + faithfulness
PHASE 4

Internal validation

End-to-end on gold set, ≥2 pharmacologists, usability.

PHASE 5

Shadow-mode pilot

System drafts; experts compare to own reports.

gate: edit rate / safety
PHASE 6

Conformity & release

Then module-by-module expansion.

gate: new-module qual.

Balancing the trade-offs

Narrow intended use

Lowers the class, shrinks the validation surface, speeds the path — and raises safety. The single biggest lever.

Rule / LLM split

Move fast on low-risk narrative; the safety-critical core is deterministic, testable, provable.

Modular architecture

Scale without a validation explosion: validate the engine once, qualify modules incrementally.

Don't fine-tune first

Context engineering + function calling + grounded RAG + guardrails. Fine-tuning & local models (MedGemma, Meditron) are a phase-2 lever for performance & on-prem privacy.

HITL as enabler

Safety control, regulatory enabler (keeps it 'informing' → IIa), and a data engine via captured edits.

What we trade

Less automation and narrower coverage now — in exchange for safety, speed, and validatability. Up-front curation pays for both validation and monitoring.