EMR Traces → Generalized Clinic Markov Chain

What This Project Is

This repository contains de-identified EMR (Electronic Medical Record) traces from a colorectal / colonoscopy-focused surgical clinic in Calgary, Alberta (affiliated with Tom Baker Cancer Centre and Alberta Health Services). Each "sample patient" folder contains the complete paper trail of a single patient's journey through the clinic — every fax, letter, lab result, booking task, phone call, prescription, operative report, and surveillance visit, converted into a structured JSON trace and accompanied by the original source PDFs.

The goal is to analyze every patient's trace and synthesize them into a single generalized Markov chain that models every possible pathway a patient can take through this clinic — from referral to discharge — with empirically-derived transition probabilities, sojourn times, and identified system failure modes.

Why a Markov Chain?

A Markov chain is a mathematical model where a system moves through a set of states (phases of care), and the probability of transitioning to the next state depends on the current state. Think of it as a map of every possible route through the clinic, with percentages at each fork showing how likely each path is.

This is useful because:

It makes the invisible visible. Healthcare systems are complex — dozens of actors (GPs, surgeons, oncologists, radiologists, admin staff, patients) interact across months or years. A Markov model collapses this into a readable diagram showing every possible journey.
It quantifies bottlenecks. If 30% of patients experience a scheduling delay at a particular state, that shows up as a transition probability to a "delay" state. You can measure exactly where the system fails and how often.
It enables simulation. Once you have the transition matrix, you can simulate thousands of hypothetical patients to answer questions like: "What's the expected time from referral to treatment for a cancer patient?" or "If we reduce the hemorrhoid queue from 12 months to 3 months, how many cancers would we catch earlier?"
It reveals hidden structure. Patterns that are invisible in individual charts — like the fact that admin task priority almost never matches clinical urgency, or that patients catch scheduling errors more often than staff do — become statistically visible across many patients.

How It Works: The Pipeline

Step 1: Raw Data (PDFs)

Each patient's source material is a set of scanned clinical documents — faxes, letters, lab reports, imaging reports, operative notes, EMR task screenshots, prescriptions. These are the ground truth. They live in each patient's patient[N]_pdfs/ subfolder.

Step 2: Structured Trace (JSON)

Each patient's PDFs have been manually analyzed and converted into a structured JSON trace (patient[N]_trace.json). Each step in the trace captures:

{
  "step_id": "PAT16-S7",
  "step_type": "procedure_execution",
  "elapsed_days": 281,
  "day_gap_from_previous": 11,
  "actor": "endoscopist",
  "document_type": "endoscopy_report",
  "observation": {
    "document_summary": "...",
    "extracted_fields": { ... },
    "prior_state": { ... }
  },
  "action_space": [
    "possible_action_1",
    "possible_action_2",
    "..."
  ],
  "action_taken": "actual_action_taken",
  "result": {
    "outcome": "...",
    "state_change": { ... },
    "next_step": "..."
  }
}

Key fields:

step_type: What kind of event this is (referral, consultation, procedure, lab result, admin task, phone call, etc.)
elapsed_days: Days since the original referral (Day 0)
actor: Who performed the action (GP, specialist, admin staff, patient, radiologist, etc.)
observation: What was known at this point — the document contents, extracted clinical data, and the prior state
action_space: All plausible actions that could have been taken at this decision point (not just what was done, but what else could have happened)
action_taken: What actually happened
result: The outcome, how the patient's state changed, and what happens next

The action_space field is particularly important — it captures the counterfactual branches. Even though only one action was taken, the other options represent transitions that other patients might take at the same state.

Step 3: Patient-Level Markov Analysis (Markdown)

Each patient gets a patient[N]_markov_process.md file that:

Explains the patient's full journey in plain English
Includes a glossary of all medical terms
Maps each step to a Markov state
Documents every state transition with probabilities (initially estimated from literature, later refined from data)
Identifies system failures, delays, and communication gaps
Notes Markov-violating features (history dependence, parallel processes, etc.)

Step 4: Patient-Level Flowchart (Mermaid)

Each patient gets a patient[N]_flowchart.mmd file — a detailed Mermaid diagram showing every appointment, admin task, phone call, and clinical event with exact calendar dates, priorities, completion lags, and outcomes. These are color-coded:

Red = Cancer diagnosis / urgent findings
Pink = Phone calls / patient communication
Yellow = Admin tasks / scheduling
Green = Good outcomes / milestones
Coral/Orange = System failures / errors

Step 5: Generalized Clinic Markov Chain (The End Goal)

As more patients are added, their individual traces are synthesized into a single master model that represents the entire clinic. This involves:

a) State Space Discovery

Each new patient may reveal states we haven't seen before. Patient 16 gave us the cancer pathway (referral → staging → neoadjuvant → surgery → surveillance). Other patients will add:

Hemorrhoid banding / conservative management
Polyp surveillance pathways
IBD (Crohn's, ulcerative colitis) management
Fistula / fissure / abscess surgical pathways
Diverticular disease management
Emergency presentations (obstruction, perforation, GI bleed)
Fecal incontinence / rectal prolapse pathways
Screening colonoscopy → normal → discharge
Watch-and-wait after complete clinical response
Palliative pathways for metastatic disease

The state space grows with each patient until it stabilizes (no new states are discovered — the clinic's full scope of practice is captured).

b) Transition Counting

Every time a patient moves from State A to State B, that's a count. After N patients:

P(A → B) = count(A → B) / count(all transitions from A)

For example, if 20 patients arrive at the "Specialist Consultation" state and 17 are confirmed hemorrhoids while 3 turn out to have cancer:

P(Consult → Conservative Management) = 17/20 = 0.85
P(Consult → Cancer Workup) = 3/20 = 0.15

These replace the literature-estimated probabilities with real clinic-specific numbers.

c) Sojourn Time Distributions

For each state, we collect the time every patient spent there:

How long patients wait in the referral queue (by diagnosis)
How long staging takes (from colonoscopy to complete staging)
How long from diagnosis to treatment start
How long admin tasks take to complete vs. their due dates

This gives us not just averages but distributions — we can say "the median wait for a hemorrhoid referral is 14 months (IQR 10–18)" or "staging CT-to-MRI gap is 19 days (range 7–35)."

d) Failure Mode Cataloging

Each patient's trace reveals system breakdowns. These are cataloged across patients to identify:

Recurring failures: Are misfiled requisitions a one-off or systemic?
Priority mismatches: How often does admin task priority fail to reflect clinical urgency?
Communication gaps: How often do patients have to chase their own results?
Delay patterns: Are delays concentrated at specific transitions (e.g., MRI scheduling, post-op follow-up booking)?

e) The Master Transition Matrix

The final output is a single transition matrix with:

All discovered states as rows and columns
Empirical transition probabilities in each cell
Confidence intervals based on sample size
Sojourn time distributions for each state
Annotated failure modes at vulnerable transitions

Repository Structure

EMR Traces/
├── README.md                              ← This file
│
├── Sample Patient 16/
│   ├── patient16_trace.json               ← Structured trace (98 steps, 1844 days)
│   ├── patient16_markov_process.md        ← Full analysis + Markov model + glossary
│   ├── patient16_flowchart.mmd            ← Mermaid diagram with all dates/scheduling
│   ├── patient16_pdfs/                    ← Source PDFs (P16_S1.pdf through P16_S98.pdf)
│   └── Sample Patient.zip                 ← Original zip archive
│
├── Sample Patient NN/                     ← Future patients follow same structure
│   ├── patientNN_trace.json
│   ├── patientNN_markov_process.md
│   ├── patientNN_flowchart.mmd
│   └── patientNN_pdfs/
│
└── (future) generalized_markov_model.md   ← Master model synthesized from all patients

What Patient 16 Taught Us

Patient 16 (John Peters, 73M) is the first and most extensively documented case. His 98-step, 5-year trace provided the foundational state space for the cancer pathway:

States discovered:

GP Referral → Wait Queue → Specialist Consultation → Cancer Workup (colonoscopy + biopsy) → Staging (CT + MRI) → Tumor Board → Neoadjuvant Therapy (RAPIDO: SCRT + CAPOX) → Restaging → Surgery (LAR) → Post-Op Pathology → Stoma Management → Ileostomy Reversal → Surveillance → De-escalation

Key metrics from this single patient:

Referral to consultation: 272 days (9 months in hemorrhoid queue)
Consultation to tissue diagnosis: 9 days
Diagnosis to complete staging: 35 days
Staging to cancer centre referral: 40 days
Referral to treatment start: 354 days (~1 year)
Total treatment duration (RT + chemo): ~6 months
Surgery to discharge: 11 days
Ileostomy duration: 13 months
Current status: No recurrence at 3+ years, surveillance de-escalating

System failures identified:

9-month referral wait with a mass growing undetected
FIT test false negative in a patient with active rectal cancer
Sedation prescription task completed 15 days after the MRI it was for
Post-op follow-up booking task with 172-day completion lag
Misfiled requisition causing 6-week delay to ileostomy reversal
DI scheduling error assigning wrong CT date (caught by admin)
Patient LARS crisis with unanswered email and 131-day task lag
Discharge summary describing surgery as "uncomplicated" despite intraoperative anastomotic failure

Markov-violating features noted:

Strong history dependence (treatment decisions depend on full prior state, not just current)
Highly variable sojourn times (9 months in hemorrhoid queue vs. 9 days on cancer pathway)
Parallel processes (CT, MRI, pathology, admin tasks running concurrently)
~30% of steps are administrative noise that doesn't change clinical state
Patient agency (preferences, schedule changes, proactive result-seeking)
Non-stationary system (admin efficiency improved over the 5-year trace)

What We Expect from More Patients

Each additional patient will contribute in one or more of these ways:

Patient Profile	What It Adds to the Model
Hemorrhoid patient (confirmed, discharged)	The most common pathway — fills in the "85% exit" from consultation
Polyp found on screening colonoscopy	Endoscopic management branch, polyp surveillance pathway
Early-stage cancer (T1-2 N0)	Direct-to-surgery branch (skipping neoadjuvant)
Metastatic cancer (M1 at diagnosis)	Palliative pathway, systemic therapy without curative surgery
Complete clinical response after chemoRT	Watch-and-wait branch (organ preservation)
IBD patient (Crohn's/UC)	Entirely new pathway: medical management, biologics, possible surgery
Fistula / fissure / abscess	Surgical pathway for benign anorectal conditions
Diverticular disease	Medical vs. surgical management branch
Emergency presentation	Urgent/emergent entry point bypassing referral queue
Patient who drops out / refuses treatment	"Lost to follow-up" absorbing state

With ~15–20 patients covering the clinic's main diagnostic categories, the generalized model should stabilize — meaning new patients mostly traverse already-known states rather than discovering new ones. The transition probabilities will continue to refine with each additional patient.

Limitations and Honest Caveats

Small sample size. Even with all planned patients, we'll have tens of cases, not thousands. Transition probabilities will have wide confidence intervals. This is a structural model (what paths exist) more than a precise statistical one (exact probabilities).
Selection bias. The "sample patients" were selected for educational/analytical purposes, likely favoring complex or interesting cases. The true distribution of cases at this clinic is probably 70%+ routine hemorrhoid/polyp patients, which will be underrepresented.
Markov assumption is an approximation. Real clinical decision-making is deeply history-dependent. The model works by encoding enough context into the state definition (e.g., "Post-Neoadjuvant Surgery" is a different state than "Upfront Surgery") but can't capture everything.
Single clinic, single health system. This model describes one specific colorectal clinic in Calgary within Alberta Health Services. Wait times, referral patterns, treatment protocols, and administrative workflows will differ at other clinics and in other healthcare systems.
Point-in-time snapshot. Clinical practice evolves. The RAPIDO protocol used for Patient 16 may be replaced by newer approaches. The model captures how this clinic operated during the period covered by the traces, not necessarily how it operates today or will operate tomorrow.

How to Use This Repository

If you're analyzing a new patient:

Place the PDFs in Sample Patient NN/patientNN_pdfs/
The structured trace should go in Sample Patient NN/patientNN_trace.json
Analysis and Markov mapping go in Sample Patient NN/patientNN_markov_process.md
Detailed flowchart goes in Sample Patient NN/patientNN_flowchart.mmd

If you're building the generalized model:

Read each patient's _markov_process.md to understand their pathway
Map their states to the master state space (adding new states as needed)
Count transitions across all patients
Update the master transition matrix
Document new failure modes and sojourn times

If you're viewing the Mermaid diagrams:

GitHub renders .mmd files natively
VS Code: install the "Mermaid Preview" extension
CLI: npx @mermaid-js/mermaid-cli mmdc -i file.mmd -o file.svg
Web: paste contents at mermaid.live