EMR Traces → Generalized Clinic Markov Chain
What This Project Is
This repository contains de-identified EMR (Electronic Medical Record) traces from a colorectal / colonoscopy-focused surgical clinic in Calgary, Alberta (affiliated with Tom Baker Cancer Centre and Alberta Health Services). Each "sample patient" folder contains the complete paper trail of a single patient's journey through the clinic — every fax, letter, lab result, booking task, phone call, prescription, operative report, and surveillance visit, converted into a structured JSON trace and accompanied by the original source PDFs.
The goal is to analyze every patient's trace and synthesize them into a single generalized Markov chain that models every possible pathway a patient can take through this clinic — from referral to discharge — with empirically-derived transition probabilities, sojourn times, and identified system failure modes.
Why a Markov Chain?
A Markov chain is a mathematical model where a system moves through a set of states (phases of care), and the probability of transitioning to the next state depends on the current state. Think of it as a map of every possible route through the clinic, with percentages at each fork showing how likely each path is.
This is useful because:
-
It makes the invisible visible. Healthcare systems are complex — dozens of actors (GPs, surgeons, oncologists, radiologists, admin staff, patients) interact across months or years. A Markov model collapses this into a readable diagram showing every possible journey.
-
It quantifies bottlenecks. If 30% of patients experience a scheduling delay at a particular state, that shows up as a transition probability to a "delay" state. You can measure exactly where the system fails and how often.
-
It enables simulation. Once you have the transition matrix, you can simulate thousands of hypothetical patients to answer questions like: "What's the expected time from referral to treatment for a cancer patient?" or "If we reduce the hemorrhoid queue from 12 months to 3 months, how many cancers would we catch earlier?"
-
It reveals hidden structure. Patterns that are invisible in individual charts — like the fact that admin task priority almost never matches clinical urgency, or that patients catch scheduling errors more often than staff do — become statistically visible across many patients.
How It Works: The Pipeline
Step 1: Raw Data (PDFs)
Each patient's source material is a set of scanned clinical documents — faxes, letters, lab reports, imaging reports, operative notes, EMR task screenshots, prescriptions. These are the ground truth. They live in each patient's patient[N]_pdfs/ subfolder.
Step 2: Structured Trace (JSON)
Each patient's PDFs have been manually analyzed and converted into a structured JSON trace (patient[N]_trace.json). Each step in the trace captures:
{
"step_id": "PAT16-S7",
"step_type": "procedure_execution",
"elapsed_days": 281,
"day_gap_from_previous": 11,
"actor": "endoscopist",
"document_type": "endoscopy_report",
"observation": {
"document_summary": "...",
"extracted_fields": { ... },
"prior_state": { ... }
},
"action_space": [
"possible_action_1",
"possible_action_2",
"..."
],
"action_taken": "actual_action_taken",
"result": {
"outcome": "...",
"state_change": { ... },
"next_step": "..."
}
}
Key fields:
- step_type: What kind of event this is (referral, consultation, procedure, lab result, admin task, phone call, etc.)
- elapsed_days: Days since the original referral (Day 0)
- actor: Who performed the action (GP, specialist, admin staff, patient, radiologist, etc.)
- observation: What was known at this point — the document contents, extracted clinical data, and the prior state
- action_space: All plausible actions that could have been taken at this decision point (not just what was done, but what else could have happened)
- action_taken: What actually happened
- result: The outcome, how the patient's state changed, and what happens next
The action_space field is particularly important — it captures the counterfactual branches. Even though only one action was taken, the other options represent transitions that other patients might take at the same state.
Step 3: Patient-Level Markov Analysis (Markdown)
Each patient gets a patient[N]_markov_process.md file that:
- Explains the patient's full journey in plain English
- Includes a glossary of all medical terms
- Maps each step to a Markov state
- Documents every state transition with probabilities (initially estimated from literature, later refined from data)
- Identifies system failures, delays, and communication gaps
- Notes Markov-violating features (history dependence, parallel processes, etc.)
Step 4: Patient-Level Flowchart (Mermaid)
Each patient gets a patient[N]_flowchart.mmd file — a detailed Mermaid diagram showing every appointment, admin task, phone call, and clinical event with exact calendar dates, priorities, completion lags, and outcomes. These are color-coded:
- Red = Cancer diagnosis / urgent findings
- Pink = Phone calls / patient communication
- Yellow = Admin tasks / scheduling
- Green = Good outcomes / milestones
- Coral/Orange = System failures / errors
Step 5: Generalized Clinic Markov Chain (The End Goal)
As more patients are added, their individual traces are synthesized into a single master model that represents the entire clinic. This involves:
a) State Space Discovery
Each new patient may reveal states we haven't seen before. Patient 16 gave us the cancer pathway (referral → staging → neoadjuvant → surgery → surveillance). Other patients will add:
- Hemorrhoid banding / conservative management
- Polyp surveillance pathways
- IBD (Crohn's, ulcerative colitis) management
- Fistula / fissure / abscess surgical pathways
- Diverticular disease management
- Emergency presentations (obstruction, perforation, GI bleed)
- Fecal incontinence / rectal prolapse pathways
- Screening colonoscopy → normal → discharge
- Watch-and-wait after complete clinical response
- Palliative pathways for metastatic disease
The state space grows with each patient until it stabilizes (no new states are discovered — the clinic's full scope of practice is captured).
b) Transition Counting
Every time a patient moves from State A to State B, that's a count. After N patients:
P(A → B) = count(A → B) / count(all transitions from A)
For example, if 20 patients arrive at the "Specialist Consultation" state and 17 are confirmed hemorrhoids while 3 turn out to have cancer:
P(Consult → Conservative Management) = 17/20 = 0.85
P(Consult → Cancer Workup) = 3/20 = 0.15
These replace the literature-estimated probabilities with real clinic-specific numbers.
c) Sojourn Time Distributions
For each state, we collect the time every patient spent there:
- How long patients wait in the referral queue (by diagnosis)
- How long staging takes (from colonoscopy to complete staging)
- How long from diagnosis to treatment start
- How long admin tasks take to complete vs. their due dates
This gives us not just averages but distributions — we can say "the median wait for a hemorrhoid referral is 14 months (IQR 10–18)" or "staging CT-to-MRI gap is 19 days (range 7–35)."
d) Failure Mode Cataloging
Each patient's trace reveals system breakdowns. These are cataloged across patients to identify:
- Recurring failures: Are misfiled requisitions a one-off or systemic?
- Priority mismatches: How often does admin task priority fail to reflect clinical urgency?
- Communication gaps: How often do patients have to chase their own results?
- Delay patterns: Are delays concentrated at specific transitions (e.g., MRI scheduling, post-op follow-up booking)?
e) The Master Transition Matrix
The final output is a single transition matrix with:
- All discovered states as rows and columns
- Empirical transition probabilities in each cell
- Confidence intervals based on sample size
- Sojourn time distributions for each state
- Annotated failure modes at vulnerable transitions
Repository Structure
EMR Traces/
├── README.md ← This file
│
├── Sample Patient 16/
│ ├── patient16_trace.json ← Structured trace (98 steps, 1844 days)
│ ├── patient16_markov_process.md ← Full analysis + Markov model + glossary
│ ├── patient16_flowchart.mmd ← Mermaid diagram with all dates/scheduling
│ ├── patient16_pdfs/ ← Source PDFs (P16_S1.pdf through P16_S98.pdf)
│ └── Sample Patient.zip ← Original zip archive
│
├── Sample Patient NN/ ← Future patients follow same structure
│ ├── patientNN_trace.json
│ ├── patientNN_markov_process.md
│ ├── patientNN_flowchart.mmd
│ └── patientNN_pdfs/
│
└── (future) generalized_markov_model.md ← Master model synthesized from all patients
What Patient 16 Taught Us
Patient 16 (John Peters, 73M) is the first and most extensively documented case. His 98-step, 5-year trace provided the foundational state space for the cancer pathway:
States discovered:
- GP Referral → Wait Queue → Specialist Consultation → Cancer Workup (colonoscopy + biopsy) → Staging (CT + MRI) → Tumor Board → Neoadjuvant Therapy (RAPIDO: SCRT + CAPOX) → Restaging → Surgery (LAR) → Post-Op Pathology → Stoma Management → Ileostomy Reversal → Surveillance → De-escalation
Key metrics from this single patient:
- Referral to consultation: 272 days (9 months in hemorrhoid queue)
- Consultation to tissue diagnosis: 9 days
- Diagnosis to complete staging: 35 days
- Staging to cancer centre referral: 40 days
- Referral to treatment start: 354 days (~1 year)
- Total treatment duration (RT + chemo): ~6 months
- Surgery to discharge: 11 days
- Ileostomy duration: 13 months
- Current status: No recurrence at 3+ years, surveillance de-escalating
System failures identified:
- 9-month referral wait with a mass growing undetected
- FIT test false negative in a patient with active rectal cancer
- Sedation prescription task completed 15 days after the MRI it was for
- Post-op follow-up booking task with 172-day completion lag
- Misfiled requisition causing 6-week delay to ileostomy reversal
- DI scheduling error assigning wrong CT date (caught by admin)
- Patient LARS crisis with unanswered email and 131-day task lag
- Discharge summary describing surgery as "uncomplicated" despite intraoperative anastomotic failure
Markov-violating features noted:
- Strong history dependence (treatment decisions depend on full prior state, not just current)
- Highly variable sojourn times (9 months in hemorrhoid queue vs. 9 days on cancer pathway)
- Parallel processes (CT, MRI, pathology, admin tasks running concurrently)
- ~30% of steps are administrative noise that doesn't change clinical state
- Patient agency (preferences, schedule changes, proactive result-seeking)
- Non-stationary system (admin efficiency improved over the 5-year trace)
What We Expect from More Patients
Each additional patient will contribute in one or more of these ways:
| Patient Profile | What It Adds to the Model |
|---|---|
| Hemorrhoid patient (confirmed, discharged) | The most common pathway — fills in the "85% exit" from consultation |
| Polyp found on screening colonoscopy | Endoscopic management branch, polyp surveillance pathway |
| Early-stage cancer (T1-2 N0) | Direct-to-surgery branch (skipping neoadjuvant) |
| Metastatic cancer (M1 at diagnosis) | Palliative pathway, systemic therapy without curative surgery |
| Complete clinical response after chemoRT | Watch-and-wait branch (organ preservation) |
| IBD patient (Crohn's/UC) | Entirely new pathway: medical management, biologics, possible surgery |
| Fistula / fissure / abscess | Surgical pathway for benign anorectal conditions |
| Diverticular disease | Medical vs. surgical management branch |
| Emergency presentation | Urgent/emergent entry point bypassing referral queue |
| Patient who drops out / refuses treatment | "Lost to follow-up" absorbing state |
With ~15–20 patients covering the clinic's main diagnostic categories, the generalized model should stabilize — meaning new patients mostly traverse already-known states rather than discovering new ones. The transition probabilities will continue to refine with each additional patient.
Limitations and Honest Caveats
-
Small sample size. Even with all planned patients, we'll have tens of cases, not thousands. Transition probabilities will have wide confidence intervals. This is a structural model (what paths exist) more than a precise statistical one (exact probabilities).
-
Selection bias. The "sample patients" were selected for educational/analytical purposes, likely favoring complex or interesting cases. The true distribution of cases at this clinic is probably 70%+ routine hemorrhoid/polyp patients, which will be underrepresented.
-
Markov assumption is an approximation. Real clinical decision-making is deeply history-dependent. The model works by encoding enough context into the state definition (e.g., "Post-Neoadjuvant Surgery" is a different state than "Upfront Surgery") but can't capture everything.
-
Single clinic, single health system. This model describes one specific colorectal clinic in Calgary within Alberta Health Services. Wait times, referral patterns, treatment protocols, and administrative workflows will differ at other clinics and in other healthcare systems.
-
Point-in-time snapshot. Clinical practice evolves. The RAPIDO protocol used for Patient 16 may be replaced by newer approaches. The model captures how this clinic operated during the period covered by the traces, not necessarily how it operates today or will operate tomorrow.
How to Use This Repository
If you're analyzing a new patient:
- Place the PDFs in
Sample Patient NN/patientNN_pdfs/ - The structured trace should go in
Sample Patient NN/patientNN_trace.json - Analysis and Markov mapping go in
Sample Patient NN/patientNN_markov_process.md - Detailed flowchart goes in
Sample Patient NN/patientNN_flowchart.mmd
If you're building the generalized model:
- Read each patient's
_markov_process.mdto understand their pathway - Map their states to the master state space (adding new states as needed)
- Count transitions across all patients
- Update the master transition matrix
- Document new failure modes and sojourn times
If you're viewing the Mermaid diagrams:
- GitHub renders
.mmdfiles natively - VS Code: install the "Mermaid Preview" extension
- CLI:
npx @mermaid-js/mermaid-cli mmdc -i file.mmd -o file.svg - Web: paste contents at mermaid.live