bioreasoning_phenotype

Pharmacology reasoning-chain env on small-molecule perturbations. The model is given a SMILES + cell line + assay context and must reason from chemical structure through to a cellular phenotype, in graded intermediate steps. Each step grades a different scientific question; rewards combine into a weighted aggregate. Optional tools give the model non-outcome lookup support for compound identity and Hallmark pathway ontology.

Current package version: 0.16.6. Hub env: abugoot/bioreasoning_phenotype. This README describes the environment interface and data construction. Project history and research results live in REPORT.md.

Chain

Full-chain examples ask for three upstream steps, one phenotype-relevant evidence step, and one downstream phenotype (varies per example). Standalone curriculum lanes ask for one selected step. The grader only checks the answer tags; explanatory reasoning should stay outside the machine-readable answer tags.

Step	Answer tag	Grade against	Metric
1. Target	`<TARGET>SYM1\|SYM2</TARGET>`	Drug Repurposing Hub `target` column	set F1
2. MoA	`<MOA>HDAC inhibitor</MOA>`	DRH `moa` column	normalized exact match
3. Pathways	`<PATHWAYS>HALLMARK_X:up, HALLMARK_Y:down, ...</PATHWAYS>`	all empirically non-neutral Hallmark signed enrichments from L1000 MODZ under the active evidence threshold	set F1 on (name, direction) tuples
3a. Single Hallmark direction	`<PATHWAY_DIRECTION>up</PATHWAY_DIRECTION>`	fixed prompted Hallmark's q20/q80 LINCS bin	exact `up` / `down` / `neutral` match
3b. Hallmark contrast	`<SHIFTED_HALLMARK>HALLMARK_X</SHIFTED_HALLMARK><PATHWAY_DIRECTION>up</PATHWAY_DIRECTION>`	two prompted Hallmarks from the same context, exactly one neutral and one shifted	0.7 shifted-Hallmark selection + 0.3 shifted direction
4. Evidence	`<EVIDENCE>aggregate:cell_cycle_program=down; focused:HALLMARK_X=down; ...</EVIDENCE>`	phenotype-relevant empirical q20/q80 Hallmark direction bins + aggregate evidence + top shifted pathways	aggregate evidence + sparse focused-evidence F1 + top-pathway F1
5. Phenotype	`<<P>>label</<P>>`	per-phenotype GT (see below)	per-phenotype scorer

Downstream phenotype is one of (selected per example):

Phenotype	Tag	Type	Labels
viability	`<VIABILITY>`	continuous LFC	piecewise-linear on \|err\|; full credit ±0.25, zero ≥ ±2.0
cell_cycle	`<CELL_CYCLE>`	3-class	`arrest` / `no_effect` / `proliferation`
stress	`<STRESS>`	4-class	`none` / `apoptosis` / `UPR` / `DNA_damage`
magnitude	`<MAGNITUDE>`	3-class	`inert` / `moderate` / `strong`

Default reward weights: target 0.08 / moa 0.10 / pathways 0.17 / evidence 0.25 / phenotype 0.40. The aggregate only counts steps requested by the entry point. format_compliance is a tracked metric, not part of the reward.

Data substrate

LINCS L1000 Level 5 MODZ signatures over 6 cell lines (A375, A549, HEPG2, HT29, MCF7, PC3) — used for DEG ranking, signed Hallmark enrichment, transcriptional magnitude (‖z‖₂), cell-cycle / stress / magnitude bucket labels
PRISM Repurposing 24Q2 Extended Primary — continuous viability LFC per (compound × cell), averaged over full Broad sample IDs when the same 13-character compound ID appears multiple times
Drug Repurposing Hub (clue.io) — pert_iname, target, MoA, SMILES, InChIKey, PubChem CID
MSigDB Hallmark v2024.1 — 50 gene sets for pathway enrichment

Joined at the molecule/cell-line level: Drug Repurposing Hub samples collapse to the 13-character Broad compound ID, LINCS uses that short pert_id, PRISM full Broad IDs are averaged by the same short ID, and DepMap ACH IDs map PRISM cell lines to LINCS short cell-line codes. Exact physical-sample matching between LINCS and PRISM is only available for a tiny fraction of pairs, so the env does not treat sample/vendor/batch identity as model-visible context.

chain_gt contains 6,276 (compound × cell-line) pairs; 6,270 have the full upstream chain required for examples. Examples are skinny (one phenotype per sample) and include full-chain, ablation, and standalone curriculum entry points, ~90% train / 10% test split by compound hash (so train/test never share compounds).

v0.10 prompt/data notes:

Prompts explicitly separate the LINCS expression assay context (dose_um, time_h) from the PRISM viability assay context (dose, 5-day endpoint, same compound/cell line but separate assay).
Viability prompts no longer say the PRISM LFC is "under this treatment condition", because that wrongly conflated the LINCS expression condition with the PRISM viability protocol.
Packaged examples carry audit/provenance columns for lincs_dose_um, lincs_time_h, prism_dose_um, prism_duration_h, prism_screens, prism_n_full_ids, and prism_aggregation.
Drug Repurposing Hub has multiple physical sample rows per short Broad ID, but retained compounds have unique pert_iname, target, MoA, and almost always a unique SMILES. The existing first-row collapse is therefore treated as a molecule-level convenience, not as physical sample provenance.
The main chain prompt does not include an inline Hallmark menu or extra concise/exact-tag instructions.
hallmark_tools=True remains available as an optional ontology-tool ablation, but the primary compound-tool setting leaves it disabled.

v0.11 prompt/data notes:

Adds phenotype_from_evidence, a medium-evidence endpoint task. The prompt gives target, MoA, top signed Hallmarks, and selected LINCS-derived Hallmark modules as coarse bins (flat, weak/moderate/strong up/down), then asks only for the downstream phenotype.
Medium evidence is intentionally coarsened from raw Hallmark scores to avoid turning derived cell-cycle/stress labels into exact threshold-calculation tasks.
phenotype_from_evidence is generated for viability, cell_cycle, and stress; magnitude is excluded because it is directly derived from expression strength.

v0.12 prompt/data notes:

Makes evidence a first-class chain step for viability, cell_cycle, and stress: target → MoA → pathways → evidence → phenotype.
The evidence step is explicitly conditioned on the next downstream phenotype class. It asks the model to predict coarse bins for fixed focused Hallmark modules, plus top_up / top_down Hallmark names. For viability only, it also asks for the overall transcriptional-magnitude bucket.
Adds evidence_from_pathways, a standalone pathway-to-evidence curriculum lane with target, MoA, and top signed pathways given.
Magnitude examples keep the older target → MoA → pathways → phenotype chain because a magnitude evidence step would directly expose the magnitude label.

v0.13 prompt/data notes:

Sparse evidence targets: the focused evidence packet includes only non-flat focused Hallmark modules. Flat focused modules are omitted; examples with no non-flat focused module use an explicit focused:none prompt convention.
The focused-evidence metric now behaves like precision/recall over non-flat focused entries with partial credit for correct direction/strength. Extra non-flat focused predictions are penalized, so a generic "mostly flat" answer no longer receives high evidence credit on UPR/DNA-damage slices.
top_up / top_down remain as secondary shifted-pathway evidence, and viability examples still include the transcriptional-magnitude bucket.

v0.14 prompt/data notes:

Focused evidence bins now use empirical per-Hallmark q20/q80 thresholds: down below q20, up above q80, and neutral otherwise.
The package also carries q10/q90 and q30/q70 evidence variants for diagnostics. load_environment(..., evidence_schemes=["q10_q90"]) selects a non-default scheme; None defaults to q20/q80.
Prompt-visible focused evidence still lists only non-neutral modules.
Evidence answers also include phenotype-relevant aggregate fields, such as cell_cycle_program=down|neutral_or_mixed|up or stress_axis=none|apoptosis|UPR|DNA_damage.
The evidence score now includes an aggregate subscore, so the pathway-to- evidence curriculum must learn both relevant module selection and the aggregate bridge used by phenotype-from-evidence prompts.

v0.15 prompt/data notes:

The pathway step now uses the same empirical threshold family as the evidence step. Instead of asking for the legacy top-5 Hallmark summary, pathway labels contain all non-neutral Hallmark pathways under the active scheme (q20_q80 by default).
Evidence-scheme variants carry matching pathway contexts. For example, evidence_schemes=["q10_q90"] gives q10/q90-thresholded pathways in from_pathways, evidence_from_pathways, and phenotype_from_evidence prompts, while the default env gives q20/q80-thresholded pathways.
The legacy top-5 pathway summary is retained in data construction only as pathways_top5_signed for audit/debugging, not as the graded pathway target.

v0.16 prompt/data notes:

Adds pathway_direction_from_moa, a dense pathway curriculum lane. The prompt gives SMILES, cell line, LINCS assay context, target, MoA, and one canonical Hallmark pathway, then asks for only that pathway's empirical direction bin: up, down, or neutral.
This first dense lane is materialized for the default q20/q80 pathway state only, to avoid tripling the fixed-Hallmark row set before the task has proven useful.
The metric hallmark_direction_accuracy tracks exact class accuracy for this fixed-Hallmark task. The older pathway_direction_accuracy metric still refers to direction accuracy within full signed pathway-list predictions.

v0.16.1 prompt notes:

The single-Hallmark direction prompt no longer frames neutral as the default when a direct mechanism is unclear. It defines neutral as the empirical middle bin and warns that missing direct tool/LINCS evidence is not itself evidence for neutral.

v0.16.2 package notes:

Rebuilds the packaged example parquet so the v0.16.1 single-Hallmark prompt wording is present in materialized pathway_direction_from_moa rows.

v0.16.3 loader notes:

Adds balance_train_by, a train-only undersampling option for auxiliary curriculum probes. For pathway_direction_from_moa, setting balance_train_by="pathway_direction" produces equal neutral/up/down train counts while preserving the natural held-out eval distribution.

v0.16.4 prompt/tool notes:

Clarifies the single-Hallmark direction prompt by spelling out the default q20/q80 empirical bins: down below the Hallmark-specific 20th percentile, neutral between the 20th and 80th percentiles, and up above the 80th percentile.
The single-Hallmark direction prompt now explicitly suggests using the optional describe_hallmark tool, when enabled, to inspect member genes before deciding whether the compound target/MoA has a direct, indirect, or weak relationship to the requested gene set.

v0.16.5 loader notes:

Adds train_class_mix, a train-only proportional undersampling option used with balance_train_by. For example, setting balance_train_by="pathway_direction" and train_class_mix={"neutral": 0.5, "down": 0.25, "up": 0.25} yields a mild direction-enriched training split while preserving the natural held-out eval distribution.

v0.16.6 prompt/data notes:

Adds pathway_contrast_from_moa, a two-candidate Hallmark contrast lane. The prompt gives SMILES, cell line, LINCS assay context, target, MoA, and two candidate Hallmarks from the same compound-cell context. Exactly one candidate is neutral and one is shifted under q20/q80.
The model outputs the shifted canonical Hallmark name in <SHIFTED_HALLMARK> and the shifted direction in <PATHWAY_DIRECTION>. Reward is 70% shifted-Hallmark selection and 30% direction, with direction credit gated on selecting the correct shifted Hallmark.
Packaged contrast rows sample three neutral distractors for each shifted Hallmark within a context, keeping the task dense without materializing every possible shifted-vs-neutral pair.

v0.16.7 prompt/data notes:

Adds pathway_contrast_context_from_moa, a scaffolded version of the two-Hallmark contrast lane. It gives a deterministic, non-exhaustive sample of other observed shifted Hallmarks from the same LINCS context while intentionally omitting both candidate Hallmarks.
This lane is for ablations that test whether partial pathway-state context, with or without the optional describe_hallmark tool, helps models infer the held-out shifted Hallmark and its direction.

Entry points

Pre-fill upstream steps to measure scaffolding sensitivity:

smiles_only — model starts from SMILES, predicts target / MoA / pathways / evidence / phenotype
from_target — target is given as context, model predicts MoA / pathways / evidence / phenotype
from_moa — target + MoA given, model predicts pathways / evidence / phenotype
from_pathways — target + MoA + pathways given, model predicts evidence / phenotype
phenotype_direct — no chain at all, just SMILES + cell + "predict phenotype"
target_from_smiles — standalone target prediction from SMILES
moa_from_smiles — standalone MoA prediction from SMILES
moa_from_target — standalone MoA prediction with target given
pathways_from_smiles — standalone signed-Hallmark prediction from SMILES
pathways_from_moa — standalone signed-Hallmark prediction with target + MoA given
pathway_direction_from_moa — standalone fixed-Hallmark up / down / neutral prediction with target + MoA given
pathway_contrast_from_moa — standalone two-Hallmark contrast: choose the shifted candidate and its up / down direction with target + MoA given
pathway_contrast_context_from_moa — scaffolded two-Hallmark contrast with a non-exhaustive sample of other shifted Hallmarks from the same LINCS context
evidence_from_pathways — standalone phenotype-relevant evidence prediction with target + MoA + signed pathways given
phenotype_from_moa — phenotype prediction with target + MoA given
phenotype_from_pathways — phenotype prediction with signed pathways given
phenotype_from_evidence — phenotype prediction with target + MoA + signed pathways + coarse LINCS module evidence given

The first five are the original full-chain/scaffold diagnostic entry points. The standalone lanes were added for curriculum phases that refresh a specific part of the chain without requiring the model to solve every upstream step in the same rollout.

`load_environment` args

load_environment(
    entry_points=None,          # list[str] | None — default: all entry points
    phenotypes=None,            # list[str] | None — default: all four
    evidence_schemes=None,      # list[str] | None — default: ["q20_q80"]
    cell_lines=None,            # list[str] | None — default: all 6
    num_train_examples=-1,      # int — -1 = all, else downsample
    num_eval_examples=-1,
    balance_train_by=None,      # str | None — train-only undersampling column
    train_class_mix=None,       # dict[str, float] | None — proportional train mix
    reward_weights=None,        # dict[str, float] — default below
    tools=False,                # if True, expose identify_compound via ToolEnv
    hallmark_tools=False,       # if True, expose describe_hallmark via ToolEnv
    max_tool_turns=5,           # bound tool-call rounds when any tool is enabled
)

Default reward weights:

{"target": 0.08, "moa": 0.10, "pathways": 0.17, "evidence": 0.25, "phenotype": 0.40}

Rubric

One reward function + evidence/upstream/phenotype tracked metrics:

Function	Type	Range
`aggregate_reward`	reward (weight 1.0)	0–1 weighted average over requested steps
`target_f1`	metric	0–1 set F1 on gene symbols
`moa_accuracy`	metric	0 or 1 normalized exact match
`pathway_signed_f1`	metric	0–1 set F1 on `(name, direction)` tuples
`pathway_name_validity`	metric	fraction of parsed pathway predictions using exact canonical Hallmark names
`pathway_name_f1`	metric	0–1 set F1 on pathway names, ignoring direction
`pathway_direction_accuracy`	metric	direction accuracy among exact pathway-name overlaps
`hallmark_direction_accuracy`	metric	exact class match for the single fixed-Hallmark direction lane
`contrast_shifted_accuracy`	metric	exact shifted-Hallmark selection for the two-candidate contrast lane
`contrast_direction_accuracy`	metric	exact up/down direction, gated on selecting the correct shifted Hallmark
`contrast_score`	metric	same 0.7 selection / 0.3 direction score used by the contrast reward
`evidence_score`	metric	weighted evidence score over aggregate fields, non-neutral focused entries, top pathways, validity, and viability magnitude
`evidence_aggregate_score`	metric	exact match on phenotype-relevant aggregate evidence fields
`evidence_focused_bin_score`	metric	sparse F1-style score over non-neutral focused modules; exact direction gets full credit and extra non-neutral guesses hurt precision
`evidence_top_f1`	metric	direction-aware F1 on `top_up` / `top_down` Hallmark names
`evidence_name_validity`	metric	fraction of parsed evidence Hallmark names using exact canonical Hallmark names
`evidence_magnitude_score`	metric	viability-only magnitude bucket score
`phenotype_score`	metric	per-phenotype (LFC piecewise-linear, else exact match)
`format_compliance`	metric	0–1 fraction of requested answer tags present

Tools (optional)

tools=True exposes the compound lookup:

identify_compound(smiles: str) -> {
    "exact_match": {"name": "..."} | None,        # InChIKey-canonical lookup in our 1046 compounds
    "descriptors": {                              # rdkit physicochemical
        "molecular_weight": ..., "logp": ..., "tpsa": ...,
        "num_h_bond_donors": ..., "num_h_bond_acceptors": ...,
        "num_rings": ..., "num_aromatic_rings": ..., "num_rotatable_bonds": ...,
    },
    "scaffold": "<canonical SMILES of Bemis-Murcko scaffold>",
    "nearest_neighbors": [
        {"name": "<drug name>", "similarity": 0.xx},  # top-5 Tanimoto
        ...
    ]
}

Returns drug name + structural context without leaking target / MoA / pathways. For known compounds, exact_match gives the name directly; for novel/perturbed SMILES, the nearest-neighbor list surfaces analog hints (e.g. an unknown adrenergic returns "dobutamine at 0.80 similarity").

hallmark_tools=True exposes ontology metadata only:

describe_hallmark(pathway: str, max_genes: int = 25) -> {
    "canonical_name": "HALLMARK_ANDROGEN_RESPONSE",
    "matched_from": "androgen signaling",
    "msigdb_url": "...",
    "gene_count": 101,
    "member_genes_sample": ["ABCC4", "ABHD2", "..."],
    "aliases": ["ANDROGEN_SIGNALING", "..."],
}

This deliberately does not return compound-specific pathway scores, observed directions, or phenotype labels. It is intended to reduce vocabulary/ontology errors without turning the pathway step into retrieval of the answer.

Local usage

Evals run via prime eval run against the Hub-published env (abugoot/bioreasoning_phenotype). Auth is handled by prime login — no PRIME_API_KEY env var needed. Both prime eval run and vf-eval execute the same code path; prime eval run adds auth, billing preflight, model-registry validation, and auto-uploads results to prime eval list.

uv pip install -e .

# diagnostic eval (default = all entry points × 4 phenotypes, random sample)
prime eval run abugoot/bioreasoning_phenotype@0.15.0 \
  -m openai/gpt-4.1-mini \
  -n 32 -r 1 -a '{}' --max-tokens 16384

# full-chain eval (smiles_only only)
prime eval run abugoot/bioreasoning_phenotype@0.15.0 \
  -m openai/gpt-4.1-mini \
  -n 100 -r 1 -a '{"entry_points":["smiles_only"]}' \
  --max-tokens 16384

# select a phenotype subset
prime eval run abugoot/bioreasoning_phenotype@0.15.0 \
  -m openai/gpt-4.1-mini \
  -n 100 -r 1 \
  -a '{"entry_points":["smiles_only"], "phenotypes":["viability","cell_cycle","stress"]}' \
  --max-tokens 16384

# with compound retrieval only
prime eval run abugoot/bioreasoning_phenotype@0.15.0 \
  -m openai/gpt-4.1-mini \
  -n 100 -r 1 \
  -a '{"entry_points":["smiles_only"], "tools": true, "max_tool_turns": 5}' \
  --max-tokens 16384

# with compound retrieval + Hallmark ontology tool
prime eval run abugoot/bioreasoning_phenotype@0.15.0 \
  -m openai/gpt-4.1-mini \
  -n 100 -r 1 \
  -a '{"entry_points":["smiles_only"], "tools": true, "hallmark_tools": true, "max_tool_turns": 5}' \
  --max-tokens 16384

Data prep (one-time, dev only)

Scripts in scripts/ build the bundled data/smallmol_chain_examples.parquet and data/compound_table.parquet:

fetch_smallmol.py — download L1000 GCTX, PRISM CSV, Drug Repurposing Hub, Hallmark GMT
build_compound_table.py — InChIKey-canonical compound × cell line table
filter_l1000.py — extract MODZ signatures for our compound set
compute_chain_gt.py — per-pair GT (target, MoA, signed pathways, medium-evidence packet, cell_cycle / stress / magnitude buckets, viability LFC)
build_examples.py — expand to all entry points × phenotypes, split train/test by compound hash
smoke.py — local sanity checks

The data-prep extra (pip install -e ".[data-prep]") installs cmapPy, h5py, openpyxl, scipy. (rdkit is in the runtime deps because the identify_compound tool uses it.)