bioreasoning_phenotype
Pharmacology reasoning-chain env on small-molecule perturbations. The model is given a SMILES + cell line + assay context and must reason from chemical structure through to a cellular phenotype, in graded intermediate steps. Each step grades a different scientific question; rewards combine into a weighted aggregate. Optional tools give the model non-outcome lookup support for compound identity and Hallmark pathway ontology.
Current package version: 0.10.0. Hub env: abugoot/bioreasoning_phenotype.
This README describes the environment interface and data construction. Project
history and research results live in REPORT.md.
Chain
Full-chain examples ask for three upstream steps + one downstream phenotype (varies per example). Standalone curriculum lanes ask for one selected step. The grader only checks the answer tags — free-form reasoning between/around tags is welcomed but not enforced.
| Step | Answer tag | Grade against | Metric |
|---|---|---|---|
| 1. Target | <TARGET>SYM1|SYM2</TARGET> | Drug Repurposing Hub target column | set F1 |
| 2. MoA | <MOA>HDAC inhibitor</MOA> | DRH moa column | normalized exact match |
| 3. Pathways | <PATHWAYS>HALLMARK_X:up, HALLMARK_Y:down, ...</PATHWAYS> | Top-5 Hallmark signed enrichment from L1000 MODZ | set F1 on (name, direction) tuples |
| 4. Phenotype | <<P>>label</<P>> | per-phenotype GT (see below) | per-phenotype scorer |
Downstream phenotype is one of (selected per example):
| Phenotype | Tag | Type | Labels |
|---|---|---|---|
| viability | <VIABILITY> | continuous LFC | piecewise-linear on |err|; full credit ±0.25, zero ≥ ±2.0 |
| cell_cycle | <CELL_CYCLE> | 3-class | arrest / no_effect / proliferation |
| stress | <STRESS> | 4-class | none / apoptosis / UPR / DNA_damage |
| magnitude | <MAGNITUDE> | 3-class | inert / moderate / strong |
Default reward weights: target 0.15 / moa 0.15 / pathways 0.25 / phenotype 0.45. The aggregate only counts steps requested by the entry point (so e.g. from_pathways examples weight only the phenotype). format_compliance is a tracked metric, not part of the reward.
Data substrate
- LINCS L1000 Level 5 MODZ signatures over 6 cell lines (A375, A549, HEPG2, HT29, MCF7, PC3) — used for DEG ranking, signed Hallmark enrichment, transcriptional magnitude (
‖z‖₂), cell-cycle / stress / magnitude bucket labels - PRISM Repurposing 24Q2 Extended Primary — continuous viability LFC per (compound × cell), averaged over full Broad sample IDs when the same 13-character compound ID appears multiple times
- Drug Repurposing Hub (clue.io) —
pert_iname, target, MoA, SMILES, InChIKey, PubChem CID - MSigDB Hallmark v2024.1 — 50 gene sets for pathway enrichment
Joined at the molecule/cell-line level: Drug Repurposing Hub samples collapse to the 13-character Broad compound ID, LINCS uses that short pert_id, PRISM full Broad IDs are averaged by the same short ID, and DepMap ACH IDs map PRISM cell lines to LINCS short cell-line codes. Exact physical-sample matching between LINCS and PRISM is only available for a tiny fraction of pairs, so the env does not treat sample/vendor/batch identity as model-visible context.
chain_gt contains 6,276 (compound × cell-line) pairs; 6,270 have the full upstream chain required for examples. Examples are skinny (one phenotype per sample) and include full-chain, ablation, and standalone curriculum entry points, ~90% train / 10% test split by compound hash (so train/test never share compounds).
v0.10 prompt/data notes:
- Prompts explicitly separate the LINCS expression assay context (
dose_um,time_h) from the PRISM viability assay context (dose, 5-day endpoint, same compound/cell line but separate assay). - Viability prompts no longer say the PRISM LFC is "under this treatment condition", because that wrongly conflated the LINCS expression condition with the PRISM viability protocol.
- Packaged examples carry audit/provenance columns for
lincs_dose_um,lincs_time_h,prism_dose_um,prism_duration_h,prism_screens,prism_n_full_ids, andprism_aggregation. - Drug Repurposing Hub has multiple physical sample rows per short Broad ID, but retained compounds have unique
pert_iname, target, MoA, and almost always a unique SMILES. The existing first-row collapse is therefore treated as a molecule-level convenience, not as physical sample provenance. - The main chain prompt does not include an inline Hallmark menu or extra concise/exact-tag instructions.
hallmark_tools=Trueremains available as an optional ontology-tool ablation, but the primary compound-tool setting leaves it disabled.
Entry points
Pre-fill upstream steps to measure scaffolding sensitivity:
smiles_only— model starts from SMILES, predicts all 4 steps (the RL training setting)from_target— target is given as context, model predicts MoA / pathways / phenotypefrom_moa— target + MoA given, model predicts pathways / phenotypefrom_pathways— target + MoA + pathways given, model predicts phenotype onlyphenotype_direct— no chain at all, just SMILES + cell + "predict phenotype"target_from_smiles— standalone target prediction from SMILESmoa_from_smiles— standalone MoA prediction from SMILESmoa_from_target— standalone MoA prediction with target givenpathways_from_smiles— standalone signed-Hallmark prediction from SMILESpathways_from_moa— standalone signed-Hallmark prediction with target + MoA givenphenotype_from_moa— phenotype prediction with target + MoA givenphenotype_from_pathways— phenotype prediction with signed pathways given
The first five are the original full-chain/scaffold diagnostic entry points. The standalone lanes were added for curriculum phases that refresh a specific part of the chain without requiring the model to solve every upstream step in the same rollout.
load_environment args
load_environment(
entry_points=None, # list[str] | None — default: all entry points
phenotypes=None, # list[str] | None — default: all four
cell_lines=None, # list[str] | None — default: all 6
num_train_examples=-1, # int — -1 = all, else downsample
num_eval_examples=-1,
reward_weights=None, # dict[str, float] — default below
tools=False, # if True, expose identify_compound via ToolEnv
hallmark_tools=False, # if True, expose describe_hallmark via ToolEnv
max_tool_turns=5, # bound tool-call rounds when any tool is enabled
)
Default reward weights:
{"target": 0.15, "moa": 0.15, "pathways": 0.25, "phenotype": 0.45}
Rubric
One reward function + eight tracked metrics:
| Function | Type | Range |
|---|---|---|
aggregate_reward | reward (weight 1.0) | 0–1 weighted average over requested steps |
target_f1 | metric | 0–1 set F1 on gene symbols |
moa_accuracy | metric | 0 or 1 normalized exact match |
pathway_signed_f1 | metric | 0–1 set F1 on (name, direction) tuples |
pathway_name_validity | metric | fraction of parsed pathway predictions using exact canonical Hallmark names |
pathway_name_f1 | metric | 0–1 set F1 on pathway names, ignoring direction |
pathway_direction_accuracy | metric | direction accuracy among exact pathway-name overlaps |
phenotype_score | metric | per-phenotype (LFC piecewise-linear, else exact match) |
format_compliance | metric | 0–1 fraction of requested answer tags present |
Tools (optional)
tools=True exposes the compound lookup:
identify_compound(smiles: str) -> {
"exact_match": {"name": "..."} | None, # InChIKey-canonical lookup in our 1046 compounds
"descriptors": { # rdkit physicochemical
"molecular_weight": ..., "logp": ..., "tpsa": ...,
"num_h_bond_donors": ..., "num_h_bond_acceptors": ...,
"num_rings": ..., "num_aromatic_rings": ..., "num_rotatable_bonds": ...,
},
"scaffold": "<canonical SMILES of Bemis-Murcko scaffold>",
"nearest_neighbors": [
{"name": "<drug name>", "similarity": 0.xx}, # top-5 Tanimoto
...
]
}
Returns drug name + structural context without leaking target / MoA / pathways.
For known compounds, exact_match gives the name directly; for novel/perturbed
SMILES, the nearest-neighbor list surfaces analog hints (e.g. an unknown
adrenergic returns "dobutamine at 0.80 similarity").
hallmark_tools=True exposes ontology metadata only:
describe_hallmark(pathway: str, max_genes: int = 25) -> {
"canonical_name": "HALLMARK_ANDROGEN_RESPONSE",
"matched_from": "androgen signaling",
"msigdb_url": "...",
"gene_count": 101,
"member_genes_sample": ["ABCC4", "ABHD2", "..."],
"aliases": ["ANDROGEN_SIGNALING", "..."],
}
This deliberately does not return compound-specific pathway scores, observed directions, or phenotype labels. It is intended to reduce vocabulary/ontology errors without turning the pathway step into retrieval of the answer.
Local usage
Evals run via prime eval run against the Hub-published env (abugoot/bioreasoning_phenotype). Auth is handled by prime login — no PRIME_API_KEY env var needed. Both prime eval run and vf-eval execute the same code path; prime eval run adds auth, billing preflight, model-registry validation, and auto-uploads results to prime eval list.
uv pip install -e .
# diagnostic eval (default = all entry points × 4 phenotypes, random sample)
prime eval run abugoot/bioreasoning_phenotype@0.10.0 \
-m openai/gpt-4.1-mini \
-n 32 -r 1 -a '{}' --max-tokens 16384
# full-chain eval (smiles_only only)
prime eval run abugoot/bioreasoning_phenotype@0.10.0 \
-m openai/gpt-4.1-mini \
-n 100 -r 1 -a '{"entry_points":["smiles_only"]}' \
--max-tokens 16384
# select a phenotype subset
prime eval run abugoot/bioreasoning_phenotype@0.10.0 \
-m openai/gpt-4.1-mini \
-n 100 -r 1 \
-a '{"entry_points":["smiles_only"], "phenotypes":["viability","cell_cycle","stress"]}' \
--max-tokens 16384
# with compound retrieval only
prime eval run abugoot/bioreasoning_phenotype@0.10.0 \
-m openai/gpt-4.1-mini \
-n 100 -r 1 \
-a '{"entry_points":["smiles_only"], "tools": true, "max_tool_turns": 5}' \
--max-tokens 16384
# with compound retrieval + Hallmark ontology tool
prime eval run abugoot/bioreasoning_phenotype@0.10.0 \
-m openai/gpt-4.1-mini \
-n 100 -r 1 \
-a '{"entry_points":["smiles_only"], "tools": true, "hallmark_tools": true, "max_tool_turns": 5}' \
--max-tokens 16384
Data prep (one-time, dev only)
Scripts in scripts/ build the bundled data/smallmol_chain_examples.parquet and data/compound_table.parquet:
fetch_smallmol.py— download L1000 GCTX, PRISM CSV, Drug Repurposing Hub, Hallmark GMTbuild_compound_table.py— InChIKey-canonical compound × cell line tablefilter_l1000.py— extract MODZ signatures for our compound setcompute_chain_gt.py— per-pair GT (target, MoA, signed pathways, cell_cycle / stress / magnitude buckets, viability LFC)build_examples.py— expand to all entry points × phenotypes, split train/test by compound hashsmoke.py— local sanity checks
The data-prep extra (pip install -e ".[data-prep]") installs cmapPy, h5py, openpyxl, scipy. (rdkit is in the runtime deps because the identify_compound tool uses it.)