0

LawBench

Fresh

LawBench has been meticulously crafted to have precise assessment of the LLMs’ legal capabilities from three cognitive levels: (1) Legal knowledge memorization: whether LLMs can memorize needed legal concepts, articles and facts; (2) Legal knowledge understanding: whether LLMs…

Type
RL Env
Capabilities
Legal Reasoning
Runtime
ORS
License
unknown
Size
10000 tasks
Published
Mar 2026

Cite

Notes

Only stored in your browser.

LawBench

OpenReward Environment

Description

LawBench is an environment for evaluating LLMs on Chinese legal knowledge. Agents answer 10,000 legal tasks spanning 20 task types across 3 cognitive levels: memorization (2 tasks), understanding (10 tasks), and applying (8 tasks). All text is in Chinese.

Capabilities

  • Legal knowledge recall (article recitation, judicial exam QA)
  • Legal text understanding (NER, reading comprehension, classification, proofreading)
  • Legal reasoning and judgment prediction (charge prediction, prison term estimation, case analysis)

Compute Requirements

No special compute requirements. CPU-only, no GPU needed.

License

Apache-2.0

Tasks

  • Split: zero_shot (10,000 tasks total)
  • 20 task types, 500 examples each:
    • Memorization (2): article recitation, knowledge QA
    • Understanding (10): proofreading, dispute focus, classification, reading comprehension, NER, summarization, argument mining, event detection, trigger extraction
    • Applying (8): article prediction, charge prediction, prison term prediction, case analysis, damages calculation, legal consultation

Reward Structure

Deterministic, verifiable scoring. No LLM judge. Each task type uses one of 11 metrics:

MetricTasks
ROUGE-L (jieba)1-1, 2-7, 3-2, 3-8
Multi-choice accuracy1-2, 2-2, 2-4, 2-8, 3-6
Set F12-3, 2-9
Character-level F12-5
IE soft F12-6
Trigger soft F12-10
Article number F13-1
Accusation F13-3
Normalized log-distance3-4, 3-5
Number accuracy3-7
Edit F0.5 (character-level)2-1

Rewards are continuous in [0, 1], computed per-example.

Data

  • Source: open-compass/LawBench on GitHub
  • Format: 20 JSON files, each containing 500 examples with instruction, question, answer fields
  • Size: ~50 MB total
  • Language: Chinese

Tools

  • submit(answer: str) — Submit an answer for grading. Single-turn; ends the episode.

Time Horizon

Single-turn. One tool call per task.

Environment Difficulty

Varies by task type:

  • Memorization tasks: moderate (requires specific legal knowledge)
  • Understanding tasks: moderate to hard (classification, NER, proofreading)
  • Applying tasks: hard (legal judgment prediction, case analysis)

Safety

Tasks involve Chinese legal content including criminal law, civil disputes, and judicial proceedings. All content is from public legal datasets.

Citations

@misc{fei2023lawbenchbenchmarkinglegalknowledge,
      title={LawBench: Benchmarking Legal Knowledge of Large Language Models},
      author={Zhiwei Fei and Xiaoyu Shen and Dawei Zhu and Fengzhe Zhou and Zhuo Han and Songyang Zhang and Kai Chen and Zongwen Shen and Jidong Ge},
      year={2023},
      eprint={2309.16289},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2309.16289},
}