LawBench
Description
LawBench is an environment for evaluating LLMs on Chinese legal knowledge. Agents answer 10,000 legal tasks spanning 20 task types across 3 cognitive levels: memorization (2 tasks), understanding (10 tasks), and applying (8 tasks). All text is in Chinese.
Capabilities
- Legal knowledge recall (article recitation, judicial exam QA)
- Legal text understanding (NER, reading comprehension, classification, proofreading)
- Legal reasoning and judgment prediction (charge prediction, prison term estimation, case analysis)
Compute Requirements
No special compute requirements. CPU-only, no GPU needed.
License
Tasks
- Split:
zero_shot(10,000 tasks total) - 20 task types, 500 examples each:
- Memorization (2): article recitation, knowledge QA
- Understanding (10): proofreading, dispute focus, classification, reading comprehension, NER, summarization, argument mining, event detection, trigger extraction
- Applying (8): article prediction, charge prediction, prison term prediction, case analysis, damages calculation, legal consultation
Reward Structure
Deterministic, verifiable scoring. No LLM judge. Each task type uses one of 11 metrics:
| Metric | Tasks |
|---|---|
| ROUGE-L (jieba) | 1-1, 2-7, 3-2, 3-8 |
| Multi-choice accuracy | 1-2, 2-2, 2-4, 2-8, 3-6 |
| Set F1 | 2-3, 2-9 |
| Character-level F1 | 2-5 |
| IE soft F1 | 2-6 |
| Trigger soft F1 | 2-10 |
| Article number F1 | 3-1 |
| Accusation F1 | 3-3 |
| Normalized log-distance | 3-4, 3-5 |
| Number accuracy | 3-7 |
| Edit F0.5 (character-level) | 2-1 |
Rewards are continuous in [0, 1], computed per-example.
Data
- Source: open-compass/LawBench on GitHub
- Format: 20 JSON files, each containing 500 examples with
instruction,question,answerfields - Size: ~50 MB total
- Language: Chinese
Tools
submit(answer: str)— Submit an answer for grading. Single-turn; ends the episode.
Time Horizon
Single-turn. One tool call per task.
Environment Difficulty
Varies by task type:
- Memorization tasks: moderate (requires specific legal knowledge)
- Understanding tasks: moderate to hard (classification, NER, proofreading)
- Applying tasks: hard (legal judgment prediction, case analysis)
Safety
Tasks involve Chinese legal content including criminal law, civil disputes, and judicial proceedings. All content is from public legal datasets.
Citations
@misc{fei2023lawbenchbenchmarkinglegalknowledge,
title={LawBench: Benchmarking Legal Knowledge of Large Language Models},
author={Zhiwei Fei and Xiaoyu Shen and Dawei Zhu and Fengzhe Zhou and Zhuo Han and Songyang Zhang and Kai Chen and Zongwen Shen and Jidong Ge},
year={2023},
eprint={2309.16289},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2309.16289},
}