LawBench

Description

LawBench is an environment for evaluating LLMs on Chinese legal knowledge. Agents answer 10,000 legal tasks spanning 20 task types across 3 cognitive levels: memorization (2 tasks), understanding (10 tasks), and applying (8 tasks). All text is in Chinese.

Capabilities

Legal knowledge recall (article recitation, judicial exam QA)
Legal text understanding (NER, reading comprehension, classification, proofreading)
Legal reasoning and judgment prediction (charge prediction, prison term estimation, case analysis)

Compute Requirements

No special compute requirements. CPU-only, no GPU needed.

License

Apache-2.0

Tasks

Split: zero_shot (10,000 tasks total)
20 task types, 500 examples each:
- Memorization (2): article recitation, knowledge QA
- Understanding (10): proofreading, dispute focus, classification, reading comprehension, NER, summarization, argument mining, event detection, trigger extraction
- Applying (8): article prediction, charge prediction, prison term prediction, case analysis, damages calculation, legal consultation

Reward Structure

Deterministic, verifiable scoring. No LLM judge. Each task type uses one of 11 metrics:

Metric	Tasks
ROUGE-L (jieba)	1-1, 2-7, 3-2, 3-8
Multi-choice accuracy	1-2, 2-2, 2-4, 2-8, 3-6
Set F1	2-3, 2-9
Character-level F1	2-5
IE soft F1	2-6
Trigger soft F1	2-10
Article number F1	3-1
Accusation F1	3-3
Normalized log-distance	3-4, 3-5
Number accuracy	3-7
Edit F0.5 (character-level)	2-1

Rewards are continuous in [0, 1], computed per-example.

Data

Source: open-compass/LawBench on GitHub
Format: 20 JSON files, each containing 500 examples with instruction, question, answer fields
Size: ~50 MB total
Language: Chinese

Tools

submit(answer: str) — Submit an answer for grading. Single-turn; ends the episode.

Time Horizon

Single-turn. One tool call per task.

Environment Difficulty

Varies by task type:

Memorization tasks: moderate (requires specific legal knowledge)
Understanding tasks: moderate to hard (classification, NER, proofreading)
Applying tasks: hard (legal judgment prediction, case analysis)

Safety

Tasks involve Chinese legal content including criminal law, civil disputes, and judicial proceedings. All content is from public legal datasets.

Citations

@misc{fei2023lawbenchbenchmarkinglegalknowledge,
      title={LawBench: Benchmarking Legal Knowledge of Large Language Models},
      author={Zhiwei Fei and Xiaoyu Shen and Dawei Zhu and Fengzhe Zhou and Zhuo Han and Songyang Zhang and Kai Chen and Zongwen Shen and Jidong Ge},
      year={2023},
      eprint={2309.16289},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2309.16289},
}