LegalKit

September 29, 2025 · View on GitHub

中文文档 / Chinese README

LegalKit is a practical and extensible evaluation toolkit for legal-domain Large Language Models (LLMs). It unifies dataset adapters, model generation, offline JSON evaluation, and LLM-as-Judge scoring into a single workflow, and provides an optional lightweight Web UI for non-terminal users.


✨ Features

  • 📚 Multi-dataset support: legal QA, case reasoning, judgment generation, etc.
  • 🔌 Modular design: pluggable load_tasks, Generator, Evaluator
  • ⚙️ Unified model specs: local paths, HuggingFace (hf:), API endpoints (api:)
  • Acceleration: support for vllm and lmdeploy (tensor parallel + data parallel)
  • 📄 Offline evaluation: evaluate directly from JSON predictions
  • 🧑‍⚖️ LLM-as-Judge: independent configuration for judge models, batch evaluation supported
  • 📊 Multi-dimensional evaluation: BLEU, Rouge-L, BERTScore + law-specific criteria
  • 🔄 Resumable runs: checkpointing via sharded prediction storage
  • 🌐 Optional Web UI: submit tasks and browse evaluation results

📦 Installation

From source

conda create -n legalkit python=3.10 -y
conda activate legalkit

git clone https://github.com/DavidMiao1127/LegalKit.git
cd LegalKit
pip install -e .

From PyPI

pip install legalkit

🗂 Project Structure

LegalKit/
  legalkit/
    main.py            # CLI entry: argument parsing & multiprocess orchestration
    judge.py           # Judge config + LLM-as-Judge runner
    storage.py         # Prediction sharding & aggregation
    datasets/          # Dataset adapters
  web/                 # Flask-based Web UI
  data/                # Built-in dataset artifacts & templates
  example/             # Example YAML configs
  run_output/          # Output directory
  README.md / README_zh.md

⚙️ Core Concepts

ConceptDescription
Model SpecLocal path, hf: (HuggingFace repo), or api: (remote endpoint)
DatasetDefined under legalkit/datasets/<Name>, must implement load_tasks()
TaskUnit of grouped records identified by ID
GeneratorExecutes batched model inference
EvaluatorComputes metrics or rule-based evaluation
Judge RunnerIndependent model used for evaluation only
StorageManagerHandles sharded prediction files and merging

🧪 CLI Parameters

FlagType / ValuesDefaultDescription
--modelslist[str]None (optional in JSON mode)Model specs: local dir, hf:Org/Repo, api:Name, or directory with multiple sub-models
--datasetslist[str]None (required)Dataset names (under legalkit/datasets/)
--accelerator / -avllm / lmdeployNoneAccelerator backend for generation
--num_workersint1Number of data-parallel worker processes
--tensor_parallelint1Tensor/model parallel degree (scales GPU demand)
--taskinfer / eval / allallPhase: generation only / evaluation only / both. Forced to eval in JSON mode
--resume / -rpathResume from existing run directory (skip completed records)
--output_dirpath./run_outputRoot output directory (with timestamp subfolders)
--temperaturefloat1.0Sampling temperature (primary model)
--top_pfloat1.0Nucleus sampling parameter
--max_tokensint8192Maximum generation tokens
--sub_taskslist[str]NoneRestrict to specific subtasks (dataset-defined)
--batch_sizeint1Batch size per worker
--repetition_penaltyfloat1.0Repetition penalty
--api_urlstrNoneBase URL for api: model
--api_keystrNoneAPI authentication key
--json_evalflagFalseEnable offline JSON evaluation mode
--json_pathslist[str]– (required in JSON mode)Prediction files: single path or Dataset=/abs/path.json pairs
--json_model_labelstrjson_evalSynthetic model name in JSON mode (if --models not given)
--judgestrNoneJudge model spec (same as --models)
--judge_acceleratorvllm / lmdeployNoneAccelerator backend for judge model
--judge_tensor_parallelint1Tensor parallelism for judge model
--judge_batch_sizeint4 in JSON mode; else same as batch_size or 1Judge evaluation batch size
--judge_temperaturefloattemperature or 0.0Sampling temperature for judge
--judge_top_pfloattop_p or 1.0Nucleus sampling for judge
--judge_max_tokensintmax_tokens or 512Maximum tokens for judge outputs
--judge_repetition_penaltyfloatrepetition_penalty or 1.0Repetition penalty for judge
--judge_api_urlstrapi_urlOverride API URL for judge
--judge_api_keystrapi_keyOverride API key for judge

🚀 Quick Start

Minimal run (generation + evaluation):

python legalkit/main.py \
  --models /path/to/local/model \
  --datasets CaseGen \
  --task all \
  --num_workers 1 --tensor_parallel 1 \
  --max_tokens 4096 --temperature 0.0

With vLLM acceleration:

python legalkit/main.py \
  --models /path/to/model \
  --datasets LawBench \
  --accelerator vllm \
  --tensor_parallel 2 \
  --num_workers 1

Resume from checkpoint:

python legalkit/main.py --resume run_output/20250710-093723 --task all

📄 Config-Driven Execution

Example config.yaml:

args:
  models: ["/models/Qwen2.5-32B-Instruct"]
  datasets: ["JECQA"]
  sub_tasks: ["1-1", "1-2"]
  accelerator: lmdeploy
  num_workers: 2
  tensor_parallel: 4
  temperature: 0.0
  max_tokens: 4096
  task: all

Run with:

python legalkit/main.py --config example/config_jecqa.yaml

📥 Offline Evaluation (JSON Predictions)

Evaluate using existing predictions only (no model inference):

python legalkit/main.py \
  --datasets JECQA \
  --json_eval \
  --json_paths /data/jecqa_preds.json

Multiple datasets:

python legalkit/main.py \
  --datasets LawBench JECQA \
  --json_eval \
  --json_paths LawBench=/data/lawbench.json JECQA=/data/jecqa.json \
  --json_model_label merged_external

🧑‍⚖️ LLM-as-Judge

Introduce a secondary model to produce qualitative or structured evaluation scores.

Example:

python legalkit/main.py \
  --datasets CaseGen \
  --json_eval --json_paths /data/casegen_preds.json \
  --judge hf:Qwen/Qwen2.5-7B-Instruct \
  --judge_batch_size 2 --judge_max_tokens 256

🌐 Web UI (Optional)

Install dependencies:

pip install flask flask-cors

Run:

./start.sh

Access via browser:

http://localhost:5000

🔍 Output Structure

run_output/<TIMESTAMP>/
  config.yaml
  <model_id>/
    result.json        # Task -> score dictionary
    predict/           # Sharded predictions

📚 Supported Datasets

  • Comprehensive benchmarks: LawBench, LegalBench, LexEval
  • Case generation / reasoning: CaseGen, CaseHold
  • QA / Knowledge: JECQA, LAiW, LexGLUE
  • Retrieval / RAG: LexRAG
  • Judicial benchmarks (CAIL series 2019–2025)