Data Dumps & Bulk Downloads

For bulk access to OLDP data, the dump_api_data management command exports every public API resource to gzipped JSONL files, accompanied by a manifest.json describing the snapshot. This is the canonical way OLDP publishes its data — including the HuggingFace dataset openlegaldata/court-decisions-germany (produced from a dump via oldp-toolkit).

Usage

./manage.py dump_api_data ./workingdir/snapshot-2026-04-29 --override

Arguments:

output (positional) — directory path relative to WORKING_DIR.
--override — replace an existing output directory.
--limit N — cap the number of records per resource (default 0 = unlimited).

Output layout

snapshot-2026-04-29/
├── cases.jsonl.gz
├── courts.jsonl.gz
├── laws.jsonl.gz
├── law_books.jsonl.gz
├── cities.jsonl.gz
├── states.jsonl.gz
├── countries.jsonl.gz
├── annotation_labels.jsonl.gz
├── case_annotations.jsonl.gz
├── case_markers.jsonl.gz
└── manifest.json

Each *.jsonl.gz contains one record per line, serialized using the same serializer the public REST API uses, so the dump shape mirrors the API.

Snapshot contract

The dump command guarantees the following properties:

Always-accepted filter

Records on Case, Law, LawBook, and Court always have review_status == "accepted". Pending or declined records are never written to a dump and therefore never reach published artifacts.

Stable ordering

Records are written in ascending primary-key order. Given the same database state, two consecutive dump runs produce byte-identical files. This makes downstream snapshots (HuggingFace dataset versions, benchmark resolution indices) reproducible.

Self-describing manifest

manifest.json records the snapshot identity:

{
  "snapshot_date": "2026-04-29T12:34:56+00:00",
  "oldp_version": "0.9.13",
  "filters": {"review_status": "accepted"},
  "files": {
    "cases.jsonl.gz": {"row_count": 318442},
    "laws.jsonl.gz": {"row_count": 47821},
    "law_books.jsonl.gz": {"row_count": 612},
    "courts.jsonl.gz": {"row_count": 1284}
  }
}

Downstream consumers should pin against snapshot_date so that benchmark scores or analyses remain reproducible across OLDP database growth.

Citation-friendly fields

Some serializers denormalize fields that would otherwise require joining multiple JSONL files:

laws.jsonl.gz records carry a book_code field (the parent LawBook.code) so a citation matcher can build a (book_code, slug) -> law_id index without loading law_books.jsonl.gz.
courts.jsonl.gz records carry both code (canonical, ECLI-derived) and aliases (newline-separated alternative names like “BGH” / “Bundesgerichtshof”) for resolving citations to a court_id.

Downstream consumers

oldp-toolkit reads cases.jsonl.gz (auto-detects the .gz suffix), converts HTML to Markdown, extracts inline reference markers, and publishes the result as a HuggingFace parquet dataset.