# Data Dumps & Bulk Downloads For bulk access to OLDP data, the `dump_api_data` management command exports every public API resource to gzipped JSONL files, accompanied by a `manifest.json` describing the snapshot. This is the canonical way OLDP publishes its data — including the [HuggingFace dataset `openlegaldata/court-decisions-germany`](https://huggingface.co/datasets/openlegaldata/court-decisions-germany) (produced from a dump via [`oldp-toolkit`](https://github.com/openlegaldata/oldp-toolkit)). ## Usage ```bash ./manage.py dump_api_data ./workingdir/snapshot-2026-04-29 --override ``` Arguments: - `output` (positional) — directory path relative to `WORKING_DIR`. - `--override` — replace an existing output directory. - `--limit N` — cap the number of records per resource (default `0` = unlimited). ## Output layout ``` snapshot-2026-04-29/ ├── cases.jsonl.gz ├── courts.jsonl.gz ├── laws.jsonl.gz ├── law_books.jsonl.gz ├── cities.jsonl.gz ├── states.jsonl.gz ├── countries.jsonl.gz ├── annotation_labels.jsonl.gz ├── case_annotations.jsonl.gz ├── case_markers.jsonl.gz └── manifest.json ``` Each `*.jsonl.gz` contains one record per line, serialized using the same serializer the public REST API uses, so the dump shape mirrors the API. ## Snapshot contract The dump command guarantees the following properties: ### Always-accepted filter Records on `Case`, `Law`, `LawBook`, and `Court` always have `review_status == "accepted"`. Pending or declined records are never written to a dump and therefore never reach published artifacts. ### Stable ordering Records are written in ascending primary-key order. Given the same database state, two consecutive dump runs produce byte-identical files. This makes downstream snapshots (HuggingFace dataset versions, benchmark resolution indices) reproducible. ### Self-describing manifest `manifest.json` records the snapshot identity: ```json { "snapshot_date": "2026-04-29T12:34:56+00:00", "oldp_version": "0.9.13", "filters": {"review_status": "accepted"}, "files": { "cases.jsonl.gz": {"row_count": 318442}, "laws.jsonl.gz": {"row_count": 47821}, "law_books.jsonl.gz": {"row_count": 612}, "courts.jsonl.gz": {"row_count": 1284} } } ``` Downstream consumers should pin against `snapshot_date` so that benchmark scores or analyses remain reproducible across OLDP database growth. ## Citation-friendly fields Some serializers denormalize fields that would otherwise require joining multiple JSONL files: - **`laws.jsonl.gz`** records carry a `book_code` field (the parent `LawBook.code`) so a citation matcher can build a `(book_code, slug) -> law_id` index without loading `law_books.jsonl.gz`. - **`courts.jsonl.gz`** records carry both `code` (canonical, ECLI-derived) and `aliases` (newline-separated alternative names like "BGH" / "Bundesgerichtshof") for resolving citations to a `court_id`. ## Downstream consumers - [`oldp-toolkit`](https://github.com/openlegaldata/oldp-toolkit) reads `cases.jsonl.gz` (auto-detects the `.gz` suffix), converts HTML to Markdown, extracts inline reference markers, and publishes the result as a HuggingFace parquet dataset.