Elasticsearch

As search backend we rely on Elasticsearch. In this document we collect useful commands or queries to work with ES.

Propagate database entries to search index

Rebuild index: ./manage.py rebuild_index
Update existing index: ./manage.py update_index

Index fields driving citation lookups

Three multi-value fields back the “find documents tagged with X” filter queries. Each lives on the corresponding search index and needs a reindex of that app whenever the field shape changes:

Field	Index	Purpose
`is_latest`	`LawIndex`	Boolean mirroring `LawBook.latest`. The search backend always filters `not (django_ct=laws.law AND is_latest=false)` so stale revisions never enter the haystack hydration loop.
`cited_laws`	`CaseIndex`	List of `"<book_slug>__<section_slug>"` tokens for every law section the case cites. Powers `/search/?cited_law_book=&cited_law_section=`, the citing-cases panel on `/law/<book>/<sec>/`, and the REST + MCP `citing_cases` endpoints.
`cited_cases`	`CaseIndex`	List of Case PKs (as strings) for every case the case cites. Powers `/search/?cited_case=<id>`, the citing-cases panel on `/case/<slug>/`, and the corresponding API + MCP endpoints.

The token format used by cited_laws is intentional: two underscores cannot appear inside a Django slug, so f"{book_slug}__{section_slug}" is unambiguously parseable and safe to use inside an ES query_string literal. Helper oldp.apps.cases.search_indexes.cited_law_token renders the token; consumers should call it rather than concatenating manually.

Reindex requirement after deploy

A release that adds or changes one of the fields above needs an operator-run reindex on prod to populate the new shape. From inside the app container:

python manage.py update_index laws   # populates is_latest
python manage.py update_index cases  # populates cited_laws + cited_cases

Estimated runtime (May 2026 prod data):

update_index laws — ~4-5 min for ~110k law sections.
update_index cases — ~28 h single-worker, ~12.5 h with -k 4 for ~424k cases. Pass -k 4 on prod to keep the wall time manageable; the bottleneck is the ~1 MB content TextField pulled per row, and parallel workers amortise the per-batch network cost across 4 MySQL→app sockets.

Before the reindex completes, downstream surfaces that rely on the new field render their empty state (“No cases cite this section yet.” / “No other cases cite this decision yet.” in the web UI; empty paginated list in REST; total_citing_cases: 0 in MCP). Free- text search, facets, and the search backend’s existing fields are unaffected — the reindex is additive and incremental.

Realtime sync on Case writes

oldp.apps.cases.signals connects post_save and post_delete handlers on Case that mirror the row-level review_status filter from CaseIndex.index_queryset into the ES index in realtime:

Event	ES action
`Case.save()` with `review_status='accepted'`	`index.update_object()` (upsert)
`Case.save()` with `review_status` in `{pending,rejected}`	`index.remove_object()`
`Case.delete()` (hard delete)	`index.remove_object()`
`loaddata` (`raw=True` on the save signal)	no-op — fixture flow runs `update_index` after

Both handlers defer via transaction.on_commit, so a rolled-back save never leaks into ES, and ES exceptions are logged but swallowed so a search-backend outage cannot break Case.save() callers. This covers admin edits and the case PATCH API endpoint; the only remaining drift source is QuerySet.update() (see below).

Bulk operations bypass the signals

Case.objects.filter(...).update(...) is a single SQL UPDATE that does not fire post_save, so the realtime sync above does not run for bulk paths. bulk_approve_cases is the canonical example:

# Approve without updating ES — fast, but ES will drift
python manage.py bulk_approve_cases

# Approve and immediately sync the touched rows into ES
python manage.py bulk_approve_cases --update-index

Always pass --update-index when running bulk_approve_cases on prod unless you plan to run a full update_index cases afterwards. The flag batches the updated PKs through backend.update(index, cases) in the same batch boundaries used by the SQL update, so there is no separate full-table scan.

Periodic reconciliation

For periodic safety (e.g., after manual SQL edits or to catch any row that slipped through a bulk path) run the drift-prune script from the deployment repo:

deployment/scripts/prune_stale_es_docs.sh cases.case            # dry run
deployment/scripts/prune_stale_es_docs.sh cases.case --apply    # delete stale docs

The script scrolls every cases.case doc PK from ES, diffs against the canonical Case.get_queryset().values_list("pk") set, and deletes only the orphans. Runs in seconds even on the full 424k index because it transfers only PKs, not document payloads.

For the user-facing search surfaces (web /search/, REST /api/cases/search/, MCP search_cases) and the full matrix of filters they support — keyword + facets + date range + citation graph, plus the order_by=date sort toggle — see Search.

Service-layer surfaces backed by Elasticsearch

After PR #224 / PR #225 the citation graph is served by ES on every surface except the references/ forward-refs endpoints (which return the immediate marker chain — a Python dict — and have always come straight out of the ORM):

Surface	Backend
Web `/law/<book>/<sec>/` “Referenced by” panel	ES (`cited_laws`)
Web `/case/<slug>/` “Cited by” panel	ES (`cited_cases`)
Web `/search/?cited_law_book=&cited_law_section=`	ES (`cited_laws`)
Web `/search/?cited_case=<id>`	ES (`cited_cases`)
REST `/api/laws/<id>/citing_cases/`	ES (`cited_laws`)
REST `/api/cases/<id>/citing_cases/`	ES (`cited_cases`)
REST `/api/{laws,cases}/<id>/references/`	SQL (forward refs)
REST `/api/references/` flat resource	SQL (analytical)
REST `/api/{laws,cases}/<id>/citing_laws/`	SQL (rare)
MCP `get_cases_for_law`	ES (`cited_laws`)
MCP `get_citing_cases` (cases citing a case)	ES (`cited_cases`)
MCP `get_case_references` (forward refs)	SQL

ES outage on a citing-cases surface returns:

Web: a “search unavailable” notice with a deep link to the search results page (the user can retry once ES recovers).
REST: 503 SearchBackendUnavailable (hard outage) or 503 SearchBackendTimeout (retryable: true in the body — transient warm-up).
MCP: {error, retryable, hint} envelope matching the search_cases tool’s contract.

The relational citation graph (Reference rows + marker through-tables) remains the source of truth and feeds the ES indexer — see oldp/apps/references/services/citation_graph.py.

Queries

curl -XGET localhost:9200/oldp/law/_search?pretty&query=

curl -XGET localhost:9200/oldp/law_search?pretty -d '
{
    "query": {
        "match" : {
            "book_code" : "AbwV"
        }
    },
    "sort": [
        { "doknr": { "order": "asc" } },
        "_score"
    ],
    "_source" : ["doknr", "title"]
}'

curl -XGET localhost:9200/oldp/case/_search?pretty -d '
{
    "_source" : ["text", "title"]
}'

Check cluster health

curl -XGET https://localhost:9200/_cat/health?v

Load Index Mappings

curl -XPUT localhost:9200/leegle -d @oldp/assets/es_index.json