Elasticsearch
As search backend we rely on Elasticsearch. In this document we collect useful commands or queries to work with ES.
Propagate database entries to search index
Rebuild index:
./manage.py rebuild_indexUpdate existing index:
./manage.py update_index
Index fields driving citation lookups
Three multi-value fields back the “find documents tagged with X” filter queries. Each lives on the corresponding search index and needs a reindex of that app whenever the field shape changes:
Field |
Index |
Purpose |
|---|---|---|
|
|
Boolean mirroring |
|
|
List of |
|
|
List of Case PKs (as strings) for every case the case cites. Powers |
The token format used by cited_laws is intentional: two underscores
cannot appear inside a Django slug, so f"{book_slug}__{section_slug}"
is unambiguously parseable and safe to use inside an ES query_string
literal. Helper oldp.apps.cases.search_indexes.cited_law_token
renders the token; consumers should call it rather than concatenating
manually.
Reindex requirement after deploy
A release that adds or changes one of the fields above needs an operator-run reindex on prod to populate the new shape. From inside the app container:
python manage.py update_index laws # populates is_latest
python manage.py update_index cases # populates cited_laws + cited_cases
Estimated runtime (May 2026 prod data):
update_index laws— ~4-5 min for ~110k law sections.update_index cases— ~28 h single-worker, ~12.5 h with-k 4for ~424k cases. Pass-k 4on prod to keep the wall time manageable; the bottleneck is the ~1 MBcontentTextField pulled per row, and parallel workers amortise the per-batch network cost across 4 MySQL→app sockets.
Before the reindex completes, downstream surfaces that rely on the
new field render their empty state (“No cases cite this section
yet.” / “No other cases cite this decision yet.” in the web UI;
empty paginated list in REST; total_citing_cases: 0 in MCP). Free-
text search, facets, and the search backend’s existing fields are
unaffected — the reindex is additive and incremental.
Realtime sync on Case writes
oldp.apps.cases.signals connects post_save and post_delete
handlers on Case that mirror the row-level review_status filter
from CaseIndex.index_queryset into the ES index in realtime:
Event |
ES action |
|---|---|
|
|
|
|
|
|
|
no-op — fixture flow runs |
Both handlers defer via transaction.on_commit, so a rolled-back
save never leaks into ES, and ES exceptions are logged but swallowed
so a search-backend outage cannot break Case.save() callers. This
covers admin edits and the case PATCH API endpoint; the only
remaining drift source is QuerySet.update() (see below).
Bulk operations bypass the signals
Case.objects.filter(...).update(...) is a single SQL UPDATE that
does not fire post_save, so the realtime sync above does not
run for bulk paths. bulk_approve_cases is the canonical example:
# Approve without updating ES — fast, but ES will drift
python manage.py bulk_approve_cases
# Approve and immediately sync the touched rows into ES
python manage.py bulk_approve_cases --update-index
Always pass --update-index when running bulk_approve_cases on
prod unless you plan to run a full update_index cases afterwards.
The flag batches the updated PKs through backend.update(index, cases) in the same batch boundaries used by the SQL update, so
there is no separate full-table scan.
Periodic reconciliation
For periodic safety (e.g., after manual SQL edits or to catch any row that slipped through a bulk path) run the drift-prune script from the deployment repo:
deployment/scripts/prune_stale_es_docs.sh cases.case # dry run
deployment/scripts/prune_stale_es_docs.sh cases.case --apply # delete stale docs
The script scrolls every cases.case doc PK from ES, diffs against
the canonical Case.get_queryset().values_list("pk") set, and
deletes only the orphans. Runs in seconds even on the full 424k
index because it transfers only PKs, not document payloads.
For the user-facing search surfaces (web /search/, REST
/api/cases/search/, MCP search_cases) and the full matrix of
filters they support — keyword + facets + date range + citation graph,
plus the order_by=date sort toggle — see Search.
Service-layer surfaces backed by Elasticsearch
After PR #224 / PR #225 the citation graph is served by ES on every
surface except the references/ forward-refs endpoints (which return
the immediate marker chain — a Python dict — and have always come
straight out of the ORM):
Surface |
Backend |
|---|---|
Web |
ES ( |
Web |
ES ( |
Web |
ES ( |
Web |
ES ( |
REST |
ES ( |
REST |
ES ( |
REST |
SQL (forward refs) |
REST |
SQL (analytical) |
REST |
SQL (rare) |
MCP |
ES ( |
MCP |
ES ( |
MCP |
SQL |
ES outage on a citing-cases surface returns:
Web: a “search unavailable” notice with a deep link to the search results page (the user can retry once ES recovers).
REST: 503
SearchBackendUnavailable(hard outage) or 503SearchBackendTimeout(retryable: truein the body — transient warm-up).MCP:
{error, retryable, hint}envelope matching thesearch_casestool’s contract.
The relational citation graph (Reference rows + marker
through-tables) remains the source of truth and feeds the ES
indexer — see oldp/apps/references/services/citation_graph.py.
Queries
curl -XGET localhost:9200/oldp/law/_search?pretty&query=
curl -XGET localhost:9200/oldp/law_search?pretty -d '
{
"query": {
"match" : {
"book_code" : "AbwV"
}
},
"sort": [
{ "doknr": { "order": "asc" } },
"_score"
],
"_source" : ["doknr", "title"]
}'
curl -XGET localhost:9200/oldp/case/_search?pretty -d '
{
"_source" : ["text", "title"]
}'
Check cluster health
curl -XGET https://localhost:9200/_cat/health?v
Load Index Mappings
curl -XPUT localhost:9200/leegle -d @oldp/assets/es_index.json