feat(health): add /health/ endpoint for OpsLog monitoring #55
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Overview
Implement a machine-readable
GET /health/endpoint so OpsLog (and any other monitoring system) can assess the runtime health of the site without needing SSH or Django management commands.The endpoint must conform to the SUM Platform health contract defined in the spec.
HTTP Contract
ok200degraded200unhealthy503Critical checks (failure →
unhealthy+ 503):db,cacheNon-critical checks (failure →
degraded+ 200):celery,backupResponse Schema
version.git_sha—GIT_SHAenv var (short commit SHA, baked in at image build time).version.build—BUILD_IDenv var (format:build-YYYYMMDD-<sha>, baked in at image build time).Implementation Plan
1. New
apps/health/appCreate a dedicated
healthapp rather than adding tocore, to keep monitoring concerns isolated and easy to locate.2.
checks.py— four check functionscheck_db()django.db.connectionSELECT 1time.perf_counter(){"status": "ok", "latency_ms": <float>}on success{"status": "fail", "detail": str(exc)}on any exceptioncheck_cache()cache.set(key, probe_value, timeout=5)cache.get(key)and verify value matchescache.delete(key){"status": "ok", "latency_ms": <float>}on success{"status": "fail", "detail": ...}on mismatch or exceptioncheck_celery()os.environ.get("CELERY_BROKER_URL"){"status": "ok", "detail": "Celery not configured: CELERY_BROKER_URL is unset"}kombu.Connectionwithconnect_timeout=3,max_retries=1{"status": "ok"}{"status": "fail", "detail": str(exc)}kombuis not in nohype's dependency tree. Use a lazy import insidecheck_celery()withtry/except ImportError— returns ok + skip detail if kombu is not installed. Do not add kombu to requirements.check_backup()os.environ.get("BACKUP_STATUS_FILE"){"status": "fail", "detail": "Backup monitoring not configured: BACKUP_STATUS_FILE is unset"}FileNotFoundError→{"status": "fail", "detail": "Backup status file not found: <path>"}PermissionError/ other IO error →{"status": "fail", "detail": str(exc)}{"status": "fail", "detail": "Invalid backup status file"}{"status": "fail", "detail": "Last backup is <age> old (> 48 h)"}{"status": "ok"}3.
views.py—health_view(request)4.
urls.py5. Wire into
config/urls.pyAdd before the Wagtail catch-all (
path("", include(wagtail_urls))):6. Register app in
config/settings/base.pyAdd
"apps.health"toINSTALLED_APPS.7. Build metadata wiring —
GIT_SHAandBUILD_IDGIT_SHAandBUILD_IDare baked into the image at build time so the running container always knows what commit it is serving. Three files require changes:Dockerfile— accept build args and promote toENV(placed afterCOPY . /appto avoid busting the apt/pip cache):docker-compose.prod.yml— expandbuild: appto object form and forward the args:deploy/deploy.sh— capture values from the freshly-pulled repo and export them before docker compose runs, so Docker Compose's variable interpolation picks them up:The deploy script's post-startup health check should also be updated to probe
/health/instead of/, so deploys fail-fast if the DB is unreachable rather than just checking that gunicorn responds:Result: every production image has the real SHA and build ID baked in (e.g.
GIT_SHA=26ba3245,BUILD_ID=build-20260306-26ba3245). Theunknownfallback only applies to CI test builds and local dev, where OpsLog's_has_valid_site_version()still accepts it (non-empty string).Tests
test_checks.pyUnit-test each check in isolation using
unittest.mock/ pytestmonkeypatch:test_db_oktest_db_failcursor.executeraisesOperationalErrortest_cache_oktest_cache_failcache.getreturns wrong valuetest_celery_no_brokerCELERY_BROKER_URLunset → ok + detailtest_celery_no_kombutest_celery_oktest_celery_failtest_backup_no_envBACKUP_STATUS_FILEunset → failtest_backup_missing_filetest_backup_freshtest_backup_staletest_backup_invalidtest_views.pyIntegration tests using
django.test.Client:statustest_healthyoktest_degraded_celerydegradedtest_degraded_backupdegradedtest_unhealthy_dbunhealthytest_unhealthy_cacheunhealthytest_response_shapetest_version_fieldsgit_shaandbuildpresent and non-emptytest_no_cache_headersCache-Control: no-cachesetAcceptance Criteria
GET /health/returns valid JSON matching the schema above200whenstatusisokordegraded503whenstatusisunhealthydbandcachefailures produceunhealthyceleryandbackupfailures producedegradedchecks.celeryreturns ok-with-detail whenCELERY_BROKER_URLis unsetchecks.celeryreturns ok-with-detail whenkombuis not installed (noImportError)checks.backupreturns fail-with-detail whenBACKUP_STATUS_FILEis unsetversion.git_shais the real short commit SHA in production (notunknown)version.buildisbuild-YYYYMMDD-<sha>in production (notunknown)DockerfileacceptsGIT_SHAandBUILD_IDbuild args and bakes them intoENVdocker-compose.prod.ymlforwardsGIT_SHA/BUILD_IDfrom shell env as build argsdeploy/deploy.shcaptures and exports both vars before running docker composeCache-Control: no-cache(via@never_cache)pytest apps/health/)Resolution: point 2 — GIT_SHA / BUILD_ID wiring
Implemented build-time injection across three files:
Dockerfile —
ARG+ENVplaced afterCOPY . /appto avoid busting the apt/pip cache:docker-compose.prod.yml — expanded
build: appto object form:deploy/deploy.sh — captures values from the freshly-pulled repo after
git pull, then exports for docker compose:Example:
GIT_SHA=26ba3245,BUILD_ID=build-20260306-26ba3245Fallback to
unknownis preserved — CI test builds and local dev still work without the args, andunknownsatisfies OpsLog's_has_valid_site_version()(non-empty string).Review of updated spec — looks good ✅
Verified against the repo. A few minor notes:
1.
deploy/deploy.shhealth check should use/health/not/The deploy script currently does
curl ... http://localhost:8001/to confirm the site is up. Once this endpoint exists, it should switch tocurl ... http://localhost:8001/health/— that way deploys fail-fast if the DB is unreachable, rather than just checking if gunicorn responds. Not blocking but worth including as a one-line tweak in scope.2.
time.monotonic()vstime.perf_counter()The spec says
time.monotonic()for latency measurement. The SUM Platform reference usestime.perf_counter()which has higher resolution. Either works fine for millisecond-level measurements — just noting the divergence.perf_counteris marginally better for sub-ms cache probes.3.
docker-compose.prod.ymlbuild contextThe spec says expand
build: appto object form — correct. The current valuebuild: appmeans the build context is theapp/subdirectory (on the production host,APP_DIRis checked out there). The expanded form needscontext: appto preserve this. The spec has this right.4. Everything else checks out:
apps/health/as a standalone app — correct isolationkombuimport withImportErrorguard — correctfailwhen unconfigured — matches SUM PR #1571 semantics@never_cacheon the view — correct for monitoring endpointsSpec is implementation-ready.