feat(health): add /health/ endpoint for OpsLog monitoring #55

Closed
opened 2026-03-06 14:37:23 +00:00 by claude · 2 comments
Member

Overview

Implement a machine-readable GET /health/ endpoint so OpsLog (and any other monitoring system) can assess the runtime health of the site without needing SSH or Django management commands.

The endpoint must conform to the SUM Platform health contract defined in the spec.


HTTP Contract

Overall status HTTP code
ok 200
degraded 200
unhealthy 503

Critical checks (failure → unhealthy + 503): db, cache

Non-critical checks (failure → degraded + 200): celery, backup


Response Schema

{
  "status": "ok | degraded | unhealthy",
  "version": { "git_sha": "26ba3245", "build": "build-20260306-26ba3245" },
  "checks": {
    "db":     { "status": "ok", "latency_ms": 1.7 },
    "cache":  { "status": "ok", "latency_ms": 0.8 },
    "celery": { "status": "ok" },
    "backup": { "status": "ok" }
  },
  "timestamp": "2026-03-06T13:05:00Z"
}

version.git_shaGIT_SHA env var (short commit SHA, baked in at image build time).
version.buildBUILD_ID env var (format: build-YYYYMMDD-<sha>, baked in at image build time).


Implementation Plan

1. New apps/health/ app

Create a dedicated health app rather than adding to core, to keep monitoring concerns isolated and easy to locate.

apps/health/
  __init__.py
  apps.py
  checks.py      ← individual check functions
  views.py       ← health view
  urls.py
  tests/
    __init__.py
    test_checks.py
    test_views.py

2. checks.py — four check functions

check_db()

  • Open a cursor via django.db.connection
  • Execute SELECT 1
  • Measure wall-clock latency with time.perf_counter()
  • Return {"status": "ok", "latency_ms": <float>} on success
  • Return {"status": "fail", "detail": str(exc)} on any exception

check_cache()

  • Generate a one-shot UUID key
  • Call cache.set(key, probe_value, timeout=5)
  • Call cache.get(key) and verify value matches
  • Call cache.delete(key)
  • Measure end-to-end latency
  • Return {"status": "ok", "latency_ms": <float>} on success
  • Return {"status": "fail", "detail": ...} on mismatch or exception

check_celery()

  • Read os.environ.get("CELERY_BROKER_URL")
  • If unset → {"status": "ok", "detail": "Celery not configured: CELERY_BROKER_URL is unset"}
  • If set → attempt a kombu.Connection with connect_timeout=3, max_retries=1
    • On success → {"status": "ok"}
    • On exception → {"status": "fail", "detail": str(exc)}
  • Note: kombu is not in nohype's dependency tree. Use a lazy import inside check_celery() with try/except ImportError — returns ok + skip detail if kombu is not installed. Do not add kombu to requirements.

check_backup()

  • Read os.environ.get("BACKUP_STATUS_FILE")
  • If unset → {"status": "fail", "detail": "Backup monitoring not configured: BACKUP_STATUS_FILE is unset"}
  • Try to open & read the file:
    • FileNotFoundError{"status": "fail", "detail": "Backup status file not found: <path>"}
    • PermissionError / other IO error → {"status": "fail", "detail": str(exc)}
    • Invalid content → {"status": "fail", "detail": "Invalid backup status file"}
  • File contains a single Unix timestamp (float, seconds since epoch) as plain text
  • If age > 48 h → {"status": "fail", "detail": "Last backup is <age> old (> 48 h)"}
  • Otherwise → {"status": "ok"}

3. views.pyhealth_view(request)

from django.http import JsonResponse
from django.views.decorators.cache import never_cache
import os
from datetime import datetime, timezone
from .checks import check_db, check_cache, check_celery, check_backup

CRITICAL = {"db", "cache"}

@never_cache
def health_view(request):
    checks = {
        "db":     check_db(),
        "cache":  check_cache(),
        "celery": check_celery(),
        "backup": check_backup(),
    }

    if any(checks[k]["status"] == "fail" for k in CRITICAL):
        overall = "unhealthy"
    elif any(v["status"] == "fail" for v in checks.values()):
        overall = "degraded"
    else:
        overall = "ok"

    payload = {
        "status": overall,
        "version": {
            "git_sha": os.environ.get("GIT_SHA", "unknown"),
            "build":   os.environ.get("BUILD_ID", "unknown"),
        },
        "checks": checks,
        "timestamp": datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"),
    }
    return JsonResponse(payload, status=503 if overall == "unhealthy" else 200)

4. urls.py

from django.urls import path
from .views import health_view

urlpatterns = [path("", health_view, name="health")]

5. Wire into config/urls.py

Add before the Wagtail catch-all (path("", include(wagtail_urls))):

path("health/", include("apps.health.urls")),

6. Register app in config/settings/base.py

Add "apps.health" to INSTALLED_APPS.

7. Build metadata wiring — GIT_SHA and BUILD_ID

GIT_SHA and BUILD_ID are baked into the image at build time so the running container always knows what commit it is serving. Three files require changes:

Dockerfile — accept build args and promote to ENV (placed after COPY . /app to avoid busting the apt/pip cache):

ARG GIT_SHA=unknown
ARG BUILD_ID=unknown
ENV GIT_SHA=${GIT_SHA} \
    BUILD_ID=${BUILD_ID}

docker-compose.prod.yml — expand build: app to object form and forward the args:

build:
  context: app
  args:
    GIT_SHA: ${GIT_SHA:-unknown}
    BUILD_ID: ${BUILD_ID:-unknown}

deploy/deploy.sh — capture values from the freshly-pulled repo and export them before docker compose runs, so Docker Compose's variable interpolation picks them up:

GIT_SHA=$(git -C "${APP_DIR}" rev-parse --short HEAD)
BUILD_ID="build-$(date +%Y%m%d)-${GIT_SHA}"
export GIT_SHA BUILD_ID

The deploy script's post-startup health check should also be updated to probe /health/ instead of /, so deploys fail-fast if the DB is unreachable rather than just checking that gunicorn responds:

if curl -fsS -H "Host: nohypeai.net" http://localhost:8001/health/ >/dev/null 2>&1; then

Result: every production image has the real SHA and build ID baked in (e.g. GIT_SHA=26ba3245, BUILD_ID=build-20260306-26ba3245). The unknown fallback only applies to CI test builds and local dev, where OpsLog's _has_valid_site_version() still accepts it (non-empty string).


Tests

test_checks.py

Unit-test each check in isolation using unittest.mock / pytest monkeypatch:

Test Scenario
test_db_ok Cursor executes successfully
test_db_fail cursor.execute raises OperationalError
test_cache_ok Normal set/get/delete probe
test_cache_fail cache.get returns wrong value
test_celery_no_broker CELERY_BROKER_URL unset → ok + detail
test_celery_no_kombu kombu not installed → ok + skip detail
test_celery_ok Broker reachable → ok
test_celery_fail Broker unreachable → fail
test_backup_no_env BACKUP_STATUS_FILE unset → fail
test_backup_missing_file File path set but file absent → fail
test_backup_fresh Valid timestamp, age < 48 h → ok
test_backup_stale Timestamp > 48 h → fail
test_backup_invalid Unreadable/corrupt file → fail

test_views.py

Integration tests using django.test.Client:

Test Mocked checks Expected HTTP Expected status
test_healthy all ok 200 ok
test_degraded_celery celery=fail 200 degraded
test_degraded_backup backup=fail 200 degraded
test_unhealthy_db db=fail 503 unhealthy
test_unhealthy_cache cache=fail 503 unhealthy
test_response_shape all ok All required fields present
test_version_fields git_sha and build present and non-empty
test_no_cache_headers Cache-Control: no-cache set

Acceptance Criteria

  • GET /health/ returns valid JSON matching the schema above
  • Returns 200 when status is ok or degraded
  • Returns 503 when status is unhealthy
  • db and cache failures produce unhealthy
  • celery and backup failures produce degraded
  • checks.celery returns ok-with-detail when CELERY_BROKER_URL is unset
  • checks.celery returns ok-with-detail when kombu is not installed (no ImportError)
  • checks.backup returns fail-with-detail when BACKUP_STATUS_FILE is unset
  • version.git_sha is the real short commit SHA in production (not unknown)
  • version.build is build-YYYYMMDD-<sha> in production (not unknown)
  • Dockerfile accepts GIT_SHA and BUILD_ID build args and bakes them into ENV
  • docker-compose.prod.yml forwards GIT_SHA/BUILD_ID from shell env as build args
  • deploy/deploy.sh captures and exports both vars before running docker compose
  • Response has Cache-Control: no-cache (via @never_cache)
  • All tests pass (pytest apps/health/)
  • Endpoint accessible without authentication
## Overview Implement a machine-readable `GET /health/` endpoint so OpsLog (and any other monitoring system) can assess the runtime health of the site without needing SSH or Django management commands. The endpoint must conform to the SUM Platform health contract defined in the spec. --- ## HTTP Contract | Overall status | HTTP code | |---|---:| | `ok` | `200` | | `degraded` | `200` | | `unhealthy` | `503` | **Critical checks** (failure → `unhealthy` + 503): `db`, `cache` **Non-critical checks** (failure → `degraded` + 200): `celery`, `backup` --- ## Response Schema ```json { "status": "ok | degraded | unhealthy", "version": { "git_sha": "26ba3245", "build": "build-20260306-26ba3245" }, "checks": { "db": { "status": "ok", "latency_ms": 1.7 }, "cache": { "status": "ok", "latency_ms": 0.8 }, "celery": { "status": "ok" }, "backup": { "status": "ok" } }, "timestamp": "2026-03-06T13:05:00Z" } ``` `version.git_sha` — `GIT_SHA` env var (short commit SHA, baked in at image build time). `version.build` — `BUILD_ID` env var (format: `build-YYYYMMDD-<sha>`, baked in at image build time). --- ## Implementation Plan ### 1. New `apps/health/` app Create a dedicated `health` app rather than adding to `core`, to keep monitoring concerns isolated and easy to locate. ``` apps/health/ __init__.py apps.py checks.py ← individual check functions views.py ← health view urls.py tests/ __init__.py test_checks.py test_views.py ``` ### 2. `checks.py` — four check functions #### `check_db()` - Open a cursor via `django.db.connection` - Execute `SELECT 1` - Measure wall-clock latency with `time.perf_counter()` - Return `{"status": "ok", "latency_ms": <float>}` on success - Return `{"status": "fail", "detail": str(exc)}` on any exception #### `check_cache()` - Generate a one-shot UUID key - Call `cache.set(key, probe_value, timeout=5)` - Call `cache.get(key)` and verify value matches - Call `cache.delete(key)` - Measure end-to-end latency - Return `{"status": "ok", "latency_ms": <float>}` on success - Return `{"status": "fail", "detail": ...}` on mismatch or exception #### `check_celery()` - Read `os.environ.get("CELERY_BROKER_URL")` - If unset → `{"status": "ok", "detail": "Celery not configured: CELERY_BROKER_URL is unset"}` - If set → attempt a `kombu.Connection` with `connect_timeout=3`, `max_retries=1` - On success → `{"status": "ok"}` - On exception → `{"status": "fail", "detail": str(exc)}` - **Note:** `kombu` is not in nohype's dependency tree. Use a lazy import inside `check_celery()` with `try/except ImportError` — returns ok + skip detail if kombu is not installed. Do not add kombu to requirements. #### `check_backup()` - Read `os.environ.get("BACKUP_STATUS_FILE")` - If unset → `{"status": "fail", "detail": "Backup monitoring not configured: BACKUP_STATUS_FILE is unset"}` - Try to open & read the file: - `FileNotFoundError` → `{"status": "fail", "detail": "Backup status file not found: <path>"}` - `PermissionError` / other IO error → `{"status": "fail", "detail": str(exc)}` - Invalid content → `{"status": "fail", "detail": "Invalid backup status file"}` - File contains a single Unix timestamp (float, seconds since epoch) as plain text - If age > 48 h → `{"status": "fail", "detail": "Last backup is <age> old (> 48 h)"}` - Otherwise → `{"status": "ok"}` ### 3. `views.py` — `health_view(request)` ```python from django.http import JsonResponse from django.views.decorators.cache import never_cache import os from datetime import datetime, timezone from .checks import check_db, check_cache, check_celery, check_backup CRITICAL = {"db", "cache"} @never_cache def health_view(request): checks = { "db": check_db(), "cache": check_cache(), "celery": check_celery(), "backup": check_backup(), } if any(checks[k]["status"] == "fail" for k in CRITICAL): overall = "unhealthy" elif any(v["status"] == "fail" for v in checks.values()): overall = "degraded" else: overall = "ok" payload = { "status": overall, "version": { "git_sha": os.environ.get("GIT_SHA", "unknown"), "build": os.environ.get("BUILD_ID", "unknown"), }, "checks": checks, "timestamp": datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"), } return JsonResponse(payload, status=503 if overall == "unhealthy" else 200) ``` ### 4. `urls.py` ```python from django.urls import path from .views import health_view urlpatterns = [path("", health_view, name="health")] ``` ### 5. Wire into `config/urls.py` Add **before** the Wagtail catch-all (`path("", include(wagtail_urls))`): ```python path("health/", include("apps.health.urls")), ``` ### 6. Register app in `config/settings/base.py` Add `"apps.health"` to `INSTALLED_APPS`. ### 7. Build metadata wiring — `GIT_SHA` and `BUILD_ID` `GIT_SHA` and `BUILD_ID` are baked into the image at build time so the running container always knows what commit it is serving. Three files require changes: **`Dockerfile`** — accept build args and promote to `ENV` (placed after `COPY . /app` to avoid busting the apt/pip cache): ```dockerfile ARG GIT_SHA=unknown ARG BUILD_ID=unknown ENV GIT_SHA=${GIT_SHA} \ BUILD_ID=${BUILD_ID} ``` **`docker-compose.prod.yml`** — expand `build: app` to object form and forward the args: ```yaml build: context: app args: GIT_SHA: ${GIT_SHA:-unknown} BUILD_ID: ${BUILD_ID:-unknown} ``` **`deploy/deploy.sh`** — capture values from the freshly-pulled repo and export them before docker compose runs, so Docker Compose's variable interpolation picks them up: ```bash GIT_SHA=$(git -C "${APP_DIR}" rev-parse --short HEAD) BUILD_ID="build-$(date +%Y%m%d)-${GIT_SHA}" export GIT_SHA BUILD_ID ``` The deploy script's post-startup health check should also be updated to probe `/health/` instead of `/`, so deploys fail-fast if the DB is unreachable rather than just checking that gunicorn responds: ```bash if curl -fsS -H "Host: nohypeai.net" http://localhost:8001/health/ >/dev/null 2>&1; then ``` Result: every production image has the real SHA and build ID baked in (e.g. `GIT_SHA=26ba3245`, `BUILD_ID=build-20260306-26ba3245`). The `unknown` fallback only applies to CI test builds and local dev, where OpsLog's `_has_valid_site_version()` still accepts it (non-empty string). --- ## Tests ### `test_checks.py` Unit-test each check in isolation using `unittest.mock` / pytest `monkeypatch`: | Test | Scenario | |---|---| | `test_db_ok` | Cursor executes successfully | | `test_db_fail` | `cursor.execute` raises `OperationalError` | | `test_cache_ok` | Normal set/get/delete probe | | `test_cache_fail` | `cache.get` returns wrong value | | `test_celery_no_broker` | `CELERY_BROKER_URL` unset → ok + detail | | `test_celery_no_kombu` | kombu not installed → ok + skip detail | | `test_celery_ok` | Broker reachable → ok | | `test_celery_fail` | Broker unreachable → fail | | `test_backup_no_env` | `BACKUP_STATUS_FILE` unset → fail | | `test_backup_missing_file` | File path set but file absent → fail | | `test_backup_fresh` | Valid timestamp, age < 48 h → ok | | `test_backup_stale` | Timestamp > 48 h → fail | | `test_backup_invalid` | Unreadable/corrupt file → fail | ### `test_views.py` Integration tests using `django.test.Client`: | Test | Mocked checks | Expected HTTP | Expected `status` | |---|---|---|---| | `test_healthy` | all ok | 200 | `ok` | | `test_degraded_celery` | celery=fail | 200 | `degraded` | | `test_degraded_backup` | backup=fail | 200 | `degraded` | | `test_unhealthy_db` | db=fail | 503 | `unhealthy` | | `test_unhealthy_cache` | cache=fail | 503 | `unhealthy` | | `test_response_shape` | all ok | — | All required fields present | | `test_version_fields` | — | — | `git_sha` and `build` present and non-empty | | `test_no_cache_headers` | — | — | `Cache-Control: no-cache` set | --- ## Acceptance Criteria - [ ] `GET /health/` returns valid JSON matching the schema above - [ ] Returns `200` when `status` is `ok` or `degraded` - [ ] Returns `503` when `status` is `unhealthy` - [ ] `db` and `cache` failures produce `unhealthy` - [ ] `celery` and `backup` failures produce `degraded` - [ ] `checks.celery` returns ok-with-detail when `CELERY_BROKER_URL` is unset - [ ] `checks.celery` returns ok-with-detail when `kombu` is not installed (no `ImportError`) - [ ] `checks.backup` returns fail-with-detail when `BACKUP_STATUS_FILE` is unset - [ ] `version.git_sha` is the real short commit SHA in production (not `unknown`) - [ ] `version.build` is `build-YYYYMMDD-<sha>` in production (not `unknown`) - [ ] `Dockerfile` accepts `GIT_SHA` and `BUILD_ID` build args and bakes them into `ENV` - [ ] `docker-compose.prod.yml` forwards `GIT_SHA`/`BUILD_ID` from shell env as build args - [ ] `deploy/deploy.sh` captures and exports both vars before running docker compose - [ ] Response has `Cache-Control: no-cache` (via `@never_cache`) - [ ] All tests pass (`pytest apps/health/`) - [ ] Endpoint accessible without authentication
Author
Member

Resolution: point 2 — GIT_SHA / BUILD_ID wiring

Implemented build-time injection across three files:

DockerfileARG+ENV placed after COPY . /app to avoid busting the apt/pip cache:

ARG GIT_SHA=unknown
ARG BUILD_ID=unknown
ENV GIT_SHA=${GIT_SHA} BUILD_ID=${BUILD_ID}

docker-compose.prod.yml — expanded build: app to object form:

build:
  context: app
  args:
    GIT_SHA: ${GIT_SHA:-unknown}
    BUILD_ID: ${BUILD_ID:-unknown}

deploy/deploy.sh — captures values from the freshly-pulled repo after git pull, then exports for docker compose:

GIT_SHA=$(git -C "${APP_DIR}" rev-parse --short HEAD)
BUILD_ID="build-$(date +%Y%m%d)-${GIT_SHA}"
export GIT_SHA BUILD_ID

Example: GIT_SHA=26ba3245, BUILD_ID=build-20260306-26ba3245

Fallback to unknown is preserved — CI test builds and local dev still work without the args, and unknown satisfies OpsLog's _has_valid_site_version() (non-empty string).

## Resolution: point 2 — GIT_SHA / BUILD_ID wiring Implemented build-time injection across three files: **Dockerfile** — `ARG`+`ENV` placed after `COPY . /app` to avoid busting the apt/pip cache: ``` ARG GIT_SHA=unknown ARG BUILD_ID=unknown ENV GIT_SHA=${GIT_SHA} BUILD_ID=${BUILD_ID} ``` **docker-compose.prod.yml** — expanded `build: app` to object form: ``` build: context: app args: GIT_SHA: ${GIT_SHA:-unknown} BUILD_ID: ${BUILD_ID:-unknown} ``` **deploy/deploy.sh** — captures values from the freshly-pulled repo after `git pull`, then exports for docker compose: ``` GIT_SHA=$(git -C "${APP_DIR}" rev-parse --short HEAD) BUILD_ID="build-$(date +%Y%m%d)-${GIT_SHA}" export GIT_SHA BUILD_ID ``` Example: `GIT_SHA=26ba3245`, `BUILD_ID=build-20260306-26ba3245` Fallback to `unknown` is preserved — CI test builds and local dev still work without the args, and `unknown` satisfies OpsLog's `_has_valid_site_version()` (non-empty string).
Owner

Review of updated spec — looks good

Verified against the repo. A few minor notes:

1. deploy/deploy.sh health check should use /health/ not /
The deploy script currently does curl ... http://localhost:8001/ to confirm the site is up. Once this endpoint exists, it should switch to curl ... http://localhost:8001/health/ — that way deploys fail-fast if the DB is unreachable, rather than just checking if gunicorn responds. Not blocking but worth including as a one-line tweak in scope.

2. time.monotonic() vs time.perf_counter()
The spec says time.monotonic() for latency measurement. The SUM Platform reference uses time.perf_counter() which has higher resolution. Either works fine for millisecond-level measurements — just noting the divergence. perf_counter is marginally better for sub-ms cache probes.

3. docker-compose.prod.yml build context
The spec says expand build: app to object form — correct. The current value build: app means the build context is the app/ subdirectory (on the production host, APP_DIR is checked out there). The expanded form needs context: app to preserve this. The spec has this right.

4. Everything else checks out:

  • apps/health/ as a standalone app — correct isolation
  • URL registration before Wagtail catch-all — correct
  • Lazy kombu import with ImportError guard — correct
  • Backup fail when unconfigured — matches SUM PR #1571 semantics
  • @never_cache on the view — correct for monitoring endpoints
  • Dict-based check returns (not dataclass) — simpler than SUM, works fine since there is no shared library dependency
  • Test matrix covers all the edge cases OpsLog validates against

Spec is implementation-ready.

### Review of updated spec — looks good ✅ Verified against the repo. A few minor notes: **1. `deploy/deploy.sh` health check should use `/health/` not `/`** The deploy script currently does `curl ... http://localhost:8001/` to confirm the site is up. Once this endpoint exists, it should switch to `curl ... http://localhost:8001/health/` — that way deploys fail-fast if the DB is unreachable, rather than just checking if gunicorn responds. Not blocking but worth including as a one-line tweak in scope. **2. `time.monotonic()` vs `time.perf_counter()`** The spec says `time.monotonic()` for latency measurement. The SUM Platform reference uses `time.perf_counter()` which has higher resolution. Either works fine for millisecond-level measurements — just noting the divergence. `perf_counter` is marginally better for sub-ms cache probes. **3. `docker-compose.prod.yml` build context** The spec says expand `build: app` to object form — correct. The current value `build: app` means the build context is the `app/` subdirectory (on the production host, `APP_DIR` is checked out there). The expanded form needs `context: app` to preserve this. The spec has this right. **4. Everything else checks out:** - `apps/health/` as a standalone app — correct isolation - URL registration before Wagtail catch-all — correct - Lazy `kombu` import with `ImportError` guard — correct - Backup `fail` when unconfigured — matches SUM PR #1571 semantics - `@never_cache` on the view — correct for monitoring endpoints - Dict-based check returns (not dataclass) — simpler than SUM, works fine since there is no shared library dependency - Test matrix covers all the edge cases OpsLog validates against Spec is implementation-ready.
mark closed this issue 2026-03-06 17:42:10 +00:00
Sign in to join this conversation.
No Label
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: nohype/main-site#55