feat(health): add /health/ endpoint for OpsLog monitoring #56

Merged
mark merged 2 commits from feature/health-endpoint-opslog into main 2026-03-06 17:42:10 +00:00
Owner

Closes #55

Summary

  • add a dedicated apps.health app with DB, cache, celery, and backup checks plus a machine-readable /health/ endpoint
  • wire the endpoint into Django settings/URLs and cover it with unit and integration tests
  • bake GIT_SHA and BUILD_ID into production builds and update deploy health checks to probe /health/

Testing

  • pytest apps/health/ -q --tb=short -o addopts=''
  • ruff check .
  • mypy apps config
  • pytest --ignore=e2e
  • Tailwind build reproducibility check
Closes #55 ## Summary - add a dedicated `apps.health` app with DB, cache, celery, and backup checks plus a machine-readable `/health/` endpoint - wire the endpoint into Django settings/URLs and cover it with unit and integration tests - bake `GIT_SHA` and `BUILD_ID` into production builds and update deploy health checks to probe `/health/` ## Testing - `pytest apps/health/ -q --tb=short -o addopts=''` - `ruff check .` - `mypy apps config` - `pytest --ignore=e2e` - Tailwind build reproducibility check
codex_b added 1 commit 2026-03-06 15:46:37 +00:00
feat: add health monitoring endpoint
All checks were successful
CI / nightly-e2e (pull_request) Has been skipped
CI / deploy (pull_request) Has been skipped
CI / pr-e2e (pull_request) Successful in 1m38s
CI / ci (pull_request) Successful in 1m46s
10e39b8331
mark requested changes 2026-03-06 15:59:09 +00:00
Dismissed
mark left a comment
Owner

QA Review — PR #56 (health endpoint)

Tests pass (21/21 health, 180/180 full suite), ruff clean, no regressions. Code is well-structured overall. Two issues need fixing before merge.


🔴 Must Fix

1. check_backup() permanently degrades every environment

checks.py:60-61 — When BACKUP_STATUS_FILE is unset, check_backup() returns {"status": "fail", ...}. Since backup is not in CRITICAL_CHECKS, this causes the health endpoint to return "degraded" (HTTP 200) in every environment where that env var is absent — which is currently all of them (not set in .env, docker-compose.prod.yml, or entrypoint.prod.sh).

This is inconsistent with how check_celery() handles the unconfigured case (returns "ok" with a detail note at line 43). For an OpsLog monitoring endpoint, permanent "degraded" defeats the purpose — OpsLog will never see a clean "ok" status, causing alert fatigue and masking real degradations.

Fix: Either:

  • (a) Return {"status": "ok", "detail": "Backup monitoring not configured: BACKUP_STATUS_FILE is unset"} matching the celery pattern, or
  • (b) Add BACKUP_STATUS_FILE to the production environment and treat missing config as a genuine failure — but then this should be documented.

Option (a) is strongly preferred for consistency.

2. test_version_fields is a no-op assertion

test_views.py:86-94 — The test sets GIT_SHA and BUILD_ID via monkeypatch, but then only asserts the values are truthy:

assert payload["version"]["git_sha"]
assert payload["version"]["build"]

Since the view defaults to "unknown" when the env vars are unset, and "unknown" is truthy, this test passes even without the monkeypatch. It does not verify the env vars are actually read.

Fix: Assert the exact values:

assert payload["version"]["git_sha"] == "59cc1c4"
assert payload["version"]["build"] == "build-20260306-59cc1c4"

🟡 Minor (non-blocking)

  • No HTTP method restriction on health_view — The endpoint accepts POST, PUT, DELETE, etc. Consider adding @require_GET or require_http_methods(["GET", "HEAD"]).
  • @pytest.mark.django_db on test_db_ok/test_db_fail — These tests monkeypatch the cursor and never touch the real DB; the marker adds unnecessary transaction overhead.
  • check_cache() skips cleanup on mismatch — When cached_value != probe_value (line 33), the function returns without cache.delete(cache_key). The 5s TTL makes this harmless, but a finally block would be tidier.
## QA Review — PR #56 (health endpoint) Tests pass (21/21 health, 180/180 full suite), ruff clean, no regressions. Code is well-structured overall. Two issues need fixing before merge. --- ### 🔴 Must Fix **1. `check_backup()` permanently degrades every environment** `checks.py:60-61` — When `BACKUP_STATUS_FILE` is unset, `check_backup()` returns `{"status": "fail", ...}`. Since backup is not in `CRITICAL_CHECKS`, this causes the health endpoint to return `"degraded"` (HTTP 200) in *every* environment where that env var is absent — which is currently *all* of them (not set in `.env`, `docker-compose.prod.yml`, or `entrypoint.prod.sh`). This is inconsistent with how `check_celery()` handles the unconfigured case (returns `"ok"` with a detail note at line 43). For an OpsLog monitoring endpoint, permanent `"degraded"` defeats the purpose — OpsLog will never see a clean `"ok"` status, causing alert fatigue and masking real degradations. **Fix:** Either: - (a) Return `{"status": "ok", "detail": "Backup monitoring not configured: BACKUP_STATUS_FILE is unset"}` matching the celery pattern, **or** - (b) Add `BACKUP_STATUS_FILE` to the production environment and treat missing config as a genuine failure — but then this should be documented. Option (a) is strongly preferred for consistency. **2. `test_version_fields` is a no-op assertion** `test_views.py:86-94` — The test sets `GIT_SHA` and `BUILD_ID` via monkeypatch, but then only asserts the values are truthy: ```python assert payload["version"]["git_sha"] assert payload["version"]["build"] ``` Since the view defaults to `"unknown"` when the env vars are unset, and `"unknown"` is truthy, this test passes even without the monkeypatch. It does not verify the env vars are actually read. **Fix:** Assert the exact values: ```python assert payload["version"]["git_sha"] == "59cc1c4" assert payload["version"]["build"] == "build-20260306-59cc1c4" ``` --- ### 🟡 Minor (non-blocking) - **No HTTP method restriction on `health_view`** — The endpoint accepts POST, PUT, DELETE, etc. Consider adding `@require_GET` or `require_http_methods(["GET", "HEAD"])`. - **`@pytest.mark.django_db` on `test_db_ok`/`test_db_fail`** — These tests monkeypatch the cursor and never touch the real DB; the marker adds unnecessary transaction overhead. - **`check_cache()` skips cleanup on mismatch** — When `cached_value != probe_value` (line 33), the function returns without `cache.delete(cache_key)`. The 5s TTL makes this harmless, but a `finally` block would be tidier.
Owner

Amendment to review item #1:

On reflection, the check_backup() behaviour is correct — backups are not optional in production the way Celery is, so returning "degraded" when BACKUP_STATUS_FILE is unset is the right signal. The fix should be on the deployment side: configure BACKUP_STATUS_FILE in the production .env (or docker-compose.prod.yml environment block) pointing to an actual backup status file.

The PR should add this env var to the production config so the check passes on deploy. The celery comparison was a false equivalence on my part.

Review item #2 (test_version_fields no-op assertion) still stands.

**Amendment to review item #1:** On reflection, the `check_backup()` behaviour is correct — backups are not optional in production the way Celery is, so returning `"degraded"` when `BACKUP_STATUS_FILE` is unset is the right signal. The fix should be on the deployment side: configure `BACKUP_STATUS_FILE` in the production `.env` (or `docker-compose.prod.yml` environment block) pointing to an actual backup status file. The PR should add this env var to the production config so the check passes on deploy. The celery comparison was a false equivalence on my part. Review item #2 (`test_version_fields` no-op assertion) still stands.
mark dismissed mark's review 2026-03-06 16:02:24 +00:00
Reason:

Superseded — resubmitting with corrected feedback.

mark requested changes 2026-03-06 16:02:40 +00:00
Dismissed
mark left a comment
Owner

QA Review — PR #56 (health endpoint)

Tests pass (21/21 health, 180/180 full suite), ruff clean, no regressions. Code is well-structured overall. Two issues need fixing before merge.


🔴 Must Fix

1. BACKUP_STATUS_FILE not configured in production

check_backup() correctly returns "fail" when BACKUP_STATUS_FILE is unset — backups are not optional and a missing config should surface as degraded. However, this env var is not defined anywhere in the deployment config (.env, docker-compose.prod.yml, or entrypoint.prod.sh), so the health endpoint will report "degraded" immediately on deploy.

Fix: Add BACKUP_STATUS_FILE to the production environment (e.g. in docker-compose.prod.yml or the .env template) pointing to the actual backup status file path.

2. test_version_fields is a no-op assertion

test_views.py:86-94 — The test sets GIT_SHA and BUILD_ID via monkeypatch, but then only asserts the values are truthy:

assert payload["version"]["git_sha"]
assert payload["version"]["build"]

Since the view defaults to "unknown" when the env vars are unset, and "unknown" is truthy, this test passes even without the monkeypatch. It does not verify the env vars are actually read.

Fix: Assert the exact values:

assert payload["version"]["git_sha"] == "59cc1c4"
assert payload["version"]["build"] == "build-20260306-59cc1c4"

🟡 Minor (non-blocking)

  • No HTTP method restriction on health_view — Consider adding @require_GET or require_http_methods(["GET", "HEAD"]).
  • @pytest.mark.django_db on test_db_ok/test_db_fail — These tests monkeypatch the cursor and never touch the real DB; the marker adds unnecessary transaction overhead.
  • check_cache() skips cleanup on mismatch — When cached_value != probe_value (line 33), the function returns without cache.delete(cache_key). The 5s TTL makes this harmless, but a finally block would be tidier.
## QA Review — PR #56 (health endpoint) Tests pass (21/21 health, 180/180 full suite), ruff clean, no regressions. Code is well-structured overall. Two issues need fixing before merge. --- ### 🔴 Must Fix **1. `BACKUP_STATUS_FILE` not configured in production** `check_backup()` correctly returns `"fail"` when `BACKUP_STATUS_FILE` is unset — backups are not optional and a missing config should surface as degraded. However, this env var is not defined anywhere in the deployment config (`.env`, `docker-compose.prod.yml`, or `entrypoint.prod.sh`), so the health endpoint will report `"degraded"` immediately on deploy. **Fix:** Add `BACKUP_STATUS_FILE` to the production environment (e.g. in `docker-compose.prod.yml` or the `.env` template) pointing to the actual backup status file path. **2. `test_version_fields` is a no-op assertion** `test_views.py:86-94` — The test sets `GIT_SHA` and `BUILD_ID` via monkeypatch, but then only asserts the values are truthy: ```python assert payload["version"]["git_sha"] assert payload["version"]["build"] ``` Since the view defaults to `"unknown"` when the env vars are unset, and `"unknown"` is truthy, this test passes even without the monkeypatch. It does not verify the env vars are actually read. **Fix:** Assert the exact values: ```python assert payload["version"]["git_sha"] == "59cc1c4" assert payload["version"]["build"] == "build-20260306-59cc1c4" ``` --- ### 🟡 Minor (non-blocking) - **No HTTP method restriction on `health_view`** — Consider adding `@require_GET` or `require_http_methods(["GET", "HEAD"])`. - **`@pytest.mark.django_db` on `test_db_ok`/`test_db_fail`** — These tests monkeypatch the cursor and never touch the real DB; the marker adds unnecessary transaction overhead. - **`check_cache()` skips cleanup on mismatch** — When `cached_value != probe_value` (line 33), the function returns without `cache.delete(cache_key)`. The 5s TTL makes this harmless, but a `finally` block would be tidier.
codex_b added 1 commit 2026-03-06 16:09:00 +00:00
fix: address health endpoint review feedback
All checks were successful
CI / nightly-e2e (pull_request) Has been skipped
CI / deploy (pull_request) Has been skipped
CI / ci (pull_request) Successful in 1m33s
CI / pr-e2e (pull_request) Successful in 1m42s
a450e7409f
mark approved these changes 2026-03-06 16:24:36 +00:00
mark left a comment
Owner

Both review items addressed. BACKUP_STATUS_FILE wired into prod compose with a read-only host mount, version test now asserts exact values. 180 tests pass, ruff clean. LGTM.

Both review items addressed. BACKUP_STATUS_FILE wired into prod compose with a read-only host mount, version test now asserts exact values. 180 tests pass, ruff clean. LGTM.
mark merged commit 15ef35e249 into main 2026-03-06 17:42:10 +00:00
mark deleted branch feature/health-endpoint-opslog 2026-03-06 17:42:11 +00:00
Sign in to join this conversation.
No Reviewers
No Label
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: nohype/main-site#56