Suggest edit — 008-observability

Title

Name

Note

---
visibility: public
---

# Spec 008 — Observability

**Status:** Draft
**Related:** [PRD FR-22..FR-25](../prd/001-picortex-v1.md#observability), Jacob's global [dev-patterns](https://github.com/tmad4000/jacob-computer-config-private)

## Goal

Everything observable without standing up a paid SaaS. Logs useful for the user ("what happened in chat X last Tuesday?") and for the developer ("why did the discriminator skip this message?").

## Structured logs

- **Logger:** `pino` with `pino-pretty` in dev, JSON in prod.
- **Level:** default `info`; `debug` via `LOG_LEVEL=debug`.
- **Location:** stdout (journald / docker-less systemd captures); no separate log files.
- **Fields every log carries:**
  - `time` (ISO)
  - `level` (`info`/`warn`/`error`)
  - `request_id` (see below)
  - `chat_id` (if in chat context)
  - `event_type` (e.g. `linq.inbound`, `tmux.turn.start`, `discriminator.decision`)
  - `msg` (free text)

## Request IDs

- Fastify middleware generates `X-Request-ID` (uuid v7) for every inbound HTTP request.
- Response headers echo it.
- Logs in that request's async context include it.
- Child-process spawns inherit it via env (`PICORTEX_REQUEST_ID`).
- Linq inbound events tag the request ID into the `events` SQLite row.

## `/api/frontend-log`

Per Jacob's global rules. Client-side:

```ts
window.addEventListener('error', ev => fetch('/api/frontend-log', {
  method: 'POST',
  body: JSON.stringify({
    level: 'error',
    message: ev.message,
    error: ev.error?.toString(),
    stack: ev.error?.stack,
    context: { url: location.href, ua: navigator.userAgent, build: __VERSION__ }
  })
}))
```

Server-side endpoint:

- Accepts up to `FRONTEND_LOG_MAX_BYTES` (default 64 KB)
- Rate-limited to 30/min per IP
- Logs under `event_type: "frontend"` with the browser-supplied fields plus the request ID tying it to the current user session

## Metrics

No Prometheus in v1. Instead, lightweight counters in SQLite `metrics` table that `/health` exposes:

```
chats_total
chats_active_7d
turns_total
turns_last_24h
discriminator_skipped_24h
errors_last_24h
```

`/health` returns:
```json
{
  "status": "ok",
  "version": "0.0.1",
  "commit": "abcd123",
  "uptime_seconds": 3412,
  "db_ok": true,
  "tmux_ok": true,
  "metrics": { ... }
}
```

## Network egress allowlist

Claude Code chat users should only reach:
- `api.anthropic.com`
- `registry.npmjs.org` (for tooling, if used by Claude)
- `pypi.org` (if Python is used)
- `github.com`, `raw.githubusercontent.com`
- Anything the user explicitly allowlists in `/etc/picortex/egress-allowlist.txt`

Enforced via iptables `owner` match on the chat-user's UID. Rejected connections log an event — Jacob gets an alert if a new host is attempted (learning mode).

## Sentry (optional, post-v0.1)

If Jacob wants error aggregation: `@sentry/node` + `@sentry/browser`. Keep it off by default.

## Testing

- **Unit:** request-ID middleware; log shape sanity.
- **Integration:** frontend-log roundtrip.
- **Manual:** tail logs during E2E; verify every turn has a request ID.

## Open questions

- OQ1: Where are logs archived long-term? (Not in v1 — stdout + journald is fine.)
- OQ2: Do we want Axiom or Loki integration? (Not for v1. Cortex uses Axiom.)