Skip to main content

Overview

This guide covers monitoring and observing a running ISCL Core instance. ISCL produces structured JSON logs via pino (Fastify’s built-in logger), exposes a health endpoint for liveness probes, and writes an append-only audit trail in SQLite that doubles as a source of operational metrics. Together, these three signals give ops engineers full visibility into the transaction pipeline without requiring additional instrumentation.

Structured logging (pino)

Fastify uses pino for structured JSON logging. Every log line is a self-contained JSON object written to stdout. The logger is configured in packages/core/src/api/app.ts:
const app = Fastify({
  logger: options.logger !== false ? { level: "info" } : false,
});
When the server starts via packages/core/src/main.ts, logger: true is always passed, so production instances emit info-level logs by default.

Log format

Every log entry includes these standard pino fields:
FieldTypeDescription
levelnumberNumeric log level (see table below)
timenumberUnix timestamp in milliseconds
pidnumberProcess ID
hostnamestringMachine hostname
reqIdstringUnique per-request correlation ID
msgstringHuman-readable message
Fastify automatically logs every HTTP request and response, attaching the reqId to both entries for correlation.

Example log output

{
  "level": 30,
  "time": 1707580800000,
  "pid": 1234,
  "hostname": "iscl-core",
  "reqId": "req-1",
  "msg": "incoming request",
  "req": { "method": "POST", "url": "/v1/tx/build" }
}
{
  "level": 30,
  "time": 1707580800050,
  "pid": 1234,
  "hostname": "iscl-core",
  "reqId": "req-1",
  "msg": "request completed",
  "res": { "statusCode": 200 },
  "responseTime": 50
}
A fatal startup failure (e.g., port already in use) logs at level 60:
{
  "level": 60,
  "time": 1707580800000,
  "pid": 1234,
  "hostname": "iscl-core",
  "msg": "Failed to start ISCL Core",
  "err": { "message": "listen EADDRINUSE: address already in use 127.0.0.1:3100" }
}

Log levels

LevelValueWhen Used
fatal60Server cannot start (port conflict, missing config)
error50Unhandled errors, broadcast failures
warn40Policy denials, high risk scores
info30Request lifecycle, audit events (default level)
debug20Builder details, RPC calls, schema validation
trace10Full request/response bodies

Configuring log level

  • buildApp({ logger: true }) — info level (default in production)
  • buildApp({ logger: false }) — disabled (used in test suites)
  • Future: ISCL_LOG_LEVEL env var support is planned for v0.2+
To temporarily lower the level for debugging in a non-production environment, modify the Fastify constructor call in app.ts:
logger: { level: process.env.ISCL_LOG_LEVEL ?? "info" }

Request correlation

Every inbound HTTP request receives a unique reqId (e.g., req-1, req-2). Use this value to correlate the request log with its response log, any intermediate error logs, and downstream RPC calls.
When troubleshooting a failed transaction, search your log aggregator for the reqId to see the full request lifecycle.

Health monitoring

Health endpoint

GET /v1/health returns the current server status:
{
  "status": "ok",
  "version": "0.1.0",
  "uptime": 3600.123
}
FieldTypeDescription
statusstringAlways "ok" when the server is responsive
versionstringISCL Core version
uptimenumberSeconds since process start (process.uptime())
The response schema enforces additionalProperties: false, so these are the only fields returned. A 200 response with status: "ok" confirms the Fastify server and its registered routes are operational. Additionally, every response includes an X-ISCL-Version header (currently 0.1.0) that external monitors can check without parsing the body.

Monitoring pattern

  • Poll GET /v1/health every 30 seconds from your monitoring system.
  • Alert if: response takes longer than 5 seconds, status is not "ok", or 3+ consecutive failures occur.
For Docker deployments, add a container-level healthcheck:
services:
  iscl-core:
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3100/v1/health"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 10s

Environment variables

The server binds to a configurable host and port (from packages/core/src/main.ts):
VariableDefaultDescription
ISCL_PORT3100HTTP listen port
ISCL_HOST127.0.0.1Bind address
Ensure your health checks target the correct host and port.

Audit events as observability

The SQLite audit trail (@clavion/audit) records every significant step in the transaction pipeline. These 14 event types serve double duty as observability signals.

Event catalog

EventSourceSignal
policy_evaluated/v1/tx/buildPolicy decision (allow/deny/require_approval)
tx_built/v1/tx/buildSuccessful transaction build
preflight_completed/v1/tx/preflightSimulation result + risk score
approve_request_created/v1/tx/approve-requestApproval flow initiated
approval_grantedApprovalServiceUser approved transaction
approval_rejectedApprovalServiceUser declined transaction
web_approval_decidedApproval UI routesWeb UI approval decision
signature_createdWalletServiceKey signed transaction
signing_deniedWalletServiceSigning blocked (policy/token)
tx_broadcast/v1/tx/sign-and-sendSuccessful RPC broadcast
broadcast_failed/v1/tx/sign-and-sendBroadcast error
skill_registered/v1/skills/registerNew skill registered
skill_registration_failed/v1/skills/registerSkill registration rejected
skill_revokedDELETE /v1/skills/:nameSkill revoked

Alert-worthy conditions

Set up alerts for these conditions in your monitoring system.
EventConditionMeaning
policy_evaluateddecision: "deny" rate spikesPossible misconfiguration or attack
broadcast_failedAny occurrenceRPC node issue or gas problem
signing_deniedAny occurrenceUnauthorized signing attempt
approval_rejectedHigh rateUsers rejecting agent-proposed txs
sandbox_errorAny occurrenceSandbox execution failure

Key metrics

Derive these operational metrics from audit events and HTTP responses:
MetricDerivationAlert Threshold
ThroughputCount of tx_broadcast events per hourBaseline-dependent
Error ratebroadcast_failed / (tx_broadcast + broadcast_failed)> 10%
Denial ratepolicy_evaluated with deny / total evaluationsSpike detection
Approval latencyTime between approve_request_created and approval_granted or approval_rejected> 300s (TTL expiry)
RPC healthHTTP 502 responses from /v1/balance or /v1/tx/preflightAny occurrence
Rate limit hitsRate limit denials per wallet per hourPolicy-dependent
Signing denialsCount of signing_denied eventsAny occurrence

Deriving metrics from SQLite

Broadcast error rate over the last hour:
SELECT
  ROUND(
    100.0 * SUM(CASE WHEN event = 'broadcast_failed' THEN 1 ELSE 0 END)
    / COUNT(*),
    2
  ) AS error_rate_pct
FROM audit_events
WHERE event IN ('tx_broadcast', 'broadcast_failed')
  AND timestamp > (strftime('%s', 'now') * 1000 - 3600000);
Approval latency (average seconds):
SELECT AVG(g.timestamp - r.timestamp) / 1000.0 AS avg_approval_seconds
FROM audit_events r
JOIN audit_events g ON r.intent_id = g.intent_id
WHERE r.event = 'approve_request_created'
  AND g.event IN ('approval_granted', 'approval_rejected');

Log forwarding

ISCL produces newline-delimited JSON (ndjson) on stdout. This format is natively supported by all major log aggregation systems.
Use Filebeat to ship container logs to Elasticsearch:
# filebeat.yml
filebeat.inputs:
  - type: container
    paths:
      - /var/lib/docker/containers/*/*.log
    processors:
      - decode_json_fields:
          fields: ["message"]
          target: "iscl"
output.elasticsearch:
  hosts: ["http://elasticsearch:9200"]
  index: "iscl-core-%{+yyyy.MM.dd}"

Future: Prometheus and OpenTelemetry

Planned for v0.2+. These features are not yet available.
  • Prometheus /metrics endpoint — histograms for request latency by route, counters for transaction types (transfer, swap, approve), gauges for pending approvals.
  • OpenTelemetry trace spans — spans across the full tx pipeline (build, preflight, approve, sign, broadcast) with intentId as the correlation ID.
  • Distributed tracing — propagate trace context from adapter (Domain A) through ISCL Core (Domain B) to RPC nodes, enabling end-to-end latency analysis.
  • Alertmanager integration — fire alerts based on Prometheus rules for broadcast failures, signing denials, and approval TTL expiry.

Next steps