Prometheus Metrics

Sortie exposes a /metrics endpoint in Prometheus text exposition format on the same port as the JSON API and HTML dashboard. Available when the HTTP server is enabled via --port or server.port.

Gauges¶

Point-in-time values. Sortie updates these after every state mutation — dispatch, worker exit, retry, reconciliation.

Name	Labels	Description	Producing layer
`sortie_sessions_running`	—	Currently running agent sessions.	Coordination
`sortie_sessions_retrying`	—	Issues awaiting retry. Includes error retries, continuation retries, and stall retries sitting in the timer queue.	Coordination
`sortie_slots_available`	—	Remaining dispatch slots: `max_concurrent_agents - running - claimed`. Reaches 0 when the orchestrator is at capacity.	Coordination
`sortie_active_sessions_elapsed_seconds`	—	Sum of wall-clock elapsed seconds across all running sessions. Recomputed from each session's `started_at` timestamp on every poll cycle. Use this to detect active work even when no sessions have recently completed (the runtime counter only increments on session end).	Coordination
`sortie_ssh_host_usage`	`host`	Active workers on a given SSH host. Only populated when `extensions.worker.ssh_hosts` is configured.	Coordination

The host label on sortie_ssh_host_usage matches the values in your ssh_hosts list exactly (e.g., host="build01.internal").

Counters¶

Monotonically increasing. Apply rate() or increase() to extract per-second or per-interval throughput.

Name	Labels	Description	Producing layer
`sortie_tokens_total`	`type`	Cumulative LLM tokens consumed. `type` is `input`, `output`, or `cache_read`.	Coordination
`sortie_agent_runtime_seconds_total`	—	Cumulative agent runtime. Incremented when a session ends, not while it runs. For live elapsed time, use the `sortie_active_sessions_elapsed_seconds` gauge.	Coordination
`sortie_dispatches_total`	`outcome`	Dispatch attempts. `outcome` is `success` (worker spawned) or `error` (spawn failed).	Coordination
`sortie_worker_exits_total`	`exit_type`	Worker session completions. `exit_type` is `normal` (agent finished), `error` (agent or infrastructure failure), or `cancelled` (reconciliation or shutdown).	Coordination
`sortie_retries_total`	`trigger`	Retry scheduling events. `trigger` is `error` (failed attempt), `continuation` (successful turn, more work remains), `timer` (retry timer fired), or `stall` (stall timeout detected).	Coordination
`sortie_reconciliation_actions_total`	`action`	Reconciliation outcomes per issue checked. `action` is `stop` (issue state no longer active), `cleanup` (terminal state, workspace removed), or `keep` (still active, no action).	Coordination
`sortie_poll_cycles_total`	`result`	Poll tick outcomes. `result` is `success` (fetched and dispatched), `error` (tracker fetch failed), or `skipped` (preflight validation failed, dispatch skipped).	Coordination
`sortie_tracker_requests_total`	`operation`, `result`	Tracker adapter API calls. Each adapter method increments this independently — the orchestrator never touches it. `operation` is `fetch_candidates`, `fetch_issue`, `fetch_comments`, or `transition`. `result` is `success` or `error`.	Integration
`sortie_handoff_transitions_total`	`result`	Handoff state transition outcomes. `result` is `success` (issue transitioned), `error` (transition API failed, retry scheduled as fallback), or `skipped` (no `handoff_state` configured).	Coordination
`sortie_tool_calls_total`	`tool`, `result`	Agent tool call completions. `tool` is the tool name (e.g., `Bash`, `tracker_api`). `result` is `success` or `error`.	Coordination

Histograms¶

Distribution summaries with pre-defined buckets. Query percentiles with histogram_quantile(). Each histogram produces _bucket, _sum, and _count time series automatically.

Name	Labels	Description	Buckets	Producing layer
`sortie_poll_duration_seconds`	—	Wall-clock time per complete poll cycle (tracker fetch through dispatch).	Exponential from 0.1s, factor 2, 10 buckets (0.1s → 51.2s)	Coordination
`sortie_worker_duration_seconds`	`exit_type`	Wall-clock time per worker session, from spawn to exit. `exit_type` is `normal`, `error`, or `cancelled`.	Exponential from 10s, factor 2, 12 buckets (10s → ~5.7h)	Coordination

The poll duration histogram is tuned for O(seconds) cycles — tracker API latency plus dispatch overhead. The worker duration histogram covers the full range from quick failures (tens of seconds) to long-running agent sessions (hours).

Bucket boundaries for sortie_poll_duration_seconds: 0.1, 0.2, 0.4, 0.8, 1.6, 3.2, 6.4, 12.8, 25.6, 51.2 seconds.

Bucket boundaries for sortie_worker_duration_seconds: 10, 20, 40, 80, 160, 320, 640, 1280, 2560, 5120, 10240, 20480 seconds (~10s to ~5.7h).

Info¶

Static metadata exposed as a gauge with constant value 1.

Name	Labels	Description	Producing layer
`sortie_build_info`	`version`, `go_version`	Build metadata. Use to verify which Sortie version is running and to join with other metrics in Grafana dashboards.	Observability

sortie_build_info
# => sortie_build_info{go_version="go1.24.1",version="0.5.0"} 1

Cardinality model¶

You will not find issue_id or issue_identifier as Prometheus labels. This is deliberate.

Sortie's concurrency is O(10) agents, not O(10,000) microservice endpoints — but issue identifiers are unbounded over time. Adding them as labels would create an ever-growing number of time series that degrades Prometheus storage and query performance for no operational benefit.

Prometheus answers aggregate questions: "How many sessions are running?", "What is the token burn rate?", "Are dispatches failing?" The JSON API answers per-issue questions: "What is PROJ-42 doing right now?", "How many tokens has this session consumed?" Use both.

PromQL examples¶

These queries assume the default 15-second scrape interval. Adjust rate() windows if your interval differs — the window should span at least 4 scrape intervals.

Token burn rate¶

sum(rate(sortie_tokens_total[5m])) by (type) * 60

Tokens per minute, broken down by input and output. Multiply by your provider's per-token pricing to get cost per minute.

Dispatch throughput and error rate¶

sum(rate(sortie_dispatches_total[5m])) by (outcome)

Dispatches per second by outcome. A sustained non-zero outcome="error" rate means workspace preparation or agent spawn is failing — check structured logs for the root cause.

To get the error ratio as a percentage:

rate(sortie_dispatches_total{outcome="error"}[5m])
/ on() sum(rate(sortie_dispatches_total[5m]))
* 100

Active sessions¶

sortie_sessions_running

Current running sessions. For capacity headroom:

sortie_slots_available / (sortie_sessions_running + sortie_slots_available) * 100

Percentage of dispatch capacity remaining. Alert when this stays below 10% — you are running near your concurrency ceiling.

Worker duration percentiles¶

histogram_quantile(0.50, rate(sortie_worker_duration_seconds_bucket[30m]))
histogram_quantile(0.95, rate(sortie_worker_duration_seconds_bucket[30m]))
histogram_quantile(0.99, rate(sortie_worker_duration_seconds_bucket[30m]))

p50, p95, and p99 worker session duration over the last 30 minutes. Use a wider window (30m+) because worker sessions are long-lived — a 5-minute window may not contain enough completed sessions for meaningful percentiles.

Retry rate by trigger¶

sum(rate(sortie_retries_total[5m])) by (trigger)

Retries per second by trigger type. A spike in trigger="error" retries signals systemic agent failures. A spike in trigger="stall" retries means agents are hanging — check agent.stall_timeout_ms in your workflow config.

Poll cycle duration trend¶

rate(sortie_poll_duration_seconds_sum[5m]) / rate(sortie_poll_duration_seconds_count[5m])

Average poll cycle duration over 5 minutes. This is dominated by tracker API latency. If it climbs steadily, your tracker is slowing down or returning larger result sets.

Tool call error rate¶

sum(rate(sortie_tool_calls_total{result="error"}[5m])) by (tool)
/ on(tool) sum(rate(sortie_tool_calls_total[5m])) by (tool)
* 100

Error percentage per tool. A high error rate on tracker_api suggests credential or connectivity issues with your tracker. High error rates on other tools (e.g., Bash) are usually agent-side problems, not Sortie infrastructure issues.

Grafana dashboard¶

A reference Grafana dashboard JSON is available for import at grafana-dashboard.json. It is tested against Grafana 10+ and uses the sortie_ metrics documented on this page.

The dashboard includes these panels:

Panel	Metric(s)	Visualization
Active sessions	`sortie_sessions_running`, `sortie_sessions_retrying`, `sortie_slots_available`	Stat + time series
Token consumption	`sortie_tokens_total`	Time series (rate), broken down by `type`
Dispatch outcomes	`sortie_dispatches_total`	Stacked bar (rate), `success` vs `error`
Worker exits	`sortie_worker_exits_total`	Stacked bar (rate) by `exit_type`
Worker duration	`sortie_worker_duration_seconds`	Heatmap + p50/p95/p99 lines
Retry activity	`sortie_retries_total`	Time series (rate) by `trigger`
Poll cycle health	`sortie_poll_cycles_total`, `sortie_poll_duration_seconds`	Status + duration overlay
Tracker API	`sortie_tracker_requests_total`	Time series (rate) by `operation` × `result`
Handoff transitions	`sortie_handoff_transitions_total`	Stat counters by `result`
Tool calls	`sortie_tool_calls_total`	Time series (rate) by `tool`
SSH host utilization	`sortie_ssh_host_usage`	Bar gauge per `host` (hidden when no SSH hosts are configured)
Build info	`sortie_build_info`	Stat showing `version` and `go_version`

Import the JSON file in Grafana via Dashboards → Import → Upload JSON file. Set your Prometheus data source when prompted.

Scrape configuration¶

Add Sortie as a scrape target in prometheus.yml:

scrape_configs:
  - job_name: sortie
    static_configs:
      - targets: ["localhost:8080"]

Replace localhost:8080 with the host and port where Sortie's HTTP server is running. Sortie binds to loopback by default — if Prometheus runs on a different machine, you will need to configure Sortie's bind address accordingly.

The endpoint also serves promhttp_metric_handler_requests_total and promhttp_metric_handler_errors_total for scrape self-instrumentation, plus Go runtime metrics (go_goroutines, go_memstats_*, process_*) from the standard process and Go collectors.

For a complete setup walkthrough covering installation, alerting rules, and remote host discovery, see Monitor with Prometheus.