Sortie exposes a /metrics endpoint in Prometheus text exposition format on the same port as the JSON API and HTML dashboard. Available when the HTTP server is enabled via --port or server.port.
Gauges¶
Point-in-time values. Sortie updates these after every state mutation — dispatch, worker exit, retry, reconciliation.
| Name | Labels | Description | Producing layer |
|---|---|---|---|
sortie_sessions_running |
— | Currently running agent sessions. | Coordination |
sortie_sessions_retrying |
— | Issues awaiting retry. Includes error retries, continuation retries, and stall retries sitting in the timer queue. | Coordination |
sortie_slots_available |
— | Remaining dispatch slots: max_concurrent_agents - running - claimed. Reaches 0 when the orchestrator is at capacity. |
Coordination |
sortie_active_sessions_elapsed_seconds |
— | Sum of wall-clock elapsed seconds across all running sessions. Recomputed from each session's started_at timestamp on every poll cycle. Use this to detect active work even when no sessions have recently completed (the runtime counter only increments on session end). |
Coordination |
sortie_ssh_host_usage |
host |
Active workers on a given SSH host. Only populated when extensions.worker.ssh_hosts is configured. |
Coordination |
The host label on sortie_ssh_host_usage matches the values in your ssh_hosts list exactly (e.g., host="build01.internal").
Counters¶
Monotonically increasing. Apply rate() or increase() to extract per-second or per-interval throughput.
| Name | Labels | Description | Producing layer |
|---|---|---|---|
sortie_tokens_total |
type |
Cumulative LLM tokens consumed. type is input, output, or cache_read. |
Coordination |
sortie_agent_runtime_seconds_total |
— | Cumulative agent runtime. Incremented when a session ends, not while it runs. For live elapsed time, use the sortie_active_sessions_elapsed_seconds gauge. |
Coordination |
sortie_dispatches_total |
outcome |
Dispatch attempts. outcome is success (worker spawned) or error (spawn failed). |
Coordination |
sortie_worker_exits_total |
exit_type |
Worker session completions. exit_type is normal (agent finished), error (agent or infrastructure failure), or cancelled (reconciliation or shutdown). |
Coordination |
sortie_retries_total |
trigger |
Retry scheduling events. trigger is error (failed attempt), continuation (successful turn, more work remains), timer (retry timer fired), or stall (stall timeout detected). |
Coordination |
sortie_reconciliation_actions_total |
action |
Reconciliation outcomes per issue checked. action is stop (issue state no longer active), cleanup (terminal state, workspace removed), or keep (still active, no action). |
Coordination |
sortie_poll_cycles_total |
result |
Poll tick outcomes. result is success (fetched and dispatched), error (tracker fetch failed), or skipped (preflight validation failed, dispatch skipped). |
Coordination |
sortie_tracker_requests_total |
operation, result |
Tracker adapter API calls. Each adapter method increments this independently — the orchestrator never touches it. operation is fetch_candidates, fetch_issue, fetch_comments, or transition. result is success or error. |
Integration |
sortie_handoff_transitions_total |
result |
Handoff state transition outcomes. result is success (issue transitioned), error (transition API failed, retry scheduled as fallback), or skipped (no handoff_state configured). |
Coordination |
sortie_tool_calls_total |
tool, result |
Agent tool call completions. tool is the tool name (e.g., Bash, tracker_api). result is success or error. |
Coordination |
Histograms¶
Distribution summaries with pre-defined buckets. Query percentiles with histogram_quantile(). Each histogram produces _bucket, _sum, and _count time series automatically.
| Name | Labels | Description | Buckets | Producing layer |
|---|---|---|---|---|
sortie_poll_duration_seconds |
— | Wall-clock time per complete poll cycle (tracker fetch through dispatch). | Exponential from 0.1s, factor 2, 10 buckets (0.1s → 51.2s) | Coordination |
sortie_worker_duration_seconds |
exit_type |
Wall-clock time per worker session, from spawn to exit. exit_type is normal, error, or cancelled. |
Exponential from 10s, factor 2, 12 buckets (10s → ~5.7h) | Coordination |
The poll duration histogram is tuned for O(seconds) cycles — tracker API latency plus dispatch overhead. The worker duration histogram covers the full range from quick failures (tens of seconds) to long-running agent sessions (hours).
Bucket boundaries for sortie_poll_duration_seconds: 0.1, 0.2, 0.4, 0.8, 1.6, 3.2, 6.4, 12.8, 25.6, 51.2 seconds.
Bucket boundaries for sortie_worker_duration_seconds: 10, 20, 40, 80, 160, 320, 640, 1280, 2560, 5120, 10240, 20480 seconds (~10s to ~5.7h).
Info¶
Static metadata exposed as a gauge with constant value 1.
| Name | Labels | Description | Producing layer |
|---|---|---|---|
sortie_build_info |
version, go_version |
Build metadata. Use to verify which Sortie version is running and to join with other metrics in Grafana dashboards. | Observability |
sortie_build_info
# => sortie_build_info{go_version="go1.24.1",version="0.5.0"} 1
Cardinality model¶
You will not find issue_id or issue_identifier as Prometheus labels. This is deliberate.
Sortie's concurrency is O(10) agents, not O(10,000) microservice endpoints — but issue identifiers are unbounded over time. Adding them as labels would create an ever-growing number of time series that degrades Prometheus storage and query performance for no operational benefit.
Prometheus answers aggregate questions: "How many sessions are running?", "What is the token burn rate?", "Are dispatches failing?" The JSON API answers per-issue questions: "What is PROJ-42 doing right now?", "How many tokens has this session consumed?" Use both.
PromQL examples¶
These queries assume the default 15-second scrape interval. Adjust rate() windows if your interval differs — the window should span at least 4 scrape intervals.
Token burn rate¶
sum(rate(sortie_tokens_total[5m])) by (type) * 60
Tokens per minute, broken down by input and output. Multiply by your provider's per-token pricing to get cost per minute.
Dispatch throughput and error rate¶
sum(rate(sortie_dispatches_total[5m])) by (outcome)
Dispatches per second by outcome. A sustained non-zero outcome="error" rate means workspace preparation or agent spawn is failing — check structured logs for the root cause.
To get the error ratio as a percentage:
rate(sortie_dispatches_total{outcome="error"}[5m])
/ on() sum(rate(sortie_dispatches_total[5m]))
* 100
Active sessions¶
sortie_sessions_running
Current running sessions. For capacity headroom:
sortie_slots_available / (sortie_sessions_running + sortie_slots_available) * 100
Percentage of dispatch capacity remaining. Alert when this stays below 10% — you are running near your concurrency ceiling.
Worker duration percentiles¶
histogram_quantile(0.50, rate(sortie_worker_duration_seconds_bucket[30m]))
histogram_quantile(0.95, rate(sortie_worker_duration_seconds_bucket[30m]))
histogram_quantile(0.99, rate(sortie_worker_duration_seconds_bucket[30m]))
p50, p95, and p99 worker session duration over the last 30 minutes. Use a wider window (30m+) because worker sessions are long-lived — a 5-minute window may not contain enough completed sessions for meaningful percentiles.
Retry rate by trigger¶
sum(rate(sortie_retries_total[5m])) by (trigger)
Retries per second by trigger type. A spike in trigger="error" retries signals systemic agent failures. A spike in trigger="stall" retries means agents are hanging — check agent.stall_timeout_ms in your workflow config.
Poll cycle duration trend¶
rate(sortie_poll_duration_seconds_sum[5m]) / rate(sortie_poll_duration_seconds_count[5m])
Average poll cycle duration over 5 minutes. This is dominated by tracker API latency. If it climbs steadily, your tracker is slowing down or returning larger result sets.
Tool call error rate¶
sum(rate(sortie_tool_calls_total{result="error"}[5m])) by (tool)
/ on(tool) sum(rate(sortie_tool_calls_total[5m])) by (tool)
* 100
Error percentage per tool. A high error rate on tracker_api suggests credential or connectivity issues with your tracker. High error rates on other tools (e.g., Bash) are usually agent-side problems, not Sortie infrastructure issues.
Grafana dashboard¶
A reference Grafana dashboard JSON is available for import at grafana-dashboard.json. It is tested against Grafana 10+ and uses the sortie_ metrics documented on this page.
The dashboard includes these panels:
| Panel | Metric(s) | Visualization |
|---|---|---|
| Active sessions | sortie_sessions_running, sortie_sessions_retrying, sortie_slots_available |
Stat + time series |
| Token consumption | sortie_tokens_total |
Time series (rate), broken down by type |
| Dispatch outcomes | sortie_dispatches_total |
Stacked bar (rate), success vs error |
| Worker exits | sortie_worker_exits_total |
Stacked bar (rate) by exit_type |
| Worker duration | sortie_worker_duration_seconds |
Heatmap + p50/p95/p99 lines |
| Retry activity | sortie_retries_total |
Time series (rate) by trigger |
| Poll cycle health | sortie_poll_cycles_total, sortie_poll_duration_seconds |
Status + duration overlay |
| Tracker API | sortie_tracker_requests_total |
Time series (rate) by operation × result |
| Handoff transitions | sortie_handoff_transitions_total |
Stat counters by result |
| Tool calls | sortie_tool_calls_total |
Time series (rate) by tool |
| SSH host utilization | sortie_ssh_host_usage |
Bar gauge per host (hidden when no SSH hosts are configured) |
| Build info | sortie_build_info |
Stat showing version and go_version |
Import the JSON file in Grafana via Dashboards → Import → Upload JSON file. Set your Prometheus data source when prompted.
Scrape configuration¶
Add Sortie as a scrape target in prometheus.yml:
scrape_configs:
- job_name: sortie
static_configs:
- targets: ["localhost:8080"]
Replace localhost:8080 with the host and port where Sortie's HTTP server is running. Sortie binds to loopback by default — if Prometheus runs on a different machine, you will need to configure Sortie's bind address accordingly.
The endpoint also serves promhttp_metric_handler_requests_total and promhttp_metric_handler_errors_total for scrape self-instrumentation, plus Go runtime metrics (go_goroutines, go_memstats_*, process_*) from the standard process and Go collectors.
For a complete setup walkthrough covering installation, alerting rules, and remote host discovery, see Monitor with Prometheus.