Prometheus Metrics

Sortie exposes a /metrics endpoint in Prometheus text exposition format on the same port as the JSON API and HTML dashboard. The HTTP server starts by default on port 7678. See CLI reference for port and host configuration.

Note

When the HTTP server is disabled (--port 0), the orchestrator uses a no-op metrics implementation. Metrics are not collected internally - they are discarded, not buffered.

Gauges

Point-in-time values. Sortie updates these after every state mutation - dispatch, worker exit, retry, reconciliation.

Name	Labels	Description	Producing layer
`sortie_sessions_running`	-	Currently running agent sessions.	Coordination
`sortie_sessions_retrying`	-	Issues awaiting retry. Includes error retries, continuation retries, and stall retries sitting in the timer queue.	Coordination
`sortie_slots_available`	-	Remaining dispatch slots: `max_concurrent_agents - running - claimed`. Reaches 0 when the orchestrator is at capacity.	Coordination
`sortie_active_sessions_elapsed_seconds`	-	Sum of wall-clock elapsed seconds across all running sessions. Recomputed from each session’s `started_at` timestamp on every poll cycle. Use this to detect active work even when no sessions have recently completed (the runtime counter only increments on session end).	Coordination
`sortie_ssh_host_usage`	`host`	Active workers on a given SSH host. Only populated when `extensions.worker.ssh_hosts` is configured.	Coordination

The host label on sortie_ssh_host_usage matches the values in your ssh_hosts list exactly (e.g., host="build01.internal").

Counters

Monotonically increasing. Apply rate() or increase() to extract per-second or per-interval throughput.

Name	Labels	Description	Producing layer
`sortie_tokens_total`	`type`	Cumulative LLM tokens consumed. `type` is `input`, `output`, or `cache_read`.	Coordination
`sortie_agent_runtime_seconds_total`	-	Cumulative agent runtime. Incremented when a session ends, not while it runs. For live elapsed time, use the `sortie_active_sessions_elapsed_seconds` gauge.	Coordination
`sortie_dispatches_total`	`outcome`	Dispatch attempts. `outcome` is `success` (worker spawned) or `error` (spawn failed).	Coordination
`sortie_worker_exits_total`	`exit_type`	Worker session completions. `exit_type` is `normal` (agent finished), `error` (agent or infrastructure failure), or `cancelled` (reconciliation or shutdown).	Coordination
`sortie_retries_total`	`trigger`	Retry scheduling events. `trigger` is `error` (failed attempt), `continuation` (successful turn, more work remains), `timer` (retry timer fired), or `stall` (stall timeout detected).	Coordination
`sortie_reconciliation_actions_total`	`action`	Reconciliation outcomes per issue checked. `action` is `stop` (issue state no longer active), `cleanup` (terminal state, workspace removed), or `keep` (still active, no action).	Coordination
`sortie_poll_cycles_total`	`result`	Poll tick outcomes. `result` is `success` (fetched and dispatched), `error` (tracker fetch failed), or `skipped` (preflight validation failed, dispatch skipped).	Coordination
`sortie_tracker_requests_total`	`operation`, `result`	Tracker adapter API calls. Each adapter method increments this independently - the orchestrator never touches it. `operation` is `fetch_candidates`, `fetch_issue`, `fetch_comments`, `transition`, or `comment`. `result` is `success` or `error`.	Integration
`sortie_handoff_transitions_total`	`result`	Handoff state transition outcomes. `result` is `success` (issue transitioned), `error` (transition API failed, retry scheduled as fallback), or `skipped` (no `handoff_state` configured).	Coordination
`sortie_dispatch_transitions_total`	`result`	Dispatch-time in-progress transition outcomes. `result` is `success` (issue transitioned at dispatch), `error` (transition API failed; worker continues to workspace preparation), or `skipped` (issue was already in the target state). Only recorded when `tracker.in_progress_state` is configured.	Coordination
`sortie_tracker_comments_total`	`lifecycle`, `result`	Tracker comment attempts. `lifecycle` is `dispatch`, `completion`, or `failure`. `result` is `success` or `error`. Only recorded when `tracker.comments.*` flags are enabled. Comment failures are non-fatal - they increment the `error` result but never block the orchestrator.	Coordination
`sortie_tool_calls_total`	`tool`, `result`	Agent tool call completions. `tool` is the tool name (e.g., `Bash`, `tracker_api`). `result` is `success` or `error`.	Coordination
`sortie_ci_status_checks_total`	`result`	CI status check outcomes. `result` is `passing`, `pending`, `failing`, or `error`. Only recorded when the CI reconciliation loop runs.	Coordination
`sortie_ci_escalations_total`	`action`	CI escalation actions taken when checks remain non-passing beyond the configured threshold. `action` is `label`, `comment`, or `error`.	Coordination
`sortie_self_review_iterations_total`	`verdict`	Self-review iterations by outcome. `verdict` is `pass` (verification succeeded), `iterate` (agent re-prompted for another attempt), or `none` (no verdict produced). Only recorded when `self_review.enabled: true` is set. When self-review is disabled, this counter remains at zero.	Coordination
`sortie_self_review_sessions_total`	`final_verdict`	Self-review sessions by final outcome. `final_verdict` is `pass`, `iterate`, or `none`. One increment per completed self-review session. Only recorded when self-review is enabled.	Coordination
`sortie_self_review_cap_reached_total`	-	Self-review sessions that hit the iteration cap without passing. A sustained non-zero rate means verification commands are consistently failing - check your `self_review.verify_commands` configuration. Only recorded when self-review is enabled.	Coordination

Histograms

Distribution summaries with pre-defined buckets. Query percentiles with histogram_quantile(). Each histogram produces _bucket, _sum, and _count time series automatically.

Name	Labels	Description	Buckets	Producing layer
`sortie_poll_duration_seconds`	-	Wall-clock time per complete poll cycle (tracker fetch through dispatch).	Exponential from 0.1s, factor 2, 10 buckets (0.1s → 51.2s)	Coordination
`sortie_worker_duration_seconds`	`exit_type`	Wall-clock time per worker session, from spawn to exit. `exit_type` is `normal`, `error`, or `cancelled`.	Exponential from 10s, factor 2, 12 buckets (10s → ~5.7h)	Coordination
`sortie_self_review_verification_duration_seconds`	`command`	Wall-clock time per verification command execution during self-review. `command` is the first 64 characters of the shell command. Only recorded when self-review is enabled.	Exponential from 10s, factor 2, 12 buckets (10s → ~5.7h)	Coordination

The poll duration histogram is tuned for O(seconds) cycles - tracker API latency plus dispatch overhead. The worker duration histogram covers the full range from quick failures (tens of seconds) to long-running agent sessions (hours).

Bucket boundaries for sortie_poll_duration_seconds: 0.1, 0.2, 0.4, 0.8, 1.6, 3.2, 6.4, 12.8, 25.6, 51.2 seconds.

Bucket boundaries for sortie_worker_duration_seconds: 10, 20, 40, 80, 160, 320, 640, 1280, 2560, 5120, 10240, 20480 seconds (~10s to ~5.7h).

The verification duration histogram shares the worker duration bucket boundaries. Verification commands range from fast linters (seconds) to full test suites (minutes).

Bucket boundaries for sortie_self_review_verification_duration_seconds: 10, 20, 40, 80, 160, 320, 640, 1280, 2560, 5120, 10240, 20480 seconds (~10s to ~5.7h).

Info

Static metadata exposed as a gauge with constant value 1.

Name	Labels	Description	Producing layer
`sortie_build_info`	`version`, `go_version`	Build metadata. Use to verify which Sortie version is running and to join with other metrics in Grafana dashboards.	Observability

sortie_build_info
# => sortie_build_info{go_version="go1.24.1",version="0.5.0"} 1

Cardinality model

You will not find issue_id or issue_identifier as Prometheus labels. This is deliberate.

Sortie’s concurrency is O(10) agents, not O(10,000) microservice endpoints - but issue identifiers are unbounded over time. Adding them as labels would create an ever-growing number of time series that degrades Prometheus storage and query performance for no operational benefit.

Prometheus answers aggregate questions: “How many sessions are running?”, “What is the token burn rate?”, “Are dispatches failing?” The JSON API answers per-issue questions: “What is PROJ-42 doing right now?”, “How many tokens has this session consumed?” Use both.

PromQL examples

These queries assume the default 15-second scrape interval. Adjust rate() windows if your interval differs - the window should span at least 4 scrape intervals.

Token burn rate

sum(rate(sortie_tokens_total[5m])) by (type) * 60

Tokens per minute, broken down by input and output. Multiply by your provider’s per-token pricing to get cost per minute.

Dispatch throughput and error rate

sum(rate(sortie_dispatches_total[5m])) by (outcome)

Dispatches per second by outcome. A sustained non-zero outcome="error" rate means workspace preparation or agent spawn is failing - check structured logs for the root cause.

To get the error ratio as a percentage:

rate(sortie_dispatches_total{outcome="error"}[5m])
/ on() sum(rate(sortie_dispatches_total[5m]))
* 100

Active sessions

sortie_sessions_running

Current running sessions. For capacity headroom:

sortie_slots_available / (sortie_sessions_running + sortie_slots_available) * 100

Percentage of dispatch capacity remaining. Alert when this stays below 10% - you are running near your concurrency ceiling.

Worker duration percentiles

histogram_quantile(0.50, rate(sortie_worker_duration_seconds_bucket[30m]))
histogram_quantile(0.95, rate(sortie_worker_duration_seconds_bucket[30m]))
histogram_quantile(0.99, rate(sortie_worker_duration_seconds_bucket[30m]))

p50, p95, and p99 worker session duration over the last 30 minutes. Use a wider window (30m+) because worker sessions are long-lived - a 5-minute window may not contain enough completed sessions for meaningful percentiles.

Retry rate by trigger

sum(rate(sortie_retries_total[5m])) by (trigger)

Retries per second by trigger type. A spike in trigger="error" retries signals systemic agent failures. A spike in trigger="stall" retries means agents are hanging - check agent.stall_timeout_ms in your workflow config.

Poll cycle duration trend

rate(sortie_poll_duration_seconds_sum[5m]) / rate(sortie_poll_duration_seconds_count[5m])

Average poll cycle duration over 5 minutes. This is dominated by tracker API latency. If it climbs steadily, your tracker is slowing down or returning larger result sets.

Tool call error rate

sum(rate(sortie_tool_calls_total{result="error"}[5m])) by (tool)
/ on(tool) sum(rate(sortie_tool_calls_total[5m])) by (tool)
* 100

Error percentage per tool. A high error rate on tracker_api suggests credential or connectivity issues with your tracker. High error rates on other tools (e.g., Bash) are usually agent-side problems, not Sortie infrastructure issues.

Self-review pass rate

rate(sortie_self_review_sessions_total{final_verdict="pass"}[30m])
/ on() sum(rate(sortie_self_review_sessions_total[30m]))
* 100

Percentage of self-review sessions that ended with a passing verdict over the last 30 minutes. A declining pass rate means agents are producing code that fails verification commands more often - review your prompt templates and verify commands. Use a wider window (30m+) because self-review sessions complete infrequently.

For cap-hit monitoring:

rate(sortie_self_review_cap_reached_total[1h])

Sessions per second that exhausted all iterations without passing. Any sustained non-zero value warrants investigation. See Configure self-review for tuning iteration caps and verify commands.

Grafana dashboard

A reference Grafana dashboard JSON is available for import at grafana-dashboard.json. It is tested against Grafana 10+ and uses the sortie_ metrics documented on this page.

The dashboard organizes panels into eight collapsible rows. Each panel maps to one or more metrics from the tables above.

Row	Panel	Metric(s)	Visualization
Overview	Build info	`sortie_build_info`	Stat (`version`, `go_version`)
Overview	Active sessions	`sortie_sessions_running`, `sortie_sessions_retrying`, `sortie_slots_available`	Stat + time series
Overview	Active sessions elapsed	`sortie_active_sessions_elapsed_seconds`	Stat
Throughput	Token consumption	`sortie_tokens_total`	Time series (rate) by `type`
Throughput	Dispatch outcomes	`sortie_dispatches_total`	Time series (rate), `success` vs `error`
Throughput	Agent runtime	`sortie_agent_runtime_seconds_total`	Time series (rate)
Workers	Worker exits	`sortie_worker_exits_total`	Time series (rate) by `exit_type`
Workers	Worker duration	`sortie_worker_duration_seconds`	Heatmap + p50/p95/p99 percentile lines
Reliability	Retry activity	`sortie_retries_total`	Time series (rate) by `trigger`
Reliability	Poll cycle health	`sortie_poll_cycles_total`, `sortie_poll_duration_seconds`	Count + duration overlay
Reliability	Reconciliation actions	`sortie_reconciliation_actions_total`	Time series (rate) by `action`
Integration	Tracker API	`sortie_tracker_requests_total`	Time series (rate) by `operation` × `result`
Integration	Handoff transitions	`sortie_handoff_transitions_total`	Stat counters by `result`
Integration	Dispatch transitions	`sortie_dispatch_transitions_total`	Stat counters by `result`
Integration	Tracker comments	`sortie_tracker_comments_total`	Time series (rate) by `lifecycle` × `result`
CI Feedback	CI status checks	`sortie_ci_status_checks_total`	Time series (rate) by `result`
CI Feedback	CI escalations	`sortie_ci_escalations_total`	Time series (rate) by `action`
Agent	Tool calls	`sortie_tool_calls_total`	Time series (rate) by `tool`
Agent	SSH host utilization	`sortie_ssh_host_usage`	Bar gauge per `host` (hidden when no SSH hosts configured)
Self-Review	Review pass rate	`sortie_self_review_sessions_total`	Stat (pass % gauge)
Self-Review	Iteration verdicts	`sortie_self_review_iterations_total`	Time series (rate) by `verdict`
Self-Review	Verification duration	`sortie_self_review_verification_duration_seconds`	Heatmap + p50/p95 percentile lines
Self-Review	Cap reached	`sortie_self_review_cap_reached_total`	Stat counter (hidden when self-review is disabled)

Import the JSON file in Grafana via Dashboards → Import → Upload JSON file. Set your Prometheus data source when prompted.

Scrape configuration

Add Sortie as a scrape target in prometheus.yml:

scrape_configs:
  - job_name: sortie
    static_configs:
      - targets: ["localhost:7678"]

Replace localhost:7678 with the host and port where Sortie’s HTTP server is running. Sortie binds to 127.0.0.1 by default - if Prometheus runs on a different machine, pass --host 0.0.0.0 to Sortie or configure a reverse proxy to make the port reachable.

The endpoint also serves promhttp_metric_handler_requests_total and promhttp_metric_handler_errors_total for scrape self-instrumentation, plus Go runtime metrics (go_goroutines, go_memstats_*, process_*) from the standard process and Go collectors.

For a complete setup walkthrough covering installation, alerting rules, and remote host discovery, see Monitor with Prometheus.

Was this page helpful?

Dashboard State Machine