Skip to content
Prometheus Metrics

Prometheus Metrics

Sortie exposes a /metrics endpoint in Prometheus text exposition format on the same port as the JSON API and HTML dashboard. The HTTP server starts by default on port 7678. See CLI reference for port and host configuration.

Note

When the HTTP server is disabled (--port 0), the orchestrator uses a no-op metrics implementation. Metrics are not collected internally - they are discarded, not buffered.

Gauges

Point-in-time values. Sortie updates these after every state mutation - dispatch, worker exit, retry, reconciliation.

NameLabelsDescriptionProducing layer
sortie_sessions_running-Currently running agent sessions.Coordination
sortie_sessions_retrying-Issues awaiting retry. Includes error retries, continuation retries, and stall retries sitting in the timer queue.Coordination
sortie_slots_available-Remaining dispatch slots: max_concurrent_agents - running - claimed. Reaches 0 when the orchestrator is at capacity.Coordination
sortie_active_sessions_elapsed_seconds-Sum of wall-clock elapsed seconds across all running sessions. Recomputed from each session’s started_at timestamp on every poll cycle. Use this to detect active work even when no sessions have recently completed (the runtime counter only increments on session end).Coordination
sortie_ssh_host_usagehostActive workers on a given SSH host. Only populated when extensions.worker.ssh_hosts is configured.Coordination

The host label on sortie_ssh_host_usage matches the values in your ssh_hosts list exactly (e.g., host="build01.internal").

Counters

Monotonically increasing. Apply rate() or increase() to extract per-second or per-interval throughput.

NameLabelsDescriptionProducing layer
sortie_tokens_totaltypeCumulative LLM tokens consumed. type is input, output, or cache_read.Coordination
sortie_agent_runtime_seconds_total-Cumulative agent runtime. Incremented when a session ends, not while it runs. For live elapsed time, use the sortie_active_sessions_elapsed_seconds gauge.Coordination
sortie_dispatches_totaloutcomeDispatch attempts. outcome is success (worker spawned) or error (spawn failed).Coordination
sortie_worker_exits_totalexit_typeWorker session completions. exit_type is normal (agent finished), error (agent or infrastructure failure), or cancelled (reconciliation or shutdown).Coordination
sortie_retries_totaltriggerRetry scheduling events. trigger is error (failed attempt), continuation (successful turn, more work remains), timer (retry timer fired), or stall (stall timeout detected).Coordination
sortie_reconciliation_actions_totalactionReconciliation outcomes per issue checked. action is stop (issue state no longer active), cleanup (terminal state, workspace removed), or keep (still active, no action).Coordination
sortie_poll_cycles_totalresultPoll tick outcomes. result is success (fetched and dispatched), error (tracker fetch failed), or skipped (preflight validation failed, dispatch skipped).Coordination
sortie_tracker_requests_totaloperation, resultTracker adapter API calls. Each adapter method increments this independently - the orchestrator never touches it. operation is fetch_candidates, fetch_issue, fetch_comments, transition, or comment. result is success or error.Integration
sortie_handoff_transitions_totalresultHandoff state transition outcomes. result is success (issue transitioned), error (transition API failed, retry scheduled as fallback), or skipped (no handoff_state configured).Coordination
sortie_dispatch_transitions_totalresultDispatch-time in-progress transition outcomes. result is success (issue transitioned at dispatch), error (transition API failed; worker continues to workspace preparation), or skipped (issue was already in the target state). Only recorded when tracker.in_progress_state is configured.Coordination
sortie_tracker_comments_totallifecycle, resultTracker comment attempts. lifecycle is dispatch, completion, or failure. result is success or error. Only recorded when tracker.comments.* flags are enabled. Comment failures are non-fatal - they increment the error result but never block the orchestrator.Coordination
sortie_tool_calls_totaltool, resultAgent tool call completions. tool is the tool name (e.g., Bash, tracker_api). result is success or error.Coordination
sortie_ci_status_checks_totalresultCI status check outcomes. result is passing, pending, failing, or error. Only recorded when the CI reconciliation loop runs.Coordination
sortie_ci_escalations_totalactionCI escalation actions taken when checks remain non-passing beyond the configured threshold. action is label, comment, or error.Coordination
sortie_self_review_iterations_totalverdictSelf-review iterations by outcome. verdict is pass (verification succeeded), iterate (agent re-prompted for another attempt), or none (no verdict produced). Only recorded when self_review.enabled: true is set. When self-review is disabled, this counter remains at zero.Coordination
sortie_self_review_sessions_totalfinal_verdictSelf-review sessions by final outcome. final_verdict is pass, iterate, or none. One increment per completed self-review session. Only recorded when self-review is enabled.Coordination
sortie_self_review_cap_reached_total-Self-review sessions that hit the iteration cap without passing. A sustained non-zero rate means verification commands are consistently failing - check your self_review.verify_commands configuration. Only recorded when self-review is enabled.Coordination

Histograms

Distribution summaries with pre-defined buckets. Query percentiles with histogram_quantile(). Each histogram produces _bucket, _sum, and _count time series automatically.

NameLabelsDescriptionBucketsProducing layer
sortie_poll_duration_seconds-Wall-clock time per complete poll cycle (tracker fetch through dispatch).Exponential from 0.1s, factor 2, 10 buckets (0.1s → 51.2s)Coordination
sortie_worker_duration_secondsexit_typeWall-clock time per worker session, from spawn to exit. exit_type is normal, error, or cancelled.Exponential from 10s, factor 2, 12 buckets (10s → ~5.7h)Coordination
sortie_self_review_verification_duration_secondscommandWall-clock time per verification command execution during self-review. command is the first 64 characters of the shell command. Only recorded when self-review is enabled.Exponential from 10s, factor 2, 12 buckets (10s → ~5.7h)Coordination

The poll duration histogram is tuned for O(seconds) cycles - tracker API latency plus dispatch overhead. The worker duration histogram covers the full range from quick failures (tens of seconds) to long-running agent sessions (hours).

Bucket boundaries for sortie_poll_duration_seconds: 0.1, 0.2, 0.4, 0.8, 1.6, 3.2, 6.4, 12.8, 25.6, 51.2 seconds.

Bucket boundaries for sortie_worker_duration_seconds: 10, 20, 40, 80, 160, 320, 640, 1280, 2560, 5120, 10240, 20480 seconds (~10s to ~5.7h).

The verification duration histogram shares the worker duration bucket boundaries. Verification commands range from fast linters (seconds) to full test suites (minutes).

Bucket boundaries for sortie_self_review_verification_duration_seconds: 10, 20, 40, 80, 160, 320, 640, 1280, 2560, 5120, 10240, 20480 seconds (~10s to ~5.7h).

Info

Static metadata exposed as a gauge with constant value 1.

NameLabelsDescriptionProducing layer
sortie_build_infoversion, go_versionBuild metadata. Use to verify which Sortie version is running and to join with other metrics in Grafana dashboards.Observability
sortie_build_info
# => sortie_build_info{go_version="go1.24.1",version="0.5.0"} 1

Cardinality model

You will not find issue_id or issue_identifier as Prometheus labels. This is deliberate.

Sortie’s concurrency is O(10) agents, not O(10,000) microservice endpoints - but issue identifiers are unbounded over time. Adding them as labels would create an ever-growing number of time series that degrades Prometheus storage and query performance for no operational benefit.

Prometheus answers aggregate questions: “How many sessions are running?”, “What is the token burn rate?”, “Are dispatches failing?” The JSON API answers per-issue questions: “What is PROJ-42 doing right now?”, “How many tokens has this session consumed?” Use both.

PromQL examples

These queries assume the default 15-second scrape interval. Adjust rate() windows if your interval differs - the window should span at least 4 scrape intervals.

Token burn rate

sum(rate(sortie_tokens_total[5m])) by (type) * 60

Tokens per minute, broken down by input and output. Multiply by your provider’s per-token pricing to get cost per minute.

Dispatch throughput and error rate

sum(rate(sortie_dispatches_total[5m])) by (outcome)

Dispatches per second by outcome. A sustained non-zero outcome="error" rate means workspace preparation or agent spawn is failing - check structured logs for the root cause.

To get the error ratio as a percentage:

rate(sortie_dispatches_total{outcome="error"}[5m])
/ on() sum(rate(sortie_dispatches_total[5m]))
* 100

Active sessions

sortie_sessions_running

Current running sessions. For capacity headroom:

sortie_slots_available / (sortie_sessions_running + sortie_slots_available) * 100

Percentage of dispatch capacity remaining. Alert when this stays below 10% - you are running near your concurrency ceiling.

Worker duration percentiles

histogram_quantile(0.50, rate(sortie_worker_duration_seconds_bucket[30m]))
histogram_quantile(0.95, rate(sortie_worker_duration_seconds_bucket[30m]))
histogram_quantile(0.99, rate(sortie_worker_duration_seconds_bucket[30m]))

p50, p95, and p99 worker session duration over the last 30 minutes. Use a wider window (30m+) because worker sessions are long-lived - a 5-minute window may not contain enough completed sessions for meaningful percentiles.

Retry rate by trigger

sum(rate(sortie_retries_total[5m])) by (trigger)

Retries per second by trigger type. A spike in trigger="error" retries signals systemic agent failures. A spike in trigger="stall" retries means agents are hanging - check agent.stall_timeout_ms in your workflow config.

Poll cycle duration trend

rate(sortie_poll_duration_seconds_sum[5m]) / rate(sortie_poll_duration_seconds_count[5m])

Average poll cycle duration over 5 minutes. This is dominated by tracker API latency. If it climbs steadily, your tracker is slowing down or returning larger result sets.

Tool call error rate

sum(rate(sortie_tool_calls_total{result="error"}[5m])) by (tool)
/ on(tool) sum(rate(sortie_tool_calls_total[5m])) by (tool)
* 100

Error percentage per tool. A high error rate on tracker_api suggests credential or connectivity issues with your tracker. High error rates on other tools (e.g., Bash) are usually agent-side problems, not Sortie infrastructure issues.

Self-review pass rate

rate(sortie_self_review_sessions_total{final_verdict="pass"}[30m])
/ on() sum(rate(sortie_self_review_sessions_total[30m]))
* 100

Percentage of self-review sessions that ended with a passing verdict over the last 30 minutes. A declining pass rate means agents are producing code that fails verification commands more often - review your prompt templates and verify commands. Use a wider window (30m+) because self-review sessions complete infrequently.

For cap-hit monitoring:

rate(sortie_self_review_cap_reached_total[1h])

Sessions per second that exhausted all iterations without passing. Any sustained non-zero value warrants investigation. See Configure self-review for tuning iteration caps and verify commands.

Grafana dashboard

A reference Grafana dashboard JSON is available for import at grafana-dashboard.json. It is tested against Grafana 10+ and uses the sortie_ metrics documented on this page.

The dashboard organizes panels into eight collapsible rows. Each panel maps to one or more metrics from the tables above.

RowPanelMetric(s)Visualization
OverviewBuild infosortie_build_infoStat (version, go_version)
OverviewActive sessionssortie_sessions_running, sortie_sessions_retrying, sortie_slots_availableStat + time series
OverviewActive sessions elapsedsortie_active_sessions_elapsed_secondsStat
ThroughputToken consumptionsortie_tokens_totalTime series (rate) by type
ThroughputDispatch outcomessortie_dispatches_totalTime series (rate), success vs error
ThroughputAgent runtimesortie_agent_runtime_seconds_totalTime series (rate)
WorkersWorker exitssortie_worker_exits_totalTime series (rate) by exit_type
WorkersWorker durationsortie_worker_duration_secondsHeatmap + p50/p95/p99 percentile lines
ReliabilityRetry activitysortie_retries_totalTime series (rate) by trigger
ReliabilityPoll cycle healthsortie_poll_cycles_total, sortie_poll_duration_secondsCount + duration overlay
ReliabilityReconciliation actionssortie_reconciliation_actions_totalTime series (rate) by action
IntegrationTracker APIsortie_tracker_requests_totalTime series (rate) by operation × result
IntegrationHandoff transitionssortie_handoff_transitions_totalStat counters by result
IntegrationDispatch transitionssortie_dispatch_transitions_totalStat counters by result
IntegrationTracker commentssortie_tracker_comments_totalTime series (rate) by lifecycle × result
CI FeedbackCI status checkssortie_ci_status_checks_totalTime series (rate) by result
CI FeedbackCI escalationssortie_ci_escalations_totalTime series (rate) by action
AgentTool callssortie_tool_calls_totalTime series (rate) by tool
AgentSSH host utilizationsortie_ssh_host_usageBar gauge per host (hidden when no SSH hosts configured)
Self-ReviewReview pass ratesortie_self_review_sessions_totalStat (pass % gauge)
Self-ReviewIteration verdictssortie_self_review_iterations_totalTime series (rate) by verdict
Self-ReviewVerification durationsortie_self_review_verification_duration_secondsHeatmap + p50/p95 percentile lines
Self-ReviewCap reachedsortie_self_review_cap_reached_totalStat counter (hidden when self-review is disabled)

Import the JSON file in Grafana via Dashboards → Import → Upload JSON file. Set your Prometheus data source when prompted.

Scrape configuration

Add Sortie as a scrape target in prometheus.yml:

scrape_configs:
  - job_name: sortie
    static_configs:
      - targets: ["localhost:7678"]

Replace localhost:7678 with the host and port where Sortie’s HTTP server is running. Sortie binds to 127.0.0.1 by default - if Prometheus runs on a different machine, pass --host 0.0.0.0 to Sortie or configure a reverse proxy to make the port reachable.

The endpoint also serves promhttp_metric_handler_requests_total and promhttp_metric_handler_errors_total for scrape self-instrumentation, plus Go runtime metrics (go_goroutines, go_memstats_*, process_*) from the standard process and Go collectors.

For a complete setup walkthrough covering installation, alerting rules, and remote host discovery, see Monitor with Prometheus.

Was this page helpful?