State Machine

Sortie maintains two layers of state for every issue it processes. The orchestration state tracks whether the orchestrator has claimed the issue and what it is doing with it. The run attempt phase tracks where a single agent invocation stands within its lifecycle. These are independent from tracker states (To Do, In Progress) - they are Sortie’s internal bookkeeping.

See also: WORKFLOW.md configuration for active_states, terminal_states, handoff_state, and in_progress_state; error reference for error kinds that trigger retries; CLI reference for --dry-run mode that simulates dispatch without launching agents; dashboard reference for real-time visibility into orchestration state.

Orchestration states

Every issue known to the orchestrator is in exactly one of five states. The orchestrator is the single authority for these transitions - no other component mutates scheduling state.

State	Description
`Unclaimed`	The issue is not running and has no retry scheduled. Eligible for dispatch if it meets candidate selection rules.
`Claimed`	The orchestrator has reserved the issue to prevent duplicate dispatch. A claimed issue is always either `Running` or `RetryQueued`.
`Running`	A worker goroutine exists for this issue. The issue is tracked in the `running` map with a live `RunningEntry`.
`RetryQueued`	No worker is running, but a retry timer exists. The issue remains claimed until the timer fires and either re-dispatches or releases.
`Released`	The claim has been removed. The issue is no longer tracked. This happens when the issue reaches a terminal tracker state, leaves the active state set, is missing from the tracker, or exhausts its retry path.

flowchart TD
    UC([Unclaimed]) --> RN

    subgraph Claimed
        RN[Running] --> RQ[RetryQueued]
        RQ --> RN
    end

    Claimed --> RL([Released])
    RL --> UC

    classDef idle fill:#f0f0f4,stroke:#8b8fa3,color:#3a3d4a
    classDef active fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f,stroke-width:2px
    classDef waiting fill:#fef3c7,stroke:#d97706,color:#78350f
    classDef released fill:#f0f0f4,stroke:#8b8fa3,color:#3a3d4a,stroke-dasharray:5 5

    class UC idle
    class RN active
    class RQ waiting
    class RL released

    style Claimed fill:none,stroke:#3b82f6,stroke-width:2px,rx:8,color:#3b82f6

Transition details

Unclaimed → Claimed. Occurs during the dispatch phase of a poll tick. The issue must pass all candidate eligibility checks and a global or per-state concurrency slot must be available. The issue enters Running immediately - there is no Claimed without a worker.

Running → RetryQueued. Three worker exit outcomes lead here (none apply when a soft-stop signal is active; see Claimed → Released below):

Normal exit, issue still active, no soft-stop: continuation retry after 1 000 ms fixed delay.
Normal exit, handoff fails, no soft-stop: continuation retry after 1 000 ms.
Error exit, retryable: exponential backoff retry (see backoff formula).
Stall timeout: worker is killed; exponential backoff retry is scheduled.

RetryQueued → Running. The retry timer fires. The orchestrator re-fetches candidates, confirms the issue is still eligible, acquires a slot, and launches a new worker. If no slot is available, the retry is rescheduled with the same backoff.

Claimed → Released. The claim is removed and no retry is scheduled:

Reconciliation detects the tracker state is terminal or no longer in active_states.
The retry timer fires but the issue is absent from the candidate list.
The max_sessions budget is reached.
The worker error is classified as non-retryable.
A handoff_state transition succeeds (the tracker now owns the issue).
Soft-stop blocked: worker exits normally, claim released. No handoff transition, no continuation retry.
Soft-stop needs-human-review, handoff succeeds: worker exits normally, handoff transition performed, claim released.
Soft-stop needs-human-review, handoff fails: worker exits normally, handoff fails, claim released without retry.

Released → Unclaimed. A released issue can be re-dispatched on a future poll tick if its tracker state returns to an active state. The orchestrator does not remember previous releases - each poll tick evaluates eligibility from scratch.

Run attempt phases

Each worker attempt progresses through a linear sequence of phases. Terminal phases end the attempt and produce a WorkerResult delivered to the orchestrator.

Phase	Description
`DispatchTransition`	Optional. When `tracker.in_progress_state` is configured, the worker calls `TransitionIssue` before workspace preparation. If the issue is already in the target state, the call is skipped (debug log only). Failure is non-fatal - the worker logs a warning and continues.
`DispatchComment`	Optional. When `tracker.comments.on_dispatch` is `true`, the worker posts a tracker comment acknowledging that Sortie has claimed the issue. Fires after the dispatch transition and before workspace preparation. Failure is non-fatal - the worker logs a warning and continues.
`PreparingWorkspace`	Workspace directory is created or reused. `after_create` and `before_run` hooks execute.
`BuildingPrompt`	The `text/template` prompt body is rendered with issue data, attempt number, and turn context.
`LaunchingAgentProcess`	The agent adapter starts a session (subprocess, API call, or mock).
`InitializingSession`	Waiting for the `session_started` event from the agent adapter.
`StreamingTurn`	The agent is actively working. Token usage, tool calls, and status events stream in.
`Finishing`	The turn ended. `after_run` hooks execute. The worker checks whether to loop for another turn.
`Succeeded`	Terminal. The worker completed all turns without error.
`Failed`	Terminal. An error occurred during any earlier phase.
`TimedOut`	Terminal. The turn exceeded `agent.turn_timeout_ms`.
`Stalled`	Terminal. No agent event arrived within `agent.stall_timeout_ms`. Detected by reconciliation.
`CanceledByReconciliation`	Terminal. The worker’s context was cancelled because the issue’s tracker state became terminal or left the active set.

flowchart TD
    DT[DispatchTransition] --> DC[DispatchComment]
    DC --> PW[PreparingWorkspace]
    PW --> BP[BuildingPrompt]
    BP --> LA[LaunchingAgent]
    LA --> IS[InitializingSession]
    IS --> ST[StreamingTurn]
    ST --> FN[Finishing]

    FN --> ST
    FN --> OK([Succeeded])

    ST --> TO([TimedOut])
    ST --> SL([Stalled])
    ST --> CR([Canceled])

    classDef phase fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f
    classDef active fill:#bfdbfe,stroke:#2563eb,color:#1e3a5f,stroke-width:2px
    classDef success fill:#d1fae5,stroke:#059669,color:#064e3b,stroke-width:2px
    classDef failure fill:#fee2e2,stroke:#dc2626,color:#7f1d1d

    class DT,DC,PW,BP,LA,IS,FN phase
    class ST active
    class OK success
    class TO,SL,CR failure

Any phase from PreparingWorkspace through StreamingTurn can also transition to Failed on error. The specific error trigger for each phase is documented in the table above.

Multi-turn behavior

A single worker attempt can execute multiple agent turns. After each turn:

The worker checks the tracker for the issue’s current state.
If the state is still active and the turn count has not reached agent.max_turns, the worker loops back to StreamingTurn.
The first turn uses the full rendered prompt. Continuation turns send only continuation guidance to the existing agent thread.

Transition triggers

Six external events drive state transitions. Each is handled by the orchestrator’s single-writer event loop.

Trigger	What happens
Poll tick	Reconcile running issues (stall detection + tracker state refresh). Run preflight validation. Fetch candidates. Sort by priority. Dispatch eligible issues until slots are exhausted. Dispatched workers perform the optional in-progress transition (via `tracker.in_progress_state`) and optional dispatch comment (via `tracker.comments.on_dispatch`) as their first steps.
Worker exit (normal)	Remove `running` entry. Persist run history to SQLite. Update token totals. Three outcome paths: (1) no soft-stop, issue active – schedule continuation retry or perform handoff transition (retry on handoff failure); (2) soft-stop `blocked` – release claim, no handoff, no retry; (3) soft-stop `needs-human-review` – perform handoff transition (if configured and issue active), release claim (no retry on handoff failure). Post completion comment if `tracker.comments.on_completion` is enabled (detached goroutine, non-blocking).
Worker exit (error)	Remove `running` entry. Persist run history. Classify error. If retryable, schedule exponential backoff retry. If not, release claim. Post failure comment if `tracker.comments.on_failure` is enabled (detached goroutine, non-blocking).
Agent update event	Update live session fields: token counters, session ID, thread ID, agent PID, rate limits, last activity timestamp.
Retry timer fired	Re-fetch candidates. If the issue is still eligible and slots are available, dispatch. If no slots, reschedule. If the issue is gone or inactive, release claim.
Reconciliation: tracker state refresh	For each running issue: terminal state → cancel worker, clean workspace. Still active → update snapshot. Neither active nor terminal → cancel worker, no cleanup.

Candidate eligibility

An issue is eligible for dispatch when all conditions are true:

Condition	Details
Required fields present	`id`, `identifier`, `title`, and `state` must be non-empty.
State is active	`state` is in `tracker.active_states` (case-insensitive).
State is not terminal	`state` is not in `tracker.terminal_states`.
Not running	`id` is not in the `running` map.
Not claimed	`id` is not in the `claimed` set.
Global slots available	`running_count < polling.max_concurrent_agents`.
Per-state slots available	Running count for this state < `polling.max_concurrent_agents_by_state[state]` (if configured).
Blockers resolved	No entry in `blocked_by` has a state that is in `active_states`.

Issues are sorted for dispatch: priority ascending (nil last), created_at oldest first, identifier lexicographic tiebreaker.

Backoff formula

Sortie uses two retry delay strategies depending on the exit type.

Continuation retry (normal worker exit, issue still active):

delay = 1000 \text{ ms}

Error retry (worker failure, stall timeout):

delay = \min(10000 \times 2^{(attempt - 1)},\ \text{max\_retry\_backoff\_ms})

Default max_retry_backoff_ms: 300 000 (5 minutes). Configurable via agent.max_retry_backoff_ms.

Attempt	Delay
1	10 s
2	20 s
3	40 s
4	80 s
5	160 s
6+	300 s (cap)

When a retry fires but no concurrency slot is available, the retry is rescheduled at the same backoff level with error no available orchestrator slots.

Reconciliation

Reconciliation runs at the start of every poll tick, before dispatch. It has two parts.

Part A - Stall detection. For each running issue, compute elapsed time since the last agent event (or started_at if no event has arrived). If elapsed exceeds agent.stall_timeout_ms, the worker is killed and an exponential backoff retry is scheduled. Disabled when stall_timeout_ms is zero or negative.

Part B - Tracker state refresh. Fetch current tracker states for all running issue IDs.

Tracker reports	Action
Terminal state	Cancel worker. Mark workspace for cleanup after worker exits.
Still active	Update the in-memory issue snapshot. Worker continues.
Neither active nor terminal	Cancel worker. No workspace cleanup.
Fetch fails	Keep all workers running. Retry on next tick.

Recovery at startup

When Sortie starts (or restarts after a crash), it reconstructs orchestration state from SQLite and the tracker.

Open SQLite database and apply schema migrations.
Load persisted retry entries. Reconstruct retry timers from stored due_at timestamps.
Enumerate workspace directories on disk and map directory names to issue identifiers.
Query the tracker for terminal-state issues among those with existing workspaces. Remove stale workspace directories.
Query the tracker for active issues. Reconcile with persisted state.
Begin the normal poll loop.

If the terminal-issue query fails at startup, Sortie logs a warning and continues - workspace cleanup is deferred to the next successful reconciliation.

Was this page helpful?

Prometheus Metrics Errors