Skip to content
State Machine

State Machine

Sortie maintains two layers of state for every issue it processes. The orchestration state tracks whether the orchestrator has claimed the issue and what it is doing with it. The run attempt phase tracks where a single agent invocation stands within its lifecycle. These are independent from tracker states (To Do, In Progress) - they are Sortie’s internal bookkeeping.

See also: WORKFLOW.md configuration for active_states, terminal_states, handoff_state, and in_progress_state; error reference for error kinds that trigger retries; CLI reference for --dry-run mode that simulates dispatch without launching agents; dashboard reference for real-time visibility into orchestration state.


Orchestration states

Every issue known to the orchestrator is in exactly one of five states. The orchestrator is the single authority for these transitions - no other component mutates scheduling state.

StateDescription
UnclaimedThe issue is not running and has no retry scheduled. Eligible for dispatch if it meets candidate selection rules.
ClaimedThe orchestrator has reserved the issue to prevent duplicate dispatch. A claimed issue is always either Running or RetryQueued.
RunningA worker goroutine exists for this issue. The issue is tracked in the running map with a live RunningEntry.
RetryQueuedNo worker is running, but a retry timer exists. The issue remains claimed until the timer fires and either re-dispatches or releases.
ReleasedThe claim has been removed. The issue is no longer tracked. This happens when the issue reaches a terminal tracker state, leaves the active state set, is missing from the tracker, or exhausts its retry path.
flowchart TD
    UC([Unclaimed]) --> RN

    subgraph Claimed
        RN[Running] --> RQ[RetryQueued]
        RQ --> RN
    end

    Claimed --> RL([Released])
    RL --> UC

    classDef idle fill:#f0f0f4,stroke:#8b8fa3,color:#3a3d4a
    classDef active fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f,stroke-width:2px
    classDef waiting fill:#fef3c7,stroke:#d97706,color:#78350f
    classDef released fill:#f0f0f4,stroke:#8b8fa3,color:#3a3d4a,stroke-dasharray:5 5

    class UC idle
    class RN active
    class RQ waiting
    class RL released

    style Claimed fill:none,stroke:#3b82f6,stroke-width:2px,rx:8,color:#3b82f6

Transition details

Unclaimed → Claimed. Occurs during the dispatch phase of a poll tick. The issue must pass all candidate eligibility checks and a global or per-state concurrency slot must be available. The issue enters Running immediately - there is no Claimed without a worker.

Running → RetryQueued. Three worker exit outcomes lead here (none apply when a soft-stop signal is active; see Claimed → Released below):

  • Normal exit, issue still active, no soft-stop: continuation retry after 1 000 ms fixed delay.
  • Normal exit, handoff fails, no soft-stop: continuation retry after 1 000 ms.
  • Error exit, retryable: exponential backoff retry (see backoff formula).
  • Stall timeout: worker is killed; exponential backoff retry is scheduled.

RetryQueued → Running. The retry timer fires. The orchestrator re-fetches candidates, confirms the issue is still eligible, acquires a slot, and launches a new worker. If no slot is available, the retry is rescheduled with the same backoff.

Claimed → Released. The claim is removed and no retry is scheduled:

  • Reconciliation detects the tracker state is terminal or no longer in active_states.
  • The retry timer fires but the issue is absent from the candidate list.
  • The max_sessions budget is reached.
  • The worker error is classified as non-retryable.
  • A handoff_state transition succeeds (the tracker now owns the issue).
  • Soft-stop blocked: worker exits normally, claim released. No handoff transition, no continuation retry.
  • Soft-stop needs-human-review, handoff succeeds: worker exits normally, handoff transition performed, claim released.
  • Soft-stop needs-human-review, handoff fails: worker exits normally, handoff fails, claim released without retry.

Released → Unclaimed. A released issue can be re-dispatched on a future poll tick if its tracker state returns to an active state. The orchestrator does not remember previous releases - each poll tick evaluates eligibility from scratch.


Run attempt phases

Each worker attempt progresses through a linear sequence of phases. Terminal phases end the attempt and produce a WorkerResult delivered to the orchestrator.

PhaseDescription
DispatchTransitionOptional. When tracker.in_progress_state is configured, the worker calls TransitionIssue before workspace preparation. If the issue is already in the target state, the call is skipped (debug log only). Failure is non-fatal - the worker logs a warning and continues.
DispatchCommentOptional. When tracker.comments.on_dispatch is true, the worker posts a tracker comment acknowledging that Sortie has claimed the issue. Fires after the dispatch transition and before workspace preparation. Failure is non-fatal - the worker logs a warning and continues.
PreparingWorkspaceWorkspace directory is created or reused. after_create and before_run hooks execute.
BuildingPromptThe text/template prompt body is rendered with issue data, attempt number, and turn context.
LaunchingAgentProcessThe agent adapter starts a session (subprocess, API call, or mock).
InitializingSessionWaiting for the session_started event from the agent adapter.
StreamingTurnThe agent is actively working. Token usage, tool calls, and status events stream in.
FinishingThe turn ended. after_run hooks execute. The worker checks whether to loop for another turn.
SucceededTerminal. The worker completed all turns without error.
FailedTerminal. An error occurred during any earlier phase.
TimedOutTerminal. The turn exceeded agent.turn_timeout_ms.
StalledTerminal. No agent event arrived within agent.stall_timeout_ms. Detected by reconciliation.
CanceledByReconciliationTerminal. The worker’s context was cancelled because the issue’s tracker state became terminal or left the active set.
flowchart TD
    DT[DispatchTransition] --> DC[DispatchComment]
    DC --> PW[PreparingWorkspace]
    PW --> BP[BuildingPrompt]
    BP --> LA[LaunchingAgent]
    LA --> IS[InitializingSession]
    IS --> ST[StreamingTurn]
    ST --> FN[Finishing]

    FN --> ST
    FN --> OK([Succeeded])

    ST --> TO([TimedOut])
    ST --> SL([Stalled])
    ST --> CR([Canceled])

    classDef phase fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f
    classDef active fill:#bfdbfe,stroke:#2563eb,color:#1e3a5f,stroke-width:2px
    classDef success fill:#d1fae5,stroke:#059669,color:#064e3b,stroke-width:2px
    classDef failure fill:#fee2e2,stroke:#dc2626,color:#7f1d1d

    class DT,DC,PW,BP,LA,IS,FN phase
    class ST active
    class OK success
    class TO,SL,CR failure

Any phase from PreparingWorkspace through StreamingTurn can also transition to Failed on error. The specific error trigger for each phase is documented in the table above.

Multi-turn behavior

A single worker attempt can execute multiple agent turns. After each turn:

  1. The worker checks the tracker for the issue’s current state.
  2. If the state is still active and the turn count has not reached agent.max_turns, the worker loops back to StreamingTurn.
  3. The first turn uses the full rendered prompt. Continuation turns send only continuation guidance to the existing agent thread.

Transition triggers

Six external events drive state transitions. Each is handled by the orchestrator’s single-writer event loop.

TriggerWhat happens
Poll tickReconcile running issues (stall detection + tracker state refresh). Run preflight validation. Fetch candidates. Sort by priority. Dispatch eligible issues until slots are exhausted. Dispatched workers perform the optional in-progress transition (via tracker.in_progress_state) and optional dispatch comment (via tracker.comments.on_dispatch) as their first steps.
Worker exit (normal)Remove running entry. Persist run history to SQLite. Update token totals. Three outcome paths: (1) no soft-stop, issue active – schedule continuation retry or perform handoff transition (retry on handoff failure); (2) soft-stop blocked – release claim, no handoff, no retry; (3) soft-stop needs-human-review – perform handoff transition (if configured and issue active), release claim (no retry on handoff failure). Post completion comment if tracker.comments.on_completion is enabled (detached goroutine, non-blocking).
Worker exit (error)Remove running entry. Persist run history. Classify error. If retryable, schedule exponential backoff retry. If not, release claim. Post failure comment if tracker.comments.on_failure is enabled (detached goroutine, non-blocking).
Agent update eventUpdate live session fields: token counters, session ID, thread ID, agent PID, rate limits, last activity timestamp.
Retry timer firedRe-fetch candidates. If the issue is still eligible and slots are available, dispatch. If no slots, reschedule. If the issue is gone or inactive, release claim.
Reconciliation: tracker state refreshFor each running issue: terminal state → cancel worker, clean workspace. Still active → update snapshot. Neither active nor terminal → cancel worker, no cleanup.

Candidate eligibility

An issue is eligible for dispatch when all conditions are true:

ConditionDetails
Required fields presentid, identifier, title, and state must be non-empty.
State is activestate is in tracker.active_states (case-insensitive).
State is not terminalstate is not in tracker.terminal_states.
Not runningid is not in the running map.
Not claimedid is not in the claimed set.
Global slots availablerunning_count < polling.max_concurrent_agents.
Per-state slots availableRunning count for this state < polling.max_concurrent_agents_by_state[state] (if configured).
Blockers resolvedNo entry in blocked_by has a state that is in active_states.

Issues are sorted for dispatch: priority ascending (nil last), created_at oldest first, identifier lexicographic tiebreaker.


Backoff formula

Sortie uses two retry delay strategies depending on the exit type.

Continuation retry (normal worker exit, issue still active):

delay=1000 msdelay = 1000 \text{ ms}

Error retry (worker failure, stall timeout):

delay=min(10000×2(attempt1), max_retry_backoff_ms)delay = \min(10000 \times 2^{(attempt - 1)},\ \text{max\_retry\_backoff\_ms})

Default max_retry_backoff_ms: 300 000 (5 minutes). Configurable via agent.max_retry_backoff_ms.

AttemptDelay
110 s
220 s
340 s
480 s
5160 s
6+300 s (cap)

When a retry fires but no concurrency slot is available, the retry is rescheduled at the same backoff level with error no available orchestrator slots.


Reconciliation

Reconciliation runs at the start of every poll tick, before dispatch. It has two parts.

Part A - Stall detection. For each running issue, compute elapsed time since the last agent event (or started_at if no event has arrived). If elapsed exceeds agent.stall_timeout_ms, the worker is killed and an exponential backoff retry is scheduled. Disabled when stall_timeout_ms is zero or negative.

Part B - Tracker state refresh. Fetch current tracker states for all running issue IDs.

Tracker reportsAction
Terminal stateCancel worker. Mark workspace for cleanup after worker exits.
Still activeUpdate the in-memory issue snapshot. Worker continues.
Neither active nor terminalCancel worker. No workspace cleanup.
Fetch failsKeep all workers running. Retry on next tick.

Recovery at startup

When Sortie starts (or restarts after a crash), it reconstructs orchestration state from SQLite and the tracker.

  1. Open SQLite database and apply schema migrations.
  2. Load persisted retry entries. Reconstruct retry timers from stored due_at timestamps.
  3. Enumerate workspace directories on disk and map directory names to issue identifiers.
  4. Query the tracker for terminal-state issues among those with existing workspaces. Remove stale workspace directories.
  5. Query the tracker for active issues. Reconcile with persisted state.
  6. Begin the normal poll loop.

If the terminal-issue query fails at startup, Sortie logs a warning and continues - workspace cleanup is deferred to the next successful reconciliation.

Was this page helpful?