Recovery Agent Framework

When systems fail,
agents recover.

CrisisMode is an open framework for building autonomous recovery agents that restore IT systems during severe incidents. Safety guarantees, forensic state preservation, and human-in-the-loop coordination — as first-class primitives.

crisismode — pnpm dev
"Agents propose, the framework disposes. Agents produce recovery plans describing what they intend to do. The framework validates, orchestrates, and enforces those plans."
— Recovery Agent Contract Specification v0.2.1

Built for the worst moment

CrisisMode is not a general-purpose automation platform. It is the tool an organization reaches for when normal operational tooling has failed or is insufficient.

01 — Safety

Safety by Default

Agents inherit safety guarantees from the framework. An agent that follows the contract cannot bypass state preservation, skip approval gates, or exceed its declared blast radius.

02 — Forensics

Forensic-First Recovery

The framework captures system state before mutating actions, preserving evidence for post-incident analysis, compliance, and learning — within the constraints of system health.

03 — Human-in-the-Loop

Structured Human Interaction

Notification, approval, escalation, and communication are first-class primitives with the same rigor as system actions. Not an afterthought — a design constraint.

04 — Speed

Approval Speed, Not Bypass

Pre-authorized action catalogs make approval fast for known scenarios. The system never provides mechanisms to skip approval under pressure.

05 — Trust

Graduated Trust

Agents earn autonomy over time through demonstrated reliability in specific scenarios and environments. Trust is scoped per agent, scenario, and environment.

06 — Resilience

Graceful Degradation

The framework sheds capabilities progressively as the environment degrades, rather than failing entirely. Recovery capability is always available at some level.

13 steps from alert to recovery

Every recovery follows a structured sequence. The framework orchestrates each phase — agents never interact directly with target systems.

Trigger Framework

Framework receives trigger from alert, health check, or manual invocation

Catalog Check Framework

Check pre-authorized action catalogs for matching scenario

Agent Selection Framework

Identify applicable agent based on trigger context and manifest declarations

Context Assembly Framework

Assemble context bundle — system topology, trust levels, organizational policies

Diagnosis Agent

Agent performs read-only investigation using provided context

Diagnostic Plan Agent

Agent may submit a lightweight diagnostic plan for investigative mutations

Plan Creation Agent

Agent produces a Recovery Plan — linear steps with bounded decision points

Plan Validation Framework

Validate plan against manifest, organizational policies, and blast radius

Catalog Match Framework

If plan matches a catalog entry, approval is pre-satisfied for covered risk levels

Human Gates Human

Execute approval gates per risk classification and trust level

Execution Framework

Orchestrate plan steps: snapshot → action → verify → notify

Replanning Agent

At declared checkpoints, agent may revise remaining plan based on current state

Completion Framework

Produce forensic record and trigger post-recovery notifications

Four layers, designed to shed

The framework operates as concentric layers. As the environment degrades, outer layers shed while core recovery remains available. A recovery tool that requires healthy infrastructure to operate is useless.

L4 Enrichment Phase 2

Advanced trust analytics, stakeholder communication rendering, observed impact monitoring, topology feedback loop

L3 Coordination Should be available

Human approval routing, escalation management, notification delivery, pre-authorized catalog matching, fallback approval

L2 Safety Must be available

State preservation capture, plan validation against manifest, blast radius enforcement, forensic record assembly

L1 Execution Kernel Always available

Sequential plan execution, command dispatch, precondition evaluation, success criteria checks, local audit log, stepwise rollback. Zero external dependencies.

Rigorous by design

The Recovery Agent Contract Specification defines the interface between agents and framework with the precision of a protocol specification. Every requirement is phased, every interaction is structured, every decision is auditable.

7
Step Types
4
Risk Levels
4
Trust Levels
4
Degradation Layers
3
Blast Radius Tiers
22
Specification Sections
Recovery Plans

Readable Under Pressure

Linear sequences with bounded decision points. A plan with 10 steps and one binary decision is comprehensible at 3 AM during a P1 outage. A 30-node graph is not.

Replanning

Adaptability Without Complexity

When conditions change beyond what a binary decision point can handle, the agent produces a new plan. Simple plans, clean audit trail, novel situations handled naturally.

Pre-Authorization

Reviewed in Calm, Activated in Crisis

Organizations pre-authorize specific recovery approaches during calm conditions. A crisis activates the pre-approved response — fast for the safe phase, controlled for the risky phase.

Recovery is too important to improvise

CrisisMode is open source. Read the specification, run the interactive demo, build your first recovery agent.