CrisisMode is an open framework for building autonomous recovery agents that restore IT systems during severe incidents. Safety guarantees, forensic state preservation, and human-in-the-loop coordination — as first-class primitives.
CrisisMode is not a general-purpose automation platform. It is the tool an organization reaches for when normal operational tooling has failed or is insufficient.
Agents inherit safety guarantees from the framework. An agent that follows the contract cannot bypass state preservation, skip approval gates, or exceed its declared blast radius.
The framework captures system state before mutating actions, preserving evidence for post-incident analysis, compliance, and learning — within the constraints of system health.
Notification, approval, escalation, and communication are first-class primitives with the same rigor as system actions. Not an afterthought — a design constraint.
Pre-authorized action catalogs make approval fast for known scenarios. The system never provides mechanisms to skip approval under pressure.
Agents earn autonomy over time through demonstrated reliability in specific scenarios and environments. Trust is scoped per agent, scenario, and environment.
The framework sheds capabilities progressively as the environment degrades, rather than failing entirely. Recovery capability is always available at some level.
Every recovery follows a structured sequence. The framework orchestrates each phase — agents never interact directly with target systems.
Framework receives trigger from alert, health check, or manual invocation
Check pre-authorized action catalogs for matching scenario
Identify applicable agent based on trigger context and manifest declarations
Assemble context bundle — system topology, trust levels, organizational policies
Agent performs read-only investigation using provided context
Agent may submit a lightweight diagnostic plan for investigative mutations
Agent produces a Recovery Plan — linear steps with bounded decision points
Validate plan against manifest, organizational policies, and blast radius
If plan matches a catalog entry, approval is pre-satisfied for covered risk levels
Execute approval gates per risk classification and trust level
Orchestrate plan steps: snapshot → action → verify → notify
At declared checkpoints, agent may revise remaining plan based on current state
Produce forensic record and trigger post-recovery notifications
The framework operates as concentric layers. As the environment degrades, outer layers shed while core recovery remains available. A recovery tool that requires healthy infrastructure to operate is useless.
Advanced trust analytics, stakeholder communication rendering, observed impact monitoring, topology feedback loop
Human approval routing, escalation management, notification delivery, pre-authorized catalog matching, fallback approval
State preservation capture, plan validation against manifest, blast radius enforcement, forensic record assembly
Sequential plan execution, command dispatch, precondition evaluation, success criteria checks, local audit log, stepwise rollback. Zero external dependencies.
The Recovery Agent Contract Specification defines the interface between agents and framework with the precision of a protocol specification. Every requirement is phased, every interaction is structured, every decision is auditable.
Linear sequences with bounded decision points. A plan with 10 steps and one binary decision is comprehensible at 3 AM during a P1 outage. A 30-node graph is not.
When conditions change beyond what a binary decision point can handle, the agent produces a new plan. Simple plans, clean audit trail, novel situations handled naturally.
Organizations pre-authorize specific recovery approaches during calm conditions. A crisis activates the pre-approved response — fast for the safe phase, controlled for the risky phase.
CrisisMode is open source. Read the specification, run the interactive demo, build your first recovery agent.