CrisisMode — Recovery Agent Framework

What Ships Today

Not a spec. A working tool.

Run crisismode scan with zero configuration. It detects what's running, checks health, and tells you what to look at first. No YAML, no credentials, no setup ceremony.

Recovery Agents

PostgreSQL, Redis, etcd, Kafka, Kubernetes, Ceph, Flink, and 5 cross-cutting concern agents

Escalation Levels

Observe, diagnose, suggest, repair-safe, repair-destructive — progressive depth under operator control

Check Formats

Native JSON, Nagios/Icinga, Goss YAML, and Sensu — thousands of existing checks work out of the box

Modern Application Incidents

Bad deploy rollback
AI provider degradation and failover
Database migration failures
Queue and worker backlog
Config and environment drift

Stateful Infrastructure Recovery

Live PostgreSQL replication
Redis memory pressure
etcd consensus recovery
Kafka partition rebalance
Kubernetes node cascades
Ceph OSD recovery
Flink checkpoint failures

Core Capabilities

Built for the worst moment

CrisisMode is not a general-purpose automation platform. It is the tool an organization reaches for when normal operational tooling has failed or is insufficient.

01 — Safety

Safety by Default

Agents inherit safety guarantees from the framework. An agent that follows the contract cannot bypass state preservation, skip approval gates, or exceed its declared blast radius.

02 — Forensics

Forensic-First Recovery

The framework captures system state before mutating actions, preserving evidence for post-incident analysis, compliance, and learning — within the constraints of system health.

03 — Human-in-the-Loop

Structured Human Interaction

Notification, approval, escalation, and communication are first-class primitives with the same rigor as system actions. Not an afterthought — a design constraint.

04 — Speed

Approval Speed, Not Bypass

Pre-authorized action catalogs make approval fast for known scenarios. The system never provides mechanisms to skip approval under pressure.

05 — Trust

Graduated Trust

Agents earn autonomy over time through demonstrated reliability in specific scenarios and environments. Trust is scoped per agent, scenario, and environment.

06 — Resilience

Graceful Degradation

The framework sheds capabilities progressively as the environment degrades, rather than failing entirely. Recovery capability is always available at some level.

Execution Flow

13 steps from alert to recovery

Every recovery follows a structured sequence. The framework orchestrates each phase — agents never interact directly with target systems.

Trigger Framework

Framework receives trigger from alert, health check, or manual invocation

Catalog Check Framework

Check pre-authorized action catalogs for matching scenario

Agent Selection Framework

Identify applicable agent based on trigger context and manifest declarations

Context Assembly Framework

Assemble context bundle — system topology, trust levels, organizational policies

Diagnosis Agent

Agent performs read-only investigation using provided context

Diagnostic Plan Agent

Agent may submit a lightweight diagnostic plan for investigative mutations

Plan Creation Agent

Agent produces a Recovery Plan — linear steps with bounded decision points

Plan Validation Framework

Validate plan against manifest, organizational policies, and blast radius

Catalog Match Framework

If plan matches a catalog entry, approval is pre-satisfied for covered risk levels

Human Gates Human

Execute approval gates per risk classification and trust level

Execution Framework

Orchestrate plan steps: snapshot → action → verify → notify

Replanning Agent

At declared checkpoints, agent may revise remaining plan based on current state

Completion Framework

Produce forensic record and trigger post-recovery notifications

Degradation Architecture

Four layers, designed to shed

The framework operates as concentric layers. As the environment degrades, outer layers shed while core recovery remains available. A recovery tool that requires healthy infrastructure to operate is useless.

L4 Enrichment Phase 2

Advanced trust analytics, stakeholder communication rendering, observed impact monitoring, topology feedback loop

L3 Coordination Should be available

Human approval routing, escalation management, notification delivery, pre-authorized catalog matching, fallback approval

L2 Safety Must be available

State preservation capture, plan validation against manifest, blast radius enforcement, forensic record assembly

L1 Execution Kernel Always available

Sequential plan execution, command dispatch, precondition evaluation, success criteria checks, local audit log, stepwise rollback. Zero external dependencies.

The Specification

Rigorous by design

The Recovery Agent Contract Specification defines the interface between agents and framework with the precision of a protocol specification. Every requirement is phased, every interaction is structured, every decision is auditable.

Step Types

Risk Levels

Trust Levels

Degradation Layers

Blast Radius Tiers

Specification Sections

Recovery Plans

Readable Under Pressure

Linear sequences with bounded decision points. A plan with 10 steps and one binary decision is comprehensible at 3 AM during a P1 outage. A 30-node graph is not.

Replanning

Adaptability Without Complexity

When conditions change beyond what a binary decision point can handle, the agent produces a new plan. Simple plans, clean audit trail, novel situations handled naturally.

Pre-Authorization

Reviewed in Calm, Activated in Crisis

Organizations pre-authorize specific recovery approaches during calm conditions. A crisis activates the pre-approved response — fast for the safe phase, controlled for the risky phase.

When systems fail,
agents recover.

Not a spec. A working tool.

Modern Application Incidents

Stateful Infrastructure Recovery

Built for the worst moment

Safety by Default

Forensic-First Recovery

Structured Human Interaction

Approval Speed, Not Bypass

Graduated Trust

Graceful Degradation

13 steps from alert to recovery

Trigger Framework

Catalog Check Framework

Agent Selection Framework

Context Assembly Framework

Diagnosis Agent

Diagnostic Plan Agent

Plan Creation Agent

Plan Validation Framework

Catalog Match Framework

Human Gates Human

Execution Framework

Replanning Agent

Completion Framework

Four layers, designed to shed

Rigorous by design

Readable Under Pressure

Adaptability Without Complexity

Reviewed in Calm, Activated in Crisis

Recovery is too important to improvise

When systems fail, agents recover.

Not a spec. A working tool.

Modern Application Incidents

Stateful Infrastructure Recovery

Built for the worst moment

Safety by Default

Forensic-First Recovery

Structured Human Interaction

Approval Speed, Not Bypass

Graduated Trust

Graceful Degradation

13 steps from alert to recovery

Trigger Framework

Catalog Check Framework

Agent Selection Framework

Context Assembly Framework

Diagnosis Agent

Diagnostic Plan Agent

Plan Creation Agent

Plan Validation Framework

Catalog Match Framework

Human Gates Human

Execution Framework

Replanning Agent

Completion Framework

Four layers, designed to shed

Rigorous by design

Readable Under Pressure

Adaptability Without Complexity

Reviewed in Calm, Activated in Crisis

Recovery is too important to improvise

When systems fail,
agents recover.