Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

RD-OS State Persistence & Checkpoint System

断点续传、状态持久化、进度恢复

“OpenClaw 可以重启,LLM 上下文可以丢失,但项目进度必须可恢复”


Core Problem

挑战

┌─────────────────────────────────────────────────────────────────┐
│                    Scale Challenges                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. Agent Count: 1000+ agents                                   │
│     - Cannot store all state in LLM context                     │
│     - Cannot log every action to memory                         │
│     - Need aggregation + sampling                               │
│                                                                 │
│  2. Long-Running Tasks: Days to weeks                           │
│     - OpenClaw may restart                                      │
│     - Network may fail                                          │
│     - API rate limits may hit                                   │
│     - Need checkpoint + resume                                  │
│                                                                 │
│  3. Memory Limits: LLM context is finite                        │
│     - Cannot accumulate infinite history                        │
│     - Need summarization + pruning                              │
│     - Critical state must be external                           │
│                                                                 │
│  4. Progress Tracking: Need to know "where are we?"             │
│     - Which repos analyzed?                                     │
│     - Which repos migrated?                                     │
│     - Which agents active?                                      │
│     - Need persistent progress store                            │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Solution Architecture

State Persistence Layers

┌─────────────────────────────────────────────────────────────────┐
│                    State Persistence Architecture               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Layer 0: Ephemeral (LLM Context)                               │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  Current conversation, recent actions, working memory   │   │
│  │  ❌ Lost on restart                                      │   │
│  └─────────────────────────────────────────────────────────┘   │
│                              │                                  │
│                              ▼                                  │
│  Layer 1: Short-Term (Session State)                            │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  memory/YYYY-MM-DD.md                                    │   │
│  │  Daily logs, recent events                               │   │
│  │  ⚠️ Survives restart, but not structured for recovery   │   │
│  └─────────────────────────────────────────────────────────┘   │
│                              │                                  │
│                              ▼                                  │
│  Layer 2: Medium-Term (Project State)                           │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  .rd-os/state/                                           │   │
│  │  - agent-states/    (per-agent checkpoint)              │   │
│  │  - progress/        (aggregated progress)               │   │
│  │  - checkpoints/     (snapshot at milestones)            │   │
│  │  ✅ Structured for recovery                              │   │
│  └─────────────────────────────────────────────────────────┘   │
│                              │                                  │
│                              ▼                                  │
│  Layer 3: Long-Term (Durable Store)                             │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  .rd-os/store/                                           │   │
│  │  - progress.db      (SQLite: definitive progress)       │   │
│  │  - agents.db        (SQLite: agent registry)            │   │
│  │  - artifacts/       (generated files, reports)          │   │
│  │  ✅ Source of truth, survives everything                │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Design Principles

1. External State > LLM Context

❌ Bad: Store progress in conversation history
   - Lost on restart
   - Consumes context tokens
   - Hard to query

✅ Good: Store progress in files/database
   - Survives restart
   - No context cost
   - Easy to query

2. Checkpoint Early, Checkpoint Often

❌ Bad: Checkpoint only at end of batch
   - Lose entire batch on failure

✅ Good: Checkpoint after each unit of work
   - Lose only current unit
   - Fast recovery

3. Aggregation > Individual Tracking

❌ Bad: Track every action of 1000 agents
   - Too much data
   - Exceeds context limits

✅ Good: Aggregate state
   - Per-component summary
   - Sampling for details
   - On-demand drill-down

4. Idempotent Operations

❌ Bad: "Migrate repo X" (may duplicate if retried)
   - Risk of corruption

✅ Good: "Ensure repo X is migrated" (safe to retry)
   - Check state first
   - Skip if done
   - Safe to retry

State Storage Structure

Directory Layout

mono-repo/
└── .rd-os/
    ├── state/                      # Runtime state (can rebuild)
    │   ├── agent-states/           # Per-agent checkpoint
    │   │   ├── repo-001.state.json
    │   │   ├── repo-002.state.json
    │   │   └── ...
    │   ├── progress/               # Aggregated progress
    │   │   ├── analysis-progress.json
    │   │   ├── migration-progress.json
    │   │   └── daily-summary/
    │   │       ├── 2026-02-28.json
    │   │       └── ...
    │   └── checkpoints/            # Milestone snapshots
    │       ├── checkpoint-001-analysis-complete/
    │       ├── checkpoint-002-p0-migrated/
    │       └── ...
    │
    └── store/                      # Durable store (source of truth)
        ├── progress.db             # SQLite: definitive progress
        ├── agents.db               # SQLite: agent registry
        ├── artifacts/              # Generated outputs
        │   ├── analysis-report.json
        │   ├── migration-log.jsonl
        │   └── ...
        └── config/                 # Configuration
            ├── agents.yaml
            ├── workflows.yaml
            └── policies.yaml

Agent State Checkpoint

Per-Agent State File

// .rd-os/state/agent-states/repo-001.state.json
{
  "agent_id": "repo-001-analyzer",
  "repo_name": "pingcap/tidb",
  "status": "completed",
  "created_at": "2026-02-28T10:00:00Z",
  "updated_at": "2026-02-28T10:15:00Z",
  
  "work": {
    "phase": "analysis",
    "subtask": "dependency_mapping",
    "progress_percent": 100,
    "items_total": 50,
    "items_completed": 50,
    "items_failed": 0
  },
  
  "result": {
    "success": true,
    "output_path": ".rd-os/store/artifacts/repo-001-analysis.json",
    "summary": {
      "lines_of_code": 652000,
      "dependencies": 127,
      "test_coverage": 78.5,
      "last_commit": "2026-02-28",
      "merge_recommendation": "P0-migrate"
    }
  },
  
  "checkpoint": {
    "last_action": "wrote_dependency_graph",
    "last_action_time": "2026-02-28T10:15:00Z",
    "can_resume": false,
    "resume_point": null
  },
  
  "errors": []
}

State Transitions

┌─────────┐     ┌─────────┐     ┌─────────┐     ┌─────────┐
│ pending │────▶│ running │────▶│  done   │     │ failed  │
└─────────┘     └─────────┘     └─────────┘     └─────────┘
                    │                                 │
                    │     ┌─────────┐                │
                    └────▶│ paused  │◀───────────────┘
                          └─────────┘

State Checkpoint Triggers:

  1. State transition (pending → running → done)
  2. Every N items completed (e.g., every 10 repos analyzed)
  3. Before/after external API calls
  4. On error (for debugging)
  5. Periodic heartbeat (every 5 minutes)

Progress Tracking

Aggregated Progress (Batch Level)

// .rd-os/state/progress/analysis-progress.json
{
  "phase": "repository_analysis",
  "started_at": "2026-02-28T00:00:00Z",
  "updated_at": "2026-02-28T16:00:00Z",
  
  "summary": {
    "total_repos": 400,
    "analyzed": 150,
    "in_progress": 50,
    "pending": 200,
    "failed": 0,
    "progress_percent": 37.5
  },
  
  "by_priority": {
    "P0": { "total": 50, "analyzed": 50, "pending": 0 },
    "P1": { "total": 100, "analyzed": 80, "pending": 20 },
    "P2": { "total": 150, "analyzed": 20, "pending": 130 },
    "P3": { "total": 100, "analyzed": 0, "pending": 100 }
  },
  
  "current_batch": {
    "batch_id": "batch-003",
    "repos": ["repo-101", "repo-102", "..."],
    "started_at": "2026-02-28T14:00:00Z",
    "estimated_complete": "2026-02-28T18:00:00Z"
  },
  
  "rate": {
    "repos_per_hour": 25,
    "estimated_completion": "2026-03-01T08:00:00Z"
  }
}

SQLite Schema (Definitive Store)

-- progress.db schema

-- Repository registry
CREATE TABLE repos (
    repo_id TEXT PRIMARY KEY,
    name TEXT NOT NULL,
    priority TEXT,  -- P0, P1, P2, P3
    category TEXT,  -- product, platform, tool, etc.
    created_at TIMESTAMP,
    updated_at TIMESTAMP
);

-- Analysis progress
CREATE TABLE analysis_state (
    repo_id TEXT PRIMARY KEY,
    status TEXT,  -- pending, running, done, failed
    progress_percent INTEGER,
    started_at TIMESTAMP,
    completed_at TIMESTAMP,
    result_json TEXT,
    error_message TEXT,
    FOREIGN KEY (repo_id) REFERENCES repos(repo_id)
);

-- Migration progress
CREATE TABLE migration_state (
    repo_id TEXT PRIMARY KEY,
    status TEXT,  -- pending, running, done, failed
    phase TEXT,   -- prep, transfer, integrate, validate
    progress_percent INTEGER,
    started_at TIMESTAMP,
    completed_at TIMESTAMP,
    result_json TEXT,
    error_message TEXT,
    FOREIGN KEY (repo_id) REFERENCES repos(repo_id)
);

-- Agent registry
CREATE TABLE agents (
    agent_id TEXT PRIMARY KEY,
    type TEXT,    -- analyzer, migrator, guardian, etc.
    assigned_repo_id TEXT,
    status TEXT,  -- active, idle, paused, error
    last_heartbeat TIMESTAMP,
    FOREIGN KEY (assigned_repo_id) REFERENCES repos(repo_id)
);

-- Checkpoints
CREATE TABLE checkpoints (
    checkpoint_id TEXT PRIMARY KEY,
    checkpoint_type TEXT,  -- batch, milestone, periodic
    created_at TIMESTAMP,
    state_snapshot TEXT,   -- JSON of full state
    recoverable BOOLEAN
);

-- Event log (for debugging/audit)
CREATE TABLE events (
    event_id TEXT PRIMARY KEY,
    timestamp TIMESTAMP,
    event_type TEXT,
    agent_id TEXT,
    repo_id TEXT,
    details TEXT
);

-- Indexes for fast queries
CREATE INDEX idx_analysis_status ON analysis_state(status);
CREATE INDEX idx_migration_status ON migration_state(status);
CREATE INDEX idx_agent_status ON agents(status);
CREATE INDEX idx_events_timestamp ON events(timestamp);

Recovery Protocol

Restart Recovery Flow

┌─────────────────────────────────────────────────────────────────┐
│              OpenClaw Restart → Recovery Flow                   │
└─────────────────────────────────────────────────────────────────┘

OpenClaw Starts
       │
       ▼
┌─────────────────────────────────────────────────────────────┐
│  1. Load Configuration                                      │
│     ├─ Read .rd-os/config/agents.yaml                       │
│     ├─ Read .rd-os/config/workflows.yaml                    │
│     └─ Initialize agent registry                            │
└─────────────────────────────────────────────────────────────┘
       │
       ▼
┌─────────────────────────────────────────────────────────────┐
│  2. Load State from Durable Store                           │
│     ├─ Query progress.db: what's done?                      │
│     ├─ Query agents.db: what agents exist?                  │
│     └─ Build in-memory state                                │
└─────────────────────────────────────────────────────────────┘
       │
       ▼
┌─────────────────────────────────────────────────────────────┐
│  3. Reconcile State                                         │
│     ├─ Compare expected vs actual state                     │
│     ├─ Find incomplete work                                 │
│     └─ Identify recoverable tasks                           │
└─────────────────────────────────────────────────────────────┘
       │
       ▼
┌─────────────────────────────────────────────────────────────┐
│  4. Resume Incomplete Work                                  │
│     ├─ For each incomplete task:                            │
│     │   ├─ Check if resumable                               │
│     │   ├─ Load checkpoint (if exists)                      │
│     │   └─ Resume from checkpoint                           │
│     └─ For non-resumable: restart from beginning            │
└─────────────────────────────────────────────────────────────┘
       │
       ▼
┌─────────────────────────────────────────────────────────────┐
│  5. Resume Agent Execution                                  │
│     ├─ Spawn agents for pending work                        │
│     ├─ Resume paused agents                                 │
│     └─ Continue normal operation                            │
└─────────────────────────────────────────────────────────────┘

Recovery Complete

Recovery Example

# Pseudo-code: Recovery logic

async def recover_after_restart():
    # Load durable state
    db = load_database(".rd-os/store/progress.db")
    
    # Find incomplete analysis
    incomplete = db.query("""
        SELECT repo_id, progress_percent, checkpoint_id
        FROM analysis_state
        WHERE status = 'running' OR status = 'pending'
    """)
    
    for task in incomplete:
        if task.progress_percent > 0:
            # Has progress - try to resume
            checkpoint = load_checkpoint(task.checkpoint_id)
            await resume_analysis(task.repo_id, checkpoint)
        else:
            # No progress - restart
            await start_analysis(task.repo_id)
    
    # Find incomplete migrations
    # ... similar logic
    
    # Resume agents
    agents = db.query("SELECT * FROM agents WHERE status = 'active'")
    for agent in agents:
        await resume_agent(agent.agent_id)
    
    log.info(f"Recovery complete: {len(incomplete)} tasks resumed")

Checkpoint Strategy

Checkpoint Types

TypeFrequencyContentUse Case
MicroEvery actionAgent stateCrash recovery
BatchEvery N itemsBatch summaryBatch resume
MilestonePhase completeFull state snapshotPhase resume
PeriodicEvery N minutesAggregated progressTime-based recovery

Checkpoint Implementation

# Pseudo-code: Checkpoint manager

class CheckpointManager:
    def __init__(self, base_path: str):
        self.base_path = base_path
        self.state_path = f"{base_path}/state"
        self.store_path = f"{base_path}/store"
    
    def save_agent_state(self, agent_id: str, state: dict):
        """Save per-agent checkpoint (micro)"""
        path = f"{self.state_path}/agent-states/{agent_id}.state.json"
        state['checkpoint_time'] = now()
        write_json(path, state)
        
        # Also update SQLite
        db.execute("""
            INSERT OR REPLACE INTO agent_states (agent_id, state_json, updated_at)
            VALUES (?, ?, ?)
        """, (agent_id, json.dumps(state), now()))
    
    def save_batch_progress(self, batch_id: str, progress: dict):
        """Save batch progress (batch)"""
        path = f"{self.state_path}/progress/{batch_id}.json"
        write_json(path, progress)
        
        # Update SQLite summary
        db.execute("""
            UPDATE batch_progress
            SET progress_json = ?, updated_at = ?
            WHERE batch_id = ?
        """, (json.dumps(progress), now(), batch_id))
    
    def save_milestone(self, milestone_name: str):
        """Save full state snapshot (milestone)"""
        checkpoint_id = f"checkpoint-{milestone_name}-{timestamp()}"
        path = f"{self.state_path}/checkpoints/{checkpoint_id}/"
        
        # Snapshot everything
        snapshot = {
            'milestone': milestone_name,
            'timestamp': now(),
            'analysis_state': db.query_all("SELECT * FROM analysis_state"),
            'migration_state': db.query_all("SELECT * FROM migration_state"),
            'agent_state': db.query_all("SELECT * FROM agents"),
            'progress_summary': self.calculate_progress_summary()
        }
        
        write_json(f"{path}/snapshot.json", snapshot)
        
        # Record in SQLite
        db.execute("""
            INSERT INTO checkpoints (checkpoint_id, checkpoint_type, created_at, state_snapshot, recoverable)
            VALUES (?, ?, ?, ?, ?)
        """, (checkpoint_id, 'milestone', now(), json.dumps(snapshot), True))
        
        return checkpoint_id
    
    def load_checkpoint(self, checkpoint_id: str) -> dict:
        """Load checkpoint for recovery"""
        path = f"{self.state_path}/checkpoints/{checkpoint_id}/snapshot.json"
        return read_json(path)
    
    def get_recovery_state(self) -> dict:
        """Get current state for recovery"""
        return {
            'analysis': db.query_all("SELECT * FROM analysis_state WHERE status != 'done'"),
            'migration': db.query_all("SELECT * FROM migration_state WHERE status != 'done'"),
            'agents': db.query_all("SELECT * FROM agents WHERE status != 'idle'"),
            'latest_checkpoint': db.query_one("SELECT * FROM checkpoints ORDER BY created_at DESC LIMIT 1")
        }

Progress Aggregation (Avoiding Context Explosion)

Hierarchical Aggregation

Level 0: Individual Agent (1000+ agents)
├─ repo-001-analyzer: done
├─ repo-002-analyzer: running (50%)
├─ repo-003-analyzer: pending
└─ ... (1000+ entries - too many for context)
         │
         ▼ Aggregate (every 10 agents)
Level 1: Batch Summary (100 batches)
├─ batch-001: 10/10 done
├─ batch-002: 8/10 done, 2 running
├─ batch-003: 0/10 done, 10 pending
└─ ... (100 entries - still too many)
         │
         ▼ Aggregate (by priority)
Level 2: Priority Summary (4 priorities)
├─ P0: 50/50 done (100%)
├─ P1: 80/100 done (80%)
├─ P2: 20/150 done (13%)
└─ P3: 0/100 done (0%)
         │
         ▼ Aggregate (overall)
Level 3: Overall Summary (fits in context)
└─ Total: 150/400 done (37.5%)
         - 50 in progress
         - 200 pending
         - 0 failed

Context-Friendly Progress Report

// What goes into LLM context (small, actionable)
{
  "phase": "repository_analysis",
  "overall": {
    "total": 400,
    "done": 150,
    "in_progress": 50,
    "pending": 200,
    "failed": 0,
    "percent": 37.5
  },
  "by_priority": {
    "P0": "100% done ✅",
    "P1": "80% done 🏃",
    "P2": "13% done 🏃",
    "P3": "0% done ⏳"
  },
  "current_focus": "P1 batch-009 (8/10 done)",
  "next_up": "P1 batch-010 (10 repos)",
  "eta": "2026-03-01T08:00:00Z",
  "issues": [],
  "last_checkpoint": "checkpoint-batch-008-20260228-1400"
}

Key: Detailed state in SQLite, summary in context.


Idempotent Operations

Pattern: “Ensure” Instead of “Do”

# ❌ Bad: Not idempotent
async def migrate_repo(repo_id: str):
    """Migrate repo - may duplicate if retried"""
    transfer_code(repo_id)
    update_build_config(repo_id)
    mark_migrated(repo_id)
    # If fails after transfer, retry duplicates!

# ✅ Good: Idempotent
async def ensure_repo_migrated(repo_id: str):
    """Ensure repo is migrated - safe to retry"""
    # Check current state
    state = get_migration_state(repo_id)
    
    if state == 'done':
        log.info(f"{repo_id} already migrated, skipping")
        return
    
    if state == 'transfer_complete':
        log.info(f"{repo_id} transfer done, resuming config update")
        update_build_config(repo_id)
        mark_migrated(repo_id)
        return
    
    # Start from beginning
    transfer_code(repo_id)
    update_build_config(repo_id)
    mark_migrated(repo_id)

State Machine for Migration

┌─────────┐     ┌─────────┐     ┌─────────┐     ┌─────────┐     ┌─────────┐
│ pending │────▶│  prep   │────▶│ transfer│────▶│integrate│────▶│  done   │
└─────────┘     └─────────┘     └─────────┘     └─────────┘     └─────────┘
                    │                 │                 │
                    ▼                 ▼                 ▼
               [prep_done]      [transfer_done]   [integrate_done]
               
Each state transition is checkpointed.
Retry from last completed state.

Monitoring & Observability

Progress Dashboard (Query SQLite)

-- Overall progress
SELECT 
    COUNT(*) as total,
    SUM(CASE WHEN status = 'done' THEN 1 ELSE 0 END) as done,
    SUM(CASE WHEN status = 'running' THEN 1 ELSE 0 END) as running,
    SUM(CASE WHEN status = 'pending' THEN 1 ELSE 0 END) as pending,
    ROUND(100.0 * SUM(CASE WHEN status = 'done' THEN 1 ELSE 0 END) / COUNT(*), 2) as percent
FROM analysis_state;

-- Progress by priority
SELECT 
    r.priority,
    COUNT(*) as total,
    SUM(CASE WHEN a.status = 'done' THEN 1 ELSE 0 END) as done,
    ROUND(100.0 * SUM(CASE WHEN a.status = 'done' THEN 1 ELSE 0 END) / COUNT(*), 2) as percent
FROM repos r
JOIN analysis_state a ON r.repo_id = a.repo_id
GROUP BY r.priority;

-- Agent health
SELECT 
    status,
    COUNT(*) as count,
    MAX(last_heartbeat) as last_activity
FROM agents
GROUP BY status;

-- Recent failures
SELECT 
    repo_id,
    error_message,
    updated_at
FROM analysis_state
WHERE status = 'failed'
ORDER BY updated_at DESC
LIMIT 10;

Alerting

# .rd-os/config/alerts.yaml
alerts:
  - name: high_failure_rate
    condition: "failed_count / total_count > 0.05"
    severity: warning
    action: notify_human

  - name: stalled_progress
    condition: "no_progress_for_minutes > 60"
    severity: warning
    action: notify_human

  - name: agent_down
    condition: "agent_heartbeat_age_minutes > 10"
    severity: critical
    action: notify_human + restart_agent

  - name: checkpoint_age
    condition: "last_checkpoint_age_minutes > 30"
    severity: warning
    action: force_checkpoint

Implementation Checklist

Phase 1: Basic Persistence

  • Create .rd-os/state/ and .rd-os/store/ directories
  • Implement JSON state file writer
  • Implement per-agent checkpoint
  • Implement progress.db SQLite schema
  • Add checkpoint triggers (per-action, per-batch)

Phase 2: Recovery

  • Implement recovery protocol
  • Test restart recovery (simulate crash)
  • Implement idempotent operations
  • Add state reconciliation logic

Phase 3: Aggregation

  • Implement hierarchical aggregation
  • Create context-friendly progress summaries
  • Add drill-down queries (on-demand details)

Phase 4: Monitoring

  • Create progress dashboard (CLI or web)
  • Implement alerting rules
  • Add checkpoint management (list, restore, prune)

Example: Recovery After OpenClaw Restart

Scenario: OpenClaw restarts during repo analysis (150/400 done)

1. OpenClaw starts
   └─> RD-OS initialization

2. Load .rd-os/store/progress.db
   └─> Query: What's the state?
   └─> Result: 150 done, 50 running, 200 pending

3. Reconcile running tasks
   └─> For each "running" task:
       ├─> Load agent state from .rd-os/state/agent-states/
       ├─> Check if resumable
       └─> Resume or restart

4. Resume agents
   └─> Spawn 50 agents for running tasks
   └─> Spawn agents for pending tasks (up to concurrency limit)

5. Continue normal operation
   └─> Analysis continues from 150/400 (37.5%)
   └─> No work lost, no duplication

Total recovery time: <1 minute
Work lost: 0 (if micro-checkpointing) or <1 batch (if batch-checkpointing)

Conclusion

Key Principles:

  1. External State - Never rely on LLM context for progress
  2. Frequent Checkpoints - Checkpoint every unit of work
  3. Idempotent Operations - Safe to retry anything
  4. Hierarchical Aggregation - Summary in context, details in DB
  5. Recovery Protocol - Automated recovery on restart

Result:

  • OpenClaw can restart anytime
  • LLM context can be lost
  • Progress is never lost
  • Work resumes automatically
  • No manual intervention needed

This is how you build a system that runs for weeks with 1000+ agents.


“The system must be resilient to failure, because at scale, failure is inevitable.”