RD-OS State Persistence & Checkpoint System
断点续传、状态持久化、进度恢复
“OpenClaw 可以重启,LLM 上下文可以丢失,但项目进度必须可恢复”
Core Problem
挑战
┌─────────────────────────────────────────────────────────────────┐
│ Scale Challenges │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. Agent Count: 1000+ agents │
│ - Cannot store all state in LLM context │
│ - Cannot log every action to memory │
│ - Need aggregation + sampling │
│ │
│ 2. Long-Running Tasks: Days to weeks │
│ - OpenClaw may restart │
│ - Network may fail │
│ - API rate limits may hit │
│ - Need checkpoint + resume │
│ │
│ 3. Memory Limits: LLM context is finite │
│ - Cannot accumulate infinite history │
│ - Need summarization + pruning │
│ - Critical state must be external │
│ │
│ 4. Progress Tracking: Need to know "where are we?" │
│ - Which repos analyzed? │
│ - Which repos migrated? │
│ - Which agents active? │
│ - Need persistent progress store │
│ │
└─────────────────────────────────────────────────────────────────┘
Solution Architecture
State Persistence Layers
┌─────────────────────────────────────────────────────────────────┐
│ State Persistence Architecture │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Layer 0: Ephemeral (LLM Context) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Current conversation, recent actions, working memory │ │
│ │ ❌ Lost on restart │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Layer 1: Short-Term (Session State) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ memory/YYYY-MM-DD.md │ │
│ │ Daily logs, recent events │ │
│ │ ⚠️ Survives restart, but not structured for recovery │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Layer 2: Medium-Term (Project State) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ .rd-os/state/ │ │
│ │ - agent-states/ (per-agent checkpoint) │ │
│ │ - progress/ (aggregated progress) │ │
│ │ - checkpoints/ (snapshot at milestones) │ │
│ │ ✅ Structured for recovery │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Layer 3: Long-Term (Durable Store) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ .rd-os/store/ │ │
│ │ - progress.db (SQLite: definitive progress) │ │
│ │ - agents.db (SQLite: agent registry) │ │
│ │ - artifacts/ (generated files, reports) │ │
│ │ ✅ Source of truth, survives everything │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Design Principles
1. External State > LLM Context
❌ Bad: Store progress in conversation history
- Lost on restart
- Consumes context tokens
- Hard to query
✅ Good: Store progress in files/database
- Survives restart
- No context cost
- Easy to query
2. Checkpoint Early, Checkpoint Often
❌ Bad: Checkpoint only at end of batch
- Lose entire batch on failure
✅ Good: Checkpoint after each unit of work
- Lose only current unit
- Fast recovery
3. Aggregation > Individual Tracking
❌ Bad: Track every action of 1000 agents
- Too much data
- Exceeds context limits
✅ Good: Aggregate state
- Per-component summary
- Sampling for details
- On-demand drill-down
4. Idempotent Operations
❌ Bad: "Migrate repo X" (may duplicate if retried)
- Risk of corruption
✅ Good: "Ensure repo X is migrated" (safe to retry)
- Check state first
- Skip if done
- Safe to retry
State Storage Structure
Directory Layout
mono-repo/
└── .rd-os/
├── state/ # Runtime state (can rebuild)
│ ├── agent-states/ # Per-agent checkpoint
│ │ ├── repo-001.state.json
│ │ ├── repo-002.state.json
│ │ └── ...
│ ├── progress/ # Aggregated progress
│ │ ├── analysis-progress.json
│ │ ├── migration-progress.json
│ │ └── daily-summary/
│ │ ├── 2026-02-28.json
│ │ └── ...
│ └── checkpoints/ # Milestone snapshots
│ ├── checkpoint-001-analysis-complete/
│ ├── checkpoint-002-p0-migrated/
│ └── ...
│
└── store/ # Durable store (source of truth)
├── progress.db # SQLite: definitive progress
├── agents.db # SQLite: agent registry
├── artifacts/ # Generated outputs
│ ├── analysis-report.json
│ ├── migration-log.jsonl
│ └── ...
└── config/ # Configuration
├── agents.yaml
├── workflows.yaml
└── policies.yaml
Agent State Checkpoint
Per-Agent State File
// .rd-os/state/agent-states/repo-001.state.json
{
"agent_id": "repo-001-analyzer",
"repo_name": "pingcap/tidb",
"status": "completed",
"created_at": "2026-02-28T10:00:00Z",
"updated_at": "2026-02-28T10:15:00Z",
"work": {
"phase": "analysis",
"subtask": "dependency_mapping",
"progress_percent": 100,
"items_total": 50,
"items_completed": 50,
"items_failed": 0
},
"result": {
"success": true,
"output_path": ".rd-os/store/artifacts/repo-001-analysis.json",
"summary": {
"lines_of_code": 652000,
"dependencies": 127,
"test_coverage": 78.5,
"last_commit": "2026-02-28",
"merge_recommendation": "P0-migrate"
}
},
"checkpoint": {
"last_action": "wrote_dependency_graph",
"last_action_time": "2026-02-28T10:15:00Z",
"can_resume": false,
"resume_point": null
},
"errors": []
}
State Transitions
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ pending │────▶│ running │────▶│ done │ │ failed │
└─────────┘ └─────────┘ └─────────┘ └─────────┘
│ │
│ ┌─────────┐ │
└────▶│ paused │◀───────────────┘
└─────────┘
State Checkpoint Triggers:
- State transition (pending → running → done)
- Every N items completed (e.g., every 10 repos analyzed)
- Before/after external API calls
- On error (for debugging)
- Periodic heartbeat (every 5 minutes)
Progress Tracking
Aggregated Progress (Batch Level)
// .rd-os/state/progress/analysis-progress.json
{
"phase": "repository_analysis",
"started_at": "2026-02-28T00:00:00Z",
"updated_at": "2026-02-28T16:00:00Z",
"summary": {
"total_repos": 400,
"analyzed": 150,
"in_progress": 50,
"pending": 200,
"failed": 0,
"progress_percent": 37.5
},
"by_priority": {
"P0": { "total": 50, "analyzed": 50, "pending": 0 },
"P1": { "total": 100, "analyzed": 80, "pending": 20 },
"P2": { "total": 150, "analyzed": 20, "pending": 130 },
"P3": { "total": 100, "analyzed": 0, "pending": 100 }
},
"current_batch": {
"batch_id": "batch-003",
"repos": ["repo-101", "repo-102", "..."],
"started_at": "2026-02-28T14:00:00Z",
"estimated_complete": "2026-02-28T18:00:00Z"
},
"rate": {
"repos_per_hour": 25,
"estimated_completion": "2026-03-01T08:00:00Z"
}
}
SQLite Schema (Definitive Store)
-- progress.db schema
-- Repository registry
CREATE TABLE repos (
repo_id TEXT PRIMARY KEY,
name TEXT NOT NULL,
priority TEXT, -- P0, P1, P2, P3
category TEXT, -- product, platform, tool, etc.
created_at TIMESTAMP,
updated_at TIMESTAMP
);
-- Analysis progress
CREATE TABLE analysis_state (
repo_id TEXT PRIMARY KEY,
status TEXT, -- pending, running, done, failed
progress_percent INTEGER,
started_at TIMESTAMP,
completed_at TIMESTAMP,
result_json TEXT,
error_message TEXT,
FOREIGN KEY (repo_id) REFERENCES repos(repo_id)
);
-- Migration progress
CREATE TABLE migration_state (
repo_id TEXT PRIMARY KEY,
status TEXT, -- pending, running, done, failed
phase TEXT, -- prep, transfer, integrate, validate
progress_percent INTEGER,
started_at TIMESTAMP,
completed_at TIMESTAMP,
result_json TEXT,
error_message TEXT,
FOREIGN KEY (repo_id) REFERENCES repos(repo_id)
);
-- Agent registry
CREATE TABLE agents (
agent_id TEXT PRIMARY KEY,
type TEXT, -- analyzer, migrator, guardian, etc.
assigned_repo_id TEXT,
status TEXT, -- active, idle, paused, error
last_heartbeat TIMESTAMP,
FOREIGN KEY (assigned_repo_id) REFERENCES repos(repo_id)
);
-- Checkpoints
CREATE TABLE checkpoints (
checkpoint_id TEXT PRIMARY KEY,
checkpoint_type TEXT, -- batch, milestone, periodic
created_at TIMESTAMP,
state_snapshot TEXT, -- JSON of full state
recoverable BOOLEAN
);
-- Event log (for debugging/audit)
CREATE TABLE events (
event_id TEXT PRIMARY KEY,
timestamp TIMESTAMP,
event_type TEXT,
agent_id TEXT,
repo_id TEXT,
details TEXT
);
-- Indexes for fast queries
CREATE INDEX idx_analysis_status ON analysis_state(status);
CREATE INDEX idx_migration_status ON migration_state(status);
CREATE INDEX idx_agent_status ON agents(status);
CREATE INDEX idx_events_timestamp ON events(timestamp);
Recovery Protocol
Restart Recovery Flow
┌─────────────────────────────────────────────────────────────────┐
│ OpenClaw Restart → Recovery Flow │
└─────────────────────────────────────────────────────────────────┘
OpenClaw Starts
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 1. Load Configuration │
│ ├─ Read .rd-os/config/agents.yaml │
│ ├─ Read .rd-os/config/workflows.yaml │
│ └─ Initialize agent registry │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 2. Load State from Durable Store │
│ ├─ Query progress.db: what's done? │
│ ├─ Query agents.db: what agents exist? │
│ └─ Build in-memory state │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 3. Reconcile State │
│ ├─ Compare expected vs actual state │
│ ├─ Find incomplete work │
│ └─ Identify recoverable tasks │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 4. Resume Incomplete Work │
│ ├─ For each incomplete task: │
│ │ ├─ Check if resumable │
│ │ ├─ Load checkpoint (if exists) │
│ │ └─ Resume from checkpoint │
│ └─ For non-resumable: restart from beginning │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 5. Resume Agent Execution │
│ ├─ Spawn agents for pending work │
│ ├─ Resume paused agents │
│ └─ Continue normal operation │
└─────────────────────────────────────────────────────────────┘
Recovery Complete
Recovery Example
# Pseudo-code: Recovery logic
async def recover_after_restart():
# Load durable state
db = load_database(".rd-os/store/progress.db")
# Find incomplete analysis
incomplete = db.query("""
SELECT repo_id, progress_percent, checkpoint_id
FROM analysis_state
WHERE status = 'running' OR status = 'pending'
""")
for task in incomplete:
if task.progress_percent > 0:
# Has progress - try to resume
checkpoint = load_checkpoint(task.checkpoint_id)
await resume_analysis(task.repo_id, checkpoint)
else:
# No progress - restart
await start_analysis(task.repo_id)
# Find incomplete migrations
# ... similar logic
# Resume agents
agents = db.query("SELECT * FROM agents WHERE status = 'active'")
for agent in agents:
await resume_agent(agent.agent_id)
log.info(f"Recovery complete: {len(incomplete)} tasks resumed")
Checkpoint Strategy
Checkpoint Types
| Type | Frequency | Content | Use Case |
|---|---|---|---|
| Micro | Every action | Agent state | Crash recovery |
| Batch | Every N items | Batch summary | Batch resume |
| Milestone | Phase complete | Full state snapshot | Phase resume |
| Periodic | Every N minutes | Aggregated progress | Time-based recovery |
Checkpoint Implementation
# Pseudo-code: Checkpoint manager
class CheckpointManager:
def __init__(self, base_path: str):
self.base_path = base_path
self.state_path = f"{base_path}/state"
self.store_path = f"{base_path}/store"
def save_agent_state(self, agent_id: str, state: dict):
"""Save per-agent checkpoint (micro)"""
path = f"{self.state_path}/agent-states/{agent_id}.state.json"
state['checkpoint_time'] = now()
write_json(path, state)
# Also update SQLite
db.execute("""
INSERT OR REPLACE INTO agent_states (agent_id, state_json, updated_at)
VALUES (?, ?, ?)
""", (agent_id, json.dumps(state), now()))
def save_batch_progress(self, batch_id: str, progress: dict):
"""Save batch progress (batch)"""
path = f"{self.state_path}/progress/{batch_id}.json"
write_json(path, progress)
# Update SQLite summary
db.execute("""
UPDATE batch_progress
SET progress_json = ?, updated_at = ?
WHERE batch_id = ?
""", (json.dumps(progress), now(), batch_id))
def save_milestone(self, milestone_name: str):
"""Save full state snapshot (milestone)"""
checkpoint_id = f"checkpoint-{milestone_name}-{timestamp()}"
path = f"{self.state_path}/checkpoints/{checkpoint_id}/"
# Snapshot everything
snapshot = {
'milestone': milestone_name,
'timestamp': now(),
'analysis_state': db.query_all("SELECT * FROM analysis_state"),
'migration_state': db.query_all("SELECT * FROM migration_state"),
'agent_state': db.query_all("SELECT * FROM agents"),
'progress_summary': self.calculate_progress_summary()
}
write_json(f"{path}/snapshot.json", snapshot)
# Record in SQLite
db.execute("""
INSERT INTO checkpoints (checkpoint_id, checkpoint_type, created_at, state_snapshot, recoverable)
VALUES (?, ?, ?, ?, ?)
""", (checkpoint_id, 'milestone', now(), json.dumps(snapshot), True))
return checkpoint_id
def load_checkpoint(self, checkpoint_id: str) -> dict:
"""Load checkpoint for recovery"""
path = f"{self.state_path}/checkpoints/{checkpoint_id}/snapshot.json"
return read_json(path)
def get_recovery_state(self) -> dict:
"""Get current state for recovery"""
return {
'analysis': db.query_all("SELECT * FROM analysis_state WHERE status != 'done'"),
'migration': db.query_all("SELECT * FROM migration_state WHERE status != 'done'"),
'agents': db.query_all("SELECT * FROM agents WHERE status != 'idle'"),
'latest_checkpoint': db.query_one("SELECT * FROM checkpoints ORDER BY created_at DESC LIMIT 1")
}
Progress Aggregation (Avoiding Context Explosion)
Hierarchical Aggregation
Level 0: Individual Agent (1000+ agents)
├─ repo-001-analyzer: done
├─ repo-002-analyzer: running (50%)
├─ repo-003-analyzer: pending
└─ ... (1000+ entries - too many for context)
│
▼ Aggregate (every 10 agents)
Level 1: Batch Summary (100 batches)
├─ batch-001: 10/10 done
├─ batch-002: 8/10 done, 2 running
├─ batch-003: 0/10 done, 10 pending
└─ ... (100 entries - still too many)
│
▼ Aggregate (by priority)
Level 2: Priority Summary (4 priorities)
├─ P0: 50/50 done (100%)
├─ P1: 80/100 done (80%)
├─ P2: 20/150 done (13%)
└─ P3: 0/100 done (0%)
│
▼ Aggregate (overall)
Level 3: Overall Summary (fits in context)
└─ Total: 150/400 done (37.5%)
- 50 in progress
- 200 pending
- 0 failed
Context-Friendly Progress Report
// What goes into LLM context (small, actionable)
{
"phase": "repository_analysis",
"overall": {
"total": 400,
"done": 150,
"in_progress": 50,
"pending": 200,
"failed": 0,
"percent": 37.5
},
"by_priority": {
"P0": "100% done ✅",
"P1": "80% done 🏃",
"P2": "13% done 🏃",
"P3": "0% done ⏳"
},
"current_focus": "P1 batch-009 (8/10 done)",
"next_up": "P1 batch-010 (10 repos)",
"eta": "2026-03-01T08:00:00Z",
"issues": [],
"last_checkpoint": "checkpoint-batch-008-20260228-1400"
}
Key: Detailed state in SQLite, summary in context.
Idempotent Operations
Pattern: “Ensure” Instead of “Do”
# ❌ Bad: Not idempotent
async def migrate_repo(repo_id: str):
"""Migrate repo - may duplicate if retried"""
transfer_code(repo_id)
update_build_config(repo_id)
mark_migrated(repo_id)
# If fails after transfer, retry duplicates!
# ✅ Good: Idempotent
async def ensure_repo_migrated(repo_id: str):
"""Ensure repo is migrated - safe to retry"""
# Check current state
state = get_migration_state(repo_id)
if state == 'done':
log.info(f"{repo_id} already migrated, skipping")
return
if state == 'transfer_complete':
log.info(f"{repo_id} transfer done, resuming config update")
update_build_config(repo_id)
mark_migrated(repo_id)
return
# Start from beginning
transfer_code(repo_id)
update_build_config(repo_id)
mark_migrated(repo_id)
State Machine for Migration
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ pending │────▶│ prep │────▶│ transfer│────▶│integrate│────▶│ done │
└─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘
│ │ │
▼ ▼ ▼
[prep_done] [transfer_done] [integrate_done]
Each state transition is checkpointed.
Retry from last completed state.
Monitoring & Observability
Progress Dashboard (Query SQLite)
-- Overall progress
SELECT
COUNT(*) as total,
SUM(CASE WHEN status = 'done' THEN 1 ELSE 0 END) as done,
SUM(CASE WHEN status = 'running' THEN 1 ELSE 0 END) as running,
SUM(CASE WHEN status = 'pending' THEN 1 ELSE 0 END) as pending,
ROUND(100.0 * SUM(CASE WHEN status = 'done' THEN 1 ELSE 0 END) / COUNT(*), 2) as percent
FROM analysis_state;
-- Progress by priority
SELECT
r.priority,
COUNT(*) as total,
SUM(CASE WHEN a.status = 'done' THEN 1 ELSE 0 END) as done,
ROUND(100.0 * SUM(CASE WHEN a.status = 'done' THEN 1 ELSE 0 END) / COUNT(*), 2) as percent
FROM repos r
JOIN analysis_state a ON r.repo_id = a.repo_id
GROUP BY r.priority;
-- Agent health
SELECT
status,
COUNT(*) as count,
MAX(last_heartbeat) as last_activity
FROM agents
GROUP BY status;
-- Recent failures
SELECT
repo_id,
error_message,
updated_at
FROM analysis_state
WHERE status = 'failed'
ORDER BY updated_at DESC
LIMIT 10;
Alerting
# .rd-os/config/alerts.yaml
alerts:
- name: high_failure_rate
condition: "failed_count / total_count > 0.05"
severity: warning
action: notify_human
- name: stalled_progress
condition: "no_progress_for_minutes > 60"
severity: warning
action: notify_human
- name: agent_down
condition: "agent_heartbeat_age_minutes > 10"
severity: critical
action: notify_human + restart_agent
- name: checkpoint_age
condition: "last_checkpoint_age_minutes > 30"
severity: warning
action: force_checkpoint
Implementation Checklist
Phase 1: Basic Persistence
- Create
.rd-os/state/and.rd-os/store/directories - Implement JSON state file writer
- Implement per-agent checkpoint
- Implement progress.db SQLite schema
- Add checkpoint triggers (per-action, per-batch)
Phase 2: Recovery
- Implement recovery protocol
- Test restart recovery (simulate crash)
- Implement idempotent operations
- Add state reconciliation logic
Phase 3: Aggregation
- Implement hierarchical aggregation
- Create context-friendly progress summaries
- Add drill-down queries (on-demand details)
Phase 4: Monitoring
- Create progress dashboard (CLI or web)
- Implement alerting rules
- Add checkpoint management (list, restore, prune)
Example: Recovery After OpenClaw Restart
Scenario: OpenClaw restarts during repo analysis (150/400 done)
1. OpenClaw starts
└─> RD-OS initialization
2. Load .rd-os/store/progress.db
└─> Query: What's the state?
└─> Result: 150 done, 50 running, 200 pending
3. Reconcile running tasks
└─> For each "running" task:
├─> Load agent state from .rd-os/state/agent-states/
├─> Check if resumable
└─> Resume or restart
4. Resume agents
└─> Spawn 50 agents for running tasks
└─> Spawn agents for pending tasks (up to concurrency limit)
5. Continue normal operation
└─> Analysis continues from 150/400 (37.5%)
└─> No work lost, no duplication
Total recovery time: <1 minute
Work lost: 0 (if micro-checkpointing) or <1 batch (if batch-checkpointing)
Conclusion
Key Principles:
- External State - Never rely on LLM context for progress
- Frequent Checkpoints - Checkpoint every unit of work
- Idempotent Operations - Safe to retry anything
- Hierarchical Aggregation - Summary in context, details in DB
- Recovery Protocol - Automated recovery on restart
Result:
- OpenClaw can restart anytime
- LLM context can be lost
- Progress is never lost
- Work resumes automatically
- No manual intervention needed
This is how you build a system that runs for weeks with 1000+ agents.
“The system must be resilient to failure, because at scale, failure is inevitable.”