RD-OS OpenClaw Architecture
OpenClaw 作为主脑 + 子 Agent 集群
“OpenClaw 是 Orchestrator,子 Agent 是临时工人,用完即销毁,状态持久化在文件系统”
Core Architecture
OpenClaw 角色定位
┌─────────────────────────────────────────────────────────────────┐
│ OpenClaw │
│ (The Orchestrator) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Role: Master Controller │
│ │
│ Responsibilities: │
│ ├─ Maintain global state (via .rd-os/store/) │
│ ├─ Make high-level decisions │
│ ├─ Spawn sub-agents for parallel work │
│ ├─ Collect and synthesize results │
│ ├─ Handle exceptions and escalations │
│ └─ Report progress to humans │
│ │
│ Memory: │
│ ├─ Short-term: Conversation context (lost on restart) │
│ └─ Long-term: .rd-os/store/ (survives restart) │
│ │
│ Models: │
│ ├─ OpenClaw: qwen3.5-plus (or user's choice) │
│ └─ Sub-agents: qwen3.5-plus (cheap, fast) │
│ │
└─────────────────────────────────────────────────────────────────┘
Sub-Agent Model
┌─────────────────────────────────────────────────────────────────┐
│ Sub-Agent Pattern │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Lifecycle: │
│ │
│ 1. Spawn │
│ ├─ OpenClaw calls sessions_spawn() │
│ ├─ Task: "Analyze repo-001, output to .rd-os/state/..." │
│ └─ Model: qwen3.5-plus (cheap) │
│ │
│ 2. Execute │
│ ├─ Sub-agent works independently │
│ ├─ Writes checkpoints to .rd-os/state/ │
│ └─ Reports completion via sessions_send() │
│ │
│ 3. Collect │
│ ├─ OpenClaw reads output from .rd-os/state/ │
│ ├─ Synthesizes results │
│ └─ Updates .rd-os/store/progress.db │
│ │
│ 4. Destroy │
│ ├─ Sub-agent session ends (cleanup=delete) │
│ └─ No memory retained (state is in files) │
│ │
│ Key Insight: │
│ - Sub-agents are DISPOSABLE WORKERS │
│ - State is in FILES, not in agent memory │
│ - OpenClaw can restart, sub-agents can die, progress remains │
│ │
└─────────────────────────────────────────────────────────────────┘
System Architecture
Three-Layer Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Layer 1: OpenClaw (Main) │
│ (Persistent Controller) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ - Maintains .rd-os/store/progress.db │
│ - Makes scheduling decisions │
│ - Spawns sub-agents via sessions_spawn() │
│ - Collects results via sessions_send() │
│ - Handles human interaction │
│ - Recovers from restart (reads from .rd-os/store/) │
│ │
│ Model: qwen3.5-plus (or user's preferred model) │
│ Lifetime: Long-running (weeks to months) │
│ │
└─────────────────────────────────────────────────────────────────┘
│
│ sessions_spawn()
│ sessions_send()
▼
┌─────────────────────────────────────────────────────────────────┐
│ Layer 2: Sub-Agent Pool (Ephemeral) │
│ (Disposable Workers) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ - Created on-demand via sessions_spawn() │
│ - Focused task: "Analyze this repo", "Migrate that repo" │
│ - Writes state to .rd-os/state/agent-states/{id}.json │
│ - Reports completion, then destroyed │
│ - No long-term memory (state is in files) │
│ │
│ Model: qwen3.5-plus (cheap, fast) │
│ Lifetime: Short (minutes to hours per task) │
│ Concurrency: 10-50 simultaneous sub-agents │
│ │
└─────────────────────────────────────────────────────────────────┘
│
│ File I/O
▼
┌─────────────────────────────────────────────────────────────────┐
│ Layer 3: Persistent State (Files + DB) │
│ (Source of Truth) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ .rd-os/ │
│ ├── state/ │
│ │ ├── agent-states/ # Per-sub-agent checkpoint │
│ │ ├── progress/ # Aggregated progress │
│ │ └── checkpoints/ # Milestone snapshots │
│ │ │
│ └── store/ │
│ ├── progress.db # SQLite: definitive state │
│ ├── agents.db # SQLite: sub-agent registry │
│ ├── artifacts/ # Generated reports │
│ └── config/ # Configuration │
│ │
│ Key: This layer SURVIVES everything │
│ - OpenClaw restart → OK, read from DB │
│ - Sub-agent dies → OK, checkpoint in files │
│ - Gateway crash → OK, DB is durable │
│ │
└─────────────────────────────────────────────────────────────────┘
OpenClaw Workflow
Main Loop
# Pseudo-code: OpenClaw main orchestration loop
class OpenClawOrchestrator:
"""
OpenClaw as the main orchestrator
"""
async def run(self):
# 1. Recovery (after restart)
await self.recover_state()
# 2. Main loop
while not self.is_complete():
# 2.1 Check progress
progress = self.load_progress()
# 2.2 Make scheduling decisions
decisions = self.make_scheduling_decisions(progress)
# 2.3 Spawn sub-agents for new work
for decision in decisions:
if decision.action == 'analyze':
await self.spawn_analyzer(decision.repo)
elif decision.action == 'migrate':
await self.spawn_migrator(decision.repo)
elif decision.action == 'deep_dive':
await self.spawn_deep_analysis_team(decision.repo)
# 2.4 Check for completed sub-agents
completed = await self.check_completed_sub_agents()
for result in completed:
await self.process_result(result)
# 2.5 Handle escalations
await self.handle_escalations()
# 2.6 Update progress
await self.update_progress()
# 2.7 Checkpoint
await self.checkpoint()
# 2.8 Wait (avoid busy loop)
await asyncio.sleep(60)
# 3. Completion
await self.generate_final_report()
async def spawn_analyzer(self, repo: Repo):
"""
Spawn a sub-agent to analyze a repo
"""
task = f"""
Analyze repository: {repo.name}
Output to: .rd-os/state/agent-states/{repo.id}-analysis.json
Steps:
1. Read repo metadata from GitHub API
2. Analyze code structure
3. Map dependencies
4. Assess code quality
5. Generate merge recommendation
Checkpoint after each step.
Report completion via sessions_send().
"""
# Spawn sub-agent (qwen3.5-plus, cheap)
session = await sessions_spawn(
task=task,
model='qwen3.5-plus',
cleanup='delete', # Destroy after completion
label=f'analyzer-{repo.id}'
)
# Register sub-agent
self.db.execute("""
INSERT INTO sub_agents (agent_id, type, repo_id, status, spawned_at)
VALUES (?, 'analyzer', ?, 'running', ?)
""", (session.id, repo.id, now()))
async def process_result(self, result: SubAgentResult):
"""
Process completed sub-agent result
"""
# Read output from file
output = read_json(result.output_path)
# Update progress DB
self.db.execute("""
UPDATE analysis_state
SET status = 'done', result_json = ?, completed_at = ?
WHERE repo_id = ?
""", (json.dumps(output), now(), result.repo_id))
# Update sub-agent registry
self.db.execute("""
UPDATE sub_agents
SET status = 'completed', completed_at = ?
WHERE agent_id = ?
""", (now(), result.agent_id))
# Synthesize findings (OpenClaw does this)
await self.synthesize_findings(result.repo_id, output)
# Make next decision (spawn more agents? escalate?)
await self.make_next_decision(result)
Sub-Agent Lifecycle
State Machine
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ idle │────▶│ running │────▶│ done │ │ failed │
└─────────┘ └─────────┘ └─────────┘ └─────────┘
▲ │ │
│ │ ┌─────────┐ │
│ └────▶│ paused │◀───────────────┘
│ └─────────┘
│
│ sessions_spawn()
│
┌─────────┐
│OpenClaw │
└─────────┘
Sub-Agent Task Template
# Template for sub-agent tasks
ANALYZER_TASK_TEMPLATE = """
You are a Repository Analyzer Agent.
TASK: Analyze {repo_name}
OUTPUT: .rd-os/state/agent-states/{repo_id}-analysis.json
INSTRUCTIONS:
1. Read repo metadata from .rd-os/store/repos/{repo_id}.json
2. Analyze code structure (use GitHub API or local clone)
3. Map dependencies (go.mod, package.json, etc.)
4. Assess code quality (tests, docs, lint)
5. Generate merge recommendation (P0/P1/P2/P3/archive)
CHECKPOINTING:
- After each step, write checkpoint to:
.rd-os/state/agent-states/{repo_id}-analysis.checkpoint.json
- Include: step_completed, partial_results, can_resume
COMPLETION:
- Write final output to: .rd-os/state/agent-states/{repo_id}-analysis.json
- Send completion message via sessions_send():
"Analysis complete: {repo_id}, output: {output_path}"
MODEL: qwen3.5-plus
TIMEOUT: 30 minutes
CLEANUP: delete (session destroyed after completion)
"""
Recovery After OpenClaw Restart
Recovery Flow
OpenClaw Restarts
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 1. Load State from .rd-os/store/progress.db │
│ ├─ Query: What repos are analyzed? │
│ ├─ Query: What repos are in progress? │
│ └─ Query: What sub-agents were running? │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 2. Reconcile Sub-Agent State │
│ ├─ Find sub-agents marked 'running' │
│ ├─ Check if they have checkpoints │
│ ├─ If checkpoint exists → respawn with resume │
│ └─ If no checkpoint → restart from beginning │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 3. Resume Orchestration │
│ ├─ Continue main loop │
│ ├─ Spawn new sub-agents for pending work │
│ └─ Resume from last checkpoint │
└─────────────────────────────────────────────────────────────┘
Result: OpenClaw can restart anytime, progress is never lost
Recovery Example
# Pseudo-code: OpenClaw recovery
async def recover_state(self):
"""
Recover state after OpenClaw restart
"""
# Load progress DB
self.db = load_database('.rd-os/store/progress.db')
# Find incomplete analysis
incomplete = self.db.query("""
SELECT repo_id, progress_percent, last_checkpoint
FROM analysis_state
WHERE status = 'running'
""")
for task in incomplete:
# Check if sub-agent has checkpoint
checkpoint_path = f".rd-os/state/agent-states/{task.repo_id}-analysis.checkpoint.json"
if exists(checkpoint_path):
# Resume from checkpoint
checkpoint = read_json(checkpoint_path)
await self.resume_analyzer(task.repo_id, checkpoint)
log.info(f"Resumed analysis: {task.repo_id} from step {checkpoint['step']}")
else:
# No checkpoint, restart
await self.spawn_analyzer(task.repo_id)
log.warning(f"No checkpoint for {task.repo_id}, restarting")
# Find orphaned sub-agents (running but no progress)
orphaned = self.db.query("""
SELECT agent_id, repo_id, spawned_at
FROM sub_agents
WHERE status = 'running'
AND agent_id NOT IN (SELECT DISTINCT agent_id FROM checkpoints)
""")
for orphan in orphaned:
# Sub-agent died without checkpoint
log.warning(f"Orphaned sub-agent: {orphan.agent_id}, restarting")
await self.spawn_analyzer(orphan.repo_id)
log.info(f"Recovery complete: {len(incomplete)} tasks resumed")
Scaling Strategy
Concurrency Control
class ConcurrencyManager:
"""
Manage sub-agent concurrency
"""
def __init__(self, max_concurrent: int = 50):
self.max_concurrent = max_concurrent
self.active_count = 0
self.lock = asyncio.Lock()
async def acquire(self) -> bool:
"""
Acquire a slot for new sub-agent
"""
async with self.lock:
if self.active_count < self.max_concurrent:
self.active_count += 1
return True
return False
async def release(self):
"""
Release a slot when sub-agent completes
"""
async with self.lock:
self.active_count -= 1
def get_utilization(self) -> float:
return self.active_count / self.max_concurrent
Batch Processing
# Process repos in batches (avoid overwhelming system)
async def process_in_batches(self, repos: List[Repo], batch_size: int = 50):
"""
Process repos in batches
"""
for i in range(0, len(repos), batch_size):
batch = repos[i:i+batch_size]
log.info(f"Processing batch {i//batch_size + 1}: {len(batch)} repos")
# Spawn sub-agents for batch
tasks = [self.spawn_analyzer(repo) for repo in batch]
# Wait for batch to complete (with timeout)
await asyncio.gather(*tasks, return_exceptions=True)
# Checkpoint after batch
await self.checkpoint(f'batch-{i//batch_size}')
# Rate limit (avoid API throttling)
await asyncio.sleep(60)
Communication Pattern
OpenClaw ↔ Sub-Agent
┌─────────────────────────────────────────────────────────────────┐
│ OpenClaw ↔ Sub-Agent Communication │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. OpenClaw → Sub-Agent: sessions_spawn(task) │
│ ├─ Task description │
│ ├─ Output path │
│ └─ Checkpoint requirements │
│ │
│ 2. Sub-Agent → File System: write_checkpoint() │
│ ├─ Progress updates │
│ ├─ Partial results │
│ └─ Recovery point │
│ │
│ 3. Sub-Agent → OpenClaw: sessions_send(message) │
│ ├─ "Task complete: {repo_id}" │
│ ├─ "Error: {error_message}" │
│ └─ "Escalation: {issue}" │
│ │
│ 4. OpenClaw → File System: read_output() │
│ ├─ Read final output │
│ ├─ Read checkpoints │
│ └─ Update progress DB │
│ │
│ Key: Communication is MINIMAL │
│ - Sub-agents don't retain state │
│ - Everything is in files │
│ - OpenClaw can restart, sub-agents are disposable │
│ │
└─────────────────────────────────────────────────────────────────┘
Cost Optimization
Model Selection
| Component | Model | Rationale |
|---|---|---|
| OpenClaw (Main) | qwen3.5-plus | Good balance of cost/capability |
| Sub-Agents | qwen3.5-plus | Cheap, fast, disposable |
| Deep Analysis | qwen3.5-plus (or upgrade if needed) | Can upgrade for complex tasks |
Cost Estimate (400 Repos)
Analysis Phase:
├─ 400 repos × ~10K tokens/repo = 4M tokens
├─ qwen3.5-plus: $0.002/1K tokens
└─ Total: ~$8
Migration Phase:
├─ 400 repos × ~50K tokens/repo = 20M tokens
├─ qwen3.5-plus: $0.002/1K tokens
└─ Total: ~$40
Ongoing Operations (monthly):
├─ Guardian agents: ~100K tokens/day
├─ Monthly: 3M tokens
└─ Total: ~$6/month
Total First Year: ~$500 (one-time migration + ongoing ops)
Example: Full Workflow
End-to-End Example
Scenario: Analyze 400 repos with OpenClaw + sub-agents
Day 1: Initialization
├─ OpenClaw starts
├─ Creates .rd-os/ directory structure
├─ Loads repo list (400 repos)
├─ Spawns 50 sub-agents (batch 1)
└─ Checkpoint: "400 repos loaded, batch 1 started"
Day 1-2: Analysis (Batch 1-8)
├─ Each batch: 50 repos
├─ Sub-agents analyze in parallel
├─ OpenClaw collects results
├─ Updates progress.db
├─ Spawns next batch
└─ Checkpoint after each batch
Day 2: Analysis Complete
├─ 400/400 repos analyzed
├─ OpenClaw synthesizes findings
├─ Identifies: 50 S-tier, 100 A-tier, 150 B-tier, 100 C-tier
└─ Checkpoint: "Analysis complete"
Day 2-3: Deep Analysis (S-tier)
├─ 50 S-tier repos
├─ Each gets 5-8 sub-agents for deep analysis
├─ OpenClaw coordinates teams
├─ Produces 50 deep reports
└─ Checkpoint: "Deep analysis complete"
Day 3-7: Migration (P0)
├─ 50 P0 repos migrated
├─ Sub-agents handle migration tasks
├─ OpenClaw validates each migration
└─ Checkpoint: "P0 migrated"
... (continue for P1, P2, P3)
Week 4: Complete
├─ 400/400 repos migrated
├─ OpenClaw generates final report
└─ System transitions to "guardian mode"
Implementation Checklist
Phase 1: OpenClaw Orchestration
- Create
.rd-os/directory structure - Implement progress.db schema
- Implement OpenClaw main loop
- Implement sub-agent spawning
- Implement result collection
Phase 2: Sub-Agent Tasks
- Create analyzer task template
- Create migrator task template
- Implement checkpointing in sub-agents
- Implement completion reporting
Phase 3: Recovery
- Implement OpenClaw recovery protocol
- Test restart recovery
- Implement sub-agent respawn
- Test sub-agent failure recovery
Phase 4: Optimization
- Implement concurrency control
- Implement batch processing
- Add rate limiting
- Tune performance
Conclusion
Key Insights:
- OpenClaw is the Brain - Maintains state, makes decisions, coordinates
- Sub-Agents are Hands - Execute tasks, disposable, no long-term memory
- Files are Memory - State in
.rd-os/store/, survives everything - Recovery is Automatic - OpenClaw restarts, reads DB, resumes
- Cost is Low - qwen3.5-plus for everything, ~$500 first year
This is how you build a resilient, scalable system with OpenClaw as the orchestrator.
“OpenClaw doesn’t do all the work. OpenClaw organizes the work.”