RD-OS OpenClaw Architecture

OpenClaw 作为主脑 + 子 Agent 集群

“OpenClaw 是 Orchestrator，子 Agent 是临时工人，用完即销毁，状态持久化在文件系统”

Core Architecture

OpenClaw 角色定位

┌─────────────────────────────────────────────────────────────────┐
│                         OpenClaw                                │
│                    (The Orchestrator)                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Role: Master Controller                                        │
│                                                                 │
│  Responsibilities:                                              │
│  ├─ Maintain global state (via .rd-os/store/)                  │
│  ├─ Make high-level decisions                                   │
│  ├─ Spawn sub-agents for parallel work                         │
│  ├─ Collect and synthesize results                             │
│  ├─ Handle exceptions and escalations                          │
│  └─ Report progress to humans                                  │
│                                                                 │
│  Memory:                                                        │
│  ├─ Short-term: Conversation context (lost on restart)         │
│  └─ Long-term: .rd-os/store/ (survives restart)                │
│                                                                 │
│  Models:                                                        │
│  ├─ OpenClaw: qwen3.5-plus (or user's choice)                  │
│  └─ Sub-agents: qwen3.5-plus (cheap, fast)                     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Sub-Agent Model

┌─────────────────────────────────────────────────────────────────┐
│                      Sub-Agent Pattern                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Lifecycle:                                                     │
│                                                                 │
│  1. Spawn                                                       │
│     ├─ OpenClaw calls sessions_spawn()                          │
│     ├─ Task: "Analyze repo-001, output to .rd-os/state/..."    │
│     └─ Model: qwen3.5-plus (cheap)                              │
│                                                                 │
│  2. Execute                                                     │
│     ├─ Sub-agent works independently                            │
│     ├─ Writes checkpoints to .rd-os/state/                      │
│     └─ Reports completion via sessions_send()                   │
│                                                                 │
│  3. Collect                                                     │
│     ├─ OpenClaw reads output from .rd-os/state/                 │
│     ├─ Synthesizes results                                      │
│     └─ Updates .rd-os/store/progress.db                         │
│                                                                 │
│  4. Destroy                                                     │
│     ├─ Sub-agent session ends (cleanup=delete)                  │
│     └─ No memory retained (state is in files)                   │
│                                                                 │
│  Key Insight:                                                   │
│  - Sub-agents are DISPOSABLE WORKERS                           │
│  - State is in FILES, not in agent memory                      │
│  - OpenClaw can restart, sub-agents can die, progress remains  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

System Architecture

Three-Layer Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    Layer 1: OpenClaw (Main)                     │
│                   (Persistent Controller)                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  - Maintains .rd-os/store/progress.db                          │
│  - Makes scheduling decisions                                   │
│  - Spawns sub-agents via sessions_spawn()                      │
│  - Collects results via sessions_send()                        │
│  - Handles human interaction                                    │
│  - Recovers from restart (reads from .rd-os/store/)            │
│                                                                 │
│  Model: qwen3.5-plus (or user's preferred model)               │
│  Lifetime: Long-running (weeks to months)                      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
                              │
                              │ sessions_spawn()
                              │ sessions_send()
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                 Layer 2: Sub-Agent Pool (Ephemeral)             │
│                    (Disposable Workers)                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  - Created on-demand via sessions_spawn()                      │
│  - Focused task: "Analyze this repo", "Migrate that repo"      │
│  - Writes state to .rd-os/state/agent-states/{id}.json         │
│  - Reports completion, then destroyed                          │
│  - No long-term memory (state is in files)                     │
│                                                                 │
│  Model: qwen3.5-plus (cheap, fast)                             │
│  Lifetime: Short (minutes to hours per task)                   │
│  Concurrency: 10-50 simultaneous sub-agents                    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
                              │
                              │ File I/O
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│              Layer 3: Persistent State (Files + DB)             │
│                  (Source of Truth)                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  .rd-os/                                                        │
│  ├── state/                                                     │
│  │   ├── agent-states/         # Per-sub-agent checkpoint      │
│  │   ├── progress/             # Aggregated progress           │
│  │   └── checkpoints/          # Milestone snapshots           │
│  │                                                              │
│  └── store/                                                     │
│      ├── progress.db           # SQLite: definitive state      │
│      ├── agents.db             # SQLite: sub-agent registry    │
│      ├── artifacts/            # Generated reports             │
│      └── config/               # Configuration                 │
│                                                                 │
│  Key: This layer SURVIVES everything                           │
│  - OpenClaw restart → OK, read from DB                         │
│  - Sub-agent dies → OK, checkpoint in files                    │
│  - Gateway crash → OK, DB is durable                           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

OpenClaw Workflow

Main Loop

# Pseudo-code: OpenClaw main orchestration loop

class OpenClawOrchestrator:
    """
    OpenClaw as the main orchestrator
    """
    
    async def run(self):
        # 1. Recovery (after restart)
        await self.recover_state()
        
        # 2. Main loop
        while not self.is_complete():
            # 2.1 Check progress
            progress = self.load_progress()
            
            # 2.2 Make scheduling decisions
            decisions = self.make_scheduling_decisions(progress)
            
            # 2.3 Spawn sub-agents for new work
            for decision in decisions:
                if decision.action == 'analyze':
                    await self.spawn_analyzer(decision.repo)
                elif decision.action == 'migrate':
                    await self.spawn_migrator(decision.repo)
                elif decision.action == 'deep_dive':
                    await self.spawn_deep_analysis_team(decision.repo)
            
            # 2.4 Check for completed sub-agents
            completed = await self.check_completed_sub_agents()
            for result in completed:
                await self.process_result(result)
            
            # 2.5 Handle escalations
            await self.handle_escalations()
            
            # 2.6 Update progress
            await self.update_progress()
            
            # 2.7 Checkpoint
            await self.checkpoint()
            
            # 2.8 Wait (avoid busy loop)
            await asyncio.sleep(60)
        
        # 3. Completion
        await self.generate_final_report()
    
    async def spawn_analyzer(self, repo: Repo):
        """
        Spawn a sub-agent to analyze a repo
        """
        task = f"""
        Analyze repository: {repo.name}
        
        Output to: .rd-os/state/agent-states/{repo.id}-analysis.json
        
        Steps:
        1. Read repo metadata from GitHub API
        2. Analyze code structure
        3. Map dependencies
        4. Assess code quality
        5. Generate merge recommendation
        
        Checkpoint after each step.
        Report completion via sessions_send().
        """
        
        # Spawn sub-agent (qwen3.5-plus, cheap)
        session = await sessions_spawn(
            task=task,
            model='qwen3.5-plus',
            cleanup='delete',  # Destroy after completion
            label=f'analyzer-{repo.id}'
        )
        
        # Register sub-agent
        self.db.execute("""
            INSERT INTO sub_agents (agent_id, type, repo_id, status, spawned_at)
            VALUES (?, 'analyzer', ?, 'running', ?)
        """, (session.id, repo.id, now()))
    
    async def process_result(self, result: SubAgentResult):
        """
        Process completed sub-agent result
        """
        # Read output from file
        output = read_json(result.output_path)
        
        # Update progress DB
        self.db.execute("""
            UPDATE analysis_state
            SET status = 'done', result_json = ?, completed_at = ?
            WHERE repo_id = ?
        """, (json.dumps(output), now(), result.repo_id))
        
        # Update sub-agent registry
        self.db.execute("""
            UPDATE sub_agents
            SET status = 'completed', completed_at = ?
            WHERE agent_id = ?
        """, (now(), result.agent_id))
        
        # Synthesize findings (OpenClaw does this)
        await self.synthesize_findings(result.repo_id, output)
        
        # Make next decision (spawn more agents? escalate?)
        await self.make_next_decision(result)

Sub-Agent Lifecycle

State Machine

┌─────────┐     ┌─────────┐     ┌─────────┐     ┌─────────┐
│  idle   │────▶│ running │────▶│ done    │     │ failed  │
└─────────┘     └─────────┘     └─────────┘     └─────────┘
     ▲              │                                 │
     │              │     ┌─────────┐                │
     │              └────▶│ paused  │◀───────────────┘
     │                    └─────────┘
     │
     │ sessions_spawn()
     │
┌─────────┐
│OpenClaw │
└─────────┘

Sub-Agent Task Template

# Template for sub-agent tasks

ANALYZER_TASK_TEMPLATE = """
You are a Repository Analyzer Agent.

TASK: Analyze {repo_name}
OUTPUT: .rd-os/state/agent-states/{repo_id}-analysis.json

INSTRUCTIONS:
1. Read repo metadata from .rd-os/store/repos/{repo_id}.json
2. Analyze code structure (use GitHub API or local clone)
3. Map dependencies (go.mod, package.json, etc.)
4. Assess code quality (tests, docs, lint)
5. Generate merge recommendation (P0/P1/P2/P3/archive)

CHECKPOINTING:
- After each step, write checkpoint to:
  .rd-os/state/agent-states/{repo_id}-analysis.checkpoint.json
- Include: step_completed, partial_results, can_resume

COMPLETION:
- Write final output to: .rd-os/state/agent-states/{repo_id}-analysis.json
- Send completion message via sessions_send():
  "Analysis complete: {repo_id}, output: {output_path}"

MODEL: qwen3.5-plus
TIMEOUT: 30 minutes
CLEANUP: delete (session destroyed after completion)
"""

Recovery After OpenClaw Restart

Recovery Flow

OpenClaw Restarts
       │
       ▼
┌─────────────────────────────────────────────────────────────┐
│  1. Load State from .rd-os/store/progress.db                │
│     ├─ Query: What repos are analyzed?                      │
│     ├─ Query: What repos are in progress?                   │
│     └─ Query: What sub-agents were running?                 │
└─────────────────────────────────────────────────────────────┘
       │
       ▼
┌─────────────────────────────────────────────────────────────┐
│  2. Reconcile Sub-Agent State                               │
│     ├─ Find sub-agents marked 'running'                     │
│     ├─ Check if they have checkpoints                       │
│     ├─ If checkpoint exists → respawn with resume           │
│     └─ If no checkpoint → restart from beginning            │
└─────────────────────────────────────────────────────────────┘
       │
       ▼
┌─────────────────────────────────────────────────────────────┐
│  3. Resume Orchestration                                    │
│     ├─ Continue main loop                                   │
│     ├─ Spawn new sub-agents for pending work                │
│     └─ Resume from last checkpoint                          │
└─────────────────────────────────────────────────────────────┘

Result: OpenClaw can restart anytime, progress is never lost

Recovery Example

# Pseudo-code: OpenClaw recovery

async def recover_state(self):
    """
    Recover state after OpenClaw restart
    """
    # Load progress DB
    self.db = load_database('.rd-os/store/progress.db')
    
    # Find incomplete analysis
    incomplete = self.db.query("""
        SELECT repo_id, progress_percent, last_checkpoint
        FROM analysis_state
        WHERE status = 'running'
    """)
    
    for task in incomplete:
        # Check if sub-agent has checkpoint
        checkpoint_path = f".rd-os/state/agent-states/{task.repo_id}-analysis.checkpoint.json"
        
        if exists(checkpoint_path):
            # Resume from checkpoint
            checkpoint = read_json(checkpoint_path)
            await self.resume_analyzer(task.repo_id, checkpoint)
            log.info(f"Resumed analysis: {task.repo_id} from step {checkpoint['step']}")
        else:
            # No checkpoint, restart
            await self.spawn_analyzer(task.repo_id)
            log.warning(f"No checkpoint for {task.repo_id}, restarting")
    
    # Find orphaned sub-agents (running but no progress)
    orphaned = self.db.query("""
        SELECT agent_id, repo_id, spawned_at
        FROM sub_agents
        WHERE status = 'running'
        AND agent_id NOT IN (SELECT DISTINCT agent_id FROM checkpoints)
    """)
    
    for orphan in orphaned:
        # Sub-agent died without checkpoint
        log.warning(f"Orphaned sub-agent: {orphan.agent_id}, restarting")
        await self.spawn_analyzer(orphan.repo_id)
    
    log.info(f"Recovery complete: {len(incomplete)} tasks resumed")

Scaling Strategy

Concurrency Control

class ConcurrencyManager:
    """
    Manage sub-agent concurrency
    """
    
    def __init__(self, max_concurrent: int = 50):
        self.max_concurrent = max_concurrent
        self.active_count = 0
        self.lock = asyncio.Lock()
    
    async def acquire(self) -> bool:
        """
        Acquire a slot for new sub-agent
        """
        async with self.lock:
            if self.active_count < self.max_concurrent:
                self.active_count += 1
                return True
            return False
    
    async def release(self):
        """
        Release a slot when sub-agent completes
        """
        async with self.lock:
            self.active_count -= 1
    
    def get_utilization(self) -> float:
        return self.active_count / self.max_concurrent

Batch Processing

# Process repos in batches (avoid overwhelming system)

async def process_in_batches(self, repos: List[Repo], batch_size: int = 50):
    """
    Process repos in batches
    """
    for i in range(0, len(repos), batch_size):
        batch = repos[i:i+batch_size]
        
        log.info(f"Processing batch {i//batch_size + 1}: {len(batch)} repos")
        
        # Spawn sub-agents for batch
        tasks = [self.spawn_analyzer(repo) for repo in batch]
        
        # Wait for batch to complete (with timeout)
        await asyncio.gather(*tasks, return_exceptions=True)
        
        # Checkpoint after batch
        await self.checkpoint(f'batch-{i//batch_size}')
        
        # Rate limit (avoid API throttling)
        await asyncio.sleep(60)

Communication Pattern

OpenClaw ↔ Sub-Agent

┌─────────────────────────────────────────────────────────────────┐
│              OpenClaw ↔ Sub-Agent Communication                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. OpenClaw → Sub-Agent: sessions_spawn(task)                  │
│     ├─ Task description                                         │
│     ├─ Output path                                              │
│     └─ Checkpoint requirements                                  │
│                                                                 │
│  2. Sub-Agent → File System: write_checkpoint()                 │
│     ├─ Progress updates                                         │
│     ├─ Partial results                                          │
│     └─ Recovery point                                           │
│                                                                 │
│  3. Sub-Agent → OpenClaw: sessions_send(message)                │
│     ├─ "Task complete: {repo_id}"                               │
│     ├─ "Error: {error_message}"                                 │
│     └─ "Escalation: {issue}"                                    │
│                                                                 │
│  4. OpenClaw → File System: read_output()                       │
│     ├─ Read final output                                        │
│     ├─ Read checkpoints                                         │
│     └─ Update progress DB                                       │
│                                                                 │
│  Key: Communication is MINIMAL                                  │
│  - Sub-agents don't retain state                               │
│  - Everything is in files                                      │
│  - OpenClaw can restart, sub-agents are disposable             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Cost Optimization

Model Selection

Component	Model	Rationale
OpenClaw (Main)	qwen3.5-plus	Good balance of cost/capability
Sub-Agents	qwen3.5-plus	Cheap, fast, disposable
Deep Analysis	qwen3.5-plus (or upgrade if needed)	Can upgrade for complex tasks

Cost Estimate (400 Repos)

Analysis Phase:
├─ 400 repos × ~10K tokens/repo = 4M tokens
├─ qwen3.5-plus: $0.002/1K tokens
└─ Total: ~$8

Migration Phase:
├─ 400 repos × ~50K tokens/repo = 20M tokens
├─ qwen3.5-plus: $0.002/1K tokens
└─ Total: ~$40

Ongoing Operations (monthly):
├─ Guardian agents: ~100K tokens/day
├─ Monthly: 3M tokens
└─ Total: ~$6/month

Total First Year: ~$500 (one-time migration + ongoing ops)

Example: Full Workflow

End-to-End Example

Scenario: Analyze 400 repos with OpenClaw + sub-agents

Day 1: Initialization
├─ OpenClaw starts
├─ Creates .rd-os/ directory structure
├─ Loads repo list (400 repos)
├─ Spawns 50 sub-agents (batch 1)
└─ Checkpoint: "400 repos loaded, batch 1 started"

Day 1-2: Analysis (Batch 1-8)
├─ Each batch: 50 repos
├─ Sub-agents analyze in parallel
├─ OpenClaw collects results
├─ Updates progress.db
├─ Spawns next batch
└─ Checkpoint after each batch

Day 2: Analysis Complete
├─ 400/400 repos analyzed
├─ OpenClaw synthesizes findings
├─ Identifies: 50 S-tier, 100 A-tier, 150 B-tier, 100 C-tier
└─ Checkpoint: "Analysis complete"

Day 2-3: Deep Analysis (S-tier)
├─ 50 S-tier repos
├─ Each gets 5-8 sub-agents for deep analysis
├─ OpenClaw coordinates teams
├─ Produces 50 deep reports
└─ Checkpoint: "Deep analysis complete"

Day 3-7: Migration (P0)
├─ 50 P0 repos migrated
├─ Sub-agents handle migration tasks
├─ OpenClaw validates each migration
└─ Checkpoint: "P0 migrated"

... (continue for P1, P2, P3)

Week 4: Complete
├─ 400/400 repos migrated
├─ OpenClaw generates final report
└─ System transitions to "guardian mode"

Implementation Checklist

Phase 1: OpenClaw Orchestration

Create .rd-os/ directory structure
Implement progress.db schema
Implement OpenClaw main loop
Implement sub-agent spawning
Implement result collection

Phase 2: Sub-Agent Tasks

Create analyzer task template
Create migrator task template
Implement checkpointing in sub-agents
Implement completion reporting

Phase 3: Recovery

Implement OpenClaw recovery protocol
Test restart recovery
Implement sub-agent respawn
Test sub-agent failure recovery

Phase 4: Optimization

Implement concurrency control
Implement batch processing
Add rate limiting
Tune performance

Conclusion

Key Insights:

OpenClaw is the Brain - Maintains state, makes decisions, coordinates
Sub-Agents are Hands - Execute tasks, disposable, no long-term memory
Files are Memory - State in .rd-os/store/, survives everything
Recovery is Automatic - OpenClaw restarts, reads DB, resumes
Cost is Low - qwen3.5-plus for everything, ~$500 first year

This is how you build a resilient, scalable system with OpenClaw as the orchestrator.

“OpenClaw doesn’t do all the work. OpenClaw organizes the work.”

Keyboard shortcuts

Agentic Engineering Documentation