Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Agent Cage (笼子) 设计文档

🎯 核心概念

“1000 个笼子,养着 1000 个 AI”

每个 Cage 是一个:

  • 隔离的执行环境 (Docker Container / K8s Pod)
  • 专用的资源配额 (CPU, Memory, GPU, Token Budget)
  • 持久的状态存储 (Agent Memory, Task History, Outputs)
  • 独立的健康监控 (Heartbeat, Error Rate, Resource Usage)

📦 Cage 架构

┌─────────────────────────────────────────────────────────────────────────┐
│                         Cage #042                                        │
│                    (Isolated Agent Environment)                          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                    Agent Runtime                                 │   │
│  │  ┌───────────────────────────────────────────────────────────┐  │   │
│  │  │  OpenClaw Agent Instance                                   │  │   │
│  │  │  - Model: qwen3.5-plus                                    │  │   │
│  │  │  - Context Window: 262K tokens                            │  │   │
│  │  │  - Skills: [space-guardian, incident-responder, ...]      │  │   │
│  │  │  - Memory: Short-term (session) + Long-term (files)       │  │   │
│  │  └───────────────────────────────────────────────────────────┘  │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                   │                                     │
│  ┌────────────────────────────────┼────────────────────────────────┐   │
│  │                    Resource Quotas                               │   │
│  │  CPU: 2 cores (limit)         Memory: 4GB (limit)               │   │
│  │  GPU: 0.5 A10 (limit)         Tokens: 100K/hour (limit)         │   │
│  │  Network: 100 Mbps (limit)    Storage: 10GB (limit)             │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                   │                                     │
│  ┌────────────────────────────────┼────────────────────────────────┐   │
│  │                    Persistent State                              │   │
│  │  /cage/state/                                                      │   │
│  │  ├── agent.json        - Agent identity & config                 │   │
│  │  ├── memory.md         - Long-term memory                        │   │
│  │  ├── task_history.jsonl - Completed tasks log                    │   │
│  │  ├── outputs/          - Generated artifacts                     │   │
│  │  └── metrics.jsonl     - Performance metrics                     │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                   │                                     │
│  ┌────────────────────────────────┼────────────────────────────────┐   │
│  │                    Health Monitor                                │   │
│  │  - Heartbeat: Every 30s                                          │   │
│  │  - Error Tracking: Capture & report exceptions                   │   │
│  │  - Resource Monitoring: CPU, Memory, Token usage                 │   │
│  │  - Auto-recovery: Restart on crash, Migrate on overload          │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

🔧 技术实现

Kubernetes Pod Template

apiVersion: v1
kind: Pod
metadata:
  name: cage-042
  namespace: agent-platform
  labels:
    cage-id: "042"
    agent-type: "space-guardian"
    status: "active"
  annotations:
    agent.openclaw.ai/id: "agent-042"
    agent.openclaw.ai/created: "2026-03-01T10:00:00Z"
spec:
  # Resource Quotas
  containers:
  - name: agent-runtime
    image: openclaw/agent-runtime:v1.0.0
    resources:
      requests:
        cpu: "1"
        memory: "2Gi"
        nvidia.com/gpu: "0.5"
      limits:
        cpu: "2"
        memory: "4Gi"
        nvidia.com/gpu: "1"
    
    # Environment Variables
    env:
    - name: CAGE_ID
      value: "042"
    - name: AGENT_ID
      value: "agent-042"
    - name: AGENT_TYPE
      value: "space-guardian"
    - name: TOKEN_BUDGET_HOURLY
      value: "100000"
    - name: ORCHESTRATOR_URL
      value: "http://orchestrator.agent-platform.svc:8080"
    
    # Volume Mounts
    volumeMounts:
    - name: state-volume
      mountPath: /cage/state
    - name: outputs-volume
      mountPath: /cage/outputs
    - name: logs-volume
      mountPath: /cage/logs
    
    # Health Checks
    livenessProbe:
      httpGet:
        path: /health/live
        port: 8080
      initialDelaySeconds: 30
      periodSeconds: 30
      timeoutSeconds: 10
      failureThreshold: 3
    
    readinessProbe:
      httpGet:
        path: /health/ready
        port: 8080
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 5
      failureThreshold: 3
  
  volumes:
  - name: state-volume
    persistentVolumeClaim:
      claimName: cage-042-state
  - name: outputs-volume
    persistentVolumeClaim:
      claimName: cage-042-outputs
  - name: logs-volume
    emptyDir:
      sizeLimit: 1Gi
  
  # Node Affinity (optional: spread across nodes)
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchLabels:
              app: agent-runtime
          topologyKey: kubernetes.io/hostname

📊 Cage 状态机

┌─────────────────────────────────────────────────────────────────────────┐
│                         Cage State Machine                               │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│                           ┌─────────────┐                              │
│                           │  CREATED    │                              │
│                           └──────┬──────┘                              │
│                                  │ start()                             │
│                                  ▼                                     │
│  ┌─────────────┐          ┌─────────────┐          ┌─────────────┐    │
│  │  STOPPED    │◄─────────│  STARTING   │─────────►│   ACTIVE    │    │
│  └─────────────┘  failed  └─────────────┘          └──────┬──────┘    │
│       ▲                                                   │            │
│       │                                                   │            │
│       │                    ┌─────────────┐               │            │
│       │                    │   ERROR     │◄──────────────┘  error()   │
│       │                    └──────┬──────┘                          │
│       │                           │                                  │
│       │                           │ recover()                        │
│       │                           ▼                                  │
│       │                    ┌─────────────┐                          │
│       └────────────────────│  RECOVERING │                          │
│                            └─────────────┘                          │
│                                                                         │
│  State Transitions:                                                     │
│  - CREATED → STARTING:  Pod scheduled, container starting              │
│  - STARTING → ACTIVE:   Health check passed, ready for tasks           │
│  - STARTING → STOPPED:  Startup failed                                 │
│  - ACTIVE → ERROR:      Runtime error detected                         │
│  - ERROR → RECOVERING:  Auto-recovery initiated                        │
│  - RECOVERING → ACTIVE: Recovery successful                            │
│  - RECOVERING → STOPPED: Recovery failed                               │
│  - ACTIVE → STOPPED:    Manual stop or resource reclamation            │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

📁 Cage 目录结构

/cage/
├── state/                    # 持久化状态
│   ├── agent.json           # Agent 身份信息
│   │   {
│   │     "id": "agent-042",
│   │     "cage_id": "042",
│   │     "type": "space-guardian",
│   │     "config": { ... },
│   │     "created_at": "2026-03-01T10:00:00Z"
│   │   }
│   │
│   ├── memory.md            # 长期记忆 (类似 MEMORY.md)
│   │   # Agent 的学习历史、经验总结
│   │
│   ├── task_history.jsonl   # 任务历史日志
│   │   {"task_id": "...", "type": "...", "status": "...", ...}
│   │
│   ├── context.json         # 当前上下文窗口
│   │   {
│   │     "current_task": "...",
│   │     "conversation": [...],
│   │     "tools_available": [...]
│   │   }
│   │
│   └── metrics.jsonl        # 性能指标
│       {"timestamp": "...", "cpu": 0.78, "memory": 0.62, "tokens": 45000}
│
├── outputs/                  # 产出物
│   ├── 2026-03-01/
│   │   ├── artifact-001.json
│   │   ├── artifact-002.md
│   │   └── artifact-003.py
│   └── 2026-03-02/
│       └── ...
│
├── logs/                     # 运行日志
│   ├── agent.log            # Agent 主日志
│   ├── task.log             # 任务执行日志
│   └── error.log            # 错误日志
│
└── tmp/                      # 临时文件
    └── ...

🔄 Cage 生命周期管理

创建流程

1. Orchestrator 决定创建新 Cage
   ↓
2. 分配 Cage ID (001-1000)
   ↓
3. 创建 Kubernetes Pod
   ↓
4. 挂载 Persistent Volumes
   ↓
5. 启动 Agent Runtime
   ↓
6. 健康检查通过
   ↓
7. 注册到 Agent Registry
   ↓
8. 开始接收任务

运行流程

1. 从 Task Queue 获取任务
   ↓
2. 加载任务上下文
   ↓
3. 执行任务 (Agent 推理 + 工具调用)
   ↓
4. 保存产出物到 /cage/outputs/
   ↓
5. 更新任务状态
   ↓
6. 发送心跳 + 指标
   ↓
7. 返回空闲状态,等待下一个任务

恢复流程

1. 检测到错误 (健康检查失败 / 异常退出)
   ↓
2. 标记 Cage 为 ERROR 状态
   ↓
3. 保存当前状态到持久化存储
   ↓
4. 尝试重启 Pod
   ↓
5. 从持久化状态恢复
   ↓
6. 健康检查通过
   ↓
7. 恢复任务执行

销毁流程

1. 收到销毁指令 (资源回收 / Agent 退役)
   ↓
2. 停止接收新任务
   ↓
3. 等待当前任务完成 (或强制终止)
   ↓
4. 归档产出物到冷存储
   ↓
5. 备份关键状态
   ↓
6. 删除 Kubernetes Pod
   ↓
7. 释放 Persistent Volumes
   ↓
8. 从 Agent Registry 注销

📈 Cage 指标监控

实时指标 (Real-time Metrics)

cage_metrics:
  resource_usage:
    cpu_percent: "0-100"
    memory_percent: "0-100"
    gpu_percent: "0-100"
    disk_usage_bytes: "integer"
    network_rx_bytes: "integer"
    network_tx_bytes: "integer"
  
  agent_status:
    status: "active|idle|busy|blocked|error"
    current_task_id: "uuid"
    task_duration_seconds: "integer"
    tokens_used: "integer"
    tokens_remaining: "integer"
  
  health:
    heartbeat_timestamp: "ISO8601"
    uptime_seconds: "integer"
    error_count_1h: "integer"
    success_rate_24h: "float (0-1)"
  
  productivity:
    tasks_completed_24h: "integer"
    artifacts_generated_24h: "integer"
    avg_task_duration_seconds: "float"
    quality_score_avg: "float (0-100)"

聚合指标 (Aggregated Metrics)

fleet_metrics:
  total_cages: 1000
  active_cages: 856
  idle_cages: 120
  error_cages: 24
  
  resource_totals:
    cpu_allocated: "2000 cores"
    cpu_used: "1456 cores"
    memory_allocated: "4000 GB"
    memory_used: "2890 GB"
    tokens_budget_daily: "1B"
    tokens_used_daily: "756M"
  
  productivity:
    tasks_completed_24h: 12456
    artifacts_generated_24h: 45678
    avg_resolution_time_minutes: 8.2
    auto_resolution_rate: 0.72
  
  cost:
    compute_cost_daily: "$450"
    token_cost_daily: "$756"
    storage_cost_daily: "$25"
    total_cost_daily: "$1,231"

🔐 Cage 安全设计

隔离机制

isolation:
  namespace: "每个 Cage 独立的 K8s Namespace"
  network_policy: "限制 Cage 间网络访问"
  service_account: "每个 Cage 独立的服务账号"
  secrets: "按 Cage 隔离的密钥管理"
  
  resource_limits:
    cpu: "硬限制,防止资源争抢"
    memory: "硬限制,防止 OOM 影响其他 Cage"
    disk: "配额管理,防止存储耗尽"
    network: "带宽限制,防止网络拥塞"

访问控制

rbac:
  cage_service_account:
    permissions:
      - read: own_state
      - write: own_outputs
      - execute: assigned_tasks
    denied:
      - access: other_cages
      - modify: orchestrator
      - delete: persistent_volumes
  
  orchestrator_access:
    permissions:
      - create: cages
      - delete: cages
      - send_tasks: any_cage
      - read_metrics: all_cages

💰 Cage 成本模型

单 Cage 日成本

cage_042_daily_cost:
  compute:
    kubernetes_pod: "2 vCPU x 24h x $0.05/vCPU/h = $2.40"
    gpu_share: "0.5 A10 x 24h x $0.50/GPU/h = $6.00"
    storage: "10GB x $0.10/GB/day = $1.00"
    networking: "~$0.10"
    subtotal: "$9.50"
  
  tokens:
    budget: "100K tokens/hour x 24h = 2.4M tokens/day"
    cost: "2.4M x $0.002/1K = $4.80"
  
  total_per_cage_per_day: "$14.30"
  total_per_cage_per_month: "$429"

1000 Cage 规模成本

fleet_1000_monthly_cost:
  compute: "$9.50 x 1000 x 30 = $285,000"
  tokens: "$4.80 x 1000 x 30 = $144,000"
  storage: "$0.50 x 1000 x 30 = $15,000"
  management_overhead: "$20,000"
  
  total_monthly: "$464,000"
  total_annual: "$5,568,000"
  
  cost_per_artifact: "$464,000 / 1,000,000 artifacts = $0.46"
  cost_per_task: "$464,000 / 500,000 tasks = $0.93"

🚀 扩缩容策略

自动扩缩 (Auto-scaling)

horizontal_pod_autoscaler:
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  
  scale_up:
    when: "avg utilization > 80% for 5 minutes"
    step: "+10% of current capacity"
    max: "1000 cages"
  
  scale_down:
    when: "avg utilization < 40% for 30 minutes"
    step: "-10% of current capacity"
    min: "100 cages"

任务队列驱动的扩缩

queue_based_scaling:
  metrics:
    - queue_depth: "待处理任务数"
    - avg_wait_time: "任务平均等待时间"
  
  scale_up_trigger:
    - queue_depth > 500
    - avg_wait_time > 5 minutes
  
  scale_down_trigger:
    - queue_depth < 50
    - avg_wait_time < 30 seconds
    - idle_cages > 30%

📝 下一步

  1. 实现 Cage Operator (K8s Custom Resource)
  2. 开发 Agent Runtime Image (Docker 镜像)
  3. 搭建监控体系 (Prometheus + Grafana)
  4. 实现自动扩缩容 (HPA + Queue-based)
  5. 压力测试 (1000 Cage 并发运行)