Agent Cage (笼子) 设计文档
🎯 核心概念
“1000 个笼子,养着 1000 个 AI”
每个 Cage 是一个:
- 隔离的执行环境 (Docker Container / K8s Pod)
- 专用的资源配额 (CPU, Memory, GPU, Token Budget)
- 持久的状态存储 (Agent Memory, Task History, Outputs)
- 独立的健康监控 (Heartbeat, Error Rate, Resource Usage)
📦 Cage 架构
┌─────────────────────────────────────────────────────────────────────────┐
│ Cage #042 │
│ (Isolated Agent Environment) │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Agent Runtime │ │
│ │ ┌───────────────────────────────────────────────────────────┐ │ │
│ │ │ OpenClaw Agent Instance │ │ │
│ │ │ - Model: qwen3.5-plus │ │ │
│ │ │ - Context Window: 262K tokens │ │ │
│ │ │ - Skills: [space-guardian, incident-responder, ...] │ │ │
│ │ │ - Memory: Short-term (session) + Long-term (files) │ │ │
│ │ └───────────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌────────────────────────────────┼────────────────────────────────┐ │
│ │ Resource Quotas │ │
│ │ CPU: 2 cores (limit) Memory: 4GB (limit) │ │
│ │ GPU: 0.5 A10 (limit) Tokens: 100K/hour (limit) │ │
│ │ Network: 100 Mbps (limit) Storage: 10GB (limit) │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌────────────────────────────────┼────────────────────────────────┐ │
│ │ Persistent State │ │
│ │ /cage/state/ │ │
│ │ ├── agent.json - Agent identity & config │ │
│ │ ├── memory.md - Long-term memory │ │
│ │ ├── task_history.jsonl - Completed tasks log │ │
│ │ ├── outputs/ - Generated artifacts │ │
│ │ └── metrics.jsonl - Performance metrics │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌────────────────────────────────┼────────────────────────────────┐ │
│ │ Health Monitor │ │
│ │ - Heartbeat: Every 30s │ │
│ │ - Error Tracking: Capture & report exceptions │ │
│ │ - Resource Monitoring: CPU, Memory, Token usage │ │
│ │ - Auto-recovery: Restart on crash, Migrate on overload │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
🔧 技术实现
Kubernetes Pod Template
apiVersion: v1
kind: Pod
metadata:
name: cage-042
namespace: agent-platform
labels:
cage-id: "042"
agent-type: "space-guardian"
status: "active"
annotations:
agent.openclaw.ai/id: "agent-042"
agent.openclaw.ai/created: "2026-03-01T10:00:00Z"
spec:
# Resource Quotas
containers:
- name: agent-runtime
image: openclaw/agent-runtime:v1.0.0
resources:
requests:
cpu: "1"
memory: "2Gi"
nvidia.com/gpu: "0.5"
limits:
cpu: "2"
memory: "4Gi"
nvidia.com/gpu: "1"
# Environment Variables
env:
- name: CAGE_ID
value: "042"
- name: AGENT_ID
value: "agent-042"
- name: AGENT_TYPE
value: "space-guardian"
- name: TOKEN_BUDGET_HOURLY
value: "100000"
- name: ORCHESTRATOR_URL
value: "http://orchestrator.agent-platform.svc:8080"
# Volume Mounts
volumeMounts:
- name: state-volume
mountPath: /cage/state
- name: outputs-volume
mountPath: /cage/outputs
- name: logs-volume
mountPath: /cage/logs
# Health Checks
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
volumes:
- name: state-volume
persistentVolumeClaim:
claimName: cage-042-state
- name: outputs-volume
persistentVolumeClaim:
claimName: cage-042-outputs
- name: logs-volume
emptyDir:
sizeLimit: 1Gi
# Node Affinity (optional: spread across nodes)
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: agent-runtime
topologyKey: kubernetes.io/hostname
📊 Cage 状态机
┌─────────────────────────────────────────────────────────────────────────┐
│ Cage State Machine │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ │
│ │ CREATED │ │
│ └──────┬──────┘ │
│ │ start() │
│ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ STOPPED │◄─────────│ STARTING │─────────►│ ACTIVE │ │
│ └─────────────┘ failed └─────────────┘ └──────┬──────┘ │
│ ▲ │ │
│ │ │ │
│ │ ┌─────────────┐ │ │
│ │ │ ERROR │◄──────────────┘ error() │
│ │ └──────┬──────┘ │
│ │ │ │
│ │ │ recover() │
│ │ ▼ │
│ │ ┌─────────────┐ │
│ └────────────────────│ RECOVERING │ │
│ └─────────────┘ │
│ │
│ State Transitions: │
│ - CREATED → STARTING: Pod scheduled, container starting │
│ - STARTING → ACTIVE: Health check passed, ready for tasks │
│ - STARTING → STOPPED: Startup failed │
│ - ACTIVE → ERROR: Runtime error detected │
│ - ERROR → RECOVERING: Auto-recovery initiated │
│ - RECOVERING → ACTIVE: Recovery successful │
│ - RECOVERING → STOPPED: Recovery failed │
│ - ACTIVE → STOPPED: Manual stop or resource reclamation │
│ │
└─────────────────────────────────────────────────────────────────────────┘
📁 Cage 目录结构
/cage/
├── state/ # 持久化状态
│ ├── agent.json # Agent 身份信息
│ │ {
│ │ "id": "agent-042",
│ │ "cage_id": "042",
│ │ "type": "space-guardian",
│ │ "config": { ... },
│ │ "created_at": "2026-03-01T10:00:00Z"
│ │ }
│ │
│ ├── memory.md # 长期记忆 (类似 MEMORY.md)
│ │ # Agent 的学习历史、经验总结
│ │
│ ├── task_history.jsonl # 任务历史日志
│ │ {"task_id": "...", "type": "...", "status": "...", ...}
│ │
│ ├── context.json # 当前上下文窗口
│ │ {
│ │ "current_task": "...",
│ │ "conversation": [...],
│ │ "tools_available": [...]
│ │ }
│ │
│ └── metrics.jsonl # 性能指标
│ {"timestamp": "...", "cpu": 0.78, "memory": 0.62, "tokens": 45000}
│
├── outputs/ # 产出物
│ ├── 2026-03-01/
│ │ ├── artifact-001.json
│ │ ├── artifact-002.md
│ │ └── artifact-003.py
│ └── 2026-03-02/
│ └── ...
│
├── logs/ # 运行日志
│ ├── agent.log # Agent 主日志
│ ├── task.log # 任务执行日志
│ └── error.log # 错误日志
│
└── tmp/ # 临时文件
└── ...
🔄 Cage 生命周期管理
创建流程
1. Orchestrator 决定创建新 Cage
↓
2. 分配 Cage ID (001-1000)
↓
3. 创建 Kubernetes Pod
↓
4. 挂载 Persistent Volumes
↓
5. 启动 Agent Runtime
↓
6. 健康检查通过
↓
7. 注册到 Agent Registry
↓
8. 开始接收任务
运行流程
1. 从 Task Queue 获取任务
↓
2. 加载任务上下文
↓
3. 执行任务 (Agent 推理 + 工具调用)
↓
4. 保存产出物到 /cage/outputs/
↓
5. 更新任务状态
↓
6. 发送心跳 + 指标
↓
7. 返回空闲状态,等待下一个任务
恢复流程
1. 检测到错误 (健康检查失败 / 异常退出)
↓
2. 标记 Cage 为 ERROR 状态
↓
3. 保存当前状态到持久化存储
↓
4. 尝试重启 Pod
↓
5. 从持久化状态恢复
↓
6. 健康检查通过
↓
7. 恢复任务执行
销毁流程
1. 收到销毁指令 (资源回收 / Agent 退役)
↓
2. 停止接收新任务
↓
3. 等待当前任务完成 (或强制终止)
↓
4. 归档产出物到冷存储
↓
5. 备份关键状态
↓
6. 删除 Kubernetes Pod
↓
7. 释放 Persistent Volumes
↓
8. 从 Agent Registry 注销
📈 Cage 指标监控
实时指标 (Real-time Metrics)
cage_metrics:
resource_usage:
cpu_percent: "0-100"
memory_percent: "0-100"
gpu_percent: "0-100"
disk_usage_bytes: "integer"
network_rx_bytes: "integer"
network_tx_bytes: "integer"
agent_status:
status: "active|idle|busy|blocked|error"
current_task_id: "uuid"
task_duration_seconds: "integer"
tokens_used: "integer"
tokens_remaining: "integer"
health:
heartbeat_timestamp: "ISO8601"
uptime_seconds: "integer"
error_count_1h: "integer"
success_rate_24h: "float (0-1)"
productivity:
tasks_completed_24h: "integer"
artifacts_generated_24h: "integer"
avg_task_duration_seconds: "float"
quality_score_avg: "float (0-100)"
聚合指标 (Aggregated Metrics)
fleet_metrics:
total_cages: 1000
active_cages: 856
idle_cages: 120
error_cages: 24
resource_totals:
cpu_allocated: "2000 cores"
cpu_used: "1456 cores"
memory_allocated: "4000 GB"
memory_used: "2890 GB"
tokens_budget_daily: "1B"
tokens_used_daily: "756M"
productivity:
tasks_completed_24h: 12456
artifacts_generated_24h: 45678
avg_resolution_time_minutes: 8.2
auto_resolution_rate: 0.72
cost:
compute_cost_daily: "$450"
token_cost_daily: "$756"
storage_cost_daily: "$25"
total_cost_daily: "$1,231"
🔐 Cage 安全设计
隔离机制
isolation:
namespace: "每个 Cage 独立的 K8s Namespace"
network_policy: "限制 Cage 间网络访问"
service_account: "每个 Cage 独立的服务账号"
secrets: "按 Cage 隔离的密钥管理"
resource_limits:
cpu: "硬限制,防止资源争抢"
memory: "硬限制,防止 OOM 影响其他 Cage"
disk: "配额管理,防止存储耗尽"
network: "带宽限制,防止网络拥塞"
访问控制
rbac:
cage_service_account:
permissions:
- read: own_state
- write: own_outputs
- execute: assigned_tasks
denied:
- access: other_cages
- modify: orchestrator
- delete: persistent_volumes
orchestrator_access:
permissions:
- create: cages
- delete: cages
- send_tasks: any_cage
- read_metrics: all_cages
💰 Cage 成本模型
单 Cage 日成本
cage_042_daily_cost:
compute:
kubernetes_pod: "2 vCPU x 24h x $0.05/vCPU/h = $2.40"
gpu_share: "0.5 A10 x 24h x $0.50/GPU/h = $6.00"
storage: "10GB x $0.10/GB/day = $1.00"
networking: "~$0.10"
subtotal: "$9.50"
tokens:
budget: "100K tokens/hour x 24h = 2.4M tokens/day"
cost: "2.4M x $0.002/1K = $4.80"
total_per_cage_per_day: "$14.30"
total_per_cage_per_month: "$429"
1000 Cage 规模成本
fleet_1000_monthly_cost:
compute: "$9.50 x 1000 x 30 = $285,000"
tokens: "$4.80 x 1000 x 30 = $144,000"
storage: "$0.50 x 1000 x 30 = $15,000"
management_overhead: "$20,000"
total_monthly: "$464,000"
total_annual: "$5,568,000"
cost_per_artifact: "$464,000 / 1,000,000 artifacts = $0.46"
cost_per_task: "$464,000 / 500,000 tasks = $0.93"
🚀 扩缩容策略
自动扩缩 (Auto-scaling)
horizontal_pod_autoscaler:
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
scale_up:
when: "avg utilization > 80% for 5 minutes"
step: "+10% of current capacity"
max: "1000 cages"
scale_down:
when: "avg utilization < 40% for 30 minutes"
step: "-10% of current capacity"
min: "100 cages"
任务队列驱动的扩缩
queue_based_scaling:
metrics:
- queue_depth: "待处理任务数"
- avg_wait_time: "任务平均等待时间"
scale_up_trigger:
- queue_depth > 500
- avg_wait_time > 5 minutes
scale_down_trigger:
- queue_depth < 50
- avg_wait_time < 30 seconds
- idle_cages > 30%
📝 下一步
- 实现 Cage Operator (K8s Custom Resource)
- 开发 Agent Runtime Image (Docker 镜像)
- 搭建监控体系 (Prometheus + Grafana)
- 实现自动扩缩容 (HPA + Queue-based)
- 压力测试 (1000 Cage 并发运行)