Agentic Engineering Documentation
Building the future of AI-driven software development and enterprise operations
🎯 Welcome
This documentation covers two major initiatives in our Agentic Engineering journey:
1. Large Scale Agentic Engineering
Goal: Consolidate 400+ repositories (~39GB) into an AI-friendly mono-repo, enabling AI to autonomously design, develop, test, deploy, and iterate.
Key Insights:
- 80% of team boundaries are for management convenience, not technical necessity
- AI needs global context to achieve scale effects
- This is a production relationship revolution, not just tool optimization
Status: ✅ Small-scale validation complete (10 repos), ready for scale to 400 repos
2. 1000 Agent Platform
Vision: “1000 cages, 1000 AIs, producing high-value outputs”
A large-scale Agentic operating system for managing 1000 AI Agents working in parallel across four scenarios:
| Application | Description | Target |
|---|---|---|
| 1000 Agent Space | Parallel production incident resolution | 70% auto-resolution, MTTR <10min |
| 1000 Agent Engineering | Autonomous mono-repo convergence (400→1) | AI-driven code consolidation |
| 1000 Agent CorpUnit | AI-driven corporate brain (Finance, HR, Legal, etc.) | Real-time business insights |
| 1000 Invested AI Company | Portfolio management for 1000 companies | Automated due diligence & monitoring |
Status: 📐 Design complete, ready for MVP implementation
📚 Documentation Structure
agentic-docs/
├── Part I: Large Scale Agentic Engineering
│ ├── Strategic vision & insights
│ ├── Mono-repo consolidation plan
│ ├── RD-OS architecture
│ └── Implementation details
│
├── Part II: 1000 Agent Platform
│ ├── System architecture
│ ├── Frontend design
│ ├── Cage (Agent container) design
│ └── Four application scenarios
│
├── Part III: Skills & Tools
│ └── Reusable skills for OpenClaw
│
└── Appendix
└── Glossary, FAQ, references
🚀 Quick Start
For Leadership
Start with Strategic Summary to understand the vision and business impact.
For Architects
Read RD-OS Architecture and 1000 Agent Platform Architecture.
For Engineers
- Backend: System Architecture + Cage Design
- Frontend: Frontend Design
- DevOps: Kubernetes Deployment
For Product Managers
🔗 External Links
- OpenClaw: https://openclaw.ai
- Documentation: https://docs.openclaw.ai
- GitHub: https://github.com/openclaw/openclaw
- Community: https://discord.gg/clawd
📞 Contact
- Project Home: https://1000-agent-platform.agents-dev.com
- Email: team@agents-dev.com
- Discord: https://discord.gg/1000agents
Built with ❤️ by the Agentic Engineering Team
Last updated: 2026-03-01
Strategic Summary: Large-scale Agentic Engineering
战略总结:向 Agent 时代大规模软件系统开发做迁移
Date: 2026-03-01
Audience: Leadership, Engineering Teams
TL;DR (30 秒版本)
我们在做什么:
- 400+ repos → 1 mono-repo
- 人工运维 → AI 自主运维
- 团队边界 → AI 无边界协作
为什么重要:
- 80% 的边界是管理便利,不是真实价值
- AI 需要全局上下文才能发挥规模效应
- 这是生产关系革命,不是工具优化
预期收益:
- 开发效率:10x 提升
- 运维效率:24x 提升(小时级 → 分钟级)
- 人类角色:从 Doer → Decider
核心洞察 (3 分钟版本)
洞察 1:开发经验的本质是生产关系
误区: 开发经验 = 怎么写代码
真相: 开发经验 = 怎么组织生产
- 怎么分工(谁做什么,边界在哪里)
- 怎么协作(如何交接,如何对齐)
- 怎么验收(如何定义完成)
- 怎么演进(如何迭代,如何重构)
Agent 时代的挑战: 生产力变了(AI 写代码),但生产关系没变(还是按团队/模块/Sprint)。
结论: 用 AI 的生产力,套传统的生产关系 = 马车装引擎
洞察 2:从“家庭联产承包“到“大机器农业“
历史类比:
| 时代 | 农业 | 软件开发 |
|---|---|---|
| 游牧 (Pre-2010) | 个人狩猎 | 英雄开发者,全栈 |
| 农耕 (2010-2025) | 家庭联产承包 | 团队边界,模块所有权 |
| 大机器 (2026+) | 土地合并 + 机械化 | Mono-repo + AI 集群 |
问题: “家庭联产承包“导致土地碎片化,大机器进不来。
解决: 土地合并(mono-repo)+ 大机器作业(AI 集群)= 10x 生产力
洞察 3:80% 的边界是“破除价值“的
边界价值分布:
20% 的边界 → 真正隔离风险(安全、合规、核心算法)
80% 的边界 → 管理便利性(绩效、进度可观测、责任划分)
问题: 为了 20% 的真实价值,我们承受了 80% 的效率损失。
AI 时代的重新评估:
- 保留 20% 的真实边界(安全、合规)
- 破除 80% 的管理边界(用 AI 可观测性替代)
项目背景 (5 分钟版本)
表面目标
项目:AI 驱动的运维告警与 Incident 分析
时间:2026 年 1 月启动,3 月 evaluation
目标:
- 提升诊断速度(10x)
- 提升诊断体验
- 提高诊断覆盖率(>90%)
隐性目标
验证:
- 一个 AI 团队的开发效率有多高?
- AI 能否独立交付生产级系统?
- AI 开发的系统是否可维护、可扩展?
- AI 团队与传统团队的协作模式是什么?
产出:
- 技术验证(AI 能做运维分析)✅
- 经验验证(AI 团队能高效交付)← 当前阶段
- 信心验证(老板敢不敢大规模推广)← 最终目标
为什么这个项目关键
这是第一个直属老板的 AI 团队项目,它的成败决定了:
- ✅ 成功 → 老板有信心推广 AI 开发 → 更多资源 → 更大项目
- ❌ 失败 → 老板怀疑 AI 能力 → 收缩资源 → AI 成为边缘实验
所以这不是一个运维项目,这是一个 AI 开发能力的 Proof of Concept。
技术方案 (10 分钟版本)
架构概览
┌─────────────────────────────────────────────────────────────┐
│ Large-scale Agentic Engineering │
├─────────────────────────────────────────────────────────────┤
│ │
│ OpenClaw (主脑) │
│ ├─ 维护全局状态 │
│ ├─ 做调度决策 │
│ ├─ 创建子 Agent (sessions_spawn) │
│ └─ 重启恢复 (状态在文件) │
│ │
│ 子 Agent 池 (临时工人) │
│ ├─ 1000+ 临时 agents │
│ ├─ 专注任务 (分析、迁移、守护) │
│ ├─ 检查点到文件 │
│ └─ 完成后销毁 │
│ │
│ 持久化状态 (.rd-os/) │
│ ├─ progress.db (SQLite) │
│ ├─ agent-states/ (JSON checkpoints) │
│ └─ artifacts/ (reports, outputs) │
│ │
└─────────────────────────────────────────────────────────────┘
关键策略
| 策略 | 说明 | 收益 |
|---|---|---|
| Mono-Repo | 400+ repos → 1 | AI 可访问全量代码,跨模块优化 |
| AI 主脑 + 子 Agent | OpenClaw 调度 1000+ agents | 规模化并行,统一协调 |
| 动态资源分配 | 价值评分,分级 (S/A/B/C) | 资源聚焦高价值,3-5x 利用率 |
| AI 闭环 | Plan → Code → Test → Deploy | 人类定义问题,AI 解决问题,10x+ 效率 |
预期收益
短期收益 (6 个月)
| 指标 | 当前 | 目标 | 提升 |
|---|---|---|---|
| AI 完成功能 | 0% | 20% | - |
| AI 部署变更 | 0% | 10% | - |
| 运维 MTTR | 2-4 小时 | <10 分钟 | 24x |
| AI 处理告警 | 0% | 90% | - |
| 人类 routine 工作 | 60% | 30% | 2x |
长期收益 (12 个月)
| 指标 | 当前 | 目标 | 提升 |
|---|---|---|---|
| AI 完成功能 | 0% | 50% | - |
| AI 部署变更 | 0% | 40% | - |
| AI 发现优化 | 0 | 500/周 | - |
| 人类 routine 工作 | 60% | 10% | 6x |
| 工程效率 | 1x | 10x | 10x |
组织影响
人类角色转变
| 传统角色 | AI 时代角色 |
|---|---|
| 写代码 | 定义问题、验收结果 |
| Code Review | 审查 AI 输出、设定标准 |
| 测试 | 定义测试策略、审查覆盖率 |
| 运维 | 定义 SLO、审查 AI 决策 |
| 项目经理 | 定义优先级、审查进度 |
核心转变:从 Doer 到 Decider
管理挑战
| 挑战 | 应对 |
|---|---|
| 团队抵触 | 渐进式推广 + 培训 |
| 绩效评估困难 | 重新定义评估标准(从 Doer 到 Decider) |
| 知识流失 | AI 文档化 + 知识沉淀 |
风险与应对
技术风险
| 风险 | 概率 | 影响 | 应对 |
|---|---|---|---|
| AI 输出质量不稳定 | 高 | 中 | 人类审查 + 自动化测试 |
| AI 系统故障 | 中 | 高 | 状态持久化 + 恢复机制 |
| AI 成本超预算 | 低 | 中 | 监控 token 使用 + 优化 (~$500/年) |
组织风险
| 风险 | 概率 | 影响 | 应对 |
|---|---|---|---|
| 团队抵触 | 高 | 高 | 渐进式推广 + 培训 |
| 绩效评估困难 | 高 | 中 | 重新定义评估标准 |
| 老板信心不足 | 中 | 高 | 快速交付小胜利 |
时间线
2026-01 ──► AI 团队组建(运维 Incident 分析)
│
2026-03 ──► Evaluation(运维项目)
│ 10-repo 实验 ✅
│
2026-03 ──► Phase 2: 基础设施搭建
│
2026-04 ──► Phase 3: 400-repo 分析
│
2026-04 ──► Phase 4: P0 迁移(50 repos)
│
2026-05 ──► Phase 4: P1 迁移(100 repos)
│
2026-06 ──► Phase 4: P2-P3 迁移(150 repos)
│
2026-07 ──► Phase 5: AI 闭环开发
│
2026-12 ──► Phase 6: 全面优化
AI 完成功能 >50%
人类 routine 工作 <10%
行动建议
对于技术团队
- 开始 Mono-Repo 规划(土地合并)
- 建设 AI 基础设施(大机器)
- 培养 AI 协作能力(新技能)
对于管理层
- 重新评估边界价值(哪些该破除)
- 重新定义绩效标准(从 Doer 到 Decider)
- 投资 AI 基础设施(长期收益)
对于老板
- 给 AI 团队真实业务场景(不是边缘实验)
- 设定合理期望(6-12 个月见效)
- 准备组织变革(生产关系调整)
关键文档
| 文档 | 说明 |
|---|---|
| migrate-to-agent-age.md | 战略宣言:从传统到 Agent 时代 |
| PROJECT-CHARTER.md | 项目章程(包含组织影响) |
| rd-os-vision.md | RD-OS 愿景 |
| rd-os-openclaw-architecture.md | OpenClaw 架构 |
| experiment-report.md | 10-repo 实验报告 |
最终愿景
2027 年,回顾今天:
我们不是"引入了 AI 工具"
我们不是"优化了开发流程"
我们完成了:
- 从农耕到游牧的生产关系变革
- 从碎片化到规模化的生产力革命
- 从 Doer 到 Decider 的人类角色转变
我们不是"用 AI 写代码"
我们是"用 AI 重新定义软件开发"
Strategic Summary: Large-scale Agentic Engineering
2026-03-01 | For: Leadership, Engineering Teams
AI Cloning Advantage: AI 时代的超能力
克隆、复制、规模化 — 超越人类社会的生产方式
Date: 2026-03-01
Author: Large-scale Agentic Engineering Team
Core Insight
AI 时代有一个人类社会无法想象的特性:完美克隆。
人类社会:
- 培养一个专家需要 10-20 年
- 专家的经验无法完美复制
- 专家会退休、会离职、会犯错
- 知识传承靠文档和口述(大量信息丢失)
AI 世界:
- 培养一个专家 AI 需要几天到几周
- AI 的经验可以完美复制(克隆)
- AI 不会退休、不会离职、不会累
- 知识固化在模型中(零信息丢失)
这是生产力的质的飞跃,不是量的提升。
人类社会的局限性
局限性 1: 知识传承效率低
人类专家培养:
1. 小学 + 中学 + 大学:12-16 年
2. 工作经验积累:5-10 年
3. 成为专家:20-26 岁开始,30-35 岁成熟
知识传承:
- 师傅带徒弟:1 对 1,效率低
- 文档化:大量隐性知识无法文档化
- 口述:信息丢失率 >50%
- 离职:知识随人走
结果:
- 公司依赖少数关键人物
- 关键人物离职 = 知识流失
- 扩张困难(专家培养太慢)
局限性 2: 无法完美复制
人类无法克隆:
- 双胞胎也不完全一样
- 经验、技能、直觉无法复制
- 每个人都是独特的
好处:
- 多样性、创新
坏处:
- 优秀能力无法规模化
- 1000 个工程师 = 1000 种水平
- 质量不稳定
局限性 3: 生理限制
人类限制:
- 每天工作 8-12 小时(上限)
- 需要休息、休假
- 会疲劳、会犯错
- 会情绪化、状态波动
- 职业生涯 30-40 年(然后退休)
结果:
- 产能有上限
- 质量有波动
- 知识会流失(退休)
AI 世界的超能力
超能力 1: 完美克隆
AI 克隆流程:
1. 训练一个专家 AI(例如:代码审查专家)
- 输入:10 万 + 代码审查样本
- 训练:3-7 天
- 成本:~$100-500
2. 验证 AI 质量
- 测试集验证
- 人类抽查
- 达到专家水平(>95% 准确率)
3. 完美复制
- 复制模型文件
- 部署到 100 个实例
- 每个实例都是 100% 相同的专家
结果:
- 1 个专家 → 100 个专家(瞬间)
- 质量 100% 一致
- 成本摊薄 100 倍
对比人类社会:
- 培养 100 个专家:100 人 × 10 年 = 1000 人年
- AI 克隆 100 个专家:1 个模型 × 3 天 = 3 天
效率提升:10,000x+
超能力 2: 大规模筛选
AI 筛选流程:
1. 训练 1000 个 AI 个体(不同参数、不同数据)
2. 用测试集评估每个个体
3. 保留 top 10 个(99.9% 准确率)
4. 克隆 top 10 个,部署到生产
对比人类社会:
- 无法训练 1000 个人类(成本太高)
- 无法公平评估 1000 个人类(主观因素)
- 无法快速淘汰 990 个人类(道德问题)
AI 优势:
- 大规模并行训练(1000 个同时)
- 客观评估(统一测试集)
- 快速迭代(淘汰差的,保留好的)
结果: AI 可以达到人类社会无法达到的质量水平(因为可以大规模筛选)
超能力 3: 知识固化
AI 知识固化:
1. AI 学习的知识固化在模型参数中
2. 不会遗忘(除非主动 fine-tune)
3. 不会丢失(备份模型即可)
4. 可以版本控制(v1, v2, v3...)
对比人类社会:
- 人类会遗忘(艾宾浩斯遗忘曲线)
- 人类离职 = 知识流失
- 知识传承靠文档(大量丢失)
- 无法版本控制("我记得以前不是这样的")
AI 优势:
- 零遗忘
- 零流失
- 可追溯(哪个版本学的)
- 可回滚(回到旧版本)
超能力 4: 持续进化
AI 持续进化:
1. 部署后继续学习(在线学习)
2. 从新数据中学习(自动更新)
3. A/B 测试不同版本
4. 优胜劣汰(差的版本淘汰)
对比人类社会:
- 人类学习速度慢(需要刻意练习)
- 人类经验无法直接共享(每个人重新学)
- 人类无法 A/B 测试(伦理问题)
AI 优势:
- 持续学习(越来越强)
- 知识共享(一个学会,全部学会)
- 快速迭代(天级别)
实际应用场景
场景 1: 代码审查专家克隆
现状:
- 公司有 5 个资深代码审查专家
- 每天能审查 50 个 PR
- 质量不稳定(专家状态波动)
- 专家离职 = 知识流失
AI 方案:
1. 训练代码审查 AI
- 输入:公司历史 10 万 + PR 及审查意见
- 训练:7 天
- 验证:准确率 >95%
2. 克隆 100 个实例
- 部署到 CI/CD 流水线
- 每天能审查 5000+ PR
- 质量 100% 一致
- 不会离职、不会累
3. 持续进化
- 从新 PR 中学习
- 每月更新模型
- 质量持续提升
结果:
- 审查能力提升 100x
- 质量提升(一致性)
- 零知识流失
场景 2: 运维专家克隆
现状:
- 公司有 3 个资深运维专家
- 能处理 P0/P1 事故
- 7x24 小时 on-call(很累)
- 专家离职 = 系统风险
AI 方案:
1. 训练运维 AI
- 输入:历史事故记录、处理方案、监控数据
- 训练:14 天
- 验证:能正确处理 95%+ 历史事故
2. 克隆 10 个实例
- 7x24 小时监控
- 自动处理 P0/P1 事故
- 人类专家只处理升级的 5%
3. 持续进化
- 从新事故中学习
- 每周更新模型
- 越来越强
结果:
- 事故处理速度提升 10x(秒级 vs 分钟级)
- 人类专家不用 on-call(生活质量提升)
- 零知识流失(专家离职也不怕)
场景 3: 架构师克隆
现状:
- 公司有 2 个资深架构师
- 负责系统设计、技术选型
- 瓶颈明显(需求太多,架构师太少)
- 架构师离职 = 技术方向风险
AI 方案:
1. 训练架构 AI
- 输入:公司历史系统设计文档、决策记录、复盘
- 训练:30 天
- 验证:能给出合理的架构建议
2. 克隆 5 个实例
- 每个产品线配 1 个架构 AI
- 7x24 小时可用
- 人类架构师审核关键决策
3. 持续进化
- 从新项目中学习
- 每月更新模型
- 吸收业界最佳实践
结果:
- 架构设计效率提升 5x
- 质量提升(一致性、最佳实践)
- 零知识流失
遗传算法 vs AI 克隆
传统遗传算法
遗传算法:
1. 初始种群(随机生成)
2. 评估适应度
3. 选择(保留好的)
4. 交叉(好的基因组合)
5. 变异(引入多样性)
6. 重复 2-5,直到收敛
问题:
- 交叉会丢失信息(父母各 50%)
- 变异是随机的(可能变好,可能变差)
- 需要很多代才能收敛
- 无法保留"完美个体"(下一代会变异)
AI 克隆算法
AI 克隆:
1. 训练多个个体(不同参数、数据)
2. 评估质量
3. 选择 top N 个
4. 完美复制(克隆,不是交叉)
5. 可选:fine-tune 克隆体(定向优化)
6. 部署克隆体
优势:
- 完美复制(100% 保留优秀基因)
- 定向优化(fine-tune,不是随机变异)
- 快速收敛(几代即可)
- 保留"完美个体"(原始模型永久保存)
本质区别:
- 遗传算法 = 有性生殖(交叉 + 变异)
- AI 克隆 = 无性生殖(完美复制)
AI 克隆更神奇,因为可以:
- 完美复制优秀个体
- 同时保留原始版本(随时回滚)
- 定向优化(不是随机变异)
- 大规模并行(1000 个个体同时训练)
组织影响
影响 1: 专家价值重估
传统:
- 专家价值 = 个人能力 × 工作时间
- 专家稀缺 = 高价值
- 公司依赖专家
AI 时代:
- 专家价值 = 可克隆性 × 克隆数量
- 专家能力 AI 化 = 价值最大化
- 公司依赖 AI(不是个人)
结果:
- 专家需要转变角色(从 Doer 到 Trainer)
- 专家的价值在于"训练 AI",不是"自己做"
- 公司不再依赖个人(依赖 AI)
影响 2: 组织规模重新定义
传统:
- 1000 人的公司 = 1000 个大脑
- 扩张 = 招聘更多人
- 管理复杂度随人数增长
AI 时代:
- 1000 人的公司 = 1000 人 + 10000 个 AI
- 扩张 = 克隆更多 AI
- 管理复杂度不随 AI 数量增长(AI 自我管理)
结果:
- 小团队可以做大事(10 人 + 1000 AI)
- 公司边界模糊(AI 可以跨公司协作)
- 组织形式变革(从科层制到网络化)
影响 3: 知识管理革命
传统:
- 知识管理 = 文档 + 培训
- 知识流失 = 员工离职
- 知识传承 = 师傅带徒弟
AI 时代:
- 知识管理 = 训练 AI
- 知识流失 = 模型丢失(可备份避免)
- 知识传承 = 复制模型
结果:
- 知识永久保存
- 知识零成本传播
- 知识持续进化
实施策略
阶段 1: 识别可克隆的专家能力 (Week 1-2)
行动:
1. 识别公司内的专家能力
- 代码审查专家
- 运维专家
- 架构师
- 测试专家
- ...
2. 评估可克隆性
- 是否有足够训练数据?
- 是否有明确评估标准?
- 是否规则清晰(不是纯创意)?
3. 优先级排序
- 高价值 + 高可克隆性 = P0
- 低价值 + 高可克隆性 = P1
- 高价值 + 低可克隆性 = P2(长期)
- 低价值 + 低可克隆性 = 排除
阶段 2: 训练第一个专家 AI (Week 3-8)
行动:
1. 收集训练数据
- 历史工作产出
- 决策记录
- 评估反馈
2. 训练 AI 模型
- 选择合适模型(LLM / 专用模型)
- Fine-tune 专家数据
- 验证准确率
3. 人类验证
- 专家 review AI 输出
- 盲测(人类 vs AI)
- 达到专家水平(>95% 准确率)
阶段 3: 克隆与部署 (Week 9-12)
行动:
1. 克隆 AI 实例
- 根据需求克隆 N 个实例
- 部署到生产环境
2. 监控与反馈
- 监控 AI 表现
- 收集反馈
- 持续改进
3. 规模化
- 证明价值后,克隆更多
- 扩展到其他专家能力
风险与应对
风险 1: AI 克隆体出错
场景:
- AI 克隆体给出错误建议
- 多个克隆体同时出错(系统性错误)
- 影响范围大
应对:
1. 人类审核关键决策
2. A/B 测试(新旧模型对比)
3. 快速回滚(保留旧版本)
4. 持续监控(异常检测)
风险 2: 过度依赖 AI
场景:
- 人类能力退化(依赖 AI)
- AI 故障时人类无法接手
- 公司失去自主能力
应对:
1. 人类持续学习(不依赖 AI)
2. 定期"AI 离线"演练
3. 人类专注 AI 做不了的事(创新、战略)
4. 保持人类专家(作为 backup)
风险 3: 知识固化导致僵化
场景:
- AI 学到的知识过时
- AI 无法适应新情况
- 公司技术栈僵化
应对:
1. 持续学习(在线学习)
2. 定期更新模型(月度/季度)
3. 吸收业界最佳实践
4. 鼓励人类创新(AI 执行)
结论
AI 克隆是 AI 时代最神奇的特性之一:
- 完美复制 — 人类无法想象的能力
- 大规模筛选 — 可以达到人类社会无法达到的质量
- 知识固化 — 零遗忘、零流失
- 持续进化 — 越来越强
这是生产力的质的飞跃:
- 人类社会:1 个专家培养 10 年
- AI 世界:1 个专家训练 10 天,克隆 1000 个
组织需要重新思考:
- 专家的价值是什么?(从 Doer 到 Trainer)
- 组织的边界是什么?(人 + AI)
- 知识如何管理?(训练 AI,不是写文档)
行动呼吁:
- 识别公司内的专家能力
- 训练第一个专家 AI
- 克隆、部署、规模化
- 持续进化
AI Cloning Advantage: AI 时代的超能力
2026-03-01 | Large-scale Agentic Engineering Team
Thinking Big: AI 时代的核心阻力
“大“的思维 vs 局部优化
Date: 2026-03-01
Author: Large-scale Agentic Engineering Team
Core Insight
AI 时代最大的阻力不是 skills,不是 multi-agent,是没有“大“的思维。
什么是“大“的思维
定义
"大"的思维 = 系统性地把整个公司上千名工程师的研发环境全部揉进 AI 世界,全面打通研发流程
对比:小思维 vs 大思维
| 维度 | 小思维 (局部优化) | 大思维 (系统重构) |
|---|---|---|
| 范围 | 单个团队、单个项目 | 全公司、全部研发流程 |
| 目标 | 提升 10-20% 效率 | 10x 效率革命 |
| 方法 | 给现有流程加 AI 工具 | 用 AI 重新设计流程 |
| 边界 | 接受现有团队/模块边界 | 破除边界,AI 自由流动 |
| 数据 | 局部数据 (一个 repo) | 全局数据 (全部 codebase) |
| 协调 | 人工协调跨团队工作 | AI 统一调度 |
| 愿景 | AI 辅助人类 | AI 主导执行,人类决策 |
为什么“大“的思维如此困难
阻力 1:组织惯性
现状:
- 团队边界清晰(3-7 人一个模块)
- 绩效评估明确(这块地是你的)
- 风险隔离(你家欠收不影响我家)
- 晋升路径可见(从农民到地主)
问题:
- 土地碎片化(无法规模化)
- 边界墙厚重(跨团队协作难)
- 创新受限(只能在自家地里创新)
- AI 进不来(被边界挡住)
打破需要: 重新定义组织、绩效、晋升
阻力 2:管理舒适区
传统管理:
- 进度可观测(看 board 就知道)
- 责任明确(谁的任务没完成)
- 风险可控(边界内风险)
- 结果可预期(sprint 承诺)
AI 时代:
- 进度由 AI 协调(人类看不到细节)
- 责任模糊(AI 做的还是人做的)
- 风险跨边界(AI 跨模块修改)
- 结果难预期(AI 可能给出意外方案)
管理者恐惧:失去控制感
打破需要: 重新定义“控制“ — 从控制过程到控制目标
阻力 3:技术债务
现状:
- 400+ repos(历史遗留)
- 技术栈不统一(Go/Java/TS/Python)
- 构建系统各异(Maven/npm/custom)
- CI/CD 分散(GitHub/GitLab/Jenkins)
问题:
- AI 需要统一接口
- AI 需要全局上下文
- AI 需要标准化流程
改造成本:高(但不变革成本更高)
打破需要: 投入资源做基础设施改造
阻力 4:思维定式
常见思维:
- "AI 是工具,人类是主体"
- "AI 辅助开发,不是主导"
- "先在边缘试试,别影响核心"
- "等 AI 更成熟再说"
问题:
- 把 AI 当"更好的锤子"
- 没有看到 AI 是"新的生产方式"
- 局部优化无法释放 AI 潜力
打破需要: 认知升级 — AI 不是工具,是新的生产关系
“大“的思维的核心原则
原则 1:全局最优 > 局部最优
❌ 小思维:优化单个团队的效率
✅ 大思维:优化全公司的研发效率
例子:
- 小思维:给团队 A 配 AI 工具,提升 20%
- 大思维:mono-repo + AI 集群,提升 10x
代价:
- 小思维:无冲突,但收益有限
- 大思维:需要组织变革,但收益巨大
原则 2:AI 主导执行 > AI 辅助人类
❌ 小思维:AI 写代码,人类 review
✅ 大思维:AI 主导开发,人类定义问题
例子:
- 小思维:Copilot 辅助写函数
- 大思维:AI 独立开发功能,人类验收
代价:
- 小思维:人类仍是瓶颈
- 大思维:需要信任 AI,需要新流程
原则 3:破除边界 > 接受边界
❌ 小思维:在现有边界内用 AI
✅ 大思维:为 AI 破除边界
例子:
- 小思维:每个团队用自己的 AI 工具
- 大思维:统一 AI 基础设施,AI 自由流动
代价:
- 小思维:AI 被边界困住
- 大思维:需要统一标准,统一调度
原则 4:系统性重构 > 局部优化
❌ 小思维:给现有流程加 AI 工具
✅ 大思维:用 AI 重新设计流程
例子:
- 小思维:AI 辅助 code review
- 大思维:AI 主导 review,人类抽查
代价:
- 小思维:流程不变,效率提升有限
- 大思维:流程重构,效率 10x
实现“大“的思维的路径
阶段 1:认知升级 (1-2 个月)
目标:让核心团队理解"大"的思维
行动:
- 战略文档(本文档 + STRATEGIC-SUMMARY.md)
- 内部分享(技术团队、管理层)
- 对标学习(Google、Stripe 等)
成功标准:
- 核心团队理解并认同
- 管理层支持变革
- 预算和资源到位
阶段 2:基础设施 (2-3 个月)
目标:建设 AI 规模化基础设施
行动:
- Mono-repo consolidation(400 → 1)
- 统一构建系统(Bazel)
- 统一 CI/CD
- AI 基础设施(OpenClaw + Agents)
成功标准:
- 400 repos 迁移完成
- 构建时间 <30 分钟
- AI 基础设施上线
阶段 3:流程重构 (3-6 个月)
目标:用 AI 重新设计研发流程
行动:
- AI 主导开发(Plan → Code → Test)
- AI 主导部署(Build → Deploy → Monitor)
- AI 主导运维(Detect → Diagnose → Fix)
- 人类角色转变(从 Doer 到 Decider)
成功标准:
- AI 完成功能 >50%
- AI 部署变更 >40%
- 人类 routine 工作 <10%
阶段 4:组织变革 (6-12 个月)
目标:调整组织适应 AI 时代
行动:
- 重新定义团队边界(动态组队)
- 重新定义绩效(从 Doer 到 Decider)
- 重新定义晋升(AI 协作能力)
- 重新定义管理(从控制到赋能)
成功标准:
- 组织满意度 >80%
- 人才保留率 >90%
- 创新产出提升 2x
案例对比:小思维 vs 大思维
案例 1:运维告警处理
小思维方案:
现状:100+ alerts/day,人工 triage
方案:
- AI 辅助分类(自动标签)
- AI 建议根因(人类确认)
- AI 建议修复(人类执行)
收益:效率提升 2-3x
成本:低(现有流程上加 AI)
问题:人类仍是瓶颈
大思维方案:
现状:100+ alerts/day,人工 triage
方案:
- AI 全权负责(90% alerts 自动处理)
- AI 自动诊断 + 自动修复
- 人类只处理升级的 10%
收益:效率提升 10x,人力节省 90%
成本:高(需要 AI 基础设施,需要信任 AI)
结果:人类专注高价值问题
案例 2:代码审查
小思维方案:
现状:人工 code review
方案:
- AI 辅助 review(自动检查)
- AI 建议改进(人类决定)
- 人类仍主导 review
收益:review 速度提升 30%
成本:低
问题:人类仍是瓶颈,review 质量依赖个人
大思维方案:
现状:人工 code review
方案:
- AI 主导 review(自动审查)
- AI 自动批准(符合标准的 PR)
- 人类只审查高风险变更
收益:review 速度提升 10x,人力节省 80%
成本:高(需要 AI 训练,需要流程变革)
结果:人类专注架构和安全审查
案例 3:项目管理
小思维方案:
现状:人工 sprint 规划,人工跟踪
方案:
- AI 辅助估算(建议 story points)
- AI 辅助跟踪(自动更新 board)
- 人类仍主导规划
收益:规划效率提升 20%
成本:低
问题:人类仍是瓶颈,估算仍不准确
大思维方案:
现状:人工 sprint 规划,人工跟踪
方案:
- AI 主导规划(基于历史数据)
- AI 自动分配任务(基于能力/负载)
- AI 自动跟踪(实时更新)
- 人类只审查优先级
收益:规划效率提升 5x,准确度提升 2x
成本:高(需要历史数据,需要信任 AI)
结果:人类专注产品方向
为什么现在必须“大“
时间窗口
2024-2026:AI 能力成熟期
- LLM 能力足够(写代码、review、debug)
- Agent 框架成熟(AutoGen、LangChain)
- 基础设施成熟(OpenClaw 等)
2026-2028:AI 规模化窗口期
- 先行者建立优势(10x 效率)
- 后发者难以追赶(基础设施差距)
- 市场格局重塑(效率决定竞争力)
2028+:AI 时代新常态
- AI 主导开发成为标准
- 人类 Doer 被淘汰
- 只有 Decider 存活
结论:现在不"大",以后没机会
竞争压力
竞争对手在做什么:
- Google:AI 主导开发(内部已规模化)
- Stripe:AI 基础设施完善
- 创业公司:无历史包袱,直接 AI-native
如果我们不"大":
- 效率差距:10x
- 成本差距:5x
- 创新速度:3x
结论:不"大" = 被淘汰
行动呼吁
对于个人
问自己:
- 我是在想"怎么用 AI 做好现在的工作"?
- 还是在想"怎么用 AI 重新定义工作"?
行动:
- 学习 AI 协作技能
- 从 Doer 转向 Decider
- 拥抱变革,不是抗拒
对于团队
问自己:
- 我们是在现有边界内优化?
- 还是在为 AI 破除边界?
行动:
- 推动 mono-repo
- 统一基础设施
- 打破团队墙
对于管理层
问自己:
- 我们是在保护现有管理舒适区?
- 还是在为 AI 时代重构组织?
行动:
- 重新定义绩效(从 Doer 到 Decider)
- 重新定义晋升(AI 协作能力)
- 重新定义管理(从控制到赋能)
对于老板
问自己:
- 我们是在做局部优化(10-20% 提升)?
- 还是在做系统重构(10x 革命)?
行动:
- 投资 AI 基础设施(mono-repo、OpenClaw)
- 支持组织变革(团队、绩效、晋升)
- 给 AI 团队真实业务场景(不是边缘实验)
- 设定合理期望(6-12 个月见效)
结论
AI 时代最大的阻力不是技术,是思维。
“大“的思维 = 系统性地把整个公司上千名工程师的研发环境全部揉进 AI 世界,全面打通研发流程
局部优化无法释放 AI 潜力,只有系统性重构才能带来 10x 效率革命。
现在不“大“,以后没机会。
Thinking Big: AI 时代的核心阻力
2026-03-01 | Large-scale Agentic Engineering Team
向 Agent 时代大规模软件系统开发做迁移
Move Forward to Agent Age Large Scale System Software Development
Date: 2026-03-01
Author: Large-scale Agentic Engineering Team
Status: Draft for Discussion
Executive Summary
2026 年 1 月,我们组建了一个直属老板的 AI 团队。表面目标是:用 AI 对线上运维告警和 Incident 做深度分析,提升诊断速度和覆盖率,3 月交付 evaluation 结果。
但老板的隐性目标更深远:验证一个 AI 团队的开发效率到底有多高,为后续全公司引入 AI 开发积累经验和信心。
这篇文章讨论的不是一个运维项目,而是一个AI 软件团队研发的探索项目。它会成为我们后期引入 AI 开发经验的基础性工程。
核心洞察: 很多人以为开发经验是“怎么写代码“,但真正的开发经验是“怎么组织生产“。Agent 时代,我们需要系统性地把传统世界的生产关系向 AI 世界迁移 — 不是优化旧系统,而是解除边界,引入大机器生产。
1. 背景:一个“运维项目“的真实使命
1.1 表面目标
项目:AI 驱动的运维告警与 Incident 分析
时间:2026 年 1 月启动,3 月 evaluation
目标:
- 提升诊断速度
- 提升诊断体验
- 提高诊断覆盖率
1.2 隐性目标
验证:
- 一个 AI 团队的开发效率有多高?
- AI 能否独立交付生产级系统?
- AI 开发的系统是否可维护、可扩展?
- AI 团队与传统团队的协作模式是什么?
产出:
- 技术验证(AI 能做运维分析)
- 经验验证(AI 团队能高效交付)
- 信心验证(老板敢不敢大规模推广)
1.3 为什么这个项目关键
这是第一个直属老板的 AI 团队项目,它的成败决定了:
- ✅ 成功 → 老板有信心推广 AI 开发 → 更多资源 → 更大项目
- ❌ 失败 → 老板怀疑 AI 能力 → 收缩资源 → AI 成为边缘实验
所以这不是一个运维项目,这是一个 AI 开发能力的 Proof of Concept。
2. 核心洞察:开发经验的本质是生产关系
2.1 误区:开发经验 = 怎么写代码
很多人以为开发经验是:
- 怎么写高性能代码
- 怎么设计优雅架构
- 怎么写可维护代码
- 怎么调试复杂问题
这些重要,但不是本质。
2.2 真相:开发经验 = 怎么组织生产
真正的开发经验是:
- 怎么分工 — 谁做什么,边界在哪里
- 怎么协作 — 如何交接,如何对齐
- 怎么验收 — 如何定义完成,如何保证质量
- 怎么演进 — 如何迭代,如何重构
这是生产关系,不是生产力。
2.3 Agent 时代的挑战
Agent 时代,生产力变了(AI 写代码),但生产关系没变:
- 还是按团队分工
- 还是按模块边界
- 还是按 Sprint 验收
- 还是按人工评审
用 AI 的生产力,套传统的生产关系 = 马车装引擎
3. 历史类比:从游牧到农耕到大机器农业
3.1 第一阶段:游牧时代(手工作坊)
特征:
- 个人英雄主义
- 全栈开发(一个人什么都做)
- 无明确分工
- 产出依赖个人能力
问题:
- 不可规模化
- 质量不稳定
- 知识不沉淀
3.2 第二阶段:农耕时代(土地确权)
特征:
- 团队分工(前端、后端、测试、运维)
- 模块边界(微服务、组件化)
- 流程规范(Scrum、Code Review、CI/CD)
- 绩效可衡量(Story Points、Velocity)
优势:
- 可规模化
- 质量可控
- 风险隔离
问题:
- 土地碎片化(3-7 人一个模块)
- 边界墙厚重(跨团队沟通成本)
- 大机器进不来(AI 无法跨越边界)
这像中国的“家庭联产承包责任制“:
- 土地确权到户(团队确权到模块)
- 激励清晰(绩效明确)
- 但土地碎片化(模块碎片化)
- 大机器农业无法开展(AI 无法规模化)
3.3 第三阶段:大机器农业时代(AI 规模化)
特征:
- 土地合并(模块合并,mono-repo)
- 大机器作业(AI 集群规模化工作)
- 统一调度(OpenClaw orchestration)
- 产出倍增(10x 效率提升)
前提:
- 解除边界(破除团队墙、模块墙)
- 统一标准(统一构建、统一测试、统一部署)
- 集中调度(AI 主脑协调)
4. 边界:AI 规模化开发的最大阻力
4.1 边界的本质
边界不是技术问题,是管理问题:
| 边界类型 | 表面原因 | 真实目的 |
|---|---|---|
| 团队边界 | 专业分工 | 隔离开发节奏,绩效评估 |
| 模块边界 | 解耦 | 隔离风险,便于替换 |
| 交付边界 | 独立部署 | 隔离故障域 |
| 代码边界 | 代码所有权 | 责任明确 |
4.2 边界的代价
假设一个公司有 100 个微服务,50 个团队:
传统模式:
- 每个团队 3-7 人
- 每个服务独立 repo
- 跨团队沟通:50×49/2 = 1,225 条沟通链路
- 跨服务依赖:每个服务平均依赖 10 个其他服务
- 协调成本:>50% 开发时间
AI 模式:
- AI 不受团队边界限制
- 但被 repo 边界限制
- 被权限边界限制
- 被流程边界限制
结果:AI 被传统边界困住,效率提升有限
4.3 80% 的边界是“破除价值“的
根据我们的分析:
边界价值分布:
20% 的边界 → 真正隔离风险(安全、合规、核心算法)
80% 的边界 → 管理便利性(绩效、进度可观测、责任划分)
问题:为了 20% 的真实价值,我们承受了 80% 的效率损失
AI 时代,我们需要重新评估边界的价值:
- 保留 20% 的真实边界(安全、合规)
- 破除 80% 的管理边界(用 AI 的可观测性替代)
5. Agent 时代开发思路:高收益策略
5.1 策略一:Mono-Repo(土地合并)
为什么:
- AI 需要全局上下文
- AI 需要跨模块优化
- AI 需要统一构建/测试/部署
怎么做:
- 400+ repos → 1 mono-repo
- 统一构建系统(Bazel)
- 统一测试框架
- 统一部署流程
收益:
- AI 可访问全量代码
- AI 可跨模块优化
- AI 可自动化端到端流程
5.2 策略二:AI 主脑 + 子 Agent 集群(大机器作业)
为什么:
- 单个 AI 能力有限
- 需要规模化并行工作
- 需要统一调度
怎么做:
- OpenClaw 作为主脑(决策、调度)
- 子 Agent 作为工人(执行、反馈)
- 状态持久化(断点续传)
收益:
- 1000+ Agents 并行工作
- 统一调度,避免冲突
- 故障恢复,持续运行
5.3 策略三:动态资源分配(精准农业)
为什么:
- 不是所有代码价值相同
- AI 资源应该聚焦高价值区域
- 需要动态调整
怎么做:
- 价值评分(0-100)
- 分级(S/A/B/C)
- 动态分配 Agent 数量
收益:
- S 级 repo 分配 8 个 Agents 深度分析
- C 级 repo 分配 0.5 个 Agent 快速扫描
- 资源利用率提升 3-5x
5.4 策略四:AI 闭环(自动驾驶)
为什么:
- 人类协调是瓶颈
- AI 可以自主协调
- 需要端到端自动化
怎么做:
- AI 开发(Plan → Code → Test)
- AI 部署(Build → Deploy → Monitor)
- AI 运维(Detect → Diagnose → Fix)
收益:
- 人类专注定义问题
- AI 负责解决问题
- 效率提升 10x+
6. 实施路径:从运维项目到研发革命
6.1 第一阶段:运维 Incident 分析(2026 年 1-3 月)
目标: 证明 AI 能独立分析运维问题
范围:
- 告警聚合(100+ alerts/day → 10 incidents/day)
- 根因分析(AI 诊断,人类确认)
- 自动修复(已知问题,AI 自动处理)
成功标准:
- 诊断速度提升 10x(小时级 → 分钟级)
- 诊断覆盖率 >90%
- 自动修复率 >50%
隐性验证:
- AI 团队能否独立交付?
- AI 开发效率 vs 传统团队?
- AI 系统是否可维护?
6.2 第二阶段:Mono-Repo consolidation(2026 年 3-6 月)
目标: 400+ repos → 1 mono-repo
范围:
- 分析 400 repos(价值评分、分级)
- 迁移 400 repos(保留历史、更新构建)
- 部署 AI 基础设施(OpenClaw、Agents)
成功标准:
- 400/400 repos 迁移完成
- 构建时间 <30 分钟(全量)
- AI 基础设施上线
隐性验证:
- AI 能否协调大规模工程?
- AI 能否处理复杂依赖?
- AI 能否持续运行(数周)?
6.3 第三阶段:AI 闭环开发(2026 年 7-12 月)
目标: AI 独立开发、测试、部署功能
范围:
- AI 开发(从需求到代码)
- AI 测试(生成测试、执行测试)
- AI 部署(CI/CD、监控)
成功标准:
- AI 完成功能 >20%
- AI 部署变更 >10%
- 人类 routine 工作 <30%
隐性验证:
- AI 能否独立交付业务价值?
- AI 开发质量是否达标?
- 人类是否愿意信任 AI?
7. 组织影响:从农耕到游牧的回归
7.1 传统组织:农耕化(土地确权)
特征:
- 团队边界清晰(这块地是你的)
- 绩效可衡量(这块地产出多少)
- 风险隔离(你家地欠收不影响我家)
- 晋升路径(从农民到地主)
问题:
- 土地碎片化(无法规模化)
- 边界墙厚重(跨团队协作难)
- 创新受限(只能在自家地里创新)
7.2 AI 时代组织:新游牧化
特征:
- 无固定边界(AI 可以在任何地方工作)
- 动态组队(根据任务临时组合)
- 统一调度(AI 主脑协调)
- 产出导向(不管谁做,做完就行)
优势:
- 规模化(AI 可以并行工作)
- 灵活性(随时调整方向)
- 创新自由(AI 可以跨领域创新)
挑战:
- 人类角色重新定义
- 绩效评估方式变化
- 管理方式变革
7.3 人类角色转变
| 传统角色 | AI 时代角色 |
|---|---|
| 写代码 | 定义问题、验收结果 |
| Code Review | 审查 AI 输出、设定标准 |
| 测试 | 定义测试策略、审查覆盖率 |
| 运维 | 定义 SLO、审查 AI 决策 |
| 项目经理 | 定义优先级、审查进度 |
核心转变:从 Doer 到 Decider
8. 风险与应对
8.1 技术风险
| 风险 | 概率 | 影响 | 应对 |
|---|---|---|---|
| AI 输出质量不稳定 | 高 | 中 | 人类审查 + 自动化测试 |
| AI 系统故障 | 中 | 高 | 状态持久化 + 恢复机制 |
| AI 成本超预算 | 低 | 中 | 监控 token 使用 + 优化 |
8.2 组织风险
| 风险 | 概率 | 影响 | 应对 |
|---|---|---|---|
| 团队抵触 | 高 | 高 | 渐进式推广 + 培训 |
| 绩效评估困难 | 高 | 中 | 重新定义评估标准 |
| 知识流失 | 中 | 高 | AI 文档化 + 知识沉淀 |
8.3 管理风险
| 风险 | 概率 | 影响 | 应对 |
|---|---|---|---|
| 老板信心不足 | 中 | 高 | 快速交付小胜利 |
| 期望过高 | 高 | 中 | 管理期望 + 透明沟通 |
| 资源不足 | 中 | 高 | 证明 ROI + 争取资源 |
9. 结论:向 Agent 时代迁移
9.1 核心论点
- 开发经验的本质是生产关系,不是生产力
- Agent 时代需要新的生产关系,不是优化旧的
- 边界是最大阻力,80% 的边界是管理便利,不是真实价值
- Mono-Repo + AI 集群 是大机器农业的基础设施
- 从农耕到新游牧 是组织演进的必然方向
9.2 行动建议
对于技术团队:
- 开始 Mono-Repo 规划(土地合并)
- 建设 AI 基础设施(大机器)
- 培养 AI 协作能力(新技能)
对于管理层:
- 重新评估边界价值(哪些该破除)
- 重新定义绩效标准(从 Doer 到 Decider)
- 投资 AI 基础设施(长期收益)
对于老板:
- 给 AI 团队真实业务场景(不是边缘实验)
- 设定合理期望(6-12 个月见效)
- 准备组织变革(生产关系调整)
9.3 最终愿景
2027 年,回顾今天:
我们不是"引入了 AI 工具"
我们不是"优化了开发流程"
我们完成了:
- 从农耕到游牧的生产关系变革
- 从碎片化到规模化的生产力革命
- 从 Doer 到 Decider 的人类角色转变
我们不是"用 AI 写代码"
我们是"用 AI 重新定义软件开发"
附录:实验案例
A.1 运维 Incident 分析实验
场景: 数据库 CPU 告警
传统流程:
1. 告警触发(On-call 收到通知)
2. 登录监控系统(查看指标)
3. 关联分析(查日志、查变更)
4. 根因定位(可能是慢查询)
5. 修复(kill query、优化索引)
6. 复盘(写 post-mortem)
时间:2-4 小时
人力:1-2 人
AI 流程:
1. 告警触发(AI 检测到异常)
2. AI 自动分析(查指标、查日志、查变更)
3. AI 根因定位(慢查询,SQL ID: XXX)
4. AI 自动修复(kill query、通知 owner)
5. AI 生成报告(根因、影响、预防)
时间:5-10 分钟
人力:0 人(AI 全自动)
效率提升: 24x 速度,100% 人力节省
A.2 Mono-Repo 分析实验
场景: 分析 10 个 repo 的价值
传统流程:
1. 人工收集元数据(stars, forks, language)
2. 人工分析代码结构
3. 人工评估依赖关系
4. 人工编写报告
时间:10 repos × 4 小时 = 40 小时
人力:1-2 人
AI 流程:
1. AI 自动收集元数据(GitHub API)
2. AI 自动分析代码结构
3. AI 自动评估依赖关系
4. AI 自动生成报告
时间:30 分钟
人力:0 人(AI 全自动)
效率提升: 80x 速度,100% 人力节省
向 Agent 时代大规模软件系统开发做迁移
2026-03-01 | Large-scale Agentic Engineering Team
Mono-Repo Consolidation: Executive Summary
TiDB Agentic Engineering AI-First Initiative
The Vision
Build a mono-repo where AI can autonomously:
- Design system architecture
- Develop features end-to-end
- Test and validate changes
- Deploy and monitor services
- Iterate based on outcomes
This is not just code consolidation. This is building the foundation for General Relativity: AI owns the full engineering lifecycle.
The Problem
Current State: 400+ Repositories, ~39GB
├── Products: TiDB, TiDB Next-Gen
├── Platform: TiDB Cloud SaaS
├── DevOps: Operations tools
├── Forks: Third-party dependencies
└── Abandoned: Unused projects
Issues:
❌ AI cannot see full system context
❌ Cross-repo optimization is impossible
❌ Human coordination overhead scales with repo count
❌ Dependency hell across repos
❌ Inconsistent tooling and practices
The Solution
Target State: 1 Unified Mono-Repo
├── AI-readable structure
├── AI-optimizable boundaries
├── Automated build/test/deploy
├── Clear ownership (CODEOWNERS)
└── Trunk-based development
Google’s Playbook (2 Billion LOC Proven)
| Principle | Google’s Practice | Our Application |
|---|---|---|
| Single Repo | 95% of code in one place | All 400 repos → 1 mono-repo |
| Trunk-Based | Direct commits to main | Pre-commit review, small changes |
| Code Ownership | OWNERS files per workspace | CODEOWNERS per component |
| Build System | Bazel (incremental) | Bazel/Turborepo/Nx based on stack |
| Automation | 24K automated commits/day | AI agents + automation |
| Access | Default open, exceptions restricted | Open within engineering |
Key Insight: If monorepo works for Google at 2B LOC with 25K engineers, it can work for us.
Our AI Advantage
Google built their system before AI was mainstream. We have a unique advantage:
Google (Human-Centric Automation)
Humans: Write code, review, fix dependencies, deploy
Automation: Formatting, dependency updates, builds, tests
Us (AI-First)
AI Agents: Write code, review, fix dependencies, optimize builds, deploy decisions
Humans: Define problems, set priorities, review architecture, handle edge cases
We’re not just matching Google. We’re going beyond.
Three-Layer AI Development Model
┌─────────────────────────────────────────────────────────────┐
│ AI Capability Layers │
├─────────────────────────────────────────────────────────────┤
│ Micro │ Skills, MCP, Tools │ Current state │
│ │ (Efficiency in existing) │ │
├─────────────────────────────────────────────────────────────┤
│ Meso │ Feature lifecycle │ Phase 4.2 │
│ │ (AI drives design→deploy) │ │
├─────────────────────────────────────────────────────────────┤
│ Macro │ System architecture │ Phase 4.3 │
│ │ (AI reorganizes everything) │ │
├─────────────────────────────────────────────────────────────┤
│ General │ AI owns everything │ End state │
│ Relativity │ │
└─────────────────────────────────────────────────────────────┘
Project Phases
Phase 1: Repository Analysis (Week 1-2)
400+ AI Agents analyze all repos
| Agent Task | Output |
|---|---|
| Freshness check | Activity score |
| Dependency mapping | Dependency graph |
| Code quality scan | Quality metrics |
| Usage analysis | Import/deployment count |
| Merge recommendation | Keep/Migrate/Archive |
Deliverable: repo-analysis-report.md
Phase 2: Mono-Repo Design (Week 2-3)
Infrastructure setup
mono-repo/
├── products/ # TiDB, TiDB Next-Gen
├── platform/ # Cloud SaaS
├── devops/ # Operations
├── libs/ # Shared libraries
├── tools/ # Build/dev tools
└── infra/ # Infrastructure
Key Decisions:
- Build system (Bazel vs Turborepo vs Nx)
- CODEOWNERS structure
- CI/CD path-based triggering
- Branching model (trunk-based)
Deliverables: mono-repo-structure.md, codeowners-template.md, build-system-evaluation.md
Phase 3: Pilot Migration (Week 3-4)
10-20 repos (P0 priority)
| Step | Action |
|---|---|
| 1 | Pre-migration check (deps, conflicts) |
| 2 | Code transfer (preserve git history) |
| 3 | Integration (update builds, fix imports) |
| 4 | Validation (CI/CD, tests, smoke) |
| 5 | Cutover (archive old repo) |
Deliverable: migration-runbook.md (refined from pilot)
Phase 4: Bulk Migration (Week 4-8)
Remaining ~380 repos in batches
| Priority | Repos | Duration |
|---|---|---|
| P0 (core products) | ~50 | 3-5 days |
| P1 (platform) | ~100 | 5-7 days |
| P2-P3 (tools, libs) | ~150 | 7-10 days |
| P4-P5 (cleanup) | ~100 | 2-3 days |
Phase 5: AI Enablement (Week 8+)
Closed-loop development
| Capability | Description |
|---|---|
| AI Code Generation | Feature development, bug fixes |
| AI Code Review | Automated PR review |
| AI Test Generation | Coverage-guided test creation |
| AI Refactoring | Cross-component optimization |
| AI Deployment | Auto-scaling, multi-region routing |
| AI Progress Tracking | Sprint planning, task estimation |
Deliverable: ai-dev-loop-spec.md, ai-first-methodology.md
Success Metrics
| Metric | Current | 6 Months | 12 Months |
|---|---|---|---|
| AI-completed features | 0% | 20% | 50% |
| AI-identified optimizations | 0 | 100/week | 500/week |
| AI-deployed changes | 0% | 10% | 40% |
| Human time on routine tasks | 60% | 30% | 10% |
| Build time (incremental) | N/A | <5 min | <3 min |
| PR review time | N/A | <4 hours | <2 hours |
Resource Requirements
Infrastructure
| Resource | Minimum | Recommended |
|---|---|---|
| CPU | 8 cores | 16+ cores |
| Memory | 16 GB | 32+ GB |
| Storage | 100 GB SSD | 500 GB+ SSD |
| Network | 1 Gbps | 10 Gbps |
Tooling
- Build System: Bazel / Turborepo / Nx
- Code Search: Sourcegraph / Zoekt
- CI/CD: GitHub Actions / GitLab CI
- Agent Framework: Custom (Python/Go)
Team
- Project Lead: 1 FTE
- Build/Infra Engineer: 1-2 FTE
- AI/ML Engineer: 1-2 FTE
- Team Representatives: 0.2 FTE each (for migration decisions)
Risks & Mitigation
| Risk | Impact | Mitigation |
|---|---|---|
| Data loss | High | Full backups before each batch |
| Downtime | High | Parallel run (old + new) |
| Broken builds | Medium | Comprehensive tests, canary deploys |
| Team disruption | Medium | Gradual migration, training |
| Performance degradation | Medium | Incremental builds, caching |
| Rollback needed | Low | Keep old repos read-only 30 days |
Open Questions (Need Answers)
-
Tech Stack: What languages/frameworks are in the 400 repos?
- Determines build system choice (Bazel vs Turborepo vs Nx)
-
Current CI/CD: What’s the existing pipeline?
- Affects migration complexity
-
Team Structure: How many engineers? How organized?
- Affects CODEOWNERS design
-
Deployment: How are services currently deployed?
- Affects infra design
-
Agent Hosting: Where will 400 agents run?
- Local cluster? Cloud? Hybrid?
Next Steps (Planning Phase: 1-2 Days)
Day 1: Analysis Framework
- Set up distributed agent infrastructure
- Define analysis metrics and scoring
- Create repo inventory (list all 400 repos)
- Run pilot analysis on 10 repos
Day 2: Mono-Repo Design
- Finalize directory structure
- Design build system architecture
- Plan migration tooling
- Create detailed migration runbook
Deliverables
-
repo-analysis-report.md -
mono-repo-structure.md -
migration-runbook.md -
ai-dev-loop-spec.md -
ai-first-methodology.md -
ai-capability-maturity.md -
google-monorepo-lessons.md✅ DONE -
codeowners-template.md✅ DONE -
build-system-evaluation.md
Conclusion
This project is not just about consolidating code. It’s about:
- Building the foundation for AI to own the full engineering lifecycle
- Learning from Google’s playbook (2B LOC proven)
- Going beyond Google with AI-first decision automation
- Enabling Agentic Engineering at scale
The goal is not to help humans do AI work. The goal is to have AI do the work, and humans define what matters.
Prepared for: TiDB Agentic Engineering AI-First Initiative Last updated: Planning Phase
Mono-Repo Consolidation Plan
Agentic Engineering AI-First Initiative
“AI should be able to automatically complete a project from development to deployment.”
“Google proved monorepo scales to 2 billion lines. We’re building on that foundation with AI ownership.”
Overview
Goal: Consolidate 400+ repositories (~39GB) into an AI-friendly mono-repo with closed-loop development, testing, and progress management.
Strategic Context: This is not just a code consolidation — it’s a first-principles reimagining of AI-driven engineering. We’re building the foundation for AI to own the full lifecycle: architecture, development, testing, deployment, and iteration.
Inspired By: Google’s monorepo (2B LOC, 25K engineers, 45K commits/day)
Our Advantage: Google automated processes. We automate decisions with AI.
Timeline: Planning phase (1-2 days) → Execution phase (TBD)
AI-First Engineering Philosophy
Three Layers of AI-Driven Development
| Layer | Scope | Focus | This Project |
|---|---|---|---|
| Micro | Skills, MCP, Tools | Efficiency in existing systems | Foundation |
| Meso | Feature lifecycle | AI drives design→test→deploy | Core capability |
| Macro | System/org architecture | AI reorganizes everything | Ultimate goal |
Relativity Framework
Special Relativity (Near-term):
AI can automatically complete a single project: development, testing, deployment, launch
General Relativity (Ultimate):
AI unifies all company repositories, system architecture, deployment, modules — all deeply designed for AI ownership
This Project’s Place
Current State → Micro layer (tools, skills, MCP)
↓
This Project → Meso + Macro transition
↓
End State → General Relativity achieved
(AI owns full lifecycle across unified codebase)
Current State
Total Repos: ~400
Total Size: ~39GB
Categories:
- Products: TiDB, TiDB Next-Gen (database, storage, import/export tools)
- Platform: TiDB Cloud SaaS (control services, resource deployment, monitoring)
- DevOps: Online operations backend
- Forks: Third-party dependencies
- Abandoned: Unused projects
Problem: Fragmented codebase prevents AI from having full context.
AI cannot optimize across repo boundaries.
Human coordination overhead scales with repo count.
Phase 1: Repository Analysis (Distributed Agent Cluster)
1.1 Agent Architecture
┌─────────────────────────────────────────────────────────────┐
│ Orchestrator Agent │
│ - Coordinates 400+ repo agents │
│ - Aggregates analysis results │
│ - Makes merge recommendations │
└─────────────────────────────────────────────────────────────┘
│
┌─────────────────────┼─────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Repo Agent │ │ Repo Agent │ │ Repo Agent │
│ (repo-001) │ │ (repo-002) │ │ (repo-400) │
└───────────────┘ └───────────────┘ └───────────────┘
1.2 Per-Repo Analysis Metrics
Each agent analyzes its repo for:
| Metric | Description | Weight |
|---|---|---|
| Freshness | Last commit date, activity frequency | High |
| Dependencies | Internal deps, external deps, circular refs | High |
| Code Quality | Test coverage, lint errors, tech debt | Medium |
| Documentation | README, API docs, architecture docs | Medium |
| Usage | Import count, deployment instances | High |
| Owner | Team ownership, maintenance status | Medium |
| Build System | CI/CD config, build scripts | Low |
1.3 Agent Implementation
# Agent spec (pseudo-code)
class RepoAgent:
def __init__(self, repo_path, repo_id):
self.repo_path = repo_path
self.repo_id = repo_id
def analyze(self):
return {
'freshness': self.check_freshness(),
'dependencies': self.map_dependencies(),
'code_quality': self.assess_quality(),
'documentation': self.scan_docs(),
'usage': self.detect_usage(),
'merge_recommendation': self.recommend(),
}
1.4 Distributed Execution Strategy
Challenge: 400+ agents running concurrently
Solution: Batched parallel execution
- Batch size: 50 agents (adjustable based on resources)
- Total batches: 8 (400/50)
- Estimated time per batch: 5-10 minutes
- Total analysis time: ~1-2 hours
Resource Requirements:
- CPU: 8+ cores recommended
- Memory: 16GB+ recommended
- Disk I/O: SSD preferred (39GB read operations)
Phase 2: Mono-Repo Design
2.1 Target Structure
mono-repo/
├── products/
│ ├── tidb/ # TiDB database core
│ │ ├── server/
│ │ ├── storage/
│ │ └── tools/
│ └── tidb-next/ # Next-gen database
│ ├── server/
│ ├── storage/
│ └── tools/
├── platform/
│ ├── cloud-saas/ # TiDB Cloud platform
│ │ ├── control-plane/
│ │ ├── resource-deploy/
│ │ ├── monitoring/
│ │ └── api-gateway/
│ └── shared-services/ # Cross-platform services
├── devops/
│ ├── ops-backend/ # Operations tools
│ ├── ci-cd/
│ └── deployment/
├── libs/ # Shared libraries
│ ├── common/
│ ├── utils/
│ └── protocols/
├── tools/ # Build/dev tools
├── docs/ # Centralized documentation
└── infra/ # Infrastructure as code
2.2 AI-Friendly Design Principles
- Clear Boundaries: Each component has well-defined interfaces
- Self-Contained: Components can be understood in isolation
- Documented Contracts: API specs, data schemas, protocols
- Testable: Clear test boundaries, mockable interfaces
- Versioned: Internal versioning for breaking changes
2.3 Build System
# Monorepo build orchestration
- Turborepo / Nx / Bazel (depending on tech stack)
- Incremental builds (only changed components)
- Parallel test execution
- Dependency graph visualization
Phase 3: Migration Strategy
3.1 Migration Priority
| Priority | Category | Criteria | Action |
|---|---|---|---|
| P0 | Active core products | High usage, active development | Migrate first |
| P1 | Platform services | Critical infrastructure | Migrate early |
| P2 | DevOps tools | Important but isolated | Migrate mid-phase |
| P3 | Low-activity repos | Minor usage, stable | Migrate late |
| P4 | Abandoned repos | No activity >1 year | Archive or delete |
| P5 | Forked dependencies | Third-party forks | Evaluate: keep upstream? |
3.2 Migration Process (Per Repo)
1. Pre-migration check
├── Dependency analysis
├── Conflict detection
└── Build verification
2. Code transfer
├── Preserve git history (git filter-repo)
├── Map to new structure
└── Update import paths
3. Integration
├── Update build configs
├── Fix dependency references
└── Run tests
4. Validation
├── CI/CD passes
├── Integration tests pass
└── Smoke tests in staging
5. Cutover
├── Update deployment configs
├── Switch CI/CD to mono-repo
└── Archive old repo (read-only)
3.3 Estimated Timeline
| Phase | Repos | Duration |
|---|---|---|
| Planning & Analysis | All 400 | 2 days |
| P0 Migration (core) | ~50 | 3-5 days |
| P1 Migration (platform) | ~100 | 5-7 days |
| P2-P3 Migration | ~150 | 7-10 days |
| P4-P5 Cleanup | ~100 | 2-3 days |
| Total | 400 | ~3-4 weeks |
Phase 4: AI Closed-Loop Development
4.1 The AI-First Vision
This mono-repo is designed to enable General Relativity: AI owns the full system lifecycle.
┌─────────────────────────────────────────────────────────────────────┐
│ AI Ownership Spectrum │
│ │
│ Micro Meso Macro General Rel. │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ Tools Feature System AI owns │
│ & Skills Lifecycle Architecture Everything │
│ │
│ [Current] [Phase 4.2] [Phase 4.3] [End State] │
└─────────────────────────────────────────────────────────────────────┘
4.2 Development Loop (Meso Layer)
┌─────────────────────────────────────────────────────────────┐
│ AI Development Loop │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Plan │───▶│ Code │───▶│ Test │───▶│ Review │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ ▲ │ │
│ └──────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Progress Management │ │
│ │ - Task tracking │ │
│ │ - Sprint planning │ │
│ │ - Blocker detection │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
4.2 AI Capabilities
| Capability | Description | Implementation |
|---|---|---|
| Code Generation | Generate features, fixes, refactors | LLM + context from repo |
| Test Generation | Auto-generate unit/integration tests | Coverage-guided |
| Code Review | Automated PR review, style checks | Static analysis + LLM |
| Bug Detection | Identify potential issues | Pattern matching + ML |
| Documentation | Auto-generate/update docs | Code → docs extraction |
| Progress Tracking | Sprint planning, task estimation | Historical data + LLM |
4.3 System Architecture Ownership (Macro Layer)
AI Reorganizes System Architecture:
┌─────────────────────────────────────────────────────────────────┐
│ AI-Designed System Architecture │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Product │ │ Platform │ │ DevOps │ │
│ │ Services │◀───▶│ Services │◀───▶│ Services │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ ▲ ▲ ▲ │
│ └────────────────────┼────────────────────┘ │
│ │ │
│ ┌────────▼────────┐ │
│ │ AI Orchestrator│ │
│ │ - Discovers │ │
│ │ - Optimizes │ │
│ │ - Refactors │ │
│ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
AI Capabilities at Macro Layer:
- Architecture Discovery: Map service dependencies, data flows, bottlenecks
- Automated Refactoring: Identify and execute cross-service improvements
- Interface Optimization: Evolve APIs based on usage patterns
- Tech Debt Management: Prioritize and fix systemic issues
4.4 Deployment & Operations Ownership (General Relativity)
AI-Managed Infrastructure:
# Auto-scaling policies (AI-optimized)
resource_policies:
- service: control-plane
scaling:
min_instances: 3
max_instances: 50
metrics: [cpu, memory, request_latency]
ai_optimizer: enabled
- service: resource-deploy
multi_region:
regions: [us-east, eu-west, ap-southeast]
ai_routing: enabled # AI decides optimal region
AI Responsibilities:
- Predict load patterns
- Auto-scale before traffic spikes
- Optimize resource allocation across regions
- Detect and respond to anomalies
- Cost optimization (right-sizing, spot instances)
- Self-healing: Automatic incident response and recovery
- Continuous Optimization: A/B test deployments, rollback on metrics
4.5 End State: General Relativity Achieved
┌─────────────────────────────────────────────────────────────────┐
│ General Relativity: AI Owns Everything │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Unified Codebase │ │
│ │ (400 repos → 1 mono-repo, AI-readable, AI-optimizable) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌────────────────────┼────────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ AI Dev │ │ AI Ops │ │ AI Org │ │
│ │ - Designs │ │ - Deploys │ │ - Plans │ │
│ │ - Codes │ │ - Scales │ │ - Staffs │ │
│ │ - Tests │ │ - Monitors │ │ - Allocates│ │
│ │ - Reviews │ │ - Heals │ │ - Optimizes│ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ Result: Human engineers focus on strategy, creativity, │
│ and high-level problem definition. │
│ AI handles execution at all layers. │
└─────────────────────────────────────────────────────────────────┘
Phase 5: Technical Considerations
5.1 Google Monorepo Lessons (2 Billion LOC Proven)
Key Insights from Google’s Playbook:
| Principle | Google’s Approach | TiDB Application |
|---|---|---|
| Single Source of Truth | One repo for 95% of codebase | All 400 repos → 1 mono-repo |
| Trunk-Based Development | Direct commits to main, pre-commit review | Adopt from day 1 |
| Code Ownership | Default open, CODEOWNERS enforcement | Directory-based ownership |
| Build System | Bazel (incremental, remote cache) | Bazel/Turborepo/Nx based on stack |
| Dependency Mgmt | Single version graph, automated updates | Dependency visualization tool |
| Code Review | Automated pre-checks + OWNERS | GitHub/GitLab CODEOWNERS |
| Infrastructure | Piper + CitC (partial checkout) | Git + shallow clones + sparse checkout |
Google’s Scale (for reference):
- 2 billion lines of code
- 25,000+ engineers
- 45,000 commits/day
- 86 TB storage
- Automation does 24,000 commits/day
Our AI Advantage: Google automated processes. We automate decisions.
5.2 Scale Challenges
| Challenge | Solution | Google Reference |
|---|---|---|
| Git repo size | git-lfs, shallow clones, sparse checkout | CitC (partial checkout) |
| Build time | Incremental builds, remote caching | Bazel |
| CI/CD complexity | Path-based triggering | Automated pre-commit checks |
| Code ownership | CODEOWNERS file, clear boundaries | OWNERS files per workspace |
| Access control | Fine-grained permissions per directory | Default open, exceptions restricted |
| Search speed | Sourcegraph / Zoekt | CodeSearch engine |
| Dependency hell | Dependency graph visualization | Single version, automated updates |
5.3 Tooling Requirements
| Category | Tools | Recommendation |
|---|---|---|
| Build System | Bazel, Turborepo, Nx | Based on tech stack (see below) |
| Code Search | Sourcegraph, Zoekt | Sourcegraph (enterprise) or Zoekt (open) |
| Dependency Viz | Custom + graph DB | Build custom tool |
| CI/CD | GitHub Actions, GitLab CI | Path filtering required |
| Agent Framework | LangChain, AutoGen, custom | Custom (tuned for repo analysis) |
| Version Control | Git | Standard Git + sparse checkout |
Build System by Tech Stack:
Go → Bazel or Please
TypeScript → Turborepo or Nx
Java → Bazel or Gradle
Python → Bazel or Pants
Mixed → Bazel (most flexible)
5.4 Risk Mitigation
| Risk | Mitigation | Google Parallel |
|---|---|---|
| Data loss | Full backups before each batch | Piper (distributed storage) |
| Downtime | Parallel run (old + new) | Release branches + feature flags |
| Broken builds | Comprehensive tests, canary deploys | Pre-commit verification |
| Team disruption | Gradual migration, training | Trunk-based culture |
| Rollback needed | Keep old repos read-only 30 days | Release branch rollback |
| Performance | Incremental builds, caching | Bazel remote cache |
5.5 Trunk-Based Development Model (Google Standard)
main (trunk)
│
├── All developers commit directly to main
├── Pre-commit code review required
├── Automated checks run before merge
│
└── release/v1.0 (branch for deployment only)
└── Feature flags control visibility
Rules:
- No long-lived feature branches
- All changes reviewed before merge (pre-commit)
- Small, frequent commits (not big bangs)
- Feature flags for incomplete features
- Release branches are for deployment, not development
Benefits:
- No merge nightmares
- Early conflict detection
- Continuous delivery enabled
- AI can safely make small, incremental changes
5.6 CODEOWNERS Structure
# Root CODEOWNERS file
# Format: path_pattern @owner1 @owner2
# Products
products/tidb/* @tidb-core-team @database-leads
products/tidb-next/* @tidb-next-team @architecture-review
# Platform
platform/cloud-saas/* @cloud-platform-team @platform-leads
platform/shared/* @platform-architects
# DevOps
devops/* @devops-team @sre-leads
# Shared Libraries (high scrutiny)
libs/* @platform-architects @tech-leads
# Infrastructure
infra/* @infra-team @security-review
# Build/Tooling
tools/* @devex-team
BUILD @build-maintainers
Review Policies:
libs/*requires 2 approvals (shared code impact)products/*requires 1 approval + team leaddevops/*requires 1 approval + on-call SRE- Security-sensitive paths require security team approval
Next Steps (Planning Phase)
Day 1: Analysis Framework
- Set up distributed agent infrastructure
- Define analysis metrics and scoring
- Create repo inventory (list all 400 repos)
- Run pilot analysis on 10 repos
Day 2: Mono-Repo Design
- Finalize directory structure
- Design build system architecture
- Plan migration tooling
- Create detailed migration runbook
Deliverables
repo-analysis-report.md— Analysis of all 400 reposmono-repo-structure.md— Detailed structure specmigration-runbook.md— Step-by-step migration guideai-dev-loop-spec.md— AI closed-loop development specai-first-methodology.md— AI-First engineering methodology (this framework)ai-capability-maturity.md— AI capability maturity model (Micro→Meso→Macro→General Relativity)google-monorepo-lessons.md— Google best practices reference ✅ DONEcodeowners-template.md— CODEOWNERS file templatebuild-system-evaluation.md— Bazel vs Turborepo vs Nx analysis
Open Questions
- Tech stack: What languages/frameworks are in the 400 repos? (affects build system choice)
- Team size: How many engineers will work in the mono-repo? (affects access control design)
- Current CI/CD: What’s the existing pipeline? (affects migration complexity)
- Deployment: How are services currently deployed? (affects infra design)
- Agent hosting: Where will the 400 agents run? (local cluster, cloud, hybrid?)
Appendix: AI-First Methodology
Why This Matters
Most AI engineering efforts stop at the Micro layer:
- Build some skills
- Add some MCP tools
- Improve individual workflows
This project goes further:
Layer What Changes Outcome
─────────────────────────────────────────────────────────
Micro Tools & workflows Faster individual tasks
Meso Feature ownership AI delivers features end-to-end
Macro System architecture AI optimizes across services
General Everything AI runs the engineering org
First Principles Reasoning
Question: What should AI be capable of in software engineering?
Answer: A good AI engineer should be able to:
- Understand the full system (not just one repo)
- Design improvements that span boundaries
- Implement, test, and deploy changes
- Monitor and iterate based on outcomes
Barrier: Fragmented codebases prevent #1.
Solution: Unified mono-repo designed for AI ownership.
Success Metrics
| Metric | Current | Target (6mo) | Target (12mo) |
|---|---|---|---|
| AI-completed features | 0% | 20% | 50% |
| AI-identified optimizations | 0% | 100/week | 500/week |
| AI-deployed changes | 0% | 10% | 40% |
| Human time on routine tasks | 60% | 30% | 10% |
| System-wide tech debt | High | Reduced 25% | Reduced 60% |
Last updated: Planning phase
“The goal is not to help humans do AI work. The goal is to have AI do the work, and humans define what matters.”
Mono-Repo Agent Ecosystem Design
AI-First Engineering: Agents + Skills Living in the Mono-Repo
“The mono-repo is not just code. It’s a living ecosystem of AI agents and skills.”
Vision
┌─────────────────────────────────────────────────────────────────┐
│ Mono-Repo Ecosystem │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Code (39GB) │ │
│ │ products/ platform/ devops/ libs/ tools/ docs/ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌────────────────────┼────────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Agents │ │ Skills │ │ Humans │ │
│ │ (Active) │ │ (Tools) │ │ (Oversight) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ Result: Self-improving, self-maintaining codebase │
└─────────────────────────────────────────────────────────────────┘
Agent Taxonomy
Layer 1: Guardian Agents (Per-Component)
Each major component has a dedicated guardian agent:
┌─────────────────────────────────────────────────────────────┐
│ Guardian Agents │
├─────────────────────────────────────────────────────────────┤
│ │
│ tidb-guardian ──► products/tidb/* │
│ tiflow-guardian ──► products/tiflow/* │
│ operator-guardian ──► platform/tidb-operator/* │
│ dashboard-guardian ──► tools/tidb-dashboard/* │
│ docs-guardian ──► docs/* │
│ sdk-guardian ──► sdks/* │
│ infra-guardian ──► infra/* │
│ │
└─────────────────────────────────────────────────────────────┘
Guardian Responsibilities:
| Task | Frequency | Description |
|---|---|---|
| Code Health | Daily | Lint, test coverage, tech debt |
| Dependency Watch | Daily | Security updates, breaking changes |
| Documentation | Per-change | Auto-update docs from code |
| Issue Triage | Real-time | Categorize, label, assign |
| PR Review | Per-PR | Automated review, suggestions |
| Refactoring | Weekly | Identify and propose improvements |
Layer 2: Cross-Cutting Agents
These agents work across component boundaries:
┌─────────────────────────────────────────────────────────────┐
│ Cross-Cutting Agents │
├─────────────────────────────────────────────────────────────┤
│ │
│ dependency-architect │
│ ├─ Maps cross-component dependencies │
│ ├─ Detects circular dependencies │
│ └─ Proposes dependency cleanup │
│ │
│ refactoring-specialist │
│ ├─ Identifies code duplication across components │
│ ├─ Proposes shared library extraction │
│ └─ Executes safe cross-component refactors │
│ │
│ test-optimizer │
│ ├─ Analyzes test coverage gaps │
│ ├─ Generates missing tests │
│ └─ Optimizes test execution order │
│ │
│ security-auditor │
│ ├─ Scans for vulnerabilities │
│ ├─ Checks security best practices │
│ └─ Monitors dependency CVEs │
│ │
│ performance-analyst │
│ ├─ Profiles code performance │
│ ├─ Identifies bottlenecks │
│ └─ Proposes optimizations │
│ │
│ documentation-curator │
│ ├─ Ensures docs match code │
│ ├─ Generates API docs │
│ └─ Maintains architecture decision records │
│ │
└─────────────────────────────────────────────────────────────┘
Layer 3: Orchestrator Agents
High-level coordination and decision-making:
┌─────────────────────────────────────────────────────────────┐
│ Orchestrator Agents │
├─────────────────────────────────────────────────────────────┤
│ │
│ mono-repo-orchestrator │
│ ├─ Coordinates all guardian agents │
│ ├─ Makes cross-component decisions │
│ ├─ Prioritizes work across components │
│ └─ Reports system health to humans │
│ │
│ release-manager │
│ ├─ Plans releases across components │
│ ├─ Coordinates version compatibility │
│ ├─ Manages changelogs │
│ └─ Handles rollback decisions │
│ │
│ sprint-planner │
│ ├─ Analyzes backlog │
│ ├─ Estimates effort (based on history) │
│ ├─ Suggests sprint goals │
│ └─ Tracks progress │
│ │
│ resource-optimizer │
│ ├─ Monitors CI/CD costs │
│ ├─ Optimizes build caching │
│ └─ Recommends infrastructure changes │
│ │
└─────────────────────────────────────────────────────────────┘
Skills Integration
Skills are the tools that agents use to interact with the codebase:
Core Skills
| Skill | Purpose | Used By |
|---|---|---|
| code-search | Fast code search (Sourcegraph/Zoekt) | All agents |
| build-runner | Execute builds (Bazel/Turborepo) | Guardian agents |
| test-runner | Execute tests with coverage | Guardian, test-optimizer |
| lint-checker | Code style and quality | Guardian, security-auditor |
| dependency-analyzer | Map and analyze dependencies | dependency-architect |
| doc-generator | Generate docs from code | documentation-curator |
| git-operations | Safe git operations (commit, PR) | All agents |
| ci-cd-trigger | Trigger CI/CD pipelines | release-manager |
| metrics-collector | Collect build/test/deploy metrics | resource-optimizer |
Specialized Skills
| Skill | Purpose | Used By |
|---|---|---|
| security-scanner | Vulnerability scanning | security-auditor |
| performance-profiler | Code profiling | performance-analyst |
| refactoring-engine | Safe code transformations | refactoring-specialist |
| test-generator | AI-generated tests | test-optimizer |
| changelog-writer | Auto-generate changelogs | release-manager |
| impact-analyzer | Analyze change impact | All agents |
Agent-Skill Interaction Model
┌─────────────────────────────────────────────────────────────────┐
│ Agent-Skill Architecture │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Guardian │ │ Cross- │ │ Orchestra- │ │
│ │ Agent │ │ Cutting │ │ tor │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ └─────────────────────┼─────────────────────┘ │
│ │ │
│ ┌──────────▼──────────┐ │
│ │ Skill Layer │ │
│ │ ┌───────────────┐ │ │
│ │ │ code-search │ │ │
│ │ │ build-runner │ │ │
│ │ │ test-runner │ │ │
│ │ │ lint-checker │ │ │
│ │ │ ... │ │ │
│ │ └───────────────┘ │ │
│ └──────────┬──────────┘ │
│ │ │
│ ┌──────────▼──────────┐ │
│ │ Mono-Repo │ │
│ │ (Code + Data) │ │
│ └─────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Human-Agent Collaboration
Human Roles in the Ecosystem
┌─────────────────────────────────────────────────────────────┐
│ Human Oversight Layers │
├─────────────────────────────────────────────────────────────┤
│ │
│ Tech Leads │
│ ├─ Review architecture decisions (AI-proposed) │
│ ├─ Set priorities for agents │
│ └─ Handle edge cases and exceptions │
│ │
│ Product Managers │
│ ├─ Define feature requirements │
│ ├─ Review sprint plans (AI-generated) │
│ └─ Make trade-off decisions │
│ │
│ SRE / Operations │
│ ├─ Review deployment plans (AI-generated) │
│ ├─ Handle production incidents │
│ └─ Set SLOs and error budgets │
│ │
│ Security Team │
│ ├─ Review security audit findings │
│ ├─ Approve security-critical changes │
│ └─ Define security policies │
│ │
└─────────────────────────────────────────────────────────────┘
Decision Escalation
AI Agent Decision
│
▼
┌─────────────────┐
│ Can AI decide? │
└────────┬────────┘
│
┌────┴────┐
│ │
Yes No
│ │
▼ ▼
┌────────┐ ┌─────────────┐
│ Execute│ │ Escalate to │
│ │ │ Human │
└────────┘ └──────┬──────┘
│
▼
┌─────────────────┐
│ Which Human? │
├─────────────────┤
│ Architecture → │ Tech Lead
│ Security → │ Security Team
│ Priority → │ Product Manager
│ Production → │ SRE
└─────────────────┘
Agent Communication Protocol
Inter-Agent Messaging
Agent Message Format:
{
"from": "tidb-guardian",
"to": "dependency-architect",
"type": "dependency_change_detected",
"payload": {
"component": "products/tidb",
"dependency": "github.com/pingcap/kvproto",
"change": "version_update",
"old_version": "v0.0.0-20250101",
"new_version": "v0.0.0-20260228",
"breaking": false,
"requires_propagation": true
},
"timestamp": "2026-02-28T16:00:00Z"
}
Event Bus
┌─────────────────────────────────────────────────────────────┐
│ Agent Event Bus │
├─────────────────────────────────────────────────────────────┤
│ │
│ Events: │
│ - code_committed │
│ - pr_created │
│ - pr_merged │
│ - test_failed │
│ - build_failed │
│ - dependency_updated │
│ - security_vulnerability_detected │
│ - performance_regression_detected │
│ - tech_debt_identified │
│ - documentation_outdated │
│ │
│ Subscription Model: │
│ - Each agent subscribes to relevant events │
│ - Events trigger agent actions │
│ - Actions may generate new events (chain reaction) │
│ │
└─────────────────────────────────────────────────────────────┘
Daily Agent Workflow
Example: A Day in the Life
00:00 ──► dependency-architect runs nightly dependency scan
└─► Finds security update for tidb dependency
└─► Creates PR with update
└─► Notifies tidb-guardian
02:00 ──► tidb-guardian reviews PR
└─► Runs tests
└─► Checks compatibility
└─► Approves (auto-merge if non-breaking)
06:00 ──► test-optimizer analyzes test coverage
└─► Finds gap in products/tidb/storage
└─► Generates new tests
└─► Creates PR
09:00 ──► Humans start workday
└─► Review overnight agent activities
└─► Handle escalations
└─► Set priorities for the day
12:00 ──► sprint-planner analyzes velocity
└─► Updates sprint forecast
└─► Notifies PM of potential delays
15:00 ──► refactoring-specialist identifies duplication
└─► Proposes shared library extraction
└─► Creates design doc
└─► Requests human review
18:00 ──► documentation-curator syncs docs with code
└─► Auto-generates API docs
└─► Updates changelog
23:00 ──► mono-repo-orchestrator generates daily report
└─► System health summary
└─► Agent activity summary
└─► Pending human decisions
Metrics & KPIs
Agent Performance
| Metric | Target | Measurement |
|---|---|---|
| PR Review Time | <1 hour | Time from PR creation to first review |
| Auto-Merge Rate | >60% | % of PRs merged without human intervention |
| Test Coverage | >80% | Code coverage across all components |
| Vulnerability MTTR | <24 hours | Time to fix security issues |
| Build Success Rate | >95% | % of builds that pass |
| Agent Decision Accuracy | >90% | % of AI decisions that are correct |
System Health
| Metric | Target | Measurement |
|---|---|---|
| Tech Debt Ratio | <10% | Tech debt / total code |
| Documentation Freshness | <7 days | Time since last doc update |
| Dependency Freshness | <30 days | Age of oldest dependency |
| Cross-Component Coupling | Decreasing | Dependency graph complexity |
Implementation Phases
Phase 1: Guardian Agents (Week 1-4)
- Build agent framework
- Implement tidb-guardian (pilot)
- Integrate core skills (code-search, build-runner, test-runner)
- Deploy to mono-repo
Phase 2: Cross-Cutting Agents (Week 4-8)
- Implement dependency-architect
- Implement test-optimizer
- Implement security-auditor
- Build event bus
Phase 3: Orchestrator Agents (Week 8-12)
- Implement mono-repo-orchestrator
- Implement release-manager
- Implement sprint-planner
- Human oversight workflows
Phase 4: Full Autonomy (Week 12+)
- Enable auto-merge for non-breaking changes
- Enable automated refactoring
- Enable AI-driven release planning
- Continuous optimization
Agent Configuration
Example: tidb-guardian config
agent:
name: tidb-guardian
model: qwen3.5-plus
component: products/tidb
permissions:
- read: products/tidb/*
- write: products/tidb/*
- create_pr: true
- merge_pr: true # Non-breaking only
skills:
- code-search
- build-runner
- test-runner
- lint-checker
- doc-generator
triggers:
- code_committed
- pr_created
- dependency_updated
- test_failed
escalation:
architecture: @tidb-architect
security: @security-team
breaking_change: @tidb-leads
schedule:
daily_health_check: "02:00 UTC"
weekly_refactor_proposal: "Monday 00:00 UTC"
Conclusion
The mono-repo is not just a code repository. It’s a living ecosystem where:
- Guardian Agents maintain individual components
- Cross-Cutting Agents optimize across boundaries
- Orchestrator Agents coordinate and make high-level decisions
- Skills provide the tools for agents to interact with code
- Humans provide oversight, handle exceptions, and set direction
This is the foundation for General Relativity: AI owns the full engineering lifecycle, with humans focusing on strategy and creativity.
“The goal is not to replace humans. The goal is to free humans from routine work, so they can focus on what matters.”
10-Repo Experiment Report
小规模实验报告
实验日期: 2026-03-01
实验状态: ✅ 完成
实验时长: ~30 分钟
实验成本: ~$0.05 (估算)
Executive Summary
✅ 实验成功! 10/10 repos 分析完成,验证了 OpenClaw 主脑 + 文件持久化的架构可行性。
关键发现:
- 10 个 repo 总计 ~2GB 代码
- S-tier: 1 个 (tidb: 95 分)
- A-tier: 4 个 (tiflow, tidb-operator, docs, tiup)
- B-tier: 4 个 (ossinsight, tidb-dashboard, ticdc, autoflow)
- C-tier: 1 个 (tidb-vector-python)
迁移建议:
- P0 (优先): tidb, tiflow, tidb-operator
- P1 (第二批): docs, tiup, tidb-dashboard
- P2 (第三批): ossinsight, ticdc, autoflow, tidb-vector-python
Experiment Results
1. Repo 价值评分排名
| Rank | Repo | 总分 | Tier | 优先级 | 迁移建议 |
|---|---|---|---|---|---|
| 1 | tidb | 95 | S | P0 | 第一个迁移,核心产品 |
| 2 | tiflow | 78 | A | P0 | 与 tidb 一起迁移 |
| 3 | tidb-operator | 75 | A | P0 | K8s 运维核心 |
| 4 | docs | 72 | A | P1 | 官方文档,必须合并 |
| 5 | tiup | 70 | A | P1 | 包管理工具,活跃 |
| 6 | ossinsight | 68 | B | P1 | 独立工具,评估是否合并 |
| 7 | tidb-dashboard | 65 | B | P1 | 控制台,依赖 tidb |
| 8 | ticdc | 62 | B | P2 | CDC 工具,与 tiflow 重叠 |
| 9 | autoflow | 58 | B | P2 | Graph RAG,独立性强 |
| 10 | tidb-vector-python | 42 | C | P2 | SDK,体积小,活跃度低 |
2. 分级分布
S-tier (85-100): ████░░░░░░ 1 个 (10%) → 深度分析 (8 agents)
A-tier (70-84): ████████░░ 4 个 (40%) → 标准分析 (4 agents)
B-tier (50-69): ████████░░ 4 个 (40%) → 标准分析 (2 agents)
C-tier (0-49): ██░░░░░░░░ 1 个 (10%) → 快速扫描 (1 agent)
3. 技术栈分布
| Language | Count | Percentage |
|---|---|---|
| Go | 6 | 60% |
| TypeScript | 3 | 30% |
| Python | 1 | 10% |
结论: Go 为主,构建系统建议选择 Bazel 或 Please
4. 代码量分布
| Size Category | Repos | Total Size |
|---|---|---|
| >500 MB | tidb, ossinsight | 1,264 MB |
| 100-500 MB | docs, tiflow, ticdc | 665 MB |
| 10-100 MB | tidb-operator, tidb-dashboard | 132 MB |
| <10 MB | tiup, autoflow, tidb-vector-python | 22 MB |
| Total | 10 | 2,084 MB (~2GB) |
Architecture Validation
✅ 验证通过的功能
| 功能 | 状态 | 说明 |
|---|---|---|
| OpenClaw 主脑 | ✅ | 成功协调分析流程 |
| 文件持久化 | ✅ | 状态写入 .rd-os/state/ |
| 价值评分 | ✅ | 10 个 repo 评分完成 |
| 分级逻辑 | ✅ | S/A/B/C 分级合理 |
| 迁移建议 | ✅ | 每个 repo 有 actionable 建议 |
⚠️ 需要改进的地方
| 问题 | 影响 | 改进方案 |
|---|---|---|
| 手动获取元数据 | 耗时 | 自动化 GitHub API 调用 |
| 未使用 sessions_spawn | 未验证子 Agent | 下一步实现 |
| 未测试恢复机制 | 未知 | 需要模拟 OpenClaw 重启 |
| 代码分析深度有限 | 表面 | 需要实际 clone 代码分析 |
Cost Analysis
实际成本
| 操作 | Token 估算 | 成本 |
|---|---|---|
| GitHub API 调用 | ~5K | $0.00 (免费) |
| 价值评分分析 | ~10K | ~$0.02 |
| 报告生成 | ~5K | ~$0.01 |
| Total | ~20K | ~$0.03 |
400-Repo 推算
| 阶段 | Token 估算 | 成本 |
|---|---|---|
| 元数据收集 | 200K | $0.00 (GitHub API 免费) |
| 价值评分 | 4M | ~$8 |
| 深度分析 (S/A-tier) | 10M | ~$20 |
| 迁移执行 | 20M | ~$40 |
| Total | ~34M | ~$68 |
结论: 成本在可接受范围内,qwen3.5-plus 性价比高
Migration Strategy (Based on Results)
Phase 1: P0 Core Products (Week 1-2)
tidb (637 MB, 95 分)
├── 核心数据库
├── 需要专门团队
└── 预计时间:3-5 天
tiflow (159 MB, 78 分)
├── DM + TiCDC
├── 依赖 tidb
└── 预计时间:2-3 天
tidb-operator (99 MB, 75 分)
├── K8s 运维
├── 独立性强
└── 预计时间:2-3 天
Phase 1 Total: ~900 MB, 7-11 天
Phase 2: P1 Platform & Tools (Week 3-4)
docs (401 MB, 72 分)
├── 官方文档
├── 体积大但简单
└── 预计时间:2-3 天
tiup (15 MB, 70 分)
├── 包管理工具
├── 体积小
└── 预计时间:1 天
tidb-dashboard (33 MB, 65 分)
├── Web UI
├── 依赖 tidb
└── 预计时间:1-2 天
ossinsight (627 MB, 68 分)
├── 独立工具
├── 评估是否合并
└── 预计时间:决策后 2-3 天
Phase 2 Total: ~1,076 MB, 6-9 天
Phase 3: P2 SDKs & Others (Week 5-6)
ticdc (105 MB, 62 分)
├── CDC 工具
├── 与 tiflow 重叠
└── 预计时间:1-2 天
autoflow (7 MB, 58 分)
├── Graph RAG
├── 独立性强
└── 预计时间:决策后 1 天
tidb-vector-python (1 MB, 42 分)
├── Python SDK
├── 体积小
└── 预计时间:0.5 天
Phase 3 Total: ~113 MB, 3-4 天
Total Migration Timeline
| Phase | Repos | Size | Duration |
|---|---|---|---|
| P0 | 3 | 895 MB | 7-11 天 |
| P1 | 4 | 1,076 MB | 6-9 天 |
| P2 | 3 | 113 MB | 3-4 天 |
| Total | 10 | 2,084 MB | 16-24 天 |
推算 400 repos: ~60-90 天 (3-4 个月)
Key Insights
1. 核心发现
✅ tidb 是绝对核心 — 95 分,39.8k stars,必须第一个迁移
✅ 依赖关系清晰 — tiflow, tidb-operator, tidb-dashboard 都依赖 tidb
⚠️ ossinsight 独立性强 — 627 MB 但独立运行,需评估是否合并
⚠️ ticdc 与 tiflow 重叠 — 都是 CDC 相关,可能可以合并
2. 技术栈集中
- 60% Go — 主要技术栈
- 30% TypeScript — 前端/工具
- 10% Python — 文档/SDK
建议: 构建系统选择 Bazel (Go 支持好,多语言)
3. 代码量可控
- 10 repos = ~2GB
- 400 repos = ~39GB (估算合理)
- Google 2B LOC = 86TB
结论: 规模在 Google 验证范围内
Next Steps
Immediate (This Week)
- ✅ 完成实验报告 ← 当前
- ⏳ 实现 sessions_spawn 子 Agent — 验证动态创建
- ⏳ 测试恢复机制 — 模拟 OpenClaw 重启
- ⏳ 深度分析 tidb — 用 8 个 agent 团队
Short-term (Next 2 Weeks)
- ⏳ 400-repo 元数据收集 — GitHub API 批量获取
- ⏳ 全量价值评分 — 400 repos 评分分级
- ⏳ 创建 progress.db — SQLite 持久化
- ⏳ 实现主循环 — OpenClaw orchestration
Medium-term (Next Month)
- ⏳ 开始 P0 迁移 — tidb, tiflow, tidb-operator
- ⏳ 部署 guardian agents — 持续监控
- ⏳ 建立 CI/CD — mono-repo 构建流程
Lessons Learned
What Worked Well
✅ 文件持久化设计 — 状态清晰,可恢复
✅ 价值评分模型 — 区分度高,合理
✅ 分级策略 — S/A/B/C 指导资源分配
✅ 迁移优先级 — P0/P1/P2 清晰
What Needs Improvement
⚠️ 自动化程度低 — 手动调用 API,需要自动化
⚠️ 子 Agent 未验证 — sessions_spawn 未测试
⚠️ 恢复机制未测试 — 需要模拟重启
⚠️ 代码分析深度 — 仅元数据,未分析实际代码
Adjustments for 400-Repo Scale
- 自动化 GitHub API — 批量获取元数据
- 并发控制 — 50 sub-agents 同时运行
- 批次处理 — 50 repos/batch,避免 API limit
- 进度监控 — 实时 dashboard
- 错误处理 — 自动重试,死信队列
Conclusion
实验成功! ✅
10-repo 小规模实验验证了:
- OpenClaw 主脑架构可行
- 文件持久化有效
- 价值评分模型合理
- 迁移策略清晰
下一步: 扩展到 400 repos,预计成本 ~$68,时间 3-4 个月
信心等级: 高 — 小规模验证通过,可大规模推广
Experiment Report for: Large-scale Agentic Engineering
Generated: 2026-03-01
Experiment: 10-Repo Small-Scale Analysis
小规模实验:验证 OpenClaw + 子 Agent 架构
实验目标: 验证 OpenClaw 主脑 + 子 Agent 集群分析 repo 的完整流程
实验范围: 10 个最重要的 PingCAP repos
预计时间: 1-2 小时
预计成本: <$0.10 (qwen3.5-plus)
Target Repos (10 Most Important)
基于之前的分析,选择 10 个核心 repo:
| # | Repo | Stars | Language | Size | Priority | Rationale |
|---|---|---|---|---|---|---|
| 1 | tidb | 39,859 | Go | 652 MB | P0 | 核心数据库产品 |
| 2 | tiflow | 454 | Go | 163 MB | P0 | DM + TiCDC |
| 3 | tidb-operator | 1,322 | Go | 101 MB | P0 | K8s 运维平台 |
| 4 | ossinsight | 2,320 | TypeScript | 642 MB | P1 | OSS 分析平台 |
| 5 | docs | 616 | Python | 411 MB | P1 | 官方文档 |
| 6 | tidb-dashboard | 198 | TypeScript | 34 MB | P1 | 可视化控制台 |
| 7 | tiup | 463 | Go | 15 MB | P1 | 包管理工具 |
| 8 | autoflow | 2,740 | TypeScript | - | P2 | Graph RAG 知识库 |
| 9 | tidb-vector-python | 61 | Python | - | P2 | Python SDK |
| 10 | ticdc | 45 | Go | - | P2 | CDC 工具 |
总计: ~2 GB 代码
Experiment Goals
验证目标
✅ 1. OpenClaw 主脑流程
├─ 创建子 Agent (sessions_spawn)
├─ 收集结果 (sessions_send)
└─ 进度追踪 (SQLite + JSON)
✅ 2. 子 Agent 分析能力
├─ Repo 元数据收集
├─ 代码结构分析
├─ 依赖关系映射
├─ 质量评估
└─ 合并建议生成
✅ 3. 状态持久化
├─ 检查点写入
├─ 进度更新
└─ 恢复机制验证
✅ 4. 动态调度
├─ 价值评分 (0-100)
├─ 分级 (S/A/B/C)
└─ Agent 分配调整
✅ 5. 成本验证
└─ 实际 token 消耗 vs 估算
Experiment Architecture
OpenClaw Orchestration
OpenClaw (Main Session)
│
├─ 1. 创建 .rd-os/ 目录结构
│
├─ 2. 初始化 progress.db
│
├─ 3. 对每个 repo:
│ │
│ ├─ 创建分析子 Agent (sessions_spawn)
│ │ Task: "Analyze {repo_name}"
│ │ Model: qwen3.5-plus
│ │ Output: .rd-os/state/agent-states/{repo_id}.json
│ │
│ └─ 等待完成 (sessions_send)
│
├─ 4. 收集结果
│ ├─ 读取输出文件
│ ├─ 更新 progress.db
│ └─ 生成综合报告
│
└─ 5. 输出实验报告
Sub-Agent Task
Sub-Agent (qwen3.5-plus)
│
├─ 1. 读取 repo 元数据 (GitHub API)
│
├─ 2. 分析代码结构
│ ├─ 目录结构
│ ├─ 主要语言
│ └─ 关键文件
│
├─ 3. 映射依赖关系
│ ├─ go.mod / package.json / requirements.txt
│ └─ 内部/外部依赖
│
├─ 4. 评估代码质量
│ ├─ 测试覆盖率
│ ├─ 文档完整性
│ └─ 代码规范
│
├─ 5. 计算价值评分
│ ├─ 活跃度 (25 分)
│ ├─ 影响力 (25 分)
│ ├─ 战略重要性 (25 分)
│ ├─ 代码质量 (15 分)
│ └─ 迁移可行性 (10 分)
│
├─ 6. 生成合并建议
│ ├─ P0/P1/P2/P3/Archive
│ └─ 迁移优先级
│
└─ 7. 输出结果
└─ .rd-os/state/agent-states/{repo_id}-analysis.json
Execution Plan
Phase 1: Setup (10 minutes)
# 1. 创建 .rd-os/ 目录
mkdir -p 20260301-mono-repo/.rd-os/{state/agent-states,store/artifacts,config}
# 2. 初始化 SQLite 数据库
sqlite3 20260301-mono-repo/.rd-os/store/progress.db <<EOF
CREATE TABLE repos (
repo_id TEXT PRIMARY KEY,
name TEXT NOT NULL,
priority TEXT,
category TEXT,
created_at TIMESTAMP,
updated_at TIMESTAMP
);
CREATE TABLE analysis_state (
repo_id TEXT PRIMARY KEY,
status TEXT,
progress_percent INTEGER,
started_at TIMESTAMP,
completed_at TIMESTAMP,
result_json TEXT,
error_message TEXT
);
CREATE TABLE sub_agents (
agent_id TEXT PRIMARY KEY,
type TEXT,
repo_id TEXT,
status TEXT,
spawned_at TIMESTAMP,
completed_at TIMESTAMP
);
EOF
# 3. 创建 repo 列表
cat > 20260301-mono-repo/.rd-os/config/target-repos.json <<EOF
[
{"id": "tidb", "name": "pingcap/tidb", "priority": "P0"},
{"id": "tiflow", "name": "pingcap/tiflow", "priority": "P0"},
{"id": "tidb-operator", "name": "pingcap/tidb-operator", "priority": "P0"},
{"id": "ossinsight", "name": "pingcap/ossinsight", "priority": "P1"},
{"id": "docs", "name": "pingcap/docs", "priority": "P1"},
{"id": "tidb-dashboard", "name": "pingcap/tidb-dashboard", "priority": "P1"},
{"id": "tiup", "name": "pingcap/tiup", "priority": "P1"},
{"id": "autoflow", "name": "pingcap/autoflow", "priority": "P2"},
{"id": "tidb-vector-python", "name": "pingcap/tidb-vector-python", "priority": "P2"},
{"id": "ticdc", "name": "pingcap/ticdc", "priority": "P2"}
]
EOF
Phase 2: Analysis (30-60 minutes)
并发:5 个子 Agent 同时运行
批次:2 批 (5 repos/batch)
Batch 1 (P0 repos):
├─ tidb
├─ tiflow
├─ tidb-operator
├─ ossinsight
└─ docs
Batch 2 (P1/P2 repos):
├─ tidb-dashboard
├─ tiup
├─ autoflow
├─ tidb-vector-python
└─ ticdc
Phase 3: Synthesis (15 minutes)
OpenClaw 综合所有结果:
├─ 计算总体统计
├─ 生成价值评分排名
├─ 创建合并建议
└─ 输出实验报告
Expected Output
Per-Repo Analysis
{
"repo_id": "tidb",
"repo_name": "pingcap/tidb",
"analysis_date": "2026-03-01",
"metadata": {
"stars": 39859,
"forks": 6126,
"language": "Go",
"size_mb": 652,
"created_at": "2015-09-06",
"last_push": "2026-02-28"
},
"value_score": {
"total": 95,
"activity": 25,
"impact": 25,
"strategic": 25,
"quality": 12,
"feasibility": 8
},
"tier": "S",
"code_structure": {
"main_components": ["server", "storage", "query", "optimizer"],
"test_coverage": 78.5,
"documentation_score": 85
},
"dependencies": {
"internal": 12,
"external": 127,
"circular": 0
},
"recommendation": {
"action": "migrate",
"priority": "P0",
"effort": "high",
"risk": "medium",
"notes": "Core product, migrate first with dedicated team"
}
}
Experiment Report
# 10-Repo Experiment Report
## Summary
- Repos analyzed: 10
- Total time: 1.5 hours
- Total cost: $0.08
- Success rate: 100%
## Value Distribution
- S-tier: 1 (tidb: 95)
- A-tier: 3 (tiflow: 75, tidb-operator: 70, ossinsight: 66)
- B-tier: 4 (docs: 62, tiup: 58, tidb-dashboard: 55, autoflow: 52)
- C-tier: 2 (tidb-vector-python: 45, ticdc: 42)
## Recommendations
- P0 (migrate first): tidb, tiflow, tidb-operator
- P1 (migrate second): ossinsight, docs, tidb-dashboard, tiup
- P2 (migrate third): autoflow, tidb-vector-python, ticdc
## Lessons Learned
- [ ] What worked well
- [ ] What needs improvement
- [ ] Adjustments for 400-repo scale
Success Criteria
| Criterion | Target | Actual |
|---|---|---|
| Completion | 10/10 repos analyzed | TBD |
| Success Rate | >90% | TBD |
| Time | <2 hours | TBD |
| Cost | <$0.20 | TBD |
| State Persistence | Checkpoints written | TBD |
| Recovery | Can resume after restart | TBD |
| Quality | Actionable recommendations | TBD |
Risk Mitigation
| Risk | Mitigation |
|---|---|
| API Rate Limit | Batch requests, add delays |
| Sub-Agent Failure | Checkpoint + retry |
| OpenClaw Restart | Recovery from progress.db |
| Token Overrun | Monitor usage, set limits |
| Poor Quality Output | Human review, iterate template |
Next Steps After Experiment
If Successful (>=90% criteria met)
-
Scale to 400 repos
- Same architecture, more concurrency
- Batch processing (50 repos/batch)
- Estimated time: 8-16 hours
-
Refine Process
- Incorporate lessons learned
- Optimize sub-agent templates
- Tune value scoring
-
Begin Migration Planning
- Use analysis results for migration order
- Create detailed migration runbook
If Issues (<90% criteria met)
-
Identify Problems
- Technical issues?
- Template issues?
- Architecture issues?
-
Fix and Re-run
- Address root causes
- Re-run experiment
- Validate fixes
Experiment Log
To be filled during execution
[2026-03-01 HH:MM] Experiment started
[2026-03-01 HH:MM] Setup complete
[2026-03-01 HH:MM] Batch 1 spawned (5 sub-agents)
[2026-03-01 HH:MM] Batch 1 complete (5/5)
[2026-03-01 HH:MM] Batch 2 spawned (5 sub-agents)
[2026-03-01 HH:MM] Batch 2 complete (5/5)
[2026-03-01 HH:MM] Synthesis complete
[2026-03-01 HH:MM] Experiment finished
Experiment designed for: Large-scale Agentic Engineering
RD-OS: Research & Development Operating System
面向 AI 时代的研发基础设施
“过去:协调很多人,跟进开发、部署、测试、运维、事故、告警 — 太费劲了”
“未来:一个活的系统,AI 自主协调一切,人类专注决策”
核心问题
传统研发模式的痛点
┌─────────────────────────────────────────────────────────────────┐
│ Traditional R&D Pain │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Understanding the System: │
│ ❌ 400+ repos, no one knows the full picture │
│ ❌ Documentation always outdated │
│ ❌ "Who owns this?" "Why was this done?" │
│ ❌ New hire ramp-up: 3-6 months │
│ │
│ Coordination Overhead: │
│ ❌ Dev → Test → Deploy → Ops: handoffs everywhere │
│ ❌ Incident response: page 5 people, 2 hours to triage │
│ ❌ Sprint planning: 2 days of meetings │
│ ❌ Post-mortem: blame, not learning │
│ │
│ Alert Fatigue: │
│ ❌ 100+ alerts/day, most are noise │
│ ❌ No context, just "something is broken" │
│ ❌ Human must investigate everything │
│ │
│ Progress Tracking: │
│ ❌ JIRA tickets, standups, status reports │
│ ❌ "What's blocked?" "Who's working on what?" │
│ ❌ Velocity is a guess │
│ │
└─────────────────────────────────────────────────────────────────┘
Root Cause: The system is passive. It waits for humans to:
- Understand it
- Coordinate across it
- Fix it
- Improve it
Vision: RD-OS (Active, Living System)
┌─────────────────────────────────────────────────────────────────┐
│ RD-OS │
│ A Living R&D Operating System │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Unified Codebase │ │
│ │ (400 repos → 1 mono-repo, AI-readable) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌────────────────────┼────────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ AI Core │ │ Skills │ │ Humans │ │
│ │ (Agents) │ │ (Tools) │ │ (Decision) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ Capabilities: │
│ ✅ Self-understanding (always knows its state) │
│ ✅ Self-coordination (agents talk to each other) │
│ ✅ Self-healing (detects and fixes issues) │
│ ✅ Self-improvement (identifies and acts on optimizations) │
│ │
│ Result: Humans focus on WHAT, AI handles HOW │
└─────────────────────────────────────────────────────────────────┘
RD-OS Architecture
Layer 0: The Codebase (Passive Foundation)
mono-repo/
├── products/ # TiDB, TiDB Next-Gen
├── platform/ # Cloud SaaS, control plane
├── devops/ # Operations tooling
├── libs/ # Shared libraries
├── tools/ # Build/dev tools
├── docs/ # Living documentation
└── .rd-os/ # RD-OS configuration
├── agents/ # Agent definitions
├── skills/ # Skill configurations
├── workflows/ # Automated workflows
└── policies/ # Decision policies
Layer 1: Perception (Understanding the System)
┌─────────────────────────────────────────────────────────────┐
│ Perception Layer │
│ "The system understands itself" │
├─────────────────────────────────────────────────────────────┤
│ │
│ code-understanding-agent │
│ ├─ Continuously indexes codebase │
│ ├─ Maps dependencies (real-time) │
│ ├─ Tracks architecture changes │
│ └─ Answers: "What does this do?" "Who uses this?" │
│ │
│ documentation-curator │
│ ├─ Auto-generates docs from code │
│ ├─ Keeps docs in sync (per-change) │
│ ├─ Maintains architecture decision records │
│ └─ Answers: "Why was this designed this way?" │
│ │
│ health-monitor │
│ ├─ Real-time system health dashboard │
│ ├─ Tracks: build status, test coverage, tech debt │
│ ├─ Detects anomalies │
│ └─ Answers: "Is the system healthy?" │
│ │
└─────────────────────────────────────────────────────────────┘
Before vs After:
| Task | Before | After (RD-OS) |
|---|---|---|
| Understand a component | Read docs (outdated), ask team (slow) | Ask agent (instant, accurate) |
| Find dependencies | Search code, grep, hope | Query dependency graph |
| New hire ramp-up | 3-6 months | 2-4 weeks (AI-guided) |
| Architecture review | Manual docs, diagrams | Auto-generated, always current |
Layer 2: Coordination (Orchestrating Work)
┌─────────────────────────────────────────────────────────────┐
│ Coordination Layer │
│ "The system coordinates itself" │
├─────────────────────────────────────────────────────────────┤
│ │
│ workflow-orchestrator │
│ ├─ Dev → Test → Deploy → Ops: automatic handoffs │
│ ├─ No human coordination needed │
│ ├─ Tracks progress, unblocks automatically │
│ └─ Humans see: "Feature X: 80% done, deploying in 2h" │
│ │
│ sprint-coordinator │
│ ├─ Analyzes backlog, capacity, velocity │
│ ├─ Suggests sprint goals │
│ ├─ Adjusts mid-sprint based on reality │
│ └─ Humans see: "Sprint on track" or "Risk: feature Y" │
│ │
│ dependency-coordinator │
│ ├─ Detects cross-component changes needed │
│ ├─ Coordinates updates across repos │
│ ├─ Prevents breaking changes │
│ └─ Humans see: "Updating lib X, 3 components affected" │
│ │
└─────────────────────────────────────────────────────────────┘
Before vs After:
| Task | Before | After (RD-OS) |
|---|---|---|
| Dev → Test handoff | PR review, wait for QA, days | Auto-test, auto-merge, hours |
| Deploy coordination | Schedule, change review, CAB | Auto-deploy (policy-based) |
| Sprint planning | 2-day meetings | AI-suggested, human-approved |
| Cross-team dependency | Email, meetings, delays | Auto-coordinated |
Layer 3: Action (Executing Work)
┌─────────────────────────────────────────────────────────────┐
│ Action Layer │
│ "The system executes work" │
├─────────────────────────────────────────────────────────────┤
│ │
│ development-agent │
│ ├─ Implements features (from specs) │
│ ├─ Writes tests │
│ ├─ Creates PRs │
│ └─ Humans review, approve │
│ │
│ testing-agent │
│ ├─ Runs test suites │
│ ├─ Generates missing tests │
│ ├─ Investigates flaky tests │
│ └─ Humans see: "Tests pass" or "Here's the issue" │
│ │
│ deployment-agent │
│ ├─ Deploys to staging/production │
│ ├─ Monitors rollout │
│ ├─ Auto-rollback on issues │
│ └─ Humans see: "Deployed v1.2.3, health: ✅" │
│ │
│ incident-responder │
│ ├─ Detects incidents (before humans) │
│ ├─ Triage: severity, impact, root cause │
│ ├─ Auto-remediation (restart, rollback, scale) │
│ └─ Humans see: "Incident detected, resolved, here's why"│
│ │
└─────────────────────────────────────────────────────────────┘
Before vs After:
| Task | Before | After (RD-OS) |
|---|---|---|
| Feature development | Human writes code, days/weeks | AI drafts, human reviews, hours/days |
| Testing | Manual test writing, maintenance | Auto-generated, maintained |
| Deployment | Manual process, risky | Automated, safe, rollback-ready |
| Incident response | Page, triage, fix (hours) | Auto-detect, auto-fix (minutes) |
Layer 4: Learning (Continuous Improvement)
┌─────────────────────────────────────────────────────────────┐
│ Learning Layer │
│ "The system improves itself" │
├─────────────────────────────────────────────────────────────┤
│ │
│ post-mortem-analyst │
│ ├─ Analyzes incidents (no blame) │
│ ├─ Identifies root causes │
│ ├─ Proposes preventive measures │
│ └─ Humans review, approve changes │
│ │
│ tech-debt-detector │
│ ├─ Continuously scans for tech debt │
│ ├─ Prioritizes by impact │
│ ├─ Proposes refactoring plans │
│ └─ Humans see: "Tech debt: 5 high-priority items" │
│ │
│ optimization-recommender │
│ ├─ Analyzes performance, cost, efficiency │
│ ├─ Identifies optimization opportunities │
│ ├─ Proposes and implements improvements │
│ └─ Humans see: "Saved $X/month with optimization Y" │
│ │
│ knowledge-curator │
│ ├─ Captures learnings from incidents │
│ ├─ Updates documentation │
│ ├─ Shares insights across teams │
│ └─ System gets smarter over time │
│ │
└─────────────────────────────────────────────────────────────┘
Key Workflows (End-to-End)
Workflow 1: Feature Development
┌─────────────────────────────────────────────────────────────────┐
│ Feature Development (AI-First) │
└─────────────────────────────────────────────────────────────────┘
Human: "Build feature X: users can export data as CSV"
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 1. Spec Analysis (AI) │
│ ├─ Understands requirements │
│ ├─ Identifies affected components │
│ └─ Creates implementation plan │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 2. Implementation (AI) │
│ ├─ Writes code (backend, frontend, tests) │
│ ├─ Creates PR │
│ └─ Notifies human reviewer │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 3. Review (Human + AI) │
│ ├─ AI: automated review (style, tests, security) │
│ ├─ Human: logic, UX, business logic │
│ └─ AI: addresses feedback, updates PR │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 4. Merge & Deploy (AI) │
│ ├─ Auto-merge (if checks pass) │
│ ├─ Deploy to staging │
│ ├─ Run integration tests │
│ └─ Deploy to production (feature flag) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 5. Monitor (AI) │
│ ├─ Watches metrics, errors, adoption │
│ ├─ Alerts human if issues │
│ └─ Reports: "Feature X: 1000 uses/day, 0 errors" │
└─────────────────────────────────────────────────────────────┘
Total Time: 2-3 days (vs 2-3 weeks traditional)
Human Effort: 2-4 hours review (vs 40+ hours coding)
Workflow 2: Incident Response
┌─────────────────────────────────────────────────────────────────┐
│ Incident Response (AI-First) │
└─────────────────────────────────────────────────────────────────┘
[Incident Occurs: API latency spike]
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 1. Detection (AI) - T+0s │
│ ├─ Detects anomaly (before humans notice) │
│ ├─ Correlates with recent changes │
│ └─ Starts investigation │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 2. Triage (AI) - T+30s │
│ ├─ Severity: P2 (degraded performance) │
│ ├─ Impact: 15% of requests affected │
│ ├─ Root cause: recent deployment, memory leak │
│ └─ Notifies on-call + team channel │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 3. Mitigation (AI) - T+60s │
│ ├─ Auto-rollback to previous version │
│ ├─ Scales up affected service │
│ └─ Monitors recovery │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 4. Resolution (AI) - T+5min │
│ ├─ Metrics return to normal │
│ ├─ Incident marked resolved │
│ └─ Report: "Root cause, fix, prevention plan" │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 5. Post-Mortem (AI + Human) - T+1day │
│ ├─ AI: timeline, root cause, prevention │
│ ├─ Human: review, approve │
│ └─ AI: creates follow-up tasks │
└─────────────────────────────────────────────────────────────┘
Total Time: 5 minutes to resolution (vs 2-4 hours traditional)
Human Effort: 30 minutes review (vs 4+ hours firefighting)
Workflow 3: Alert Handling
┌─────────────────────────────────────────────────────────────────┐
│ Alert Handling (AI-First) │
└─────────────────────────────────────────────────────────────────┘
[Alert: High CPU on service X]
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 1. Alert Analysis (AI) │
│ ├─ Is this real? (vs noise) │
│ ├─ What's the context? (recent changes, load spike) │
│ └─ What's the impact? │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 2. Decision (AI, policy-based) │
│ ├─ If known issue + auto-fix exists → execute fix │
│ ├─ If unknown → investigate, notify human │
│ └─ If noise → suppress, update alert rules │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 3. Action (AI) │
│ ├─ Execute fix OR │
│ ├─ Create incident OR │
│ └─ Update alert rules │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 4. Human Notification (if needed) │
│ ├─ "Alert X: auto-resolved, here's what happened" OR │
│ └─ "Alert X: needs attention, here's the context" │
└─────────────────────────────────────────────────────────────┘
Result: 90% of alerts handled without human intervention
Human Focus: Only meaningful alerts with full context
Human Experience in RD-OS
What Humans Do
┌─────────────────────────────────────────────────────────────┐
│ Human Focus Areas │
├─────────────────────────────────────────────────────────────┤
│ │
│ Strategy & Direction │
│ ├─ What problems to solve │
│ ├─ What features to build │
│ └─ What trade-offs to make │
│ │
│ Review & Approval │
│ ├─ Architecture decisions (AI-proposed) │
│ ├─ Security-critical changes │
│ ├─ Breaking changes │
│ └─ High-risk deployments │
│ │
│ Exception Handling │
│ ├─ Edge cases AI can't handle │
│ ├─ Novel situations │
│ └─ Escalations from agents │
│ │
│ Creativity & Innovation │
│ ├─ New product ideas │
│ ├─ Novel solutions │
│ └─ Exploratory work │
│ │
└─────────────────────────────────────────────────────────────┘
What Humans Don’t Do
┌─────────────────────────────────────────────────────────────┐
│ Eliminated by RD-OS │
├─────────────────────────────────────────────────────────────┤
│ │
│ ❌ Manual code writing (AI drafts) │
│ ❌ Manual testing (AI generates & runs) │
│ ❌ Manual deployment (AI deploys) │
│ ❌ Manual monitoring (AI watches 24/7) │
│ ❌ Alert triage (AI handles 90%) │
│ ❌ Incident firefighting (AI auto-remediates) │
│ ❌ Status meetings (AI reports automatically) │
│ ❌ Progress tracking (AI tracks in real-time) │
│ ❌ Documentation writing (AI auto-generates) │
│ ❌ Coordination overhead (AI coordinates) │
│ │
└─────────────────────────────────────────────────────────────┘
Metrics: Before vs After
| Metric | Traditional | RD-OS Target | Improvement |
|---|---|---|---|
| Feature dev time | 2-3 weeks | 2-3 days | 10x |
| Incident MTTR | 2-4 hours | 5-10 minutes | 24x |
| Alert noise | 90% false positive | <10% false positive | 9x |
| New hire ramp-up | 3-6 months | 2-4 weeks | 3-6x |
| Deploy frequency | Weekly | Multiple/day | 10x+ |
| Deploy failure rate | 10-20% | <1% | 10-20x |
| Tech debt visibility | Unknown | Real-time dashboard | - |
| Coordination meetings | 10+ hours/week | <2 hours/week | 5x |
| Human coding time | 60% | 10% | 6x |
| Human decision time | 20% | 70% | 3.5x |
Implementation Roadmap
Phase 1: Foundation (Month 1-2)
- Mono-repo consolidation (400 → 1)
- Basic agent framework
- Core skills (build, test, deploy)
- Perception layer (code understanding, docs)
Phase 2: Coordination (Month 3-4)
- Workflow orchestrator
- Sprint coordinator
- Dependency coordinator
- Action layer (dev, test, deploy agents)
Phase 3: Autonomy (Month 5-6)
- Incident responder
- Alert handler
- Post-mortem analyst
- Learning layer (continuous improvement)
Phase 4: Optimization (Month 7-12)
- Full autonomy for routine work
- AI-driven optimization
- Human focus on strategy only
- Continuous self-improvement
Conclusion
RD-OS is not just a mono-repo. It’s a paradigm shift:
| Aspect | Traditional | RD-OS |
|---|---|---|
| System Nature | Passive | Active, Living |
| Understanding | Human effort | Built-in |
| Coordination | Human meetings | AI orchestration |
| Execution | Human labor | AI execution |
| Improvement | Occasional, manual | Continuous, automatic |
| Human Role | Doer | Decision-maker |
The goal:
Humans define WHAT matters. AI handles HOW to achieve it.
The result:
A研发 department that moves at AI speed, with human wisdom.
“过去:协调很多人,跟进开发、部署、测试、运维、事故、告警 — 太费劲了”
“未来:一个活的系统,AI 自主协调一切,人类专注决策”
This is RD-OS.
RD-OS OpenClaw Architecture
OpenClaw 作为主脑 + 子 Agent 集群
“OpenClaw 是 Orchestrator,子 Agent 是临时工人,用完即销毁,状态持久化在文件系统”
Core Architecture
OpenClaw 角色定位
┌─────────────────────────────────────────────────────────────────┐
│ OpenClaw │
│ (The Orchestrator) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Role: Master Controller │
│ │
│ Responsibilities: │
│ ├─ Maintain global state (via .rd-os/store/) │
│ ├─ Make high-level decisions │
│ ├─ Spawn sub-agents for parallel work │
│ ├─ Collect and synthesize results │
│ ├─ Handle exceptions and escalations │
│ └─ Report progress to humans │
│ │
│ Memory: │
│ ├─ Short-term: Conversation context (lost on restart) │
│ └─ Long-term: .rd-os/store/ (survives restart) │
│ │
│ Models: │
│ ├─ OpenClaw: qwen3.5-plus (or user's choice) │
│ └─ Sub-agents: qwen3.5-plus (cheap, fast) │
│ │
└─────────────────────────────────────────────────────────────────┘
Sub-Agent Model
┌─────────────────────────────────────────────────────────────────┐
│ Sub-Agent Pattern │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Lifecycle: │
│ │
│ 1. Spawn │
│ ├─ OpenClaw calls sessions_spawn() │
│ ├─ Task: "Analyze repo-001, output to .rd-os/state/..." │
│ └─ Model: qwen3.5-plus (cheap) │
│ │
│ 2. Execute │
│ ├─ Sub-agent works independently │
│ ├─ Writes checkpoints to .rd-os/state/ │
│ └─ Reports completion via sessions_send() │
│ │
│ 3. Collect │
│ ├─ OpenClaw reads output from .rd-os/state/ │
│ ├─ Synthesizes results │
│ └─ Updates .rd-os/store/progress.db │
│ │
│ 4. Destroy │
│ ├─ Sub-agent session ends (cleanup=delete) │
│ └─ No memory retained (state is in files) │
│ │
│ Key Insight: │
│ - Sub-agents are DISPOSABLE WORKERS │
│ - State is in FILES, not in agent memory │
│ - OpenClaw can restart, sub-agents can die, progress remains │
│ │
└─────────────────────────────────────────────────────────────────┘
System Architecture
Three-Layer Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Layer 1: OpenClaw (Main) │
│ (Persistent Controller) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ - Maintains .rd-os/store/progress.db │
│ - Makes scheduling decisions │
│ - Spawns sub-agents via sessions_spawn() │
│ - Collects results via sessions_send() │
│ - Handles human interaction │
│ - Recovers from restart (reads from .rd-os/store/) │
│ │
│ Model: qwen3.5-plus (or user's preferred model) │
│ Lifetime: Long-running (weeks to months) │
│ │
└─────────────────────────────────────────────────────────────────┘
│
│ sessions_spawn()
│ sessions_send()
▼
┌─────────────────────────────────────────────────────────────────┐
│ Layer 2: Sub-Agent Pool (Ephemeral) │
│ (Disposable Workers) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ - Created on-demand via sessions_spawn() │
│ - Focused task: "Analyze this repo", "Migrate that repo" │
│ - Writes state to .rd-os/state/agent-states/{id}.json │
│ - Reports completion, then destroyed │
│ - No long-term memory (state is in files) │
│ │
│ Model: qwen3.5-plus (cheap, fast) │
│ Lifetime: Short (minutes to hours per task) │
│ Concurrency: 10-50 simultaneous sub-agents │
│ │
└─────────────────────────────────────────────────────────────────┘
│
│ File I/O
▼
┌─────────────────────────────────────────────────────────────────┐
│ Layer 3: Persistent State (Files + DB) │
│ (Source of Truth) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ .rd-os/ │
│ ├── state/ │
│ │ ├── agent-states/ # Per-sub-agent checkpoint │
│ │ ├── progress/ # Aggregated progress │
│ │ └── checkpoints/ # Milestone snapshots │
│ │ │
│ └── store/ │
│ ├── progress.db # SQLite: definitive state │
│ ├── agents.db # SQLite: sub-agent registry │
│ ├── artifacts/ # Generated reports │
│ └── config/ # Configuration │
│ │
│ Key: This layer SURVIVES everything │
│ - OpenClaw restart → OK, read from DB │
│ - Sub-agent dies → OK, checkpoint in files │
│ - Gateway crash → OK, DB is durable │
│ │
└─────────────────────────────────────────────────────────────────┘
OpenClaw Workflow
Main Loop
# Pseudo-code: OpenClaw main orchestration loop
class OpenClawOrchestrator:
"""
OpenClaw as the main orchestrator
"""
async def run(self):
# 1. Recovery (after restart)
await self.recover_state()
# 2. Main loop
while not self.is_complete():
# 2.1 Check progress
progress = self.load_progress()
# 2.2 Make scheduling decisions
decisions = self.make_scheduling_decisions(progress)
# 2.3 Spawn sub-agents for new work
for decision in decisions:
if decision.action == 'analyze':
await self.spawn_analyzer(decision.repo)
elif decision.action == 'migrate':
await self.spawn_migrator(decision.repo)
elif decision.action == 'deep_dive':
await self.spawn_deep_analysis_team(decision.repo)
# 2.4 Check for completed sub-agents
completed = await self.check_completed_sub_agents()
for result in completed:
await self.process_result(result)
# 2.5 Handle escalations
await self.handle_escalations()
# 2.6 Update progress
await self.update_progress()
# 2.7 Checkpoint
await self.checkpoint()
# 2.8 Wait (avoid busy loop)
await asyncio.sleep(60)
# 3. Completion
await self.generate_final_report()
async def spawn_analyzer(self, repo: Repo):
"""
Spawn a sub-agent to analyze a repo
"""
task = f"""
Analyze repository: {repo.name}
Output to: .rd-os/state/agent-states/{repo.id}-analysis.json
Steps:
1. Read repo metadata from GitHub API
2. Analyze code structure
3. Map dependencies
4. Assess code quality
5. Generate merge recommendation
Checkpoint after each step.
Report completion via sessions_send().
"""
# Spawn sub-agent (qwen3.5-plus, cheap)
session = await sessions_spawn(
task=task,
model='qwen3.5-plus',
cleanup='delete', # Destroy after completion
label=f'analyzer-{repo.id}'
)
# Register sub-agent
self.db.execute("""
INSERT INTO sub_agents (agent_id, type, repo_id, status, spawned_at)
VALUES (?, 'analyzer', ?, 'running', ?)
""", (session.id, repo.id, now()))
async def process_result(self, result: SubAgentResult):
"""
Process completed sub-agent result
"""
# Read output from file
output = read_json(result.output_path)
# Update progress DB
self.db.execute("""
UPDATE analysis_state
SET status = 'done', result_json = ?, completed_at = ?
WHERE repo_id = ?
""", (json.dumps(output), now(), result.repo_id))
# Update sub-agent registry
self.db.execute("""
UPDATE sub_agents
SET status = 'completed', completed_at = ?
WHERE agent_id = ?
""", (now(), result.agent_id))
# Synthesize findings (OpenClaw does this)
await self.synthesize_findings(result.repo_id, output)
# Make next decision (spawn more agents? escalate?)
await self.make_next_decision(result)
Sub-Agent Lifecycle
State Machine
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ idle │────▶│ running │────▶│ done │ │ failed │
└─────────┘ └─────────┘ └─────────┘ └─────────┘
▲ │ │
│ │ ┌─────────┐ │
│ └────▶│ paused │◀───────────────┘
│ └─────────┘
│
│ sessions_spawn()
│
┌─────────┐
│OpenClaw │
└─────────┘
Sub-Agent Task Template
# Template for sub-agent tasks
ANALYZER_TASK_TEMPLATE = """
You are a Repository Analyzer Agent.
TASK: Analyze {repo_name}
OUTPUT: .rd-os/state/agent-states/{repo_id}-analysis.json
INSTRUCTIONS:
1. Read repo metadata from .rd-os/store/repos/{repo_id}.json
2. Analyze code structure (use GitHub API or local clone)
3. Map dependencies (go.mod, package.json, etc.)
4. Assess code quality (tests, docs, lint)
5. Generate merge recommendation (P0/P1/P2/P3/archive)
CHECKPOINTING:
- After each step, write checkpoint to:
.rd-os/state/agent-states/{repo_id}-analysis.checkpoint.json
- Include: step_completed, partial_results, can_resume
COMPLETION:
- Write final output to: .rd-os/state/agent-states/{repo_id}-analysis.json
- Send completion message via sessions_send():
"Analysis complete: {repo_id}, output: {output_path}"
MODEL: qwen3.5-plus
TIMEOUT: 30 minutes
CLEANUP: delete (session destroyed after completion)
"""
Recovery After OpenClaw Restart
Recovery Flow
OpenClaw Restarts
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 1. Load State from .rd-os/store/progress.db │
│ ├─ Query: What repos are analyzed? │
│ ├─ Query: What repos are in progress? │
│ └─ Query: What sub-agents were running? │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 2. Reconcile Sub-Agent State │
│ ├─ Find sub-agents marked 'running' │
│ ├─ Check if they have checkpoints │
│ ├─ If checkpoint exists → respawn with resume │
│ └─ If no checkpoint → restart from beginning │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 3. Resume Orchestration │
│ ├─ Continue main loop │
│ ├─ Spawn new sub-agents for pending work │
│ └─ Resume from last checkpoint │
└─────────────────────────────────────────────────────────────┘
Result: OpenClaw can restart anytime, progress is never lost
Recovery Example
# Pseudo-code: OpenClaw recovery
async def recover_state(self):
"""
Recover state after OpenClaw restart
"""
# Load progress DB
self.db = load_database('.rd-os/store/progress.db')
# Find incomplete analysis
incomplete = self.db.query("""
SELECT repo_id, progress_percent, last_checkpoint
FROM analysis_state
WHERE status = 'running'
""")
for task in incomplete:
# Check if sub-agent has checkpoint
checkpoint_path = f".rd-os/state/agent-states/{task.repo_id}-analysis.checkpoint.json"
if exists(checkpoint_path):
# Resume from checkpoint
checkpoint = read_json(checkpoint_path)
await self.resume_analyzer(task.repo_id, checkpoint)
log.info(f"Resumed analysis: {task.repo_id} from step {checkpoint['step']}")
else:
# No checkpoint, restart
await self.spawn_analyzer(task.repo_id)
log.warning(f"No checkpoint for {task.repo_id}, restarting")
# Find orphaned sub-agents (running but no progress)
orphaned = self.db.query("""
SELECT agent_id, repo_id, spawned_at
FROM sub_agents
WHERE status = 'running'
AND agent_id NOT IN (SELECT DISTINCT agent_id FROM checkpoints)
""")
for orphan in orphaned:
# Sub-agent died without checkpoint
log.warning(f"Orphaned sub-agent: {orphan.agent_id}, restarting")
await self.spawn_analyzer(orphan.repo_id)
log.info(f"Recovery complete: {len(incomplete)} tasks resumed")
Scaling Strategy
Concurrency Control
class ConcurrencyManager:
"""
Manage sub-agent concurrency
"""
def __init__(self, max_concurrent: int = 50):
self.max_concurrent = max_concurrent
self.active_count = 0
self.lock = asyncio.Lock()
async def acquire(self) -> bool:
"""
Acquire a slot for new sub-agent
"""
async with self.lock:
if self.active_count < self.max_concurrent:
self.active_count += 1
return True
return False
async def release(self):
"""
Release a slot when sub-agent completes
"""
async with self.lock:
self.active_count -= 1
def get_utilization(self) -> float:
return self.active_count / self.max_concurrent
Batch Processing
# Process repos in batches (avoid overwhelming system)
async def process_in_batches(self, repos: List[Repo], batch_size: int = 50):
"""
Process repos in batches
"""
for i in range(0, len(repos), batch_size):
batch = repos[i:i+batch_size]
log.info(f"Processing batch {i//batch_size + 1}: {len(batch)} repos")
# Spawn sub-agents for batch
tasks = [self.spawn_analyzer(repo) for repo in batch]
# Wait for batch to complete (with timeout)
await asyncio.gather(*tasks, return_exceptions=True)
# Checkpoint after batch
await self.checkpoint(f'batch-{i//batch_size}')
# Rate limit (avoid API throttling)
await asyncio.sleep(60)
Communication Pattern
OpenClaw ↔ Sub-Agent
┌─────────────────────────────────────────────────────────────────┐
│ OpenClaw ↔ Sub-Agent Communication │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. OpenClaw → Sub-Agent: sessions_spawn(task) │
│ ├─ Task description │
│ ├─ Output path │
│ └─ Checkpoint requirements │
│ │
│ 2. Sub-Agent → File System: write_checkpoint() │
│ ├─ Progress updates │
│ ├─ Partial results │
│ └─ Recovery point │
│ │
│ 3. Sub-Agent → OpenClaw: sessions_send(message) │
│ ├─ "Task complete: {repo_id}" │
│ ├─ "Error: {error_message}" │
│ └─ "Escalation: {issue}" │
│ │
│ 4. OpenClaw → File System: read_output() │
│ ├─ Read final output │
│ ├─ Read checkpoints │
│ └─ Update progress DB │
│ │
│ Key: Communication is MINIMAL │
│ - Sub-agents don't retain state │
│ - Everything is in files │
│ - OpenClaw can restart, sub-agents are disposable │
│ │
└─────────────────────────────────────────────────────────────────┘
Cost Optimization
Model Selection
| Component | Model | Rationale |
|---|---|---|
| OpenClaw (Main) | qwen3.5-plus | Good balance of cost/capability |
| Sub-Agents | qwen3.5-plus | Cheap, fast, disposable |
| Deep Analysis | qwen3.5-plus (or upgrade if needed) | Can upgrade for complex tasks |
Cost Estimate (400 Repos)
Analysis Phase:
├─ 400 repos × ~10K tokens/repo = 4M tokens
├─ qwen3.5-plus: $0.002/1K tokens
└─ Total: ~$8
Migration Phase:
├─ 400 repos × ~50K tokens/repo = 20M tokens
├─ qwen3.5-plus: $0.002/1K tokens
└─ Total: ~$40
Ongoing Operations (monthly):
├─ Guardian agents: ~100K tokens/day
├─ Monthly: 3M tokens
└─ Total: ~$6/month
Total First Year: ~$500 (one-time migration + ongoing ops)
Example: Full Workflow
End-to-End Example
Scenario: Analyze 400 repos with OpenClaw + sub-agents
Day 1: Initialization
├─ OpenClaw starts
├─ Creates .rd-os/ directory structure
├─ Loads repo list (400 repos)
├─ Spawns 50 sub-agents (batch 1)
└─ Checkpoint: "400 repos loaded, batch 1 started"
Day 1-2: Analysis (Batch 1-8)
├─ Each batch: 50 repos
├─ Sub-agents analyze in parallel
├─ OpenClaw collects results
├─ Updates progress.db
├─ Spawns next batch
└─ Checkpoint after each batch
Day 2: Analysis Complete
├─ 400/400 repos analyzed
├─ OpenClaw synthesizes findings
├─ Identifies: 50 S-tier, 100 A-tier, 150 B-tier, 100 C-tier
└─ Checkpoint: "Analysis complete"
Day 2-3: Deep Analysis (S-tier)
├─ 50 S-tier repos
├─ Each gets 5-8 sub-agents for deep analysis
├─ OpenClaw coordinates teams
├─ Produces 50 deep reports
└─ Checkpoint: "Deep analysis complete"
Day 3-7: Migration (P0)
├─ 50 P0 repos migrated
├─ Sub-agents handle migration tasks
├─ OpenClaw validates each migration
└─ Checkpoint: "P0 migrated"
... (continue for P1, P2, P3)
Week 4: Complete
├─ 400/400 repos migrated
├─ OpenClaw generates final report
└─ System transitions to "guardian mode"
Implementation Checklist
Phase 1: OpenClaw Orchestration
- Create
.rd-os/directory structure - Implement progress.db schema
- Implement OpenClaw main loop
- Implement sub-agent spawning
- Implement result collection
Phase 2: Sub-Agent Tasks
- Create analyzer task template
- Create migrator task template
- Implement checkpointing in sub-agents
- Implement completion reporting
Phase 3: Recovery
- Implement OpenClaw recovery protocol
- Test restart recovery
- Implement sub-agent respawn
- Test sub-agent failure recovery
Phase 4: Optimization
- Implement concurrency control
- Implement batch processing
- Add rate limiting
- Tune performance
Conclusion
Key Insights:
- OpenClaw is the Brain - Maintains state, makes decisions, coordinates
- Sub-Agents are Hands - Execute tasks, disposable, no long-term memory
- Files are Memory - State in
.rd-os/store/, survives everything - Recovery is Automatic - OpenClaw restarts, reads DB, resumes
- Cost is Low - qwen3.5-plus for everything, ~$500 first year
This is how you build a resilient, scalable system with OpenClaw as the orchestrator.
“OpenClaw doesn’t do all the work. OpenClaw organizes the work.”
RD-OS State Persistence & Checkpoint System
断点续传、状态持久化、进度恢复
“OpenClaw 可以重启,LLM 上下文可以丢失,但项目进度必须可恢复”
Core Problem
挑战
┌─────────────────────────────────────────────────────────────────┐
│ Scale Challenges │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. Agent Count: 1000+ agents │
│ - Cannot store all state in LLM context │
│ - Cannot log every action to memory │
│ - Need aggregation + sampling │
│ │
│ 2. Long-Running Tasks: Days to weeks │
│ - OpenClaw may restart │
│ - Network may fail │
│ - API rate limits may hit │
│ - Need checkpoint + resume │
│ │
│ 3. Memory Limits: LLM context is finite │
│ - Cannot accumulate infinite history │
│ - Need summarization + pruning │
│ - Critical state must be external │
│ │
│ 4. Progress Tracking: Need to know "where are we?" │
│ - Which repos analyzed? │
│ - Which repos migrated? │
│ - Which agents active? │
│ - Need persistent progress store │
│ │
└─────────────────────────────────────────────────────────────────┘
Solution Architecture
State Persistence Layers
┌─────────────────────────────────────────────────────────────────┐
│ State Persistence Architecture │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Layer 0: Ephemeral (LLM Context) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Current conversation, recent actions, working memory │ │
│ │ ❌ Lost on restart │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Layer 1: Short-Term (Session State) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ memory/YYYY-MM-DD.md │ │
│ │ Daily logs, recent events │ │
│ │ ⚠️ Survives restart, but not structured for recovery │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Layer 2: Medium-Term (Project State) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ .rd-os/state/ │ │
│ │ - agent-states/ (per-agent checkpoint) │ │
│ │ - progress/ (aggregated progress) │ │
│ │ - checkpoints/ (snapshot at milestones) │ │
│ │ ✅ Structured for recovery │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Layer 3: Long-Term (Durable Store) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ .rd-os/store/ │ │
│ │ - progress.db (SQLite: definitive progress) │ │
│ │ - agents.db (SQLite: agent registry) │ │
│ │ - artifacts/ (generated files, reports) │ │
│ │ ✅ Source of truth, survives everything │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Design Principles
1. External State > LLM Context
❌ Bad: Store progress in conversation history
- Lost on restart
- Consumes context tokens
- Hard to query
✅ Good: Store progress in files/database
- Survives restart
- No context cost
- Easy to query
2. Checkpoint Early, Checkpoint Often
❌ Bad: Checkpoint only at end of batch
- Lose entire batch on failure
✅ Good: Checkpoint after each unit of work
- Lose only current unit
- Fast recovery
3. Aggregation > Individual Tracking
❌ Bad: Track every action of 1000 agents
- Too much data
- Exceeds context limits
✅ Good: Aggregate state
- Per-component summary
- Sampling for details
- On-demand drill-down
4. Idempotent Operations
❌ Bad: "Migrate repo X" (may duplicate if retried)
- Risk of corruption
✅ Good: "Ensure repo X is migrated" (safe to retry)
- Check state first
- Skip if done
- Safe to retry
State Storage Structure
Directory Layout
mono-repo/
└── .rd-os/
├── state/ # Runtime state (can rebuild)
│ ├── agent-states/ # Per-agent checkpoint
│ │ ├── repo-001.state.json
│ │ ├── repo-002.state.json
│ │ └── ...
│ ├── progress/ # Aggregated progress
│ │ ├── analysis-progress.json
│ │ ├── migration-progress.json
│ │ └── daily-summary/
│ │ ├── 2026-02-28.json
│ │ └── ...
│ └── checkpoints/ # Milestone snapshots
│ ├── checkpoint-001-analysis-complete/
│ ├── checkpoint-002-p0-migrated/
│ └── ...
│
└── store/ # Durable store (source of truth)
├── progress.db # SQLite: definitive progress
├── agents.db # SQLite: agent registry
├── artifacts/ # Generated outputs
│ ├── analysis-report.json
│ ├── migration-log.jsonl
│ └── ...
└── config/ # Configuration
├── agents.yaml
├── workflows.yaml
└── policies.yaml
Agent State Checkpoint
Per-Agent State File
// .rd-os/state/agent-states/repo-001.state.json
{
"agent_id": "repo-001-analyzer",
"repo_name": "pingcap/tidb",
"status": "completed",
"created_at": "2026-02-28T10:00:00Z",
"updated_at": "2026-02-28T10:15:00Z",
"work": {
"phase": "analysis",
"subtask": "dependency_mapping",
"progress_percent": 100,
"items_total": 50,
"items_completed": 50,
"items_failed": 0
},
"result": {
"success": true,
"output_path": ".rd-os/store/artifacts/repo-001-analysis.json",
"summary": {
"lines_of_code": 652000,
"dependencies": 127,
"test_coverage": 78.5,
"last_commit": "2026-02-28",
"merge_recommendation": "P0-migrate"
}
},
"checkpoint": {
"last_action": "wrote_dependency_graph",
"last_action_time": "2026-02-28T10:15:00Z",
"can_resume": false,
"resume_point": null
},
"errors": []
}
State Transitions
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ pending │────▶│ running │────▶│ done │ │ failed │
└─────────┘ └─────────┘ └─────────┘ └─────────┘
│ │
│ ┌─────────┐ │
└────▶│ paused │◀───────────────┘
└─────────┘
State Checkpoint Triggers:
- State transition (pending → running → done)
- Every N items completed (e.g., every 10 repos analyzed)
- Before/after external API calls
- On error (for debugging)
- Periodic heartbeat (every 5 minutes)
Progress Tracking
Aggregated Progress (Batch Level)
// .rd-os/state/progress/analysis-progress.json
{
"phase": "repository_analysis",
"started_at": "2026-02-28T00:00:00Z",
"updated_at": "2026-02-28T16:00:00Z",
"summary": {
"total_repos": 400,
"analyzed": 150,
"in_progress": 50,
"pending": 200,
"failed": 0,
"progress_percent": 37.5
},
"by_priority": {
"P0": { "total": 50, "analyzed": 50, "pending": 0 },
"P1": { "total": 100, "analyzed": 80, "pending": 20 },
"P2": { "total": 150, "analyzed": 20, "pending": 130 },
"P3": { "total": 100, "analyzed": 0, "pending": 100 }
},
"current_batch": {
"batch_id": "batch-003",
"repos": ["repo-101", "repo-102", "..."],
"started_at": "2026-02-28T14:00:00Z",
"estimated_complete": "2026-02-28T18:00:00Z"
},
"rate": {
"repos_per_hour": 25,
"estimated_completion": "2026-03-01T08:00:00Z"
}
}
SQLite Schema (Definitive Store)
-- progress.db schema
-- Repository registry
CREATE TABLE repos (
repo_id TEXT PRIMARY KEY,
name TEXT NOT NULL,
priority TEXT, -- P0, P1, P2, P3
category TEXT, -- product, platform, tool, etc.
created_at TIMESTAMP,
updated_at TIMESTAMP
);
-- Analysis progress
CREATE TABLE analysis_state (
repo_id TEXT PRIMARY KEY,
status TEXT, -- pending, running, done, failed
progress_percent INTEGER,
started_at TIMESTAMP,
completed_at TIMESTAMP,
result_json TEXT,
error_message TEXT,
FOREIGN KEY (repo_id) REFERENCES repos(repo_id)
);
-- Migration progress
CREATE TABLE migration_state (
repo_id TEXT PRIMARY KEY,
status TEXT, -- pending, running, done, failed
phase TEXT, -- prep, transfer, integrate, validate
progress_percent INTEGER,
started_at TIMESTAMP,
completed_at TIMESTAMP,
result_json TEXT,
error_message TEXT,
FOREIGN KEY (repo_id) REFERENCES repos(repo_id)
);
-- Agent registry
CREATE TABLE agents (
agent_id TEXT PRIMARY KEY,
type TEXT, -- analyzer, migrator, guardian, etc.
assigned_repo_id TEXT,
status TEXT, -- active, idle, paused, error
last_heartbeat TIMESTAMP,
FOREIGN KEY (assigned_repo_id) REFERENCES repos(repo_id)
);
-- Checkpoints
CREATE TABLE checkpoints (
checkpoint_id TEXT PRIMARY KEY,
checkpoint_type TEXT, -- batch, milestone, periodic
created_at TIMESTAMP,
state_snapshot TEXT, -- JSON of full state
recoverable BOOLEAN
);
-- Event log (for debugging/audit)
CREATE TABLE events (
event_id TEXT PRIMARY KEY,
timestamp TIMESTAMP,
event_type TEXT,
agent_id TEXT,
repo_id TEXT,
details TEXT
);
-- Indexes for fast queries
CREATE INDEX idx_analysis_status ON analysis_state(status);
CREATE INDEX idx_migration_status ON migration_state(status);
CREATE INDEX idx_agent_status ON agents(status);
CREATE INDEX idx_events_timestamp ON events(timestamp);
Recovery Protocol
Restart Recovery Flow
┌─────────────────────────────────────────────────────────────────┐
│ OpenClaw Restart → Recovery Flow │
└─────────────────────────────────────────────────────────────────┘
OpenClaw Starts
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 1. Load Configuration │
│ ├─ Read .rd-os/config/agents.yaml │
│ ├─ Read .rd-os/config/workflows.yaml │
│ └─ Initialize agent registry │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 2. Load State from Durable Store │
│ ├─ Query progress.db: what's done? │
│ ├─ Query agents.db: what agents exist? │
│ └─ Build in-memory state │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 3. Reconcile State │
│ ├─ Compare expected vs actual state │
│ ├─ Find incomplete work │
│ └─ Identify recoverable tasks │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 4. Resume Incomplete Work │
│ ├─ For each incomplete task: │
│ │ ├─ Check if resumable │
│ │ ├─ Load checkpoint (if exists) │
│ │ └─ Resume from checkpoint │
│ └─ For non-resumable: restart from beginning │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 5. Resume Agent Execution │
│ ├─ Spawn agents for pending work │
│ ├─ Resume paused agents │
│ └─ Continue normal operation │
└─────────────────────────────────────────────────────────────┘
Recovery Complete
Recovery Example
# Pseudo-code: Recovery logic
async def recover_after_restart():
# Load durable state
db = load_database(".rd-os/store/progress.db")
# Find incomplete analysis
incomplete = db.query("""
SELECT repo_id, progress_percent, checkpoint_id
FROM analysis_state
WHERE status = 'running' OR status = 'pending'
""")
for task in incomplete:
if task.progress_percent > 0:
# Has progress - try to resume
checkpoint = load_checkpoint(task.checkpoint_id)
await resume_analysis(task.repo_id, checkpoint)
else:
# No progress - restart
await start_analysis(task.repo_id)
# Find incomplete migrations
# ... similar logic
# Resume agents
agents = db.query("SELECT * FROM agents WHERE status = 'active'")
for agent in agents:
await resume_agent(agent.agent_id)
log.info(f"Recovery complete: {len(incomplete)} tasks resumed")
Checkpoint Strategy
Checkpoint Types
| Type | Frequency | Content | Use Case |
|---|---|---|---|
| Micro | Every action | Agent state | Crash recovery |
| Batch | Every N items | Batch summary | Batch resume |
| Milestone | Phase complete | Full state snapshot | Phase resume |
| Periodic | Every N minutes | Aggregated progress | Time-based recovery |
Checkpoint Implementation
# Pseudo-code: Checkpoint manager
class CheckpointManager:
def __init__(self, base_path: str):
self.base_path = base_path
self.state_path = f"{base_path}/state"
self.store_path = f"{base_path}/store"
def save_agent_state(self, agent_id: str, state: dict):
"""Save per-agent checkpoint (micro)"""
path = f"{self.state_path}/agent-states/{agent_id}.state.json"
state['checkpoint_time'] = now()
write_json(path, state)
# Also update SQLite
db.execute("""
INSERT OR REPLACE INTO agent_states (agent_id, state_json, updated_at)
VALUES (?, ?, ?)
""", (agent_id, json.dumps(state), now()))
def save_batch_progress(self, batch_id: str, progress: dict):
"""Save batch progress (batch)"""
path = f"{self.state_path}/progress/{batch_id}.json"
write_json(path, progress)
# Update SQLite summary
db.execute("""
UPDATE batch_progress
SET progress_json = ?, updated_at = ?
WHERE batch_id = ?
""", (json.dumps(progress), now(), batch_id))
def save_milestone(self, milestone_name: str):
"""Save full state snapshot (milestone)"""
checkpoint_id = f"checkpoint-{milestone_name}-{timestamp()}"
path = f"{self.state_path}/checkpoints/{checkpoint_id}/"
# Snapshot everything
snapshot = {
'milestone': milestone_name,
'timestamp': now(),
'analysis_state': db.query_all("SELECT * FROM analysis_state"),
'migration_state': db.query_all("SELECT * FROM migration_state"),
'agent_state': db.query_all("SELECT * FROM agents"),
'progress_summary': self.calculate_progress_summary()
}
write_json(f"{path}/snapshot.json", snapshot)
# Record in SQLite
db.execute("""
INSERT INTO checkpoints (checkpoint_id, checkpoint_type, created_at, state_snapshot, recoverable)
VALUES (?, ?, ?, ?, ?)
""", (checkpoint_id, 'milestone', now(), json.dumps(snapshot), True))
return checkpoint_id
def load_checkpoint(self, checkpoint_id: str) -> dict:
"""Load checkpoint for recovery"""
path = f"{self.state_path}/checkpoints/{checkpoint_id}/snapshot.json"
return read_json(path)
def get_recovery_state(self) -> dict:
"""Get current state for recovery"""
return {
'analysis': db.query_all("SELECT * FROM analysis_state WHERE status != 'done'"),
'migration': db.query_all("SELECT * FROM migration_state WHERE status != 'done'"),
'agents': db.query_all("SELECT * FROM agents WHERE status != 'idle'"),
'latest_checkpoint': db.query_one("SELECT * FROM checkpoints ORDER BY created_at DESC LIMIT 1")
}
Progress Aggregation (Avoiding Context Explosion)
Hierarchical Aggregation
Level 0: Individual Agent (1000+ agents)
├─ repo-001-analyzer: done
├─ repo-002-analyzer: running (50%)
├─ repo-003-analyzer: pending
└─ ... (1000+ entries - too many for context)
│
▼ Aggregate (every 10 agents)
Level 1: Batch Summary (100 batches)
├─ batch-001: 10/10 done
├─ batch-002: 8/10 done, 2 running
├─ batch-003: 0/10 done, 10 pending
└─ ... (100 entries - still too many)
│
▼ Aggregate (by priority)
Level 2: Priority Summary (4 priorities)
├─ P0: 50/50 done (100%)
├─ P1: 80/100 done (80%)
├─ P2: 20/150 done (13%)
└─ P3: 0/100 done (0%)
│
▼ Aggregate (overall)
Level 3: Overall Summary (fits in context)
└─ Total: 150/400 done (37.5%)
- 50 in progress
- 200 pending
- 0 failed
Context-Friendly Progress Report
// What goes into LLM context (small, actionable)
{
"phase": "repository_analysis",
"overall": {
"total": 400,
"done": 150,
"in_progress": 50,
"pending": 200,
"failed": 0,
"percent": 37.5
},
"by_priority": {
"P0": "100% done ✅",
"P1": "80% done 🏃",
"P2": "13% done 🏃",
"P3": "0% done ⏳"
},
"current_focus": "P1 batch-009 (8/10 done)",
"next_up": "P1 batch-010 (10 repos)",
"eta": "2026-03-01T08:00:00Z",
"issues": [],
"last_checkpoint": "checkpoint-batch-008-20260228-1400"
}
Key: Detailed state in SQLite, summary in context.
Idempotent Operations
Pattern: “Ensure” Instead of “Do”
# ❌ Bad: Not idempotent
async def migrate_repo(repo_id: str):
"""Migrate repo - may duplicate if retried"""
transfer_code(repo_id)
update_build_config(repo_id)
mark_migrated(repo_id)
# If fails after transfer, retry duplicates!
# ✅ Good: Idempotent
async def ensure_repo_migrated(repo_id: str):
"""Ensure repo is migrated - safe to retry"""
# Check current state
state = get_migration_state(repo_id)
if state == 'done':
log.info(f"{repo_id} already migrated, skipping")
return
if state == 'transfer_complete':
log.info(f"{repo_id} transfer done, resuming config update")
update_build_config(repo_id)
mark_migrated(repo_id)
return
# Start from beginning
transfer_code(repo_id)
update_build_config(repo_id)
mark_migrated(repo_id)
State Machine for Migration
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ pending │────▶│ prep │────▶│ transfer│────▶│integrate│────▶│ done │
└─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘
│ │ │
▼ ▼ ▼
[prep_done] [transfer_done] [integrate_done]
Each state transition is checkpointed.
Retry from last completed state.
Monitoring & Observability
Progress Dashboard (Query SQLite)
-- Overall progress
SELECT
COUNT(*) as total,
SUM(CASE WHEN status = 'done' THEN 1 ELSE 0 END) as done,
SUM(CASE WHEN status = 'running' THEN 1 ELSE 0 END) as running,
SUM(CASE WHEN status = 'pending' THEN 1 ELSE 0 END) as pending,
ROUND(100.0 * SUM(CASE WHEN status = 'done' THEN 1 ELSE 0 END) / COUNT(*), 2) as percent
FROM analysis_state;
-- Progress by priority
SELECT
r.priority,
COUNT(*) as total,
SUM(CASE WHEN a.status = 'done' THEN 1 ELSE 0 END) as done,
ROUND(100.0 * SUM(CASE WHEN a.status = 'done' THEN 1 ELSE 0 END) / COUNT(*), 2) as percent
FROM repos r
JOIN analysis_state a ON r.repo_id = a.repo_id
GROUP BY r.priority;
-- Agent health
SELECT
status,
COUNT(*) as count,
MAX(last_heartbeat) as last_activity
FROM agents
GROUP BY status;
-- Recent failures
SELECT
repo_id,
error_message,
updated_at
FROM analysis_state
WHERE status = 'failed'
ORDER BY updated_at DESC
LIMIT 10;
Alerting
# .rd-os/config/alerts.yaml
alerts:
- name: high_failure_rate
condition: "failed_count / total_count > 0.05"
severity: warning
action: notify_human
- name: stalled_progress
condition: "no_progress_for_minutes > 60"
severity: warning
action: notify_human
- name: agent_down
condition: "agent_heartbeat_age_minutes > 10"
severity: critical
action: notify_human + restart_agent
- name: checkpoint_age
condition: "last_checkpoint_age_minutes > 30"
severity: warning
action: force_checkpoint
Implementation Checklist
Phase 1: Basic Persistence
- Create
.rd-os/state/and.rd-os/store/directories - Implement JSON state file writer
- Implement per-agent checkpoint
- Implement progress.db SQLite schema
- Add checkpoint triggers (per-action, per-batch)
Phase 2: Recovery
- Implement recovery protocol
- Test restart recovery (simulate crash)
- Implement idempotent operations
- Add state reconciliation logic
Phase 3: Aggregation
- Implement hierarchical aggregation
- Create context-friendly progress summaries
- Add drill-down queries (on-demand details)
Phase 4: Monitoring
- Create progress dashboard (CLI or web)
- Implement alerting rules
- Add checkpoint management (list, restore, prune)
Example: Recovery After OpenClaw Restart
Scenario: OpenClaw restarts during repo analysis (150/400 done)
1. OpenClaw starts
└─> RD-OS initialization
2. Load .rd-os/store/progress.db
└─> Query: What's the state?
└─> Result: 150 done, 50 running, 200 pending
3. Reconcile running tasks
└─> For each "running" task:
├─> Load agent state from .rd-os/state/agent-states/
├─> Check if resumable
└─> Resume or restart
4. Resume agents
└─> Spawn 50 agents for running tasks
└─> Spawn agents for pending tasks (up to concurrency limit)
5. Continue normal operation
└─> Analysis continues from 150/400 (37.5%)
└─> No work lost, no duplication
Total recovery time: <1 minute
Work lost: 0 (if micro-checkpointing) or <1 batch (if batch-checkpointing)
Conclusion
Key Principles:
- External State - Never rely on LLM context for progress
- Frequent Checkpoints - Checkpoint every unit of work
- Idempotent Operations - Safe to retry anything
- Hierarchical Aggregation - Summary in context, details in DB
- Recovery Protocol - Automated recovery on restart
Result:
- OpenClaw can restart anytime
- LLM context can be lost
- Progress is never lost
- Work resumes automatically
- No manual intervention needed
This is how you build a system that runs for weeks with 1000+ agents.
“The system must be resilient to failure, because at scale, failure is inevitable.”
RD-OS Dynamic Agent Scheduling
动态资源分配、深度分析、智能调度
“不是平均分配,而是智能调度:有价值的 repo 分配更多 Agent 深入研究”
Core Problem
传统静态分配的问题
❌ Static Assignment (传统方式)
├─ 400 repos, 100 agents → 每个 repo 分配 0.25 agent
├─ 平均分配时间:每个 repo 分析 10 分钟
├─ 问题:
│ ├─ 重要 repo (tidb) 和 不重要 repo (废弃工具) 同样对待
│ ├─ 发现有价值 repo 时,无法动态增加资源
│ ├─ 发现无价值 repo 时,无法及时止损
│ └─ 无法根据发现调整策略
└─ 结果:资源浪费,深度不够
动态调度的优势
✅ Dynamic Scheduling (RD-OS)
├─ 初始扫描:所有 repo 快速扫描 (2 分钟/repo)
├─ 价值评估:根据指标评分
├─ 动态分配:
│ ├─ 高价值 repo → 分配 5-10 agents 深入分析
│ ├─ 中价值 repo → 分配 1-2 agents 标准分析
│ └─ 低价值 repo → 分配 0.5 agent 快速归档
├─ 持续调整:
│ ├─ 发现新问题 → 增加 Agent
│ ├─ 发现无价值 → 减少/停止分析
│ └─ 发现依赖关系 → 协调分析
└─ 结果:资源聚焦,深度足够,效率高
Value Scoring System
Repo 价值评估指标
# Repo 价值评分模型
class RepoValueScorer:
"""
评估 repo 价值,决定分配多少 Agent 资源
"""
def calculate_score(self, repo: Repo) -> float:
score = 0.0
# 1. 活跃度 (0-25 分)
score += self._activity_score(repo)
# - 最近提交频率
# - 活跃贡献者数量
# - 最近 PR/Issue 活动
# 2. 影响力 (0-25 分)
score += self._impact_score(repo)
# - 被其他 repo 引用次数
# - Stars/Forks
# - 部署实例数量
# 3. 战略重要性 (0-25 分)
score += self._strategic_score(repo)
# - 是否核心产品 (tidb = 25 分)
# - 是否平台组件
# - 是否关键依赖
# 4. 代码质量 (0-15 分)
score += self._quality_score(repo)
# - 测试覆盖率
# - 文档完整性
# - 代码规范
# 5. 迁移可行性 (0-10 分)
score += self._feasibility_score(repo)
# - 依赖复杂度
# - 团队支持度
# - 技术栈匹配度
return score # 0-100
评分示例
| Repo | 活跃度 | 影响力 | 战略 | 质量 | 可行性 | 总分 | 等级 |
|---|---|---|---|---|---|---|---|
| tidb | 25 | 25 | 25 | 12 | 8 | 95 | S |
| tiflow | 20 | 18 | 20 | 10 | 7 | 75 | A |
| tidb-operator | 18 | 15 | 18 | 11 | 8 | 70 | A |
| ossinsight | 15 | 20 | 10 | 12 | 9 | 66 | B |
| 废弃工具 A | 2 | 1 | 2 | 5 | 8 | 18 | D |
| 废弃工具 B | 0 | 0 | 0 | 3 | 9 | 12 | D |
Agent Allocation Strategy
三级分析深度
┌─────────────────────────────────────────────────────────────────┐
│ Three-Tier Analysis Depth │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Level 1: Deep Analysis (S/A 级 repo) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Agents: 5-10 per repo │ │
│ │ Time: 2-4 hours per repo │ │
│ │ Scope: │ │
│ │ - Full code analysis │ │
│ │ - Dependency graph (detailed) │ │
│ │ - Test coverage analysis │ │
│ │ - Performance profiling │ │
│ │ - Security audit │ │
│ │ - Tech debt assessment │ │
│ │ - Migration complexity analysis │ │
│ │ Output: 50-100 page report │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ Level 2: Standard Analysis (B 级 repo) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Agents: 1-2 per repo │ │
│ │ Time: 30-60 minutes per repo │ │
│ │ Scope: │ │
│ │ - Code structure overview │ │
│ │ - Dependency list │ │
│ │ - Basic quality metrics │ │
│ │ - Migration recommendation │ │
│ │ Output: 10-20 page report │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ Level 3: Quick Scan (C/D 级 repo) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Agents: 0.5 per repo (1 agent handles 2-3 repos) │ │
│ │ Time: 10-15 minutes per repo │ │
│ │ Scope: │ │
│ │ - Basic metadata │ │
│ │ - Last activity check │ │
│ │ - Archive recommendation │ │
│ │ Output: 1-2 page summary │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Agent Allocation Algorithm
class DynamicAgentScheduler:
"""
动态分配 Agent 资源
"""
def __init__(self, total_agents: int = 1000):
self.total_agents = total_agents
self.available_agents = total_agents
self.assignments = {}
def allocate(self, repos: List[Repo]) -> Dict[str, int]:
"""
根据 repo 价值分配 Agent 数量
"""
# 1. 评分所有 repo
scored_repos = [(repo, scorer.calculate_score(repo)) for repo in repos]
# 2. 分级
s_tier = [r for r, s in scored_repos if s >= 85] # S 级
a_tier = [r for r, s in scored_repos if 70 <= s < 85] # A 级
b_tier = [r for r, s in scored_repos if 50 <= s < 70] # B 级
c_tier = [r for r, s in scored_repos if s < 50] # C/D 级
# 3. 分配 Agent
allocation = {}
# S 级:每 repo 8 agents
for repo in s_tier:
allocation[repo.id] = 8
# A 级:每 repo 4 agents
for repo in a_tier:
allocation[repo.id] = 4
# B 级:每 repo 2 agents
for repo in b_tier:
allocation[repo.id] = 2
# C/D 级:每 3 repos 1 agent
agent_for_c = max(1, len(c_tier) // 3)
for i, repo in enumerate(c_tier):
allocation[repo.id] = 1 if i % 3 == 0 else 0 # 共享 agent
# 4. 检查是否超出总 Agent 数
total_needed = sum(allocation.values())
if total_needed > self.available_agents:
# 降级处理:减少 S/A 级的 agent 数
allocation = self._scale_down(allocation, self.available_agents)
return allocation
def reallocate(self, new_info: Dict[str, float]):
"""
根据新信息重新分配(动态调整)
"""
# 例如:发现某个 repo 比预期更重要
# 增加其 Agent 分配,从低优先级 repo 调配
pass
Dynamic Reallocation Triggers
何时触发重新分配
┌─────────────────────────────────────────────────────────────────┐
│ Dynamic Reallocation Triggers │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. Value Discovery (价值发现) │
│ ├─ Trigger: 初始分析发现 repo 价值高于预期 │
│ ├─ Action: 增加 Agent (1 → 5) │
│ └─ Example: 发现"废弃工具"实际被 50 个服务依赖 │
│ │
│ 2. Dependency Discovery (依赖发现) │
│ ├─ Trigger: 发现 repo 是关键依赖 │
│ ├─ Action: 增加 Agent,协调分析依赖链 │
│ └─ Example: 发现 tidb 依赖某个"小工具" │
│ │
│ 3. Issue Detection (问题检测) │
│ ├─ Trigger: 发现严重问题(安全漏洞、架构缺陷) │
│ ├─ Action: 增加专项 Agent 深入调查 │
│ └─ Example: 发现安全漏洞,分配安全专家 Agent │
│ │
│ 4. Blocker Resolution (阻塞解决) │
│ ├─ Trigger: 某 repo 分析阻塞,等待外部信息 │
│ ├─ Action: 临时减少 Agent,调配到其他 repo │
│ └─ Example: 等待团队确认,先分析其他 repo │
│ │
│ 5. Milestone Completion (里程碑完成) │
│ ├─ Trigger: 一批 repo 分析完成 │
│ ├─ Action: 释放 Agent,分配到下一批 │
│ └─ Example: P0 完成,Agent 调到 P1 │
│ │
│ 6. Human Intervention (人类干预) │
│ ├─ Trigger: 人类指定优先分析某 repo │
│ ├─ Action: 立即调配 Agent │
│ └─ Example: CTO 说"先分析这个" │
│ │
└─────────────────────────────────────────────────────────────────┘
Deep Analysis Workflow
S 级 Repo 深度分析流程
┌─────────────────────────────────────────────────────────────────┐
│ Deep Analysis Workflow (S-Tier Repo) │
│ Example: pingcap/tidb │
└─────────────────────────────────────────────────────────────────┘
Repo: tidb (Score: 95, S-Tier)
Agents Assigned: 8
Estimated Time: 4 hours
┌─────────────────────────────────────────────────────────────┐
│ Agent Team Structure │
├─────────────────────────────────────────────────────────────┤
│ │
│ lead-analyst (1) │
│ ├─ Coordinates the team │
│ ├─ Synthesizes findings │
│ └─ Produces final report │
│ │
│ code-archaeologist (2) │
│ ├─ Maps code structure │
│ ├─ Identifies key components │
│ └─ Documents architecture │
│ │
│ dependency-analyst (1) │
│ ├─ Maps internal dependencies │
│ ├─ Maps external dependencies │
│ └─ Identifies circular deps │
│ │
│ quality-auditor (1) │
│ ├─ Analyzes test coverage │
│ ├─ Runs static analysis │
│ └─ Identifies tech debt │
│ │
│ security-analyst (1) │
│ ├─ Scans for vulnerabilities │
│ ├─ Reviews auth/security code │
│ └─ Checks compliance │
│ │
│ migration-planner (1) │
│ ├─ Assesses migration complexity │
│ ├─ Identifies risks │
│ └─ Creates migration plan │
│ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Analysis Phases │
├─────────────────────────────────────────────────────────────┤
│ │
│ Phase 1: Reconnaissance (30 min) │
│ ├─ Quick scan of repo structure │
│ ├─ Identify key directories │
│ └─ Create initial dependency graph │
│ │
│ Phase 2: Deep Dive (2 hours) │
│ ├─ Each agent analyzes their specialty │
│ ├─ Continuous checkpointing │
│ └─ Cross-agent communication │
│ │
│ Phase 3: Synthesis (1 hour) │
│ ├─ Lead analyst synthesizes findings │
│ ├─ Identifies cross-cutting concerns │
│ └─ Creates unified report │
│ │
│ Phase 4: Review (30 min) │
│ ├─ Quality check │
│ ├─ Validate findings │
│ └─ Submit report + recommendations │
│ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Output: Deep Analysis Report │
├─────────────────────────────────────────────────────────────┤
│ │
│ 1. Executive Summary (1 page) │
│ - Value score, recommendation │
│ - Key findings │
│ - Migration priority │
│ │
│ 2. Architecture Overview (5 pages) │
│ - Component diagram │
│ - Data flow │
│ - Key modules │
│ │
│ 3. Dependency Analysis (10 pages) │
│ - Internal dependency graph │
│ - External dependencies │
│ - Circular dependencies │
│ │
│ 4. Quality Assessment (5 pages) │
│ - Test coverage │
│ - Code quality metrics │
│ - Tech debt inventory │
│ │
│ 5. Security Audit (5 pages) │
│ - Vulnerability scan results │
│ - Security best practices │
│ - Compliance status │
│ │
│ 6. Migration Plan (10 pages) │
│ - Migration strategy │
│ - Risk assessment │
│ - Effort estimation │
│ - Recommended order │
│ │
│ Total: ~36 pages │
│ │
└─────────────────────────────────────────────────────────────┘
Agent Coordination Protocol
多 Agent 协作分析同一 Repo
# Pseudo-code: Multi-agent coordination
class DeepAnalysisTeam:
"""
多 Agent 协作深度分析
"""
def __init__(self, repo: Repo, agents: List[Agent]):
self.repo = repo
self.agents = agents
self.shared_context = SharedContext()
self.findings = []
async def coordinate(self):
# 1. 共享上下文初始化
self.shared_context.set('repo', self.repo)
self.shared_context.set('phase', 'reconnaissance')
# 2. 并行分析(每个 agent 负责不同方面)
tasks = [
self.agents[0].analyze_architecture(self.shared_context),
self.agents[1].analyze_dependencies(self.shared_context),
self.agents[2].analyze_quality(self.shared_context),
self.agents[3].analyze_security(self.shared_context),
# ...
]
# 3. 定期同步(每 15 分钟)
sync_task = asyncio.create_task(self.periodic_sync())
# 4. 等待所有分析完成
results = await asyncio.gather(*tasks)
# 5. 综合发现
await self.synthesize(results)
# 6. 生成报告
report = await self.generate_report()
return report
async def periodic_sync(self):
"""定期同步,避免重复工作"""
while not self.is_complete():
await asyncio.sleep(900) # 15 分钟
# 共享发现
for agent in self.agents:
new_findings = agent.get_new_findings()
self.shared_context.append('findings', new_findings)
# 通知其他相关 agent
for other_agent in self.agents:
if other_agent.should_know(new_findings):
other_agent.notify(new_findings)
# 检查是否需要重新分配
if self.needs_reallocation():
await self.reallocate()
Real-World Example
场景:发现隐藏的宝石
初始状态:
├─ Repo: "old-tool" (看似废弃工具)
├─ Initial Score: 25 (C 级)
├─ Agent Allocation: 0.5 (快速扫描)
└─ Expected: 15 分钟完成,可能归档
快速扫描发现:
├─ 被 50 个内部服务依赖
├─ 处理关键数据转换
├─ 无替代方案
└─ 团队说"这很重要,但没时间维护"
触发重新评估:
├─ New Score: 78 (A 级) ⬆️
├─ New Agent Allocation: 4 agents ⬆️
└─ New Depth: Standard Analysis ⬆️
深度分析结果:
├─ 发现 3 个严重 bug
├─ 发现 5 个性能优化机会
├─ 创建现代化计划
└─ 建议:保留 + 重构(不是归档)
Impact:
├─ 避免归档关键工具
├─ 防止 50 个服务中断
├─ 改进性能 40%
└─ 价值:远超分析成本
Resource Optimization
Agent 利用率监控
class AgentUtilizationMonitor:
"""
监控 Agent 利用率,优化分配
"""
def monitor(self):
metrics = {
'total_agents': 1000,
'active': 850,
'idle': 100,
'blocked': 50,
'utilization_rate': 0.85, # 85%
'avg_task_duration': '45min',
'tasks_completed_today': 342,
'by_tier': {
'S-tier': {'agents': 80, 'repos': 10, 'utilization': 0.95},
'A-tier': {'agents': 200, 'repos': 50, 'utilization': 0.88},
'B-tier': {'agents': 300, 'repos': 150, 'utilization': 0.82},
'C-tier': {'agents': 100, 'repos': 190, 'utilization': 0.75},
}
}
# 告警:利用率过低
if metrics['utilization_rate'] < 0.60:
alert("Low agent utilization - consider increasing batch size")
# 告警:阻塞过多
if metrics['blocked'] > 100:
alert("Many agents blocked - investigate blockers")
# 建议:重新分配
if metrics['by_tier']['C-tier']['utilization'] < 0.50:
suggest("Reallocate C-tier agents to B-tier")
return metrics
Human Override
人类干预接口
# .rd-os/config/human-override.yaml
# 人类可以覆盖 AI 的分配决策
overrides:
# 优先分析指定 repo
priority_repos:
- repo: pingcap/tidb
reason: "CTO request - strategic importance"
agents: 10 # 覆盖 AI 建议的 8 个
deadline: 2026-03-01
- repo: pingcap/new-feature
reason: "Urgent customer request"
agents: 5
deadline: 2026-02-29
# 跳过某些 repo
skip_repos:
- repo: pingcap/old-experiment
reason: "Confirmed obsolete by team"
action: archive
# 调整分析深度
depth_overrides:
- repo: pingcap/ossinsight
depth: deep # 覆盖 AI 建议的 standard
reason: "May become core product"
Metrics & KPIs
调度效果评估
| Metric | Target | Measurement |
|---|---|---|
| Agent Utilization | >80% | Active agents / Total agents |
| Value Discovery Rate | >10% | Repos upgraded after initial scan |
| Reallocation Efficiency | <5 min | Time to reallocate agents |
| Deep Analysis ROI | >5x | Value found / Analysis cost |
| Human Satisfaction | >90% | Human approval of allocations |
| Completion Rate | >95% | Repos analyzed / Total repos |
Implementation Checklist
Phase 1: Basic Scoring
- Implement repo value scorer
- Define scoring criteria
- Test on 10-repo sample
- Tune scoring weights
Phase 2: Dynamic Allocation
- Implement allocation algorithm
- Create agent pool manager
- Add reallocation triggers
- Test dynamic scaling
Phase 3: Coordination
- Implement multi-agent coordination
- Create shared context system
- Add periodic sync mechanism
- Test team analysis
Phase 4: Optimization
- Implement utilization monitoring
- Add human override interface
- Create optimization recommendations
- Continuous tuning
Conclusion
Dynamic Scheduling vs Static Allocation:
| Aspect | Static | Dynamic |
|---|---|---|
| Agent Distribution | Equal | Based on value |
| Response to Discovery | None | Immediate reallocation |
| Resource Efficiency | 50-60% | 80-90% |
| Depth for Critical Repos | Same as others | 5-10x deeper |
| Adaptability | None | High |
Result:
- High-value repos get deep analysis
- Low-value repos get quick disposition
- Resources flow to where they matter most
- System learns and adapts over time
This is how you analyze 400 repos intelligently, not uniformly.
“Not all repos are created equal. Treat them accordingly.”
Scope Definition: TiDB Cloud DBaaS Mono-Repo
项目范围定义
Date: 2026-03-01
Version: 1.0
Status: Scope Finalized
Executive Summary
本项目交付物: TiDB Cloud DBaaS 平台的完整 mono-repo
范围边界:
- ✅ 包含: TiDB Cloud 从云、部署、管控、监控、O11y、交付的全链路
- ✅ 包含: TiDB/TiKV/PD/TiFlash 核心数据库(云相关)
- ❌ 排除: PingCAP 组织与 TiDB Cloud 无关的项目
核心原则: 以 TiDB Cloud 产品为中心,不是以 PingCAP 组织为中心。
In Scope (纳入范围)
1. 核心数据库 (Core Database)
✅ TiDB
- 计算层 (SQL Layer)
- 优化器 (Optimizer)
- 执行器 (Executor)
- 存储引擎接口 (KV Interface)
✅ TiKV
- 分布式 KV 存储
- Raft 共识
- 事务处理
✅ PD (Placement Driver)
- 集群管理
- 调度
- 元数据管理
✅ TiFlash
- 列式存储
- 实时分析
✅ 生态组件
- TiCDC (变更数据捕获)
- TiDB-Binlog
- DM (数据迁移)
理由: 这些是 TiDB Cloud 的核心交付物,必须纳入 mono-repo 以实现端到端优化。
2. 云平台基础设施 (Cloud Platform)
✅ 云资源管理
- 多云抽象层 (AWS/GCP/Azure/阿里云)
- 计算资源管理 (EC2/GCE/VM)
- 存储资源管理 (EBS/GCS/S3)
- 网络资源管理 (VPC/Security Group)
✅ 集群部署
- TiDB Operator (Kubernetes)
- 自动化部署工具
- 配置管理
- 版本管理
✅ 管控服务 (Control Plane)
- 集群生命周期管理
- 实例管理
- 备份恢复
- 扩缩容
- 升级管理
✅ 监控与可观测性 (Monitoring & O11y)
- 指标收集 (Metrics)
- 日志聚合 (Logging)
- 链路追踪 (Tracing)
- 告警系统
- Dashboard (Grafana/自研)
✅ 交付与运维 (Delivery & Operations)
- CI/CD 流水线
- 自动化测试
- 发布管理
- 运维工具
- 事故响应工具
理由: 这些是 TiDB Cloud DBaaS 的核心竞争力,必须纳入 mono-repo 以实现端到端自动化。
3. 云原生特性 (Cloud-Native Features)
✅ 弹性伸缩
- 自动扩缩容
- 资源调度优化
- 成本优化
✅ 高可用
- 多可用区部署
- 跨区域复制
- 故障转移
✅ 安全合规
- 身份认证 (IAM 集成)
- 访问控制 (RBAC)
- 数据加密
- 审计日志
- 合规认证 (SOC2/GDPR 等)
✅ 多租户
- 资源隔离
- 配额管理
- 计费计量
理由: 这些是 DBaaS 产品的差异化特性,需要跨组件协同优化。
4. 开发者工具 (Developer Tools)
✅ SDK 与客户端
- TiDB Vector SDK (Python/Go/Java)
- 驱动 (MySQL Protocol)
- ORM 集成
✅ 管理工具
- CLI 工具
- Web Console
- API Gateway
✅ 迁移工具
- 数据迁移 (DM)
- schema 迁移
- 增量同步
理由: 这些是用户体验的关键部分,需要与后端协同优化。
Out of Scope (排除范围)
1. PingCAP 组织内与 TiDB Cloud 无关的项目
❌ OSS Insight
- 原因:独立 OSS 分析平台,不是 TiDB Cloud 核心功能
- 处理:保持独立 repo
❌ AutoFlow / Graph RAG
- 原因:实验性 AI 项目,不是 TiDB Cloud 核心功能
- 处理:保持独立 repo
❌ 纯内部工具(与 TiDB Cloud 无关)
- 原因:不服务 TiDB Cloud 客户
- 处理:评估后决定(可能归档)
❌ 市场/网站/文档(非技术文档)
- 原因:不是研发代码
- 处理:保持独立系统
2. 第三方 Fork(评估后决定)
⚠️ Tantivy (搜索)
- 评估:如果 TiDB Cloud 强依赖,保留;否则用上
- 决策:待评估
⚠️ Sarama (Kafka Client)
- 评估:如果 TiCDC 强依赖,保留;否则用上游
- 决策:待评估
⚠️ 其他 Fork
- 评估:是否有 TiDB Cloud 特定修改
- 决策:有修改→保留;无修改→用上
原则: 只保留 TiDB Cloud 有定制修改的 fork,纯 fork 回归上游。
3. 已废弃/低维护项目
❌ 超过 1 年无活跃维护
❌ 无生产使用
❌ 功能被其他项目替代
处理:归档或删除,不纳入 mono-repo
Repo 分类与优先级
P0: 核心产品 (Core Products)
必须第一批迁移
| Repo | 说明 | 优先级 | 预计大小 |
|---|---|---|---|
| tidb | TiDB 数据库核心 | P0 | ~650 MB |
| tikv | TiKV 分布式存储 | P0 | ~500 MB |
| pd | Placement Driver | P0 | ~100 MB |
| tiflash | TiFlash 列式存储 | P0 | ~300 MB |
| ticdc | TiCDC 变更捕获 | P0 | ~100 MB |
| tidb-operator | K8s 运维编排 | P0 | ~100 MB |
小计: ~1.75 GB
P1: 云平台 (Cloud Platform)
必须第二批迁移
| Repo | 说明 | 优先级 | 预计大小 |
|---|---|---|---|
| cloud-control-plane | 管控服务 | P1 | ~200 MB |
| cloud-deploy | 部署服务 | P1 | ~100 MB |
| cloud-monitoring | 监控服务 | P1 | ~150 MB |
| cloud-o11y | 可观测性平台 | P1 | ~200 MB |
| cloud-delivery | 交付流水线 | P1 | ~50 MB |
| cloud-security | 安全服务 | P1 | ~100 MB |
小计: ~800 MB
P2: 工具与 SDK (Tools & SDKs)
必须第三批迁移
| Repo | 说明 | 优先级 | 预计大小 |
|---|---|---|---|
| tidb-dashboard | Web 控制台 | P2 | ~50 MB |
| tiup | 包管理工具 | P2 | ~20 MB |
| docs (technical) | 技术文档 | P2 | ~400 MB |
| tidb-vector-python | Python SDK | P2 | ~1 MB |
| client-drivers | 客户端驱动 | P2 | ~50 MB |
小计: ~521 MB
P3: 评估后决定 (Evaluate)
需要评估是否纳入
| Repo | 说明 | 决策 | 理由 |
|---|---|---|---|
| ossinsight | OSS 分析 | ❌ 排除 | 独立产品 |
| autoflow | Graph RAG | ❌ 排除 | 实验项目 |
| tantivy (fork) | 搜索 | ⚠️ 评估 | 看依赖程度 |
| sarama (fork) | Kafka | ⚠️ 评估 | 看依赖程度 |
预计规模
纳入 Repo 统计
| 优先级 | Repo 数量 | 预计大小 | 迁移时间 |
|---|---|---|---|
| P0 | 6 | ~1.75 GB | 2-3 周 |
| P1 | 6 | ~800 MB | 2-3 周 |
| P2 | 5 | ~521 MB | 1-2 周 |
| P3 | 4 | TBD | 待评估 |
| 总计 | ~21 | ~3.1 GB | 5-8 周 |
对比: 之前估算 400 repos / 39 GB → 现在 21 repos / 3.1 GB
结论: 范围聚焦后,规模减少 90%,可在 2 个月内完成迁移。
边界案例处理
案例 1: TiDB 社区版 vs 云版本
场景:
- TiDB 有社区版(开源)和云版本(TiDB Cloud 特性)
- 云版本有额外特性(Serverless、弹性伸缩等)
处理:
✅ 统一代码库(mono-repo)
✅ 用 Feature Flag 区分社区版和云版本
✅ 云版本特性在 mono-repo 内开发
✅ 社区版从 mono-repo 构建(去除云特性)
好处:
- 代码复用最大化
- 云版本特性可以快速迭代
- 社区版仍然可以独立发布
案例 2: 内部工具 vs 客户工具
场景:
- 有些工具只供内部运维使用
- 有些工具客户直接使用
处理:
✅ 都纳入 mono-repo
✅ 用权限控制访问(内部工具限制访问)
✅ 内部工具也遵循相同质量标准
好处:
- 内部工具也能受益于 AI 优化
- 统一工具链
- 内部/客户工具可以互相借鉴
案例 3: 第三方依赖
场景:
- TiDB Cloud 依赖大量第三方库
- 有些是 fork 后修改的
处理:
✅ 有 TiDB Cloud 特定修改的 fork → 纳入 mono-repo (libs/)
✅ 无修改的依赖 → 使用上游(通过包管理)
✅ 定期评估 fork,能回归上游的回归
好处:
- 减少维护负担
- 聚焦核心差异
- 保持与社区同步
迁移策略
Phase 1: P0 核心产品 (Week 1-3)
目标:TiDB/TiKV/PD/TiFlash/TiCDC/tidb-operator
行动:
1. 创建 mono-repo 骨架
2. 迁移 6 个核心 repo
3. 建立统一构建系统
4. 验证端到端构建
成功标准:
- 6 个 repo 在 mono-repo 中可构建
- 测试通过率 100%
- 构建时间 <1 小时
Phase 2: P1 云平台 (Week 4-6)
目标:Control Plane, Deploy, Monitoring, O11y, Delivery, Security
行动:
1. 迁移 6 个云平台 repo
2. 建立统一 API Gateway
3. 建立统一监控体系
4. 验证端到端部署
成功标准:
- 云平台可部署 TiDB Cloud
- 监控告警正常
- 部署自动化率 >90%
Phase 3: P2 工具与 SDK (Week 7-8)
目标:Dashboard, tiup, docs, SDKs
行动:
1. 迁移 5 个工具 repo
2. 统一文档体系
3. 统一 SDK 发布流程
成功标准:
- 工具可正常使用
- 文档完整
- SDK 可正常发布
Phase 4: AI 赋能 (Week 9+)
目标:部署 AI 基础设施,开始 AI 闭环
行动:
1. 部署 OpenClaw + Agents
2. AI 主导开发/测试/部署
3. AI 主导监控/运维
成功标准:
- AI 完成功能 >20%
- AI 部署变更 >10%
- 人类 routine 工作 <30%
治理模式
代码所有权
mono-repo/
├── products/
│ ├── tidb/ @tidb-core-team
│ ├── tikv/ @tikv-core-team
│ ├── pd/ @pd-team
│ └── tiflash/ @tiflash-team
├── platform/
│ ├── control-plane/ @cloud-platform-team
│ ├── deploy/ @cloud-deploy-team
│ └── monitoring/ @cloud-monitoring-team
├── tools/
│ ├── dashboard/ @dashboard-team
│ └── tiup/ @tooling-team
└── libs/
└── ... @platform-architects
审批权限
| 变更类型 | 审批者 | 自动化程度 |
|---|---|---|
| 产品代码 | 产品团队 + AI | AI review + 人类批准 |
| 平台代码 | 平台团队 + AI | AI review + 人类批准 |
| 共享库 | 架构委员会 + AI | AI review + 2 人类批准 |
| 基础设施 | Infra 团队 + AI | AI review + 人类批准 |
| 文档 | 文档团队 + AI | AI review(可自动合并) |
决策记录
2026-03-01: 范围聚焦决策
决策: 聚焦 TiDB Cloud DBaaS,排除无关项目
理由:
- 400 repos / 39 GB 规模太大,迁移周期过长(3-4 个月)
- 聚焦 TiDB Cloud 可快速交付价值(2 个月)
- 无关项目(OSS Insight、AutoFlow)会分散注意力
- 聚焦后可验证 AI 驱动 mono-repo 的可行性
影响:
- 迁移规模:400 repos → ~21 repos
- 迁移时间:3-4 个月 → 5-8 周
- 成本:~$500 → ~$50
- 风险:大幅降低
后续:
- 如果 TiDB Cloud mono-repo 成功,可扩展到其他产品线
- 排除的项目保持独立,未来可评估是否并入
结论
范围聚焦后:
✅ 更清晰的目标 — TiDB Cloud DBaaS 全链路
✅ 更小的规模 — 21 repos / 3.1 GB(vs 400 / 39GB)
✅ 更快的交付 — 5-8 周(vs 3-4 个月)
✅ 更低的成本 — ~$50(vs ~$500)
✅ 更低的风险 — 聚焦核心,减少复杂度
建议: 立即按此范围启动迁移,快速验证 AI 驱动 mono-repo 的可行性。
Scope Definition: TiDB Cloud DBaaS Mono-Repo
2026-03-01 | Large-scale Agentic Engineering Team
Low-Level Design: Large-scale Agentic Engineering
详细设计文档(应付“50% AI Coding“运动)
Date: 2026-03-01
Version: 1.0
Status: Design Complete
1. System Architecture
1.1 High-Level Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Large-scale Agentic Engineering │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ OpenClaw (Main Brain) │ │
│ │ - Model: qwen3.5-plus │ │
│ │ - Role: Orchestrator, Decision Maker │ │
│ │ - Lifetime: Long-running (weeks to months) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌────────────────────┼────────────────────┐ │
│ │ sessions_spawn() │ sessions_send() │ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Sub-Agent 1 │ │ Sub-Agent 2 │ │ Sub-Agent N │ │
│ │ (Analyzer) │ │ (Migrator) │ │ (Guardian) │ │
│ │ qwen3.5+ │ │ qwen3.5+ │ │ qwen3.5+ │ │
│ │ Disposable │ │ Disposable │ │ Long-running│ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Persistent State (.rd-os/) │ │
│ │ - progress.db (SQLite): Definitive progress store │ │
│ │ - agent-states/: Per-agent checkpoints (JSON) │ │
│ │ - artifacts/: Generated reports, outputs │ │
│ │ - Survives: OpenClaw restart, sub-agent death │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
1.2 Component Responsibilities
| Component | Responsibility | Lifetime | Model |
|---|---|---|---|
| OpenClaw | Orchestration, decisions, recovery | Weeks-months | qwen3.5-plus |
| Analyzer Agents | Repo analysis, value scoring | Minutes-hours | qwen3.5-plus |
| Migrator Agents | Code migration, build updates | Minutes-hours | qwen3.5-plus |
| Guardian Agents | Continuous monitoring, PR review | Days-weeks | qwen3.5-plus |
| State Store | Progress.db, checkpoints | Permanent | N/A |
2. Data Model
2.1 SQLite Schema (progress.db)
-- Repository registry
CREATE TABLE repos (
repo_id TEXT PRIMARY KEY,
name TEXT NOT NULL,
full_name TEXT,
priority TEXT, -- P0, P1, P2, P3
category TEXT, -- product, platform, tool, docs, sdk
github_url TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Analysis state
CREATE TABLE analysis_state (
repo_id TEXT PRIMARY KEY,
status TEXT NOT NULL, -- pending, running, done, failed
progress_percent INTEGER DEFAULT 0,
value_score INTEGER, -- 0-100
tier TEXT, -- S, A, B, C
started_at TIMESTAMP,
completed_at TIMESTAMP,
result_json TEXT, -- Full analysis result
error_message TEXT,
last_checkpoint TEXT,
FOREIGN KEY (repo_id) REFERENCES repos(repo_id)
);
-- Migration state
CREATE TABLE migration_state (
repo_id TEXT PRIMARY KEY,
status TEXT NOT NULL, -- pending, running, done, failed
phase TEXT, -- prep, transfer, integrate, validate
progress_percent INTEGER DEFAULT 0,
started_at TIMESTAMP,
completed_at TIMESTAMP,
result_json TEXT,
error_message TEXT,
FOREIGN KEY (repo_id) REFERENCES repos(repo_id)
);
-- Sub-agent registry
CREATE TABLE sub_agents (
agent_id TEXT PRIMARY KEY,
agent_type TEXT NOT NULL, -- analyzer, migrator, guardian
repo_id TEXT,
status TEXT NOT NULL, -- active, idle, paused, completed, failed
spawned_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
completed_at TIMESTAMP,
last_heartbeat TIMESTAMP,
checkpoint_path TEXT,
FOREIGN KEY (repo_id) REFERENCES repos(repo_id)
);
-- Checkpoints
CREATE TABLE checkpoints (
checkpoint_id TEXT PRIMARY KEY,
checkpoint_type TEXT NOT NULL, -- micro, batch, milestone
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
state_snapshot TEXT, -- JSON of full state
recoverable BOOLEAN DEFAULT TRUE
);
-- Event log (for debugging/audit)
CREATE TABLE events (
event_id TEXT PRIMARY KEY,
timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
event_type TEXT NOT NULL,
agent_id TEXT,
repo_id TEXT,
details TEXT
);
-- Indexes for fast queries
CREATE INDEX idx_analysis_status ON analysis_state(status);
CREATE INDEX idx_migration_status ON migration_state(status);
CREATE INDEX idx_agent_status ON sub_agents(status);
CREATE INDEX idx_events_timestamp ON events(timestamp);
CREATE INDEX idx_repos_priority ON repos(priority);
2.2 JSON State Format
// .rd-os/state/agent-states/{repo_id}-analysis.json
{
"agent_id": "analyzer-tidb-001",
"repo_id": "tidb",
"status": "completed",
"created_at": "2026-03-01T10:00:00Z",
"updated_at": "2026-03-01T10:30:00Z",
"work": {
"phase": "analysis",
"subtask": "dependency_mapping",
"progress_percent": 100,
"items_total": 50,
"items_completed": 50,
"items_failed": 0
},
"result": {
"success": true,
"output_path": ".rd-os/store/artifacts/tidb-analysis.json",
"summary": {
"lines_of_code": 652000,
"dependencies": 127,
"test_coverage": 78.5,
"last_commit": "2026-02-28",
"merge_recommendation": "P0-migrate"
}
},
"checkpoint": {
"last_action": "wrote_dependency_graph",
"last_action_time": "2026-03-01T10:30:00Z",
"can_resume": false,
"resume_point": null
},
"errors": []
}
3. OpenClaw Main Loop
3.1 Orchestration Logic
class OpenClawOrchestrator:
"""
OpenClaw main orchestration loop
"""
def __init__(self, db_path: str, max_concurrent: int = 50):
self.db = load_database(db_path)
self.max_concurrent = max_concurrent
self.active_agents = 0
self.lock = asyncio.Lock()
async def run(self):
"""
Main orchestration loop
"""
# 1. Recovery (after restart)
await self.recover_state()
# 2. Main loop
while not self.is_complete():
# 2.1 Check progress
progress = await self.load_progress()
# 2.2 Make scheduling decisions
decisions = await self.make_scheduling_decisions(progress)
# 2.3 Spawn sub-agents for new work
for decision in decisions:
if self.active_agents < self.max_concurrent:
if decision.action == 'analyze':
await self.spawn_analyzer(decision.repo)
elif decision.action == 'migrate':
await self.spawn_migrator(decision.repo)
elif decision.action == 'deep_dive':
await self.spawn_deep_analysis_team(decision.repo)
# 2.4 Check for completed sub-agents
completed = await self.check_completed_sub_agents()
for result in completed:
await self.process_result(result)
# 2.5 Handle escalations
await self.handle_escalations()
# 2.6 Update progress
await self.update_progress()
# 2.7 Checkpoint
await self.checkpoint()
# 2.8 Wait (avoid busy loop)
await asyncio.sleep(60)
# 3. Completion
await self.generate_final_report()
async def recover_state(self):
"""
Recover state after OpenClaw restart
"""
# Load progress DB
incomplete = self.db.query("""
SELECT repo_id, progress_percent, last_checkpoint
FROM analysis_state
WHERE status = 'running'
""")
for task in incomplete:
# Check if sub-agent has checkpoint
checkpoint_path = f".rd-os/state/agent-states/{task.repo_id}-analysis.checkpoint.json"
if exists(checkpoint_path):
# Resume from checkpoint
checkpoint = read_json(checkpoint_path)
await self.resume_analyzer(task.repo_id, checkpoint)
else:
# No checkpoint, restart
await self.spawn_analyzer(task.repo_id)
async def spawn_analyzer(self, repo: Repo):
"""
Spawn a sub-agent to analyze a repo
"""
task = f"""
Analyze repository: {repo.name}
Output to: .rd-os/state/agent-states/{repo.id}-analysis.json
Steps:
1. Read repo metadata from GitHub API
2. Analyze code structure
3. Map dependencies
4. Assess code quality
5. Generate merge recommendation
Checkpoint after each step.
Report completion via sessions_send().
"""
# Spawn sub-agent (qwen3.5-plus, cheap)
session = await sessions_spawn(
task=task,
model='qwen3.5-plus',
cleanup='delete', # Destroy after completion
label=f'analyzer-{repo.id}'
)
# Register sub-agent
self.db.execute("""
INSERT INTO sub_agents (agent_id, type, repo_id, status, spawned_at)
VALUES (?, 'analyzer', ?, 'running', ?)
""", (session.id, repo.id, now()))
self.active_agents += 1
3.2 Sub-Agent Task Template
# Template for sub-agent tasks
ANALYZER_TASK_TEMPLATE = """
You are a Repository Analyzer Agent.
TASK: Analyze {repo_name}
OUTPUT: .rd-os/state/agent-states/{repo_id}-analysis.json
INSTRUCTIONS:
1. Read repo metadata from .rd-os/store/repos/{repo_id}.json
2. Analyze code structure (use GitHub API or local clone)
3. Map dependencies (go.mod, package.json, requirements.txt)
4. Assess code quality (tests, docs, lint)
5. Generate merge recommendation (P0/P1/P2/P3/archive)
VALUE SCORING (0-100):
- Activity (0-25): Last commit frequency, active contributors
- Impact (0-25): Stars, forks, import count, deployment instances
- Strategic (0-25): Core product, platform component, critical dependency
- Quality (0-15): Test coverage, documentation, code standards
- Feasibility (0-10): Dependency complexity, team support, tech stack match
CHECKPOINTING:
- After each step, write checkpoint to:
.rd-os/state/agent-states/{repo_id}-analysis.checkpoint.json
- Include: step_completed, partial_results, can_resume
COMPLETION:
- Write final output to: .rd-os/state/agent-states/{repo_id}-analysis.json
- Send completion message via sessions_send():
"Analysis complete: {repo_id}, output: {output_path}"
MODEL: qwen3.5-plus
TIMEOUT: 30 minutes
CLEANUP: delete (session destroyed after completion)
"""
4. State Persistence
4.1 Checkpoint Strategy
| Checkpoint Type | Frequency | Content | Use Case |
|---|---|---|---|
| Micro | Every action | Agent state | Crash recovery |
| Batch | Every N items | Batch summary | Batch resume |
| Milestone | Phase complete | Full state snapshot | Phase resume |
| Periodic | Every N minutes | Aggregated progress | Time-based recovery |
4.2 Checkpoint Implementation
class CheckpointManager:
"""
Manage checkpoints for recovery
"""
def __init__(self, base_path: str):
self.base_path = base_path
self.state_path = f"{base_path}/state"
self.store_path = f"{base_path}/store"
def save_agent_state(self, agent_id: str, state: dict):
"""Save per-agent checkpoint (micro)"""
path = f"{self.state_path}/agent-states/{agent_id}.state.json"
state['checkpoint_time'] = now()
write_json(path, state)
# Also update SQLite
db.execute("""
INSERT OR REPLACE INTO sub_agents (agent_id, state_json, updated_at)
VALUES (?, ?, ?)
""", (agent_id, json.dumps(state), now()))
def save_batch_progress(self, batch_id: str, progress: dict):
"""Save batch progress (batch)"""
path = f"{self.state_path}/progress/{batch_id}.json"
write_json(path, progress)
# Update SQLite summary
db.execute("""
UPDATE batch_progress
SET progress_json = ?, updated_at = ?
WHERE batch_id = ?
""", (json.dumps(progress), now(), batch_id))
def save_milestone(self, milestone_name: str):
"""Save full state snapshot (milestone)"""
checkpoint_id = f"checkpoint-{milestone_name}-{timestamp()}"
path = f"{self.state_path}/checkpoints/{checkpoint_id}/"
# Snapshot everything
snapshot = {
'milestone': milestone_name,
'timestamp': now(),
'analysis_state': db.query_all("SELECT * FROM analysis_state"),
'migration_state': db.query_all("SELECT * FROM migration_state"),
'agent_state': db.query_all("SELECT * FROM sub_agents"),
'progress_summary': self.calculate_progress_summary()
}
write_json(f"{path}/snapshot.json", snapshot)
# Record in SQLite
db.execute("""
INSERT INTO checkpoints (checkpoint_id, checkpoint_type, created_at, state_snapshot, recoverable)
VALUES (?, ?, ?, ?, ?)
""", (checkpoint_id, 'milestone', now(), json.dumps(snapshot), True))
return checkpoint_id
def load_checkpoint(self, checkpoint_id: str) -> dict:
"""Load checkpoint for recovery"""
path = f"{self.state_path}/checkpoints/{checkpoint_id}/snapshot.json"
return read_json(path)
def get_recovery_state(self) -> dict:
"""Get current state for recovery"""
return {
'analysis': db.query_all("SELECT * FROM analysis_state WHERE status != 'done'"),
'migration': db.query_all("SELECT * FROM migration_state WHERE status != 'done'"),
'agents': db.query_all("SELECT * FROM sub_agents WHERE status != 'idle'"),
'latest_checkpoint': db.query_one("SELECT * FROM checkpoints ORDER BY created_at DESC LIMIT 1")
}
5. Recovery Protocol
5.1 Recovery Flow
OpenClaw Restarts
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 1. Load State from .rd-os/store/progress.db │
│ - Query: What repos are analyzed? │
│ - Query: What repos are in progress? │
│ - Query: What sub-agents were running? │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 2. Reconcile Sub-Agent State │
│ - Find sub-agents marked 'running' │
│ - Check if they have checkpoints │
│ - If checkpoint exists → respawn with resume │
│ - If no checkpoint → restart from beginning │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 3. Resume Orchestration │
│ - Continue main loop │
│ - Spawn new sub-agents for pending work │
│ - Resume from last checkpoint │
└─────────────────────────────────────────────────────────────┘
Result: OpenClaw can restart anytime, progress is never lost
5.2 Recovery Example
async def recover_after_restart():
"""
Recovery after OpenClaw restart
"""
# Load durable state
db = load_database(".rd-os/store/progress.db")
# Find incomplete analysis
incomplete = db.query("""
SELECT repo_id, progress_percent, last_checkpoint
FROM analysis_state
WHERE status = 'running' OR status = 'pending'
""")
for task in incomplete:
if task.progress_percent > 0:
# Has progress - try to resume
checkpoint = load_checkpoint(task.last_checkpoint)
await resume_analysis(task.repo_id, checkpoint)
else:
# No progress - restart
await start_analysis(task.repo_id)
# Find incomplete migrations
# ... similar logic
# Resume agents
agents = db.query("SELECT * FROM sub_agents WHERE status = 'active'")
for agent in agents:
await resume_agent(agent.agent_id)
log.info(f"Recovery complete: {len(incomplete)} tasks resumed")
6. Concurrency Control
6.1 Agent Pool Manager
class AgentPoolManager:
"""
Manage sub-agent concurrency
"""
def __init__(self, max_concurrent: int = 50):
self.max_concurrent = max_concurrent
self.active_count = 0
self.lock = asyncio.Lock()
async def acquire(self) -> bool:
"""
Acquire a slot for new sub-agent
"""
async with self.lock:
if self.active_count < self.max_concurrent:
self.active_count += 1
return True
return False
async def release(self):
"""
Release a slot when sub-agent completes
"""
async with self.lock:
self.active_count -= 1
def get_utilization(self) -> float:
return self.active_count / self.max_concurrent
def get_available_slots(self) -> int:
return self.max_concurrent - self.active_count
6.2 Batch Processing
async def process_in_batches(repos: List[Repo], batch_size: int = 50):
"""
Process repos in batches (avoid overwhelming system)
"""
for i in range(0, len(repos), batch_size):
batch = repos[i:i+batch_size]
log.info(f"Processing batch {i//batch_size + 1}: {len(batch)} repos")
# Spawn sub-agents for batch
tasks = [spawn_analyzer(repo) for repo in batch]
# Wait for batch to complete (with timeout)
await asyncio.gather(*tasks, return_exceptions=True)
# Checkpoint after batch
await checkpoint(f'batch-{i//batch_size}')
# Rate limit (avoid API throttling)
await asyncio.sleep(60)
7. API Specifications
7.1 GitHub API Integration
class GitHubAPIClient:
"""
GitHub API client for repo metadata
"""
def __init__(self, token: str):
self.token = token
self.base_url = "https://api.github.com"
self.rate_limit = 5000 # requests/hour
self.requests_made = 0
async def get_repo(self, owner: str, repo: str) -> dict:
"""
Get repository metadata
"""
url = f"{self.base_url}/repos/{owner}/{repo}"
return await self._request(url)
async def get_repos(self, org: str, per_page: int = 100) -> List[dict]:
"""
Get all repositories for an organization
"""
repos = []
page = 1
while True:
url = f"{self.base_url}/orgs/{org}/repos"
params = {"sort": "stars", "direction": "desc", "per_page": per_page, "page": page}
result = await self._request(url, params)
if not result:
break
repos.extend(result)
page += 1
return repos
async def _request(self, url: str, params: dict = None) -> dict:
"""
Make authenticated request with rate limiting
"""
if self.requests_made >= self.rate_limit:
await self._wait_for_reset()
headers = {"Authorization": f"token {self.token}"}
async with aiohttp.ClientSession() as session:
async with session.get(url, headers=headers, params=params) as response:
self.requests_made += 1
return await response.json()
7.2 sessions_spawn Interface
async def sessions_spawn(
task: str,
model: str = 'qwen3.5-plus',
cleanup: str = 'delete',
label: str = None,
timeout_seconds: int = 1800
) -> Session:
"""
Spawn a sub-agent session
Args:
task: Task description for the sub-agent
model: Model to use (default: qwen3.5-plus)
cleanup: 'delete' (destroy after completion) or 'keep'
label: Optional label for the session
timeout_seconds: Timeout in seconds (default: 30 minutes)
Returns:
Session object with id and methods
"""
# Implementation via OpenClaw sessions_spawn API
pass
7.3 sessions_send Interface
async def sessions_send(
session_key: str = None,
label: str = None,
message: str = None,
timeout_seconds: int = 60
):
"""
Send a message to/from a session
Args:
session_key: Target session key (or label)
label: Target session label
message: Message to send
timeout_seconds: Timeout in seconds
"""
# Implementation via OpenClaw sessions_send API
pass
8. Directory Structure
mono-repo/
└── .rd-os/
├── state/ # Runtime state (can rebuild)
│ ├── agent-states/ # Per-agent checkpoint
│ │ ├── repo-001.state.json
│ │ ├── repo-002.state.json
│ │ └── ...
│ ├── progress/ # Aggregated progress
│ │ ├── analysis-progress.json
│ │ ├── migration-progress.json
│ │ └── daily-summary/
│ │ ├── 2026-03-01.json
│ │ └── ...
│ └── checkpoints/ # Milestone snapshots
│ ├── checkpoint-001-analysis-complete/
│ ├── checkpoint-002-p0-migrated/
│ └── ...
│
└── store/ # Durable store (source of truth)
├── progress.db # SQLite: definitive progress
├── agents.db # SQLite: agent registry
├── artifacts/ # Generated outputs
│ ├── analysis-report.json
│ ├── migration-log.jsonl
│ └── ...
└── config/ # Configuration
├── agents.yaml
├── workflows.yaml
└── policies.yaml
9. Cost Estimate
9.1 Token Usage
| Phase | Repos | Tokens/Repo | Total Tokens | Cost (@$0.002/1K) |
|---|---|---|---|---|
| Analysis | 400 | 10K | 4M | ~$8 |
| Deep Analysis | 150 (S/A) | 50K | 7.5M | ~$15 |
| Migration | 400 | 50K | 20M | ~$40 |
| Ongoing (monthly) | - | - | 3M | ~$6 |
| Total (Year 1) | - | - | ~35M | ~$70 |
9.2 Infrastructure
| Resource | Estimate | Cost |
|---|---|---|
| Storage | 100GB SSD | ~$10/month |
| Compute | Local (existing) | $0 |
| GitHub API | Free tier (5K/hr) | $0 |
| Total (monthly) | - | ~$10 |
9.3 Total Cost (Year 1)
| Category | Cost |
|---|---|
| LLM Tokens | ~$70 |
| Infrastructure | ~$120 |
| Total | ~$190 |
10. Risk Mitigation
10.1 Technical Risks
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| API Rate Limit | Medium | Medium | Batch requests, add delays, use multiple tokens |
| Sub-Agent Failure | High | Low | Checkpoint + retry, idempotent operations |
| OpenClaw Restart | Medium | Low | Recovery from progress.db, automatic resume |
| Token Overrun | Low | Medium | Monitor usage, set limits, alert on threshold |
| Poor Quality Output | Medium | Medium | Human review, iterate template, add validation |
10.2 Operational Risks
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| Data Loss | Low | High | Full backups before each batch, SQLite WAL mode |
| Build Failures | Medium | Medium | Comprehensive tests, canary deploys, rollback |
| Performance Degradation | Medium | Medium | Incremental builds, remote caching, parallel execution |
11. Testing Strategy
11.1 Unit Tests
# Test checkpoint manager
def test_save_agent_state():
manager = CheckpointManager(".rd-os")
state = {"agent_id": "test-001", "status": "running", "progress": 50}
manager.save_agent_state("test-001", state)
# Verify file created
assert exists(".rd-os/state/agent-states/test-001.state.json")
# Verify SQLite updated
result = db.query_one("SELECT * FROM sub_agents WHERE agent_id = ?", ("test-001",))
assert result is not None
# Test recovery
def test_recovery_after_restart():
# Simulate restart
orchestrator = OpenClawOrchestrator(".rd-os/store/progress.db")
await orchestrator.recover_state()
# Verify incomplete tasks resumed
incomplete = db.query_all("SELECT * FROM analysis_state WHERE status = 'running'")
for task in incomplete:
assert task.repo_id in orchestrator.active_tasks
11.2 Integration Tests
# Test full analysis workflow
async def test_full_analysis_workflow():
# Setup
repos = [Repo("tidb"), Repo("tiflow")]
# Run analysis
await process_in_batches(repos, batch_size=2)
# Verify results
for repo in repos:
state = db.query_one("SELECT * FROM analysis_state WHERE repo_id = ?", (repo.id,))
assert state.status == "done"
assert state.result_json is not None
# Verify checkpoints
assert exists(".rd-os/state/checkpoints/checkpoint-batch-0/")
11.3 Recovery Tests
# Test recovery after crash
async def test_recovery_after_crash():
# Start analysis
orchestrator = OpenClawOrchestrator(".rd-os/store/progress.db")
task = asyncio.create_task(orchestrator.run())
# Wait for some progress
await asyncio.sleep(300) # 5 minutes
# Simulate crash
task.cancel()
await task
# Restart
orchestrator2 = OpenClawOrchestrator(".rd-os/store/progress.db")
await orchestrator2.recover_state()
# Verify progress preserved
progress = await orchestrator2.load_progress()
assert progress['analyzed'] > 0
assert progress['in_progress'] >= 0
12. Deployment Plan
12.1 Phase 1: Infrastructure Setup (Week 1-2)
Week 1:
- Create .rd-os/ directory structure
- Initialize progress.db schema
- Implement OpenClaw main loop
- Implement checkpoint manager
Week 2:
- Create sub-agent task templates
- Implement recovery protocol
- Test restart recovery
- Test sub-agent failure recovery
12.2 Phase 2: 400-Repo Analysis (Week 3-4)
Week 3:
- Fetch all 400 repos via GitHub API
- Run initial scan (all repos)
- Score and tier repos
Week 4:
- Deep analysis for S/A-tier repos
- Generate analysis report
- Create migration priority list
12.3 Phase 3: Migration (Week 5-16)
Week 5-7: P0 repos (50 repos)
Week 8-11: P1 repos (100 repos)
Week 12-15: P2-P3 repos (150 repos)
Week 16: P4-P5 cleanup (100 repos)
13. Monitoring & Alerting
13.1 Key Metrics
metrics = {
'total_repos': 400,
'analyzed': 150,
'in_progress': 50,
'pending': 200,
'failed': 0,
'progress_percent': 37.5,
'active_agents': 45,
'agent_utilization': 0.90,
'tokens_used': 1500000,
'tokens_remaining': 3500000,
'estimated_cost': 3.00,
'last_checkpoint': '2026-03-01T14:00:00Z',
'checkpoint_age_minutes': 15,
}
13.2 Alerting Rules
alerts:
- name: high_failure_rate
condition: "failed_count / total_count > 0.05"
severity: warning
action: notify_human
- name: stalled_progress
condition: "no_progress_for_minutes > 60"
severity: warning
action: notify_human
- name: agent_down
condition: "agent_heartbeat_age_minutes > 10"
severity: critical
action: notify_human + restart_agent
- name: checkpoint_age
condition: "last_checkpoint_age_minutes > 30"
severity: warning
action: force_checkpoint
- name: token_budget
condition: "tokens_remaining < 500000"
severity: warning
action: notify_human
14. Appendix
14.1 Glossary
| Term | Definition |
|---|---|
| OpenClaw | Main orchestrator (LLM-based) |
| Sub-Agent | Temporary worker agent (spawned by OpenClaw) |
| Checkpoint | Saved state for recovery |
| Mono-Repo | Single repository containing all code |
| RD-OS | Research & Development Operating System |
14.2 References
Low-Level Design: Large-scale Agentic Engineering
Version 1.0 | 2026-03-01
Google Sheet Interface: AI-Human Collaboration Hub
用 Google Sheet 作为人机协作交互界面
Date: 2026-03-01
Version: 1.0
Status: Design Complete
Core Insight
Google Sheet 是这个项目的核心交互界面,不是附属工具。
为什么是 Google Sheet?
✅ 透明性
- 所有人都能看到进度
- AI 的决策过程透明
- 人类可以随时介入
✅ 协作性
- 多人同时编辑
- AI 和人类共同维护
- 评论、讨论、决策记录
✅ 灵活性
- 字段可以随时调整
- 架构可以迭代演化
- 不需要开发 UI
✅ 可追溯
- 版本历史
- 谁(AI/人类)改的
- 为什么改(评论)
✅ 低门槛
- 人人都会用
- 不需要培训
- 移动端也能看
对比其他方案:
| 方案 | 透明性 | 协作性 | 灵活性 | 开发成本 |
|---|---|---|---|---|
| Google Sheet | ✅ 高 | ✅ 高 | ✅ 高 | ✅ 零 |
| 自研 Dashboard | ⚠️ 中 | ⚠️ 中 | ❌ 低 | ❌ 高 |
| JIRA/Asana | ⚠️ 中 | ✅ 高 | ⚠️ 中 | ⚠️ 中 |
| 数据库 + API | ❌ 低 | ❌ 低 | ❌ 低 | ❌ 高 |
结论: Google Sheet 是 AI-Human 协作的最佳界面。
Sheet 设计
Sheet 1: Repo Inventory (总清单)
目的: 列举所有待分析的 400 个 repo,跟踪分析状态
字段定义
| 列 | 字段名 | 类型 | 说明 | 填写者 |
|---|---|---|---|---|
| A | Repo ID | Text | 唯一标识(如:tidb-001) | AI |
| B | Repo Name | Text | 完整名称(如:pingcap/tidb) | AI |
| C | GitHub URL | URL | GitHub 链接 | AI |
| D | Description | Text | 一句话描述(AI 生成) | AI |
| E | Category | Dropdown | 分类(Product/Platform/Tool/SDK/Docs/Other) | AI |
| F | Stars | Number | GitHub Stars | AI |
| G | Language | Text | 主要语言 | AI |
| H | Size (MB) | Number | 代码大小 | AI |
| I | Last Commit | Date | 最后提交时间 | AI |
| J | Activity Score | Number | 活跃度评分 (0-100) | AI |
| K | TiDB Cloud Related? | Dropdown | Yes/No/Unsure | AI + 人类确认 |
| L | Worth Analyzing? | Dropdown | Yes/No/Maybe | AI + 人类确认 |
| M | Priority | Dropdown | P0/P1/P2/P3/Archive | AI + 人类确认 |
| N | Target Architecture | Text | 在 mono-repo 中的位置(如:products/tidb) | AI + 人类确认 |
| O | Migration Phase | Dropdown | Phase1/2/3/4/Exclude | 人类 |
| P | Analysis Status | Dropdown | Pending/In Progress/Done/Blocked | AI |
| Q | Analysis Progress | % | 分析进度 (0-100%) | AI |
| R | Value Score | Number | 价值评分 (0-100) | AI |
| S | Tier | Text | 分级 (S/A/B/C) | AI |
| T | Dependencies | Text | 依赖的其他 repo(逗号分隔) | AI |
| U | Blockers | Text | 阻塞问题(如有) | AI + 人类 |
| V | Owner (Human) | Text | 人类负责人(团队/个人) | 人类 |
| W | Owner (AI) | Text | AI 负责人(agent ID) | AI |
| X | Last Updated | Timestamp | 最后更新时间 | AI |
| Y | Updated By | Text | 最后更新者(AI/Human name) | AI |
| Z | Notes | Text | 备注、评论、讨论 | AI + 人类 |
示例数据
| Repo ID | Repo Name | Description | TiDB Cloud Related? | Worth Analyzing? | Priority | Target Architecture | Status |
|---|---|---|---|---|---|---|---|
| tidb-001 | pingcap/tidb | TiDB 分布式数据库核心 | Yes | Yes | P0 | products/tidb | Done |
| tikv-001 | pingcap/tikv | TiKV 分布式 KV 存储 | Yes | Yes | P0 | products/tikv | Done |
| oss-001 | pingcap/ossinsight | OSS 数据分析平台 | No | No | Exclude | N/A | Done |
| cloud-001 | pingcap/tidb-cloud-control | TiDB Cloud 管控服务 | Yes | Yes | P0 | platform/control-plane | In Progress |
Sheet 2: Architecture Evolution (架构演化)
目的: 记录 mono-repo 架构的多轮迭代过程
字段定义
| 列 | 字段名 | 类型 | 说明 |
|---|---|---|---|
| A | Iteration | Number | 迭代版本号(1, 2, 3…) |
| B | Date | Date | 迭代日期 |
| C | Path | Text | 架构路径(如:products/tidb) |
| D | Description | Text | 该路径的职责描述 |
| E | Repos | Text | 归入该路径的 repo 列表 |
| F | Changes from Previous | Text | 与上一版的变更说明 |
| G | Rationale | Text | 变更理由(AI 生成) |
| H | Approved By | Text | 审批者(人类) |
| I | Status | Dropdown | Proposed/Approved/Implemented |
示例:架构演化过程
Iteration 1 (2026-03-01): 初始架构
├── products/
│ ├── tidb/
│ └── tikv/
├── platform/
│ └── control-plane/
└── tools/
Iteration 2 (2026-03-08): 细化 products
├── products/
│ ├── tidb/ # 计算层
│ ├── tikv/ # 存储层
│ ├── pd/ # 新增:调度层
│ └── tiflash/ # 新增:分析层
├── platform/
│ └── control-plane/
└── tools/
变更理由:
- 发现 tidb/tikv/pd/tiflash 是独立组件
- 分开管理便于独立构建和测试
- 符合云原生架构(分层解耦)
Iteration 3 (2026-03-15): 扩展 platform
├── products/
│ ├── tidb/
│ ├── tikv/
│ ├── pd/
│ └── tiflash/
├── platform/
│ ├── control-plane/ # 管控服务
│ ├── deploy/ # 新增:部署服务
│ ├── monitoring/ # 新增:监控服务
│ └── o11y/ # 新增:可观测性
└── tools/
变更理由:
- 深入分析云平台 repo 后,发现需要细分
- deploy/monitoring/o11y 职责不同
- 便于 AI 独立优化各子模块
Sheet 3: Decision Log (决策日志)
目的: 记录 AI 和人类的重大决策,可追溯
字段定义
| 列 | 字段名 | 类型 | 说明 |
|---|---|---|---|
| A | Decision ID | Text | 唯一标识(如:DEC-001) |
| B | Date | Date | 决策日期 |
| C | Type | Dropdown | Architecture/Scope/Priority/Other |
| D | Description | Text | 决策描述 |
| E | Proposed By | Text | AI/Human name |
| F | Rationale | Text | 决策理由 |
| G | Alternatives | Text | 考虑过的其他选项 |
| H | Impact | Text | 影响范围 |
| I | Approved By | Text | 审批者 |
| J | Status | Dropdown | Proposed/Approved/Rejected/Implemented |
| K | Related Repos | Text | 相关的 repo 列表 |
| L | Comments | Text | 讨论记录 |
示例决策记录
| ID | Type | Description | Proposed By | Rationale | Status |
|---|---|---|---|---|---|
| DEC-001 | Scope | 排除 ossinsight | AI | 与 TiDB Cloud 无关,独立产品 | Approved |
| DEC-002 | Architecture | products 下分层(tidb/tikv/pd/tiflash) | AI | 符合云原生架构,便于独立构建 | Approved |
| DEC-003 | Priority | tidb-operator 从 P1 提升到 P0 | Human | K8s 是云部署核心,必须第一批迁移 | Approved |
Sheet 4: Agent Assignment (Agent 分配)
目的: 跟踪哪个 AI Agent 负责哪个 repo
字段定义
| 列 | 字段名 | 类型 | 说明 |
|---|---|---|---|
| A | Agent ID | Text | Agent 唯一标识(如:analyzer-001) |
| B | Agent Type | Dropdown | Analyzer/Migrator/Guardian |
| C | Assigned Repo | Text | 分配的 repo ID |
| D | Status | Dropdown | Idle/Running/Completed/Failed |
| E | Started At | Timestamp | 开始时间 |
| F | Completed At | Timestamp | 完成时间 |
| G | Progress % | Number | 进度百分比 |
| H | Last Checkpoint | Text | 最后检查点 |
| I | Result | Text | 结果摘要 |
| J | Errors | Text | 错误信息(如有) |
| K | Token Used | Number | 消耗 token 数 |
| L | Cost | Number | 成本($) |
Sheet 5: Progress Dashboard (进度看板)
目的: 高层进度总览,给老板和管理层看
内容
=== Overall Progress ===
Total Repos: 400
Analyzed: 150 (37.5%)
In Progress: 50 (12.5%)
Pending: 200 (50%)
Excluded: 21 (5.2%)
=== TiDB Cloud Related ===
Related: 21 (5.2%)
- P0: 6
- P1: 6
- P2: 5
- P3: 4
Not Related: 379 (94.8%)
=== Migration Status ===
Phase 1 (P0): 0/6 (0%)
Phase 2 (P1): 0/6 (0%)
Phase 3 (P2): 0/5 (0%)
Phase 4 (P3): 0/4 (0%)
=== Cost Tracking ===
Budget: $50
Spent: $12.50 (25%)
Remaining: $37.50 (75%)
Estimated Total: $48 (under budget)
=== Timeline ===
Start Date: 2026-03-01
Current Date: 2026-03-15
Planned End: 2026-04-30
Days Elapsed: 14
Days Remaining: 32
On Track: Yes ✅
工作流程
Phase 1: 初始数据填充 (Week 1)
AI 任务:
1. 通过 GitHub API 获取 400 个 repo 元数据
2. 填充 Sheet 1 的基础字段(A-J 列)
3. AI 初步分析,填充 K-M 列(相关性、价值、优先级)
4. AI 生成初步架构建议(N 列)
人类任务:
1. Review AI 的初步分析
2. 确认/调整 K-M 列(相关性、价值、优先级)
3. 确认/调整 N 列(架构位置)
4. 填写 O 列(Migration Phase)
5. 填写 V 列(人类 Owner)
输出:
- 400 个 repo 的完整清单
- 初步架构设计(Iteration 1)
- 优先级和迁移计划
Phase 2: 深度分析 (Week 2-4)
AI 任务:
1. 按优先级顺序,深度分析每个 repo
2. 更新 P-Q 列(分析状态和进度)
3. 填充 R-S 列(价值评分和分级)
4. 填充 T 列(依赖关系)
5. 发现新信息时,更新 N 列(架构位置建议)
6. 遇到阻塞时,填写 U 列(Blockers)
人类任务:
1. 监控进度(查看 Sheet 5 Dashboard)
2. 处理 Blockers(U 列)
3. Review AI 的架构建议(N 列)
4. 批准架构变更(Sheet 2)
5. 记录重大决策(Sheet 3)
输出:
- 400 个 repo 的深度分析报告
- 架构演化记录(Iteration 1 → 2 → 3)
- 决策日志
Phase 3: 架构迭代 (Week 5-6)
AI 任务:
1. 基于分析结果,提出架构优化建议
2. 更新 Sheet 2(架构演化)
3. 更新 Sheet 1 的 N 列(架构位置)
4. 生成架构对比报告(Iteration N vs N+1)
人类任务:
1. Review 架构变更
2. 批准/拒绝变更
3. 记录决策理由(Sheet 3)
4. 通知相关团队(架构变更影响)
输出:
- 稳定的 mono-repo 架构(Iteration Final)
- 完整的决策日志
- 架构演化历史
Phase 4: 迁移准备 (Week 7-8)
AI 任务:
1. 为每个 repo 生成迁移计划
2. 更新 Sheet 1 的 O 列(Migration Phase)
3. 分配 AI Agents(Sheet 4)
4. 生成迁移风险评估
人类任务:
1. Review 迁移计划
2. 确认人类 Owner(V 列)
3. 批准迁移启动
4. 通知相关团队
输出:
- 迁移计划(按 Phase 分组)
- Agent 分配方案
- 风险评估报告
多轮迭代机制
架构演化流程
Iteration N:
┌─────────────────────────────────────────────────────────────┐
│ 1. AI 分析新 repo │
│ - 发现:这个 repo 不适合当前架构 │
│ - 建议:创建新目录 / 调整现有目录 │
│ - 填写:Sheet 1, N 列(架构位置建议) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 2. AI 提出架构变更 │
│ - 填写:Sheet 2(架构演化) │
│ - 填写:Sheet 3(决策日志 - Proposed) │
│ - 通知:人类审批者 │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 3. 人类 Review │
│ - 查看:架构变更理由 │
│ - 查看:影响范围 │
│ - 评论:提出问题 / 建议 │
│ - 决策:Approve / Reject / Modify │
│ - 填写:Sheet 3(Approved By, Status) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 4. AI 执行变更 │
│ - 更新:Sheet 2(Status = Implemented) │
│ - 更新:Sheet 1(N 列,受影响 repo 的架构位置) │
│ - 记录:变更日志 │
└─────────────────────────────────────────────────────────────┘
│
▼
Iteration N+1
示例:架构迭代过程
=== Iteration 1 (2026-03-01) ===
初始架构(基于人类直觉):
mono-repo/
├── products/
│ └── database/
├── platform/
│ └── cloud/
└── tools/
问题:
- 太粗糙(只有 3 个大类)
- 不符合云原生架构
- 无法支持独立构建
=== Iteration 2 (2026-03-08) ===
AI 分析 50 个 repo 后,提出优化:
mono-repo/
├── products/
│ ├── tidb/ # 计算层
│ ├── tikv/ # 存储层
│ ├── pd/ # 调度层
│ └── tiflash/ # 分析层
├── platform/
│ ├── control-plane/ # 管控
│ ├── deploy/ # 部署
│ └── monitoring/ # 监控
└── tools/
变更理由:
- 分层架构符合云原生最佳实践
- 各层可独立构建、测试、部署
- 便于 AI 独立优化各模块
人类审批:✅ Approved
=== Iteration 3 (2026-03-15) ===
AI 分析 100 个 repo 后,进一步优化:
mono-repo/
├── products/
│ ├── tidb/
│ ├── tikv/
│ ├── pd/
│ └── tiflash/
├── platform/
│ ├── control-plane/
│ ├── deploy/
│ ├── monitoring/
│ ├── o11y/ # 新增:可观测性独立
│ └── security/ # 新增:安全服务
├── tools/
│ ├── dashboard/
│ ├── tiup/
│ └── sdk/
└── libs/ # 新增:共享库
└── ...
变更理由:
- o11y 职责复杂,需要从 monitoring 独立
- security 是跨层能力,需要独立模块
- libs 用于存放共享库和 fork
人类审批:✅ Approved
=== Iteration Final (2026-03-31) ===
稳定架构(分析完 400 个 repo 后):
mono-repo/
├── products/ # 核心数据库
│ ├── tidb/
│ ├── tikv/
│ ├── pd/
│ └── tiflash/
├── platform/ # 云平台
│ ├── control-plane/
│ ├── deploy/
│ ├── monitoring/
│ ├── o11y/
│ └── security/
├── tools/ # 工具链
│ ├── dashboard/
│ ├── tiup/
│ └── sdk/
├── libs/ # 共享库
│ └── ...
└── docs/ # 文档
└── ...
架构稳定,不再变更。
AI-Human 协作模式
AI 负责
✅ 数据填充
- 从 GitHub API 获取元数据
- 自动生成描述、分类、评分
✅ 初步分析
- 评估相关性(TiDB Cloud Related?)
- 评估价值(Worth Analyzing?)
- 建议优先级(Priority)
- 建议架构位置(Target Architecture)
✅ 进度跟踪
- 更新分析状态
- 更新进度百分比
- 记录 Blockers
✅ 架构建议
- 基于分析结果提出架构优化
- 记录架构演化
- 生成对比报告
✅ 决策支持
- 提供决策理由
- 列出替代方案
- 评估影响范围
人类负责
✅ 最终决策
- 确认/调整 AI 的建议
- 批准架构变更
- 批准重大决策
✅ 处理异常
- 处理 Blockers
- 处理 AI 无法判断的情况
- 处理跨团队协调
✅ 团队沟通
- 通知相关团队
- 协调迁移时间
- 处理人员安排
✅ 质量监督
- 抽查 AI 分析质量
- 审核架构合理性
- 确保符合业务目标
技术实现
Google Sheet + OpenClaw 集成
# Pseudo-code: OpenClaw 与 Google Sheet 集成
class GoogleSheetInterface:
"""
OpenClaw 与 Google Sheet 的集成接口
"""
def __init__(self, sheet_id: str):
self.sheet_id = sheet_id
self.client = gspread.oauth().client
def update_repo_status(self, repo_id: str, status: str, progress: int):
"""
更新 repo 分析状态
"""
sheet = self.client.open_by_key(self.sheet_id).worksheet("Repo Inventory")
# 找到 repo 所在行
row = self._find_repo_row(repo_id)
# 更新状态和进度
sheet.update(f"P{row}", status)
sheet.update(f"Q{row}", f"{progress}%")
sheet.update(f"X{row}", now())
sheet.update(f"Y{row}", "OpenClaw-Agent-001")
def propose_architecture_change(self, iteration: int, changes: dict):
"""
提出架构变更建议
"""
sheet = self.client.open_by_key(self.sheet_id).worksheet("Architecture Evolution")
# 添加新行
sheet.append_row([
iteration,
now(),
changes['path'],
changes['description'],
changes['repos'],
changes['changes_from_previous'],
changes['rationale'],
"", # Approved By (待人类填写)
"Proposed" # Status
])
# 通知人类审批者
self._notify_human(changes['approved_by'])
def get_pending_decisions(self) -> List[dict]:
"""
获取待人类决策的列表
"""
sheet = self.client.open_by_key(self.sheet_id).worksheet("Decision Log")
# 查询 Status = "Proposed" 的行
pending = sheet.findall("Proposed", in_column=10) # J 列
return [self._row_to_dict(row) for row in pending]
自动化规则
# OpenClaw 自动化规则
triggers:
- name: repo_analysis_complete
condition: "Sheet1.Q列 = 100%"
action:
- update_sheet: "Sheet1.P列 = Done"
- notify_human: "Repo {repo_id} analysis complete"
- trigger_next_repo: true
- name: blocker_detected
condition: "Sheet1.U列 != ''"
action:
- notify_human: "Blocker detected in {repo_id}: {U列内容}"
- update_sheet: "Sheet1.P列 = Blocked"
- name: architecture_change_proposed
condition: "Sheet2.Status = Proposed"
action:
- notify_human: "Architecture change proposed (Iteration {iteration})"
- wait_for_approval: true
- name: decision_approved
condition: "Sheet3.Status = Approved"
action:
- execute_decision: "{Decision Details}"
- update_sheet: "Sheet3.Status = Implemented"
成功标准
Sheet 质量指标
| 指标 | 目标 | 测量方式 |
|---|---|---|
| 数据完整性 | >95% 字段有值 | 空字段比例 |
| 数据准确性 | >90% AI 填充准确 | 人类抽查 |
| 更新及时性 | <1 小时延迟 | 最后更新时间 |
| 人类参与度 | >80% 决策有人类审批 | 审批率 |
| 架构稳定性 | <5 次大变更 | 架构迭代次数 |
协作质量指标
| 指标 | 目标 | 测量方式 |
|---|---|---|
| AI 决策采纳率 | >70% | 人类批准/AI 提议 |
| 人类满意度 | >80% | 问卷调查 |
| 决策速度 | <24 小时 | 提议到批准时间 |
| 透明度 | 100% 决策可追溯 | 决策日志完整性 |
风险与应对
风险 1: Sheet 变得太复杂
场景:
- 字段越来越多(>50 列)
- 人类难以理解
- AI 填充错误率上升
应对:
1. 定期 Review 字段必要性
2. 删除不用的字段
3. 分 Sheet(不要全部在一个 Sheet)
4. 提供字段说明文档
风险 2: 人类过度依赖 AI
场景:
- 人类不 Review AI 填充
- 全部直接批准
- AI 错误未被发现
应对:
1. 强制人类 Review 关键字段(K-M, N, O 列)
2. 定期抽查(10% 随机抽查)
3. 设置审批上限(人类必须审批 X%)
4. 培训人类理解 AI 决策逻辑
风险 3: AI 填充错误
场景:
- AI 错误分类 repo
- AI 错误评估价值
- AI 错误建议架构
应对:
1. 人类 Review 关键决策
2. AI 提供置信度(低置信度时标记)
3. 错误案例反馈给 AI(持续学习)
4. 多 AI 交叉验证(AI vs AI)
结论
Google Sheet 是这个项目的核心交互界面:
- 透明 — 所有人都能看到进度和决策
- 协作 — AI 和人类共同维护
- 灵活 — 字段和架构可以迭代演化
- 可追溯 — 版本历史、决策日志
- 低门槛 — 人人都会用,不需要培训
多轮迭代机制:
- AI 分析 → 提出架构建议 → 人类审批 → 执行变更 → 下一轮迭代
- 架构随着分析深入逐渐清晰(Iteration 1 → 2 → 3 → Final)
AI-Human 协作:
- AI 负责:数据填充、初步分析、进度跟踪、架构建议
- 人类负责:最终决策、处理异常、团队沟通、质量监督
成功关键:
- 保持 Sheet 简洁(定期 Review 字段)
- 人类参与关键决策(不全部依赖 AI)
- AI 持续学习(从错误案例中学习)
Google Sheet Interface: AI-Human Collaboration Hub
2026-03-01 | Large-scale Agentic Engineering Team
Corner Cases & Mitigation: AI 时代迁移的阻力与应对
传统研发资产和流程进入 AI 世界的阻力分析
Date: 2026-03-01
Version: 1.0
Status: Risk Analysis Complete
Executive Summary
从传统研发向 AI 驱动研发迁移,会遇到技术、组织、流程、安全、文化五大维度的阻力。本文档列举 50+ 个 corner cases,并提供具体应对方案。
核心洞察: 技术阻力只占 20%,80% 的阻力来自组织、流程、文化。
1. 技术层面的阻力
1.1 代码库碎片化
Corner Case 1.1.1: 400+ repos 依赖关系复杂
场景:
- Repo A 依赖 Repo B 的 v1.2.3
- Repo B 已升级到 v2.0,但 A 还在用旧版本
- Repo C 同时依赖 A 和 B,版本冲突
- AI 要修改 B 的 API,影响 50 个下游 repo
阻力:
- AI 无法安全修改(影响范围太大)
- 人工协调成本高(需要 50 个团队确认)
- 迁移陷入僵局
应对方案:
1. **依赖图谱先行** — 迁移前用 AI 分析完整依赖关系
2. **向后兼容策略** — AI 修改 API 时自动生成兼容层
3. **分批迁移** — 按依赖图拓扑排序,从叶子节点开始
4. **自动化回归测试** — AI 修改后自动跑下游 repo 测试
5. **Feature Flag** — 新 API 用 flag 控制,逐步放量
Corner Case 1.1.2: 历史代码无文档
场景:
- 核心模块是 5 年前离职员工写的
- 无文档、无注释、无测试
- 只有代码,不知道业务逻辑
- AI 分析后说"看不懂"
阻力:
- AI 无法理解业务意图
- 不敢修改(怕破坏逻辑)
- 成为迁移瓶颈
应对方案:
1. **AI 逆向工程** — 用 AI 分析代码生成文档和流程图
2. **行为捕捉** — 在生产环境 capture 输入输出,建立行为基线
3. **渐进式重构** — AI 逐步添加测试,确保安全后再重构
4. **专家访谈** — 找老员工访谈,AI 记录并生成文档
5. **标记为高风险** — 这类模块最后迁移,先积累 AI 经验
1.2 构建系统不统一
Corner Case 1.2.1: 多构建系统共存
场景:
- 100 个 repo 用 Maven
- 150 个 repo 用 npm
- 100 个 repo 用 Go modules
- 50 个 repo 用自定义脚本
- 构建命令各不相同
阻力:
- AI 无法统一调度构建
- 每个系统都要单独适配
- 构建时间不可控
应对方案:
1. **统一构建层(Bazel)** — 在现有构建系统上加 Bazel 封装
2. **构建命令标准化** — 定义统一接口(build/test/deploy)
3. **AI 构建优化** — AI 分析构建依赖,优化缓存策略
4. **渐进迁移** — 先统一新 repo,老 repo 逐步迁移
5. **构建时间 SLO** — 设定目标(全量<30 分钟),持续优化
Corner Case 1.2.2: 构建依赖外部服务
场景:
- 构建需要访问内部 Nexus(已停用)
- 需要特定版本的编译器(只有某台机器有)
- 需要访问外部 API(rate limit)
- AI 无法复现构建环境
阻力:
- AI 无法独立构建
- 需要人工介入
- 自动化失败
应对方案:
1. **环境容器化** — 把构建环境打包成 Docker 镜像
2. **依赖镜像** — 搭建内部镜像站,缓存外部依赖
3. **构建即代码** — 用代码定义构建环境,AI 可复现
4. **降级策略** — 构建失败时自动 fallback 到预构建 artifact
1.3 测试覆盖不足
Corner Case 1.3.1: 无自动化测试
场景:
- 核心服务 0 自动化测试
- 只有 manual QA
- AI 修改后无法验证
- QA 团队人手不足
阻力:
- AI 不敢修改(无安全网)
- 修改后 QA 瓶颈
- 质量风险高
应对方案:
1. **AI 生成测试** — 用 AI 分析代码生成单元测试
2. **行为测试优先** — 先写端到端测试,捕捉现有行为
3. **测试覆盖率 SLO** — 设定目标(>80%),逐步提升
4. **AI+ 人工 review** — AI 生成测试,人工 review 关键用例
5. **渐进式覆盖** — 优先覆盖高频修改的代码
Corner Case 1.3.2: 测试依赖外部系统
场景:
- 测试需要访问数据库(数据敏感)
- 测试需要调用支付 API(产生真实费用)
- 测试需要第三方服务(不稳定)
- AI 无法在 CI 中运行测试
阻力:
- 测试不可靠
- CI 经常失败
- AI 无法判断是代码问题还是环境问题
应对方案:
1. **测试隔离** — 用 Docker 隔离测试环境
2. **Mock 外部依赖** — AI 生成 mock 服务
3. **测试数据脱敏** — 用脱敏数据跑测试
4. **测试分层** — 单元测试(无依赖)+ 集成测试(有依赖)
5. **Flaky 测试检测** — AI 识别并标记不稳定测试
2. 组织层面的阻力
2.1 团队边界保护
Corner Case 2.1.1: 团队拒绝 AI 访问代码
场景:
- 核心算法团队认为代码是"核心竞争力"
- 拒绝把代码放入 mono-repo
- 拒绝 AI 访问(怕泄露)
- 只愿意提供编译后的库
阻力:
- AI 无法理解核心逻辑
- 无法优化跨模块性能
- mono-repo 不完整
应对方案:
1. **分级访问控制** — 代码在 mono-repo,但访问权限分级
2. **AI 安全审计** — 证明 AI 不会泄露代码(审计日志)
3. **价值演示** — 先用公开 repo 演示 AI 带来的收益
4. **渐进开放** — 先开放非核心模块,建立信任后开放核心
5. **高层支持** — 需要 CTO/老板支持,明确 AI 战略
Corner Case 2.1.2: 团队拒绝 AI 修改代码
场景:
- 团队说"我们的代码太复杂,AI 不懂"
- 拒绝 AI 提交的 PR
- 要求所有修改必须人工 review
- 实际上是不信任 AI
阻力:
- AI 贡献被拒绝
- 团队仍是瓶颈
- AI 价值无法体现
应对方案:
1. **AI 配对编程** — AI 和人类一起开发,建立信任
2. **小步快跑** — AI 先提交小改动(文档、注释、测试)
3. **质量证明** — AI 提交的代码通过测试、benchmark
4. **成功故事** — 宣传 AI 成功贡献的案例
5. **激励机制** — 奖励接受 AI 贡献的团队
2.2 绩效评估冲突
Corner Case 2.2.1: AI 做的功算谁的?
场景:
- AI 开发了一个功能
- 人类 A 定义了需求
- 人类 B review 了代码
- 人类 C 部署了
- 绩效评估时,功劳算谁的?
阻力:
- 团队争功
- 人类不愿意让 AI 做(怕没功劳)
- 绩效体系失效
应对方案:
1. **重新定义绩效** — 从 Doer 到 Decider
2. **AI 贡献追踪** — 记录 AI 的贡献(用于评估,不用于抢功)
3. **团队绩效优先** — 强调团队成果,不是个人功劳
4. **新评估维度** — 评估"AI 协作能力"、"决策质量"
5. **透明沟通** — 明确 AI 时代的绩效标准
Corner Case 2.2.2: AI 导致人力冗余
场景:
- AI 接手后,某团队 5 个人只剩 2 个人的工作
- 多出来的 3 个人怎么办?
- 团队担心被裁员
- 抵制 AI 引入
阻力:
- 团队抵制 AI
- 消极配合
- 甚至破坏 AI 工作
应对方案:
1. **明确承诺** — 不裁员,转岗到高价值工作
2. **再培训计划** — 培训员工做 AI 无法做的工作(架构、创新)
3. **自然 attrition** — 通过自然流失减少人力
4. **新业务扩展** — 用 AI 节省的人力开拓新业务
5. **透明沟通** — 明确 AI 的目标是提升效率,不是裁员
2.3 管理层阻力
Corner Case 2.3.1: 中层管理者失去控制感
场景:
- 以前:管理者分配任务、跟踪进度、review 代码
- AI 时代:AI 分配任务、跟踪进度、review 代码
- 管理者不知道每天该干什么
- 感觉失去价值
阻力:
- 管理者抵制 AI
- 设置障碍("需要人工审批")
- 回到老路
应对方案:
1. **重新定义角色** — 从"任务分配者"到"目标定义者"
2. **新技能培训** — 培训 AI 协作、战略规划、人才培养
3. **新价值点** — 聚焦 AI 做不了的事(跨团队协调、战略)
4. **成功案例** — 展示 AI 时代管理者的新价值
5. **高层支持** — 明确支持管理者转型
Corner Case 2.3.2: 预算分配冲突
场景:
- AI 基础设施需要预算(LLM tokens、存储、计算)
- 传统 IT 预算被削减
- 部门间争夺预算
- AI 项目预算被砍
阻力:
- AI 项目无法推进
- 基础设施不足
- 进展缓慢
应对方案:
1. **ROI 证明** — 用数据证明 AI 的 ROI(效率提升、成本节省)
2. **渐进投资** — 先小投入,证明价值后再追加
3. **成本分摊** — AI 基础设施成本分摊到各受益部门
4. **高层支持** — 老板明确 AI 是战略投资
5. **对标竞争** — 展示竞争对手的 AI 投资,制造紧迫感
3. 流程层面的阻力
3.1 审批流程过长
Corner Case 3.1.1: AI 部署需要多层审批
场景:
- AI 完成开发,要部署到生产
- 需要:Tech Lead → Manager → Director → VP 审批
- 每层审批 1-2 天
- 部署周期:1-2 周
阻力:
- AI 效率被审批流程抵消
- AI 快速迭代优势无法发挥
- 人类成为瓶颈
应对方案:
1. **分级审批** — 低风险变更自动审批,高风险人工审批
2. **审批自动化** — AI 准备审批材料,自动发送给审批人
3. **审批 SLO** — 设定审批时限(24 小时内)
4. **信任积累** — AI 部署成功率>99% 后,减少审批层级
5. **事后审计** — 从事前审批转为事后审计
Corner Case 3.1.2: 变更管理委员会(CAB)审批
场景:
- 生产变更需要 CAB 审批
- CAB 每周开一次会
- AI 一周产生 100 个变更
- CAB 无法处理
阻力:
- 变更积压
- AI 无法部署
- 流程成为瓶颈
应对方案:
1. **CAB 自动化** — AI 准备变更材料,CAB 远程审批
2. **标准变更免审** — 预批准的变更类型(测试通过、回滚方案)免审
3. **CAB 授权** — CAB 授权 AI 处理低风险变更
4. **变更分级** — 高风险变更 CAB 审批,低风险自动审批
5. **流程重构** — 用 AI 能力重新设计变更流程
3.2 合规流程冲突
Corner Case 3.2.1: AI 生成的代码需要合规审查
场景:
- 金融/医疗行业有严格合规要求
- 代码需要合规审查才能上线
- 审查流程:2-4 周
- AI 生成代码速度快,审查跟不上
阻力:
- AI 产出积压
- 合规成为瓶颈
- AI 效率优势被抵消
应对方案:
1. **合规规则 AI 化** — 把合规规则变成 AI 可执行的检查
2. **AI 自审查** — AI 生成代码时自动检查合规
3. **合规预审** — 合规团队预先批准 AI 代码模板
4. **抽样审查** — 从高频率审查转为抽样审查
5. **合规自动化** — 用 AI 自动化合规文档生成
Corner Case 3.2.2: 审计要求代码可追溯
场景:
- 审计要求:每行代码要知道是谁写的、为什么
- AI 生成的代码,"作者"是 AI
- 审计不认可
- 合规风险
阻力:
- AI 代码无法通过审计
- 需要人工"背书"
- 增加人工成本
应对方案:
1. **AI+ 人类联合署名** — AI 生成,人类 review 后联合署名
2. **审计规则更新** — 和审计团队沟通,更新规则适应 AI
3. **AI 决策日志** — 记录 AI 的决策过程(为什么这样写)
4. **人类最终责任** — 明确人类对 AI 代码负最终责任
5. **行业倡导** — 推动行业标准更新,认可 AI 代码
3.3 发布流程复杂
Corner Case 3.3.1: 多产品协同发布
场景:
- 10 个产品需要协同发布
- 产品间有依赖关系
- 发布顺序:A → B → C → ...
- 协调 10 个团队,耗时 2 周
阻力:
- AI 无法协调跨团队发布
- 人工协调仍是瓶颈
- 发布周期长
应对方案:
1. **发布编排 AI 化** — AI 分析依赖,自动生成发布计划
2. **独立发布** — 重构为可独立发布(减少耦合)
3. **发布自动化** — AI 自动执行发布流程
4. **发布窗口统一** — 统一发布窗口,减少协调
5. **渐进发布** — 用 feature flag 渐进发布,减少协同
4. 安全层面的阻力
4.1 代码安全
Corner Case 4.1.1: AI 引入安全漏洞
场景:
- AI 生成的代码有 SQL 注入漏洞
- 上线后被发现
- 安全团队要求所有 AI 代码人工审查
- AI 效率优势被抵消
阻力:
- 安全团队不信任 AI
- 人工审查成为瓶颈
- AI 代码被歧视
应对方案:
1. **AI 安全训练** — 用安全代码数据集训练 AI
2. **安全扫描自动化** — AI 生成代码后自动跑安全扫描
3. **安全规则 AI 化** — 把安全规则变成 AI 可执行的检查
4. **AI 安全审计** — 用另一个 AI 审查 AI 代码(AI vs AI)
5. **渐进信任** — AI 安全记录良好后,减少人工审查
Corner Case 4.1.2: AI 访问敏感代码
场景:
- AI 需要访问核心算法代码
- 核心算法是商业机密
- 担心 AI 泄露(AI 模型训练可能记忆代码)
- 安全团队阻止
阻力:
- AI 无法访问核心代码
- 核心模块无法 AI 化
- mono-repo 不完整
应对方案:
1. **本地 AI 模型** — 核心代码用本地 AI 模型(不上传云端)
2. **代码脱敏** — AI 访问前脱敏(移除敏感逻辑)
3. **访问审计** — 记录 AI 访问日志,可追溯
4. **AI 隔离** — 处理敏感代码的 AI 与其他 AI 隔离
5. **法律保障** — 和 AI 供应商签保密协议
4.2 数据安全
Corner Case 4.2.1: AI 访问生产数据
场景:
- AI 需要生产数据做分析
- 生产数据包含用户隐私
- 数据合规要求(GDPR、个人信息保护法)
- 安全团队阻止 AI 访问
阻力:
- AI 无法访问真实数据
- AI 分析不准确
- 价值受限
应对方案:
1. **数据脱敏** — AI 访问前脱敏(移除 PII)
2. **合成数据** — 用 AI 生成合成数据(类似真实数据分布)
3. **数据隔离** — AI 在隔离环境访问数据
4. **访问审计** — 记录 AI 数据访问日志
5. **合规审批** — 预先获得合规审批
5. 文化层面的阻力
5.1 工程师文化冲突
Corner Case 5.1.1: 工程师认为 AI 代码“不纯粹“
场景:
- 老工程师认为"代码是艺术"
- AI 生成的代码"没有灵魂"
- 拒绝使用 AI 代码
- 甚至抵制 AI 工具
阻力:
- 文化抵触
- 消极使用 AI
- 影响 AI 推广
应对方案:
1. **重新定义"艺术"** — 代码的艺术在于解决问题,不是手写
2. **成功案例** — 展示 AI 生成的优质代码
3. **AI+ 人类协作** — AI 生成草稿,人类优化(保留"艺术")
4. **代际差异** — 年轻工程师更容易接受 AI
5. **时间证明** — 让时间证明 AI 代码的质量
Corner Case 5.1.2: 工程师担心被 AI 取代
场景:
- 工程师听说"AI 要取代程序员"
- 担心失业
- 抵制 AI 工具
- 甚至故意给 AI 设置障碍
阻力:
- 人为阻力
- 消极配合
- 破坏 AI 工作
应对方案:
1. **明确承诺** — 不裁员,转岗到高价值工作
2. **重新定位** — AI 是助手,不是替代者
3. **技能提升** — 培训工程师做 AI 无法做的工作
4. **成功案例** — 展示 AI 帮助工程师提升的案例
5. **透明沟通** — 定期沟通 AI 战略和人员规划
5.2 管理者文化冲突
Corner Case 5.2.1: 管理者认为 AI 不可控
场景:
- 管理者习惯控制细节
- AI 自主决策,管理者无法控制细节
- 感觉失去控制
- 要求 AI 每一步都人工审批
阻力:
- AI 自主性被限制
- 效率优势被抵消
- 回到老路
应对方案:
1. **重新定义"控制"** — 从控制过程到控制目标
2. **透明决策** — AI 记录决策过程,可追溯
3. **例外管理** — AI 处理常规,人类处理例外
4. **信任建立** — AI 证明可靠性后,逐步放权
5. **管理者培训** — 培训 AI 时代的管理技能
6. 实际运营中的 Corner Cases
6.1 AI 相关
Corner Case 6.1.1: AI 模型更新导致行为变化
场景:
- AI 模型从 v1 升级到 v2
- v2 生成的代码风格变了
- v2 的 bug 修复方式变了
- 团队困惑,不知道用哪个版本
阻力:
- 行为不一致
- 团队不信任 AI
- 版本管理复杂
应对方案:
1. **版本锁定** — 团队可以锁定 AI 版本
2. **渐进升级** — 先小范围测试 v2,再全面升级
3. **变更日志** — AI 生成版本变更日志
4. **回滚机制** — v2 有问题可快速回滚 v1
5. **A/B 测试** — 对比 v1 和 v2 的输出质量
Corner Case 6.1.2: AI 产生幻觉(Hallucination)
场景:
- AI 生成不存在的 API 调用
- AI 生成错误的依赖
- AI 生成虚假的文档引用
- 代码无法编译
阻力:
- 人类需要 review 所有 AI 代码
- AI 信誉受损
- 效率优势被抵消
应对方案:
1. **编译检查** — AI 生成后自动编译验证
2. **事实核查** — 用另一个 AI 核查 AI 输出
3. **约束生成** — 限制 AI 只能使用已知 API
4. **人类抽查** — 高频修改的代码人类 review
5. **持续改进** — 用幻觉案例训练 AI,减少幻觉
6.2 基础设施相关
Corner Case 6.2.1: AI 基础设施故障
场景:
- OpenClaw 服务宕机
- 子 Agent 无法创建
- 所有 AI 工作停滞
- 人类不知道如何接手
阻力:
- 研发停滞
- 人类无法接手(习惯了 AI)
- 业务影响
应对方案:
1. **高可用架构** — OpenClaw 多实例部署
2. **降级模式** — AI 故障时自动切换到人工流程
3. **人类培训** — 培训人类在 AI 故障时如何接手
4. **故障演练** — 定期演练 AI 故障场景
5. **备份方案** — 准备备用 AI 服务
Corner Case 6.2.2: Token 预算超支
场景:
- AI 使用量超出预期
- Token 预算超支 50%
- 财务部门要求削减 AI 使用
- AI 项目面临预算危机
阻力:
- AI 使用受限
- 项目进展放缓
- 信心受损
应对方案:
1. **使用监控** — 实时监控 token 使用,超预算前预警
2. **优化策略** — 优化 AI 使用(缓存、批量、模型选择)
3. **ROI 证明** — 用 ROI 数据争取更多预算
4. **成本分摊** — AI 成本分摊到受益部门
5. **预算调整** — 根据实际情况调整预算
7. 应对策略总结
7.1 阻力分类
| 阻力类型 | 占比 | 特点 | 应对难度 |
|---|---|---|---|
| 技术阻力 | 20% | 明确、可量化 | 低 |
| 组织阻力 | 30% | 隐性、情绪化 | 中 |
| 流程阻力 | 20% | 制度化、惯性 | 中 |
| 安全阻力 | 15% | 合规、风险 | 高 |
| 文化阻力 | 15% | 深层、长期 | 高 |
7.2 通用应对原则
原则 1:透明沟通
- 明确 AI 战略和目标
- 定期沟通进展和挑战
- 坦诚面对问题和失败
原则 2:渐进式变革
- 先试点,再推广
- 先低风险,再高风险
- 先自愿,再强制
原则 3:价值证明
- 用数据证明 AI 价值
- 用案例建立信心
- 用 ROI 争取资源
原则 4:人员优先
- 不裁员承诺
- 再培训计划
- 转岗到高价值工作
原则 5:高层支持
- 老板明确支持
- 资源保障
- 阻力升级通道
8. 阻力应对检查清单
8.1 技术准备度检查
- 代码库依赖关系分析完成
- 构建系统统一方案确定
- 测试覆盖率基线建立
- AI 基础设施高可用设计
- 监控和告警系统就绪
8.2 组织准备度检查
- 核心团队理解并支持 AI 战略
- 绩效评估体系更新
- 人员转型计划制定
- 预算和资源保障
- 阻力升级通道建立
8.3 流程准备度检查
- 审批流程 AI 化改造
- 合规流程 AI 适配
- 发布流程自动化
- 变更管理流程更新
- 事故响应流程 AI 集成
8.4 安全准备度检查
- AI 安全扫描集成
- 代码访问控制设计
- 数据脱敏方案确定
- 审计流程 AI 适配
- 合规审批获得
8.5 文化准备度检查
- AI 战略全员沟通
- 成功案例收集和宣传
- 工程师 AI 培训完成
- 管理者 AI 培训完成
- 抵制情绪监测机制建立
9. 结论
核心洞察:
- 技术阻力只占 20% — 80% 的阻力来自组织、流程、文化
- 隐性阻力比显性阻力更难 — 文化、情绪、信任问题最难解决
- 沟通比技术更重要 — 透明沟通可以化解大部分阻力
- 人员优先是关键 — 保障人员利益,阻力自然减少
- 高层支持是保障 — 没有老板支持,阻力无法克服
行动建议:
- 提前识别阻力 — 用本文档作为检查清单
- 制定应对计划 — 每个阻力都有应对方案
- 持续监测 — 阻力是动态的,持续监测和应对
- 灵活调整 — 根据实际阻力调整策略
- 保持耐心 — 文化变革需要时间(6-12 个月)
Corner Cases & Mitigation: AI 时代迁移的阻力与应对
2026-03-01 | Large-scale Agentic Engineering Team
Google Monorepo Lessons Learned
Key Insights from Google’s 2 Billion Line Monorepo
Research summary for TiDB Mono-Repo Consolidation Project
Scale Comparison
| Metric | TiDB Target | |
|---|---|---|
| Lines of Code | 2 billion | ~39GB (TBD) |
| Engineers | 25,000+ | TBD |
| Commits/day | 45,000 | TBD |
| Files | 9 million | TBD |
| Storage | 86 TB | 39 GB |
Key Insight: Google proves monorepo scales to extreme levels with right tooling.
Core Principles (Google’s Playbook)
1. Single Source of Truth
✅ ONE repository for 95% of codebase
✅ No submodules
✅ No complex cross-repo dependency graphs
✅ No "which version should I use?" problems
TiDB Application: All 400 repos → 1 mono-repo
2. Trunk-Based Development
main (trunk)
│
├── Developers commit directly to main
├── Code review BEFORE merge (pre-commit)
├── Release branches for deployment only
└── Feature flags for incomplete features
Benefits:
- No merge nightmares from long-lived branches
- Early integration conflict detection
- Continuous delivery enabled
TiDB Application: Adopt trunk-based from day 1
3. Code Ownership & Visibility
Default: OPEN ACCESS
- All engineers can read all code
- Traceability built-in
- Exceptions: restricted files (security, legal)
Ownership: Workspace-based
- Each directory has owning team
- Responsible engineer identified
- CODEOWNERS enforcement
TiDB Application:
- Default open access within engineering
- CODEOWNERS file for each component
- Clear ownership boundaries
4. Build System: Bazel
Key Features:
- Incremental builds (only changed targets)
- Remote caching (share build artifacts)
- Parallel execution
- Dependency graph analysis
- Hermetic builds (reproducible)
Why It Matters:
- 2B LOC builds in minutes, not hours
- Developers get fast feedback
- CI/CD scales efficiently
TiDB Application:
- Evaluate: Bazel vs Turborepo vs Nx
- Depends on tech stack (Go/Java/TS?)
- Must support incremental builds
5. Dependency Management
Google's Approach:
- All dependencies visible in one graph
- No circular dependencies (enforced)
- Breaking changes caught immediately
- Automated dependency updates
Tooling:
- Static analysis for dependency detection
- Automated refactoring for API changes
- Impact analysis before changes
TiDB Application:
- Map all 400 repos’ dependencies
- Identify circular dependencies early
- Build dependency visualization tool
6. Automated Code Review
Pre-commit Review:
- All changes reviewed before merge
- Automated checks (lint, tests, security)
- Human review for logic/approval
- OWNERS file defines reviewers
Scale Solution:
- Automated systems make 24,000 commits/day
- 500,000 requests/second to review system
- Most commits are automated (refactoring, cleanup)
TiDB Application:
- Automated PR checks (CI/CD)
- CODEOWNERS for review assignment
- AI-assisted code review (future)
7. Infrastructure: Piper + CitC
Piper (Version Control):
- Custom distributed filesystem
- Handles 86TB efficiently
- Supports 40,000 commits/day
CitC (Client in the Cloud):
- Lightweight checkout
- Downloads only modified files
- Cloud-based browsing/editing
CodeSearch:
- Fast search across entire codebase
- Cross-workspace search
- IDE integration (Eclipse, Emacs plugins)
TiDB Application:
- Use Git (not custom VCS)
- Shallow clones for agents
- Implement fast code search (Sourcegraph/Zoekt)
Google’s Monorepo Challenges & Solutions
| Challenge | Google’s Solution | TiDB Application |
|---|---|---|
| Download time | CitC (partial checkout) | Shallow clones, sparse checkout |
| Slow search | CodeSearch engine | Sourcegraph / Zoekt |
| Build time | Bazel (incremental) | Bazel/Turborepo/Nx |
| Dependency hell | Single version, automated updates | Dependency graph tooling |
| Code review scale | Automated pre-checks + OWNERS | GitHub/GitLab CODEOWNERS |
| Merge conflicts | Trunk-based, small commits | Trunk-based development |
| Access control | Default open, exceptions restricted | Directory-based permissions |
AI-Specific Opportunities (Beyond Google)
Google built their system before AI was mainstream. We have an advantage:
What Google Does (Human-Centric)
Human engineers:
- Write code
- Review code
- Fix dependencies
- Run builds
- Deploy services
Automation:
- Code formatting
- Dependency updates
- Build optimization
- Test execution
What We Can Do (AI-First)
AI agents:
- Write code (feature development)
- Review code (automated PR review)
- Fix dependencies (automated refactoring)
- Optimize builds (AI-driven caching)
- Deploy services (auto-scaling decisions)
Humans:
- Define problems
- Set priorities
- Review architecture
- Handle edge cases
Key Difference: Google automated processes. We can automate decisions.
Recommended Architecture for TiDB
Layer 1: Repository Structure
mono-repo/
├── products/ # TiDB, TiDB Next-Gen
├── platform/ # Cloud SaaS, control plane
├── devops/ # Operations tools
├── libs/ # Shared libraries
├── tools/ # Build/dev tools
└── infra/ # Infrastructure as code
Layer 2: Build System
Recommendation: Evaluate based on tech stack
- Go: Bazel or Please
- TypeScript: Turborepo or Nx
- Java: Bazel or Gradle
- Mixed: Bazel (most flexible)
Layer 3: Code Ownership
CODEOWNERS file:
- products/tidb/* @tidb-core-team
- platform/cloud/* @cloud-platform-team
- devops/* @devops-team
- libs/* @platform-architects
Layer 4: CI/CD
Path-based triggering:
- Changes to products/tidb/* → Run TiDB tests
- Changes to platform/* → Run platform tests
- Changes to libs/* → Run all tests (shared code)
Layer 5: AI Agent Integration
400+ Repo Agents:
- Each agent owns one legacy repo
- Agents analyze, recommend, migrate
- Post-migration: agents become component guardians
Orchestrator Agent:
- Coordinates agents
- Makes cross-component decisions
- Optimizes system-wide
Migration Strategy (Google-Inspired)
Phase 1: Analysis (Week 1-2)
- Inventory all 400 repos
- Map dependencies
- Identify owners
- Score by activity/usage
Phase 2: Infrastructure (Week 2-3)
- Set up mono-repo structure
- Configure build system
- Set up CI/CD with path filtering
- Implement CODEOWNERS
Phase 3: Pilot Migration (Week 3-4)
- Migrate 10-20 repos (P0 priority)
- Validate build/test/deploy
- Refine process
Phase 4: Bulk Migration (Week 4-8)
- Migrate remaining repos in batches
- Automated refactoring where possible
- Archive old repos
Phase 5: AI Enablement (Week 8+)
- Deploy agent infrastructure
- Enable AI code review
- Enable AI-driven refactoring
- Enable AI deployment optimization
Success Metrics (Inspired by Google)
| Metric | Target |
|---|---|
| Build time (incremental) | <5 minutes |
| Build time (full) | <30 minutes |
| PR review time | <4 hours |
| Merge conflicts/week | <10 |
| AI-completed features | 20% (6mo), 50% (12mo) |
| Automated refactoring/week | 100+ |
Key Takeaways
- Monorepo scales — Google proves 2B+ LOC is viable
- Tooling is critical — Can’t do this without proper build/search/review tools
- Culture matters — Trunk-based, open access, small commits
- Automation is key — Google’s automation does 24k commits/day
- AI is our advantage — We can go beyond Google’s human-centric model
Sources:
- https://cacm.acm.org/research/why-google-stores-billions-of-lines-of-code-in-a-single-repository/
- https://qeunit.com/blog/how-google-does-monorepo/
- https://medium.com/@sohail_saifi/the-monorepo-strategy-that-scaled-google-to-2-billion-lines-of-code
- https://bazel.build/
PingCAP Top 10 Repos Analysis
Sample Analysis for Mono-Repo Consolidation Validation
Analysis date: 2026-02-28
Top Repositories by Stars
| # | Repository | Stars | Forks | Language | Size (KB) | Created | Last Push | Fork? | Category |
|---|---|---|---|---|---|---|---|---|---|
| 1 | tidb | 39,859 | 6,126 | Go | 652,429 | 2015-09 | 2026-02-28 | No | Product |
| 2 | ossinsight | 2,320 | 411 | TypeScript | 642,471 | 2022-01 | 2026-02-22 | No | Tool |
| 3 | autoflow | 2,740 | 176 | TypeScript | N/A | N/A | 2026-02-28 | No | Product |
| 4 | tidb-operator | 1,322 | 529 | Go | 101,136 | 2018-08 | 2026-02-27 | No | Platform |
| 5 | docs | 616 | 707 | Python | 410,671 | 2016-07 | 2026-02-27 | No | Docs |
| 6 | tidb-vector-python | 61 | 17 | Python | N/A | N/A | 2025-12-27 | No | SDK |
| 7 | ticdc | 45 | 40 | Go | N/A | N/A | 2026-02-27 | No | Product |
| 8 | tiflow | 454 | 298 | Go | 163,035 | 2019-08 | 2026-02-26 | No | Product |
| 9 | tiup | 463 | N/A | Go | 15,476 | N/A | N/A | No | Tool |
| 10 | tidb-dashboard | 198 | N/A | TypeScript | 34,146 | N/A | N/A | No | Tool |
Forked Repos (Third-party)
| Repository | Stars | Language | Purpose |
|---|---|---|---|
| agfs | 0 | C++ | Aggregated File System (Plan 9 tribute) |
| tantivy | 0 | Rust | Full-text search engine (Lucene alternative) |
| sarama | 0 | N/A | Kafka client library |
Repository Categories
Products (Core Database)
tidb/ - Main database engine (652 MB, 39.8k stars)
tiflow/ - DM + TiCDC (163 MB, 454 stars)
ticdc/ - Change data capture (active)
autoflow/ - Graph RAG knowledge base (2.7k stars)
Platform (Kubernetes/Cloud)
tidb-operator/ - K8s operator (101 MB, 1.3k stars)
Tools
tiup/ - Package manager (15 MB, 463 stars)
tidb-dashboard/ - Web dashboard (34 MB, TypeScript)
ossinsight/ - OSS analytics (642 MB, 2.3k stars)
Documentation
docs/ - Documentation (411 MB, 616 stars)
SDKs/Libraries
tidb-vector-python/ - Python SDK for vector operations
pytidb/ - Python client (30 stars)
Forked Dependencies
agfs/ - File system (C++, fork)
tantivy/ - Search engine (Rust, fork)
sarama/ - Kafka client (Go, fork)
Key Insights for Mono-Repo Consolidation
1. Tech Stack Distribution
Go: 6 repos (tidb, tiflow, ticdc, tiup, tidb-operator, forks)
TypeScript: 3 repos (ossinsight, autoflow, tidb-dashboard)
Python: 2 repos (docs, tidb-vector-python)
Rust: 1 repo (tantivy - fork)
C++: 1 repo (agfs - fork)
Implication: Multi-language build system required (Bazel recommended)
2. Repository Sizes
| Size Category | Repos | Total Size |
|---|---|---|
| >500 MB | tidb, ossinsight | ~1.3 GB |
| 100-500 MB | docs, tiflow | ~574 MB |
| 10-100 MB | tidb-operator, tidb-dashboard, tiup | ~151 MB |
| <10 MB | Others | ~50 MB |
| Total | 10 repos | ~2.1 GB |
Implication: 10 repos = ~2GB. 400 repos = ~39GB estimate is reasonable.
3. Activity Analysis
| Last Push | Count | Repos |
|---|---|---|
| Today (2026-02-28) | 2 | tidb, autoflow |
| This week | 5 | tidb-operator, docs, ticdc, tiflow, wordpress-plugin |
| This month | 2 | pytidb, full-stack-app-builder |
| Older | 1 | tidb_workload_analysis |
Implication: 80% of repos are actively maintained (good candidates for migration)
4. Dependency Relationships (Inferred)
tidb (core)
├── tidb-operator (depends on tidb)
├── tiflow (depends on tidb - CDC/DM)
├── ticdc (depends on tidb - CDC)
├── tiup (depends on tidb - package manager)
├── tidb-dashboard (depends on tidb - UI)
├── docs (documents tidb)
└── SDKs (tidb-vector-python, pytidb)
ossinsight (standalone tool)
autoflow (uses TiDB Serverless - could be separate)
Forks (external deps):
├── tantivy (search - optional dependency)
├── agfs (filesystem - experimental)
└── sarama (Kafka - for TiCDC)
Implication: Clear dependency graph. tidb is the root.
5. Merge Priority Assessment
| Priority | Repos | Rationale |
|---|---|---|
| P0 | tidb, tiflow, ticdc | Core product, active development |
| P1 | tidb-operator, tiup, tidb-dashboard | Platform/tooling, tight coupling |
| P2 | docs, SDKs | Documentation/SDKs, moderate coupling |
| P3 | ossinsight, autoflow | Standalone tools, loose coupling |
| P4 | Forks (tantivy, agfs, sarama) | Evaluate: keep upstream instead? |
Proposed Mono-Repo Structure (Based on 10 Repos)
pingcap-mono/
├── products/
│ ├── tidb/ # Main database (652 MB)
│ ├── tiflow/ # DM + TiCDC (163 MB)
│ └── ticdc/ # CDC (merged from tiflow?)
├── platform/
│ └── tidb-operator/ # K8s operator (101 MB)
├── tools/
│ ├── tiup/ # Package manager (15 MB)
│ ├── tidb-dashboard/ # Web UI (34 MB)
│ └── ossinsight/ # OSS analytics (642 MB)
├── products-experimental/
│ └── autoflow/ # Graph RAG (2.7k stars)
├── docs/
│ └── tidb-docs/ # Documentation (411 MB)
├── sdks/
│ ├── python/
│ │ ├── tidb-vector-python/
│ │ └── pytidb/
│ └── ...
├── libs/
│ ├── tantivy/ # Search (fork - evaluate upstream)
│ ├── agfs/ # Filesystem (fork - evaluate)
│ └── sarama/ # Kafka client (fork - evaluate)
└── infra/
└── ...
Validation: Does Mono-Repo Make Sense?
✅ Pros (Confirmed from Analysis)
-
Clear Dependency Graph
- tidb is the root, everything else depends on it
- Mono-repo makes dependencies explicit and manageable
-
Shared Tech Stack
- 60% Go, 30% TypeScript, 10% Python/other
- Bazel can handle all these languages
-
Active Development
- 80% repos pushed this week
- Trunk-based development feasible
-
Size Manageable
- 10 repos = ~2GB
- 400 repos = ~39GB (within Google’s lessons)
-
Tooling Overlap
- Multiple tools (tiup, dashboard) share common needs
- Shared libraries possible in mono-repo
⚠️ Challenges (Confirmed from Analysis)
-
Forked Dependencies
- tantivy, agfs, sarama are forks
- Decision: Keep in mono-repo or use upstream + patches?
-
Standalone Tools
- ossinsight, autoflow are loosely coupled
- May not benefit from mono-repo
-
Multi-Language Build
- Go + TypeScript + Python + Rust + C++
- Requires sophisticated build system (Bazel)
-
Repo Size Variance
- tidb (652 MB) vs tiup (15 MB)
- Sparse checkout needed for efficient workflows
Recommendations (Based on Sample)
1. Migration Strategy Validation
Phase 1 (P0): tidb + tiflow + ticdc
- Core product, clear dependencies
- ~800 MB total
Phase 2 (P1): tidb-operator + tiup + tidb-dashboard
- Platform/tooling
- ~150 MB total
Phase 3 (P2): docs + SDKs
- Documentation/SDKs
- ~500 MB total
Phase 4 (P3): ossinsight + autoflow
- Evaluate: Keep separate or merge?
Phase 5 (P4): Forks
- Decision: Upstream + patches vs keep in mono-repo
2. Build System Choice
Recommendation: Bazel
Reasons:
- Multi-language support (Go, TS, Python, Rust, C++)
- Incremental builds (critical for 39GB repo)
- Remote caching (team-scale builds)
- Used by Google for 2B LOC monorepo
3. Code Ownership Structure
# Core Product
products/tidb/* @tidb-core-team
products/tiflow/* @tiflow-team
products/ticdc/* @ticdc-team
# Platform
platform/tidb-operator/ @k8s-platform-team
# Tools
tools/tiup/ @tooling-team
tools/tidb-dashboard/ @dashboard-team
tools/ossinsight/ @ossinsight-team
# Documentation
docs/* @docs-team @devrel-team
# SDKs
sdks/python/* @sdk-team
# Forked Libraries (high scrutiny)
libs/* @platform-architects @legal-review
Next Steps (Full 400-Repo Analysis)
-
Automated Inventory
- Script to fetch all 400 repos via GitHub API
- Extract: stars, forks, language, size, last push, dependencies
-
Dependency Mapping
- Analyze go.mod, package.json, requirements.txt
- Build dependency graph
- Identify circular dependencies
-
Activity Scoring
- Commits last 30/90/365 days
- Open PRs, issues
- Active maintainers
-
Merge Recommendation Engine
- Score each repo: Keep/Migrate/Archive/Fork
- Priority ranking
- Effort estimation
Conclusion
This 10-repo sample validates the mono-repo consolidation approach:
- ✅ Clear dependency hierarchy (tidb at root)
- ✅ Manageable tech stack (Go/TS/Python dominant)
- ✅ Active development (trunk-based feasible)
- ✅ Size within reasonable bounds (~2GB for 10 repos)
- ✅ Google’s monorepo lessons apply
Key Decision Points:
- How to handle forked dependencies?
- Should standalone tools (ossinsight, autoflow) be in mono-repo?
- What’s the build system? (Bazel recommended)
Confidence Level: High. The sample confirms the approach is sound. Full 400-repo analysis should proceed.
Analysis performed via GitHub API on 2026-02-28
1000 Agent Platform
“1000 个笼子,养着 1000 个 AI,生产高价值产物”
一个大规模 Agentic 操作系统,用于管理 1000 个 AI Agent 并行工作,覆盖运维、工程、企业运营、投资管理四大场景。
🎯 四大应用场景
| 应用 | URL | 描述 |
|---|---|---|
| 1000 Agent Space | http://1000-agent-space.agents-dev.com/ | 并行生产事故解决平台 |
| 1000 Agent Engineering | https://1000-agent-engineering.spaces.agents-dev.com/ | 自主 Mono-Repo 收敛平台 |
| 1000 Agent CorpUnit | https://1000-agent-corp-unit.spaces.agents-dev.com/ | AI 驱动的企业大脑 |
| 1000 Invested AI Company | https://1000-invested-ai-company.spaces.agents-dev.com/ | 投资组合管理仪表盘 |
📚 文档导航
| 文档 | 描述 |
|---|---|
| ARCHITECTURE.md | 系统架构总览 |
| FRONTEND-DESIGN.md | 前端交互设计 |
| CAGE-DESIGN.md | Agent 笼子详细设计 |
🏗️ 核心架构
┌─────────────────────────────────────────────────────────────────┐
│ 1000 Agent Platform │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Frontend Layer (4 Apps) │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Space │ │Engineering│ │ CorpUnit │ │Investment│ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │ │
│ ▼ │
│ API Gateway Layer (Auth, Rate Limit, WebSocket) │
│ │ │
│ ▼ │
│ Core Services Layer │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Orchestrator│ │ Scheduler │ │ State Mgr │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │
│ ▼ │
│ Agent Execution Layer (1000 Cages) │
│ ┌─────┐ ┌─────┐ ┌─────┐ ... ┌─────┐ ┌─────┐ ┌─────┐ │
│ │#001 │ │#002 │ │#003 │ │#998 │ │#999 │ │#1000│ │
│ └─────┘ └─────┘ └─────┘ └─────┘ └─────┘ └─────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
🚀 快速开始
本地开发环境
# 克隆仓库
git clone https://github.com/your-org/1000-agent-platform.git
cd 1000-agent-platform
# 启动开发环境 (Docker Compose)
docker-compose up -d
# 访问本地开发环境
open http://localhost:3000
生产部署
# 部署到 Kubernetes
kubectl apply -f k8s/namespace.yaml
kubectl apply -f k8s/orchestrator.yaml
kubectl apply -f k8s/cage-operator.yaml
kubectl apply -f k8s/frontend.yaml
# 查看部署状态
kubectl get pods -n agent-platform
kubectl get cages -n agent-platform
📊 核心指标
| 指标 | 目标值 | 当前值 |
|---|---|---|
| Total Agents | 1000 | 0 |
| Active Agents | 850+ | 0 |
| Auto-resolution Rate | >70% | - |
| Avg MTTR | <10 分钟 | - |
| Repos Merged | 400 → 1 | 0/400 |
| Daily Tasks | 50,000+ | 0 |
| Daily Artifacts | 100,000+ | 0 |
💰 成本估算
| 项目 | 月度成本 |
|---|---|
| 计算资源 (K8s) | $285,000 |
| Token 消耗 | $144,000 |
| 存储 | $15,000 |
| 管理开销 | $20,000 |
| 总计 | $464,000/月 |
单位成本:
- 每任务:~$0.93
- 每产出物:~$0.46
🔑 核心特性
1. 隔离的 Agent 笼子 (Cages)
- 每个 Agent 独立的执行环境
- 专用资源配额 (CPU, Memory, GPU, Tokens)
- 持久化状态存储
- 独立健康监控
2. 智能任务调度
- 优先级队列
- 能力匹配
- 负载均衡
- 自动重试
3. 实时可观测性
- WebSocket 实时推送
- 秒级状态更新
- 详细指标监控
- 告警通知
4. 高可用设计
- 自动故障恢复
- 多可用区部署
- 数据备份
- 灾难恢复
🛠️ 技术栈
Backend
- Runtime: Node.js 20+ / Python 3.11+
- API: REST + GraphQL + WebSocket
- Database: PostgreSQL + Redis + ClickHouse
- Message Queue: Kafka / RabbitMQ
- Orchestration: Kubernetes + Custom Operators
Frontend
- Framework: Next.js 14+
- UI: TailwindCSS + shadcn/ui
- State: Zustand
- Realtime: WebSocket + SWR
- Charts: Recharts + D3.js
Infrastructure
- Cloud: AWS / GCP / Aliyun
- K8s: EKS / GKE / ACK
- Monitoring: Prometheus + Grafana
- Logging: ELK / Loki
- CI/CD: GitHub Actions + ArgoCD
📈 实施路线图
Phase 1: 基础设施 (Week 1-4)
- K8s 集群搭建
- 数据库部署
- 监控体系搭建
- CI/CD 流水线
Phase 2: 核心服务 (Week 5-8)
- Agent Orchestrator
- Task Scheduler
- State Manager
- Resource Allocator
Phase 3: 应用场景 (Week 9-16)
- 1000 Agent Space (运维)
- 1000 Agent Engineering (工程)
- 1000 Agent CorpUnit (企业)
- 1000 Invested AI Company (投资)
Phase 4: 前端界面 (Week 17-20)
- 4 个应用的前端开发
- 实时数据推送
- 交互优化
Phase 5: 规模化 (Week 21-24)
- 性能优化
- 安全加固
- 文档完善
- 上线发布
🤝 贡献指南
开发流程
- Fork 仓库
- 创建功能分支 (
git checkout -b feature/my-feature) - 提交变更 (
git commit -am 'Add my feature') - 推送到分支 (
git push origin feature/my-feature) - 创建 Pull Request
代码规范
- 遵循 ESLint / Prettier 配置
- 编写单元测试 (覆盖率 >80%)
- 更新相关文档
📄 许可证
MIT License - 详见 LICENSE 文件
📞 联系方式
- 项目主页: https://1000-agent-platform.agents-dev.com
- 文档: https://docs.1000-agent-platform.com
- Discord: https://discord.gg/1000agents
- Email: team@agents-dev.com
🙏 致谢
本项目基于以下开源项目和技术:
- OpenClaw - Agent 编排框架
- Kubernetes - 容器编排
- Next.js - React 框架
- TailwindCSS - CSS 框架
Built with ❤️ by the Agentic Engineering Team
1000 Agent Platform - 后端架构设计
🎯 Vision
构建一个“1000 个笼子,养着 1000 个 AI,生产高价值产物“的规模化 Agentic 平台。
四个核心应用场景:
- 1000 Agent Space - 线上运维闭环 (Production Incident Resolution)
- 1000 Agent Engineering - AI 软件工程 (Autonomous Mono-Repo Convergence)
- 1000 Agent CorpUnit - 企业大脑 (AI-Driven Corporate Brain)
- 1000 Invested AI Company - 投资组合管理 (Portfolio Management Dashboard)
🏗️ 系统架构总览
┌─────────────────────────────────────────────────────────────────────────┐
│ 1000 Agent Platform │
│ (大规模 Agentic 操作系统) │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Frontend Layer (4 Apps) │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Space │ │Engineering│ │ CorpUnit │ │Investment│ │ │
│ │ │ (运维) │ │ (工程) │ │ (企业) │ │ (投资) │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ API Gateway Layer │ │
│ │ - Authentication & Authorization │ │
│ │ - Rate Limiting & Quotas │ │
│ │ - Request Routing & Load Balancing │ │
│ │ - WebSocket for Real-time Updates │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Core Services Layer │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌───────────┐ │ │
│ │ │ Agent │ │ Task │ │ State │ │ Resource │ │ │
│ │ │ Orchestrator│ │ Scheduler │ │ Manager │ │ Allocator │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ └───────────┘ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌───────────┐ │ │
│ │ │ Code │ │ Incident │ │ Finance │ │ Portfolio │ │ │
│ │ │ Repository │ │ Manager │ │ Engine │ │ Analyzer │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ └───────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Agent Execution Layer │ │
│ │ ┌─────────────────────────────────────────────────────────┐ │ │
│ │ │ 1000 Agent Containers (Cages) │ │ │
│ │ │ ┌─────┐ ┌─────┐ ┌─────┐ ... ┌─────┐ ┌─────┐ ┌─────┐ │ │ │
│ │ │ │ #001│ │ #002│ │ #003│ │ #998│ │ #999│ │#1000│ │ │ │
│ │ │ └─────┘ └─────┘ └─────┘ └─────┘ └─────┘ └─────┘ │ │ │
│ │ │ - Isolated environments │ │ │
│ │ │ - Dedicated resources │ │ │
│ │ │ - Persistent state │ │ │
│ │ │ - Health monitoring │ │ │
│ │ └─────────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Infrastructure Layer │ │
│ │ - Kubernetes Cluster (Agent Pods) │ │
│ │ - Cloud Resources (AWS/GCP/Aliyun) │ │
│ │ - Storage (S3, Database, Cache) │ │
│ │ - Monitoring & Observability │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
📦 核心模块设计
Module 1: Agent Orchestrator (Agent 编排器)
职责: 管理 1000 个 Agent 的生命周期、状态、资源分配
AgentOrchestrator:
responsibilities:
- Agent lifecycle management (spawn, pause, resume, terminate)
- Health monitoring & auto-recovery
- Resource allocation & scaling
- Inter-agent communication routing
- Performance metrics collection
components:
AgentRegistry:
description: "维护 1000 个 Agent 的注册信息"
data:
- agent_id: "agent-001"
type: "space-guardian" # space|engineering|corpunit|investment
status: "active|idle|busy|blocked|error"
current_task: "task-12345"
resource_usage: { cpu: "0.5", memory: "512MB", tokens: "10000" }
last_heartbeat: "2026-03-01T10:00:00Z"
uptime: "72h"
output_count: 156 # 累计产出数量
AgentScheduler:
description: "调度 Agent 任务执行"
strategies:
- round_robin: "轮询分配"
- priority_based: "优先级分配"
- capability_matching: "能力匹配"
- load_balancing: "负载均衡"
HealthMonitor:
description: "监控 Agent 健康状态"
checks:
- heartbeat_timeout: "60s"
- error_rate_threshold: "5%"
- resource_exhaustion: "90%"
auto_recovery:
- restart_on_failure: true
- migrate_on_overload: true
- escalate_on_persistent_error: true
Module 2: Task Scheduler (任务调度器)
职责: 接收、分解、分配、跟踪任务执行
TaskScheduler:
task_types:
space:
- incident_detection: "告警检测"
- incident_triage: "告警分类"
- root_cause_analysis: "根因分析"
- auto_remediation: "自动修复"
- human_escalation: "人工升级"
engineering:
- repo_analysis: "仓库分析"
- code_review: "代码审查"
- refactoring: "重构建议"
- test_generation: "测试生成"
- merge_proposal: "合并提案"
corpunit:
- finance_analysis: "财务分析"
- hr_processing: "人力流程"
- legal_review: "法务审查"
- market_research: "市场调研"
- growth_optimization: "增长优化"
investment:
- company_screening: "公司筛选"
- due_diligence: "尽职调查"
- valuation_model: "估值建模"
- portfolio_rebalance: "组合再平衡"
- risk_assessment: "风险评估"
workflow_engine:
description: "定义任务执行流程"
example:
incident_workflow:
- step1: detect (auto)
- step2: triage (auto)
- step3: analyze (auto)
- step4: remediate (auto | human_approval)
- step5: verify (auto)
- step6: close (auto)
Module 3: State Manager (状态管理器)
职责: 持久化所有 Agent 状态、任务进度、产出物
StateManager:
storage_layers:
hot_storage:
type: "Redis Cluster"
purpose: "实时状态、任务队列、缓存"
ttl: "7 days"
warm_storage:
type: "PostgreSQL"
purpose: "任务历史、Agent 日志、指标数据"
retention: "90 days"
cold_storage:
type: "S3 + Parquet"
purpose: "归档数据、审计日志、训练数据"
retention: "7 years"
data_models:
AgentState:
fields:
- agent_id: string
- session_id: string
- status: enum
- current_task_id: string
- context_window: jsonb # 当前上下文
- memory_index: string # 长期记忆索引
- created_at: timestamp
- updated_at: timestamp
TaskState:
fields:
- task_id: string
- type: string
- priority: int
- status: enum
- assigned_agent: string
- input: jsonb
- output: jsonb
- error: text
- started_at: timestamp
- completed_at: timestamp
OutputArtifact:
fields:
- artifact_id: string
- agent_id: string
- task_id: string
- type: enum # code|doc|analysis|decision
- content: text
- quality_score: float
- human_approved: boolean
- created_at: timestamp
Module 4: Resource Allocator (资源分配器)
职责: 管理云资源、计算资源、Token 预算
ResourceAllocator:
resource_types:
compute:
- kubernetes_pods: "Agent 容器"
- gpu_instances: "模型推理"
- cpu_instances: "常规计算"
storage:
- database_connections: "数据库连接池"
- object_storage: "文件存储"
- cache_memory: "缓存内存"
api_quotas:
- llm_tokens: "LLM Token 预算"
- external_apis: "第三方 API 调用"
- rate_limits: "速率限制"
allocation_strategies:
dynamic_scaling:
description: "根据负载自动扩缩容"
metrics:
- cpu_utilization: "target: 70%"
- memory_utilization: "target: 80%"
- queue_depth: "target: <100 tasks"
actions:
- scale_up: "当指标超过阈值"
- scale_down: "当指标低于阈值 30%"
cost_optimization:
description: "优化资源成本"
strategies:
- spot_instances: "使用竞价实例"
- reserved_capacity: "预留容量折扣"
- token_budgeting: "Token 预算管理"
- idle_detection: "检测并回收空闲资源"
🎮 四大应用场景详细设计
App 1: 1000 Agent Space (线上运维)
┌─────────────────────────────────────────────────────────────────┐
│ 1000 Agent Space │
│ 并行生产事故解决平台 │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Incident Pipeline: │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Detect │ → │ Triage │ → │ Analyze │ → │ Resolve │ │
│ │ (100%) │ │ (100%) │ │ (90%) │ │ (70%) │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ Agent#001 Agent#002 Agent#003 Agent#004 │
│ (监控) (分类) (分析) (修复) │
│ │
│ Human Escalation: │
│ - 当自动修复失败时,通过电话/短信/IM 通知人类工程师 │
│ - 人类处理结果反馈给 Agent 学习 │
│ │
│ Metrics: │
│ - MTTR (Mean Time To Resolve): 目标 <10 分钟 │
│ - Auto-resolution Rate: 目标 >70% │
│ - False Positive Rate: 目标 <5% │
│ │
└─────────────────────────────────────────────────────────────────┘
后端服务:
incident-ingestion-service: 接收告警 (Prometheus, PagerDuty, etc.)incident-router-service: 路由到合适的 Agentremediation-executor: 执行修复脚本escalation-manager: 管理人工升级流程learning-feedback-loop: 从人类处理中学习
App 2: 1000 Agent Engineering (AI 软件工程)
┌─────────────────────────────────────────────────────────────────┐
│ 1000 Agent Engineering │
│ 自主 Mono-Repo 收敛平台 │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Repo Analysis Pipeline: │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ 400 Repos Input │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────┼───────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │Repo-001 │ │Repo-002 │ │Repo-400 │ │
│ │ Agent │ │ Agent │ │ Agent │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │
│ └───────────────────┼───────────────────┘ │
│ ▼ │
│ ┌─────────────────────────┐ │
│ │ Aggregation Agent │ │
│ │ (合并分析结果) │ │
│ └────────────┬────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────┐ │
│ │ Mono-Repo Generator │ │
│ │ (生成合并方案) │ │
│ └─────────────────────────┘ │
│ │
│ Continuous Improvement: │
│ - Guardian Agents 持续监控各自组件 │
│ - 自动代码审查、测试生成、文档更新 │
│ - 定期重构建议 │
│ │
└─────────────────────────────────────────────────────────────────┘
后端服务:
repo-analyzer-service: 分析单个仓库dependency-mapper: 映射跨仓库依赖merge-planner: 规划合并策略code-quality-monitor: 持续代码质量监控auto-pr-generator: 自动生成 PR
App 3: 1000 Agent CorpUnit (企业大脑)
┌─────────────────────────────────────────────────────────────────┐
│ 1000 Agent CorpUnit │
│ AI 驱动的企业大脑 │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Corporate Functions: │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ CEO Agent (决策协调) │ │
│ └────────────────────────┬─────────────────────────────────┘ │
│ │ │
│ ┌───────────────────┼───────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ CFO │ │ COO │ │ CTO │ │
│ │ Agent │ │ Agent │ │ Agent │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │Finance │ │HR/Legal │ │Engineering│ │
│ │Team │ │Team │ │Team │ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ │
│ Department Agents: │
│ - Finance: 预算分析、成本控制、财务预测 │
│ - HR: 招聘筛选、绩效评估、培训规划 │
│ - Legal: 合同审查、合规检查、风险评估 │
│ - Market: 市场调研、竞品分析、营销策略 │
│ - Growth: 用户增长、转化优化、A/B 测试 │
│ - Investment: 投资分析、尽职调查、组合管理 │
│ │
│ Output: │
│ - 实时经营仪表盘 │
│ - 决策建议报告 │
│ - 自动化流程执行 │
│ │
└─────────────────────────────────────────────────────────────────┘
后端服务:
data-ingestion-service: 接入企业数据 (ERP, CRM, HRIS, etc.)analytics-engine: 数据分析与洞察decision-recommender: 决策建议生成workflow-automator: 流程自动化执行executive-dashboard: 高管仪表盘
App 4: 1000 Invested AI Company (投资组合)
┌─────────────────────────────────────────────────────────────────┐
│ 1000 Invested AI Company │
│ 投资组合管理仪表盘 │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Portfolio Structure: │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Portfolio Manager Agent │ │
│ └────────────────────────┬────────────────────────────────┘ │
│ │ │
│ ┌───────────────────┼───────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │Company-1│ │Company-2│ │Company-N│ │
│ │ Agent │ │ Agent │ │ Agent │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │Company-1│ │Company-2│ │Company-N│ │
│ │ Metrics │ │ Metrics │ │ Metrics │ │
│ │ - Revenue│ │ - Revenue│ │ - Revenue│ │
│ │ - Growth │ │ - Growth │ │ - Growth │ │
│ │ - Burn │ │ - Burn │ │ - Burn │ │
│ │ - Health │ │ - Health │ │ - Health │ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ │
│ Analysis Capabilities: │
│ - 实时财务健康度监控 │
│ - 行业对标分析 │
│ - 风险预警 │
│ - 退出时机建议 │
│ - 组合再平衡优化 │
│ │
└─────────────────────────────────────────────────────────────────┘
后端服务:
company-data-collector: 收集被投公司数据financial-modeling-engine: 财务建模与估值risk-monitor: 风险监控与预警portfolio-optimizer: 组合优化建议lp-reporting: LP 报告生成
🔧 技术栈设计
Backend Stack
core_framework:
runtime: "Node.js 20+ / Python 3.11+"
api: "REST + GraphQL + WebSocket"
orm: "Prisma / SQLAlchemy"
database:
primary: "PostgreSQL 15+ (关系数据)"
cache: "Redis 7+ (会话、队列、缓存)"
analytics: "ClickHouse (指标分析)"
archive: "S3 + Parquet (冷数据)"
messaging:
queue: "Apache Kafka / RabbitMQ"
event_bus: "NATS / Redis PubSub"
agent_execution:
container: "Docker + Kubernetes"
orchestration: "K8s Operators"
isolation: "Namespace + Resource Quotas"
monitoring:
metrics: "Prometheus + Grafana"
logging: "ELK Stack / Loki"
tracing: "Jaeger / Temporal"
alerting: "PagerDuty / OpsGenie"
Frontend Stack
framework: "React 18+ / Next.js 14+"
ui_library: "TailwindCSS + shadcn/ui"
state_management: "Zustand / Redux Toolkit"
realtime: "WebSocket + SWR"
visualization: "Recharts + D3.js"
Infrastructure
cloud_provider: "AWS / GCP / Aliyun"
kubernetes: "EKS / GKE / ACK"
cdn: "CloudFront / Cloudflare"
dns: "Route53 / Cloudflare DNS"
secrets: "AWS Secrets Manager / HashiCorp Vault"
ci_cd: "GitHub Actions + ArgoCD"
📊 数据模型设计
核心数据表
-- Agents 表
CREATE TABLE agents (
id UUID PRIMARY KEY,
name VARCHAR(255) NOT NULL,
type VARCHAR(50) NOT NULL, -- space|engineering|corpunit|investment
status VARCHAR(50) NOT NULL, -- active|idle|busy|blocked|error
cage_id VARCHAR(50), -- 笼子编号 (001-1000)
current_task_id UUID,
resource_config JSONB,
metrics JSONB, -- 实时指标
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW(),
last_heartbeat TIMESTAMP
);
-- Tasks 表
CREATE TABLE tasks (
id UUID PRIMARY KEY,
type VARCHAR(100) NOT NULL,
priority INTEGER DEFAULT 0,
status VARCHAR(50) NOT NULL, -- pending|running|completed|failed|cancelled
assigned_agent_id UUID REFERENCES agents(id),
input JSONB NOT NULL,
output JSONB,
error TEXT,
started_at TIMESTAMP,
completed_at TIMESTAMP,
created_at TIMESTAMP DEFAULT NOW()
);
-- Artifacts 表 (Agent 产出物)
CREATE TABLE artifacts (
id UUID PRIMARY KEY,
agent_id UUID REFERENCES agents(id),
task_id UUID REFERENCES tasks(id),
type VARCHAR(50) NOT NULL, -- code|doc|analysis|decision|report
title VARCHAR(500),
content TEXT,
quality_score FLOAT,
human_approved BOOLEAN DEFAULT FALSE,
metadata JSONB,
created_at TIMESTAMP DEFAULT NOW()
);
-- Cages 表 (Agent 容器/资源配额)
CREATE TABLE cages (
id VARCHAR(50) PRIMARY KEY, -- 001-1000
agent_id UUID REFERENCES agents(id),
status VARCHAR(50) NOT NULL, -- occupied|vacant|maintenance
resource_limits JSONB, -- cpu, memory, gpu, tokens
resource_usage JSONB, -- 实际使用
created_at TIMESTAMP DEFAULT NOW()
);
-- Metrics 表 (时间序列指标)
CREATE TABLE metrics (
time TIMESTAMP NOT NULL,
agent_id UUID NOT NULL,
metric_name VARCHAR(100) NOT NULL,
metric_value FLOAT NOT NULL,
labels JSONB,
PRIMARY KEY (time, agent_id, metric_name)
) PARTITION BY RANGE (time);
🚀 API 设计
RESTful APIs
# Agent 管理
GET /api/v1/agents # 列出所有 Agents
GET /api/v1/agents/:id # 获取 Agent 详情
POST /api/v1/agents/:id/pause # 暂停 Agent
POST /api/v1/agents/:id/resume # 恢复 Agent
POST /api/v1/agents/:id/restart # 重启 Agent
DELETE /api/v1/agents/:id # 删除 Agent
# Task 管理
GET /api/v1/tasks # 列出任务 (支持过滤)
POST /api/v1/tasks # 创建任务
GET /api/v1/tasks/:id # 获取任务详情
POST /api/v1/tasks/:id/cancel # 取消任务
# Artifact 管理
GET /api/v1/artifacts # 列出产出物
GET /api/v1/artifacts/:id # 获取产出物详情
POST /api/v1/artifacts/:id/approve # 人工审批
# Cage 管理
GET /api/v1/cages # 列出所有笼子
GET /api/v1/cages/:id # 获取笼子详情
GET /api/v1/cages/:id/metrics # 获取笼子指标
# Metrics & Analytics
GET /api/v1/metrics/agents # Agent 指标聚合
GET /api/v1/metrics/system # 系统整体指标
GET /api/v1/analytics/productivity # 生产力分析
WebSocket Events
// 前端订阅实时事件
ws.subscribe('agent:status:changed', (data) => {
// Agent 状态变化
});
ws.subscribe('task:completed', (data) => {
// 任务完成
});
ws.subscribe('artifact:created', (data) => {
// 新产出物
});
ws.subscribe('alert:triggered', (data) => {
// 告警触发
});
🔐 安全设计
authentication:
method: "JWT + OAuth2"
providers:
- "Google Workspace (企业 SSO)"
- "GitHub (开发者)"
- "API Keys (服务间调用)"
authorization:
model: "RBAC + ABAC"
roles:
- admin: "完全访问"
- operator: "运维操作"
- viewer: "只读访问"
- agent: "Agent 服务账号"
data_protection:
encryption_at_rest: "AES-256"
encryption_in_transit: "TLS 1.3"
secrets_management: "HashiCorp Vault"
audit:
logging: "所有操作审计日志"
retention: "7 年"
compliance: "SOC2, ISO27001"
📈 可扩展性设计
horizontal_scaling:
stateless_services: "K8s HPA 自动扩缩"
stateful_services: "分库分表 + 读写分离"
agent_containers: "按 Cage 分组调度"
performance:
caching_strategy: "多级缓存 (L1: 内存,L2: Redis, L3: CDN)"
database_optimization: "连接池 + 预编译 + 索引优化"
async_processing: "消息队列解耦"
reliability:
redundancy: "多可用区部署"
failover: "自动故障转移"
backup: "每日备份 + 异地容灾"
recovery_objective:
rto: "<15 分钟"
rpo: "<5 分钟"
💰 成本估算
infrastructure_cost_monthly:
kubernetes_cluster:
nodes: "50 x 8vCPU 32GB"
cost: "~$5,000/月"
database:
postgresql: "2 x db.r6g.2xlarge"
redis: "2 x cache.r6g.large"
cost: "~$2,000/月"
storage:
s3: "10TB"
cost: "~$250/月"
networking:
data_transfer: "10TB"
cost: "~$1,000/月"
llm_tokens:
estimated: "1B tokens/月"
cost: "~$5,000/月"
total: "~$13,250/月"
agent_cost_per_cage:
compute: "~$5/天"
tokens: "~$2/天"
total: "~$7/天/cage"
monthly: "~$210/月/cage"
1000_cages_total: "~$210,000/月"
🎯 实施路线图
Phase 1: 基础设施 (Week 1-4)
- K8s 集群搭建
- 数据库部署
- 监控体系搭建
- CI/CD 流水线
Phase 2: 核心服务 (Week 5-8)
- Agent Orchestrator
- Task Scheduler
- State Manager
- Resource Allocator
Phase 3: 应用场景 (Week 9-16)
- 1000 Agent Space (运维)
- 1000 Agent Engineering (工程)
- 1000 Agent CorpUnit (企业)
- 1000 Invested AI Company (投资)
Phase 4: 前端界面 (Week 17-20)
- 4 个应用的前端开发
- 实时数据推送
- 交互优化
Phase 5: 规模化 (Week 21-24)
- 性能优化
- 安全加固
- 文档完善
- 上线发布
📝 下一步
- 确认技术栈选择 (Node.js vs Python, K8s vs Serverless)
- 设计详细 API 规范 (OpenAPI 3.0)
- 搭建开发环境 (Docker Compose 本地开发)
- 实现 MVP (单 Agent + 单 Task 流程)
- 逐步扩展到 1000 Agents
1000 Agent Platform - 前端交互设计
🎨 设计理念
“1000 个笼子,1000 个 AI,实时可见的生产力”
- 可视化: 每个 Agent 的状态、产出、资源使用都清晰可见
- 实时性: WebSocket 推送,秒级更新
- 可操作: 随时干预、暂停、重启、重新分配
- 可度量: 生产力指标、质量评分、ROI 分析
🖥️ 通用布局框架
┌─────────────────────────────────────────────────────────────────────────┐
│ [Logo] 1000 Agent Platform [Space] [Engineering] [CorpUnit] [Invest] │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Global Stats Bar │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ 1000 │ │ 856 │ │ 120 │ │ 24 │ │ 98.5% │ │ │
│ │ │ Total │ │ Active │ │ Idle │ │ Blocked │ │ Health │ │ │
│ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────┐ ┌────────────────────────────────┐ │
│ │ │ │ │ │
│ │ Main Content Area │ │ Side Panel │ │
│ │ │ │ - Filters │ │
│ │ [Agent Grid / Details] │ │ - Quick Actions │ │
│ │ │ │ - Real-time Logs │ │
│ │ │ │ - Metrics │ │
│ │ │ │ │ │
│ └────────────────────────────────┘ └────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Bottom Status Bar │ │
│ │ System: ● Healthy Tokens: 45.2M/100M Cost: $1,234/day │ │
│ └─────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
📊 1000 Agent Space (运维) - 详细设计
主界面:Agent Grid View
┌─────────────────────────────────────────────────────────────────────────┐
│ ⚡ 1000 Agent Space [Dashboard] [Agents] [Incidents] [Reports] │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Global Stats: │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ 🔴 12 │ │ 🟡 45 │ │ 🟢 943 │ │ ⏱️ 8.2m │ │ ✅ 72% │ │
│ │ Critical │ │ Warning │ │ Healthy │ │ Avg MTTR │ │ Auto-fix │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │
│ Filters: [Status: All ▼] [Severity: All ▼] [Search: 🔍 _______] │
│ │
│ Agent Grid (10x10 = 100 visible, scroll for more): │
│ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐│
│ │#001│ │#002│ │#003│ │#004│ │#005│ │#006│ │#007│ │#008│ │#009│ │#010││
│ │🟢 │ │🔴 │ │🟢 │ │🟡 │ │🟢 │ │🟢 │ │🔴 │ │🟢 │ │🟢 │ │🟢 ││
│ │IDLE│ │INC │ │IDLE│ │WAIT│ │IDLE│ │IDLE│ │INC │ │IDLE│ │IDLE│ │IDLE││
│ └────┘ └────┘ └────┘ └────┘ └────┘ └────┘ └────┘ └────┘ └────┘ └────┘│
│ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐│
│ │#011│ │#012│ │#013│ │#014│ │#015│ │#016│ │#017│ │#018│ │#019│ │#020││
│ │🟢 │ │🟢 │ │🔴 │ │🟢 │ │🟡 │ │🟢 │ │🟢 │ │🟢 │ │🟢 │ │🟢 ││
│ │IDLE│ │IDLE│ │INC │ │IDLE│ │WAIT│ │IDLE│ │IDLE│ │IDLE│ │IDLE│ │IDLE││
│ └────┘ └────┘ └────┘ └────┘ └────┘ └────┘ └────┘ └────┘ └────┘ └────┘│
│ ... (scrollable grid of 1000 agents) │
│ │
│ Legend: 🟢 Healthy 🟡 Warning 🔴 Incident ⚪ Offline │
└─────────────────────────────────────────────────────────────────────────┘
Agent 详情面板 (点击任意 Agent)
┌─────────────────────────────────────────────────────────────────────────┐
│ Agent #042 - Production Guardian [× Close] │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Status: 🔴 HANDLING INCIDENT Uptime: 72h 14m Health: 94% │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Current Incident │ │
│ │ ─────────────────────────────────────────────────────────────── │ │
│ │ 🔴 SEV-1: Database Connection Pool Exhausted │ │
│ │ 📍 Service: tidb-cloud-control-plane │ │
│ │ ⏰ Started: 2 minutes ago │ │
│ │ 📊 Progress: [████████░░] 80% - Analyzing root cause │ │
│ │ │ │
│ │ Timeline: │ │
│ │ 10:00:00 - Incident detected │ │
│ │ 10:00:15 - Triage completed (SEV-1) │ │
│ │ 10:01:30 - Root cause identified │ │
│ │ 10:02:00 - Remediation in progress... │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ Resource Usage: │
│ CPU: [████████░░] 78% Memory: [██████░░░░] 62% Tokens: 45K/h │
│ │
│ Recent Outputs (24h): │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ ✅ 14:32 - Auto-scaled connection pool from 100 to 500 │ │
│ │ ✅ 12:15 - Resolved memory leak in service-abc │ │
│ │ ✅ 09:45 - Deployed hotfix for authentication bug │ │
│ │ ⚠️ 08:30 - Escalated to human: Complex network issue │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ Actions: [🔍 View Logs] [⏸️ Pause] [🔄 Restart] [👤 Escalate] │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Incident 列表页
┌─────────────────────────────────────────────────────────────────────────┐
│ Incidents [Active] [History] [All]│
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Filters: [Severity: All ▼] [Status: All ▼] [Service: All ▼] │
│ [Date Range: Last 7 days ▼] │
│ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ 🔴 SEV-1 │ DB Connection Pool │ Agent #042 │ 2m ago │ [View] │ │
│ │ │ Exhausted │ │ │ │ │
│ ├───────────────────────────────────────────────────────────────────┤ │
│ │ 🔴 SEV-1 │ API Latency Spike │ Agent #087 │ 5m ago │ [View] │ │
│ │ │ p99 > 5s │ │ │ │ │
│ ├───────────────────────────────────────────────────────────────────┤ │
│ │ 🟡 SEV-2 │ Memory Usage High │ Agent #156 │ 12m ago │ [View] │ │
│ │ │ 85% utilization │ │ │ │ │
│ ├───────────────────────────────────────────────────────────────────┤ │
│ │ 🟢 SEV-3 │ Disk Space Warning │ Agent #234 │ 1h ago │ [View] │ │
│ │ │ /var/log at 80% │ │ │ │ │
│ ├───────────────────────────────────────────────────────────────────┤ │
│ │ ✅ RESOL │ Auto-scaling Failed │ Agent #091 │ 2h ago │ [View] │ │
│ │ │ Resolved in 8m │ │ │ │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │
│ Stats: 12 Active | 156 Resolved (24h) | 72% Auto-resolution Rate │
│ │
└─────────────────────────────────────────────────────────────────────────┘
⚙️ 1000 Agent Engineering (工程) - 详细设计
主界面:Repo Convergence Map
┌─────────────────────────────────────────────────────────────────────────┐
│ ⚙ 1000 Agent Engineering [Dashboard] [Repos] [Agents] [Merge Plan] │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Progress to Mono-Repo: │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ [████████████████████░░░░░░░░░░] 65% Complete │ │
│ │ │ │
│ │ 📊 400 Repos Analyzed | 260 Merged | 140 Pending │ │
│ │ 📁 15.2GB / 39GB Consolidated │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ Agent Status by Tier: │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ S-Tier │ │ A-Tier │ │ B-Tier │ │ C-Tier │ │ Total │ │
│ │ 1/1 │ │ 156/160 │ │ 103/159 │ │ 40/80 │ │ 300/400 │ │
│ │ 🟢 Done │ │ 🟡 97% │ │ 🟡 65% │ │ 🟡 50% │ │ 🟡 75% │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │
│ Repo Analysis Grid: │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Repo Name │ Status │ Agent │ Progress │ Quality │ │
│ │────────────────────│─────────│─────────│──────────│───────────│ │
│ │ tidb │ ✅ Done │ #001-8 │ 100% │ 95/100 │ │
│ │ tiflow │ ✅ Done │ #009-12 │ 100% │ 88/100 │ │
│ │ tidb-operator │ 🟡 85% │ #013-16 │ 85% │ - │ │
│ │ docs │ 🟡 72% │ #017-20 │ 72% │ - │ │
│ │ tiup │ 🟢 45% │ #021-24 │ 45% │ - │ │
│ │ ... (395 more) │ │ │ │ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Repo 详情页
┌─────────────────────────────────────────────────────────────────────────┐
│ Repository: tidb [× Close] │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Tier: S | Priority: P0 | Status: ✅ Analysis Complete │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Analysis Summary │ │
│ │ ─────────────────────────────────────────────────────────────── │ │
│ │ 📊 Score: 95/100 │ │
│ │ 📝 Last Commit: 2 hours ago │ │
│ │ 👥 Contributors: 156 active │ │
│ │ 📦 Size: 856 MB (1.2M LOC) │ │
│ │ 🔧 Tech Stack: Go (85%), Python (10%), Other (5%) │ │
│ │ │ │
│ │ Recommendation: MERGE FIRST - Core product, high activity │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ Assigned Agents (8): │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ #001 - Code Analyzer ✅ Complete │ Output: 45 artifacts │ │
│ │ #002 - Dependency Mapper ✅ Complete │ Output: 12 artifacts │ │
│ │ #003 - Test Coverage ✅ Complete │ Output: 23 artifacts │ │
│ │ #004 - Documentation ✅ Complete │ Output: 8 artifacts │ │
│ │ #005 - Security Scanner ✅ Complete │ Output: 6 artifacts │ │
│ │ #006 - Performance Profiler ✅ Complete │ Output: 15 artifacts │ │
│ │ #007 - Refactoring Advisor ✅ Complete │ Output: 31 artifacts │ │
│ │ #008 - Merge Coordinator ✅ Complete │ Output: 3 artifacts │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ Key Findings: │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ ⚠️ 3 circular dependencies detected │ │
│ │ ✅ 87% test coverage (above threshold) │ │
│ │ ⚠️ 12 security vulnerabilities (8 low, 4 medium) │ │
│ │ ✅ Well-documented (95% public APIs documented) │ │
│ │ 💡 Suggested refactorings: 31 (high impact: 5) │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ Actions: [📄 View Full Report] [🔀 Create Merge Plan] [📊 Compare] │
│ │
└─────────────────────────────────────────────────────────────────────────┘
🏢 1000 Agent CorpUnit (企业) - 详细设计
主界面:Corporate Brain Dashboard
┌─────────────────────────────────────────────────────────────────────────┐
│ 🏢 1000 Agent CorpUnit [Dashboard] [Finance] [HR] [Legal] [Market] │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Executive Summary (Last 7 Days): │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ 💰 Revenue │ │ 👥 Headcount │ │ ⚖️ Legal │ │ 📈 Growth │ │
│ │ $12.5M │ │ 1,245 │ │ Risk Score │ │ +15.2% │ │
│ │ +8.3% WoW │ │ +23 new │ │ 23/100 (Low) │ │ +2.1% WoW │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ Department Agent Status: │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Department │ Agents │ Active │ Insights │ Actions │ Health │ │
│ │───────────────│────────│────────│──────────│─────────│─────────│ │
│ │ Finance │ 150 │ 142 │ 45 │ 12 │ 🟢 98% │ │
│ │ HR │ 100 │ 95 │ 28 │ 8 │ 🟢 96% │ │
│ │ Legal │ 80 │ 76 │ 15 │ 3 │ 🟢 97% │ │
│ │ Marketing │ 200 │ 188 │ 67 │ 24 │ 🟡 92% │ │
│ │ Growth │ 170 │ 165 │ 52 │ 19 │ 🟢 95% │ │
│ │ Investment │ 100 │ 94 │ 31 │ 7 │ 🟢 96% │ │
│ │ Operations │ 200 │ 189 │ 43 │ 15 │ 🟢 94% │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ Recent Insights (Last 24h): │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ 💡 Finance Agent: Cash flow projection shows surplus of $2.3M │ │
│ │ in Q2. Recommend investment or dividend distribution. │ │
│ │ │ │
│ │ ⚠️ HR Agent: Engineering team attrition rate at 12% (above │ │
│ │ industry avg 8%). Suggest retention program. │ │
│ │ │ │
│ │ 💡 Growth Agent: A/B test variant B shows 23% conversion │ │
│ │ lift. Recommend full rollout. │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Finance 详情页
┌─────────────────────────────────────────────────────────────────────────┐
│ Finance Department [× Close] │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Financial Health Score: 87/100 🟢 Excellent │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Key Metrics (MTD) │ │
│ │ ─────────────────────────────────────────────────────────────── │ │
│ │ Revenue: $12.5M (vs $11.2M budget, +11.6%) │ │
│ │ Expenses: $8.3M (vs $8.5M budget, -2.4%) │ │
│ │ EBITDA: $4.2M (33.6% margin) │ │
│ │ Cash Balance: $45.6M (182 days runway) │ │
│ │ Burn Rate: $1.2M/month │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ Active Finance Agents: │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Budget Analyst (x25) - Monitoring department budgets │ │
│ │ Expense Processor (x40) - Automated expense review │ │
│ │ Revenue Tracker (x20) - Real-time revenue recognition │ │
│ │ Cash Flow Modeler (x15) - 13-week cash flow forecasting │ │
│ │ Tax Optimizer (x10) - Tax planning & compliance │ │
│ │ Audit Preparer (x15) - Continuous audit readiness │ │
│ │ FP&A Analyst (x25) - Financial planning & analysis │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ Recent Actions: │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ ✅ Approved 156 expense reports ($234K total) │ │
│ │ ⚠️ Flagged 3 unusual transactions for review │ │
│ │ ✅ Generated monthly board deck │ │
│ │ ✅ Updated Q2 forecast based on actuals │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
💼 1000 Invested AI Company (投资) - 详细设计
主界面:Portfolio Dashboard
┌─────────────────────────────────────────────────────────────────────────┐
│ 💼 1000 Invested AI Company [Portfolio] [Companies] [Analysis] [LP] │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Portfolio Overview: │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ 🏢 Companies │ │ 💰 Total │ │ 📈 Avg │ │ ⚠️ At-Risk │ │
│ │ 47 │ │ $890M │ │ Multiple │ │ 3 │ │
│ │ Active │ │ AUM │ │ 2.3x │ │ Companies │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ Portfolio Performance: │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ Performance Since Inception │ │
│ │ │ │ │
│ │ │ ╭────╮ │ │
│ │ │ ╱ ╲ ╭──╮ │ │
│ │ │ ╱ ╲ ╱ ╲ ╭──╮ │ │
│ │ │ ╱ ╲ ╱ ╲ ╱ ╲ │ │
│ │ │╱ ╲ ╲╱ ╲──╮ │ │
│ │ └──────────────────────────────────────────── │ │
│ │ Jan Apr Jul Oct Jan Apr Jul Oct Jan │ │
│ │ │ │
│ │ TVPI: 2.3x | DPI: 1.1x | RVPI: 1.2x | IRR: 34% │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ Company Status: │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Company │ Stage │ Health │ Last Report │ Action │ │
│ │─────────────────────│──────────│────────│─────────────│────────│ │
│ │ TechCorp AI │ Series B │ 🟢 92 │ 2 days ago │ [View] │ │
│ │ DataFlow Inc │ Series A │ 🟢 88 │ 1 day ago │ [View] │ │
│ │ CloudNative Labs │ Seed │ 🟡 72 │ 5 days ago │ [View] │ │
│ │ SecureNet │ Series C │ 🔴 45 │ 1 day ago │ [View] │ │
│ │ ... (43 more) │ │ │ │ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Company 详情页
┌─────────────────────────────────────────────────────────────────────────┐
│ TechCorp AI (Portfolio Company #12) [× Close] │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Investment Summary: │
│ Stage: Series B | Invested: $15M | Ownership: 18% | Current Val: $85M │
│ │
│ Health Score: 92/100 🟢 Thriving │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Financial Metrics (Last Quarter) │ │
│ │ ─────────────────────────────────────────────────────────────── │ │
│ │ Revenue: $2.1M/quarter (+$45% QoQ) │ │
│ │ ARR: $8.4M │ │
│ │ Gross Margin: 78% │ │
│ │ Burn Rate: $450K/month │ │
│ │ Runway: 18 months │ │
│ │ Cash Balance: $8.1M │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ Key Metrics: │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Customers: 145 (up from 98 last quarter) │ │
│ │ NRR: 125% (excellent retention + expansion) │ │
│ │ CAC Payback: 14 months │ │
│ │ LTV/CAC: 4.2x │ │
│ │ Team Size: 67 (hiring plan: +20 in Q2) │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ Assigned Agents: │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Financial Analyst - Weekly financial review │ │
│ │ Market Intelligence - Competitor tracking │ │
│ │ Risk Monitor - Early warning detection │ │
│ │ Board Prep - Quarterly board deck preparation │ │
│ │ Valuation Modeler - Monthly valuation update │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ Recent Updates: │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ 📈 Mar 15 - Closed enterprise deal with Fortune 500 ($500K ACV)│ │
│ │ 👥 Mar 10 - Hired VP of Sales from competitor │ │
│ │ 🏆 Mar 5 - Named Leader in Gartner Magic Quadrant │ │
│ │ 💰 Feb 28 - Q4 results: beat revenue target by 12% │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ Actions: [📊 Full Report] [📞 Schedule Call] [💡 Send Recommendation]│
│ │
└─────────────────────────────────────────────────────────────────────────┘
🎨 交互组件库
Agent Card (可复用组件)
<AgentCard
id="#042"
status="incident" // healthy|warning|incident|offline
type="space-guardian"
currentTask="SEV-1: DB Connection Pool"
uptime="72h 14m"
health={94}
resourceUsage={{ cpu: 78, memory: 62 }}
outputCount={156}
onClick={() => openAgentDetails('#042')}
/>
Status Badge
<StatusBadge status="healthy" /> // 🟢
<StatusBadge status="warning" /> // 🟡
<StatusBadge status="incident" /> // 🔴
<StatusBadge status="offline" /> // ⚪
Progress Ring
<ProgressRing
progress={75}
size={60}
strokeWidth={6}
color="#10B981"
showLabel={true}
/>
Metric Card
<MetricCard
label="Active Agents"
value={856}
trend={+12}
trendLabel="+1.4% vs last hour"
icon={<IconAgents />}
/>
📱 响应式设计
Desktop (≥1280px)
- 完整 Grid 视图 (10x10 Agents)
- 多列布局
- 完整侧边栏
Tablet (768px - 1279px)
- 缩减 Grid (5x5 Agents)
- 单列布局
- 可折叠侧边栏
Mobile (<768px)
- List 视图 (非 Grid)
- 底部导航
- 简化信息展示
🚀 性能优化
rendering:
virtual_scroll: "只渲染可见区域 (1000 Agents → ~100 DOM 节点)"
lazy_loading: "按需加载详情"
memoization: "React.memo 避免不必要的重渲染"
data_fetching:
websocket: "实时推送状态变更"
swr: "智能缓存 + 后台更新"
pagination: "服务端分页 (50 items/page)"
optimizations:
bundle_splitting: "按应用拆分 bundle"
image_optimization: "WebP + lazy loading"
service_worker: "离线缓存"
🎯 下一步
- 确认设计稿 (Figma 高保真原型)
- 搭建前端框架 (Next.js 14 + TailwindCSS)
- 实现通用组件库 (AgentCard, StatusBadge, etc.)
- 开发 4 个应用的主界面
- 集成 WebSocket 实时数据
Agent Cage (笼子) 设计文档
🎯 核心概念
“1000 个笼子,养着 1000 个 AI”
每个 Cage 是一个:
- 隔离的执行环境 (Docker Container / K8s Pod)
- 专用的资源配额 (CPU, Memory, GPU, Token Budget)
- 持久的状态存储 (Agent Memory, Task History, Outputs)
- 独立的健康监控 (Heartbeat, Error Rate, Resource Usage)
📦 Cage 架构
┌─────────────────────────────────────────────────────────────────────────┐
│ Cage #042 │
│ (Isolated Agent Environment) │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Agent Runtime │ │
│ │ ┌───────────────────────────────────────────────────────────┐ │ │
│ │ │ OpenClaw Agent Instance │ │ │
│ │ │ - Model: qwen3.5-plus │ │ │
│ │ │ - Context Window: 262K tokens │ │ │
│ │ │ - Skills: [space-guardian, incident-responder, ...] │ │ │
│ │ │ - Memory: Short-term (session) + Long-term (files) │ │ │
│ │ └───────────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌────────────────────────────────┼────────────────────────────────┐ │
│ │ Resource Quotas │ │
│ │ CPU: 2 cores (limit) Memory: 4GB (limit) │ │
│ │ GPU: 0.5 A10 (limit) Tokens: 100K/hour (limit) │ │
│ │ Network: 100 Mbps (limit) Storage: 10GB (limit) │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌────────────────────────────────┼────────────────────────────────┐ │
│ │ Persistent State │ │
│ │ /cage/state/ │ │
│ │ ├── agent.json - Agent identity & config │ │
│ │ ├── memory.md - Long-term memory │ │
│ │ ├── task_history.jsonl - Completed tasks log │ │
│ │ ├── outputs/ - Generated artifacts │ │
│ │ └── metrics.jsonl - Performance metrics │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌────────────────────────────────┼────────────────────────────────┐ │
│ │ Health Monitor │ │
│ │ - Heartbeat: Every 30s │ │
│ │ - Error Tracking: Capture & report exceptions │ │
│ │ - Resource Monitoring: CPU, Memory, Token usage │ │
│ │ - Auto-recovery: Restart on crash, Migrate on overload │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
🔧 技术实现
Kubernetes Pod Template
apiVersion: v1
kind: Pod
metadata:
name: cage-042
namespace: agent-platform
labels:
cage-id: "042"
agent-type: "space-guardian"
status: "active"
annotations:
agent.openclaw.ai/id: "agent-042"
agent.openclaw.ai/created: "2026-03-01T10:00:00Z"
spec:
# Resource Quotas
containers:
- name: agent-runtime
image: openclaw/agent-runtime:v1.0.0
resources:
requests:
cpu: "1"
memory: "2Gi"
nvidia.com/gpu: "0.5"
limits:
cpu: "2"
memory: "4Gi"
nvidia.com/gpu: "1"
# Environment Variables
env:
- name: CAGE_ID
value: "042"
- name: AGENT_ID
value: "agent-042"
- name: AGENT_TYPE
value: "space-guardian"
- name: TOKEN_BUDGET_HOURLY
value: "100000"
- name: ORCHESTRATOR_URL
value: "http://orchestrator.agent-platform.svc:8080"
# Volume Mounts
volumeMounts:
- name: state-volume
mountPath: /cage/state
- name: outputs-volume
mountPath: /cage/outputs
- name: logs-volume
mountPath: /cage/logs
# Health Checks
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
volumes:
- name: state-volume
persistentVolumeClaim:
claimName: cage-042-state
- name: outputs-volume
persistentVolumeClaim:
claimName: cage-042-outputs
- name: logs-volume
emptyDir:
sizeLimit: 1Gi
# Node Affinity (optional: spread across nodes)
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: agent-runtime
topologyKey: kubernetes.io/hostname
📊 Cage 状态机
┌─────────────────────────────────────────────────────────────────────────┐
│ Cage State Machine │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ │
│ │ CREATED │ │
│ └──────┬──────┘ │
│ │ start() │
│ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ STOPPED │◄─────────│ STARTING │─────────►│ ACTIVE │ │
│ └─────────────┘ failed └─────────────┘ └──────┬──────┘ │
│ ▲ │ │
│ │ │ │
│ │ ┌─────────────┐ │ │
│ │ │ ERROR │◄──────────────┘ error() │
│ │ └──────┬──────┘ │
│ │ │ │
│ │ │ recover() │
│ │ ▼ │
│ │ ┌─────────────┐ │
│ └────────────────────│ RECOVERING │ │
│ └─────────────┘ │
│ │
│ State Transitions: │
│ - CREATED → STARTING: Pod scheduled, container starting │
│ - STARTING → ACTIVE: Health check passed, ready for tasks │
│ - STARTING → STOPPED: Startup failed │
│ - ACTIVE → ERROR: Runtime error detected │
│ - ERROR → RECOVERING: Auto-recovery initiated │
│ - RECOVERING → ACTIVE: Recovery successful │
│ - RECOVERING → STOPPED: Recovery failed │
│ - ACTIVE → STOPPED: Manual stop or resource reclamation │
│ │
└─────────────────────────────────────────────────────────────────────────┘
📁 Cage 目录结构
/cage/
├── state/ # 持久化状态
│ ├── agent.json # Agent 身份信息
│ │ {
│ │ "id": "agent-042",
│ │ "cage_id": "042",
│ │ "type": "space-guardian",
│ │ "config": { ... },
│ │ "created_at": "2026-03-01T10:00:00Z"
│ │ }
│ │
│ ├── memory.md # 长期记忆 (类似 MEMORY.md)
│ │ # Agent 的学习历史、经验总结
│ │
│ ├── task_history.jsonl # 任务历史日志
│ │ {"task_id": "...", "type": "...", "status": "...", ...}
│ │
│ ├── context.json # 当前上下文窗口
│ │ {
│ │ "current_task": "...",
│ │ "conversation": [...],
│ │ "tools_available": [...]
│ │ }
│ │
│ └── metrics.jsonl # 性能指标
│ {"timestamp": "...", "cpu": 0.78, "memory": 0.62, "tokens": 45000}
│
├── outputs/ # 产出物
│ ├── 2026-03-01/
│ │ ├── artifact-001.json
│ │ ├── artifact-002.md
│ │ └── artifact-003.py
│ └── 2026-03-02/
│ └── ...
│
├── logs/ # 运行日志
│ ├── agent.log # Agent 主日志
│ ├── task.log # 任务执行日志
│ └── error.log # 错误日志
│
└── tmp/ # 临时文件
└── ...
🔄 Cage 生命周期管理
创建流程
1. Orchestrator 决定创建新 Cage
↓
2. 分配 Cage ID (001-1000)
↓
3. 创建 Kubernetes Pod
↓
4. 挂载 Persistent Volumes
↓
5. 启动 Agent Runtime
↓
6. 健康检查通过
↓
7. 注册到 Agent Registry
↓
8. 开始接收任务
运行流程
1. 从 Task Queue 获取任务
↓
2. 加载任务上下文
↓
3. 执行任务 (Agent 推理 + 工具调用)
↓
4. 保存产出物到 /cage/outputs/
↓
5. 更新任务状态
↓
6. 发送心跳 + 指标
↓
7. 返回空闲状态,等待下一个任务
恢复流程
1. 检测到错误 (健康检查失败 / 异常退出)
↓
2. 标记 Cage 为 ERROR 状态
↓
3. 保存当前状态到持久化存储
↓
4. 尝试重启 Pod
↓
5. 从持久化状态恢复
↓
6. 健康检查通过
↓
7. 恢复任务执行
销毁流程
1. 收到销毁指令 (资源回收 / Agent 退役)
↓
2. 停止接收新任务
↓
3. 等待当前任务完成 (或强制终止)
↓
4. 归档产出物到冷存储
↓
5. 备份关键状态
↓
6. 删除 Kubernetes Pod
↓
7. 释放 Persistent Volumes
↓
8. 从 Agent Registry 注销
📈 Cage 指标监控
实时指标 (Real-time Metrics)
cage_metrics:
resource_usage:
cpu_percent: "0-100"
memory_percent: "0-100"
gpu_percent: "0-100"
disk_usage_bytes: "integer"
network_rx_bytes: "integer"
network_tx_bytes: "integer"
agent_status:
status: "active|idle|busy|blocked|error"
current_task_id: "uuid"
task_duration_seconds: "integer"
tokens_used: "integer"
tokens_remaining: "integer"
health:
heartbeat_timestamp: "ISO8601"
uptime_seconds: "integer"
error_count_1h: "integer"
success_rate_24h: "float (0-1)"
productivity:
tasks_completed_24h: "integer"
artifacts_generated_24h: "integer"
avg_task_duration_seconds: "float"
quality_score_avg: "float (0-100)"
聚合指标 (Aggregated Metrics)
fleet_metrics:
total_cages: 1000
active_cages: 856
idle_cages: 120
error_cages: 24
resource_totals:
cpu_allocated: "2000 cores"
cpu_used: "1456 cores"
memory_allocated: "4000 GB"
memory_used: "2890 GB"
tokens_budget_daily: "1B"
tokens_used_daily: "756M"
productivity:
tasks_completed_24h: 12456
artifacts_generated_24h: 45678
avg_resolution_time_minutes: 8.2
auto_resolution_rate: 0.72
cost:
compute_cost_daily: "$450"
token_cost_daily: "$756"
storage_cost_daily: "$25"
total_cost_daily: "$1,231"
🔐 Cage 安全设计
隔离机制
isolation:
namespace: "每个 Cage 独立的 K8s Namespace"
network_policy: "限制 Cage 间网络访问"
service_account: "每个 Cage 独立的服务账号"
secrets: "按 Cage 隔离的密钥管理"
resource_limits:
cpu: "硬限制,防止资源争抢"
memory: "硬限制,防止 OOM 影响其他 Cage"
disk: "配额管理,防止存储耗尽"
network: "带宽限制,防止网络拥塞"
访问控制
rbac:
cage_service_account:
permissions:
- read: own_state
- write: own_outputs
- execute: assigned_tasks
denied:
- access: other_cages
- modify: orchestrator
- delete: persistent_volumes
orchestrator_access:
permissions:
- create: cages
- delete: cages
- send_tasks: any_cage
- read_metrics: all_cages
💰 Cage 成本模型
单 Cage 日成本
cage_042_daily_cost:
compute:
kubernetes_pod: "2 vCPU x 24h x $0.05/vCPU/h = $2.40"
gpu_share: "0.5 A10 x 24h x $0.50/GPU/h = $6.00"
storage: "10GB x $0.10/GB/day = $1.00"
networking: "~$0.10"
subtotal: "$9.50"
tokens:
budget: "100K tokens/hour x 24h = 2.4M tokens/day"
cost: "2.4M x $0.002/1K = $4.80"
total_per_cage_per_day: "$14.30"
total_per_cage_per_month: "$429"
1000 Cage 规模成本
fleet_1000_monthly_cost:
compute: "$9.50 x 1000 x 30 = $285,000"
tokens: "$4.80 x 1000 x 30 = $144,000"
storage: "$0.50 x 1000 x 30 = $15,000"
management_overhead: "$20,000"
total_monthly: "$464,000"
total_annual: "$5,568,000"
cost_per_artifact: "$464,000 / 1,000,000 artifacts = $0.46"
cost_per_task: "$464,000 / 500,000 tasks = $0.93"
🚀 扩缩容策略
自动扩缩 (Auto-scaling)
horizontal_pod_autoscaler:
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
scale_up:
when: "avg utilization > 80% for 5 minutes"
step: "+10% of current capacity"
max: "1000 cages"
scale_down:
when: "avg utilization < 40% for 30 minutes"
step: "-10% of current capacity"
min: "100 cages"
任务队列驱动的扩缩
queue_based_scaling:
metrics:
- queue_depth: "待处理任务数"
- avg_wait_time: "任务平均等待时间"
scale_up_trigger:
- queue_depth > 500
- avg_wait_time > 5 minutes
scale_down_trigger:
- queue_depth < 50
- avg_wait_time < 30 seconds
- idle_cages > 30%
📝 下一步
- 实现 Cage Operator (K8s Custom Resource)
- 开发 Agent Runtime Image (Docker 镜像)
- 搭建监控体系 (Prometheus + Grafana)
- 实现自动扩缩容 (HPA + Queue-based)
- 压力测试 (1000 Cage 并发运行)
1000 Agent Space - Production Incident Resolution
Parallel production incident resolution at scale
URL: http://1000-agent-space.agents-dev.com/
Overview
1000 Agent Space is a platform for parallel production incident resolution, where 1000 AI Agents work together to detect, triage, analyze, and resolve production incidents.
Incident Pipeline
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ Detect │ → │ Triage │ → │ Analyze │ → │ Resolve │
│ (100%) │ │ (100%) │ │ (90%) │ │ (70%) │
└─────────┘ └─────────┘ └─────────┘ └─────────┘
Key Metrics
| Metric | Target | Description |
|---|---|---|
| MTTR | <10 minutes | Mean Time To Resolve |
| Auto-resolution Rate | >70% | Incidents resolved without human intervention |
| False Positive Rate | <5% | Incorrect incident detection |
Human Escalation
When auto-remediation fails, the system escalates to human engineers via:
- Phone call
- SMS
- Instant messaging (Telegram/Slack)
Human handling results are fed back to Agents for learning.
Frontend Features
- Agent Grid View: Real-time status of 1000 Agents
- Incident List: Filterable incident history
- Agent Details: Deep dive into individual Agent activity
- Metrics Dashboard: System-wide performance metrics
1000 Agent Engineering - Autonomous Mono-Repo Convergence
400 Repos → 1 Codebase, driven by AI
URL: https://1000-agent-engineering.spaces.agents-dev.com/
Overview
1000 Agent Engineering is a platform for autonomous mono-repo convergence, where AI Agents analyze, plan, and execute the consolidation of 400+ repositories into a single AI-friendly codebase.
Repo Analysis Pipeline
┌─────────────────────────────────────────────────────────┐
│ 400 Repos Input │
└─────────────────────────────────────────────────────────┘
│
┌───────────────────┼───────────────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│Repo-001 │ │Repo-002 │ │Repo-400 │
│ Agent │ │ Agent │ │ Agent │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
└───────────────────┼───────────────────┘
▼
┌─────────────────────────┐
│ Aggregation Agent │
└────────────┬────────────┘
│
▼
┌─────────────────────────┐
│ Mono-Repo Generator │
└─────────────────────────┘
Repo Tier Classification
| Tier | Score Range | Count | Action |
|---|---|---|---|
| S-Tier | 85-100 | ~10% | Deep analysis (8 Agents), merge first |
| A-Tier | 70-84 | ~40% | Standard analysis (4 Agents) |
| B-Tier | 50-69 | ~40% | Standard analysis (2 Agents) |
| C-Tier | 0-49 | ~10% | Quick scan (1 Agent), consider archiving |
Continuous Improvement
After initial consolidation, Guardian Agents continuously monitor their assigned components:
- Automated code review
- Test generation
- Documentation updates
- Refactoring suggestions
- Dependency updates
Frontend Features
- Repo Convergence Map: Visual progress to mono-repo
- Repo Details: Deep dive into individual repository analysis
- Agent Assignment: See which Agents are working on which repos
- Merge Plan: Generated consolidation proposals
1000 Agent CorpUnit - AI-Driven Corporate Brain
Intelligent enterprise operations at scale
URL: https://1000-agent-corp-unit.spaces.agents-dev.com/
Overview
1000 Agent CorpUnit is an AI-driven corporate brain, where specialized Agent teams handle different corporate functions: Finance, HR, Legal, Marketing, Growth, and Investment.
Corporate Function Structure
┌──────────────────────────────────────────────────────────┐
│ CEO Agent (决策协调) │
└────────────────────────┬─────────────────────────────────┘
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ CFO │ │ COO │ │ CTO │
│ Agent │ │ Agent │ │ Agent │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────────┐ ┌───────────┐
│Finance │ │HR/Legal/Ops │ │Engineering│
│Team │ │Team │ │Team │
└─────────┘ └─────────────┘ └───────────┘
Department Agent Teams
| Department | Agent Count | Key Functions |
|---|---|---|
| Finance | 150 | Budget analysis, cost control, financial forecasting |
| HR | 100 | Recruitment screening, performance evaluation, training planning |
| Legal | 80 | Contract review, compliance checks, risk assessment |
| Marketing | 200 | Market research, competitor analysis, marketing strategy |
| Growth | 170 | User growth, conversion optimization, A/B testing |
| Investment | 100 | Investment analysis, due diligence, portfolio management |
| Operations | 200 | Process automation, workflow optimization |
Output
- Real-time Executive Dashboard: Live business metrics
- Decision Recommendation Reports: AI-generated insights
- Automated Workflow Execution: Routine tasks handled automatically
Frontend Features
- Executive Summary: High-level business health
- Department Views: Deep dive into each function
- Insight Feed: Recent AI-generated insights
- Action Queue: Pending decisions requiring human approval
1000 Invested AI Company - Portfolio Management
Managing 1000 companies with AI-driven insights
URL: https://1000-invested-ai-company.spaces.agents-dev.com/
Overview
1000 Invested AI Company is a portfolio management dashboard for venture capital and private equity firms, where AI Agents monitor and analyze 1000+ portfolio companies in real-time.
Portfolio Structure
┌─────────────────────────────────────────────────────────┐
│ Portfolio Manager Agent │
└────────────────────────┬────────────────────────────────┘
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│Company-1│ │Company-2│ │Company-N│
│ Agent │ │ Agent │ │ Agent │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│Company-1│ │Company-2│ │Company-N│
│ Metrics │ │ Metrics │ │ Metrics │
└─────────┘ └─────────┘ └─────────┘
Company Metrics Tracked
| Category | Metrics |
|---|---|
| Financial | Revenue, ARR, Gross Margin, Burn Rate, Runway, Cash Balance |
| Growth | Customers, NRR, CAC Payback, LTV/CAC |
| Team | Headcount, Hiring Plan, Attrition Rate |
| Market | Market Share, Competitor Positioning |
Analysis Capabilities
- Real-time Financial Health Monitoring: Continuous tracking of key metrics
- Industry Benchmarking: Compare against industry peers
- Risk Early Warning: Detect potential issues before they become critical
- Exit Timing Recommendations: AI-driven exit strategy suggestions
- Portfolio Rebalancing Optimization: Optimize allocation across companies
Frontend Features
- Portfolio Overview: High-level performance metrics (TVPI, DPI, IRR)
- Company List: Filterable list of all portfolio companies
- Company Details: Deep dive into individual company metrics
- LP Reporting: Automated limited partner report generation
Key Metrics
| Metric | Description |
|---|---|
| TVPI | Total Value to Paid-In Capital |
| DPI | Distributed to Paid-In Capital |
| RVPI | Residual Value to Paid-In Capital |
| IRR | Internal Rate of Return |
API Design
RESTful + GraphQL + WebSocket APIs for 1000 Agent Platform
RESTful APIs
Agent Management
GET /api/v1/agents # List all Agents
GET /api/v1/agents/:id # Get Agent details
POST /api/v1/agents/:id/pause # Pause Agent
POST /api/v1/agents/:id/resume # Resume Agent
POST /api/v1/agents/:id/restart # Restart Agent
DELETE /api/v1/agents/:id # Delete Agent
Task Management
GET /api/v1/tasks # List tasks (with filtering)
POST /api/v1/tasks # Create task
GET /api/v1/tasks/:id # Get task details
POST /api/v1/tasks/:id/cancel # Cancel task
Artifact Management
GET /api/v1/artifacts # List outputs
GET /api/v1/artifacts/:id # Get artifact details
POST /api/v1/artifacts/:id/approve # Human approval
Cage Management
GET /api/v1/cages # List all cages
GET /api/v1/cages/:id # Get cage details
GET /api/v1/cages/:id/metrics # Get cage metrics
Metrics & Analytics
GET /api/v1/metrics/agents # Agent metrics aggregation
GET /api/v1/metrics/system # System-wide metrics
GET /api/v1/analytics/productivity # Productivity analysis
WebSocket Events
// Frontend subscribes to real-time events
ws.subscribe('agent:status:changed', (data) => {
// Agent status changed
});
ws.subscribe('task:completed', (data) => {
// Task completed
});
ws.subscribe('artifact:created', (data) => {
// New artifact created
});
ws.subscribe('alert:triggered', (data) => {
// Alert triggered
});
GraphQL Schema (Sample)
type Query {
agent(id: ID!): Agent
agents(filter: AgentFilter): [Agent!]!
task(id: ID!): Task
tasks(filter: TaskFilter): [Task!]!
cage(id: ID!): Cage
cages: [Cage!]!
metrics(timeRange: TimeRange!): Metrics!
}
type Mutation {
pauseAgent(id: ID!): Agent
resumeAgent(id: ID!): Agent
restartAgent(id: ID!): Agent
createTask(input: TaskInput!): Task
cancelTask(id: ID!): Task
approveArtifact(id: ID!): Artifact
}
type Subscription {
agentStatusChanged: Agent!
taskCompleted: Task!
artifactCreated: Artifact!
alertTriggered: Alert!
}
Data Models
Core database schema for 1000 Agent Platform
Core Tables
Agents Table
CREATE TABLE agents (
id UUID PRIMARY KEY,
name VARCHAR(255) NOT NULL,
type VARCHAR(50) NOT NULL, -- space|engineering|corpunit|investment
status VARCHAR(50) NOT NULL, -- active|idle|busy|blocked|error
cage_id VARCHAR(50), -- Cage number (001-1000)
current_task_id UUID,
resource_config JSONB,
metrics JSONB, -- Real-time metrics
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW(),
last_heartbeat TIMESTAMP
);
Tasks Table
CREATE TABLE tasks (
id UUID PRIMARY KEY,
type VARCHAR(100) NOT NULL,
priority INTEGER DEFAULT 0,
status VARCHAR(50) NOT NULL, -- pending|running|completed|failed|cancelled
assigned_agent_id UUID REFERENCES agents(id),
input JSONB NOT NULL,
output JSONB,
error TEXT,
started_at TIMESTAMP,
completed_at TIMESTAMP,
created_at TIMESTAMP DEFAULT NOW()
);
Artifacts Table (Agent Outputs)
CREATE TABLE artifacts (
id UUID PRIMARY KEY,
agent_id UUID REFERENCES agents(id),
task_id UUID REFERENCES tasks(id),
type VARCHAR(50) NOT NULL, -- code|doc|analysis|decision|report
title VARCHAR(500),
content TEXT,
quality_score FLOAT,
human_approved BOOLEAN DEFAULT FALSE,
metadata JSONB,
created_at TIMESTAMP DEFAULT NOW()
);
Cages Table (Agent Containers/Resource Quotas)
CREATE TABLE cages (
id VARCHAR(50) PRIMARY KEY, -- 001-1000
agent_id UUID REFERENCES agents(id),
status VARCHAR(50) NOT NULL, -- occupied|vacant|maintenance
resource_limits JSONB, -- cpu, memory, gpu, tokens
resource_usage JSONB, -- Actual usage
created_at TIMESTAMP DEFAULT NOW()
);
Metrics Table (Time-Series Metrics)
CREATE TABLE metrics (
time TIMESTAMP NOT NULL,
agent_id UUID NOT NULL,
metric_name VARCHAR(100) NOT NULL,
metric_value FLOAT NOT NULL,
labels JSONB,
PRIMARY KEY (time, agent_id, metric_name)
) PARTITION BY RANGE (time);
Indexes
-- Performance indexes
CREATE INDEX idx_agents_status ON agents(status);
CREATE INDEX idx_agents_type ON agents(type);
CREATE INDEX idx_tasks_status ON tasks(status);
CREATE INDEX idx_tasks_assigned ON tasks(assigned_agent_id);
CREATE INDEX idx_artifacts_agent ON artifacts(agent_id);
CREATE INDEX idx_metrics_time ON metrics(time DESC);
Kubernetes Deployment
Deploying 1000 Agent Cages on Kubernetes
Namespace Setup
apiVersion: v1
kind: Namespace
metadata:
name: agent-platform
labels:
name: agent-platform
Cage Pod Template
apiVersion: v1
kind: Pod
metadata:
name: cage-042
namespace: agent-platform
labels:
cage-id: "042"
agent-type: "space-guardian"
spec:
containers:
- name: agent-runtime
image: openclaw/agent-runtime:v1.0.0
resources:
requests:
cpu: "1"
memory: "2Gi"
nvidia.com/gpu: "0.5"
limits:
cpu: "2"
memory: "4Gi"
nvidia.com/gpu: "1"
env:
- name: CAGE_ID
value: "042"
- name: AGENT_ID
value: "agent-042"
- name: AGENT_TYPE
value: "space-guardian"
volumeMounts:
- name: state-volume
mountPath: /cage/state
- name: outputs-volume
mountPath: /cage/outputs
volumes:
- name: state-volume
persistentVolumeClaim:
claimName: cage-042-state
- name: outputs-volume
persistentVolumeClaim:
claimName: cage-042-outputs
Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: agent-cages-hpa
namespace: agent-platform
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: agent-cages
minReplicas: 100
maxReplicas: 1000
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
Deployment Commands
# Deploy namespace
kubectl apply -f k8s/namespace.yaml
# Deploy core services
kubectl apply -f k8s/orchestrator.yaml
kubectl apply -f k8s/cage-operator.yaml
# Deploy frontend
kubectl apply -f k8s/frontend.yaml
# Check deployment status
kubectl get pods -n agent-platform
kubectl get cages -n agent-platform
# Scale cages
kubectl scale deployment agent-cages --replicas=100 -n agent-platform
Cost Model
Understanding the economics of 1000 Agent Platform
Single Cage Daily Cost
| Component | Calculation | Cost |
|---|---|---|
| Kubernetes Pod | 2 vCPU × 24h × $0.05/vCPU/h | $2.40 |
| GPU Share | 0.5 A10 × 24h × $0.50/GPU/h | $6.00 |
| Storage | 10GB × $0.10/GB/day | $1.00 |
| Networking | Estimated | $0.10 |
| Compute Subtotal | $9.50 | |
| Tokens | 2.4M × $0.002/1K | $4.80 |
| Total per Cage per Day | $14.30 | |
| Total per Cage per Month | $429 |
1000 Cage Fleet Monthly Cost
| Component | Calculation | Monthly Cost |
|---|---|---|
| Compute | $9.50 × 1000 × 30 | $285,000 |
| Tokens | $4.80 × 1000 × 30 | $144,000 |
| Storage | $0.50 × 1000 × 30 | $15,000 |
| Management Overhead | Estimated | $20,000 |
| Total Monthly | $464,000 | |
| Total Annual | $5,568,000 |
Unit Economics
| Metric | Calculation | Cost |
|---|---|---|
| Cost per Task | $464,000 / 500,000 tasks | $0.93 |
| Cost per Artifact | $464,000 / 1,000,000 artifacts | $0.46 |
Cost Optimization Strategies
1. Spot Instances
Use spot/preemptible instances for non-critical workloads:
- Savings: 60-70% on compute costs
- Risk: Instances can be preempted
2. Reserved Capacity
Commit to 1-3 year reservations for baseline capacity:
- Savings: 30-40% on compute costs
- Requirement: Predictable baseline usage
3. Token Budgeting
Implement strict token budgets per Agent:
- Strategy: Dynamic allocation based on task priority
- Savings: 20-30% on token costs
4. Idle Detection
Automatically scale down idle Agents:
- Trigger: No tasks for >30 minutes
- Action: Pause or terminate cage
- Savings: 15-25% on overall costs
ROI Analysis
Traditional Approach (Human Teams)
| Function | Team Size | Annual Cost |
|---|---|---|
| Production Ops | 10 engineers | $2,000,000 |
| Code Review | 5 engineers | $1,000,000 |
| Business Analysis | 8 analysts | $1,600,000 |
| Investment Analysis | 5 analysts | $1,000,000 |
| Total | 28 people | $5,600,000 |
1000 Agent Platform
| Component | Annual Cost |
|---|---|
| Infrastructure | $5,568,000 |
| Total | $5,568,000 |
Comparison
- Cost: Similar (~$5.6M/year)
- Capacity: 1000 Agents vs 28 humans = 35x scale
- Availability: 24/7/365 vs 8h/day × 5 days/week
- Consistency: No fatigue, no turnover
Data Visualization Skill
SearXNG Skill
Glossary
Key terms and definitions
A
Agent
An AI instance that can perceive, reason, and act autonomously to accomplish tasks.
Agentic Engineering
Software engineering methodology where AI Agents play a central role in design, development, testing, and operations.
C
Cage
An isolated execution environment for a single Agent, typically implemented as a Kubernetes Pod with dedicated resources and persistent storage.
CODEOWNERS
A file that defines code ownership for automated review assignment.
M
Mono-Repo
A single repository containing all project code, as opposed to multiple separate repositories.
MTTR
Mean Time To Resolve - average time taken to resolve production incidents.
R
RD-OS
Research & Development Operating System - a living system where AI coordinates all aspects of software development and operations.
Repo
Short for repository - a version-controlled codebase.
S
Skill
A reusable capability that can be invoked by Agents (e.g., data visualization, web search).
Space
In 1000 Agent Platform context, refers to an isolated Agent execution environment (synonym for Cage).
T
Token
The basic unit of text processing for LLMs. Costs are typically measured per 1K tokens.
Trunk-Based Development
A version control practice where developers merge small changes frequently to the main branch.
FAQ
Frequently Asked Questions
General
Q: What is Agentic Engineering?
A: Agentic Engineering is a software development methodology where AI Agents play a central role in the entire engineering lifecycle - from design to deployment to operations.
Q: Why 1000 Agents?
A: 1000 is a sweet spot that provides:
- Visual impact and clear communication
- Actual productive capacity
- Manageable complexity
- Cost-effective scale
Q: Can I run this locally?
A: Yes! The MVP can run on a single machine with Docker. Full 1000-Agent scale requires a Kubernetes cluster.
Technical
Q: What LLM models are supported?
A: The platform is model-agnostic. Currently optimized for:
- Alibaba Cloud: Qwen3.5-Plus, Qwen3-Max
- OpenAI: GPT-4, GPT-4-Turbo
- Google: Claude, Gemini
Q: How do Agents communicate?
A: Agents communicate through:
- Task queues (for work assignment)
- Shared state (in file system)
- Direct messaging (via Orchestrator)
Q: What happens when an Agent fails?
A: The Cage health monitor detects the failure and:
- Saves current state to persistent storage
- Attempts automatic recovery (restart)
- If recovery fails, escalates to human operator
- Reassigns pending tasks to other Agents
Cost
Q: How much does it cost to run 1000 Agents?
A: Approximately $464,000/month, broken down as:
- Compute: $285,000
- Tokens: $144,000
- Storage: $15,000
- Overhead: $20,000
Q: Can I reduce costs?
A: Yes! Strategies include:
- Use spot instances (60-70% savings on compute)
- Reserved capacity (30-40% savings)
- Token budgeting (20-30% savings)
- Idle detection and scale-down (15-25% savings)
Security
Q: How is data isolated between Agents?
A: Each Cage has:
- Dedicated Kubernetes namespace
- Isolated persistent storage
- Network policies restricting cross-Cage access
- Separate service accounts and credentials
Q: Can Agents access external systems?
A: Yes, but with strict controls:
- Whitelisted external APIs only
- Rate limiting and quotas
- Audit logging of all external calls
- Human approval for sensitive operations
Migration
Q: How long does mono-repo migration take?
A: Based on 10-repo experiment:
- Analysis: ~2 hours per repo (parallel)
- Planning: ~1 day for 400 repos
- Execution: ~2-4 weeks (phased approach)
Q: What’s the risk of migration?
A: Key risks and mitigations:
- Build complexity: Gradual migration, parallel testing
- Agent coordination overhead: Layered scheduling, batch processing
- State consistency: File locks + transaction logs
- Human acceptance: Gradual automation,保留 approval points