RD-OS: Research & Development Operating System
面向 AI 时代的研发基础设施
“过去:协调很多人,跟进开发、部署、测试、运维、事故、告警 — 太费劲了”
“未来:一个活的系统,AI 自主协调一切,人类专注决策”
核心问题
传统研发模式的痛点
┌─────────────────────────────────────────────────────────────────┐
│ Traditional R&D Pain │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Understanding the System: │
│ ❌ 400+ repos, no one knows the full picture │
│ ❌ Documentation always outdated │
│ ❌ "Who owns this?" "Why was this done?" │
│ ❌ New hire ramp-up: 3-6 months │
│ │
│ Coordination Overhead: │
│ ❌ Dev → Test → Deploy → Ops: handoffs everywhere │
│ ❌ Incident response: page 5 people, 2 hours to triage │
│ ❌ Sprint planning: 2 days of meetings │
│ ❌ Post-mortem: blame, not learning │
│ │
│ Alert Fatigue: │
│ ❌ 100+ alerts/day, most are noise │
│ ❌ No context, just "something is broken" │
│ ❌ Human must investigate everything │
│ │
│ Progress Tracking: │
│ ❌ JIRA tickets, standups, status reports │
│ ❌ "What's blocked?" "Who's working on what?" │
│ ❌ Velocity is a guess │
│ │
└─────────────────────────────────────────────────────────────────┘
Root Cause: The system is passive. It waits for humans to:
- Understand it
- Coordinate across it
- Fix it
- Improve it
Vision: RD-OS (Active, Living System)
┌─────────────────────────────────────────────────────────────────┐
│ RD-OS │
│ A Living R&D Operating System │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Unified Codebase │ │
│ │ (400 repos → 1 mono-repo, AI-readable) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌────────────────────┼────────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ AI Core │ │ Skills │ │ Humans │ │
│ │ (Agents) │ │ (Tools) │ │ (Decision) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ Capabilities: │
│ ✅ Self-understanding (always knows its state) │
│ ✅ Self-coordination (agents talk to each other) │
│ ✅ Self-healing (detects and fixes issues) │
│ ✅ Self-improvement (identifies and acts on optimizations) │
│ │
│ Result: Humans focus on WHAT, AI handles HOW │
└─────────────────────────────────────────────────────────────────┘
RD-OS Architecture
Layer 0: The Codebase (Passive Foundation)
mono-repo/
├── products/ # TiDB, TiDB Next-Gen
├── platform/ # Cloud SaaS, control plane
├── devops/ # Operations tooling
├── libs/ # Shared libraries
├── tools/ # Build/dev tools
├── docs/ # Living documentation
└── .rd-os/ # RD-OS configuration
├── agents/ # Agent definitions
├── skills/ # Skill configurations
├── workflows/ # Automated workflows
└── policies/ # Decision policies
Layer 1: Perception (Understanding the System)
┌─────────────────────────────────────────────────────────────┐
│ Perception Layer │
│ "The system understands itself" │
├─────────────────────────────────────────────────────────────┤
│ │
│ code-understanding-agent │
│ ├─ Continuously indexes codebase │
│ ├─ Maps dependencies (real-time) │
│ ├─ Tracks architecture changes │
│ └─ Answers: "What does this do?" "Who uses this?" │
│ │
│ documentation-curator │
│ ├─ Auto-generates docs from code │
│ ├─ Keeps docs in sync (per-change) │
│ ├─ Maintains architecture decision records │
│ └─ Answers: "Why was this designed this way?" │
│ │
│ health-monitor │
│ ├─ Real-time system health dashboard │
│ ├─ Tracks: build status, test coverage, tech debt │
│ ├─ Detects anomalies │
│ └─ Answers: "Is the system healthy?" │
│ │
└─────────────────────────────────────────────────────────────┘
Before vs After:
| Task | Before | After (RD-OS) |
|---|---|---|
| Understand a component | Read docs (outdated), ask team (slow) | Ask agent (instant, accurate) |
| Find dependencies | Search code, grep, hope | Query dependency graph |
| New hire ramp-up | 3-6 months | 2-4 weeks (AI-guided) |
| Architecture review | Manual docs, diagrams | Auto-generated, always current |
Layer 2: Coordination (Orchestrating Work)
┌─────────────────────────────────────────────────────────────┐
│ Coordination Layer │
│ "The system coordinates itself" │
├─────────────────────────────────────────────────────────────┤
│ │
│ workflow-orchestrator │
│ ├─ Dev → Test → Deploy → Ops: automatic handoffs │
│ ├─ No human coordination needed │
│ ├─ Tracks progress, unblocks automatically │
│ └─ Humans see: "Feature X: 80% done, deploying in 2h" │
│ │
│ sprint-coordinator │
│ ├─ Analyzes backlog, capacity, velocity │
│ ├─ Suggests sprint goals │
│ ├─ Adjusts mid-sprint based on reality │
│ └─ Humans see: "Sprint on track" or "Risk: feature Y" │
│ │
│ dependency-coordinator │
│ ├─ Detects cross-component changes needed │
│ ├─ Coordinates updates across repos │
│ ├─ Prevents breaking changes │
│ └─ Humans see: "Updating lib X, 3 components affected" │
│ │
└─────────────────────────────────────────────────────────────┘
Before vs After:
| Task | Before | After (RD-OS) |
|---|---|---|
| Dev → Test handoff | PR review, wait for QA, days | Auto-test, auto-merge, hours |
| Deploy coordination | Schedule, change review, CAB | Auto-deploy (policy-based) |
| Sprint planning | 2-day meetings | AI-suggested, human-approved |
| Cross-team dependency | Email, meetings, delays | Auto-coordinated |
Layer 3: Action (Executing Work)
┌─────────────────────────────────────────────────────────────┐
│ Action Layer │
│ "The system executes work" │
├─────────────────────────────────────────────────────────────┤
│ │
│ development-agent │
│ ├─ Implements features (from specs) │
│ ├─ Writes tests │
│ ├─ Creates PRs │
│ └─ Humans review, approve │
│ │
│ testing-agent │
│ ├─ Runs test suites │
│ ├─ Generates missing tests │
│ ├─ Investigates flaky tests │
│ └─ Humans see: "Tests pass" or "Here's the issue" │
│ │
│ deployment-agent │
│ ├─ Deploys to staging/production │
│ ├─ Monitors rollout │
│ ├─ Auto-rollback on issues │
│ └─ Humans see: "Deployed v1.2.3, health: ✅" │
│ │
│ incident-responder │
│ ├─ Detects incidents (before humans) │
│ ├─ Triage: severity, impact, root cause │
│ ├─ Auto-remediation (restart, rollback, scale) │
│ └─ Humans see: "Incident detected, resolved, here's why"│
│ │
└─────────────────────────────────────────────────────────────┘
Before vs After:
| Task | Before | After (RD-OS) |
|---|---|---|
| Feature development | Human writes code, days/weeks | AI drafts, human reviews, hours/days |
| Testing | Manual test writing, maintenance | Auto-generated, maintained |
| Deployment | Manual process, risky | Automated, safe, rollback-ready |
| Incident response | Page, triage, fix (hours) | Auto-detect, auto-fix (minutes) |
Layer 4: Learning (Continuous Improvement)
┌─────────────────────────────────────────────────────────────┐
│ Learning Layer │
│ "The system improves itself" │
├─────────────────────────────────────────────────────────────┤
│ │
│ post-mortem-analyst │
│ ├─ Analyzes incidents (no blame) │
│ ├─ Identifies root causes │
│ ├─ Proposes preventive measures │
│ └─ Humans review, approve changes │
│ │
│ tech-debt-detector │
│ ├─ Continuously scans for tech debt │
│ ├─ Prioritizes by impact │
│ ├─ Proposes refactoring plans │
│ └─ Humans see: "Tech debt: 5 high-priority items" │
│ │
│ optimization-recommender │
│ ├─ Analyzes performance, cost, efficiency │
│ ├─ Identifies optimization opportunities │
│ ├─ Proposes and implements improvements │
│ └─ Humans see: "Saved $X/month with optimization Y" │
│ │
│ knowledge-curator │
│ ├─ Captures learnings from incidents │
│ ├─ Updates documentation │
│ ├─ Shares insights across teams │
│ └─ System gets smarter over time │
│ │
└─────────────────────────────────────────────────────────────┘
Key Workflows (End-to-End)
Workflow 1: Feature Development
┌─────────────────────────────────────────────────────────────────┐
│ Feature Development (AI-First) │
└─────────────────────────────────────────────────────────────────┘
Human: "Build feature X: users can export data as CSV"
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 1. Spec Analysis (AI) │
│ ├─ Understands requirements │
│ ├─ Identifies affected components │
│ └─ Creates implementation plan │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 2. Implementation (AI) │
│ ├─ Writes code (backend, frontend, tests) │
│ ├─ Creates PR │
│ └─ Notifies human reviewer │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 3. Review (Human + AI) │
│ ├─ AI: automated review (style, tests, security) │
│ ├─ Human: logic, UX, business logic │
│ └─ AI: addresses feedback, updates PR │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 4. Merge & Deploy (AI) │
│ ├─ Auto-merge (if checks pass) │
│ ├─ Deploy to staging │
│ ├─ Run integration tests │
│ └─ Deploy to production (feature flag) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 5. Monitor (AI) │
│ ├─ Watches metrics, errors, adoption │
│ ├─ Alerts human if issues │
│ └─ Reports: "Feature X: 1000 uses/day, 0 errors" │
└─────────────────────────────────────────────────────────────┘
Total Time: 2-3 days (vs 2-3 weeks traditional)
Human Effort: 2-4 hours review (vs 40+ hours coding)
Workflow 2: Incident Response
┌─────────────────────────────────────────────────────────────────┐
│ Incident Response (AI-First) │
└─────────────────────────────────────────────────────────────────┘
[Incident Occurs: API latency spike]
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 1. Detection (AI) - T+0s │
│ ├─ Detects anomaly (before humans notice) │
│ ├─ Correlates with recent changes │
│ └─ Starts investigation │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 2. Triage (AI) - T+30s │
│ ├─ Severity: P2 (degraded performance) │
│ ├─ Impact: 15% of requests affected │
│ ├─ Root cause: recent deployment, memory leak │
│ └─ Notifies on-call + team channel │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 3. Mitigation (AI) - T+60s │
│ ├─ Auto-rollback to previous version │
│ ├─ Scales up affected service │
│ └─ Monitors recovery │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 4. Resolution (AI) - T+5min │
│ ├─ Metrics return to normal │
│ ├─ Incident marked resolved │
│ └─ Report: "Root cause, fix, prevention plan" │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 5. Post-Mortem (AI + Human) - T+1day │
│ ├─ AI: timeline, root cause, prevention │
│ ├─ Human: review, approve │
│ └─ AI: creates follow-up tasks │
└─────────────────────────────────────────────────────────────┘
Total Time: 5 minutes to resolution (vs 2-4 hours traditional)
Human Effort: 30 minutes review (vs 4+ hours firefighting)
Workflow 3: Alert Handling
┌─────────────────────────────────────────────────────────────────┐
│ Alert Handling (AI-First) │
└─────────────────────────────────────────────────────────────────┘
[Alert: High CPU on service X]
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 1. Alert Analysis (AI) │
│ ├─ Is this real? (vs noise) │
│ ├─ What's the context? (recent changes, load spike) │
│ └─ What's the impact? │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 2. Decision (AI, policy-based) │
│ ├─ If known issue + auto-fix exists → execute fix │
│ ├─ If unknown → investigate, notify human │
│ └─ If noise → suppress, update alert rules │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 3. Action (AI) │
│ ├─ Execute fix OR │
│ ├─ Create incident OR │
│ └─ Update alert rules │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 4. Human Notification (if needed) │
│ ├─ "Alert X: auto-resolved, here's what happened" OR │
│ └─ "Alert X: needs attention, here's the context" │
└─────────────────────────────────────────────────────────────┘
Result: 90% of alerts handled without human intervention
Human Focus: Only meaningful alerts with full context
Human Experience in RD-OS
What Humans Do
┌─────────────────────────────────────────────────────────────┐
│ Human Focus Areas │
├─────────────────────────────────────────────────────────────┤
│ │
│ Strategy & Direction │
│ ├─ What problems to solve │
│ ├─ What features to build │
│ └─ What trade-offs to make │
│ │
│ Review & Approval │
│ ├─ Architecture decisions (AI-proposed) │
│ ├─ Security-critical changes │
│ ├─ Breaking changes │
│ └─ High-risk deployments │
│ │
│ Exception Handling │
│ ├─ Edge cases AI can't handle │
│ ├─ Novel situations │
│ └─ Escalations from agents │
│ │
│ Creativity & Innovation │
│ ├─ New product ideas │
│ ├─ Novel solutions │
│ └─ Exploratory work │
│ │
└─────────────────────────────────────────────────────────────┘
What Humans Don’t Do
┌─────────────────────────────────────────────────────────────┐
│ Eliminated by RD-OS │
├─────────────────────────────────────────────────────────────┤
│ │
│ ❌ Manual code writing (AI drafts) │
│ ❌ Manual testing (AI generates & runs) │
│ ❌ Manual deployment (AI deploys) │
│ ❌ Manual monitoring (AI watches 24/7) │
│ ❌ Alert triage (AI handles 90%) │
│ ❌ Incident firefighting (AI auto-remediates) │
│ ❌ Status meetings (AI reports automatically) │
│ ❌ Progress tracking (AI tracks in real-time) │
│ ❌ Documentation writing (AI auto-generates) │
│ ❌ Coordination overhead (AI coordinates) │
│ │
└─────────────────────────────────────────────────────────────┘
Metrics: Before vs After
| Metric | Traditional | RD-OS Target | Improvement |
|---|---|---|---|
| Feature dev time | 2-3 weeks | 2-3 days | 10x |
| Incident MTTR | 2-4 hours | 5-10 minutes | 24x |
| Alert noise | 90% false positive | <10% false positive | 9x |
| New hire ramp-up | 3-6 months | 2-4 weeks | 3-6x |
| Deploy frequency | Weekly | Multiple/day | 10x+ |
| Deploy failure rate | 10-20% | <1% | 10-20x |
| Tech debt visibility | Unknown | Real-time dashboard | - |
| Coordination meetings | 10+ hours/week | <2 hours/week | 5x |
| Human coding time | 60% | 10% | 6x |
| Human decision time | 20% | 70% | 3.5x |
Implementation Roadmap
Phase 1: Foundation (Month 1-2)
- Mono-repo consolidation (400 → 1)
- Basic agent framework
- Core skills (build, test, deploy)
- Perception layer (code understanding, docs)
Phase 2: Coordination (Month 3-4)
- Workflow orchestrator
- Sprint coordinator
- Dependency coordinator
- Action layer (dev, test, deploy agents)
Phase 3: Autonomy (Month 5-6)
- Incident responder
- Alert handler
- Post-mortem analyst
- Learning layer (continuous improvement)
Phase 4: Optimization (Month 7-12)
- Full autonomy for routine work
- AI-driven optimization
- Human focus on strategy only
- Continuous self-improvement
Conclusion
RD-OS is not just a mono-repo. It’s a paradigm shift:
| Aspect | Traditional | RD-OS |
|---|---|---|
| System Nature | Passive | Active, Living |
| Understanding | Human effort | Built-in |
| Coordination | Human meetings | AI orchestration |
| Execution | Human labor | AI execution |
| Improvement | Occasional, manual | Continuous, automatic |
| Human Role | Doer | Decision-maker |
The goal:
Humans define WHAT matters. AI handles HOW to achieve it.
The result:
A研发 department that moves at AI speed, with human wisdom.
“过去:协调很多人,跟进开发、部署、测试、运维、事故、告警 — 太费劲了”
“未来:一个活的系统,AI 自主协调一切,人类专注决策”
This is RD-OS.