Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

RD-OS: Research & Development Operating System

面向 AI 时代的研发基础设施

“过去:协调很多人,跟进开发、部署、测试、运维、事故、告警 — 太费劲了”

“未来:一个活的系统,AI 自主协调一切,人类专注决策”


核心问题

传统研发模式的痛点

┌─────────────────────────────────────────────────────────────────┐
│                    Traditional R&D Pain                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Understanding the System:                                      │
│  ❌ 400+ repos, no one knows the full picture                   │
│  ❌ Documentation always outdated                               │
│  ❌ "Who owns this?" "Why was this done?"                       │
│  ❌ New hire ramp-up: 3-6 months                                │
│                                                                 │
│  Coordination Overhead:                                         │
│  ❌ Dev → Test → Deploy → Ops: handoffs everywhere              │
│  ❌ Incident response: page 5 people, 2 hours to triage         │
│  ❌ Sprint planning: 2 days of meetings                         │
│  ❌ Post-mortem: blame, not learning                            │
│                                                                 │
│  Alert Fatigue:                                                 │
│  ❌ 100+ alerts/day, most are noise                             │
│  ❌ No context, just "something is broken"                      │
│  ❌ Human must investigate everything                           │
│                                                                 │
│  Progress Tracking:                                             │
│  ❌ JIRA tickets, standups, status reports                      │
│  ❌ "What's blocked?" "Who's working on what?"                  │
│  ❌ Velocity is a guess                                         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Root Cause: The system is passive. It waits for humans to:

  • Understand it
  • Coordinate across it
  • Fix it
  • Improve it

Vision: RD-OS (Active, Living System)

┌─────────────────────────────────────────────────────────────────┐
│                         RD-OS                                   │
│              A Living R&D Operating System                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                    Unified Codebase                      │   │
│  │         (400 repos → 1 mono-repo, AI-readable)          │   │
│  └─────────────────────────────────────────────────────────┘   │
│                              │                                  │
│         ┌────────────────────┼────────────────────┐            │
│         ▼                    ▼                    ▼             │
│  ┌─────────────┐     ┌─────────────┐     ┌─────────────┐       │
│  │   AI Core   │     │   Skills    │     │   Humans    │       │
│  │  (Agents)   │     │  (Tools)    │     │ (Decision)  │       │
│  └─────────────┘     └─────────────┘     └─────────────┘       │
│                                                                 │
│  Capabilities:                                                  │
│  ✅ Self-understanding (always knows its state)                │
│  ✅ Self-coordination (agents talk to each other)              │
│  ✅ Self-healing (detects and fixes issues)                    │
│  ✅ Self-improvement (identifies and acts on optimizations)   │
│                                                                 │
│  Result: Humans focus on WHAT, AI handles HOW                   │
└─────────────────────────────────────────────────────────────────┘

RD-OS Architecture

Layer 0: The Codebase (Passive Foundation)

mono-repo/
├── products/          # TiDB, TiDB Next-Gen
├── platform/          # Cloud SaaS, control plane
├── devops/            # Operations tooling
├── libs/              # Shared libraries
├── tools/             # Build/dev tools
├── docs/              # Living documentation
└── .rd-os/            # RD-OS configuration
    ├── agents/        # Agent definitions
    ├── skills/        # Skill configurations
    ├── workflows/     # Automated workflows
    └── policies/      # Decision policies

Layer 1: Perception (Understanding the System)

┌─────────────────────────────────────────────────────────────┐
│                  Perception Layer                           │
│         "The system understands itself"                     │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  code-understanding-agent                                   │
│    ├─ Continuously indexes codebase                        │
│    ├─ Maps dependencies (real-time)                        │
│    ├─ Tracks architecture changes                          │
│    └─ Answers: "What does this do?" "Who uses this?"       │
│                                                             │
│  documentation-curator                                      │
│    ├─ Auto-generates docs from code                        │
│    ├─ Keeps docs in sync (per-change)                      │
│    ├─ Maintains architecture decision records              │
│    └─ Answers: "Why was this designed this way?"           │
│                                                             │
│  health-monitor                                             │
│    ├─ Real-time system health dashboard                    │
│    ├─ Tracks: build status, test coverage, tech debt       │
│    ├─ Detects anomalies                                    │
│    └─ Answers: "Is the system healthy?"                    │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Before vs After:

TaskBeforeAfter (RD-OS)
Understand a componentRead docs (outdated), ask team (slow)Ask agent (instant, accurate)
Find dependenciesSearch code, grep, hopeQuery dependency graph
New hire ramp-up3-6 months2-4 weeks (AI-guided)
Architecture reviewManual docs, diagramsAuto-generated, always current

Layer 2: Coordination (Orchestrating Work)

┌─────────────────────────────────────────────────────────────┐
│                 Coordination Layer                          │
│      "The system coordinates itself"                        │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  workflow-orchestrator                                      │
│    ├─ Dev → Test → Deploy → Ops: automatic handoffs        │
│    ├─ No human coordination needed                         │
│    ├─ Tracks progress, unblocks automatically              │
│    └─ Humans see: "Feature X: 80% done, deploying in 2h"   │
│                                                             │
│  sprint-coordinator                                         │
│    ├─ Analyzes backlog, capacity, velocity                 │
│    ├─ Suggests sprint goals                                │
│    ├─ Adjusts mid-sprint based on reality                  │
│    └─ Humans see: "Sprint on track" or "Risk: feature Y"   │
│                                                             │
│  dependency-coordinator                                     │
│    ├─ Detects cross-component changes needed               │
│    ├─ Coordinates updates across repos                     │
│    ├─ Prevents breaking changes                            │
│    └─ Humans see: "Updating lib X, 3 components affected"  │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Before vs After:

TaskBeforeAfter (RD-OS)
Dev → Test handoffPR review, wait for QA, daysAuto-test, auto-merge, hours
Deploy coordinationSchedule, change review, CABAuto-deploy (policy-based)
Sprint planning2-day meetingsAI-suggested, human-approved
Cross-team dependencyEmail, meetings, delaysAuto-coordinated

Layer 3: Action (Executing Work)

┌─────────────────────────────────────────────────────────────┐
│                    Action Layer                             │
│         "The system executes work"                          │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  development-agent                                          │
│    ├─ Implements features (from specs)                     │
│    ├─ Writes tests                                         │
│    ├─ Creates PRs                                          │
│    └─ Humans review, approve                               │
│                                                             │
│  testing-agent                                              │
│    ├─ Runs test suites                                     │
│    ├─ Generates missing tests                              │
│    ├─ Investigates flaky tests                             │
│    └─ Humans see: "Tests pass" or "Here's the issue"       │
│                                                             │
│  deployment-agent                                           │
│    ├─ Deploys to staging/production                        │
│    ├─ Monitors rollout                                     │
│    ├─ Auto-rollback on issues                              │
│    └─ Humans see: "Deployed v1.2.3, health: ✅"            │
│                                                             │
│  incident-responder                                         │
│    ├─ Detects incidents (before humans)                    │
│    ├─ Triage: severity, impact, root cause                 │
│    ├─ Auto-remediation (restart, rollback, scale)          │
│    └─ Humans see: "Incident detected, resolved, here's why"│
│                                                             │
└─────────────────────────────────────────────────────────────┘

Before vs After:

TaskBeforeAfter (RD-OS)
Feature developmentHuman writes code, days/weeksAI drafts, human reviews, hours/days
TestingManual test writing, maintenanceAuto-generated, maintained
DeploymentManual process, riskyAutomated, safe, rollback-ready
Incident responsePage, triage, fix (hours)Auto-detect, auto-fix (minutes)

Layer 4: Learning (Continuous Improvement)

┌─────────────────────────────────────────────────────────────┐
│                   Learning Layer                            │
│        "The system improves itself"                         │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  post-mortem-analyst                                        │
│    ├─ Analyzes incidents (no blame)                        │
│    ├─ Identifies root causes                               │
│    ├─ Proposes preventive measures                         │
│    └─ Humans review, approve changes                       │
│                                                             │
│  tech-debt-detector                                         │
│    ├─ Continuously scans for tech debt                     │
│    ├─ Prioritizes by impact                                │
│    ├─ Proposes refactoring plans                           │
│    └─ Humans see: "Tech debt: 5 high-priority items"       │
│                                                             │
│  optimization-recommender                                   │
│    ├─ Analyzes performance, cost, efficiency               │
│    ├─ Identifies optimization opportunities                │
│    ├─ Proposes and implements improvements                 │
│    └─ Humans see: "Saved $X/month with optimization Y"     │
│                                                             │
│  knowledge-curator                                          │
│    ├─ Captures learnings from incidents                    │
│    ├─ Updates documentation                                │
│    ├─ Shares insights across teams                         │
│    └─ System gets smarter over time                        │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Key Workflows (End-to-End)

Workflow 1: Feature Development

┌─────────────────────────────────────────────────────────────────┐
│              Feature Development (AI-First)                     │
└─────────────────────────────────────────────────────────────────┘

Human: "Build feature X: users can export data as CSV"
   │
   ▼
┌─────────────────────────────────────────────────────────────┐
│  1. Spec Analysis (AI)                                      │
│     ├─ Understands requirements                             │
│     ├─ Identifies affected components                       │
│     └─ Creates implementation plan                          │
└─────────────────────────────────────────────────────────────┘
   │
   ▼
┌─────────────────────────────────────────────────────────────┐
│  2. Implementation (AI)                                     │
│     ├─ Writes code (backend, frontend, tests)               │
│     ├─ Creates PR                                           │
│     └─ Notifies human reviewer                              │
└─────────────────────────────────────────────────────────────┘
   │
   ▼
┌─────────────────────────────────────────────────────────────┐
│  3. Review (Human + AI)                                     │
│     ├─ AI: automated review (style, tests, security)        │
│     ├─ Human: logic, UX, business logic                     │
│     └─ AI: addresses feedback, updates PR                   │
└─────────────────────────────────────────────────────────────┘
   │
   ▼
┌─────────────────────────────────────────────────────────────┐
│  4. Merge & Deploy (AI)                                     │
│     ├─ Auto-merge (if checks pass)                          │
│     ├─ Deploy to staging                                    │
│     ├─ Run integration tests                                │
│     └─ Deploy to production (feature flag)                  │
└─────────────────────────────────────────────────────────────┘
   │
   ▼
┌─────────────────────────────────────────────────────────────┐
│  5. Monitor (AI)                                            │
│     ├─ Watches metrics, errors, adoption                    │
│     ├─ Alerts human if issues                               │
│     └─ Reports: "Feature X: 1000 uses/day, 0 errors"        │
└─────────────────────────────────────────────────────────────┘

Total Time: 2-3 days (vs 2-3 weeks traditional)
Human Effort: 2-4 hours review (vs 40+ hours coding)

Workflow 2: Incident Response

┌─────────────────────────────────────────────────────────────────┐
│              Incident Response (AI-First)                       │
└─────────────────────────────────────────────────────────────────┘

[Incident Occurs: API latency spike]
   │
   ▼
┌─────────────────────────────────────────────────────────────┐
│  1. Detection (AI) - T+0s                                   │
│     ├─ Detects anomaly (before humans notice)               │
│     ├─ Correlates with recent changes                       │
│     └─ Starts investigation                                 │
└─────────────────────────────────────────────────────────────┘
   │
   ▼
┌─────────────────────────────────────────────────────────────┐
│  2. Triage (AI) - T+30s                                     │
│     ├─ Severity: P2 (degraded performance)                  │
│     ├─ Impact: 15% of requests affected                     │
│     ├─ Root cause: recent deployment, memory leak           │
│     └─ Notifies on-call + team channel                      │
└─────────────────────────────────────────────────────────────┘
   │
   ▼
┌─────────────────────────────────────────────────────────────┐
│  3. Mitigation (AI) - T+60s                                 │
│     ├─ Auto-rollback to previous version                    │
│     ├─ Scales up affected service                           │
│     └─ Monitors recovery                                    │
└─────────────────────────────────────────────────────────────┘
   │
   ▼
┌─────────────────────────────────────────────────────────────┐
│  4. Resolution (AI) - T+5min                                │
│     ├─ Metrics return to normal                             │
│     ├─ Incident marked resolved                             │
│     └─ Report: "Root cause, fix, prevention plan"           │
└─────────────────────────────────────────────────────────────┘
   │
   ▼
┌─────────────────────────────────────────────────────────────┐
│  5. Post-Mortem (AI + Human) - T+1day                       │
│     ├─ AI: timeline, root cause, prevention                 │
│     ├─ Human: review, approve                               │
│     └─ AI: creates follow-up tasks                          │
└─────────────────────────────────────────────────────────────┘

Total Time: 5 minutes to resolution (vs 2-4 hours traditional)
Human Effort: 30 minutes review (vs 4+ hours firefighting)

Workflow 3: Alert Handling

┌─────────────────────────────────────────────────────────────────┐
│              Alert Handling (AI-First)                          │
└─────────────────────────────────────────────────────────────────┘

[Alert: High CPU on service X]
   │
   ▼
┌─────────────────────────────────────────────────────────────┐
│  1. Alert Analysis (AI)                                     │
│     ├─ Is this real? (vs noise)                             │
│     ├─ What's the context? (recent changes, load spike)     │
│     └─ What's the impact?                                   │
└─────────────────────────────────────────────────────────────┘
   │
   ▼
┌─────────────────────────────────────────────────────────────┐
│  2. Decision (AI, policy-based)                             │
│     ├─ If known issue + auto-fix exists → execute fix       │
│     ├─ If unknown → investigate, notify human              │
│     └─ If noise → suppress, update alert rules             │
└─────────────────────────────────────────────────────────────┘
   │
   ▼
┌─────────────────────────────────────────────────────────────┐
│  3. Action (AI)                                             │
│     ├─ Execute fix OR                                       │
│     ├─ Create incident OR                                   │
│     └─ Update alert rules                                   │
└─────────────────────────────────────────────────────────────┘
   │
   ▼
┌─────────────────────────────────────────────────────────────┐
│  4. Human Notification (if needed)                          │
│     ├─ "Alert X: auto-resolved, here's what happened" OR    │
│     └─ "Alert X: needs attention, here's the context"       │
└─────────────────────────────────────────────────────────────┘

Result: 90% of alerts handled without human intervention
Human Focus: Only meaningful alerts with full context

Human Experience in RD-OS

What Humans Do

┌─────────────────────────────────────────────────────────────┐
│                  Human Focus Areas                          │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Strategy & Direction                                       │
│    ├─ What problems to solve                               │
│    ├─ What features to build                               │
│    └─ What trade-offs to make                              │
│                                                             │
│  Review & Approval                                          │
│    ├─ Architecture decisions (AI-proposed)                 │
│    ├─ Security-critical changes                            │
│    ├─ Breaking changes                                     │
│    └─ High-risk deployments                                │
│                                                             │
│  Exception Handling                                         │
│    ├─ Edge cases AI can't handle                           │
│    ├─ Novel situations                                     │
│    └─ Escalations from agents                              │
│                                                             │
│  Creativity & Innovation                                    │
│    ├─ New product ideas                                    │
│    ├─ Novel solutions                                      │
│    └─ Exploratory work                                     │
│                                                             │
└─────────────────────────────────────────────────────────────┘

What Humans Don’t Do

┌─────────────────────────────────────────────────────────────┐
│              Eliminated by RD-OS                            │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ❌ Manual code writing (AI drafts)                         │
│  ❌ Manual testing (AI generates & runs)                    │
│  ❌ Manual deployment (AI deploys)                          │
│  ❌ Manual monitoring (AI watches 24/7)                     │
│  ❌ Alert triage (AI handles 90%)                           │
│  ❌ Incident firefighting (AI auto-remediates)              │
│  ❌ Status meetings (AI reports automatically)              │
│  ❌ Progress tracking (AI tracks in real-time)              │
│  ❌ Documentation writing (AI auto-generates)               │
│  ❌ Coordination overhead (AI coordinates)                  │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Metrics: Before vs After

MetricTraditionalRD-OS TargetImprovement
Feature dev time2-3 weeks2-3 days10x
Incident MTTR2-4 hours5-10 minutes24x
Alert noise90% false positive<10% false positive9x
New hire ramp-up3-6 months2-4 weeks3-6x
Deploy frequencyWeeklyMultiple/day10x+
Deploy failure rate10-20%<1%10-20x
Tech debt visibilityUnknownReal-time dashboard-
Coordination meetings10+ hours/week<2 hours/week5x
Human coding time60%10%6x
Human decision time20%70%3.5x

Implementation Roadmap

Phase 1: Foundation (Month 1-2)

  • Mono-repo consolidation (400 → 1)
  • Basic agent framework
  • Core skills (build, test, deploy)
  • Perception layer (code understanding, docs)

Phase 2: Coordination (Month 3-4)

  • Workflow orchestrator
  • Sprint coordinator
  • Dependency coordinator
  • Action layer (dev, test, deploy agents)

Phase 3: Autonomy (Month 5-6)

  • Incident responder
  • Alert handler
  • Post-mortem analyst
  • Learning layer (continuous improvement)

Phase 4: Optimization (Month 7-12)

  • Full autonomy for routine work
  • AI-driven optimization
  • Human focus on strategy only
  • Continuous self-improvement

Conclusion

RD-OS is not just a mono-repo. It’s a paradigm shift:

AspectTraditionalRD-OS
System NaturePassiveActive, Living
UnderstandingHuman effortBuilt-in
CoordinationHuman meetingsAI orchestration
ExecutionHuman laborAI execution
ImprovementOccasional, manualContinuous, automatic
Human RoleDoerDecision-maker

The goal:

Humans define WHAT matters. AI handles HOW to achieve it.

The result:

A研发 department that moves at AI speed, with human wisdom.


“过去:协调很多人,跟进开发、部署、测试、运维、事故、告警 — 太费劲了”

“未来:一个活的系统,AI 自主协调一切,人类专注决策”

This is RD-OS.