Skip to content

CNTRLPLANE-3724: adding agentic SDLC documentation for etcd#389

Open
sandeepknd wants to merge 1 commit into
openshift:mainfrom
sandeepknd:sdlc-etcd
Open

CNTRLPLANE-3724: adding agentic SDLC documentation for etcd#389
sandeepknd wants to merge 1 commit into
openshift:mainfrom
sandeepknd:sdlc-etcd

Conversation

@sandeepknd

@sandeepknd sandeepknd commented Jun 25, 2026

Copy link
Copy Markdown

Hi Team,
This PR about adding agentic SDLC documentation for etcd. Kindly find the below details.

AGENTS.md - AI Agent Development Guide

  - Architecture patterns ✓
  - Code organization ✓
  - Development workflows ✓
  - What to Always/Ask/Never Do sections ✓
  - Common agent mistakes ✓
  - Testing strategies ✓

ARCHITECTURE.md - Detailed System Documentation

  - System architecture diagrams ✓
  - Component architecture ✓
  - Design decisions with rationale ✓
  - Failure modes and recovery ✓
  - Deployment topology ✓

CONTRIBUTING.md - Updated with Links

  - References to new documentation ✓
  - Integration with existing workflow ✓

Summary by CodeRabbit

  • Documentation
    • Added a new quick-reference guide and an in-depth architecture overview for the OpenShift etcd fork.
    • Expanded contributor guidance with clearer pointers to fork-specific workflows and key reading.
    • Linked the main docs together so it’s easier to find development, recovery, performance, and operational information.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 25, 2026
@openshift-ci

openshift-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown

Walkthrough

Adds AGENTS.md and ARCHITECTURE.md as repository documentation, updates CLAUDE.md to point to AGENTS.md, and revises CONTRIBUTING.md to surface the new fork-specific reading list and resource links.

Changes

Repository documentation and guidance

Layer / File(s) Summary
AGENTS reference and operations
AGENTS.md
Adds the guide scope, repository map, compaction and restore guidance, TLS notes, and performance tuning references.
AGENTS rules and metadata
AGENTS.md
Adds development workflows, testing guidance, critical rules, common mistakes, metrics, defaults, OpenShift notes, resource links, and footer metadata.
Architecture overview and core storage path
ARCHITECTURE.md
Adds the system overview, data flow, EtcdServer, Raft node, MVCC, backend persistence, WAL, and snapshot store sections.
Consensus, storage semantics, and client APIs
ARCHITECTURE.md
Adds Raft behavior, storage semantics, compaction and defragmentation, transactions, client APIs, and watch semantics.
Leases, auth, cluster operations, and deployment
ARCHITECTURE.md
Adds lease, authentication, RBAC, cluster management, recovery, performance, failure-mode, and deployment topology sections.
Contributor references and symlink
CLAUDE.md, CONTRIBUTING.md
Points CLAUDE.md at AGENTS.md and revises CONTRIBUTING.md notes and resource links.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

🚥 Pre-merge checks | ✅ 15
✅ Passed checks (15 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding agentic SDLC documentation for etcd.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PASS: The PR only adds/edits Markdown docs; I found no Ginkgo test titles or dynamic test-name patterns in the changed files.
Test Structure And Quality ✅ Passed The PR only changes AGENTS.md, ARCHITECTURE.md, CLAUDE.md, and CONTRIBUTING.md; no Ginkgo test code was modified, so this test-quality check is not applicable.
Microshift Test Compatibility ✅ Passed No new Ginkgo e2e tests or OpenShift API usages were introduced; the e2e suites are plain testing-based etcd tests with no MicroShift-only concerns.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR changes are docs-only (AGENTS/ARCHITECTURE/CONTRIBUTING/CLAUDE); no new Ginkgo e2e tests or SNO-sensitive multi-node assumptions were added.
Topology-Aware Scheduling Compatibility ✅ Passed Only docs and a symlink were added; no deployment manifests, operator code, or controllers were changed, so no topology-aware scheduling issue is introduced.
Ote Binary Stdout Contract ✅ Passed PR only adds docs/symlink; no changes to main/init/TestMain/suite setup or logging code, so no stdout contract risk.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR adds docs only; no new/changed Ginkgo e2e tests or IPv4/external-network assumptions were introduced.
No-Weak-Crypto ✅ Passed Added docs mention SHA256/CRC32/TLS/RBAC only; no MD5/SHA1/DES/RC4/3DES/Blowfish/ECB, custom crypto, or secret-comparison code found.
Container-Privileges ✅ Passed Scanned the added YAML/Dockerfiles in the commit tree and found no privileged, hostPID, hostNetwork, hostIPC, or allowPrivilegeEscalation settings.
No-Sensitive-Data-In-Logs ✅ Passed The PR only adds docs; log examples are redacted and any credentials/hosts shown are generic placeholders, not real sensitive data.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@sandeepknd sandeepknd changed the title adding agentic SDLC documentation for etcd CNTRLPLANE-3724: adding agentic SDLC documentation for etcd Jun 25, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 25, 2026
@openshift-ci-robot

openshift-ci-robot commented Jun 25, 2026

Copy link
Copy Markdown

@sandeepknd: This pull request references CNTRLPLANE-3724 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Hi Team,
This PR about adding agentic SDLC documentation for etcd. Kindly find the below details.

✅ AGENTS.md - Comprehensive Development Guide

  • Architecture patterns ✓
  • Code organization ✓
  • Development workflows ✓
  • What to Always/Ask/Never Do sections ✓
  • Common agent mistakes ✓
  • Testing strategies ✓

✅ ARCHITECTURE.md - Detailed System Documentation

  • System architecture diagrams ✓
  • Component architecture ✓
  • Design decisions with rationale ✓
  • Failure modes and recovery ✓
  • Deployment topology ✓

✅ CONTRIBUTING.md - Updated with Links

  • References to new documentation ✓
  • Integration with existing workflow ✓

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sandeepknd sandeepknd marked this pull request as ready for review June 25, 2026 13:13
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 25, 2026
@openshift-ci openshift-ci Bot requested review from hasbro17 and tjungblu June 25, 2026 13:14
@sandeepknd

Copy link
Copy Markdown
Author

/assign @dusk125

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@AGENTS.md`:
- Around line 117-131: The watch design notes overstate the guarantee by saying
“No events are lost”; update the wording in the AGENTS.md watch section to
describe ordered, resumable delivery instead, and keep the rest of the guidance
consistent with the watcher-related implementation symbols `watchable_store.go`,
`watcher.go`, and `pkg/v3/notify` so readers understand watches can be
compacted, disconnected, or backpressured.
- Around line 81-99: The fenced diagram example is missing a language tag,
causing markdownlint warnings and inconsistent rendering. Update the affected
fenced blocks in AGENTS.md to use an explicit tag such as text, and scan the
rest of the file for the same bare-fence pattern so all examples are tagged
consistently.

In `@ARCHITECTURE.md`:
- Around line 150-173: Update the EtcdServer sketch so it mirrors the live
struct in server.go: replace the non-existent types and fields shown here with
the actual EtcdServer members, and add the missing real ones such as
snapshotter, authStore, alarmStore, and AccessController. Use the current
server/etcdserver/server.go EtcdServer definition as the source of truth, and
keep the field grouping aligned with the real layout so this documentation stays
authoritative.
- Around line 788-805: Qualify the Watch Guarantees section in ARCHITECTURE.md
so it does not read as an absolute promise; update the text under the “Watch
Guarantees” and “Slow Consumer Handling” headings to reflect that ordering,
uniqueness, and resumability only hold within the available history window and
that watches may be canceled on compaction or slow consumption. Keep the wording
aligned with the existing watcher/victim behavior described in the
watch-handling docs and make sure the guarantees are presented as conditional
rather than unconditional.
- Around line 1059-1070: Update the recovery section to make the restore flow
the default path: in the snapshot recovery steps shown near the “Restore from
snapshot on one member” and “Start restored member” examples, keep the `etcdutl
snapshot restore` flow as the primary guidance and replace the `etcd
--force-new-cluster` start step with a stronger note that `--force-new-cluster`
is only a discouraged fallback. Reference the existing recovery examples and
wording around `snapshot restore` and `force-new-cluster` so the new text
clearly steers users to the restore-based path first.
- Around line 56-1428: The fenced code blocks in ARCHITECTURE.md are missing
language annotations, which triggers markdownlint MD040. Update each bare fence
in the document to use the appropriate language tag based on the snippet
content, such as go, bash, protobuf, or text, and ensure the named sections
around the examples (for example, the gRPC service definitions, client/v3
samples, and shell commands) are tagged consistently. If any fence is
intentionally language-agnostic, handle it via an explicit lint exemption
instead of leaving it bare.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6d4c4813-b06a-4b41-9c66-342b7476e084

📥 Commits

Reviewing files that changed from the base of the PR and between bf6c009 and 7088966.

📒 Files selected for processing (4)
  • AGENTS.md
  • ARCHITECTURE.md
  • CLAUDE.md
  • CONTRIBUTING.md

Comment thread AGENTS.md Outdated
Comment on lines +81 to +99
```
┌─────────────────────────────────────────┐
│ KV Store (Interface) │
└─────────────────┬───────────────────────┘
┌─────────────────▼───────────────────────┐
│ MVCC Layer │
│ - Revision management │
│ - Transaction coordination │
│ - Watch event generation │
└─────────────────┬───────────────────────┘
┌─────────────────▼───────────────────────┐
│ BoltDB Backend (bbolt) │
│ - B+tree storage │
│ - ACID transactions │
│ - Snapshot support │
└─────────────────────────────────────────┘
```

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Tag the fenced examples.

These bare fences are already tripping markdownlint, and the same pattern repeats throughout the file. Please add explicit language tags (text, go, protobuf, etc.) so the doc renders consistently and the lint warnings go away.

🛠️ Example fix
- ```
+ ```text
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
```
┌─────────────────────────────────────────┐
│ KV Store (Interface) │
└─────────────────┬───────────────────────┘
┌─────────────────▼───────────────────────┐
│ MVCC Layer │
│ - Revision management │
│ - Transaction coordination │
│ - Watch event generation │
└─────────────────┬───────────────────────┘
┌─────────────────▼───────────────────────┐
│ BoltDB Backend (bbolt) │
│ - B+tree storage │
│ - ACID transactions │
│ - Snapshot support │
└─────────────────────────────────────────┘
```
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 81-81: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@AGENTS.md` around lines 81 - 99, The fenced diagram example is missing a
language tag, causing markdownlint warnings and inconsistent rendering. Update
the affected fenced blocks in AGENTS.md to use an explicit tag such as text, and
scan the rest of the file for the same bare-fence pattern so all examples are
tagged consistently.

Source: Linters/SAST tools

Comment thread AGENTS.md Outdated
Comment thread ARCHITECTURE.md
Comment on lines +56 to +1428
```
┌─────────────────────────────────────────────────────────────────────┐
│ etcd Cluster │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Node 1 │ │ Node 2 │ │ Node 3 │ │
│ │ (Leader) │◄────►│ (Follower) │◄────►│ (Follower) │ │
│ │ │ │ │ │ │ │
│ │ ┌────────┐ │ │ ┌────────┐ │ │ ┌────────┐ │ │
│ │ │ gRPC │ │ │ │ gRPC │ │ │ │ gRPC │ │ │
│ │ │ Server │ │ │ │ Server │ │ │ │ Server │ │ │
│ │ └────┬───┘ │ │ └────┬───┘ │ │ └────┬───┘ │ │
│ │ │ │ │ │ │ │ │ │ │
│ │ ┌────▼───┐ │ │ ┌────▼───┐ │ │ ┌────▼───┐ │ │
│ │ │ Raft │ │ │ │ Raft │ │ │ │ Raft │ │ │
│ │ │ Node │ │ │ │ Node │ │ │ │ Node │ │ │
│ │ └────┬───┘ │ │ └────┬───┘ │ │ └────┬───┘ │ │
│ │ │ │ │ │ │ │ │ │ │
│ │ ┌────▼───┐ │ │ ┌────▼───┐ │ │ ┌────▼───┐ │ │
│ │ │ MVCC │ │ │ │ MVCC │ │ │ │ MVCC │ │ │
│ │ │ Store │ │ │ │ Store │ │ │ │ Store │ │ │
│ │ └────┬───┘ │ │ └────┬───┘ │ │ └────┬───┘ │ │
│ │ │ │ │ │ │ │ │ │ │
│ │ ┌────▼───┐ │ │ ┌────▼───┐ │ │ ┌────▼───┐ │ │
│ │ │ BoltDB │ │ │ │ BoltDB │ │ │ │ BoltDB │ │ │
│ │ │Backend │ │ │ │Backend │ │ │ │Backend │ │ │
│ │ └────────┘ │ │ └────────┘ │ │ └────────┘ │ │
│ │ │ │ │ │ │ │ │ │ │
│ │ ┌────▼───┐ │ │ ┌────▼───┐ │ │ ┌────▼───┐ │ │
│ │ │ WAL │ │ │ │ WAL │ │ │ │ WAL │ │ │
│ │ │ Log │ │ │ │ Log │ │ │ │ Log │ │ │
│ │ └────────┘ │ │ └────────┘ │ │ └────────┘ │ │
│ │ │ │ │ │ │ │ │ │ │
│ │ ┌────▼───┐ │ │ ┌────▼───┐ │ │ ┌────▼───┐ │ │
│ │ │ Snap │ │ │ │ Snap │ │ │ │ Snap │ │ │
│ │ │ Store │ │ │ │ Store │ │ │ │ Store │ │ │
│ │ └────────┘ │ │ └────────┘ │ │ └────────┘ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ ▲ ▲ ▲ │
└─────────┼─────────────────────┼─────────────────────┼───────────────┘
│ │ │
│ │ │
┌─────┴─────────────────────┴─────────────────────┴─────┐
│ Client Applications │
│ (Kubernetes API Server, etcdctl, custom clients) │
└─────────────────────────────────────────────────────────┘
```

### Data Flow

**Write Operation (Linearizable)**:
```
1. Client → gRPC API (any node)
2. Node → Forward to Leader (if not leader)
3. Leader → Propose to Raft
4. Raft → Replicate to majority
5. Raft → Commit entry
6. Leader → Apply to MVCC store
7. MVCC → Write to BoltDB backend
8. Backend → Persist to disk
9. Leader → Return response to client
```

**Read Operation (Linearizable)**:
```
1. Client → gRPC API (any node)
2. Node → Check leadership (quorum read)
3. Node → Read from local MVCC store
4. MVCC → Query BoltDB backend
5. Node → Return response to client
```

**Read Operation (Serializable)**:
```
1. Client → gRPC API (any node)
2. Node → Read from local MVCC store (no quorum check)
3. MVCC → Query BoltDB backend
4. Node → Return response to client
```

## Core Components

### 1. EtcdServer

**Location**: `server/etcdserver/server.go`

**Responsibilities**:
- Coordinate all server operations
- Manage Raft node lifecycle
- Process client requests
- Apply committed Raft entries
- Manage cluster membership
- Handle snapshots and WAL

**Key Structures**:
```go
type EtcdServer struct {
// Raft consensus
r raftNode
raftStorage *raft.MemoryStorage

// Storage
kv mvcc.ConsistentWatchableKV
be backend.Backend

// Cluster state
cluster api.Cluster
id types.ID

// Configuration
Cfg config.ServerConfig

// Lease management
lessor lease.Lessor

// Apply layer
applyV3 apply.ApplyV3
}
```

**Event Loop**:
The server runs a main event loop that:
1. Receives committed Raft entries
2. Applies entries to state machine (MVCC store)
3. Sends responses to waiting clients
4. Processes snapshots
5. Handles leadership changes

### 2. Raft Node

**Location**: `server/etcdserver/raft.go`

**Responsibilities**:
- Implement Raft consensus protocol
- Manage leader election
- Replicate log entries
- Handle network communication between nodes
- Manage Raft configuration changes

**Raft States**:
- **Leader**: Accepts writes, replicates to followers
- **Follower**: Replicates from leader, redirects writes
- **Candidate**: Transitional state during election
- **Learner**: Non-voting member (used for adding nodes)

**Communication**:
- Uses `rafthttp` package for peer-to-peer communication
- Maintains persistent connections between nodes
- Handles message serialization and network failures

### 3. MVCC Storage

**Location**: `server/storage/mvcc/`

**Architecture**:
```
┌─────────────────────────────────────────┐
│ ConsistentWatchableKV │
│ (Combines consistency + watch) │
└─────────────────┬───────────────────────┘
┌─────────────────▼───────────────────────┐
│ WatchableKV │
│ (Adds watch functionality) │
└─────────────────┬───────────────────────┘
┌─────────────────▼───────────────────────┐
│ KV Store │
│ (Core MVCC operations) │
│ - Put, Get, Delete, Txn │
│ - Revision management │
└─────────────────┬───────────────────────┘
┌─────────────────▼───────────────────────┐
│ BoltDB Backend │
│ (Persistent storage) │
└─────────────────────────────────────────┘
```

**Key Concepts**:

**Revision**: Global monotonically increasing counter
- Increments on every write transaction
- Used for point-in-time queries
- Forms the basis for MVCC

**Key Structure**:
```
Key: /registry/pods/default/my-pod
CreateRevision: 100
ModRevision: 105
Version: 3
Value: <serialized pod data>
```

**Index Structure**:
```
BoltDB Buckets:
key → keyIndex (revision history)
keyIndex → <CreateRevision, ModRevision, Version, Generations>

meta → consistentIndex (last applied Raft index)
meta → scheduledCompactRevision

rev_{revision} → key-value data
```

### 4. Backend Storage (BoltDB)

**Location**: `server/storage/backend/`

**BoltDB Characteristics**:
- Embedded key-value database
- B+tree data structure
- ACID transactions
- MVCC support
- Memory-mapped files for performance
- Single-writer, multiple-readers

**Buckets**:
- `key`: Stores key index with revision history
- `meta`: Stores metadata (consistent index, compaction, etc.)
- `lease`: Stores lease information
- `auth`: Stores authentication data
- `members`: Stores cluster membership
- `cluster`: Stores cluster configuration

**Backend Operations**:
```go
// Batch write (transaction)
tx := be.BatchTx()
tx.Lock()
defer tx.Unlock()
tx.UnsafePut(buckets.Key, key, value)
```

**Optimization**:
- Read transactions don't block writes
- Batch commits for better performance
- Periodic defragmentation to reclaim space

### 5. Write-Ahead Log (WAL)

**Location**: `server/storage/wal/`

**Purpose**: Ensure durability of Raft log entries before they're applied.

**Characteristics**:
- Append-only log structure
- Fsync after every write for durability
- Segmented files for easier management
- Used for crash recovery

**WAL Record Types**:
```go
type Record struct {
Type RecordType // Entry, State, Snapshot, CRC
Data []byte
Crc uint32
}
```

**Recovery Process**:
1. Read WAL from last snapshot
2. Replay entries to rebuild Raft state
3. Apply committed entries to state machine
4. Resume normal operation

### 6. Snapshot Store

**Location**: `server/etcdserver/api/snap/`

**Purpose**: Periodic snapshots of entire state for faster recovery.

**Snapshot Process**:
```
1. Trigger snapshot (after N entries, typically 10,000)
2. Serialize current MVCC state
3. Write snapshot file
4. Update WAL with snapshot metadata
5. Truncate old WAL entries
```

**Benefits**:
- Faster recovery (don't replay entire WAL)
- Smaller WAL size
- Efficient cluster bootstrapping

**Snapshot Format**:
```
Snapshot File:
- Metadata (index, term, cluster config)
- BoltDB database dump
- CRC checksum
```

## Raft Consensus

### Raft Overview

etcd uses the Raft consensus algorithm to maintain a consistent, replicated log across all nodes.

**Raft Properties**:
- **Leader-based**: One leader coordinates all writes
- **Strong consistency**: Linearizable reads and writes
- **Fault tolerance**: Survives f failures in 2f+1 cluster
- **Understandable**: Simpler than Paxos, easier to implement

### Leader Election

**Process**:
1. Follower times out waiting for heartbeat
2. Becomes candidate, increments term
3. Votes for itself, requests votes from others
4. Wins if receives majority votes
5. Becomes leader, sends heartbeats

**Election Timeout**: Randomized to avoid split votes
- Typical: 1000-5000ms
- Prevents multiple candidates simultaneously

**Safety**: Only candidates with up-to-date logs can win
- Candidate's log must contain all committed entries
- Ensures committed entries are never lost

### Log Replication

**Write Flow**:
```
1. Client sends write to leader
2. Leader appends entry to local log
3. Leader sends AppendEntries RPC to followers
4. Followers append entry, respond with success
5. Leader commits entry after majority acknowledges
6. Leader applies entry to state machine
7. Leader notifies followers of commit
8. Followers apply entry to state machine
```

**Log Structure**:
```
Index: 1 2 3 4 5 6
Term: 1 1 2 2 3 3
Entry: [A] [B] [C] [D] [E] [F]
↑ ↑
Committed Uncommitted
```

**Commit Rules**:
- Entry is committed when majority has it
- All entries before committed entry are also committed
- Committed entries are durable and will never be lost

### Log Compaction

**Problem**: Log grows unbounded over time.

**Solution**: Snapshot + truncate log.

**Process**:
1. Create snapshot of current state
2. Store snapshot index and term
3. Truncate log up to snapshot index
4. New nodes receive snapshot instead of full log

**Triggered by**: Raft entry count (default: 10,000 entries)

### Network Partitions

**Scenario**: Network partition splits cluster into two groups.

**Majority Partition** (has quorum):
- Elects new leader
- Continues accepting writes
- Operates normally

**Minority Partition** (no quorum):
- Cannot elect leader
- Rejects writes
- Accepts serializable reads (may be stale)

**Recovery**: When partition heals
- Minority rejoins cluster
- Syncs with current leader
- Conflicting uncommitted entries are discarded

### Membership Changes

**Safe Reconfiguration**: Raft's joint consensus prevents split-brain during membership changes.

**Process**:
1. Propose configuration change (add/remove member)
2. Enter joint consensus (both old and new configs)
3. Commit joint consensus
4. Transition to new configuration
5. Commit new configuration

**Learner Members**: Non-voting members used for safe addition
- Receive log replication
- Don't participate in voting
- Promoted to voting member when caught up

## Storage Architecture

### MVCC Implementation

**Multi-Version Concurrency Control** enables:
- Snapshot isolation for transactions
- Historical queries
- Watch from any revision
- Non-blocking reads

**Revision Semantics**:

**Main Revision**: Global counter for all changes
```
Transaction 1: Put key=A → Revision 10
Transaction 2: Put key=B, Put key=C → Revision 11 (both get same revision)
```

**Mod Revision**: When key was last modified
```
Put key=A value=1 → ModRevision=10
Put key=A value=2 → ModRevision=15
Put key=B value=x → ModRevision=15
```

**Version**: How many times key was modified
```
Put key=A value=1 → Version=1
Put key=A value=2 → Version=2
Put key=A value=3 → Version=3
```

**Key Index Structure**:
```go
type keyIndex struct {
key []byte
modified revision // last modified revision
generations []generation
}

type generation struct {
ver int64 // version counter
created revision // create revision
revs []revision // all modifications
}
```

**Example**:
```
Put foo=a → Rev 10
Put foo=b → Rev 15
Delete foo → Rev 20
Put foo=c → Rev 25

keyIndex for "foo":
modified: (25,0)
generations:
[0]: created: (10,0), ver: 2, revs: [(10,0), (15,0)]
[1]: created: (25,0), ver: 1, revs: [(25,0)]
```

### Compaction

**Purpose**: Reclaim space by removing old revisions.

**Types**:

**Periodic Compaction** (default):
- Automatically compacts based on time
- Keeps revisions for configured duration (e.g., 5 minutes)
- `--auto-compaction-mode=periodic --auto-compaction-retention=5m`

**Revision Compaction**:
- Keeps last N revisions
- `--auto-compaction-mode=revision --auto-compaction-retention=1000`

**Process**:
1. Mark revisions < target as deleted
2. Async goroutine removes deleted revisions
3. BoltDB frees space in B+tree
4. Space reusable immediately

**Effect on Operations**:
- Queries at compacted revision return `ErrCompacted`
- Watches from compacted revision fail
- Historical data is lost

### Defragmentation

**Problem**: Even after compaction, BoltDB file has fragmentation and wasted space.

**Solution**: Defragmentation creates new database file with only live data.

**Process**:
1. Create new BoltDB file
2. Copy all live data to new file
3. Atomically replace old file
4. Old file space reclaimed

**Trigger**:
```bash
etcdctl defrag # Online defrag (blocks writes)
etcdutl defrag --data-dir=/path # Offline defrag
```

**Trade-offs**:
- Online: Convenient but blocks writes, doubles disk usage temporarily
- Offline: Requires downtime but more efficient

### Transaction Model

**Transaction Structure**:
```
If <conditions>
Then <operations>
Else <operations>
```

**Example**:
```go
txn := Txn().
If(Compare(Value("key"), "=", "old")).
Then(OpPut("key", "new"), OpPut("status", "updated")).
Else(OpGet("key"))
```

**Semantics**:
- Evaluated atomically
- All comparisons in If() evaluated first
- Execute Then() if all comparisons succeed
- Execute Else() otherwise
- Return results of executed operations

**Compare Operations**:
- `Value`: Compare key value
- `Version`: Compare key version
- `CreateRevision`: Compare create revision
- `ModRevision`: Compare mod revision
- `Lease`: Compare lease ID

**Use Cases**:
- Compare-and-swap (CAS)
- Distributed locks
- Conditional updates
- Atomic multi-key operations

## Client API

### gRPC Services

**KV Service** (`rpc.proto`):
```protobuf
service KV {
rpc Range(RangeRequest) returns (RangeResponse); // Get
rpc Put(PutRequest) returns (PutResponse); // Put
rpc DeleteRange(DeleteRangeRequest) returns (DeleteRangeResponse); // Delete
rpc Txn(TxnRequest) returns (TxnResponse); // Transaction
rpc Compact(CompactionRequest) returns (CompactionResponse); // Compact
}
```

**Watch Service**:
```protobuf
service Watch {
rpc Watch(stream WatchRequest) returns (stream WatchResponse);
}
```

**Lease Service**:
```protobuf
service Lease {
rpc LeaseGrant(LeaseGrantRequest) returns (LeaseGrantResponse);
rpc LeaseRevoke(LeaseRevokeRequest) returns (LeaseRevokeResponse);
rpc LeaseKeepAlive(stream LeaseKeepAliveRequest) returns (stream LeaseKeepAliveResponse);
rpc LeaseTimeToLive(LeaseTimeToLiveRequest) returns (LeaseTimeToLiveResponse);
rpc LeaseLeases(LeaseLeasesRequest) returns (LeaseLeasesResponse);
}
```

**Cluster Service**:
```protobuf
service Cluster {
rpc MemberAdd(MemberAddRequest) returns (MemberAddResponse);
rpc MemberRemove(MemberRemoveRequest) returns (MemberRemoveResponse);
rpc MemberUpdate(MemberUpdateRequest) returns (MemberUpdateResponse);
rpc MemberList(MemberListRequest) returns (MemberListResponse);
rpc MemberPromote(MemberPromoteRequest) returns (MemberPromoteResponse);
}
```

**Maintenance Service**:
```protobuf
service Maintenance {
rpc Alarm(AlarmRequest) returns (AlarmResponse);
rpc Status(StatusRequest) returns (StatusResponse);
rpc Defragment(DefragmentRequest) returns (DefragmentResponse);
rpc Hash(HashRequest) returns (HashResponse);
rpc HashKV(HashKVRequest) returns (HashKVResponse);
rpc Snapshot(SnapshotRequest) returns (stream SnapshotResponse);
rpc MoveLeader(MoveLeaderRequest) returns (MoveLeaderResponse);
rpc Downgrade(DowngradeRequest) returns (DowngradeResponse);
}
```

### Client Library (client/v3)

**Basic Operations**:
```go
// Create client
cli, err := clientv3.New(clientv3.Config{
Endpoints: []string{"localhost:2379"},
DialTimeout: 5 * time.Second,
})
defer cli.Close()

// Put
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
_, err = cli.Put(ctx, "key", "value")
cancel()

// Get
ctx, cancel = context.WithTimeout(context.Background(), 5*time.Second)
resp, err := cli.Get(ctx, "key")
cancel()

// Get with prefix
resp, err := cli.Get(ctx, "prefix", clientv3.WithPrefix())

// Delete
_, err = cli.Delete(ctx, "key")

// Transaction
txn := cli.Txn(ctx).
If(clientv3.Compare(clientv3.Value("key"), "=", "old")).
Then(clientv3.OpPut("key", "new")).
Else(clientv3.OpGet("key"))
resp, err := txn.Commit()
```

**Watch**:
```go
watchChan := cli.Watch(context.Background(), "key")
for watchResp := range watchChan {
for _, event := range watchResp.Events {
fmt.Printf("Type: %s, Key: %s, Value: %s\n",
event.Type, event.Kv.Key, event.Kv.Value)
}
}
```

**Lease**:
```go
// Grant lease
lease, err := cli.Grant(ctx, 10) // 10 seconds

// Put with lease
_, err = cli.Put(ctx, "key", "value", clientv3.WithLease(lease.ID))

// Keep alive
ch, err := cli.KeepAlive(context.Background(), lease.ID)
for ka := range ch {
// Lease renewed
}

// Revoke lease (deletes associated keys)
_, err = cli.Revoke(ctx, lease.ID)
```

## Watch Mechanism

### Architecture

```
┌────────────────────────────────────────────────┐
│ Watch Clients │
└─────────────────┬──────────────────────────────┘
┌─────────────────▼──────────────────────────────┐
│ WatchableStore │
│ ┌──────────────────────────────────────────┐ │
│ │ Watcher Registry │ │
│ │ - watchers map[string]*watcherGroup │ │
│ │ - victims (slow watchers) │ │
│ └──────────────────────────────────────────┘ │
└─────────────────┬──────────────────────────────┘
┌─────────────────▼──────────────────────────────┐
│ Event Generator │
│ - Notifies watchers on Put/Delete │
│ - Batches events for efficiency │
└─────────────────┬──────────────────────────────┘
┌─────────────────▼──────────────────────────────┐
│ MVCC Store │
│ - Generates events during apply │
└────────────────────────────────────────────────┘
```

### Watch Types

**Key Watch**: Watch single key
```go
watchChan := cli.Watch(ctx, "foo")
```

**Prefix Watch**: Watch all keys with prefix
```go
watchChan := cli.Watch(ctx, "foo", clientv3.WithPrefix())
```

**Range Watch**: Watch key range
```go
watchChan := cli.Watch(ctx, "foo", clientv3.WithRange("foz"))
```

**Historical Watch**: Watch from past revision
```go
watchChan := cli.Watch(ctx, "foo", clientv3.WithRev(100))
```

### Event Types

```go
type Event struct {
Type EventType // PUT or DELETE
Kv *KeyValue // Current key-value
PrevKv *KeyValue // Previous key-value (if WithPrevKV)
}
```

### Watch Guarantees

1. **Ordered**: Events delivered in revision order
2. **Reliable**: No events are lost or duplicated
3. **Resumable**: Can resume from any revision
4. **Atomic**: Transactional puts generate single event

### Slow Consumer Handling

**Problem**: Slow consumer can't keep up with event rate.

**Solution**: Event buffering with overflow detection.

**Behavior**:
- Events buffered in channel (default 1024)
- If buffer fills, watcher marked as "victim"
- Victim watchers receive all queued events in one batch
- Client must process or risk watch cancellation

## Lease System

### Architecture

```
┌────────────────────────────────────────────┐
│ Lessor │
│ ┌──────────────────────────────────────┐ │
│ │ Lease Map │ │
│ │ leaseID → Lease{TTL, keys} │ │
│ └──────────────────────────────────────┘ │
│ ┌──────────────────────────────────────┐ │
│ │ Expiry Queue │ │
│ │ heap of leases by expiry time │ │
│ └──────────────────────────────────────┘ │
└───────────────┬────────────────────────────┘
│ Expired leases
┌───────────────────────────────────────────┐
│ Lease Revoker │
│ - Proposes lease revocation via Raft │
│ - Deletes associated keys │
└───────────────────────────────────────────┘
```

### Lease Lifecycle

**1. Grant Lease**:
```go
lease, err := cli.Grant(ctx, 30) // 30 seconds TTL
```
- Assigns unique lease ID
- Sets initial TTL
- Raft-replicated for consistency

**2. Attach Keys to Lease**:
```go
cli.Put(ctx, "key", "value", clientv3.WithLease(lease.ID))
```
- Key ownership tied to lease
- Multiple keys can share one lease
- Key deleted when lease expires

**3. Keep Alive (Renew)**:
```go
ch, err := cli.KeepAlive(ctx, lease.ID)
for ka := range ch {
// Lease renewed
}
```
- Client sends periodic heartbeats
- Resets lease expiry time
- Continues until context canceled

**4. Lease Expiration**:
- Lessor detects expired lease
- Proposes revocation via Raft
- Keys associated with lease deleted
- Lease removed from map

**5. Explicit Revocation**:
```go
cli.Revoke(ctx, lease.ID)
```
- Immediately revokes lease
- Deletes all associated keys
- Raft-replicated

### Use Cases

**Distributed Locks**:
```go
// Acquire lock
lease, _ := cli.Grant(ctx, 30)
txn := cli.Txn(ctx).
If(clientv3.Compare(clientv3.CreateRevision("lock"), "=", 0)).
Then(clientv3.OpPut("lock", "holder", clientv3.WithLease(lease.ID)))
resp, _ := txn.Commit()

if resp.Succeeded {
// Lock acquired
defer cli.Revoke(context.Background(), lease.ID)

// Keep alive in background
ch, _ := cli.KeepAlive(context.Background(), lease.ID)
go func() {
for range ch {}
}()

// Critical section
}
```

**Session Management**:
- Client creates lease at start
- Attaches session data to lease
- Keeps lease alive periodically
- Session auto-deleted if client crashes

**Service Discovery**:
- Service registers endpoint with lease
- Keeps lease alive while running
- Endpoint removed on service crash

## Authentication and Authorization

### Authentication

**Supported Methods**:
- **Simple Password**: Username/password authentication
- **TLS Client Certificates**: Mutual TLS authentication

**User Management**:
```bash
etcdctl user add myuser # Add user
etcdctl user grant-role myuser admin # Grant role
etcdctl auth enable # Enable auth
```

**Client Authentication**:
```go
cli, err := clientv3.New(clientv3.Config{
Endpoints: []string{"localhost:2379"},
Username: "myuser",
Password: "mypassword",
})
```

### Authorization (RBAC)

**Role-Based Access Control**:

**Roles**: Named collection of permissions
```bash
etcdctl role add myrole
etcdctl role grant-permission myrole read /foo
etcdctl role grant-permission myrole readwrite /bar
```

**Users**: Assigned one or more roles
```bash
etcdctl user add alice
etcdctl user grant-role alice myrole
```

**Permissions**:
- `read`: Get, watch
- `write`: Put, delete
- `readwrite`: Both read and write

**Key Ranges**: Permissions apply to key ranges
```bash
# Permission on single key
etcdctl role grant-permission myrole read /exact-key

# Permission on key prefix
etcdctl role grant-permission myrole read /prefix/ --prefix=true

# Permission on key range
etcdctl role grant-permission myrole read /start /end
```

**Root User**: Special user with all permissions
- Created during `auth enable`
- Cannot be deleted
- Used for administrative tasks

## Cluster Management

### Cluster Bootstrapping

**Static Bootstrap**: All members known at start
```bash
# Member 1
etcd --name=member1 \
--initial-cluster=member1=http://host1:2380,member2=http://host2:2380,member3=http://host3:2380 \
--initial-cluster-state=new

# Member 2
etcd --name=member2 \
--initial-cluster=member1=http://host1:2380,member2=http://host2:2380,member3=http://host3:2380 \
--initial-cluster-state=new

# Member 3
etcd --name=member3 \
--initial-cluster=member1=http://host1:2380,member2=http://host2:2380,member3=http://host3:2380 \
--initial-cluster-state=new
```

**Discovery Bootstrap**: Members discover each other via discovery service
```bash
# Generate discovery URL
curl https://discovery.etcd.io/new?size=3

# Start members with discovery URL
etcd --name=member1 --discovery=https://discovery.etcd.io/xxxxx
```

### Adding Members

**1. Add Learner** (recommended):
```bash
etcdctl member add newmember --learner=true --peer-urls=http://newhost:2380
```

**2. Start New Member**:
```bash
etcd --name=newmember \
--initial-cluster-state=existing \
--initial-cluster=member1=http://host1:2380,...,newmember=http://newhost:2380
```

**3. Promote Learner to Voting Member**:
```bash
etcdctl member promote <member-id>
```

**Why Learners?**
- Prevents quorum loss during catch-up
- New member doesn't vote until fully synchronized
- Safe to add multiple learners

### Removing Members

```bash
# List members
etcdctl member list

# Remove member
etcdctl member remove <member-id>

# Stop member process
systemctl stop etcd
```

**Quorum Considerations**:
- 3-member cluster: Can remove 1 member safely (quorum: 2)
- 5-member cluster: Can remove 2 members safely (quorum: 3)
- Never remove majority of members simultaneously

### Disaster Recovery

**Scenario**: Lost quorum (majority of members failed).

**Recovery Steps**:

**1. Stop all members**:
```bash
systemctl stop etcd
```

**2. Restore from snapshot on one member**:
```bash
etcdutl snapshot restore snapshot.db \
--name=member1 \
--initial-cluster=member1=http://host1:2380 \
--initial-advertise-peer-urls=http://host1:2380
```

**3. Start restored member**:
```bash
etcd --force-new-cluster
```

**4. Add new members** (follow normal add process).

## Performance Characteristics

### Throughput

**Typical Performance** (on SSD):
- Sequential writes: ~10,000 ops/sec
- Random writes: ~5,000-8,000 ops/sec
- Reads (local): ~100,000+ ops/sec
- Linearizable reads (quorum): ~10,000 ops/sec

**Factors**:
- Disk I/O (WAL fsync is bottleneck)
- Network latency (Raft replication)
- Key/value size
- Number of watchers
- CPU and memory

### Latency

**Write Latency** (p99):
- Local SSD: 10-50ms
- Network SSD: 50-100ms
- HDD: 100-500ms

**Read Latency**:
- Serializable (local): <1ms
- Linearizable (quorum): 10-50ms

**Components**:
- Network RTT: 1-10ms
- Raft replication: 5-20ms
- WAL fsync: 5-20ms (SSD), 50-200ms (HDD)
- BoltDB write: 1-5ms

### Scalability Limits

**Cluster Size**:
- Recommended: 3 or 5 members
- Maximum: 7 members (diminishing returns)
- Larger clusters: Higher latency, lower throughput

**Database Size**:
- Recommended: <8GB
- Warning at: 8GB
- Alarm at: 10GB (default quota)
- Maximum tested: 100GB+

**Watchers**:
- Typical: <10,000 watchers
- Tested: 100,000+ watchers
- Impact: Memory usage, event fanout latency

**Keys**:
- Millions of keys supported
- Watch performance degrades with many keys per prefix
- Compaction critical for large keyspaces

### Optimization Techniques

**1. Use SSDs**: Dramatic improvement in write latency.

**2. Dedicated Disk**: Don't share disk with other I/O-intensive apps.

**3. Tune OS**:
```bash
# Increase file descriptors
ulimit -n 65536

# Disable swap
swapoff -a

# I/O scheduler
echo noop > /sys/block/sda/queue/scheduler
```

**4. etcd Configuration**:
```bash
# Snapshot less frequently (reduce I/O)
--snapshot-count=50000

# Larger request size limit
--max-request-bytes=10485760

# Auto-compaction
--auto-compaction-mode=periodic
--auto-compaction-retention=5m
```

**5. Client Best Practices**:
- Use serializable reads when possible
- Batch operations in transactions
- Use prefix watches instead of many individual watches
- Close watchers when done
- Reuse client connections

## Failure Modes and Recovery

### Single Node Failure

**3-Member Cluster**:
- Quorum: 2 nodes
- Healthy nodes: 2
- Status: **Operational**
- Behavior: Cluster continues, leader election if leader failed

**5-Member Cluster**:
- Quorum: 3 nodes
- Healthy nodes: 4
- Status: **Operational**
- Behavior: Cluster continues normally

**Recovery**: Replace failed node with new member.

### Quorum Loss

**3-Member Cluster** (2 failures):
- Quorum: 2 nodes
- Healthy nodes: 1
- Status: **Unavailable**
- Behavior: Reads may work (serializable), writes fail

**5-Member Cluster** (3 failures):
- Quorum: 3 nodes
- Healthy nodes: 2
- Status: **Unavailable**

**Recovery**: Restore from snapshot or repair members.

### Network Partition

**Scenario**: 3-member cluster splits into [2] and [1].

**Majority Partition [2]**:
- Has quorum
- Elects leader
- Accepts writes
- Operational

**Minority Partition [1]**:
- No quorum
- Cannot elect leader
- Rejects writes
- Serves stale serializable reads

**Recovery**: When partition heals
- Minority rejoins
- Syncs with leader
- Resumes normal operation

### Disk Failure

**Symptoms**:
- Slow I/O
- WAL write errors
- Backend commit timeouts
- Member drops out of cluster

**Recovery**:
1. Stop member
2. Replace disk
3. Restore from snapshot OR
4. Remove and re-add member

### Database Corruption

**Detection**:
- Hash mismatch errors
- Backend corruption errors
- Cluster consistency check failures

**Recovery**:
1. Identify corrupt member
2. Stop corrupt member
3. Restore from snapshot
4. Restart member

**Prevention**:
- Use ECC memory
- Validate backups regularly
- Monitor cluster health

### Split-Brain Prevention

**Raft Guarantees**:
- Only one leader per term
- Leader requires majority votes
- Two partitions cannot both have quorum

**Example**: 3-node cluster splits [2] vs [1]
- Partition [2]: Can elect leader, form quorum
- Partition [1]: Cannot elect leader, no quorum
- **No split-brain possible**

## Design Decisions

### Why Raft Instead of Paxos?

**Reasons**:
- **Understandability**: Raft is easier to understand and implement
- **Strong leader**: Simplifies log management
- **Modularity**: Separate leader election, log replication, safety
- **Proof of correctness**: Formally verified safety properties

**Trade-offs**:
- Paxos may have slightly better performance in some scenarios
- Raft's strong leader can be a bottleneck

### Why BoltDB?

**Reasons**:
- **Embedded**: No separate database process
- **ACID**: Strong consistency guarantees
- **MVCC**: Perfect fit for etcd's needs
- **Memory-mapped**: Efficient reads
- **Simple**: Easy to understand and debug

**Trade-offs**:
- Single-writer (all writes through one goroutine)
- File size growth requires defragmentation
- Not optimized for very large datasets (>100GB)

### Why gRPC?

**Reasons**:
- **Performance**: Binary protocol, HTTP/2 multiplexing
- **Type Safety**: Protocol buffers with generated code
- **Streaming**: Bi-directional streaming for watch
- **Cross-language**: Clients in any language
- **Built-in**: Authentication, load balancing, timeouts

**Trade-offs**:
- More complex than REST
- Requires HTTP/2
- Less human-readable than JSON

### Why MVCC?

**Reasons**:
- **Watch**: Enables efficient watch from any revision
- **Transactions**: Snapshot isolation for txns
- **History**: Point-in-time queries
- **Kubernetes**: Matches Kubernetes resourceVersion semantics

**Trade-offs**:
- Storage overhead (multiple versions)
- Requires compaction to reclaim space
- More complex than simple key-value

## Deployment Topology

### Development (1 Node)

```
┌──────────────┐
│ etcd-1 │
│ (Single) │
└──────────────┘
```

**Use**: Local development, testing
**Fault Tolerance**: None
**Performance**: Full read/write speed

### Production (3 Nodes)

```
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ etcd-1 │◄────►│ etcd-2 │◄────►│ etcd-3 │
│ (Leader) │ │ (Follower) │ │ (Follower) │
└──────────────┘ └──────────────┘ └──────────────┘
```

**Use**: Small production clusters
**Fault Tolerance**: 1 node failure
**Quorum**: 2 nodes

### High Availability (5 Nodes)

```
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ etcd-1 │◄────►│ etcd-2 │◄────►│ etcd-3 │
│ (Leader) │ │ (Follower) │ │ (Follower) │
└──────────────┘ └──────────────┘ └──────────────┘
▲ ▲
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ etcd-4 │◄──────────────────────────►│ etcd-5 │
│ (Follower) │ │ (Follower) │
└──────────────┘ └──────────────┘
```

**Use**: Large production clusters
**Fault Tolerance**: 2 node failures
**Quorum**: 3 nodes

### Multi-Region (5 Nodes)

```
Region 1 Region 2 Region 3
┌──────────┐ ┌──────────┐ ┌──────────┐
│ etcd-1 │◄───────►│ etcd-2 │◄───────►│ etcd-3 │
│(Follower)│ │ (Leader) │ │(Follower)│
└──────────┘ └──────────┘ └──────────┘
Region 1 ┌──────────┐ Region 3
┌──────────┐ │ etcd-4 │ ┌──────────┐
│ etcd-5 │◄───────►│(Follower)│◄───────►│ (etc) │
│(Follower)│ └──────────┘ │ │
└──────────┘ Region 2 └──────────┘
```

**Use**: Global availability
**Considerations**:
- Higher latency (cross-region)
- Place majority in low-latency region
- Consider network costs

### Kubernetes/OpenShift

```
┌─────────────────────────────────────────────┐
│ Kubernetes/OpenShift Cluster │
│ │
│ ┌────────────────────────────────────────┐ │
│ │ Control Plane Nodes │ │
│ │ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌───────┐│ │
│ │ │ etcd-1 │ │ etcd-2 │ │ etcd-3││ │
│ │ │(Static │ │(Static │ │(Static││ │
│ │ │ Pod) │ │ Pod) │ │ Pod) ││ │
│ │ └──────────┘ └──────────┘ └───────┘│ │
│ │ ▲ ▲ ▲ │ │
│ └───────┼──────────────┼──────────────┼──┘ │
│ │ │ │ │
│ ┌───────▼──────────────▼──────────────▼──┐ │
│ │ kube-apiserver instances │ │
│ │ (read/write cluster state to etcd) │ │
│ └────────────────────────────────────────┘ │
└─────────────────────────────────────────────┘
```

**Characteristics**:
- etcd runs as static pods
- Co-located with kube-apiserver
- Dedicated data directory (hostPath)
- Separate network for peer communication

---

**Document Version**: 1.0
**Last Updated**: 2026-06-25
**Maintained By**: OpenShift etcd Team

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Add language tags to the fenced blocks.

markdownlint is already flagging every bare fence here (MD040). Please mark the snippets with the right language (go, bash, protobuf, text, etc.) or exempt the file if bare fences are intentional.

🧰 Tools
🪛 LanguageTool

[style] ~1293-~1293: As an alternative to the over-used intensifier ‘very’, consider replacing this phrase.
Context: ...res defragmentation - Not optimized for very large datasets (>100GB) ### Why gRPC? **Rea...

(EN_WEAK_ADJECTIVE)

🪛 markdownlint-cli2 (0.22.1)

[warning] 56-56: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 107-107: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 120-120: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 129-129: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 211-211: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 243-243: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 252-252: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 331-331: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 345-345: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 384-384: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 396-396: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 471-471: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 477-477: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 484-484: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 506-506: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 570-570: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 730-730: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 811-811: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 1326-1326: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 1339-1339: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 1352-1352: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 1372-1372: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 1396-1396: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ARCHITECTURE.md` around lines 56 - 1428, The fenced code blocks in
ARCHITECTURE.md are missing language annotations, which triggers markdownlint
MD040. Update each bare fence in the document to use the appropriate language
tag based on the snippet content, such as go, bash, protobuf, or text, and
ensure the named sections around the examples (for example, the gRPC service
definitions, client/v3 samples, and shell commands) are tagged consistently. If
any fence is intentionally language-agnostic, handle it via an explicit lint
exemption instead of leaving it bare.

Source: Linters/SAST tools

Comment thread ARCHITECTURE.md
Comment on lines +150 to +173
**Key Structures**:
```go
type EtcdServer struct {
// Raft consensus
r raftNode
raftStorage *raft.MemoryStorage

// Storage
kv mvcc.ConsistentWatchableKV
be backend.Backend

// Cluster state
cluster api.Cluster
id types.ID

// Configuration
Cfg config.ServerConfig

// Lease management
lessor lease.Lessor

// Apply layer
applyV3 apply.ApplyV3
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

Make this EtcdServer sketch match server.go.

The current example names types that do not exist in the live struct (mvcc.ConsistentWatchableKV, api.Cluster, apply.ApplyV3) and omits several real fields (snapshotter, authStore, alarmStore, AccessController, etc.). Readers will treat this as authoritative, so it should mirror the actual struct layout. Based on the current server/etcdserver/server.go struct.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ARCHITECTURE.md` around lines 150 - 173, Update the EtcdServer sketch so it
mirrors the live struct in server.go: replace the non-existent types and fields
shown here with the actual EtcdServer members, and add the missing real ones
such as snapshotter, authStore, alarmStore, and AccessController. Use the
current server/etcdserver/server.go EtcdServer definition as the source of
truth, and keep the field grouping aligned with the real layout so this
documentation stays authoritative.

Comment thread ARCHITECTURE.md
Comment on lines +788 to +805
### Watch Guarantees

1. **Ordered**: Events delivered in revision order
2. **Reliable**: No events are lost or duplicated
3. **Resumable**: Can resume from any revision
4. **Atomic**: Transactional puts generate single event

### Slow Consumer Handling

**Problem**: Slow consumer can't keep up with event rate.

**Solution**: Event buffering with overflow detection.

**Behavior**:
- Events buffered in channel (default 1024)
- If buffer fills, watcher marked as "victim"
- Victim watchers receive all queued events in one batch
- Client must process or risk watch cancellation

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Qualify the watch guarantees.

This reads as an absolute guarantee, but etcd only promises ordered/unique/reliable delivery within the available history window, and a slow or compacted watch can still be canceled and must be re-established. (etcd.io)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ARCHITECTURE.md` around lines 788 - 805, Qualify the Watch Guarantees section
in ARCHITECTURE.md so it does not read as an absolute promise; update the text
under the “Watch Guarantees” and “Slow Consumer Handling” headings to reflect
that ordering, uniqueness, and resumability only hold within the available
history window and that watches may be canceled on compaction or slow
consumption. Keep the wording aligned with the existing watcher/victim behavior
described in the watch-handling docs and make sure the guarantees are presented
as conditional rather than unconditional.

Comment thread ARCHITECTURE.md
Comment on lines +1059 to +1070
**2. Restore from snapshot on one member**:
```bash
etcdutl snapshot restore snapshot.db \
--name=member1 \
--initial-cluster=member1=http://host1:2380 \
--initial-advertise-peer-urls=http://host1:2380
```

**3. Start restored member**:
```bash
etcd --force-new-cluster
```

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Use the restore flow as the default recovery path.

The normal recovery path is etcdutl snapshot restore, which creates new data dirs and rewrites member/cluster IDs; --force-new-cluster is only a discouraged fallback and can panic if old members are still alive. (etcd.io)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ARCHITECTURE.md` around lines 1059 - 1070, Update the recovery section to
make the restore flow the default path: in the snapshot recovery steps shown
near the “Restore from snapshot on one member” and “Start restored member”
examples, keep the `etcdutl snapshot restore` flow as the primary guidance and
replace the `etcd --force-new-cluster` start step with a stronger note that
`--force-new-cluster` is only a discouraged fallback. Reference the existing
recovery examples and wording around `snapshot restore` and `force-new-cluster`
so the new text clearly steers users to the restore-based path first.

@openshift-ci

openshift-ci Bot commented Jun 26, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sandeepknd
Once this PR has been reviewed and has the lgtm label, please ask for approval from dusk125. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (5)
ARCHITECTURE.md (4)

1059-1070: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Don't make --force-new-cluster the normal restart step.

After snapshot restore, the restored member should be started normally. --force-new-cluster is only a fallback and makes this recovery path look unsafe by default.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ARCHITECTURE.md` around lines 1059 - 1070, The recovery steps in the snapshot
restore section should not present `--force-new-cluster` as the standard way to
start the restored member. Update the instructions around the `snapshot restore`
and `etcd` startup steps so the restored member is started normally by default,
and mention `--force-new-cluster` only as an exceptional fallback if needed.

150-173: 📐 Maintainability & Code Quality | 🟠 Major | 🏗️ Heavy lift

Mirror the live EtcdServer layout.

This sketch still doesn't match server.go and will read as authoritative. Please replace the placeholder types/fields with the real members from the struct definition.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ARCHITECTURE.md` around lines 150 - 173, The EtcdServer sketch in
Architecture documentation is still using placeholder types and fields instead
of the real struct layout from EtcdServer in server.go. Update the documented
members to match the actual EtcdServer definition exactly, including the correct
field names and types for raft, storage, cluster state, configuration, lease,
and apply components, so the reference stays authoritative and in sync with the
code.

788-805: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Qualify the watch guarantees.

No events are lost or duplicated is still too strong; watches can be canceled by compaction or slow consumers and must be resumed from a revision.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ARCHITECTURE.md` around lines 788 - 805, Qualify the Watch Guarantees section
in ARCHITECTURE.md so it no longer states unconditional reliability. Update the
guarantee text to reflect that watches preserve order and resumability, but may
be canceled by compaction or slow consumer handling, requiring clients to resume
from a revision; keep the wording aligned with the Watch Guarantees and Slow
Consumer Handling sections.

56-102: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Tag the remaining bare fences.

The document still has many unlabeled code/diagram fences, so MD040 will keep firing and the examples will render inconsistently. Please mark them with explicit languages such as go, bash, protobuf, or text.

Also applies to: 107-134, 211-261, 331-350, 384-402, 471-488, 506-517, 730-754, 780-786, 811-831, 1326-1416

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ARCHITECTURE.md` around lines 56 - 102, The architecture document still
contains bare fenced blocks that trigger MD040 and inconsistent rendering.
Update the affected fences around the etcd cluster diagram and the other listed
sections by adding an explicit language tag that matches the content, using text
for diagrams and the appropriate code language for any actual snippets. Keep the
existing content intact and make sure each fence is clearly labeled so the
markdown linter stops flagging them.

Source: Linters/SAST tools

AGENTS.md (1)

21-30: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Tag the remaining bare fences.

These blocks still trip MD040. Please add explicit language tags (text, bash, etc.) so the examples render consistently.

Also applies to: 69-74, 178-183

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@AGENTS.md` around lines 21 - 30, The documentation examples still contain
bare fenced code blocks that trigger MD040. Update the fenced blocks in
AGENTS.md under the listed section blocks so each fence has an explicit language
tag such as text or bash, and apply the same fix to the other remaining bare
fences noted in the comment. Keep the surrounding example content unchanged and
ensure the fenced blocks under the storage/client/CLI sections are consistently
tagged.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@AGENTS.md`:
- Around line 21-30: The documentation examples still contain bare fenced code
blocks that trigger MD040. Update the fenced blocks in AGENTS.md under the
listed section blocks so each fence has an explicit language tag such as text or
bash, and apply the same fix to the other remaining bare fences noted in the
comment. Keep the surrounding example content unchanged and ensure the fenced
blocks under the storage/client/CLI sections are consistently tagged.

In `@ARCHITECTURE.md`:
- Around line 1059-1070: The recovery steps in the snapshot restore section
should not present `--force-new-cluster` as the standard way to start the
restored member. Update the instructions around the `snapshot restore` and
`etcd` startup steps so the restored member is started normally by default, and
mention `--force-new-cluster` only as an exceptional fallback if needed.
- Around line 150-173: The EtcdServer sketch in Architecture documentation is
still using placeholder types and fields instead of the real struct layout from
EtcdServer in server.go. Update the documented members to match the actual
EtcdServer definition exactly, including the correct field names and types for
raft, storage, cluster state, configuration, lease, and apply components, so the
reference stays authoritative and in sync with the code.
- Around line 788-805: Qualify the Watch Guarantees section in ARCHITECTURE.md
so it no longer states unconditional reliability. Update the guarantee text to
reflect that watches preserve order and resumability, but may be canceled by
compaction or slow consumer handling, requiring clients to resume from a
revision; keep the wording aligned with the Watch Guarantees and Slow Consumer
Handling sections.
- Around line 56-102: The architecture document still contains bare fenced
blocks that trigger MD040 and inconsistent rendering. Update the affected fences
around the etcd cluster diagram and the other listed sections by adding an
explicit language tag that matches the content, using text for diagrams and the
appropriate code language for any actual snippets. Keep the existing content
intact and make sure each fence is clearly labeled so the markdown linter stops
flagging them.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 70734fd5-c535-4179-b634-a516c6b59d78

📥 Commits

Reviewing files that changed from the base of the PR and between 7088966 and 1bcee55.

📒 Files selected for processing (4)
  • AGENTS.md
  • ARCHITECTURE.md
  • CLAUDE.md
  • CONTRIBUTING.md
✅ Files skipped from review due to trivial changes (2)
  • CLAUDE.md
  • CONTRIBUTING.md

@openshift-ci

openshift-ci Bot commented Jun 26, 2026

Copy link
Copy Markdown

@sandeepknd: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/upstream-integration 1bcee55 link false /test upstream-integration
ci/prow/upstream-e2e 1bcee55 link false /test upstream-e2e
ci/prow/e2e-aws-ovn 1bcee55 link true /test e2e-aws-ovn

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants