Kubernetes 深度研究:從架構原理到源碼實現的完整解析
引言:從容器到編排的演進
在 Docker 普及之後,容器化應用的部署變得簡單,但隨之而來的問題是:如何管理成百上千個容器?如何處理服務發現、負載均衡、自動擴縮容、滾動更新?這就是 Kubernetes 要解決的問題。
Kubernetes(希臘語意為「舵手」)源自 Google 內部的 Borg 系統,集結了 Google 十多年大規模容器管理的經驗。它不僅是一個容器編排工具,更是一個完整的分散式系統平台。
本文將從架構設計到源碼實現,全面解析 Kubernetes 的核心機制。
第一章:Kubernetes 整體架構
1.1 架構概覽
graph TB
subgraph "Control Plane 控制平面"
API[API Server<br/>集群入口]
ETCD[(etcd<br/>分散式存儲)]
SCHED[Scheduler<br/>調度器]
CM[Controller Manager<br/>控制器管理器]
CCM[Cloud Controller<br/>Manager]
end
subgraph "Worker Node 1"
K1[kubelet]
KP1[kube-proxy]
CR1[Container Runtime]
P1[Pod A]
P2[Pod B]
end
subgraph "Worker Node 2"
K2[kubelet]
KP2[kube-proxy]
CR2[Container Runtime]
P3[Pod C]
P4[Pod D]
end
API <--> ETCD
SCHED --> API
CM --> API
CCM --> API
K1 --> API
K2 --> API
KP1 --> API
KP2 --> API
K1 --> CR1
K2 --> CR2
CR1 --> P1
CR1 --> P2
CR2 --> P3
CR2 --> P4
1.2 設計哲學
Kubernetes 的設計遵循幾個核心原則:
聲明式 API(Declarative API)
# 使用者只需聲明期望狀態
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
spec:
replicas: 3 # 期望 3 個副本
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.21
系統會自動將實際狀態調整到期望狀態,這就是「控制迴路」模式。
控制迴路(Control Loop)
graph LR
A[觀察 Observe] --> B[比較 Diff]
B --> C[行動 Act]
C --> A
subgraph "控制器邏輯"
B
D[期望狀態<br/>Desired State]
E[實際狀態<br/>Current State]
D --> B
E --> B
end
鬆耦合架構
- 各組件通過 API Server 通信
- Watch 機制實現事件驅動
- 組件可獨立升級和擴展
第二章:控制平面深度剖析
2.1 API Server:集群的神經中樞
API Server 是 Kubernetes 的核心組件,所有組件都通過它來存取資源。
graph TB
subgraph "API Server 內部架構"
REQ[HTTP Request] --> AUTH[認證 Authentication]
AUTH --> AUTHZ[授權 Authorization]
AUTHZ --> ADM[准入控制 Admission Control]
ADM --> VAL[驗證 Validation]
VAL --> ETCD[(etcd Storage)]
end
subgraph "認證方式"
AUTH --> C1[X509 證書]
AUTH --> C2[Bearer Token]
AUTH --> C3[ServiceAccount]
AUTH --> C4[OIDC]
end
subgraph "授權方式"
AUTHZ --> R1[RBAC]
AUTHZ --> R2[ABAC]
AUTHZ --> R3[Webhook]
end
API Server 請求處理流程(源碼分析)
// staging/src/k8s.io/apiserver/pkg/server/handler.go
func (d director) ServeHTTP(w http.ResponseWriter, req *http.Request) {
// 1. 路由匹配
path := req.URL.Path
// 2. 根據路徑選擇處理器
for _, prefix := range d.prefixes {
if strings.HasPrefix(path, prefix) {
// 3. 處理 API 請求
d.handlers[prefix].ServeHTTP(w, req)
return
}
}
}
// pkg/registry/core/pod/storage/storage.go
// Pod 資源的 REST 存儲實現
type REST struct {
*genericregistry.Store
proxyTransport http.RoundTripper
}
func (r *REST) Create(ctx context.Context, obj runtime.Object,
createValidation rest.ValidateObjectFunc,
options *metav1.CreateOptions) (runtime.Object, error) {
// 1. 驗證對象
if err := createValidation(ctx, obj); err != nil {
return nil, err
}
// 2. 準入控制
// 3. 寫入 etcd
return r.Store.Create(ctx, obj, createValidation, options)
}
Watch 機制實現
Watch 是 Kubernetes 事件驅動架構的核心:
// staging/src/k8s.io/apiserver/pkg/storage/etcd3/watcher.go
type watcher struct {
client *clientv3.Client
codec runtime.Codec
key string
rev int64
ctx context.Context
cancel context.CancelFunc
result chan watch.Event
}
func (w *watcher) run() {
// 創建 etcd watch channel
watchCh := w.client.Watch(w.ctx, w.key,
clientv3.WithRev(w.rev),
clientv3.WithPrefix())
for response := range watchCh {
for _, event := range response.Events {
// 轉換 etcd 事件為 Kubernetes watch 事件
obj, err := w.decode(event.Kv.Value)
if err != nil {
continue
}
var eventType watch.EventType
switch event.Type {
case clientv3.EventTypePut:
if event.IsCreate() {
eventType = watch.Added
} else {
eventType = watch.Modified
}
case clientv3.EventTypeDelete:
eventType = watch.Deleted
}
w.result <- watch.Event{
Type: eventType,
Object: obj,
}
}
}
}
2.2 etcd:分散式一致性存儲
etcd 是 Kubernetes 的「大腦」,存儲所有集群狀態。
graph TB
subgraph "etcd 集群(Raft 共識)"
L[Leader]
F1[Follower 1]
F2[Follower 2]
L --> |心跳/日誌複製| F1
L --> |心跳/日誌複製| F2
end
API[API Server] --> |寫入| L
L --> |確認| API
subgraph "數據模型"
K1["/registry/pods/default/nginx-xxx"]
K2["/registry/services/default/nginx-svc"]
K3["/registry/deployments/default/nginx"]
end
etcd 中的 Kubernetes 數據結構
# 查看 etcd 中的數據
etcdctl get /registry/pods/default/nginx-deployment-xxx --print-value-only | \
kubectl -o json decode
# etcd 鍵值結構
/registry/
├── pods/
│ └── {namespace}/
│ └── {pod-name}
├── services/
│ └── {namespace}/
│ └── {service-name}
├── deployments/
│ └── {namespace}/
│ └── {deployment-name}
├── replicasets/
├── configmaps/
├── secrets/
└── ...
etcd 效能優化
// etcd 客戶端配置
cfg := clientv3.Config{
Endpoints: []string{"https://etcd1:2379", "https://etcd2:2379"},
DialTimeout: 5 * time.Second,
// 連接池配置
MaxCallSendMsgSize: 2 * 1024 * 1024,
MaxCallRecvMsgSize: math.MaxInt32,
// TLS 配置
TLS: tlsConfig,
}
// 使用壓縮減少存儲空間
// API Server 啟動參數
// --etcd-compaction-interval=5m
2.3 Scheduler:智慧調度引擎
Scheduler 負責將 Pod 調度到最合適的節點。
graph TB
subgraph "調度流程"
NEW[新 Pod 創建] --> Q[調度隊列]
Q --> PRE[Prefilter<br/>預過濾]
PRE --> FIL[Filter<br/>過濾]
FIL --> POST[PostFilter<br/>後過濾]
POST --> SCORE[Score<br/>評分]
SCORE --> RES[Reserve<br/>預留]
RES --> BIND[Bind<br/>綁定]
end
subgraph "過濾插件"
FIL --> F1[NodeResourcesFit]
FIL --> F2[NodeAffinity]
FIL --> F3[PodTopologySpread]
FIL --> F4[TaintToleration]
end
subgraph "評分插件"
SCORE --> S1[NodeResourcesBalancedAllocation]
SCORE --> S2[ImageLocality]
SCORE --> S3[InterPodAffinity]
end
調度器核心邏輯(源碼分析)
// pkg/scheduler/scheduler.go
func (sched *Scheduler) scheduleOne(ctx context.Context) {
// 1. 從隊列獲取待調度的 Pod
podInfo := sched.NextPod()
pod := podInfo.Pod
// 2. 調度週期
scheduleResult, err := sched.schedulePod(ctx, fwk, state, pod)
if err != nil {
// 調度失敗處理
sched.handleSchedulingFailure(ctx, fwk, podInfo, err)
return
}
// 3. 異步綁定
go func() {
// 執行綁定週期
err := sched.bind(ctx, fwk, pod, scheduleResult.SuggestedHost, state)
if err != nil {
sched.handleBindingFailure(ctx, fwk, podInfo, err)
}
}()
}
// pkg/scheduler/framework/runtime/framework.go
func (f *frameworkImpl) RunFilterPlugins(
ctx context.Context,
state *framework.CycleState,
pod *v1.Pod,
nodeInfo *framework.NodeInfo,
) framework.PluginToStatus {
statuses := make(framework.PluginToStatus)
// 依次運行所有過濾插件
for _, pl := range f.filterPlugins {
status := f.runFilterPlugin(ctx, pl, state, pod, nodeInfo)
if !status.IsSuccess() {
// 記錄失敗原因
statuses[pl.Name()] = status
if !status.IsUnschedulable() {
return statuses
}
}
}
return statuses
}
調度器評分算法
// pkg/scheduler/framework/plugins/noderesources/balanced_allocation.go
func (ba *BalancedAllocation) Score(
ctx context.Context,
state *framework.CycleState,
pod *v1.Pod,
nodeName string,
) (int64, *framework.Status) {
nodeInfo, err := ba.handle.SnapshotSharedLister().NodeInfos().Get(nodeName)
if err != nil {
return 0, framework.AsStatus(err)
}
// 計算 CPU 和 Memory 的使用比例
requested := nodeInfo.Requested
allocatable := nodeInfo.Allocatable
cpuFraction := float64(requested.MilliCPU) / float64(allocatable.MilliCPU)
memoryFraction := float64(requested.Memory) / float64(allocatable.Memory)
// 計算平衡分數 - 越平衡分數越高
// 使用方差來衡量資源使用的平衡程度
mean := (cpuFraction + memoryFraction) / 2
variance := ((cpuFraction-mean)*(cpuFraction-mean) +
(memoryFraction-mean)*(memoryFraction-mean)) / 2
// 轉換為 0-100 的分數
score := int64((1 - variance) * float64(framework.MaxNodeScore))
return score, nil
}
2.4 Controller Manager:控制器集合
Controller Manager 運行所有內建控制器,實現聲明式 API 的核心邏輯。
graph TB
subgraph "Controller Manager"
RC[ReplicaSet Controller]
DC[Deployment Controller]
SC[StatefulSet Controller]
DSC[DaemonSet Controller]
JC[Job Controller]
NC[Node Controller]
SAC[ServiceAccount Controller]
EC[Endpoint Controller]
NSC[Namespace Controller]
end
subgraph "控制迴路示例:ReplicaSet"
W[Watch ReplicaSet] --> CHECK{實際副本數<br/>== 期望副本數?}
CHECK --> |否| CREATE[創建/刪除 Pod]
CHECK --> |是| WAIT[等待事件]
CREATE --> W
WAIT --> W
end
API[API Server] <--> RC
API <--> DC
API <--> SC
ReplicaSet 控制器源碼分析
// pkg/controller/replicaset/replica_set.go
func (rsc *ReplicaSetController) syncReplicaSet(
ctx context.Context,
key string,
) error {
namespace, name, err := cache.SplitMetaNamespaceKey(key)
if err != nil {
return err
}
// 1. 獲取 ReplicaSet
rs, err := rsc.rsLister.ReplicaSets(namespace).Get(name)
if errors.IsNotFound(err) {
return nil
}
// 2. 獲取所有匹配的 Pod
selector, err := metav1.LabelSelectorAsSelector(rs.Spec.Selector)
allPods, err := rsc.podLister.Pods(rs.Namespace).List(selector)
// 3. 過濾活躍的 Pod
filteredPods := controller.FilterActivePods(allPods)
// 4. 計算差異並調整
diff := len(filteredPods) - int(*(rs.Spec.Replicas))
if diff < 0 {
// 需要創建 Pod
diff *= -1
successfulCreations, err := rsc.createPods(ctx, rs, diff)
// ...
} else if diff > 0 {
// 需要刪除 Pod
podsToDelete := getPodsToDelete(filteredPods, diff)
rsc.deletePods(ctx, rs, podsToDelete)
// ...
}
// 5. 更新狀態
newStatus := calculateStatus(rs, filteredPods)
_, err = rsc.updateReplicaSetStatus(ctx, rs, newStatus)
return err
}
Deployment 控制器的滾動更新邏輯
// pkg/controller/deployment/rolling.go
func (dc *DeploymentController) rolloutRolling(
ctx context.Context,
d *apps.Deployment,
rsList []*apps.ReplicaSet,
) error {
newRS, oldRSs, err := dc.getAllReplicaSetsAndSyncRevision(ctx, d, rsList)
// 計算可以擴容和縮容的數量
// maxSurge: 滾動更新時可以超出期望副本數的最大數量
// maxUnavailable: 滾動更新時可以不可用的最大數量
_, maxSurge, err := resolveFenceposts(
d.Spec.Strategy.RollingUpdate.MaxSurge,
d.Spec.Strategy.RollingUpdate.MaxUnavailable,
*(d.Spec.Replicas),
)
// 1. 擴容新的 ReplicaSet
scaledUp, err := dc.reconcileNewReplicaSet(ctx, d, newRS, maxSurge)
if scaledUp {
return dc.syncRolloutStatus(ctx, d, newRS, oldRSs)
}
// 2. 縮容舊的 ReplicaSet
scaledDown, err := dc.reconcileOldReplicaSets(ctx, d, newRS, oldRSs)
if scaledDown {
return dc.syncRolloutStatus(ctx, d, newRS, oldRSs)
}
// 3. 檢查是否完成
if deploymentComplete(d, &d.Status) {
return dc.cleanupDeployment(ctx, oldRSs, d)
}
return nil
}
第三章:工作節點組件深度解析
3.1 kubelet:節點代理
kubelet 是運行在每個節點上的代理,負責管理 Pod 的生命週期。
graph TB
subgraph "kubelet 架構"
SYNC[SyncLoop<br/>主同步迴路]
PM[Pod Manager]
PLEG[Pod Lifecycle<br/>Event Generator]
CM[Container Manager]
VM[Volume Manager]
SM[Status Manager]
PM_MGR[Probe Manager]
SYNC --> PM
SYNC --> PLEG
SYNC --> CM
SYNC --> VM
SYNC --> SM
SYNC --> PM_MGR
end
subgraph "Container Runtime Interface"
CRI[CRI gRPC Server]
CR1[containerd]
CR2[CRI-O]
CM --> CRI
CRI --> CR1
CRI --> CR2
end
API[API Server] --> |Watch| SYNC
SYNC --> |Update Status| API
kubelet 主循環(源碼分析)
// pkg/kubelet/kubelet.go
func (kl *Kubelet) syncLoopIteration(
ctx context.Context,
configCh <-chan kubetypes.PodUpdate,
handler SyncHandler,
syncCh <-chan time.Time,
housekeepingCh <-chan time.Time,
plegCh <-chan *pleg.PodLifecycleEvent,
) bool {
select {
case u, open := <-configCh:
// 1. 處理 Pod 配置更新
if !open {
return false
}
switch u.Op {
case kubetypes.ADD:
handler.HandlePodAdditions(u.Pods)
case kubetypes.UPDATE:
handler.HandlePodUpdates(u.Pods)
case kubetypes.DELETE:
handler.HandlePodRemoves(u.Pods)
case kubetypes.RECONCILE:
handler.HandlePodReconcile(u.Pods)
}
case e := <-plegCh:
// 2. 處理 PLEG 事件
if isSyncPodWorthy(e) {
pod, ok := kl.podManager.GetPodByUID(e.ID)
if ok {
kl.podWorkers.UpdatePod(UpdatePodOptions{
Pod: pod,
UpdateType: kubetypes.SyncPodSync,
})
}
}
case <-syncCh:
// 3. 定期同步所有 Pod
podsToSync := kl.getPodsToSync()
for _, pod := range podsToSync {
kl.podWorkers.UpdatePod(UpdatePodOptions{
Pod: pod,
UpdateType: kubetypes.SyncPodSync,
})
}
case <-housekeepingCh:
// 4. 清理任務
if !kl.sourcesReady.AllReady() {
return true
}
kl.HandlePodCleanups(ctx)
}
return true
}
Pod 同步流程
// pkg/kubelet/pod_workers.go
func (p *podWorkers) managePodLoop(podUpdates <-chan struct{}) {
var lastSyncTime time.Time
for range podUpdates {
ctx, update, canStart, canEverStart, ok := p.getPodUpdate()
if !ok {
continue
}
// 執行同步
err := p.podSyncer.SyncPod(ctx, update.Options.UpdateType,
update.Options.Pod,
update.Options.MirrorPod,
update.Status)
lastSyncTime = time.Now()
p.completeWork(update, err)
}
}
// pkg/kubelet/kubelet.go
func (kl *Kubelet) SyncPod(
ctx context.Context,
updateType kubetypes.SyncPodType,
pod, mirrorPod *v1.Pod,
podStatus *kubecontainer.PodStatus,
) (isTerminal bool, err error) {
// 1. 創建 Pod 數據目錄
if err := kl.makePodDataDirs(pod); err != nil {
return false, err
}
// 2. 等待 Volume 掛載
if err := kl.volumeManager.WaitForAttachAndMount(pod); err != nil {
return false, err
}
// 3. 獲取 Pull Secrets
pullSecrets := kl.getPullSecretsForPod(pod)
// 4. 調用容器運行時創建 Pod
result := kl.containerRuntime.SyncPod(ctx, pod, podStatus,
pullSecrets, kl.backOff)
return false, result.Error()
}
3.2 Container Runtime Interface (CRI)
CRI 是 Kubernetes 與容器運行時之間的標準接口。
graph LR
subgraph "kubelet"
CRI_CLIENT[CRI Client]
end
subgraph "Container Runtime"
subgraph "containerd"
CRI_PLUGIN[CRI Plugin]
CONTAINERD[containerd daemon]
SHIM[containerd-shim]
end
subgraph "runc"
RUNC[runc]
end
end
CRI_CLIENT --> |gRPC| CRI_PLUGIN
CRI_PLUGIN --> CONTAINERD
CONTAINERD --> SHIM
SHIM --> RUNC
RUNC --> |創建| CONTAINER[Container]
CRI 接口定義
// api/cri/runtime/v1/api.proto
service RuntimeService {
// Pod 沙箱管理
rpc RunPodSandbox(RunPodSandboxRequest)
returns (RunPodSandboxResponse) {}
rpc StopPodSandbox(StopPodSandboxRequest)
returns (StopPodSandboxResponse) {}
rpc RemovePodSandbox(RemovePodSandboxRequest)
returns (RemovePodSandboxResponse) {}
rpc PodSandboxStatus(PodSandboxStatusRequest)
returns (PodSandboxStatusResponse) {}
rpc ListPodSandbox(ListPodSandboxRequest)
returns (ListPodSandboxResponse) {}
// 容器管理
rpc CreateContainer(CreateContainerRequest)
returns (CreateContainerResponse) {}
rpc StartContainer(StartContainerRequest)
returns (StartContainerResponse) {}
rpc StopContainer(StopContainerRequest)
returns (StopContainerResponse) {}
rpc RemoveContainer(RemoveContainerRequest)
returns (RemoveContainerResponse) {}
rpc ListContainers(ListContainersRequest)
returns (ListContainersResponse) {}
rpc ContainerStatus(ContainerStatusRequest)
returns (ContainerStatusResponse) {}
// 執行命令
rpc ExecSync(ExecSyncRequest) returns (ExecSyncResponse) {}
rpc Exec(ExecRequest) returns (ExecResponse) {}
rpc Attach(AttachRequest) returns (AttachResponse) {}
}
service ImageService {
rpc ListImages(ListImagesRequest) returns (ListImagesResponse) {}
rpc ImageStatus(ImageStatusRequest) returns (ImageStatusResponse) {}
rpc PullImage(PullImageRequest) returns (PullImageResponse) {}
rpc RemoveImage(RemoveImageRequest) returns (RemoveImageResponse) {}
}
kubelet 調用 CRI 創建容器
// pkg/kubelet/kuberuntime/kuberuntime_container.go
func (m *kubeGenericRuntimeManager) startContainer(
ctx context.Context,
podSandboxID string,
podSandboxConfig *runtimeapi.PodSandboxConfig,
spec *startSpec,
pod *v1.Pod,
podStatus *kubecontainer.PodStatus,
pullSecrets []v1.Secret,
podIP string,
podIPs []string,
) (string, error) {
container := spec.container
// 1. 拉取鏡像
imageRef, err := m.imagePuller.EnsureImageExists(
ctx, pod, container, pullSecrets, podSandboxConfig)
// 2. 創建容器配置
containerConfig, err := m.generateContainerConfig(
ctx, container, pod, restartCount,
podIP, imageRef, podIPs)
// 3. 調用 CRI 創建容器
containerID, err := m.runtimeService.CreateContainer(
ctx, podSandboxID, containerConfig, podSandboxConfig)
// 4. 啟動容器
err = m.runtimeService.StartContainer(ctx, containerID)
return containerID, nil
}
3.3 kube-proxy:服務網路代理
kube-proxy 負責實現 Service 的網路轉發規則。
graph TB
subgraph "kube-proxy 模式"
subgraph "iptables 模式"
IPT[iptables rules]
IPT --> |DNAT| POD1[Pod 1]
IPT --> |DNAT| POD2[Pod 2]
end
subgraph "IPVS 模式"
IPVS[IPVS Virtual Server]
IPVS --> |RR/LC/WRR| POD3[Pod 1]
IPVS --> |負載均衡| POD4[Pod 2]
end
end
CLIENT[Client] --> |ClusterIP| IPT
CLIENT2[Client] --> |ClusterIP| IPVS
iptables 規則生成
// pkg/proxy/iptables/proxier.go
func (proxier *Proxier) syncProxyRules() {
// 為每個 Service 生成 iptables 規則
for svcName, svc := range proxier.svcPortMap {
// 1. KUBE-SERVICES 鏈 - Service ClusterIP
proxier.natRules.Write(
"-A", string(kubeServicesChain),
"-m", "comment", "--comment", fmt.Sprintf("%s cluster IP", svcName),
"-m", protocol, "-p", protocol,
"-d", svc.ClusterIP().String(),
"--dport", strconv.Itoa(svc.Port()),
"-j", string(svcChain),
)
// 2. 為每個 Endpoint 生成規則
for i, ep := range allLocallyReachableEndpoints {
epChain := svc.endpointChainName(ep.Endpoint)
// 使用概率實現負載均衡
// 第一個 Endpoint: 1/n 概率
// 第二個 Endpoint: 1/(n-1) 概率
// ...
probability := 1.0 / float64(n-i)
proxier.natRules.Write(
"-A", string(svcChain),
"-m", "statistic",
"--mode", "random",
"--probability", fmt.Sprintf("%0.10f", probability),
"-j", string(epChain),
)
}
// 3. Endpoint 鏈 - DNAT 到 Pod IP
proxier.natRules.Write(
"-A", string(epChain),
"-m", protocol, "-p", protocol,
"-j", "DNAT",
"--to-destination", ep.Endpoint,
)
}
}
IPVS 模式實現
// pkg/proxy/ipvs/proxier.go
func (proxier *Proxier) syncService(
svcName string,
svc *servicePortInfo,
endpoints []Endpoint,
) error {
// 1. 創建 IPVS Virtual Server
virtualServer := &utilipvs.VirtualServer{
Address: net.ParseIP(svc.ClusterIP().String()),
Port: uint16(svc.Port()),
Protocol: string(svc.Protocol()),
Scheduler: proxier.ipvsScheduler, // rr, lc, wrr, sh, dh...
}
if err := proxier.ipvs.AddVirtualServer(virtualServer); err != nil {
return err
}
// 2. 為每個 Endpoint 添加 Real Server
for _, ep := range endpoints {
realServer := &utilipvs.RealServer{
Address: net.ParseIP(ep.IP()),
Port: uint16(ep.Port()),
Weight: 1,
}
if err := proxier.ipvs.AddRealServer(virtualServer, realServer); err != nil {
return err
}
}
return nil
}
第四章:核心資源對象生命週期
4.1 Pod 生命週期
stateDiagram-v2
[*] --> Pending: 創建 Pod
Pending --> Running: 容器啟動成功
Pending --> Failed: 調度/啟動失敗
Running --> Succeeded: 所有容器正常退出
Running --> Failed: 容器異常退出
Running --> Running: 容器重啟
Succeeded --> [*]
Failed --> [*]
state Pending {
[*] --> Scheduling
Scheduling --> Scheduled: 找到節點
Scheduled --> ImagePulling
ImagePulling --> ContainerCreating
}
state Running {
[*] --> ContainersReady
ContainersReady --> Probing
Probing --> Ready: 探針通過
}
Pod 創建完整流程
sequenceDiagram
participant User
participant API as API Server
participant ETCD as etcd
participant Sched as Scheduler
participant Kubelet as kubelet
participant CRI as Container Runtime
User->>API: kubectl create pod
API->>API: 認證/授權/准入
API->>ETCD: 存儲 Pod (nodeName 為空)
API-->>User: Pod created
Sched->>API: Watch 新 Pod
API-->>Sched: Pod 事件
Sched->>Sched: 過濾 + 評分
Sched->>API: 更新 Pod.spec.nodeName
API->>ETCD: 存儲綁定信息
Kubelet->>API: Watch 分配給本節點的 Pod
API-->>Kubelet: Pod 事件
Kubelet->>CRI: RunPodSandbox
CRI-->>Kubelet: Sandbox ID
Kubelet->>CRI: CreateContainer
CRI-->>Kubelet: Container ID
Kubelet->>CRI: StartContainer
Kubelet->>API: 更新 Pod Status
Pod 健康檢查
apiVersion: v1
kind: Pod
metadata:
name: health-check-demo
spec:
containers:
- name: app
image: myapp:1.0
# 存活探針 - 失敗則重啟容器
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3
# 就緒探針 - 失敗則從 Service 移除
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
# 啟動探針 - 啟動期間使用
startupProbe:
httpGet:
path: /startup
port: 8080
failureThreshold: 30
periodSeconds: 10
探針實現
// pkg/kubelet/prober/prober.go
func (pb *prober) runProbe(
ctx context.Context,
probeType probeType,
p *v1.Probe,
pod *v1.Pod,
status v1.PodStatus,
container v1.Container,
containerID kubecontainer.ContainerID,
) (probe.Result, string, error) {
timeout := time.Duration(p.TimeoutSeconds) * time.Second
if p.Exec != nil {
// Exec 探針
return pb.exec.Probe(pb.newExecInContainer(
ctx, container, containerID, p.Exec.Command, timeout))
}
if p.HTTPGet != nil {
// HTTP 探針
scheme := strings.ToLower(string(p.HTTPGet.Scheme))
host := p.HTTPGet.Host
if host == "" {
host = status.PodIP
}
port := p.HTTPGet.Port.IntValue()
path := p.HTTPGet.Path
url := formatURL(scheme, host, port, path)
headers := buildHeader(p.HTTPGet.HTTPHeaders)
return pb.http.Probe(url, headers, timeout)
}
if p.TCPSocket != nil {
// TCP 探針
port := p.TCPSocket.Port.IntValue()
host := p.TCPSocket.Host
if host == "" {
host = status.PodIP
}
return pb.tcp.Probe(host, port, timeout)
}
return probe.Unknown, "", fmt.Errorf("missing probe handler")
}
4.2 Service 與 Endpoints
graph TB
subgraph "Service 類型"
CIP[ClusterIP<br/>集群內部訪問]
NP[NodePort<br/>節點端口暴露]
LB[LoadBalancer<br/>外部負載均衡]
EXT[ExternalName<br/>DNS 別名]
end
subgraph "Service 工作原理"
SVC[Service<br/>10.96.0.1:80]
EP[Endpoints]
POD1[Pod 1<br/>10.244.1.10:8080]
POD2[Pod 2<br/>10.244.2.20:8080]
POD3[Pod 3<br/>10.244.3.30:8080]
SVC --> EP
EP --> POD1
EP --> POD2
EP --> POD3
end
subgraph "Endpoint 選擇"
SELECTOR[Label Selector<br/>app=myapp]
SELECTOR --> |匹配| POD1
SELECTOR --> |匹配| POD2
SELECTOR --> |匹配| POD3
end
Endpoints Controller
// pkg/controller/endpoint/endpoints_controller.go
func (e *Controller) syncService(ctx context.Context, key string) error {
namespace, name, _ := cache.SplitMetaNamespaceKey(key)
// 1. 獲取 Service
service, err := e.serviceLister.Services(namespace).Get(name)
if errors.IsNotFound(err) {
// Service 已刪除,刪除對應的 Endpoints
return e.client.CoreV1().Endpoints(namespace).Delete(
ctx, name, metav1.DeleteOptions{})
}
// 2. 獲取匹配的 Pod
pods, err := e.podLister.Pods(namespace).List(
labels.Set(service.Spec.Selector).AsSelectorPreValidated())
// 3. 構建 Endpoints
subsets := []v1.EndpointSubset{}
for _, pod := range pods {
if !podutil.IsPodReady(pod) {
continue // 跳過未就緒的 Pod
}
epa := v1.EndpointAddress{
IP: pod.Status.PodIP,
NodeName: &pod.Spec.NodeName,
TargetRef: &v1.ObjectReference{
Kind: "Pod",
Namespace: pod.Namespace,
Name: pod.Name,
UID: pod.UID,
},
}
for _, servicePort := range service.Spec.Ports {
portNum, _ := podutil.FindPort(pod, &servicePort)
epp := v1.EndpointPort{
Port: int32(portNum),
Protocol: servicePort.Protocol,
Name: servicePort.Name,
}
subsets = addEndpointSubset(subsets, epa, epp)
}
}
// 4. 創建/更新 Endpoints
newEndpoints := &v1.Endpoints{
ObjectMeta: metav1.ObjectMeta{
Name: service.Name,
Namespace: service.Namespace,
},
Subsets: subsets,
}
return e.client.CoreV1().Endpoints(namespace).Update(
ctx, newEndpoints, metav1.UpdateOptions{})
}
4.3 Deployment 滾動更新策略
graph TB
subgraph "滾動更新過程"
D[Deployment v2]
subgraph "ReplicaSet 變化"
RS1[RS v1: 3 → 2 → 1 → 0]
RS2[RS v2: 0 → 1 → 2 → 3]
end
D --> RS1
D --> RS2
end
subgraph "時間線"
T1[T1: v1=3, v2=0]
T2[T2: v1=2, v2=1]
T3[T3: v1=1, v2=2]
T4[T4: v1=0, v2=3]
T1 --> T2 --> T3 --> T4
end
Deployment 策略配置
apiVersion: apps/v1
kind: Deployment
metadata:
name: rolling-update-demo
spec:
replicas: 10
strategy:
type: RollingUpdate
rollingUpdate:
# 最多超出期望副本數的 25%
maxSurge: 25%
# 最多不可用副本數的 25%
maxUnavailable: 25%
# 最小就緒時間
minReadySeconds: 10
# 保留的歷史版本數
revisionHistoryLimit: 10
# 進度截止時間
progressDeadlineSeconds: 600
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: app
image: myapp:v2
readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 5
第五章:Kubernetes 網路模型
5.1 網路設計原則
Kubernetes 網路遵循以下原則:
- 每個 Pod 擁有獨立的 IP 地址
- Pod 之間可以直接通信,無需 NAT
- Node 與 Pod 之間可以直接通信
- Pod 看到的自己的 IP 就是其他 Pod 看到的 IP
graph TB
subgraph "Node 1 (10.0.1.1)"
subgraph "Pod Network 10.244.1.0/24"
P1[Pod A<br/>10.244.1.10]
P2[Pod B<br/>10.244.1.20]
end
BR1[cni0 Bridge]
VETH1[veth pairs]
P1 --> VETH1
P2 --> VETH1
VETH1 --> BR1
end
subgraph "Node 2 (10.0.1.2)"
subgraph "Pod Network 10.244.2.0/24"
P3[Pod C<br/>10.244.2.10]
P4[Pod D<br/>10.244.2.20]
end
BR2[cni0 Bridge]
VETH2[veth pairs]
P3 --> VETH2
P4 --> VETH2
VETH2 --> BR2
end
subgraph "Overlay Network"
VXLAN[VXLAN/IPIP Tunnel]
end
BR1 --> VXLAN
BR2 --> VXLAN
5.2 CNI (Container Network Interface)
CNI 是 Kubernetes 網路插件的標準接口。
graph LR
subgraph "CNI 工作流程"
KUBELET[kubelet]
CNI[CNI Plugin]
IPAM[IPAM Plugin]
KUBELET --> |1. ADD| CNI
CNI --> |2. 分配 IP| IPAM
IPAM --> |3. IP 地址| CNI
CNI --> |4. 配置網路| NETNS[Pod Network Namespace]
end
CNI 配置文件
// /etc/cni/net.d/10-calico.conflist
{
"name": "k8s-pod-network",
"cniVersion": "0.3.1",
"plugins": [
{
"type": "calico",
"log_level": "info",
"datastore_type": "kubernetes",
"nodename": "node1",
"mtu": 1440,
"ipam": {
"type": "calico-ipam"
},
"policy": {
"type": "k8s"
},
"kubernetes": {
"kubeconfig": "/etc/cni/net.d/calico-kubeconfig"
}
},
{
"type": "portmap",
"snat": true,
"capabilities": {"portMappings": true}
}
]
}
CNI 插件接口
// CNI 插件需要實現的接口
type CNI interface {
AddNetworkList(ctx context.Context, net *NetworkConfigList,
rt *RuntimeConf) (types.Result, error)
DelNetworkList(ctx context.Context, net *NetworkConfigList,
rt *RuntimeConf) error
CheckNetworkList(ctx context.Context, net *NetworkConfigList,
rt *RuntimeConf) error
}
// kubelet 調用 CNI
func (plugin *cniNetworkPlugin) SetUpPod(
namespace string, name string,
id kubecontainer.ContainerID,
annotations, options map[string]string,
) error {
// 構建 RuntimeConf
rt := &libcni.RuntimeConf{
ContainerID: id.ID,
NetNS: netnsPath,
IfName: defaultIfName,
Args: [][2]string{
{"K8S_POD_NAMESPACE", namespace},
{"K8S_POD_NAME", name},
{"K8S_POD_INFRA_CONTAINER_ID", id.ID},
},
}
// 調用 CNI 插件
result, err := plugin.cni.AddNetworkList(context.TODO(),
plugin.netConfig, rt)
return err
}
5.3 常見 CNI 插件比較
graph TB
subgraph "Flannel"
F_VXLAN[VXLAN 模式]
F_HOST[Host-GW 模式]
F_SIMPLE[簡單易用]
end
subgraph "Calico"
C_BGP[BGP 路由]
C_IPIP[IPIP 隧道]
C_POLICY[網路策略]
C_PERF[高性能]
end
subgraph "Cilium"
CI_EBPF[eBPF 數據平面]
CI_L7[L7 感知]
CI_OBS[可觀測性]
CI_SEC[安全策略]
end
subgraph "Weave"
W_MESH[Mesh 網路]
W_ENCRYPT[加密通信]
W_AUTO[自動發現]
end
Calico BGP 路由原理
graph LR
subgraph "Node 1"
P1[Pod<br/>10.244.1.10]
BIRD1[BIRD BGP]
end
subgraph "Node 2"
P2[Pod<br/>10.244.2.10]
BIRD2[BIRD BGP]
end
subgraph "路由表"
R1[10.244.1.0/24 via Node1]
R2[10.244.2.0/24 via Node2]
end
BIRD1 <--> |BGP| BIRD2
BIRD1 --> R1
BIRD2 --> R2
P1 --> |直接路由| P2
第六章:存儲系統設計
6.1 Volume 類型與抽象
graph TB
subgraph "Volume 類型"
EMT[emptyDir<br/>臨時存儲]
HOST[hostPath<br/>節點存儲]
CFG[configMap/secret<br/>配置存儲]
PVC_V[persistentVolumeClaim<br/>持久存儲]
CSI_V[csi<br/>CSI 驅動]
end
subgraph "持久卷架構"
PV[PersistentVolume<br/>集群級資源]
PVC[PersistentVolumeClaim<br/>命名空間級請求]
SC[StorageClass<br/>動態配置]
PVC --> |綁定| PV
SC --> |動態創建| PV
end
subgraph "後端存儲"
NFS[NFS]
CEPH[Ceph RBD]
AWS[AWS EBS]
GCE[GCE PD]
LOCAL[Local Volume]
end
PV --> NFS
PV --> CEPH
PV --> AWS
6.2 CSI (Container Storage Interface)
graph TB
subgraph "CSI 架構"
subgraph "Controller Plugin"
CS[Controller Service<br/>CreateVolume/DeleteVolume]
end
subgraph "Node Plugin"
NS[Node Service<br/>NodePublishVolume]
end
subgraph "Kubernetes 組件"
ATTACH[external-attacher]
PROV[external-provisioner]
REG[node-driver-registrar]
end
end
PROV --> |CreateVolume| CS
ATTACH --> |ControllerPublish| CS
REG --> NS
subgraph "kubelet"
KL[Volume Manager]
end
KL --> |NodePublishVolume| NS
CSI Driver 接口
// CSI Controller 接口
type ControllerServer interface {
// 創建 Volume
CreateVolume(context.Context, *CreateVolumeRequest)
(*CreateVolumeResponse, error)
// 刪除 Volume
DeleteVolume(context.Context, *DeleteVolumeRequest)
(*DeleteVolumeResponse, error)
// 附加 Volume 到節點
ControllerPublishVolume(context.Context, *ControllerPublishVolumeRequest)
(*ControllerPublishVolumeResponse, error)
// 從節點分離 Volume
ControllerUnpublishVolume(context.Context, *ControllerUnpublishVolumeRequest)
(*ControllerUnpublishVolumeResponse, error)
// 創建快照
CreateSnapshot(context.Context, *CreateSnapshotRequest)
(*CreateSnapshotResponse, error)
}
// CSI Node 接口
type NodeServer interface {
// 掛載 Volume 到 Pod
NodePublishVolume(context.Context, *NodePublishVolumeRequest)
(*NodePublishVolumeResponse, error)
// 卸載 Volume
NodeUnpublishVolume(context.Context, *NodeUnpublishVolumeRequest)
(*NodeUnpublishVolumeResponse, error)
// 設備準備(格式化、掛載到全局目錄)
NodeStageVolume(context.Context, *NodeStageVolumeRequest)
(*NodeStageVolumeResponse, error)
// 設備清理
NodeUnstageVolume(context.Context, *NodeUnstageVolumeRequest)
(*NodeUnstageVolumeResponse, error)
}
動態供應流程
sequenceDiagram
participant User
participant API as API Server
participant Prov as external-provisioner
participant CSI as CSI Controller
participant Storage as Storage Backend
User->>API: 創建 PVC
API->>Prov: Watch PVC
Prov->>Prov: 檢查 StorageClass
Prov->>CSI: CreateVolume
CSI->>Storage: 創建實際存儲
Storage-->>CSI: Volume ID
CSI-->>Prov: Volume 信息
Prov->>API: 創建 PV
API->>API: 綁定 PVC 和 PV
第七章:自定義資源與 Operator 模式
7.1 Custom Resource Definition (CRD)
CRD 允許擴展 Kubernetes API。
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: databases.example.com
spec:
group: example.com
versions:
- name: v1
served: true
storage: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
engine:
type: string
enum: ["mysql", "postgresql"]
version:
type: string
replicas:
type: integer
minimum: 1
maximum: 5
storage:
type: string
pattern: '^[0-9]+Gi$'
status:
type: object
properties:
phase:
type: string
readyReplicas:
type: integer
subresources:
status: {}
scale:
specReplicasPath: .spec.replicas
statusReplicasPath: .status.readyReplicas
scope: Namespaced
names:
plural: databases
singular: database
kind: Database
shortNames:
- db
使用自定義資源
apiVersion: example.com/v1
kind: Database
metadata:
name: my-database
spec:
engine: postgresql
version: "14.0"
replicas: 3
storage: 100Gi
7.2 Operator 模式
Operator 是將運維知識編碼到軟體中的模式。
graph TB
subgraph "Operator 架構"
OP[Operator Controller]
subgraph "監控"
W1[Watch CRD]
W2[Watch Pod]
W3[Watch PVC]
W4[Watch Service]
end
subgraph "調和邏輯"
R1[創建/更新資源]
R2[配置管理]
R3[備份恢復]
R4[擴縮容]
R5[故障處理]
end
W1 --> OP
W2 --> OP
W3 --> OP
W4 --> OP
OP --> R1
OP --> R2
OP --> R3
OP --> R4
OP --> R5
end
API[API Server] <--> OP
使用 controller-runtime 實現 Operator
// 數據庫 Operator 示例
package controllers
import (
"context"
appsv1 "k8s.io/api/apps/v1"
corev1 "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/runtime"
ctrl "sigs.k8s.io/controller-runtime"
"sigs.k8s.io/controller-runtime/pkg/client"
dbv1 "example.com/database-operator/api/v1"
)
type DatabaseReconciler struct {
client.Client
Scheme *runtime.Scheme
}
// Reconcile 是核心調和邏輯
func (r *DatabaseReconciler) Reconcile(
ctx context.Context,
req ctrl.Request,
) (ctrl.Result, error) {
log := ctrl.LoggerFrom(ctx)
// 1. 獲取 Database 資源
var database dbv1.Database
if err := r.Get(ctx, req.NamespacedName, &database); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// 2. 確保 StatefulSet 存在
sts := r.buildStatefulSet(&database)
if err := r.createOrUpdate(ctx, sts); err != nil {
return ctrl.Result{}, err
}
// 3. 確保 Service 存在
svc := r.buildService(&database)
if err := r.createOrUpdate(ctx, svc); err != nil {
return ctrl.Result{}, err
}
// 4. 確保 PVC 存在
pvc := r.buildPVC(&database)
if err := r.createOrUpdate(ctx, pvc); err != nil {
return ctrl.Result{}, err
}
// 5. 更新狀態
database.Status.Phase = "Running"
database.Status.ReadyReplicas = sts.Status.ReadyReplicas
if err := r.Status().Update(ctx, &database); err != nil {
return ctrl.Result{}, err
}
log.Info("Reconciled database", "name", database.Name)
return ctrl.Result{}, nil
}
// SetupWithManager 設置 Controller
func (r *DatabaseReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&dbv1.Database{}).
Owns(&appsv1.StatefulSet{}).
Owns(&corev1.Service{}).
Owns(&corev1.PersistentVolumeClaim{}).
Complete(r)
}
// buildStatefulSet 構建 StatefulSet
func (r *DatabaseReconciler) buildStatefulSet(db *dbv1.Database) *appsv1.StatefulSet {
replicas := int32(db.Spec.Replicas)
sts := &appsv1.StatefulSet{
ObjectMeta: metav1.ObjectMeta{
Name: db.Name,
Namespace: db.Namespace,
},
Spec: appsv1.StatefulSetSpec{
Replicas: &replicas,
Selector: &metav1.LabelSelector{
MatchLabels: map[string]string{"app": db.Name},
},
ServiceName: db.Name,
Template: corev1.PodTemplateSpec{
ObjectMeta: metav1.ObjectMeta{
Labels: map[string]string{"app": db.Name},
},
Spec: corev1.PodSpec{
Containers: []corev1.Container{{
Name: "database",
Image: fmt.Sprintf("%s:%s", db.Spec.Engine, db.Spec.Version),
Ports: []corev1.ContainerPort{{
ContainerPort: 5432,
}},
VolumeMounts: []corev1.VolumeMount{{
Name: "data",
MountPath: "/var/lib/data",
}},
}},
},
},
VolumeClaimTemplates: []corev1.PersistentVolumeClaim{{
ObjectMeta: metav1.ObjectMeta{Name: "data"},
Spec: corev1.PersistentVolumeClaimSpec{
AccessModes: []corev1.PersistentVolumeAccessMode{
corev1.ReadWriteOnce,
},
Resources: corev1.ResourceRequirements{
Requests: corev1.ResourceList{
corev1.ResourceStorage: resource.MustParse(db.Spec.Storage),
},
},
},
}},
},
}
// 設置 OwnerReference
ctrl.SetControllerReference(db, sts, r.Scheme)
return sts
}
第八章:安全機制深度解析
8.1 認證與授權
graph TB
subgraph "認證 Authentication"
REQ[請求] --> AUTH{認證}
AUTH --> C1[X.509 證書]
AUTH --> C2[Bearer Token]
AUTH --> C3[ServiceAccount]
AUTH --> C4[OIDC]
AUTH --> C5[Webhook]
end
subgraph "授權 Authorization"
AUTH --> AUTHZ{授權}
AUTHZ --> R1[RBAC]
AUTHZ --> R2[ABAC]
AUTHZ --> R3[Node]
AUTHZ --> R4[Webhook]
end
subgraph "准入控制 Admission"
AUTHZ --> ADM{准入}
ADM --> A1[Mutating Webhook]
ADM --> A2[Validating Webhook]
ADM --> A3[內建控制器]
end
RBAC 配置示例
# 定義角色
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: default
name: pod-reader
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "watch", "list"]
- apiGroups: [""]
resources: ["pods/log"]
verbs: ["get"]
---
# 綁定角色到用戶
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: read-pods
namespace: default
subjects:
- kind: User
name: jane
apiGroup: rbac.authorization.k8s.io
- kind: ServiceAccount
name: default
namespace: kube-system
roleRef:
kind: Role
name: pod-reader
apiGroup: rbac.authorization.k8s.io
---
# 集群級別角色
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: secret-reader
rules:
- apiGroups: [""]
resources: ["secrets"]
verbs: ["get", "watch", "list"]
8.2 Pod 安全
apiVersion: v1
kind: Pod
metadata:
name: secure-pod
spec:
# Pod 級別安全上下文
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 3000
fsGroup: 2000
seccompProfile:
type: RuntimeDefault
containers:
- name: app
image: myapp:1.0
# 容器級別安全上下文
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
add:
- NET_BIND_SERVICE
# 資源限制
resources:
limits:
cpu: "1"
memory: "512Mi"
requests:
cpu: "100m"
memory: "128Mi"
# 只讀掛載
volumeMounts:
- name: tmp
mountPath: /tmp
- name: config
mountPath: /etc/config
readOnly: true
volumes:
- name: tmp
emptyDir: {}
- name: config
configMap:
name: app-config
8.3 NetworkPolicy
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-server-policy
namespace: production
spec:
podSelector:
matchLabels:
app: api-server
policyTypes:
- Ingress
- Egress
ingress:
# 只允許來自 web 層的流量
- from:
- podSelector:
matchLabels:
app: web-frontend
- namespaceSelector:
matchLabels:
name: monitoring
ports:
- protocol: TCP
port: 8080
egress:
# 只允許訪問數據庫
- to:
- podSelector:
matchLabels:
app: database
ports:
- protocol: TCP
port: 5432
# 允許 DNS 查詢
- to:
- namespaceSelector: {}
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53
第九章:生產環境最佳實踐
9.1 高可用架構
graph TB
subgraph "Control Plane HA"
LB[Load Balancer]
subgraph "Master 1"
API1[API Server]
SCHED1[Scheduler<br/>leader]
CM1[Controller Manager<br/>standby]
end
subgraph "Master 2"
API2[API Server]
SCHED2[Scheduler<br/>standby]
CM2[Controller Manager<br/>leader]
end
subgraph "Master 3"
API3[API Server]
SCHED3[Scheduler<br/>standby]
CM3[Controller Manager<br/>standby]
end
LB --> API1
LB --> API2
LB --> API3
end
subgraph "etcd Cluster"
ETCD1[(etcd 1)]
ETCD2[(etcd 2)]
ETCD3[(etcd 3)]
ETCD1 <--> ETCD2
ETCD2 <--> ETCD3
ETCD1 <--> ETCD3
end
API1 --> ETCD1
API2 --> ETCD2
API3 --> ETCD3
9.2 資源管理
# ResourceQuota - 命名空間資源配額
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-quota
namespace: team-a
spec:
hard:
requests.cpu: "20"
requests.memory: 40Gi
limits.cpu: "40"
limits.memory: 80Gi
pods: "50"
services: "10"
persistentvolumeclaims: "20"
secrets: "20"
configmaps: "20"
---
# LimitRange - 默認資源限制
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: team-a
spec:
limits:
- type: Container
default:
cpu: "500m"
memory: "512Mi"
defaultRequest:
cpu: "100m"
memory: "128Mi"
max:
cpu: "2"
memory: "4Gi"
min:
cpu: "50m"
memory: "64Mi"
9.3 監控與可觀測性
# Prometheus ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: app-monitor
namespace: monitoring
spec:
selector:
matchLabels:
app: myapp
endpoints:
- port: metrics
interval: 15s
path: /metrics
namespaceSelector:
matchNames:
- production
---
# 重要 Kubernetes 指標
# 1. API Server 指標
# - apiserver_request_total
# - apiserver_request_duration_seconds
# - etcd_request_duration_seconds
#
# 2. kubelet 指標
# - kubelet_running_pods
# - kubelet_running_containers
# - container_cpu_usage_seconds_total
# - container_memory_usage_bytes
#
# 3. kube-state-metrics
# - kube_pod_status_phase
# - kube_deployment_status_replicas
# - kube_node_status_condition
第十章:源碼架構總覽
10.1 代碼目錄結構
kubernetes/
├── cmd/ # 各組件入口
│ ├── kube-apiserver/
│ ├── kube-controller-manager/
│ ├── kube-scheduler/
│ ├── kubelet/
│ └── kube-proxy/
│
├── pkg/ # 核心實現
│ ├── api/ # API 類型定義
│ ├── controller/ # 內建控制器
│ ├── kubelet/ # kubelet 實現
│ ├── scheduler/ # 調度器實現
│ ├── proxy/ # kube-proxy 實現
│ └── registry/ # API 資源存儲
│
├── staging/ # 可獨立發布的庫
│ └── src/k8s.io/
│ ├── api/ # API 類型
│ ├── apimachinery/ # API 機制
│ ├── client-go/ # Go 客戶端
│ └── apiserver/ # API Server 庫
│
├── vendor/ # 依賴
└── hack/ # 腳本工具
10.2 核心設計模式
Informer 模式
// client-go 的 Informer 機制
type sharedIndexInformer struct {
indexer Indexer // 本地緩存
controller Controller // Watch 控制器
processor *sharedProcessor // 事件處理器
}
func (s *sharedIndexInformer) Run(stopCh <-chan struct{}) {
// 1. 創建 DeltaFIFO 隊列
fifo := NewDeltaFIFOWithOptions(DeltaFIFOOptions{
KnownObjects: s.indexer,
EmitDeltaTypeReplaced: true,
})
// 2. 創建 Controller
s.controller = New(Config{
Queue: fifo,
ListerWatcher: s.listerWatcher,
ObjectType: s.objectType,
FullResyncPeriod: s.resyncCheckPeriod,
RetryOnError: false,
Process: func(obj interface{}, isInInitialList bool) error {
// 處理資源變更
for _, d := range obj.(Deltas) {
switch d.Type {
case Sync, Replaced, Added, Updated:
s.indexer.Update(d.Object)
s.processor.distribute(updateNotification{d.Object}, false)
case Deleted:
s.indexer.Delete(d.Object)
s.processor.distribute(deleteNotification{d.Object}, false)
}
}
return nil
},
})
// 3. 啟動事件處理
wg.StartWithChannel(processorStopCh, s.processor.run)
// 4. 啟動控制器
s.controller.Run(stopCh)
}
WorkQueue 模式
// 帶限流的工作隊列
type rateLimitingController struct {
queue workqueue.RateLimitingInterface
}
func (c *rateLimitingController) processNextWorkItem() bool {
// 1. 從隊列獲取項目
key, quit := c.queue.Get()
if quit {
return false
}
defer c.queue.Done(key)
// 2. 處理
err := c.sync(key.(string))
// 3. 錯誤處理
if err != nil {
// 限流重試
c.queue.AddRateLimited(key)
return true
}
// 4. 成功則忘記(重置重試計數)
c.queue.Forget(key)
return true
}
總結
Kubernetes 是一個極其複雜但設計精良的分散式系統。其核心設計理念包括:
- 聲明式 API:使用者描述期望狀態,系統自動調和
- 控制迴路:Watch + Reconcile 的事件驅動架構
- 鬆耦合:組件通過 API Server 通信,可獨立擴展
- 可擴展性:CRD、Operator、CNI、CSI 等擴展機制
理解 Kubernetes 內部機制,對於:
- 排查問題(為什麼 Pod 沒調度?為什麼 Service 不通?)
- 性能優化(調度器配置、網路插件選擇)
- 擴展開發(自定義控制器、Operator)
都至關重要。
希望這篇深度解析能幫助你從「會用 Kubernetes」進階到「理解 Kubernetes」。