1. 数据背景

100,000+ Pods

1300+ Nodes

3集群（单：11Master + 17ETCD)

2. 遇到的问题

Apiserver调度，延迟问题；
Controller 不能及时从 API Server 感知到最新的变化，处理的延时较高；
Scheduler 延迟高、吞吐低，无法适应业务日常需求；
ETCD架构设计不合理/ETCD稳定性/ETCD性能无法满足业务；
发生异常重启时，服务的恢复时间需要几分钟；

3. 优化之路

3.1 硬件/网络/存储/架构等

虚拟机（或物理服务器）层面的优化：

旧换新：使用较久的服务全部更新为新款服务器，针对类型采购最新不同类型的资源
调整虚拟机配置：增加虚拟机的内存、CPU 核心数等资源，以满足高并发负载的需求
使用高性能的虚拟化技术：选择性能较好的虚拟机管理器（如KVM、Xen等），充分利用硬件资源
宿主资源超卖：比如将一个实际只有 48 核的宿主上报资源给 apiserver 时上报为60 核，以此来对宿主进行资源超卖。

硬件层面的优化：

多核处理器：使用多核处理器可以提高系统的处理能力，使其能够更好地应对高并发负载。
高速缓存：充分利用硬件的高速缓存，减少数据访问的延迟。
高性能网络接口：采用高性能的网卡和交换机，提供更快的网络传输速度和更低的延迟。

网络层面的优化：

负载均衡：采用负载均衡设备或技术，将请求均匀地分布到多台服务器上，提高整体的并发处理能力。
增加带宽：提高网络带宽可以支持更多的并发连接，并减少网络传输的瓶颈。
优化网络协议：使用较低延迟和高吞吐量的网络协议，如使用GRPC代替HTTP，QUIC代替TCP等。

存储层面的优化：

使用高性能的存储设备：采用 SSD 硬盘或 NVMe 存储设备，提高数据的读写速度和响应时间。
数据缓存：使用缓存技术（如 Redis、Memcached 等），减少后端存储的访问压力。
数据库优化：对数据库进行索引优化、查询优化等，提高数据库的读写性能。

架构层面的优化：

异步处理：采用异步处理模式，如使用消息队列或事件驱动架构等，将请求的处理过程解耦，提高系统的并发能力。
分布式架构：使用分布式架构，将负载分散到多个节点上，提高系统的横向扩展性能。
水平拆分：根据负载情况和业务需求，将系统按照不同的功能或模块进行水平拆分，以提高并发处理能力。

3.2 内核层面

增大内核选项配置 /etc/sysctl.conf

1）一般如果遇到文件句柄达到上限时，会碰到 "Too many open files" 或者Socket/File: Can’t open so many files 等错误：

bash

# max-file 表示系统级别的能够打开的文件句柄的数量，
fs.file-max=1000000

# max-file 表示系统级别的能够打开的文件句柄的数量，
fs.file-max=1000000

2）配置 arp cache 大小，当内核维护的arp表过于庞大时候，可以考虑优化：

bash

# 存在于ARP高速缓存中的最少层数，如果少于这个数，垃圾收集器将不会运行。缺省值是128。
net.ipv4.neigh.default.gc_thresh1=1024

# 保存在 ARP 高速缓存中的最多的记录软限制。垃圾收集器在开始收集前，允许记录数超过这个数字 5 秒。缺省值是 512。
net.ipv4.neigh.default.gc_thresh2=4096

# 保存在 ARP 高速缓存中的最多记录的硬限制，一旦高速缓存中的数目高于此，垃圾收集器将马上运行。缺省值是1024。
net.ipv4.neigh.default.gc_thresh3=8192

# 存在于ARP高速缓存中的最少层数，如果少于这个数，垃圾收集器将不会运行。缺省值是128。
net.ipv4.neigh.default.gc_thresh1=1024

# 保存在 ARP 高速缓存中的最多的记录软限制。垃圾收集器在开始收集前，允许记录数超过这个数字 5 秒。缺省值是 512。
net.ipv4.neigh.default.gc_thresh2=4096

# 保存在 ARP 高速缓存中的最多记录的硬限制，一旦高速缓存中的数目高于此，垃圾收集器将马上运行。缺省值是1024。
net.ipv4.neigh.default.gc_thresh3=8192

3） conntrack 是指针对连接跟踪（Connection Tracking）进行的性能优化措施：

bash

# 允许的最大跟踪连接条目，是在内核内存中netfilter可以同时处理的“任务”（连接跟踪条目）
net.netfilter.nf_conntrack_max=10485760

# 哈希表大小（只读）（64位系统、8G内存默认 65536，16G翻倍，如此类推）
net.core.netdev_max_backlog=10000

# 每个网络接口接收数据包的速率比内核处理这些包的速率快时，允许送到队列的数据包的最大数目。
net.netfilter.nf_conntrack_tcp_timeout_established=300
net.netfilter.nf_conntrack_buckets=655360

# 允许的最大跟踪连接条目，是在内核内存中netfilter可以同时处理的“任务”（连接跟踪条目）
net.netfilter.nf_conntrack_max=10485760

# 哈希表大小（只读）（64位系统、8G内存默认 65536，16G翻倍，如此类推）
net.core.netdev_max_backlog=10000

# 每个网络接口接收数据包的速率比内核处理这些包的速率快时，允许送到队列的数据包的最大数目。
net.netfilter.nf_conntrack_tcp_timeout_established=300
net.netfilter.nf_conntrack_buckets=655360

4）监听文件系统上的事件（如文件创建、修改、删除等），并在事件发生时通知相应的应用程序：

bash

# 默认值: 128 指定了每一个real user ID可创建的inotify instatnces的数量上限
fs.inotify.max_user_instances=524288

# 默认值: 8192 指定了每个inotify instance相关联的watches的上限
fs.inotify.max_user_watches=524288

# 默认值: 128 指定了每一个real user ID可创建的inotify instatnces的数量上限
fs.inotify.max_user_instances=524288

# 默认值: 8192 指定了每个inotify instance相关联的watches的上限
fs.inotify.max_user_watches=524288

5)完整配置

bash

# Kubernetes Settings
vm.max_map_count = 262144
kernel.softlockup_panic = 1
kernel.softlockup_all_cpu_backtrace = 1
net.ipv4.ip_local_reserved_ports = 30000-32767

# Increase the number of connections
net.core.somaxconn = 32768

# Maximum Socket Receive Buffer
net.core.rmem_max = 16777216

# Maximum Socket Send Buffer
net.core.wmem_max = 16777216

# Increase the maximum total buffer-space allocatable
net.ipv4.tcp_wmem = 4096 87380 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216

# Increase the number of outstanding syn requests allowed
net.ipv4.tcp_max_syn_backlog = 8096


# For persistent HTTP connections
net.ipv4.tcp_slow_start_after_idle = 0

# Allow to reuse TIME_WAIT sockets for new connections
# when it is safe from protocol viewpoint
net.ipv4.tcp_tw_reuse = 1

# Max number of packets that can be queued on interface input
# If kernel is receiving packets faster than can be processed
# this queue increases
net.core.netdev_max_backlog = 16384

# Increase size of file handles and inode cache
fs.file-max = 2097152

# Max number of inotify instances and watches for a user
# Since dockerd runs as a single user, the default instances value of 128 per user is too low
# e.g. uses of inotify: nginx ingress controller, kubectl logs -f
fs.inotify.max_user_instances = 8192
fs.inotify.max_user_watches = 524288

# Additional sysctl flags that kubelet expects
vm.overcommit_memory = 1
kernel.panic = 10
kernel.panic_on_oops = 1

# Prevent docker from changing iptables: https://github.com/kubernetes/kubernetes/issues/40182
net.ipv4.ip_forward=1

# Kubernetes Settings
vm.max_map_count = 262144
kernel.softlockup_panic = 1
kernel.softlockup_all_cpu_backtrace = 1
net.ipv4.ip_local_reserved_ports = 30000-32767

# Increase the number of connections
net.core.somaxconn = 32768

# Maximum Socket Receive Buffer
net.core.rmem_max = 16777216

# Maximum Socket Send Buffer
net.core.wmem_max = 16777216

# Increase the maximum total buffer-space allocatable
net.ipv4.tcp_wmem = 4096 87380 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216

# Increase the number of outstanding syn requests allowed
net.ipv4.tcp_max_syn_backlog = 8096


# For persistent HTTP connections
net.ipv4.tcp_slow_start_after_idle = 0

# Allow to reuse TIME_WAIT sockets for new connections
# when it is safe from protocol viewpoint
net.ipv4.tcp_tw_reuse = 1

# Max number of packets that can be queued on interface input
# If kernel is receiving packets faster than can be processed
# this queue increases
net.core.netdev_max_backlog = 16384

# Increase size of file handles and inode cache
fs.file-max = 2097152

# Max number of inotify instances and watches for a user
# Since dockerd runs as a single user, the default instances value of 128 per user is too low
# e.g. uses of inotify: nginx ingress controller, kubectl logs -f
fs.inotify.max_user_instances = 8192
fs.inotify.max_user_watches = 524288

# Additional sysctl flags that kubelet expects
vm.overcommit_memory = 1
kernel.panic = 10
kernel.panic_on_oops = 1

# Prevent docker from changing iptables: https://github.com/kubernetes/kubernetes/issues/40182
net.ipv4.ip_forward=1

如果是aws,需要多开启两个参数

bash

# AWS settings
# Issue #23395
net.ipv4.neigh.default.gc_thresh1=0

# Enable IPv6 forwarding for network plugins that don't do it themselves
net.ipv6.conf.all.forwarding=1

# AWS settings
# Issue #23395
net.ipv4.neigh.default.gc_thresh1=0

# Enable IPv6 forwarding for network plugins that don't do it themselves
net.ipv6.conf.all.forwarding=1

参数解释

分类	内核参数	说明	参考链接
Kubernetes	`vm.max_map_count = 262144`	限制一个进程可以拥有的 VMA(虚拟内存区域)的数量，一个更大的值对于 elasticsearch、mongo 或其他 mmap 用户来说非常有用	ES Configuration
Kubernetes	`kernel.softlockup_panic = 1`	用于解决 K8S 内核软锁相关 bug	root cause kernel soft lockups · Issue #37853 · kubernetes/kubernetes (github.com)
Kubernetes	`kernel.softlockup_all_cpu_backtrace = 1`	用于解决 K8S 内核软锁相关 bug	root cause kernel soft lockups · Issue #37853 · kubernetes/kubernetes (github.com)
Kubernetes	`net.ipv4.ip_local_reserved_ports = 30000-32767`	默认 K8S Nodport 端口	service-node-port-range and ip_local_port_range collision · Issue #6342 · kubernetes/kops (github.com)
网络	`net.core.somaxconn = 32768`	表示 socket 监听（listen）的 backlog 上限。什么是 backlog？backlog 就是 socket 的监听队列，当一个请求（request）尚未被处理或建立时，他会进入 backlog。增加连接数.	Image: We should tweak our sysctls · Issue #261 · kubernetes-retired/kube-deploy (github.com)
网络	`net.core.rmem_max = 16777216`	接收套接字缓冲区大小的最大值 (以字节为单位)。最大化 Socket Receive Buffer	Image: We should tweak our sysctls · Issue #261 · kubernetes-retired/kube-deploy (github.com)
网络	`net.core.wmem_max = 16777216`	发送套接字缓冲区大小的最大值 (以字节为单位)。最大化 Socket Send Buffer	Image: We should tweak our sysctls · Issue #261 · kubernetes-retired/kube-deploy (github.com)
网络	net.ipv4.tcp_wmem = 4096 87380 16777216 net.ipv4.tcp_rmem = 4096 87380 16777216	增加总的可分配的 buffer 空间的最大值	Image: We should tweak our sysctls · Issue #261 · kubernetes-retired/kube-deploy (github.com)
网络	`net.ipv4.tcp_max_syn_backlog = 8096`	表示那些尚未收到客户端确认信息的连接（SYN 消息）队列的长度，默认为 1024 增加未完成的 syn 请求的数量	Image: We should tweak our sysctls · Issue #261 · kubernetes-retired/kube-deploy (github.com)
网络	`net.ipv4.tcp_slow_start_after_idle = 0`	持久化 HTTP 连接	Image: We should tweak our sysctls · Issue #261 · kubernetes-retired/kube-deploy (github.com)
网络	`net.ipv4.tcp_tw_reuse = 1`	表示允许重用 TIME_WAIT 状态的套接字用于新的 TCP 连接, 默认为 0，表示关闭。允许在协议安全的情况下重用 TIME_WAIT 套接字用于新的连接	Image: We should tweak our sysctls · Issue #261 · kubernetes-retired/kube-deploy (github.com)
网络	`net.core.netdev_max_backlog = 16384`	当网卡接收数据包的速度大于内核处理的速度时，会有一个队列保存这些数据包。这个参数表示该队列的最大值如果内核接收数据包的速度超过了可以处理的速度，这个队列就会增加	Image: We should tweak our sysctls · Issue #261 · kubernetes-retired/kube-deploy (github.com)
文件系统	`fs.file-max = 2097152`	该参数决定了系统中所允许的文件句柄最大数目，文件句柄设置代表 linux 系统中可以打开的文件的数量。增加文件句柄和 inode 缓存的大小	Image: We should tweak our sysctls · Issue #261 · kubernetes-retired/kube-deploy (github.com)
文件系统	fs.inotify.max_user_instances = 8192 fs.inotify.max_user_watches = 524288	一个用户的 inotify 实例和 watch 的最大数量由于 dockerd 作为单个用户运行，每个用户的默认实例值 128 太低了例如使用 inotify: nginx ingress controller, kubectl logs -f	Image: We should tweak our sysctls · Issue #261 · kubernetes-retired/kube-deploy (github.com)
kubelet	`vm.overcommit_memory = 1`	对内存分配的一种策略 =1，表示内核允许分配所有的物理内存，而不管当前的内存状态如何	Image: We should tweak our sysctls · Issue #261 · kubernetes-retired/kube-deploy (github.com)
kubelet	`kernel.panic = 10`	panic 错误中自动重启，等待时间为 10 秒	Image: We should tweak our sysctls · Issue #261 · kubernetes-retired/kube-deploy (github.com)
kubelet	`kernel.panic_on_oops = 1`	在 Oops 发生时会进行 panic()操作	Image: We should tweak our sysctls · Issue #261 · kubernetes-retired/kube-deploy (github.com)
网络	`net.ipv4.ip_forward=1`	启用 ip 转发另外也防止 docker 改变 iptables	Upgrading docker 1.13 on nodes causes outbound container traffic to stop working · Issue #40182 · kubernetes/kubernetes (github.com)
网络	`net.ipv4.neigh.default.gc_thresh1=0`	修复 AWS `arp_cache: neighbor table overflow!` 报错	arp_cache: neighbor table overflow! · Issue #4533 · kubernetes/kops (github.com)

3.3 Etcd性能优化

架构层面:

1、搭建高可用的etcd集群, 集群规模增大时可以自动增加etcd节点；

硬件层面：

1、etcd 采用本地 ssd 盘作为后端存储存储；

2、etcd 独立部署在非 k8s node 上；

3、etcd 快照(snap)与预写式日志(wal)分盘存储；

1）Etcd对磁盘写入延迟非常敏感，因此对于负载较重的集群，etcd一定要使用 Local SSD 或者高性能云盘。可以使用fio测量磁盘实际顺序 IOPS。

bash

$ fio -filename=/dev/sda1 -direct=1 -iodepth 1 -thread -rw=write -ioengine=psync -bs=4k -size=60G -numjobs=64 -runtime=10 -group_reporting -name=file

$ fio -filename=/dev/sda1 -direct=1 -iodepth 1 -thread -rw=write -ioengine=psync -bs=4k -size=60G -numjobs=64 -runtime=10 -group_reporting -name=file

2）由于etcd必须将数据持久保存到磁盘日志文件中，因此来自其他进程的磁盘活动可能会导致增加写入时间，结果导致etcd请求超时和临时leader丢失。

因此可以给etcd进程更高的磁盘优先级，使etcd服务可以稳定地与这些进程一起运行。

bash

$ ionice -c2 -n0 -p $(pgrep etcd)

$ ionice -c2 -n0 -p $(pgrep etcd)

3）默认etcd空间配额大小为 2G，超过 2G 将不再写入数据。通过给etcd配置 --quota-backend-bytes 参数增大空间配额，最大支持 8G。

bash

--quota-backend-bytes 8589934592

--quota-backend-bytes 8589934592

4）如果 etcd leader 处理大量并发客户端请求，可能由于网络拥塞而延迟处理follower对等请求。在follower 节点上可能会产生如下的发送缓冲区错误的消息：

bash

dropped MsgProp to 247ae21ff9436b2d since streamMsg's sending buffer is full
dropped MsgAppResp to 247ae21ff9436b2d since streamMsg's sending buffer is full

dropped MsgProp to 247ae21ff9436b2d since streamMsg's sending buffer is full
dropped MsgAppResp to 247ae21ff9436b2d since streamMsg's sending buffer is full

可以通过提高etcd对于对等网络流量优先级来解决这些错误。在 Linux 上，可以使用 tc对对等流量进行优先级排序：

bash

$ tc qdisc add dev eth0 root handle 1: prio bands 3
tc filter add dev eth0 parent 1: protocol ip prio 1 u32 match ip
sport 2380 0xffff flowid 1:1
tc filter add dev eth0 parent 1: protocol ip prio 1 u32 match ip
dport 2380 0xffff flowid 1:1
tc filter add dev eth0 parent 1: protocol ip prio 2 u32 match ip
sport 2379 0xffff flowid 1:1
tc filter add dev eth0 parent 1: protocol ip prio 2 u32 match ip
dport 2379 0xffff flowid 1:1

$ tc qdisc add dev eth0 root handle 1: prio bands 3
tc filter add dev eth0 parent 1: protocol ip prio 1 u32 match ip
sport 2380 0xffff flowid 1:1
tc filter add dev eth0 parent 1: protocol ip prio 1 u32 match ip
dport 2380 0xffff flowid 1:1
tc filter add dev eth0 parent 1: protocol ip prio 2 u32 match ip
sport 2379 0xffff flowid 1:1
tc filter add dev eth0 parent 1: protocol ip prio 2 u32 match ip
dport 2379 0xffff flowid 1:1

5）为了在大规模集群下提高性能，可以将events存储在单独的 ETCD 实例中，可以配置kube-apiserver参数：

bash

##添加 etcd 配置
vim /etc/kubernetes/manifests/kube-apiserver.yaml
##新增如下，第一行代表着当前的主ETCD，第二块代表着 Event 事件拆分到的 Etcd 集群

--etcd-servers="http://etcd1:2379,http://etcd2:2379,http://etcd3:2379" \
--etcd-servers-overrides="/events#http://etcd4:2379,http://etcd5:2379,http://etcd6:2379"

##添加 etcd 配置
vim /etc/kubernetes/manifests/kube-apiserver.yaml
##新增如下，第一行代表着当前的主ETCD，第二块代表着 Event 事件拆分到的 Etcd 集群

--etcd-servers="http://etcd1:2379,http://etcd2:2379,http://etcd3:2379" \
--etcd-servers-overrides="/events#http://etcd4:2379,http://etcd5:2379,http://etcd6:2379"

6）目前的解决方案是使用 etcd operator 来搭建 etcd 集群，它是一个感知应用状态的控制器，通过扩展Kubernetes API来自动创建、管理和配置应用实例。

etcd operator 有如下特性：

ceate/destroy：自动部署和删除 etcd 集群，不需要人额外干预配置。
resize：可以动态实现 etcd 集群的扩缩容。
backup：支持etcd集群的数据备份和集群恢复重建
upgrade：可以实现在升级etcd集群时不中断服务。

4. apiserver优化

apiserver

4.1 参数调整

kube-apiserve

bash

--max-mutating-requests-inflight int The maximum number of mutating requests in flight at a given time. When the server exceeds this, it rejects requests. Zero for no limit. (default 200)

--max-requests-inflight int The maximum number ofnon-mutating requests in flight at a given time. When the server exceeds this, it rejects requests. Zero for no limit. (default 400)

--max-mutating-requests-inflight int The maximum number of mutating requests in flight at a given time. When the server exceeds this, it rejects requests. Zero for no limit. (default 200)

--max-requests-inflight int The maximum number ofnon-mutating requests in flight at a given time. When the server exceeds this, it rejects requests. Zero for no limit. (default 400)

节点数量在 1000 - 3000 之间时，推荐：

bash

--max-requests-inflight=1500
--max-mutating-requests-inflight=500

--max-requests-inflight=1500
--max-mutating-requests-inflight=500

节点数量大于 3000 时，推荐：

bash

--max-requests-inflight=3000
--max-mutating-requests-inflight=1000

--max-requests-inflight=3000
--max-mutating-requests-inflight=1000

当集群中 node 以及 pod 数量非常多时可以稍微调大：

bash

--watch-cache-sizes：调大 resources 的 watch size，默认为 100，
如：
--watch-cache-sizes=node#1000, pod#5000

--watch-cache-sizes：调大 resources 的 watch size，默认为 100，
如：
--watch-cache-sizes=node#1000, pod#5000

配置kube-apiserver的内存

使用--target-ram-mb配置kube-apiserver的内存，按以下公式得到一个合理的值：
--target-ram-mb=node_nums * 60

使用--target-ram-mb配置kube-apiserver的内存，按以下公式得到一个合理的值：
--target-ram-mb=node_nums * 60

4.2 apiserver负载均衡

方式一：启动多个 kube-apiserver 实例通过外部 LB 做负载均衡。

方式二：设置 --apiserver-count 和 --endpoint-reconciler-type ，让多个kube-apiserver 实例加入到 Kubernetes Service 的 endpoints 中，从而实现高可用。

4.3 使用pprof进行性能分析

pprof 是 golang 的一大杀器，要想进行源码级别的性能分析，必须使用 pprof

bash

// 安装相关包
$ brew install graphviz
// 启动 pprof
$ go tool pprof http://localhost:8001/debug/pprof/profile
File: kube-apiserver
Type: cpu
Time: Oct 11, 2019 at 11:39am (CST)
Duration: 30s, Total samples = 620ms ( 2.07%)
Entering interactive mode (type "help" for commands, "o" for
options)
(pprof) web // 使用 web 命令生成 svg 文件

// 安装相关包
$ brew install graphviz
// 启动 pprof
$ go tool pprof http://localhost:8001/debug/pprof/profile
File: kube-apiserver
Type: cpu
Time: Oct 11, 2019 at 11:39am (CST)
Duration: 30s, Total samples = 620ms ( 2.07%)
Entering interactive mode (type "help" for commands, "o" for
options)
(pprof) web // 使用 web 命令生成 svg 文件

可以通过 graph 以及交互式界面得到 cpu 耗时、goroutine 阻塞等信息，apiserver 中的对象比较多，序列化会消耗非常大的时间。

5. kube-controller-manager优化

kube-controller-manager

5.1 参数优化

调大 --kube-api-qps 值：与 apiServer 的每秒请求数量限制可以调整至 100，默认值为 20；
调大 --kube-api-burst 值：可以调整至 100，默认值为 30；
禁用不需要的 controller：默认启动为 --controllers ，即启动所有 controller，可以禁用不需要的 controller；

# - --controllers=*,deployment.*

--kube-api-qps=100
--kube-api-burst=150

# - --controllers=*,deployment.*

--kube-api-qps=100
--kube-api-burst=150

5.2 kube-controller-manager 升级过程 informer 预加载

尽量的减小 controller-manager 单次升级对系统的中断时间，主要有以下两处改造：

预启动 controller informer ，提前加载 controller 需要的数据；
主 controller 升级时，会主动释放 Leader Lease ，触发备立即接管工作；

5.3 通过 leader election 实现高可用

kube-controller-manager可以通过 leader election 实现高可用，添加以下命令行参数：

--leader-elect=true
--leader-elect-lease-duration=15s
--leader-elect-renew-deadline=10s
--leader-elect-resource-lock=endpoints
--leader-elect-retry-period=2s

--leader-elect=true
--leader-elect-lease-duration=15s
--leader-elect-renew-deadline=10s
--leader-elect-resource-lock=endpoints
--leader-elect-retry-period=2s

6. kube-scheduler优化

kube-scheduler

首先还是使用好调度器的基本功能：

Pod/Node Affinity & Anti-affinity //亲和
Taint & Toleration //污点 & 容忍
Eviction & Preemption //驱逐 & 抢占

优先级抢占调度策略的核心行为分别是

驱逐（Eviction）：kubelet进程的行为。

抢占（Preemption）：Scheduler执行的行为。

驱逐：

即当一个Node发生资源不足（under resource pressure）的情况时，该节点上的kubelet进程会执行驱逐动作，此时 Kubelet会综合考虑Pod的优先级、资源申请量与实际使用量等信息来计算哪些Pod需要被驱逐；当同样优先级的Pod需要被驱逐时，实际使用的资源量超过申请量最大倍数的高耗能Pod会被首先驱逐。对于QoS等级为“Best Effort”的Pod来说，由于没有定义资源申请（CPU/Memory Request），所以它们实际使用的资源可能非常大；

抢占：

当一个新的Pod因为资源无法满足而不能被调度时，Scheduler可能（有权决定）选择驱逐部分低优先级的Pod实例来满足此Pod的调度目标，这就是Preemption机制；

Pod Disruption Budget(简称PDB)：

通过PodDisruptionBudget控制器可以设置应用POD集群处于运行状态最低个数，也可以设置应用POD集群处于运行状态的最低百分比，这样可以保证在主动销毁应用POD的时候，不会一次性销毁太多的应用POD，从而保证业务不中断或业务SLA不降级。

1、 MinAvailable参数：表示最小可用POD数，表示应用POD集群处于运行状态的最小POD数量，或者是运行状态的POD数同总POD数的最小百分比。

2、 MaxUnavailable参数：表示最大不可用POD数，表示应用POD集群处于不可用状态的最大POD数，或者是不可用状态的POD数同总POD数的最大百分比。

6.1 参数优化

调大 --kube-api-qps 值：可以调整至 100，默认值为 50

--kube-api-qps=100
--kube-api-burst=150

--kube-api-qps=100
--kube-api-burst=150

6.2 调度器优化

扩展调度器功能：目前可以通过 scheduler_extender 很方便的扩展调度器，比如对于 GPU 的调度，可以通过 scheduler_extender + device-plugins 来支持。
多调度器支持：kubernetes 也支持在集群中运行多个调度器调度不同作业，例如可以在 pod 的 spec.schedulerName 指定对应的调度器，也可以在 job 的.spec.template.spec.schedulerName 指定调度器
动态调度支持：由于 kubernetes 的默认调度器只在 pod 创建过程中进行一次性调度，后续不会重新去平衡 pod 在集群中的分布，导致实际的资源使用率不均衡，此时集群中会存在部分热点宿主，为了解决默认调度器的功能缺陷，kubernetes 孵化了一个工具 Descheduler 来对默认调度器的功能进行一些补充，详细说明可以参考官方文档。

6.3通过 leader election 实现高可用

--leader-elect=true
--leader-elect-lease-duration=15s
--leader-elect-renew-deadline=10s
--leader-elect-resource-lock=endpoints
--leader-elect-retry-period=2s

--leader-elect=true
--leader-elect-lease-duration=15s
--leader-elect-renew-deadline=10s
--leader-elect-resource-lock=endpoints
--leader-elect-retry-period=2s

7. kubelet 优化

kubelet

7.1 参数优化

--max-pods ：kubelet 可以运行的最大 Pod 数量。

--image-pull-progress-deadline ：配置镜像拉取超时。

--eviction-hard 和 --eviction-soft ：这两个参数用于定义 kubelet 中 Pod 驱逐（Eviction）策略的硬性和软性限制。

--image-gc-high-threshold 和 --image-gc-low-threshold ：这两个参数用于定义 kubelet 中镜像垃圾回收（Garbage Collection）的阈值。

--serialize-image-pulls ：该选项配置串行拉取镜像，默认值时true，配置为false可以增加并发度。

7.2 kubelet 状态更新机制

kubelet 自身会定期更新状态到 apiserver，通过参数 --node-status-update-frequency 指定上报频率，默认是 10s 上报一次。
kube-controller-manager 会每隔 --node-monitor-period 时间去检查 kubelet 的状态，默认是 5s。
当 node 失联一段时间后，kubernetes 判定 node 为 notready 状态，这段时长通过 --node-monitor-grace-period 参数配置，默认 40s。
当 node 失联一段时间后，kubernetes 判定 node 为 unhealthy 状态，这段时长通过 --node-startup-grace-period 参数配置，默认 1m0s。
当 node 失联一段时间后，kubernetes 开始删除原 node 上的 pod，这段时长是通过 --pod-eviction-timeout 参数配置，默认 5m0s。

默认参数：

参数	默认值
--node-status-update-frequency	10s
--node-monitor-period	5s
--node-monitor-grace-period	40s
--pod-eviction-timeout	5m

快速更新和快速响应：

参数	默认值
--node-status-update-frequency	4s
--node-monitor-period	2s
--node-monitor-grace-period	20s
--pod-eviction-timeout	30s

中等更新和平均响应：

参数	默认值
--node-status-update-frequency	20s
--node-monitor-period	5s
--node-monitor-grace-period	2m
--pod-eviction-timeout	1m

这种场景下会 20s 更新一次 node 状态，controller manager 认为 node 状态不正常之前，会有 (2mx60/20)x5=30 次的 node 状态更新，Node 状态为 down 之后 1m，就会触发驱逐操作。

如果有 1000 个节点，1分钟之内就会有 60s/20s*1000=3000 次的节点状态更新操作。

低更新和慢响应：

参数	默认值
--node-status-update-frequency	1m
--node-monitor-period	5s
--node-monitor-grace-period	5m
--pod-eviction-timeout	1m

Kubelet 将会 1m 更新一次节点的状态，在认为不健康之后会有 5m/1m*5=25 次重试更新的机会。Node为不健康的时候，1m 之后 pod开始被驱逐。

7.3 使用 bookmark 机制

Kubernetes（K8s）中的 "bookmark" 是一个用于标记资源的机制，允许用户保存特定资源对象的状态，并随后通过该标记来检索和操作该资源。

在 Kubernetes API 中，bookmark 是由 API 资源对象的 metadata 字段中的resourceVersion 和 kind 属性组成的。这个 bookmark 可以用作查询参数传递给API，以便在操作期间锁定特定的资源状态。

使用 bookmark 的常见场景是在众多资源对象中执行分页操作或轮询更新。例如，在获取 Pod 列表的过程中，如果列表很大并且你希望在下次获取时继续之前的状态，可以通过将当前的 bookmark 作为查询参数传递给 API 来实现。

bash

GET /api/v1/namespaces/default/pods?limit=10&bookmark=f3b9fc35-5f82-4820-bfa4-7079595c48b3

GET /api/v1/namespaces/default/pods?limit=10&bookmark=f3b9fc35-5f82-4820-bfa4-7079595c48b3

尽管 bookmark 对于分页和追踪资源状态很有用，但它不是 Kubernetes 中常用的功能，因此在应用程序开发中可能会相对较少使用。

7.4 限制驱逐

资源紧张时不建议进行驱逐的原因有以下几点：

资源可用性：特殊属性节点可能提供了某些独特的能力或功能，例如高性能计算、存储设备或专用网络连接等。
资源调度：对于特殊属性的节点，通常只有少数几台存在于集群中，并且它们被认为是有限且宝贵的资源。
重新调度成本：在高并发集群中频繁地驱逐容器会导致频繁的重新调度操作。重新调度包括为被驱逐容器选择新的节点、迁移容器的状态和数据等。

7.5 原地升级

对组件进行二开，或者通过operator来变现；
在 resource 对应于 k8s 中的应用，当 pod 中的 image 改变后只更新 pod 不重建，kubelet 重启 container 生效。

8. kube-proxy 优化

kube-proxy

8.1 使用 ipvs 模式

IPVS模式和IPTABLES模式之间的差异如下：

性能和扩展性：IPVS 是一个基于内核的 TCP/UDP 负载均衡器，相对于 iptables 具有更高的性能和扩展能力。
负载均衡算法：IPVS 提供了多种负载均衡算法，如轮询、加权轮询、最少连接数等。
服务代理模式：与 iptables 相比，IPVS 可以以直接代理模式工作，将数据包直接转发到后端 Pod 的 IP 地址，而无需修改数据包的目标 IP 地址。
动态配置更新：IPVS 支持动态配置更新，可以让 kube-proxy 在运行时动态地添加、删除和更新负载均衡规则，而无需重新生成整个 iptables 规则集。

8.2 优化

--conntrack-tcp-timeout-close-wait ：用于指定 IPVS 的 TCP CLOSE_WAIT状态下的连接超时时间。

--conntrack-max-per-core ：用于指定每个 CPU 核心的最大并发连接数限制。

9. 镜像优化

一个容器的镜像平均 1~2G 左右，若频繁的拉取镜像可能会将宿主机的带宽打满，甚至影响镜像仓库的使用，

1、镜像优化；

使用基于 Alpine Linux、BusyBox 或 Scratch 的轻量级基础镜像；
能在一个阶段中执行的业务逻辑就不要放到2个；
移除不必要的依赖和文件；
镜像使用最小化的操作系统组件。

2、镜像缓存；

3、使用 P2P 进行镜像分发，比如：dragonfly；

4、基础镜像预加载（一般镜像会分为三层）：

第一层：基础镜像即 os，
第二层：环境镜像即带有 nginx、tomcat 等服务的镜像，
第三层：业务镜像也就是带有业务代码的镜像。
基础镜像一般不会频繁更新，可在所有宿主机上预先加载，环境镜像可以定时
进行加载，业务镜像则实时拉取。

10. docker优化

10.1 daemon.json

bash

#创建docker目录
mkdir /etc/docker

cat > /etc/docker/daemon.json <<EOF
{
  "exec-opts": [
  	"native.cgroupdriver=systemd"
  ],
  "max-concurrent-downloads": 10,
  "max-concurrent-uploads": 5,
  "live-restore":true,
  "log-driver": "json-file",
  "log-opts": {
  	"max-size": "100m",
	"max-file":"5"
  },
  "storage-driver": "overlay2",
  "storage-opts": [
  	"overlay2.override_kernel_check=true"
  ],
  "registry-mirrors" : [
  ],
  "data-root": "/data/docker"
}
EOF

#创建docker目录
mkdir /etc/docker

cat > /etc/docker/daemon.json <<EOF
{
  "exec-opts": [
  	"native.cgroupdriver=systemd"
  ],
  "max-concurrent-downloads": 10,
  "max-concurrent-uploads": 5,
  "live-restore":true,
  "log-driver": "json-file",
  "log-opts": {
  	"max-size": "100m",
	"max-file":"5"
  },
  "storage-driver": "overlay2",
  "storage-opts": [
  	"overlay2.override_kernel_check=true"
  ],
  "registry-mirrors" : [
  ],
  "data-root": "/data/docker"
}
EOF

10.2 pause

提前下载pause镜像，导入

bash

registry.cn-hangzhou.aliyuncs.com/google_containers/pause:3.5

registry.cn-hangzhou.aliyuncs.com/google_containers/pause:3.5

2-Harbor

3-Docker

1.安装

🍎维护手册

4-Containerd

1.安装

3.镜像管理

4.构建镜像

5-Dockerfile

🍂 env案例

6-Docker-Compose

7-Swarm

8-KVM

2-资源对象

2-Pod

5-Deployment

6-StatefulSet

7-Service

9-Job

10-ConfigMap

11-Secret

13-CoreDns

17-发布

3-存储

1- 存储卷概念

2-NFS

4-Minio

1-安装

4-网络

1-Calico

2-Cilium

OpenELB

5-认证与授权

6-安装

1.二进制安装

2.kubeadm安装

7-监控

1-Prometheus

2-Alertmanager

3-PrometheusAlert

4-Grafana

5-VictoriaMetrics

8-备份

9-常用操作

10-Yaml配置

11-Helm

3-Helm语法

🍎 Helm项目

12-CICD

1-Jenkins

2-ArgoCD

13-Ingress

1-Ingress_nginx

2-Higress

15-Autoscaler

1-HPA

2-VPA

3-OpenKruise

1-Kruise

16-Scheduler

云k8s

1-AWS EKS

5-ingress-nginx

🍎维护手册

4-TKE

🍎维护手册

18-Kubernetes故障排查

19-Kubernetes排查手册

1-WireShark

20-Kubernetes维护手册

21-Kubernetes面试

22-Kubernetes发布

1-Go

2-Go框架

3-Go编译

5-Go文档

6-Go日志

10-Go模块

11-Web前端开发

vue