生产环境问题-20250606

背景:早上同事提醒,今天凌晨的定时任务,没有执行完

一、问题排查

查看相关报错的pod,出现如下报错信息:

10m Warning FailedKillPod pod/nwp-pggrid-29152744-mrfwq error killing pod: failed to "KillPodSandbox" for "8497fda4-832b-43f3-becf-f9bb76e5a3d2" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for sandbox \"34af0a51469031d27d3cae61399b1b6399345c38a381a43894337535dbd69d4a\": plugin type=\"egress-cni\" name=\"egress-cni\" failed (delete): netplugin failed: \"runtime: failed to create new OS thread (have 6 already; errno=11)\\nruntime: may need to increase max user processes (ulimit -u)\\nfatal error: newosproc\\n\\ngoroutine 1 [running, locked to thread]:\\nruntime.throw({0x55caa09e53e8?, 0xc0001dfd40?})\\n\\t/root/sdk/go1.20.4/src/runtime/panic.go:1047 +0x5f fp=0xc0001dfd18 sp=0xc0001dfce8 pc=0x55caa073151f\\nruntime.newosproc(0xc000100c00)\\n\\t/root/sdk/go1.20.4/src/runtime/os_linux.go:195 +0x177 fp=0xc0001dfd80 sp=0xc0001dfd18 pc=0x55caa072dad7\\nruntime.newm1(0xc000100c00)\\n\\t/root/sdk/go1.20.4/src/runtime/proc.go:2253 +0xcf fp=0xc0001dfdc0 sp=0xc0001dfd80 pc=0x55caa0737e8f\\nruntime.newm(0x55caa0713253?, 0x0, 0x55caa070a427?)\\n\\t/root/sdk/go1.20.4/src/runtime/proc.go:2228 +0x12c fp=0xc0001dfdf0 sp=0xc0001dfdc0 pc=0x55caa0737d4c\\nruntime.startTemplateThread()\\n\\t/root/sdk/go1.20.4/src/runtime/proc.go:2273 +0x90 fp=0xc0001dfe20 sp=0xc0001dfdf0 pc=0x55caa0737f70\\nruntime.LockOSThread()\\n\\t/root/sdk/go1.20.4/src/runtime/proc.go:4550 +0x1d fp=0xc0001dfe40 sp=0xc0001dfe20 pc=0x55caa073d4fd\\nmain.init.0()\\n\\t/go/src/github.com/aws/amazon-vpc-cni-k8s/cmd/egress-cni-plugin/main.go:34 +0x17 fp=0xc0001dfe50 sp=0xc0001dfe40 pc=0x55caa09e14f7\\nruntime.doInit(0x55caa0f1d920)\\n\\t/root/sdk/go1.20.4/src/runtime/proc.go:6506 +0x128 fp=0xc0001dff80 sp=0xc0001dfe50 pc=0x55caa0741428\\nruntime.main()\\n\\t/root/sdk/go1.20.4/src/runtime/proc.go:233 +0x1d3 fp=0xc0001dffe0 sp=0xc0001dff80 pc=0x55caa0733dd3\\nruntime.goexit()\\n\\t/root/sdk/go1.20.4/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc0001dffe8 sp=0xc0001dffe0 pc=0x55caa0762661\\n\\ngoroutine 2 [force gc (idle)]:\\nruntime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)\\n\\t/root/sdk/go1.20.4/src/runtime/proc.go:381 +0xd6 fp=0xc000084fb0 sp=0xc000084f90 pc=0x55caa0734256\\nruntime.goparkunlock(...)\\n\\t/root/sdk/go1.20.4/src/runtime/proc.go:387\\nruntime.forcegchelper()\\n\\t/root/sdk/go1.20.4/src/runtime/proc.go:305 +0xb0 fp=0xc000084fe0 sp=0xc000084fb0 pc=0x55caa0734090\\nruntime.goexit()\\n\\t/root/sdk/go1.20.4/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000084fe8 sp=0xc000084fe0 pc=0x55caa0762661\\ncreated by runtime.init.6\\n\\t/root/sdk/go1.20.4/src/runtime/proc.go:293 +0x25\\n\\ngoroutine 3 [GC sweep wait]:\\nruntime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)\\n\\t/root/sdk/go1.20.4/src/runtime/proc.go:381 +0xd6 fp=0xc000085780 sp=0xc000085760 pc=0x55caa0734256\\nruntime.goparkunlock(...)\\n\\t/root/sdk/go1.20.4/src/runtime/proc.go:387\\nruntime.bgsweep(0x0?)\\n\\t/root/sdk/go1.20.4/src/runtime/mgcsweep.go:278 +0x8e fp=0xc0000857c8 sp=0xc000085780 pc=0x55caa072028e\\nruntime.gcenable.func1()\\n\\t/root/sdk/go1.20.4/src/runtime/mgc.go:178 +0x26 fp=0xc0000857e0 sp=0xc0000857c8 pc=0x55caa0715746\\nruntime.goexit()\\n\\t/root/sdk/go1.20.4/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc0000857e8 sp=0xc0000857e0 pc=0x55caa0762661\\ncreated by runtime.gcenable\\n\\t/root/sdk/go1.20.4/src/runtime/mgc.go:178 +0x6b\\n\\ngoroutine 4 [GC scavenge wait]:\\nruntime.gopark(0xc00007e000?, 0x55caa0c31778?, 0x1?, 0x0?, 0x0?)\\n\\t/root/sdk/go1.20.4/src/runtime/proc.go:381 +0xd6 fp=0xc000085f70 sp=0xc000085f50 pc=0x55caa0734256\\nruntime.goparkunlock(...)\\n\\t/root/sdk/go1.20.4/src/runtime/proc.go:387\\nruntime.(*scavengerState).park(0x55caa0f51f40)\\n\\t/root/sdk/go1.20.4/src/runtime/mgcscavenge.go:400 +0x53 fp=0xc000085fa0 sp=0xc000085f70 pc=0x55caa071e1b3\\nruntime.bgscavenge(0x0?)\\n\\t/root/sdk/go1.20.4/src/runtime/mgcscavenge.go:628 +0x45 fp=0xc000085fc8 sp=0xc000085fa0 pc=0x55caa071e785\\nruntime.gcenable.func2()\\n\\t/root/sdk/go1.20.4/src/runtime/mgc.go:179 +0x26 fp=0xc000085fe0 sp=0xc000085fc8 pc=0x55caa07156e6\\nruntime.goexit()\\n\\t/root/sdk/go1.20.4/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000085fe8 sp=0xc000085fe0 pc=0x55caa0762661\\ncreated by runtime.gcenable\\n\\t/root/sdk/go1.20.4/src/runtime/mgc.go:179 +0xaa\\n\\ngoroutine 17 [finalizer wait]:\\nruntime.gopark(0x55caa07345d2?, 0x7f4896428a48?, 0x0?, 0x0?, 0xc000084770?)\\n\\t/root/sdk/go1.20.4/src/runtime/proc.go:381 +0xd6 fp=0xc000084628 sp=0xc000084608 pc=0x55caa0734256\\nruntime.runfinq()\\n\\t/root/sdk/go1.20.4/src/runtime/mfinal.go:193 +0x107 fp=0xc0000847e0 sp=0xc000084628 pc=0x55caa0714767\\nruntime.goexit()\\n\\t/root/sdk/go1.20.4/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc0000847e8 sp=0xc0000847e0 pc=0x55caa0762661\\ncreated by runtime.createfing\\n\\t/root/sdk/go1.20.4/src/runtime/mfinal.go:163 +0x45\\n\""

重点内容:

failed to create new OS thread (have 5 already; errno=11)\\nruntime: may need to increase max user processes (ulimit -u)

集群中pod运行失败,经过排查我们发现是节点的进程数已经达到了最大的限制,无法创建新的操作系统线程
运行时:可能需要增加最大用户进程数

查看最大进程数:

# cat /proc/sys/kernel/pid_max 
32768

二、解决办法

增大进程数:

echo "kernel.pid_max=65535" >> /etc/sysctl.conf
sysctl -p

实际上该值仍然可以向上调整,64为系统的pid最大值为4194304, 可以根据业务或使用经验调整优化该值