解决k8s 1.24 kubeadm init执行失败的问题


最近kubernetes 1.24正式发布,但是在国内部署时出现了kubeadm init执行失败的问题。
搜索了一圈之后,发现不少人遇到了这个问题,尤其是国内。google搜索出的国外资料反而不多。
最后经过日志分析发现,还是gcr下载镜像引起的问题。

这里记录相关问题的分析与处理过程。

先说一下结论

k8s 1.24的官方中文翻译没有跟上英文说明,其中1.24中需要补充配置的sandbox(pause)镜像下载地址。
否则按老方法配置的国内镜像不生效,pause镜像下载不下来,所有k8s核心服务均无法启动。

附上修改说明:
如果使用的是containerd,则按如下说明配置:

Overriding the sandbox (pause) image 
In your containerd config you can overwrite the sandbox image by setting the following config:

[plugins."io.containerd.grpc.v1.cri"]
  sandbox_image = "k8s.gcr.io/pause:3.2"
You might need to restart containerd as well once you've updated the config file: systemctl restart containerd

如果使用cri-o,则按如下说明配置:

Overriding the sandbox (pause) image
In your CRI-O config you can set the following config value:

[crio.image]
pause_image="registry.k8s.io/pause:3.6"
This config option supports live configuration reload to apply this change: systemctl reload crio or by sending SIGHUP to the crio process.

如果使用cni-dockerd,则按如下说明配置:

Overriding the sandbox (pause) image 
The cri-dockerd adapter accepts a command line argument for specifying which container image to use as the Pod infrastructure container (“pause image”). The command line argument to use is --pod-infra-container-image

问题现象

按照官方的部署流程,通过kubeadm部署k8s集群,执行下属init操作时出现失败

kubeadm init \
  --kubernetes-version 1.24.2 \
  --apiserver-advertise-address=**** \
  --image-repository registry.aliyuncs.com/google_containers 

上面已经配置采用aliyun的gcr镜像,避免k8s.gcr.io被墙,导致k8s核心服务镜像无法下载的问题。

但是执行之后,初始化超时,并返回如下提示:

[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
[kubelet-check] Initial timeout of 40s passed.

Unfortunately, an error has occurred:
        timed out waiting for the condition

This error is likely caused by:
        - The kubelet is not running
        - The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)

If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
        - 'systemctl status kubelet'
        - 'journalctl -xeu kubelet'

Additionally, a control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI.
Here is one example how you may list all running Kubernetes containers by using crictl:
        - 'crictl --runtime-endpoint unix:///var/run/crio/crio.sock ps -a | grep kube | grep -v pause'
        Once you have found the failing container, you can inspect its logs with:
        - 'crictl --runtime-endpoint unix:///var/run/crio/crio.sock logs CONTAINERID'

执行journalctl -xeu kubelet后发现报错:

"RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to get sandbox image \"k8s.gcr.io/pause:3.6
"Failed to create sandbox for pod" err="rpc error: code = Unknown desc = failed to get sandbox image \"k8s.gcr.io/pause:3.6\":
"CreatePodSandbox for pod failed" err="rpc error: code = Unknown desc = failed to get sandbox image \"k8s.gcr.io/pause:3.6\":

以及大量的报错(其中k8s-141为master节点的hostname):

"Error getting node" err="node \"k8s-141\" not found"
"Error getting node" err="node \"k8s-141\" not found"
"Error getting node" err="node \"k8s-141\" not found"
"Error getting node" err="node \"k8s-141\" not found"
"Error getting node" err="node \"k8s-141\" not found"
"Error getting node" err="node \"k8s-141\" not found"

问题原因

从上面的日志能够看到k8s核心服务的pod创建失败,因为获取pause镜像失败,总是从k8s.gcr.io下载。
需要说明的是,执行init操作时,cri-o已经正确配置了国内镜像。

尝试从1.24降级到1.23,同样的过程在1.23下则集群创建成功。

经过确认,k8s 1.24中启用了CRI sandbox(pause) image的配置支持。
之前通过kubeadm init –image-repository设置的镜像地址,不再会传递给cri运行时去下载pause镜像。
而是需要在cri运行时的配置文件中设置,具体参照开头的配置方式进行。


文章作者: 2356
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 2356 !