k8s部署的kafka集群,其中两个kafka节点频繁crashbackoff

▓千年祇园 卐 发表于: 2023-09-15   最后更新时间: 2023-09-18 10:04:06   1,009 游览

在一个由3个节点组成的kafka集群里,有2个kafka节点一直重启失败,下面是其日志

[2023-09-15 06:16:30,240] INFO [Log partition=poc_alarm_queue-2, dir=/bitnami/kafka/data] Recovering unflushed segment 0 (kafka.log.Log)
[2023-09-15 06:16:30,240] INFO [Log partition=poc_alarm_queue-2, dir=/bitnami/kafka/data] Loading producer state till offset 0 with message format version 2 (kafka.log.Log)
[2023-09-15 06:16:31,981] INFO Terminating process due to signal SIGTERM (org.apache.kafka.common.utils.LoggingSignalHandler)
[2023-09-15 06:16:31,983] INFO [KafkaServer id=1] shutting down (kafka.server.KafkaServer)
[2023-09-15 06:16:31,987] ERROR [KafkaServer id=1] Fatal error during KafkaServer shutdown. (kafka.server.KafkaServer)
java.lang.IllegalStateException: Kafka server is still starting up, cannot shut down!
        at kafka.server.KafkaServer.shutdown(KafkaServer.scala:602)
        at kafka.server.KafkaServerStartable.shutdown(KafkaServerStartable.scala:54)
        at kafka.Kafka$.$anonfun$main$3(Kafka.scala:80)
        at kafka.utils.Exit$.$anonfun$addShutdownHook$1(Exit.scala:38)
        at java.base/java.lang.Thread.run(Thread.java:834)
[2023-09-15 06:16:31,990] ERROR Halting Kafka. (kafka.server.KafkaServerStartable)

此外,kafka集群连接zk集群,zk集群看起来是正常的,用zkServer.sh status查看3个节点基本ok, 不过也有其客户端连接超时和失败的日志信息:

2023-09-15 06:13:03,026 [myid:3] - WARN  [NIOWorkerThread-2:NIOServerCnxn@364] - Unexpected exception
EndOfStreamException: Unable to read additional data from client, it probably closed the socket: address = /192.168.4.12:31708, session = 0x3000004f3460053
        at org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163)
        at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:326)
        at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
        at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
2023-09-15 06:13:06,011 [myid:3] - INFO  [CommitProcessor:3:LeaderSessionTracker@104] - Committing global session 0x10000043eed0061
2023-09-15 06:13:06,188 [myid:3] - INFO  [CommitProcessor:3:LeaderSessionTracker@104] - Committing global session 0x2000003f1530054
2023-09-15 06:13:08,440 [myid:3] - INFO  [SessionTracker:ZooKeeperServer@610] - Expiring session 0x10000043eed0060, timeout of 18000ms exceeded
2023-09-15 06:13:08,440 [myid:3] - INFO  [RequestThrottler:QuorumZooKeeperServer@159] - Submitting global closeSession request for session 0x10000043eed0060
2023-09-15 06:13:20,441 [myid:3] - INFO  [SessionTracker:ZooKeeperServer@610] - Expiring session 0x3000004f3460053, timeout of 18000ms exceeded
2023-09-15 06:13:20,441 [myid:3] - INFO  [RequestThrottler:QuorumZooKeeperServer@159] - Submitting global closeSession request for session 0x3000004f3460053
2023-09-15 06:13:38,441 [myid:3] - INFO  [SessionTracker:ZooKeeperServer@610] - Expiring session 0x2000003f1530053, timeout of 18000ms exceeded
2023-09-15 06:13:38,441 [myid:3] - INFO  [RequestThrottler:QuorumZooKeeperServer@159] - Submitting global closeSession request for session 0x2000003f1530053
2023-09-15 06:13:50,441 [myid:3] - INFO  [SessionTracker:ZooKeeperServer@610] - Expiring session 0x2000003f1530054, timeout of 18000ms exceeded
2023-09-15 06:13:50,441 [myid:3] - INFO  [RequestThrottler:QuorumZooKeeperServer@159] - Submitting global closeSession request for session 0x2000003f1530054
2023-09-15 06:14:04,748 [myid:3] - INFO  [CommitProcessor:3:LeaderSessionTracker@104] - Committing global session 0x3000004f3460054
2023-09-15 06:14:04,791 [myid:3] - INFO  [RequestThrottler:QuorumZooKeeperServer@159] - Submitting global closeSession request for session 0x3000004f3460054
2023-09-15 06:14:04,903 [myid:3] - INFO  [CommitProcessor:3:LeaderSessionTracker@104] - Committing global session 0x2000003f1530055
2023-09-15 06:14:28,637 [myid:3] - INFO  [CommitProcessor:3:LeaderSessionTracker@104] - Committing global session 0x10000043eed0062
2023-09-15 06:14:28,807 [myid:3] - INFO  [CommitProcessor:3:LeaderSessionTracker@104] - Committing global session 0x3000004f3460055
2023-09-15 06:14:48,441 [myid:3] - INFO  [SessionTracker:ZooKeeperServer@610] - Expiring session 0x2000003f1530055, timeout of 18000ms exceeded
2023-09-15 06:14:48,441 [myid:3] - INFO  [RequestThrottler:QuorumZooKeeperServer@159] - Submitting global closeSession request for session 0x2000003f1530055
2023-09-15 06:14:53,043 [myid:3] - WARN  [NIOWorkerThread-1:NIOServerCnxn@364] - Unexpected exception
EndOfStreamException: Unable to read additional data from client, it probably closed the socket: address = /192.168.4.12:60734, session = 0x3000004f3460055
        at org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163)
        at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:326)
        at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
        at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
2023-09-15 06:14:56,129 [myid:3] - INFO  [CommitProcessor:3:LeaderSessionTracker@104] - Committing global session 0x2000003f1530056
2023-09-15 06:14:56,306 [myid:3] - INFO  [CommitProcessor:3:LeaderSessionTracker@104] - Committing global session 0x2000003f1530057
2023-09-15 06:15:06,441 [myid:3] - INFO  [SessionTracker:ZooKeeperServer@610] - Expiring session 0x3000004f3460055, timeout of 18000ms exceeded
2023-09-15 06:15:06,441 [myid:3] - INFO  [RequestThrottler:QuorumZooKeeperServer@159] - Submitting global closeSession request for session 0x3000004f3460055
2023-09-15 06:15:40,441 [myid:3] - INFO  [SessionTracker:ZooKeeperServer@610] - Expiring session 0x2000003f1530057, timeout of 18000ms exceeded
2023-09-15 06:15:40,441 [myid:3] - INFO  [RequestThrottler:QuorumZooKeeperServer@159] - Submitting global closeSession request for session 0x2000003f1530057

请问这种情况一般如何定位解决?

发表于 2023-09-15
添加评论

收到了停机信号,kafka还在初始化中,导致关闭失败。

排查思路:

基于k8s安装的kafka需要注意容器健康探测验证,否则一直被k8s认为不健康而重启。

  • initialDelaySeconds: 120
    告诉 kubelet 在第一次执行 probe 之前要的等待120秒钟

参考:kubernetes配置Liveness和Readiness探针

谢谢回复~。目前我的配置如下:

livenessProbe:
  failureThreshold: 2
  initialDelaySeconds: 10
  periodSeconds: 10
  successThreshold: 1
  tcpSocket:
    port: 9092
  timeoutSeconds: 5
readinessProbe:
  failureThreshold: 6
  initialDelaySeconds: 5
  periodSeconds: 10
  successThreshold: 1
  tcpSocket:
    port: 9092
  timeoutSeconds: 5

这里只需要调整initialDelaySeconds为120秒么?调整为这个值的一般会参考什么指标?谢谢
此外,上面的配置还有没有其它问题呢

就是kafka启动,你5秒后开始探测是否健康,太快了,kafka肯定还没有启来。

没有标准的指标,这个指标就有点复杂,越大给它充足的启动时间最好。
比如你设置了2核cpu给kafka程序,那么启动时间是60秒完成,如果是4核就30秒就ok了。

嗯嗯,了解了。您说的是readinessProbe中的initialDelaySeconds应该改为120吧,存活探针中的initialDelaySeconds是否也要相应改到120?

我试试,多谢

我将上述initalDelaySeconds改为120后,发现pod至少要等120秒才能就绪,这个时间有点久。
是不是可以这样理解:如果:

  • failureThreshold: 6
  • initialDelaySeconds: 120

pod将在连连续6次失败总共等待720秒后pod状态变为crashloopbackoff。
如果将initialDelaySeconds设置为10,则总共等待1分钟pod状态就变为crashloopbackoff呢?

你的答案

查看kafka相关的其他问题或提一个您自己的问题