3台zk,4台kafka,kafka启动后运行7天左右时发现kafka日志中存在:
WARN Client session timed out, have not heard from server in 72318ms for sessionid 0x27a5c2e2cfb0001 (org.apache.zookeeper.ClientCnxn) ;
此类告警;
然后kafka自己重连zk,自动连接上,但过了10小时左右,kafka与zk越来越频繁会话断开重连,
然后直到会话完全断开,
[2021-07-07 14:12:12,805] WARN Unable to reconnect to ZooKeeper service, session 0x37a5c2eeb6a0008 has expired (org.apache.zookeeper.ClientCnxn)
[2021-07-07 14:12:12,805] INFO zookeeper state changed (Expired) (org.I0Itec.zkclient.ZkClient)
[2021-07-07 14:12:12,805] INFO Unable to reconnect to ZooKeeper service, session 0x37a5c2eeb6a0008 has expired, closing socket connection (org.apache.zookeeper.ClientCnxn)
。。。。。。
kafka进程在,但是此台kafka已经不再/breker/ids
中了
KAFKA TOPIC信息:
[BEGIN] 2021/7/8 15:52:39
[root@kafka2 kafka]# bin/kafka-topics.sh --describe --zookeeper XX.XX.XX.XX:2181,XX.XX.XX.XX:2181,XX.XX.XX.XX:2181
Topic:ED PartitionCount:4 ReplicationFactor:1 Configs:
Topic: ED Partition: 0 Leader: 1 Replicas: 1 Isr: 1
Topic: ED Partition: 1 Leader: 3 Replicas: 3 Isr: 3
Topic: ED Partition: 2 Leader: 0 Replicas: 0 Isr: 0
Topic: ED Partition: 3 Leader: 2 Replicas: 2 Isr: 2
Topic:__consumer_offsets PartitionCount:200 ReplicationFactor:1 Configs:segment.bytes=104857600,cleanup.policy=compact,compression.type=producer
Topic: __consumer_offsets Partition: 0 Leader: 1 Replicas: 1 Isr: 1
...
Topic: __consumer_offsets Partition: 93 Leader: 2 Replicas: 2 Isr: 2
Topic: __consumer_offsets Partition: 94 Leader: 3 Replicas: 3 Isr: 3
Topic: __consumer_offsets Partition: 95 Leader: 0 Replicas: 0 Isr: 0
Topic: __consumer_offsets Partition: 96 Leader: 2 Replicas: 2 Isr: 2
Topic: __consumer_offsets Partition: 97 Leader: 3 Replicas: 3 Isr: 3
...
Topic: __consumer_offsets Partition: 197 Leader: 0 Replicas: 0 Isr: 0
Topic: __consumer_offsets Partition: 198 Leader: 2 Replicas: 2 Isr: 2
Topic: __consumer_offsets Partition: 199 Leader: 3 Replicas: 3 Isr: 3
[END] 2021/7/8 15:52:51
kafka日志级别设置如下:
log4j.rootLogger=INFO, stdout, kafkaAppender
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=[%d] %p %m (%c)%n
log4j.appender.kafkaAppender=org.apache.log4j.DailyRollingFileAppender
log4j.appender.kafkaAppender.DatePattern='.'yyyy-MM-dd-HH
log4j.appender.kafkaAppender.File=${kafka.logs.dir}/server.log
log4j.appender.kafkaAppender.layout=org.apache.log4j.PatternLayout
log4j.appender.kafkaAppender.layout.ConversionPattern=[%d] %p %m (%c)%n
log4j.appender.stateChangeAppender=org.apache.log4j.DailyRollingFileAppender
log4j.appender.stateChangeAppender.DatePattern='.'yyyy-MM-dd-HH
log4j.appender.stateChangeAppender.File=${kafka.logs.dir}/state-change.log
log4j.appender.stateChangeAppender.layout=org.apache.log4j.PatternLayout
log4j.appender.stateChangeAppender.layout.ConversionPattern=[%d] %p %m (%c)%n
log4j.appender.requestAppender=org.apache.log4j.DailyRollingFileAppender
log4j.appender.requestAppender.DatePattern='.'yyyy-MM-dd-HH
log4j.appender.requestAppender.File=${kafka.logs.dir}/kafka-request.log
log4j.appender.requestAppender.layout=org.apache.log4j.PatternLayout
log4j.appender.requestAppender.layout.ConversionPattern=[%d] %p %m (%c)%n
没有ERROR级别的报错。
4台kafka均只有这一台报重连问题。日志中还有其他的大量的其他告警信息:
例如:
WARN Attempting to send response via channel for which there is no open connection, connection id XXXXXXXX:9092-XXXXXXXX:47415-114379938 (kafka.network.Processor)
和
WARN Received a PartitionLeaderEpoch assignment for an epoch < latestEpoch. This implies messages have arrived out of order. New: {epoch:0, offset:116331733}, Current: {epoch:24994, offset114475406} for Partition: XXXXXX-3 (kafka.server.epoch.LeaderEpochFileCache)
现场的topic均只有一个副本;
今日又发现kafka停止服务的问题;排查GC发现
2021-07-15T09:49:41.106+0800: 571718.919: [Full GC (Allocation Failure) 31G->29G(32G), 67.2720092 secs]
[Eden: 0.0B(1632.0M)->0.0B(1632.0M) Survivors: 0.0B->0.0B Heap: 32.0G(32.0G)->29.9G(32.0G)], [Metaspace: 31574K->31440K(32768K)]
[Times: user=124.42 sys=0.00, real=67.27 secs]
2021-07-15T09:50:48.380+0800: 571786.192: [GC concurrent-mark-reset-for-overflow]
2021-07-15T09:50:48.380+0800: 571786.192: [GC concurrent-mark-abort]
发现kafka内存溢出了;但究竟是什么原因搞出的溢出还在排查