返回到文章

采纳

编辑于 3年前

kafka与zk异常断开

kafka

3台zk,4台kafka,kafka启动后运行7天左右时发现kafka日志中存在:

WARN Client session timed out, have not heard from server in 72318ms for sessionid 0x27a5c2e2cfb0001 (org.apache.zookeeper.ClientCnxn) ;

此类告警;
然后kafka自己重连zk,自动连接上,但过了10小时左右,kafka与zk越来越频繁会话断开重连,

然后直到会话完全断开,
[2021-07-07 14:12:12,805] WARN Unable to reconnect to ZooKeeper service, session 0x37a5c2eeb6a0008 has expired (org.apache.zookeeper.ClientCnxn)
[2021-07-07 14:12:12,805] INFO zookeeper state changed (Expired) (org.I0Itec.zkclient.ZkClient)
[2021-07-07 14:12:12,805] INFO Unable to reconnect to ZooKeeper service, session 0x37a5c2eeb6a0008 has expired, closing socket connection (org.apache.zookeeper.ClientCnxn)
。。。。。。

kafka进程在,但是此台kafka已经不再/breker/ids中了

KAFKA TOPIC信息:

[BEGIN] 2021/7/8 15:52:39
[root@kafka2 kafka]# bin/kafka-topics.sh --describe --zookeeper XX.XX.XX.XX:2181,XX.XX.XX.XX:2181,XX.XX.XX.XX:2181
Topic:ED    PartitionCount:4    ReplicationFactor:1    Configs:
    Topic: ED    Partition: 0    Leader: 1    Replicas: 1    Isr: 1
    Topic: ED    Partition: 1    Leader: 3    Replicas: 3    Isr: 3
    Topic: ED    Partition: 2    Leader: 0    Replicas: 0    Isr: 0
    Topic: ED    Partition: 3    Leader: 2    Replicas: 2    Isr: 2
Topic:__consumer_offsets    PartitionCount:200    ReplicationFactor:1    Configs:segment.bytes=104857600,cleanup.policy=compact,compression.type=producer
    Topic: __consumer_offsets    Partition: 0    Leader: 1    Replicas: 1    Isr: 1
    ...
    Topic: __consumer_offsets    Partition: 93    Leader: 2    Replicas: 2    Isr: 2
    Topic: __consumer_offsets    Partition: 94    Leader: 3    Replicas: 3    Isr: 3
    Topic: __consumer_offsets    Partition: 95    Leader: 0    Replicas: 0    Isr: 0
    Topic: __consumer_offsets    Partition: 96    Leader: 2    Replicas: 2    Isr: 2
    Topic: __consumer_offsets    Partition: 97    Leader: 3    Replicas: 3    Isr: 3
    ...
    Topic: __consumer_offsets    Partition: 197    Leader: 0    Replicas: 0    Isr: 0
    Topic: __consumer_offsets    Partition: 198    Leader: 2    Replicas: 2    Isr: 2
    Topic: __consumer_offsets    Partition: 199    Leader: 3    Replicas: 3    Isr: 3

[END] 2021/7/8 15:52:51

kafka日志级别设置如下:

log4j.rootLogger=INFO, stdout, kafkaAppender

log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=[%d] %p %m (%c)%n

log4j.appender.kafkaAppender=org.apache.log4j.DailyRollingFileAppender
log4j.appender.kafkaAppender.DatePattern='.'yyyy-MM-dd-HH
log4j.appender.kafkaAppender.File=${kafka.logs.dir}/server.log
log4j.appender.kafkaAppender.layout=org.apache.log4j.PatternLayout
log4j.appender.kafkaAppender.layout.ConversionPattern=[%d] %p %m (%c)%n

log4j.appender.stateChangeAppender=org.apache.log4j.DailyRollingFileAppender
log4j.appender.stateChangeAppender.DatePattern='.'yyyy-MM-dd-HH
log4j.appender.stateChangeAppender.File=${kafka.logs.dir}/state-change.log
log4j.appender.stateChangeAppender.layout=org.apache.log4j.PatternLayout
log4j.appender.stateChangeAppender.layout.ConversionPattern=[%d] %p %m (%c)%n

log4j.appender.requestAppender=org.apache.log4j.DailyRollingFileAppender
log4j.appender.requestAppender.DatePattern='.'yyyy-MM-dd-HH
log4j.appender.requestAppender.File=${kafka.logs.dir}/kafka-request.log
log4j.appender.requestAppender.layout=org.apache.log4j.PatternLayout
log4j.appender.requestAppender.layout.ConversionPattern=[%d] %p %m (%c)%n

没有ERROR级别的报错。

4台kafka均只有这一台报重连问题。日志中还有其他的大量的其他告警信息:

例如:

WARN Attempting to send response via channel for which there is no open connection, connection id XXXXXXXX:9092-XXXXXXXX:47415-114379938 (kafka.network.Processor)

WARN Received a PartitionLeaderEpoch assignment for an epoch < latestEpoch. This implies messages have arrived out of order. New: {epoch:0, offset:116331733}, Current: {epoch:24994, offset114475406} for Partition: XXXXXX-3 (kafka.server.epoch.LeaderEpochFileCache)

现场的topic均只有一个副本;

今日又发现kafka停止服务的问题;排查GC发现

2021-07-15T09:49:41.106+0800: 571718.919: [Full GC (Allocation Failure)  31G->29G(32G), 67.2720092 secs]
   [Eden: 0.0B(1632.0M)->0.0B(1632.0M) Survivors: 0.0B->0.0B Heap: 32.0G(32.0G)->29.9G(32.0G)], [Metaspace: 31574K->31440K(32768K)]
 [Times: user=124.42 sys=0.00, real=67.27 secs]
2021-07-15T09:50:48.380+0800: 571786.192: [GC concurrent-mark-reset-for-overflow]
2021-07-15T09:50:48.380+0800: 571786.192: [GC concurrent-mark-abort]

发现kafka内存溢出了;但究竟是什么原因搞出的溢出还在排查