Kafka服务隔几天挂一次 尝试解决记录贴

李呲花 发表于: 2024-12-19   最后更新时间: 2024-12-30 11:11:54   171 游览

现象

Kafka服务隔几天就卡死一次 kill-9之后 重启能恢复正常 过几天又不行了

版本kafka3.0.0,用的自带的zookeeper,数据量不小,每个主题保存了5G,大概最多几个小时的数据

用了自带的sasl-plaintext鉴权,单服务器单节点

配置

broker.id=0
listeners=LOCAL://xxx:9092,JTKG://xxx:9093,GDKJ://xxx:9094  # 跨网所以映射了多个,这里的xxx代表同一个ip
advertised.listeners=LOCAL://xxx:9092,JTKG://yyy:9093,GDKJ://zzz:29092  # xxx,yyy,zzz是不同ip
listener.security.protocol.map=LOCAL:SASL_PLAINTEXT,JTKG:SASL_PLAINTEXT,GDKJ:SASL_PLAINTEXT
inter.broker.listener.name=LOCAL
sasl.enabled.mechanisms=SCRAM-SHA-256
sasl.mechanism.inter.broker.protocol=SCRAM-SHA-256
authorizer.class.name=kafka.security.authorizer.AclAuthorizer
super.users=User:admin
num.network.threads=3
num.io.threads=8
socket.send.buffer.bytes=1024000
socket.receive.buffer.bytes=1024000
socket.request.max.bytes=104857600
log.dirs=/home/kafka/kafka_2.12-3.0.0/kafka-logs
num.partitions=1
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
transaction.state.log.min.isr=1
log.retention.hours=168
log.retention.bytes=5368709120
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
zookeeper.connect=localhost:2181
zookeeper.connection.timeout.ms=60000
group.initial.rebalance.delay.ms=0

最近一次日志

[2024-12-19 13:14:25,486] DEBUG [Controller id=0] Resigning (kafka.controller.KafkaController)
[2024-12-19 13:14:25,940] DEBUG [Controller id=0] Unregister BrokerModifications handler for Set() (kafka.controller.KafkaController)
[2024-12-19 13:14:26,058] INFO [PartitionStateMachine controllerId=0] Stopped partition state machine (kafka.controller.ZkPartitionStateMachine)
[2024-12-19 13:14:26,136] INFO [ReplicaStateMachine controllerId=0] Stopped replica state machine (kafka.controller.ZkReplicaStateMachine)
[2024-12-19 13:14:26,677] INFO [Controller id=0] Resigned (kafka.controller.KafkaController)
[2024-12-19 13:14:26,760] DEBUG [Controller id=0] Resigning (kafka.controller.KafkaController)
[2024-12-19 13:14:26,760] DEBUG [Controller id=0] Unregister BrokerModifications handler for Set() (kafka.controller.KafkaController)
[2024-12-19 13:14:26,760] INFO [PartitionStateMachine controllerId=0] Stopped partition state machine (kafka.controller.ZkPartitionStateMachine)
[2024-12-19 13:14:26,760] INFO [ReplicaStateMachine controllerId=0] Stopped replica state machine (kafka.controller.ZkReplicaStateMachine)
[2024-12-19 13:14:26,760] INFO [Controller id=0] Resigned (kafka.controller.KafkaController)
[2024-12-19 13:14:29,817] INFO [Controller id=0] 0 successfully elected as the controller. Epoch incremented to 60 and epoch zk version is now 60 (kafka.controller.KafkaController)
[2024-12-19 13:14:30,134] INFO [Controller id=0] Registering handlers (kafka.controller.KafkaController)
[2024-12-19 13:14:30,165] INFO [Controller id=0] Deleting log dir event notifications (kafka.controller.KafkaController)
[2024-12-19 13:14:30,207] INFO [Controller id=0] Deleting isr change notifications (kafka.controller.KafkaController)
[2024-12-19 13:14:30,209] INFO [Controller id=0] Initializing controller context (kafka.controller.KafkaController)
[2024-12-19 13:14:30,556] INFO [Controller id=0] Initialized broker epochs cache: Map(0 -> 5125) (kafka.controller.KafkaController)
[2024-12-19 13:14:30,910] DEBUG [Controller id=0] Register BrokerModifications handler for Set(0) (kafka.controller.KafkaController)
[2024-12-19 13:14:31,219] DEBUG [Channel manager on controller 0]: Controller 0 trying to connect to broker 0 (kafka.controller.ControllerChannelManager)
[2024-12-19 13:14:31,281] INFO [RequestSendThread controllerId=0] Starting (kafka.controller.RequestSendThread)
[2024-12-19 13:14:31,282] INFO [Controller id=0] Currently active brokers in the cluster: Set(0) (kafka.controller.KafkaController)
[2024-12-19 13:14:31,283] INFO [Controller id=0] Currently shutting brokers in the cluster: Set() (kafka.controller.KafkaController)
[2024-12-19 13:14:31,283] INFO [Controller id=0] Current list of topics in the cluster: Set(xxx) (kafka.controller.KafkaController)
[2024-12-19 13:14:31,284] INFO [Controller id=0] Fetching topic deletions in progress (kafka.controller.KafkaController)
[2024-12-19 13:14:31,364] INFO [Controller id=0] List of topics to be deleted:  (kafka.controller.KafkaController)
[2024-12-19 13:14:31,364] INFO [Controller id=0] List of topics ineligible for deletion:  (kafka.controller.KafkaController)
[2024-12-19 13:14:31,364] INFO [Controller id=0] Initializing topic deletion manager (kafka.controller.KafkaController)
[2024-12-19 13:14:31,384] INFO [Topic Deletion Manager 0] Initializing manager with initial deletions: Set(), initial ineligible deletions: Set() (kafka.controller.TopicDeletionManager)
[2024-12-19 13:14:31,399] INFO [Controller id=0] Sending update metadata request (kafka.controller.KafkaController)
[2024-12-19 13:14:31,510] INFO [ReplicaStateMachine controllerId=0] Initializing replica state (kafka.controller.ZkReplicaStateMachine)
[2024-12-19 13:14:31,519] INFO [ReplicaStateMachine controllerId=0] Triggering online replica state changes (kafka.controller.ZkReplicaStateMachine)
[2024-12-19 13:14:31,558] INFO [RequestSendThread controllerId=0] Controller 0 connected to x'x'xxxx:9092 (id: 0 rack: null) for sending state change requests (kafka.controller.RequestSendThread)
[2024-12-19 13:14:31,673] INFO [ReplicaStateMachine controllerId=0] Triggering offline replica state changes (kafka.controller.ZkReplicaStateMachine)
[2024-12-19 13:14:31,674] DEBUG [ReplicaStateMachine controllerId=0] Started replica state machine with initial state -> Map(xxx) (kafka.controller.ZkReplicaStateMachine)
[2024-12-19 13:14:31,675] INFO [PartitionStateMachine controllerId=0] Initializing partition state (kafka.controller.ZkPartitionStateMachine)
[2024-12-19 13:14:31,677] INFO [PartitionStateMachine controllerId=0] Triggering online partition state changes (kafka.controller.ZkPartitionStateMachine)
[2024-12-19 13:14:31,679] DEBUG [PartitionStateMachine controllerId=0] Started partition state machine with initial state -> Map(xxxxxxx) (kafka.controller.ZkPartitionStateMachine)
[2024-12-19 13:14:31,679] INFO [Controller id=0] Ready to serve as the new controller with epoch 60 (kafka.controller.KafkaController)
[2024-12-19 13:14:31,683] INFO [Controller id=0] Partitions undergoing preferred replica election:  (kafka.controller.KafkaController)
[2024-12-19 13:14:31,683] INFO [Controller id=0] Partitions that completed preferred replica election:  (kafka.controller.KafkaController)
[2024-12-19 13:14:31,692] INFO [Controller id=0] Skipping preferred replica election for partitions due to topic deletion:  (kafka.controller.KafkaController)
[2024-12-19 13:14:31,692] INFO [Controller id=0] Resuming preferred replica election for partitions:  (kafka.controller.KafkaController)
[2024-12-19 13:14:31,693] INFO [Controller id=0] Starting replica leader election (PREFERRED) for partitions  triggered by ZkTriggered (kafka.controller.KafkaController)
[2024-12-19 13:14:31,823] INFO [Controller id=0] Starting the controller scheduler (kafka.controller.KafkaController)
[2024-12-19 13:14:31,847] DEBUG [Controller id=0] Resigning (kafka.controller.KafkaController)
[2024-12-19 13:14:31,847] DEBUG [Controller id=0] Unregister BrokerModifications handler for Set(0) (kafka.controller.KafkaController)
[2024-12-19 13:14:31,859] INFO [PartitionStateMachine controllerId=0] Stopped partition state machine (kafka.controller.ZkPartitionStateMachine)
[2024-12-19 13:14:31,859] INFO [ReplicaStateMachine controllerId=0] Stopped replica state machine (kafka.controller.ZkReplicaStateMachine)
[2024-12-19 13:14:32,002] INFO [RequestSendThread controllerId=0] Shutting down (kafka.controller.RequestSendThread)
[2024-12-19 13:14:32,051] WARN [RequestSendThread controllerId=0] Controller 0 epoch 60 fails to send request (xxx) to broker xxx:9092 (id: 0 rack: null). Reconnecting to broker. (kafka.controller.RequestSendThread)
java.io.IOException: Client was shutdown before response was read
    at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:109)
    at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:252)
    at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)
[2024-12-19 13:14:32,070] ERROR [RequestSendThread controllerId=0] Controller 0 fails to send a request to broker xxx:9092 (id: 0 rack: null) (kafka.controller.RequestSendThread)
java.lang.InterruptedException
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326)
    at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
    at kafka.utils.ShutdownableThread.pause(ShutdownableThread.scala:82)
    at kafka.controller.RequestSendThread.backoff$1(ControllerChannelManager.scala:233)
    at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:261)
    at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)
[2024-12-19 13:14:32,071] INFO [RequestSendThread controllerId=0] Stopped (kafka.controller.RequestSendThread)
[2024-12-19 13:14:32,076] INFO [RequestSendThread controllerId=0] Shutdown completed (kafka.controller.RequestSendThread)
[2024-12-19 13:14:32,146] INFO [Controller id=0] Resigned (kafka.controller.KafkaController)
[2024-12-19 13:14:32,316] DEBUG [Controller id=0] Broker 0 has been elected as the controller, so stopping the election process. (kafka.controller.KafkaController)
[2024-12-19 13:14:51,582] INFO [ControllerEventThread controllerId=0] Shutting down (kafka.controller.ControllerEventManager$ControllerEventThread)
[2024-12-19 13:14:51,582] INFO [ControllerEventThread controllerId=0] Stopped (kafka.controller.ControllerEventManager$ControllerEventThread)
[2024-12-19 13:14:51,582] INFO [ControllerEventThread controllerId=0] Shutdown completed (kafka.controller.ControllerEventManager$ControllerEventThread)
[2024-12-19 13:14:51,583] DEBUG [Controller id=0] Resigning (kafka.controller.KafkaController)
[2024-12-19 13:14:51,583] DEBUG [Controller id=0] Unregister BrokerModifications handler for Set() (kafka.controller.KafkaController)
[2024-12-19 13:14:51,583] INFO [PartitionStateMachine controllerId=0] Stopped partition state machine (kafka.controller.ZkPartitionStateMachine)
[2024-12-19 13:14:51,583] INFO [ReplicaStateMachine controllerId=0] Stopped replica state machine (kafka.controller.ZkReplicaStateMachine)
[2024-12-19 13:14:51,583] INFO [Controller id=0] Resigned (kafka.controller.KafkaController)
发表于 2024-12-19
添加评论

这几天主要做了几件事

  1. 发现JVM的内存使用默认的有点小了,并没有有效利用服务器内存资源,故 Kafka 改为 -Xmx16G -Xms1G,zookeeper 改为 -Xmx2G -Xms512M
  2. Kafka 版本由 3.0.0 升级为 3.9.0
  3. 监测发现内存会一直增长,swap 也会一直增长且只增不减,猜测为 log.flush.interval.messageslog.flush.interval.ms 走默认值 null,使得消息一直存在 Page Cache 中。目前前者改为10000,后者改为1000,swap 基本稳定了,但内存还是会慢慢增长,后续还需继续观察

贴个解决方案链接 https://www.hangge.com/blog/cache/detail_3526.html

又监控了几天,swap 空间大概会在5天左右涨到80%,故猜测上次的3并没有太大用。

这次调整为:

  1. 堆内存最大值下调为12g export KAFKA_HEAP_OPTS="-Xms1g -Xmx12g"
  2. 直接内存配置为8g export KAFKA_HEAP_OPTS="$KAFKA_HEAP_OPTS -XX:MaxDirectMemorySize=8g"
  3. 元空间配置为512M export KAFKA_HEAP_OPTS="$KAFKA_HEAP_OPTS -XX:MaxMetaspaceSize=512m"

方案来自于 Github Copilot

半兽人 -> 李呲花 12天前

现在还会挂吗?

李呲花 -> 半兽人 11天前

现在每天都监控着 看着内存和swap快满了就重启下服务 然后俩就都降下来了 但是还是没找到为啥内存会一直加。。。

这个是正常现象么?我想着是如果还是找不到原因就看看起个定时任务,每周定时重启hhhh

大概有30左右主题,30左右消费者,生产的话有三个主题大概1秒3000条,其他的大概一秒100左右

李呲花 -> 半兽人 11天前

会是这个360的问题么?服务器上会有些安全相关的组件,这个我大概率没法子改

1348082 root      20   0  487420   9420   8500 S   1.3   0.0  58:56.67 360cwmonitor -d
1348125 root      20   0  379572  12256  12248 S   0.7   0.0  23:53.12 360cwexam -d
半兽人 -> 李呲花 10天前

内存和Swap满是因为kafka利用了当前物理机可用的内存(Java堆外内存)提高效率的,如果其他进程需要的话,会释放这部分内存,是正常的。

李呲花 -> 半兽人 10天前

好的 我再监控监控 感谢

你的答案

查看kafka相关的其他问题或提一个您自己的问题