测试集群有三个broker
broker0:9099
broker1:9091
broker2:9092
当单独杀掉9091或者单独杀掉9092,集群可以重新平衡,并很快恢复正常工作。
当单独杀掉9099时,消费者就无法消费了(生产者可以正常工作)
对consumer源码进行了跟踪,分析了下流程,发现问题阻塞在 consumer选取coordinator时
ensureCoordinatorReady()
lookupCoordinator()
流程大概说就是
只有consumer将9099作为coordinator,并向9099broker发送申请时,会获得成功反馈
ClientResponse(receivedTimeMs=1515898295592, disconnected=false, request=ClientRequest(expectResponse=true, callback=org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler@7bc58891, request=RequestSend(header={api_key=10,api_version=0,correlation_id=30285,client_id=consumer-1}, body={group_id=testGroup}), createdTimeMs=1515898285668, sendTimeMs=1515898295578), **responseBody={error_code=0,coordinator={node_id=0,host=192.168.0.108,port=9099}})**
当consumer依次尝试将9091,9092作为coordinator并依次发送申请时,都会失败,并得到如下反馈
ClientResponse(receivedTimeMs=1515897561105, disconnected=false, request=ClientRequest(expectResponse=true, callback=org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler@488f3dd1, request=RequestSend(header={api_key=10,api_version=0,correlation_id=30281,client_id=consumer-1}, body={group_id=testGroup}), createdTimeMs=1515897558800, sendTimeMs=1515897561104),
responseBody=**{error_code=15,coordinator={node_id=-1,host=,port=-1}})**
为啥会是这样的呢
responseBody=**{error_code=15,coordinator={node_id=-1,host=,port=-1}})**
9099这个broker与众不同之处在哪,拿着这个错误翻墙找了一波google第一条就找到了
https://stackoverflow.com/questions/42362911/kafka-high-level-consumer-error-code-15
As it turned out, the all partitions of the __consumer_offsets topic were located on dead nodes (nodes that I turned off and that will never come back). I solved the issue by shutting the cluster down, deleting the __consumer_offsets topic from Zookeeper and then starting the cluster again.
然后上zk看了下__consumer_offsets
这个topic的分区副本情况,果然50个分区全在broker0上,我也是醉了
> bin/kafka-topics.sh --describe --zookeeper localhost:2182 --topic __consumer_offsets
Topic:__consumer_offsets PartitionCount:50 ReplicationFactor:1 Configs:segment.bytes=104857600,cleanup.policy=compact,compression.type=producer
Topic: __consumer_offsets Partition: 0 Leader: 0 Replicas: 0 Isr: 0
Topic: __consumer_offsets Partition: 1 Leader: 0 Replicas: 0 Isr: 0
Topic: __consumer_offsets Partition: 2 Leader: 0 Replicas: 0 Isr: 0
Topic: __consumer_offsets Partition: 3 Leader: 0
......
然后zk上删掉 __consumer_offsets
目录, 重启集群每一个broker,此时zookeeper上__consumer_offsets
还并没有生成,要开启消费者之后才会生成。
然后再观察__consumer_offsets
,分区已经均匀分布在三个broker上面了
> bin/kafka-topics.sh --zookeeper localhost:2182 --describe --topic __consumer_offsets
Topic:__consumer_offsets PartitionCount:50 ReplicationFactor:1 Configs:segment.bytes=104857600,cleanup.policy=compact,compression.type=producer
Topic: __consumer_offsets Partition: 0 Leader: 2 Replicas: 2 Isr: 2
Topic: __consumer_offsets Partition: 1 Leader: 0 Replicas: 0 Isr: 0
Topic: __consumer_offsets Partition: 2 Leader: 1 Replicas: 1 Isr: 1
Topic: __consumer_offsets Partition: 3 Leader: 2 Replicas: 2 Isr: 2
.........
以为问题出在这里
就是在集群还未充分启动完全的情况下,客户端就开始了消费者进程进行消费,导致用于内部偏移量收集的 topic : __consumer_offsets
只分布到了已启动的broker,当这个broker挂掉,集群便不可用了,所以当__consumer_offsets
的所有partition均匀分布到多个broker,按道理就应该可以实现高可用了
于是,再重试一次,问题还是出现了..........简直蛋都碎了
高可用依旧失败,发现整个集群能否正常运行依赖于首个启动的broker
无奈,回头又看了下配置文件,有关于 __consumer_offsets
的配置
############################# Internal Topic Settings #############################
# The replication factor for the group metadata internal topics "__consumer_offsets" and "__transaction_state"
# For anything other than development testing, a value greater than 1 is recommended for to ensure availability such as 3.
offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
transaction.state.log.min.isr=1
确保高可用需要改成3,修改成3之后,终于实现了高可用,3个broker,只要存活任意1个broker就可以正常工作
__consumer_offsets
的所有分区可用,offsets.topic.replication.factor
至少配置为3__consumer_offsets
主题的分区副本不能均匀分布到每个broker,这样一旦某个副本所在broker全挂掉,就不能消费了?offsets.topic.replication.factor
为1的时候,不管__consumer_offsets
分区怎么分布,反正只要首个启动的broker存活,集群就能工作......
上个提问这里,不能修改所以重新开贴 https://www.orchome.com/792