kafka在做分区rebanlace的时候。有15个分区,一个分区160g的数据,这个过程大概需要多久

lq0317。 发表于: 2020-08-11   最后更新时间: 2020-08-11 23:08:01   2,129 游览
发表于 2020-08-11
添加评论

现在已经持续一天了,还是没有完成的状态

半兽人 -> lq0317。 4年前

原来有个哥们花了3天。。。
源源不断涌入的新消息,和迁移的速度的时间差,来决定了你的迁移的时间。

lq0317。 -> 半兽人 4年前

中间会不会出现中断的问题,我都2天了还没好,有没有什么方法能判断需要多久的

lq0317。 -> 半兽人 4年前
[root@prd-kafka-01 opt]# /usr/hdp/2.6.4.0-91/kafka/bin/kafka-topics.sh --describe --zookeeper 172.19.38.217:2181 --topic ods_be_monitor_item_detail
Topic:ods_be_monitor_item_detail        PartitionCount:15       ReplicationFactor:3     Configs:retention.ms=172800000
        Topic: ods_be_monitor_item_detail       Partition: 0    Leader: 1006    Replicas: 1006,1005,1008        Isr: 1006,1005,1008
        Topic: ods_be_monitor_item_detail       Partition: 1    Leader: 1008    Replicas: 1008,1006,1009,1005   Isr: 1008,1006,1005
        Topic: ods_be_monitor_item_detail       Partition: 2    Leader: 1005    Replicas: 1005,1006,1008,1010,1009      Isr: 1005,1008,1006
        Topic: ods_be_monitor_item_detail       Partition: 3    Leader: 1006    Replicas: 1005,1006,1008,1010,1009      Isr: 1006,1008,1005
        Topic: ods_be_monitor_item_detail       Partition: 4    Leader: 1008    Replicas: 1005,1010,1006,1008   Isr: 1008,1006,1005
        Topic: ods_be_monitor_item_detail       Partition: 5    Leader: 1005    Replicas: 1006,1008,1009,1005   Isr: 1005,1008,1006
        Topic: ods_be_monitor_item_detail       Partition: 6    Leader: 1006    Replicas: 1005,1006,1008,1010,1009      Isr: 1006,1008,1005
        Topic: ods_be_monitor_item_detail       Partition: 7    Leader: 1008    Replicas: 1005,1006,1008,1010,1009      Isr: 1008,1006,1005
        Topic: ods_be_monitor_item_detail       Partition: 8    Leader: 1005    Replicas: 1010,1005,1006,1008   Isr: 1005,1008,1006
        Topic: ods_be_monitor_item_detail       Partition: 9    Leader: 1006    Replicas: 1005,1006,1008        Isr: 1006,1005,1008
        Topic: ods_be_monitor_item_detail       Partition: 10   Leader: 1008    Replicas: 1005,1006,1008,1010,1009      Isr: 1008,1006,1005
        Topic: ods_be_monitor_item_detail       Partition: 11   Leader: 1005    Replicas: 1008,1010,1005,1006   Isr: 1005,1008,1006
        Topic: ods_be_monitor_item_detail       Partition: 12   Leader: 1006    Replicas: 1009,1005,1006,1008   Isr: 1006,1008,1005
        Topic: ods_be_monitor_item_detail       Partition: 13   Leader: 1008    Replicas: 1010,1006,1008,1005   Isr: 1008,1006,1005
        Topic: ods_be_monitor_item_detail       Partition: 14   Leader: 1005    Replicas: 1005,1008,1009,1006   Isr: 1005,1008,1006
这是正常的吗
lq0317。 -> 半兽人 4年前
[root@prd-kafka-01 opt]# /usr/hdp/2.6.4.0-91/kafka/bin/kafka-reassign-partitions.sh --zookeeper 172.19.38.217:2181 --reassignment-json-file expand-cluster-ods-be-reassignment.json --verify
Status of partition reassignment: 
Reassignment of partition [ods_be_monitor_item_detail,8] is still in progress
Reassignment of partition [ods_be_monitor_item_detail,9] completed successfully
Reassignment of partition [ods_be_monitor_item_detail,6] is still in progress
Reassignment of partition [ods_be_monitor_item_detail,14] is still in progress
Reassignment of partition [ods_be_monitor_item_detail,5] is still in progress
Reassignment of partition [ods_be_monitor_item_detail,11] is still in progress
Reassignment of partition [ods_be_monitor_item_detail,13] is still in progress
Reassignment of partition [ods_be_monitor_item_detail,3] is still in progress
Reassignment of partition [ods_be_monitor_item_detail,2] is still in progress
Reassignment of partition [ods_be_monitor_item_detail,4] is still in progress
Reassignment of partition [ods_be_monitor_item_detail,1] is still in progress
Reassignment of partition [ods_be_monitor_item_detail,10] is still in progress
Reassignment of partition [ods_be_monitor_item_detail,12] is still in progress
Reassignment of partition [ods_be_monitor_item_detail,0] completed successfully
Reassignment of partition [ods_be_monitor_item_detail,7] is still in progress

你为什么要重平衡呢?是不是你的集群存在什么问题?这样15个分区,每个160G数据,你重平衡耗时耗力太大,严重影响kafka的吞吐量和效率吧。

扩容加了2个节点,所以要平衡

半兽人 -> lq0317。 4年前

不会,关注一下分区同步的offset,而且你有3个副本,这个量级确实很庞大

lq0317。 -> 半兽人 4年前

好的,我在观察一两天看看,十分感谢

lq0317。 -> 半兽人 4年前

您好,我想问下,分区同步offset怎么观察

lq0317。 -> 半兽人 4年前

我这个已经快5天了还没好,怕出问题

lq0317。 -> 半兽人 4年前

现在写入数据一直报错NotLeaderForPartitionError

lq0317。 -> 半兽人 4年前
[2020-08-14 17:39:01,727] INFO [KafkaApi-1009] Closing connection due to error during produce request with correlation id 572 from client id producer-1 with ack=0
Topic and partition to exceptions: pshop_sell_status_topic-12 -> org.apache.kafka.common.errors.NotLeaderForPartitionException (kafka.server.KafkaApis)
[2020-08-14 17:39:01,877] INFO [KafkaApi-1009] Closing connection due to error during produce request with correlation id 578 from client id producer-1 with ack=0
Topic and partition to exceptions: pshop_sell_status_topic-2 -> org.apache.kafka.common.errors.NotLeaderForPartitionException (kafka.server.KafkaApis)
[2020-08-14 17:39:01,928] INFO [KafkaApi-1009] Closing connection due to error during produce request with correlation id 583 from client id producer-1 with ack=0
Topic and partition to exceptions: pshop_sell_status_topic-7 -> org.apache.kafka.common.errors.NotLeaderForPartitionException (kafka.server.KafkaApis)
[2020-08-14 17:39:02,079] INFO [KafkaApi-1009] Closing connection due to error during produce request with correlation id 589 from client id producer-1 with ack=0
Topic and partition to exceptions: pshop_sell_status_topic-12 -> org.apache.kafka.common.errors.NotLeaderForPartitionException (kafka.server.KafkaApis)
[2020-08-14 17:39:02,129] INFO [KafkaApi-1009] Closing connection due to error during produce request with correlation id 594 from client id producer-1 with ack=0
Topic and partition to exceptions: pshop_sell_status_topic-2 -> org.apache.kafka.common.errors.NotLeaderForPartitionException (kafka.server.KafkaApis)
[2020-08-14 17:40:06,193] INFO [KafkaApi-1009] Closing connection due to error during produce request with correlation id 601 from client id producer-1 with ack=0
Topic and partition to exceptions: pshop_sell_status_topic-7 -> org.apache.kafka.common.errors.NotLeaderForPartitionException (kafka.server.KafkaApis)
[2020-08-14 17:40:06,295] INFO [KafkaApi-1009] Closing connection due to error during produce request with correlation id 608 from client id producer-1 with ack=0
Topic and partition to exceptions: pshop_sell_status_topic-12 -> org.apache.kafka.common.errors.NotLeaderForPartitionException (kafka.server.KafkaApis)
半兽人 -> lq0317。 4年前

INFO日志,可以忽视额。
可以去机器上看看物理文件,同步的offset位置(手机上打字)

lq0317。 -> 半兽人 4年前
[2020-08-14 18:21:58,656] ERROR [KafkaApi-1009] Error when handling request {controller_id=1005,controller_epoch=22,partition_states=[{topic=monitor_shop_selltime_status_v3,partition=2,controller_epoch=22,leader=1005,leader_epoch=2,isr=[1005,1006,1008],zk_version=13,replicas=[1005,1006,1008,1010,1009]}],live_leaders=[{id=1005,host=kafka1.sh-internal.com,port=6667}]} (kafka.server.KafkaApis)
java.io.IOException: Malformed line in offset checkpoint file: pshop sell status topic 7 0'
        at kafka.server.OffsetCheckpoint.malformedLineException$1(OffsetCheckpoint.scala:81)
        at kafka.server.OffsetCheckpoint.liftedTree2$1(OffsetCheckpoint.scala:104)
lq0317。 -> 半兽人 4年前

现在有这个报错,我把之前分区从分配的任务删了,还是写不进去数据

半兽人 -> lq0317。 4年前

你动了迁移吗?

lq0317。 -> 半兽人 4年前

没有,但是别人给我创建了一个这种topic
pshop sell status topic 中间有空格

lq0317。 -> 半兽人 4年前

recovery-point-offset-checkpoint replication-offset-checkpoint 这两个文件一直会有pshop sell status topic 这个信息 删了文件重启也不行,现在该怎么解决,线上的很着急,麻烦回复下

半兽人 -> lq0317。 4年前

是格式问题引起的,有空格的主题怎么会创建成功呢。
kafka什么版本,不清楚你存储的offset是在zk还在kafka自己的__consumer_offsets中。
要从里面删除掉。

lq0317。 -> 半兽人 4年前

[root@prd-kafka-01 kafka]# find ./libs/ -name *kafka_* | head -1 | grep -o '\kafka[^\n]*'
kafka_2.11-0.10.1.2.6.4.0-91.jar

lq0317。 -> 半兽人 4年前

我现在该怎么操作,好多方法用了也解决不了

lq0317。 -> lq0317。 4年前

能给一个联系方式吗

半兽人 -> lq0317。 4年前

我给你个建议。你的是生产的kafka,你需要手动清理掉问题的topic,但是在生产上操作是个比较高危的动作,而且你还在迁移数据中。
1、搭建一个新的kafka集群,将业务引导新的kafka上。
2、业务引走之后,你就可以安心修复旧的kafka集群了。

lq0317。 -> 半兽人 4年前

迁移数据我已经在zk上面给停止了。
之前是五个节点,现在其中一个节点还有问题,一直在修复数据,端口不监听

lq0317。 -> 半兽人 4年前
[2020-08-15 00:17:37,087] INFO Recovering unflushed segment 307489432 in log raw_shop_business_detail-2. (kafka.log.Log)
[2020-08-15 00:17:37,709] INFO Recovering unflushed segment 76574100245 in log ods_eleme_monitor_item_detail-7. (kafka.log.Log)
[2020-08-15 00:17:39,330] INFO Recovering unflushed segment 76634617423 in log ods_eleme_monitor_item_detail-11. (kafka.log.Log)
[2020-08-15 00:17:39,730] INFO Recovering unflushed segment 76885515852 in log ods_eleme_monitor_item_detail-0. (kafka.log.Log)
[2020-08-15 00:17:44,113] INFO Recovering unflushed segment 307582694 in log raw_shop_business_detail-2. (kafka.log.Log)
[2020-08-15 00:17:46,124] INFO Recovering unflushed segment 76635155355 in log ods_eleme_monitor_item_detail-11. (kafka.log.Log)
[2020-08-15 00:17:48,664] INFO Recovering unflushed segment 76574648701 in log ods_eleme_monitor_item_detail-7. (kafka.log.Log)
[2020-08-15 00:17:48,928] INFO Recovering unflushed segment 76886065583 in log ods_eleme_monitor_item_detail-0. (kafka.log.Log)
[2020-08-15 00:17:50,801] INFO Recovering unflushed segment 307675968 in log raw_shop_business_detail-2. (kafka.log.Log)
[2020-08-15 00:17:52,978] INFO Recovering unflushed segment 76635693417 in log ods_eleme_monitor_item_detail-11. (kafka.log.Log)
[2020-08-15 00:17:57,688] INFO Recovering unflushed segment 76886609961 in log ods_eleme_monitor_item_detail-0. (kafka.log.Log)
[2020-08-15 00:17:57,825] INFO Recovering unflushed segment 307770638 in log raw_shop_business_detail-2. (kafka.log.Log)
[2020-08-15 00:17:59,792] INFO Recovering unflushed segment 76636232706 in log ods_eleme_monitor_item_detail-11. (kafka.log.Log)
[2020-08-15 00:17:59,811] INFO Recovering unflushed segment 76575198496 in log ods_eleme_monitor_item_detail-7. (kafka.log.Log)
半兽人 -> lq0317。 4年前

你把那个有问题的节点的log.dir指向一下新的目录(也可保留老数据),让这台broker重新同步数据吧。
如果的topic的副本都大于1的话,可以暴力一点。

lq0317。 -> 半兽人 4年前

从新搭建了一个集群

你的答案

查看kafka相关的其他问题或提一个您自己的问题