kafka监控

原创
半兽人 发表于: 2015-03-10   最后更新时间: 2020-01-12 17:05:14  
{{totalSubscript}} 订阅, 38,340 游览

6.6 监控

Kafka uses Yammer Metrics for metrics reporting in the server. The Java clients use Kafka Metrics, a built-in metrics registry that minimizes transitive dependencies pulled into client applications. Both expose metrics via JMX and can be configured to report stats using pluggable stats reporters to hook up to your monitoring system.
Kafka服务端和Java客户端使用Yammer Metrics来报告指标。它是一个内置的度量标准注册表,两者都可通过JMX暴露指标,可插拔式的统计报告信息,可连接到你自己的监视系统。

All Kafka rate metrics have a corresponding cumulative count metric with suffix -total. For example, records-consumed-rate has a corresponding metric named records-consumed-total.
所有Kafka比率指标都有一个后缀为-total累积计数指标。 例如,records-consumed-rate的对应度量是records-consumed-total

The easiest way to see the available metrics is to fire up jconsole and point it at a running kafka client or server; this will allow browsing all metrics with JMX.
最简单的方式是通过启动jconsole并将其指向正在运行的kafka客户端或服务器来查看可用的指标(基于JMX);

使用JMX进行远程监控的安全注意事项

Apache Kafka disables remote JMX by default. You can enable remote monitoring using JMX by setting the environment variable JMX_PORT for processes started using the CLI or standard Java system properties to enable remote JMX programmatically. You must enable security when enabling remote JMX in production scenarios to ensure that unauthorized users cannot monitor or control your broker or application as well as the platform on which these are running. Note that authentication is disabled for JMX by default in Kafka and security configs must be overridden for production deployments by setting the environment variable KAFKA_JMX_OPTS for processes started using the CLI or by setting appropriate Java system properties. See Monitoring and Management Using JMX Technology for details on securing JMX.
默认情况下,Apache Kafka远程JMX是禁用的。 您可以通过为使用CLI或标准Java系统属性启动的进程设置环境变量JMX_PORT来启用JMX的远程监视,以通过编程方式启用远程JMX。 在生产场景中启用远程JMX时,必须启用安全性,以确保未经授权的用户无法监视或控制您的代理或应用程序以及运行它们的平台。 请注意,默认情况下,Kafka中对JMX的身份验证是禁用的,对于生产部署,必须通过为使用CLI启动的进程设置环境变量KAFKA_JMX_OPTS或通过设置适当的Java系统属性来覆盖安全配置。

以下是指标介绍:

描述 MBEAN NAME NORMAL VALUE
Message in rate
消息速率
kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec
Byte in rate from clients
客户端字节速率
kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec
Byte in rate from other
其他brokers字节速率
kafka.server:type=BrokerTopicMetrics,name=ReplicationBytesInPerSec
Request rate
请求速率
kafka.network:type=RequestMetrics,name=RequestsPerSec,request={Produce|FetchConsumer|FetchFollower}
Error rate
错误速率
kafka.network:type=RequestMetrics,name=ErrorsPerSec,request=([-.\w]+),error=([-.\w]+) Number of errors in responses counted per-request-type, per-error-code. If a response contains multiple errors, all are counted. error=NONE indicates successful responses.
Request size in bytes
请求大小(以字节为单位)
kafka.network:type=RequestMetrics,name=RequestBytes,request=([-.\w]+) Size of requests for each request type.
Temporary memory size in bytes
临时内存大小(以字节为段位)
kafka.network:type=RequestMetrics,name=TemporaryMemoryBytes,request={Produce|Fetch} Temporary memory used for message format conversions and decompression.
Message conversion time
消息转换时间
kafka.network:type=RequestMetrics,name=MessageConversionsTimeMs,request={Produce|Fetch} Time in milliseconds spent on message format conversions.
Message conversion rate
消息转换比率
kafka.server:type=BrokerTopicMetrics,name={Produce|Fetch}MessageConversionsPerSec,topic=([-.\w]+) Number of records which required message format conversion.
Byte out rate to clients
向客户的字节输出率
kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec
Byte out rate to other brokers
对其他broker的字节输出率
kafka.server:type=BrokerTopicMetrics,name=ReplicationBytesOutPerSec
Message validation failure rate due to no key specified for compacted topic
由于未为压缩topic指定key,消息验证失败率
kafka.server:type=BrokerTopicMetrics,name=NoKeyCompactedTopicRecordsPerSec
Message validation failure rate due to invalid magic number
无效的magic导致的消息验证失败率
kafka.server:type=BrokerTopicMetrics,name=InvalidMagicNumberRecordsPerSec
Message validation failure rate due to incorrect crc checksum
由于错误的crc校验和导致的消息验证失败率
kafka.server:type=BrokerTopicMetrics,name=InvalidMessageCrcRecordsPerSec
Message validation failure rate due to non-continuous offset or sequence number in batch
由于不连续offset或批处理中的序列号,导致消息验证失败率
kafka.server:type=BrokerTopicMetrics,name=InvalidOffsetOrSequenceRecordsPerSec
Log flush rate and time
日志刷新率和时间
kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs
# of under replicated partitions (|ISR|< |all replicas|) kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions 0
# of under minIsr partitions (|ISR| < min.insync.replicas) kafka.server:type=ReplicaManager,name=UnderMinIsrPartitionCount 0
# of at minIsr partitions (|ISR| = min.insync.replicas) kafka.server:type=ReplicaManager,name=AtMinIsrPartitionCount 0
# of offline log directories
脱机日志目录
kafka.log:type=LogManager,name=OfflineLogDirectoryCount 0
Is controller active on broker
控制器在broker上是否活跃
kafka.controller:type=KafkaController,name=ActiveControllerCount only one broker in the cluster should have 1
Leader election rate
leader选举率
kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs non-zero when there are broker failures
Unclean leader election rate
未清理的leader选举率
kafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec 0
Pending topic deletes
待删除主题
kafka.controller:type=KafkaController,name=TopicsToDeleteCount
Pending replica deletes
待删除的副本
kafka.controller:type=KafkaController,name=ReplicasToDeleteCount
Ineligible pending topic deletes
不合格的待删除主题
kafka.controller:type=KafkaController,name=TopicsIneligibleToDeleteCount
Ineligible pending replica deletes
不合格的待删除副本
kafka.controller:type=KafkaController,name=ReplicasIneligibleToDeleteCount
Partition counts
分区数
kafka.server:type=ReplicaManager,name=PartitionCount mostly even across brokers
Leader replica counts
Leader副本数
kafka.server:type=ReplicaManager,name=LeaderCount mostly even across brokers
ISR shrink rate
ISR收缩率
kafka.server:type=ReplicaManager,name=IsrShrinksPerSec If a broker goes down, ISR for some of the partitions will shrink. When that broker is up again, ISR will be expanded once the replicas are fully caught up. Other than that, the expected value for both ISR shrink rate and expansion rate is 0.
ISR expansion rate
ISR扩展率
kafka.server:type=ReplicaManager,name=IsrExpandsPerSec See above
Max lag in messages btw follower and leader replicas
follower副本和leader副本之间的最大消息延迟
kafka.server:type=ReplicaFetcherManager,name=MaxLag,clientId=Replica lag should be proportional to the maximum batch size of a produce request.
Lag in messages per follower replica
每个follower副本的消息延迟
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=([-.\w]+),topic=([-.\w]+),partition=([0-9]+) lag should be proportional to the maximum batch size of a produce request.
Requests waiting in the producer purgatory
请求在生产者purgatory中等待
kafka.server:type=DelayedOperationPurgatory,name=PurgatorySize,delayedOperation=Produce non-zero if ack=-1 is used
Requests waiting in the fetch purgatory
请求在purgatory中等待
kafka.server:type=DelayedOperationPurgatory,name=PurgatorySize,delayedOperation=Fetch size depends on fetch.wait.max.ms in the consumer
Request total time
请求总时间
kafka.network:type=RequestMetrics,name=TotalTimeMs,request={Produce|FetchConsumer|FetchFollower} broken into queue, local, remote and response send time
Time the request waits in the request queue
请求在请求队列中等待的时间
kafka.network:type=RequestMetrics,name=RequestQueueTimeMs,request={Produce|FetchConsumer|FetchFollower}
Time the request is processed at the leader
leader处理请求的时间
kafka.network:type=RequestMetrics,name=LocalTimeMs,request={Produce|FetchConsumer|FetchFollower}
Time the request waits for the follower
请求等待follower的时间
kafka.network:type=RequestMetrics,name=RemoteTimeMs,request={Produce|FetchConsumer|FetchFollower} non-zero for produce requests when ack=-1
Time the request waits in the response queue
请求在响应队列中等待的时间
kafka.network:type=RequestMetrics,name=ResponseQueueTimeMs,request={Produce|FetchConsumer|FetchFollower}
Time to send the response
发送回应的时间
kafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request={Produce|FetchConsumer|FetchFollower}
Number of messages the consumer lags behind the producer by. Published by the consumer, not broker.
消费者落后于生产者的消息数。 由消费者而非broker提供。
kafka.consumer:type=consumer-fetch-manager-metrics,client-id={client-id} Attribute: records-lag-max
The average fraction of time the network processors are idle
网络处理空闲的平均时间
kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent between 0 and 1, ideally > 0.3
The number of connections disconnected on a processor due to a client not re-authenticating and then using the connection beyond its expiration time for anything other than re-authentication
由于客户端未重新进行身份验证,然后将连接超出其到期时间而用于除重新身份验证以外的任何操作而在处理器上断开的连接数
kafka.server:type=socket-server-metrics,listener=[SASL_PLAINTEXT|SASL_SSL],networkProcessor=<#>,name=expired-connections-killed-count ideally 0 when re-authentication is enabled, implying there are no longer any older, pre-2.2.0 clients connecting to this (listener, processor) combination
The total number of connections disconnected, across all processors, due to a client not re-authenticating and then using the connection beyond its expiration time for anything other than re-authentication
由于客户端未重新进行身份验证,然后在其过期时间之后使用该连接进行除重新身份验证以外的任何操作时,所有处理器之间断开连接的总数
kafka.network:type=SocketServer,name=ExpiredConnectionsKilledCount ideally 0 when re-authentication is enabled, implying there are no longer any older, pre-2.2.0 clients connecting to this broker
The average fraction of time the request handler threads are idle
请求处理程序线程空闲的平均时间百分比
kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent between 0 and 1, ideally > 0.3
Bandwidth quota metrics per (user, client-id), user or client-id
每个(user, client-id),user或client-id的带宽配额指标
kafka.server:type={Produce|Fetch},user=([-.\w]+),client-id=([-.\w]+) Two attributes. throttle-time indicates the amount of time in ms the client was throttled. Ideally = 0. byte-rate indicates the data produce/consume rate of the client in bytes/sec. For (user, client-id) quotas, both user and client-id are specified. If per-client-id quota is applied to the client, user is not specified. If per-user quota is applied, client-id is not specified.
Request quota metrics per (user, client-id), user or client-id
每个(user, client-id),user或client-id的请求配额指标
kafka.server:type=Request,user=([-.\w]+),client-id=([-.\w]+) Two attributes. throttle-time indicates the amount of time in ms the client was throttled. Ideally = 0. request-time indicates the percentage of time spent in broker network and I/O threads to process requests from client group. For (user, client-id) quotas, both user and client-id are specified. If per-client-id quota is applied to the client, user is not specified. If per-user quota is applied, client-id is not specified.
Requests exempt from throttling
请求不受限制
kafka.server:type=Request exempt-throttle-time indicates the percentage of time spent in broker network and I/O threads to process requests that are exempt from throttling.
ZooKeeper client request latency
ZooKeeper客户端请求延迟
kafka.server:type=ZooKeeperClientMetrics,name=ZooKeeperRequestLatencyMs Latency in millseconds for ZooKeeper requests from broker.
ZooKeeper connection status
ZooKeeper连接状态
kafka.server:type=SessionExpireListener,name=SessionState Connection status of broker's ZooKeeper session which may be one of Disconnected|SyncConnected|AuthFailed|ConnectedReadOnly|SaslAuthenticated|Expired.
Max time to load group metadata
加载组元数据的最长时间
kafka.server:type=group-coordinator-metrics,name=partition-load-time-max maximum time, in milliseconds, it took to load offsets and group metadata from the consumer offset partitions loaded in the last 30 seconds
Avg time to load group metadata
加载组元数据的平均时间
kafka.server:type=group-coordinator-metrics,name=partition-load-time-avg average time, in milliseconds, it took to load offsets and group metadata from the consumer offset partitions loaded in the last 30 seconds
Max time to load transaction metadata
加载交易元数据的最长时间
kafka.server:type=transaction-coordinator-metrics,name=partition-load-time-max maximum time, in milliseconds, it took to load transaction metadata from the consumer offset partitions loaded in the last 30 seconds
Avg time to load transaction metadata
加载交易元数据的平均时间
kafka.server:type=transaction-coordinator-metrics,name=partition-load-time-avg average time, in milliseconds, it took to load transaction metadata from the consumer offset partitions loaded in the last 30 seconds

生产者/消费者/连接器共同的监控指标

The following metrics are available on producer/consumer/connector instances. For specific metrics, please see following sections.
以下指标可用于生产者/消费者/连接器实例。有关具体的指标。请查看以下部分。

METRIC/ATTRIBUTE NAME DESCRIPTION MBEAN NAME
connection-close-rate Connections closed per second in the window.
窗口每秒关闭的连接。
kafka.[producer|consumer|connect]:type=[producer|consumer|connect]-metrics,client-id=([-.\w]+)
connection-creation-rate New connections established per second in the window.
窗口每秒建立的新连接。
kafka.[producer|consumer|connect]:type=[producer|consumer|connect]-metrics,client-id=([-.\w]+)
network-io-rate The average number of network operations (reads or writes) on all connections per second.
所有连接每秒的平均网络操作数(读取或写入)。
kafka.[producer|consumer|connect]:type=[producer|consumer|connect]-metrics,client-id=([-.\w]+)
outgoing-byte-rate The average number of outgoing bytes sent per second to all servers.
每秒向所有服务器发送的传出字节的平均数。
kafka.[producer|consumer|connect]:type=[producer|consumer|connect]-metrics,client-id=([-.\w]+)
request-rate The average number of requests sent per second.
每秒发送请求的平均数。
kafka.[producer|consumer|connect]:type=[producer|consumer|connect]-metrics,client-id=([-.\w]+)
request-size-avg The average size of all requests in the window.
窗口所有请求的平均大小。
kafka.[producer|consumer|connect]:type=[producer|consumer|connect]-metrics,client-id=([-.\w]+)
request-size-max The maximum size of any request sent in the window.
窗口发送请求的最大值。
kafka.[producer|consumer|connect]:type=[producer|consumer|connect]-metrics,client-id=([-.\w]+)
incoming-byte-rate Bytes/second read off all sockets.
字节/秒读取所有socket。
kafka.[producer|consumer|connect]:type=[producer|consumer|connect]-metrics,client-id=([-.\w]+)
response-rate Responses received sent per second.
每秒响应收到的发送
kafka.[producer|consumer|connect]:type=[producer|consumer|connect]-metrics,client-id=([-.\w]+)
select-rate Number of times the I/O layer checked for new I/O to perform per second.
I/O层每秒检查新I/O执行的次数。
kafka.[producer|consumer|connect]:type=[producer|consumer|connect]-metrics,client-id=([-.\w]+)
io-wait-time-ns-avg The average length of time the I/O thread spent waiting for a socket ready for reads or writes in nanoseconds.
I/O线程花费在等待以纳秒为单位准备好读取或写入的socket的平均时间长度。
kafka.[producer|consumer|connect]:type=[producer|consumer|connect]-metrics,client-id=([-.\w]+)
io-wait-ratio The fraction of time the I/O thread spent waiting.
I/O线程花费等待的时间的比例。
kafka.[producer|consumer|connect]:type=[producer|consumer|connect]-metrics,client-id=([-.\w]+)
io-time-ns-avg The average length of time for I/O per select call in nanoseconds.
每个选择调用的I/O的平均时间长度(以纳秒为单位)。
kafka.[producer|consumer|connect]:type=[producer|consumer|connect]-metrics,client-id=([-.\w]+)
io-ratio The fraction of time the I/O thread spent doing I/O.
I/O线程用于执行I/O的时间比例。
kafka.[producer|consumer|connect]:type=[producer|consumer|connect]-metrics,client-id=([-.\w]+)
connection-count The current number of active connections.
当前活跃的连接数
kafka.[producer|consumer|connect]:type=[producer|consumer|connect]-metrics,client-id=([-.\w]+)

每个broker的生产者/消费者/连接器的公共指标(Common Per-broker metrics for producer/consumer/connect)

The following metrics are available on producer/consumer/connector instances. For specific metrics, please see following sections.
以下可用于生产者/消费者/连接器实例。有关具体指标,请参阅以下部分。

METRIC/ATTRIBUTE NAME DESCRIPTION MBEAN NAME
outgoing-byte-rate The average number of outgoing bytes sent per second for a node.
每个节点每秒传出字节的平均数。
kafka.producer:type=[consumer|producer|connect]-node-metrics,client-id=([-.\w]+),node-id=([0-9]+)
request-rate The average number of requests sent per second for a node.
每个节点每秒发送的平均请求数。
kafka.producer:type=[consumer|producer|connect]-node-metrics,client-id=([-.\w]+),node-id=([0-9]+)
request-size-avg The average size of all requests in the window for a node.
每个节点窗口所有请求平均大小。
kafka.producer:type=[consumer|producer|connect]-node-metrics,client-id=([-.\w]+),node-id=([0-9]+)
request-size-max The maximum size of any request sent in the window for a node.
每个节点窗口发送请求最大值。
kafka.producer:type=[consumer|producer|connect]-node-metrics,client-id=([-.\w]+),node-id=([0-9]+)
incoming-byte-rate The average number of responses received per second for a node.
每个节点接收响应的平均时间。
kafka.producer:type=[consumer|producer|connect]-node-metrics,client-id=([-.\w]+),node-id=([0-9]+)
request-latency-avg The average request latency in ms for a node.
节点等待平均请求延迟(毫秒)
kafka.producer:type=[consumer|producer|connect]-node-metrics,client-id=([-.\w]+),node-id=([0-9]+)
request-latency-max The maximum request latency in ms for a node.
节点的请求最大延迟。
kafka.producer:type=[consumer|producer|connect]-node-metrics,client-id=([-.\w]+),node-id=([0-9]+)
response-rate Responses received sent per second for a node.
节点每秒接收发送的响应。
kafka.producer:type=[consumer|producer|connect]-node-metrics,client-id=([-.\w]+),node-id=([0-9]+)

生产者监控(Producer monitoring)

The following metrics are available on producer instances.
以下指数可用于生产实例。

METRIC/ATTRIBUTE NAME DESCRIPTION MBEAN NAME
waiting-threads The number of user threads blocked waiting for buffer memory to enqueue their records.
用户线程数,阻塞等待缓冲内存消息入队。
kafka.producer:type=producer-metrics,client-id=([-.\w]+)
buffer-total-bytes The maximum amount of buffer memory the client can use (whether or not it is currently used).
客户端可以使用的最大缓冲区内存(无论目前是否使用)
kafka.producer:type=producer-metrics,client-id=([-.\w]+)
buffer-available-bytes The total amount of buffer memory that is not being used (either unallocated or in the free list).
未使用的缓冲内存总量(未分配或在空闲列表中)。
kafka.producer:type=producer-metrics,client-id=([-.\w]+)
bufferpool-wait-time The fraction of time an appender waits for space allocation.
appender等待空间分配的时间比率。
kafka.producer:type=producer-metrics,client-id=([-.\w]+)
batch-size-avg The average number of bytes sent per partition per-request.
每个分区每个请求发送的平均字节数
kafka.producer:type=producer-metrics,client-id=([-.\w]+)
batch-size-max The max number of bytes sent per partition per-request.
每个分区每个请求发送的最大字节数
kafka.producer:type=producer-metrics,client-id=([-.\w]+)
compression-rate-avg The average compression rate of record batches.
消息批次的平均压缩比率
kafka.producer:type=producer-metrics,client-id=([-.\w]+)
record-queue-time-avg The average time in ms record batches spent in the record accumulator.
消息累加器花费消息批次的平均时间(毫秒)。
kafka.producer:type=producer-metrics,client-id=([-.\w]+)
record-queue-time-max The maximum time in ms record batches spent in the record accumulator.
消息累加器花费消息批次的最大时间(毫秒)。
kafka.producer:type=producer-metrics,client-id=([-.\w]+)
request-latency-avg The average request latency in ms.
请求平均延迟(毫秒)
kafka.producer:type=producer-metrics,client-id=([-.\w]+)
request-latency-max The maximum request latency in ms.
最大请求延迟(毫秒)
kafka.producer:type=producer-metrics,client-id=([-.\w]+)
record-send-rate The average number of records sent per second.
每秒发送的消息平均数。
kafka.producer:type=producer-metrics,client-id=([-.\w]+)
records-per-request-avg The average number of records per request.
每个请求的平均消息数
kafka.producer:type=producer-metrics,client-id=([-.\w]+)
record-retry-rate The average per-second number of retried record sends.
每秒重试消息发送的平均数。
kafka.producer:type=producer-metrics,client-id=([-.\w]+)
record-error-rate The average per-second number of record sends that resulted in errors.
引起错误的消息发送的每秒平均数。
kafka.producer:type=producer-metrics,client-id=([-.\w]+)
record-size-max The maximum record size.
最大消息大小
kafka.producer:type=producer-metrics,client-id=([-.\w]+)
record-size-avg The average record size.
平均消息大小
kafka.producer:type=producer-metrics,client-id=([-.\w]+)
requests-in-flight The current number of in-flight requests awaiting a response.
等待响应的当前请求数。
kafka.producer:type=producer-metrics,client-id=([-.\w]+)
metadata-age The age in seconds of the current producer metadata being used.
当前生产者元数据已使用的时间(以秒为单位)。
kafka.producer:type=producer-metrics,client-id=([-.\w]+)
record-send-rate The average number of records sent per second for a topic.
topic每秒发送的平均消息数。
kafka.producer:type=producer-topic-metrics,client-id=([-.\w]+),topic=([-.\w]+)
byte-rate The average number of bytes sent per second for a topic.
topic每秒发送的平均字节数
kafka.producer:type=producer-topic-metrics,client-id=([-.\w]+),topic=([-.\w]+)
compression-rate The average compression rate of record batches for a topic.
topic的消息批次的平均压缩比率。
kafka.producer:type=producer-topic-metrics,client-id=([-.\w]+),topic=([-.\w]+)
record-retry-rate The average per-second number of retried record sends for a topic.
topic发送重试消息的每秒平均数
kafka.producer:type=producer-topic-metrics,client-id=([-.\w]+),topic=([-.\w]+)
record-error-rate The average per-second number of record sends that resulted in errors for a topic.
topic引起错误的发送每秒平均数。
kafka.producer:type=producer-topic-metrics,client-id=([-.\w]+),topic=([-.\w]+)
produce-throttle-time-max The maximum time in ms a request was throttled by a broker.
broker限制请求的最打时间(以毫秒为单位)
kafka.producer:type=producer-topic-metrics,client-id=([-.\w]+)
produce-throttle-time-avg The average time in ms a request was throttled by a broker.
broker限制请求的平均时间(以毫秒为单位)
kafka.producer:type=producer-topic-metrics,client-id=([-.\w]+)

新消费者监控(New consumer monitoring)

The following metrics are available on new consumer instances.
以下指标适用于新的消费者实例。

消费者组指标(Consumer Group Metrics)

METRIC/ATTRIBUTE NAME DESCRIPTION MBEAN NAME
commit-latency-avg The average time taken for a commit request
提交请求所需的平均时间
kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)
commit-latency-max The max time taken for a commit request
提交请求所需的最大时间
kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)
commit-rate The number of commit calls per second
每秒调用提交数
kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)
assigned-partitions The number of partitions currently assigned to this consumer
当前分配给此消费者的分区数
kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)
heartbeat-response-time-max The max time taken to receive a response to a heartbeat request
接收心跳请求响应所需的最大时间
kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)
heartbeat-rate The average number of heartbeats per second
每秒心跳的平均数
kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)
join-time-avg The average time taken for a group rejoin
group重新加入所需要的平均时间
kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)
join-time-max The max time taken for a group rejoin
group重新加入的最大时间
kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)
join-rate The number of group joins per second
每秒加入的group数
kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)
sync-time-avg The average time taken for a group sync
group同步所需的平均时间
kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)
sync-time-max The max time taken for a group sync
group同步所需的最大时间
kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)
sync-rate The number of group syncs per second
每秒group同步数
kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)
last-heartbeat-seconds-ago The number of seconds since the last controller heartbeat
上次控制器心跳之后的秒数
kafka.consumer:type=consumer-coordinator-metrics,client-id=([-.\w]+)

消费者拉取指标(Consumer Fetch Metrics)

METRIC/ATTRIBUTE NAME DESCRIPTION MBEAN NAME
fetch-size-avg The average number of bytes fetched per request
每个请求拉取的平均字节数
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)
fetch-size-max The maximum number of bytes fetched per request
每次请求拉取的最大字节数
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)
bytes-consumed-rate The average number of bytes consumed per second
每秒消费的平均字节数
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)
records-per-request-avg The average number of records in each request
每个请求的平均消息数
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)
records-consumed-rate The average number of records consumed per second
每秒消费的消息平均数
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)
fetch-latency-avg The average time taken for a fetch request
拉取请求所需的平均时间
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)
fetch-latency-max The max time taken for a fetch request
拉取请求所需的最大时间
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)
fetch-rate The number of fetch requests per second
每秒拉取请求数
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)
records-lag-max The maximum lag in terms of number of records for any partition in this window
此窗口中任何分区消息数的最大落后
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)
fetch-throttle-time-avg The average throttle time in ms
平均限制时间(毫秒)
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)
fetch-throttle-time-max The maximum throttle time in ms
最大限流时间(毫秒)
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+)

topic级别拉取指标(Topic-level Fetch Metrics)

METRIC/ATTRIBUTE NAME DESCRIPTION MBEAN NAME
fetch-size-avg The average number of bytes fetched per request for a specific topic.
每个分区针对特定topic拉取的平均字节数
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+),topic=([-.\w]+)
fetch-size-max The maximum number of bytes fetched per request for a specific topic.
每个分区针对特定topic拉取的最大数
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+),topic=([-.\w]+)
bytes-consumed-rate The average number of bytes consumed per second for a specific topic.
特定topic每秒消费的平均字节数
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+),topic=([-.\w]+)
records-per-request-avg The average number of records in each request for a specific topic.
特定topic每个请求的平均消息数
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+),topic=([-.\w]+)
records-consumed-rate The average number of records consumed per second for a specific topic.
特定topic每秒消费的平均消息数
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.\w]+),topic=([-.\w]+)

其他方面(Others)

We recommend monitoring GC time and other stats and various server stats such as CPU utilization, I/O service time, etc. On the client side, we recommend monitoring the message/byte rate (global and per topic), request rate/size/time, and on the consumer side, max lag in messages among all partitions and min fetch request rate. For a consumer to keep up, max lag needs to be less than a threshold and min fetch rate needs to be larger than 0.
我们建议监控GC时间和其他统计信息以及各种服务器状态,例如CPU利用率,I/O服务时间等。客户端方面,我们建议监控消息/字节速率(全局和每个topic),请求速率/大小/ 时间,并且在消费者方面,在所有分区之间的消息中的最大滞后和最小获取请求速率。 对于消费者来说,最大落后需要小于阈值,并且最少拉取速率需要大于0。

审计(Audit)

The final alerting we do is on the correctness of the data delivery. We audit that every message that is sent is consumed by all consumers and measure the lag for this to occur. For important topics we alert if a certain completeness is not achieved in a certain time period. The details of this are discussed in KAFKA-260.
我们最后提醒的是数据传输的正确性。 我们审核发送的每条消息都由所有消费者消费,并估算发生这种情况的落后。 对于重要的topic,我们提醒,如果在一定时间内没有达到某种完整性。 详细内容在KAFKA-260中讨论。

更新于 2020-01-12

luo 2年前

kafka pagecache 命中率的监控有没有老哥做过呀

半兽人 -> luo 2年前

kafka怎么还有命中率?它可跟Redis不一样哦。

luo -> 半兽人 2年前

我看官网文档有提到这个metric,hitRatio-avg:The average cache hit ratio defined as the ratio of cache read hits over the total cache read requests.

https://kafka.apache.org/24/documentation.html#kafka_streams_cache_monitoring

半兽人 -> luo 2年前

这个是流的,你用到了?

黄永杰 3年前

kafka_server_zookeeperclientmetrics_zookeeperrequestlatencyms 有碰到过dashboard显示这个指标统计值很大的情况吗?
kafka metrics里显示0.999的统计值17000+,0.99也很高。
楼主理解这个指标的metrics含义么?

半兽人 -> 黄永杰 3年前

从指标的名字来看,就是kafka请求zk的延迟时间(毫秒)。越大代表延迟的越高。表面意思吧。

黄永杰 -> 黄永杰 3年前

kafka jmx metric里显示这个指标格式

kafka_server_zookeeperclientmetrics_zookeeperrequestlatencyms{quantile="0.50"} 1.0
kafka_server_zookeeperclientmetrics_zookeeperrequestlatencyms{quantile="0.75"} 1.0
kafka_server_zookeeperclientmetrics_zookeeperrequestlatencyms{quantile="0.95"} 4.0
kafka_server_zookeeperclientmetrics_zookeeperrequestlatencyms{quantile="0.98"} 14587.7
kafka_server_zookeeperclientmetrics_zookeeperrequestlatencyms{quantile="0.99"} 17068.0
kafka_server_zookeeperclientmetrics_zookeeperrequestlatencyms{quantile="0.999",} 17068.0

看一篇文章里介绍quantile:假设0.9-quantile的值为120,意思就是所有的采样值中,小于120的采样值的数量占总体采样值的90%.

https://cloud.tencent.com/developer/news/319419

看来不能单纯作为延迟值来看…

黄永杰 -> 黄永杰 3年前

https://grafana.com/grafana/dashboards/11962
peometheus里的这个dashboard直接拿sum(kafka_server_zookeeperclientmetrics_zookeeperrequestlatencyms{job=\"$job\",instance=~\"$broker\"})by(instance)
统计的延迟,感觉不太对吧

李东 3年前

请问有没有jmxtrans监控kafka集群,并将监控指标写入influxdb,及grafana展示的例子?主要求jmx json,还有grafana dashboard的json。谢谢啦

喵帕斯~ 6年前

请教一下,该如何监控rebalance发生的时间,次数等信息呢

木木&很呆 6年前

为什么我在Jconsole 里没有发现kafka.consumer 的类

你消费者开启JMX功能了吗?

+1,百度说是版本的问题,比如2.2的就取消了kafka.consumer,目前还没找到如何获取kafka.consumer下的信息

鹰击长空 7年前

有两个问题请教一下前辈:

  1. 如果想监控每个topic的producer响应最长等待时间是否可行?
  2. request-latency-max
    The maximum request latency in ms.
    最大请求延迟(毫秒)

    kafka.producer:type=producer-metrics,client-id=([-.\w]+)
    

    如果想监控这个指标,实际写程序访问的时候client-id是什么,如何取到呢,多谢

半兽人 -> 鹰击长空 7年前

指定的消费组ID

半兽人 -> 鹰击长空 7年前

client-id的解释:当发出请求时传递给服务器的id字符串。这样做的目的是允许服务器请求记录记录这个【逻辑应用名】,这样能够追踪请求的源,而不仅仅只是ip/prot。 

鹰击长空 -> 半兽人 7年前

感谢您的解释,在实际应用中如何通过程序来获取这个client-id呢?

查看kafka更多相关的文章或提一个关于kafka的问题,也可以与我们一起分享文章