EDP

  1. Home
  2. Docs
  3. EDP
  4. 6. OPERATIONS
  5. 6.6 Monitoring

6.6 Monitoring

Kafka uses Yammer Metrics for metrics reporting in the server. The Java clients use Kafka Metrics, a built-in metrics registry that minimizes transitive dependencies pulled into client applications. Both expose metrics via JMX and can be configured to report stats using pluggable stats reporters to hook up to your monitoring system.

All Kafka rate metrics have a corresponding cumulative count metric with suffix -total. For example, records-consumed-rate has a corresponding metric named records-consumed-total.

The easiest way to see the available metrics is to fire up jconsole and point it at a running kafka client or server; this will allow browsing all metrics with JMX.

Security Considerations for Remote Monitoring using JMX

Apache Kafka disables remote JMX by default. You can enable remote monitoring using JMX by setting the environment variable JMX_PORT for processes started using the CLI or standard Java system properties to enable remote JMX programmatically. You must enable security when enabling remote JMX in production scenarios to ensure that unauthorized users cannot monitor or control your broker or application as well as the platform on which these are running. Note that authentication is disabled for JMX by default in Kafka and security configs must be overridden for production deployments by setting the environment variable KAFKA_JMX_OPTS for processes started using the CLI or by setting appropriate Java system properties. See Monitoring and Management Using JMX Technology for details on securing JMX.

We do graphing and alerting on the following metrics:

DESCRIPTIONMBEAN NAMENORMAL VALUE
Message in ratekafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec
Byte in rate from clientskafka.server:type=BrokerTopicMetrics,name=BytesInPerSec
Byte in rate from other brokerskafka.server:type=BrokerTopicMetrics,name=ReplicationBytesInPerSec
Request ratekafka.network:type=RequestMetrics,name=RequestsPerSec,request={Produce|FetchConsumer|FetchFollower}
Error ratekafka.network:type=RequestMetrics,name=ErrorsPerSec,request=([-.\w]+),error=([-.\w]+)Number of errors in responses counted per-request-type, per-error-code. If a response contains multiple errors, all are counted. error=NONE indicates successful responses.
Request size in byteskafka.network:type=RequestMetrics,name=RequestBytes,request=([-.\w]+)Size of requests for each request type.
Temporary memory size in byteskafka.network:type=RequestMetrics,name=TemporaryMemoryBytes,request={Produce|Fetch}Temporary memory used for message format conversions and decompression.
Message conversion timekafka.network:type=RequestMetrics,name=MessageConversionsTimeMs,request={Produce|Fetch}Time in milliseconds spent on message format conversions.
Message conversion ratekafka.server:type=BrokerTopicMetrics,name={Produce|Fetch}MessageConversionsPerSec,topic=([-.\w]+)Number of records which required message format conversion.
Byte out rate to clientskafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec
Byte out rate to other brokerskafka.server:type=BrokerTopicMetrics,name=ReplicationBytesOutPerSec
Message validation failure rate due to no key specified for compacted topickafka.server:type=BrokerTopicMetrics,name=NoKeyCompactedTopicRecordsPerSec
Message validation failure rate due to invalid magic numberkafka.server:type=BrokerTopicMetrics,name=InvalidMagicNumberRecordsPerSec
Message validation failure rate due to incorrect crc checksumkafka.server:type=BrokerTopicMetrics,name=InvalidMessageCrcRecordsPerSec
Message validation failure rate due to non-continuous offset or sequence number in batchkafka.server:type=BrokerTopicMetrics,name=InvalidOffsetOrSequenceRecordsPerSec
Log flush rate and timekafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs
# of under replicated partitions (the number of non-reassigning replicas – the number of ISR > 0)kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions0
# of under minIsr partitions (|ISR| < min.insync.replicas)kafka.server:type=ReplicaManager,name=UnderMinIsrPartitionCount0
# of at minIsr partitions (|ISR| = min.insync.replicas)kafka.server:type=ReplicaManager,name=AtMinIsrPartitionCount0
# of offline log directorieskafka.log:type=LogManager,name=OfflineLogDirectoryCount0
Is controller active on brokerkafka.controller:type=KafkaController,name=ActiveControllerCountonly one broker in the cluster should have 1
Leader election ratekafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMsnon-zero when there are broker failures
Unclean leader election ratekafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec0
Pending topic deleteskafka.controller:type=KafkaController,name=TopicsToDeleteCount
Pending replica deleteskafka.controller:type=KafkaController,name=ReplicasToDeleteCount
Ineligible pending topic deleteskafka.controller:type=KafkaController,name=TopicsIneligibleToDeleteCount
Ineligible pending replica deleteskafka.controller:type=KafkaController,name=ReplicasIneligibleToDeleteCount
Partition countskafka.server:type=ReplicaManager,name=PartitionCountmostly even across brokers
Leader replica countskafka.server:type=ReplicaManager,name=LeaderCountmostly even across brokers
ISR shrink ratekafka.server:type=ReplicaManager,name=IsrShrinksPerSecIf a broker goes down, ISR for some of the partitions will shrink. When that broker is up again, ISR will be expanded once the replicas are fully caught up. Other than that, the expected value for both ISR shrink rate and expansion rate is 0. 
ISR expansion ratekafka.server:type=ReplicaManager,name=IsrExpandsPerSecSee above
Max lag in messages btw follower and leader replicaskafka.server:type=ReplicaFetcherManager,name=MaxLag,clientId=Replicalag should be proportional to the maximum batch size of a produce request.
Lag in messages per follower replicakafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=([-.\w]+),topic=([-.\w]+),partition=([0-9]+)lag should be proportional to the maximum batch size of a produce request.
Requests waiting in the producer purgatorykafka.server:type=DelayedOperationPurgatory,name=PurgatorySize,delayedOperation=Producenon-zero if ack=-1 is used
Requests waiting in the fetch purgatorykafka.server:type=DelayedOperationPurgatory,name=PurgatorySize,delayedOperation=Fetchsize depends on fetch.wait.max.ms in the consumer
Request total timekafka.network:type=RequestMetrics,name=TotalTimeMs,request={Produce|FetchConsumer|FetchFollower}broken into queue, local, remote and response send time
Time the request waits in the request queuekafka.network:type=RequestMetrics,name=RequestQueueTimeMs,request={Produce|FetchConsumer|FetchFollower}
Time the request is processed at the leaderkafka.network:type=RequestMetrics,name=LocalTimeMs,request={Produce|FetchConsumer|FetchFollower}
Time the request waits for the followerkafka.network:type=RequestMetrics,name=RemoteTimeMs,request={Produce|FetchConsumer|FetchFollower}non-zero for produce requests when ack=-1
Time the request waits in the response queuekafka.network:type=RequestMetrics,name=ResponseQueueTimeMs,request={Produce|FetchConsumer|FetchFollower}
Time to send the responsekafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request={Produce|FetchConsumer|FetchFollower}
Number of messages the consumer lags behind the producer by. Published by the consumer, not broker.kafka.consumer:type=consumer-fetch-manager-metrics,client-id={client-id} Attribute: records-lag-max
The average fraction of time the network processors are idlekafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercentbetween 0 and 1, ideally > 0.3
The number of connections disconnected on a processor due to a client not re-authenticating and then using the connection beyond its expiration time for anything other than re-authenticationkafka.server:type=socket-server-metrics,listener=[SASL_PLAINTEXT|SASL_SSL],networkProcessor=<#>,name=expired-connections-killed-countideally 0 when re-authentication is enabled, implying there are no longer any older, pre-2.2.0 clients connecting to this (listener, processor) combination
The total number of connections disconnected, across all processors, due to a client not re-authenticating and then using the connection beyond its expiration time for anything other than re-authenticationkafka.network:type=SocketServer,name=ExpiredConnectionsKilledCountideally 0 when re-authentication is enabled, implying there are no longer any older, pre-2.2.0 clients connecting to this broker
The average fraction of time the request handler threads are idlekafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercentbetween 0 and 1, ideally > 0.3
Bandwidth quota metrics per (user, client-id), user or client-idkafka.server:type={Produce|Fetch},user=([-.\w]+),client-id=([-.\w]+)Two attributes. throttle-time indicates the amount of time in ms the client was throttled. Ideally = 0. byte-rate indicates the data produce/consume rate of the client in bytes/sec. For (user, client-id) quotas, both user and client-id are specified. If per-client-id quota is applied to the client, user is not specified. If per-user quota is applied, client-id is not specified.
Request quota metrics per (user, client-id), user or client-idkafka.server:type=Request,user=([-.\w]+),client-id=([-.\w]+)Two attributes. throttle-time indicates the amount of time in ms the client was throttled. Ideally = 0. request-time indicates the percentage of time spent in broker network and I/O threads to process requests from client group. For (user, client-id) quotas, both user and client-id are specified. If per-client-id quota is applied to the client, user is not specified. If per-user quota is applied, client-id is not specified.
Requests exempt from throttlingkafka.server:type=Requestexempt-throttle-time indicates the percentage of time spent in broker network and I/O threads to process requests that are exempt from throttling.
ZooKeeper client request latencykafka.server:type=ZooKeeperClientMetrics,name=ZooKeeperRequestLatencyMsLatency in millseconds for ZooKeeper requests from broker.
ZooKeeper connection statuskafka.server:type=SessionExpireListener,name=SessionStateConnection status of broker’s ZooKeeper session which may be one of Disconnected|SyncConnected|AuthFailed|ConnectedReadOnly|SaslAuthenticated|Expired.
Max time to load group metadatakafka.server:type=group-coordinator-metrics,name=partition-load-time-maxmaximum time, in milliseconds, it took to load offsets and group metadata from the consumer offset partitions loaded in the last 30 seconds (including time spent waiting for the loading task to be scheduled)
Avg time to load group metadatakafka.server:type=group-coordinator-metrics,name=partition-load-time-avgaverage time, in milliseconds, it took to load offsets and group metadata from the consumer offset partitions loaded in the last 30 seconds (including time spent waiting for the loading task to be scheduled)
Max time to load transaction metadatakafka.server:type=transaction-coordinator-metrics,name=partition-load-time-maxmaximum time, in milliseconds, it took to load transaction metadata from the consumer offset partitions loaded in the last 30 seconds (including time spent waiting for the loading task to be scheduled)
Avg time to load transaction metadatakafka.server:type=transaction-coordinator-metrics,name=partition-load-time-avgaverage time, in milliseconds, it took to load transaction metadata from the consumer offset partitions loaded in the last 30 seconds (including time spent waiting for the loading task to be scheduled)
Consumer Group Offset Countkafka.server:type=GroupMetadataManager,name=NumOffsetsTotal number of committed offsets for Consumer Groups
Consumer Group Countkafka.server:type=GroupMetadataManager,name=NumGroupsTotal number of Consumer Groups
Consumer Group Count, per Statekafka.server:type=GroupMetadataManager,name=NumGroups[PreparingRebalance,CompletingRebalance,Empty,Stable,Dead]The number of Consumer Groups in each state: PreparingRebalance, CompletingRebalance, Empty, Stable, Dead
Number of reassigning partitionskafka.server:type=ReplicaManager,name=ReassigningPartitionsThe number of reassigning leader partitions on a broker.
Outgoing byte rate of reassignment traffickafka.server:type=BrokerTopicMetrics,name=ReassignmentBytesOutPerSec
Incoming byte rate of reassignment traffickafka.server:type=BrokerTopicMetrics,name=ReassignmentBytesInPerSec