Essential metrics to monitor

Cloudera Manager collects a high number of performance metrics for the Kafka services running on your clusters. Certain metrics should be monitored in any Kafka deployment as they can help you to improve the stability and performance of your Kafka deployment.

The following tables collect the Kafka broker metrics that Cloudera recommends you to monitor in any Kafka deployment. For more information on metrics, including a full list of Kafka metrics, see Cloudera Manager Metrics.

Table 1. ZooKeeper connectivity metrics
Metric Name Description Unit Importance Parents Version Availability
kafka_zookeeper_expires_rate Measures the session expires per second. Expires per second High cluster, kafka, rack CDH 5, CDH 6, Cloudera 7
kafka_zookeeper_request_latency_avg Request latency between the broker and Zookeeper. ms High cluster, kafka, rack CDH 5, CDH 6, Cloudera 7
Table 2. KRaft controller metrics
Metric Name Description Unit Importance Parents Version Availability
kafka_active_controller Will be 1 if this instance is the active controller, 0 otherwise. Number of controllers High cluster, kafka, rack Cloudera 7
kafka_preferred_replica_imbalance Number of partitions where the lead replica is not the preferred replica. Partition count Medium cluster, kafka, rack Cloudera 7
kafka_offline_partitions Number of unavailable partitions. Partition count High cluster, kafka, rack Cloudera 7
kafka_global_partition_count Total number of partitions in the cluster. Partition count Low cluster, kafka, rack Cloudera 7
kafka_global_topic_count Total number of topics in the cluster. Topic count Low cluster, kafka, rack Cloudera 7
kafka_active_broker_count Number of active brokers in the cluster. Broker count High cluster, kafka, rack Cloudera 7
kafka_fenced_broker_count Number of fenced brokers in the cluster. Broker count High cluster, kafka, rack Cloudera 7
kafka_zk_migration_state An enumeration of the possible migration states the cluster can be in. 0 = NONE, cluster created in KRaft mode; 4 = ZK, Migration has not started, controller is a ZK controller; 2 = PRE_MIGRATION, the KRaft Controller is waiting for all ZK brokers to register in migration mode; 1 = MIGRATION, ZK metadata has been migrated, but some broker is still running in ZK mode; 3 = POST_MIGRATION, the cluster migration is complete. Discrete states Medium cluster, kafka, rack Cloudera 7
kafka_metadata_log_end_offset The current raft log end offset. Offset count Low cluster, kafka, rack Cloudera 7
kafka_metadata_high_watermark The metadata high watermark maintained on this member; -1 if it is unknown. Offset count Medium cluster, kafka, rack Cloudera 7
kafka_metadata_fetch_records_avg_count The average number of records fetched from the leader of the raft quorum. Message count Low cluster, kafka, rack Cloudera 7
kafka_quorum_member_state The state of the quorum member. 0 = leader; 1 = candidate; 2 = voted; 3 = follower; 4 = unattached; 5 = observer. Discrete states High cluster, kafka, rack Cloudera 7
Table 3. Network metrics
Metric Name Description Unit Importance Parents Version Availability
kafka_network_processor_avg_idle The average free capacity of the network processors. Should be > 0.3. Percentage of free capacity High cluster, kafka, rack CDH 5, CDH 6, Cloudera 7
kafka_request_queue_size Size of the request queue in Kafka. Message count Medium cluster, kafka, rack CDH 5, CDH 6, Cloudera 7
kafka_response_queue_size Size of the response queue in Kafka. Message count Medium cluster, kafka, rack CDH 5, CDH 6, Cloudera 7
kafka_messages_received_rate Number of messages written to topic on this broker. Messages per second Medium cluster, kafka, rack CDH 5, CDH 6, Cloudera 7
Table 4. Disk utilization metrics
Metric Name Description Unit Importance Parents Version Availability
kafka_produce_local_time_rate Local Time spent in responding to Produce requests. Requests per second Low cluster, kafka, rack CDH 5, CDH 6, Cloudera 7
kafka_log_flush_rate Rate of flushing Kafka logs to disk. Flushes per second Low cluster, kafka, rack CDH 5, CDH 6, Cloudera 7
kafka_request_handler_avg_idle_rate The average free capacity of the request handler. Should be > 0.3. Percentage of free capacity High cluster, kafka, rack CDH 5, CDH 6, Cloudera 7
Table 5. Kafka metrics
Metric Name Description Unit Importance Parents Version Availability
kafka_broker_state The state the broker is in. 0 = NotRunning, 1 = Starting, 2 = RecoveringFromUncleanShutdown, 3 = RunningAsBroker, 4 = RunningAsController, 6 = PendingControlledShutdown, 7 = BrokerShuttingDown Discrete states Medium cluster, kafka, rack CDH 5, CDH 6, Cloudera 7
kafka_jvm_gc_runs_rate Number of garbage collector runs performed on this broker. Events per second Medium cluster, kafka, rack CDH 5, CDH 6, Cloudera 7
kafka_isr_expands_rate ISR expands per second. Events per second Medium cluster, kafka, rack CDH 5, CDH 6, Cloudera 7
kafka_isr_shrinks_rate ISR shrinks per second. Events per second Medium cluster, kafka, rack CDH 5, CDH 6, Cloudera 7
kafka_max_replication_lag Maximum replication lag on the broker, across all fetchers, topics, and partitions. Messages Medium cluster, kafka, rack CDH 5, CDH 6, Cloudera 7
kafka_offline_partitions Number of offline partitions. Partition count High cluster, kafka, rack CDH 5, CDH 6, Cloudera 7
kafka_under_min_isr_partition_count Count of partitions with less than the configured minimum in-sync replicas available. Partition count High cluster, kafka, rack CDH 5, CDH 6, Cloudera 7
kafka_under_replicated_partitions Count of partitions with unavailable replicas. Partition count Low cluster, kafka, rack CDH 5, CDH 6, Cloudera 7
kafka_active_controller Shows the number of active controllers at a given time. Ideally it should be 1. Number of controllers High cluster, kafka, rack CDH 5, CDH 6, Cloudera 7