Fixed Issues in HBase

Review the list of HBase issues that are resolved in Cloudera Runtime 7.3.1, its service packs and cumulative hotfixes.

Cloudera Runtime 7.3.1.400 SP2

CDPD-84435: The upgrade operation fails with a message “Failed to decommission RegionServer.”
7.3.1.400
This issue is fixed. Now, the HBase MASTER aborts when it detects a WALSyncTimeoutException while making edits to the MasterRegion.

Apache Jira: HBASE-28803

CDPD-83544 : Potential performance degradation may occur when utilizing persistent cache, causing a restart involving full cache recovery.
7.3.1.400
This issue is fixed.

Apache Jira: HBASE-29326

CDPD-81524: Add configurable throttling of region moves in CacheAwareLoadBalancer
7.3.1.400
This fix introduces region moving throttling for LoadBalancer implementations. The throttling time is configurable by the hbase.master.balancer.move.throttlingMillis property, with a default value of 60000 milliseconds.

In this change, the only balancer implementation applying throttling is the CacheAwareLoadBalancer. All other balancers just inherit the noop default provided within the LoadBalancer interface.

The CacheAwareLoadBalancer throttling implementation performs throttling only for regions moving to the target server with a region cached ratio below the threshold configurable by hbase.master.balancer.stochastic.throttling.cacheRatio (80% by default).

Apache Jira: HBASE-29168

CDPD-81524: The `ENCODED_DATA` block type is not being considered within `BucketCache.notifyFileCachingComplete`
7.3.1.400
This fix addresses a defect in BucketCache.notifyFileCachingComplete, wherein only blocks of the DATA type were registered. When an encoding such as FASTDIFF was employed, the data block type became ENCODED_DATA, preventing it from being accounted for in the internal cache metrics. This oversight subsequently affects the cache-aware balancer after cache recovery following a crash or restart (with persistent cache enabled), as the region percentage cache is not accurately calculated due to this flaw.

Apache Jira: HBASE-29243

CDPD-81524: Enable BlockCache implementations to define dynamic properties
7.3.1.400
This resolution introduces dynamic configurability for the following properties related to free space management and block prioritization:
  • hbase.bucketcache.acceptfactor
  • hbase.bucketcache.minfactor
  • hbase.bucketcache.extrafreefactor
  • hbase.bucketcache.single.factor
  • hbase.bucketcache.multi.factor
  • hbase.bucketcache.multi.factor
  • hbase.bucketcache.memory.factor
  • hbase.bucketcache.queue.addition.waittime
  • hbase.bucketcache.persist.intervalinmillis
  • hbase.bucketcache.persistence.chunksize

Apache Jira: HBASE-29249

CDPD-81524: Display hit ratio metrics by configurable, granular periods
7.3.1.400
This change introduces two additional properties:
  • hbase.blockcache.stats.periods which allows defining a multiple window period;
  • hbase.blockcache.stats.period.minute which defines the length of each of these periods (in minutes);

If hbase.blockcache.stats.periods is defined and is greater than one, it creates a scheduled executor that rolls the metrics calculation at hbase.blockcache.stats.period.minute rate. This property calculates the hit ratio for each of the last periods (as defined by hbase.blockcache.stats.periods), accounting for only the hits and requests that occurred during the interval of the given period (as defined by hbase.blockcache.stats.period.minute).

Apache Jira: HBASE-29276

CDPD-81524: Avoid adding new blocks during prefetch if usage is greater than the accept factor
7.3.1.400
Previously, when cache prefetch was enabled and cache usage reached the configured acceptance factor, it resulted in a cycle of frequent mass block evictions until the prefetch thread completed reading the entire file. This process proved to be both costly and inefficient. An initial attempt to mitigate this issue was proposed in HBASE-28176; however, that solution only interrupted the prefetch thread after it had already attempted to cache the current block being read, which could still trigger a mass eviction.

To completely avert evictions triggered solely by the prefetch, this modification evaluates the impact of incorporating the current block into the cache before attempting to write it into the cache. This verification is exclusively executed when caching from prefetch threads; standard client reads and HFile writes persist in their attempt to cache the associated block.

Apache Jira: HBASE-29288

Cloudera Runtime 7.3.1.300 SP1 CHF 1

There are no fixed issues in this release.

Cloudera Runtime 7.3.1.200 SP1

CDPD-77399: HBase fails to register the servlet metrics and throws ClassNotFoundException: org.apache.hadoop.metrics.MetricsServlet
This issue is fixed now. HBase does not warn about the Hadoop 2-based metric servlet class on a Hadoop 3 deployment.

Apache Jira:: HBASE-28315

Cloudera Runtime 7.3.1.100 CHF 1

There are no fixed issues in this release.

Cloudera Runtime 7.3.1

CDPD-67520: JWT authentication expects [sub] claim in the payload

A JWT payload can have a custom claim for Subject/Principal instead of the standard sub claim.

You can set the hbase.security.oauth.jwt.token.principal.claim configuration property in Cloudera Manager under HBase Service Advanced Configuration Snippet (Safety Valve) for hbase-site.xml to define the custom Subject/Principal claim.

CDPD-66387: RegionServer should be aborted when WAL.sync throws TimeoutIOException
This fix adds additional logic for WAL.sync. If WAL.sync gets a timeout exception, HBase wraps TimeoutIOException as a special WALSyncTimeoutIOException. When the upper layer such as HRegion.doMiniBatchMutate called by HRegion.batchMutation catches this special exception, HBase aborts the region server.

Apache Jira: HBASE-27230

CDPD-65373: Make delay prefetch property dynamically configurable
This change allows you to dynamically configure the hbase.hfile.prefetch.delay property using the Cloudera Manager. You need to update the value and refresh the HBase service. The new value is applied to the HBase service automatically.

Apache Jira: HBASE-28292

CDPD-74494: JVM crashes intermittently on ARM64 machines
After noticing the JVM crashes in the HBase service that is based on arm64 architecture and uses JDK 17, the fix is applied that refactors the module and the large implementation function into multiple smaller functions. The issue was observed in a specific module that had a very large member function.

Apache Jira: HBASE-28206

CDPD-73117: Bucket cache utilization is dropped after a rolling restart
For a persistent bucket cache of a size higher than 1.3 TB, the corresponding backing-map information (information related to the persistence cache) grows beyond 2 GB. But, 2 GB is the limit of the protobuf message sizes. These protobuf messages are used to persist the backing map information. If the size of the message grew beyond 2 GB, the backing map partially persisted and after a restart, the size of the cache seemed to be reduced.

With this fix, backing map information was chunked in smaller chunks with sizes below 2 GB. Now all information, even beyond 2 GB, is persisted and can be retrieved back after a rolling restart.

OPSAPS-70946: The hbase-site.xml file does not contain xinclude for the refreshable files
HBase supports generating hbase-site.xml with xinclude which is needed for the hbase-site-refreshable.xml file.
OPSAPS-70908: Refresh cluster command fails during ephemeral cache zero downtime upgrade
Configurations from refreshable files encountered authentication failure during the refresh command when Kerberos is enabled.
hbase/hbase.sh
["refresh-regionserver","hbase.hfile.prefetch.delay","hbase.rs.cacheblocksonwrite",
"hbase.block.data.cacheonread","hbase.rs.evictblocksonclose"]

To fix this, RegionServerRefreshCommand now sets SCM_KERBEROS_PRINCIPAL as the Kerberos principal in the region server refresh process in the environment.

OPSAPS-70866: Invalid HBase prefetch configurations during rolling runtime upgrade
The default values of hbase_hfile_prefetch_delay and hbase.block.data.cacheonread are reverted to 1000 ms and are set to true.
OPSAPS-70294: HBase must use load balancing for the WEBHBASE Knox service
For CDPD 7.3.0 and later, the WEBHBASE service is configured for sticky load balancing instead of high availability in Knox.
OPSAPS-70035: HBase ZooKeeper client TLS toggle should also control the daemon roles
This issue is fixed. HBase ZooKeeper secure client mode now affects all roles.
OPSAPS-69983: Set Zookeeper store types to HBase service configuration
HBase now automatically sets the ZooKeeper truststore type based on ScmParams.
OPSAPS-69805: HBase client configuration does not use a secure port if Client TLS is enabled
HBase only uses a secure ZooKeeper port in client connections if enabled explicitly.
OPSAPS-69757: Make HBase TLS connection to ZooKeeper disabled by default
The HBase TLS connection to ZooKeeper must be disabled because it breaks some use cases. Instead, HBase introduces a new property to enable or disable in client roles. The default value is disabled.
OPSAPS-57937: No alerts are generated when the HBase process is in a hung state
HBase master monitoring (canary) showed green status even if the master has not initialized yet and added extra checks to query HBase if it is up and running.
OPSAPS-53851: ZooKeeper SSL/TLS support for HBase
Cloudera Manager configures HBase for a secure ZooKeeper connection if ZooKeeper TLS is enabled.
CDPD-74725: HBase throws org.apache.hbase.thirdparty.io.netty.util.ResourceLeakDetector exception
HBase direct memory buffer leak issues are fixed which could lead to heap issues in the long run.

Apache Jiras: HBASE-28890 and HBASE-28893

CDPD-72120: Allow specifying a filter for the REST multiget endpoint (addendum: add back SCAN_FILTER constant)
HBase allows specifying a filter for the REST multiget endpoint (addendum: add back SCAN_FILTER constant).

Apache Jira: HBASE-28518

CDPD-71008: REST Java client library assumes stateless servers
This issue is fixed.

Apache Jira: HBASE-28500

CDPD-71007: hbase-rest client shading conflicts with hbase-shaded-client in HBase 2.x
This issue is fixed.

Apache Jira: HBASE-28526

CDPD-71006: Support non-SPNEGO authentication methods and implement session handling in the REST Java client library
This issue is fixed.

Apache Jira: HBASE-28501

CDPD-70493: MultiRowRangeFilter deserialization fails in org.apache.hadoop.hbase.rest.model.ScannerModel
This issue is fixed.

Apache Jira: HBASE-28626

CDPD-69335: Use a single GET call in the REST multiget endpoint
This issue is fixed.

Apache Jira: HBASE-28523

CDPD-68900: HBase properties need to be dynamically configured
The following configurations can be dynamically configured.
  • hbase.rs.evictblocksonclose
  • hbase.rs.cacheblocksonwrite
  • hbase.block.data.cacheonread

After changing values of these confgurations restarting region servers is no longer required. These configurations help in getting better throughput.

Newly changed values in the hbase-site.xml are read by HBase and values in appropriate classes are updated.

CDPD-68550: BucketCache.notifyFileCachingCompleted might incorrectly consider a file fully cached
This issue is fixed.

Apache Jira: HBASE-28458

CDPD-68154: BuckeCache.evictBlocksByHfileName does not work after a cache recovery from a file
This issue is fixed.

Apache Jira: HBASE-28450

CDPD-64046: BucketCache.blocksByHFile might leak on allocationFailure or if encountering input/output errors can lead to cache leak and extra heap usage
This issue is fixed.

Apache Jira: HBASE-28211

CDPD-63765: Move the NavigableSet add operation to the writer thread in BucketCache
This issue fixes potential cache leaks and extra memory usage.

Apache Jira: HBASE-26305

CDPD-62737: PrefetchExecutor must not run for files from the CF levels that have disabled BLOCKCACHE
This fix allows disabling the caching or pre-caching of individual tables.

Apache Jira: HBASE-28217

CDPD-45890: Fix the miss count in one of the CombinedBlockCache getBlock implementations
This fix impacts the hit ratio chart's accuracy in Cloudera Manager.

Apache Jira: HBASE-28189