Fixed issues

Review the fixed issues in this release of the Cloudera Data Warehouse service on cloud.

CDPD-89414: Incorrect results for window functions with IGNORE NULLS
When you used the FIRST_VALUE and LAST_VALUE window functions with the IGNORE NULLS clause while vectorization was enabled, the results were incorrect. This occurred because the vectorized execution engine did not properly handle the IGNORE NULLS setting for these functions.
This issue is addressed by modifying the vectorized processing for FIRST_VALUE and LAST_VALUE to correctly respect the IGNORE NULLS clause, ensuring the same results are produced whether vectorization is enabled or disabled.

Apache Jira: HIVE-29122

CDPD-60770: Passwords with special characters fail to connect with Beeline
When you used a password containing special characters like #, ^, or ; in a JDBC URL for a Beeline connection, the connection failed with a 401 error. This happened because Beeline did not correctly interpret these special characters in the password.
This issue is resolved by introducing a new method to reparse the password from the original JDBC URL, allowing Beeline to correctly handle and authenticate passwords containing special characters.

Apache Jira: HIVE-28805

CDPD-85600: Select queries with ORDER BY fail due to compression error
When you ran a Hive SELECT query with an ORDER BY clause, it failed with a java.io.IOException and java.lang.UnsatisfiedLinkError related to the zlib decompressor.
The issue was addressed by ensuring the zlib native library is correctly loaded.

Apache Jira: HIVE-28805

CDPD-90301: Stack overflow error from queries with OR and MIN filters
Queries, cause a stack overflow error when they contained multiple OR conditions on the same expression, such as MINUTE(date_) = 2 OR MINUTE(date_) = 10.
This issue is addressed by modifying the HivePointLookupOptimizerRule to keep the original order of expressions and to check if a merge can be performed before creating a new expression.

Apache Jira: HIVE-29208

CDPD-90303: Incorrect results from a CASE expression
A query that used a CASE expression to conditionally return values produced an incorrect result. The query plan incorrectly folded the CASE statement into a COALESCE function, which led to a logic error that filtered out some of the expected results.
This issue is addressed by adding a more strict check when converting CASE expressions into COALESCE during query optimization.

Apache Jira: HIVE-24902

CDPD-80655: Compile error with ambiguous column reference
A Hive query using CREATE TABLE AS SELECT with a GROUP BY clause and a window function failed with an "Ambiguous column reference" error. This happened because the query plan couldn't correctly handle redundant keys in the GROUP BY clause.
This issue is fixed by improving the query planner's logic to properly handle complex expressions and their aliases within window functions, allowing the query to compile and run successfully.

Apache Jira: HIVE-28878

DWX-20754: Invalid column reference in lateral view queries
The virtual column BLOCK__OFFSET__INSIDE__FILE fails to be correctly referenced in queries using lateral views, resulting in the error:
FAILED: SemanticException Line 0:-1 Invalid column reference 'BLOCK_OFFSET_INSIDE_FILE.

This issue is now resolved.

Apache Jira:HIVE-28938

DWX-21855: Impala Executors fail to gracefully shutdown
During graceful shutdown Impala executors wait for running queries to finish up to the graceful shutdown deadline (--shutdown_deadline_s). During graceful shutdown the istio-proxy container on Impala executor pod was getting terminated immediately and as a result the executors were not reachable and were removed from the Impala cluster membership resulting in cancellation of running queries.
This issue is now resolved by making sure istio-proxy container’s lifecycle does not impact executor’s cluster membership.
IMPALA-14263: Enhanced join strategy for large clusters
The query planner's cost model for broadcast joins can be skewed by the number of nodes in a cluster. This could lead to suboptimal join strategy choices, especially in large clusters with skewed data where a partitioned join was chosen over a more efficient broadcast join.
This issue is now resolved by introducing the broadcast_cost_scale_factor query option as an additional tuning option besides query hint to override query planner decision. To set it cluster-wide for all queries, add the following key-value to the default_query_options startup option:
broadcast_cost_scale_factor=<less than 1.0>

Apache Jira: IMPALA-14263

IMPALA-11402: Fetching metadata for tables with huge numbers of files no longer fails with OutOfMemoryError
Previously, when Impala Coordinator tried to fetch file metadata for extremely large tables (those with millions of files or partitions), the Impala Catalog service would attempt to return all the file details at once. This often exceeded the Java memory limits, causing the service to crash with an OutOfMemoryError.
This issue is addressed by configuring the Catalog service to limit the number of file descriptors included in a single getPartialCatalogObject response. A new configuration flag, catalog_partial_fetch_max_files, is introduced to define the maximum number of file descriptors allowed per response (with a default of 1,000,000 files).
If a request exceeds this limit, the Catalog service will truncate the response and return metadata for only a subset of the requested partitions. The coordinator is now designed to detect this truncated response and automatically send new batch requests to fetch the remaining partitions until all required metadata is retrieved. This change ensures that the coordinator can successfully fetch and process the metadata for extremely large tables without crashing due to memory limits.

Apache Jira: IMPALA-11402

CDPD-77261: Impala can now read Parquet integer data as DECIMAL after schema changes
Previously, if you changed a column type from an integer (INT or BIGINT) to a DECIMAL using ALTER TABLE, Impala could fail to read the original Parquet data files. This happened because the files lacked the specific metadata (logical types) Impala expected for decimals, resulting in an error.
Impala is now more flexible when reading Parquet files following schema evolution. If Impala encounters an integer type but the schema expects a DECIMAL, it automatically assumes a suitable decimal precision and scale, allowing you to successfully query the updated table:
  • INT32 is read as DECIMAL(9, 0).
  • INT64 is read as DECIMAL(18, 0).
This change supports common schema evolution practices by allowing you to update column types without manually rewriting old data files.

Apache Jira: IMPALA-13625

IMPALA-12927: Impala can now correctly read BINARY columns in JSON tables
Previously, Impala couldn't correctly read BINARY columns in JSON tables, often resulting in errors or incorrect data. This happened because Impala assumed the data was always Base64 encoded, which wasn't true for files written by older Hive versions.
Impala now supports a new table property, 'json.binary.format' (BASE64 or RAWSTRING), and a query option, JSON_BINARY_FORMAT, to explicitly define the binary encoding. This ensures Impala reads the data correctly. If no format is specified, Impala will now return an error instead of risking silent data corruption.

JIRA Issue: IMPALA-12927

CDPD-81076: LEFT ANTI JOIN fails on Iceberg V2 tables with Delete files
Queries using a LEFT ANTI JOIN fail with an AnalysisException if the right-side table is an Iceberg V2 table containing delete files. For example, consider the following query:
SELECT * FROM table_a a
LEFT ANTI JOIN iceberg_v2_table b
ON a.id = b.id;

The error Illegal column/field reference'b.input_file_name' of semi-/anti-joined table 'b' is displayed because semi-joined tuples need to be explicitly made visible for paths pointing inside them to be resolvable.

The fix updates the IcebergScanPlanner to ensure that the tuple containing the virtual fields is made visible when it is semi-joined.

Apache Jira: IMPALA-13888

CDPD-81053: Enable MERGE statement for Iceberg tables with equality deletes
This patch fixes an issue that caused MERGE statements to fail on Iceberg tables that use equality deletes.

The failure occurred because the delete expression calculation was missing the data sequence number, even though the underlying data description included it. This mismatch caused row evaluation to fail.

The fix ensures the data sequence number is correctly included in the result expressions, allowing MERGE operations to complete successfully on these tables.

Apache Jira: IMPALA-13674

CDPD-77773: Tolerate missing data files during Iceberg table loading
This fix addresses an issue where an Iceberg table would fail to load completely if any of its data files were missing from the file system. This TableLoadingException left the table in an incomplete state, blocking all operations on it.

Impala now tolerates missing data files during the table loading process. An exception will only be thrown if a query subsequently attempts to read one of the specific files that is missing.

This change allows other operations that do not depend on the missing data—such as ROLLBACK, DROP PARTITION, or SELECT statements on valid partitions—to execute successfully.

Apache Jira: IMPALA-13654

CDPD-78508: Skip reloading Iceberg tables when metadata JSON file is the same
This patch optimizes metadata handling for Iceberg tables, particularly those that are updated frequently.

Previously, if an event processor was lagging, Impala might receive numerous update events for the same table (for example, 100 events). Impala would attempt to reload the table 100 times, even if the table's state was already up-to-date after processing the first event.

With this fix, Impala now compares the path of the incoming metadata JSON file with the one that is currently loaded. If the metadata file location is the same, Impala skips the reload, correctly assuming the table is already unchanged. This significantly reduces unnecessary metadata processing.

Apache Jira: IMPALA-13718

Fixed Common Vulnerabilities and Exposures

Common Vulnerabilities and Exposures (CVE) that are fixed in this release:

CVE Description
CVE-2025-30065 Code execution vulnerability in schema parsing of Apache Parquet-avro module in versions lower than 1.15.1.
CVE-2020-20703 Buffer overflow vulnerability in VIM v.8.1.2135 allows a remote attacker to execute arbitrary code using the operand parameter.
CVE-2024-53990 Cookie handling vulnerability in AsyncHttpClient (AHC) library leading to cross-user cookie misuse.
CVE-2024-52533 Buffer overflow vulnerability in GNOME GLib SOCKS4 proxy handling (gio/gsocks4aproxy.c).
CVE-2024-52046 Apache MINA ObjectSerializationDecoder vulnerability leading to Remote Code Execution (RCE).
CVE-2017-6519 Avahi-daemon IPv6 unicast query handling vulnerability leading to DoS and information leakage.