What's new
This section lists major features and updates for the Cloudera Data Catalog service.
April 8, 2025
This release (3.1.0) of the Cloudera Data Catalog service introduces the following new changes:
Improved services for profilers
-
Thanks to the improved Custer Setup API, the configuration of profilers is simplified
-
Executor related settings only specify the maximum number of workers, an internal service manages the autoscaling within this range
-
Redesigned profiler setup
-
Settings for instance sizing and autoscaling are introduced
Improved profiler UI
The improved profilers present a more user friendly UI and several extended capabilities for Compute Cluster enabled environments.
- New names for profilers in Compute Cluster enabled environments:
- The Cluster Sensitivity Profiler is now called Data Compliance profiler.
- The Hive Column Profiler is now called Statistics Collector profiler.
- The Ranger Audit Profiler is now called Activity Profiler.
- Redesigned Profilers menu for easier access to jobs, configurations
and their history, asset filtering and tag rules:
- The individual profilers show new metrics
- Number of profiled assets of the last job
- Job duration of the last job
- The profilers menu also shows the next jobs’ start time and the number of completions
- The CRON expression based scheduler is supplemented with a natural language based scheduler
- Asset Filtering Rules is expanded with the list of assets affected by your rule set
- You can now access the Configuration History of a profiler, where you can check your changes in a sequential order
- The Job Summary page is introduced new metrics:
- Workers details:
- Worker Memory limit
- Threads per workers
- Number of workers
- Last run check details
- Workers details:
- The Job Summary page provides the list of profiled assets.
- The individual profilers show new metrics
Redesigned and expanded Tag Rules for Compute Cluster enabled environments
- Profiling table names is introduced next to column values or column names.
- Atlas classifications (Cloudera Data Catalog tags) can be used in a more granular way thanks to the distinction between parent and child tags.
- Tag rules are data lake specific in Compute Cluster enabled environments compared to being valid for all data lakes in VM-based environments.
- The new Tag Rules tab offers filters to allow for faster searching
and displays:
- List of applied parent and child tags
- Tag rule status (Can be used to filter for tag rules not yet validated by Dry Run)
- Rule types
- You can filter for tag rules that apply child tags
- The initial loading time of rules has been decreased.
- You can upload regex patterns in CSV files for easier handling.
- Now you can specify weightage for column value based matching (which was fixed at 85% before). The column weightage and column name weightage add up to 100%.
- When profiling column values, you can upload a sample set of column values instead of defining a regex pattern.
- You can review your configuration before finalizing your tag rule.
- Dry Run: Before deploying your tag rules, you have to test them with actual table data.
- New API calls are available.
New file formats for Compute Cluster based profilers
Compute Cluster based profilers also support the ORC and Avro file format.
January 16, 2025
This release (2025-M1) of the Cloudera Data Catalog service introduces the following new changes:
Containerized architecture for profilers
- Only the required amount of Kubernetes pods are launched based on the size of the database to be profiled. You need to pay only for the used cloud resources only while they are used by the profilers.
- Also, the deployment of the containerized profiler architecture is more streamlined and quicker than the previous VM-based architecture.
- Moreover, the containerized nature of the architecture means that later upgrades can be carried out easier, without the need for multiple dependencies as in the VM-based architecture which used multiple services.
- Profilers now also support the following file formats:
-
VM-based environments: CSV, ORC
-
Compute Cluster enabled environments: CSV and Parquet
-
Hive Column Profilers and Cluster Sensitivity Profilers also support profiling Iceberg Tables, including with On-Demand Profilers.
-
-
For more information, see Profiler architecture in Compute Cluster enabled environment.
Redesigned Dashboard menu
-
A new Dashboard is introduced to give a overview of your data lakes and profilers including:
-
Data lake type and status
-
Profiler status
-
Last 10 assets bookmarked by you
-
Last run of profiler
-
Number of assets profiled
-
Redesigned Search menu
The Search menu is reorganized so information is easier to access. You can expand each entity result to see their qualified name, database, classification and assigned terms. You can use these to check if your query returns the expected results.
Improved display of comments in Asset Details
Following this release, you can hover over the Comment field for individual schema entries in Asset Details to preview longer comments without opening them.
Common time format
Asset Details and other menus will used the same time format for a more
readable overview: MM/DD/YYYY hh:mm A
.
Removed features
The following features have been removed:
- Tagging multiple assets in the Search menu