Hive Column Profiler configuration
In addition to the generic configuration, there are additional parameters for the Hive Column Profiler that can optionally be edited.
- Go to Profilers > Configs.
- Select your data lake.
-
Select Hive Column Profiler.
The Detail page is displayed.
- Use the toggle button to enable or disable the profiler.
-
Select a schedule to run the profiler. This is implemented as a quartz cron
expression.
For more information, see Understanding the Cron Expression generator.
-
Select Last Run Check and set a period if
needed.
-
Set the sample settings:
- Select the Sample Data Size.
- From the drop down, select the type of sample data size.
- Enter the value based on the previously selected type.
- Select the Sample Data Size.
-
Continue with the resource settings.
- In Advanced Options, set the following:
- Number of Executors - Enter the number of executors to launch for running this profiler.
- Executor Cores - Enter the number of cores to be used for each executor.
- Executor Memory - Enter the amount of memory in GB to be used per executor process.
- Driver Cores - Enter the number of cores to be used for the driver process.
- Driver Memory - Enter the memory to be used for the driver processes.
- In Pod Configurations, update the
following:
- Pod CPU Limit: Indicates the maximum number of cores that can be allocated to a Pod. The accepted values range from one through eight.
- Pod CPU Requirement: This is the minimum number of CPUs that will be allocated to a Pod when its provisioned. If the node where a Pod is running has enough resources available, it is possible (and allowed) for a container to use more resource than its request for that resource specifies. However, a container is not allowed to use more than its resource limit. The accepted values range from one through eight.
- Pod Memory Limit: Maximum amount of memory can be allocated to a Pod. The accepted values range from 1 through 256.
- Pod Memory Requirement: This is the minimum amount of RAM that will be allocated to a Pod when it is provisioned. If the node where a Pod is running has enough resources available, it is possible (and allowed) for a container to use more resource than its request for that resource specifies. However, a container is not allowed to use more than its resource limit. The accepted values range from 1 through 256.
- In Executor Configurations, update the following:
- Number of workers: Indicates the number of processes that are used by the distributed computing framework. The accepted values range from one through eight.
- Number of threads per worker: Indicates the number of threads used by each worker to complete the job. The accepted values range from one through eight.
- Worker Memory limit in GB: To avoid over utilization of memory, this parameter forces an upper threshold memory usage for a given worker. For example, if you have a 8 GB Pod and 4 threads, the value of this parameter must be 2 GB. The accepted values range from one through four.
Executor configurations are the runtime configurations. These configuration must be changed if you are changing the pod configurations and when there is a requirement for additional compute power.
- In Advanced Options, set the following:
-
Add Asset Filter Rules as needed to customize the selection and
deselection of assets which the profiler profiles.
-
Set your Deny List and Allow-list.
The profiler will skip profiling assets that meet any criteria in the Deny List and will include assets that meet any criteria in the Allow List.
- Select the Deny-list or Allow List tab.
- Click Add New to include rules.
- Select the key from the drop-down list. You can select from the following:
- Database name
- Asset name
- Asset owner
- Path to the asset
- Created date
- Select the operator from the drop-down list. Depending on the keys selected, you can select an operator such as equals, contains. For example, you can select the name of assets that contain a particular string.
- Enter the value corresponding to the key. For example, you can enter a string as mentioned in the previous example.
- Click Done. Once rule is added, you can toggle the state of the new rule to enable it or disable it as needed.
-
Set your Deny List and Allow-list.
- Click Save to apply the configuration changes to the selected profiler.