Known issues for Cloudera AI on premises 1.5.5
This section lists known issues that you might run into while using Cloudera AI on premises.
- DSE-44699: Provisioning workbench is failing with error pool guaranteed resources larger than parent's available guaranteed resources
-
With the Quota Management feature enabled, creating Cloudera AI Workbench might fail with the error pool guaranteed resources larger than parent's available guaranteed resources.
- DSE-44367: Buildkitd Pod CrashLoopBackOff due to port conflicts
-
During the creation or upgrade of a Cloudera AI Workbench,
buildkitd
pods may occasionally enter aCrashLoopBackOff
state. This typically happens when the port used by BuildKit is not properly released during pod restarts or is occupied by another process. You may encounter errors such as:buildkitd: listen tcp 0.0.0.0:1234: bind: address already in use
Workaround:
If you experience this issue, follow this step to resolve it:- Perform a rollout restart of the
buildkitd
pods to ensure they start correctly:kubectl rollout restart daemonset buildkitd -n [***WORKBENCH NAMESPACE***]
- Perform a rollout restart of the
- DSE-44319 The model import from Model Hub is failing with error to communicate with Cloudera AI Registry
-
If you have used self-signed certificates or your trust store lacks the certificate required to trust the Cloudera AI Registry domain, the Cloudera AI Registry calls from the UI might fail.
Figure 1. Importing a model fails
Workaround:
When facing issues with an untrusted certificate:- Select Cloudera AI Registry from the left navigation pane. The Cloudera AI Registry page is displayed.
- Open up the Cloudera AI Registry Details page.
- Copy the domain and open it in a new browser.
- Add the certificate permanently to your device's trust store to avoid the risk with the current session.
For permanent trust, export the certificate from your browser and import it into your operating system's certificate manager.
- DSE-43704: Rename custom tee binary to cml-tee
-
Certain vulnerability scanners may incorrectly flag Cloudera AI as using a vulnerable version of
coreutils
andtee
.Cloudera AI services include a custom
tee
binary, developed entirely in-house by Cloudera, which is not based on the open-sourcecoreutils
library. The current version of Cloudera's customtee
command is 0.9, which may be mistakenly identified as thetee
command fromcoreutils
that contains known vulnerabilities. - DSE-44827: The model is failing with unknown status
-
Model deployments may fail to start if the model build relies on an add-on that was hotfixed in the release. This issue occurs when the model deployment restarts and the add-on hotfixing process overlap, leading to conflicts.
Workaround:
To resolve this issue, create a new build and deploy it for the affected model deployments.
- DSE-44682: Model deployment is failing at building stage due to TLS issues
-
TLS-related issues may occasionally occur during the Cloudera AI model build process in Cloudera Embedded Container Service clusters, specifically when pulling images from the container registry. These issues are typically caused by missing registry certificates on the worker nodes, which should be located at the following path:
/etc/docker/certs.d/
.Workaround:
To address this issue, ensure that the required registry certificates are present on all worker nodes. Follow these steps to recover:
-
Identify a reference worker node.
Select a worker node where the model build process completes successfully without TLS errors. This node is expected to have the correct registry certificates in place. In general, any worker node running
s2i-builder
pods is likely to have the necessary certificates on the nodes of the cluster. -
Locate the registry certificates.
On the reference worker node, navigate to /etc/docker/certs.d/[***REGISTRY NAME***]/ and verify that the
registry.crt
file exists. -
Distribute certificates to affected nodes.
Copy the certificate files (
registry.crt
) from the reference node to the same path (/etc/docker/certs.d/[***REGISTRY NAME***]/) on the affected worker nodes that lack the required certificates. Make sure that the certificates for both the image pull and push registries are present and are correctly placed on all worker nodes. -
Perform rollout restart of the
buildkitd
daemonset so that the certificates are applied properly to the buildkit pods.
-
- DSE-44913: Spark executor
fluentd init
container fails -
A bug introduced in a recent change to handle dynamic volume association with Spark pods causes the
Fluentbit
executor, responsible for log collection, to crash. As a result, logs from affected pods are not included in debug bundles, and the Spark executor pod logs will not appear in the session's Logs tab in the UI.Despite this, the Spark executors will continue to function as expected—the engine container will still start, and the script will execute.
Cloudera AI Inference service Known issues
- DSE-44238: Cannot create Cloudera AI Inference service application deployment via CDP CLI when ozone credentials are passed
-
Cloudera AI Inference service cannot be created via CDP CLI. Create the Cloudera AI Inference service only via the UI.
- DSE-44141: Failed to delete deployment in executing DeleteMLServingApp
-
Cloudera AI Inference service fails to remove all namespaces if the Cloudera AI Inference service is deleted post installation failure.
Workaround:
Manually delete the below namespaces from the cluster:- knative-serving
- kserve
- cml-serving
- knox-caii
- serving-default
Further known issues with Cloudera AI Inference service:
- Updating the description after a model has been added to a model endpoint will lead to a UI mismatch in the model builder for models listed by the model builder and the models deployed.
- Embedding models function in two modes:
query
orpassage
. This has to be specified when interacting with the models. There are two ways to do this:-
suffix the model id in the payload by either
-query
or-passage
or -
specify the
input_type
parameter in the request payload.For more information, see NVIDIA documentation.
-
-
Embedding models only accept strings as input. Token stream input is currently not supported.
-
Llama 3.2 Vision models are not supported on AWS on A10G and L40S GPUs.
-
Llama 3.1 70B Instruct model L40S profile needs 8 GPUs to deploy successfully, while NVIDIA documentation lists this model profile as needing only 4 L40S GPUs.
- Mistral 7B models for NIM version 1.1.2 require the max_tokens parameter in the request payload. This API regression is known to affect the Test Model UI functionality for this specific NIM version.
- NIM endpoints will reply with a 307 temporary redirect if the URL ends with a trailing /. Make sure not to have a trailing slash character at the end of NIM endpoint URLs.