Known issues for the Cloudera Data Services on premises 1.5.5
You must be aware of the known issues and limitations, the areas of impact, and workaround in Cloudera Data Services on premises 1.5.5 release.
Known Issues in Cloudera Data Services on premises 1.5.5
- OBS-8038: When using the Grafana Dashboard URL shortener, the shortened URL defaults to localhost:3000. This behaviour happens because the URL shortener uses the local server address instead of the actual domain name of the Cloudera Observability instance. As a result, users cannot access the shortened URL.
- You must not use the shortened URL. To ensure users can access the URL, update it to use the correct Cloudera Observability instance domain name, such as cp_domain/{shorten_url}{}.
- DWX-20809: Cloudera Data Services on premises installations on RHEL 8.9 or lower versions may encounter issues
- You may notice issues when installing Cloudera Data Services on premises on Cloudera
Embedded Container Service clusters running on RHEL 8.9 or lower versions. Pod
crashloops are noticed with the following
error:
The issue is due to a memory leak with 'seccomp' (Secure Computing Mode) in the Linux kernel. If your kernel version is not on 6.2 or higher verisons or if it is not part of the list of versions mentioned here, you may face issues during installation.Warning FailedCreatePodSandBox 1s (x2 over 4s) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524: unknown
- COMPX-20705: [153CHF-155] Post ECS upgrade pods are stuck in ApplicationRejected State
- After upgrading the CDP installation pods on Kubernetes could be left in a failure state showing "ApplicationRejected". This is caused by a delay in settings being applied to Kubernetes as part of the post upgrade steps.
- OPSX-6303 - ECS server went down - 'etcdserver: mvcc: database space exceeded'
-
ECS server may fail with error message - "etcdserver: mvcc: database space exceeded" in large clusters.
- OPSX-6295 - Control Plane upgrade failing with cadence-matching and cadence-history
-
Incase of extra cadence-matching and cadence-history pod stuck in Init:CreateContainerError state , Cloudera Embedded Container Service Upgrade to 1.5.5 will be stuck in retry loop because of all pods running validation failure.
- OPSX-4391 - External docker cert not base64 encoded
- When using Cloudera Data Services on premises on ECS, in some rare situations, the CA certificate for the Docker registry in the cdp namespace is incorrectly encoded, resulting in TLS errors when connecting to the Docker registry.
- OPSX-6245 - Airgap | Multiple pods are in pending state on rolling restart
-
Performing back-to-back rolling restarts on ECS clusters can intermittently fail during the Vault unseal step. During rapid consecutive rolling restarts, the kube-controller-manager pod may not return to a ready state promptly. This can cause a cascading effect where other critical pods, including Vault, fails to initialize properly. As a result, the unseal Vault step fails.
- OPSX-4684 - Start ECS command shows green(finished) even though start docker server failed on one of the hosts
- Docker service starts with one or more docker roles failed to start because the corresponding host is unhealthy.
- OPSX-5986 - ECS fresh install failing with helm-install-rke2-ingress-nginx pod failing to come into Completed state
- ECS fresh install fails at the "Execute command Reapply All Settings to Cluster on service ECS" step due to a timeout waiting for helm-install.
- OPSX-6298 - Issue on service namespace cleanup
-
There might be cases in which uninstalling services from the Cloudera Data Services on premises UI will fail due to various reasons.
- OPSX-6265 - Setting inotify max_user_instances config
-
We cannot recommend an exact value for inotify max_user_instances config. It depends on all workloads that are run in a given node.
- COMPX-20362 - Use API to create a pool that has a subset of resource types
-
The Resource Management UI supports displaying only three resource types: CPU, memory and GPU. The Resource Management UI will always set all three resource types it knows about: CPU, Memory and GPU (K8s resource nvidia.com/gpu) when creating a quota. If no value is chosen for a resource type a value of 0 will be set, blocking the use of that resource type.
Known issues from previous releases carried in Cloudera Data Services on premises 1.5.5
Known Issues identified in 1.5.4
- DOCS-21833: Orphaned replicas/pods are not getting auto cleaned up leading to volume fill-up issues
-
By default, Longhorn will not automatically delete the orphaned replica directory. One can enable the automatic deletion by setting orphan-auto-deletion to true.
- OPSX-5310: Longhorn engine images were not deployed on ECS server nodes
- Longhorn engine images were not deployed on ECS server nodes due to missing tolerations for Cloudera Control Plane taints. This caused the engine DaemonSet to schedule only on ECS agent nodes, preventing deployment on Cloudera Control Plane nodes.
- OPSX-5155: OS Upgrade | Pods are not starting after the OS upgrade from RHEL 8.6 to 8.8
- After an OS upgrade and start of the Cloudera Embedded Container Service service, pods fail to come up due to stale state.
- OPSX-5055: Cloudera Embedded Container Service upgrade failed at Unseal Vault step
-
During an Cloudera Embedded Container Service upgrade from 1.5.2 to 1.5.4 release, the vault pod fails to start due to an error caused by the Longhorn volume unable to attach to the host. The error is as below:
Warning FailedAttachVolume 3m16s (x166 over 5h26m) attachdetach-controller AttachVolume.Attach failed for volume "pvc-0ba86385-9064-4ef9-9019-71976b4902a5" : rpc error: code = Internal desc = volume pvc-0ba86385-9064-4ef9-9019-71976b4902a5 failed to attach to node host-1.cloudera.com with attachmentID csi-7659ab0e6655d308d2316536269de47b4e66062539f135bf6012bfc8b41fc345: the volume is currently attached to different node host-2.cloudera.com
- OPSX-4684: Start Cloudera Embedded Container Service command shows green(finished) even though start docker server failed on one of the hosts
-
The Docker service starts, but one or more Docker roles fail to start because the corresponding host is unhealthy.
- OPSX-735: Kerberos service should handle Cloudera Manager downtime
-
The Cloudera Manager Server in the base cluster operates to generate Kerberos principals for Cloudera on premises. If there is downtime, you may observe Kerberos-related errors.
Known Issues identified in 1.5.2
- OPSX-4594: [ECS Restart Stability] Post rolling restart few volumes are in detached state (vault being one of them)
-
After rolling restart there may be some volumes in detached state.
- OPSX-4392: Getting the real client IP address in the application
- CML has a feature for adding the audit event for each user action (Monitoring User Events). In Private Cloud, instead of the client IP, we are getting the internal IP, which is logged into the internal DB.
- CDPVC-1137, CDPAM-4388, COMPX-15083, and COMPX-15418: OpenShift Container Platform version upgrade from 4.10 to 4.11 fails due to a Pod Disruption Budget (PDB) issue
- PDB can prevent a node from draining which makes the nodes to
report the
Ready,SchedulingDisabled
state. As a result, the node is not updated to correct the Kubernetes version when you upgrade OpenShift Container Platform from 4.10 to 4.11.
- PULSE-944 and PULSE-941 Cloudera Observability namespace not created after platform upgrade from 151 to 152
- The Cloudera Observability namespace is not created
after a platform upgrade from Cloudera Observability 1.5.1 to Cloudera Private Cloud Data Services 1.5.2.
During the creation of the resource pool the Cloudera Observability namespace is provided by the Cloudera on premises. If the provisioning flow is not completed, such as due to a timing difference between the start of the computeAPI pod and the call to the computeAPI pod by the service, the namespace is not created.
- PULSE-921 Cloudera Observability namespace has no pods
- The Cloudera Observability namespace should have the same number of pods and nodes. When the Cloudera Observability namespace has no pods the prometheus-node-exporter-1.6.0 helm release state becomes invalid and the Cloudera Data Services on premises is unable to uninstall and reinstall the namespace. Also, as the Node Exporter is not installed into the Cloudera Observability namespace its metrics are unavailable when querying Prometheus in the control plane, for example the node_cpu_seconds_total metric.
- PULSE-697 Add node-exporter to Cloudera Data Services on premises
- When expanding a cluster with new nodes and there is insufficient CPU and memory resources, the Node Exporter will encounter difficulties deploying new pods on the additional nodes.
- PULSE-935 Longhorn volumes are over 90% of the capacity alerts on Prometheus volumes
- Cloudera Manager displays the following alert
about your Prometheus volumes: Concerning: Firing alerts for Longhorn: The
actual used space of Longhorn volume is over 90% of the capacity.
Longhorn stores historical data as snapshots that are calculated with the active data for the volume's actual size. This size is therefore greater than the volume's nominal data value.
- PULSE-937 Private-Key field change in Update Remote Write request does not reflect in enabling the metric flow
- When using the Cloudera Management Console UI for Remote Storage the Disable option does not deactivate the remote write configuration, even when the action returns a positive result message. Therefore, when disabling a remote storage configuration use the CLI client to disable the remote storage configuration directly from the API.
- PULSE-841 Disabling the remote write configuration logs an error in both cp prometheus and env prometheus
- When a metric replication is set up between the cluster and Cloudera Observability and the connection is disabled or deleted, Prometheus writes an error message that states that it cannot replicate the metrics.
- PULSE-895 Disabling the remote write config in the UI is broken in Cloudera Private Cloud Data Services
- The Remote Write Enable and Disable options in the Cloudera Management Console’s User Interface do not work when a Remote Storage configuration is created with a requestSignerAuth type from either the HTTP API or using the CDP-CLI tool.
- PULSE-936 No Alert to prompt the metric flow being affected b/c of wrong private key configuration
- A remote write alert was not triggered when the wrong private key was used in a Remote Storage configuration.
Known Issues identified in 1.5.1
- External metadata databases are no longer supported on OCP
-
As of Cloudera Private Cloud Data Services 1.5.1, external Control Plane metadata databases are no longer supported. New installs require the use of an embedded Cloudera Control Plane database. Upgrades from Cloudera Private Cloud Data Services 1.4.1 or 1.5.0 to 1.5.1 are supported, but there is currently no migration path from a previous external Cloudera Control Plane database to the embedded Cloudera Control Plane database. Upgrades from 1.4.0 or 1.5.0 with external Cloudera Control Plane metadata databases also require additional steps, which are described in the Cloudera Private Cloud Data Services 1.5.1 upgrade topics.
- DOCS-15855: Networking API is deprecated after upgrade to Cloudera Private Cloud Data Services 1.5.1 (K8s 1.24)
-
When the control plane is upgraded from 1.4.1 to 1.5.1, the Kubernates version changes to 1.24. The Livy pods running in existing Virtual Clusters (VCs) use a deprecated networking API for creating ingress for Spark driver pods. Because the old networking API is deprecated and does not exist in Kubernates 1.24, any new job run will not work for the existing VCs.
- CDPQE-24295: Update docker client to fetch the correct HA Proxy image
-
When you attempt to execute the Docker command to fetch the Cloudera-provided images into your air-gapped environment, you may encounter an issue where Docker pulls an incorrect version of the HAProxy image, especially if you are using an outdated Docker client. This situation arises due to the Cloudera registry containing images with multiple platform versions. Unfortunately, older Docker clients may lack the capability to retrieve the appropriate architecture version, such as amd64.
- OPSX-4266: Cloudera Embedded Container Service upgrade from 1.5.0 to 1.5.1 is failing in Cadence schema update job
-
When upgrading from Cloudera Embedded Container Service 1.5.0 to 1.5.1, the CONTROL_PLANE_CANARY fails with the following error:
Firing alerts for Control Plane: Job did not complete in time, Job failed to complete.
And the
cdp-release-cdp-cadence-schema-update
job fails.
- OPSX-4076:
- When you delete an environment after the backup event, the restore operation for the backup does not bring up the environment.
- OPSX-4024: CM truststore import into unified truststore should handle duplicate CommonNames
-
If multiple CA certificates with the exact same value for the Common Name field are present in the Cloudera Manager truststore when a Cloudera Data Services on premises cluster is installed, only one of them may be imported into the Data Services truststore. This may cause certificate errors if an incorrect/old certificate is imported.
- COMOPS-2822: OCP error x509: certificate signed by unknown authority
-
The error
x509: certificate signed by unknown authority
usually means that the Docker daemon that is used by Kubernetes on the managed cluster does not trust the self-signed certificate.
- OPSX-3073 Cloudera Embedded Container ServiceFirst run command failed at setup storage step with error "Timed out waiting for local path storage to come up"
- Pod stuck in pending state on host for a long time. Error in
Role log related to CNI plugin:
Events:
Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedCreatePodSandBox 3m5s (x269 over 61m) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "70427e9b26fb014750dfe4441fdfae96cb4d73e3256ff5673217602d503e806f": failed to find plugin "calico" in path [/opt/cni/bin]
- OPSX-3528: [Pulse] Prometheus config reload fails if multiple remote storage configurations exist with the same name
- It is possible to create multiple remote storage configurations with the same name. However, if such a situation occurs, the metrics will not flow to the remote storage as the config reload of the original prometheus will fail.
- OPSX-1405: Able to create multiple Cloudera on premises Environments with the same name
- If two users try to create an environment with the same name at the same time, it might result in an unusable environment.
- OPSX-1412: Creating a new environment through the Cloudera CLI reports intermittently that "Environment name is not unique" even though it is unique
- When multiple users try to create the same environment at the same time or use automation to create an environment with retries, create environment may fail on collision with a previous request to create an environment.
Known Issues identified in 1.5.0
- Somehow the Rebuilding field inside volume.meta is set to true causing the volume to get stuck in attaching/detaching loop
-
This is a condition that can occur in Cloudera Embedded Container Service Longhorn storage.
Known Issues identified before 1.5.0
- OPSX-5629: COE Insight from case 922848: Not able to connect to bit bucket
- After installing Cloudera AI on an Cloudera Embedded Container Service cluster, users were not able to connect the internal bitbucket repo.
- OPSX-2484: FileAlreadyExistsException during timestamp filtering
- The timestamp filtering may result in FileAlreadyExistsException
when there is a file with same name already existing in the
tmp
directory. - OPSX-2772: For Account Administrator user, update roles functionality should be disabled
- An Account Administrator user holds the biggest set of privileges, and is not allowed to modify via current UI, even user try to modify permissions system doesn't support changing for account administrator.
- Recover fast in case of a Node failures with Cloudera Embedded Container Service HA
- When a node is deleted from cloud or made unavailable, it is observed that the it takes more than two minutes until the pods were rescheduled on another node.
- Cloudera Data Services on premises Cloudera Embedded Container Service: Failed to perform First Run of services.
- If an issue is encountered during the Install Cloudera Control Plane step of Containerized Cluster First Run, installation will be re-attempted infinitely rather than the command failing.
- Environment creation through the CDP CLI fails when the base cluster includes Ozone
- Problem: Attempt to create an environment using the CDP command-line interface fails in a Cloudera Private Cloud Data Services deployment when the Cloudera Base on premises cluster is in a degraded state and includes Ozone service.
- Filtering the diagnostic data by time range might result in a FileAlreadyExistsException
- Problem:Filtering the collected diagnostic data might result in a FileAlreadyExistsException if the /tmp directory already contains a file by that name.
- Kerberos service does not always handle Cloudera Manager downtime
- Problem: The Cloudera Manager Server in the base cluster must be running to generate Kerberos principals for Cloudera on premises. If there is downtime, you might observe Kerberos-related errors.
- Updating user roles for the admin user does not update privileges
- In the Cloudera Management Console, changing roles on the User Management page does not change privileges of the admin user.
- Upgrade applies values that cannot be patched
- If the size of a persistent volume claim in a Containerized Cluster is manually modified, subsequent upgrades of the cluster will fail.