Known Issues in Cloudera Data Services on premises 1.5.4 SP2
Learn about the known issues for Cloudera Data Services on premises,
the impact or changes to the functionality, and the workaround..
DWX-20809: Cloudera Data Services on premises
installations on RHEL 8.9 or lower versions may encounter issues
You may notice issues when installing Cloudera Data Services on premises on Cloudera Embedded Container Service
clusters running on RHEL 8.9 or lower versions. Pod crashloops are noticed with the
following
error:
Warning FailedCreatePodSandBox 1s (x2 over 4s) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create
failed: runc create failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524: unknown
The
issue is due to a memory leak with 'seccomp' (Secure Computing Mode) in the Linux kernel.
If your kernel version is not on 6.2 or higher verisons or if it is not part of the list
of versions mentioned here, you may face issues during installation.
To avoid this issue, increase the value of
net.core.bpf_jit_limit by running the following command on all ECS
hosts:
However,
Cloudera recommends upgrading the Linux kernel
to an appropriate version that contains a patch for the memory leak issue. For a list of
versions that contain this patch, see this link.
OPSX-5776 and OPSX-5747: ECS - Some of the rke2-canal DaemonSet
pods in the kube-system namespace are stuck in Init state causing longhorn volume attach
issues
In a few cases, after upgrading from Cloudera Data Services on premises 1.5.3 or 1.5.3-CHF1 to 1.5.4 SP2, a pod
that belongs to rke2-canal DaemonSet is stuck in
Init status. This causes some pods in kube-system and
longhorn-system namespaces to be in Init or
CrashLoopBackOff status. This manifests as volume attach failure in
the embedded-db-0 pod in CDP namespace, and causes some pods in CDP
namespace to be in CrashLoopBackOff state.
Perform a rolling restart of rke2-canal DaemonSet by running the
following
command:
Monitor the DaemonSet restart status by running the following
command:
kubectl get ds -n kube-system
After the rke2-canal DaemonSet restart is complete, if any pods
in DaemonSets within the longhorn-system namespace
remain in Init or CrashLoopBackOff state, perform a
rolling restart of those DaemonSets. Choose the appropriate command
based on the specific DaemonSet that is failing. If more than one
DaemonSet requires a restart, restart them sequentially, one at a
time.
OPSAPS-72270: Start ECS command fails on uncordon nodes step
In an ECS HA cluster sometimes, the server node restarts during start up. This causes
the uncordon step to fail.
Run the following command on the same node to verify whether the kube-apiserver is
ready:
kubectl get pods -n kube-system | grep kube-apiserver
Resume the command from the Cloudera Manager UI.
OPSX-5239: Updating the External Docker Registry Certificate
command fails when existing Pods are restarted.
If a wrong certificate is updated using the path ECS-> admin->
certificates then the wrong certificate cannot be restored using the
Cloudera Manager Update External Docker Certificate command to
correct the external docker certificate.
If you plan to alter the external docker certificate with an invalid certificate
and run the Cloudera Manager's 'Update External Docker Certificate' command
to correct the external docker certificate, this workflow is not supported.
For Example:
Install PVC with an external docker registry.
2. Update the wrong certificate in the ECS configurations and run the
Update External Docker Registry Certificate command.
Restart all the Pods in the cdp namespace. (Pods are in
imagepull backoff error state).
4. Update the correct certificate in ECS configurations and run the
Update External Docker Registry Certificate command.
Running the 4th step in Cloudera Manager does not
support restoring the wrong certificate.
None
OPSAPS-72964 and OPSAPS-72769: Unseal Vault command fails after
restarting the ECS service
Unseal Vault command fails sometimes, after restarting the ECS
service.
It may take sometime for the ECS cluster to be up and running after a restart
operation. In case Unseal Vault fails after the restart operation
please follow the below steps:
Verify that the pod vault-0 in the vault-system namespace is
running.
Once it is in Running state, initiate the Unseal
Vault command from the ECS service's Action
menu.
OPSX-5986: ECS fresh install failing with
helm-install-rke2-ingress-nginx pod failing to come into Completed state
ECS fresh install fails at the "Execute command Reapply All Settings to Cluster on
service ECS" step due to a timeout waiting for helm-install. To
confirm the issue, run the following kubectl command on the ECS
server host to check if the pod is stuck in a running
state:
kubectl get pods -n kube-system | grep helm-install-rke2-ingress-nginx
To resolve the issue, manually delete the pod by running the following
command:
kubectl delete pod <helm-install-rke2-ingress-nginx-pod-name> -n kube-system
Then,
click Resume to proceed with the fresh install process on the
Cloudera Manager UI.