Known Issues in Cloudera Data Services on premises 1.5.4 SP2

DWX-20809: Cloudera Data Services on premises installations on RHEL 8.9 or lower versions may encounter issues

You may notice issues when installing Cloudera Data Services on premises on Cloudera Embedded Container Service clusters running on RHEL 8.9 or lower versions. Pod crashloops are noticed with the following error:

Warning  FailedCreatePodSandBox           1s (x2 over 4s)  kubelet   Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create
 failed: runc create failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524: unknown

The issue is due to a memory leak with 'seccomp' (Secure Computing Mode) in the Linux kernel. If your kernel version is not on 6.2 or higher verisons or if it is not part of the list of versions mentioned here, you may face issues during installation.

To avoid this issue, increase the value of net.core.bpf_jit_limit by running the following command on all ECS hosts:

[root@host ~]# sysctl net.core.bpf_jit_limit=528482304

However, Cloudera recommends upgrading the Linux kernel to an appropriate version that contains a patch for the memory leak issue. For a list of versions that contain this patch, see this link.

OPSX-5776 and OPSX-5747: ECS - Some of the rke2-canal DaemonSet pods in the kube-system namespace are stuck in Init state causing longhorn volume attach issues

In a few cases, after upgrading from Cloudera Data Services on premises 1.5.3 or 1.5.3-CHF1 to 1.5.4 SP2, a pod that belongs to rke2-canal DaemonSet is stuck in Init status. This causes some pods in kube-system and longhorn-system namespaces to be in Init or CrashLoopBackOff status. This manifests as volume attach failure in the embedded-db-0 pod in CDP namespace, and causes some pods in CDP namespace to be in CrashLoopBackOff state.

Perform a rolling restart of rke2-canal DaemonSet by running the following command:
```
kubectl rollout restart ds rke2-canal -n kube-system
```
Monitor the DaemonSet restart status by running the following command:
```
kubectl get ds -n kube-system
```
After the rke2-canal DaemonSet restart is complete, if any pods in DaemonSets within the longhorn-system namespace remain in Init or CrashLoopBackOff state, perform a rolling restart of those DaemonSets. Choose the appropriate command based on the specific DaemonSet that is failing. If more than one DaemonSet requires a restart, restart them sequentially, one at a time.
```
kubectl rollout restart ds longhorn-csi-plugin -n longhorn-system
kubectl rollout restart ds longhorn-manager -n longhorn-system
kubectl rollout restart ds engine-image-ei-6b4330bf -n longhorn-system
kubectl rollout restart ds engine-image-ei-ea8e2e58 -n longhorn-system 
```
Monitor the DaemonSet restart status by running the following command:
```
kubectl get ds -n longhorn-system
```
note
After the DaemonSet restart completes, it can take another ten minutes for pods in longhorn-system and CDP namespaces to come back to Running status.

OPSX-5903: 1.5.3 to 1.5.4 SP2 ECS - The upgrade fails when the rke2-ingress-nginx-controller system exceeds its progress deadline.

Upgrade fails while running the following command:

kubectl rollout status deployment/rke2-ingress-nginx-controller -n kube-system --timeout=5m

Run the following command:

kubectl rollout status deployment/rke2-ingress-nginx-controller -n kube-system --timeout=5m

Run the command for Refresh ECS by performing the following steps:

Find the number of replicas for the deployment:

kubectl get deployment -n kube-system rke2-ingress-nginx-controller

Scale it down to 0:

kubectl scale deployment rke2-ingress-nginx-controller -n kube-system --replicas=0

Once all associated pods are terminated scale it back:

kubectl scale deployment rke2-ingress-nginx-controller -n kube-system --replicas=<n>

Resume upgrade.

OPSAPS-72270: Start ECS command fails on uncordon nodes step

In an ECS HA cluster sometimes, the server node restarts during start up. This causes the uncordon step to fail.

Run the following command on the same node to verify whether the kube-apiserver is ready:

kubectl get pods -n kube-system | grep kube-apiserver

Resume the command from the Cloudera Manager UI.

OPSAPS-72964 and OPSAPS-72769: Unseal Vault command fails after restarting the ECS service

Unseal Vault command fails sometimes, after restarting the ECS service.

It may take sometime for the ECS cluster to be up and running after a restart operation. In case Unseal Vault fails after the restart operation please follow the below steps:

Verify that the pod vault-0 in the vault-system namespace is running.
Once it is in Running state, initiate the Unseal Vault command from the ECS service's Action menu.

OPSX-5986: ECS fresh install failing with helm-install-rke2-ingress-nginx pod failing to come into Completed state

ECS fresh install fails at the "Execute command Reapply All Settings to Cluster on service ECS" step due to a timeout waiting for helm-install. To confirm the issue, run the following kubectl command on the ECS server host to check if the pod is stuck in a running state:

kubectl get pods -n kube-system | grep helm-install-rke2-ingress-nginx

To resolve the issue, manually delete the pod by running the following command:

kubectl delete pod <helm-install-rke2-ingress-nginx-pod-name> -n kube-system

Then, click Resume to proceed with the fresh install process on the Cloudera Manager UI.