Known issues for the Cloudera Data Services on premises 1.5.5
You must be aware of the known issues and limitations, the areas of impact, and
workaround in Cloudera Data Services on premises 1.5.5 release.
Known Issues in Cloudera Data Services on premises 1.5.5
OBS-8038: When using the
Grafana Dashboard URL shortener, the shortened URL defaults to localhost:3000. This
behaviour happens because the URL shortener uses the local server address instead of the
actual domain name of the Cloudera Observability instance. As a result,
users cannot access the shortened URL.
You must not use the shortened URL. To ensure users can
access the URL, update it to use the correct Cloudera Observability
instance domain name, such as cp_domain/{shorten_url}{}.
OPSX-6209 and DWX-20809: Cloudera Data
Services on premises installations on RHEL 8.9 or lower versions may encounter
issues
You may notice issues when installing Cloudera Data Services on premises on Cloudera
Embedded Container Service clusters running on RHEL 8.9 or lower versions. Pod
crashloops are noticed with the following
error:
Warning FailedCreatePodSandBox 1s (x2 over 4s) kubelet Failed to create pod
sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create
shim task: OCI runtime create
failed: runc create failed: unable to start container process: unable to init seccomp: error
loading seccomp filter into kernel: error loading seccomp filter: errno 524: unknown
The
issue is due to a memory leak with 'seccomp' (Secure Computing Mode) in the Linux
kernel. If your kernel version is not on 6.2 or higher verisons or if it is not part of
the list of versions mentioned here, you may face issues during installation.
To avoid this issue, increase the value of
net.core.bpf_jit_limit by running the following command on all ECS
hosts:
However,
Cloudera recommends upgrading the Linux kernel to an appropriate version that contains
a patch for the memory leak issue. For a list of versions that contain this patch, see
this link.
COMPX-20705: [153CHF-155] Post ECS upgrade pods are stuck in ApplicationRejected
State
After upgrading the CDP installation pods on Kubernetes could be left in a failure
state showing "ApplicationRejected". This is caused by a delay in settings being applied
to Kubernetes as part of the post upgrade steps.
To resove this issue, restart the scheduler to pick up the latest settings for
Kubernetes. Also, restart YuniKorn using the following
commands:
OPSX-6303 - ECS server went down - 'etcdserver: mvcc: database space exceeded'
ECS server may fail with error message - "etcdserver: mvcc: database space exceeded"
in large clusters.
Add to the safety valve for server
group:
etcd-arg:
- "quota-backend-bytes=4294967296"
Restart stale services (Select the option re-deploy client
configs).
The default value for quota-backend-bytes is 2 GB. It can be
increased up to 8 GB.
OPSX-6295 - Control Plane upgrade failing with cadence-matching and
cadence-history
Incase of extra cadence-matching and cadence-history pod stuck in
Init:CreateContainerError state , Cloudera Embedded Container Service Upgrade to 1.5.5 will be stuck in retry loop because
of all pods running validation failure.
You need to manually apply the workaround to proceed
further upgrade and get it done successfully. Hence, delete the stuck cadence pods.
OPSX-4391 - External docker cert not base64 encoded
When using Cloudera Data Services on premises
on ECS, in some rare situations, the CA certificate for the Docker registry in the cdp
namespace is incorrectly encoded, resulting in TLS errors when connecting to the Docker
registry.
Compare and edit the contents of the
"cdp-private-installer-docker-cert" secret in the cdp namespace so that it matches the
contents of the "cdp-private-installer-docker-cert" secret in other namespaces. The
secrets and their corresponding namespaces can be identified using the
command:
kubectl get secret -A | grep cdp-private-installer-docker-cert
Inspect
each secret using the
command:
kubectl get secret -n cdp cdp-private-installer-docker-cert -o yaml
Replace "cdp" with the different namespace names. If necessary, modify the secret in the
cdp namespace using the
command:
OPSX-6245 - Airgap | Multiple pods are in pending state on rolling restart
Performing back-to-back rolling restarts on ECS clusters can intermittently fail
during the Vault unseal step. During rapid consecutive rolling restarts, the
kube-controller-manager pod may not return to a ready state promptly. This can cause a
cascading effect where other critical pods, including Vault, fails to initialize
properly. As a result, the unseal Vault step fails.
As a workaround, perform the following steps:
Stop the ECS role that failed.
Start the ECS role again.
If required, perform the rolling restart again.
OPSX-4684 - Start ECS command shows green(finished) even though start docker server
failed on one of the hosts
Docker service starts with one or more docker roles failed to
start because the corresponding host is unhealthy.
Make sure the host is healthy. Start the the docker role in the host.
OPSX-5986 - ECS fresh install failing with helm-install-rke2-ingress-nginx pod failing
to come into Completed state
ECS fresh install fails at the "Execute command Reapply All
Settings to Cluster on service ECS" step due to a timeout waiting for helm-install.
To confirm the issue, run the following kubectl command on the ECS server host to
check if the pod is stuck in a running
state:
kubectl get pods -n kube-system | grep helm-install-rke2-ingress-nginx
To
resolve the issue, manually delete the pod by running:
kubectl delete pod <helm-install-rke2-ingress-nginx-pod-name> -n kube-system
Then, click Resume to proceed with the fresh install process on
the Cloudera Manager UI.
OPSX-6298 - Issue on service namespace cleanup
There might be cases in which uninstalling services from the Cloudera Data Services on premises UI will fail due to various
reasons.
In case uninstallation of a Service fails, trigger again the service uninstall
process, and mark “Force Delete” to ensure that all metadata of the service will be
removed from Cloudera side. Then, move to the OpenShift UI, and there search for that
service namespace / project. On that project/namespace select the Action button
on the top right of the screen and choose to Delete the Project.
If you move back to the main Project screen you could see that the project is moving
to status “Terminating” after which it will be removed from the OCP platform. That
action will ensure that all the entities linked to that project/namespace will also be
removed by OpenShift.
We cannot recommend an exact value for inotify max_user_instances config. It
depends on all workloads that are run in a given node.
With newly introduced features like istio, cert manager, in Cloudera Control Plane, there is a need to set inotify
max_user_instancesconfig to 256 instead of 128 to resolve this issue.
COMPX-20362 - Use API to create a pool that has a subset of resource types
The Resource Management UI supports displaying only three resource types: CPU, memory
and GPU. The Resource Management UI will always set all three resource types it knows
about: CPU, Memory and GPU (K8s resource nvidia.com/gpu) when creating a quota. If no
value is chosen for a resource type a value of 0 will be set, blocking the use of that
resource type.
To create a pool that has a subset of resource types the REST API must be used as
follows:
Known issues from previous releases carried in Cloudera Data Services on premises 1.5.5
Known Issues identified in 1.5.4
DOCS-21833: Orphaned replicas/pods are not getting auto cleaned up leading to volume
fill-up issues
By default, Longhorn will not automatically delete the orphaned replica directory.
One can enable the automatic deletion by setting
orphan-auto-deletion to
true.
No workaround available.
OPSX-5310: Longhorn engine images were not deployed on ECS server nodes
Longhorn engine images were not deployed on ECS server nodes due to missing
tolerations for Cloudera Control Plane taints. This caused the engine
DaemonSet to schedule only on ECS agent nodes, preventing deployment on Cloudera Control Plane nodes.
Check the Engine DaemonSet Status. Run the following command to
check if the Longhorn engine DaemonSet is missing on certain
nodes:
kubectl get ds -n longhorn-system | grep engine
Identify Taints on Affected Nodes. Run the following command to check for taints
on affected
nodes:
kubectl describe node <node-name> | grep Taints
Manually Edit the DaemonSet to Add a Toleration. Edit the Longhorn engine
DaemonSet
YAML:
Apply the changes and verify deeployment. Save and exit the editor. Then, check
if the DaemonSet is now running on all necessary
nodes:
kubectl get pods -n longhorn-system -o wide | grep engine
Verify that the engine pods are successfully scheduled on the affected ECS server
nodes.
OPSX-5155: OS Upgrade | Pods are not starting after the OS
upgrade from RHEL 8.6 to 8.8
After an OS upgrade and start of the Cloudera Embedded Container Service service, pods fail to come up due to stale state.
Restart the Cloudera Embedded Container Service
cluster.
OPSX-5055: Cloudera Embedded Container Service upgrade failed at
Unseal Vault step
During an Cloudera Embedded Container Service upgrade from 1.5.2 to 1.5.4 release,
the vault pod fails to start due to an error caused by the Longhorn volume unable to
attach to the host. The error is as below:
Warning FailedAttachVolume 3m16s (x166 over 5h26m)
attachdetach-controller AttachVolume.Attach failed for volume
"pvc-0ba86385-9064-4ef9-9019-71976b4902a5" : rpc error: code = Internal desc =
volume pvc-0ba86385-9064-4ef9-9019-71976b4902a5 failed to attach to node
host-1.cloudera.com with attachmentID
csi-7659ab0e6655d308d2316536269de47b4e66062539f135bf6012bfc8b41fc345: the volume is
currently attached to different node host-2.cloudera.com
Follow below steps provided by SUSE to ensure the
Longhorn volume is correctly attached to the node where the vault pod is
running.
# Find out the volume name that is failing to attach to the vault pod.
For e.g. pvc-bc73e7d3-c7e7-468a-b8e0-afdb8033e40b from the pod logs.
kubectl edit volumeattachments.longhorn.io -n longhorn-system pvc-bc73e7d3-c7e7-468a-b8e0-afdb8033e40b
# Update the "spec:" section of the volumeattachment and replace
attachmentTickets section with {} as shown below and save.
spec:
attachmentTickets: {}
volume: pvc-bc73e7d3-c7e7-468a-b8e0-afdb8033e40b
# scale down the vault statefulset to 0 and scale it back up.
kubectl scale sts vault --replicas=0 -n vault-system
kubectl scale sts vault --replicas=1 -n vault-system
OPSX-4684: Start Cloudera Embedded Container Service command
shows green(finished) even though start docker server failed on one of the hosts
The Docker service starts, but one or more Docker roles fail to start because the
corresponding host is unhealthy.
Ensure the host is healthy. Start the the Docker role on the host.
OPSX-735: Kerberos service should handle Cloudera Manager downtime
The Cloudera Manager Server in the base cluster operates to
generate Kerberos principals for Cloudera on premises. If
there is downtime, you may observe Kerberos-related errors.
Resolve downtime on Cloudera Manager. If you encounter Kerberos
errors, you can retry the operation (such as retrying creation of the Virtual
Warehouse).
Known Issues identified in 1.5.2
OPSX-4594: [ECS Restart Stability] Post rolling restart few
volumes are in detached state (vault being one of them)
After rolling restart there may be some volumes in detached state.
Open the Longhorn UI to view the detached volumes.
Perform the following operations for each volume in a detached state:
Identify the workload name and type from the volume details.
Identify the workload and number of replicas using kubectl or the Kubernetes
UI.
Scale the workload down to 0.
Wait for the pods associated with the workload to fully terminate.
Scale up the workload up to the number of replicas it had originally.
To prevent this issue, use the Longhorn UI to set the number of replicas for the
volume to at least 3.
OPSX-4392: Getting the real client IP address in the
application
CML has a feature for adding the audit event for each user action (Monitoring User Events). In Private Cloud,
instead of the client IP, we are getting the internal IP, which is logged into the
internal DB.
In ECS, add the enable-real-ip configuration
as true for the nginx ingress controller:
CDPVC-1137, CDPAM-4388, COMPX-15083, and COMPX-15418: OpenShift Container Platform version upgrade from 4.10 to 4.11 fails due to a
Pod Disruption Budget (PDB) issue
PDB can prevent a node from draining which makes the nodes to
report the Ready,SchedulingDisabled state. As a result, the node is not updated
to correct the Kubernetes version when you upgrade OpenShift Container Platform from 4.10 to 4.11.
To resolve this issue, confirm that the upgrade has
failed due to the PDB issue, and then manually delete the PDBs from the Cloudera on premises namespace.
Run the following command to check whether the nodes are stuck in the
Ready,SchedulingDisabled
state:
oc get nodes
Get the machine config daemon details of the particular pod as
follows:
oc get po -n openshift-machine-config-operator -l 'k8s-app=machine-config-daemon' -o wide
Check the logs of the machine config operator of that particular node as
follows:
Replace
[***MACHINE-CONFIG-DAEMON-NAME***] with the actual machine
config daemon name.
You may see one of the following errors in the node
logs:
error when evicting pods/cdp-release-cpx-liftie-****" -n
"[***PRIVATE-CLOUD-NAMESPACE***] Cannot evict pod as
it would violate the pod's disruption budget
error when evicting
pods/"cdp-release-cluster-proxy-[******]" -n
"[***PRIVATE-CLOUD-NAMESPACE***] Cannot evict pod as
it would violate the pod's disruption budget
Delete the PDB from the Cloudera on premises
namespace as follows:
Obtain the PDB for the cdp-release-cluster-proxy
namespace:
oc get pdb -n [***PRIVATE-CLOUD-NAMESPACE***] | grep cdp-release-cluster-proxy
Back up the
PDB:
oc get pdb [***PDB-NAME-OF-CLUSTER-PROXY***] -n [***PRIVATE-CLOUD-NAMESPACE***] -o yaml >> [***BACKUP-FILE-NAME***].yaml