Performance Differences between CDH and CDP
Assess the performance changes this migration can bring. If you are planning to migrate the current Impala workload to Public Cloud, conduct a performance impact analysis to evaluate how this migration will affect you.
IO Performance Considerations
On-prem CDH hosts often have substantial IO hardware attached to support large scan operations on HDFS, potentially providing 10s of GB per seconds of bandwidth with many SSD devices and dedicated interconnects. Due to the transient nature and cost structure of cloud instances, such a model is not practical for primary storage in CDP.
Like many AWS database offerings, HFS in CDP uses EBS volumes for persistence. The EBS gp2 has a bandwidth limit of 250MB/sec/volume. In addition, EBS may be throttled to zero throughput for extended durations if bandwidth exceeds thresholds. Because of these limitations, it is not practical in many cases to rely on direct IO to EBS for performance. EBS is also routed over shared network hardware and may have additional performance limitations due to redundancy.
To mitigate the PC IO bandwidth discrepancy, ephemeral storage is relied on heavily for caching working sets. While this is existing Impala behavior carried over from CDH, the penalty for going to primary storage is much higher so more data must be cached locally to maintain equivalent performance. Since ephemeral storage is also used for spilling of intermediate results, it is import to avoid excess spilling which could compete for bandwidth.