How to use the post copy reconciliation script for HDFS replication policies

Cloudera Private Cloud Base versions 7.7.1 CHF22 and 7.11.3 CHF8 and higher support the latest version of PCR or Post Copy Reconciliation script. You can use different methods to run the PCR script on your HDFS replication policies depending on your requirements. You can run the PCR script for HDFS replication policies between on-premises clusters if you are using the supported target cluster version. You can also set the options to record the debug information and leverage the extra logging capabilities for troubleshooting purposes.

Some use cases where you can use the PCR script are:

When replicating large amounts of data. You might want to verify whether all the data was replicated successfully.
After a recovery/failover scenario. You might want to check data integrity.
When there is a change on target but no snapshot for it is available on the target. You might want to verify if the data on the source and target are in sync.

What is a PCR script?

The PCR script validates the data that is replicated using the HDFS replication policy. It checks whether the HDFS replication was successful by verifying whether the source and target locations have the same content. It accomplishes this task by performing a full file listing on the source and target after replication. It then uses the file listing to compare the following attributes:

Paths of source and target data

The PCR script compares this attribute by default.

File sizes

You can disable this comparison using the pcrEnableLengthCheck=false query parameter in the PCR API.

File last modification time

You can disable this comparison using the pcrEnableModtimeCheck=false query parameter in the PCR API.

Cyclic redundancy check (CRC) checksums

PCR checks this attribute when available. You can disable this comparison using the pcrEnableCrcCheck=false query parameter in the PCR API. For example,

/clusters/[***CLUSTER
                  NAME***]>/services/[***SERVICE***]/replications/[***SCHEDULE
                  ID***]/postCopyReconciliation?pcrEnableCrcCheck=false&pcrEnableModtimeCheck=false

To compare checksums, the source must support the checksum extension for the "DistCpFileStatus" class. PCR compares checksums for HDFS replication policies only if the following conditions are met:

Replication is between on-premises clusters
Both the source and target clusters must support the checksum extension
Target clusters support PCR
Source and target files are not encrypted
Source and target files have the same block size

To write the checksums for PCR, you can enable one of the following:

Enable the CRC FileStatus extension only for PCR by setting the ENABLE_FILESTATUS_CRC_FOR_ADDITIONAL_DEBUG_STEPS = true key-value pair in the target Cloudera Manager > Clusters > HDFS service > Configuration > hdfs_replication_env_safety_valve property.
Enable the CRC FileStatus extension globally by setting the following key-value pairs for the target Cloudera Manager > Clusters > HDFS service > Configuration > hdfs_replication_env_safety_valve property:
- ENABLE_FILESTATUS_EXTENSIONS = true
- ENABLE_FILESTATUS_CRC_EXTENSION= true

Replication Manager performs the following steps when you run the PCR script or when you include PCR command step in an HDFS replication policy:

Checks whether the snapshots are available on the source and target. When available, the snapshot is listed in the next command step, otherwise, the source and target directories are listed directly.
Performs a full file listing on the source and target. If the source supports file listing, the source file listing runs as a remote command on the source. The listing file is then transferred to the target. File listing of the source and target happens in parallel.
Runs the PCR to compare the two file listings after which the results are saved in the mismatch_paths.tsv file and all_paths.tsv file (if enabled). If a fail-on status is detected, the replication policy run fails.

The PCR and the replication runs for the same replication job must not overlap. If they overlap, the replication run is not impacted but the PCR results become unreliable. Therefore, do not run the PCR script when the replication run is active.

The debug output of PCR is available in the mismatch_paths.tsv file on the target HDFS, and is saved in the $logDir/debug directory. For example, hdfs://user/hdfs/.cm/distcp/2023-08-24_206/debug/mismatch_paths.tsv.

If you want to restore the earlier format of PCR, set the com.cloudera.enterprise.distcp.post-copy-reconciliation.legacy-output-format.enabled = true key value pair in the target Cloudera Manager > Clusters > HDFS service > Configuration > hdfs_replication_hdfs_site_safety_valve property.

Different methods to run PCR

You can use one of the following methods to run PCR on an HDFS replication policy:

Run the PCR script using API

Use the /clusters/[***CLUSTER NAME***]>/services/[***SERVICE***]/replications/[***SCHEDULE ID***]/postCopyReconciliation API.

When you set the API parameters, you can choose to compare one or all the supported attributes (file size, file modification time, and CRC checksums) during the PCR script run. By default, the checks for these attributes are enabled.

Include PCR as part of replication job

To include the PCR script in an HDFS replication policy as a command step, enter the SCHEDULES_WITH_ADDITIONAL_DEBUG_STEPS” = [***ENTER COMMA-SEPARATED LIST OF NUMERICAL IDS OF THE REPLICATION POLICIES***] key-value pair in the target Cloudera Manager > Clusters > HDFS service > Configuration > hdfs_replication_env_safety_valve property, and then run the replication policy. The PCR step is added automatically to subsequent replication runs. In this method, PCR runs as a command step and does not interfere with the replication process.

You can also enable the checks individually for each attribute (file size, file modification time, and CRC checksums) when you include the PCR script as part of the replication job. You can accomplish this task by setting the following variables to true in the target Cloudera Manager > Clusters > HDFS service > Configuration > hdfs_replication_hdfs_site_safety_valve property before you run the replication policy. Replication Manager validates only the attributes that you have set to true:

com.cloudera.enterprise.distcp.post-copy-reconciliation.length-check.enabled
com.cloudera.enterprise.distcp.post-copy-reconciliation.modtime-check.enabled
com.cloudera.enterprise.distcp.post-copy-reconciliation.crc-check.enabled

Debug and extra logging for PCR

Additionally, you can perform the following steps to enable the debug steps and extra logging for PCR which might assist you to troubleshoot issues:

To save the debug-related information, enter the following key-value pairs in the target Cloudera Manager > Clusters > HDFS service > Configuration > hdfs_replication_hdfs_site_safety_valve property:
- com.cloudera.enterprise.distcp.post-copy-reconciliation.fail-on = MISSING_ON_TARGET, MISSING_ON_SOURCE, OTHER_MISMATCH, ANY_MISMATCH, or NONE
  The mismatch_paths.tsv file is updated.
- com.cloudera.enterprise.distcp.post-copy-reconciliation.all-paths=true
  An entry is added to the all_paths.tsv file for each compared path.

To initiate and save extra logging information, enter the

EXTRA_LOG_CONFIGS_[***NUMERICAL ID OF THE REPLICATION
              POLICY***]

= [***VALUE***] key-value pair in the target Cloudera Manager > Clusters > HDFS service > Configuration > hdfs_replication_env_safety_valve property.

For example, if your on-premises cluster is on Microsoft Azure, the value is

log4j.rootLogger=INFO,console;hadoop.root.logger=INFO,console;log4j.appender.console=org.apache.log4j.ConsoleAppender;log4j.appender.console.target=System.err;log4j.appender.console.layout=org.apache.log4j.PatternLayout;log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd
          HH:mm:ss} %p %c{2}:
          %m%n;log4j.logger.org.apache.hadoop.fs.azurebfs.services.AbfsIoUtils=DEBUG,console;log4j.logger.org.apache.hadoop.fs.azurebfs.services.AbfsClient=DEBUG,console;log4j.logger.distcp.SimpleCopyListing=DEBUG,console;log4j.logger.distcp.SnapshotDiffGenerator=DEBUG,console

The extra debug logs are available in the $logDir/debug file. For example, hdfs://user/hdfs/.cm/distcp/2023-08-24_206/debug.