HDFS Sink

Learn more about the HDFS Sink connector.

The HDFS Sink connector can be used to transfer data from Kafka topics to files on HDFS clusters. Each partition of every topic results in a collection of files named in the following pattern:

{topic name}_{partition number}_{end_offset}.{file extension}

For example, running the HDFS Sink connector on partition 0 of a topic named sourceTopic can yield the following series of files:

sourceTopic_0_50.avro - for record 0 ~ 50
sourceTopic_0_79.avro - holding record 51 ~ 79
...

The HDFS Sink connector periodically commits records to final result files. Each commit results in a separate "chunk" file.

Configuration example for writing data to HDFS

A simple configuration example for the HDFS Sink connector.

The following is a simple configuration example for the HDFS Sink connector. Short descriptions of the properties set in this example are also provided. For a full properties reference, see the HDFS Sink properties reference.

{
    "connector.class": "com.cloudera.dim.kafka.connect.hdfs.HdfsSinkConnector",
    "tasks.max": 1,
    "key.converter": "org.apache.kafka.connect.storage.StringConverter",
    "value.converter": "com.cloudera.dim.kafka.connect.converts.AvroConverter",
    "value.converter.passthrough.enabled": true,
    "value.converter.schema.registry.url": "http://localhost:9090/api/v1",
    "topics": "avro_topic",
    "hdfs.uri": "hdfs://my-host.my-realm.com:8020",
    "hdfs.output": "/topics_output/",
    "output.writer": "com.cloudera.dim.kafka.connect.partition.writers.avro.AvroPartitionWriter",
    "output.avro.passthrough.enabled": true,
    "hdfs.kerberos.authentication": true,
    "hdfs.kerberos.user.principal": "user_account@MY-REALM.COM",
    "hdfs.kerberos.keytab.path": "/path/to/user_account.keytab",
    "hdfs.kerberos.namenode.principal": "hdfs/_HOST@MY-REALM.COM",
    "hadoop.conf.path": "/etc/hadoop/"
  }

connector.class: Class name of the HDFS Sink connector.
key.converter: The converter capable of understanding the data format of the key of each record on this topic.
value.converter: The converter capable of understanding the data format of the value of each record on this topic.
note
When the AvroConverter is used, you can specify Schema Registry properties to be used by the AvroConverter’s Schema Registry client. This is done by adding the required Schema Registry property as a suffix to the value.converter property. For example, value.converter.schema.registry.url. Properties defined this way are passed on to the Schema Registry client used by the AvroConverter.
value.converter.passthrough.enabled: This property controls whether or not data is converted into the Kafka Connect intermediate data format before writing into an output file. Because in this example the input and output format is the same, the property is set to true, that is, data is not converted.
value.converter.schema.registry.url: The URL to Schema Registry. This is a mandatory property if the topic has records encoded in Avro format.
topics: List of topics to consume data from.
hdfs.uri: The URI to the namenode of the HDFS cluster.
hdfs.output: The destination folder on the HDFS cluster where output files will reside.
output.writer: Determines the output file format. Because in this example the output format is Avro, AvroPartitionWriter is used.
output.avro.passthrough.enabled: This property has to match the configuration of the value.converter.passthrough.enabled property because both the input and output formats are Avro.
hdfs.kerberos.authentication: Enables or disables kerberos authentication.
hdfs.kerberos.user.principal: The user principal that the Kafka Connect role will use.
hdfs.kerberos.keytab.path: The path to the kerberos keytab file.
hdfs.kerberos.namenode.principal: The Kerberos principal used by the namenode. This is necessary when the HDFS cluster has data encryption turned on.
hadoop.conf.path: The path to the hadoop configuration files. This is necessary when the HDFS cluster has data encryption turned on.

Configuration example for writing data to Ozone FS

A simple configuration example for the HDFS Sink connector that writes data to the Ozone FS.

The following is a simple configuration example for the HDFS Sink connector. In this example data is written to the Ozone FS. Short descriptions of the properties set in this example are also provided. For a full properties reference, see the HDFS Sink properties reference.

{
    "connector.class": "com.cloudera.dim.kafka.connect.hdfs.HdfsSinkConnector",
    "tasks.max": 1,
    "key.converter": "org.apache.kafka.connect.storage.StringConverter",
    "value.converter": "com.cloudera.dim.kafka.connect.converts.AvroConverter",
    "value.converter.passthrough.enabled": true,
    "value.converter.schema.registry.url": "http://localhost:9090/api/v1",
    "topics": "avro_topic",
    "hdfs.uri": "ofs://ozone1/volume1/bucket1/",
    "hdfs.output": "/topics_output/",
    "output.writer": "com.cloudera.dim.kafka.connect.hdfs.avro.AvroPartitionWriter",
    "output.avro.passthrough.enabled": true,
    "hdfs.kerberos.authentication": true,
    "hdfs.kerberos.user.principal": "user_account@MY-REALM.COM",
    "hdfs.kerberos.keytab.path": "/path/to/user_account.keytab",
    "hadoop.conf.path": "/etc/hadoop/"
  }

connector.class: Class name of the HDFS Sink connector.
key.converter: The converter capable of understanding the data format of the key of each record on this topic.
value.converter: The converter capable of understanding the data format of the value of each record on this topic.
note
When the AvroConverter is used, you can specify Schema Registry properties to be used by the AvroConverter’s Schema Registry client. This is done by adding the required Schema Registry property as a suffix to the value.converter property. For example, value.converter.schema.registry.url. Properties defined this way are passed on to the Schema Registry client used by the AvroConverter.
value.converter.passthrough.enabled: This property controls whether or not data is converted into the Kafka Connect intermediate data format before writing into an output file. Because in this example the input and output format is the same, the property is set to true, that is, data is not converted.
value.converter.schema.registry.url: The URL to Schema Registry. This is a mandatory property if the topic has records encoded in Avro format.
topics: List of topics to consume data from.
hdfs.uri: The Ozone FS (ofs) URI.
hdfs.output: The destination folder on the HDFS cluster where output files will reside.
output.writer: Determines the output file format. Because in this example the output format is Avro, AvroPartitionWriter is used.
output.avro.passthrough.enabled: This property has to match the configuration of the value.converter.passthrough.enabled property because both the input and output formats are Avro.
hdfs.kerberos.authentication: Enables or disables kerberos authentication.
hdfs.kerberos.user.principal: The user principal that the Kafka Connect role will use.
hdfs.kerberos.keytab.path: The path to the kerberos keytab file.
hadoop.conf.path: The path to the hadoop configuration files. This is necessary when the HDFS cluster has data encryption turned on.

HDFS Sink properties reference

HDFS Sink connector properties reference.

The following table collects connector properties that are specific for the HDFS Sink Connector. For properties common to all sink connectors, see the upstream Apache Kafka documentation.


Property Name	Description	Type	Default Value	Accepted Values	Recommended Values
`hdfs.uri`	The file system URI to connect to on the destination cluster. This property supports any valid Hadoop-compatible filesystem (HCFS, For example, HDFS or ofs) URI.	String	None
`hdfs.output`	The root directory on the HDFS cluster where all the output files will reside. The sub path has the following pattern: `{topic}/{topic}_{partition}_{endoffset}.{file extension}`	String	/tmp		Any path on the HDFS file system where the role has read write permission.
`hdfs.kerberos.authentication`	Enables or disables secure access to the HDFS cluster by authenticating with Kerberos.	Boolean	false	true or false
`hdfs.kerberos.user.principal`	The kerberos user principal.	String	null	The host-dependent Kerberos principal assigned to the Kafka Connect role.
`hdfs.kerberos.keytab.path`	The path to the Kerberos keytab file.	String	null		In a Cloudera Manager provisioned environment, it’s recommended to use the Cloudera Manager Config Provider to automatically provision the path.
`hdfs.kerberos.namenode.principal`	The kerberos name node principal. Required when the HDFS cluster has data encryption on.	String	null
`hadoop.conf.path`	The path to the site specific Hadoop configuration XML files. Required when the HDFS cluster has data encryption on.	String	null
`output.writer`	The output file writer which determines the type of file to be written to the HDFS cluster. The value of this property should be the FQCN of a class that implements the `PartitionWriter` interface.	String	null	com.cloudera.dim.kafka.connect.partition.writers.avro.AvroPartitionWriter com.cloudera.dim.kafka.connect.partition.writers.json.JsonPartitionWriter com.cloudera.dim.kafka.connect.hdfs.parquet.ParquetPartitionWriter com.cloudera.dim.kafka.connect.partition.writers.txt.TxtPartitionWriter	com.cloudera.dim.kafka.connect.partition.writers.avro.AvroPartitionWriter
`output.avro.passthrough.enabled`	Configures whether the output writer expects an Avro encoded Kafka Connect data record. Must match the configuration of `value.converter.passthrough.enabled`.	Boolean	true	true or false	True if input and output are both Avro.
`value.converter`	The converter to be used to translate the value field of the source Kafka record into Kafka Connect Data format.	String	Inherited from Kafka Connect worker properties.	org.apache.kafka.connect.json.JsonConverter org.apache.kafka.connect.storage.StringConverter com.cloudera.dim.kafka.connect.converts.AvroConverter	com.cloudera.dim.kafka.connect.converts.AvroConverter
`value.converter.schema.registry.url`	The URL to the Schema Registry server.	String	null	true or false
`value.converter.passthrough.enabled`	Configures whether the AvroConverter translates an Avro record into Kafka Connect Data or transparently passes the Avro encoded bytes as payload.	Boolean	true	true or false	True if input and output are both Avro.