HDFS to S3 BDR Job Filled Up Disk Space

HDFS to S3 BDR Job Filled Up Disk Space

When Distcp job runs and tries to copy data from HDFS to Amazon S3, Distcp job will try to buffer data to disk first until the output stream is closed via the close() method call. This is due to the nature of S3 object store, that data written to an S3A OutputStream is not written incrementally.

As a result of this, the disks hosting temporary directories defined in fs.s3a.buffer.dir must have the capacity to store the entire buffered file. This is documented in Apache Hadoop’s documentation: Hadoop-AWS module: Integration with Amazon Web Services

So, if the directory of the disk that pointed to by fs.s3a.buffer.dir is not enough to hold the temporary files (by default it points to ${hadoop.tmp.dir}/s3a, and ${hadoop.tmp.dir} points to /tmp/hadoop-${user.name}), the No space left error will occur:

2019-04-04 08:53:34,072 ERROR [main] com.cloudera.enterprise.distcp.util.RetriableCommand: Failure in Retriable command: Copying hdfs://nameservice1/user/eric/table1/part-m-00001 to s3a://s3-host/user/table1/part-m-00001
java.io.IOException: No space left on device

This will cause varies errors on different services that are running on the host, including service crashing or Kerberos related errors.

So to avoid this issue, you can point fs.s3a.buffer.dir to a directory that has much large capacity to hold the temporary files. How large the disk, will depend on the job you run.

Hope above information can be helpful.


Leave a Reply

Your email address will not be published. Required fields are marked *

My new Snowflake Blog is now live. I will not be updating this blog anymore but will continue with new contents in the Snowflake world!