How to use “filters” to exclude files when in DistCp

How to use “filters” to exclude files when in DistCp

This article explains how to use the new feature supported in Apache Hadoop 2.6.0 to filter out the files that don’t need to be DistCp-ed. Hadoop 2.8.0 added support to filter out certain files that match certain regular expressions, so that they won’t be copied to destination when DistCp command is issued. This new feature was introduced by HADOOP-1540. However, it it not obvious on how to define regular expressions. We have customers who tried to define the following regexp in the filter file:
Trash
staging
\/\.Trash\/
\/\staging\/
and hoping that all files under .Trash and .staging will be skipped. However, it does not work. By checking the code in class src/main/java/org/apache/hadoop/tools/RegexCopyFilter.java:
@Override
public boolean shouldCopy(Path path) {
  for (Pattern filter : filters) {
    if (filter.matcher(path.toString()).matches()) {
      return false;
    }
  }
  
  return true;
}
We can see that the code uses Matcher.matches function, which attempts to match the entire region against the pattern. So the above example of “Trash” will not match “/path/to/.Trash/filename” because it only matches part of the string, not entire string. To fix it, use the following:
.*\.Trash.*
.*\.staging.*
The full command looks like this:
hadoop distcp -filters /path/to/filterfile.txt hdfs://source/path hdfs://destination/path
Hope this helps.

8 Comments

  1. Prakash

    Hi ,
    I am using hadoop 2.7.0 and want to skip some files through distCp command.I saw -filters option on apache website but it doesn’t help me.
    Is there a way to achieve the above scenario.Please suggets.

    1. Eric Lin

      Hi Prakash,

      Sorry about the delay, I am just back from holiday.

      As I explained in the article, you can use the -filters option and put the regular expressions in a text file to define the files that you need to filter.

      If it does not work for you, can you please explain which part it is not working for you? What is your filter file look like?

      Thanks

    1. Eric Lin

      Hi HuanCao,

      For absolute path, just use full path, but remember to add escaping characters. So like below:

      \/path\/to\/.Trash\/filename

      Hope that answers your question.

      Cheers
      Eric

Leave a Reply

Your email address will not be published.

My new Snowflake Blog is now live. I will not be updating this blog anymore but will continue with new contents in the Snowflake world!