Trash staging \/\.Trash\/ \/\staging\/and hoping that all files under .Trash and .staging will be skipped. However, it does not work. By checking the code in class src/main/java/org/apache/hadoop/tools/RegexCopyFilter.java:
@Override public boolean shouldCopy(Path path) { for (Pattern filter : filters) { if (filter.matcher(path.toString()).matches()) { return false; } } return true; }We can see that the code uses Matcher.matches function, which attempts to match the entire region against the pattern. So the above example of “Trash” will not match “/path/to/.Trash/filename” because it only matches part of the string, not entire string. To fix it, use the following:
.*\.Trash.* .*\.staging.*The full command looks like this:
hadoop distcp -filters /path/to/filterfile.txt hdfs://source/path hdfs://destination/pathHope this helps.
Hi ,
I am using hadoop 2.7.0 and want to skip some files through distCp command.I saw -filters option on apache website but it doesn’t help me.
Is there a way to achieve the above scenario.Please suggets.
Hi Prakash,
Sorry about the delay, I am just back from holiday.
As I explained in the article, you can use the -filters option and put the regular expressions in a text file to define the files that you need to filter.
If it does not work for you, can you please explain which part it is not working for you? What is your filter file look like?
Thanks
@ file will be in hdfs or local system form where command is going to be executed?
Hi Abhishek,
Thanks for visiting my blog and posting questions.
The file should be local to the host where you run the command. You can see the source code of the RegexCopyFilter class here:
https://github.com/apache/hadoop/blob/release-3.2.1-RC0/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/RegexCopyFilter.java#L55
It uses Java’s java.io.File class to read the filter file.
Cheers
Eric
how to match the absolute path?can you give an example?thanks
Hi HuanCao,
For absolute path, just use full path, but remember to add escaping characters. So like below:
\/path\/to\/.Trash\/filename
Hope that answers your question.
Cheers
Eric
Hi would like to add filter pattern for visit_date 2021 and 2022 how to do that
Sorry Abhilasha,
I am not exactly sure what pattern you are after.