In the last few weeks, I have been working on an issue in Sqoop that Sqoop’s “–append” options does not work well with parameter “-Dmapreduce.output.basename”. The goal of adding “-Dmapreduce.output.basename” is to ask Sqoop to generate custom file names for the target file in HDFS, rather than using the default “part” prefix for all files.
However, Sqoop has a bug that when trying to move the file from temp folder to target folder in HDFS, it does not respect the setting for mapreduce.output.basename, hence resulted no data being moved to the final destination.
This can be confirmed by turning on the DEBUG logging for Sqoop (using –verbose):
sqoop import -Dmapreduce.output.basename="eric-test" \ --connect jdbc:mysql://mysql-host.com/test \ --username root --password 'root' --table test \ --target-dir /tmp/ericlin-test/sqoop/test --fields-terminated-by '\t' \ --verbose --append 18/05/28 22:24:44 INFO util.AppendUtils: Appending to directory test 18/05/28 22:24:44 DEBUG util.AppendUtils: Filename: _SUCCESS ignored 18/05/28 22:24:44 DEBUG util.AppendUtils: Filename: eric-test-m-00000 ignored 18/05/28 22:24:44 DEBUG util.AppendUtils: Filename: eric-test-m-00001 ignored 18/05/28 22:24:44 DEBUG util.AppendUtils: Filename: eric-test-m-00002 ignored 18/05/28 22:24:44 DEBUG util.AppendUtils: Deleting temporary folder 14935e396acc4ea7b9a6236c66064c9b_test
From the output, you can see that Sqoop ignored all the files generated with prefix “eric-test”.
I have submitted a upstream JIRA SQOOP-3330, and after a few review cycles, my patch finally get accepted and committed to Sqoop’s trunk code base. I am looking forward for the fix to be backported into CDH in the near future.
Update (28th of April, 2020): This JIRA is confirmed to be backported into CDH since version 6.1.0. So if you face this issue, upgrade to latest CDH/CDP would have the fix. The workaround is to simply add “part_” prefix to mapreduce.output.basename, so like “part_eric_test” to force Sqoop to copy those files, as older Sqoop will look for filenames start with “part”.
We are currently using MySql as sqoop metastore. MySQL DB is share with all other hadoop services.
As of now, we use file based approach for Sqoop incremental pull requirements. We plan to use Sqoop incremental option which uses sqoop metastore. We are concerned if shared metastore MySQL db may get impacted with sqoop jobs.
What is the best practice on Sqoop incremental?
Can we use SQLite for each job instead of shared MySQL?
What can be nuances using SQLite?
Sorry about the late reply.
If you have concerns about shared MySQL, have you considered a dedicated MySQL host just for Sqoop purpose?
By default, Sqoop uses HSQLDB, do you think this is suitable for you?
In my personal opinion, Sqoop does not use metastore heavily, it only saves and retrieves data from database when you run saved jobs. Unless you run your job multiple times a second, I would not worry too much about it.
Hope above is useful.
Again, apologies for the delay in response.