How to import BLOB data into HBase directly using Sqoop

How to import BLOB data into HBase directly using Sqoop

Recently I was dealing with an issue that I was not able to import BLOB data correctly into HBase from Oracle database. All other columns were imported successfully, however, the BLOB column failed to appear in HBase table. My test table has three columns, ID:int, DATA_S:VARCHAR2 and DATA_B:BLOB. The following was the original command that failed to import BLOB column:
sqoop import -Dsqoop.hbase.add.row.key=true \
    --connect jdbc:oracle:thin:@//:1521/orcl \
    --username USERNAME --password password --table TEST_TABLE \ 
    --hbase-create-table --hbase-table test_case_no_bulkload \
    --column-family cf --split-by ID \
    --hbase-row-key ID
Result as below:
ROW          COLUMN+CELL
 1           column=cf:DATA_S, timestamp=1475219854006, value=Test
 1           column=cf:ID, timestamp=1475219854006, value=1
 2           column=cf:DATA_S, timestamp=1475219894990, value=Test MOre
 2           column=cf:ID, timestamp=1475219894990, value=2
2 row(s) in 0.7070 seconds
You can see that DATA_B column was missing in the destination HBase table. The fix here is to use the bulkload for HBase, see the command below:
sqoop import -Dsqoop.hbase.add.row.key=true \
    --connect jdbc:oracle:thin:@//:1521/orcl \
    --username USERNAME --password password --table TEST_TABLE \ 
    --hbase-create-table --hbase-table test_case_bulkload \
    --column-family cf --split-by ID \
    --hbase-row-key ID \
    --hbase-bulkload
See the new result:
hbase(main):002:0> scan 'test_case_bulkload'
ROW       COLUMN+CELL
 1        column=cf:DATA_B, timestamp=1475220177891, value=2f 72 6f 6f 74 2f 31 30 39 33 33 31 2f 6c 65 69 73 61 5f 63 68 72 69 73 74 6d 61 73 5f 66 61 6c 73 65 5f 63 6f 6c 6f 72 2e 70 6e 67
 1        column=cf:DATA_S, timestamp=1475220177891, value=Test
 2        column=cf:DATA_B, timestamp=1475220177891, value=41 6e 6f 74 68 65 72 20 74 65 73 74
 2        column=cf:DATA_S, timestamp=1475220177891, value=Test MOre
2 row(s) in 0.0620 seconds
You can see that column DATA_B has been created in HBase. However, it comes to another problem. Do you notice the ID column in HBase is missing? We expect the ID column should be created as part of the column family because we specified “-Dsqoop.hbase.add.row.key=true” in the import command. This is caused by a known issue that is reported by my colleague, see SQOOP-2952 for details. Basically the issue is when using –hbase-bulkload, -Dsqoop.hbase.add.row.key=true will be ignored and the key will not be created as part of column family for the new HBase table. If you do not care about the row key to be in the column family, then –hbase-bulkload is the solution for you, otherwise you will have to wait for SQOOP-2952 to be resolved. Hope this helps.

Loading

One comment

Leave a Reply

Your email address will not be published. Required fields are marked *

My new Snowflake Blog is now live. I will not be updating this blog anymore but will continue with new contents in the Snowflake world!