Hive “INSERT OVERWRITE” Does Not Remove Existing Data

Hive “INSERT OVERWRITE” Does Not Remove Existing Data

When Hive tries to “INSERT OVERWRITE” to a partition of an external table under existing directory, depending on whether the partition definition already exists in the metastore or not, Hive will behave differently:

1) if partition definition does not exist, it will not try to guess where the target partition directories are (either static or dynamic partitions), so it will not be able to delete existing files under those partitions that will be written to

2) if partition definition does exist, it will attempt to remove all files under the target partition directory before writing new data into those directories

To re-produce the issue, I did the following:

Login as “hdfs” user, run the following commands

hdfs dfs -mkdir test
hdfs dfs -mkdir test/p=p1
touch test.txt
hdfs dfs -put test.txt test/p=p1

Confirm that there is one file under test/p=p1

hdfs dfs -ls test/p=p1
Found 1 items
-rw-r--r--   3 hdfs supergroup          5 2015-05-04 17:30 test/p=p1/test.txt

Then start “hive”

DROP TABLE IF EXISTS partition_test;
CREATE EXTERNAL TABLE partition_test (a int) PARTITIONED BY (p string) LOCATION '/user/hdfs/test';
INSERT OVERWRITE TABLE partition_test PARTITION (p = 'p1') SELECT  FROM ;

The output from the above “INSERT OVERWRITE”:

Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1430100146027_0004, Tracking URL = http://host-10-17-74-166.coe.cloudera.com:8088/proxy/application_1430100146027_0004/
Kill Command = /opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/hadoop/bin/hadoop job  -kill job_1430100146027_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2015-05-05 00:15:35,220 Stage-1 map = 0%,  reduce = 0%
2015-05-05 00:15:48,740 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.19 sec
MapReduce Total cumulative CPU time: 3 seconds 190 msec
Ended Job = job_1430100146027_0004
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://ha-test/user/hdfs/test/p=p1/.hive-staging_hive_2015-05-05_00-13-47_253_4887262776207257351-1/-ext-10000
Loading data to table default.partition_test partition (p=p1)
Partition default.partition_test{p=p1} stats: [numFiles=2, numRows=33178, totalSize=194973, rawDataSize=161787]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 3.19 sec   HDFS Read: 2219273 HDFS Write: 195055 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 190 msec

to confirm that test.txt is not removed

hdfs dfs -ls test/p=p1
Found 2 items
-rwxr-xr-x   3 hdfs supergroup     194965 2015-05-05 00:15 test/p=p1/000000_0
-rw-r--r--   3 hdfs supergroup          8 2015-05-05 00:10 test/p=p1/test.txt

rename 000000_0 to 11111111

hdfs dfs -mv test/p=p1/000000_0 test/p=p1/11111111

confirm now two files under test/p=p1

hdfs dfs -ls test/p=p1
Found 2 items
-rwxr-xr-x   3 hdfs supergroup     194965 2015-05-05 00:15 test/p=p1/11111111
-rw-r--r--   3 hdfs supergroup          8 2015-05-05 00:10 test/p=p1/test.txt

Runt the following query again:

INSERT OVERWRITE TABLE partition_test PARTITION (p = 'p1') SELECT  FROM ;

The output from second “INSERT OVERWRITE”:

Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1430100146027_0005, Tracking URL = http://host-10-17-74-166.coe.cloudera.com:8088/proxy/application_1430100146027_0005/
Kill Command = /opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/hadoop/bin/hadoop job  -kill job_1430100146027_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2015-05-05 00:23:39,298 Stage-1 map = 0%,  reduce = 0%
2015-05-05 00:23:48,891 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.92 sec
MapReduce Total cumulative CPU time: 2 seconds 920 msec
Ended Job = job_1430100146027_0005
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://ha-test/user/hdfs/test/p=p1/.hive-staging_hive_2015-05-05_00-21-58_505_3688057093497278728-1/-ext-10000
Loading data to table default.partition_test partition (p=p1)
Moved: 'hdfs://ha-test/user/hdfs/test/p=p1/11111111' to trash at: hdfs://ha-test/user/hdfs/.Trash/Current
Moved: 'hdfs://ha-test/user/hdfs/test/p=p1/test.txt' to trash at: hdfs://ha-test/user/hdfs/.Trash/Current
Partition default.partition_test{p=p1} stats: [numFiles=1, numRows=33178, totalSize=194965, rawDataSize=161787]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 2.92 sec   HDFS Read: 2219273 HDFS Write: 195055 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 920 msec

Finally confirm that only one file under test/p=p1 directory, both 11111111 and test.txt were moved to .Trash directory

hdfs dfs -ls test/p=p1
Found 1 items
-rwxr-xr-x   3 hdfs supergroup       4954 2015-05-04 17:36 test/p=p1/000000_0

The above test confirms that files remain in the target partition directory when table was newly created with no partition definitions.

To fix this issue, you can run the following hive query before the “INSERT OVERWRITE” to recover the missing partition definitions:

MSCK REPAIR TABLE partition_test;
OK
Partitions not in metastore:	partition_test:p=p1
Repair: Added partition to metastore partition_test:p=p1
Time taken: 0.486 seconds, Fetched: 2 row(s)

Loading

Leave a Reply

Your email address will not be published. Required fields are marked *

My new Snowflake Blog is now live. I will not be updating this blog anymore but will continue with new contents in the Snowflake world!