This article explains one of the possible reasons caused Hive dynamic insert query to only use one reducer to do the job, out of thousands of other reducers which do no job at all. See the below Hive query that triggers dynamic partitioning insert:
INSERT INTO TABLE target_table PARTITION (p1)
SELECT col1, col2, p1 FROM source_table
WHERE p1 >= '2015-05-01' AND p1 < '2015-05-02'; 
This query will create thousands of reducers, however, the end result is that there is only 1 reducer does the job, which takes more than 3 hours, and all other reducers all finish processing within a few minutes. So the final output from the job is a single fat data file in HDFS which is more than several GBs in size. If you look at the query closely, you can see that condition “p1 >= ‘2015-05-01’ AND p2 < ‘2015-05-02′” is essentially the same as p1 = ‘2015-05-01’, which means only one partition will be created from the query. After checking Hive setting “hive.optimize.sort.dynamic.partition”, it has value of “TRUE”, which means the partition column will be globally sorted before sending data to reducers based on the partition column. And because there is only one partition, all data will be sent to one reducer only even though there are thousands of reducers. There are two way here to resolve the issue: 1. Do not use dynamic partitioning. Because the result from the SELECT query will only return 1 partition column value, we can avoid this issue by using static partitioning instead. So use the following query instead:
INSERT INTO TABLE target_table PARTITION (p1 = '2015-05-01')
SELECT col1, col2, p1 FROM source_table
WHERE p1 >= '2015-05-01' AND p1 <'2015-05-02';

This will ask Hive to not apply dynamic partitioning logic, but use static partitioning instead. 2. Another way is to disable the global dynamic partitioning sorting. Run the following in Hive/Beeline command line before running the dynamic partition query:
SET hive.optimize.sort.dynamic.partition=false;
For more information about hive.optimize.sort.dynamic.partition, please have a look at Hive’s Apache Documentation and search for “hive.optimize.sort.dynamic.partition”. The following is the official description: When enabled, dynamic partitioning column will be globally sorted. This way we can keep only one record writer open for each partition value in the reducer thereby reducing the memory pressure on reducers.

5 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *