Impala query with GRUOP BY clause failed without disk spilling

Impala query with GRUOP BY clause failed without disk spilling

This article explains one of the possible reasons that caused Impala query to fail with Out of Memory Exception. The following is the profile from failed query:
Query Type: DML
    Query State: EXCEPTION
    Query Status: 
Memory limit exceeded
    Sql Statement: insert OVERWRITE TABLE  
SELECT id, MIN(_id) AS uuid FROM  
WHERE id IS NOT NULL AND _id IS NOT NULL GROUP BY id

This query was big, and used more than 60GB of memory, in theory with GROUP BY, Impala should trigger disk spilling when it was about to reach the memory limit. However, this was not the case:
BlockMgr
        - AsyncTotalTime: 0
        - BlockWritesOutstanding: 0
        - BlocksCreated: 0
        - BlocksRecycled: 0
        - BufferedPins: 0
        - BytesWritten: 0
        - InactiveTotalTime: 0
        - MaxBlockSize: 8388608
        - MemoryLimit: 65283502080
        - PeakMemoryUsage: 0
        - TotalBufferWaitTime: 0
        - TotalEncryptionTime: 0
        - TotalIntegrityCheckTime: 0
        - TotalReadBlockTime: 0
        - TotalTime: 0
You can see that BytesWritten is “0”, which indicated that no disk spilling happened. By checking the cluster setting, I found that that disk spilling was disabled in the safety valve in Cloudera Manager:
--enable_partitioned_aggregation=false
--enable_partitioned_hash_join=false
“enable_partitioned_aggregation” controls whether to enable disk spilling when doing aggregation, while “enable_partitioned_hash_join” controls disk spilling when doing a hash join. “enable_partitioned_aggregation” was disabled which caused the mentioned query to fail with Out of Memory exception. To re-enable the disk spilling for impala, following the instructions below if you are using Cloudera Manager: Set the following in the safety valve:
--enable_partitioned_aggregation=true
--enable_partitioned_hash_join=true
Or even just remove the following:
--enable_partitioned_aggregation=false
--enable_partitioned_hash_join=false
will re-enable the disk spilling, as it is by default. We confirmed that the disk spilling happened and query succeeded after the change and restarted Impala:
BlockMgr
    - AsyncTotalTime: 0
    - BlockWritesOutstanding: 0
    - BlocksCreated: 3613
    - BlocksRecycled: 3379
    - BufferedPins: 48
    - BytesWritten: 2904251616
    - InactiveTotalTime: 0
    - MaxBlockSize: 8388608
    - MemoryLimit: 65283502080
    - PeakMemoryUsage: 29098027690
    - TotalBufferWaitTime: 0
    - TotalEncryptionTime: 0
    - TotalIntegrityCheckTime: 0
    - TotalReadBlockTime: 6839466872
    - TotalTime: 0

Leave a Reply

Your email address will not be published.

December 2015
M T W T F S S
 123456
78910111213
14151617181920
21222324252627
28293031  

My new Snowflake Blog is now live. I will not be updating this blog anymore but will continue with new contents in the Snowflake world!