How to confirm Dynamic Partition Pruning works in Impala – Hadoop Troubleshooting Guide

This article explains how to confirm Impala’s new Dynamic Partition Pruning feature is effective in CDH5.7.x. Dynamic Partition Pruning is a new feature introduced from CDH5.7.x / Impala 2.5, where information about the partition is collected during run time and impala prunes unnecessary partitions in the ways that were impractical to predict in advance. I have the following test case to show / prove that Dynamic Partition Pruning is working effectively. 1. Create tables with testing data:

CREATE TABLE yy (s string) PARTITIONED BY (year int) STORED AS PARQUET;
INSERT INTO yy PARTITION (year) VALUES ('1999', 1999), ('2000', 2000),
  ('2001', 2001), ('2010',2010);
COMPUTE STATS yy;

CREATE TABLE yy2 (s string) PARTITIONED BY (year int) STORED AS PARQUET;
INSERT INTO yy2 PARTITION (year) VALUES ('1999', 1999), ('2000', 2000),
  ('2001', 2001);
COMPUTE STATS yy2;

It is important to COMPUTE STATS on those tables to make Dynamic Partition Pruning effective 2. Run the following query:

Query: SELECT s FROM yy2 WHERE year IN (SELECT MIN(year) FROM yy LIMIT 1)
+------+
| s    |
+------+
| 1999 |
+------+
Fetched 1 row(s) in 0.55s

The result returned as expected because the minimum value for “year” column in table “yy” is 1999 and it is used in the WHERE condition to limit the result in table “yy2” 3. Check the explain query:

Query: explain SELECT s FROM yy2 WHERE year IN (SELECT MIN(year) FROM yy LIMIT 1)
+----------------------------------------------------------+
| Explain String                                           |
+----------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=16.00MB VCores=1 |
|                                                          |
| 07:EXCHANGE [UNPARTITIONED]                              |
| |                                                        |
| 03:HASH JOIN [LEFT SEMI JOIN, BROADCAST]                 |
| |  hash predicates: year = min(year)                     |
| |  runtime filters: RF000 <- min(year)                   |
| |                                                        |
| |--06:EXCHANGE [BROADCAST]                               |
| |  |                                                     |
| |  05:AGGREGATE [FINALIZE]                               |
| |  |  output: min:merge(year)                            |
| |  |  limit: 1                                           |
| |  |                                                     |
| |  04:EXCHANGE [UNPARTITIONED]                           |
| |  |                                                     |
| |  02:AGGREGATE                                          |
| |  |  output: min(year)                                  |
| |  |                                                     |
| |  01:SCAN HDFS [default.yy]                             |
| |     partitions=4/4 files=4 size=936B                   |
| |                                                        |
| 00:SCAN HDFS [default.yy2]                               |
|    partitions=3/3 files=3 size=702B                      |
|    runtime filters: RF000 -> year                        |
+----------------------------------------------------------+
Fetched 25 row(s) in 0.05s

We can see that table “yy2” was scheduled to be scanned through all 3 partitions 1999, 2000 and 2001, this is expected because at compile time, Impala does not know the result of “SELECT MIN(year) FROM yy LIMIT 1”, so it plans to scan through all partitions in table “yy2”. 4. Let’s check the SUMMARY of the query:

Operator          #Hosts   Avg Time   Max Time  #Rows  Est. #Rows  Peak Mem  Est. Peak Mem  Detail                    
----------------------------------------------------------------------------------------------------------------------
07:EXCHANGE            1    0.000ns    0.000ns      1           1         0        -1.00 B  UNPARTITIONED             
03:HASH JOIN           3  180.999ms  259.000ms      1           1   2.02 MB         5.00 B  LEFT SEMI JOIN, BROADCAST 
|--06:EXCHANGE         3    0.000ns    0.000ns      1           1         0              0  BROADCAST                 
|  05:AGGREGATE        1  215.000ms  215.000ms      1           1  24.00 KB        -1.00 B  FINALIZE                  
|  04:EXCHANGE         1    0.000ns    0.000ns      3           1         0        -1.00 B  UNPARTITIONED             
|  02:AGGREGATE        3  147.333ms  168.000ms      3           1  20.00 KB       10.00 MB                            
|  01:SCAN HDFS        3   28.333ms   39.000ms      4           4  36.00 KB              0  default.yy            
00:SCAN HDFS           3    2.000ms    6.000ms      1           3  46.00 KB       16.00 MB  default.yy2

Check the last row for the SCAN HDFS operator for table “yy2”, we can see that “Est. #Rows” was 3, however, the actual #Rows scanned was only 1. This confirms that the Dynamic Partition Pruning is working, as only one partition/row was scanned instead of 3 partitions/rows 5. Another way to check Dynamic Partition Pruning is effective is to check the Filter in the SCAN operator per node in the query PROFILE. In my test I have 3 hosts. – Host one:

Filter 0
    - Files processed: 1
    - Files rejected: 0
    - Files total: 1

– Host two:

Filter 0
    - Files processed: 1
    - Files rejected: 1
    - Files total: 1

– Host three:

Filter 0
    - Files processed: 1
    - Files rejected: 1
    - Files total: 1

From above, we can see that each host processed one file each, however, two of them rejected the files due to Dynamic Partition Pruning.

Other posts that you might also be interested:

Leave a Reply Cancel reply