CREATE TABLE yy (s string) PARTITIONED BY (year int) STORED AS PARQUET; INSERT INTO yy PARTITION (year) VALUES ('1999', 1999), ('2000', 2000), ('2001', 2001), ('2010',2010); COMPUTE STATS yy; CREATE TABLE yy2 (s string) PARTITIONED BY (year int) STORED AS PARQUET; INSERT INTO yy2 PARTITION (year) VALUES ('1999', 1999), ('2000', 2000), ('2001', 2001); COMPUTE STATS yy2;It is important to COMPUTE STATS on those tables to make Dynamic Partition Pruning effective 2. Run the following query:
Query: SELECT s FROM yy2 WHERE year IN (SELECT MIN(year) FROM yy LIMIT 1) +------+ | s | +------+ | 1999 | +------+ Fetched 1 row(s) in 0.55sThe result returned as expected because the minimum value for “year” column in table “yy” is 1999 and it is used in the WHERE condition to limit the result in table “yy2” 3. Check the explain query:
Query: explain SELECT s FROM yy2 WHERE year IN (SELECT MIN(year) FROM yy LIMIT 1) +----------------------------------------------------------+ | Explain String | +----------------------------------------------------------+ | Estimated Per-Host Requirements: Memory=16.00MB VCores=1 | | | | 07:EXCHANGE [UNPARTITIONED] | | | | | 03:HASH JOIN [LEFT SEMI JOIN, BROADCAST] | | | hash predicates: year = min(year) | | | runtime filters: RF000 <- min(year) | | | | | |--06:EXCHANGE [BROADCAST] | | | | | | | 05:AGGREGATE [FINALIZE] | | | | output: min:merge(year) | | | | limit: 1 | | | | | | | 04:EXCHANGE [UNPARTITIONED] | | | | | | | 02:AGGREGATE | | | | output: min(year) | | | | | | | 01:SCAN HDFS [default.yy] | | | partitions=4/4 files=4 size=936B | | | | | 00:SCAN HDFS [default.yy2] | | partitions=3/3 files=3 size=702B | | runtime filters: RF000 -> year | +----------------------------------------------------------+ Fetched 25 row(s) in 0.05sWe can see that table “yy2” was scheduled to be scanned through all 3 partitions 1999, 2000 and 2001, this is expected because at compile time, Impala does not know the result of “SELECT MIN(year) FROM yy LIMIT 1”, so it plans to scan through all partitions in table “yy2”. 4. Let’s check the SUMMARY of the query:
Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail ---------------------------------------------------------------------------------------------------------------------- 07:EXCHANGE 1 0.000ns 0.000ns 1 1 0 -1.00 B UNPARTITIONED 03:HASH JOIN 3 180.999ms 259.000ms 1 1 2.02 MB 5.00 B LEFT SEMI JOIN, BROADCAST |--06:EXCHANGE 3 0.000ns 0.000ns 1 1 0 0 BROADCAST | 05:AGGREGATE 1 215.000ms 215.000ms 1 1 24.00 KB -1.00 B FINALIZE | 04:EXCHANGE 1 0.000ns 0.000ns 3 1 0 -1.00 B UNPARTITIONED | 02:AGGREGATE 3 147.333ms 168.000ms 3 1 20.00 KB 10.00 MB | 01:SCAN HDFS 3 28.333ms 39.000ms 4 4 36.00 KB 0 default.yy 00:SCAN HDFS 3 2.000ms 6.000ms 1 3 46.00 KB 16.00 MB default.yy2Check the last row for the SCAN HDFS operator for table “yy2”, we can see that “Est. #Rows” was 3, however, the actual #Rows scanned was only 1. This confirms that the Dynamic Partition Pruning is working, as only one partition/row was scanned instead of 3 partitions/rows 5. Another way to check Dynamic Partition Pruning is effective is to check the Filter in the SCAN operator per node in the query PROFILE. In my test I have 3 hosts. – Host one:
Filter 0 - Files processed: 1 - Files rejected: 0 - Files total: 1– Host two:
Filter 0 - Files processed: 1 - Files rejected: 1 - Files total: 1– Host three:
Filter 0 - Files processed: 1 - Files rejected: 1 - Files total: 1From above, we can see that each host processed one file each, however, two of them rejected the files due to Dynamic Partition Pruning.