Last year, I wrote the first part of my Impala Query Profile series. I have been busy and not able to add the Part 2 of the series, apologies here. If you missed it, please check it out the last post here first: Impala Query Profile Explained – Part 1.
In the second part of the series, I will be focusing on the Query Plan and Execution Summary section of the profile. And I will be using the same PROFILE that I used in my part 1 of the series.
The Query Plan and Execution Summary looks like below:
Query (id=36433472787e1cab:29c30e7800000000): Summary: ....Skipped here.... Plan: ---------------- Max Per-Host Resource Reservation: Memory=0B Per-Host Resource Estimates: Memory=52.00MB WARNING: The following tables are missing relevant table and/or column statistics. default.sample_07 F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1 | Per-Host Resources: mem-estimate=10.00MB mem-reservation=0B PLAN-ROOT SINK | mem-estimate=0B mem-reservation=0B | 03:AGGREGATE [FINALIZE] | output: count:merge(*) | mem-estimate=10.00MB mem-reservation=0B spill-buffer=2.00MB | tuple-ids=1 row-size=8B cardinality=1 | 02:EXCHANGE [UNPARTITIONED] | mem-estimate=0B mem-reservation=0B | tuple-ids=1 row-size=8B cardinality=1 | F00:PLAN FRAGMENT [RANDOM] hosts=1 instances=1 Per-Host Resources: mem-estimate=42.00MB mem-reservation=0B 01:AGGREGATE | output: count(*) | mem-estimate=10.00MB mem-reservation=0B spill-buffer=2.00MB | tuple-ids=1 row-size=8B cardinality=1 | 00:SCAN HDFS [default.sample_07, RANDOM] partitions=1/1 files=1 size=44.98KB stats-rows=unavailable extrapolated-rows=disabled table stats: rows=unavailable size=44.98KB column stats: all mem-estimate=32.00MB mem-reservation=0B tuple-ids=0 row-size=0B cardinality=unavailable ---------------- Estimated Per-Host Mem: 54525952 Tables Missing Stats: default.sample_07 Per Host Min Reservation: xxx-3.xxxx.com:22000(0) xxx-4.xxxx.com:22000(0) Request Pool: root.hive Admission result: Admitted immediately ExecSummary: Operator Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail ----------------------------------------------------------------------------------------------------------- 03:AGGREGATE 1 0.000ns 0.000ns 1 1 20.00 KB 10.00 MB FINALIZE 02:EXCHANGE 1 868.991ms 868.991ms 1 1 0 0 UNPARTITIONED 01:AGGREGATE 1 0.000ns 0.000ns 1 1 16.00 KB 10.00 MB 00:SCAN HDFS 1 743.001ms 743.001ms 823 -1 80.00 KB 32.00 MB default.sample_07
OK, let’s dive in.
1. Table/Column Statistics:
Max Per-Host Resource Reservation: Memory=0B Per-Host Resource Estimates: Memory=52.00MB WARNING: The following tables are missing relevant table and/or column statistics. default.sample_07
The first two lines simply state resource information, they are not very important and not used very often.
The next line is very important however, as Impala tells us if it has detected that the tables involved in the query have up-to-date information about their stats or not. This is very crucial because Impala uses table/column statistics information to do resource estimation as well as perform query plan to determine the best strategy to run the query. If the stats are not up-to-date, Impala will end up with bad query plan, hence will affect the overall query performance.
In my example, we can see that the table default.sample_07’s stats are missing. Impala produced the warning so that users are informed about this and COMPUTE STATS should be performed on the table to fix this.
For more information on Impala stats, please refer to documentation Table and Column Statistics provided by Cloudera.
2. Query Plan Details:
F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1 | Per-Host Resources: mem-estimate=10.00MB mem-reservation=0B PLAN-ROOT SINK | mem-estimate=0B mem-reservation=0B | 03:AGGREGATE [FINALIZE] | output: count:merge(*) | mem-estimate=10.00MB mem-reservation=0B spill-buffer=2.00MB | tuple-ids=1 row-size=8B cardinality=1 | 02:EXCHANGE [UNPARTITIONED] | mem-estimate=0B mem-reservation=0B | tuple-ids=1 row-size=8B cardinality=1 | F00:PLAN FRAGMENT [RANDOM] hosts=1 instances=1 Per-Host Resources: mem-estimate=42.00MB mem-reservation=0B 01:AGGREGATE | output: count(*) | mem-estimate=10.00MB mem-reservation=0B spill-buffer=2.00MB | tuple-ids=1 row-size=8B cardinality=1 | 00:SCAN HDFS [default.sample_07, RANDOM] partitions=1/1 files=1 size=44.98KB stats-rows=unavailable extrapolated-rows=disabled table stats: rows=unavailable size=44.98KB column stats: all mem-estimate=32.00MB mem-reservation=0B tuple-ids=0 row-size=0B cardinality=unavailable
Now, it comes to the interesting part: the query plan. Query plan is one of the most important part of Impala PROFILE that you need to know how to read, as it tells you how table is scanned, data exchanged and joined to get the final result.
This can get very complicated if your query is complex, but let’s start with this simple query to understand the basis. One thing to remember is that you need to read the Query Plan backwards, which will allow you to follow what Impala planned to do.
a. It normally starts with HDFS Scan:
00:SCAN HDFS [default.sample_07, RANDOM] partitions=1/1 files=1 size=44.98KB stats-rows=unavailable extrapolated-rows=disabled table stats: rows=unavailable size=44.98KB column stats: all mem-estimate=32.00MB mem-reservation=0B tuple-ids=0 row-size=0B cardinality=unavailable
From above, we can see below useful information:
- there was only one partition in the table, and Impala also read one partition. This does not necessarily mean that this table is partitioned. If the table is not partitioned, it will just be shown as 1/1.
- there was only one file under the table/partition (files=1)
- the total size read by Impala was 44.98KB
- there were no stats available for this table (stats-rows=unavailable, table stats: rows=unavailable and cardinality=unavailable)
- estimated memory to be 32MB to run the query and no memory were reserved
b. After HDFS scan was complete, Impala needed to do Aggregation, as we did COUNT(*):
01:AGGREGATE | output: count(*) | mem-estimate=10.00MB mem-reservation=0B spill-buffer=2.00MB | tuple-ids=1 row-size=8B cardinality=1
There isn’t much to explain here, but just to know that this operation does the Aggregation step.
c. Fragment information:
F00:PLAN FRAGMENT [RANDOM] hosts=1 instances=1 Per-Host Resources: mem-estimate=42.00MB mem-reservation=0B
This bit of information just above the 00:SCAN HDFS and 01:AGGREGATE operators tells us that both Scan and Aggregation Operator belongs to Fragment F00, which ran on 1 host and 1 instance. This Fragment ID of F00 can be used to find the actual Fragment statistic in the later part of PROFILE, which can tell us more detailed information about how this Fragment runs at run time. I will also cover this in the later part of the series.
d. Exchange Operation:
02:EXCHANGE [UNPARTITIONED] | mem-estimate=0B mem-reservation=0B | tuple-ids=1 row-size=8B cardinality=1
So after aggregation was done on each worker node, the results needed to be exchanged from each worker node to the coordinator, that was what happened here. After that, the coordinator needed to do the final aggregation/merger on the those results:
03:AGGREGATE [FINALIZE] | output: count:merge(*) | mem-estimate=10.00MB mem-reservation=0B spill-buffer=2.00MB | tuple-ids=1 row-size=8B cardinality=1
And both of above two operations belonged to the same Fragment 01, which again can be used to reference the rest of Profile data to find out more detailed stats about the query:
F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
Now, let’s have a look at the Summary Section of the Profile:
Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail ----------------------------------------------------------------------------------------------------------- 03:AGGREGATE 1 999.992us 999.992us 1 1 20.00 KB 10.00 MB FINALIZE 02:EXCHANGE 1 831.992ms 831.992ms 1 1 0 0 UNPARTITIONED 01:AGGREGATE 1 0.000ns 0.000ns 1 1 16.00 KB 10.00 MB 00:SCAN HDFS 1 709.995ms 709.995ms 823 -1 80.00 KB 32.00 MB default.sample_07
Here you can find below information that could be useful:
- It tells the Average time and Maximum time each operation took. If there is big difference between the two, you would know that there was in-balance/skew when running jobs in each worker node, as in theory, they should be processing similar amount of data and we should expect all of the nodes to finish in similar time range
- If the values for “#Row” and “Est. #Rows” are way off, in my case -1 for Est. #Rows for SCAN HDFS operation and 823 for #Row (the actual number of rows returned after running the query), we know that Impala has out of date information about the table statistics. In my case, we did not have table stats, so Impala reported “-1” estimated value. If the estimated value is positive, but is still different from actual rows returned, then we know that we need to run “COMPUTE STATS” against this table to update the statistics.
- the “#Hosts” column tells us know many worker nodes participated in the query for that particular operation. In my case, since the data was small, we only had 1 host to run the query.
- The “Peak Mem” and “Est. Peak Mem” are self-explanatory, they are the actual memory used vs. the estimated memory that Impala calculated based on table stats.
If there are joins in queries, this section will also show us what join strategies were used in the join operation, either Broadcast or Shuffle Join. I will try to cover this as well in the later part of the series.
That’s all for this part II of the series, and hope that they are useful. I will try to get more complicated query Profiles to share next time and work through to understand more.
See you next time!
Thanks for the post. it helps a lot to me to understand how to read Impala Profile.
Keep up the good work.
Hi Jacky,
Thanks for visiting my site and provide your feedback. I am glad that it helps!
Cheers
Eric
tuple-ids=0 row-size=0B cardinality=unavailable How to explain it and how to estimate it.
Thanks
Hi Fucun,
Sorry about the delay in responding. That means there is no stats for this table so cardinality is unavailable and others are 0. Impala does not estimate this.
Cheers
Eric
Hi Eric,
Your blog is a blessing to our community.
I’m eagerly waiting for Part-3 as there is not enough good blogs or video which explain the query plan like you did.
Cheers,
Teddy
Hi Snehasish,
Thanks for visiting my blog and share your love.
Just let you know that I have written Part 3 of this back in May, please check out below link:
https://www.ericlin.me/2019/05/impala-query-profile-explained-part-3/
I am planning for Part 4 soon.
Cheers
Eric