Impala Query Profile Explained – Part 2 – Hadoop Troubleshooting Guide

Last year, I wrote the first part of my Impala Query Profile series. I have been busy and not able to add the Part 2 of the series, apologies here. If you missed it, please check it out the last post here first: Impala Query Profile Explained – Part 1.

In the second part of the series, I will be focusing on the Query Plan and Execution Summary section of the profile. And I will be using the same PROFILE that I used in my part 1 of the series.

The Query Plan and Execution Summary looks like below:

Query (id=36433472787e1cab:29c30e7800000000):
  Summary:
    ....Skipped here....
    Plan: 
----------------
Max Per-Host Resource Reservation: Memory=0B
Per-Host Resource Estimates: Memory=52.00MB
WARNING: The following tables are missing relevant table and/or column statistics.
default.sample_07

F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
|  Per-Host Resources: mem-estimate=10.00MB mem-reservation=0B
PLAN-ROOT SINK
|  mem-estimate=0B mem-reservation=0B
|
03:AGGREGATE [FINALIZE]
|  output: count:merge(*)
|  mem-estimate=10.00MB mem-reservation=0B spill-buffer=2.00MB
|  tuple-ids=1 row-size=8B cardinality=1
|
02:EXCHANGE [UNPARTITIONED]
|  mem-estimate=0B mem-reservation=0B
|  tuple-ids=1 row-size=8B cardinality=1
|
F00:PLAN FRAGMENT [RANDOM] hosts=1 instances=1
Per-Host Resources: mem-estimate=42.00MB mem-reservation=0B
01:AGGREGATE
|  output: count(*)
|  mem-estimate=10.00MB mem-reservation=0B spill-buffer=2.00MB
|  tuple-ids=1 row-size=8B cardinality=1
|
00:SCAN HDFS [default.sample_07, RANDOM]
   partitions=1/1 files=1 size=44.98KB
   stats-rows=unavailable extrapolated-rows=disabled
   table stats: rows=unavailable size=44.98KB
   column stats: all
   mem-estimate=32.00MB mem-reservation=0B
   tuple-ids=0 row-size=0B cardinality=unavailable
----------------
    Estimated Per-Host Mem: 54525952
    Tables Missing Stats: default.sample_07
    Per Host Min Reservation: xxx-3.xxxx.com:22000(0) xxx-4.xxxx.com:22000(0) 
    Request Pool: root.hive
    Admission result: Admitted immediately
    ExecSummary: 
Operator       Hosts   Avg Time   Max Time  #Rows  Est. #Rows  Peak Mem  Est. Peak Mem  Detail            
-----------------------------------------------------------------------------------------------------------
03:AGGREGATE        1    0.000ns    0.000ns      1           1  20.00 KB       10.00 MB  FINALIZE          
02:EXCHANGE         1  868.991ms  868.991ms      1           1         0              0  UNPARTITIONED     
01:AGGREGATE        1    0.000ns    0.000ns      1           1  16.00 KB       10.00 MB                    
00:SCAN HDFS        1  743.001ms  743.001ms    823          -1  80.00 KB       32.00 MB  default.sample_07

OK, let’s dive in.

1. Table/Column Statistics:

Max Per-Host Resource Reservation: Memory=0B
Per-Host Resource Estimates: Memory=52.00MB
WARNING: The following tables are missing relevant table and/or column statistics.
default.sample_07

The first two lines simply state resource information, they are not very important and not used very often.

The next line is very important however, as Impala tells us if it has detected that the tables involved in the query have up-to-date information about their stats or not. This is very crucial because Impala uses table/column statistics information to do resource estimation as well as perform query plan to determine the best strategy to run the query. If the stats are not up-to-date, Impala will end up with bad query plan, hence will affect the overall query performance.

In my example, we can see that the table default.sample_07’s stats are missing. Impala produced the warning so that users are informed about this and COMPUTE STATS should be performed on the table to fix this.

For more information on Impala stats, please refer to documentation Table and Column Statistics provided by Cloudera.

2. Query Plan Details:

F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
|  Per-Host Resources: mem-estimate=10.00MB mem-reservation=0B
PLAN-ROOT SINK
|  mem-estimate=0B mem-reservation=0B
|
03:AGGREGATE [FINALIZE]
|  output: count:merge(*)
|  mem-estimate=10.00MB mem-reservation=0B spill-buffer=2.00MB
|  tuple-ids=1 row-size=8B cardinality=1
|
02:EXCHANGE [UNPARTITIONED]
|  mem-estimate=0B mem-reservation=0B
|  tuple-ids=1 row-size=8B cardinality=1
|
F00:PLAN FRAGMENT [RANDOM] hosts=1 instances=1
Per-Host Resources: mem-estimate=42.00MB mem-reservation=0B
01:AGGREGATE
|  output: count(*)
|  mem-estimate=10.00MB mem-reservation=0B spill-buffer=2.00MB
|  tuple-ids=1 row-size=8B cardinality=1
|
00:SCAN HDFS [default.sample_07, RANDOM]
   partitions=1/1 files=1 size=44.98KB
   stats-rows=unavailable extrapolated-rows=disabled
   table stats: rows=unavailable size=44.98KB
   column stats: all
   mem-estimate=32.00MB mem-reservation=0B
   tuple-ids=0 row-size=0B cardinality=unavailable

Now, it comes to the interesting part: the query plan. Query plan is one of the most important part of Impala PROFILE that you need to know how to read, as it tells you how table is scanned, data exchanged and joined to get the final result.

This can get very complicated if your query is complex, but let’s start with this simple query to understand the basis. One thing to remember is that you need to read the Query Plan backwards, which will allow you to follow what Impala planned to do.

a. It normally starts with HDFS Scan:

00:SCAN HDFS [default.sample_07, RANDOM]
   partitions=1/1 files=1 size=44.98KB
   stats-rows=unavailable extrapolated-rows=disabled
   table stats: rows=unavailable size=44.98KB
   column stats: all
   mem-estimate=32.00MB mem-reservation=0B
   tuple-ids=0 row-size=0B cardinality=unavailable

From above, we can see below useful information:

there was only one partition in the table, and Impala also read one partition. This does not necessarily mean that this table is partitioned. If the table is not partitioned, it will just be shown as 1/1.
there was only one file under the table/partition (files=1)
the total size read by Impala was 44.98KB
there were no stats available for this table (stats-rows=unavailable, table stats: rows=unavailable and cardinality=unavailable)
estimated memory to be 32MB to run the query and no memory were reserved

b. After HDFS scan was complete, Impala needed to do Aggregation, as we did COUNT(*):

01:AGGREGATE
|  output: count(*)
|  mem-estimate=10.00MB mem-reservation=0B spill-buffer=2.00MB
|  tuple-ids=1 row-size=8B cardinality=1

There isn’t much to explain here, but just to know that this operation does the Aggregation step.

c. Fragment information:

F00:PLAN FRAGMENT [RANDOM] hosts=1 instances=1
Per-Host Resources: mem-estimate=42.00MB mem-reservation=0B

This bit of information just above the 00:SCAN HDFS and 01:AGGREGATE operators tells us that both Scan and Aggregation Operator belongs to Fragment F00, which ran on 1 host and 1 instance. This Fragment ID of F00 can be used to find the actual Fragment statistic in the later part of PROFILE, which can tell us more detailed information about how this Fragment runs at run time. I will also cover this in the later part of the series.

d. Exchange Operation:

02:EXCHANGE [UNPARTITIONED]
|  mem-estimate=0B mem-reservation=0B
|  tuple-ids=1 row-size=8B cardinality=1

So after aggregation was done on each worker node, the results needed to be exchanged from each worker node to the coordinator, that was what happened here. After that, the coordinator needed to do the final aggregation/merger on the those results:

03:AGGREGATE [FINALIZE]
|  output: count:merge(*)
|  mem-estimate=10.00MB mem-reservation=0B spill-buffer=2.00MB
|  tuple-ids=1 row-size=8B cardinality=1

And both of above two operations belonged to the same Fragment 01, which again can be used to reference the rest of Profile data to find out more detailed stats about the query:

F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1

Now, let’s have a look at the Summary Section of the Profile:

Operator       #Hosts   Avg Time   Max Time  #Rows  Est. #Rows  Peak Mem  Est. Peak Mem  Detail
-----------------------------------------------------------------------------------------------------------
03:AGGREGATE        1  999.992us  999.992us      1           1  20.00 KB       10.00 MB  FINALIZE
02:EXCHANGE         1  831.992ms  831.992ms      1           1         0              0  UNPARTITIONED
01:AGGREGATE        1    0.000ns    0.000ns      1           1  16.00 KB       10.00 MB
00:SCAN HDFS        1  709.995ms  709.995ms    823          -1  80.00 KB       32.00 MB  default.sample_07

Here you can find below information that could be useful:

It tells the Average time and Maximum time each operation took. If there is big difference between the two, you would know that there was in-balance/skew when running jobs in each worker node, as in theory, they should be processing similar amount of data and we should expect all of the nodes to finish in similar time range
If the values for “#Row” and “Est. #Rows” are way off, in my case -1 for Est. #Rows for SCAN HDFS operation and 823 for #Row (the actual number of rows returned after running the query), we know that Impala has out of date information about the table statistics. In my case, we did not have table stats, so Impala reported “-1” estimated value. If the estimated value is positive, but is still different from actual rows returned, then we know that we need to run “COMPUTE STATS” against this table to update the statistics.
the “#Hosts” column tells us know many worker nodes participated in the query for that particular operation. In my case, since the data was small, we only had 1 host to run the query.
The “Peak Mem” and “Est. Peak Mem” are self-explanatory, they are the actual memory used vs. the estimated memory that Impala calculated based on table stats.

If there are joins in queries, this section will also show us what join strategies were used in the join operation, either Broadcast or Shuffle Join. I will try to cover this as well in the later part of the series.

That’s all for this part II of the series, and hope that they are useful. I will try to get more complicated query Profiles to share next time and work through to understand more.

See you next time!

6 Comments

Jacky

May 12, 2019 at 2:03 PM 5 years ago


Thanks for the post. it helps a lot to me to understand how to read Impala Profile.

Keep up the good work.
1. Eric Lin
  
  May 25, 2019 at 9:09 PM 5 years ago
  
  
  Hi Jacky,
  
  Thanks for visiting my site and provide your feedback. I am glad that it helps!
  
  Cheers
  Eric
fucun

July 9, 2019 at 7:49 PM 5 years ago


tuple-ids=0 row-size=0B cardinality=unavailable How to explain it and how to estimate it.

Thanks
1. Eric Lin
  
  July 17, 2019 at 9:46 PM 5 years ago
  
  
  Hi Fucun,
  
  Sorry about the delay in responding. That means there is no stats for this table so cardinality is unavailable and others are 0. Impala does not estimate this.
  
  Cheers
  Eric
Snehasish Dey

August 1, 2019 at 11:06 PM 5 years ago


Hi Eric,
Your blog is a blessing to our community.
I’m eagerly waiting for Part-3 as there is not enough good blogs or video which explain the query plan like you did.

Cheers,
Teddy
1. Eric Lin
  
  August 2, 2019 at 8:38 AM 5 years ago
  
  
  Hi Snehasish,
  
  Thanks for visiting my blog and share your love.
  
  Just let you know that I have written Part 3 of this back in May, please check out below link:
  https://www.ericlin.me/2019/05/impala-query-profile-explained-part-3/
  
  I am planning for Part 4 soon.
  
  Cheers
  Eric

Other posts that you might also be interested:

6 Comments

Leave a Reply to Eric Lin Cancel reply