All right, in the Part 2 of the Impala Profile series, I have talked about how to read the basic Impala query plan and Execution Summary from the Profile information. If you missed them, you can find the links from below:
Impala Query Profile Explained – Part 1
Impala Query Profile Explained – Part 2
In this third part of this blog series, I will be still focusing on the Query Plan as well as the Execution Summary, but using a more complicated query that is against real life data that is downloaded from Kaggle’s Flights Delay database.
This database has 3 tables:
The query that generated the profile as below:
SELECT a.airline as airline_name, COUNT(IF(cancelled = 1, 1, NULL)) AS cancelled, COUNT(1) AS total, CONCAT(CAST(CAST(COUNT(IF(cancelled = 1, 1, NULL)) / COUNT(1) AS DECIMAL(8,4)) * 100 AS STRING), "%") AS cancelled_rate FROM flights f JOIN airlines a ON (f.airline = a.iata_code) GROUP BY a.airline ORDER BY a.airline
This query JOINs the flights and airports tables and generated a report that can tell us the cancellation rate for each airline for all flights happened during 2015. The result looks like below:
+------------------------------+-----------+---------+----------------+ | airline_name | cancelled | total | cancelled_rate | +------------------------------+-----------+---------+----------------+ | Alaska Airlines Inc. | 669 | 172521 | 0.3800% | | American Airlines Inc. | 10919 | 725984 | 1.5000% | | American Eagle Airlines Inc. | 15025 | 294632 | 5.0900% | | Atlantic Southeast Airlines | 15231 | 571977 | 2.6600% | | Delta Air Lines Inc. | 3824 | 875881 | 0.4300% | | Frontier Airlines Inc. | 588 | 90836 | 0.6400% | | Hawaiian Airlines Inc. | 171 | 76272 | 0.2200% | | JetBlue Airways | 4276 | 267048 | 1.6000% | | Skywest Airlines Inc. | 9960 | 588353 | 1.6900% | | Southwest Airlines Co. | 16043 | 1261855 | 1.2700% | | Spirit Air Lines | 2004 | 117379 | 1.7000% | | US Airways Inc. | 4067 | 198715 | 2.0400% | | United Air Lines Inc. | 6573 | 515723 | 1.2700% | | Virgin America | 534 | 61903 | 0.8600% | +------------------------------+-----------+---------+----------------+
And the full details on the query plan and execution summary from the profile as below:
F03:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1 | Per-Host Resources: mem-estimate=0B mem-reservation=0B PLAN-ROOT SINK | mem-estimate=0B mem-reservation=0B | 08:MERGING-EXCHANGE [UNPARTITIONED] | order by: a.airline ASC | mem-estimate=0B mem-reservation=0B | tuple-ids=3 row-size=52B cardinality=14 | F02:PLAN FRAGMENT [HASH(a.airline)] hosts=4 instances=4 Per-Host Resources: mem-estimate=22.00MB mem-reservation=13.94MB 04:SORT | order by: a.airline ASC | mem-estimate=12.00MB mem-reservation=12.00MB spill-buffer=2.00MB | tuple-ids=3 row-size=52B cardinality=14 | 07:AGGREGATE [FINALIZE] | output: count:merge(if(cancelled = 1, 1, NULL)), count:merge(*) | group by: a.airline | mem-estimate=10.00MB mem-reservation=1.94MB spill-buffer=64.00KB | tuple-ids=2 row-size=52B cardinality=14 | 06:EXCHANGE [HASH(a.airline)] | mem-estimate=0B mem-reservation=0B | tuple-ids=2 row-size=52B cardinality=14 | F00:PLAN FRAGMENT [RANDOM] hosts=4 instances=4 Per-Host Resources: mem-estimate=187.94MB mem-reservation=3.94MB 03:AGGREGATE [STREAMING] | output: count(if(cancelled = 1, 1, NULL)), count(*) | group by: a.airline | mem-estimate=10.00MB mem-reservation=2.00MB spill-buffer=64.00KB | tuple-ids=2 row-size=52B cardinality=14 | 02:HASH JOIN [INNER JOIN, BROADCAST] | hash predicates: f.airline = a.iata_code | fk/pk conjuncts: f.airline = a.iata_code | runtime filters: RF000 <- a.iata_code | mem-estimate=1.94MB mem-reservation=1.94MB spill-buffer=64.00KB | tuple-ids=0,1 row-size=73B cardinality=5819079 | |--05:EXCHANGE [BROADCAST] | | mem-estimate=0B mem-reservation=0B | | tuple-ids=1 row-size=54B cardinality=14 | | | F01:PLAN FRAGMENT [RANDOM] hosts=1 instances=1 | Per-Host Resources: mem-estimate=32.00MB mem-reservation=0B | 01:SCAN HDFS [flight_delay.airlines a, RANDOM] | partitions=1/1 files=1 size=341B | stats-rows=14 extrapolated-rows=disabled | table stats: rows=14 size=341B | column stats: all | mem-estimate=32.00MB mem-reservation=0B | tuple-ids=1 row-size=54B cardinality=14 | 00:SCAN HDFS [flight_delay.flights f, RANDOM] partitions=1/1 files=1 size=564.96MB runtime filters: RF000 -> f.airline stats-rows=5819079 extrapolated-rows=disabled table stats: rows=5819079 size=564.96MB column stats: all mem-estimate=176.00MB mem-reservation=0B tuple-ids=0 row-size=19B cardinality=5819079 ---------------- Estimated Per-Host Mem: 253689856 Per Host Min Reservation: host-xxx.xxxx.com:22000(17.88 MB) host-xxx.xxxx.com:22000(17.88 MB) host-xxx.xxxx.com:22000(17.88 MB) host-xxx.xxxx.com:22000(17.88 MB) Request Pool: root.hive Admission result: Admitted immediately ExecSummary: Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail ------------------------------------------------------------------------------------------------------------------------- 08:MERGING-EXCHANGE 1 4s122ms 4s122ms 14 14 0 0 UNPARTITIONED 04:SORT 4 249.999us 999.996us 14 14 12.02 MB 12.00 MB 07:AGGREGATE 4 2.750ms 4.000ms 14 14 1.99 MB 10.00 MB FINALIZE 06:EXCHANGE 4 4s100ms 4s137ms 55 14 0 0 HASH(a.airline) 03:AGGREGATE 4 280.499ms 339.002ms 55 14 10.11 MB 10.00 MB STREAMING 02:HASH JOIN 4 177.749ms 184.999ms 5.82M 5.82M 10.05 MB 1.94 MB INNER JOIN, BROADCAST |--05:EXCHANGE 4 0.000ns 0.000ns 14 14 0 0 BROADCAST | 01:SCAN HDFS 1 97.000ms 97.000ms 14 14 177.00 KB 32.00 MB flight_delay.airlines a 00:SCAN HDFS 4 2s052ms 3s278ms 5.82M 5.82M 40.06 MB 176.00 MB flight_delay.flights f
OK, let’s jump to the Execution Summary section first this time, as it is easier to see and it is normally the first section I would check when helping CDH users to troubleshoot any Impala query related issues. I will assume that you have read through my Part 2 series, so you know what had been explained before and I will go a bit faster this time.
So, from the above Execution Summary, we can see what happened during the life time of the query exeuction, remember to read backwards:
- it took average of 2 seconds to scan HDFS data for table flight_delay.flights (2s052ms)
- Impala estimated to have 5.82 million rows in table flight_delay.flights, and the scan result returned the same, which means the table stat is up-to-date (5.82M)
- Impala estimated 176MB memory needed to scan table flight_delay.flights, but in fact only 40MB was used. This is expected, as memory estimation is not possible to be exactly the same as actual usage. The idea is to get as close as possible
- Since the data is big, Impala was able to split the file and perform the scan operation on 4 hosts, so that the load was distributed
- After scanning for table flight_delay.flights was completed, Impala started scanning for another table flight_delay.airlines. Again, the estimated and actual rows returned match, so table stats is also up-to-date for flight_delay.airlines as well. And since the table only has 14 rows, it only took 97 milli-seconds to scan it
- As the table is small, with only 14 rows, Impala only used 1 host to do the scan operation
- The next step was to broadcast the smaller table flight_delay.airlines into all nodes in the cluster that will perform the query, in my case, 4 hosts (BROADCAST)
- Once broadcast was done, Impala performed Hash Join between flight_delay.airlines and flight_delay.flights, which took 177 milli-seconds and used 10MB of memory
- Since we had called COUNT aggregation function, Impala was instructed to perform the AGGREGATE operation, which ran, again, on 4 hosts, spent 280ms, 10MB of memory and returned 55 rows
- Since above step was performed on 4 worker nodes, Impala then needed to merge the results from them, which was achieved by exchanging the data internally and then performed the final aggregation on the intermediate result
- A sort operation was performed after step 10 was complete, as we had “ORDER BY” in our query
You can match the operation number, like 00, 01, 02 etc in the Summary section to the numbers in the Query Plan section, which will tell you more details about the operation. The details I had covered in Part 2, so please refer to the previous post if you need reference.
Now, let’s jump further down the PROFILE and have a look at the Planner and Query Timeline:
Planner Timeline Analysis finished: 3ms (3389346) Equivalence classes computed: 3ms (3600838) Single node plan created: 4ms (4625920) Runtime filters computed: 4ms (4734686) Distributed plan created: 5ms (5120630) Lineage info computed: 13ms (13666462) Planning finished: 15ms (15712999) Query Timeline Query submitted: 0ns (0) Planning finished: 16ms (16999947) Submit for admission: 17ms (17999944) Completed admission: 17ms (17999944) Ready to start on 4 backends: 18ms (18999941) All 4 execution backends (10 fragment instances) started: 28ms (28999909) Rows available: 4.28s (4280986646) First row fetched: 4.31s (4308986559)
Each line is pretty much self-explanatory, we can see that the query plan took 15ms seconds to run, submitted for admission from 17ms, ready to execute plan on worker nodes from 28ms and then finally rows were ready at 4.28 seconds and first row was fetched by client at 4.31 seconds. This gives you a very clear overview of how long each stage took. If any of the stages is slow, it will be very obvious and then we can start to drill down further to see what might have happened.
Since my query was fast, so it is not very interesting to see here. Let’s have a look at another real production impala query profile:
Query Compilation: 16.268ms - Metadata of all 1 tables cached: 1.786ms (1.786ms) - Analysis finished: 6.162ms (4.376ms) - Value transfer graph computed: 6.537ms (374.918us) - Single node plan created: 7.955ms (1.417ms) - Runtime filters computed: 8.274ms (318.815us) - Distributed plan created: 8.430ms (156.307us) - Lineage info computed: 9.664ms (1.234ms) - Planning finished: 16.268ms (6.603ms) Query Timeline: 35m46s - Query submitted: 0.000ns (0.000ns) - Planning finished: 22.001ms (22.001ms) - Submit for admission: 23.001ms (1.000ms) - Completed admission: 23.001ms (0.000ns) - Ready to start on 2 backends: 24.001ms (1.000ms) - All 2 execution backends (2 fragment instances) started: 36.001ms (12.000ms) - Rows available: 5m51s (5m51s) - First row fetched: 5m52s (950.045ms) - Last row fetched: 35m46s (29m53s) - Released admission control resources: 35m46s (1.000ms) - Unregister query: 35m46s (30.001ms) - ComputeScanRangeAssignmentTimer: 0.000ns
This was taken from a case that Impala query took a long time to run and customer wanted to find out why. From the Query Timeline, we can clearly see that it took almost 6 minutes (5m51s) from starting execution (All 2 execution backends) until data was available (Rows available). This 6 minutes execution could be normal, as if there were lots of joins with large dataset, it is common to have query run for several minutes.
However, we can notice that it took Impala 30 minutes to pass the data back to client as First row fetched at 6 minutes mark, but Last row fetch only at 36 minutes mark. So from here, we could suspect that there could be some networking issue between Impala coordinator and client (as data fetch happens from Client, like impala-shell or Hue, to Impala Coordinator host). Another possibility is that client might be capturing the results and performing other actions like printing on the screen, as the return data might be big, that operation can be time consuming.
So this section of the PROFILE can lead us to the right direction of where to look for to find out the bottleneck.
This concludes the Part 3 of the Impala Profile series. I will explain more regarding how to relate the operation number that shown in the Query Plan section to the bottom of the Profile section where it shows the detailed metric of each operation, either average or individually on each host.
Hope to see you again!
I gone through all 3 impala query profile parts they are wonderfully crafted. Thanks a lot for sharing the knowledge Eric. very much appreciated.
Thanks for visiting my blog and I am glad that those posts are useful.
(as data fetch happens from Client, like impala-shell or Hue, to Impala Coordinator host)
Is this a mistake? it’s more reasonable that data fetch happens from Coordinator host to Client
Firstly, thanks for reading my post and posting your comment.
However, data fetch does happen from client side, as API call is triggered from there, to impala Coordinator host, to say, I want more data, please give me more. It is the returning of data that happens on the server side.
Please let me know if you still think otherwise.
Hi, thanks for your wonderful post! But I’m wondering why 06 EXCHANGE takes much longer than 02 HASH JOIN? Given that there’re only 55 rows to be exchanged across network.
Thanks for visiting and provide feedback. Unfortunately that I do not have the full PROFILE of this query anymore, so I am unable to find out why. I might need to re-run this in a new lab after loading the data. I will share the result if I have further findings here.
Hi Eric! I wish to thank you for your didatic article. It was very clarifier to me. To proove that I red it very carefully, I noticed a little mistake, because when you analized summary you said that AGGREGATE had returned 55 rows, but this was the estimated number by IMPALA. The actual number was only 14, the maximum number possible in this case since it can´t be larger than the number of rows of the smaller table, that have the column used to do the aggregation.
Thanks for adding comment on my site, and I am really sorry for the delayed response.
Based on the Summary, the estimated rows for Aggregation was 14 and actual number was 55. Since it was a JOIN, it is possible that the result can return more than 14 rows, depending what is on the other table.
I still do not see the issue here, but please correct me again if you think otherwise.
I eric im new to this profiling concept. But after seeing ur blog i gained a confidence. i watched part 1, 2 and 3
heading toward 4 and 5