SELECT * query triggered Map Only job under CDH5.5.1, but not from CDH5.3.x

SELECT * query triggered Map Only job under CDH5.5.1, but not from CDH5.3.x

This article explains why a map only job was launched while running a simple “SELECT * FROM <table>” query in CDH5.5.x, while same query did not need any MapReduce task in CDH5.3.x. After upgrading to CDH5.5.1, while running “SELECT * FROM <table>” query from Beeline, a map only job was launched which did not happen in CDH5.3.2, see the log below:
INFO : Number of reduce tasks is set to 0 since there's no reduce operator
INFO : number of splits:5
INFO : Submitting tokens for job: job_1453140482266_0991
INFO : Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:nameservice1, Ident: (HDFS_DELEGATION_TOKEN token 209597 for hive)
INFO : The url to track the job: http://host:8088/proxy/application_1453140482266_0991/
INFO : Starting Job = job_1453140482266_0991, Tracking URL = http://host8088/proxy/application_1453140482266_0991/
INFO : Kill Command = /opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/hadoop/bin/hadoop job -kill job_1453140482266_0991
INFO : Hadoop job information for Stage-1: number of mappers: 5; number of reducers: 0
INFO : 2016-01-20 16:10:40,443 Stage-1 map = 0%, reduce = 0%
INFO : 2016-01-20 16:10:52,912 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 8.69 sec
INFO : 2016-01-20 16:10:53,955 Stage-1 map = 2%, reduce = 0%, Cumulative CPU 45.28 sec
INFO : 2016-01-20 16:10:56,043 Stage-1 map = 4%, reduce = 0%, Cumulative CPU 49.02 sec
INFO : 2016-01-20 16:10:57,084 Stage-1 map = 7%, reduce = 0%, Cumulative CPU 63.54 sec
INFO : 2016-01-20 16:11:00,209 Stage-1 map = 11%, reduce = 0%, Cumulative CPU 83.36 sec
INFO : 2016-01-20 16:11:03,328 Stage-1 map = 16%, reduce = 0%, Cumulative CPU 101.25 sec
INFO : 2016-01-20 16:11:05,404 Stage-1 map = 21%, reduce = 0%, Cumulative CPU 104.74 sec
INFO : 2016-01-20 16:11:06,442 Stage-1 map = 32%, reduce = 0%, Cumulative CPU 119.81 sec
INFO : 2016-01-20 16:11:09,553 Stage-1 map = 34%, reduce = 0%, Cumulative CPU 138.41 sec
INFO : 2016-01-20 16:11:11,648 Stage-1 map = 38%, reduce = 0%, Cumulative CPU 141.73 sec
INFO : 2016-01-20 16:11:12,696 Stage-1 map = 45%, reduce = 0%, Cumulative CPU 156.4 sec
INFO : 2016-01-20 16:11:18,940 Stage-1 map = 55%, reduce = 0%, Cumulative CPU 192.69 sec
INFO : 2016-01-20 16:11:21,014 Stage-1 map = 59%, reduce = 0%, Cumulative CPU 196.29 sec
INFO : 2016-01-20 16:11:22,051 Stage-1 map = 74%, reduce = 0%, Cumulative CPU 210.45 sec
INFO : 2016-01-20 16:11:27,218 Stage-1 map = 77%, reduce = 0%, Cumulative CPU 234.26 sec
INFO : 2016-01-20 16:11:28,253 Stage-1 map = 85%, reduce = 0%, Cumulative CPU 245.14 sec
INFO : 2016-01-20 16:11:34,502 Stage-1 map = 90%, reduce = 0%, Cumulative CPU 261.37 sec
INFO : 2016-01-20 16:11:38,647 Stage-1 map = 97%, reduce = 0%, Cumulative CPU 270.36 sec
INFO : 2016-01-20 16:11:40,736 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 273.89 sec
INFO : MapReduce Total cumulative CPU time: 4 minutes 33 seconds 890 msec
INFO : Ended Job = job_1453140482266_0991
This is caused by hive.fetch.task.conversion.threshold is -1 in CDH5.3.x for Hive and it’s 268435456 (256MB) in CDH5.5.1. Setting hive.fetch.task.conversion.threshold to -1 will disable the map only job, or a value that is more than the current size of the table. See Hive Configuration Properties documentation for more details. The following is the copy from the site: hive.fetch.task.conversion.threshold Default Value: -1 in Hive 0.13.0 and 0.13.1, 1073741824 (1 GB) in Hive 0.14.0 and later Added In: Hive 0.13.0 with HIVE-3990; default changed in Hive 0.14.0 with HIVE-7397 Input threshold (in bytes) for applying hive.fetch.task.conversion. If target table is native, input length is calculated by summation of file lengths. If it’s not native, the storage handler for the table can optionally implement the org.apache.hadoop.hive.ql.metadata.InputEstimator interface. A negative threshold means hive.fetch.task.conversion is applied without any input length threshold. The solution is to set hive.fetch.task.conversion.threshold to -1 to always allow for converting task to a FETCH task. However, it is advised not to set hive.fetch.task.conversion.threshold to -1 globally, there is a reason that a threshold has bee applied, please see HIVE-7397. Basically we don’t want to apply the conversion on a table that is super large, like a table with petabytes of data, which might cause Hive to crash. It is recommended to set hive.fetch.task.conversion.threshold to -1 in a session only when a query to a table that is bigger than the default value of 268435456 in CDH5.5.1 and the user knows that it is not too big and safe to do so.

Loading

Leave a Reply

Your email address will not be published. Required fields are marked *

My new Snowflake Blog is now live. I will not be updating this blog anymore but will continue with new contents in the Snowflake world!