How to control the number of mappers required for a Hive query

How to control the number of mappers required for a Hive query

This article explains how to increase or decrease the number of mappers required for a particular Hive query. Setting both “mapreduce.input.fileinputformat.split.maxsize” and “mapreduce.input.fileinputformat.split.minsize” to the same value in most cases will be able to control the number of mappers (either increase or decrease) used when Hive is running a particular query. For example, for a text file with file size of 200000 bytes, setting the value of
set mapreduce.input.fileinputformat.split.maxsize=100000;
set mapreduce.input.fileinputformat.split.minsize=100000;
will trigger two mappers for the map reduce job setting the value of
set mapreduce.input.fileinputformat.split.maxsize=50000;
set mapreduce.input.fileinputformat.split.minsize=50000;
will trigger 4 mappers for the the same job. If hive.input.format is set to “org.apache.hadoop.hive.ql.io.CombineHiveInputFormat” which is the default in newer version of Hive, Hive will also combine small files whose file size are smaller than mapreduce.input.fileinputformat.split.minsize, so the number of mappers will be reduced to reduce overhead of starting too many mappers. However, this is also affected by the data locality of each HDFS blocks. For example, if there are lots of small files that are stored on different data nodes, Hive will not be able to reduce the number of mappers by combining those files because they are not stored on the same machine.  

One comment

Leave a Reply

Your email address will not be published.

My new Snowflake Blog is now live. I will not be updating this blog anymore but will continue with new contents in the Snowflake world!