When a simple count query in Hive like below:
SELECT COUNT(*) FROM table WHERE col = 'value';
with 2GB of data takes almost 30 minutes to finish in a reasonable sized cluster like 10 nodes, how do you determine the cause of the slowness? There are many possible causes of this issue, however, I think the following three are a great start:
  1. The number of partitions
  2. The number of files per partition
  3. The size of each files under each partition compared with block size
Small files and large mount of files are big problems in Hadoop, so we should avoid them whenever possible. Please have a read on Cloudera’s blog: The Small Files Problem. It was written in 2009, a bit old, but same principal still apply. So when you have noticed that your query against a relatively small data set, say a few GBs of data, yet it still takes around 20-30 minutes to run on a reasonable sized cluster, please look out for the possible causes mentioned above. To prove that this is the cause, you can do a simple test by copying the data from one table to another and Hive will do the job for you by merging small files so that they will be close to the block size configured in the cluster. For example:
CREATE TABLE table_test LIKE original_table;
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE table_test PARTITION (rundate, qsequence) SELECT * FROM original_table;
This assumes that the number of partitions are reasonable, if not, you can update the query so that the number of partitions generated in the new table can be reduced, while data still maintained. So the goal here is to:
  • Reduce the number of partitions per table
  • Reduce the number of files per partition
  • Increase the size of each file under each partition so that they will be close to the block size configured in the cluster (default of 128MB in latest CDH)
Of course there are other causes, but those ones mentioned in the article are just the basic checks to start with.

Leave a Reply

Your email address will not be published. Required fields are marked *