Hive’s Staging Directory Not Cleaned Up Properly – Hadoop Troubleshooting Guide

Eric Lin Cloudera July 5, 2016 July 5, 2016

This article explains the situation that will cause Hive to leave its staging directories that were created during processing and not doing clean up after job finished successfully. The issue happens when user runs Hive query through Hue’s Hive Editor, it does not apply to queries running through Beeline, Hive CLI or through JDBC driver. To re-produce the issue, simply run a SELECT COUNT(*) query against any table through Hue’s Hive Editor, and then check the staging directory created afterwards (defined by hive.exec.stagingdir property). You will notice that directory looks something like “.hive-staging_hive_2015-12-15_10-46-52_381_5695733254813362445-1329” remains under the staging directory. This is due to Hue keeps each query open for a long time, or never close them, depending on configuration. And the staging directory’s clean up event will only happen when query connection is closed. As normally in Beeline, Hive CLI or JDBC Driver, once query finished and data returned to client, the query connection will be closed immediately, hence the cleanup event is trigger straightaway. Hue needs to keep the query connection so that user can still come back and retrieve the query result at later stage, or when user navigates away from the query page, the query will still be running in the background, rather than being killed forcibly. As a result, the staging directories never gets cleaned up. There are two possible ways:

To force Hue to close the query when user navigates away from the page, you can do the following:
However, there are downsides to the workaround though:
1. Users will not be able to retrieve historical query results through Hue anymore for the finished queries
2. If user navigates away from the Hive Editor page when query was running, then the query will be closed and killed, rather than running in the background
Set the following HiveServer2 parameters to control the session and operation/query time out values:
```
hive.server2.session.check.interval = 1 hour
hive.server2.idle.operation.timeout = 1 day
hive.server2.idle.session.timeout = 3 days
```
This will ensure that any HS2 sessions will be closed 3 days after inactivity, with session checking every 1 hour. Any operation/query that is kept open for more than 1 day will also be closed.This will ensure that the operation/query will be timed out after 1 day, which will force queries to be closed, hence trigger the clean up of staging directories.Those settings can be found by going to CM > Hive Configuration and look for the following configuration names:
```
Session Check Interval
Idle Operation Timeout
Idle Session Timeout
```
Above are recommended values, however, they should be changed based on cluster usage and query running time.

13 Comments

Ajinkya D

May 2, 2017 at 12:43 AM 7 years ago

Reply

how to delete the staging file ??
1. Eric Lin
  
  May 17, 2017 at 8:52 PM 7 years ago
  
  Reply
  
  Hi Ajinkya,
  
  Sorry about the delay, I just back from holiday. Do you mean to delete those staging files that are stale? You can simply manually remove them by issuing “hdfs dfs -rm -r /path/to/staging/dir”. In theory, they should be cleaned up by HiveServer2 automatically, when depends on setup.
akhil

August 11, 2018 at 8:29 PM 6 years ago

Reply

hey why do we use staging directories?
1. Eric Lin
  
  August 11, 2018 at 9:23 PM 6 years ago
  
  Reply
  
  Hi Akhil,
  
  Thanks for visiting and leaving comment on my blog. The Hive staging directory is a temporary directory used during processing, and the end data will be copied to the final destination upon completion. This is to avoid writing data live on the destination folder and causing issues when other users are reading while data is being written at the same time.
  
  Hope that makes sense.
  
  Cheers
Steve Yang

January 3, 2019 at 6:12 PM 5 years ago

Reply

Hi Eric, if we just changed the property – hive.exec.stagingdir to another path,
will it have a serious influence in whole ?

Is that a good idea ( my concern is that there may be any conflict if many hive clients produce immediate files at same time ) ?

(Now I have a few tasks to add the setters
set hive.exec.scratchdir=/tmp/hive/temps;

before query )
1. Eric Lin
  
  January 18, 2019 at 5:01 PM 5 years ago
  
  Reply
  
  Hi Steve,
  
  Sorry about the late reply. I was on holiday this month.
  
  It should be safe to change it, if you do not have encryption enabled, or they are on the same encryption zone.
  
  The staging directory is used per job, so should not interfere with other jobs. Did you change at client or server side?
Naief Massoud

February 22, 2019 at 12:49 AM 5 years ago

Reply

Hi Eric,
In my cluster I do not see hive.exec.stagingdir. Is it a custom parameter for hive-site?
Also what is the difference between hive.exec.stagingdir and hive.exec.scratchdir?
Thank you.
Regards,
Naief
1. Eric Lin
  
  February 27, 2019 at 1:11 PM 5 years ago
  
  Reply
  
  Hi Naief,
  
  Thanks for visiting and posting comment on my blog. For the differences between stagingdir and scratchdir, please see below page:
  https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties
  
  and search for hive.exec.stagingdir. It mentioned below:
  
  Directory name that will be created inside table locations in order to support HDFS encryption. This is replaces hive.exec.scratchdir for query results with the exception of read-only tables. In all cases hive.exec.scratchdir is still used for other temporary files, such as job plans.
  
  Hope this answers your question.
  
  Cheers
  Eric
shrikant

April 30, 2020 at 2:24 AM 4 years ago

Reply

Hi Eric,

Thanks for the Blog.
I have been following your blog and got one concern to set the below configuration

hive.server2.session.check.interval = 1 hour
hive.server2.idle.operation.timeout = 1 day –> set as 1 day, if the query that runs more than 1 hour will that query fails?
hive.server2.idle.session.timeout = 3 days
Why can’t the idle session timeout be lesser time example 30 mins, if the session is idle for 30 mins than it should close the session and session check interval 15 mins
Please guide me on this

Thanking in advance
1. Eric Lin
  
  June 18, 2020 at 2:15 PM 4 years ago
  
  Reply
  
  Hi Shrikant,
  
  I am really sorry about the delay in getting back to you. I got your email from your comment, but did not have time to follow up back then and forgot about it. My apologies.
  
  To answer your questions:
  
  1. hive.server2.idle.operation.timeout = 1 day –> set as 1 day, if the query that runs more than 1 hour will that query fails?
  >> Did you mean if query runs more than 1 hour or 1 day? But it should not matter. As long as the query is running it won’t be timed out, as it is not considered as idle. It is considered as idle if client holds the operation after query finishes, and not perform any operation, like getting more rows. This is to stop clients like Hue to hold the operation for a long time.
  
  2,. Why can’t the idle session timeout be lesser time example 30 mins, if the session is idle for 30 mins than it should close the session and session check interval 15 mins
  
  You can set the idle session to be less than 30 minutes, however, it might cause issues on the end user side as they need to constantly reconnect, particularly for Hue users.
  
  3 days is just a guide, you can certainly set to a few hours or 1 day, depending on how your user uses Hive client.
  
  Hope above answers your questions.
  
  Cheers
  Eric
venkatesh

June 16, 2020 at 1:26 AM 4 years ago

Reply

Hi Eric,

I am using spark “saveasTable” to write data into hive table and my question is, is there any way to write data directly in to hive warehouse directory and avoid the hive staging directory, because it is taking a long time to move data from staging to warehouse directory.
1. Eric Lin
  
  June 18, 2020 at 2:10 PM 4 years ago
  
  Reply
  
  Hi Venkat,
  
  Thanks for visiting and commenting on my blog post.
  
  To answer your question, I do not think you can avoid staging directory. If you check spark code here:
  https://github.com/cloudera/spark/blob/cdh6.3.2/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala#L106-L114
  
  You can see that Spark is following what Hive does, and staging dir will be used. The purpose of the staging dir is to avoid users from seeing half written data, so that user either see the full data after successful injection, or do not see it at all upon failure. This is by design and I do not think avoiding staging dir is a good idea.
  
  Hope above answers your question.
  
  Cheers
  Eric
Desh

November 16, 2021 at 1:26 AM 3 years ago

Reply

Hey Eric,
I am using Ambari 2.7.3.0. I have thousands of staging directories on /warehouse/tablespace/managed/hive/xxx.db/…
currently Hive Exec Scratchdir is set to /tmp/hive
hive.start.cleanup.scratchdir is set to false

There is no property set as: hive.exec.stagingdir

How can I set to automatically delete staging directories on /warehouse/tablespace/managed/hive/xxx.db/…

Thanks in advance. Mush apprecitae.

Other posts that you might also be interested:

13 Comments

Leave a Reply to Steve Yang Cancel reply