Hive’s Staging Directory Not Cleaned Up Properly

Hive’s Staging Directory Not Cleaned Up Properly

This article explains the situation that will cause Hive to leave its staging directories that were created during processing and not doing clean up after job finished successfully. The issue happens when user runs Hive query through Hue’s Hive Editor, it does not apply to queries running through Beeline, Hive CLI or through JDBC driver. To re-produce the issue, simply run a SELECT COUNT(*) query against any table through Hue’s Hive Editor, and then check the staging directory created afterwards (defined by hive.exec.stagingdir property). You will notice that directory looks something like “.hive-staging_hive_2015-12-15_10-46-52_381_5695733254813362445-1329” remains under the staging directory. This is due to Hue keeps each query open for a long time, or never close them, depending on configuration. And the staging directory’s clean up event will only happen when query connection is closed. As normally in Beeline, Hive CLI or JDBC Driver, once query finished and data returned to client, the query connection will be closed immediately, hence the cleanup event is trigger straightaway. Hue needs to keep the query connection so that user can still come back and retrieve the query result at later stage, or when user navigates away from the query page, the query will still be running in the background, rather than being killed forcibly. As a result, the staging directories never gets cleaned up. There are two possible ways:
  1. To force Hue to close the query when user navigates away from the page, you can do the following:
      1. go to CM > Hue > Configuration
      2. add the following to “Hue Service Advanced Configuration Snippet (Safety Valve) for hue_safety_valve.ini”
        [beeswax]
        close_queries=true
        
        if [beeswax] section already exists, simply add “close_queries=true” to the section
      3. Save and restart Hue
    However, there are downsides to the workaround though:
    1. Users will not be able to retrieve historical query results through Hue anymore for the finished queries
    2. If user navigates away from the Hive Editor page when query was running, then the query will be closed and killed, rather than running in the background
  2. Set the following HiveServer2 parameters to control the session and operation/query time out values:
    hive.server2.session.check.interval = 1 hour
    hive.server2.idle.operation.timeout = 1 day
    hive.server2.idle.session.timeout = 3 days
    
    This will ensure that any HS2 sessions will be closed 3 days after inactivity, with session checking every 1 hour. Any operation/query that is kept open for more than 1 day will also be closed.This will ensure that the operation/query will be timed out after 1 day, which will force queries to be closed, hence trigger the clean up of staging directories.Those settings can be found by going to CM > Hive Configuration and look for the following configuration names:
    Session Check Interval
    Idle Operation Timeout
    Idle Session Timeout
    
    Above are recommended values, however, they should be changed based on cluster usage and query running time.
 

Loading

13 Comments

    1. Eric Lin

      Hi Ajinkya,

      Sorry about the delay, I just back from holiday. Do you mean to delete those staging files that are stale? You can simply manually remove them by issuing “hdfs dfs -rm -r /path/to/staging/dir”. In theory, they should be cleaned up by HiveServer2 automatically, when depends on setup.

    1. Eric Lin

      Hi Akhil,

      Thanks for visiting and leaving comment on my blog. The Hive staging directory is a temporary directory used during processing, and the end data will be copied to the final destination upon completion. This is to avoid writing data live on the destination folder and causing issues when other users are reading while data is being written at the same time.

      Hope that makes sense.

      Cheers

  1. Hi Eric, if we just changed the property – hive.exec.stagingdir to another path,
    will it have a serious influence in whole ?

    Is that a good idea ( my concern is that there may be any conflict if many hive clients produce immediate files at same time ) ?

    (Now I have a few tasks to add the setters
    set hive.exec.scratchdir=/tmp/hive/temps;

    before query )

    1. Eric Lin

      Hi Steve,

      Sorry about the late reply. I was on holiday this month.

      It should be safe to change it, if you do not have encryption enabled, or they are on the same encryption zone.

      The staging directory is used per job, so should not interfere with other jobs. Did you change at client or server side?

    1. Eric Lin

      Hi Naief,

      Thanks for visiting and posting comment on my blog. For the differences between stagingdir and scratchdir, please see below page:
      https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties

      and search for hive.exec.stagingdir. It mentioned below:

      Directory name that will be created inside table locations in order to support HDFS encryption. This is replaces hive.exec.scratchdir for query results with the exception of read-only tables. In all cases hive.exec.scratchdir is still used for other temporary files, such as job plans.

      Hope this answers your question.

      Cheers
      Eric

  2. shrikant

    Hi Eric,

    Thanks for the Blog.
    I have been following your blog and got one concern to set the below configuration

    hive.server2.session.check.interval = 1 hour
    hive.server2.idle.operation.timeout = 1 day –> set as 1 day, if the query that runs more than 1 hour will that query fails?
    hive.server2.idle.session.timeout = 3 days
    Why can’t the idle session timeout be lesser time example 30 mins, if the session is idle for 30 mins than it should close the session and session check interval 15 mins
    Please guide me on this

    Thanking in advance

    1. Eric Lin

      Hi Shrikant,

      I am really sorry about the delay in getting back to you. I got your email from your comment, but did not have time to follow up back then and forgot about it. My apologies.

      To answer your questions:

      1. hive.server2.idle.operation.timeout = 1 day –> set as 1 day, if the query that runs more than 1 hour will that query fails?
      >> Did you mean if query runs more than 1 hour or 1 day? But it should not matter. As long as the query is running it won’t be timed out, as it is not considered as idle. It is considered as idle if client holds the operation after query finishes, and not perform any operation, like getting more rows. This is to stop clients like Hue to hold the operation for a long time.

      2,. Why can’t the idle session timeout be lesser time example 30 mins, if the session is idle for 30 mins than it should close the session and session check interval 15 mins

      You can set the idle session to be less than 30 minutes, however, it might cause issues on the end user side as they need to constantly reconnect, particularly for Hue users.

      3 days is just a guide, you can certainly set to a few hours or 1 day, depending on how your user uses Hive client.

      Hope above answers your questions.

      Cheers
      Eric

  3. venkatesh

    Hi Eric,

    I am using spark “saveasTable” to write data into hive table and my question is, is there any way to write data directly in to hive warehouse directory and avoid the hive staging directory, because it is taking a long time to move data from staging to warehouse directory.

    1. Eric Lin

      Hi Venkat,

      Thanks for visiting and commenting on my blog post.

      To answer your question, I do not think you can avoid staging directory. If you check spark code here:
      https://github.com/cloudera/spark/blob/cdh6.3.2/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala#L106-L114

      You can see that Spark is following what Hive does, and staging dir will be used. The purpose of the staging dir is to avoid users from seeing half written data, so that user either see the full data after successful injection, or do not see it at all upon failure. This is by design and I do not think avoiding staging dir is a good idea.

      Hope above answers your question.

      Cheers
      Eric

  4. Desh

    Hey Eric,
    I am using Ambari 2.7.3.0. I have thousands of staging directories on /warehouse/tablespace/managed/hive/xxx.db/…
    currently Hive Exec Scratchdir is set to /tmp/hive
    hive.start.cleanup.scratchdir is set to false

    There is no property set as: hive.exec.stagingdir

    How can I set to automatically delete staging directories on /warehouse/tablespace/managed/hive/xxx.db/…

    Thanks in advance. Mush apprecitae.

Leave a Reply to Steve Yang Cancel reply

Your email address will not be published. Required fields are marked *

My new Snowflake Blog is now live. I will not be updating this blog anymore but will continue with new contents in the Snowflake world!