- To force Hue to close the query when user navigates away from the page, you can do the following:
- go to CM > Hue > Configuration
- add the following to “Hue Service Advanced Configuration Snippet (Safety Valve) for hue_safety_valve.ini”
[beeswax] close_queries=true
if [beeswax] section already exists, simply add “close_queries=true” to the section - Save and restart Hue
- Users will not be able to retrieve historical query results through Hue anymore for the finished queries
- If user navigates away from the Hive Editor page when query was running, then the query will be closed and killed, rather than running in the background
- Set the following HiveServer2 parameters to control the session and operation/query time out values:
hive.server2.session.check.interval = 1 hour hive.server2.idle.operation.timeout = 1 day hive.server2.idle.session.timeout = 3 days
This will ensure that any HS2 sessions will be closed 3 days after inactivity, with session checking every 1 hour. Any operation/query that is kept open for more than 1 day will also be closed.This will ensure that the operation/query will be timed out after 1 day, which will force queries to be closed, hence trigger the clean up of staging directories.Those settings can be found by going to CM > Hive Configuration and look for the following configuration names:Session Check Interval Idle Operation Timeout Idle Session Timeout
Above are recommended values, however, they should be changed based on cluster usage and query running time.
how to delete the staging file ??
Hi Ajinkya,
Sorry about the delay, I just back from holiday. Do you mean to delete those staging files that are stale? You can simply manually remove them by issuing “hdfs dfs -rm -r /path/to/staging/dir”. In theory, they should be cleaned up by HiveServer2 automatically, when depends on setup.
hey why do we use staging directories?
Hi Akhil,
Thanks for visiting and leaving comment on my blog. The Hive staging directory is a temporary directory used during processing, and the end data will be copied to the final destination upon completion. This is to avoid writing data live on the destination folder and causing issues when other users are reading while data is being written at the same time.
Hope that makes sense.
Cheers
Hi Eric, if we just changed the property – hive.exec.stagingdir to another path,
will it have a serious influence in whole ?
Is that a good idea ( my concern is that there may be any conflict if many hive clients produce immediate files at same time ) ?
(Now I have a few tasks to add the setters
set hive.exec.scratchdir=/tmp/hive/temps;
before query )
Hi Steve,
Sorry about the late reply. I was on holiday this month.
It should be safe to change it, if you do not have encryption enabled, or they are on the same encryption zone.
The staging directory is used per job, so should not interfere with other jobs. Did you change at client or server side?
Hi Eric,
In my cluster I do not see hive.exec.stagingdir. Is it a custom parameter for hive-site?
Also what is the difference between hive.exec.stagingdir and hive.exec.scratchdir?
Thank you.
Regards,
Naief
Hi Naief,
Thanks for visiting and posting comment on my blog. For the differences between stagingdir and scratchdir, please see below page:
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties
and search for hive.exec.stagingdir. It mentioned below:
Directory name that will be created inside table locations in order to support HDFS encryption. This is replaces hive.exec.scratchdir for query results with the exception of read-only tables. In all cases hive.exec.scratchdir is still used for other temporary files, such as job plans.
Hope this answers your question.
Cheers
Eric
Hi Eric,
Thanks for the Blog.
I have been following your blog and got one concern to set the below configuration
hive.server2.session.check.interval = 1 hour
hive.server2.idle.operation.timeout = 1 day –> set as 1 day, if the query that runs more than 1 hour will that query fails?
hive.server2.idle.session.timeout = 3 days
Why can’t the idle session timeout be lesser time example 30 mins, if the session is idle for 30 mins than it should close the session and session check interval 15 mins
Please guide me on this
Thanking in advance
Hi Shrikant,
I am really sorry about the delay in getting back to you. I got your email from your comment, but did not have time to follow up back then and forgot about it. My apologies.
To answer your questions:
1. hive.server2.idle.operation.timeout = 1 day –> set as 1 day, if the query that runs more than 1 hour will that query fails?
>> Did you mean if query runs more than 1 hour or 1 day? But it should not matter. As long as the query is running it won’t be timed out, as it is not considered as idle. It is considered as idle if client holds the operation after query finishes, and not perform any operation, like getting more rows. This is to stop clients like Hue to hold the operation for a long time.
2,. Why can’t the idle session timeout be lesser time example 30 mins, if the session is idle for 30 mins than it should close the session and session check interval 15 mins
You can set the idle session to be less than 30 minutes, however, it might cause issues on the end user side as they need to constantly reconnect, particularly for Hue users.
3 days is just a guide, you can certainly set to a few hours or 1 day, depending on how your user uses Hive client.
Hope above answers your questions.
Cheers
Eric
Hi Eric,
I am using spark “saveasTable” to write data into hive table and my question is, is there any way to write data directly in to hive warehouse directory and avoid the hive staging directory, because it is taking a long time to move data from staging to warehouse directory.
Hi Venkat,
Thanks for visiting and commenting on my blog post.
To answer your question, I do not think you can avoid staging directory. If you check spark code here:
https://github.com/cloudera/spark/blob/cdh6.3.2/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala#L106-L114
You can see that Spark is following what Hive does, and staging dir will be used. The purpose of the staging dir is to avoid users from seeing half written data, so that user either see the full data after successful injection, or do not see it at all upon failure. This is by design and I do not think avoiding staging dir is a good idea.
Hope above answers your question.
Cheers
Eric
Hey Eric,
I am using Ambari 2.7.3.0. I have thousands of staging directories on /warehouse/tablespace/managed/hive/xxx.db/…
currently Hive Exec Scratchdir is set to /tmp/hive
hive.start.cleanup.scratchdir is set to false
There is no property set as: hive.exec.stagingdir
How can I set to automatically delete staging directories on /warehouse/tablespace/managed/hive/xxx.db/…
Thanks in advance. Mush apprecitae.