How to Find Out the Query That Locked Hive Table

How to Find Out the Query That Locked Hive Table

I have seen lots of Hive users faced an issue that their query took a long time to compile, even for a table that is small and without many columns or partitions. Those queries failed with error message similar to below, which was due to table was locked and query simply had to wait for release of the lock:

2020-02-17 04:10:51,294 ERROR ZooKeeperHiveLockManager: [HiveServer2-Background-Pool: Thread-391004]: Unable to acquire IMPLICIT, EXCLUSIVE lock @@table_name= after 100 attempts.
2020-02-17 04:10:51,296 ERROR org.apache.hadoop.hive.ql.Driver: [HiveServer2-Background-Pool: Thread-391004]: FAILED: Error in acquiring locks: Locks on the underlying objects cannot be acquired. retry after some time
org.apache.hadoop.hive.ql.lockmgr.LockException: Locks on the underlying objects cannot be acquired. retry after some time

This post will show you how to find out the original query that locked the table, then find out why this query took so long to run. This requires that the issue is present at the time of troubleshooting. If the original query finished, the lock will be released, then we won’t be able to perform below steps.

Firstly, we need to understand which table under which database was locked. This should be obvious from the log mentioned above, or simply from the query that failed with above mentioned error.

Then go to ZooKeeper host and run command “zookeeper-client” to check znodes information. When Hive runs a query, it will create a znode in ZK under /hive_zookeeper_namespace_hive/<db-name/<table-name>, and if in the case of partitions, it will be under /hive_zookeeper_namespace_hive/<db-name/<table-name>/<part-name>=<part-value>. In my example, below znode LOCK-SHARED-0000000000 was created for my table “test” under database “default”:

[zk: localhost:2181(CONNECTED) 4] ls /hive_zookeeper_namespace_hive/default/test
[LOCK-SHARED-0000000000]

We can see that it was a SHARED lock, because I simply ran a SELECT query against the table.

To find out what query created the znode, run “get” command against it in ZK:

[zk: localhost:2181(CONNECTED) 5] get /hive_zookeeper_namespace_hive/default/test/LOCK-SHARED-0000000000
hive_20200225162929_c91be6f4-b89a-49a9-af24-674aec719efc:1582676988255:IMPLICIT:SELECT count(*) FROM test:xx.xx.xxx.xxx
cZxid = 0x2bccc
ctime = Tue Feb 25 16:29:48 PST 2020
mZxid = 0x2bccc
mtime = Tue Feb 25 16:29:48 PST 2020
pZxid = 0x2bccc
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x17041db20995420
dataLength = 119
numChildren = 0

You can now see very clearly that, a query “SELECT count(*) FROM test” was triggered from IP address xx.xx.xxx.xxx, which is my HiveServer2 host, and the query ID in Hive was hive_20200225162929_c91be6f4-b89a-49a9-af24-674aec719efc. We can then take this Query ID and search in HiveServer2 log to find out when the query was triggered and how long it took:

2020-02-25 16:29:48,212 INFO  org.apache.hadoop.hive.ql.Driver: [HiveServer2-Handler-Pool: Thread-148]: Compiling command(queryId=hive_20200225162929_c91be6f4-b89a-49a9-af24-674aec719efc): SELECT count(*) FROM test
...
2020-02-25 16:29:48,250 INFO  org.apache.hadoop.hive.ql.Driver: [HiveServer2-Handler-Pool: Thread-148]: Completed compiling command(queryId=hive_20200225162929_c91be6f4-b89a-49a9-af24-674aec719efc); Time taken: 0.038 seconds

2020-02-25 16:29:48,264 INFO  org.apache.hadoop.hive.ql.Driver: [HiveServer2-Background-Pool: Thread-204]: Executing command(queryId=hive_20200225162929_c91be6f4-b89a-49a9-af24-674aec719efc): SELECT count(*) FROM test
...
2020-02-25 16:30:08,281 INFO  org.apache.hadoop.hive.ql.Driver: [HiveServer2-Background-Pool: Thread-204]: Completed executing command(queryId=hive_20200225162929_c91be6f4-b89a-49a9-af24-674aec719efc); Time taken: 20.016 seconds

In my example it only too 0.038 seconds to compile and 20 seconds to run. If a query that took a long time to compile and run, then we can get EXPLAIN EXTENDED output of the query and understand how it was planned, what data was involved, and possibly also need DDL of tables involved to understand why it was slow to compile. There are lots of reasons for the slow query compilation, the common ones are below:

  • Too many partitions need to be scanned
  • Table is too wide with many columns and Hive has Sentry column level privileges enabled
  • Query string is too long, and yes, I have see query string with several hundreds MB in size, don’t ask me why and how :), though not very common
  • Slow performance from Hive MetaStore’s backend database to retrieve Metadata information
  • Slow Sentry user group mapping retrieval

Of course, there are a lot more, so need to check in details and determine the cause on case by case basis.

Hope above information can be useful.

Leave a Reply

Your email address will not be published.

My new Snowflake Blog is now live. I will not be updating this blog anymore but will continue with new contents in the Snowflake world!