I have seen lots of Hive users faced an issue that their query took a long time to compile, even for a table that is small and without many columns or partitions. Those queries failed with error message similar to below, which was due to table was locked and query simply had to wait for release of the lock:
2020-02-17 04:10:51,294 ERROR ZooKeeperHiveLockManager: [HiveServer2-Background-Pool: Thread-391004]: Unable to acquire IMPLICIT, EXCLUSIVE lock @@table_name= after 100 attempts. 2020-02-17 04:10:51,296 ERROR org.apache.hadoop.hive.ql.Driver: [HiveServer2-Background-Pool: Thread-391004]: FAILED: Error in acquiring locks: Locks on the underlying objects cannot be acquired. retry after some time org.apache.hadoop.hive.ql.lockmgr.LockException: Locks on the underlying objects cannot be acquired. retry after some time
This post will show you how to find out the original query that locked the table, then find out why this query took so long to run. This requires that the issue is present at the time of troubleshooting. If the original query finished, the lock will be released, then we won’t be able to perform below steps.
Firstly, we need to understand which table under which database was locked. This should be obvious from the log mentioned above, or simply from the query that failed with above mentioned error.
Then go to ZooKeeper host and run command “zookeeper-client” to check znodes information. When Hive runs a query, it will create a znode in ZK under /hive_zookeeper_namespace_hive/<db-name/<table-name>, and if in the case of partitions, it will be under /hive_zookeeper_namespace_hive/<db-name/<table-name>/<part-name>=<part-value>. In my example, below znode LOCK-SHARED-0000000000 was created for my table “test” under database “default”:
[zk: localhost:2181(CONNECTED) 4] ls /hive_zookeeper_namespace_hive/default/test [LOCK-SHARED-0000000000]
We can see that it was a SHARED lock, because I simply ran a SELECT query against the table.
To find out what query created the znode, run “get” command against it in ZK:
[zk: localhost:2181(CONNECTED) 5] get /hive_zookeeper_namespace_hive/default/test/LOCK-SHARED-0000000000 hive_20200225162929_c91be6f4-b89a-49a9-af24-674aec719efc:1582676988255:IMPLICIT:SELECT count(*) FROM test:xx.xx.xxx.xxx cZxid = 0x2bccc ctime = Tue Feb 25 16:29:48 PST 2020 mZxid = 0x2bccc mtime = Tue Feb 25 16:29:48 PST 2020 pZxid = 0x2bccc cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x17041db20995420 dataLength = 119 numChildren = 0
You can now see very clearly that, a query “SELECT count(*) FROM test” was triggered from IP address xx.xx.xxx.xxx, which is my HiveServer2 host, and the query ID in Hive was hive_20200225162929_c91be6f4-b89a-49a9-af24-674aec719efc. We can then take this Query ID and search in HiveServer2 log to find out when the query was triggered and how long it took:
2020-02-25 16:29:48,212 INFO org.apache.hadoop.hive.ql.Driver: [HiveServer2-Handler-Pool: Thread-148]: Compiling command(queryId=hive_20200225162929_c91be6f4-b89a-49a9-af24-674aec719efc): SELECT count(*) FROM test ... 2020-02-25 16:29:48,250 INFO org.apache.hadoop.hive.ql.Driver: [HiveServer2-Handler-Pool: Thread-148]: Completed compiling command(queryId=hive_20200225162929_c91be6f4-b89a-49a9-af24-674aec719efc); Time taken: 0.038 seconds 2020-02-25 16:29:48,264 INFO org.apache.hadoop.hive.ql.Driver: [HiveServer2-Background-Pool: Thread-204]: Executing command(queryId=hive_20200225162929_c91be6f4-b89a-49a9-af24-674aec719efc): SELECT count(*) FROM test ... 2020-02-25 16:30:08,281 INFO org.apache.hadoop.hive.ql.Driver: [HiveServer2-Background-Pool: Thread-204]: Completed executing command(queryId=hive_20200225162929_c91be6f4-b89a-49a9-af24-674aec719efc); Time taken: 20.016 seconds
In my example it only too 0.038 seconds to compile and 20 seconds to run. If a query that took a long time to compile and run, then we can get EXPLAIN EXTENDED output of the query and understand how it was planned, what data was involved, and possibly also need DDL of tables involved to understand why it was slow to compile. There are lots of reasons for the slow query compilation, the common ones are below:
- Too many partitions need to be scanned
- Table is too wide with many columns and Hive has Sentry column level privileges enabled
- Query string is too long, and yes, I have see query string with several hundreds MB in size, don’t ask me why and how :), though not very common
- Slow performance from Hive MetaStore’s backend database to retrieve Metadata information
- Slow Sentry user group mapping retrieval
Of course, there are a lot more, so need to check in details and determine the cause on case by case basis.
Hope above information can be useful.