How Table Locking Works in Hive

How Table Locking Works in Hive

This article explains how table locking works in Hive by running a series hive commands and their outputs. To do this, I have created two simple tables in my small cluster called “test” and “test_partitioned”.

Initially, when no query is running against the “test” table, the table should have no locks:

hive> SHOW LOCKS test;
OK
Time taken: 0.305 seconds

If you do a simple query like:

SELECT count(1) FROM test;

you will see that the table will be “SHARED” locked:

hive> SHOW LOCKS test;
OK
default@test	SHARED
Time taken: 0.159 seconds, Fetched: 1 row(s)

“SHARED” lock is also called a “READ” lock, meaning, other people can still read from the table, but any writes will have to wait for it to finish.

Now if you want to write data to the table using query:

INSERT OVERWRITE TABLE test SELECT COUNT(1) FROM sample_07;

the table will be locked “EXCLUSIVELY”

hive> SHOW LOCKS test;
OK
default@test	EXCLUSIVE
Time taken: 0.158 seconds, Fetched: 1 row(s)

“EXCLUSIVE” lock is also called a “WRITE” lock, meaning no one else is able to read or write to the table while the lock is present, all other queries will have to wait for the current query to finish before they can start.

You can also manually lock tables whenever you want:

hive> LOCK TABLE test SHARED;
OK
Time taken: 0.244 seconds
hive> SHOW LOCKS test;
OK
default@test	SHARED
Time taken: 0.107 seconds, Fetched: 1 row(s)
hive> UNLOCK TABLE test;
OK
Time taken: 0.255 seconds
hive> SHOW LOCKS test;
OK
Time taken: 0.114 seconds
hive> LOCK TABLE test EXCLUSIVE;
OK
Time taken: 0.154 seconds
hive> SHOW LOCKS test;
OK
default@test	EXCLUSIVE
Time taken: 0.083 seconds, Fetched: 1 row(s)
hive> UNLOCK TABLE test;
OK
Time taken: 0.127 seconds
hive> SHOW LOCKS test;
OK
Time taken: 0.232 seconds

The locking can also be applied to table partitions:

hive> LOCK TABLE test_partitioned PARTITION (p='p1') EXCLUSIVE;
OK
Time taken: 0.31 seconds
hive> SHOW LOCKS test_partitioned PARTITION (p='p1');
OK
default@test_partitioned@p=p1	EXCLUSIVE
Time taken: 0.189 seconds, Fetched: 1 row(s)
hive> SHOW LOCKS test_partitioned;
OK
Time taken: 0.105 seconds
hive> UNLOCK TABLE test_partitioned PARTITION (p='p1');
OK
Time taken: 0.136 seconds
hive> SHOW LOCKS test_partitioned PARTITION (p='p1');
OK
Time taken: 0.123 seconds
hive> SHOW LOCKS test_partitioned;
OK
Time taken: 0.081 seconds 

When you write to a partition of a table using static partitioning, an EXCLUSIVE lock will be applied to the partition that will be written to, and SHARED lock will be applied to the table itself:

INSERT OVERWRITE TABLE test_partitioned PARTITION (p='p1') SELECT salary FROM sample_07;
hive> SHOW LOCKS test_partitioned;
OK
default@test_partitioned	SHARED
Time taken: 1.345 seconds, Fetched: 1 row(s)
hive> SHOW LOCKS test_partitioned PARTITION (p='p1');
OK
default@test_partitioned@p=p1	EXCLUSIVE
Time taken: 0.243 seconds, Fetched: 1 row(s)

However, there is a bug in Hive that when you try to run an “INSERT OVERWRITE” using dynamic partitioning, because Hive is unable to figure out which partitions need to be locked, it currently only applies “SHARED” lock to the table being updated.

To be absolutely safe, Hive should have applied an EXCLUSIVE lock to the table to prevent any further update to the table and all partitions, but it does not.

So for the same query I ran above, I am able to run the following same query twice in two different Hive sessions and they will race with each other:

hive> INSERT OVERWRITE TABLE test_partitioned PARTITION (p) SELECT salary, 'p1' AS p FROM sample_07;

hive> INSERT OVERWRITE TABLE test_partitioned PARTITION (p) SELECT salary, 'p1' AS p FROM sample_07;

Of course, you will have to enable dynamic partitioning for the above query to run.

They can be run at the same time and whoever finishes the last will overwrite the results from the previous one.

I tested this under CDH5.3 and CDH5.4 and both of them have the bug present. There is a upstream Hive JIRA issue,
you can have a look at HIVE-3509 – Exclusive locks are not acquired when using dynamic partitions for more information.

That’s all I have to say about locks in Hive, if I missed anything, please let me know in the comments.

Loading

24 Comments

    1. Eric Lin

      Hi Boris,

      Thanks for visiting.

      I believe in that case the attempt to obtain the exclusive lock will have to wait until the shared lock is released (that’s the purpose of shared lock to prevent others from writing it).

      Hope this helps.

  1. radhika

    i am getting the below error while executing “show locks tablename”
    FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. show Locks LockManager not specified

    1. Eric Lin

      Hi Radhika,

      Thanks for visiting and your question on my blog. Based on the error message “LockManager not specified”, I suspect that you are either missing one or both of below:

      1. hive.support.concurrency is not set to “true”
      2. you do not have ZooKeeper enabled, which is required for locking to work in Hive.

      Cheers

    1. Eric Lin

      Hi Amulyam,

      Thanks for your comments.

      Hive currently has ACID Transactions feature available. Hortonworks supports it, however, Cloudera still considers this feature is experimental and currently does not support ACID in CDH.

      Normally currently users do not use manual locking on Hive tables, because Hive queries themselves will take care of that automatically. However, if user decides for whatever reason, he/she does not want others to view or update the table, then locking can be used.

      Hope above helps.

      1. Amulyam Agrawal

        I have a usecase where I want the whole table to be locked as I want to overwrite it but there are many read queries which are coming all this time. What do you suggest in above scenario?

        1. Eric Lin

          Hi Amulyam,

          In that case the READ operation really has to wait. If you force READING while WRITING, then the READ operation will return unexpected results. Currently Hive will enforce EXCLUSIVE lock while writing, which will cause any subsequent read on the same table will have to wait. You don’t really have a choice, and you should not anyway for consistent.

          If you have the table partitioned, and READING and WRITING are operating on different partitions, then that will work, as they will not interfere with each other. The EXCLUSIVE lock will only apply to certain partitions that the query will run over.

          Cheers.

    1. Eric Lin

      Hi Raushan,

      Thanks for visiting my blog. To answer your question, no, you can’t drop a table if there is a shared lock. Hive will wait for the lock to be released before dropping the table. If the lock is there for a long time, then the drop query might eventually fail.

      Cheers

    1. Eric Lin

      Hi Ram,

      Thanks for visiting my blog and posting questions.

      When you INSERT OVERWRITE into the same partition while reading it, the EXCLUSIVE lock will be placed on the partition for this query, and the query will still READ first and then WRITE, so there is no race conditions here, the query should still work.

      Other queries against this partition will have to wait, regardless of whether reading or writing.

      Cheers
      Eric

  2. Rahul Sharma

    Thank you for an explanation, so I was going through your blog to full fill my requirement, so let me summarize you about my use case, I have to read the data from x table while reading the table I want to make sure that other writers should until current read completes because I have to truncate the table after that read operation. so the fresh entries should be present.
    let me start the explaining use cases.
    1) SHARED LOCK
    hive> lock table gdpr_del_req_status2 shared;
    OK
    Time taken: 0.076 seconds
    hive>
    now in another session, I am trying to write.
    insert into table gdpr.gdpr_del_req_status2 PARTITION(req_dts) select req_id,tbl_guid,req_status,update_user ,current_date() as req_dts from gdpr.gdpr_del_req_status1 limit 2;
    Query ID = idscah_20200602170505_c627bf20-52cf-439d-808c-6cdd3b289071
    Total jobs = 3
    Launching Job 1 out of 3
    Number of reduce tasks determined at compile time: 1
    In order to change the average load for a reducer (in bytes):
    set hive.exec.reducers.bytes.per.reducer=
    In order to limit the maximum number of reducers:
    set hive.exec.reducers.max=
    In order to set a constant number of reducers:
    set mapreduce.job.reduces=

    When I check the lock available for the current table and now I see that only ‘SHARED LOCK ‘ has been acquired by the latest write operation why is it so, it should be ‘EXCLUSIVE LOCK ‘ because of write op. correct me if I am wrong
    hive> show locks gdpr_del_req_status2;
    OK
    gdpr@gdpr_del_req_status2 SHARED
    gdpr@gdpr_del_req_status2 SHARED
    2)EXCLUSIVE LOCK
    hive> lock table gdpr_del_req_status2 EXCLUSIVE;
    OK
    Time taken: 0.043 seconds
    hive>
    now on another session i am not able to read/write
    hive> select * from gdpr.gdpr_del_req_status2;
    The query stuck good works, wait for another surprise on the current session I am not able to read/write ?? why it is so ??
    This current session where I have done the EXCLUSIVE LOCK.
    hive> lock table gdpr_del_req_status2 EXCLUSIVE;
    OK
    Time taken: 0.043 seconds
    hive> desc gdpr_del_req_status2;
    now above query is stuck.
    Please correct my understanding if I am doing anything wrong.

    1. Eric Lin

      Hi Rahul,

      Thanks for visiting my blog and post questions. Let me answer them below:

      1. Currently Hive has a bug that does not acquire EXCLUSIVE lock when dynamic partition is used, please refer to below JIRA in the upstream:

      https://issues.apache.org/jira/browse/HIVE-3509

      which I already mentioned in my post. So you are right that it should be EXCLUSIVE in this case, but due to the bug, currently it is not.

      2. If a table is EXCLUSIVELY locked, then no other queries will be able to proceed until the table is unlocked. This is by design, otherwise why would we need the lock? EXCLUSIVE lock will prevent both READ and WRITE, while SHARED lock will only prevent WRITE.

      Hope above answers your question. If not, please let me know.

      Cheers
      Eric

  3. Hamit Cherah

    Hi Eric,
    Thank you very much for your analysis and your sharing.

    In my case, we are still using CDH5.13 (and yes it does exist) with Hive1.1 (so no ACID)

    let me explain my problem to you and try to simplify it.

    Suppose a table T has been partitioned statically. And consider two queries: (1) Q_w a query which writes to a partition ‘p1’ of T at time ‘t’ and which will take 30 minutes to complete. (2) Q_r a query which will simply read the whole table T (Select * from T) and which is launched at ‘t’ + 10minutes (ie Q_r is launched 10 minutes after Q_w).
    Unless I am mistaken, at the moment ‘t’ + 10minute:
    – table T has a “Shared” status and
    – partition ‘p1’ of T has an “Exclusive” status (because Q_w has locked this partition).

    My question is the following :
    If we consider that we are at ‘t’ + 10 minutes, there are 20 minutes left before Q_w finishes writing in ‘p1’, do I understand that ‘Q_r’ cannot return the result before writing in ‘ P1 ‘? If so, how can we handle read and write queries in parallel (you understand that a reading process cannot wait 20 long minutes in real life!)?

    1. Eric Lin

      Hi Hamit,

      Thanks for visiting my blog and posting questions.

      Yes, you are correct that at ‘t’ + 10 minutes, the table has shared lock status and p1 partition has an exclusive lock.

      However, since the table only has SHARED lock while Q_w is running against P1, Q_r should be permitted to start without waiting for Q_w to finish first, so you should not face any delay.

      I suggest you to perform a simple test to confirm, but I believe that it is the case.

      Hope above helps.

      Cheers
      Eric

  4. Kumar

    Hi Eric,
    My hive tables lock count is very huge in production because we have 13 K partitions in table.So how to avoid huge lock count and what changes we can do in the table to reduce.

    1. Eric Lin

      Hi Kumar,

      Sorry about the late response, I was on holiday earlier.

      Looks like you have large number of partitions issue. A few thoughts:

      1. Can you review and see how big the files are under each partition? If they are too small, can you merge them to reduce the total partition number?
      2. review your query, do your query need to scan all 13K partitions? Can you optimize your query?

      These of course depends on how your partition keys were setup. The goal is to reduce the number of partitions if possible, or avoid full table scan in your query.

      Cheers
      Eric

    1. Eric Lin

      Hi Ahmed,

      Thanks for visiting my blog and posting questions.

      I am not fully understanding your question, but I guess you are trying to ask if you need to manually put the hive table in write lock? If yes, you should not need to do it manually, as Hive will automatically create locks based on your query type. So if you try to SELECT a partition of a table, then those partitions will be Share Locked, if you try to update the whole table using INSERT INTO, then exclusive locks will be required for the whole table.

      Hope that answers your question.

      Cheers
      Eric

  5. Florin

    Hi Eric,

    I have a wired use case.
    I have two inserts in the same partition of a parquet table. When those inserts happens in the same second, one of the partitions is written in a supplementary folder, not directly in the path.

    Example:
    correct path: /folder/finalFolder/
    wrong (what actually happens) with one of the queries: /folder/finalFolder/finalFolder

    So, in other words, only when the queries do the insert in the same second, one of them is creating an extra folder and put there the parquet file, instead of placing the file in the defined path.

    1. Eric Lin

      Hi Florin,

      Thanks for visiting my blog. However, I am no longer at Cloudera and I am not working with CDH/CDP anymore, so I can’t test out for you. And I do not have an answer as I have not seen it before. It is best that you reach out to the CDH/CDP community to seek for an answer.

      Apologies.

      Eric

Leave a Reply

Your email address will not be published. Required fields are marked *

My new Snowflake Blog is now live. I will not be updating this blog anymore but will continue with new contents in the Snowflake world!