Impala metadata not synced across all Impala Daemons when Load Balancer is enabled

Impala metadata not synced across all Impala Daemons when Load Balancer is enabled

Recently I have be dealing with quite a few customers with the same issue that Impala metadata out of sync between each Impala Daemons. And the common cause was to do with Load Balancer setup in front of Impala Daemons and the way they run impala queries. So this was what happened. When running Impala query through different impala-shell sessions, metadata updates from previous session is not visible in the next session. For example, if you run the following impala-shell command in a shell script:
impala-shell -i load-balancer-host -q "CREATE TABLE test (a INT)"
impala-shell -i load-balancer-host -q "SELECT * FROM test"
It is likely that you will get table not found error for “test”. This is caused by a delay in update metadata information to the rest of impala daemons. For example, when first command was run:
impala-shell -i load-balancer-host -q "CREATE TABLE test (a INT)"
impala-shell might be connected to impalahost1, and metadata is updated on host impalahost1 to have a new table “test” information. However, at this point, only impalahost1 has this data, and the rest of impala daemons have not got the update yet. And because this query is fast and the next impala-shell is triggered straightaway:
impala-shell -i load-balancer-host -q "SELECT * FROM test"
This time, load balancer forwards the request to impalahost2, since metadata has not be updated yet, impalahost2 does not know the existence of table “test”, so table not found exception will be triggered. The solution here is to force impala to update all impala daemons’ metadata before a query’s result is returned back to the user. This is done through the SYNC_DDL query option. So basically in the impala-shell session, run the following command:
SET SYNC_DDL=1;
This will ask impala to make sure metadata updates have been done on all impala daemon hosts before query result is returned. So our script will look like this:
impala-shell -i load-balancer-host -q "SET SYNC_DDL=1; CREATE TABLE test (a INT)"
impala-shell -i load-balancer-host -q "SELECT * FROM test"
And because this option can introduce a delay after each write operation, if you are running a sequence of CREATE DATABASE, CREATE TABLE, ALTER TABLE, INSERT, and similar statements within a setup script, to minimize the overall delay you can enable the SYNC_DDL query option only near the end, before the final DDL statement. More information about SYNC_DDL query option can be found at our documentation website: SYNC_DDL Query Option Another option is to have all queries to be run in the same impala-shell session, like the example below:
impala-shell -i load-balancer-host -q "CREATE TABLE test (a INT); SELECT * FROM test;"
In this case as both query are run in impalahost1, so we won’t have the metadata sync delayed issue. Hope this helps.

Loading

2 Comments

  1. Kartik

    Thanks eric. I was facing the same situation when I am using the following sequence of commands –
    1)Recover partition
    2)Refresh Table
    3)Show files in partition
    i was getting not found errors in the 3rd step..looked several places..yours is the only article i found on this . Will implement the sync_ddl in my code.

Leave a Reply

Your email address will not be published. Required fields are marked *

My new Snowflake Blog is now live. I will not be updating this blog anymore but will continue with new contents in the Snowflake world!