Column Stats Shows Incorrect Stats Information in Impala

Column Stats Shows Incorrect Stats Information in Impala

Another bug identified today in Impala while helping customers solving a weird Impala issue. The problem is that “SHOW COLUMN STATS” command in Impala shows incorrect stats information, either shows “-1” for distinct values or the number is not matching with real distinct values:
query: show column stats test
+--------+------------+------------------+--------+----------+----------+
| Column | Type       | #Distinct Values | #Nulls | Max Size | Avg Size |
+--------+------------+------------------+--------+----------+----------+
| a      | INT        | 3                | -1     | 4        | 4        |
| b      | INT        | 3                | -1     | 4        | 4        |
| c      | TIMESTAMP  | -1               | -1     | 4        | 4        |
+--------+------------+------------------+--------+----------+----------+
Identified this is a bug in Hive HMS API when renaming tables, and old table name’s stats becomes stale. This bug exists in CDH5.3.x. In CDH5.4.x, column stats got deleted after a table is renamed. So, to avoid facing this issue: 1) Avoid renaming tables 2) If renaming tables can’t be avoided, manually remove column stats from table “TAB_COL_STATS” in the HMS DB after table is renamed
DELETE FROM "TAB_COL_STATS" WHERE "TABLE_NAME" = '' AND "TBL_ID" = ;
And then re-compute the stats for the table. You can find out the “TBL_ID” from “TBLS” table in the HMS DB. Please refer to the follow upstream JIRA issues: HIVE-9720: Metastore does not properly migrate column stats when renaming a table across databases. HIVE-9866: Changing a column’s type doesn’t change column stats type in metastore Hope this helps.

Leave a Reply

Your email address will not be published.

My new Snowflake Blog is now live. I will not be updating this blog anymore but will continue with new contents in the Snowflake world!