Timestamp stored in Parquet file format in Impala Showing GMT Value

Timestamp stored in Parquet file format in Impala Showing GMT Value

This article explains why Impala and Hive return different timestamp values on the same table that was created and value inserted from Hive. It also outlines the steps to force Impala to apply local time zone conversion when reading timestamp field stored in Parquet file format.

When Hive stores a timestamp value into Parquet format, it converts local time into UTC time, and when it reads data out, it converts back to local time.

Impala, however on the other hand, does no conversion when reads the timestamp field out, hence, UTC time is returned instead of local time.

Both behaviors are by design and work in the right way. More information can be found at: TIMESTAMP Data Type

However, Impala can be set to apply the conversion as well to the timestamp field stored in Parquet file format (only available in Cloudera Manager 5.4), which is also mentioned in the link above. To do this, follow the steps below:

  1. Go to Impala Services home page
  2. Click on “Configuration
  3. ​On the left side under “Filters“, click “Impala Daemon” under “Scope” and “Advanced” under “Category
  4. Locate “Impala Daemon Command Line Argument Advanced Configuration Snippet (Safety Valve)“, and then enter the following:
--convert_legacy_hive_parquet_utc_timestamps=true
  1. Save the changes
  2. Restart all Impala Daemons

impala-config

To confirm that the change takes effect, follow the steps below:

  1. Go to Impala Home page
  2. Click on “Instances” tab
  3. Click on any “Impala Daemon” link (make sure you have restarted all of them)
  4. Under “Summary” > “Quick Links“, click on “Impala Daemon Web UI
  5. A new page will open, click on the last tab on the top of the page named “/varz
  6. Search “convert_legacy_hive_parquet_utc_timestamps” and confirm that it is set to “true”: –convert_legacy_hive_parquet_utc_timestamps=true

impala-flags


This enables Impala to do the time zone conversion when reading timestamp field from Parquet file.

Update:

Please be warned that this will have some performance hit if you go with this path, please refer to upstream Impala JIRA: IMPALA-3316 for more details.

 

2 Comments

    1. Eric Lin

      Hi Boris,

      Thanks for your comment, and yes, you are right that there will be some performance hit if you go with this path. I will update the post with notes.

      Thanks again for mentioning that.

      Cheers

Leave a Reply

Your email address will not be published.

My new Snowflake Blog is now live. I will not be updating this blog anymore but will continue with new contents in the Snowflake world!