Unable to query Hive parquet table after altering column type

Unable to query Hive parquet table after altering column type

Currently Hive does not support changing column types for parquet tables, due to performance issues. I have developed the following test case to prove the bug:
DROP TABLE IF EXISTS test;
CREATE TABLE test (a INT, b DOUBLE) STORED AS PARQUET;
INSERT OVERWRITE TABLE test VALUES (1000, 1000);
SELECT * FROM test;
ALTER ABLE test CHANGE a a DOUBLE;
SELECT * FROM test;
The following is the output:
0: jdbc:hive2://10.17.80.41:10000/default> drop table if exists test;
No rows affected (0.408 seconds)
0: jdbc:hive2://10.17.80.41:10000/default> create table test (a int, b double) stored as parquet;
No rows affected (0.28 seconds)
0: jdbc:hive2://10.17.80.41:10000/default> insert overwrite table test values (1000, 1000);
INFO  : Number of reduce tasks is set to 0 since there's no reduce operator
WARN  : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
INFO  : number of splits:1
INFO  : Submitting tokens for job: job_1444272676566_0015
INFO  : Kind: HDFS_DELEGATION_TOKEN, Service: 10.17.80.40:8020, Ident: (HDFS_DELEGATION_TOKEN token 78 for hive)
INFO  : The url to track the job: http://host-10-17-80-40.coe.cloudera.com:8088/proxy/application_1444272676566_0015/
INFO  : Starting Job = job_1444272676566_0015, Tracking URL = http://host-10-17-80-40.coe.cloudera.com:8088/proxy/application_1444272676566_0015/
INFO  : Kill Command = /opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/bin/hadoop job  -kill job_1444272676566_0015
INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
INFO  : 2015-10-08 04:14:07,188 Stage-1 map = 0%,  reduce = 0%
INFO  : 2015-10-08 04:14:17,845 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.49 sec
INFO  : MapReduce Total cumulative CPU time: 3 seconds 490 msec
INFO  : Ended Job = job_1444272676566_0015
INFO  : Stage-4 is selected by condition resolver.
INFO  : Stage-3 is filtered out by condition resolver.
INFO  : Stage-5 is filtered out by condition resolver.
INFO  : Moving data to: hdfs://host-10-17-80-40.coe.cloudera.com:8020/user/hive/warehouse/test/.hive-staging_hive_2015-10-08_04-13-54_259_91084841227249311-1/-ext-10000 from 
hdfs://host-10-17-80-40.coe.cloudera.com:8020/user/hive/warehouse/test/.hive-staging_hive_2015-10-08_04-13-54_259_91084841227249311-1/-ext-10002
INFO  : Loading data to table default.test from hdfs://host-10-17-80-40.coe.cloudera.com:8020/user/hive/warehouse/test/.hive-staging_hive_2015-10-08_04-13-54_259_91084841227249311-1/-ext-10000
INFO  : Table default.test stats: [numFiles=1, numRows=1, totalSize=272, rawDataSize=2]
No rows affected (25.223 seconds)
0: jdbc:hive2://10.17.80.41:10000/default> select * from test;
INFO  : Number of reduce tasks is set to 0 since there's no reduce operator
WARN  : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
INFO  : number of splits:1
INFO  : Submitting tokens for job: job_1444272676566_0016
INFO  : Kind: HDFS_DELEGATION_TOKEN, Service: 10.17.80.40:8020, Ident: (HDFS_DELEGATION_TOKEN token 79 for hive)
INFO  : The url to track the job: http://host-10-17-80-40.coe.cloudera.com:8088/proxy/application_1444272676566_0016/
INFO  : Starting Job = job_1444272676566_0016, Tracking URL = http://host-10-17-80-40.coe.cloudera.com:8088/proxy/application_1444272676566_0016/
INFO  : Kill Command = /opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/bin/hadoop job  -kill job_1444272676566_0016
INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
INFO  : 2015-10-08 04:14:36,141 Stage-1 map = 0%,  reduce = 0%
INFO  : 2015-10-08 04:14:46,540 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.82 sec
INFO  : MapReduce Total cumulative CPU time: 2 seconds 820 msec
INFO  : Ended Job = job_1444272676566_0016
+-------+---------+--+
|   a   |    b    |
+-------+---------+--+
| 1000  | 1000.0  |
+-------+---------+--+
1 row selected (28.204 seconds)
0: jdbc:hive2://10.17.80.41:10000/default> alter table test change a a double;
No rows affected (0.378 seconds)
0: jdbc:hive2://10.17.80.41:10000/default> select * from test;
INFO  : Number of reduce tasks is set to 0 since there's no reduce operator
WARN  : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
INFO  : number of splits:1
INFO  : Submitting tokens for job: job_1444272676566_0017
INFO  : Kind: HDFS_DELEGATION_TOKEN, Service: 10.17.80.40:8020, Ident: (HDFS_DELEGATION_TOKEN token 81 for hive)
INFO  : The url to track the job: http://host-10-17-80-40.coe.cloudera.com:8088/proxy/application_1444272676566_0017/
INFO  : Starting Job = job_1444272676566_0017, Tracking URL = http://host-10-17-80-40.coe.cloudera.com:8088/proxy/application_1444272676566_0017/
INFO  : Kill Command = /opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/bin/hadoop job  -kill job_1444272676566_0017
INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
INFO  : 2015-10-08 04:15:04,794 Stage-1 map = 0%,  reduce = 0%
INFO  : 2015-10-08 04:15:39,006 Stage-1 map = 100%,  reduce = 0%
ERROR : Ended Job = job_1444272676566_0017 with errors
Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask (state=08S01,code=2)
0: jdbc:hive2://10.17.80.41:10000/default>
After some research I have found the following JIRAs for this issue: HIVE-6784 and HIVE-12080. The original issue HIVE-6784 was created a while back, the upstream developer, however, due to concerns about performance issue for the patch, has rejected it and never got fixed. A newer JIRA was created recently HIVE-12080, which is pretty much the same as HIVE-6784, but requesting to fix the issue without introducing any performance hit. So, at the time of writing, there is no fixes for this problem yet, the workaround is to re-generate the table by running “INSERT OVERWRITE {table_name} SELECT * FROM {table_name}” so that all data will be updated to the new type. For the time being, just re-generate the table and you will be good to go.

Leave a Reply

Your email address will not be published.

My new Snowflake Blog is now live. I will not be updating this blog anymore but will continue with new contents in the Snowflake world!