Query to Hive RCFile table with error “LazySimpleSerDe: expects either BytesWritable or Text object”

Query to Hive RCFile table with error “LazySimpleSerDe: expects either BytesWritable or Text object”

Steps to re-produce the issue: 1) Create a Hive table using following command:
DROP TABLE IF EXISTS test;

CREATE TABLE test (a int) PARTITIONED BY (p int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' 
STORED AS RCFILE;
2) Load data into the table:
INSERT OVERWRITE TABLE test VALUES (1), (2);
3) Get the output of table definition:
SHOW CREATE TABLE test;
and we get the following output:
CREATE TABLE `test`(
  `a` int)
PARTITIONED BY (
  `p` int)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '|'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.RCFileInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.RCFileOutputFormat';
4) Use the output from step 3. to create a new table:
CREATE TABLE `test_copy`(
  `a` int)
PARTITIONED BY (
  `p` int)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '|'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.RCFileInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.RCFileOutputFormat'
5) Manually copy data from table “test” to “test_copy” using HDFS command with corresponding partitions 6) Recover the partitions information:
MSCK REPAIR TABLE test_copy;
7) Query the new table using SELECT statement:
SELECT * FROM test_copy;
and we expect to get the following error:
java.io.IOException: org.apache.hadoop.hive.serde2.SerDeException: class org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: expects either BytesWritable or Text object
And stack trace in HS2 log:
Caused by: java.io.IOException: org.apache.hadoop.hive.serde2.SerDeException: 
class org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe: expects BytesRefArrayWritable!
        at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:507)
        at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:414)
        at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:138)
        at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1655)
        at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:324)
        ... 13 more
Caused by: org.apache.hadoop.hive.serde2.SerDeException: class org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe: expects BytesRefArrayWritable!
        at org.apache.hadoop.hive.serde2.columnar.ColumnarSerDeBase.deserialize(ColumnarSerDeBase.java:47)
        at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:488)
        ... 17 more
8) After we tried to change the table’s SerDe Lib:
ALTER TABLE test SET SERDE 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe';
Issue still remains The typical use case for the scenario is when you want to copy a Hive table from PROD to STAGING environment, in which case you would want to create the table using “SHOW CREATE TABLE” statement and then use DistCp to copy underlining HDFS files from one cluster to another. I have reported this issue upstream: HIVE-12506, and it is now resolved and committed into Hive 1.3, 2.0 and 2.1. Basically what happened was that in step 3) mentioned above, the “SHOW CREATE TABLE” command generated a wrong statement which will create a RCFile table using “LazySimpleSerDe”, which is designed for TextFileFormat only. For RCFileFormat table to work, we need to use “ColumnarSerDe” instead. In step 8), we tried to change the table’s SerDe to be “ColumnarSerDe”, however, this won’t impact the existing partitions’ SerDe Lib used. To get around with this issue in the earlier version of Hive, we have the following workarounds: 1. If there are limited number of partitions, we can manually fix SerDe information per partition as below:
ALTER TABLE test PARTITION (p=1) SET SERDE 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe';
ALTER TABLE test PARTITION (p=2) SET SERDE 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe';
ALTER TABLE test PARTITION (p=3) SET SERDE 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe';
...
If there are lot of partitions and it is impractical to do it manually, follow the steps below: a) If the table is not an external table, temporarily make it external:
ALTER TABLE test SET TBLPROPERTIES('EXTERNAL'='TRUE');
this allows us to drop the table without deleting underlining HDFS data b) Drop the table:
DROP TABLE test;
c) Re-create the table:
CREATE TABLE `test`(
  `a` int)
PARTITIONED BY (
  `p` int)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '|'
STORED AS RCFILE;
Notice that we used different format
STORED AS RCFILE
instead of
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.RCFileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.RCFileOutputFormat'
Because “STORED AS RCFILE” will use “ColumnarSerDe” by default, while the later format will use wrong “LazySimpleSerDe” SerDe. d) Re-cover the partitions by running:
MSCK REPAIR TABLE test;
e) Revert the external flag if you did it in step 1.
ALTER TABLE test SET TBLPROPERTIES('EXTERNAL'='FALSE');
Now you should be able to SELECT data from the table again.

Leave a Reply

Your email address will not be published.

My new Snowflake Blog is now live. I will not be updating this blog anymore but will continue with new contents in the Snowflake world!