- A Hive table (from_hive) with its data injected from Flume
- Create another table with same column structure (from_flume)
- Insert data into the new table by selecting from old table, this worked and SELECT COUNT(*) returned correct result:
INSERT INTO from_hive SELECT * FROM from_flume;
- At this stage SELECT query works on both old and new tables
- Copy the data generated by Flume into the new table’s location, so that those files sit under the same table’s location
- Do a SELECT * from the table will result in the following error:
Error: java.io.IOException: java.io.IOException: java.lang.IndexOutOfBoundsException at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121) at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77) at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:226) at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.next(HadoopShimsSecure.java:136) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:199) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:185) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: java.io.IOException: java.lang.IndexOutOfBoundsException at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121) at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77) at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:355) at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:105) at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:41) at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:116) at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:224) ... 11 more Caused by: java.lang.IndexOutOfBoundsException at java.io.DataInputStream.readFully(DataInputStream.java:192) at org.apache.hadoop.io.Text.readWithKnownLength(Text.java:319) at org.apache.hadoop.io.Text.readFields(Text.java:291) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:71) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:42) at org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:2245) at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2229) at org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:109) at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:84) at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:350) ... 15 more
SEQ!org.apache.hadoop.io.LongWritable"org.apache.hadoop.io.BytesWritable)org.apache.hadoop.io.compress.SnappyCodecSelect Souce:
SEQ"org.apache.hadoop.io.BytesWritableorg.apache.hadoop.io.Text)org.apache.hadoop.io.compress.SnappyCodecSo when Hive uses CombineHiveInputFormat class (default) to read snappy files, one mapper will read multiple files and if they are being run in the same mapper, due to different structures in the snappy file, Hive will not be able to read them together properly. The solution is to set hive.input.format to org.apache.hadoop.hive.ql.io.HiveInputFormat to avoid Hive to use CombineHiveInputFormat class to combine multiple snappy files when reading. This will ensure that one mapper will read one file only, but the side effect is that more mappers will be used or files being processed sequentially:
SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;