Unable to read Parquet files with same schema and different flags in Pig

Unable to read Parquet files with same schema and different flags in Pig

Today, I found a bug in Pig that does not allow user to run Pig to read a table which contains multiple files that have the same schema with different flags. See example below:
message example {
    required binary file_name (UTF8);
    required binary date_time (UTF8);
    required binary tail (UTF8);
    required binary event (UTF8);
    required binary value (UTF8);
    optional int32 record_number;
    required binary src (UTF8);
}

message schema {
    optional binary file_name;
    optional binary date_time;
    optional binary tail;
    optional binary event;
    optional binary value;
    optional int32 record_number;
    optional binary src;
} 
If you run it in Pig, the following error will be returned:
[main] ERROR org.apache.pig.PigServer - exception during parsing: Error during parsing. 
repetition constraint is more restrictive: can not merge type required binary file_name (UTF8) into optional binary file_name
Failed to parse: repetition constraint is more restrictive: can not merge type required binary file_name (UTF8) 
into optional binary file_name
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:198)
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1676)
at org.apache.pig.PigServer$Graph.access$000(PigServer.java:1409)
at org.apache.pig.PigServer.parseAndBuild(PigServer.java:342)
at org.apache.pig.PigServer.executeBatch(PigServer.java:367)
at org.apache.pig.PigServer.executeBatch(PigServer.java:353)
at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:140)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:202)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
at org.apache.pig.Main.run(Main.java:478)
at org.apache.pig.PigRunner.run(PigRunner.java:49)
at org.apache.oozie.action.hadoop.PigMain.runPigJob(PigMain.java:286)
at org.apache.oozie.action.hadoop.PigMain.run(PigMain.java:226)
at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:39)
at org.apache.oozie.action.hadoop.PigMain.main(PigMain.java:74)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:227)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
At the time of writing, this issue affects both CDH5.3.x and CDH5.4.x, and it is reported in PARQUET-138, but still not fixed. I have also found another issue PARQUET-139, which is fixed from CDH5.4.0 onwards, provides us a workaround to fix the problem we have here. To fix this issue, we need to upgrade CDH to 5.4.x, and then update the Pig script from:
data = LOAD '$path_to_source'
USING parquet.pig.ParquetLoader as(
    file_name:bytearray,
    date_time:bytearray,
    tail:bytearray,
    event:bytearray,
    value:bytearray,
    record_number:int,
    src:bytearray
);
to:
data = LOAD '$path_to_source'
USING parquet.pig.ParquetLoader(
    'file_name:bytearray,date_time:bytearray,tail:bytearray,event:bytearray,value:bytearray,record_number:int,src:bytearray'
);
So instead of passing each column definitions as one parameter to AS(), we pass all columns as a single string to ParquetLoader’s constructor. After this change, the problem should be fixed. Hope this helps.

Leave a Reply

Your email address will not be published.

My new Snowflake Blog is now live. I will not be updating this blog anymore but will continue with new contents in the Snowflake world!