hadoop jar /opt/cloudera/parcels/GPLEXTRAS-5.7.0-1.cdh5.7.0.p0.40/lib/hadoop/lib/hadoop-lzo.jar com.hadoop.compression.lzo.DistributedLzoIndexer /tmp/lzo_testHowever, it failed with the following error:
16/09/10 03:05:51 INFO mapreduce.Job: Task Id : attempt_1473404927068_0005_m_000000_0, Status : FAILED Error: java.lang.NullPointerException at com.hadoop.mapreduce.LzoSplitRecordReader.initialize(LzoSplitRecordReader.java:50) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:548) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:786) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)After some research, it turned out that I did not have LZO codec enabled in my cluster. I did the following to resolve the issue: – Add gplextras parcel to Cloudera Manager’s parcel configuration: https://archive.cloudera.com/gplextras5/parcels/x.x.x where x.x.x should match with your CDH version, in my case is 5.7.0

com.hadoop.compression.lzo.LzopCodec com.hadoop.compression.lzo.LzoCodec

sudo yum install lzopAfter the above changes, the index job should finish without issues:
hadoop jar /opt/cloudera/parcels/GPLEXTRAS-5.7.0-1.cdh5.7.0.p0.40/lib/hadoop/lib/hadoop-lzo.jar com.hadoop.compression.lzo.DistributedLzoIndexer /tmp/lzo_test 16/09/10 03:22:10 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library 16/09/10 03:22:10 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 820f83d05b8d916e89dbb72d6ef129113b277303] 16/09/10 03:22:12 INFO lzo.DistributedLzoIndexer: Adding LZO file hdfs://host-10-17-101-195.coe.cloudera.com:8020/tmp/lzo_test/test.txt.lzo to indexing list (no index currently exists) 16/09/10 03:22:12 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative 16/09/10 03:22:12 INFO client.RMProxy: Connecting to ResourceManager at host-10-17-101-195.coe.cloudera.com/10.17.101.195:8032 16/09/10 03:22:12 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 119 for hive on 10.17.101.195:8020 16/09/10 03:22:12 INFO security.TokenCache: Got dt for hdfs://host-10-17-101-195.coe.cloudera.com:8020; Kind: HDFS_DELEGATION_TOKEN, Service: 10.17.101.195:8020, Ident: (HDFS_DELEGATION_TOKEN token 119 for hive) 16/09/10 03:22:12 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 16/09/10 03:22:12 INFO input.FileInputFormat: Total input paths to process : 1 16/09/10 03:22:13 INFO mapreduce.JobSubmitter: number of splits:1 16/09/10 03:22:13 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1473502403576_0002 16/09/10 03:22:13 INFO mapreduce.JobSubmitter: Kind: HDFS_DELEGATION_TOKEN, Service: 10.17.101.195:8020, Ident: (HDFS_DELEGATION_TOKEN token 119 for hive) 16/09/10 03:22:13 INFO impl.YarnClientImpl: Submitted application application_1473502403576_0002 16/09/10 03:22:13 INFO mapreduce.Job: The url to track the job: http://host-10-17-101-195.coe.cloudera.com:8088/proxy/application_1473502403576_0002/ 16/09/10 03:22:13 INFO mapreduce.Job: Running job: job_1473502403576_0002 16/09/10 03:22:22 INFO mapreduce.Job: Job job_1473502403576_0002 running in uber mode : false 16/09/10 03:22:22 INFO mapreduce.Job: map 0% reduce 0% 16/09/10 03:22:35 INFO mapreduce.Job: map 22% reduce 0% 16/09/10 03:22:38 INFO mapreduce.Job: map 36% reduce 0% 16/09/10 03:22:41 INFO mapreduce.Job: map 50% reduce 0% 16/09/10 03:22:44 INFO mapreduce.Job: map 62% reduce 0% 16/09/10 03:22:47 INFO mapreduce.Job: map 76% reduce 0% 16/09/10 03:22:50 INFO mapreduce.Job: map 90% reduce 0% 16/09/10 03:22:53 INFO mapreduce.Job: map 100% reduce 0% 16/09/10 03:22:53 INFO mapreduce.Job: Job job_1473502403576_0002 completed successfully 16/09/10 03:22:53 INFO mapreduce.Job: Counters: 30 File System Counters FILE: Number of bytes read=0 FILE: Number of bytes written=118080 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=85196 HDFS: Number of bytes written=85008 HDFS: Number of read operations=2 HDFS: Number of large read operations=0 HDFS: Number of write operations=4 Job Counters Launched map tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=28846 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=28846 Total vcore-seconds taken by all map tasks=28846 Total megabyte-seconds taken by all map tasks=29538304 Map-Reduce Framework Map input records=10626 Map output records=10626 Input split bytes=138 Spilled Records=0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=32 CPU time spent (ms)=3960 Physical memory (bytes) snapshot=233476096 Virtual memory (bytes) snapshot=1564155904 Total committed heap usage (bytes)=176160768 File Input Format Counters Bytes Read=85058 File Output Format Counters Bytes Written=0