How to Index LZO files in Hadoop

How to Index LZO files in Hadoop

Today I was trying to index LZO file using hadoop command:
hadoop jar /opt/cloudera/parcels/GPLEXTRAS-5.7.0-1.cdh5.7.0.p0.40/lib/hadoop/lib/hadoop-lzo.jar com.hadoop.compression.lzo.DistributedLzoIndexer /tmp/lzo_test
However, it failed with the following error:
16/09/10 03:05:51 INFO mapreduce.Job: Task Id : attempt_1473404927068_0005_m_000000_0, Status : FAILED
Error: java.lang.NullPointerException
       	at com.hadoop.mapreduce.LzoSplitRecordReader.initialize(
       	at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(
       	at org.apache.hadoop.mapred.MapTask.runNewMapper(
       	at org.apache.hadoop.mapred.YarnChild$
       	at Method)
       	at org.apache.hadoop.mapred.YarnChild.main(
After some research, it turned out that I did not have LZO codec enabled in my cluster. I did the following to resolve the issue: – Add gplextras parcel to Cloudera Manager’s parcel configuration: where x.x.x should match with your CDH version, in my case is 5.7.0 screen-shot-2016-09-10-at-8-25-24-pm – Install GPL Extras Parcel through Cloudera Manager as normal – Add the following codec to Compression Codec (io.compression.codecs) in Cloudera Manager > HDFS > Configurations:
screen-shot-2016-09-10-at-8-17-45-pm – Restart Cluster – Deploy Client Cluster Configuration – Install native-lzo library on all hosts in the cluster
sudo yum install lzop
After the above changes, the index job should finish without issues:
hadoop jar /opt/cloudera/parcels/GPLEXTRAS-5.7.0-1.cdh5.7.0.p0.40/lib/hadoop/lib/hadoop-lzo.jar com.hadoop.compression.lzo.DistributedLzoIndexer /tmp/lzo_test
16/09/10 03:22:10 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
16/09/10 03:22:10 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 820f83d05b8d916e89dbb72d6ef129113b277303]
16/09/10 03:22:12 INFO lzo.DistributedLzoIndexer: Adding LZO file hdfs:// to indexing list (no index currently exists)
16/09/10 03:22:12 INFO Configuration.deprecation: is deprecated. Instead, use
16/09/10 03:22:12 INFO client.RMProxy: Connecting to ResourceManager at
16/09/10 03:22:12 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 119 for hive on
16/09/10 03:22:12 INFO security.TokenCache: Got dt for hdfs://; Kind: HDFS_DELEGATION_TOKEN, Service:, Ident: (HDFS_DELEGATION_TOKEN token 119 for hive)
16/09/10 03:22:12 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
16/09/10 03:22:12 INFO input.FileInputFormat: Total input paths to process : 1
16/09/10 03:22:13 INFO mapreduce.JobSubmitter: number of splits:1
16/09/10 03:22:13 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1473502403576_0002
16/09/10 03:22:13 INFO mapreduce.JobSubmitter: Kind: HDFS_DELEGATION_TOKEN, Service:, Ident: (HDFS_DELEGATION_TOKEN token 119 for hive)
16/09/10 03:22:13 INFO impl.YarnClientImpl: Submitted application application_1473502403576_0002
16/09/10 03:22:13 INFO mapreduce.Job: The url to track the job:
16/09/10 03:22:13 INFO mapreduce.Job: Running job: job_1473502403576_0002
16/09/10 03:22:22 INFO mapreduce.Job: Job job_1473502403576_0002 running in uber mode : false
16/09/10 03:22:22 INFO mapreduce.Job:  map 0% reduce 0%
16/09/10 03:22:35 INFO mapreduce.Job:  map 22% reduce 0%
16/09/10 03:22:38 INFO mapreduce.Job:  map 36% reduce 0%
16/09/10 03:22:41 INFO mapreduce.Job:  map 50% reduce 0%
16/09/10 03:22:44 INFO mapreduce.Job:  map 62% reduce 0%
16/09/10 03:22:47 INFO mapreduce.Job:  map 76% reduce 0%
16/09/10 03:22:50 INFO mapreduce.Job:  map 90% reduce 0%
16/09/10 03:22:53 INFO mapreduce.Job:  map 100% reduce 0%
16/09/10 03:22:53 INFO mapreduce.Job: Job job_1473502403576_0002 completed successfully
16/09/10 03:22:53 INFO mapreduce.Job: Counters: 30
       	File System Counters
       		FILE: Number of bytes read=0
       		FILE: Number of bytes written=118080
       		FILE: Number of read operations=0
       		FILE: Number of large read operations=0
       		FILE: Number of write operations=0
       		HDFS: Number of bytes read=85196
       		HDFS: Number of bytes written=85008
       		HDFS: Number of read operations=2
       		HDFS: Number of large read operations=0
       		HDFS: Number of write operations=4
       	Job Counters
       		Launched map tasks=1
       		Data-local map tasks=1
       		Total time spent by all maps in occupied slots (ms)=28846
       		Total time spent by all reduces in occupied slots (ms)=0
       		Total time spent by all map tasks (ms)=28846
       		Total vcore-seconds taken by all map tasks=28846
       		Total megabyte-seconds taken by all map tasks=29538304
       	Map-Reduce Framework
       		Map input records=10626
       		Map output records=10626
       		Input split bytes=138
       		Spilled Records=0
       		Failed Shuffles=0
       		Merged Map outputs=0
       		GC time elapsed (ms)=32
       		CPU time spent (ms)=3960
       		Physical memory (bytes) snapshot=233476096
       		Virtual memory (bytes) snapshot=1564155904
       		Total committed heap usage (bytes)=176160768
       	File Input Format Counters
       		Bytes Read=85058
       	File Output Format Counters
       		Bytes Written=0


Leave a Reply

Your email address will not be published. Required fields are marked *

My new Snowflake Blog is now live. I will not be updating this blog anymore but will continue with new contents in the Snowflake world!