Disable Hive Output Compression

Disable Hive Output Compression

In EM, we use Snappy as default compression for all Hive tables, which means all file data generated by Hive will be have “.snappy” as extension. This is certainly handy to save some disk space. However, sometimes you do want to select some data out from Hadoop’s raw files and transport the data to somewhere else that can be further analysed (as raw data). To achieve this, I use the following method: 1. Disable the output compression inline after login into “hive” shell
set hive.exec.compress.output=false;
2. Create external table link to the raw data:
CREATE EXTERNAL TABLE IF NOT EXISTS `external_table` (
    column1 int,
    column2 string,
    time int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION 'hdfs://$hadoopHost/$hadoopDirectory'
Replace $hadoopHost and $hadoopDirectory accordingly. 3. Create hive native table to hold the data you want:
CREATE TABLE IF NOT EXISTS `tmp_table` (
    column1 int,
    column2 string,
    time int
)
row format delimited
FIELDS TERMINATED BY '\t'
STORED AS TextFile;
4. Then copy the data with the condition you want:
INSERT OVERWRITE TABLE `tmp_table` SELECT column1, column2, time FROM external_table WHERE time >= 111111 AND column2 = 'some-column-value';
5. Now log into hadoop namenode and check the data:
hadoop fs -ls /user/hive/warehouse/default/tmp_table
Assuming that you use the “default” database, you will see a file named “000000_0”. You can simply use “vim” to view the file as it is a plain text file. Now you can copy the file to local disk and move around as you wish.

Leave a Reply

Your email address will not be published.

My new Snowflake Blog is now live. I will not be updating this blog anymore but will continue with new contents in the Snowflake world!