How to create a Hive multi-character delimitered table – Hadoop Troubleshooting Guide

We have the following data:

ROW1~|`900~|`3103~|`2015~|`AA~|`1002003829
ROW2~|`900~|`3103~|`2015~|`BB~|`1002005103
ROW3~|`900~|`3103~|`2015~|`CC~|`1002003829
ROW4~|`900~|`3103~|`2015~|`DD~|`1002003829
ROW5~|`900~|`3103~|`2015~|`FF~|`1002003829

How to load the data into a Hive table with delimiter “~|`”? Well, it is pretty straightforward, just use the “MultiDelimitSerDe” which is available since CDH5.1.4, example as folllows:

CREATE TABLE test_multi 
(a string, b string, c string, d string, e string, f string) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe' 
WITH SERDEPROPERTIES (
    "field.delim"="~|`",
    "collection.delim"=":",
    "mapkey.delim"="@"
);

After loading the data, the table will look something like this:

hive> LOAD DATA LOCAL INPATH '/tmp/data-multi.txt' OVERWRITE INTO TABLE test_multi;

hive> SELECT * FROM test_multi;
+----------------+---------------+---------------+---------------+---------------+---------------+--+
| test_multi.a   | test_multi.b  | test_multi.c  | test_multi.d  | test_multi.e  | test_multi.f  |
+----------------+---------------+---------------+---------------+---------------+---------------+--+
| ROW1           | 900           | 3103          | 2015          | AA            | 1002003829    |
| ROW2           | 900           | 3103          | 2015          | BB            | 1002005103    |
| ROW3           | 900           | 3103          | 2015          | CC            | 1002003829    |
| ROW4           | 900           | 3103          | 2015          | DD            | 1002003829    |
| ROW5           | 900           | 3103          | 2015          | FF            | 1002003829    |
+----------------+---------------+---------------+---------------+---------------+---------------+--+

This is the preferred way of loading multi-character delimited data into Hive over the use of “org.apache.hadoop.hive.serde2.RegexSerDe”, as it is simpler and faster.

5 Comments

sujoy

July 17, 2016 at 2:15 PM 8 years ago

For some reason I am not able to add hive-contrib jars to my hive jars .
Will there be any chance to define a simple multi delimiter for the RegexSerde ?

suresh

November 3, 2016 at 9:58 PM 8 years ago

Reply

i have some issue

have some data in csv file
that data like

column1 , column2
1,2,3,4 , 5,6,7,8

how can i load data into hive pleace help me
thank you
1. Eric Lin
  
  November 19, 2016 at 8:48 PM 8 years ago
  
  Reply
  
  Hi Suresh,
  
  In this case you will need to quote the strings, so that they are in the proper CSV file format, like below:
  
  column1,column2
  “1,2,3,4”,”5,6,7,8″
  
  And then you can use OpenCSVSerde for your table like below:
  
  CREATE EXTERNAL TABLE test (a string, b string, c string)
  ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.OpenCSVSerde’
  LOCATION ‘/user/hive/warehouse/test.db/test’;
  
  The latest Hive supports OpenCSVSerde natively, so you don’t need to add any custom jars.
  
  Hope this helps.
Eric Lin

November 19, 2016 at 8:44 PM 8 years ago

Reply

Hi Sujoy,

Sorry for the very late reply. My email server did not send me emails before, so I missed your update. This has now been fixed, so I should get all update emails from my blog.

Before I answer your question, I would like to check what the issue you are facing? And what version of Hive are you using?

Sanjiv Singh

November 13, 2016 at 5:35 AM 8 years ago

Try this :https://github.com/sanjivsingh/FixedLengthAndDelimitedSerde
– Supported both fixed length and delimited data serialization/deserialization
– It support both case in single definition.

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Other posts that you might also be interested:

5 Comments

Leave a Reply Cancel reply