Hive supports array type columns so that you can store a list of values for a row all inside a single column, and better yet can still be queried. This is particularly useful to me in order to reduce the number of data rows in our database.
Our original “answer” table looks like this:
qId
cId
vId
1
1
v1
1
2
v1
1
2
v2
And the new “answer” table looks like this:
qId
cId
vIds
1
1
“v1”
1
2
“v1”, “v2”
You can see that it has saved me one record in this instance. Of course it is not big deal, but with hundreds of millions of rows of data, it will have great impact.
Another advantage of saving data in this format is that you can selectively get the data you want and then expand them again to the original format. For example, the new “answer” table you have above, we can expand the data again in the following way:
SELECT qId, cId, vId FROM answer
LATERAL VIEW explode(vIds) visitor AS vId
WHERE cId = 2
“vIds” is the column name in the new “answer” table, “visitor” is the LATERAL VIEW TABLE alias and “vId” is the new column alias so that it can be selected in the query.
This will produce the following data:
qId
cId
vId
1
2
v1
1
2
v2
The new table will make your table size smaller and have less records to scan through to find the data, so it should improve your application’s performance.
You can get more information from the official Hive Wiki page.
If you have any other thoughts, please feel free to comment below to let me know.
Hi Eric,
Nice explanation. Could you explain how to achieve the converse of this i.e, how to reduce the number of rows like storing the column value as an array
I have written like this:
select key1, key2, key3 from
lateral view posexplode(col_name_1) key1 as q1, key1
lateral view posexplode(col_name_2) key2 as q2, key2
lateral view posexplode(col_name_3) key3 as q3, key3
where q1=q2 and q1=q3;
Gets an exception as:
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
Sorry about the delay. My test showed that your query is correct, here is my version:
select key1, key2, key3 from map_test
lateral view posexplode(a) t1 as q1, key1
lateral view posexplode(b) t2 as q2, key2
lateral view posexplode(c) t3 as q3, key3
where q1 = q2 and q1=q3;
I think you missed the table name after “select key1, key2, key3 from”? Please try again by adding the missing table name and let me know how you goes.
t1, t2 and t3 are aliases for posexplode function call, like a table alias, so that you can reference in the SELECT statement.
q1, key1 and the rest are the actual key value pairs of each row returned by the posexplode function, it is probably better to be names as pos1, val1, pos2, val2 would make more sense.
Hey Eric, thanks for the blog.
I had a doubt. If we have json string stored in one of the column. Within the json string, there are many fields, and one of the field is a comma separated array of strings say str1, str2, str3…. which are again json. How to access the entities within these str1, str2 ,str3.
Example json : {“time”:123,”type”:”show”,”page”:”home”,”pageId”:”ws12s”,”things”:[{“Program”:”new”,”position”:3,”Id”:”123″},{“Program”:”old”,”position”:2,”Id”:”142″},{“Program”:”outdated”,”position”:1,”Id”:”214″}],”module”:”main”}
The table contains the example json-type strings in a column. Supposedly the table has 100 such rows, I need to count the number of “Id”:”123″ in the entire table. What will be the give query for that?
Thanks! this was very helpful
Hi Anju,
Thanks for visiting and I am glad that my article helps.
worked greate for me too, thanks!
awesome!. thanks!
also found that you can do self reference with an alias to repeat another column value along with the explode column values.
Hi sgad,
Thanks for your comment. Is it possible if you can share the detailed steps to achieve this?
Thanks again.
ofcourse, here it is:
select year, colx
from (
select 2017 year, ‘a,b,c’ config_value
) x
LATERAL VIEW explode(split(regexp_replace(trim(config_value),’\”,’,’)) x as colx
Thanks Sgad for sharing ^_^ and sorry about the delay in response. I totally overlooked your update, apologies.
Hi Eric,
Nice explanation. Could you explain how to achieve the converse of this i.e, how to reduce the number of rows like storing the column value as an array
Hi Jaise,
Thanks for visiting my blog.
Did you mean to group a list of values into array? Maybe collect_set or collect_list?
I have test data:
select * from test10;
+———–+———–+–+
| test10.a | test10.b |
+———–+———–+–+
| 1 | 1 |
| 1 | 2 |
| 1 | 4 |
| 2 | 1 |
+———–+———–+–+
And collect_list gives me:
select a, collect_list(b) from test10 group by a
+—-+———-+–+
| a | _c1 |
+—-+———-+–+
| 1 | [1,2,4] |
| 2 | [1] |
+—-+———-+–+
Is this you are after?
collect_set will remove duplicates, while collect_list will leave duplicates as they are.
Hope above helps.
Cheers
Have 3 arrays:
[21,31,41], [121,131,141], [1021,1031,1041]
Wanted to explode as:
21, 121, 1021
31, 131, 1031
41, 141, 1041
I have written like this:
select key1, key2, key3 from
lateral view posexplode(col_name_1) key1 as q1, key1
lateral view posexplode(col_name_2) key2 as q2, key2
lateral view posexplode(col_name_3) key3 as q3, key3
where q1=q2 and q1=q3;
Gets an exception as:
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
Hi Vinod,
Sorry about the delay. My test showed that your query is correct, here is my version:
select key1, key2, key3 from map_test
lateral view posexplode(a) t1 as q1, key1
lateral view posexplode(b) t2 as q2, key2
lateral view posexplode(c) t3 as q3, key3
where q1 = q2 and q1=q3;
and following results are returned:
+——-+——-+——-+–+
| key1 | key2 | key3 |
+——-+——-+——-+–+
| 1 | 3 | 6 |
| 2 | 4 | 7 |
| 3 | 5 | 8 |
+——-+——-+——-+–+
I think you missed the table name after “select key1, key2, key3 from”? Please try again by adding the missing table name and let me know how you goes.
Hi Eric,
Thanks for information.. it helps.
Br,
Maruthi
what is t1 , t2, t3 and q1 ,q2 and q3 here ?
Hi Kamal,
Sorry about the late reply.
t1, t2 and t3 are aliases for posexplode function call, like a table alias, so that you can reference in the SELECT statement.
q1, key1 and the rest are the actual key value pairs of each row returned by the posexplode function, it is probably better to be names as pos1, val1, pos2, val2 would make more sense.
Please refer to Apache doc:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-posexplode(array)
Cheers
Eric
How to create Hive table with column names as USERNAME,FIRSTNAME,LASTNAME and PASSWORD by using this below XML format?
Hi Kishore,
Thanks for visiting my blog. However, I do not see any XML output, did you forget to post in the comment or maybe WordPress removed the XML content?
Does this phrase:
“vIds” is the table name in the new “answer” table,
really mean
“vIds” is the column name in the new “answer” table,
?
Hi CSmith,
Thanks a lot for your visit and your pick up of my typo on the blog. I have updated it with correct wording.
Thanks again!
Cheers
Eric
Hey Eric, thanks for the blog.
I had a doubt. If we have json string stored in one of the column. Within the json string, there are many fields, and one of the field is a comma separated array of strings say str1, str2, str3…. which are again json. How to access the entities within these str1, str2 ,str3.
Example json : {“time”:123,”type”:”show”,”page”:”home”,”pageId”:”ws12s”,”things”:[{“Program”:”new”,”position”:3,”Id”:”123″},{“Program”:”old”,”position”:2,”Id”:”142″},{“Program”:”outdated”,”position”:1,”Id”:”214″}],”module”:”main”}
The table contains the example json-type strings in a column. Supposedly the table has 100 such rows, I need to count the number of “Id”:”123″ in the entire table. What will be the give query for that?
Hey Eric,
Thankful to you man !! This helped me
Thanks,
Praneeth
Thanks for visiting my blog and your comment Praneeth, glad that it helps.