Hive supports array type columns so that you can store a list of values for a row all inside a single column, and better yet can still be queried. This is particularly useful to me in order to reduce the number of data rows in our database. Our original “answer” table looks like this:
qId cId vId
1 1 v1
1 2 v1
1 2 v2
And the new “answer” table looks like this:
qId cId vIds
1 1 “v1”
1 2 “v1”, “v2”
You can see that it has saved me one record in this instance. Of course it is not big deal, but with hundreds of millions of rows of data, it will have great impact. Another advantage of saving data in this format is that you can selectively get the data you want and then expand them again to the original format. For example, the new “answer” table you have above, we can expand the data again in the following way:
SELECT qId, cId, vId FROM answer
LATERAL VIEW explode(vIds) visitor AS vId
WHERE cId = 2
“vIds” is the column name in the new “answer” table, “visitor” is the LATERAL VIEW TABLE alias and “vId” is the new column alias so that it can be selected in the query. This will produce the following data:
qId cId vId
1 2 v1
1 2 v2
The new table will make your table size smaller and have less records to scan through to find the data, so it should improve your application’s performance. You can get more information from the official Hive Wiki page. If you have any other thoughts, please feel free to comment below to let me know.

21 Comments

      1. sgad

        ofcourse, here it is:

        select year, colx
        from (
        select 2017 year, ‘a,b,c’ config_value
        ) x
        LATERAL VIEW explode(split(regexp_replace(trim(config_value),’\”,’,’)) x as colx

    1. Eric Lin

      Hi Jaise,

      Thanks for visiting my blog.

      Did you mean to group a list of values into array? Maybe collect_set or collect_list?

      I have test data:

      select * from test10;
      +———–+———–+–+
      | test10.a | test10.b |
      +———–+———–+–+
      | 1 | 1 |
      | 1 | 2 |
      | 1 | 4 |
      | 2 | 1 |
      +———–+———–+–+

      And collect_list gives me:

      select a, collect_list(b) from test10 group by a
      +—-+———-+–+
      | a | _c1 |
      +—-+———-+–+
      | 1 | [1,2,4] |
      | 2 | [1] |
      +—-+———-+–+

      Is this you are after?

      collect_set will remove duplicates, while collect_list will leave duplicates as they are.

      Hope above helps.

      Cheers

  1. Vinod Kumar Chandran

    Have 3 arrays:
    [21,31,41], [121,131,141], [1021,1031,1041]

    Wanted to explode as:
    21, 121, 1021
    31, 131, 1031
    41, 141, 1041

    I have written like this:
    select key1, key2, key3 from
    lateral view posexplode(col_name_1) key1 as q1, key1
    lateral view posexplode(col_name_2) key2 as q2, key2
    lateral view posexplode(col_name_3) key3 as q3, key3

    where q1=q2 and q1=q3;

    Gets an exception as:
    FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask

    1. Eric Lin

      Hi Vinod,

      Sorry about the delay. My test showed that your query is correct, here is my version:

      select key1, key2, key3 from map_test
      lateral view posexplode(a) t1 as q1, key1
      lateral view posexplode(b) t2 as q2, key2
      lateral view posexplode(c) t3 as q3, key3
      where q1 = q2 and q1=q3;

      and following results are returned:

      +——-+——-+——-+–+
      | key1 | key2 | key3 |
      +——-+——-+——-+–+
      | 1 | 3 | 6 |
      | 2 | 4 | 7 |
      | 3 | 5 | 8 |
      +——-+——-+——-+–+

      I think you missed the table name after “select key1, key2, key3 from”? Please try again by adding the missing table name and let me know how you goes.

        1. Eric Lin

          Hi Kamal,

          Sorry about the late reply.

          t1, t2 and t3 are aliases for posexplode function call, like a table alias, so that you can reference in the SELECT statement.

          q1, key1 and the rest are the actual key value pairs of each row returned by the posexplode function, it is probably better to be names as pos1, val1, pos2, val2 would make more sense.

          Please refer to Apache doc:
          https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-posexplode(array)

          Cheers
          Eric

  2. Abhinav

    Hey Eric, thanks for the blog.
    I had a doubt. If we have json string stored in one of the column. Within the json string, there are many fields, and one of the field is a comma separated array of strings say str1, str2, str3…. which are again json. How to access the entities within these str1, str2 ,str3.
    Example json : {“time”:123,”type”:”show”,”page”:”home”,”pageId”:”ws12s”,”things”:[{“Program”:”new”,”position”:3,”Id”:”123″},{“Program”:”old”,”position”:2,”Id”:”142″},{“Program”:”outdated”,”position”:1,”Id”:”214″}],”module”:”main”}

    The table contains the example json-type strings in a column. Supposedly the table has 100 such rows, I need to count the number of “Id”:”123″ in the entire table. What will be the give query for that?

Leave a Reply

Your email address will not be published. Required fields are marked *