- UDF – normal user-defined functions, simplest to write
- UDAF – user-defined aggregation functions, the ones typically used in the “group by” case
- UDTF – user-defined table functions, just like Hive internal “explode” function
package com.effectivemeasure.hive.udaf; import org.apache.hadoop.hive.ql.exec.UDAF; import org.apache.hadoop.hive.ql.exec.UDAFEvaluator; import java.util.HashMap; /** * This class will group key value pairs and return a MapKey points:, based on group by fields * * For example, query "SELECT id, GROUP_MAP(choice_id, question_id) FROM table GROUP BY visitor_id" * * Will be able to return the following values: * * 2f5d017334da977740ee723-61885258 -> {3165:411,2:1,3162:410,3159:409} * * @author Eric Lin */ public final class GroupMap extends UDAF { public static class Evaluator implements UDAFEvaluator { private HashMap buffer; public Evaluator() { init(); } /** * Initializes the evaluator and resets its internal state. */ public void init() { buffer = new HashMap (); } /** * This function is called every time there is a new value to be aggregated. * The parameters are the same parameters that are passed when function is called in Hive query. * * @param key Integer * @param value Integer * @return Boolean */ public boolean iterate(Integer key, Integer value) { if(!buffer.containsKey(key)) { buffer.put(key, value); } return true; } /** * Function called when separated jobs are done on different data nodes (partial aggregation) * * @return HashMap */ public HashMap terminatePartial() { return buffer; } /** * Function called when merging all data result calculated from all data notes * * @param another HashMap * @return Boolean */ public boolean merge(HashMap another) { //null might be passed in case there is no input data. if (another == null) { return true; } for(Integer key : another.keySet()) { if(!buffer.containsKey(key)) { buffer.put(key, another.get(key)); } } return true; } /** * This function is called when the final result of the aggregation is needed * * @return HashMap */ public HashMap terminate() { if (buffer.size() == 0) { return null; } return buffer; } } }
- A UDAF must be a subclass of org.apache.hadoop.hive.ql.exec.UDAF
- Contain one or more nested static classes implementing org.apache.hadoop.hive.ql.exec.UDAFEvaluator
- I have explained the 5 required functions “init”, “iterate”, “terminatePartial”, “merge”, “terminate” in the comments section.
- Compile the JAVA code to generate the JAR file
- Put the JAR file on the namenode under any location, maybe /tmp
- Hive command “ADD JAR /tmp/my-udaf.jar”
- Hive command “CREATE TEMPORARY FUNCTION group_map AS ‘com.effectivemeasure.hive.udaf'”
- Finally simply use the function as normal in Hive query: “SELECT id, GROUP_MAP(choice_id, question_id) FROM table GROUP BY id”