COLLECT_SET () dans Hive, conserver les doublons?

Question

Existe-t-il un moyen de conserver les doublons dans un ensemble collecté dans Hive ou de simuler le type de collection agrégée fourni par Hive à l'aide d'une autre méthode? Je veux agréger tous les éléments d'une colonne qui ont la même clé dans un tableau, avec des doublons.

C'EST À DIRE.:

hash_id | num_of_cats ===================== ad3jkfk 4 ad3jkfk 4 ad3jkfk 2 fkjh43f 1 fkjh43f 8 fkjh43f 8 rjkhd93 7 rjkhd93 4 rjkhd93 7

devrait retourner:

hash_agg | cats_aggregate =========================== ad3jkfk Array<int>(4,4,2) fkjh43f Array<int>(1,8,8) rjkhd93 Array<int>(7,4,7)

Marvin W · Accepted Answer

Essayez d'utiliser COLLECT_LIST (col) après Hive 0.13.0

SELECT hash_id, COLLECT_LIST(num_of_cats) AS aggr_set FROM tablename WHERE blablabla GROUP BY hash_id ;

Jeff Mc · Answer

Il n'y a rien de intégré, mais créer des fonctions définies par l'utilisateur, y compris des agrégats, n'est pas si mal. La seule partie difficile consiste à les rendre génériques, mais voici un exemple de collecte.

package com.example; import Java.util.ArrayList; import org.Apache.hadoop.Hive.ql.exec.UDFArgumentTypeException; import org.Apache.hadoop.Hive.ql.metadata.HiveException; import org.Apache.hadoop.Hive.ql.parse.SemanticException; import org.Apache.hadoop.Hive.ql.udf.generic.AbstractGenericUDAFResolver; import org.Apache.hadoop.Hive.ql.udf.generic.GenericUDAFEvaluator; import org.Apache.hadoop.Hive.serde2.objectinspector.ObjectInspector; import org.Apache.hadoop.Hive.serde2.objectinspector.ObjectInspectorFactory; import org.Apache.hadoop.Hive.serde2.objectinspector.ObjectInspectorUtils; import org.Apache.hadoop.Hive.serde2.objectinspector.PrimitiveObjectInspector; import org.Apache.hadoop.Hive.serde2.objectinspector.StandardListObjectInspector; import org.Apache.hadoop.Hive.serde2.typeinfo.TypeInfo; public class CollectAll extends AbstractGenericUDAFResolver { @Override public GenericUDAFEvaluator getEvaluator(TypeInfo[] tis) throws SemanticException { if (tis.length != 1) { throw new UDFArgumentTypeException(tis.length - 1, "Exactly one argument is expected."); } if (tis[0].getCategory() != ObjectInspector.Category.PRIMITIVE) { throw new UDFArgumentTypeException(0, "Only primitive type arguments are accepted but " + tis[0].getTypeName() + " was passed as parameter 1."); } return new CollectAllEvaluator(); } public static class CollectAllEvaluator extends GenericUDAFEvaluator { private PrimitiveObjectInspector inputOI; private StandardListObjectInspector loi; private StandardListObjectInspector internalMergeOI; @Override public ObjectInspector init(Mode m, ObjectInspector[] parameters) throws HiveException { super.init(m, parameters); if (m == Mode.PARTIAL1) { inputOI = (PrimitiveObjectInspector) parameters[0]; return ObjectInspectorFactory .getStandardListObjectInspector((PrimitiveObjectInspector) ObjectInspectorUtils .getStandardObjectInspector(inputOI)); } else { if (!(parameters[0] instanceof StandardListObjectInspector)) { inputOI = (PrimitiveObjectInspector) ObjectInspectorUtils .getStandardObjectInspector(parameters[0]); return (StandardListObjectInspector) ObjectInspectorFactory .getStandardListObjectInspector(inputOI); } else { internalMergeOI = (StandardListObjectInspector) parameters[0]; inputOI = (PrimitiveObjectInspector) internalMergeOI.getListElementObjectInspector(); loi = (StandardListObjectInspector) ObjectInspectorUtils.getStandardObjectInspector(internalMergeOI); return loi; } } } static class ArrayAggregationBuffer implements AggregationBuffer { ArrayList<Object> container; } @Override public void reset(AggregationBuffer ab) throws HiveException { ((ArrayAggregationBuffer) ab).container = new ArrayList<Object>(); } @Override public AggregationBuffer getNewAggregationBuffer() throws HiveException { ArrayAggregationBuffer ret = new ArrayAggregationBuffer(); reset(ret); return ret; } @Override public void iterate(AggregationBuffer ab, Object[] parameters) throws HiveException { assert (parameters.length == 1); Object p = parameters[0]; if (p != null) { ArrayAggregationBuffer agg = (ArrayAggregationBuffer) ab; agg.container.add(ObjectInspectorUtils.copyToStandardObject(p, this.inputOI)); } } @Override public Object terminatePartial(AggregationBuffer ab) throws HiveException { ArrayAggregationBuffer agg = (ArrayAggregationBuffer) ab; ArrayList<Object> ret = new ArrayList<Object>(agg.container.size()); ret.addAll(agg.container); return ret; } @Override public void merge(AggregationBuffer ab, Object o) throws HiveException { ArrayAggregationBuffer agg = (ArrayAggregationBuffer) ab; ArrayList<Object> partial = (ArrayList<Object>)internalMergeOI.getList(o); for(Object i : partial) { agg.container.add(ObjectInspectorUtils.copyToStandardObject(i, this.inputOI)); } } @Override public Object terminate(AggregationBuffer ab) throws HiveException { ArrayAggregationBuffer agg = (ArrayAggregationBuffer) ab; ArrayList<Object> ret = new ArrayList<Object>(agg.container.size()); ret.addAll(agg.container); return ret; } } }

Ensuite, dans Hive, lancez simplement add jar Whatever.jar; et CREATE TEMPORARY FUNCTION collect_all AS 'com.example.CollectAll'; Vous devriez pouvoir les utiliser comme prévu.

Hive> SELECT hash_id, collect_all(num_of_cats) FROM test GROUP BY hash_id; OK ad3jkfk [4,4,2] fkjh43f [1,8,8] rjkhd93 [7,4,7]

Il convient de noter que l'ordre des éléments doit être considéré comme indéfini, donc si vous avez l'intention de l'utiliser pour alimenter des informations dans n_grams, vous devrez peut-être l'étendre un peu pour trier les données selon vos besoins.

nephtes · Answer

Code de Jeff Mc modifié pour supprimer la restriction (vraisemblablement héritée de collect_set) que l'entrée doit être de type primitif. Cette version peut collecter des structures, des cartes et des tableaux ainsi que des primitives.

package com.example; import Java.util.ArrayList; import org.Apache.hadoop.Hive.ql.exec.UDFArgumentTypeException; import org.Apache.hadoop.Hive.ql.metadata.HiveException; import org.Apache.hadoop.Hive.ql.parse.SemanticException; import org.Apache.hadoop.Hive.ql.udf.generic.AbstractGenericUDAFResolver; import org.Apache.hadoop.Hive.ql.udf.generic.GenericUDAFEvaluator; import org.Apache.hadoop.Hive.serde2.objectinspector.ObjectInspector; import org.Apache.hadoop.Hive.serde2.objectinspector.ObjectInspectorFactory; import org.Apache.hadoop.Hive.serde2.objectinspector.ObjectInspectorUtils; import org.Apache.hadoop.Hive.serde2.objectinspector.StandardListObjectInspector; import org.Apache.hadoop.Hive.serde2.typeinfo.TypeInfo; public class CollectAll extends AbstractGenericUDAFResolver { @Override public GenericUDAFEvaluator getEvaluator(TypeInfo[] tis) throws SemanticException { if (tis.length != 1) { throw new UDFArgumentTypeException(tis.length - 1, "Exactly one argument is expected."); } return new CollectAllEvaluator(); } public static class CollectAllEvaluator extends GenericUDAFEvaluator { private ObjectInspector inputOI; private StandardListObjectInspector loi; private StandardListObjectInspector internalMergeOI; @Override public ObjectInspector init(Mode m, ObjectInspector[] parameters) throws HiveException { super.init(m, parameters); if (m == Mode.PARTIAL1) { inputOI = parameters[0]; return ObjectInspectorFactory .getStandardListObjectInspector(ObjectInspectorUtils .getStandardObjectInspector(inputOI)); } else { if (!(parameters[0] instanceof StandardListObjectInspector)) { inputOI = ObjectInspectorUtils .getStandardObjectInspector(parameters[0]); return (StandardListObjectInspector) ObjectInspectorFactory .getStandardListObjectInspector(inputOI); } else { internalMergeOI = (StandardListObjectInspector) parameters[0]; inputOI = internalMergeOI.getListElementObjectInspector(); loi = (StandardListObjectInspector) ObjectInspectorUtils.getStandardObjectInspector(internalMergeOI); return loi; } } } static class ArrayAggregationBuffer implements AggregationBuffer { ArrayList<Object> container; } @Override public void reset(AggregationBuffer ab) throws HiveException { ((ArrayAggregationBuffer) ab).container = new ArrayList<Object>(); } @Override public AggregationBuffer getNewAggregationBuffer() throws HiveException { ArrayAggregationBuffer ret = new ArrayAggregationBuffer(); reset(ret); return ret; } @Override public void iterate(AggregationBuffer ab, Object[] parameters) throws HiveException { assert (parameters.length == 1); Object p = parameters[0]; if (p != null) { ArrayAggregationBuffer agg = (ArrayAggregationBuffer) ab; agg.container.add(ObjectInspectorUtils.copyToStandardObject(p, this.inputOI)); } } @Override public Object terminatePartial(AggregationBuffer ab) throws HiveException { ArrayAggregationBuffer agg = (ArrayAggregationBuffer) ab; ArrayList<Object> ret = new ArrayList<Object>(agg.container.size()); ret.addAll(agg.container); return ret; } @Override public void merge(AggregationBuffer ab, Object o) throws HiveException { ArrayAggregationBuffer agg = (ArrayAggregationBuffer) ab; ArrayList<Object> partial = (ArrayList<Object>)internalMergeOI.getList(o); for(Object i : partial) { agg.container.add(ObjectInspectorUtils.copyToStandardObject(i, this.inputOI)); } } @Override public Object terminate(AggregationBuffer ab) throws HiveException { ArrayAggregationBuffer agg = (ArrayAggregationBuffer) ab; ArrayList<Object> ret = new ArrayList<Object>(agg.container.size()); ret.addAll(agg.container); return ret; } } }

jlemaitre · Answer

Depuis Hive 0.13, il existe un UDAF intégré appelé collect_list() qui y parvient. Voir ici .

Jerome Banks · Answer

Découvrez le Brickhouse collect UDAF ( http://github.com/klout/brickhouse/blob/master/src/main/Java/brickhouse/udf/collect/CollectUDAF.Java )

Il prend également en charge la collecte dans une carte. Brickhouse contient également de nombreux UDF utiles qui ne figurent pas dans la distribution Hive standard.

Jai Prakash · Answer

Voici la requête Hive exacte qui fait ce travail (ne fonctionne que dans Hive> 0.13):

SELECT hash_id, collect_set (num_of_cats) FROM GROUP BY hash_id;

mgokayla · Answer

Pour ce que ça vaut (même si je sais que c'est un article plus ancien), Hive 0.13. propose un nouveau collect_list () fonction qui ne se déduplique pas.

Nikhil · Answer

Solution de contournement pour collecter struct

supposons que vous ayez une table

tableWithStruct( id string, obj struct <a:string,b:string>)

créez maintenant une autre table

CREATE EXTERNAL TABLE tablename ( id string, temp array<string> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '	' COLLECTION ITEMS TERMINATED BY ',' MAP KEYS TERMINATED BY '|'

insérer une requête

insert into table tablename select id,collect(concat_ws('|',cast(obj.a as string),cast(obj.b as string)) from tableWithStruct group by id;

créez maintenant une autre table au même endroit que tablename

CREATE EXTERNAL TABLE tablename_final ( id string, array_list array<struct<a:string,b:string>> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '	' COLLECTION ITEMS TERMINATED BY ',' MAP KEYS TERMINATED BY '|'

lorsque vous sélectionnez parmi tablename_final vous obtiendrez la sortie souhaitée