pyspark.sql.functions.hll_union_agg#

pyspark.sql.functions.hll_union_agg(col, allowDifferentLgConfigK=None)[source]#

Aggregate function: returns the updatable binary representation of the Datasketches HllSketch, generated by merging previously created Datasketches HllSketch instances via a Datasketches Union instance. Throws an exception if sketches have different lgConfigK values and allowDifferentLgConfigK is unset or set to false.

New in version 3.5.0.

Parameters
colColumn or str
allowDifferentLgConfigKColumn or bool, optional

Allow sketches with different lgConfigK values to be merged (defaults to false).

Returns
Column

The binary representation of the merged HllSketch.

Examples

>>> df1 = spark.createDataFrame([1,2,2,3], "INT")
>>> df1 = df1.agg(hll_sketch_agg("value").alias("sketch"))
>>> df2 = spark.createDataFrame([4,5,5,6], "INT")
>>> df2 = df2.agg(hll_sketch_agg("value").alias("sketch"))
>>> df3 = df1.union(df2).agg(hll_sketch_estimate(
...     hll_union_agg("sketch")
... ).alias("distinct_cnt"))
>>> df3.drop("sketch").show()
+------------+
|distinct_cnt|
+------------+
|           6|
+------------+
>>> df4 = df1.union(df2).agg(hll_sketch_estimate(
...     hll_union_agg("sketch", lit(False))
... ).alias("distinct_cnt"))
>>> df4.drop("sketch").show()
+------------+
|distinct_cnt|
+------------+
|           6|
+------------+
>>> df5 = df1.union(df2).agg(hll_sketch_estimate(
...     hll_union_agg(col("sketch"), lit(False))
... ).alias("distinct_cnt"))
>>> df5.drop("sketch").show()
+------------+
|distinct_cnt|
+------------+
|           6|
+------------+