pyspark.sql.functions.shuffle#

pyspark.sql.functions.shuffle(col)[source]#

Array function: Generates a random permutation of the given array.

New in version 2.4.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters

colColumn or str: The name of the column or expression to be shuffled.

Returns

Column: A new column that contains an array of elements in random order.

Notes

The shuffle function is non-deterministic, meaning the order of the output array can be different for each execution.

Examples

Example 1: Shuffling a simple array

>>> import pyspark.sql.functions as sf
>>> df = spark.createDataFrame([([1, 20, 3, 5],)], ['data'])
>>> df.select(sf.shuffle(df.data)).show() 
+-------------+
|shuffle(data)|
+-------------+
|[1, 3, 20, 5]|
+-------------+

Example 2: Shuffling an array with null values

>>> import pyspark.sql.functions as sf
>>> df = spark.createDataFrame([([1, 20, None, 3],)], ['data'])
>>> df.select(sf.shuffle(df.data)).show() 
+----------------+
|   shuffle(data)|
+----------------+
|[20, 3, NULL, 1]|
+----------------+

Example 3: Shuffling an array with duplicate values

>>> import pyspark.sql.functions as sf
>>> df = spark.createDataFrame([([1, 2, 2, 3, 3, 3],)], ['data'])
>>> df.select(sf.shuffle(df.data)).show() 
+------------------+
|     shuffle(data)|
+------------------+
|[3, 2, 1, 3, 2, 3]|
+------------------+

Example 4: Shuffling an array with different types of elements

>>> import pyspark.sql.functions as sf
>>> df = spark.createDataFrame([(['a', 'b', 'c', 1, 2, 3],)], ['data'])
>>> df.select(sf.shuffle(df.data)).show() 
+------------------+
|     shuffle(data)|
+------------------+
|[1, c, 2, a, b, 3]|
+------------------+