pyspark.sql.functions.shuffle#
- pyspark.sql.functions.shuffle(col)[source]#
Array function: Generates a random permutation of the given array.
New in version 2.4.0.
Changed in version 3.4.0: Supports Spark Connect.
- Parameters
- col
Column
or str The name of the column or expression to be shuffled.
- col
- Returns
Column
A new column that contains an array of elements in random order.
Notes
The shuffle function is non-deterministic, meaning the order of the output array can be different for each execution.
Examples
Example 1: Shuffling a simple array
>>> import pyspark.sql.functions as sf >>> df = spark.createDataFrame([([1, 20, 3, 5],)], ['data']) >>> df.select(sf.shuffle(df.data)).show() +-------------+ |shuffle(data)| +-------------+ |[1, 3, 20, 5]| +-------------+
Example 2: Shuffling an array with null values
>>> import pyspark.sql.functions as sf >>> df = spark.createDataFrame([([1, 20, None, 3],)], ['data']) >>> df.select(sf.shuffle(df.data)).show() +----------------+ | shuffle(data)| +----------------+ |[20, 3, NULL, 1]| +----------------+
Example 3: Shuffling an array with duplicate values
>>> import pyspark.sql.functions as sf >>> df = spark.createDataFrame([([1, 2, 2, 3, 3, 3],)], ['data']) >>> df.select(sf.shuffle(df.data)).show() +------------------+ | shuffle(data)| +------------------+ |[3, 2, 1, 3, 2, 3]| +------------------+
Example 4: Shuffling an array with different types of elements
>>> import pyspark.sql.functions as sf >>> df = spark.createDataFrame([(['a', 'b', 'c', 1, 2, 3],)], ['data']) >>> df.select(sf.shuffle(df.data)).show() +------------------+ | shuffle(data)| +------------------+ |[1, c, 2, a, b, 3]| +------------------+