pyspark.pandas.DataFrame.diff¶
-
DataFrame.
diff
(periods: int = 1, axis: Union[int, str] = 0) → pyspark.pandas.frame.DataFrame[source]¶ First discrete difference of element.
Calculates the difference of a DataFrame element compared with another element in the DataFrame (default is the element in the same column of the previous row).
Note
the current implementation of diff uses Spark’s Window without specifying partition specification. This leads to moving all data into a single partition in a single machine and could cause serious performance degradation. Avoid this method with very large datasets.
- Parameters
- periodsint, default 1
Periods to shift for calculating difference, accepts negative values.
- axisint, default 0 or ‘index’
Can only be set to 0 now.
- Returns
- diffedDataFrame
Examples
>>> df = ps.DataFrame({'a': [1, 2, 3, 4, 5, 6], ... 'b': [1, 1, 2, 3, 5, 8], ... 'c': [1, 4, 9, 16, 25, 36]}, columns=['a', 'b', 'c']) >>> df a b c 0 1 1 1 1 2 1 4 2 3 2 9 3 4 3 16 4 5 5 25 5 6 8 36
>>> df.diff() a b c 0 NaN NaN NaN 1 1.0 0.0 3.0 2 1.0 1.0 5.0 3 1.0 1.0 7.0 4 1.0 2.0 9.0 5 1.0 3.0 11.0
Difference with previous column
>>> df.diff(periods=3) a b c 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN NaN NaN 3 3.0 2.0 15.0 4 3.0 4.0 21.0 5 3.0 6.0 27.0
Difference with following row
>>> df.diff(periods=-1) a b c 0 -1.0 0.0 -3.0 1 -1.0 -1.0 -5.0 2 -1.0 -1.0 -7.0 3 -1.0 -2.0 -9.0 4 -1.0 -3.0 -11.0 5 NaN NaN NaN