pyspark.sql.DataFrame¶
-
class
pyspark.sql.
DataFrame
(jdf: py4j.java_gateway.JavaObject, sql_ctx: Union[SQLContext, SparkSession])[source]¶ A distributed collection of data grouped into named columns.
New in version 1.3.0.
Changed in version 3.4.0: Supports Spark Connect.
Notes
A DataFrame should only be created as described above. It should not be directly created via using the constructor.
Examples
A
DataFrame
is equivalent to a relational table in Spark SQL, and can be created using various functions inSparkSession
:>>> people = spark.createDataFrame([ ... {"deptId": 1, "age": 40, "name": "Hyukjin Kwon", "gender": "M", "salary": 50}, ... {"deptId": 1, "age": 50, "name": "Takuya Ueshin", "gender": "M", "salary": 100}, ... {"deptId": 2, "age": 60, "name": "Xinrong Meng", "gender": "F", "salary": 150}, ... {"deptId": 3, "age": 20, "name": "Haejoon Lee", "gender": "M", "salary": 200} ... ])
Once created, it can be manipulated using the various domain-specific-language (DSL) functions defined in:
DataFrame
,Column
.To select a column from the
DataFrame
, use the apply method:>>> age_col = people.age
A more concrete example:
>>> # To create DataFrame using SparkSession ... department = spark.createDataFrame([ ... {"id": 1, "name": "PySpark"}, ... {"id": 2, "name": "ML"}, ... {"id": 3, "name": "Spark SQL"} ... ])
>>> people.filter(people.age > 30).join( ... department, people.deptId == department.id).groupBy( ... department.name, "gender").agg({"salary": "avg", "age": "max"}).show() +-------+------+-----------+--------+ | name|gender|avg(salary)|max(age)| +-------+------+-----------+--------+ | ML| F| 150.0| 60| |PySpark| M| 75.0| 50| +-------+------+-----------+--------+
Methods
agg
(*exprs)Aggregate on the entire
DataFrame
without groups (shorthand fordf.groupBy().agg()
).alias
(alias)Returns a new
DataFrame
with an alias set.approxQuantile
(col, probabilities, relativeError)Calculates the approximate quantiles of numerical columns of a
DataFrame
.cache
()Persists the
DataFrame
with the default storage level (MEMORY_AND_DISK).checkpoint
([eager])Returns a checkpointed version of this
DataFrame
.coalesce
(numPartitions)Returns a new
DataFrame
that has exactly numPartitions partitions.colRegex
(colName)Selects column based on the column name specified as a regex and returns it as
Column
.collect
()Returns all the records as a list of
Row
.corr
(col1, col2[, method])Calculates the correlation of two columns of a
DataFrame
as a double value.count
()Returns the number of rows in this
DataFrame
.cov
(col1, col2)Calculate the sample covariance for the given columns, specified by their names, as a double value.
createGlobalTempView
(name)Creates a global temporary view with this
DataFrame
.Creates or replaces a global temporary view using the given name.
createOrReplaceTempView
(name)Creates or replaces a local temporary view with this
DataFrame
.createTempView
(name)Creates a local temporary view with this
DataFrame
.crossJoin
(other)Returns the cartesian product with another
DataFrame
.crosstab
(col1, col2)Computes a pair-wise frequency table of the given columns.
cube
(*cols)Create a multi-dimensional cube for the current
DataFrame
using the specified columns, so we can run aggregations on them.describe
(*cols)Computes basic statistics for numeric and string columns.
distinct
()Returns a new
DataFrame
containing the distinct rows in thisDataFrame
.drop
(*cols)Returns a new
DataFrame
without specified columns.dropDuplicates
([subset])Return a new
DataFrame
with duplicate rows removed, optionally only considering certain columns.dropDuplicatesWithinWatermark
([subset])Return a new
DataFrame
with duplicate rows removed,drop_duplicates
([subset])drop_duplicates()
is an alias fordropDuplicates()
.dropna
([how, thresh, subset])Returns a new
DataFrame
omitting rows with null values.exceptAll
(other)Return a new
DataFrame
containing rows in thisDataFrame
but not in anotherDataFrame
while preserving duplicates.explain
([extended, mode])Prints the (logical and physical) plans to the console for debugging purposes.
fillna
(value[, subset])Replace null values, alias for
na.fill()
.filter
(condition)Filters rows using the given condition.
first
()Returns the first row as a
Row
.foreach
(f)Applies the
f
function to each partition of thisDataFrame
.freqItems
(cols[, support])Finding frequent items for columns, possibly with false positives.
groupBy
(*cols)Groups the
DataFrame
using the specified columns, so we can run aggregation on them.groupby
(*cols)groupby()
is an alias forgroupBy()
.head
([n])Returns the first
n
rows.hint
(name, *parameters)Specifies some hint on the current
DataFrame
.Returns a best-effort snapshot of the files that compose this
DataFrame
.intersect
(other)Return a new
DataFrame
containing rows only in both thisDataFrame
and anotherDataFrame
.intersectAll
(other)Return a new
DataFrame
containing rows in both thisDataFrame
and anotherDataFrame
while preserving duplicates.isEmpty
()Checks if the
DataFrame
is empty and returns a boolean value.isLocal
()Returns
True
if thecollect()
andtake()
methods can be run locally (without any Spark executors).join
(other[, on, how])Joins with another
DataFrame
, using the given join expression.limit
(num)Limits the result count to the number specified.
localCheckpoint
([eager])Returns a locally checkpointed version of this
DataFrame
.mapInArrow
(func, schema[, barrier])Maps an iterator of batches in the current
DataFrame
using a Python native function that takes and outputs a PyArrow’s RecordBatch, and returns the result as aDataFrame
.mapInPandas
(func, schema[, barrier])Maps an iterator of batches in the current
DataFrame
using a Python native function that takes and outputs a pandas DataFrame, and returns the result as aDataFrame
.melt
(ids, values, variableColumnName, …)Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set.
observe
(observation, *exprs)Define (named) metrics to observe on the DataFrame.
offset
(num)Returns a new :class: DataFrame by skipping the first n rows.
orderBy
(*cols, **kwargs)Returns a new
DataFrame
sorted by the specified column(s).pandas_api
([index_col])Converts the existing DataFrame into a pandas-on-Spark DataFrame.
persist
([storageLevel])Sets the storage level to persist the contents of the
DataFrame
across operations after the first time it is computed.printSchema
([level])Prints out the schema in the tree format.
randomSplit
(weights[, seed])Randomly splits this
DataFrame
with the provided weights.registerTempTable
(name)Registers this
DataFrame
as a temporary table using the given name.repartition
(numPartitions, *cols)Returns a new
DataFrame
partitioned by the given partitioning expressions.repartitionByRange
(numPartitions, *cols)Returns a new
DataFrame
partitioned by the given partitioning expressions.replace
(to_replace[, value, subset])Returns a new
DataFrame
replacing a value with another value.rollup
(*cols)Create a multi-dimensional rollup for the current
DataFrame
using the specified columns, so we can run aggregation on them.sameSemantics
(other)Returns True when the logical query plans inside both
DataFrame
s are equal and therefore return the same results.sample
([withReplacement, fraction, seed])Returns a sampled subset of this
DataFrame
.sampleBy
(col, fractions[, seed])Returns a stratified sample without replacement based on the fraction given on each stratum.
select
(*cols)Projects a set of expressions and returns a new
DataFrame
.selectExpr
(*expr)Projects a set of SQL expressions and returns a new
DataFrame
.Returns a hash code of the logical query plan against this
DataFrame
.show
([n, truncate, vertical])Prints the first
n
rows to the console.sort
(*cols, **kwargs)Returns a new
DataFrame
sorted by the specified column(s).sortWithinPartitions
(*cols, **kwargs)Returns a new
DataFrame
with each partition sorted by the specified column(s).subtract
(other)Return a new
DataFrame
containing rows in thisDataFrame
but not in anotherDataFrame
.summary
(*statistics)Computes specified statistics for numeric and string columns.
tail
(num)Returns the last
num
rows as alist
ofRow
.take
(num)Returns the first
num
rows as alist
ofRow
.to
(schema)Returns a new
DataFrame
where each row is reconciled to match the specified schema.toDF
(*cols)Returns a new
DataFrame
that with new specified column namestoJSON
([use_unicode])Converts a
DataFrame
into aRDD
of string.toLocalIterator
([prefetchPartitions])Returns an iterator that contains all of the rows in this
DataFrame
.toPandas
()Returns the contents of this
DataFrame
as Pandaspandas.DataFrame
.to_koalas
([index_col])to_pandas_on_spark
([index_col])transform
(func, *args, **kwargs)Returns a new
DataFrame
.union
(other)Return a new
DataFrame
containing the union of rows in this and anotherDataFrame
.unionAll
(other)Return a new
DataFrame
containing the union of rows in this and anotherDataFrame
.unionByName
(other[, allowMissingColumns])Returns a new
DataFrame
containing union of rows in this and anotherDataFrame
.unpersist
([blocking])Marks the
DataFrame
as non-persistent, and remove all blocks for it from memory and disk.unpivot
(ids, values, variableColumnName, …)Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set.
where
(condition)withColumn
(colName, col)Returns a new
DataFrame
by adding a column or replacing the existing column that has the same name.withColumnRenamed
(existing, new)Returns a new
DataFrame
by renaming an existing column.withColumns
(*colsMap)Returns a new
DataFrame
by adding multiple columns or replacing the existing columns that have the same names.withColumnsRenamed
(colsMap)Returns a new
DataFrame
by renaming multiple columns.withMetadata
(columnName, metadata)Returns a new
DataFrame
by updating an existing column with metadata.withWatermark
(eventTime, delayThreshold)Defines an event time watermark for this
DataFrame
.writeTo
(table)Create a write configuration builder for v2 sources.
Attributes
Retrieves the names of all columns in the
DataFrame
as a list.Returns all column names and their data types as a list.
Returns
True
if thisDataFrame
contains one or more sources that continuously return data as it arrives.Returns a
DataFrameNaFunctions
for handling missing values.Returns the content as an
pyspark.RDD
ofRow
.Returns the schema of this
DataFrame
as apyspark.sql.types.StructType
.Returns Spark session that created this
DataFrame
.sql_ctx
Returns a
DataFrameStatFunctions
for statistic functions.Get the
DataFrame
’s current storage level.Interface for saving the content of the non-streaming
DataFrame
out into external storage.Interface for saving the content of the streaming
DataFrame
out into external storage.