pyspark.pandas.DataFrame.query#
- DataFrame.query(expr, inplace=False)[source]#
Query the columns of a DataFrame with a boolean expression.
Note
Internal columns that starting with a ‘__’ prefix are able to access, however, they are not supposed to be accessed.
Note
This API delegates to Spark SQL so the syntax follows Spark SQL. Therefore, the pandas specific syntax such as @ is not supported. If you want the pandas syntax, you can work around with
DataFrame.pandas_on_spark.apply_batch()
, but you should be aware that query_func will be executed at different nodes in a distributed manner. So, for example to use @ syntax, make sure the variable is serialized by putting it within the closure as below.>>> df = ps.DataFrame({'A': range(2000), 'B': range(2000)}) >>> def query_func(pdf): ... num = 1995 ... return pdf.query('A > @num') >>> df.pandas_on_spark.apply_batch(query_func) A B 1996 1996 1996 1997 1997 1997 1998 1998 1998 1999 1999 1999
- Parameters
- exprstr
The query string to evaluate.
You can refer to column names that contain spaces by surrounding them in backticks.
For example, if one of your columns is called
a a
and you want to sum it withb
, your query should be`a a` + b
.- inplacebool
Whether the query should modify the data in place or return a modified copy.
- Returns
- DataFrame
DataFrame resulting from the provided query expression.
Examples
>>> df = ps.DataFrame({'A': range(1, 6), ... 'B': range(10, 0, -2), ... 'C C': range(10, 5, -1)}) >>> df A B C C 0 1 10 10 1 2 8 9 2 3 6 8 3 4 4 7 4 5 2 6
>>> df.query('A > B') A B C C 4 5 2 6
The previous expression is equivalent to
>>> df[df.A > df.B] A B C C 4 5 2 6
For columns with spaces in their name, you can use backtick quoting.
>>> df.query('B == `C C`') A B C C 0 1 10 10
The previous expression is equivalent to
>>> df[df.B == df['C C']] A B C C 0 1 10 10