pyspark.pandas.DataFrame.query#

DataFrame.query(expr, inplace=False)[source]#

Query the columns of a DataFrame with a boolean expression.

Note

Internal columns that starting with a ‘__’ prefix are able to access, however, they are not supposed to be accessed.

Note

This API delegates to Spark SQL so the syntax follows Spark SQL. Therefore, the pandas specific syntax such as @ is not supported. If you want the pandas syntax, you can work around with DataFrame.pandas_on_spark.apply_batch(), but you should be aware that query_func will be executed at different nodes in a distributed manner. So, for example to use @ syntax, make sure the variable is serialized by putting it within the closure as below.

>>> df = ps.DataFrame({'A': range(2000), 'B': range(2000)})
>>> def query_func(pdf):
...     num = 1995
...     return pdf.query('A > @num')
>>> df.pandas_on_spark.apply_batch(query_func)
         A     B
1996  1996  1996
1997  1997  1997
1998  1998  1998
1999  1999  1999

Parameters

exprstr

The query string to evaluate.

You can refer to column names that contain spaces by surrounding them in backticks.

For example, if one of your columns is called a a and you want to sum it with b, your query should be `a a` + b.

inplacebool

Whether the query should modify the data in place or return a modified copy.

Returns

DataFrame: DataFrame resulting from the provided query expression.

Examples

>>> df = ps.DataFrame({'A': range(1, 6),
...                    'B': range(10, 0, -2),
...                    'C C': range(10, 5, -1)})
>>> df
   A   B  C C
0  1  10   10
1  2   8    9
2  3   6    8
3  4   4    7
4  5   2    6

>>> df.query('A > B')
   A  B  C C
4  5  2    6

The previous expression is equivalent to

>>> df[df.A > df.B]
   A  B  C C
4  5  2    6

For columns with spaces in their name, you can use backtick quoting.

>>> df.query('B == `C C`')
   A   B  C C
0  1  10   10

The previous expression is equivalent to

>>> df[df.B == df['C C']]
   A   B  C C
0  1  10   10