API reference#

This page provides a list of all publicly accessible modules, methods, and classes in the dask_cudf namespace.

Creating and storing DataFrames#

Like Dask, Dask-cuDF supports creation of DataFrames from a variety of storage formats. For on-disk data that are not supported directly in Dask-cuDF, we recommend using Dask’s data reading facilities, followed by calling from_dask_dataframe() to obtain a Dask-cuDF object.

dask_cudf.from_cudf(data, npartitions=None, chunksize=None, sort=True, name=None)#

Create a DataFrame from a cudf.DataFrame.

This function is a thin wrapper around dask.dataframe.from_pandas(), accepting the same arguments (described below) excepting that it operates on cuDF rather than pandas objects.

Construct a Dask DataFrame from a Pandas DataFrame

This splits an in-memory Pandas dataframe into several parts and constructs a dask.dataframe from those parts on which Dask.dataframe can operate in parallel. By default, the input dataframe will be sorted by the index to produce cleanly-divided partitions (with known divisions). To preserve the input ordering, make sure the input index is monotonically-increasing. The sort=False option will also avoid reordering, but will not result in known divisions.

Note that, despite parallelism, Dask.dataframe may not always be faster than Pandas. We recommend that you stay with Pandas for as long as possible before switching to Dask.dataframe.

Parameters:
datapandas.DataFrame or pandas.Series

The DataFrame/Series with which to construct a Dask DataFrame/Series

npartitionsint, optional

The number of partitions of the index to create. Note that if there are duplicate values or insufficient elements in data.index, the output may have fewer partitions than requested.

chunksizeint, optional

The desired number of rows per index partition to use. Note that depending on the size and index of the dataframe, actual partition sizes may vary.

sort: bool

Sort the input by index first to obtain cleanly divided partitions (with known divisions). If False, the input will not be sorted, and all divisions will be set to None. Default is True.

name: string, optional

An optional keyname for the dataframe. Defaults to hashing the input

Returns:
dask.DataFrame or dask.Series

A dask DataFrame/Series partitioned along the index

Raises:
TypeError

If something other than a pandas.DataFrame or pandas.Series is passed in.

See also

from_array

Construct a dask.DataFrame from an array that has record dtype

read_csv

Construct a dask.DataFrame from a CSV file

Examples

>>> from dask.dataframe import from_pandas
>>> df = pd.DataFrame(dict(a=list('aabbcc'), b=list(range(6))),
...                   index=pd.date_range(start='20100101', periods=6))
>>> ddf = from_pandas(df, npartitions=3)
>>> ddf.divisions  
(Timestamp('2010-01-01 00:00:00', freq='D'),
 Timestamp('2010-01-03 00:00:00', freq='D'),
 Timestamp('2010-01-05 00:00:00', freq='D'),
 Timestamp('2010-01-06 00:00:00', freq='D'))
>>> ddf = from_pandas(df.a, npartitions=3)  # Works with Series too!
>>> ddf.divisions  
(Timestamp('2010-01-01 00:00:00', freq='D'),
 Timestamp('2010-01-03 00:00:00', freq='D'),
 Timestamp('2010-01-05 00:00:00', freq='D'),
 Timestamp('2010-01-06 00:00:00', freq='D'))
dask_cudf.from_dask_dataframe(df)#

Convert a Dask dask.dataframe.DataFrame to a Dask-cuDF one.

Parameters:
dfdask.dataframe.DataFrame

The Dask dataframe to convert

Returns:
dask_cudf.DataFrameA new Dask collection backed by cuDF objects
dask_cudf.from_delayed(dfs: Delayed | distributed.Future | Iterable[Delayed | distributed.Future], meta=None, divisions: tuple | Literal['sorted'] | None = None, prefix: str = 'from-delayed', verify_meta: bool = True) DataFrame | Series#

Create Dask DataFrame from many Dask Delayed objects

Parameters:
dfs

A dask.delayed.Delayed, a distributed.Future, or an iterable of either of these objects, e.g. returned by client.submit. These comprise the individual partitions of the resulting dataframe. If a single object is provided (not an iterable), then the resulting dataframe will have only one partition.

metapd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

divisions

Partition boundaries along the index. For tuple, see https://docs.dask.org/en/latest/dataframe-design.html#partitions For string ‘sorted’ will compute the delayed values to find index values. Assumes that the indexes are mutually sorted. If None, then won’t use index information

prefix

Prefix to prepend to the keys.

verify_meta

If True check that the partitions have consistent metadata, defaults to True.

dask_cudf.read_csv(path, blocksize='default', **kwargs)#

Read CSV files into a DataFrame.

This API parallelizes the cudf.read_csv() function in the following ways:

It supports loading many files at once using globstrings:

>>> import dask_cudf
>>> df = dask_cudf.read_csv("myfiles.*.csv")

In some cases it can break up large files:

>>> df = dask_cudf.read_csv("largefile.csv", blocksize="256 MiB")

It can read CSV files from external resources (e.g. S3, HTTP, FTP)

>>> df = dask_cudf.read_csv("s3://bucket/myfiles.*.csv")
>>> df = dask_cudf.read_csv("https://www.mycloud.com/sample.csv")

Internally read_csv uses cudf.read_csv() and supports many of the same keyword arguments with the same performance guarantees. See the docstring for cudf.read_csv() for more information on available keyword arguments.

Parameters:
pathstr, path object, or file-like object

Either a path to a file (a str, pathlib.Path, or py._path.local.LocalPath), URL (including http, ftp, and S3 locations), or any object with a read() method (such as builtin open() file handler function or StringIO).

blocksizeint or str, default “256 MiB”

The target task partition size. If None, a single block is used for each file.

**kwargsdict

Passthrough key-word arguments that are sent to cudf.read_csv().

Notes

If any of skipfooter/skiprows/nrows are passed, blocksize will default to None.

Examples

>>> import dask_cudf
>>> ddf = dask_cudf.read_csv("sample.csv", usecols=["a", "b"])
>>> ddf.compute()
   a      b
0  1     hi
1  2  hello
2  3     ai
dask_cudf.read_json(url_path, engine='auto', **kwargs)#

Read JSON data into a DataFrame.

This function wraps dask.dataframe.read_json(), and passes engine=partial(cudf.read_json, engine="auto") by default.

Parameters:
url_pathstr, list of str

Location to read from. If a string, can include a glob character to find a set of file names. Supports protocol specifications such as "s3://".

enginestr or Callable, default “auto”

If str, this value will be used as the engine argument when cudf.read_json() is used to create each partition. If a Callable, this value will be used as the underlying function used to create each partition from JSON data. The default value is “auto”, so that engine=partial(cudf.read_json, engine="auto") will be passed to dask.dataframe.read_json() by default.

**kwargs

Key-word arguments to pass through to dask.dataframe.read_json().

Returns:
DataFrame

Examples

Load single file

>>> from dask_cudf import read_json
>>> read_json('myfile.json')  

Load large line-delimited JSON files using partitions of approx 256MB size

>>> read_json('data/file*.csv', blocksize=2**28)  

Load nested JSON data

>>> read_json('myfile.json')  
dask_cudf.read_orc(path, columns=None, filters=None, storage_options=None, **kwargs)#

Read ORC files into a DataFrame.

Note that this function is mostly borrowed from upstream Dask.

Parameters:
pathstr or list[str]

Location of file(s), which can be a full URL with protocol specifier, and may include glob character if a single string.

columnsNone or list[str]

Columns to load. If None, loads all.

filtersNone or list of tuple or list of lists of tuples

If not None, specifies a filter predicate used to filter out row groups using statistics stored for each row group as Parquet metadata. Row groups that do not match the given filter predicate are not read. The predicate is expressed in disjunctive normal form (DNF) like [[('x', '=', 0), ...], ...]. DNF allows arbitrary boolean logical combinations of single column predicates. The innermost tuples each describe a single column predicate. The list of inner predicates is interpreted as a conjunction (AND), forming a more selective and multiple column predicate. Finally, the outermost list combines these filters as a disjunction (OR). Predicates may also be passed as a list of tuples. This form is interpreted as a single conjunction. To express OR in predicates, one must use the (preferred) notation of list of lists of tuples.

storage_optionsNone or dict

Further parameters to pass to the bytes backend.

Returns:
dask_cudf.DataFrame
dask_cudf.read_parquet(path, columns=None, **kwargs)#

Read parquet files into a DataFrame.

Calls dask.dataframe.read_parquet() with engine=CudfEngine to coordinate the execution of cudf.read_parquet(), and to ultimately create a DataFrame collection.

See the dask.dataframe.read_parquet() documentation for all available options.

Examples

>>> from dask_cudf import read_parquet
>>> df = read_parquet("/path/to/dataset/")  

When dealing with one or more large parquet files having an in-memory footprint >15% device memory, the split_row_groups argument should be used to map Parquet row-groups to DataFrame partitions (instead of files to partitions). For example, the following code will map each row-group to a distinct partition:

>>> df = read_parquet(..., split_row_groups=True)  

To map multiple row-groups to each partition, an integer can be passed to split_row_groups to specify the maximum number of row-groups allowed in each output partition:

>>> df = read_parquet(..., split_row_groups=10)  
dask_cudf.to_orc(df, path, write_index=True, storage_options=None, compression='snappy', compute=True, **kwargs)#

Write a DataFrame to ORC file(s) (one file per partition).

Parameters:
dfDataFrame
pathstr or pathlib.Path

Destination directory for data. Prepend with protocol like s3:// or hdfs:// for remote data.

write_indexboolean, optional

Whether or not to write the index. Defaults to True.

storage_optionsNone or dict

Further parameters to pass to the bytes backend.

compressionstring or dict, optional
computebool, optional

If True (default) then the result is computed immediately. If False then a Delayed object is returned for future computation.

Grouping#

As discussed in the Dask documentation for groupby, groupby, join, and merge, and similar operations that require matching up rows of a DataFrame become significantly more challenging in a parallel setting than they are in serial. Dask-cuDF has the same challenges, however for certain groupby operations, we can take advantage of functionality in cuDF that allows us to compute multiple aggregations at once. There are therefore two interfaces to grouping in Dask-cuDF, the general DataFrame.groupby() which returns a CudfDataFrameGroupBy object, and a specialized groupby_agg(). Generally speaking, you should not need to call groupby_agg() directly, since Dask-cuDF will arrange to call it if possible.

class dask_cudf.groupby.CudfDataFrameGroupBy(*args, sort=None, **kwargs)#

Bases: DataFrameGroupBy

Attributes

index

Methods

agg([arg, split_every, split_out, ...])

Aggregate using one or more specified operations

aggregate(arg[, split_every, split_out, ...])

Aggregate using one or more specified operations

apply(func, *args, **kwargs)

Parallel version of pandas GroupBy.apply

bfill([limit])

Backward fill the values.

corr([ddof, split_every, split_out, ...])

Compute pairwise correlation of columns, excluding NA/null values.

count([split_every, split_out])

Compute count of group, excluding missing values.

cov([ddof, split_every, split_out, std, ...])

Compute pairwise covariance of columns, excluding NA/null values.

cumcount([axis])

Number each item in each group from 0 to the length of that group - 1.

cumprod([axis, numeric_only])

Cumulative product for each group.

cumsum([axis, numeric_only])

Cumulative sum for each group.

ffill([limit])

Forward fill the values.

fillna([value, method, limit, axis])

Fill NA/NaN values using the specified method.

first([split_every, split_out])

Compute the first non-null entry of each column.

get_group(key)

Construct DataFrame from group with provided name.

idxmax([split_every, split_out, ...])

Return index of first occurrence of maximum over requested axis.

idxmin([split_every, split_out, ...])

Return index of first occurrence of minimum over requested axis.

last([split_every, split_out])

Compute the last non-null entry of each column.

max([split_every, split_out])

Compute max of group values.

mean([split_every, split_out])

Compute mean of groups, excluding missing values.

median([split_every, split_out, ...])

Compute median of groups, excluding missing values.

min([split_every, split_out])

Compute min of group values.

prod([split_every, split_out, ...])

Compute prod of group values.

rolling(window[, min_periods, center, ...])

Provides rolling transformations.

shift([periods, freq, axis, fill_value, meta])

Parallel version of pandas GroupBy.shift

size([split_every, split_out, shuffle_method])

Compute group sizes.

std([split_every, split_out])

Compute standard deviation of groups, excluding missing values.

sum([split_every, split_out])

Compute sum of group values.

transform(func, *args, **kwargs)

Parallel version of pandas GroupBy.transform

var([split_every, split_out])

Compute variance of groups, excluding missing values.

collect

compute

agg(arg=None, split_every=None, split_out=1, shuffle_method=None, **kwargs)#

Aggregate using one or more specified operations

Based on pd.core.groupby.DataFrameGroupBy.agg

Parameters:
argcallable, str, list or dict, optional

Aggregation spec. Accepted combinations are:

  • callable function

  • string function name

  • list of functions and/or function names, e.g. [np.sum, 'mean']

  • dict of column names -> function, function name or list of such.

  • None only if named aggregation syntax is used

split_everyint, optional

Number of intermediate partitions that may be aggregated at once. This defaults to 8. If your intermediate partitions are likely to be small (either due to a small number of groups or a small initial partition size), consider increasing this number for better performance.

split_outint, optional

Number of output partitions. Default is 1.

shufflebool or str, optional

Whether a shuffle-based algorithm should be used. A specific algorithm name may also be specified (e.g. "tasks" or "p2p"). The shuffle-based algorithm is likely to be more efficient than shuffle=False when split_out>1 and the number of unique groups is large (high cardinality). Default is False when split_out = 1. When split_out > 1, it chooses the algorithm set by the shuffle option in the dask config system, or "tasks" if nothing is set.

kwargs: tuple or pd.NamedAgg, optional

Used for named aggregations where the keywords are the output column names and the values are tuples where the first element is the input column name and the second element is the aggregation function. pandas.NamedAgg can also be used as the value. To use the named aggregation syntax, arg must be set to None.

aggregate(arg, split_every=None, split_out=1, shuffle_method=None)#

Aggregate using one or more specified operations

Based on pd.core.groupby.DataFrameGroupBy.aggregate

Parameters:
argcallable, str, list or dict, optional

Aggregation spec. Accepted combinations are:

  • callable function

  • string function name

  • list of functions and/or function names, e.g. [np.sum, 'mean']

  • dict of column names -> function, function name or list of such.

  • None only if named aggregation syntax is used

split_everyint, optional

Number of intermediate partitions that may be aggregated at once. This defaults to 8. If your intermediate partitions are likely to be small (either due to a small number of groups or a small initial partition size), consider increasing this number for better performance.

split_outint, optional

Number of output partitions. Default is 1.

shufflebool or str, optional

Whether a shuffle-based algorithm should be used. A specific algorithm name may also be specified (e.g. "tasks" or "p2p"). The shuffle-based algorithm is likely to be more efficient than shuffle=False when split_out>1 and the number of unique groups is large (high cardinality). Default is False when split_out = 1. When split_out > 1, it chooses the algorithm set by the shuffle option in the dask config system, or "tasks" if nothing is set.

kwargs: tuple or pd.NamedAgg, optional

Used for named aggregations where the keywords are the output column names and the values are tuples where the first element is the input column name and the second element is the aggregation function. pandas.NamedAgg can also be used as the value. To use the named aggregation syntax, arg must be set to None.

apply(func, *args, **kwargs)#

Parallel version of pandas GroupBy.apply

This mimics the pandas version except for the following:

  1. If the grouper does not align with the index then this causes a full shuffle. The order of rows within each group may not be preserved.

  2. Dask’s GroupBy.apply is not appropriate for aggregations. For custom aggregations, use dask.dataframe.groupby.Aggregation.

Warning

Pandas’ groupby-apply can be used to to apply arbitrary functions, including aggregations that result in one row per group. Dask’s groupby-apply will apply func once on each group, doing a shuffle if needed, such that each group is contained in one partition. When func is a reduction, e.g., you’ll end up with one row per group. To apply a custom aggregation with Dask, use dask.dataframe.groupby.Aggregation.

Parameters:
func: function

Function to apply

args, kwargsScalar, Delayed or object

Arguments and keywords to pass to the function.

metapd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

Returns:
appliedSeries or DataFrame depending on columns keyword
bfill(limit=None)#

Backward fill the values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.bfill.

Some inconsistencies with the Dask version may exist.

Parameters:
limitint, optional

Limit of how many values to fill.

Returns:
Series or DataFrame

Object with missing values filled.

See also

Series.bfill

Backward fill the missing values in the dataset.

DataFrame.bfill

Backward fill the missing values in the dataset.

Series.fillna

Fill NaN values of a Series.

DataFrame.fillna

Fill NaN values of a DataFrame.

corr(ddof=1, split_every=None, split_out=1, numeric_only=_NoDefault.no_default)#

Compute pairwise correlation of columns, excluding NA/null values.

This docstring was copied from pandas.core.frame.DataFrame.corr.

Some inconsistencies with the Dask version may exist.

Groupby correlation: corr(X, Y) = cov(X, Y) / (std_x * std_y)

Parameters:
method{‘pearson’, ‘kendall’, ‘spearman’} or callable (Not supported in Dask)

Method of correlation:

  • pearson : standard correlation coefficient

  • kendall : Kendall Tau correlation coefficient

  • spearman : Spearman rank correlation

  • callable: callable with input two 1d ndarrays

    and returning a float. Note that the returned matrix from corr will have 1 along the diagonals and will be symmetric regardless of the callable’s behavior.

min_periodsint, optional (Not supported in Dask)

Minimum number of observations required per pair of columns to have a valid result. Currently only available for Pearson and Spearman correlation.

numeric_onlybool, default True

Include only float, int or boolean data.

New in version 1.5.0.

Deprecated since version 1.5.0: The default value of numeric_only will be False in a future version of pandas.

Returns:
DataFrame

Correlation matrix.

See also

DataFrame.corrwith

Compute pairwise correlation with another DataFrame or Series.

Series.corr

Compute the correlation between two Series.

Notes

Pearson, Kendall and Spearman correlation are currently computed using pairwise complete observations.

Examples

>>> def histogram_intersection(a, b):  
...     v = np.minimum(a, b).sum().round(decimals=1)
...     return v
>>> df = pd.DataFrame([(.2, .3), (.0, .6), (.6, .0), (.2, .1)],  
...                   columns=['dogs', 'cats'])
>>> df.corr(method=histogram_intersection)  
      dogs  cats
dogs   1.0   0.3
cats   0.3   1.0
>>> df = pd.DataFrame([(1, 1), (2, np.nan), (np.nan, 3), (4, 4)],  
...                   columns=['dogs', 'cats'])
>>> df.corr(min_periods=3)  
      dogs  cats
dogs   1.0   NaN
cats   NaN   1.0
count(split_every=None, split_out=1)#

Compute count of group, excluding missing values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.count.

Some inconsistencies with the Dask version may exist.

Returns:
Series or DataFrame

Count of values within each group.

See also

Series.groupby

Apply a function groupby to a Series.

DataFrame.groupby

Apply a function groupby to each row or column of a DataFrame.

cov(ddof=1, split_every=None, split_out=1, std=False, numeric_only=_NoDefault.no_default)#

Compute pairwise covariance of columns, excluding NA/null values.

This docstring was copied from pandas.core.frame.DataFrame.cov.

Some inconsistencies with the Dask version may exist.

Groupby covariance is accomplished by

  1. Computing intermediate values for sum, count, and the product of all columns: a b c -> a*a, a*b, b*b, b*c, c*c.

  2. The values are then aggregated and the final covariance value is calculated: cov(X, Y) = X*Y - Xbar * Ybar

When std is True calculate Correlation

Compute the pairwise covariance among the series of a DataFrame. The returned data frame is the covariance matrix of the columns of the DataFrame.

Both NA and null values are automatically excluded from the calculation. (See the note below about bias from missing values.) A threshold can be set for the minimum number of observations for each value created. Comparisons with observations below this threshold will be returned as NaN.

This method is generally used for the analysis of time series data to understand the relationship between different measures across time.

Parameters:
min_periodsint, optional (Not supported in Dask)

Minimum number of observations required per pair of columns to have a valid result.

ddofint, default 1

Delta degrees of freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

New in version 1.1.0.

numeric_onlybool, default True

Include only float, int or boolean data.

New in version 1.5.0.

Deprecated since version 1.5.0: The default value of numeric_only will be False in a future version of pandas.

Returns:
DataFrame

The covariance matrix of the series of the DataFrame.

See also

Series.cov

Compute covariance with another Series.

core.window.ewm.ExponentialMovingWindow.cov

Exponential weighted sample covariance.

core.window.expanding.Expanding.cov

Expanding sample covariance.

core.window.rolling.Rolling.cov

Rolling sample covariance.

Notes

Returns the covariance matrix of the DataFrame’s time series. The covariance is normalized by N-ddof.

For DataFrames that have Series that are missing data (assuming that data is missing at random) the returned covariance matrix will be an unbiased estimate of the variance and covariance between the member Series.

However, for many applications this estimate may not be acceptable because the estimate covariance matrix is not guaranteed to be positive semi-definite. This could lead to estimate correlations having absolute values which are greater than one, and/or a non-invertible covariance matrix. See Estimation of covariance matrices for more details.

Examples

>>> df = pd.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)],  
...                   columns=['dogs', 'cats'])
>>> df.cov()  
          dogs      cats
dogs  0.666667 -1.000000
cats -1.000000  1.666667
>>> np.random.seed(42)  
>>> df = pd.DataFrame(np.random.randn(1000, 5),  
...                   columns=['a', 'b', 'c', 'd', 'e'])
>>> df.cov()  
          a         b         c         d         e
a  0.998438 -0.020161  0.059277 -0.008943  0.014144
b -0.020161  1.059352 -0.008543 -0.024738  0.009826
c  0.059277 -0.008543  1.010670 -0.001486 -0.000271
d -0.008943 -0.024738 -0.001486  0.921297 -0.013692
e  0.014144  0.009826 -0.000271 -0.013692  0.977795

Minimum number of periods

This method also supports an optional min_periods keyword that specifies the required minimum number of non-NA observations for each column pair in order to have a valid result:

>>> np.random.seed(42)  
>>> df = pd.DataFrame(np.random.randn(20, 3),  
...                   columns=['a', 'b', 'c'])
>>> df.loc[df.index[:5], 'a'] = np.nan  
>>> df.loc[df.index[5:10], 'b'] = np.nan  
>>> df.cov(min_periods=12)  
          a         b         c
a  0.316741       NaN -0.150812
b       NaN  1.248003  0.191417
c -0.150812  0.191417  0.895202
cumcount(axis=_NoDefault.no_default)#

Number each item in each group from 0 to the length of that group - 1.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.cumcount.

Some inconsistencies with the Dask version may exist.

Essentially this is equivalent to

self.apply(lambda x: pd.Series(np.arange(len(x)), x.index))
Parameters:
ascendingbool, default True (Not supported in Dask)

If False, number in reverse, from length of group - 1 to 0.

Returns:
Series

Sequence number of each element within each group.

See also

ngroup

Number the groups themselves.

Examples

>>> df = pd.DataFrame([['a'], ['a'], ['a'], ['b'], ['b'], ['a']],  
...                   columns=['A'])
>>> df  
   A
0  a
1  a
2  a
3  b
4  b
5  a
>>> df.groupby('A').cumcount()  
0    0
1    1
2    2
3    0
4    1
5    3
dtype: int64
>>> df.groupby('A').cumcount(ascending=False)  
0    3
1    2
2    1
3    1
4    0
5    0
dtype: int64
cumprod(axis=_NoDefault.no_default, numeric_only=_NoDefault.no_default)#

Cumulative product for each group.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.cumprod.

Some inconsistencies with the Dask version may exist.

Returns:
Series or DataFrame

See also

Series.groupby

Apply a function groupby to a Series.

DataFrame.groupby

Apply a function groupby to each row or column of a DataFrame.

cumsum(axis=_NoDefault.no_default, numeric_only=_NoDefault.no_default)#

Cumulative sum for each group.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.cumsum.

Some inconsistencies with the Dask version may exist.

Returns:
Series or DataFrame

See also

Series.groupby

Apply a function groupby to a Series.

DataFrame.groupby

Apply a function groupby to each row or column of a DataFrame.

ffill(limit=None)#

Forward fill the values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.ffill.

Some inconsistencies with the Dask version may exist.

Parameters:
limitint, optional

Limit of how many values to fill.

Returns:
Series or DataFrame

Object with missing values filled.

See also

Series.ffill

Returns Series with minimum number of char in object.

DataFrame.ffill

Object with missing values filled or None if inplace=True.

Series.fillna

Fill NaN values of a Series.

DataFrame.fillna

Fill NaN values of a DataFrame.

fillna(value=None, method=None, limit=None, axis=_NoDefault.no_default)#

Fill NA/NaN values using the specified method.

Parameters:
valuescalar, default None

Value to use to fill holes (e.g. 0).

method{‘bfill’, ‘ffill’, None}, default None

Method to use for filling holes in reindexed Series. ffill: propagate last valid observation forward to next valid. bfill: use next valid observation to fill gap.

axis{0 or ‘index’, 1 or ‘columns’}

Axis along which to fill missing values.

limitint, default None

If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.

Returns:
Series or DataFrame

Object with missing values filled

first(split_every=None, split_out=1)#

Compute the first non-null entry of each column.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.first.

Some inconsistencies with the Dask version may exist.

Parameters:
numeric_onlybool, default False

Include only float, int, boolean columns.

min_countint, default -1 (Not supported in Dask)

The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

Returns:
Series or DataFrame

First non-null of values within each group.

See also

DataFrame.groupby

Apply a function groupby to each row or column of a DataFrame.

DataFrame.core.groupby.GroupBy.last

Compute the last non-null entry of each column.

DataFrame.core.groupby.GroupBy.nth

Take the nth row from each group.

Examples

>>> df = pd.DataFrame(dict(A=[1, 1, 3], B=[None, 5, 6], C=[1, 2, 3],  
...                        D=['3/11/2000', '3/12/2000', '3/13/2000']))
>>> df['D'] = pd.to_datetime(df['D'])  
>>> df.groupby("A").first()  
     B  C          D
A
1  5.0  1 2000-03-11
3  6.0  3 2000-03-13
>>> df.groupby("A").first(min_count=2)  
    B    C          D
A
1 NaN  1.0 2000-03-11
3 NaN  NaN        NaT
>>> df.groupby("A").first(numeric_only=True)  
     B  C
A
1  5.0  1
3  6.0  3
get_group(key)#

Construct DataFrame from group with provided name.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.get_group.

Some inconsistencies with the Dask version may exist.

Known inconsistencies:

If the group is not present, Dask will return an empty Series/DataFrame.

Parameters:
nameobject (Not supported in Dask)

The name of the group to get as a DataFrame.

objDataFrame, default None (Not supported in Dask)

The DataFrame to take the DataFrame out of. If it is None, the object groupby was called on will be used.

Returns:
groupsame type as obj
idxmax(split_every=None, split_out=1, shuffle_method=None, axis=_NoDefault.no_default, skipna=True, numeric_only=_NoDefault.no_default)#

Return index of first occurrence of maximum over requested axis.

This docstring was copied from pandas.core.frame.DataFrame.idxmax.

Some inconsistencies with the Dask version may exist.

NA/null values are excluded.

Parameters:
axis{0 or ‘index’, 1 or ‘columns’}, default 0

The axis to use. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.

skipnabool, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA.

numeric_onlybool, default False

Include only float, int or boolean data.

New in version 1.5.0.

Returns:
Series

Indexes of maxima along the specified axis.

Raises:
ValueError
  • If the row/column is empty

See also

Series.idxmax

Return index of the maximum element.

Notes

This method is the DataFrame version of ndarray.argmax.

Examples

Consider a dataset containing food consumption in Argentina.

>>> df = pd.DataFrame({'consumption': [10.51, 103.11, 55.48],  
...                    'co2_emissions': [37.2, 19.66, 1712]},
...                    index=['Pork', 'Wheat Products', 'Beef'])
>>> df  
                consumption  co2_emissions
Pork                  10.51         37.20
Wheat Products       103.11         19.66
Beef                  55.48       1712.00

By default, it returns the index for the maximum value in each column.

>>> df.idxmax()  
consumption     Wheat Products
co2_emissions             Beef
dtype: object

To return the index for the maximum value in each row, use axis="columns".

>>> df.idxmax(axis="columns")  
Pork              co2_emissions
Wheat Products     consumption
Beef              co2_emissions
dtype: object
idxmin(split_every=None, split_out=1, shuffle_method=None, axis=_NoDefault.no_default, skipna=True, numeric_only=_NoDefault.no_default)#

Return index of first occurrence of minimum over requested axis.

This docstring was copied from pandas.core.frame.DataFrame.idxmin.

Some inconsistencies with the Dask version may exist.

NA/null values are excluded.

Parameters:
axis{0 or ‘index’, 1 or ‘columns’}, default 0

The axis to use. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.

skipnabool, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA.

numeric_onlybool, default False

Include only float, int or boolean data.

New in version 1.5.0.

Returns:
Series

Indexes of minima along the specified axis.

Raises:
ValueError
  • If the row/column is empty

See also

Series.idxmin

Return index of the minimum element.

Notes

This method is the DataFrame version of ndarray.argmin.

Examples

Consider a dataset containing food consumption in Argentina.

>>> df = pd.DataFrame({'consumption': [10.51, 103.11, 55.48],  
...                    'co2_emissions': [37.2, 19.66, 1712]},
...                    index=['Pork', 'Wheat Products', 'Beef'])
>>> df  
                consumption  co2_emissions
Pork                  10.51         37.20
Wheat Products       103.11         19.66
Beef                  55.48       1712.00

By default, it returns the index for the minimum value in each column.

>>> df.idxmin()  
consumption                Pork
co2_emissions    Wheat Products
dtype: object

To return the index for the minimum value in each row, use axis="columns".

>>> df.idxmin(axis="columns")  
Pork                consumption
Wheat Products    co2_emissions
Beef                consumption
dtype: object
last(split_every=None, split_out=1)#

Compute the last non-null entry of each column.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.last.

Some inconsistencies with the Dask version may exist.

Parameters:
numeric_onlybool, default False

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data.

min_countint, default -1 (Not supported in Dask)

The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

Returns:
Series or DataFrame

Last non-null of values within each group.

See also

DataFrame.groupby

Apply a function groupby to each row or column of a DataFrame.

DataFrame.core.groupby.GroupBy.first

Compute the first non-null entry of each column.

DataFrame.core.groupby.GroupBy.nth

Take the nth row from each group.

Examples

>>> df = pd.DataFrame(dict(A=[1, 1, 3], B=[5, None, 6], C=[1, 2, 3]))  
>>> df.groupby("A").last()  
     B  C
A
1  5.0  2
3  6.0  3
max(split_every=None, split_out=1)#

Compute max of group values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.max.

Some inconsistencies with the Dask version may exist.

Parameters:
numeric_onlybool, default False

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data.

min_countint, default -1 (Not supported in Dask)

The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

Returns:
Series or DataFrame

Computed max of values within each group.

mean(split_every=None, split_out=1)#

Compute mean of groups, excluding missing values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.mean.

Some inconsistencies with the Dask version may exist.

Parameters:
numeric_onlybool, default True

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data.

enginestr, default None (Not supported in Dask)
  • 'cython' : Runs the operation through C-extensions from cython.

  • 'numba' : Runs the operation through JIT compiled code from numba.

  • None : Defaults to 'cython' or globally setting compute.use_numba

New in version 1.4.0.

engine_kwargsdict, default None (Not supported in Dask)
  • For 'cython' engine, there are no accepted engine_kwargs

  • For 'numba' engine, the engine can accept nopython, nogil and parallel dictionary keys. The values must either be True or False. The default engine_kwargs for the 'numba' engine is {{'nopython': True, 'nogil': False, 'parallel': False}}

New in version 1.4.0.

Returns:
pandas.Series or pandas.DataFrame

See also

Series.groupby

Apply a function groupby to a Series.

DataFrame.groupby

Apply a function groupby to each row or column of a DataFrame.

Examples

>>> df = pd.DataFrame({'A': [1, 1, 2, 1, 2],  
...                    'B': [np.nan, 2, 3, 4, 5],
...                    'C': [1, 2, 1, 1, 2]}, columns=['A', 'B', 'C'])

Groupby one column and return the mean of the remaining columns in each group.

>>> df.groupby('A').mean()  
     B         C
A
1  3.0  1.333333
2  4.0  1.500000

Groupby two columns and return the mean of the remaining column.

>>> df.groupby(['A', 'B']).mean()  
         C
A B
1 2.0  2.0
  4.0  1.0
2 3.0  1.0
  5.0  2.0

Groupby one column and return the mean of only particular column in the group.

>>> df.groupby('A')['B'].mean()  
A
1    3.0
2    4.0
Name: B, dtype: float64
median(split_every=None, split_out=1, shuffle_method=None, numeric_only=_NoDefault.no_default)#

Compute median of groups, excluding missing values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.median.

Some inconsistencies with the Dask version may exist.

For multiple groupings, the result index will be a MultiIndex

Parameters:
numeric_onlybool, default True

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data.

Returns:
Series or DataFrame

Median of values within each group.

See also

Series.groupby

Apply a function groupby to a Series.

DataFrame.groupby

Apply a function groupby to each row or column of a DataFrame.

min(split_every=None, split_out=1)#

Compute min of group values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.min.

Some inconsistencies with the Dask version may exist.

Parameters:
numeric_onlybool, default False

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data.

min_countint, default -1 (Not supported in Dask)

The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

Returns:
Series or DataFrame

Computed min of values within each group.

prod(split_every=None, split_out=1, shuffle_method=None, min_count=None, numeric_only=_NoDefault.no_default)#

Compute prod of group values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.prod.

Some inconsistencies with the Dask version may exist.

Parameters:
numeric_onlybool, default True

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data.

min_countint, default 0

The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

Returns:
Series or DataFrame

Computed prod of values within each group.

rolling(window, min_periods=None, center=False, win_type=None, axis=0)#

Provides rolling transformations.

Note

Since MultiIndexes are not well supported in Dask, this method returns a dataframe with the same index as the original data. The groupby column is not added as the first level of the index like pandas does.

This method works differently from other groupby methods. It does a groupby on each partition (plus some overlap). This means that the output has the same shape and number of partitions as the original.

Parameters:
windowstr, offset

Size of the moving window. This is the number of observations used for calculating the statistic. Data must have a DatetimeIndex

min_periodsint, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

centerboolean, default False

Set the labels at the center of the window.

win_typestring, default None

Provide a window type. The recognized window types are identical to pandas.

axisint, default 0
Returns:
a Rolling object on which to call a method to compute a statistic

Examples

>>> import dask
>>> ddf = dask.datasets.timeseries(freq="1h")
>>> result = ddf.groupby("name").x.rolling('1D').max()
shift(periods=1, freq=_NoDefault.no_default, axis=_NoDefault.no_default, fill_value=_NoDefault.no_default, meta=_NoDefault.no_default)#

Parallel version of pandas GroupBy.shift

This mimics the pandas version except for the following:

If the grouper does not align with the index then this causes a full shuffle. The order of rows within each group may not be preserved.

Parameters:
periodsDelayed, Scalar or int, default 1

Number of periods to shift.

freqDelayed, Scalar or str, optional

Frequency string.

axisaxis to shift, default 0

Shift direction.

fill_valueScalar, Delayed or object, optional

The scalar value to use for newly introduced missing values.

metapd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

Returns:
shiftedSeries or DataFrame shifted within each group.

Examples

>>> import dask
>>> ddf = dask.datasets.timeseries(freq="1h")
>>> result = ddf.groupby("name").shift(1, meta={"id": int, "x": float, "y": float})
size(split_every=None, split_out=1, shuffle_method=None)#

Compute group sizes.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.size.

Some inconsistencies with the Dask version may exist.

Returns:
DataFrame or Series

Number of rows in each group as a Series if as_index is True or a DataFrame if as_index is False.

See also

Series.groupby

Apply a function groupby to a Series.

DataFrame.groupby

Apply a function groupby to each row or column of a DataFrame.

std(split_every=None, split_out=1)#

Compute standard deviation of groups, excluding missing values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.std.

Some inconsistencies with the Dask version may exist.

For multiple groupings, the result index will be a MultiIndex.

Parameters:
ddofint, default 1

Degrees of freedom.

enginestr, default None (Not supported in Dask)
  • 'cython' : Runs the operation through C-extensions from cython.

  • 'numba' : Runs the operation through JIT compiled code from numba.

  • None : Defaults to 'cython' or globally setting compute.use_numba

New in version 1.4.0.

engine_kwargsdict, default None (Not supported in Dask)
  • For 'cython' engine, there are no accepted engine_kwargs

  • For 'numba' engine, the engine can accept nopython, nogil and parallel dictionary keys. The values must either be True or False. The default engine_kwargs for the 'numba' engine is {{'nopython': True, 'nogil': False, 'parallel': False}}

New in version 1.4.0.

numeric_onlybool, default True

Include only float, int or boolean data.

New in version 1.5.0.

Returns:
Series or DataFrame

Standard deviation of values within each group.

See also

Series.groupby

Apply a function groupby to a Series.

DataFrame.groupby

Apply a function groupby to each row or column of a DataFrame.

sum(split_every=None, split_out=1)#

Compute sum of group values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.sum.

Some inconsistencies with the Dask version may exist.

Parameters:
numeric_onlybool, default True

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data.

min_countint, default 0

The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

Returns:
Series or DataFrame

Computed sum of values within each group.

transform(func, *args, **kwargs)#

Parallel version of pandas GroupBy.transform

This mimics the pandas version except for the following:

  1. If the grouper does not align with the index then this causes a full shuffle. The order of rows within each group may not be preserved.

  2. Dask’s GroupBy.transform is not appropriate for aggregations. For custom aggregations, use dask.dataframe.groupby.Aggregation.

Warning

Pandas’ groupby-transform can be used to apply arbitrary functions, including aggregations that result in one row per group. Dask’s groupby-transform will apply func once on each group, doing a shuffle if needed, such that each group is contained in one partition. When func is a reduction, e.g., you’ll end up with one row per group. To apply a custom aggregation with Dask, use dask.dataframe.groupby.Aggregation.

Parameters:
func: function

Function to apply

args, kwargsScalar, Delayed or object

Arguments and keywords to pass to the function.

metapd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

Returns:
appliedSeries or DataFrame depending on columns keyword
var(split_every=None, split_out=1)#

Compute variance of groups, excluding missing values.

This docstring was copied from pandas.core.groupby.groupby.GroupBy.var.

Some inconsistencies with the Dask version may exist.

For multiple groupings, the result index will be a MultiIndex.

Parameters:
ddofint, default 1

Degrees of freedom.

enginestr, default None (Not supported in Dask)
  • 'cython' : Runs the operation through C-extensions from cython.

  • 'numba' : Runs the operation through JIT compiled code from numba.

  • None : Defaults to 'cython' or globally setting compute.use_numba

New in version 1.4.0.

engine_kwargsdict, default None (Not supported in Dask)
  • For 'cython' engine, there are no accepted engine_kwargs

  • For 'numba' engine, the engine can accept nopython, nogil and parallel dictionary keys. The values must either be True or False. The default engine_kwargs for the 'numba' engine is {{'nopython': True, 'nogil': False, 'parallel': False}}

New in version 1.4.0.

numeric_onlybool, default True

Include only float, int or boolean data.

New in version 1.5.0.

Returns:
Series or DataFrame

Variance of values within each group.

See also

Series.groupby

Apply a function groupby to a Series.

DataFrame.groupby

Apply a function groupby to each row or column of a DataFrame.

dask_cudf.groupby_agg(ddf, gb_cols, aggs_in, split_every=None, split_out=None, dropna=True, sep='___', sort=False, as_index=True, shuffle_method=None)#

Optimized groupby aggregation for Dask-CuDF.

Parameters:
ddfDataFrame

DataFrame object to perform grouping on.

gb_colsstr or list[str]

Column names to group by.

aggs_instr, list, or dict

Aggregations to perform.

split_everyint (optional)

How to group intermediate aggregates.

dropnabool

Drop grouping key values corresponding to NA values.

as_indexbool

Currently ignored.

sortbool

Sort the group keys, better performance is obtained when not sorting.

shuffle_methodstr (optional)

Control how shuffling of the DataFrame is performed.

sepstr

Internal usage.

See also

DataFrame.groupby

generic groupby of a DataFrame

dask.dataframe.apply_concat_apply

for more description of the split_every argument.

Notes

This “optimized” approach is more performant than the algorithm in implemented in DataFrame.apply() because it allows the cuDF backend to perform multiple aggregations at once.

This aggregation algorithm only supports the following options

  • “collect”

  • “count”

  • “first”

  • “last”

  • “max”

  • “mean”

  • “min”

  • “std”

  • “sum”

  • “var”

DataFrames and Series#

The core distributed objects provided by Dask-cuDF are the DataFrame and Series. These inherit respectively from dask.dataframe.DataFrame and dask.dataframe.Series, and so the API is essentially identical. The full API is provided below.

class dask_cudf.DataFrame(dsk, name, meta, divisions)#

Bases: _Frame, DataFrame

A distributed Dask DataFrame where the backing dataframe is a cuDF DataFrame.

Typically you would not construct this object directly, but rather use one of Dask-cuDF’s IO routines.

Most operations on Dask DataFrames are supported, with many of the same caveats.

Attributes

attrs

Dictionary of global attributes of this dataset.

divisions

Tuple of npartitions + 1 values, in ascending order, marking the lower/upper bounds of each partition's index.

dtypes

Return data types

iloc

Purely integer-location based indexing for selection by position.

index

Return dask Index instance

known_divisions

Whether divisions are already known

loc

Purely label-location based indexer for selection by label.

ndim

Return dimensionality

npartitions

Return number of partitions

partitions

Slice dataframe by partitions

shape

Return a tuple representing the dimensionality of the DataFrame.

size

Size of the Series or DataFrame as a Delayed object.

values

Return a dask.array of the values of this dataframe

axes

columns

empty

Methods

abs()

Return a Series/DataFrame with absolute numeric value of each element.

add(other[, axis, level, fill_value])

Get Addition of DataFrame or Series and other, element-wise (binary operator add).

add_prefix(prefix)

Prefix labels with string prefix.

add_suffix(suffix)

Suffix labels with string suffix.

align(other[, join, axis, fill_value])

Align two objects on their axes with the specified join method.

all([axis, skipna, split_every, out])

Return whether all elements are True, potentially over an axis.

any([axis, skipna, split_every, out])

Return whether any element is True, potentially over an axis.

append(other[, interleave_partitions])

Append rows of other to the end of caller, returning a new object.

apply(func[, axis, broadcast, raw, reduce, ...])

Parallel version of pandas.DataFrame.apply

applymap(func[, meta])

Apply a function to a Dataframe elementwise.

assign(**kwargs)

Assign new columns to a DataFrame.

astype(dtype)

Cast a pandas object to a specified dtype dtype.

bfill([axis, limit])

Synonym for DataFrame.fillna() with method='bfill'.

categorize([columns, index, split_every])

Convert columns of the DataFrame to category dtype.

clear_divisions()

Forget division information

clip([lower, upper, axis])

Trim values at input threshold(s).

combine(other, func[, fill_value, overwrite])

Perform column-wise combine with another DataFrame.

combine_first(other)

Update null elements with value in the same location in other.

compute(**kwargs)

Compute this dask collection

compute_current_divisions([col])

Compute the current divisions of the DataFrame.

copy([deep])

Make a copy of the dataframe

corr([method, min_periods, numeric_only, ...])

Compute pairwise correlation of columns, excluding NA/null values.

count([axis, split_every, numeric_only])

Count non-NA cells for each column or row.

cov([min_periods, numeric_only, split_every])

Compute pairwise covariance of columns, excluding NA/null values.

cummax([axis, skipna, out])

Return cumulative maximum over a DataFrame or Series axis.

cummin([axis, skipna, out])

Return cumulative minimum over a DataFrame or Series axis.

cumprod([axis, skipna, dtype, out])

Return cumulative product over a DataFrame or Series axis.

cumsum([axis, skipna, dtype, out])

Return cumulative sum over a DataFrame or Series axis.

describe([split_every, percentiles, ...])

Generate descriptive statistics.

diff([periods, axis])

First discrete difference of element.

div(other[, axis, level, fill_value])

Get Floating division of dataframe and other, element-wise (binary operator truediv).

divide(other[, axis, level, fill_value])

Get Floating division of dataframe and other, element-wise (binary operator truediv).

dot(other[, meta])

Compute the dot product between the Series and the columns of other.

drop([labels, axis, columns, errors])

Drop specified labels from rows or columns.

drop_duplicates([subset, split_every, ...])

Return DataFrame with duplicate rows removed.

dropna([how, subset, thresh])

Remove missing values.

enforce_runtime_divisions()

Enforce the current divisions at runtime

eq(other[, axis, level])

Get Equal to of dataframe and other, element-wise (binary operator eq).

eval(expr[, inplace])

Evaluate a string describing operations on DataFrame columns.

explode(column)

Transform each element of a list-like to a row, replicating index values.

ffill([axis, limit])

Synonym for DataFrame.fillna() with method='ffill'.

fillna([value, method, limit, axis])

Fill NA/NaN values using the specified method.

first(offset)

Select initial periods of time series data based on a date offset.

floordiv(other[, axis, level, fill_value])

Get Integer division of DataFrame or Series and other, element-wise (binary operator floordiv).

from_dict(data, *, npartitions[, orient, ...])

Construct a Dask DataFrame from a Python Dictionary

ge(other[, axis, level])

Get Greater than or equal to of dataframe and other, element-wise (binary operator ge).

get_partition(n)

Get a dask DataFrame/Series representing the nth partition.

groupby([by])

Group DataFrame using a mapper or by a Series of columns.

gt(other[, axis, level])

Get Greater than of dataframe and other, element-wise (binary operator gt).

head([n, npartitions, compute])

First n rows of the dataset

idxmax([axis, skipna, split_every, numeric_only])

Return index of first occurrence of maximum over requested axis.

idxmin([axis, skipna, split_every, numeric_only])

Return index of first occurrence of minimum over requested axis.

info([buf, verbose, memory_usage])

Concise summary of a Dask DataFrame.

isin(values)

Whether each element in the DataFrame is contained in values.

isna()

Detect missing values.

isnull()

DataFrame.isnull is an alias for DataFrame.isna.

items()

Iterate over (column name, Series) pairs.

iterrows()

Iterate over DataFrame rows as (index, Series) pairs.

itertuples([index, name])

Iterate over DataFrame rows as namedtuples.

join(other[, shuffle_method])

Join columns of another DataFrame.

kurtosis([axis, fisher, bias, nan_policy, ...])

Return unbiased kurtosis over requested axis.

last(offset)

Select final periods of time series data based on a date offset.

le(other[, axis, level])

Get Less than or equal to of dataframe and other, element-wise (binary operator le).

lt(other[, axis, level])

Get Less than of dataframe and other, element-wise (binary operator lt).

map_overlap(func, before, after, *args, **kwargs)

Apply a function to each partition, sharing rows with adjacent partitions.

map_partitions(func, *args, **kwargs)

Apply Python function on each DataFrame partition.

mask(cond[, other])

Replace values where the condition is True.

max([axis, skipna, split_every, out, ...])

Return the maximum of the values over the requested axis.

mean([axis, skipna, split_every, dtype, ...])

Return the mean of the values over the requested axis.

median([axis, method])

Return the median of the values over the requested axis.

median_approximate([axis, method])

Return the approximate median of the values over the requested axis.

melt([id_vars, value_vars, var_name, ...])

Unpivots a DataFrame from wide format to long format, optionally leaving identifier variables set.

memory_usage([index, deep])

Return the memory usage of each column in bytes.

memory_usage_per_partition([index, deep])

Return the memory usage of each partition

merge(other[, shuffle_method])

Merge the DataFrame with another DataFrame

min([axis, skipna, split_every, out, ...])

Return the minimum of the values over the requested axis.

mod(other[, axis, level, fill_value])

Get Modulo of DataFrame or Series and other, element-wise (binary operator mod).

mode([dropna, split_every, numeric_only])

Get the mode(s) of each element along the selected axis.

mul(other[, axis, level, fill_value])

Get Multiplication of DataFrame or Series and other, element-wise (binary operator mul).

ne(other[, axis, level])

Get Not equal to of dataframe and other, element-wise (binary operator ne).

nlargest([n, columns, split_every])

Return the first n rows ordered by columns in descending order.

notnull()

DataFrame.notnull is an alias for DataFrame.notna.

nsmallest([n, columns, split_every])

Return the first n rows ordered by columns in ascending order.

nunique([split_every, dropna, axis])

Count number of distinct elements in specified axis.

nunique_approx([split_every])

Approximate number of unique rows.

persist(**kwargs)

Persist this dask collection into memory

pipe(func, *args, **kwargs)

Apply chainable functions that expect Series or DataFrames.

pivot_table([index, columns, values, aggfunc])

Create a spreadsheet-style pivot table as a DataFrame.

pop(item)

Return item and drop from frame.

pow(other[, axis, level, fill_value])

Get Exponential of DataFrame or Series and other, element-wise (binary operator pow).

prod([axis, skipna, split_every, dtype, ...])

Return the product of the values over the requested axis.

product([axis, skipna, split_every, dtype, ...])

Return the product of the values over the requested axis.

quantile([q, axis, numeric_only, method])

Approximate row-wise and precise column-wise quantiles of DataFrame

query(expr, **kwargs)

Filter dataframe with complex expression

radd(other[, axis, level, fill_value])

Get Addition of DataFrame or Series and other, element-wise (binary operator radd).

random_split(frac[, random_state, shuffle])

Pseudorandomly split dataframe into different pieces row-wise

rdiv(other[, axis, level, fill_value])

Get Floating division of dataframe and other, element-wise (binary operator rtruediv).

reduction(chunk[, aggregate, combine, meta, ...])

Generic row-wise reductions.

rename([index, columns])

Alter axes labels.

repartition([divisions, npartitions, ...])

Repartition dataframe along new divisions

replace([to_replace, value, regex])

Replace values given in to_replace with value.

resample(rule[, closed, label])

Resample time-series data.

reset_index([drop])

Reset the index to the default index.

rfloordiv(other[, axis, level, fill_value])

Get Integer division of DataFrame or Series and other, element-wise (binary operator rfloordiv).

rmod(other[, axis, level, fill_value])

Get Modulo of DataFrame or Series and other, element-wise (binary operator rmod).

rmul(other[, axis, level, fill_value])

Get Multiplication of DataFrame or Series and other, element-wise (binary operator rmul).

rolling(window[, min_periods, center, ...])

Provides rolling transformations.

round([decimals])

Round a DataFrame to a variable number of decimal places.

rpow(other[, axis, level, fill_value])

Get Exponential of DataFrame or Series and other, element-wise (binary operator rpow).

rsub(other[, axis, level, fill_value])

Get Subtraction of DataFrame or Series and other, element-wise (binary operator rsub).

rtruediv(other[, axis, level, fill_value])

Get Floating division of DataFrame or Series and other, element-wise (binary operator rtruediv).

sample([n, frac, replace, random_state])

Random sample of items

select_dtypes([include, exclude])

Return a subset of the DataFrame's columns based on the column dtypes.

sem([axis, skipna, ddof, split_every, ...])

Return unbiased standard error of the mean over requested axis.

set_index(other[, sorted, divisions, ...])

Set the DataFrame index (row labels) using an existing column.

shift([periods, freq, axis])

Shift index by desired number of periods with an optional time freq.

shuffle(*args[, shuffle_method])

Wraps dask.dataframe DataFrame.shuffle method

skew([axis, bias, nan_policy, out, numeric_only])

Return unbiased skew over requested axis.

sort_values(by[, ignore_index, max_branch, ...])

Sort the dataset by a single column.

squeeze([axis])

Squeeze 1 dimensional axis objects into scalars.

std([axis, skipna, ddof, split_every, ...])

Return sample standard deviation over requested axis.

sub(other[, axis, level, fill_value])

Get Subtraction of DataFrame or Series and other, element-wise (binary operator sub).

sum([axis, skipna, split_every, dtype, out, ...])

Return the sum of the values over the requested axis.

tail([n, compute])

Last n rows of the dataset

to_backend([backend])

Move to a new DataFrame backend

to_bag([index, format])

Create Dask Bag from a Dask DataFrame

to_csv(filename, **kwargs)

Store Dask DataFrame to CSV files

to_dask_array([lengths, meta])

Convert a dask DataFrame to a dask array.

to_dask_dataframe(**kwargs)

Create a dask.dataframe object from a dask_cudf object

to_delayed([optimize_graph])

Convert into a list of dask.delayed objects, one per partition.

to_hdf(path_or_buf, key[, mode, append])

Store Dask Dataframe to Hierarchical Data Format (HDF) files

to_html([max_rows])

Render a DataFrame as an HTML table.

to_json(filename, *args, **kwargs)

See dd.to_json docstring for more information

to_orc(path, **kwargs)

Calls dask_cudf.io.to_orc

to_parquet(path, *args, **kwargs)

Calls dask.dataframe.io.to_parquet with CudfEngine backend

to_records([index, lengths])

Create Dask Array from a Dask Dataframe

to_sql(name, uri[, schema, if_exists, ...])

See dd.to_sql docstring for more information

to_string([max_rows])

Render a DataFrame to a console-friendly tabular output.

to_timestamp([freq, how, axis])

Cast to DatetimeIndex of timestamps, at beginning of period.

truediv(other[, axis, level, fill_value])

Get Floating division of DataFrame or Series and other, element-wise (binary operator truediv).

var([axis, skipna, ddof, split_every, ...])

Return unbiased variance over requested axis.

visualize([filename, format, optimize_graph])

Render the computation of this object's task graph using graphviz.

where(cond[, other])

Replace values where the condition is False.

apply_rows

map

abs()#

Return a Series/DataFrame with absolute numeric value of each element.

This docstring was copied from pandas.core.frame.DataFrame.abs.

Some inconsistencies with the Dask version may exist.

This function only applies to elements that are all numeric.

Returns:
abs

Series/DataFrame containing the absolute value of each element.

See also

numpy.absolute

Calculate the absolute value element-wise.

Notes

For complex inputs, 1.2 + 1j, the absolute value is \(\sqrt{ a^2 + b^2 }\).

Examples

Absolute numeric values in a Series.

>>> s = pd.Series([-1.10, 2, -3.33, 4])  
>>> s.abs()  
0    1.10
1    2.00
2    3.33
3    4.00
dtype: float64

Absolute numeric values in a Series with complex numbers.

>>> s = pd.Series([1.2 + 1j])  
>>> s.abs()  
0    1.56205
dtype: float64

Absolute numeric values in a Series with a Timedelta element.

>>> s = pd.Series([pd.Timedelta('1 days')])  
>>> s.abs()  
0   1 days
dtype: timedelta64[ns]

Select rows with data closest to certain value using argsort (from StackOverflow).

>>> df = pd.DataFrame({  
...     'a': [4, 5, 6, 7],
...     'b': [10, 20, 30, 40],
...     'c': [100, 50, -30, -50]
... })
>>> df  
     a    b    c
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50
>>> df.loc[(df.c - 43).abs().argsort()]  
     a    b    c
1    5   20   50
0    4   10  100
2    6   30  -30
3    7   40  -50
add(other, axis='columns', level=None, fill_value=None)#

Get Addition of DataFrame or Series and other, element-wise (binary operator add).

This docstring was copied from cudf.core.series.Series.add.

Some inconsistencies with the Dask version may exist.

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:
otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame or Series

Result of the arithmetic operation.

Examples

DataFrame

>>> df = cudf.DataFrame(  
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.add(1)  
        angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])  
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])  
>>> a.add(b)  
a       2
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.add(b, fill_value=0)  
a       2
b       1
c       1
d       1
e    <NA>
dtype: int64
add_prefix(prefix)#

Prefix labels with string prefix.

This docstring was copied from pandas.core.frame.DataFrame.add_prefix.

Some inconsistencies with the Dask version may exist.

For Series, the row labels are prefixed. For DataFrame, the column labels are prefixed.

Parameters:
prefixstr

The string to add before each label.

Returns:
Series or DataFrame

New Series or DataFrame with updated labels.

See also

Series.add_suffix

Suffix row labels with string suffix.

DataFrame.add_suffix

Suffix column labels with string suffix.

Examples

>>> s = pd.Series([1, 2, 3, 4])  
>>> s  
0    1
1    2
2    3
3    4
dtype: int64
>>> s.add_prefix('item_')  
item_0    1
item_1    2
item_2    3
item_3    4
dtype: int64
>>> df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [3, 4, 5, 6]})  
>>> df  
   A  B
0  1  3
1  2  4
2  3  5
3  4  6
>>> df.add_prefix('col_')  
     col_A  col_B
0       1       3
1       2       4
2       3       5
3       4       6
add_suffix(suffix)#

Suffix labels with string suffix.

This docstring was copied from pandas.core.frame.DataFrame.add_suffix.

Some inconsistencies with the Dask version may exist.

For Series, the row labels are suffixed. For DataFrame, the column labels are suffixed.

Parameters:
suffixstr

The string to add after each label.

Returns:
Series or DataFrame

New Series or DataFrame with updated labels.

See also

Series.add_prefix

Prefix row labels with string prefix.

DataFrame.add_prefix

Prefix column labels with string prefix.

Examples

>>> s = pd.Series([1, 2, 3, 4])  
>>> s  
0    1
1    2
2    3
3    4
dtype: int64
>>> s.add_suffix('_item')  
0_item    1
1_item    2
2_item    3
3_item    4
dtype: int64
>>> df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [3, 4, 5, 6]})  
>>> df  
   A  B
0  1  3
1  2  4
2  3  5
3  4  6
>>> df.add_suffix('_col')  
     A_col  B_col
0       1       3
1       2       4
2       3       5
3       4       6
align(other, join='outer', axis=None, fill_value=None)#

Align two objects on their axes with the specified join method.

This docstring was copied from pandas.core.frame.DataFrame.align.

Some inconsistencies with the Dask version may exist.

Join method is specified for each axis Index.

Parameters:
otherDataFrame or Series
join{‘outer’, ‘inner’, ‘left’, ‘right’}, default ‘outer’
axisallowed axis of the other object, default None

Align on index (0), columns (1), or both (None).

levelint or level name, default None (Not supported in Dask)

Broadcast across a level, matching Index values on the passed MultiIndex level.

copybool, default True (Not supported in Dask)

Always returns new objects. If copy=False and no reindexing is required then original objects are returned.

fill_valuescalar, default np.NaN

Value to use for missing values. Defaults to NaN, but can be any “compatible” value.

method{‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None (Not supported in Dask)

Method to use for filling holes in reindexed Series:

  • pad / ffill: propagate last valid observation forward to next valid.

  • backfill / bfill: use NEXT valid observation to fill gap.

limitint, default None (Not supported in Dask)

If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.

fill_axis{0 or ‘index’, 1 or ‘columns’}, default 0 (Not supported in Dask)

Filling axis, method and limit.

broadcast_axis{0 or ‘index’, 1 or ‘columns’}, default None (Not supported in Dask)

Broadcast values along this axis, if aligning two objects of different dimensions.

Returns:
(left, right)(DataFrame, type of other)

Aligned objects.

Examples

>>> df = pd.DataFrame(  
...     [[1, 2, 3, 4], [6, 7, 8, 9]], columns=["D", "B", "E", "A"], index=[1, 2]
... )
>>> other = pd.DataFrame(  
...     [[10, 20, 30, 40], [60, 70, 80, 90], [600, 700, 800, 900]],
...     columns=["A", "B", "C", "D"],
...     index=[2, 3, 4],
... )
>>> df  
   D  B  E  A
1  1  2  3  4
2  6  7  8  9
>>> other  
    A    B    C    D
2   10   20   30   40
3   60   70   80   90
4  600  700  800  900

Align on columns:

>>> left, right = df.align(other, join="outer", axis=1)  
>>> left  
   A  B   C  D  E
1  4  2 NaN  1  3
2  9  7 NaN  6  8
>>> right  
    A    B    C    D   E
2   10   20   30   40 NaN
3   60   70   80   90 NaN
4  600  700  800  900 NaN

We can also align on the index:

>>> left, right = df.align(other, join="outer", axis=0)  
>>> left  
    D    B    E    A
1  1.0  2.0  3.0  4.0
2  6.0  7.0  8.0  9.0
3  NaN  NaN  NaN  NaN
4  NaN  NaN  NaN  NaN
>>> right  
    A      B      C      D
1    NaN    NaN    NaN    NaN
2   10.0   20.0   30.0   40.0
3   60.0   70.0   80.0   90.0
4  600.0  700.0  800.0  900.0

Finally, the default axis=None will align on both index and columns:

>>> left, right = df.align(other, join="outer", axis=None)  
>>> left  
     A    B   C    D    E
1  4.0  2.0 NaN  1.0  3.0
2  9.0  7.0 NaN  6.0  8.0
3  NaN  NaN NaN  NaN  NaN
4  NaN  NaN NaN  NaN  NaN
>>> right  
       A      B      C      D   E
1    NaN    NaN    NaN    NaN NaN
2   10.0   20.0   30.0   40.0 NaN
3   60.0   70.0   80.0   90.0 NaN
4  600.0  700.0  800.0  900.0 NaN
all(axis=None, skipna=True, split_every=False, out=None)#

Return whether all elements are True, potentially over an axis.

This docstring was copied from pandas.core.frame.DataFrame.all.

Some inconsistencies with the Dask version may exist.

Returns True unless there at least one element within a series or along a Dataframe axis that is False or equivalent (e.g. zero or empty).

Parameters:
axis{0 or ‘index’, 1 or ‘columns’, None}, default 0

Indicate which axis or axes should be reduced. For Series this parameter is unused and defaults to 0.

  • 0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.

  • 1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.

  • None : reduce all axes, return a scalar.

bool_onlybool, default None (Not supported in Dask)

Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.

skipnabool, default True

Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be True, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.

levelint or level name, default None (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

Deprecated since version 1.3.0: The level keyword is deprecated. Use groupby instead.

**kwargsany, default None

Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns:
Series or DataFrame

If level is specified, then, DataFrame is returned; otherwise, Series is returned.

See also

Series.all

Return True if all elements are True.

DataFrame.any

Return True if one (or more) elements are True.

Examples

Series

>>> pd.Series([True, True]).all()  
True
>>> pd.Series([True, False]).all()  
False
>>> pd.Series([], dtype="float64").all()  
True
>>> pd.Series([np.nan]).all()  
True
>>> pd.Series([np.nan]).all(skipna=False)  
True

DataFrames

Create a dataframe from a dictionary.

>>> df = pd.DataFrame({'col1': [True, True], 'col2': [True, False]})  
>>> df  
   col1   col2
0  True   True
1  True  False

Default behaviour checks if values in each column all return True.

>>> df.all()  
col1     True
col2    False
dtype: bool

Specify axis='columns' to check if values in each row all return True.

>>> df.all(axis='columns')  
0     True
1    False
dtype: bool

Or axis=None for whether every value is True.

>>> df.all(axis=None)  
False
any(axis=None, skipna=True, split_every=False, out=None)#

Return whether any element is True, potentially over an axis.

This docstring was copied from pandas.core.frame.DataFrame.any.

Some inconsistencies with the Dask version may exist.

Returns False unless there is at least one element within a series or along a Dataframe axis that is True or equivalent (e.g. non-zero or non-empty).

Parameters:
axis{0 or ‘index’, 1 or ‘columns’, None}, default 0

Indicate which axis or axes should be reduced. For Series this parameter is unused and defaults to 0.

  • 0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.

  • 1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.

  • None : reduce all axes, return a scalar.

bool_onlybool, default None (Not supported in Dask)

Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.

skipnabool, default True

Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be False, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.

levelint or level name, default None (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

Deprecated since version 1.3.0: The level keyword is deprecated. Use groupby instead.

**kwargsany, default None

Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns:
Series or DataFrame

If level is specified, then, DataFrame is returned; otherwise, Series is returned.

See also

numpy.any

Numpy version of this method.

Series.any

Return whether any element is True.

Series.all

Return whether all elements are True.

DataFrame.any

Return whether any element is True over requested axis.

DataFrame.all

Return whether all elements are True over requested axis.

Examples

Series

For Series input, the output is a scalar indicating whether any element is True.

>>> pd.Series([False, False]).any()  
False
>>> pd.Series([True, False]).any()  
True
>>> pd.Series([], dtype="float64").any()  
False
>>> pd.Series([np.nan]).any()  
False
>>> pd.Series([np.nan]).any(skipna=False)  
True

DataFrame

Whether each column contains at least one True element (the default).

>>> df = pd.DataFrame({"A": [1, 2], "B": [0, 2], "C": [0, 0]})  
>>> df  
   A  B  C
0  1  0  0
1  2  2  0
>>> df.any()  
A     True
B     True
C    False
dtype: bool

Aggregating over the columns.

>>> df = pd.DataFrame({"A": [True, False], "B": [1, 2]})  
>>> df  
       A  B
0   True  1
1  False  2
>>> df.any(axis='columns')  
0    True
1    True
dtype: bool
>>> df = pd.DataFrame({"A": [True, False], "B": [1, 0]})  
>>> df  
       A  B
0   True  1
1  False  0
>>> df.any(axis='columns')  
0    True
1    False
dtype: bool

Aggregating over the entire DataFrame with axis=None.

>>> df.any(axis=None)  
True

any for an empty DataFrame is an empty Series.

>>> pd.DataFrame([]).any()  
Series([], dtype: bool)
append(other, interleave_partitions=False)#

Append rows of other to the end of caller, returning a new object.

This docstring was copied from pandas.core.frame.DataFrame.append.

Some inconsistencies with the Dask version may exist.

Deprecated since version 1.4.0: Use concat() instead. For further details see Deprecated DataFrame.append and Series.append

Columns in other that are not in the caller are added as new columns.

Parameters:
otherDataFrame or Series/dict-like object, or list of these

The data to append.

ignore_indexbool, default False (Not supported in Dask)

If True, the resulting axis will be labeled 0, 1, …, n - 1.

verify_integritybool, default False (Not supported in Dask)

If True, raise ValueError on creating index with duplicates.

sortbool, default False (Not supported in Dask)

Sort columns if the columns of self and other are not aligned.

Changed in version 1.0.0: Changed to not sort by default.

Returns:
DataFrame

A new DataFrame consisting of the rows of caller and the rows of other.

See also

concat

General function to concatenate DataFrame or Series objects.

Notes

If a list of dict/series is passed and the keys are all contained in the DataFrame’s index, the order of the columns in the resulting DataFrame will be unchanged.

Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.

Examples

>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'), index=['x', 'y'])  
>>> df  
   A  B
x  1  2
y  3  4
>>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'), index=['x', 'y'])  
>>> df.append(df2)  
   A  B
x  1  2
y  3  4
x  5  6
y  7  8

With ignore_index set to True:

>>> df.append(df2, ignore_index=True)  
   A  B
0  1  2
1  3  4
2  5  6
3  7  8

The following, while not recommended methods for generating DataFrames, show two ways to generate a DataFrame from multiple data sources.

Less efficient:

>>> df = pd.DataFrame(columns=['A'])  
>>> for i in range(5):  
...     df = df.append({'A': i}, ignore_index=True)
>>> df  
   A
0  0
1  1
2  2
3  3
4  4

More efficient:

>>> pd.concat([pd.DataFrame([i], columns=['A']) for i in range(5)],  
...           ignore_index=True)
   A
0  0
1  1
2  2
3  3
4  4
apply(func, axis=0, broadcast=None, raw=False, reduce=None, args=(), meta=_NoDefault.no_default, result_type=None, **kwds)#

Parallel version of pandas.DataFrame.apply

This mimics the pandas version except for the following:

  1. Only axis=1 is supported (and must be specified explicitly).

  2. The user should provide output metadata via the meta keyword.

Parameters:
funcfunction

Function to apply to each column/row

axis{0 or ‘index’, 1 or ‘columns’}, default 0
  • 0 or ‘index’: apply function to each column (NOT SUPPORTED)

  • 1 or ‘columns’: apply function to each row

metapd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

argstuple

Positional arguments to pass to function in addition to the array/series

Additional keyword arguments will be passed as keywords to the function
Returns:
appliedSeries or DataFrame

See also

dask.DataFrame.map_partitions

Examples

>>> import pandas as pd
>>> import dask.dataframe as dd
>>> df = pd.DataFrame({'x': [1, 2, 3, 4, 5],
...                    'y': [1., 2., 3., 4., 5.]})
>>> ddf = dd.from_pandas(df, npartitions=2)

Apply a function to row-wise passing in extra arguments in args and kwargs:

>>> def myadd(row, a, b=1):
...     return row.sum() + a + b
>>> res = ddf.apply(myadd, axis=1, args=(2,), b=1.5)  

By default, dask tries to infer the output metadata by running your provided function on some fake data. This works well in many cases, but can sometimes be expensive, or even fail. To avoid this, you can manually specify the output metadata with the meta keyword. This can be specified in many forms, for more information see dask.dataframe.utils.make_meta.

Here we specify the output is a Series with name 'x', and dtype float64:

>>> res = ddf.apply(myadd, axis=1, args=(2,), b=1.5, meta=('x', 'f8'))

In the case where the metadata doesn’t change, you can also pass in the object itself directly:

>>> res = ddf.apply(lambda row: row + 1, axis=1, meta=ddf)
applymap(func, meta=_NoDefault.no_default)#

Apply a function to a Dataframe elementwise.

This docstring was copied from pandas.core.frame.DataFrame.applymap.

Some inconsistencies with the Dask version may exist.

This method applies a function that accepts and returns a scalar to every element of a DataFrame.

Parameters:
funccallable

Python function, returns a single value from a single value.

na_action{None, ‘ignore’}, default None (Not supported in Dask)

If ‘ignore’, propagate NaN values, without passing them to func.

New in version 1.2.

**kwargs

Additional keyword arguments to pass as keywords arguments to func.

New in version 1.3.0.

Returns:
DataFrame

Transformed DataFrame.

See also

DataFrame.apply

Apply a function along input axis of DataFrame.

Examples

>>> df = pd.DataFrame([[1, 2.12], [3.356, 4.567]])  
>>> df  
       0      1
0  1.000  2.120
1  3.356  4.567
>>> df.applymap(lambda x: len(str(x)))  
   0  1
0  3  4
1  5  5

Like Series.map, NA values can be ignored:

>>> df_copy = df.copy()  
>>> df_copy.iloc[0, 0] = pd.NA  
>>> df_copy.applymap(lambda x: len(str(x)), na_action='ignore')  
     0  1
0  NaN  4
1  5.0  5

Note that a vectorized version of func often exists, which will be much faster. You could square each number elementwise.

>>> df.applymap(lambda x: x**2)  
           0          1
0   1.000000   4.494400
1  11.262736  20.857489

But it’s better to avoid applymap in that case.

>>> df ** 2  
           0          1
0   1.000000   4.494400
1  11.262736  20.857489
assign(**kwargs)#

Assign new columns to a DataFrame.

This docstring was copied from pandas.core.frame.DataFrame.assign.

Some inconsistencies with the Dask version may exist.

Returns a new object with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten.

Parameters:
**kwargsdict of {str: callable or Series}

The column names are keywords. If the values are callable, they are computed on the DataFrame and assigned to the new columns. The callable must not change input DataFrame (though pandas doesn’t check it). If the values are not callable, (e.g. a Series, scalar, or array), they are simply assigned.

Returns:
DataFrame

A new DataFrame with the new columns in addition to all the existing columns.

Notes

Assigning multiple columns within the same assign is possible. Later items in ‘**kwargs’ may refer to newly created or modified columns in ‘df’; items are computed and assigned into ‘df’ in order.

Examples

>>> df = pd.DataFrame({'temp_c': [17.0, 25.0]},  
...                   index=['Portland', 'Berkeley'])
>>> df  
          temp_c
Portland    17.0
Berkeley    25.0

Where the value is a callable, evaluated on df:

>>> df.assign(temp_f=lambda x: x.temp_c * 9 / 5 + 32)  
          temp_c  temp_f
Portland    17.0    62.6
Berkeley    25.0    77.0

Alternatively, the same behavior can be achieved by directly referencing an existing Series or sequence:

>>> df.assign(temp_f=df['temp_c'] * 9 / 5 + 32)  
          temp_c  temp_f
Portland    17.0    62.6
Berkeley    25.0    77.0

You can create multiple columns within the same assign where one of the columns depends on another one defined within the same assign:

>>> df.assign(temp_f=lambda x: x['temp_c'] * 9 / 5 + 32,  
...           temp_k=lambda x: (x['temp_f'] +  459.67) * 5 / 9)
          temp_c  temp_f  temp_k
Portland    17.0    62.6  290.15
Berkeley    25.0    77.0  298.15
astype(dtype)#

Cast a pandas object to a specified dtype dtype.

This docstring was copied from pandas.core.frame.DataFrame.astype.

Some inconsistencies with the Dask version may exist.

Parameters:
dtypedata type, or dict of column name -> data type

Use a numpy.dtype or Python type to cast entire pandas object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.

copybool, default True (Not supported in Dask)

Return a copy when copy=True (be very careful setting copy=False as changes to values then may propagate to other pandas objects).

errors{‘raise’, ‘ignore’}, default ‘raise’ (Not supported in Dask)

Control raising of exceptions on invalid data for provided dtype.

  • raise : allow exceptions to be raised

  • ignore : suppress exceptions. On error return original object.

Returns:
castedsame type as caller

See also

to_datetime

Convert argument to datetime.

to_timedelta

Convert argument to timedelta.

to_numeric

Convert argument to a numeric type.

numpy.ndarray.astype

Cast a numpy array to a specified type.

Notes

Deprecated since version 1.3.0: Using astype to convert from timezone-naive dtype to timezone-aware dtype is deprecated and will raise in a future version. Use Series.dt.tz_localize() instead.

Examples

Create a DataFrame:

>>> d = {'col1': [1, 2], 'col2': [3, 4]}  
>>> df = pd.DataFrame(data=d)  
>>> df.dtypes  
col1    int64
col2    int64
dtype: object

Cast all columns to int32:

>>> df.astype('int32').dtypes  
col1    int32
col2    int32
dtype: object

Cast col1 to int32 using a dictionary:

>>> df.astype({'col1': 'int32'}).dtypes  
col1    int32
col2    int64
dtype: object

Create a series:

>>> ser = pd.Series([1, 2], dtype='int32')  
>>> ser  
0    1
1    2
dtype: int32
>>> ser.astype('int64')  
0    1
1    2
dtype: int64

Convert to categorical type:

>>> ser.astype('category')  
0    1
1    2
dtype: category
Categories (2, int64): [1, 2]

Convert to ordered categorical type with custom ordering:

>>> from pandas.api.types import CategoricalDtype  
>>> cat_dtype = CategoricalDtype(  
...     categories=[2, 1], ordered=True)
>>> ser.astype(cat_dtype)  
0    1
1    2
dtype: category
Categories (2, int64): [2 < 1]

Note that using copy=False and changing data on a new pandas object may propagate changes:

>>> s1 = pd.Series([1, 2])  
>>> s2 = s1.astype('int64', copy=False)  
>>> s2[0] = 10  
>>> s1  # note that s1[0] has changed too  
0    10
1     2
dtype: int64

Create a series of dates:

>>> ser_date = pd.Series(pd.date_range('20200101', periods=3))  
>>> ser_date  
0   2020-01-01
1   2020-01-02
2   2020-01-03
dtype: datetime64[ns]
property attrs#

Dictionary of global attributes of this dataset.

This docstring was copied from pandas.core.frame.DataFrame.attrs.

Some inconsistencies with the Dask version may exist.

Warning

attrs is experimental and may change without warning.

See also

DataFrame.flags

Global flags applying to this object.

bfill(axis=None, limit=None)#

Synonym for DataFrame.fillna() with method='bfill'.

This docstring was copied from pandas.core.frame.DataFrame.bfill.

Some inconsistencies with the Dask version may exist.

Returns:
Series/DataFrame or None

Object with missing values filled or None if inplace=True.

categorize(columns=None, index=None, split_every=None, **kwargs)#

Convert columns of the DataFrame to category dtype.

Parameters:
columnslist, optional

A list of column names to convert to categoricals. By default any column with an object dtype is converted to a categorical, and any unknown categoricals are made known.

indexbool, optional

Whether to categorize the index. By default, object indices are converted to categorical, and unknown categorical indices are made known. Set True to always categorize the index, False to never.

split_everyint, optional

Group partitions into groups of this size while performing a tree-reduction. If set to False, no tree-reduction will be used. Default is 16.

kwargs

Keyword arguments are passed on to compute.

clear_divisions()#

Forget division information

clip(lower=None, upper=None, axis=None)#

Trim values at input threshold(s).

This docstring was copied from pandas.core.frame.DataFrame.clip.

Some inconsistencies with the Dask version may exist.

Assigns values outside boundary to boundary values. Thresholds can be singular values or array like, and in the latter case the clipping is performed element-wise in the specified axis.

Parameters:
lowerfloat or array-like, default None

Minimum threshold value. All values below this threshold will be set to it. A missing threshold (e.g NA) will not clip the value.

upperfloat or array-like, default None

Maximum threshold value. All values above this threshold will be set to it. A missing threshold (e.g NA) will not clip the value.

axis{{0 or ‘index’, 1 or ‘columns’, None}}, default None

Align object with lower and upper along the given axis. For Series this parameter is unused and defaults to None.

inplacebool, default False (Not supported in Dask)

Whether to perform the operation in place on the data.

*args, **kwargs

Additional keywords have no effect but might be accepted for compatibility with numpy.

Returns:
Series or DataFrame or None

Same type as calling object with the values outside the clip boundaries replaced or None if inplace=True.

See also

Series.clip

Trim values at input threshold in series.

DataFrame.clip

Trim values at input threshold in dataframe.

numpy.clip

Clip (limit) the values in an array.

Examples

>>> data = {'col_0': [9, -3, 0, -1, 5], 'col_1': [-2, -7, 6, 8, -5]}  
>>> df = pd.DataFrame(data)  
>>> df  
   col_0  col_1
0      9     -2
1     -3     -7
2      0      6
3     -1      8
4      5     -5

Clips per column using lower and upper thresholds:

>>> df.clip(-4, 6)  
   col_0  col_1
0      6     -2
1     -3     -4
2      0      6
3     -1      6
4      5     -4

Clips using specific lower and upper thresholds per column element:

>>> t = pd.Series([2, -4, -1, 6, 3])  
>>> t  
0    2
1   -4
2   -1
3    6
4    3
dtype: int64
>>> df.clip(t, t + 4, axis=0)  
   col_0  col_1
0      6      2
1     -3     -4
2      0      3
3      6      8
4      5      3

Clips using specific lower threshold per column element, with missing values:

>>> t = pd.Series([2, -4, np.NaN, 6, 3])  
>>> t  
0    2.0
1   -4.0
2    NaN
3    6.0
4    3.0
dtype: float64
>>> df.clip(t, axis=0)  
col_0  col_1
0      9      2
1     -3     -4
2      0      6
3      6      8
4      5      3
combine(other, func, fill_value=None, overwrite=True)#

Perform column-wise combine with another DataFrame.

This docstring was copied from pandas.core.frame.DataFrame.combine.

Some inconsistencies with the Dask version may exist.

Combines a DataFrame with other DataFrame using func to element-wise combine columns. The row and column indexes of the resulting DataFrame will be the union of the two.

Parameters:
otherDataFrame

The DataFrame to merge column-wise.

funcfunction

Function that takes two series as inputs and return a Series or a scalar. Used to merge the two dataframes column by columns.

fill_valuescalar value, default None

The value to fill NaNs with prior to passing any column to the merge func.

overwritebool, default True

If True, columns in self that do not exist in other will be overwritten with NaNs.

Returns:
DataFrame

Combination of the provided DataFrames.

See also

DataFrame.combine_first

Combine two DataFrame objects and default to non-null values in frame calling the method.

Examples

Combine using a simple function that chooses the smaller column.

>>> df1 = pd.DataFrame({'A': [0, 0], 'B': [4, 4]})  
>>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]})  
>>> take_smaller = lambda s1, s2: s1 if s1.sum() < s2.sum() else s2  
>>> df1.combine(df2, take_smaller)  
   A  B
0  0  3
1  0  3

Example using a true element-wise combine function.

>>> df1 = pd.DataFrame({'A': [5, 0], 'B': [2, 4]})  
>>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]})  
>>> df1.combine(df2, np.minimum)  
   A  B
0  1  2
1  0  3

Using fill_value fills Nones prior to passing the column to the merge function.

>>> df1 = pd.DataFrame({'A': [0, 0], 'B': [None, 4]})  
>>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]})  
>>> df1.combine(df2, take_smaller, fill_value=-5)  
   A    B
0  0 -5.0
1  0  4.0

However, if the same element in both dataframes is None, that None is preserved

>>> df1 = pd.DataFrame({'A': [0, 0], 'B': [None, 4]})  
>>> df2 = pd.DataFrame({'A': [1, 1], 'B': [None, 3]})  
>>> df1.combine(df2, take_smaller, fill_value=-5)  
    A    B
0  0 -5.0
1  0  3.0

Example that demonstrates the use of overwrite and behavior when the axis differ between the dataframes.

>>> df1 = pd.DataFrame({'A': [0, 0], 'B': [4, 4]})  
>>> df2 = pd.DataFrame({'B': [3, 3], 'C': [-10, 1], }, index=[1, 2])  
>>> df1.combine(df2, take_smaller)  
     A    B     C
0  NaN  NaN   NaN
1  NaN  3.0 -10.0
2  NaN  3.0   1.0
>>> df1.combine(df2, take_smaller, overwrite=False)  
     A    B     C
0  0.0  NaN   NaN
1  0.0  3.0 -10.0
2  NaN  3.0   1.0

Demonstrating the preference of the passed in dataframe.

>>> df2 = pd.DataFrame({'B': [3, 3], 'C': [1, 1], }, index=[1, 2])  
>>> df2.combine(df1, take_smaller)  
   A    B   C
0  0.0  NaN NaN
1  0.0  3.0 NaN
2  NaN  3.0 NaN
>>> df2.combine(df1, take_smaller, overwrite=False)  
     A    B   C
0  0.0  NaN NaN
1  0.0  3.0 1.0
2  NaN  3.0 1.0
combine_first(other)#

Update null elements with value in the same location in other.

This docstring was copied from pandas.core.frame.DataFrame.combine_first.

Some inconsistencies with the Dask version may exist.

Combine two DataFrame objects by filling null values in one DataFrame with non-null values from other DataFrame. The row and column indexes of the resulting DataFrame will be the union of the two. The resulting dataframe contains the ‘first’ dataframe values and overrides the second one values where both first.loc[index, col] and second.loc[index, col] are not missing values, upon calling first.combine_first(second).

Parameters:
otherDataFrame

Provided DataFrame to use to fill null values.

Returns:
DataFrame

The result of combining the provided DataFrame with the other object.

See also

DataFrame.combine

Perform series-wise operation on two DataFrames using a given function.

Examples

>>> df1 = pd.DataFrame({'A': [None, 0], 'B': [None, 4]})  
>>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]})  
>>> df1.combine_first(df2)  
     A    B
0  1.0  3.0
1  0.0  4.0

Null values still persist if the location of that null value does not exist in other

>>> df1 = pd.DataFrame({'A': [None, 0], 'B': [4, None]})  
>>> df2 = pd.DataFrame({'B': [3, 3], 'C': [1, 1]}, index=[1, 2])  
>>> df1.combine_first(df2)  
     A    B    C
0  NaN  4.0  NaN
1  0.0  3.0  1.0
2  NaN  3.0  1.0
compute(**kwargs)#

Compute this dask collection

This turns a lazy Dask collection into its in-memory equivalent. For example a Dask array turns into a NumPy array and a Dask dataframe turns into a Pandas dataframe. The entire dataset must fit into memory before calling this operation.

Parameters:
schedulerstring, optional

Which scheduler to use like “threads”, “synchronous” or “processes”. If not provided, the default is to check the global settings first, and then fall back to the collection defaults.

optimize_graphbool, optional

If True [default], the graph is optimized before computation. Otherwise the graph is run as is. This can be useful for debugging.

kwargs

Extra keywords to forward to the scheduler function.

See also

dask.compute
compute_current_divisions(col=None)#

Compute the current divisions of the DataFrame.

This method triggers immediate computation. If you find yourself running this command repeatedly for the same dataframe, we recommend storing the result so you don’t have to rerun it.

If the column or index values overlap between partitions, raises ValueError. To prevent this, make sure the data are sorted by the column or index.

Parameters:
colstring, optional

Calculate the divisions for a non-index column by passing in the name of the column. If col is not specified, the index will be used to calculate divisions. In this case, if the divisions are already known, they will be returned immediately without computing.

Examples

>>> import dask
>>> ddf = dask.datasets.timeseries(start="2021-01-01", end="2021-01-07", freq="1h").clear_divisions()
>>> divisions = ddf.compute_current_divisions()
>>> print(divisions)  
(Timestamp('2021-01-01 00:00:00'),
 Timestamp('2021-01-02 00:00:00'),
 Timestamp('2021-01-03 00:00:00'),
 Timestamp('2021-01-04 00:00:00'),
 Timestamp('2021-01-05 00:00:00'),
 Timestamp('2021-01-06 00:00:00'),
 Timestamp('2021-01-06 23:00:00'))
>>> ddf.divisions = divisions
>>> ddf.known_divisions
True
>>> ddf = ddf.reset_index().clear_divisions()
>>> divisions = ddf.compute_current_divisions("timestamp")
>>> print(divisions)  
(Timestamp('2021-01-01 00:00:00'),
 Timestamp('2021-01-02 00:00:00'),
 Timestamp('2021-01-03 00:00:00'),
 Timestamp('2021-01-04 00:00:00'),
 Timestamp('2021-01-05 00:00:00'),
 Timestamp('2021-01-06 00:00:00'),
 Timestamp('2021-01-06 23:00:00'))
>>> ddf = ddf.set_index("timestamp", divisions=divisions, sorted=True)
copy(deep=False)#

Make a copy of the dataframe

This is strictly a shallow copy of the underlying computational graph. It does not affect the underlying data

Parameters:
deepboolean, default False

The deep value must be False and it is declared as a parameter just for compatibility with third-party libraries like cuDF

corr(method='pearson', min_periods=None, numeric_only=_NoDefault.no_default, split_every=False)#

Compute pairwise correlation of columns, excluding NA/null values.

This docstring was copied from pandas.core.frame.DataFrame.corr.

Some inconsistencies with the Dask version may exist.

Parameters:
method{‘pearson’, ‘kendall’, ‘spearman’} or callable

Method of correlation:

  • pearson : standard correlation coefficient

  • kendall : Kendall Tau correlation coefficient

  • spearman : Spearman rank correlation

  • callable: callable with input two 1d ndarrays

    and returning a float. Note that the returned matrix from corr will have 1 along the diagonals and will be symmetric regardless of the callable’s behavior.

min_periodsint, optional

Minimum number of observations required per pair of columns to have a valid result. Currently only available for Pearson and Spearman correlation.

numeric_onlybool, default True

Include only float, int or boolean data.

New in version 1.5.0.

Deprecated since version 1.5.0: The default value of numeric_only will be False in a future version of pandas.

Returns:
DataFrame

Correlation matrix.

See also

DataFrame.corrwith

Compute pairwise correlation with another DataFrame or Series.

Series.corr

Compute the correlation between two Series.

Notes

Pearson, Kendall and Spearman correlation are currently computed using pairwise complete observations.

Examples

>>> def histogram_intersection(a, b):  
...     v = np.minimum(a, b).sum().round(decimals=1)
...     return v
>>> df = pd.DataFrame([(.2, .3), (.0, .6), (.6, .0), (.2, .1)],  
...                   columns=['dogs', 'cats'])
>>> df.corr(method=histogram_intersection)  
      dogs  cats
dogs   1.0   0.3
cats   0.3   1.0
>>> df = pd.DataFrame([(1, 1), (2, np.nan), (np.nan, 3), (4, 4)],  
...                   columns=['dogs', 'cats'])
>>> df.corr(min_periods=3)  
      dogs  cats
dogs   1.0   NaN
cats   NaN   1.0
count(axis=None, split_every=False, numeric_only=False)#

Count non-NA cells for each column or row.

This docstring was copied from pandas.core.frame.DataFrame.count.

Some inconsistencies with the Dask version may exist.

The values None, NaN, NaT, and optionally numpy.inf (depending on pandas.options.mode.use_inf_as_na) are considered NA.

Parameters:
axis{0 or ‘index’, 1 or ‘columns’}, default 0

If 0 or ‘index’ counts are generated for each column. If 1 or ‘columns’ counts are generated for each row.

levelint or str, optional (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DataFrame. A str specifies the level name.

numeric_onlybool, default False

Include only float, int or boolean data.

Returns:
Series or DataFrame

For each column/row the number of non-NA/null entries. If level is specified returns a DataFrame.

See also

Series.count

Number of non-NA elements in a Series.

DataFrame.value_counts

Count unique combinations of columns.

DataFrame.shape

Number of DataFrame rows and columns (including NA elements).

DataFrame.isna

Boolean same-sized DataFrame showing places of NA elements.

Examples

Constructing DataFrame from a dictionary:

>>> df = pd.DataFrame({"Person":  
...                    ["John", "Myla", "Lewis", "John", "Myla"],
...                    "Age": [24., np.nan, 21., 33, 26],
...                    "Single": [False, True, True, True, False]})
>>> df  
   Person   Age  Single
0    John  24.0   False
1    Myla   NaN    True
2   Lewis  21.0    True
3    John  33.0    True
4    Myla  26.0   False

Notice the uncounted NA values:

>>> df.count()  
Person    5
Age       4
Single    5
dtype: int64

Counts for each row:

>>> df.count(axis='columns')  
0    3
1    2
2    3
3    3
4    3
dtype: int64
cov(min_periods=None, numeric_only=_NoDefault.no_default, split_every=False)#

Compute pairwise covariance of columns, excluding NA/null values.

This docstring was copied from pandas.core.frame.DataFrame.cov.

Some inconsistencies with the Dask version may exist.

Compute the pairwise covariance among the series of a DataFrame. The returned data frame is the covariance matrix of the columns of the DataFrame.

Both NA and null values are automatically excluded from the calculation. (See the note below about bias from missing values.) A threshold can be set for the minimum number of observations for each value created. Comparisons with observations below this threshold will be returned as NaN.

This method is generally used for the analysis of time series data to understand the relationship between different measures across time.

Parameters:
min_periodsint, optional

Minimum number of observations required per pair of columns to have a valid result.

ddofint, default 1 (Not supported in Dask)

Delta degrees of freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

New in version 1.1.0.

numeric_onlybool, default True

Include only float, int or boolean data.

New in version 1.5.0.

Deprecated since version 1.5.0: The default value of numeric_only will be False in a future version of pandas.

Returns:
DataFrame

The covariance matrix of the series of the DataFrame.

See also

Series.cov

Compute covariance with another Series.

core.window.ewm.ExponentialMovingWindow.cov

Exponential weighted sample covariance.

core.window.expanding.Expanding.cov

Expanding sample covariance.

core.window.rolling.Rolling.cov

Rolling sample covariance.

Notes

Returns the covariance matrix of the DataFrame’s time series. The covariance is normalized by N-ddof.

For DataFrames that have Series that are missing data (assuming that data is missing at random) the returned covariance matrix will be an unbiased estimate of the variance and covariance between the member Series.

However, for many applications this estimate may not be acceptable because the estimate covariance matrix is not guaranteed to be positive semi-definite. This could lead to estimate correlations having absolute values which are greater than one, and/or a non-invertible covariance matrix. See Estimation of covariance matrices for more details.

Examples

>>> df = pd.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)],  
...                   columns=['dogs', 'cats'])
>>> df.cov()  
          dogs      cats
dogs  0.666667 -1.000000
cats -1.000000  1.666667
>>> np.random.seed(42)  
>>> df = pd.DataFrame(np.random.randn(1000, 5),  
...                   columns=['a', 'b', 'c', 'd', 'e'])
>>> df.cov()  
          a         b         c         d         e
a  0.998438 -0.020161  0.059277 -0.008943  0.014144
b -0.020161  1.059352 -0.008543 -0.024738  0.009826
c  0.059277 -0.008543  1.010670 -0.001486 -0.000271
d -0.008943 -0.024738 -0.001486  0.921297 -0.013692
e  0.014144  0.009826 -0.000271 -0.013692  0.977795

Minimum number of periods

This method also supports an optional min_periods keyword that specifies the required minimum number of non-NA observations for each column pair in order to have a valid result:

>>> np.random.seed(42)  
>>> df = pd.DataFrame(np.random.randn(20, 3),  
...                   columns=['a', 'b', 'c'])
>>> df.loc[df.index[:5], 'a'] = np.nan  
>>> df.loc[df.index[5:10], 'b'] = np.nan  
>>> df.cov(min_periods=12)  
          a         b         c
a  0.316741       NaN -0.150812
b       NaN  1.248003  0.191417
c -0.150812  0.191417  0.895202
cummax(axis=None, skipna=True, out=None)#

Return cumulative maximum over a DataFrame or Series axis.

This docstring was copied from pandas.core.frame.DataFrame.cummax.

Some inconsistencies with the Dask version may exist.

Returns a DataFrame or Series of the same size containing the cumulative maximum.

Parameters:
axis{0 or ‘index’, 1 or ‘columns’}, default 0

The index or the name of the axis. 0 is equivalent to None or ‘index’. For Series this parameter is unused and defaults to 0.

skipnabool, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA.

*args, **kwargs

Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns:
Series or DataFrame

Return cumulative maximum of Series or DataFrame.

See also

core.window.expanding.Expanding.max

Similar functionality but ignores NaN values.

DataFrame.max

Return the maximum over DataFrame axis.

DataFrame.cummax

Return cumulative maximum over DataFrame axis.

DataFrame.cummin

Return cumulative minimum over DataFrame axis.

DataFrame.cumsum

Return cumulative sum over DataFrame axis.

DataFrame.cumprod

Return cumulative product over DataFrame axis.

Examples

Series

>>> s = pd.Series([2, np.nan, 5, -1, 0])  
>>> s  
0    2.0
1    NaN
2    5.0
3   -1.0
4    0.0
dtype: float64

By default, NA values are ignored.

>>> s.cummax()  
0    2.0
1    NaN
2    5.0
3    5.0
4    5.0
dtype: float64

To include NA values in the operation, use skipna=False

>>> s.cummax(skipna=False)  
0    2.0
1    NaN
2    NaN
3    NaN
4    NaN
dtype: float64

DataFrame

>>> df = pd.DataFrame([[2.0, 1.0],  
...                    [3.0, np.nan],
...                    [1.0, 0.0]],
...                    columns=list('AB'))
>>> df  
     A    B
0  2.0  1.0
1  3.0  NaN
2  1.0  0.0

By default, iterates over rows and finds the maximum in each column. This is equivalent to axis=None or axis='index'.

>>> df.cummax()  
     A    B
0  2.0  1.0
1  3.0  NaN
2  3.0  1.0

To iterate over columns and find the maximum in each row, use axis=1

>>> df.cummax(axis=1)  
     A    B
0  2.0  2.0
1  3.0  NaN
2  1.0  1.0
cummin(axis=None, skipna=True, out=None)#

Return cumulative minimum over a DataFrame or Series axis.

This docstring was copied from pandas.core.frame.DataFrame.cummin.

Some inconsistencies with the Dask version may exist.

Returns a DataFrame or Series of the same size containing the cumulative minimum.

Parameters:
axis{0 or ‘index’, 1 or ‘columns’}, default 0

The index or the name of the axis. 0 is equivalent to None or ‘index’. For Series this parameter is unused and defaults to 0.

skipnabool, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA.

*args, **kwargs

Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns:
Series or DataFrame

Return cumulative minimum of Series or DataFrame.

See also

core.window.expanding.Expanding.min

Similar functionality but ignores NaN values.

DataFrame.min

Return the minimum over DataFrame axis.

DataFrame.cummax

Return cumulative maximum over DataFrame axis.

DataFrame.cummin

Return cumulative minimum over DataFrame axis.

DataFrame.cumsum

Return cumulative sum over DataFrame axis.

DataFrame.cumprod

Return cumulative product over DataFrame axis.

Examples

Series

>>> s = pd.Series([2, np.nan, 5, -1, 0])  
>>> s  
0    2.0
1    NaN
2    5.0
3   -1.0
4    0.0
dtype: float64

By default, NA values are ignored.

>>> s.cummin()  
0    2.0
1    NaN
2    2.0
3   -1.0
4   -1.0
dtype: float64

To include NA values in the operation, use skipna=False

>>> s.cummin(skipna=False)  
0    2.0
1    NaN
2    NaN
3    NaN
4    NaN
dtype: float64

DataFrame

>>> df = pd.DataFrame([[2.0, 1.0],  
...                    [3.0, np.nan],
...                    [1.0, 0.0]],
...                    columns=list('AB'))
>>> df  
     A    B
0  2.0  1.0
1  3.0  NaN
2  1.0  0.0

By default, iterates over rows and finds the minimum in each column. This is equivalent to axis=None or axis='index'.

>>> df.cummin()  
     A    B
0  2.0  1.0
1  2.0  NaN
2  1.0  0.0

To iterate over columns and find the minimum in each row, use axis=1

>>> df.cummin(axis=1)  
     A    B
0  2.0  1.0
1  3.0  NaN
2  1.0  0.0
cumprod(axis=None, skipna=True, dtype=None, out=None)#

Return cumulative product over a DataFrame or Series axis.

This docstring was copied from pandas.core.frame.DataFrame.cumprod.

Some inconsistencies with the Dask version may exist.

Returns a DataFrame or Series of the same size containing the cumulative product.

Parameters:
axis{0 or ‘index’, 1 or ‘columns’}, default 0

The index or the name of the axis. 0 is equivalent to None or ‘index’. For Series this parameter is unused and defaults to 0.

skipnabool, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA.

*args, **kwargs

Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns:
Series or DataFrame

Return cumulative product of Series or DataFrame.

See also

core.window.expanding.Expanding.prod

Similar functionality but ignores NaN values.

DataFrame.prod

Return the product over DataFrame axis.

DataFrame.cummax

Return cumulative maximum over DataFrame axis.

DataFrame.cummin

Return cumulative minimum over DataFrame axis.

DataFrame.cumsum

Return cumulative sum over DataFrame axis.

DataFrame.cumprod

Return cumulative product over DataFrame axis.

Examples

Series

>>> s = pd.Series([2, np.nan, 5, -1, 0])  
>>> s  
0    2.0
1    NaN
2    5.0
3   -1.0
4    0.0
dtype: float64

By default, NA values are ignored.

>>> s.cumprod()  
0     2.0
1     NaN
2    10.0
3   -10.0
4    -0.0
dtype: float64

To include NA values in the operation, use skipna=False

>>> s.cumprod(skipna=False)  
0    2.0
1    NaN
2    NaN
3    NaN
4    NaN
dtype: float64

DataFrame

>>> df = pd.DataFrame([[2.0, 1.0],  
...                    [3.0, np.nan],
...                    [1.0, 0.0]],
...                    columns=list('AB'))
>>> df  
     A    B
0  2.0  1.0
1  3.0  NaN
2  1.0  0.0

By default, iterates over rows and finds the product in each column. This is equivalent to axis=None or axis='index'.

>>> df.cumprod()  
     A    B
0  2.0  1.0
1  6.0  NaN
2  6.0  0.0

To iterate over columns and find the product in each row, use axis=1

>>> df.cumprod(axis=1)  
     A    B
0  2.0  2.0
1  3.0  NaN
2  1.0  0.0
cumsum(axis=None, skipna=True, dtype=None, out=None)#

Return cumulative sum over a DataFrame or Series axis.

This docstring was copied from pandas.core.frame.DataFrame.cumsum.

Some inconsistencies with the Dask version may exist.

Returns a DataFrame or Series of the same size containing the cumulative sum.

Parameters:
axis{0 or ‘index’, 1 or ‘columns’}, default 0

The index or the name of the axis. 0 is equivalent to None or ‘index’. For Series this parameter is unused and defaults to 0.

skipnabool, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA.

*args, **kwargs

Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns:
Series or DataFrame

Return cumulative sum of Series or DataFrame.

See also

core.window.expanding.Expanding.sum

Similar functionality but ignores NaN values.

DataFrame.sum

Return the sum over DataFrame axis.

DataFrame.cummax

Return cumulative maximum over DataFrame axis.

DataFrame.cummin

Return cumulative minimum over DataFrame axis.

DataFrame.cumsum

Return cumulative sum over DataFrame axis.

DataFrame.cumprod

Return cumulative product over DataFrame axis.

Examples

Series

>>> s = pd.Series([2, np.nan, 5, -1, 0])  
>>> s  
0    2.0
1    NaN
2    5.0
3   -1.0
4    0.0
dtype: float64

By default, NA values are ignored.

>>> s.cumsum()  
0    2.0
1    NaN
2    7.0
3    6.0
4    6.0
dtype: float64

To include NA values in the operation, use skipna=False

>>> s.cumsum(skipna=False)  
0    2.0
1    NaN
2    NaN
3    NaN
4    NaN
dtype: float64

DataFrame

>>> df = pd.DataFrame([[2.0, 1.0],  
...                    [3.0, np.nan],
...                    [1.0, 0.0]],
...                    columns=list('AB'))
>>> df  
     A    B
0  2.0  1.0
1  3.0  NaN
2  1.0  0.0

By default, iterates over rows and finds the sum in each column. This is equivalent to axis=None or axis='index'.

>>> df.cumsum()  
     A    B
0  2.0  1.0
1  5.0  NaN
2  6.0  1.0

To iterate over columns and find the sum in each row, use axis=1

>>> df.cumsum(axis=1)  
     A    B
0  2.0  3.0
1  3.0  NaN
2  1.0  1.0
describe(split_every=False, percentiles=None, percentiles_method='default', include=None, exclude=None, datetime_is_numeric=_NoDefault.no_default)#

Generate descriptive statistics.

This docstring was copied from pandas.core.frame.DataFrame.describe.

Some inconsistencies with the Dask version may exist.

Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.

Parameters:
percentileslist-like of numbers, optional

The percentiles to include in the output. All should fall between 0 and 1. The default is [.25, .5, .75], which returns the 25th, 50th, and 75th percentiles.

include‘all’, list-like of dtypes or None (default), optional

A white list of data types to include in the result. Ignored for Series. Here are the options:

  • ‘all’ : All columns of the input will be included in the output.

  • A list-like of dtypes : Limits the results to the provided data types. To limit the result to numeric types submit numpy.number. To limit it instead to object columns submit the numpy.object data type. Strings can also be used in the style of select_dtypes (e.g. df.describe(include=['O'])). To select pandas categorical columns, use 'category'

  • None (default) : The result will include all numeric columns.

excludelist-like of dtypes or None (default), optional,

A black list of data types to omit from the result. Ignored for Series. Here are the options:

  • A list-like of dtypes : Excludes the provided data types from the result. To exclude numeric types submit numpy.number. To exclude object columns submit the data type numpy.object. Strings can also be used in the style of select_dtypes (e.g. df.describe(exclude=['O'])). To exclude pandas categorical columns, use 'category'

  • None (default) : The result will exclude nothing.

datetime_is_numericbool, default False

Whether to treat datetime dtypes as numeric. This affects statistics calculated for the column. For DataFrame input, this also controls whether datetime columns are included by default.

New in version 1.1.0.

Returns:
Series or DataFrame

Summary statistics of the Series or Dataframe provided.

See also

DataFrame.count

Count number of non-NA/null observations.

DataFrame.max

Maximum of the values in the object.

DataFrame.min

Minimum of the values in the object.

DataFrame.mean

Mean of the values.

DataFrame.std

Standard deviation of the observations.

DataFrame.select_dtypes

Subset of a DataFrame including/excluding columns based on their dtype.

Notes

For numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50 and upper percentiles. By default the lower percentile is 25 and the upper percentile is 75. The 50 percentile is the same as the median.

For object data (e.g. strings or timestamps), the result’s index will include count, unique, top, and freq. The top is the most common value. The freq is the most common value’s frequency. Timestamps also include the first and last items.

If multiple object values have the highest count, then the count and top results will be arbitrarily chosen from among those with the highest count.

For mixed data types provided via a DataFrame, the default is to return only an analysis of numeric columns. If the dataframe consists only of object and categorical data without any numeric columns, the default is to return an analysis of both the object and categorical columns. If include='all' is provided as an option, the result will include a union of attributes of each type.

The include and exclude parameters can be used to limit which columns in a DataFrame are analyzed for the output. The parameters are ignored when analyzing a Series.

Examples

Describing a numeric Series.

>>> s = pd.Series([1, 2, 3])  
>>> s.describe()  
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
dtype: float64

Describing a categorical Series.

>>> s = pd.Series(['a', 'a', 'b', 'c'])  
>>> s.describe()  
count     4
unique    3
top       a
freq      2
dtype: object

Describing a timestamp Series.

>>> s = pd.Series([  
...   np.datetime64("2000-01-01"),
...   np.datetime64("2010-01-01"),
...   np.datetime64("2010-01-01")
... ])
>>> s.describe(datetime_is_numeric=True)  
count                      3
mean     2006-09-01 08:00:00
min      2000-01-01 00:00:00
25%      2004-12-31 12:00:00
50%      2010-01-01 00:00:00
75%      2010-01-01 00:00:00
max      2010-01-01 00:00:00
dtype: object

Describing a DataFrame. By default only numeric fields are returned.

>>> df = pd.DataFrame({'categorical': pd.Categorical(['d','e','f']),  
...                    'numeric': [1, 2, 3],
...                    'object': ['a', 'b', 'c']
...                   })
>>> df.describe()  
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Describing all columns of a DataFrame regardless of data type.

>>> df.describe(include='all')  
       categorical  numeric object
count            3      3.0      3
unique           3      NaN      3
top              f      NaN      a
freq             1      NaN      1
mean           NaN      2.0    NaN
std            NaN      1.0    NaN
min            NaN      1.0    NaN
25%            NaN      1.5    NaN
50%            NaN      2.0    NaN
75%            NaN      2.5    NaN
max            NaN      3.0    NaN

Describing a column from a DataFrame by accessing it as an attribute.

>>> df.numeric.describe()  
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
Name: numeric, dtype: float64

Including only numeric columns in a DataFrame description.

>>> df.describe(include=[np.number])  
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Including only string columns in a DataFrame description.

>>> df.describe(include=[object])  
       object
count       3
unique      3
top         a
freq        1

Including only categorical columns from a DataFrame description.

>>> df.describe(include=['category'])  
       categorical
count            3
unique           3
top              d
freq             1

Excluding numeric columns from a DataFrame description.

>>> df.describe(exclude=[np.number])  
       categorical object
count            3      3
unique           3      3
top              f      a
freq             1      1

Excluding object columns from a DataFrame description.

>>> df.describe(exclude=[object])  
       categorical  numeric
count            3      3.0
unique           3      NaN
top              f      NaN
freq             1      NaN
mean           NaN      2.0
std            NaN      1.0
min            NaN      1.0
25%            NaN      1.5
50%            NaN      2.0
75%            NaN      2.5
max            NaN      3.0
diff(periods=1, axis=0)#

First discrete difference of element.

This docstring was copied from pandas.core.frame.DataFrame.diff.

Some inconsistencies with the Dask version may exist.

Note

Pandas currently uses an object-dtype column to represent boolean data with missing values. This can cause issues for boolean-specific operations, like |. To enable boolean- specific operations, at the cost of metadata that doesn’t match pandas, use .astype(bool) after the shift.

Calculates the difference of a DataFrame element compared with another element in the DataFrame (default is element in previous row).

Parameters:
periodsint, default 1

Periods to shift for calculating difference, accepts negative values.

axis{0 or ‘index’, 1 or ‘columns’}, default 0

Take difference over rows (0) or columns (1).

Returns:
DataFrame

First differences of the Series.

See also

DataFrame.pct_change

Percent change over given number of periods.

DataFrame.shift

Shift index by desired number of periods with an optional time freq.

Series.diff

First discrete difference of object.

Notes

For boolean dtypes, this uses operator.xor() rather than operator.sub(). The result is calculated according to current dtype in DataFrame, however dtype of the result is always float64.

Examples

Difference with previous row

>>> df = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6],  
...                    'b': [1, 1, 2, 3, 5, 8],
...                    'c': [1, 4, 9, 16, 25, 36]})
>>> df  
   a  b   c
0  1  1   1
1  2  1   4
2  3  2   9
3  4  3  16
4  5  5  25
5  6  8  36
>>> df.diff()  
     a    b     c
0  NaN  NaN   NaN
1  1.0  0.0   3.0
2  1.0  1.0   5.0
3  1.0  1.0   7.0
4  1.0  2.0   9.0
5  1.0  3.0  11.0

Difference with previous column

>>> df.diff(axis=1)  
    a  b   c
0 NaN  0   0
1 NaN -1   3
2 NaN -1   7
3 NaN -1  13
4 NaN  0  20
5 NaN  2  28

Difference with 3rd previous row

>>> df.diff(periods=3)  
     a    b     c
0  NaN  NaN   NaN
1  NaN  NaN   NaN
2  NaN  NaN   NaN
3  3.0  2.0  15.0
4  3.0  4.0  21.0
5  3.0  6.0  27.0

Difference with following row

>>> df.diff(periods=-1)  
     a    b     c
0 -1.0  0.0  -3.0
1 -1.0 -1.0  -5.0
2 -1.0 -1.0  -7.0
3 -1.0 -2.0  -9.0
4 -1.0 -3.0 -11.0
5  NaN  NaN   NaN

Overflow in input dtype

>>> df = pd.DataFrame({'a': [1, 0]}, dtype=np.uint8)  
>>> df.diff()  
       a
0    NaN
1  255.0
div(other, axis='columns', level=None, fill_value=None)#

Get Floating division of dataframe and other, element-wise (binary operator truediv).

This docstring was copied from pandas.core.frame.DataFrame.div.

Some inconsistencies with the Dask version may exist.

Equivalent to dataframe / other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rtruediv.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
otherscalar, sequence, Series, dict or DataFrame

Any single or multiple element data structure, or list-like object.

axis{0 or ‘index’, 1 or ‘columns’}

Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.

levelint or label

Broadcast across a level, matching Index values on the passed MultiIndex level.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame

Result of the arithmetic operation.

See also

DataFrame.add

Add DataFrames.

DataFrame.sub

Subtract DataFrames.

DataFrame.mul

Multiply DataFrames.

DataFrame.div

Divide DataFrames (float division).

DataFrame.truediv

Divide DataFrames (float division).

DataFrame.floordiv

Divide DataFrames (integer division).

DataFrame.mod

Calculate modulo (remainder after division).

DataFrame.pow

Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],  
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df  
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1  
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361
>>> df.add(1)  
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)  
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0
>>> df.rdiv(10)  
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]  
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub([1, 2], axis='columns')  
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),  
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a dictionary by axis.

>>> df.mul({'angles': 0, 'degrees': 2})  
            angles      degrees
circle           0          720
triangle             0      360
rectangle            0      720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index')  
            angles      degrees
circle               0        0
triangle             6      360
rectangle           12     1080

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},  
...                      index=['circle', 'triangle', 'rectangle'])
>>> other  
           angles
circle          0
triangle        3
rectangle       4
>>> df * other  
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN
>>> df.mul(other, fill_value=0)  
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],  
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex  
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720
>>> df.div(df_multindex, level=1, fill_value=0)  
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
divide(other, axis='columns', level=None, fill_value=None)#

Get Floating division of dataframe and other, element-wise (binary operator truediv).

This docstring was copied from pandas.core.frame.DataFrame.divide.

Some inconsistencies with the Dask version may exist.

Equivalent to dataframe / other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rtruediv.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
otherscalar, sequence, Series, dict or DataFrame

Any single or multiple element data structure, or list-like object.

axis{0 or ‘index’, 1 or ‘columns’}

Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.

levelint or label

Broadcast across a level, matching Index values on the passed MultiIndex level.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame

Result of the arithmetic operation.

See also

DataFrame.add

Add DataFrames.

DataFrame.sub

Subtract DataFrames.

DataFrame.mul

Multiply DataFrames.

DataFrame.div

Divide DataFrames (float division).

DataFrame.truediv

Divide DataFrames (float division).

DataFrame.floordiv

Divide DataFrames (integer division).

DataFrame.mod

Calculate modulo (remainder after division).

DataFrame.pow

Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],  
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df  
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1  
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361
>>> df.add(1)  
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)  
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0
>>> df.rdiv(10)  
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]  
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub([1, 2], axis='columns')  
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),  
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a dictionary by axis.

>>> df.mul({'angles': 0, 'degrees': 2})  
            angles      degrees
circle           0          720
triangle             0      360
rectangle            0      720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index')  
            angles      degrees
circle               0        0
triangle             6      360
rectangle           12     1080

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},  
...                      index=['circle', 'triangle', 'rectangle'])
>>> other  
           angles
circle          0
triangle        3
rectangle       4
>>> df * other  
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN
>>> df.mul(other, fill_value=0)  
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],  
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex  
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720
>>> df.div(df_multindex, level=1, fill_value=0)  
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
property divisions#

Tuple of npartitions + 1 values, in ascending order, marking the lower/upper bounds of each partition’s index. Divisions allow Dask to know which partition will contain a given value, significantly speeding up operations like loc, merge, and groupby by not having to search the full dataset.

Example: for divisions = (0, 10, 50, 100), there are three partitions, where the index in each partition contains values [0, 10), [10, 50), and [50, 100], respectively. Dask therefore knows df.loc[45] will be in the second partition.

When every item in divisions is None, the divisions are unknown. Most operations can still be performed, but some will be much slower, and a few may fail.

It is uncommon to set divisions directly. Instead, use set_index, which sorts and splits the data as needed. See https://docs.dask.org/en/latest/dataframe-design.html#partitions.

dot(other, meta=_NoDefault.no_default)#

Compute the dot product between the Series and the columns of other.

This docstring was copied from pandas.core.series.Series.dot.

Some inconsistencies with the Dask version may exist.

This method computes the dot product between the Series and another one, or the Series and each columns of a DataFrame, or the Series and each columns of an array.

It can also be called using self @ other in Python >= 3.5.

Parameters:
otherSeries, DataFrame or array-like

The other object to compute the dot product with its columns.

Returns:
scalar, Series or numpy.ndarray

Return the dot product of the Series and other if other is a Series, the Series of the dot product of Series and each rows of other if other is a DataFrame or a numpy.ndarray between the Series and each columns of the numpy array.

See also

DataFrame.dot

Compute the matrix product with the DataFrame.

Series.mul

Multiplication of series and other, element-wise.

Notes

The Series and other has to share the same index if other is a Series or a DataFrame.

Examples

>>> s = pd.Series([0, 1, 2, 3])  
>>> other = pd.Series([-1, 2, -3, 4])  
>>> s.dot(other)  
8
>>> s @ other  
8
>>> df = pd.DataFrame([[0, 1], [-2, 3], [4, -5], [6, 7]])  
>>> s.dot(df)  
0    24
1    14
dtype: int64
>>> arr = np.array([[0, 1], [-2, 3], [4, -5], [6, 7]])  
>>> s.dot(arr)  
array([24, 14])
drop(labels=None, axis=0, columns=None, errors='raise')#

Drop specified labels from rows or columns.

This docstring was copied from pandas.core.frame.DataFrame.drop.

Some inconsistencies with the Dask version may exist.

Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level. See the user guide <advanced.shown_levels> for more information about the now unused levels.

Parameters:
labelssingle label or list-like

Index or column labels to drop. A tuple will be used as a single label and not treated as a list-like.

axis{0 or ‘index’, 1 or ‘columns’}, default 0

Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).

indexsingle label or list-like (Not supported in Dask)

Alternative to specifying axis (labels, axis=0 is equivalent to index=labels).

columnssingle label or list-like

Alternative to specifying axis (labels, axis=1 is equivalent to columns=labels).

levelint or level name, optional (Not supported in Dask)

For MultiIndex, level from which the labels will be removed.

inplacebool, default False (Not supported in Dask)

If False, return a copy. Otherwise, do operation inplace and return None.

errors{‘ignore’, ‘raise’}, default ‘raise’

If ‘ignore’, suppress error and only existing labels are dropped.

Returns:
DataFrame or None

DataFrame without the removed index or column labels or None if inplace=True.

Raises:
KeyError

If any of the labels is not found in the selected axis.

See also

DataFrame.loc

Label-location based indexer for selection by label.

DataFrame.dropna

Return DataFrame with labels on given axis omitted where (all or any) data are missing.

DataFrame.drop_duplicates

Return DataFrame with duplicate rows removed, optionally only considering certain columns.

Series.drop

Return Series with specified index labels removed.

Examples

>>> df = pd.DataFrame(np.arange(12).reshape(3, 4),  
...                   columns=['A', 'B', 'C', 'D'])
>>> df  
   A  B   C   D
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11

Drop columns

>>> df.drop(['B', 'C'], axis=1)  
   A   D
0  0   3
1  4   7
2  8  11
>>> df.drop(columns=['B', 'C'])  
   A   D
0  0   3
1  4   7
2  8  11

Drop a row by index

>>> df.drop([0, 1])  
   A  B   C   D
2  8  9  10  11

Drop columns and/or rows of MultiIndex DataFrame

>>> midx = pd.MultiIndex(levels=[['lama', 'cow', 'falcon'],  
...                              ['speed', 'weight', 'length']],
...                      codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2],
...                             [0, 1, 2, 0, 1, 2, 0, 1, 2]])
>>> df = pd.DataFrame(index=midx, columns=['big', 'small'],  
...                   data=[[45, 30], [200, 100], [1.5, 1], [30, 20],
...                         [250, 150], [1.5, 0.8], [320, 250],
...                         [1, 0.8], [0.3, 0.2]])
>>> df  
                big     small
lama    speed   45.0    30.0
        weight  200.0   100.0
        length  1.5     1.0
cow     speed   30.0    20.0
        weight  250.0   150.0
        length  1.5     0.8
falcon  speed   320.0   250.0
        weight  1.0     0.8
        length  0.3     0.2

Drop a specific index combination from the MultiIndex DataFrame, i.e., drop the combination 'falcon' and 'weight', which deletes only the corresponding row

>>> df.drop(index=('falcon', 'weight'))  
                big     small
lama    speed   45.0    30.0
        weight  200.0   100.0
        length  1.5     1.0
cow     speed   30.0    20.0
        weight  250.0   150.0
        length  1.5     0.8
falcon  speed   320.0   250.0
        length  0.3     0.2
>>> df.drop(index='cow', columns='small')  
                big
lama    speed   45.0
        weight  200.0
        length  1.5
falcon  speed   320.0
        weight  1.0
        length  0.3
>>> df.drop(index='length', level=1)  
                big     small
lama    speed   45.0    30.0
        weight  200.0   100.0
cow     speed   30.0    20.0
        weight  250.0   150.0
falcon  speed   320.0   250.0
        weight  1.0     0.8
drop_duplicates(subset=None, split_every=None, split_out=1, shuffle_method=None, ignore_index=False, **kwargs)#

Return DataFrame with duplicate rows removed.

This docstring was copied from pandas.core.frame.DataFrame.drop_duplicates.

Some inconsistencies with the Dask version may exist.

Known inconsistencies:

keep=False will raise a NotImplementedError

Considering certain columns is optional. Indexes, including time indexes are ignored.

Parameters:
subsetcolumn label or sequence of labels, optional

Only consider certain columns for identifying duplicates, by default use all of the columns.

keep{‘first’, ‘last’, False}, default ‘first’ (Not supported in Dask)

Determines which duplicates (if any) to keep. - first : Drop duplicates except for the first occurrence. - last : Drop duplicates except for the last occurrence. - False : Drop all duplicates.

inplacebool, default False (Not supported in Dask)

Whether to modify the DataFrame rather than creating a new one.

ignore_indexbool, default False

If True, the resulting axis will be labeled 0, 1, …, n - 1.

New in version 1.0.0.

Returns:
DataFrame or None

DataFrame with duplicates removed or None if inplace=True.

See also

DataFrame.value_counts

Count unique combinations of columns.

Examples

Consider dataset containing ramen rating.

>>> df = pd.DataFrame({  
...     'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
...     'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
...     'rating': [4, 4, 3.5, 15, 5]
... })
>>> df  
    brand style  rating
0  Yum Yum   cup     4.0
1  Yum Yum   cup     4.0
2  Indomie   cup     3.5
3  Indomie  pack    15.0
4  Indomie  pack     5.0

By default, it removes duplicate rows based on all columns.

>>> df.drop_duplicates()  
    brand style  rating
0  Yum Yum   cup     4.0
2  Indomie   cup     3.5
3  Indomie  pack    15.0
4  Indomie  pack     5.0

To remove duplicates on specific column(s), use subset.

>>> df.drop_duplicates(subset=['brand'])  
    brand style  rating
0  Yum Yum   cup     4.0
2  Indomie   cup     3.5

To remove duplicates and keep last occurrences, use keep.

>>> df.drop_duplicates(subset=['brand', 'style'], keep='last')  
    brand style  rating
1  Yum Yum   cup     4.0
2  Indomie   cup     3.5
4  Indomie  pack     5.0
dropna(how=_NoDefault.no_default, subset=None, thresh=_NoDefault.no_default)#

Remove missing values.

This docstring was copied from pandas.core.frame.DataFrame.dropna.

Some inconsistencies with the Dask version may exist.

See the User Guide for more on which values are considered missing, and how to work with missing data.

Parameters:
axis{0 or ‘index’, 1 or ‘columns’}, default 0 (Not supported in Dask)

Determine if rows or columns which contain missing values are removed.

  • 0, or ‘index’ : Drop rows which contain missing values.

  • 1, or ‘columns’ : Drop columns which contain missing value.

Changed in version 1.0.0: Pass tuple or list to drop on multiple axes. Only a single axis is allowed.

how{‘any’, ‘all’}, default ‘any’

Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.

  • ‘any’ : If any NA values are present, drop that row or column.

  • ‘all’ : If all values are NA, drop that row or column.

threshint, optional

Require that many non-NA values. Cannot be combined with how.

subsetcolumn label or sequence of labels, optional

Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include.

inplacebool, default False (Not supported in Dask)

Whether to modify the DataFrame rather than creating a new one.

Returns:
DataFrame or None

DataFrame with NA entries dropped from it or None if inplace=True.

See also

DataFrame.isna

Indicate missing values.

DataFrame.notna

Indicate existing (non-missing) values.

DataFrame.fillna

Replace missing values.

Series.dropna

Drop missing values.

Index.dropna

Drop missing indices.

Examples

>>> df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],  
...                    "toy": [np.nan, 'Batmobile', 'Bullwhip'],
...                    "born": [pd.NaT, pd.Timestamp("1940-04-25"),
...                             pd.NaT]})
>>> df  
       name        toy       born
0    Alfred        NaN        NaT
1    Batman  Batmobile 1940-04-25
2  Catwoman   Bullwhip        NaT

Drop the rows where at least one element is missing.

>>> df.dropna()  
     name        toy       born
1  Batman  Batmobile 1940-04-25

Drop the columns where at least one element is missing.

>>> df.dropna(axis='columns')  
       name
0    Alfred
1    Batman
2  Catwoman

Drop the rows where all elements are missing.

>>> df.dropna(how='all')  
       name        toy       born
0    Alfred        NaN        NaT
1    Batman  Batmobile 1940-04-25
2  Catwoman   Bullwhip        NaT

Keep only the rows with at least 2 non-NA values.

>>> df.dropna(thresh=2)  
       name        toy       born
1    Batman  Batmobile 1940-04-25
2  Catwoman   Bullwhip        NaT

Define in which columns to look for missing values.

>>> df.dropna(subset=['name', 'toy'])  
       name        toy       born
1    Batman  Batmobile 1940-04-25
2  Catwoman   Bullwhip        NaT

Keep the DataFrame with valid entries in the same variable.

>>> df.dropna(inplace=True)  
>>> df  
     name        toy       born
1  Batman  Batmobile 1940-04-25
property dtypes#

Return data types

enforce_runtime_divisions()#

Enforce the current divisions at runtime

eq(other, axis='columns', level=None)#

Get Equal to of dataframe and other, element-wise (binary operator eq).

This docstring was copied from pandas.core.frame.DataFrame.eq.

Some inconsistencies with the Dask version may exist.

Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.

Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.

Parameters:
otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axis{0 or ‘index’, 1 or ‘columns’}, default ‘columns’

Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).

levelint or label

Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns:
DataFrame of bool

Result of the comparison.

See also

DataFrame.eq

Compare DataFrames for equality elementwise.

DataFrame.ne

Compare DataFrames for inequality elementwise.

DataFrame.le

Compare DataFrames for less than inequality or equality elementwise.

DataFrame.lt

Compare DataFrames for strictly less than inequality elementwise.

DataFrame.ge

Compare DataFrames for greater than inequality or equality elementwise.

DataFrame.gt

Compare DataFrames for strictly greater than inequality elementwise.

Notes

Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).

Examples

>>> df = pd.DataFrame({'cost': [250, 150, 100],  
...                    'revenue': [100, 250, 300]},
...                   index=['A', 'B', 'C'])
>>> df  
   cost  revenue
A   250      100
B   150      250
C   100      300

Comparison with a scalar, using either the operator or method:

>>> df == 100  
    cost  revenue
A  False     True
B  False    False
C   True    False
>>> df.eq(100)  
    cost  revenue
A  False     True
B  False    False
C   True    False

When other is a Series, the columns of a DataFrame are aligned with the index of other and broadcast:

>>> df != pd.Series([100, 250], index=["cost", "revenue"])  
    cost  revenue
A   True     True
B   True    False
C  False     True

Use the method to control the broadcast axis:

>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index')  
   cost  revenue
A  True    False
B  True     True
C  True     True
D  True     True

When comparing to an arbitrary sequence, the number of columns must match the number elements in other:

>>> df == [250, 100]  
    cost  revenue
A   True     True
B  False    False
C  False    False

Use the method to control the axis:

>>> df.eq([250, 250, 100], axis='index')  
    cost  revenue
A   True    False
B  False     True
C   True    False

Compare to a DataFrame of different shape.

>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]},  
...                      index=['A', 'B', 'C', 'D'])
>>> other  
   revenue
A      300
B      250
C      100
D      150
>>> df.gt(other)  
    cost  revenue
A  False    False
B  False    False
C  False     True
D  False    False

Compare to a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220],  
...                              'revenue': [100, 250, 300, 200, 175, 225]},
...                             index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'],
...                                    ['A', 'B', 'C', 'A', 'B', 'C']])
>>> df_multindex  
      cost  revenue
Q1 A   250      100
   B   150      250
   C   100      300
Q2 A   150      200
   B   300      175
   C   220      225
>>> df.le(df_multindex, level=1)  
       cost  revenue
Q1 A   True     True
   B   True     True
   C   True     True
Q2 A  False     True
   B   True    False
   C   True    False
eval(expr, inplace=None, **kwargs)#

Evaluate a string describing operations on DataFrame columns.

This docstring was copied from pandas.core.frame.DataFrame.eval.

Some inconsistencies with the Dask version may exist.

Operates on columns only, not specific rows or elements. This allows eval to run arbitrary code, which can make you vulnerable to code injection if you pass user input to this function.

Parameters:
exprstr

The expression string to evaluate.

inplacebool, default False

If the expression contains an assignment, whether to perform the operation inplace and mutate the existing DataFrame. Otherwise, a new DataFrame is returned.

**kwargs

See the documentation for eval() for complete details on the keyword arguments accepted by query().

Returns:
ndarray, scalar, pandas object, or None

The result of the evaluation or None if inplace=True.

See also

DataFrame.query

Evaluates a boolean expression to query the columns of a frame.

DataFrame.assign

Can evaluate an expression or function to create new values for a column.

eval

Evaluate a Python expression as a string using various backends.

Notes

For more details see the API documentation for eval(). For detailed examples see enhancing performance with eval.

Examples

>>> df = pd.DataFrame({'A': range(1, 6), 'B': range(10, 0, -2)})  
>>> df  
   A   B
0  1  10
1  2   8
2  3   6
3  4   4
4  5   2
>>> df.eval('A + B')  
0    11
1    10
2     9
3     8
4     7
dtype: int64

Assignment is allowed though by default the original DataFrame is not modified.

>>> df.eval('C = A + B')  
   A   B   C
0  1  10  11
1  2   8  10
2  3   6   9
3  4   4   8
4  5   2   7
>>> df  
   A   B
0  1  10
1  2   8
2  3   6
3  4   4
4  5   2

Use inplace=True to modify the original DataFrame.

>>> df.eval('C = A + B', inplace=True)  
>>> df  
   A   B   C
0  1  10  11
1  2   8  10
2  3   6   9
3  4   4   8
4  5   2   7

Multiple columns can be assigned to using multi-line expressions:

>>> df.eval(  
...     '''
... C = A + B
... D = A - B
... '''
... )
   A   B   C  D
0  1  10  11 -9
1  2   8  10 -6
2  3   6   9 -3
3  4   4   8  0
4  5   2   7  3
explode(column)#

Transform each element of a list-like to a row, replicating index values.

This docstring was copied from pandas.core.frame.DataFrame.explode.

Some inconsistencies with the Dask version may exist.

New in version 0.25.0.

Parameters:
columnIndexLabel

Column(s) to explode. For multiple columns, specify a non-empty list with each element be str or tuple, and all specified columns their list-like data on same row of the frame must have matching length.

New in version 1.3.0: Multi-column explode

ignore_indexbool, default False (Not supported in Dask)

If True, the resulting index will be labeled 0, 1, …, n - 1.

New in version 1.1.0.

Returns:
DataFrame

Exploded lists to rows of the subset columns; index will be duplicated for these rows.

Raises:
ValueError
  • If columns of the frame are not unique.

  • If specified columns to explode is empty list.

  • If specified columns to explode have not matching count of elements rowwise in the frame.

See also

DataFrame.unstack

Pivot a level of the (necessarily hierarchical) index labels.

DataFrame.melt

Unpivot a DataFrame from wide format to long format.

Series.explode

Explode a DataFrame from list-like columns to long format.

Notes

This routine will explode list-likes including lists, tuples, sets, Series, and np.ndarray. The result dtype of the subset rows will be object. Scalars will be returned unchanged, and empty list-likes will result in a np.nan for that row. In addition, the ordering of rows in the output will be non-deterministic when exploding sets.

Reference the user guide for more examples.

Examples

>>> df = pd.DataFrame({'A': [[0, 1, 2], 'foo', [], [3, 4]],  
...                    'B': 1,
...                    'C': [['a', 'b', 'c'], np.nan, [], ['d', 'e']]})
>>> df  
           A  B          C
0  [0, 1, 2]  1  [a, b, c]
1        foo  1        NaN
2         []  1         []
3     [3, 4]  1     [d, e]

Single-column explode.

>>> df.explode('A')  
     A  B          C
0    0  1  [a, b, c]
0    1  1  [a, b, c]
0    2  1  [a, b, c]
1  foo  1        NaN
2  NaN  1         []
3    3  1     [d, e]
3    4  1     [d, e]

Multi-column explode.

>>> df.explode(list('AC'))  
     A  B    C
0    0  1    a
0    1  1    b
0    2  1    c
1  foo  1  NaN
2  NaN  1  NaN
3    3  1    d
3    4  1    e
ffill(axis=None, limit=None)#

Synonym for DataFrame.fillna() with method='ffill'.

This docstring was copied from pandas.core.frame.DataFrame.ffill.

Some inconsistencies with the Dask version may exist.

Returns:
Series/DataFrame or None

Object with missing values filled or None if inplace=True.

fillna(value=None, method=None, limit=None, axis=None)#

Fill NA/NaN values using the specified method.

This docstring was copied from pandas.core.frame.DataFrame.fillna.

Some inconsistencies with the Dask version may exist.

Parameters:
valuescalar, dict, Series, or DataFrame

Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). Values not in the dict/Series/DataFrame will not be filled. This value cannot be a list.

method{‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None

Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use next valid observation to fill gap.

axis{0 or ‘index’, 1 or ‘columns’}

Axis along which to fill missing values. For Series this parameter is unused and defaults to 0.

inplacebool, default False (Not supported in Dask)

If True, fill in-place. Note: this will modify any other views on this object (e.g., a no-copy slice for a column in a DataFrame).

limitint, default None

If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.

downcastdict, default is None (Not supported in Dask)

A dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible).

Returns:
DataFrame or None

Object with missing values filled or None if inplace=True.

See also

interpolate

Fill NaN values using interpolation.

reindex

Conform object to new index.

asfreq

Convert TimeSeries to specified frequency.

Examples

>>> df = pd.DataFrame([[np.nan, 2, np.nan, 0],  
...                    [3, 4, np.nan, 1],
...                    [np.nan, np.nan, np.nan, np.nan],
...                    [np.nan, 3, np.nan, 4]],
...                   columns=list("ABCD"))
>>> df  
     A    B   C    D
0  NaN  2.0 NaN  0.0
1  3.0  4.0 NaN  1.0
2  NaN  NaN NaN  NaN
3  NaN  3.0 NaN  4.0

Replace all NaN elements with 0s.

>>> df.fillna(0)  
     A    B    C    D
0  0.0  2.0  0.0  0.0
1  3.0  4.0  0.0  1.0
2  0.0  0.0  0.0  0.0
3  0.0  3.0  0.0  4.0

We can also propagate non-null values forward or backward.

>>> df.fillna(method="ffill")  
     A    B   C    D
0  NaN  2.0 NaN  0.0
1  3.0  4.0 NaN  1.0
2  3.0  4.0 NaN  1.0
3  3.0  3.0 NaN  4.0

Replace all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1, 2, and 3 respectively.

>>> values = {"A": 0, "B": 1, "C": 2, "D": 3}  
>>> df.fillna(value=values)  
     A    B    C    D
0  0.0  2.0  2.0  0.0
1  3.0  4.0  2.0  1.0
2  0.0  1.0  2.0  3.0
3  0.0  3.0  2.0  4.0

Only replace the first NaN element.

>>> df.fillna(value=values, limit=1)  
     A    B    C    D
0  0.0  2.0  2.0  0.0
1  3.0  4.0  NaN  1.0
2  NaN  1.0  NaN  3.0
3  NaN  3.0  NaN  4.0

When filling using a DataFrame, replacement happens along the same column names and same indices

>>> df2 = pd.DataFrame(np.zeros((4, 4)), columns=list("ABCE"))  
>>> df.fillna(df2)  
     A    B    C    D
0  0.0  2.0  0.0  0.0
1  3.0  4.0  0.0  1.0
2  0.0  0.0  0.0  NaN
3  0.0  3.0  0.0  4.0

Note that column D is not affected since it is not present in df2.

first(offset)#

Select initial periods of time series data based on a date offset.

This docstring was copied from pandas.core.frame.DataFrame.first.

Some inconsistencies with the Dask version may exist.

When having a DataFrame with dates as index, this function can select the first few rows based on a date offset.

Parameters:
offsetstr, DateOffset or dateutil.relativedelta

The offset length of the data that will be selected. For instance, ‘1M’ will display all the rows having their index within the first month.

Returns:
Series or DataFrame

A subset of the caller.

Raises:
TypeError

If the index is not a DatetimeIndex

See also

last

Select final periods of time series based on a date offset.

at_time

Select values at a particular time of the day.

between_time

Select values between particular times of the day.

Examples

>>> i = pd.date_range('2018-04-09', periods=4, freq='2D')  
>>> ts = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i)  
>>> ts  
            A
2018-04-09  1
2018-04-11  2
2018-04-13  3
2018-04-15  4

Get the rows for the first 3 days:

>>> ts.first('3D')  
            A
2018-04-09  1
2018-04-11  2

Notice the data for 3 first calendar days were returned, not the first 3 days observed in the dataset, and therefore data for 2018-04-13 was not returned.

floordiv(other, axis='columns', level=None, fill_value=None)#

Get Integer division of DataFrame or Series and other, element-wise (binary operator floordiv).

This docstring was copied from cudf.core.series.Series.floordiv.

Some inconsistencies with the Dask version may exist.

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:
otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame or Series

Result of the arithmetic operation.

Examples

DataFrame

>>> df = cudf.DataFrame(  
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.floordiv(1)  
        angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])  
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])  
>>> a.floordiv(b)  
a       1
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.floordiv(b, fill_value=0)  
a                      1
b    9223372036854775807
c    9223372036854775807
d                      0
e                   <NA>
dtype: int64
classmethod from_dict(data, *, npartitions, orient='columns', dtype=None, columns=None)#

Construct a Dask DataFrame from a Python Dictionary

See also

dask.dataframe.from_dict
ge(other, axis='columns', level=None)#

Get Greater than or equal to of dataframe and other, element-wise (binary operator ge).

This docstring was copied from pandas.core.frame.DataFrame.ge.

Some inconsistencies with the Dask version may exist.

Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.

Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.

Parameters:
otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axis{0 or ‘index’, 1 or ‘columns’}, default ‘columns’

Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).

levelint or label

Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns:
DataFrame of bool

Result of the comparison.

See also

DataFrame.eq

Compare DataFrames for equality elementwise.

DataFrame.ne

Compare DataFrames for inequality elementwise.

DataFrame.le

Compare DataFrames for less than inequality or equality elementwise.

DataFrame.lt

Compare DataFrames for strictly less than inequality elementwise.

DataFrame.ge

Compare DataFrames for greater than inequality or equality elementwise.

DataFrame.gt

Compare DataFrames for strictly greater than inequality elementwise.

Notes

Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).

Examples

>>> df = pd.DataFrame({'cost': [250, 150, 100],  
...                    'revenue': [100, 250, 300]},
...                   index=['A', 'B', 'C'])
>>> df  
   cost  revenue
A   250      100
B   150      250
C   100      300

Comparison with a scalar, using either the operator or method:

>>> df == 100  
    cost  revenue
A  False     True
B  False    False
C   True    False
>>> df.eq(100)  
    cost  revenue
A  False     True
B  False    False
C   True    False

When other is a Series, the columns of a DataFrame are aligned with the index of other and broadcast:

>>> df != pd.Series([100, 250], index=["cost", "revenue"])  
    cost  revenue
A   True     True
B   True    False
C  False     True

Use the method to control the broadcast axis:

>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index')  
   cost  revenue
A  True    False
B  True     True
C  True     True
D  True     True

When comparing to an arbitrary sequence, the number of columns must match the number elements in other:

>>> df == [250, 100]  
    cost  revenue
A   True     True
B  False    False
C  False    False

Use the method to control the axis:

>>> df.eq([250, 250, 100], axis='index')  
    cost  revenue
A   True    False
B  False     True
C   True    False

Compare to a DataFrame of different shape.

>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]},  
...                      index=['A', 'B', 'C', 'D'])
>>> other  
   revenue
A      300
B      250
C      100
D      150
>>> df.gt(other)  
    cost  revenue
A  False    False
B  False    False
C  False     True
D  False    False

Compare to a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220],  
...                              'revenue': [100, 250, 300, 200, 175, 225]},
...                             index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'],
...                                    ['A', 'B', 'C', 'A', 'B', 'C']])
>>> df_multindex  
      cost  revenue
Q1 A   250      100
   B   150      250
   C   100      300
Q2 A   150      200
   B   300      175
   C   220      225
>>> df.le(df_multindex, level=1)  
       cost  revenue
Q1 A   True     True
   B   True     True
   C   True     True
Q2 A  False     True
   B   True    False
   C   True    False
get_partition(n)#

Get a dask DataFrame/Series representing the nth partition.

Parameters:
nint

The 0-indexed partition number to select.

Returns:
Dask DataFrame or Series

The same type as the original object.

Examples

>>> import dask
>>> ddf = dask.datasets.timeseries(start="2021-01-01", end="2021-01-07", freq="1h")
>>> ddf.get_partition(0)  
Dask DataFrame Structure:
                 name     id        x        y
npartitions=1
2021-01-01     string  int64  float64  float64
2021-01-02        ...    ...      ...      ...
Dask Name: get-partition, 3 graph layers
groupby(by=None, **kwargs)#

Group DataFrame using a mapper or by a Series of columns.

This docstring was copied from pandas.core.frame.DataFrame.groupby.

Some inconsistencies with the Dask version may exist.

A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

Parameters:
bymapping, function, label, or list of labels

Used to determine the groups for the groupby. If by is a function, it’s called on each value of the object’s index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups (the Series’ values are first aligned; see .align() method). If a list or ndarray of length equal to the selected axis is passed (see the groupby user guide), the values are used as-is to determine the groups. A label or list of labels may be passed to group by the columns in self. Notice that a tuple is interpreted as a (single) key.

axis{0 or ‘index’, 1 or ‘columns’}, default 0 (Not supported in Dask)

Split along rows (0) or columns (1). For Series this parameter is unused and defaults to 0.

levelint, level name, or sequence of such, default None (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), group by a particular level or levels. Do not specify both by and level.

as_indexbool, default True (Not supported in Dask)

For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output.

sortbool, default True

Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.

group_keysbool, optional

When calling apply and the by argument produces a like-indexed (i.e. a transform) result, add group keys to index to identify pieces. By default group keys are not included when the result’s index (and column) labels match the inputs, and are included otherwise. This argument has no effect if the result produced is not like-indexed with respect to the input.

Changed in version 1.5.0: Warns that group_keys will no longer be ignored when the result from apply is a like-indexed Series or DataFrame. Specify group_keys explicitly to include the group keys or not.

squeezebool, default False (Not supported in Dask)

Reduce the dimensionality of the return type if possible, otherwise return a consistent type.

Deprecated since version 1.1.0.

observedbool, default False

This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.

dropnabool, default True

If True, and if group keys contain NA values, NA values together with row/column will be dropped. If False, NA values will also be treated as the key in groups.

New in version 1.1.0.

Returns:
DataFrameGroupBy

Returns a groupby object that contains information about the groups.

See also

resample

Convenience method for frequency conversion and resampling of time series.

Notes

See the user guide for more detailed usage and examples, including splitting an object into groups, iterating through groups, selecting a group, aggregation, and more.

Examples

>>> df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',  
...                               'Parrot', 'Parrot'],
...                    'Max Speed': [380., 370., 24., 26.]})
>>> df  
   Animal  Max Speed
0  Falcon      380.0
1  Falcon      370.0
2  Parrot       24.0
3  Parrot       26.0
>>> df.groupby(['Animal']).mean()  
        Max Speed
Animal
Falcon      375.0
Parrot       25.0

Hierarchical Indexes

We can groupby different levels of a hierarchical index using the level parameter:

>>> arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'],  
...           ['Captive', 'Wild', 'Captive', 'Wild']]
>>> index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type'))  
>>> df = pd.DataFrame({'Max Speed': [390., 350., 30., 20.]},  
...                   index=index)
>>> df  
                Max Speed
Animal Type
Falcon Captive      390.0
       Wild         350.0
Parrot Captive       30.0
       Wild          20.0
>>> df.groupby(level=0).mean()  
        Max Speed
Animal
Falcon      370.0
Parrot       25.0
>>> df.groupby(level="Type").mean()  
         Max Speed
Type
Captive      210.0
Wild         185.0

We can also choose to include NA in group keys or not by setting dropna parameter, the default setting is True.

>>> l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]  
>>> df = pd.DataFrame(l, columns=["a", "b", "c"])  
>>> df.groupby(by=["b"]).sum()  
    a   c
b
1.0 2   3
2.0 2   5
>>> df.groupby(by=["b"], dropna=False).sum()  
    a   c
b
1.0 2   3
2.0 2   5
NaN 1   4
>>> l = [["a", 12, 12], [None, 12.3, 33.], ["b", 12.3, 123], ["a", 1, 1]]  
>>> df = pd.DataFrame(l, columns=["a", "b", "c"])  
>>> df.groupby(by="a").sum()  
    b     c
a
a   13.0   13.0
b   12.3  123.0
>>> df.groupby(by="a", dropna=False).sum()  
    b     c
a
a   13.0   13.0
b   12.3  123.0
NaN 12.3   33.0

When using .apply(), use group_keys to include or exclude the group keys. The group_keys argument defaults to True (include).

>>> df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',  
...                               'Parrot', 'Parrot'],
...                    'Max Speed': [380., 370., 24., 26.]})
>>> df.groupby("Animal", group_keys=True).apply(lambda x: x)  
          Animal  Max Speed
Animal
Falcon 0  Falcon      380.0
       1  Falcon      370.0
Parrot 2  Parrot       24.0
       3  Parrot       26.0
>>> df.groupby("Animal", group_keys=False).apply(lambda x: x)  
   Animal  Max Speed
0  Falcon      380.0
1  Falcon      370.0
2  Parrot       24.0
3  Parrot       26.0
gt(other, axis='columns', level=None)#

Get Greater than of dataframe and other, element-wise (binary operator gt).

This docstring was copied from pandas.core.frame.DataFrame.gt.

Some inconsistencies with the Dask version may exist.

Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.

Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.

Parameters:
otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axis{0 or ‘index’, 1 or ‘columns’}, default ‘columns’

Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).

levelint or label

Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns:
DataFrame of bool

Result of the comparison.

See also

DataFrame.eq

Compare DataFrames for equality elementwise.

DataFrame.ne

Compare DataFrames for inequality elementwise.

DataFrame.le

Compare DataFrames for less than inequality or equality elementwise.

DataFrame.lt

Compare DataFrames for strictly less than inequality elementwise.

DataFrame.ge

Compare DataFrames for greater than inequality or equality elementwise.

DataFrame.gt

Compare DataFrames for strictly greater than inequality elementwise.

Notes

Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).

Examples

>>> df = pd.DataFrame({'cost': [250, 150, 100],  
...                    'revenue': [100, 250, 300]},
...                   index=['A', 'B', 'C'])
>>> df  
   cost  revenue
A   250      100
B   150      250
C   100      300

Comparison with a scalar, using either the operator or method:

>>> df == 100  
    cost  revenue
A  False     True
B  False    False
C   True    False
>>> df.eq(100)  
    cost  revenue
A  False     True
B  False    False
C   True    False

When other is a Series, the columns of a DataFrame are aligned with the index of other and broadcast:

>>> df != pd.Series([100, 250], index=["cost", "revenue"])  
    cost  revenue
A   True     True
B   True    False
C  False     True

Use the method to control the broadcast axis:

>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index')  
   cost  revenue
A  True    False
B  True     True
C  True     True
D  True     True

When comparing to an arbitrary sequence, the number of columns must match the number elements in other:

>>> df == [250, 100]  
    cost  revenue
A   True     True
B  False    False
C  False    False

Use the method to control the axis:

>>> df.eq([250, 250, 100], axis='index')  
    cost  revenue
A   True    False
B  False     True
C   True    False

Compare to a DataFrame of different shape.

>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]},  
...                      index=['A', 'B', 'C', 'D'])
>>> other  
   revenue
A      300
B      250
C      100
D      150
>>> df.gt(other)  
    cost  revenue
A  False    False
B  False    False
C  False     True
D  False    False

Compare to a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220],  
...                              'revenue': [100, 250, 300, 200, 175, 225]},
...                             index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'],
...                                    ['A', 'B', 'C', 'A', 'B', 'C']])
>>> df_multindex  
      cost  revenue
Q1 A   250      100
   B   150      250
   C   100      300
Q2 A   150      200
   B   300      175
   C   220      225
>>> df.le(df_multindex, level=1)  
       cost  revenue
Q1 A   True     True
   B   True     True
   C   True     True
Q2 A  False     True
   B   True    False
   C   True    False
head(n=5, npartitions=1, compute=True)#

First n rows of the dataset

Parameters:
nint, optional

The number of rows to return. Default is 5.

npartitionsint, optional

Elements are only taken from the first npartitions, with a default of 1. If there are fewer than n rows in the first npartitions a warning will be raised and any found rows returned. Pass -1 to use all partitions.

computebool, optional

Whether to compute the result, default is True.

idxmax(axis=None, skipna=True, split_every=False, numeric_only=_NoDefault.no_default)#

Return index of first occurrence of maximum over requested axis.

This docstring was copied from pandas.core.frame.DataFrame.idxmax.

Some inconsistencies with the Dask version may exist.

NA/null values are excluded.

Parameters:
axis{0 or ‘index’, 1 or ‘columns’}, default 0

The axis to use. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.

skipnabool, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA.

numeric_onlybool, default False

Include only float, int or boolean data.

New in version 1.5.0.

Returns:
Series

Indexes of maxima along the specified axis.

Raises:
ValueError
  • If the row/column is empty

See also

Series.idxmax

Return index of the maximum element.

Notes

This method is the DataFrame version of ndarray.argmax.

Examples

Consider a dataset containing food consumption in Argentina.

>>> df = pd.DataFrame({'consumption': [10.51, 103.11, 55.48],  
...                    'co2_emissions': [37.2, 19.66, 1712]},
...                    index=['Pork', 'Wheat Products', 'Beef'])
>>> df  
                consumption  co2_emissions
Pork                  10.51         37.20
Wheat Products       103.11         19.66
Beef                  55.48       1712.00

By default, it returns the index for the maximum value in each column.

>>> df.idxmax()  
consumption     Wheat Products
co2_emissions             Beef
dtype: object

To return the index for the maximum value in each row, use axis="columns".

>>> df.idxmax(axis="columns")  
Pork              co2_emissions
Wheat Products     consumption
Beef              co2_emissions
dtype: object
idxmin(axis=None, skipna=True, split_every=False, numeric_only=_NoDefault.no_default)#

Return index of first occurrence of minimum over requested axis.

This docstring was copied from pandas.core.frame.DataFrame.idxmin.

Some inconsistencies with the Dask version may exist.

NA/null values are excluded.

Parameters:
axis{0 or ‘index’, 1 or ‘columns’}, default 0

The axis to use. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.

skipnabool, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA.

numeric_onlybool, default False

Include only float, int or boolean data.

New in version 1.5.0.

Returns:
Series

Indexes of minima along the specified axis.

Raises:
ValueError
  • If the row/column is empty

See also

Series.idxmin

Return index of the minimum element.

Notes

This method is the DataFrame version of ndarray.argmin.

Examples

Consider a dataset containing food consumption in Argentina.

>>> df = pd.DataFrame({'consumption': [10.51, 103.11, 55.48],  
...                    'co2_emissions': [37.2, 19.66, 1712]},
...                    index=['Pork', 'Wheat Products', 'Beef'])
>>> df  
                consumption  co2_emissions
Pork                  10.51         37.20
Wheat Products       103.11         19.66
Beef                  55.48       1712.00

By default, it returns the index for the minimum value in each column.

>>> df.idxmin()  
consumption                Pork
co2_emissions    Wheat Products
dtype: object

To return the index for the minimum value in each row, use axis="columns".

>>> df.idxmin(axis="columns")  
Pork                consumption
Wheat Products    co2_emissions
Beef                consumption
dtype: object
property iloc#

Purely integer-location based indexing for selection by position.

Only indexing the column positions is supported. Trying to select row positions will raise a ValueError.

See Indexing into Dask DataFrames for more.

Examples

>>> df.iloc[:, [2, 0, 1]]  
property index#

Return dask Index instance

info(buf=None, verbose=False, memory_usage=False)#

Concise summary of a Dask DataFrame.

isin(values)#

Whether each element in the DataFrame is contained in values.

This docstring was copied from pandas.core.frame.DataFrame.isin.

Some inconsistencies with the Dask version may exist.

Parameters:
valuesiterable, Series, DataFrame or dict

The result will only be true at a location if all the labels match. If values is a Series, that’s the index. If values is a dict, the keys must be the column names, which must match. If values is a DataFrame, then both the index and column labels must match.

Returns:
DataFrame

DataFrame of booleans showing whether each element in the DataFrame is contained in values.

See also

DataFrame.eq

Equality test for DataFrame.

Series.isin

Equivalent method on Series.

Series.str.contains

Test if pattern or regex is contained within a string of a Series or Index.

Examples

>>> df = pd.DataFrame({'num_legs': [2, 4], 'num_wings': [2, 0]},  
...                   index=['falcon', 'dog'])
>>> df  
        num_legs  num_wings
falcon         2          2
dog            4          0

When values is a list check whether every value in the DataFrame is present in the list (which animals have 0 or 2 legs or wings)

>>> df.isin([0, 2])  
        num_legs  num_wings
falcon      True       True
dog        False       True

To check if values is not in the DataFrame, use the ~ operator:

>>> ~df.isin([0, 2])  
        num_legs  num_wings
falcon     False      False
dog         True      False

When values is a dict, we can pass values to check for each column separately:

>>> df.isin({'num_wings': [0, 3]})  
        num_legs  num_wings
falcon     False      False
dog        False       True

When values is a Series or DataFrame the index and column must match. Note that ‘falcon’ does not match based on the number of legs in other.

>>> other = pd.DataFrame({'num_legs': [8, 3], 'num_wings': [0, 2]},  
...                      index=['spider', 'falcon'])
>>> df.isin(other)  
        num_legs  num_wings
falcon     False       True
dog        False      False
isna()#

Detect missing values.

This docstring was copied from pandas.core.frame.DataFrame.isna.

Some inconsistencies with the Dask version may exist.

Return a boolean same-sized object indicating if the values are NA. NA values, such as None or numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True).

Returns:
DataFrame

Mask of bool values for each element in DataFrame that indicates whether an element is an NA value.

See also

DataFrame.isnull

Alias of isna.

DataFrame.notna

Boolean inverse of isna.

DataFrame.dropna

Omit axes labels with missing values.

isna

Top-level isna.

Examples

Show which entries in a DataFrame are NA.

>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN],  
...                    born=[pd.NaT, pd.Timestamp('1939-05-27'),
...                          pd.Timestamp('1940-04-25')],
...                    name=['Alfred', 'Batman', ''],
...                    toy=[None, 'Batmobile', 'Joker']))
>>> df  
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker
>>> df.isna()  
     age   born   name    toy
0  False   True  False   True
1  False  False  False  False
2   True  False  False  False

Show which entries in a Series are NA.

>>> ser = pd.Series([5, 6, np.NaN])  
>>> ser  
0    5.0
1    6.0
2    NaN
dtype: float64
>>> ser.isna()  
0    False
1    False
2     True
dtype: bool
isnull()#

DataFrame.isnull is an alias for DataFrame.isna.

This docstring was copied from pandas.core.frame.DataFrame.isnull.

Some inconsistencies with the Dask version may exist.

Detect missing values.

Return a boolean same-sized object indicating if the values are NA. NA values, such as None or numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True).

Returns:
DataFrame

Mask of bool values for each element in DataFrame that indicates whether an element is an NA value.

See also

DataFrame.isnull

Alias of isna.

DataFrame.notna

Boolean inverse of isna.

DataFrame.dropna

Omit axes labels with missing values.

isna

Top-level isna.

Examples

Show which entries in a DataFrame are NA.

>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN],  
...                    born=[pd.NaT, pd.Timestamp('1939-05-27'),
...                          pd.Timestamp('1940-04-25')],
...                    name=['Alfred', 'Batman', ''],
...                    toy=[None, 'Batmobile', 'Joker']))
>>> df  
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker
>>> df.isna()  
     age   born   name    toy
0  False   True  False   True
1  False  False  False  False
2   True  False  False  False

Show which entries in a Series are NA.

>>> ser = pd.Series([5, 6, np.NaN])  
>>> ser  
0    5.0
1    6.0
2    NaN
dtype: float64
>>> ser.isna()  
0    False
1    False
2     True
dtype: bool
items()#

Iterate over (column name, Series) pairs.

This docstring was copied from pandas.core.frame.DataFrame.items.

Some inconsistencies with the Dask version may exist.

Iterates over the DataFrame columns, returning a tuple with the column name and the content as a Series.

Yields:
labelobject

The column names for the DataFrame being iterated over.

contentSeries

The column entries belonging to each label, as a Series.

See also

DataFrame.iterrows

Iterate over DataFrame rows as (index, Series) pairs.

DataFrame.itertuples

Iterate over DataFrame rows as namedtuples of the values.

Examples

>>> df = pd.DataFrame({'species': ['bear', 'bear', 'marsupial'],  
...                   'population': [1864, 22000, 80000]},
...                   index=['panda', 'polar', 'koala'])
>>> df  
        species   population
panda   bear      1864
polar   bear      22000
koala   marsupial 80000
>>> for label, content in df.items():  
...     print(f'label: {label}')
...     print(f'content: {content}', sep='\n')
...
label: species
content:
panda         bear
polar         bear
koala    marsupial
Name: species, dtype: object
label: population
content:
panda     1864
polar    22000
koala    80000
Name: population, dtype: int64
iterrows()#

Iterate over DataFrame rows as (index, Series) pairs.

This docstring was copied from pandas.core.frame.DataFrame.iterrows.

Some inconsistencies with the Dask version may exist.

Yields:
indexlabel or tuple of label

The index of the row. A tuple for a MultiIndex.

dataSeries

The data of the row as a Series.

See also

DataFrame.itertuples

Iterate over DataFrame rows as namedtuples of the values.

DataFrame.items

Iterate over (column name, Series) pairs.

Notes

  1. Because iterrows returns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames). For example,

    >>> df = pd.DataFrame([[1, 1.5]], columns=['int', 'float'])  
    >>> row = next(df.iterrows())[1]  
    >>> row  
    int      1.0
    float    1.5
    Name: 0, dtype: float64
    >>> print(row['int'].dtype)  
    float64
    >>> print(df['int'].dtype)  
    int64
    

    To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns namedtuples of the values and which is generally faster than iterrows.

  2. You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.

itertuples(index=True, name='Pandas')#

Iterate over DataFrame rows as namedtuples.

This docstring was copied from pandas.core.frame.DataFrame.itertuples.

Some inconsistencies with the Dask version may exist.

Parameters:
indexbool, default True

If True, return the index as the first element of the tuple.

namestr or None, default “Pandas”

The name of the returned namedtuples or None to return regular tuples.

Returns:
iterator

An object to iterate over namedtuples for each row in the DataFrame with the first field possibly being the index and following fields being the column values.

See also

DataFrame.iterrows

Iterate over DataFrame rows as (index, Series) pairs.

DataFrame.items

Iterate over (column name, Series) pairs.

Notes

The column names will be renamed to positional names if they are invalid Python identifiers, repeated, or start with an underscore.

Examples

>>> df = pd.DataFrame({'num_legs': [4, 2], 'num_wings': [0, 2]},  
...                   index=['dog', 'hawk'])
>>> df  
      num_legs  num_wings
dog          4          0
hawk         2          2
>>> for row in df.itertuples():  
...     print(row)
...
Pandas(Index='dog', num_legs=4, num_wings=0)
Pandas(Index='hawk', num_legs=2, num_wings=2)

By setting the index parameter to False we can remove the index as the first element of the tuple:

>>> for row in df.itertuples(index=False):  
...     print(row)
...
Pandas(num_legs=4, num_wings=0)
Pandas(num_legs=2, num_wings=2)

With the name parameter set we set a custom name for the yielded namedtuples:

>>> for row in df.itertuples(name='Animal'):  
...     print(row)
...
Animal(Index='dog', num_legs=4, num_wings=0)
Animal(Index='hawk', num_legs=2, num_wings=2)
join(other, shuffle_method=None, **kwargs)#

Join columns of another DataFrame.

This docstring was copied from pandas.core.frame.DataFrame.join.

Some inconsistencies with the Dask version may exist.

Join columns with other DataFrame either on index or on a key column. Efficiently join multiple DataFrame objects by index at once by passing a list.

Parameters:
otherDataFrame, Series, or a list containing any combination of them

Index should be similar to one of the columns in this one. If a Series is passed, its name attribute must be set, and that will be used as the column name in the resulting joined DataFrame.

onstr, list of str, or array-like, optional

Column or index level name(s) in the caller to join on the index in other, otherwise joins index-on-index. If multiple values given, the other DataFrame must have a MultiIndex. Can pass an array as the join key if it is not already contained in the calling DataFrame. Like an Excel VLOOKUP operation.

how{‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘left’

How to handle the operation of the two objects.

  • left: use calling frame’s index (or column if on is specified)

  • right: use other’s index.

  • outer: form union of calling frame’s index (or column if on is specified) with other’s index, and sort it. lexicographically.

  • inner: form intersection of calling frame’s index (or column if on is specified) with other’s index, preserving the order of the calling’s one.

  • cross: creates the cartesian product from both frames, preserves the order of the left keys.

    New in version 1.2.0.

lsuffixstr, default ‘’

Suffix to use from left frame’s overlapping columns.

rsuffixstr, default ‘’

Suffix to use from right frame’s overlapping columns.

sortbool, default False (Not supported in Dask)

Order result DataFrame lexicographically by the join key. If False, the order of the join key depends on the join type (how keyword).

validatestr, optional (Not supported in Dask)

If specified, checks if join is of specified type. * “one_to_one” or “1:1”: check if join keys are unique in both left and right datasets. * “one_to_many” or “1:m”: check if join keys are unique in left dataset. * “many_to_one” or “m:1”: check if join keys are unique in right dataset. * “many_to_many” or “m:m”: allowed, but does not result in checks. .. versionadded:: 1.5.0

Returns:
DataFrame

A dataframe containing columns from both the caller and other.

See also

DataFrame.merge

For column(s)-on-column(s) operations.

Notes

Parameters on, lsuffix, and rsuffix are not supported when passing a list of DataFrame objects.

Support for specifying index levels as the on parameter was added in version 0.23.0.

Examples

>>> df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],  
...                    'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
>>> df  
  key   A
0  K0  A0
1  K1  A1
2  K2  A2
3  K3  A3
4  K4  A4
5  K5  A5
>>> other = pd.DataFrame({'key': ['K0', 'K1', 'K2'],  
...                       'B': ['B0', 'B1', 'B2']})
>>> other  
  key   B
0  K0  B0
1  K1  B1
2  K2  B2

Join DataFrames using their indexes.

>>> df.join(other, lsuffix='_caller', rsuffix='_other')  
  key_caller   A key_other    B
0         K0  A0        K0   B0
1         K1  A1        K1   B1
2         K2  A2        K2   B2
3         K3  A3       NaN  NaN
4         K4  A4       NaN  NaN
5         K5  A5       NaN  NaN

If we want to join using the key columns, we need to set key to be the index in both df and other. The joined DataFrame will have key as its index.

>>> df.set_index('key').join(other.set_index('key'))  
      A    B
key
K0   A0   B0
K1   A1   B1
K2   A2   B2
K3   A3  NaN
K4   A4  NaN
K5   A5  NaN

Another option to join using the key columns is to use the on parameter. DataFrame.join always uses other’s index but we can use any column in df. This method preserves the original DataFrame’s index in the result.

>>> df.join(other.set_index('key'), on='key')  
  key   A    B
0  K0  A0   B0
1  K1  A1   B1
2  K2  A2   B2
3  K3  A3  NaN
4  K4  A4  NaN
5  K5  A5  NaN

Using non-unique key values shows how they are matched.

>>> df = pd.DataFrame({'key': ['K0', 'K1', 'K1', 'K3', 'K0', 'K1'],  
...                    'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
>>> df  
  key   A
0  K0  A0
1  K1  A1
2  K1  A2
3  K3  A3
4  K0  A4
5  K1  A5
>>> df.join(other.set_index('key'), on='key', validate='m:1')  
  key   A    B
0  K0  A0   B0
1  K1  A1   B1
2  K1  A2   B1
3  K3  A3  NaN
4  K0  A4   B0
5  K1  A5   B1
property known_divisions#

Whether divisions are already known

kurtosis(axis=0, fisher=True, bias=True, nan_policy='propagate', out=None, numeric_only=_NoDefault.no_default)#

Return unbiased kurtosis over requested axis.

This docstring was copied from pandas.core.frame.DataFrame.kurtosis.

Some inconsistencies with the Dask version may exist.

Note

This implementation follows the dask.array.stats implementation of kurtosis and calculates kurtosis without taking into account a bias term for finite sample size, which corresponds to the default settings of the scipy.stats kurtosis calculation. This differs from pandas.

Further, this method currently does not support filtering out NaN values, which is again a difference to Pandas.

Kurtosis obtained using Fisher’s definition of kurtosis (kurtosis of normal == 0.0). Normalized by N-1.

Parameters:
axis{index (0), columns (1)}

Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

skipnabool, default True (Not supported in Dask)

Exclude NA/null values when computing the result.

levelint or level name, default None (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

Deprecated since version 1.3.0: The level keyword is deprecated. Use groupby instead.

numeric_onlybool, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Deprecated since version 1.5.0: Specifying numeric_only=None is deprecated. The default value will be False in a future version of pandas.

**kwargs

Additional keyword arguments to be passed to the function.

Returns:
Series or DataFrame (if level specified)
last(offset)#

Select final periods of time series data based on a date offset.

This docstring was copied from pandas.core.frame.DataFrame.last.

Some inconsistencies with the Dask version may exist.

For a DataFrame with a sorted DatetimeIndex, this function selects the last few rows based on a date offset.

Parameters:
offsetstr, DateOffset, dateutil.relativedelta

The offset length of the data that will be selected. For instance, ‘3D’ will display all the rows having their index within the last 3 days.

Returns:
Series or DataFrame

A subset of the caller.

Raises:
TypeError

If the index is not a DatetimeIndex

See also

first

Select initial periods of time series based on a date offset.

at_time

Select values at a particular time of the day.

between_time

Select values between particular times of the day.

Examples

>>> i = pd.date_range('2018-04-09', periods=4, freq='2D')  
>>> ts = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i)  
>>> ts  
            A
2018-04-09  1
2018-04-11  2
2018-04-13  3
2018-04-15  4

Get the rows for the last 3 days:

>>> ts.last('3D')  
            A
2018-04-13  3
2018-04-15  4

Notice the data for 3 last calendar days were returned, not the last 3 observed days in the dataset, and therefore data for 2018-04-11 was not returned.

le(other, axis='columns', level=None)#

Get Less than or equal to of dataframe and other, element-wise (binary operator le).

This docstring was copied from pandas.core.frame.DataFrame.le.

Some inconsistencies with the Dask version may exist.

Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.

Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.

Parameters:
otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axis{0 or ‘index’, 1 or ‘columns’}, default ‘columns’

Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).

levelint or label

Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns:
DataFrame of bool

Result of the comparison.

See also

DataFrame.eq

Compare DataFrames for equality elementwise.

DataFrame.ne

Compare DataFrames for inequality elementwise.

DataFrame.le

Compare DataFrames for less than inequality or equality elementwise.

DataFrame.lt

Compare DataFrames for strictly less than inequality elementwise.

DataFrame.ge

Compare DataFrames for greater than inequality or equality elementwise.

DataFrame.gt

Compare DataFrames for strictly greater than inequality elementwise.

Notes

Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).

Examples

>>> df = pd.DataFrame({'cost': [250, 150, 100],  
...                    'revenue': [100, 250, 300]},
...                   index=['A', 'B', 'C'])
>>> df  
   cost  revenue
A   250      100
B   150      250
C   100      300

Comparison with a scalar, using either the operator or method:

>>> df == 100  
    cost  revenue
A  False     True
B  False    False
C   True    False
>>> df.eq(100)  
    cost  revenue
A  False     True
B  False    False
C   True    False

When other is a Series, the columns of a DataFrame are aligned with the index of other and broadcast:

>>> df != pd.Series([100, 250], index=["cost", "revenue"])  
    cost  revenue
A   True     True
B   True    False
C  False     True

Use the method to control the broadcast axis:

>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index')  
   cost  revenue
A  True    False
B  True     True
C  True     True
D  True     True

When comparing to an arbitrary sequence, the number of columns must match the number elements in other:

>>> df == [250, 100]  
    cost  revenue
A   True     True
B  False    False
C  False    False

Use the method to control the axis:

>>> df.eq([250, 250, 100], axis='index')  
    cost  revenue
A   True    False
B  False     True
C   True    False

Compare to a DataFrame of different shape.

>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]},  
...                      index=['A', 'B', 'C', 'D'])
>>> other  
   revenue
A      300
B      250
C      100
D      150
>>> df.gt(other)  
    cost  revenue
A  False    False
B  False    False
C  False     True
D  False    False

Compare to a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220],  
...                              'revenue': [100, 250, 300, 200, 175, 225]},
...                             index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'],
...                                    ['A', 'B', 'C', 'A', 'B', 'C']])
>>> df_multindex  
      cost  revenue
Q1 A   250      100
   B   150      250
   C   100      300
Q2 A   150      200
   B   300      175
   C   220      225
>>> df.le(df_multindex, level=1)  
       cost  revenue
Q1 A   True     True
   B   True     True
   C   True     True
Q2 A  False     True
   B   True    False
   C   True    False
property loc#

Purely label-location based indexer for selection by label.

>>> df.loc["b"]  
>>> df.loc["b":"d"]  
lt(other, axis='columns', level=None)#

Get Less than of dataframe and other, element-wise (binary operator lt).

This docstring was copied from pandas.core.frame.DataFrame.lt.

Some inconsistencies with the Dask version may exist.

Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.

Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.

Parameters:
otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axis{0 or ‘index’, 1 or ‘columns’}, default ‘columns’

Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).

levelint or label

Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns:
DataFrame of bool

Result of the comparison.

See also

DataFrame.eq

Compare DataFrames for equality elementwise.

DataFrame.ne

Compare DataFrames for inequality elementwise.

DataFrame.le

Compare DataFrames for less than inequality or equality elementwise.

DataFrame.lt

Compare DataFrames for strictly less than inequality elementwise.

DataFrame.ge

Compare DataFrames for greater than inequality or equality elementwise.

DataFrame.gt

Compare DataFrames for strictly greater than inequality elementwise.

Notes

Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).

Examples

>>> df = pd.DataFrame({'cost': [250, 150, 100],  
...                    'revenue': [100, 250, 300]},
...                   index=['A', 'B', 'C'])
>>> df  
   cost  revenue
A   250      100
B   150      250
C   100      300

Comparison with a scalar, using either the operator or method:

>>> df == 100  
    cost  revenue
A  False     True
B  False    False
C   True    False
>>> df.eq(100)  
    cost  revenue
A  False     True
B  False    False
C   True    False

When other is a Series, the columns of a DataFrame are aligned with the index of other and broadcast:

>>> df != pd.Series([100, 250], index=["cost", "revenue"])  
    cost  revenue
A   True     True
B   True    False
C  False     True

Use the method to control the broadcast axis:

>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index')  
   cost  revenue
A  True    False
B  True     True
C  True     True
D  True     True

When comparing to an arbitrary sequence, the number of columns must match the number elements in other:

>>> df == [250, 100]  
    cost  revenue
A   True     True
B  False    False
C  False    False

Use the method to control the axis:

>>> df.eq([250, 250, 100], axis='index')  
    cost  revenue
A   True    False
B  False     True
C   True    False

Compare to a DataFrame of different shape.

>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]},  
...                      index=['A', 'B', 'C', 'D'])
>>> other  
   revenue
A      300
B      250
C      100
D      150
>>> df.gt(other)  
    cost  revenue
A  False    False
B  False    False
C  False     True
D  False    False

Compare to a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220],  
...                              'revenue': [100, 250, 300, 200, 175, 225]},
...                             index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'],
...                                    ['A', 'B', 'C', 'A', 'B', 'C']])
>>> df_multindex  
      cost  revenue
Q1 A   250      100
   B   150      250
   C   100      300
Q2 A   150      200
   B   300      175
   C   220      225
>>> df.le(df_multindex, level=1)  
       cost  revenue
Q1 A   True     True
   B   True     True
   C   True     True
Q2 A  False     True
   B   True    False
   C   True    False
map_overlap(func, before, after, *args, **kwargs)#

Apply a function to each partition, sharing rows with adjacent partitions.

This can be useful for implementing windowing functions such as df.rolling(...).mean() or df.diff().

Parameters:
funcfunction

Function applied to each partition.

beforeint, timedelta or string timedelta

The rows to prepend to partition i from the end of partition i - 1.

afterint, timedelta or string timedelta

The rows to append to partition i from the beginning of partition i + 1.

args, kwargs

Positional and keyword arguments to pass to the function. Positional arguments are computed on a per-partition basis, while keyword arguments are shared across all partitions. The partition itself will be the first positional argument, with all other arguments passed after. Arguments can be Scalar, Delayed, or regular Python objects. DataFrame-like args (both dask and pandas) will be repartitioned to align (if necessary) before applying the function; see align_dataframes to control this behavior.

enforce_metadatabool, default True

Whether to enforce at runtime that the structure of the DataFrame produced by func actually matches the structure of meta. This will rename and reorder columns for each partition, and will raise an error if this doesn’t work, but it won’t raise if dtypes don’t match.

transform_divisionsbool, default True

Whether to apply the function onto the divisions and apply those transformed divisions to the output.

align_dataframesbool, default True

Whether to repartition DataFrame- or Series-like args (both dask and pandas) so their divisions align before applying the function. This requires all inputs to have known divisions. Single-partition inputs will be split into multiple partitions.

If False, all inputs must have either the same number of partitions or a single partition. Single-partition inputs will be broadcast to every partition of multi-partition inputs.

metapd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

Notes

Given positive integers before and after, and a function func, map_overlap does the following:

  1. Prepend before rows to each partition i from the end of partition i - 1. The first partition has no rows prepended.

  2. Append after rows to each partition i from the beginning of partition i + 1. The last partition has no rows appended.

  3. Apply func to each partition, passing in any extra args and kwargs if provided.

  4. Trim before rows from the beginning of all but the first partition.

  5. Trim after rows from the end of all but the last partition.

Examples

Given a DataFrame, Series, or Index, such as:

>>> import pandas as pd
>>> import dask.dataframe as dd
>>> df = pd.DataFrame({'x': [1, 2, 4, 7, 11],
...                    'y': [1., 2., 3., 4., 5.]})
>>> ddf = dd.from_pandas(df, npartitions=2)

A rolling sum with a trailing moving window of size 2 can be computed by overlapping 2 rows before each partition, and then mapping calls to df.rolling(2).sum():

>>> ddf.compute()
    x    y
0   1  1.0
1   2  2.0
2   4  3.0
3   7  4.0
4  11  5.0
>>> ddf.map_overlap(lambda df: df.rolling(2).sum(), 2, 0).compute()
      x    y
0   NaN  NaN
1   3.0  3.0
2   6.0  5.0
3  11.0  7.0
4  18.0  9.0

The pandas diff method computes a discrete difference shifted by a number of periods (can be positive or negative). This can be implemented by mapping calls to df.diff to each partition after prepending/appending that many rows, depending on sign:

>>> def diff(df, periods=1):
...     before, after = (periods, 0) if periods > 0 else (0, -periods)
...     return df.map_overlap(lambda df, periods=1: df.diff(periods),
...                           periods, 0, periods=periods)
>>> diff(ddf, 1).compute()
     x    y
0  NaN  NaN
1  1.0  1.0
2  2.0  1.0
3  3.0  1.0
4  4.0  1.0

If you have a DatetimeIndex, you can use a pd.Timedelta for time- based windows or any pd.Timedelta convertible string:

>>> ts = pd.Series(range(10), index=pd.date_range('2017', periods=10))
>>> dts = dd.from_pandas(ts, npartitions=2)
>>> dts.map_overlap(lambda df: df.rolling('2D').sum(),
...                 pd.Timedelta('2D'), 0).compute()
2017-01-01     0.0
2017-01-02     1.0
2017-01-03     3.0
2017-01-04     5.0
2017-01-05     7.0
2017-01-06     9.0
2017-01-07    11.0
2017-01-08    13.0
2017-01-09    15.0
2017-01-10    17.0
Freq: D, dtype: float64
map_partitions(func, *args, **kwargs)#

Apply Python function on each DataFrame partition.

Note that the index and divisions are assumed to remain unchanged.

Parameters:
funcfunction

The function applied to each partition. If this function accepts the special partition_info keyword argument, it will receive information on the partition’s relative location within the dataframe.

args, kwargs

Positional and keyword arguments to pass to the function. Positional arguments are computed on a per-partition basis, while keyword arguments are shared across all partitions. The partition itself will be the first positional argument, with all other arguments passed after. Arguments can be Scalar, Delayed, or regular Python objects. DataFrame-like args (both dask and pandas) will be repartitioned to align (if necessary) before applying the function; see align_dataframes to control this behavior.

enforce_metadatabool, default True

Whether to enforce at runtime that the structure of the DataFrame produced by func actually matches the structure of meta. This will rename and reorder columns for each partition, and will raise an error if this doesn’t work, but it won’t raise if dtypes don’t match.

transform_divisionsbool, default True

Whether to apply the function onto the divisions and apply those transformed divisions to the output.

align_dataframesbool, default True

Whether to repartition DataFrame- or Series-like args (both dask and pandas) so their divisions align before applying the function. This requires all inputs to have known divisions. Single-partition inputs will be split into multiple partitions.

If False, all inputs must have either the same number of partitions or a single partition. Single-partition inputs will be broadcast to every partition of multi-partition inputs.

metapd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

Examples

Given a DataFrame, Series, or Index, such as:

>>> import pandas as pd
>>> import dask.dataframe as dd
>>> df = pd.DataFrame({'x': [1, 2, 3, 4, 5],
...                    'y': [1., 2., 3., 4., 5.]})
>>> ddf = dd.from_pandas(df, npartitions=2)

One can use map_partitions to apply a function on each partition. Extra arguments and keywords can optionally be provided, and will be passed to the function after the partition.

Here we apply a function with arguments and keywords to a DataFrame, resulting in a Series:

>>> def myadd(df, a, b=1):
...     return df.x + df.y + a + b
>>> res = ddf.map_partitions(myadd, 1, b=2)
>>> res.dtype
dtype('float64')

Here we apply a function to a Series resulting in a Series:

>>> res = ddf.x.map_partitions(lambda x: len(x)) # ddf.x is a Dask Series Structure
>>> res.dtype
dtype('int64')

By default, dask tries to infer the output metadata by running your provided function on some fake data. This works well in many cases, but can sometimes be expensive, or even fail. To avoid this, you can manually specify the output metadata with the meta keyword. This can be specified in many forms, for more information see dask.dataframe.utils.make_meta.

Here we specify the output is a Series with no name, and dtype float64:

>>> res = ddf.map_partitions(myadd, 1, b=2, meta=(None, 'f8'))

Here we map a function that takes in a DataFrame, and returns a DataFrame with a new column:

>>> res = ddf.map_partitions(lambda df: df.assign(z=df.x * df.y))
>>> res.dtypes
x      int64
y    float64
z    float64
dtype: object

As before, the output metadata can also be specified manually. This time we pass in a dict, as the output is a DataFrame:

>>> res = ddf.map_partitions(lambda df: df.assign(z=df.x * df.y),
...                          meta={'x': 'i8', 'y': 'f8', 'z': 'f8'})

In the case where the metadata doesn’t change, you can also pass in the object itself directly:

>>> res = ddf.map_partitions(lambda df: df.head(), meta=ddf)

Also note that the index and divisions are assumed to remain unchanged. If the function you’re mapping changes the index/divisions, you’ll need to clear them afterwards:

>>> ddf.map_partitions(func).clear_divisions()  

Your map function gets information about where it is in the dataframe by accepting a special partition_info keyword argument.

>>> def func(partition, partition_info=None):
...     pass

This will receive the following information:

>>> partition_info  
{'number': 1, 'division': 3}

For each argument and keyword arguments that are dask dataframes you will receive the number (n) which represents the nth partition of the dataframe and the division (the first index value in the partition). If divisions are not known (for instance if the index is not sorted) then you will get None as the division.

mask(cond, other=nan)#

Replace values where the condition is True.

This docstring was copied from pandas.core.frame.DataFrame.mask.

Some inconsistencies with the Dask version may exist.

Parameters:
condbool Series/DataFrame, array-like, or callable

Where cond is False, keep the original value. Where True, replace with corresponding value from other. If cond is callable, it is computed on the Series/DataFrame and should return boolean Series/DataFrame or array. The callable must not change input Series/DataFrame (though pandas doesn’t check it).

otherscalar, Series/DataFrame, or callable

Entries where cond is True are replaced with corresponding value from other. If other is callable, it is computed on the Series/DataFrame and should return scalar or Series/DataFrame. The callable must not change input Series/DataFrame (though pandas doesn’t check it).

inplacebool, default False (Not supported in Dask)

Whether to perform the operation in place on the data.

axisint, default None (Not supported in Dask)

Alignment axis if needed. For Series this parameter is unused and defaults to 0.

levelint, default None (Not supported in Dask)

Alignment level if needed.

errorsstr, {‘raise’, ‘ignore’}, default ‘raise’ (Not supported in Dask)

Note that currently this parameter won’t affect the results and will always coerce to a suitable dtype.

  • ‘raise’ : allow exceptions to be raised.

  • ‘ignore’ : suppress exceptions. On error return original object.

Deprecated since version 1.5.0: This argument had no effect.

try_castbool, default None (Not supported in Dask)

Try to cast the result back to the input type (if possible).

Deprecated since version 1.3.0: Manually cast back if necessary.

Returns:
Same type as caller or None if inplace=True.

See also

DataFrame.where()

Return an object of same shape as self.

Notes

The mask method is an application of the if-then idiom. For each element in the calling DataFrame, if cond is False the element is used; otherwise the corresponding element from the DataFrame other is used. If the axis of other does not align with axis of cond Series/DataFrame, the misaligned index positions will be filled with True.

The signature for DataFrame.where() differs from numpy.where(). Roughly df1.where(m, df2) is equivalent to np.where(m, df1, df2).

For further details and examples see the mask documentation in indexing.

The dtype of the object takes precedence. The fill value is casted to the object’s dtype, if this can be done losslessly.

Examples

>>> s = pd.Series(range(5))  
>>> s.where(s > 0)  
0    NaN
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64
>>> s.mask(s > 0)  
0    0.0
1    NaN
2    NaN
3    NaN
4    NaN
dtype: float64
>>> s = pd.Series(range(5))  
>>> t = pd.Series([True, False])  
>>> s.where(t, 99)  
0     0
1    99
2    99
3    99
4    99
dtype: int64
>>> s.mask(t, 99)  
0    99
1     1
2    99
3    99
4    99
dtype: int64
>>> s.where(s > 1, 10)  
0    10
1    10
2    2
3    3
4    4
dtype: int64
>>> s.mask(s > 1, 10)  
0     0
1     1
2    10
3    10
4    10
dtype: int64
>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])  
>>> df  
   A  B
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9
>>> m = df % 3 == 0  
>>> df.where(m, -df)  
   A  B
0  0 -1
1 -2  3
2 -4 -5
3  6 -7
4 -8  9
>>> df.where(m, -df) == np.where(m, df, -df)  
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True
>>> df.where(m, -df) == df.mask(~m, -df)  
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True
max(axis=0, skipna=True, split_every=False, out=None, numeric_only=None)#

Return the maximum of the values over the requested axis.

This docstring was copied from pandas.core.frame.DataFrame.max.

Some inconsistencies with the Dask version may exist.

If you want the index of the maximum, use idxmax. This is the equivalent of the numpy.ndarray method argmax.

Parameters:
axis{index (0), columns (1)}

Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

skipnabool, default True

Exclude NA/null values when computing the result.

levelint or level name, default None (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

Deprecated since version 1.3.0: The level keyword is deprecated. Use groupby instead.

numeric_onlybool, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Deprecated since version 1.5.0: Specifying numeric_only=None is deprecated. The default value will be False in a future version of pandas.

**kwargs

Additional keyword arguments to be passed to the function.

Returns:
Series or DataFrame (if level specified)

See also

Series.sum

Return the sum.

Series.min

Return the minimum.

Series.max

Return the maximum.

Series.idxmin

Return the index of the minimum.

Series.idxmax

Return the index of the maximum.

DataFrame.sum

Return the sum over the requested axis.

DataFrame.min

Return the minimum over the requested axis.

DataFrame.max

Return the maximum over the requested axis.

DataFrame.idxmin

Return the index of the minimum over the requested axis.

DataFrame.idxmax

Return the index of the maximum over the requested axis.

Examples

>>> idx = pd.MultiIndex.from_arrays([  
...     ['warm', 'warm', 'cold', 'cold'],
...     ['dog', 'falcon', 'fish', 'spider']],
...     names=['blooded', 'animal'])
>>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx)  
>>> s  
blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64
>>> s.max()  
8
mean(axis=0, skipna=True, split_every=False, dtype=None, out=None, numeric_only=None)#

Return the mean of the values over the requested axis.

This docstring was copied from pandas.core.frame.DataFrame.mean.

Some inconsistencies with the Dask version may exist.

Parameters:
axis{index (0), columns (1)}

Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

skipnabool, default True

Exclude NA/null values when computing the result.

levelint or level name, default None (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

Deprecated since version 1.3.0: The level keyword is deprecated. Use groupby instead.

numeric_onlybool, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Deprecated since version 1.5.0: Specifying numeric_only=None is deprecated. The default value will be False in a future version of pandas.

**kwargs

Additional keyword arguments to be passed to the function.

Returns:
Series or DataFrame (if level specified)
median(axis=None, method='default')#

Return the median of the values over the requested axis.

This docstring was copied from pandas.core.frame.DataFrame.median.

Some inconsistencies with the Dask version may exist.

Parameters:
axis{index (0), columns (1)}

Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

skipnabool, default True (Not supported in Dask)

Exclude NA/null values when computing the result.

levelint or level name, default None (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

Deprecated since version 1.3.0: The level keyword is deprecated. Use groupby instead.

numeric_onlybool, default None (Not supported in Dask)

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Deprecated since version 1.5.0: Specifying numeric_only=None is deprecated. The default value will be False in a future version of pandas.

**kwargs

Additional keyword arguments to be passed to the function.

Returns:
Series or DataFrame (if level specified)
median_approximate(axis=None, method='default')#

Return the approximate median of the values over the requested axis.

Parameters:
axis{0, 1, “index”, “columns”} (default 0)

0 or "index" for row-wise, 1 or "columns" for column-wise

method{‘default’, ‘tdigest’, ‘dask’}, optional

What method to use. By default will use Dask’s internal custom algorithm ("dask"). If set to "tdigest" will use tdigest for floats and ints and fallback to the "dask" otherwise.

melt(id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None)#

Unpivots a DataFrame from wide format to long format, optionally leaving identifier variables set.

This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.

Parameters:
frameDataFrame
id_varstuple, list, or ndarray, optional

Column(s) to use as identifier variables.

value_varstuple, list, or ndarray, optional

Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.

var_namescalar

Name to use for the ‘variable’ column. If None it uses frame.columns.name or ‘variable’.

value_namescalar, default ‘value’

Name to use for the ‘value’ column.

col_levelint or string, optional

If columns are a MultiIndex then use this level to melt.

Returns:
DataFrame

Unpivoted DataFrame.

memory_usage(index=True, deep=False)#

Return the memory usage of each column in bytes.

This docstring was copied from pandas.core.frame.DataFrame.memory_usage.

Some inconsistencies with the Dask version may exist.

The memory usage can optionally include the contribution of the index and elements of object dtype.

This value is displayed in DataFrame.info by default. This can be suppressed by setting pandas.options.display.memory_usage to False.

Parameters:
indexbool, default True

Specifies whether to include the memory usage of the DataFrame’s index in returned Series. If index=True, the memory usage of the index is the first item in the output.

deepbool, default False

If True, introspect the data deeply by interrogating object dtypes for system-level memory consumption, and include it in the returned values.

Returns:
Series

A Series whose index is the original column names and whose values is the memory usage of each column in bytes.

See also

numpy.ndarray.nbytes

Total bytes consumed by the elements of an ndarray.

Series.memory_usage

Bytes consumed by a Series.

Categorical

Memory-efficient array for string values with many repeated values.

DataFrame.info

Concise summary of a DataFrame.

Notes

See the Frequently Asked Questions for more details.

Examples

>>> dtypes = ['int64', 'float64', 'complex128', 'object', 'bool']  
>>> data = dict([(t, np.ones(shape=5000, dtype=int).astype(t))  
...              for t in dtypes])
>>> df = pd.DataFrame(data)  
>>> df.head()  
   int64  float64            complex128  object  bool
0      1      1.0              1.0+0.0j       1  True
1      1      1.0              1.0+0.0j       1  True
2      1      1.0              1.0+0.0j       1  True
3      1      1.0              1.0+0.0j       1  True
4      1      1.0              1.0+0.0j       1  True
>>> df.memory_usage()  
Index           128
int64         40000
float64       40000
complex128    80000
object        40000
bool           5000
dtype: int64
>>> df.memory_usage(index=False)  
int64         40000
float64       40000
complex128    80000
object        40000
bool           5000
dtype: int64

The memory footprint of object dtype columns is ignored by default:

>>> df.memory_usage(deep=True)  
Index            128
int64          40000
float64        40000
complex128     80000
object        180000
bool            5000
dtype: int64

Use a Categorical for efficient storage of an object-dtype column with many repeated values.

>>> df['object'].astype('category').memory_usage(deep=True)  
5244
memory_usage_per_partition(index=True, deep=False)#

Return the memory usage of each partition

Parameters:
indexbool, default True

Specifies whether to include the memory usage of the index in returned Series.

deepbool, default False

If True, introspect the data deeply by interrogating object dtypes for system-level memory consumption, and include it in the returned values.

Returns:
Series

A Series whose index is the partition number and whose values are the memory usage of each partition in bytes.

merge(other, shuffle_method=None, **kwargs)#

Merge the DataFrame with another DataFrame

This will merge the two datasets, either on the indices, a certain column in each dataset or the index in one dataset and the column in another.

Parameters:
right: dask.dataframe.DataFrame
how{‘left’, ‘right’, ‘outer’, ‘inner’}, default: ‘inner’

How to handle the operation of the two objects:

  • left: use calling frame’s index (or column if on is specified)

  • right: use other frame’s index

  • outer: form union of calling frame’s index (or column if on is specified) with other frame’s index, and sort it lexicographically

  • inner: form intersection of calling frame’s index (or column if on is specified) with other frame’s index, preserving the order of the calling’s one

onlabel or list

Column or index level names to join on. These must be found in both DataFrames. If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.

left_onlabel or list, or array-like

Column to join on in the left DataFrame. Other than in pandas arrays and lists are only support if their length is 1.

right_onlabel or list, or array-like

Column to join on in the right DataFrame. Other than in pandas arrays and lists are only support if their length is 1.

left_indexboolean, default False

Use the index from the left DataFrame as the join key.

right_indexboolean, default False

Use the index from the right DataFrame as the join key.

suffixes2-length sequence (tuple, list, …)

Suffix to apply to overlapping column names in the left and right side, respectively

indicatorboolean or string, default False

If True, adds a column to output DataFrame called “_merge” with information on the source of each row. If string, column with information on source of each row will be added to output DataFrame, and column will be named value of string. Information column is Categorical-type and takes on a value of “left_only” for observations whose merge key only appears in left DataFrame, “right_only” for observations whose merge key only appears in right DataFrame, and “both” if the observation’s merge key is found in both.

npartitions: int or None, optional

The ideal number of output partitions. This is only utilised when performing a hash_join (merging on columns only). If None then npartitions = max(lhs.npartitions, rhs.npartitions). Default is None.

shuffle_method: {‘disk’, ‘tasks’, ‘p2p’}, optional

Either 'disk' for single-node operation or 'tasks' and 'p2p'` for distributed operation. Will be inferred by your current scheduler.

broadcast: boolean or float, optional

Whether to use a broadcast-based join in lieu of a shuffle-based join for supported cases. By default, a simple heuristic will be used to select the underlying algorithm. If a floating-point value is specified, that number will be used as the broadcast_bias within the simple heuristic (a large number makes Dask more likely to choose the broacast_join code path). See broadcast_join for more information.

Notes

There are three ways to join dataframes:

  1. Joining on indices. In this case the divisions are aligned using the function dask.dataframe.multi.align_partitions. Afterwards, each partition is merged with the pandas merge function.

  2. Joining one on index and one on column. In this case the divisions of dataframe merged by index (\(d_i\)) are used to divide the column merged dataframe (\(d_c\)) one using dask.dataframe.multi.rearrange_by_divisions. In this case the merged dataframe (\(d_m\)) has the exact same divisions as (\(d_i\)). This can lead to issues if you merge multiple rows from (\(d_c\)) to one row in (\(d_i\)).

  3. Joining both on columns. In this case a hash join is performed using dask.dataframe.multi.hash_join.

In some cases, you may see a MemoryError if the merge operation requires an internal shuffle, because shuffling places all rows that have the same index in the same partition. To avoid this error, make sure all rows with the same on-column value can fit on a single partition.

min(axis=0, skipna=True, split_every=False, out=None, numeric_only=None)#

Return the minimum of the values over the requested axis.

This docstring was copied from pandas.core.frame.DataFrame.min.

Some inconsistencies with the Dask version may exist.

If you want the index of the minimum, use idxmin. This is the equivalent of the numpy.ndarray method argmin.

Parameters:
axis{index (0), columns (1)}

Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

skipnabool, default True

Exclude NA/null values when computing the result.

levelint or level name, default None (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

Deprecated since version 1.3.0: The level keyword is deprecated. Use groupby instead.

numeric_onlybool, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Deprecated since version 1.5.0: Specifying numeric_only=None is deprecated. The default value will be False in a future version of pandas.

**kwargs

Additional keyword arguments to be passed to the function.

Returns:
Series or DataFrame (if level specified)

See also

Series.sum

Return the sum.

Series.min

Return the minimum.

Series.max

Return the maximum.

Series.idxmin

Return the index of the minimum.

Series.idxmax

Return the index of the maximum.

DataFrame.sum

Return the sum over the requested axis.

DataFrame.min

Return the minimum over the requested axis.

DataFrame.max

Return the maximum over the requested axis.

DataFrame.idxmin

Return the index of the minimum over the requested axis.

DataFrame.idxmax

Return the index of the maximum over the requested axis.

Examples

>>> idx = pd.MultiIndex.from_arrays([  
...     ['warm', 'warm', 'cold', 'cold'],
...     ['dog', 'falcon', 'fish', 'spider']],
...     names=['blooded', 'animal'])
>>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx)  
>>> s  
blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64
>>> s.min()  
0
mod(other, axis='columns', level=None, fill_value=None)#

Get Modulo of DataFrame or Series and other, element-wise (binary operator mod).

This docstring was copied from cudf.core.series.Series.mod.

Some inconsistencies with the Dask version may exist.

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:
otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame or Series

Result of the arithmetic operation.

Examples

DataFrame

>>> df = cudf.DataFrame(  
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.mod(1)  
        angles  degrees
circle          0        0
triangle        0        0
rectangle       0        0

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])  
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])  
>>> a.mod(b)  
a       0
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.mod(b, fill_value=0)  
a             0
b    4294967295
c    4294967295
d             0
e          <NA>
dtype: int64
mode(dropna=True, split_every=False, numeric_only=False)#

Get the mode(s) of each element along the selected axis.

This docstring was copied from pandas.core.frame.DataFrame.mode.

Some inconsistencies with the Dask version may exist.

The mode of a set of values is the value that appears most often. It can be multiple values.

Parameters:
axis{0 or ‘index’, 1 or ‘columns’}, default 0 (Not supported in Dask)

The axis to iterate over while searching for the mode:

  • 0 or ‘index’ : get mode of each column

  • 1 or ‘columns’ : get mode of each row.

numeric_onlybool, default False

If True, only apply to numeric columns.

dropnabool, default True

Don’t consider counts of NaN/NaT.

Returns:
DataFrame

The modes of each column or row.

See also

Series.mode

Return the highest frequency value in a Series.

Series.value_counts

Return the counts of values in a Series.

Examples

>>> df = pd.DataFrame([('bird', 2, 2),  
...                    ('mammal', 4, np.nan),
...                    ('arthropod', 8, 0),
...                    ('bird', 2, np.nan)],
...                   index=('falcon', 'horse', 'spider', 'ostrich'),
...                   columns=('species', 'legs', 'wings'))
>>> df  
           species  legs  wings
falcon        bird     2    2.0
horse       mammal     4    NaN
spider   arthropod     8    0.0
ostrich       bird     2    NaN

By default, missing values are not considered, and the mode of wings are both 0 and 2. Because the resulting DataFrame has two rows, the second row of species and legs contains NaN.

>>> df.mode()  
  species  legs  wings
0    bird   2.0    0.0
1     NaN   NaN    2.0

Setting dropna=False NaN values are considered and they can be the mode (like for wings).

>>> df.mode(dropna=False)  
  species  legs  wings
0    bird     2    NaN

Setting numeric_only=True, only the mode of numeric columns is computed, and columns of other types are ignored.

>>> df.mode(numeric_only=True)  
   legs  wings
0   2.0    0.0
1   NaN    2.0

To compute the mode over columns and not rows, use the axis parameter:

>>> df.mode(axis='columns', numeric_only=True)  
           0    1
falcon   2.0  NaN
horse    4.0  NaN
spider   0.0  8.0
ostrich  2.0  NaN
mul(other, axis='columns', level=None, fill_value=None)#

Get Multiplication of DataFrame or Series and other, element-wise (binary operator mul).

This docstring was copied from cudf.core.series.Series.mul.

Some inconsistencies with the Dask version may exist.

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:
otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame or Series

Result of the arithmetic operation.

Examples

DataFrame

>>> df = cudf.DataFrame(  
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.multiply(1)  
        angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])  
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])  
>>> a.multiply(b)  
a       1
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.multiply(b, fill_value=0)  
a       1
b       0
c       0
d       0
e    <NA>
dtype: int64
property ndim#

Return dimensionality

ne(other, axis='columns', level=None)#

Get Not equal to of dataframe and other, element-wise (binary operator ne).

This docstring was copied from pandas.core.frame.DataFrame.ne.

Some inconsistencies with the Dask version may exist.

Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.

Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.

Parameters:
otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axis{0 or ‘index’, 1 or ‘columns’}, default ‘columns’

Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).

levelint or label

Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns:
DataFrame of bool

Result of the comparison.

See also

DataFrame.eq

Compare DataFrames for equality elementwise.

DataFrame.ne

Compare DataFrames for inequality elementwise.

DataFrame.le

Compare DataFrames for less than inequality or equality elementwise.

DataFrame.lt

Compare DataFrames for strictly less than inequality elementwise.

DataFrame.ge

Compare DataFrames for greater than inequality or equality elementwise.

DataFrame.gt

Compare DataFrames for strictly greater than inequality elementwise.

Notes

Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).

Examples

>>> df = pd.DataFrame({'cost': [250, 150, 100],  
...                    'revenue': [100, 250, 300]},
...                   index=['A', 'B', 'C'])
>>> df  
   cost  revenue
A   250      100
B   150      250
C   100      300

Comparison with a scalar, using either the operator or method:

>>> df == 100  
    cost  revenue
A  False     True
B  False    False
C   True    False
>>> df.eq(100)  
    cost  revenue
A  False     True
B  False    False
C   True    False

When other is a Series, the columns of a DataFrame are aligned with the index of other and broadcast:

>>> df != pd.Series([100, 250], index=["cost", "revenue"])  
    cost  revenue
A   True     True
B   True    False
C  False     True

Use the method to control the broadcast axis:

>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index')  
   cost  revenue
A  True    False
B  True     True
C  True     True
D  True     True

When comparing to an arbitrary sequence, the number of columns must match the number elements in other:

>>> df == [250, 100]  
    cost  revenue
A   True     True
B  False    False
C  False    False

Use the method to control the axis:

>>> df.eq([250, 250, 100], axis='index')  
    cost  revenue
A   True    False
B  False     True
C   True    False

Compare to a DataFrame of different shape.

>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]},  
...                      index=['A', 'B', 'C', 'D'])
>>> other  
   revenue
A      300
B      250
C      100
D      150
>>> df.gt(other)  
    cost  revenue
A  False    False
B  False    False
C  False     True
D  False    False

Compare to a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220],  
...                              'revenue': [100, 250, 300, 200, 175, 225]},
...                             index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'],
...                                    ['A', 'B', 'C', 'A', 'B', 'C']])
>>> df_multindex  
      cost  revenue
Q1 A   250      100
   B   150      250
   C   100      300
Q2 A   150      200
   B   300      175
   C   220      225
>>> df.le(df_multindex, level=1)  
       cost  revenue
Q1 A   True     True
   B   True     True
   C   True     True
Q2 A  False     True
   B   True    False
   C   True    False
nlargest(n=5, columns=_NoDefault.no_default, split_every=None)#

Return the first n rows ordered by columns in descending order.

This docstring was copied from pandas.core.frame.DataFrame.nlargest.

Some inconsistencies with the Dask version may exist.

Return the first n rows with the largest values in columns, in descending order. The columns that are not specified are returned as well, but not used for ordering.

This method is equivalent to df.sort_values(columns, ascending=False).head(n), but more performant.

Parameters:
nint

Number of rows to return.

columnslabel or list of labels

Column label(s) to order by.

keep{‘first’, ‘last’, ‘all’}, default ‘first’ (Not supported in Dask)

Where there are duplicate values:

  • first : prioritize the first occurrence(s)

  • last : prioritize the last occurrence(s)

  • all : do not drop any duplicates, even it means selecting more than n items.

Returns:
DataFrame

The first n rows ordered by the given columns in descending order.

See also

DataFrame.nsmallest

Return the first n rows ordered by columns in ascending order.

DataFrame.sort_values

Sort DataFrame by the values.

DataFrame.head

Return the first n rows without re-ordering.

Notes

This function cannot be used with all column types. For example, when specifying columns with object or category dtypes, TypeError is raised.

Examples

>>> df = pd.DataFrame({'population': [59000000, 65000000, 434000,  
...                                   434000, 434000, 337000, 11300,
...                                   11300, 11300],
...                    'GDP': [1937894, 2583560 , 12011, 4520, 12128,
...                            17036, 182, 38, 311],
...                    'alpha-2': ["IT", "FR", "MT", "MV", "BN",
...                                "IS", "NR", "TV", "AI"]},
...                   index=["Italy", "France", "Malta",
...                          "Maldives", "Brunei", "Iceland",
...                          "Nauru", "Tuvalu", "Anguilla"])
>>> df  
          population      GDP alpha-2
Italy       59000000  1937894      IT
France      65000000  2583560      FR
Malta         434000    12011      MT
Maldives      434000     4520      MV
Brunei        434000    12128      BN
Iceland       337000    17036      IS
Nauru          11300      182      NR
Tuvalu         11300       38      TV
Anguilla       11300      311      AI

In the following example, we will use nlargest to select the three rows having the largest values in column “population”.

>>> df.nlargest(3, 'population')  
        population      GDP alpha-2
France    65000000  2583560      FR
Italy     59000000  1937894      IT
Malta       434000    12011      MT

When using keep='last', ties are resolved in reverse order:

>>> df.nlargest(3, 'population', keep='last')  
        population      GDP alpha-2
France    65000000  2583560      FR
Italy     59000000  1937894      IT
Brunei      434000    12128      BN

When using keep='all', all duplicate items are maintained:

>>> df.nlargest(3, 'population', keep='all')  
          population      GDP alpha-2
France      65000000  2583560      FR
Italy       59000000  1937894      IT
Malta         434000    12011      MT
Maldives      434000     4520      MV
Brunei        434000    12128      BN

To order by the largest values in column “population” and then “GDP”, we can specify multiple columns like in the next example.

>>> df.nlargest(3, ['population', 'GDP'])  
        population      GDP alpha-2
France    65000000  2583560      FR
Italy     59000000  1937894      IT
Brunei      434000    12128      BN
notnull()#

DataFrame.notnull is an alias for DataFrame.notna.

This docstring was copied from pandas.core.frame.DataFrame.notnull.

Some inconsistencies with the Dask version may exist.

Detect existing (non-missing) values.

Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True). NA values, such as None or numpy.NaN, get mapped to False values.

Returns:
DataFrame

Mask of bool values for each element in DataFrame that indicates whether an element is not an NA value.

See also

DataFrame.notnull

Alias of notna.

DataFrame.isna

Boolean inverse of notna.

DataFrame.dropna

Omit axes labels with missing values.

notna

Top-level notna.

Examples

Show which entries in a DataFrame are not NA.

>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN],  
...                    born=[pd.NaT, pd.Timestamp('1939-05-27'),
...                          pd.Timestamp('1940-04-25')],
...                    name=['Alfred', 'Batman', ''],
...                    toy=[None, 'Batmobile', 'Joker']))
>>> df  
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker
>>> df.notna()  
     age   born  name    toy
0   True  False  True  False
1   True   True  True   True
2  False   True  True   True

Show which entries in a Series are not NA.

>>> ser = pd.Series([5, 6, np.NaN])  
>>> ser  
0    5.0
1    6.0
2    NaN
dtype: float64
>>> ser.notna()  
0     True
1     True
2    False
dtype: bool
property npartitions: int#

Return number of partitions

nsmallest(n=5, columns=_NoDefault.no_default, split_every=None)#

Return the first n rows ordered by columns in ascending order.

This docstring was copied from pandas.core.frame.DataFrame.nsmallest.

Some inconsistencies with the Dask version may exist.

Return the first n rows with the smallest values in columns, in ascending order. The columns that are not specified are returned as well, but not used for ordering.

This method is equivalent to df.sort_values(columns, ascending=True).head(n), but more performant.

Parameters:
nint

Number of items to retrieve.

columnslist or str

Column name or names to order by.

keep{‘first’, ‘last’, ‘all’}, default ‘first’ (Not supported in Dask)

Where there are duplicate values:

  • first : take the first occurrence.

  • last : take the last occurrence.

  • all : do not drop any duplicates, even it means selecting more than n items.

Returns:
DataFrame

See also

DataFrame.nlargest

Return the first n rows ordered by columns in descending order.

DataFrame.sort_values

Sort DataFrame by the values.

DataFrame.head

Return the first n rows without re-ordering.

Examples

>>> df = pd.DataFrame({'population': [59000000, 65000000, 434000,  
...                                   434000, 434000, 337000, 337000,
...                                   11300, 11300],
...                    'GDP': [1937894, 2583560 , 12011, 4520, 12128,
...                            17036, 182, 38, 311],
...                    'alpha-2': ["IT", "FR", "MT", "MV", "BN",
...                                "IS", "NR", "TV", "AI"]},
...                   index=["Italy", "France", "Malta",
...                          "Maldives", "Brunei", "Iceland",
...                          "Nauru", "Tuvalu", "Anguilla"])
>>> df  
          population      GDP alpha-2
Italy       59000000  1937894      IT
France      65000000  2583560      FR
Malta         434000    12011      MT
Maldives      434000     4520      MV
Brunei        434000    12128      BN
Iceland       337000    17036      IS
Nauru         337000      182      NR
Tuvalu         11300       38      TV
Anguilla       11300      311      AI

In the following example, we will use nsmallest to select the three rows having the smallest values in column “population”.

>>> df.nsmallest(3, 'population')  
          population    GDP alpha-2
Tuvalu         11300     38      TV
Anguilla       11300    311      AI
Iceland       337000  17036      IS

When using keep='last', ties are resolved in reverse order:

>>> df.nsmallest(3, 'population', keep='last')  
          population  GDP alpha-2
Anguilla       11300  311      AI
Tuvalu         11300   38      TV
Nauru         337000  182      NR

When using keep='all', all duplicate items are maintained:

>>> df.nsmallest(3, 'population', keep='all')  
          population    GDP alpha-2
Tuvalu         11300     38      TV
Anguilla       11300    311      AI
Iceland       337000  17036      IS
Nauru         337000    182      NR

To order by the smallest values in column “population” and then “GDP”, we can specify multiple columns like in the next example.

>>> df.nsmallest(3, ['population', 'GDP'])  
          population  GDP alpha-2
Tuvalu         11300   38      TV
Anguilla       11300  311      AI
Nauru         337000  182      NR
nunique(split_every=False, dropna=True, axis=0)#

Count number of distinct elements in specified axis.

This docstring was copied from pandas.core.frame.DataFrame.nunique.

Some inconsistencies with the Dask version may exist.

Return Series with number of distinct elements. Can ignore NaN values.

Parameters:
axis{0 or ‘index’, 1 or ‘columns’}, default 0

The axis to use. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.

dropnabool, default True

Don’t include NaN in the counts.

Returns:
Series

See also

Series.nunique

Method nunique for Series.

DataFrame.count

Count non-NA cells for each column or row.

Examples

>>> df = pd.DataFrame({'A': [4, 5, 6], 'B': [4, 1, 1]})  
>>> df.nunique()  
A    3
B    2
dtype: int64
>>> df.nunique(axis=1)  
0    1
1    2
2    2
dtype: int64
nunique_approx(split_every=None)#

Approximate number of unique rows.

This method uses the HyperLogLog algorithm for cardinality estimation to compute the approximate number of unique rows. The approximate error is 0.406%.

Parameters:
split_everyint, optional

Group partitions into groups of this size while performing a tree-reduction. If set to False, no tree-reduction will be used. Default is 8.

Returns:
a float representing the approximate number of elements
property partitions#

Slice dataframe by partitions

This allows partitionwise slicing of a Dask Dataframe. You can perform normal Numpy-style slicing, but now rather than slice elements of the array you slice along partitions so, for example, df.partitions[:5] produces a new Dask Dataframe of the first five partitions. Valid indexers are integers, sequences of integers, slices, or boolean masks.

Returns:
A Dask DataFrame

Examples

>>> df.partitions[0]  
>>> df.partitions[:3]  
>>> df.partitions[::10]  
persist(**kwargs)#

Persist this dask collection into memory

This turns a lazy Dask collection into a Dask collection with the same metadata, but now with the results fully computed or actively computing in the background.

The action of function differs significantly depending on the active task scheduler. If the task scheduler supports asynchronous computing, such as is the case of the dask.distributed scheduler, then persist will return immediately and the return value’s task graph will contain Dask Future objects. However if the task scheduler only supports blocking computation then the call to persist will block and the return value’s task graph will contain concrete Python results.

This function is particularly useful when using distributed systems, because the results will be kept in distributed memory, rather than returned to the local process as with compute.

Parameters:
schedulerstring, optional

Which scheduler to use like “threads”, “synchronous” or “processes”. If not provided, the default is to check the global settings first, and then fall back to the collection defaults.

optimize_graphbool, optional

If True [default], the graph is optimized before computation. Otherwise the graph is run as is. This can be useful for debugging.

**kwargs

Extra keywords to forward to the scheduler function.

Returns:
New dask collections backed by in-memory data

See also

dask.persist
pipe(func, *args, **kwargs)#

Apply chainable functions that expect Series or DataFrames.

This docstring was copied from pandas.core.frame.DataFrame.pipe.

Some inconsistencies with the Dask version may exist.

Parameters:
funcfunction

Function to apply to the Series/DataFrame. args, and kwargs are passed into func. Alternatively a (callable, data_keyword) tuple where data_keyword is a string indicating the keyword of callable that expects the Series/DataFrame.

argsiterable, optional

Positional arguments passed into func.

kwargsmapping, optional

A dictionary of keyword arguments passed into func.

Returns:
objectthe return type of func.

See also

DataFrame.apply

Apply a function along input axis of DataFrame.

DataFrame.applymap

Apply a function elementwise on a whole DataFrame.

Series.map

Apply a mapping correspondence on a Series.

Notes

Use .pipe when chaining together functions that expect Series, DataFrames or GroupBy objects. Instead of writing

>>> func(g(h(df), arg1=a), arg2=b, arg3=c)  

You can write

>>> (df.pipe(h)  
...    .pipe(g, arg1=a)
...    .pipe(func, arg2=b, arg3=c)
... )  

If you have a function that takes the data as (say) the second argument, pass a tuple indicating which keyword expects the data. For example, suppose f takes its data as arg2:

>>> (df.pipe(h)  
...    .pipe(g, arg1=a)
...    .pipe((func, 'arg2'), arg1=a, arg3=c)
...  )  
pivot_table(index=None, columns=None, values=None, aggfunc='mean')#

Create a spreadsheet-style pivot table as a DataFrame. Target columns must have category dtype to infer result’s columns. index, columns, values and aggfunc must be all scalar.

Parameters:
valuesscalar

column to aggregate

indexscalar

column to be index

columnsscalar

column to be columns

aggfunc{‘mean’, ‘sum’, ‘count’}, default ‘mean’
Returns:
tableDataFrame
pop(item)#

Return item and drop from frame. Raise KeyError if not found.

This docstring was copied from pandas.core.frame.DataFrame.pop.

Some inconsistencies with the Dask version may exist.

Parameters:
itemlabel

Label of column to be popped.

Returns:
Series

Examples

>>> df = pd.DataFrame([('falcon', 'bird', 389.0),  
...                    ('parrot', 'bird', 24.0),
...                    ('lion', 'mammal', 80.5),
...                    ('monkey', 'mammal', np.nan)],
...                   columns=('name', 'class', 'max_speed'))
>>> df  
     name   class  max_speed
0  falcon    bird      389.0
1  parrot    bird       24.0
2    lion  mammal       80.5
3  monkey  mammal        NaN
>>> df.pop('class')  
0      bird
1      bird
2    mammal
3    mammal
Name: class, dtype: object
>>> df  
     name  max_speed
0  falcon      389.0
1  parrot       24.0
2    lion       80.5
3  monkey        NaN
pow(other, axis='columns', level=None, fill_value=None)#

Get Exponential of DataFrame or Series and other, element-wise (binary operator pow).

This docstring was copied from cudf.core.series.Series.pow.

Some inconsistencies with the Dask version may exist.

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:
otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame or Series

Result of the arithmetic operation.

Examples

DataFrame

>>> df = cudf.DataFrame(  
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.pow(1)  
        angles  degrees
circle          0      360
triangle        2      180
rectangle       4      360

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])  
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])  
>>> a.pow(b)  
a       1
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.pow(b, fill_value=0)  
a       1
b       1
c       1
d       0
e    <NA>
dtype: int64
prod(axis=None, skipna=True, split_every=False, dtype=None, out=None, min_count=None, numeric_only=None)#

Return the product of the values over the requested axis.

This docstring was copied from pandas.core.frame.DataFrame.prod.

Some inconsistencies with the Dask version may exist.

Parameters:
axis{index (0), columns (1)}

Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

skipnabool, default True

Exclude NA/null values when computing the result.

levelint or level name, default None (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

Deprecated since version 1.3.0: The level keyword is deprecated. Use groupby instead.

numeric_onlybool, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Deprecated since version 1.5.0: Specifying numeric_only=None is deprecated. The default value will be False in a future version of pandas.

min_countint, default 0

The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

**kwargs

Additional keyword arguments to be passed to the function.

Returns:
Series or DataFrame (if level specified)

See also

Series.sum

Return the sum.

Series.min

Return the minimum.

Series.max

Return the maximum.

Series.idxmin

Return the index of the minimum.

Series.idxmax

Return the index of the maximum.

DataFrame.sum

Return the sum over the requested axis.

DataFrame.min

Return the minimum over the requested axis.

DataFrame.max

Return the maximum over the requested axis.

DataFrame.idxmin

Return the index of the minimum over the requested axis.

DataFrame.idxmax

Return the index of the maximum over the requested axis.

Examples

By default, the product of an empty or all-NA Series is 1

>>> pd.Series([], dtype="float64").prod()  
1.0

This can be controlled with the min_count parameter

>>> pd.Series([], dtype="float64").prod(min_count=1)  
nan

Thanks to the skipna parameter, min_count handles all-NA and empty series identically.

>>> pd.Series([np.nan]).prod()  
1.0
>>> pd.Series([np.nan]).prod(min_count=1)  
nan
product(axis=None, skipna=True, split_every=False, dtype=None, out=None, min_count=None, numeric_only=None)#

Return the product of the values over the requested axis.

This docstring was copied from pandas.core.frame.DataFrame.prod.

Some inconsistencies with the Dask version may exist.

Parameters:
axis{index (0), columns (1)}

Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

skipnabool, default True

Exclude NA/null values when computing the result.

levelint or level name, default None (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

Deprecated since version 1.3.0: The level keyword is deprecated. Use groupby instead.

numeric_onlybool, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Deprecated since version 1.5.0: Specifying numeric_only=None is deprecated. The default value will be False in a future version of pandas.

min_countint, default 0

The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

**kwargs

Additional keyword arguments to be passed to the function.

Returns:
Series or DataFrame (if level specified)

See also

Series.sum

Return the sum.

Series.min

Return the minimum.

Series.max

Return the maximum.

Series.idxmin

Return the index of the minimum.

Series.idxmax

Return the index of the maximum.

DataFrame.sum

Return the sum over the requested axis.

DataFrame.min

Return the minimum over the requested axis.

DataFrame.max

Return the maximum over the requested axis.

DataFrame.idxmin

Return the index of the minimum over the requested axis.

DataFrame.idxmax

Return the index of the maximum over the requested axis.

Examples

By default, the product of an empty or all-NA Series is 1

>>> pd.Series([], dtype="float64").prod()  
1.0

This can be controlled with the min_count parameter

>>> pd.Series([], dtype="float64").prod(min_count=1)  
nan

Thanks to the skipna parameter, min_count handles all-NA and empty series identically.

>>> pd.Series([np.nan]).prod()  
1.0
>>> pd.Series([np.nan]).prod(min_count=1)  
nan
quantile(q=0.5, axis=0, numeric_only=_NoDefault.no_default, method='default')#

Approximate row-wise and precise column-wise quantiles of DataFrame

Parameters:
qlist/array of floats, default 0.5 (50%)

Iterable of numbers ranging from 0 to 1 for the desired quantiles

axis{0, 1, ‘index’, ‘columns’} (default 0)

0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise

method{‘default’, ‘tdigest’, ‘dask’}, optional

What method to use. By default will use dask’s internal custom algorithm ('dask'). If set to 'tdigest' will use tdigest for floats and ints and fallback to the 'dask' otherwise.

query(expr, **kwargs)#

Filter dataframe with complex expression

Blocked version of pd.DataFrame.query

Parameters:
expr: str

The query string to evaluate. You can refer to column names that are not valid Python variable names by surrounding them in backticks. Dask does not fully support referring to variables using the ‘@’ character, use f-strings or the local_dict keyword argument instead.

Notes

This is like the sequential version except that this will also happen in many threads. This may conflict with numexpr which will use multiple threads itself. We recommend that you set numexpr to use a single thread:

import numexpr
numexpr.set_num_threads(1)

Examples

>>> import pandas as pd
>>> import dask.dataframe as dd
>>> df = pd.DataFrame({'x': [1, 2, 1, 2],
...                    'y': [1, 2, 3, 4],
...                    'z z': [4, 3, 2, 1]})
>>> ddf = dd.from_pandas(df, npartitions=2)

Refer to column names directly:

>>> ddf.query('y > x').compute()
   x  y  z z
2  1  3    2
3  2  4    1

Refer to column name using backticks:

>>> ddf.query('`z z` > x').compute()
   x  y  z z
0  1  1    4
1  2  2    3
2  1  3    2

Refer to variable name using f-strings:

>>> value = 1
>>> ddf.query(f'x == {value}').compute()
   x  y  z z
0  1  1    4
2  1  3    2

Refer to variable name using local_dict:

>>> ddf.query('x == @value', local_dict={"value": value}).compute()
   x  y  z z
0  1  1    4
2  1  3    2
radd(other, axis='columns', level=None, fill_value=None)#

Get Addition of DataFrame or Series and other, element-wise (binary operator radd).

This docstring was copied from cudf.core.series.Series.radd.

Some inconsistencies with the Dask version may exist.

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:
otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame or Series

Result of the arithmetic operation.

Examples

DataFrame

>>> df = cudf.DataFrame(  
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.radd(1)  
        angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])  
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])  
>>> a.radd(b)  
a       2
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.radd(b, fill_value=0)  
a       2
b       1
c       1
d       1
e    <NA>
dtype: int64
random_split(frac, random_state=None, shuffle=False)#

Pseudorandomly split dataframe into different pieces row-wise

Parameters:
fraclist

List of floats that should sum to one.

random_stateint or np.random.RandomState

If int create a new RandomState with this as the seed. Otherwise draw from the passed RandomState.

shufflebool, default False

If set to True, the dataframe is shuffled (within partition) before the split.

See also

dask.DataFrame.sample

Examples

50/50 split

>>> a, b = df.random_split([0.5, 0.5])  

80/10/10 split, consistent random_state

>>> a, b, c = df.random_split([0.8, 0.1, 0.1], random_state=123)  
rdiv(other, axis='columns', level=None, fill_value=None)#

Get Floating division of dataframe and other, element-wise (binary operator rtruediv).

This docstring was copied from pandas.core.frame.DataFrame.rdiv.

Some inconsistencies with the Dask version may exist.

Equivalent to other / dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, truediv.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
otherscalar, sequence, Series, dict or DataFrame

Any single or multiple element data structure, or list-like object.

axis{0 or ‘index’, 1 or ‘columns’}

Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.

levelint or label

Broadcast across a level, matching Index values on the passed MultiIndex level.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame

Result of the arithmetic operation.

See also

DataFrame.add

Add DataFrames.

DataFrame.sub

Subtract DataFrames.

DataFrame.mul

Multiply DataFrames.

DataFrame.div

Divide DataFrames (float division).

DataFrame.truediv

Divide DataFrames (float division).

DataFrame.floordiv

Divide DataFrames (integer division).

DataFrame.mod

Calculate modulo (remainder after division).

DataFrame.pow

Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],  
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df  
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1  
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361
>>> df.add(1)  
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)  
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0
>>> df.rdiv(10)  
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]  
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub([1, 2], axis='columns')  
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),  
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a dictionary by axis.

>>> df.mul({'angles': 0, 'degrees': 2})  
            angles      degrees
circle           0          720
triangle             0      360
rectangle            0      720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index')  
            angles      degrees
circle               0        0
triangle             6      360
rectangle           12     1080

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},  
...                      index=['circle', 'triangle', 'rectangle'])
>>> other  
           angles
circle          0
triangle        3
rectangle       4
>>> df * other  
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN
>>> df.mul(other, fill_value=0)  
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],  
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex  
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720
>>> df.div(df_multindex, level=1, fill_value=0)  
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
reduction(chunk, aggregate=None, combine=None, meta=_NoDefault.no_default, token=None, split_every=None, chunk_kwargs=None, aggregate_kwargs=None, combine_kwargs=None, **kwargs)#

Generic row-wise reductions.

Parameters:
chunkcallable

Function to operate on each partition. Should return a pandas.DataFrame, pandas.Series, or a scalar.

aggregatecallable, optional

Function to operate on the concatenated result of chunk. If not specified, defaults to chunk. Used to do the final aggregation in a tree reduction.

The input to aggregate depends on the output of chunk. If the output of chunk is a:

  • scalar: Input is a Series, with one row per partition.

  • Series: Input is a DataFrame, with one row per partition. Columns are the rows in the output series.

  • DataFrame: Input is a DataFrame, with one row per partition. Columns are the columns in the output dataframes.

Should return a pandas.DataFrame, pandas.Series, or a scalar.

combinecallable, optional

Function to operate on intermediate concatenated results of chunk in a tree-reduction. If not provided, defaults to aggregate. The input/output requirements should match that of aggregate described above.

metapd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

tokenstr, optional

The name to use for the output keys.

split_everyint, optional

Group partitions into groups of this size while performing a tree-reduction. If set to False, no tree-reduction will be used, and all intermediates will be concatenated and passed to aggregate. Default is 8.

chunk_kwargsdict, optional

Keyword arguments to pass on to chunk only.

aggregate_kwargsdict, optional

Keyword arguments to pass on to aggregate only.

combine_kwargsdict, optional

Keyword arguments to pass on to combine only.

kwargs

All remaining keywords will be passed to chunk, combine, and aggregate.

Examples

>>> import pandas as pd
>>> import dask.dataframe as dd
>>> df = pd.DataFrame({'x': range(50), 'y': range(50, 100)})
>>> ddf = dd.from_pandas(df, npartitions=4)

Count the number of rows in a DataFrame. To do this, count the number of rows in each partition, then sum the results:

>>> res = ddf.reduction(lambda x: x.count(),
...                     aggregate=lambda x: x.sum())
>>> res.compute()
x    50
y    50
dtype: int64

Count the number of rows in a Series with elements greater than or equal to a value (provided via a keyword).

>>> def count_greater(x, value=0):
...     return (x >= value).sum()
>>> res = ddf.x.reduction(count_greater, aggregate=lambda x: x.sum(),
...                       chunk_kwargs={'value': 25})
>>> res.compute()
25

Aggregate both the sum and count of a Series at the same time:

>>> def sum_and_count(x):
...     return pd.Series({'count': x.count(), 'sum': x.sum()},
...                      index=['count', 'sum'])
>>> res = ddf.x.reduction(sum_and_count, aggregate=lambda x: x.sum())
>>> res.compute()
count      50
sum      1225
dtype: int64

Doing the same, but for a DataFrame. Here chunk returns a DataFrame, meaning the input to aggregate is a DataFrame with an index with non-unique entries for both ‘x’ and ‘y’. We groupby the index, and sum each group to get the final result.

>>> def sum_and_count(x):
...     return pd.DataFrame({'count': x.count(), 'sum': x.sum()},
...                         columns=['count', 'sum'])
>>> res = ddf.reduction(sum_and_count,
...                     aggregate=lambda x: x.groupby(level=0).sum())
>>> res.compute()
   count   sum
x     50  1225
y     50  3725
rename(index=None, columns=None)#

Alter axes labels.

This docstring was copied from pandas.core.frame.DataFrame.rename.

Some inconsistencies with the Dask version may exist.

Function / dict values must be unique (1-to-1). Labels not contained in a dict / Series will be left as-is. Extra labels listed don’t throw an error.

See the user guide for more.

Parameters:
mapperdict-like or function (Not supported in Dask)

Dict-like or function transformations to apply to that axis’ values. Use either mapper and axis to specify the axis to target with mapper, or index and columns.

indexdict-like or function (Not supported in Dask)

Alternative to specifying axis (mapper, axis=0 is equivalent to index=mapper).

columnsdict-like or function

Alternative to specifying axis (mapper, axis=1 is equivalent to columns=mapper).

axis{0 or ‘index’, 1 or ‘columns’}, default 0 (Not supported in Dask)

Axis to target with mapper. Can be either the axis name (‘index’, ‘columns’) or number (0, 1). The default is ‘index’.

copybool, default True (Not supported in Dask)

Also copy underlying data.

inplacebool, default False (Not supported in Dask)

Whether to modify the DataFrame rather than creating a new one. If True then value of copy is ignored.

levelint or level name, default None (Not supported in Dask)

In case of a MultiIndex, only rename labels in the specified level.

errors{‘ignore’, ‘raise’}, default ‘ignore’ (Not supported in Dask)

If ‘raise’, raise a KeyError when a dict-like mapper, index, or columns contains labels that are not present in the Index being transformed. If ‘ignore’, existing keys will be renamed and extra keys will be ignored.

Returns:
DataFrame or None

DataFrame with the renamed axis labels or None if inplace=True.

Raises:
KeyError

If any of the labels is not found in the selected axis and “errors=’raise’”.

See also

DataFrame.rename_axis

Set the name of the axis.

Examples

DataFrame.rename supports two calling conventions

  • (index=index_mapper, columns=columns_mapper, ...)

  • (mapper, axis={'index', 'columns'}, ...)

We highly recommend using keyword arguments to clarify your intent.

Rename columns using a mapping:

>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})  
>>> df.rename(columns={"A": "a", "B": "c"})  
   a  c
0  1  4
1  2  5
2  3  6

Rename index using a mapping:

>>> df.rename(index={0: "x", 1: "y", 2: "z"})  
   A  B
x  1  4
y  2  5
z  3  6

Cast index labels to a different type:

>>> df.index  
RangeIndex(start=0, stop=3, step=1)
>>> df.rename(index=str).index  
Index(['0', '1', '2'], dtype='object')
>>> df.rename(columns={"A": "a", "B": "b", "C": "c"}, errors="raise")  
Traceback (most recent call last):
KeyError: ['C'] not found in axis

Using axis-style parameters:

>>> df.rename(str.lower, axis='columns')  
   a  b
0  1  4
1  2  5
2  3  6
>>> df.rename({1: 2, 2: 4}, axis='index')  
   A  B
0  1  4
2  2  5
4  3  6
repartition(divisions=None, npartitions=None, partition_size=None, freq=None, force=False)#

Repartition dataframe along new divisions

Parameters:
divisionslist, optional

The “dividing lines” used to split the dataframe into partitions. For divisions=[0, 10, 50, 100], there would be three output partitions, where the new index contained [0, 10), [10, 50), and [50, 100), respectively. See https://docs.dask.org/en/latest/dataframe-design.html#partitions. Only used if npartitions and partition_size isn’t specified. For convenience if given an integer this will defer to npartitions and if given a string it will defer to partition_size (see below)

npartitionsint, optional

Approximate number of partitions of output. Only used if partition_size isn’t specified. The number of partitions used may be slightly lower than npartitions depending on data distribution, but will never be higher.

partition_size: int or string, optional

Max number of bytes of memory for each partition. Use numbers or strings like 5MB. If specified npartitions and divisions will be ignored. Note that the size reflects the number of bytes used as computed by pandas.DataFrame.memory_usage, which will not necessarily match the size when storing to disk.

Warning

This keyword argument triggers computation to determine the memory size of each partition, which may be expensive.

freqstr, pd.Timedelta

A period on which to partition timeseries data like '7D' or '12h' or pd.Timedelta(hours=12). Assumes a datetime index.

forcebool, default False

Allows the expansion of the existing divisions. If False then the new divisions’ lower and upper bounds must be the same as the old divisions’.

Notes

Exactly one of divisions, npartitions, partition_size, or freq should be specified. A ValueError will be raised when that is not the case.

Also note that len(divisons) is equal to npartitions + 1. This is because divisions represents the upper and lower bounds of each partition. The first item is the lower bound of the first partition, the second item is the lower bound of the second partition and the upper bound of the first partition, and so on. The second-to-last item is the lower bound of the last partition, and the last (extra) item is the upper bound of the last partition.

Examples

>>> df = df.repartition(npartitions=10)  
>>> df = df.repartition(divisions=[0, 5, 10, 20])  
>>> df = df.repartition(freq='7d')  
replace(to_replace=None, value=None, regex=False)#

Replace values given in to_replace with value.

This docstring was copied from pandas.core.frame.DataFrame.replace.

Some inconsistencies with the Dask version may exist.

Values of the DataFrame are replaced with other values dynamically.

This differs from updating with .loc or .iloc, which require you to specify a location to update with some value.

Parameters:
to_replacestr, regex, list, dict, Series, int, float, or None

How to find the values that will be replaced.

  • numeric, str or regex:

    • numeric: numeric values equal to to_replace will be replaced with value

    • str: string exactly matching to_replace will be replaced with value

    • regex: regexs matching to_replace will be replaced with value

  • list of str, regex, or numeric:

    • First, if to_replace and value are both lists, they must be the same length.

    • Second, if regex=True then all of the strings in both lists will be interpreted as regexs otherwise they will match directly. This doesn’t matter much for value since there are only a few possible substitution regexes you can use.

    • str, regex and numeric rules apply as above.

  • dict:

    • Dicts can be used to specify different replacement values for different existing values. For example, {'a': 'b', 'y': 'z'} replaces the value ‘a’ with ‘b’ and ‘y’ with ‘z’. To use a dict in this way, the optional value parameter should not be given.

    • For a DataFrame a dict can specify that different values should be replaced in different columns. For example, {'a': 1, 'b': 'z'} looks for the value 1 in column ‘a’ and the value ‘z’ in column ‘b’ and replaces these values with whatever is specified in value. The value parameter should not be None in this case. You can treat this as a special case of passing two lists except that you are specifying the column to search in.

    • For a DataFrame nested dictionaries, e.g., {'a': {'b': np.nan}}, are read as follows: look in column ‘a’ for the value ‘b’ and replace it with NaN. The optional value parameter should not be specified to use a nested dict in this way. You can nest regular expressions as well. Note that column names (the top-level dictionary keys in a nested dictionary) cannot be regular expressions.

  • None:

    • This means that the regex argument must be a string, compiled regular expression, or list, dict, ndarray or Series of such elements. If value is also None then this must be a nested dictionary or Series.

See the examples section for examples of each of these.

valuescalar, dict, list, str, regex, default None

Value to replace any values matching to_replace with. For a DataFrame a dict of values can be used to specify which value to use for each column (columns not in the dict will not be filled). Regular expressions, strings and lists or dicts of such objects are also allowed.

inplacebool, default False (Not supported in Dask)

Whether to modify the DataFrame rather than creating a new one.

limitint, default None (Not supported in Dask)

Maximum size gap to forward or backward fill.

regexbool or same types as to_replace, default False

Whether to interpret to_replace and/or value as regular expressions. If this is True then to_replace must be a string. Alternatively, this could be a regular expression or a list, dict, or array of regular expressions in which case to_replace must be None.

method{‘pad’, ‘ffill’, ‘bfill’} (Not supported in Dask)

The method to use when for replacement, when to_replace is a scalar, list or tuple and value is None.

Changed in version 0.23.0: Added to DataFrame.

Returns:
DataFrame

Object after replacement.

Raises:
AssertionError
  • If regex is not a bool and to_replace is not None.

TypeError
  • If to_replace is not a scalar, array-like, dict, or None

  • If to_replace is a dict and value is not a list, dict, ndarray, or Series

  • If to_replace is None and regex is not compilable into a regular expression or is a list, dict, ndarray, or Series.

  • When replacing multiple bool or datetime64 objects and the arguments to to_replace does not match the type of the value being replaced

ValueError
  • If a list or an ndarray is passed to to_replace and value but they are not the same length.

See also

DataFrame.fillna

Fill NA values.

DataFrame.where

Replace values based on boolean condition.

Series.str.replace

Simple string replacement.

Notes

  • Regex substitution is performed under the hood with re.sub. The rules for substitution for re.sub are the same.

  • Regular expressions will only substitute on strings, meaning you cannot provide, for example, a regular expression matching floating point numbers and expect the columns in your frame that have a numeric dtype to be matched. However, if those floating point numbers are strings, then you can do this.

  • This method has a lot of options. You are encouraged to experiment and play with this method to gain intuition about how it works.

  • When dict is used as the to_replace value, it is like key(s) in the dict are the to_replace part and value(s) in the dict are the value parameter.

Examples

Scalar `to_replace` and `value`

>>> s = pd.Series([1, 2, 3, 4, 5])  
>>> s.replace(1, 5)  
0    5
1    2
2    3
3    4
4    5
dtype: int64
>>> df = pd.DataFrame({'A': [0, 1, 2, 3, 4],  
...                    'B': [5, 6, 7, 8, 9],
...                    'C': ['a', 'b', 'c', 'd', 'e']})
>>> df.replace(0, 5)  
    A  B  C
0  5  5  a
1  1  6  b
2  2  7  c
3  3  8  d
4  4  9  e

List-like `to_replace`

>>> df.replace([0, 1, 2, 3], 4)  
    A  B  C
0  4  5  a
1  4  6  b
2  4  7  c
3  4  8  d
4  4  9  e
>>> df.replace([0, 1, 2, 3], [4, 3, 2, 1])  
    A  B  C
0  4  5  a
1  3  6  b
2  2  7  c
3  1  8  d
4  4  9  e
>>> s.replace([1, 2], method='bfill')  
0    3
1    3
2    3
3    4
4    5
dtype: int64

dict-like `to_replace`

>>> df.replace({0: 10, 1: 100})  
        A  B  C
0   10  5  a
1  100  6  b
2    2  7  c
3    3  8  d
4    4  9  e
>>> df.replace({'A': 0, 'B': 5}, 100)  
        A    B  C
0  100  100  a
1    1    6  b
2    2    7  c
3    3    8  d
4    4    9  e
>>> df.replace({'A': {0: 100, 4: 400}})  
        A  B  C
0  100  5  a
1    1  6  b
2    2  7  c
3    3  8  d
4  400  9  e

Regular expression `to_replace`

>>> df = pd.DataFrame({'A': ['bat', 'foo', 'bait'],  
...                    'B': ['abc', 'bar', 'xyz']})
>>> df.replace(to_replace=r'^ba.$', value='new', regex=True)  
        A    B
0   new  abc
1   foo  new
2  bait  xyz
>>> df.replace({'A': r'^ba.$'}, {'A': 'new'}, regex=True)  
        A    B
0   new  abc
1   foo  bar
2  bait  xyz
>>> df.replace(regex=r'^ba.$', value='new')  
        A    B
0   new  abc
1   foo  new
2  bait  xyz
>>> df.replace(regex={r'^ba.$': 'new', 'foo': 'xyz'})  
        A    B
0   new  abc
1   xyz  new
2  bait  xyz
>>> df.replace(regex=[r'^ba.$', 'foo'], value='new')  
        A    B
0   new  abc
1   new  new
2  bait  xyz

Compare the behavior of s.replace({'a': None}) and s.replace('a', None) to understand the peculiarities of the to_replace parameter:

>>> s = pd.Series([10, 'a', 'a', 'b', 'a'])  

When one uses a dict as the to_replace value, it is like the value(s) in the dict are equal to the value parameter. s.replace({'a': None}) is equivalent to s.replace(to_replace={'a': None}, value=None, method=None):

>>> s.replace({'a': None})  
0      10
1    None
2    None
3       b
4    None
dtype: object

When value is not explicitly passed and to_replace is a scalar, list or tuple, replace uses the method parameter (default ‘pad’) to do the replacement. So this is why the ‘a’ values are being replaced by 10 in rows 1 and 2 and ‘b’ in row 4 in this case.

>>> s.replace('a')  
0    10
1    10
2    10
3     b
4     b
dtype: object

On the other hand, if None is explicitly passed for value, it will be respected:

>>> s.replace('a', None)  
0      10
1    None
2    None
3       b
4    None
dtype: object

Changed in version 1.4.0: Previously the explicit None was silently ignored.

resample(rule, closed=None, label=None)#

Resample time-series data.

This docstring was copied from pandas.core.frame.DataFrame.resample.

Some inconsistencies with the Dask version may exist.

Convenience method for frequency conversion and resampling of time series. The object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or the caller must pass the label of a datetime-like series/index to the on/level keyword parameter.

Parameters:
ruleDateOffset, Timedelta or str

The offset string or object representing target conversion.

axis{0 or ‘index’, 1 or ‘columns’}, default 0 (Not supported in Dask)

Which axis to use for up- or down-sampling. For Series this parameter is unused and defaults to 0. Must be DatetimeIndex, TimedeltaIndex or PeriodIndex.

closed{‘right’, ‘left’}, default None

Which side of bin interval is closed. The default is ‘left’ for all frequency offsets except for ‘M’, ‘A’, ‘Q’, ‘BM’, ‘BA’, ‘BQ’, and ‘W’ which all have a default of ‘right’.

label{‘right’, ‘left’}, default None

Which bin edge label to label bucket with. The default is ‘left’ for all frequency offsets except for ‘M’, ‘A’, ‘Q’, ‘BM’, ‘BA’, ‘BQ’, and ‘W’ which all have a default of ‘right’.

convention{‘start’, ‘end’, ‘s’, ‘e’}, default ‘start’ (Not supported in Dask)

For PeriodIndex only, controls whether to use the start or end of rule.

kind{‘timestamp’, ‘period’}, optional, default None (Not supported in Dask)

Pass ‘timestamp’ to convert the resulting index to a DateTimeIndex or ‘period’ to convert it to a PeriodIndex. By default the input representation is retained.

loffsettimedelta, default None (Not supported in Dask)

Adjust the resampled time labels.

Deprecated since version 1.1.0: You should add the loffset to the df.index after the resample. See below.

baseint, default 0 (Not supported in Dask)

For frequencies that evenly subdivide 1 day, the “origin” of the aggregated intervals. For example, for ‘5min’ frequency, base could range from 0 through 4. Defaults to 0.

Deprecated since version 1.1.0: The new arguments that you should use are ‘offset’ or ‘origin’.

onstr, optional (Not supported in Dask)

For a DataFrame, column to use instead of index for resampling. Column must be datetime-like.

levelstr or int, optional (Not supported in Dask)

For a MultiIndex, level (name or number) to use for resampling. level must be datetime-like.

originTimestamp or str, default ‘start_day’ (Not supported in Dask)

The timestamp on which to adjust the grouping. The timezone of origin must match the timezone of the index. If string, must be one of the following:

  • ‘epoch’: origin is 1970-01-01

  • ‘start’: origin is the first value of the timeseries

  • ‘start_day’: origin is the first day at midnight of the timeseries

New in version 1.1.0.

  • ‘end’: origin is the last value of the timeseries

  • ‘end_day’: origin is the ceiling midnight of the last day

New in version 1.3.0.

offsetTimedelta or str, default is None (Not supported in Dask)

An offset timedelta added to the origin.

New in version 1.1.0.

group_keysbool, optional (Not supported in Dask)

Whether to include the group keys in the result index when using .apply() on the resampled object. Not specifying group_keys will retain values-dependent behavior from pandas 1.4 and earlier (see pandas 1.5.0 Release notes for examples). In a future version of pandas, the behavior will default to the same as specifying group_keys=False.

New in version 1.5.0.

Returns:
pandas.core.Resampler

Resampler object.

See also

Series.resample

Resample a Series.

DataFrame.resample

Resample a DataFrame.

groupby

Group DataFrame by mapping, function, label, or list of labels.

asfreq

Reindex a DataFrame with the given frequency without grouping.

Notes

See the user guide for more.

To learn more about the offset strings, please see this link.

Examples

Start by creating a series with 9 one minute timestamps.

>>> index = pd.date_range('1/1/2000', periods=9, freq='T')  
>>> series = pd.Series(range(9), index=index)  
>>> series  
2000-01-01 00:00:00    0
2000-01-01 00:01:00    1
2000-01-01 00:02:00    2
2000-01-01 00:03:00    3
2000-01-01 00:04:00    4
2000-01-01 00:05:00    5
2000-01-01 00:06:00    6
2000-01-01 00:07:00    7
2000-01-01 00:08:00    8
Freq: T, dtype: int64

Downsample the series into 3 minute bins and sum the values of the timestamps falling into a bin.

>>> series.resample('3T').sum()  
2000-01-01 00:00:00     3
2000-01-01 00:03:00    12
2000-01-01 00:06:00    21
Freq: 3T, dtype: int64

Downsample the series into 3 minute bins as above, but label each bin using the right edge instead of the left. Please note that the value in the bucket used as the label is not included in the bucket, which it labels. For example, in the original series the bucket 2000-01-01 00:03:00 contains the value 3, but the summed value in the resampled bucket with the label 2000-01-01 00:03:00 does not include 3 (if it did, the summed value would be 6, not 3). To include this value close the right side of the bin interval as illustrated in the example below this one.

>>> series.resample('3T', label='right').sum()  
2000-01-01 00:03:00     3
2000-01-01 00:06:00    12
2000-01-01 00:09:00    21
Freq: 3T, dtype: int64

Downsample the series into 3 minute bins as above, but close the right side of the bin interval.

>>> series.resample('3T', label='right', closed='right').sum()  
2000-01-01 00:00:00     0
2000-01-01 00:03:00     6
2000-01-01 00:06:00    15
2000-01-01 00:09:00    15
Freq: 3T, dtype: int64

Upsample the series into 30 second bins.

>>> series.resample('30S').asfreq()[0:5]   # Select first 5 rows  
2000-01-01 00:00:00   0.0
2000-01-01 00:00:30   NaN
2000-01-01 00:01:00   1.0
2000-01-01 00:01:30   NaN
2000-01-01 00:02:00   2.0
Freq: 30S, dtype: float64

Upsample the series into 30 second bins and fill the NaN values using the ffill method.

>>> series.resample('30S').ffill()[0:5]  
2000-01-01 00:00:00    0
2000-01-01 00:00:30    0
2000-01-01 00:01:00    1
2000-01-01 00:01:30    1
2000-01-01 00:02:00    2
Freq: 30S, dtype: int64

Upsample the series into 30 second bins and fill the NaN values using the bfill method.

>>> series.resample('30S').bfill()[0:5]  
2000-01-01 00:00:00    0
2000-01-01 00:00:30    1
2000-01-01 00:01:00    1
2000-01-01 00:01:30    2
2000-01-01 00:02:00    2
Freq: 30S, dtype: int64

Pass a custom function via apply

>>> def custom_resampler(arraylike):  
...     return np.sum(arraylike) + 5
...
>>> series.resample('3T').apply(custom_resampler)  
2000-01-01 00:00:00     8
2000-01-01 00:03:00    17
2000-01-01 00:06:00    26
Freq: 3T, dtype: int64

For a Series with a PeriodIndex, the keyword convention can be used to control whether to use the start or end of rule.

Resample a year by quarter using ‘start’ convention. Values are assigned to the first quarter of the period.

>>> s = pd.Series([1, 2], index=pd.period_range('2012-01-01',  
...                                             freq='A',
...                                             periods=2))
>>> s  
2012    1
2013    2
Freq: A-DEC, dtype: int64
>>> s.resample('Q', convention='start').asfreq()  
2012Q1    1.0
2012Q2    NaN
2012Q3    NaN
2012Q4    NaN
2013Q1    2.0
2013Q2    NaN
2013Q3    NaN
2013Q4    NaN
Freq: Q-DEC, dtype: float64

Resample quarters by month using ‘end’ convention. Values are assigned to the last month of the period.

>>> q = pd.Series([1, 2, 3, 4], index=pd.period_range('2018-01-01',  
...                                                   freq='Q',
...                                                   periods=4))
>>> q  
2018Q1    1
2018Q2    2
2018Q3    3
2018Q4    4
Freq: Q-DEC, dtype: int64
>>> q.resample('M', convention='end').asfreq()  
2018-03    1.0
2018-04    NaN
2018-05    NaN
2018-06    2.0
2018-07    NaN
2018-08    NaN
2018-09    3.0
2018-10    NaN
2018-11    NaN
2018-12    4.0
Freq: M, dtype: float64

For DataFrame objects, the keyword on can be used to specify the column instead of the index for resampling.

>>> d = {'price': [10, 11, 9, 13, 14, 18, 17, 19],  
...      'volume': [50, 60, 40, 100, 50, 100, 40, 50]}
>>> df = pd.DataFrame(d)  
>>> df['week_starting'] = pd.date_range('01/01/2018',  
...                                     periods=8,
...                                     freq='W')
>>> df  
   price  volume week_starting
0     10      50    2018-01-07
1     11      60    2018-01-14
2      9      40    2018-01-21
3     13     100    2018-01-28
4     14      50    2018-02-04
5     18     100    2018-02-11
6     17      40    2018-02-18
7     19      50    2018-02-25
>>> df.resample('M', on='week_starting').mean()  
               price  volume
week_starting
2018-01-31     10.75    62.5
2018-02-28     17.00    60.0

For a DataFrame with MultiIndex, the keyword level can be used to specify on which level the resampling needs to take place.

>>> days = pd.date_range('1/1/2000', periods=4, freq='D')  
>>> d2 = {'price': [10, 11, 9, 13, 14, 18, 17, 19],  
...       'volume': [50, 60, 40, 100, 50, 100, 40, 50]}
>>> df2 = pd.DataFrame(  
...     d2,
...     index=pd.MultiIndex.from_product(
...         [days, ['morning', 'afternoon']]
...     )
... )
>>> df2  
                      price  volume
2000-01-01 morning       10      50
           afternoon     11      60
2000-01-02 morning        9      40
           afternoon     13     100
2000-01-03 morning       14      50
           afternoon     18     100
2000-01-04 morning       17      40
           afternoon     19      50
>>> df2.resample('D', level=0).sum()  
            price  volume
2000-01-01     21     110
2000-01-02     22     140
2000-01-03     32     150
2000-01-04     36      90

If you want to adjust the start of the bins based on a fixed timestamp:

>>> start, end = '2000-10-01 23:30:00', '2000-10-02 00:30:00'  
>>> rng = pd.date_range(start, end, freq='7min')  
>>> ts = pd.Series(np.arange(len(rng)) * 3, index=rng)  
>>> ts  
2000-10-01 23:30:00     0
2000-10-01 23:37:00     3
2000-10-01 23:44:00     6
2000-10-01 23:51:00     9
2000-10-01 23:58:00    12
2000-10-02 00:05:00    15
2000-10-02 00:12:00    18
2000-10-02 00:19:00    21
2000-10-02 00:26:00    24
Freq: 7T, dtype: int64
>>> ts.resample('17min').sum()  
2000-10-01 23:14:00     0
2000-10-01 23:31:00     9
2000-10-01 23:48:00    21
2000-10-02 00:05:00    54
2000-10-02 00:22:00    24
Freq: 17T, dtype: int64
>>> ts.resample('17min', origin='epoch').sum()  
2000-10-01 23:18:00     0
2000-10-01 23:35:00    18
2000-10-01 23:52:00    27
2000-10-02 00:09:00    39
2000-10-02 00:26:00    24
Freq: 17T, dtype: int64
>>> ts.resample('17min', origin='2000-01-01').sum()  
2000-10-01 23:24:00     3
2000-10-01 23:41:00    15
2000-10-01 23:58:00    45
2000-10-02 00:15:00    45
Freq: 17T, dtype: int64

If you want to adjust the start of the bins with an offset Timedelta, the two following lines are equivalent:

>>> ts.resample('17min', origin='start').sum()  
2000-10-01 23:30:00     9
2000-10-01 23:47:00    21
2000-10-02 00:04:00    54
2000-10-02 00:21:00    24
Freq: 17T, dtype: int64
>>> ts.resample('17min', offset='23h30min').sum()  
2000-10-01 23:30:00     9
2000-10-01 23:47:00    21
2000-10-02 00:04:00    54
2000-10-02 00:21:00    24
Freq: 17T, dtype: int64

If you want to take the largest Timestamp as the end of the bins:

>>> ts.resample('17min', origin='end').sum()  
2000-10-01 23:35:00     0
2000-10-01 23:52:00    18
2000-10-02 00:09:00    27
2000-10-02 00:26:00    63
Freq: 17T, dtype: int64

In contrast with the start_day, you can use end_day to take the ceiling midnight of the largest Timestamp as the end of the bins and drop the bins not containing data:

>>> ts.resample('17min', origin='end_day').sum()  
2000-10-01 23:38:00     3
2000-10-01 23:55:00    15
2000-10-02 00:12:00    45
2000-10-02 00:29:00    45
Freq: 17T, dtype: int64

To replace the use of the deprecated base argument, you can now use offset, in this example it is equivalent to have base=2:

>>> ts.resample('17min', offset='2min').sum()  
2000-10-01 23:16:00     0
2000-10-01 23:33:00     9
2000-10-01 23:50:00    36
2000-10-02 00:07:00    39
2000-10-02 00:24:00    24
Freq: 17T, dtype: int64

To replace the use of the deprecated loffset argument:

>>> from pandas.tseries.frequencies import to_offset  
>>> loffset = '19min'  
>>> ts_out = ts.resample('17min').sum()  
>>> ts_out.index = ts_out.index + to_offset(loffset)  
>>> ts_out  
2000-10-01 23:33:00     0
2000-10-01 23:50:00     9
2000-10-02 00:07:00    21
2000-10-02 00:24:00    54
2000-10-02 00:41:00    24
Freq: 17T, dtype: int64
reset_index(drop=False)#

Reset the index to the default index.

Note that unlike in pandas, the reset dask.dataframe index will not be monotonically increasing from 0. Instead, it will restart at 0 for each partition (e.g. index1 = [0, ..., 10], index2 = [0, ...]). This is due to the inability to statically know the full length of the index.

For DataFrame with multi-level index, returns a new DataFrame with labeling information in the columns under the index names, defaulting to ‘level_0’, ‘level_1’, etc. if any are None. For a standard index, the index name will be used (if set), otherwise a default ‘index’ or ‘level_0’ (if ‘index’ is already taken) will be used.

Parameters:
dropboolean, default False

Do not try to insert index into dataframe columns.

rfloordiv(other, axis='columns', level=None, fill_value=None)#

Get Integer division of DataFrame or Series and other, element-wise (binary operator rfloordiv).

This docstring was copied from cudf.core.series.Series.rfloordiv.

Some inconsistencies with the Dask version may exist.

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:
otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame or Series

Result of the arithmetic operation.

Examples

DataFrame

>>> df = cudf.DataFrame(  
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.rfloordiv(1)  
                        angles  degrees
circle     9223372036854775807        0
triangle                     0        0
rectangle                    0        0

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])  
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])  
>>> a.rfloordiv(b)  
a       1
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.rfloordiv(b, fill_value=0)  
a                      1
b                      0
c                      0
d    9223372036854775807
e                   <NA>
dtype: int64
rmod(other, axis='columns', level=None, fill_value=None)#

Get Modulo of DataFrame or Series and other, element-wise (binary operator rmod).

This docstring was copied from cudf.core.series.Series.rmod.

Some inconsistencies with the Dask version may exist.

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:
otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame or Series

Result of the arithmetic operation.

Examples

DataFrame

>>> df = cudf.DataFrame(  
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.rmod(1)  
            angles  degrees
circle     4294967295        1
triangle            1        1
rectangle           1        1

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])  
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])  
>>> a.rmod(b)  
a       0
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.rmod(b, fill_value=0)  
a             0
b             0
c             0
d    4294967295
e          <NA>
dtype: int64
rmul(other, axis='columns', level=None, fill_value=None)#

Get Multiplication of DataFrame or Series and other, element-wise (binary operator rmul).

This docstring was copied from cudf.core.series.Series.rmul.

Some inconsistencies with the Dask version may exist.

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:
otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame or Series

Result of the arithmetic operation.

Examples

DataFrame

>>> df = cudf.DataFrame(  
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.rmul(1)  
        angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])  
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])  
>>> a.rmul(b)  
a       1
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.rmul(b, fill_value=0)  
a       1
b       0
c       0
d       0
e    <NA>
dtype: int64
rolling(window, min_periods=None, center=False, win_type=None, axis=_NoDefault.no_default)#

Provides rolling transformations.

Parameters:
windowint, str, offset

Size of the moving window. This is the number of observations used for calculating the statistic. When not using a DatetimeIndex, the window size must not be so large as to span more than one adjacent partition. If using an offset or offset alias like ‘5D’, the data must have a DatetimeIndex

Changed in version 0.15.0: Now accepts offsets and string offset aliases

min_periodsint, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

centerboolean, default False

Set the labels at the center of the window.

win_typestring, default None

Provide a window type. The recognized window types are identical to pandas.

axisint, str, None, default 0

This parameter is deprecated with pandas>=2.1.

Returns:
a Rolling object on which to call a method to compute a statistic
round(decimals=0)#

Round a DataFrame to a variable number of decimal places.

This docstring was copied from pandas.core.frame.DataFrame.round.

Some inconsistencies with the Dask version may exist.

Parameters:
decimalsint, dict, Series

Number of decimal places to round each column to. If an int is given, round each column to the same number of places. Otherwise dict and Series round to variable numbers of places. Column names should be in the keys if decimals is a dict-like, or in the index if decimals is a Series. Any columns not included in decimals will be left as is. Elements of decimals which are not columns of the input will be ignored.

*args

Additional keywords have no effect but might be accepted for compatibility with numpy.

**kwargs

Additional keywords have no effect but might be accepted for compatibility with numpy.

Returns:
DataFrame

A DataFrame with the affected columns rounded to the specified number of decimal places.

See also

numpy.around

Round a numpy array to the given number of decimals.

Series.round

Round a Series to the given number of decimals.

Examples

>>> df = pd.DataFrame([(.21, .32), (.01, .67), (.66, .03), (.21, .18)],  
...                   columns=['dogs', 'cats'])
>>> df  
    dogs  cats
0  0.21  0.32
1  0.01  0.67
2  0.66  0.03
3  0.21  0.18

By providing an integer each column is rounded to the same number of decimal places

>>> df.round(1)  
    dogs  cats
0   0.2   0.3
1   0.0   0.7
2   0.7   0.0
3   0.2   0.2

With a dict, the number of places for specific columns can be specified with the column names as key and the number of decimal places as value

>>> df.round({'dogs': 1, 'cats': 0})  
    dogs  cats
0   0.2   0.0
1   0.0   1.0
2   0.7   0.0
3   0.2   0.0

Using a Series, the number of places for specific columns can be specified with the column names as index and the number of decimal places as value

>>> decimals = pd.Series([0, 1], index=['cats', 'dogs'])  
>>> df.round(decimals)  
    dogs  cats
0   0.2   0.0
1   0.0   1.0
2   0.7   0.0
3   0.2   0.0
rpow(other, axis='columns', level=None, fill_value=None)#

Get Exponential of DataFrame or Series and other, element-wise (binary operator rpow).

This docstring was copied from cudf.core.series.Series.rpow.

Some inconsistencies with the Dask version may exist.

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:
otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame or Series

Result of the arithmetic operation.

Examples

DataFrame

>>> df = cudf.DataFrame(  
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.rpow(1)  
        angles  degrees
circle          1        1
triangle        1        1
rectangle       1        1

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])  
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])  
>>> a.rpow(b)  
a       1
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.rpow(b, fill_value=0)  
a       1
b       0
c       0
d       1
e    <NA>
dtype: int64
rsub(other, axis='columns', level=None, fill_value=None)#

Get Subtraction of DataFrame or Series and other, element-wise (binary operator rsub).

This docstring was copied from cudf.core.series.Series.rsub.

Some inconsistencies with the Dask version may exist.

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:
otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame or Series

Result of the arithmetic operation.

Examples

DataFrame

>>> df = cudf.DataFrame(  
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.rsub(1)  
        angles  degrees
circle          1     -359
triangle       -2     -179
rectangle      -3     -359

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])  
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])  
>>> a.rsub(b)  
a       0
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.rsub(b, fill_value=0)  
a       0
b      -1
c      -1
d       1
e    <NA>
dtype: int64
rtruediv(other, axis='columns', level=None, fill_value=None)#

Get Floating division of DataFrame or Series and other, element-wise (binary operator rtruediv).

This docstring was copied from cudf.core.series.Series.rtruediv.

Some inconsistencies with the Dask version may exist.

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:
otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame or Series

Result of the arithmetic operation.

Examples

DataFrame

>>> df = cudf.DataFrame(  
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.rtruediv(1)  
            angles   degrees
circle          inf  0.002778
triangle   0.333333  0.005556
rectangle  0.250000  0.002778

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])  
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])  
>>> a.rtruediv(b)  
a     1.0
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: float64
>>> a.rtruediv(b, fill_value=0)  
a     1.0
b     0.0
c     0.0
d     Inf
e    <NA>
dtype: float64
sample(n=None, frac=None, replace=False, random_state=None)#

Random sample of items

Parameters:
nint, optional

Number of items to return is not supported by dask. Use frac instead.

fracfloat, optional

Approximate fraction of items to return. This sampling fraction is applied to all partitions equally. Note that this is an approximate fraction. You should not expect exactly len(df) * frac items to be returned, as the exact number of elements selected will depend on how your data is partitioned (but should be pretty close in practice).

replaceboolean, optional

Sample with or without replacement. Default = False.

random_stateint or np.random.RandomState

If an int, we create a new RandomState with this as the seed; Otherwise we draw from the passed RandomState.

select_dtypes(include=None, exclude=None)#

Return a subset of the DataFrame’s columns based on the column dtypes.

This docstring was copied from pandas.core.frame.DataFrame.select_dtypes.

Some inconsistencies with the Dask version may exist.

Parameters:
include, excludescalar or list-like

A selection of dtypes or strings to be included/excluded. At least one of these parameters must be supplied.

Returns:
DataFrame

The subset of the frame including the dtypes in include and excluding the dtypes in exclude.

Raises:
ValueError
  • If both of include and exclude are empty

  • If include and exclude have overlapping elements

  • If any kind of string dtype is passed in.

See also

DataFrame.dtypes

Return Series with the data type of each column.

Notes

  • To select all numeric types, use np.number or 'number'

  • To select strings you must use the object dtype, but note that this will return all object dtype columns

  • See the numpy dtype hierarchy

  • To select datetimes, use np.datetime64, 'datetime' or 'datetime64'

  • To select timedeltas, use np.timedelta64, 'timedelta' or 'timedelta64'

  • To select Pandas categorical dtypes, use 'category'

  • To select Pandas datetimetz dtypes, use 'datetimetz' (new in 0.20.0) or 'datetime64[ns, tz]'

Examples

>>> df = pd.DataFrame({'a': [1, 2] * 3,  
...                    'b': [True, False] * 3,
...                    'c': [1.0, 2.0] * 3})
>>> df  
        a      b  c
0       1   True  1.0
1       2  False  2.0
2       1   True  1.0
3       2  False  2.0
4       1   True  1.0
5       2  False  2.0
>>> df.select_dtypes(include='bool')  
   b
0  True
1  False
2  True
3  False
4  True
5  False
>>> df.select_dtypes(include=['float64'])  
   c
0  1.0
1  2.0
2  1.0
3  2.0
4  1.0
5  2.0
>>> df.select_dtypes(exclude=['int64'])  
       b    c
0   True  1.0
1  False  2.0
2   True  1.0
3  False  2.0
4   True  1.0
5  False  2.0
sem(axis=None, skipna=True, ddof=1, split_every=False, numeric_only=None)#

Return unbiased standard error of the mean over requested axis.

This docstring was copied from pandas.core.frame.DataFrame.sem.

Some inconsistencies with the Dask version may exist.

Normalized by N-1 by default. This can be changed using the ddof argument

Parameters:
axis{index (0), columns (1)}

For Series this parameter is unused and defaults to 0.

skipnabool, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA.

levelint or level name, default None (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

Deprecated since version 1.3.0: The level keyword is deprecated. Use groupby instead.

ddofint, default 1

Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

numeric_onlybool, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Deprecated since version 1.5.0: Specifying numeric_only=None is deprecated. The default value will be False in a future version of pandas.

Returns:
Series or DataFrame (if level specified)
set_index(other, sorted=False, divisions=None, shuffle_method=None, **kwargs)#

Set the DataFrame index (row labels) using an existing column.

If sort=False, this function operates exactly like pandas.set_index and sets the index on the DataFrame. If sort=True (default), this function also sorts the DataFrame by the new index. This can have a significant impact on performance, because joins, groupbys, lookups, etc. are all much faster on that column. However, this performance increase comes with a cost, sorting a parallel dataset requires expensive shuffles. Often we set_index once directly after data ingest and filtering and then perform many cheap computations off of the sorted dataset.

With sort=True, this function is much more expensive. Under normal operation this function does an initial pass over the index column to compute approximate quantiles to serve as future divisions. It then passes over the data a second time, splitting up each input partition into several pieces and sharing those pieces to all of the output partitions now in sorted order.

In some cases we can alleviate those costs, for example if your dataset is sorted already then we can avoid making many small pieces or if you know good values to split the new index column then we can avoid the initial pass over the data. For example if your new index is a datetime index and your data is already sorted by day then this entire operation can be done for free. You can control these options with the following parameters.

Parameters:
other: string or Dask Series

Column to use as index.

drop: boolean, default True

Delete column to be used as the new index.

sorted: bool, optional

If the index column is already sorted in increasing order. Defaults to False

npartitions: int, None, or ‘auto’

The ideal number of output partitions. If None, use the same as the input. If ‘auto’ then decide by memory use. Only used when divisions is not given. If divisions is given, the number of output partitions will be len(divisions) - 1.

divisions: list, optional

The “dividing lines” used to split the new index into partitions. For divisions=[0, 10, 50, 100], there would be three output partitions, where the new index contained [0, 10), [10, 50), and [50, 100), respectively. See https://docs.dask.org/en/latest/dataframe-design.html#partitions. If not given (default), good divisions are calculated by immediately computing the data and looking at the distribution of its values. For large datasets, this can be expensive. Note that if sorted=True, specified divisions are assumed to match the existing partitions in the data; if this is untrue you should leave divisions empty and call repartition after set_index.

inplace: bool, optional

Modifying the DataFrame in place is not supported by Dask. Defaults to False.

sort: bool, optional

If True, sort the DataFrame by the new index. Otherwise set the index on the individual existing partitions. Defaults to True.

shuffle_method: {‘disk’, ‘tasks’, ‘p2p’}, optional

Either 'disk' for single-node operation or 'tasks' and 'p2p' for distributed operation. Will be inferred by your current scheduler.

compute: bool, default False

Whether or not to trigger an immediate computation. Defaults to False. Note, that even if you set compute=False, an immediate computation will still be triggered if divisions is None.

partition_size: int, optional

Desired size of each partitions in bytes. Only used when npartitions='auto'

Examples

>>> import dask
>>> ddf = dask.datasets.timeseries(start="2021-01-01", end="2021-01-07", freq="1h").reset_index()
>>> ddf2 = ddf.set_index("x")
>>> ddf2 = ddf.set_index(ddf.x)
>>> ddf2 = ddf.set_index(ddf.timestamp, sorted=True)

A common case is when we have a datetime column that we know to be sorted and is cleanly divided by day. We can set this index for free by specifying both that the column is pre-sorted and the particular divisions along which is is separated

>>> import pandas as pd
>>> divisions = pd.date_range(start="2021-01-01", end="2021-01-07", freq='1D')
>>> divisions
DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
               '2021-01-05', '2021-01-06', '2021-01-07'],
              dtype='datetime64[ns]', freq='D')

Note that len(divisons) is equal to npartitions + 1. This is because divisions represents the upper and lower bounds of each partition. The first item is the lower bound of the first partition, the second item is the lower bound of the second partition and the upper bound of the first partition, and so on. The second-to-last item is the lower bound of the last partition, and the last (extra) item is the upper bound of the last partition.

>>> ddf2 = ddf.set_index("timestamp", sorted=True, divisions=divisions.tolist())

If you’ll be running set_index on the same (or similar) datasets repeatedly, you could save time by letting Dask calculate good divisions once, then copy-pasting them to reuse. This is especially helpful running in a Jupyter notebook:

>>> ddf2 = ddf.set_index("name")  # slow, calculates data distribution
>>> ddf2.divisions  
["Alice", "Laura", "Ursula", "Zelda"]
>>> # ^ Now copy-paste this and edit the line above to:
>>> # ddf2 = ddf.set_index("name", divisions=["Alice", "Laura", "Ursula", "Zelda"])
property shape#

Return a tuple representing the dimensionality of the DataFrame.

The number of rows is a Delayed result. The number of columns is a concrete integer.

Examples

>>> df.size  
(Delayed('int-07f06075-5ecc-4d77-817e-63c69a9188a8'), 2)
shift(periods=1, freq=None, axis=0)#

Shift index by desired number of periods with an optional time freq.

This docstring was copied from pandas.core.frame.DataFrame.shift.

Some inconsistencies with the Dask version may exist.

When freq is not passed, shift the index without realigning the data. If freq is passed (in this case, the index must be date or datetime, or it will raise a NotImplementedError), the index will be increased using the periods and the freq. freq can be inferred when specified as “infer” as long as either freq or inferred_freq attribute is set in the index.

Parameters:
periodsint

Number of periods to shift. Can be positive or negative.

freqDateOffset, tseries.offsets, timedelta, or str, optional

Offset to use from the tseries module or time rule (e.g. ‘EOM’). If freq is specified then the index values are shifted but the data is not realigned. That is, use freq if you would like to extend the index when shifting and preserve the original data. If freq is specified as “infer” then it will be inferred from the freq or inferred_freq attributes of the index. If neither of those attributes exist, a ValueError is thrown.

axis{0 or ‘index’, 1 or ‘columns’, None}, default None

Shift direction. For Series this parameter is unused and defaults to 0.

fill_valueobject, optional (Not supported in Dask)

The scalar value to use for newly introduced missing values. the default depends on the dtype of self. For numeric data, np.nan is used. For datetime, timedelta, or period data, etc. NaT is used. For extension dtypes, self.dtype.na_value is used.

Changed in version 1.1.0.

Returns:
DataFrame

Copy of input object, shifted.

See also

Index.shift

Shift values of Index.

DatetimeIndex.shift

Shift values of DatetimeIndex.

PeriodIndex.shift

Shift values of PeriodIndex.

tshift

Shift the time index, using the index’s frequency if available.

Examples

>>> df = pd.DataFrame({"Col1": [10, 20, 15, 30, 45],  
...                    "Col2": [13, 23, 18, 33, 48],
...                    "Col3": [17, 27, 22, 37, 52]},
...                   index=pd.date_range("2020-01-01", "2020-01-05"))
>>> df  
            Col1  Col2  Col3
2020-01-01    10    13    17
2020-01-02    20    23    27
2020-01-03    15    18    22
2020-01-04    30    33    37
2020-01-05    45    48    52
>>> df.shift(periods=3)  
            Col1  Col2  Col3
2020-01-01   NaN   NaN   NaN
2020-01-02   NaN   NaN   NaN
2020-01-03   NaN   NaN   NaN
2020-01-04  10.0  13.0  17.0
2020-01-05  20.0  23.0  27.0
>>> df.shift(periods=1, axis="columns")  
            Col1  Col2  Col3
2020-01-01   NaN    10    13
2020-01-02   NaN    20    23
2020-01-03   NaN    15    18
2020-01-04   NaN    30    33
2020-01-05   NaN    45    48
>>> df.shift(periods=3, fill_value=0)  
            Col1  Col2  Col3
2020-01-01     0     0     0
2020-01-02     0     0     0
2020-01-03     0     0     0
2020-01-04    10    13    17
2020-01-05    20    23    27
>>> df.shift(periods=3, freq="D")  
            Col1  Col2  Col3
2020-01-04    10    13    17
2020-01-05    20    23    27
2020-01-06    15    18    22
2020-01-07    30    33    37
2020-01-08    45    48    52
>>> df.shift(periods=3, freq="infer")  
            Col1  Col2  Col3
2020-01-04    10    13    17
2020-01-05    20    23    27
2020-01-06    15    18    22
2020-01-07    30    33    37
2020-01-08    45    48    52
shuffle(*args, shuffle_method=None, **kwargs)#

Wraps dask.dataframe DataFrame.shuffle method

property size#

Size of the Series or DataFrame as a Delayed object.

Examples

>>> series.size  
dd.Scalar<size-ag..., dtype=int64>
skew(axis=0, bias=True, nan_policy='propagate', out=None, numeric_only=_NoDefault.no_default)#

Return unbiased skew over requested axis.

This docstring was copied from pandas.core.frame.DataFrame.skew.

Some inconsistencies with the Dask version may exist.

Note

This implementation follows the dask.array.stats implementation of skewness and calculates skewness without taking into account a bias term for finite sample size, which corresponds to the default settings of the scipy.stats skewness calculation. However, Pandas corrects for this, so the values differ by a factor of (n * (n - 1)) ** 0.5 / (n - 2), where n is the number of samples.

Further, this method currently does not support filtering out NaN values, which is again a difference to Pandas.

Normalized by N-1.

Parameters:
axis{index (0), columns (1)}

Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

skipnabool, default True (Not supported in Dask)

Exclude NA/null values when computing the result.

levelint or level name, default None (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

Deprecated since version 1.3.0: The level keyword is deprecated. Use groupby instead.

numeric_onlybool, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Deprecated since version 1.5.0: Specifying numeric_only=None is deprecated. The default value will be False in a future version of pandas.

**kwargs

Additional keyword arguments to be passed to the function.

Returns:
Series or DataFrame (if level specified)
sort_values(by, ignore_index=False, max_branch=None, divisions=None, set_divisions=False, ascending=True, na_position='last', sort_function=None, sort_function_kwargs=None, shuffle_method=None, **kwargs)#

Sort the dataset by a single column.

Sorting a parallel dataset requires expensive shuffles and is generally not recommended. See set_index for implementation details.

Parameters:
by: str or list[str]

Column(s) to sort by.

npartitions: int, None, or ‘auto’

The ideal number of output partitions. If None, use the same as the input. If ‘auto’ then decide by memory use.

ascending: bool, optional

Sort ascending vs. descending. Defaults to True.

na_position: {‘last’, ‘first’}, optional

Puts NaNs at the beginning if ‘first’, puts NaN at the end if ‘last’. Defaults to ‘last’.

sort_function: function, optional

Sorting function to use when sorting underlying partitions. If None, defaults to M.sort_values (the partition library’s implementation of sort_values).

sort_function_kwargs: dict, optional

Additional keyword arguments to pass to the partition sorting function. By default, by, ascending, and na_position are provided.

Examples

>>> df2 = df.sort_values('x')  
squeeze(axis=None)#

Squeeze 1 dimensional axis objects into scalars.

This docstring was copied from pandas.core.frame.DataFrame.squeeze.

Some inconsistencies with the Dask version may exist.

Series or DataFrames with a single element are squeezed to a scalar. DataFrames with a single column or a single row are squeezed to a Series. Otherwise the object is unchanged.

This method is most useful when you don’t know if your object is a Series or DataFrame, but you do know it has just a single column. In that case you can safely call squeeze to ensure you have a Series.

Parameters:
axis{0 or ‘index’, 1 or ‘columns’, None}, default None

A specific axis to squeeze. By default, all length-1 axes are squeezed. For Series this parameter is unused and defaults to None.

Returns:
DataFrame, Series, or scalar

The projection after squeezing axis or all the axes.

See also

Series.iloc

Integer-location based indexing for selecting scalars.

DataFrame.iloc

Integer-location based indexing for selecting Series.

Series.to_frame

Inverse of DataFrame.squeeze for a single-column DataFrame.

Examples

>>> primes = pd.Series([2, 3, 5, 7])  

Slicing might produce a Series with a single value:

>>> even_primes = primes[primes % 2 == 0]  
>>> even_primes  
0    2
dtype: int64
>>> even_primes.squeeze()  
2

Squeezing objects with more than one value in every axis does nothing:

>>> odd_primes = primes[primes % 2 == 1]  
>>> odd_primes  
1    3
2    5
3    7
dtype: int64
>>> odd_primes.squeeze()  
1    3
2    5
3    7
dtype: int64

Squeezing is even more effective when used with DataFrames.

>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=['a', 'b'])  
>>> df  
   a  b
0  1  2
1  3  4

Slicing a single column will produce a DataFrame with the columns having only one value:

>>> df_a = df[['a']]  
>>> df_a  
   a
0  1
1  3

So the columns can be squeezed down, resulting in a Series:

>>> df_a.squeeze('columns')  
0    1
1    3
Name: a, dtype: int64

Slicing a single row from a single column will produce a single scalar DataFrame:

>>> df_0a = df.loc[df.index < 1, ['a']]  
>>> df_0a  
   a
0  1

Squeezing the rows produces a single scalar Series:

>>> df_0a.squeeze('rows')  
a    1
Name: 0, dtype: int64

Squeezing all axes will project directly into a scalar:

>>> df_0a.squeeze()  
1
std(axis=None, skipna=True, ddof=1, split_every=False, dtype=None, out=None, numeric_only=_NoDefault.no_default)#

Return sample standard deviation over requested axis.

This docstring was copied from pandas.core.frame.DataFrame.std.

Some inconsistencies with the Dask version may exist.

Normalized by N-1 by default. This can be changed using the ddof argument.

Parameters:
axis{index (0), columns (1)}

For Series this parameter is unused and defaults to 0.

skipnabool, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA.

levelint or level name, default None (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

Deprecated since version 1.3.0: The level keyword is deprecated. Use groupby instead.

ddofint, default 1

Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

numeric_onlybool, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Deprecated since version 1.5.0: Specifying numeric_only=None is deprecated. The default value will be False in a future version of pandas.

Returns:
Series or DataFrame (if level specified)

Notes

To have the same behaviour as numpy.std, use ddof=0 (instead of the default ddof=1)

Examples

>>> df = pd.DataFrame({'person_id': [0, 1, 2, 3],  
...                   'age': [21, 25, 62, 43],
...                   'height': [1.61, 1.87, 1.49, 2.01]}
...                  ).set_index('person_id')
>>> df  
           age  height
person_id
0           21    1.61
1           25    1.87
2           62    1.49
3           43    2.01

The standard deviation of the columns can be found as follows:

>>> df.std()  
age       18.786076
height     0.237417

Alternatively, ddof=0 can be set to normalize by N instead of N-1:

>>> df.std(ddof=0)  
age       16.269219
height     0.205609
sub(other, axis='columns', level=None, fill_value=None)#

Get Subtraction of DataFrame or Series and other, element-wise (binary operator sub).

This docstring was copied from cudf.core.series.Series.sub.

Some inconsistencies with the Dask version may exist.

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:
otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame or Series

Result of the arithmetic operation.

Examples

DataFrame

>>> df = cudf.DataFrame(  
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.sub(1)  
        angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])  
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])  
>>> a.sub(b)  
a       0
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.sub(b, fill_value=0)  
a       2
b       1
c       1
d      -1
e    <NA>
dtype: int64
sum(axis=None, skipna=True, split_every=False, dtype=None, out=None, min_count=None, numeric_only=None)#

Return the sum of the values over the requested axis.

This docstring was copied from pandas.core.frame.DataFrame.sum.

Some inconsistencies with the Dask version may exist.

This is equivalent to the method numpy.sum.

Parameters:
axis{index (0), columns (1)}

Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

skipnabool, default True

Exclude NA/null values when computing the result.

levelint or level name, default None (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

Deprecated since version 1.3.0: The level keyword is deprecated. Use groupby instead.

numeric_onlybool, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Deprecated since version 1.5.0: Specifying numeric_only=None is deprecated. The default value will be False in a future version of pandas.

min_countint, default 0

The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

**kwargs

Additional keyword arguments to be passed to the function.

Returns:
Series or DataFrame (if level specified)

See also

Series.sum

Return the sum.

Series.min

Return the minimum.

Series.max

Return the maximum.

Series.idxmin

Return the index of the minimum.

Series.idxmax

Return the index of the maximum.

DataFrame.sum

Return the sum over the requested axis.

DataFrame.min

Return the minimum over the requested axis.

DataFrame.max

Return the maximum over the requested axis.

DataFrame.idxmin

Return the index of the minimum over the requested axis.

DataFrame.idxmax

Return the index of the maximum over the requested axis.

Examples

>>> idx = pd.MultiIndex.from_arrays([  
...     ['warm', 'warm', 'cold', 'cold'],
...     ['dog', 'falcon', 'fish', 'spider']],
...     names=['blooded', 'animal'])
>>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx)  
>>> s  
blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64
>>> s.sum()  
14

By default, the sum of an empty or all-NA Series is 0.

>>> pd.Series([], dtype="float64").sum()  # min_count=0 is the default  
0.0

This can be controlled with the min_count parameter. For example, if you’d like the sum of an empty series to be NaN, pass min_count=1.

>>> pd.Series([], dtype="float64").sum(min_count=1)  
nan

Thanks to the skipna parameter, min_count handles all-NA and empty series identically.

>>> pd.Series([np.nan]).sum()  
0.0
>>> pd.Series([np.nan]).sum(min_count=1)  
nan
tail(n=5, compute=True)#

Last n rows of the dataset

Caveat, the only checks the last n rows of the last partition.

to_backend(backend: str | None = None, **kwargs)#

Move to a new DataFrame backend

Parameters:
backendstr, Optional

The name of the new backend to move to. The default is the current “dataframe.backend” configuration.

Returns:
DataFrame, Series or Index
to_bag(index=False, format='tuple')#

Create Dask Bag from a Dask DataFrame

Parameters:
indexbool, optional

If True, the elements are tuples of (index, value), otherwise they’re just the value. Default is False.

format{“tuple”, “dict”, “frame”}, optional

Whether to return a bag of tuples, dictionaries, or dataframe-like objects. Default is “tuple”. If “frame”, the original partitions of df will not be transformed in any way.

Examples

>>> bag = df.to_bag()  
to_csv(filename, **kwargs)#

Store Dask DataFrame to CSV files

One filename per partition will be created. You can specify the filenames in a variety of ways.

Use a globstring:

>>> df.to_csv('/path/to/data/export-*.csv')  

The * will be replaced by the increasing sequence 0, 1, 2, …

/path/to/data/export-0.csv
/path/to/data/export-1.csv

Use a globstring and a name_function= keyword argument. The name_function function should expect an integer and produce a string. Strings produced by name_function must preserve the order of their respective partition indices.

>>> from datetime import date, timedelta
>>> def name(i):
...     return str(date(2015, 1, 1) + i * timedelta(days=1))
>>> name(0)
'2015-01-01'
>>> name(15)
'2015-01-16'
>>> df.to_csv('/path/to/data/export-*.csv', name_function=name)  
/path/to/data/export-2015-01-01.csv
/path/to/data/export-2015-01-02.csv
...

You can also provide an explicit list of paths:

>>> paths = ['/path/to/data/alice.csv', '/path/to/data/bob.csv', ...]  
>>> df.to_csv(paths) 

You can also provide a directory name:

>>> df.to_csv('/path/to/data') 

The files will be numbered 0, 1, 2, (and so on) suffixed with ‘.part’:

/path/to/data/0.part
/path/to/data/1.part
Parameters:
dfdask.DataFrame

Data to save

filenamestring or list

Absolute or relative filepath(s). Prefix with a protocol like s3:// to save to remote filesystems.

single_filebool, default False

Whether to save everything into a single CSV file. Under the single file mode, each partition is appended at the end of the specified CSV file.

encodingstring, default ‘utf-8’

A string representing the encoding to use in the output file.

modestr, default ‘w’

Python file mode. The default is ‘w’ (or ‘wt’), for writing a new file or overwriting an existing file in text mode. ‘a’ (or ‘at’) will append to an existing file in text mode or create a new file if it does not already exist. See open().

name_functioncallable, default None

Function accepting an integer (partition index) and producing a string to replace the asterisk in the given filename globstring. Should preserve the lexicographic order of partitions. Not supported when single_file is True.

compressionstring, optional

A string representing the compression to use in the output file, allowed values are ‘gzip’, ‘bz2’, ‘xz’, only used when the first argument is a filename.

computebool, default True

If True, immediately executes. If False, returns a set of delayed objects, which can be computed at a later time.

storage_optionsdict

Parameters passed on to the backend filesystem class.

header_first_partition_onlybool, default None

If set to True, only write the header row in the first output file. By default, headers are written to all partitions under the multiple file mode (single_file is False) and written only once under the single file mode (single_file is True). It must be True under the single file mode.

compute_kwargsdict, optional

Options to be passed in to the compute method

kwargsdict, optional

Additional parameters to pass to pandas.DataFrame.to_csv().

Returns:
The names of the file written if they were computed right away.
If not, the delayed tasks associated with writing the files.
Raises:
ValueError

If header_first_partition_only is set to False or name_function is specified when single_file is True.

See also

fsspec.open_files
to_dask_array(lengths=None, meta=None)#

Convert a dask DataFrame to a dask array.

Parameters:
lengthsbool or Sequence of ints, optional

How to determine the chunks sizes for the output array. By default, the output array will have unknown chunk lengths along the first axis, which can cause some later operations to fail.

  • True : immediately compute the length of each partition

  • Sequence : a sequence of integers to use for the chunk sizes on the first axis. These values are not validated for correctness, beyond ensuring that the number of items matches the number of partitions.

metaobject, optional

An optional meta parameter can be passed for dask to override the default metadata on the underlying dask array.

Returns:
to_dask_dataframe(**kwargs)#

Create a dask.dataframe object from a dask_cudf object

to_delayed(optimize_graph=True)#

Convert into a list of dask.delayed objects, one per partition.

Parameters:
optimize_graphbool, optional

If True [default], the graph is optimized before converting into dask.delayed objects.

Examples

>>> partitions = df.to_delayed()  
to_hdf(path_or_buf, key, mode='a', append=False, **kwargs)#

Store Dask Dataframe to Hierarchical Data Format (HDF) files

This is a parallel version of the Pandas function of the same name. Please see the Pandas docstring for more detailed information about shared keyword arguments.

This function differs from the Pandas version by saving the many partitions of a Dask DataFrame in parallel, either to many files, or to many datasets within the same file. You may specify this parallelism with an asterix * within the filename or datapath, and an optional name_function. The asterix will be replaced with an increasing sequence of integers starting from 0 or with the result of calling name_function on each of those integers.

This function only supports the Pandas 'table' format, not the more specialized 'fixed' format.

Parameters:
pathstring, pathlib.Path

Path to a target filename. Supports strings, pathlib.Path, or any object implementing the __fspath__ protocol. May contain a * to denote many filenames.

keystring

Datapath within the files. May contain a * to denote many locations

name_functionfunction

A function to convert the * in the above options to a string. Should take in a number from 0 to the number of partitions and return a string. (see examples below)

computebool

Whether or not to execute immediately. If False then this returns a dask.Delayed value.

lockbool, Lock, optional

Lock to use to prevent concurrency issues. By default a threading.Lock, multiprocessing.Lock or SerializableLock will be used depending on your scheduler if a lock is required. See dask.utils.get_scheduler_lock for more information about lock selection.

schedulerstring

The scheduler to use, like “threads” or “processes”

**other:

See pandas.to_hdf for more information

Returns:
filenameslist

Returned if compute is True. List of file names that each partition is saved to.

delayeddask.Delayed

Returned if compute is False. Delayed object to execute to_hdf when computed.

See also

read_hdf
to_parquet

Examples

Save Data to a single file

>>> df.to_hdf('output.hdf', '/data')            

Save data to multiple datapaths within the same file:

>>> df.to_hdf('output.hdf', '/data-*')          

Save data to multiple files:

>>> df.to_hdf('output-*.hdf', '/data')          

Save data to multiple files, using the multiprocessing scheduler:

>>> df.to_hdf('output-*.hdf', '/data', scheduler='processes') 

Specify custom naming scheme. This writes files as ‘2000-01-01.hdf’, ‘2000-01-02.hdf’, ‘2000-01-03.hdf’, etc..

>>> from datetime import date, timedelta
>>> base = date(year=2000, month=1, day=1)
>>> def name_function(i):
...     ''' Convert integer 0 to n to a string '''
...     return base + timedelta(days=i)
>>> df.to_hdf('*.hdf', '/data', name_function=name_function) 
to_html(max_rows=5)#

Render a DataFrame as an HTML table.

Parameters:
bufstr, Path or StringIO-like, optional, default None (Not supported in Dask)

Buffer to write to. If None, the output is returned as a string.

columnssequence, optional, default None (Not supported in Dask)

The subset of columns to write. Writes all columns by default.

col_spacestr or int, list or dict of int or str, optional (Not supported in Dask)

The minimum width of each column in CSS length units. An int is assumed to be px units.

This docstring was copied from pandas.core.frame.DataFrame.to_html.

Some inconsistencies with the Dask version may exist.

New in version 0.25.0: Ability to use str.

headerbool, optional (Not supported in Dask)

Whether to print column labels, default True.

indexbool, optional, default True (Not supported in Dask)

Whether to print index (row) labels.

na_repstr, optional, default ‘NaN’ (Not supported in Dask)

String representation of NaN to use.

formatterslist, tuple or dict of one-param. functions, optional (Not supported in Dask)

Formatter functions to apply to columns’ elements by position or name. The result of each function must be a unicode string. List/tuple must be of length equal to the number of columns.

float_formatone-parameter function, optional, default None (Not supported in Dask)

Formatter function to apply to columns’ elements if they are floats. This function must return a unicode string and will be applied only to the non-NaN elements, with NaN being handled by na_rep.

Changed in version 1.2.0.

sparsifybool, optional, default True (Not supported in Dask)

Set to False for a DataFrame with a hierarchical index to print every multiindex key at each row.

index_namesbool, optional, default True (Not supported in Dask)

Prints the names of the indexes.

justifystr, default None (Not supported in Dask)

How to justify the column labels. If None uses the option from the print configuration (controlled by set_option), ‘right’ out of the box. Valid values are

  • left

  • right

  • center

  • justify

  • justify-all

  • start

  • end

  • inherit

  • match-parent

  • initial

  • unset.

max_rowsint, optional

Maximum number of rows to display in the console.

max_colsint, optional (Not supported in Dask)

Maximum number of columns to display in the console.

show_dimensionsbool, default False (Not supported in Dask)

Display DataFrame dimensions (number of rows by number of columns).

decimalstr, default ‘.’ (Not supported in Dask)

Character recognized as decimal separator, e.g. ‘,’ in Europe.

bold_rowsbool, default True (Not supported in Dask)

Make the row labels bold in the output.

classesstr or list or tuple, default None (Not supported in Dask)

CSS class(es) to apply to the resulting html table.

escapebool, default True (Not supported in Dask)

Convert the characters <, >, and & to HTML-safe sequences.

notebook{True, False}, default False (Not supported in Dask)

Whether the generated HTML is for IPython Notebook.

borderint (Not supported in Dask)

A border=border attribute is included in the opening <table> tag. Default pd.options.display.html.border.

table_idstr, optional (Not supported in Dask)

A css id is included in the opening <table> tag if specified.

render_linksbool, default False (Not supported in Dask)

Convert URLs to HTML links.

encodingstr, default “utf-8” (Not supported in Dask)

Set character encoding.

New in version 1.0.

Returns:
str or None

If buf is None, returns the result as a string. Otherwise returns None.

See also

to_string

Convert DataFrame to a string.

to_json(filename, *args, **kwargs)#

See dd.to_json docstring for more information

to_orc(path, **kwargs)#

Calls dask_cudf.io.to_orc

to_parquet(path, *args, **kwargs)#

Calls dask.dataframe.io.to_parquet with CudfEngine backend

to_records(index=False, lengths=None)#

Create Dask Array from a Dask Dataframe

Warning: This creates a dask.array without precise shape information. Operations that depend on shape information, like slicing or reshaping, will not work.

See also

dask.dataframe._Frame.values
dask.dataframe.from_dask_array

Examples

>>> df.to_records()  
to_sql(name: str, uri: str, schema=None, if_exists: str = 'fail', index: bool = True, index_label=None, chunksize=None, dtype=None, method=None, compute=True, parallel=False, engine_kwargs=None)#

See dd.to_sql docstring for more information

to_string(max_rows=5)#

Render a DataFrame to a console-friendly tabular output.

Parameters:
bufstr, Path or StringIO-like, optional, default None (Not supported in Dask)

Buffer to write to. If None, the output is returned as a string.

columnssequence, optional, default None (Not supported in Dask)

The subset of columns to write. Writes all columns by default.

col_spaceint, list or dict of int, optional (Not supported in Dask)

The minimum width of each column. If a list of ints is given every integers corresponds with one column. If a dict is given, the key references the column, while the value defines the space to use..

headerbool or sequence of str, optional (Not supported in Dask)

Write out the column names. If a list of strings is given, it is assumed to be aliases for the column names.

indexbool, optional, default True (Not supported in Dask)

Whether to print index (row) labels.

na_repstr, optional, default ‘NaN’ (Not supported in Dask)

String representation of NaN to use.

formatterslist, tuple or dict of one-param. functions, optional (Not supported in Dask)

Formatter functions to apply to columns’ elements by position or name. The result of each function must be a unicode string. List/tuple must be of length equal to the number of columns.

float_formatone-parameter function, optional, default None (Not supported in Dask)

Formatter function to apply to columns’ elements if they are floats. This function must return a unicode string and will be applied only to the non-NaN elements, with NaN being handled by na_rep.

This docstring was copied from pandas.core.frame.DataFrame.to_string.

Some inconsistencies with the Dask version may exist.

Changed in version 1.2.0.

sparsifybool, optional, default True (Not supported in Dask)

Set to False for a DataFrame with a hierarchical index to print every multiindex key at each row.

index_namesbool, optional, default True (Not supported in Dask)

Prints the names of the indexes.

justifystr, default None (Not supported in Dask)

How to justify the column labels. If None uses the option from the print configuration (controlled by set_option), ‘right’ out of the box. Valid values are

  • left

  • right

  • center

  • justify

  • justify-all

  • start

  • end

  • inherit

  • match-parent

  • initial

  • unset.

max_rowsint, optional

Maximum number of rows to display in the console.

max_colsint, optional (Not supported in Dask)

Maximum number of columns to display in the console.

show_dimensionsbool, default False (Not supported in Dask)

Display DataFrame dimensions (number of rows by number of columns).

decimalstr, default ‘.’ (Not supported in Dask)

Character recognized as decimal separator, e.g. ‘,’ in Europe.

line_widthint, optional (Not supported in Dask)

Width to wrap a line in characters.

min_rowsint, optional (Not supported in Dask)

The number of rows to display in the console in a truncated repr (when number of rows is above max_rows).

max_colwidthint, optional (Not supported in Dask)

Max width to truncate each column in characters. By default, no limit.

New in version 1.0.0.

encodingstr, default “utf-8” (Not supported in Dask)

Set character encoding.

New in version 1.0.

Returns:
str or None

If buf is None, returns the result as a string. Otherwise returns None.

See also

to_html

Convert DataFrame to HTML.

Examples

>>> d = {'col1': [1, 2, 3], 'col2': [4, 5, 6]}  
>>> df = pd.DataFrame(d)  
>>> print(df.to_string())  
   col1  col2
0     1     4
1     2     5
2     3     6
to_timestamp(freq=None, how='start', axis=0)#

Cast to DatetimeIndex of timestamps, at beginning of period.

This docstring was copied from pandas.core.frame.DataFrame.to_timestamp.

Some inconsistencies with the Dask version may exist.

Parameters:
freqstr, default frequency of PeriodIndex

Desired frequency.

how{‘s’, ‘e’, ‘start’, ‘end’}

Convention for converting period to timestamp; start of period vs. end.

axis{0 or ‘index’, 1 or ‘columns’}, default 0

The axis to convert (the index by default).

copybool, default True (Not supported in Dask)

If False then underlying input data is not copied.

Returns:
DataFrame with DatetimeIndex
truediv(other, axis='columns', level=None, fill_value=None)#

Get Floating division of DataFrame or Series and other, element-wise (binary operator truediv).

This docstring was copied from cudf.core.series.Series.truediv.

Some inconsistencies with the Dask version may exist.

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:
otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame or Series

Result of the arithmetic operation.

Examples

DataFrame

>>> df = cudf.DataFrame(  
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.truediv(1)  
        angles  degrees
circle        0.0    360.0
triangle      3.0    180.0
rectangle     4.0    360.0

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])  
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])  
>>> a.truediv(b)  
a     1.0
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: float64
>>> a.truediv(b, fill_value=0)  
a     1.0
b     Inf
c     Inf
d     0.0
e    <NA>
dtype: float64
property values#

Return a dask.array of the values of this dataframe

Warning: This creates a dask.array without precise shape information. Operations that depend on shape information, like slicing or reshaping, will not work.

var(axis=None, skipna=True, ddof=1, split_every=False, dtype=None, out=None, naive=False)#

Return unbiased variance over requested axis.

This docstring was copied from pandas.core.frame.DataFrame.var.

Some inconsistencies with the Dask version may exist.

Normalized by N-1 by default. This can be changed using the ddof argument.

Parameters:
axis{index (0), columns (1)}

For Series this parameter is unused and defaults to 0.

skipnabool, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA.

levelint or level name, default None (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

Deprecated since version 1.3.0: The level keyword is deprecated. Use groupby instead.

ddofint, default 1

Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

numeric_onlybool, default None (Not supported in Dask)

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Deprecated since version 1.5.0: Specifying numeric_only=None is deprecated. The default value will be False in a future version of pandas.

Returns:
Series or DataFrame (if level specified)

Examples

>>> df = pd.DataFrame({'person_id': [0, 1, 2, 3],  
...                   'age': [21, 25, 62, 43],
...                   'height': [1.61, 1.87, 1.49, 2.01]}
...                  ).set_index('person_id')
>>> df  
           age  height
person_id
0           21    1.61
1           25    1.87
2           62    1.49
3           43    2.01
>>> df.var()  
age       352.916667
height      0.056367

Alternatively, ddof=0 can be set to normalize by N instead of N-1:

>>> df.var(ddof=0)  
age       264.687500
height      0.042275
visualize(filename='mydask', format=None, optimize_graph=False, **kwargs)#

Render the computation of this object’s task graph using graphviz.

Requires graphviz to be installed.

Parameters:
filenamestr or None, optional

The name of the file to write to disk. If the provided filename doesn’t include an extension, ‘.png’ will be used by default. If filename is None, no file will be written, and we communicate with dot using only pipes.

format{‘png’, ‘pdf’, ‘dot’, ‘svg’, ‘jpeg’, ‘jpg’}, optional

Format in which to write output file. Default is ‘png’.

optimize_graphbool, optional

If True, the graph is optimized before rendering. Otherwise, the graph is displayed as is. Default is False.

color: {None, ‘order’}, optional

Options to color nodes. Provide cmap= keyword for additional colormap

**kwargs

Additional keyword arguments to forward to to_graphviz.

Returns:
resultIPython.diplay.Image, IPython.display.SVG, or None

See dask.dot.dot_graph for more information.

See also

dask.visualize
dask.dot.dot_graph

Notes

For more information on optimization see here:

https://docs.dask.org/en/latest/optimize.html

Examples

>>> x.visualize(filename='dask.pdf')  
>>> x.visualize(filename='dask.pdf', color='order')  
where(cond, other=nan)#

Replace values where the condition is False.

This docstring was copied from pandas.core.frame.DataFrame.where.

Some inconsistencies with the Dask version may exist.

Parameters:
condbool Series/DataFrame, array-like, or callable

Where cond is True, keep the original value. Where False, replace with corresponding value from other. If cond is callable, it is computed on the Series/DataFrame and should return boolean Series/DataFrame or array. The callable must not change input Series/DataFrame (though pandas doesn’t check it).

otherscalar, Series/DataFrame, or callable

Entries where cond is False are replaced with corresponding value from other. If other is callable, it is computed on the Series/DataFrame and should return scalar or Series/DataFrame. The callable must not change input Series/DataFrame (though pandas doesn’t check it).

inplacebool, default False (Not supported in Dask)

Whether to perform the operation in place on the data.

axisint, default None (Not supported in Dask)

Alignment axis if needed. For Series this parameter is unused and defaults to 0.

levelint, default None (Not supported in Dask)

Alignment level if needed.

errorsstr, {‘raise’, ‘ignore’}, default ‘raise’ (Not supported in Dask)

Note that currently this parameter won’t affect the results and will always coerce to a suitable dtype.

  • ‘raise’ : allow exceptions to be raised.

  • ‘ignore’ : suppress exceptions. On error return original object.

Deprecated since version 1.5.0: This argument had no effect.

try_castbool, default None (Not supported in Dask)

Try to cast the result back to the input type (if possible).

Deprecated since version 1.3.0: Manually cast back if necessary.

Returns:
Same type as caller or None if inplace=True.

See also

DataFrame.mask()

Return an object of same shape as self.

Notes

The where method is an application of the if-then idiom. For each element in the calling DataFrame, if cond is True the element is used; otherwise the corresponding element from the DataFrame other is used. If the axis of other does not align with axis of cond Series/DataFrame, the misaligned index positions will be filled with False.

The signature for DataFrame.where() differs from numpy.where(). Roughly df1.where(m, df2) is equivalent to np.where(m, df1, df2).

For further details and examples see the where documentation in indexing.

The dtype of the object takes precedence. The fill value is casted to the object’s dtype, if this can be done losslessly.

Examples

>>> s = pd.Series(range(5))  
>>> s.where(s > 0)  
0    NaN
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64
>>> s.mask(s > 0)  
0    0.0
1    NaN
2    NaN
3    NaN
4    NaN
dtype: float64
>>> s = pd.Series(range(5))  
>>> t = pd.Series([True, False])  
>>> s.where(t, 99)  
0     0
1    99
2    99
3    99
4    99
dtype: int64
>>> s.mask(t, 99)  
0    99
1     1
2    99
3    99
4    99
dtype: int64
>>> s.where(s > 1, 10)  
0    10
1    10
2    2
3    3
4    4
dtype: int64
>>> s.mask(s > 1, 10)  
0     0
1     1
2    10
3    10
4    10
dtype: int64
>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])  
>>> df  
   A  B
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9
>>> m = df % 3 == 0  
>>> df.where(m, -df)  
   A  B
0  0 -1
1 -2  3
2 -4 -5
3  6 -7
4 -8  9
>>> df.where(m, -df) == np.where(m, df, -df)  
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True
>>> df.where(m, -df) == df.mask(~m, -df)  
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True
class dask_cudf.Series(dsk, name, meta, divisions)#

Bases: _Frame, Series

Attributes

attrs

Dictionary of global attributes of this dataset.

divisions

Tuple of npartitions + 1 values, in ascending order, marking the lower/upper bounds of each partition's index.

dtype

Return data type

index

Return dask Index instance

is_monotonic

Return boolean if values in the object are monotonically increasing.

is_monotonic_decreasing

Return boolean if values in the object are monotonically decreasing.

is_monotonic_increasing

Return boolean if values in the object are monotonically increasing.

known_divisions

Whether divisions are already known

loc

Purely label-location based indexer for selection by label.

nbytes

Number of bytes

ndim

Return dimensionality

npartitions

Return number of partitions

partitions

Slice dataframe by partitions

shape

Return a tuple representing the dimensionality of a Series.

size

Size of the Series or DataFrame as a Delayed object.

values

Return a dask.array of the values of this dataframe

axes

list

name

struct

Methods

abs()

Return a Series/DataFrame with absolute numeric value of each element.

add(other[, level, fill_value, axis])

Get Addition of DataFrame or Series and other, element-wise (binary operator add).

add_prefix(prefix)

Prefix labels with string prefix.

add_suffix(suffix)

Suffix labels with string suffix.

align(other[, join, axis, fill_value])

Align two objects on their axes with the specified join method.

all([axis, skipna, split_every, out])

Return whether all elements are True, potentially over an axis.

any([axis, skipna, split_every, out])

Return whether any element is True, potentially over an axis.

append(other[, interleave_partitions])

Concatenate two or more Series.

apply(func[, convert_dtype, meta, args])

Parallel version of pandas.Series.apply

astype(dtype)

Cast a pandas object to a specified dtype dtype.

autocorr([lag, split_every])

Compute the lag-N autocorrelation.

between(left, right[, inclusive])

Return boolean Series equivalent to left <= series <= right.

bfill([axis, limit])

Synonym for DataFrame.fillna() with method='bfill'.

cat

alias of CategoricalAccessor

clear_divisions()

Forget division information

clip([lower, upper, axis])

Trim values at input threshold(s).

combine(other, func[, fill_value])

Combine the Series with a Series or scalar according to func.

combine_first(other)

Update null elements with value in the same location in 'other'.

compute(**kwargs)

Compute this dask collection

compute_current_divisions([col])

Compute the current divisions of the DataFrame.

copy([deep])

Make a copy of the dataframe

corr(other[, method, min_periods, split_every])

Compute correlation with other Series, excluding missing values.

count([split_every])

Count non-NA cells for each column or row.

cov(other[, min_periods, split_every])

Compute covariance with Series, excluding missing values.

cummax([axis, skipna, out])

Return cumulative maximum over a DataFrame or Series axis.

cummin([axis, skipna, out])

Return cumulative minimum over a DataFrame or Series axis.

cumprod([axis, skipna, dtype, out])

Return cumulative product over a DataFrame or Series axis.

cumsum([axis, skipna, dtype, out])

Return cumulative sum over a DataFrame or Series axis.

describe([split_every, percentiles, ...])

Generate descriptive statistics.

diff([periods, axis])

First discrete difference of element.

div(other[, level, fill_value, axis])

Return Floating division of series and other, element-wise (binary operator truediv).

divide(other[, level, fill_value, axis])

Return Floating division of series and other, element-wise (binary operator truediv).

dot(other[, meta])

Compute the dot product between the Series and the columns of other.

drop_duplicates([subset, split_every, ...])

Return DataFrame with duplicate rows removed.

dropna()

Return a new Series with missing values removed.

dt

alias of DatetimeAccessor

enforce_runtime_divisions()

Enforce the current divisions at runtime

eq(other[, level, fill_value, axis])

Get Equal to of DataFrame or Series and other, element-wise (binary operator eq).

explode()

Transform each element of a list-like to a row.

ffill([axis, limit])

Synonym for DataFrame.fillna() with method='ffill'.

fillna([value, method, limit, axis])

Fill NA/NaN values using the specified method.

first(offset)

Select initial periods of time series data based on a date offset.

floordiv(other[, level, fill_value, axis])

Get Integer division of DataFrame or Series and other, element-wise (binary operator floordiv).

ge(other[, level, fill_value, axis])

Get Greater than or equal to of DataFrame or Series and other, element-wise (binary operator ge).

get_partition(n)

Get a dask DataFrame/Series representing the nth partition.

groupby(*args, **kwargs)

Group Series using a mapper or by a Series of columns.

gt(other[, level, fill_value, axis])

Get Greater than of DataFrame or Series and other, element-wise (binary operator gt).

head([n, npartitions, compute])

First n rows of the dataset

idxmax([axis, skipna, split_every, numeric_only])

Return index of first occurrence of maximum over requested axis.

idxmin([axis, skipna, split_every, numeric_only])

Return index of first occurrence of minimum over requested axis.

isin(values)

Whether elements in Series are contained in values.

isna()

Detect missing values.

isnull()

DataFrame.isnull is an alias for DataFrame.isna.

iteritems()

Lazily iterate over (index, value) tuples.

kurtosis([axis, fisher, bias, nan_policy, ...])

Return unbiased kurtosis over requested axis.

last(offset)

Select final periods of time series data based on a date offset.

le(other[, level, fill_value, axis])

Get Less than or equal to of DataFrame or Series and other, element-wise (binary operator le).

lt(other[, level, fill_value, axis])

Get Less than of DataFrame or Series and other, element-wise (binary operator lt).

map(arg[, na_action, meta])

Map values of Series according to an input mapping or function.

map_overlap(func, before, after, *args, **kwargs)

Apply a function to each partition, sharing rows with adjacent partitions.

map_partitions(func, *args, **kwargs)

Apply Python function on each DataFrame partition.

mask(cond[, other])

Replace values where the condition is True.

max([axis, skipna, split_every, out, ...])

Return the maximum of the values over the requested axis.

mean([split_every])

Return the mean of the values over the requested axis.

median([method])

Return the median of the values over the requested axis.

median_approximate([method])

Return the approximate median of the values over the requested axis.

memory_usage([index, deep])

Return the memory usage of the Series.

memory_usage_per_partition([index, deep])

Return the memory usage of each partition

min([axis, skipna, split_every, out, ...])

Return the minimum of the values over the requested axis.

mod(other[, level, fill_value, axis])

Get Modulo of DataFrame or Series and other, element-wise (binary operator mod).

mode([dropna, split_every])

Return the mode(s) of the Series.

mul(other[, level, fill_value, axis])

Get Multiplication of DataFrame or Series and other, element-wise (binary operator mul).

ne(other[, level, fill_value, axis])

Get Not equal to of DataFrame or Series and other, element-wise (binary operator ne).

nlargest([n, split_every])

Return the largest n elements.

notnull()

DataFrame.notnull is an alias for DataFrame.notna.

nsmallest([n, split_every])

Return the smallest n elements.

nunique([split_every, dropna])

Return number of unique elements in the object.

nunique_approx([split_every])

Approximate number of unique rows.

persist(**kwargs)

Persist this dask collection into memory

pipe(func, *args, **kwargs)

Apply chainable functions that expect Series or DataFrames.

pow(other[, level, fill_value, axis])

Get Exponential of DataFrame or Series and other, element-wise (binary operator pow).

prod([axis, skipna, split_every, dtype, ...])

Return the product of the values over the requested axis.

product([axis, skipna, split_every, dtype, ...])

Return the product of the values over the requested axis.

quantile([q, method])

Approximate quantiles of Series

radd(other[, level, fill_value, axis])

Get Addition of DataFrame or Series and other, element-wise (binary operator radd).

random_split(frac[, random_state, shuffle])

Pseudorandomly split dataframe into different pieces row-wise

rdiv(other[, level, fill_value, axis])

Return Floating division of series and other, element-wise (binary operator rtruediv).

reduction(chunk[, aggregate, combine, meta, ...])

Generic row-wise reductions.

rename([index, inplace, sorted_index])

Alter Series index labels or name

repartition([divisions, npartitions, ...])

Repartition dataframe along new divisions

replace([to_replace, value, regex])

Replace values given in to_replace with value.

resample(rule[, closed, label])

Resample time-series data.

reset_index([drop])

Reset the index to the default index.

rfloordiv(other[, level, fill_value, axis])

Get Integer division of DataFrame or Series and other, element-wise (binary operator rfloordiv).

rmod(other[, level, fill_value, axis])

Get Modulo of DataFrame or Series and other, element-wise (binary operator rmod).

rmul(other[, level, fill_value, axis])

Get Multiplication of DataFrame or Series and other, element-wise (binary operator rmul).

rolling(window[, min_periods, center, ...])

Provides rolling transformations.

round([decimals])

Round each value in a Series to the given number of decimals.

rpow(other[, level, fill_value, axis])

Get Exponential of DataFrame or Series and other, element-wise (binary operator rpow).

rsub(other[, level, fill_value, axis])

Get Subtraction of DataFrame or Series and other, element-wise (binary operator rsub).

rtruediv(other[, level, fill_value, axis])

Get Floating division of DataFrame or Series and other, element-wise (binary operator rtruediv).

sample([n, frac, replace, random_state])

Random sample of items

sem([axis, skipna, ddof, split_every, ...])

Return unbiased standard error of the mean over requested axis.

shift([periods, freq, axis])

Shift index by desired number of periods with an optional time freq.

shuffle(on[, npartitions, max_branch, ...])

Rearrange DataFrame into new partitions

skew([axis, bias, nan_policy, out, numeric_only])

Return unbiased skew over requested axis.

squeeze()

Squeeze 1 dimensional axis objects into scalars.

std([axis, skipna, ddof, split_every, ...])

Return sample standard deviation over requested axis.

str

alias of StringAccessor

sub(other[, level, fill_value, axis])

Get Subtraction of DataFrame or Series and other, element-wise (binary operator sub).

sum([axis, skipna, split_every, dtype, out, ...])

Return the sum of the values over the requested axis.

tail([n, compute])

Last n rows of the dataset

to_backend([backend])

Move to a new DataFrame backend

to_bag([index, format])

Create a Dask Bag from a Series

to_csv(filename, **kwargs)

Store Dask DataFrame to CSV files

to_dask_array([lengths, meta])

Convert a dask DataFrame to a dask array.

to_dask_dataframe(**kwargs)

Create a dask.dataframe object from a dask_cudf object

to_delayed([optimize_graph])

Convert into a list of dask.delayed objects, one per partition.

to_frame([name])

Convert Series to DataFrame.

to_hdf(path_or_buf, key[, mode, append])

Store Dask Dataframe to Hierarchical Data Format (HDF) files

to_json(filename, *args, **kwargs)

See dd.to_json docstring for more information

to_sql(name, uri[, schema, if_exists, ...])

See dd.to_sql docstring for more information

to_string([max_rows])

Render a string representation of the Series.

to_timestamp([freq, how, axis])

Cast to DatetimeIndex of Timestamps, at beginning of period.

truediv(other[, level, fill_value, axis])

Get Floating division of DataFrame or Series and other, element-wise (binary operator truediv).

unique([split_every, split_out])

Return Series of unique values in the object.

value_counts([sort, ascending, dropna, ...])

Return a Series containing counts of unique values.

var([axis, skipna, ddof, split_every, ...])

Return unbiased variance over requested axis.

view(dtype)

Create a new view of the Series.

visualize([filename, format, optimize_graph])

Render the computation of this object's task graph using graphviz.

where(cond[, other])

Replace values where the condition is False.

abs()#

Return a Series/DataFrame with absolute numeric value of each element.

This docstring was copied from pandas.core.frame.DataFrame.abs.

Some inconsistencies with the Dask version may exist.

This function only applies to elements that are all numeric.

Returns:
abs

Series/DataFrame containing the absolute value of each element.

See also

numpy.absolute

Calculate the absolute value element-wise.

Notes

For complex inputs, 1.2 + 1j, the absolute value is \(\sqrt{ a^2 + b^2 }\).

Examples

Absolute numeric values in a Series.

>>> s = pd.Series([-1.10, 2, -3.33, 4])  
>>> s.abs()  
0    1.10
1    2.00
2    3.33
3    4.00
dtype: float64

Absolute numeric values in a Series with complex numbers.

>>> s = pd.Series([1.2 + 1j])  
>>> s.abs()  
0    1.56205
dtype: float64

Absolute numeric values in a Series with a Timedelta element.

>>> s = pd.Series([pd.Timedelta('1 days')])  
>>> s.abs()  
0   1 days
dtype: timedelta64[ns]

Select rows with data closest to certain value using argsort (from StackOverflow).

>>> df = pd.DataFrame({  
...     'a': [4, 5, 6, 7],
...     'b': [10, 20, 30, 40],
...     'c': [100, 50, -30, -50]
... })
>>> df  
     a    b    c
0    4   10  100
1    5   20   50
2    6   30  -30
3    7   40  -50
>>> df.loc[(df.c - 43).abs().argsort()]  
     a    b    c
1    5   20   50
0    4   10  100
2    6   30  -30
3    7   40  -50
add(other, level=None, fill_value=None, axis=0)#

Get Addition of DataFrame or Series and other, element-wise (binary operator add).

This docstring was copied from cudf.core.series.Series.add.

Some inconsistencies with the Dask version may exist.

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:
otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame or Series

Result of the arithmetic operation.

Examples

DataFrame

>>> df = cudf.DataFrame(  
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.add(1)  
        angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])  
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])  
>>> a.add(b)  
a       2
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.add(b, fill_value=0)  
a       2
b       1
c       1
d       1
e    <NA>
dtype: int64
add_prefix(prefix)#

Prefix labels with string prefix.

This docstring was copied from pandas.core.frame.DataFrame.add_prefix.

Some inconsistencies with the Dask version may exist.

For Series, the row labels are prefixed. For DataFrame, the column labels are prefixed.

Parameters:
prefixstr

The string to add before each label.

Returns:
Series or DataFrame

New Series or DataFrame with updated labels.

See also

Series.add_suffix

Suffix row labels with string suffix.

DataFrame.add_suffix

Suffix column labels with string suffix.

Examples

>>> s = pd.Series([1, 2, 3, 4])  
>>> s  
0    1
1    2
2    3
3    4
dtype: int64
>>> s.add_prefix('item_')  
item_0    1
item_1    2
item_2    3
item_3    4
dtype: int64
>>> df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [3, 4, 5, 6]})  
>>> df  
   A  B
0  1  3
1  2  4
2  3  5
3  4  6
>>> df.add_prefix('col_')  
     col_A  col_B
0       1       3
1       2       4
2       3       5
3       4       6
add_suffix(suffix)#

Suffix labels with string suffix.

This docstring was copied from pandas.core.frame.DataFrame.add_suffix.

Some inconsistencies with the Dask version may exist.

For Series, the row labels are suffixed. For DataFrame, the column labels are suffixed.

Parameters:
suffixstr

The string to add after each label.

Returns:
Series or DataFrame

New Series or DataFrame with updated labels.

See also

Series.add_prefix

Prefix row labels with string prefix.

DataFrame.add_prefix

Prefix column labels with string prefix.

Examples

>>> s = pd.Series([1, 2, 3, 4])  
>>> s  
0    1
1    2
2    3
3    4
dtype: int64
>>> s.add_suffix('_item')  
0_item    1
1_item    2
2_item    3
3_item    4
dtype: int64
>>> df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [3, 4, 5, 6]})  
>>> df  
   A  B
0  1  3
1  2  4
2  3  5
3  4  6
>>> df.add_suffix('_col')  
     A_col  B_col
0       1       3
1       2       4
2       3       5
3       4       6
align(other, join='outer', axis=None, fill_value=None)#

Align two objects on their axes with the specified join method.

This docstring was copied from pandas.core.series.Series.align.

Some inconsistencies with the Dask version may exist.

Join method is specified for each axis Index.

Parameters:
otherDataFrame or Series
join{‘outer’, ‘inner’, ‘left’, ‘right’}, default ‘outer’
axisallowed axis of the other object, default None

Align on index (0), columns (1), or both (None).

levelint or level name, default None (Not supported in Dask)

Broadcast across a level, matching Index values on the passed MultiIndex level.

copybool, default True (Not supported in Dask)

Always returns new objects. If copy=False and no reindexing is required then original objects are returned.

fill_valuescalar, default np.NaN

Value to use for missing values. Defaults to NaN, but can be any “compatible” value.

method{‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None (Not supported in Dask)

Method to use for filling holes in reindexed Series:

  • pad / ffill: propagate last valid observation forward to next valid.

  • backfill / bfill: use NEXT valid observation to fill gap.

limitint, default None (Not supported in Dask)

If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.

fill_axis{0 or ‘index’}, default 0 (Not supported in Dask)

Filling axis, method and limit.

broadcast_axis{0 or ‘index’}, default None (Not supported in Dask)

Broadcast values along this axis, if aligning two objects of different dimensions.

Returns:
(left, right)(Series, type of other)

Aligned objects.

Examples

>>> df = pd.DataFrame(  
...     [[1, 2, 3, 4], [6, 7, 8, 9]], columns=["D", "B", "E", "A"], index=[1, 2]
... )
>>> other = pd.DataFrame(  
...     [[10, 20, 30, 40], [60, 70, 80, 90], [600, 700, 800, 900]],
...     columns=["A", "B", "C", "D"],
...     index=[2, 3, 4],
... )
>>> df  
   D  B  E  A
1  1  2  3  4
2  6  7  8  9
>>> other  
    A    B    C    D
2   10   20   30   40
3   60   70   80   90
4  600  700  800  900

Align on columns:

>>> left, right = df.align(other, join="outer", axis=1)  
>>> left  
   A  B   C  D  E
1  4  2 NaN  1  3
2  9  7 NaN  6  8
>>> right  
    A    B    C    D   E
2   10   20   30   40 NaN
3   60   70   80   90 NaN
4  600  700  800  900 NaN

We can also align on the index:

>>> left, right = df.align(other, join="outer", axis=0)  
>>> left  
    D    B    E    A
1  1.0  2.0  3.0  4.0
2  6.0  7.0  8.0  9.0
3  NaN  NaN  NaN  NaN
4  NaN  NaN  NaN  NaN
>>> right  
    A      B      C      D
1    NaN    NaN    NaN    NaN
2   10.0   20.0   30.0   40.0
3   60.0   70.0   80.0   90.0
4  600.0  700.0  800.0  900.0

Finally, the default axis=None will align on both index and columns:

>>> left, right = df.align(other, join="outer", axis=None)  
>>> left  
     A    B   C    D    E
1  4.0  2.0 NaN  1.0  3.0
2  9.0  7.0 NaN  6.0  8.0
3  NaN  NaN NaN  NaN  NaN
4  NaN  NaN NaN  NaN  NaN
>>> right  
       A      B      C      D   E
1    NaN    NaN    NaN    NaN NaN
2   10.0   20.0   30.0   40.0 NaN
3   60.0   70.0   80.0   90.0 NaN
4  600.0  700.0  800.0  900.0 NaN
all(axis=None, skipna=True, split_every=False, out=None)#

Return whether all elements are True, potentially over an axis.

This docstring was copied from pandas.core.frame.DataFrame.all.

Some inconsistencies with the Dask version may exist.

Returns True unless there at least one element within a series or along a Dataframe axis that is False or equivalent (e.g. zero or empty).

Parameters:
axis{0 or ‘index’, 1 or ‘columns’, None}, default 0

Indicate which axis or axes should be reduced. For Series this parameter is unused and defaults to 0.

  • 0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.

  • 1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.

  • None : reduce all axes, return a scalar.

bool_onlybool, default None (Not supported in Dask)

Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.

skipnabool, default True

Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be True, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.

levelint or level name, default None (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

Deprecated since version 1.3.0: The level keyword is deprecated. Use groupby instead.

**kwargsany, default None

Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns:
Series or DataFrame

If level is specified, then, DataFrame is returned; otherwise, Series is returned.

See also

Series.all

Return True if all elements are True.

DataFrame.any

Return True if one (or more) elements are True.

Examples

Series

>>> pd.Series([True, True]).all()  
True
>>> pd.Series([True, False]).all()  
False
>>> pd.Series([], dtype="float64").all()  
True
>>> pd.Series([np.nan]).all()  
True
>>> pd.Series([np.nan]).all(skipna=False)  
True

DataFrames

Create a dataframe from a dictionary.

>>> df = pd.DataFrame({'col1': [True, True], 'col2': [True, False]})  
>>> df  
   col1   col2
0  True   True
1  True  False

Default behaviour checks if values in each column all return True.

>>> df.all()  
col1     True
col2    False
dtype: bool

Specify axis='columns' to check if values in each row all return True.

>>> df.all(axis='columns')  
0     True
1    False
dtype: bool

Or axis=None for whether every value is True.

>>> df.all(axis=None)  
False
any(axis=None, skipna=True, split_every=False, out=None)#

Return whether any element is True, potentially over an axis.

This docstring was copied from pandas.core.frame.DataFrame.any.

Some inconsistencies with the Dask version may exist.

Returns False unless there is at least one element within a series or along a Dataframe axis that is True or equivalent (e.g. non-zero or non-empty).

Parameters:
axis{0 or ‘index’, 1 or ‘columns’, None}, default 0

Indicate which axis or axes should be reduced. For Series this parameter is unused and defaults to 0.

  • 0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.

  • 1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.

  • None : reduce all axes, return a scalar.

bool_onlybool, default None (Not supported in Dask)

Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.

skipnabool, default True

Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be False, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.

levelint or level name, default None (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

Deprecated since version 1.3.0: The level keyword is deprecated. Use groupby instead.

**kwargsany, default None

Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns:
Series or DataFrame

If level is specified, then, DataFrame is returned; otherwise, Series is returned.

See also

numpy.any

Numpy version of this method.

Series.any

Return whether any element is True.

Series.all

Return whether all elements are True.

DataFrame.any

Return whether any element is True over requested axis.

DataFrame.all

Return whether all elements are True over requested axis.

Examples

Series

For Series input, the output is a scalar indicating whether any element is True.

>>> pd.Series([False, False]).any()  
False
>>> pd.Series([True, False]).any()  
True
>>> pd.Series([], dtype="float64").any()  
False
>>> pd.Series([np.nan]).any()  
False
>>> pd.Series([np.nan]).any(skipna=False)  
True

DataFrame

Whether each column contains at least one True element (the default).

>>> df = pd.DataFrame({"A": [1, 2], "B": [0, 2], "C": [0, 0]})  
>>> df  
   A  B  C
0  1  0  0
1  2  2  0
>>> df.any()  
A     True
B     True
C    False
dtype: bool

Aggregating over the columns.

>>> df = pd.DataFrame({"A": [True, False], "B": [1, 2]})  
>>> df  
       A  B
0   True  1
1  False  2
>>> df.any(axis='columns')  
0    True
1    True
dtype: bool
>>> df = pd.DataFrame({"A": [True, False], "B": [1, 0]})  
>>> df  
       A  B
0   True  1
1  False  0
>>> df.any(axis='columns')  
0    True
1    False
dtype: bool

Aggregating over the entire DataFrame with axis=None.

>>> df.any(axis=None)  
True

any for an empty DataFrame is an empty Series.

>>> pd.DataFrame([]).any()  
Series([], dtype: bool)
append(other, interleave_partitions=False)#

Concatenate two or more Series.

This docstring was copied from pandas.core.series.Series.append.

Some inconsistencies with the Dask version may exist.

Deprecated since version 1.4.0: Use concat() instead. For further details see Deprecated DataFrame.append and Series.append

Parameters:
to_appendSeries or list/tuple of Series (Not supported in Dask)

Series to append with self.

ignore_indexbool, default False (Not supported in Dask)

If True, the resulting axis will be labeled 0, 1, …, n - 1.

verify_integritybool, default False (Not supported in Dask)

If True, raise Exception on creating index with duplicates.

Returns:
Series

Concatenated Series.

See also

concat

General function to concatenate DataFrame or Series objects.

Notes

Iteratively appending to a Series can be more computationally intensive than a single concatenate. A better solution is to append values to a list and then concatenate the list with the original Series all at once.

Examples

>>> s1 = pd.Series([1, 2, 3])  
>>> s2 = pd.Series([4, 5, 6])  
>>> s3 = pd.Series([4, 5, 6], index=[3, 4, 5])  
>>> s1.append(s2)  
0    1
1    2
2    3
0    4
1    5
2    6
dtype: int64
>>> s1.append(s3)  
0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64

With ignore_index set to True:

>>> s1.append(s2, ignore_index=True)  
0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64

With verify_integrity set to True:

>>> s1.append(s2, verify_integrity=True)  
Traceback (most recent call last):
...
ValueError: Indexes have overlapping values: [0, 1, 2]
apply(func, convert_dtype=_NoDefault.no_default, meta=_NoDefault.no_default, args=(), **kwds)#

Parallel version of pandas.Series.apply

Parameters:
funcfunction

Function to apply

convert_dtypeboolean, default True

Try to find better dtype for elementwise function results. If False, leave as dtype=object.

metapd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

argstuple

Positional arguments to pass to function in addition to the value.

Additional keyword arguments will be passed as keywords to the function.
Returns:
appliedSeries or DataFrame if func returns a Series.

See also

dask.Series.map_partitions

Examples

>>> import dask.dataframe as dd
>>> s = pd.Series(range(5), name='x')
>>> ds = dd.from_pandas(s, npartitions=2)

Apply a function elementwise across the Series, passing in extra arguments in args and kwargs:

>>> def myadd(x, a, b=1):
...     return x + a + b
>>> res = ds.apply(myadd, args=(2,), b=1.5)  

By default, dask tries to infer the output metadata by running your provided function on some fake data. This works well in many cases, but can sometimes be expensive, or even fail. To avoid this, you can manually specify the output metadata with the meta keyword. This can be specified in many forms, for more information see dask.dataframe.utils.make_meta.

Here we specify the output is a Series with name 'x', and dtype float64:

>>> res = ds.apply(myadd, args=(2,), b=1.5, meta=('x', 'f8'))

In the case where the metadata doesn’t change, you can also pass in the object itself directly:

>>> res = ds.apply(lambda x: x + 1, meta=ds)
astype(dtype)#

Cast a pandas object to a specified dtype dtype.

This docstring was copied from pandas.core.frame.DataFrame.astype.

Some inconsistencies with the Dask version may exist.

Parameters:
dtypedata type, or dict of column name -> data type

Use a numpy.dtype or Python type to cast entire pandas object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.

copybool, default True (Not supported in Dask)

Return a copy when copy=True (be very careful setting copy=False as changes to values then may propagate to other pandas objects).

errors{‘raise’, ‘ignore’}, default ‘raise’ (Not supported in Dask)

Control raising of exceptions on invalid data for provided dtype.

  • raise : allow exceptions to be raised

  • ignore : suppress exceptions. On error return original object.

Returns:
castedsame type as caller

See also

to_datetime

Convert argument to datetime.

to_timedelta

Convert argument to timedelta.

to_numeric

Convert argument to a numeric type.

numpy.ndarray.astype

Cast a numpy array to a specified type.

Notes

Deprecated since version 1.3.0: Using astype to convert from timezone-naive dtype to timezone-aware dtype is deprecated and will raise in a future version. Use Series.dt.tz_localize() instead.

Examples

Create a DataFrame:

>>> d = {'col1': [1, 2], 'col2': [3, 4]}  
>>> df = pd.DataFrame(data=d)  
>>> df.dtypes  
col1    int64
col2    int64
dtype: object

Cast all columns to int32:

>>> df.astype('int32').dtypes  
col1    int32
col2    int32
dtype: object

Cast col1 to int32 using a dictionary:

>>> df.astype({'col1': 'int32'}).dtypes  
col1    int32
col2    int64
dtype: object

Create a series:

>>> ser = pd.Series([1, 2], dtype='int32')  
>>> ser  
0    1
1    2
dtype: int32
>>> ser.astype('int64')  
0    1
1    2
dtype: int64

Convert to categorical type:

>>> ser.astype('category')  
0    1
1    2
dtype: category
Categories (2, int64): [1, 2]

Convert to ordered categorical type with custom ordering:

>>> from pandas.api.types import CategoricalDtype  
>>> cat_dtype = CategoricalDtype(  
...     categories=[2, 1], ordered=True)
>>> ser.astype(cat_dtype)  
0    1
1    2
dtype: category
Categories (2, int64): [2 < 1]

Note that using copy=False and changing data on a new pandas object may propagate changes:

>>> s1 = pd.Series([1, 2])  
>>> s2 = s1.astype('int64', copy=False)  
>>> s2[0] = 10  
>>> s1  # note that s1[0] has changed too  
0    10
1     2
dtype: int64

Create a series of dates:

>>> ser_date = pd.Series(pd.date_range('20200101', periods=3))  
>>> ser_date  
0   2020-01-01
1   2020-01-02
2   2020-01-03
dtype: datetime64[ns]
property attrs#

Dictionary of global attributes of this dataset.

This docstring was copied from pandas.core.frame.DataFrame.attrs.

Some inconsistencies with the Dask version may exist.

Warning

attrs is experimental and may change without warning.

See also

DataFrame.flags

Global flags applying to this object.

autocorr(lag=1, split_every=False)#

Compute the lag-N autocorrelation.

This docstring was copied from pandas.core.series.Series.autocorr.

Some inconsistencies with the Dask version may exist.

This method computes the Pearson correlation between the Series and its shifted self.

Parameters:
lagint, default 1

Number of lags to apply before performing autocorrelation.

Returns:
float

The Pearson correlation between self and self.shift(lag).

See also

Series.corr

Compute the correlation between two Series.

Series.shift

Shift index by desired number of periods.

DataFrame.corr

Compute pairwise correlation of columns.

DataFrame.corrwith

Compute pairwise correlation between rows or columns of two DataFrame objects.

Notes

If the Pearson correlation is not well defined return ‘NaN’.

Examples

>>> s = pd.Series([0.25, 0.5, 0.2, -0.05])  
>>> s.autocorr()  
0.10355...
>>> s.autocorr(lag=2)  
-0.99999...

If the Pearson correlation is not well defined, then ‘NaN’ is returned.

>>> s = pd.Series([1, 0, 0, 0])  
>>> s.autocorr()  
nan
between(left, right, inclusive='both')#

Return boolean Series equivalent to left <= series <= right.

This docstring was copied from pandas.core.series.Series.between.

Some inconsistencies with the Dask version may exist.

This function returns a boolean vector containing True wherever the corresponding Series element is between the boundary values left and right. NA values are treated as False.

Parameters:
leftscalar or list-like

Left boundary.

rightscalar or list-like

Right boundary.

inclusive{“both”, “neither”, “left”, “right”}

Include boundaries. Whether to set each bound as closed or open.

Changed in version 1.3.0.

Returns:
Series

Series representing whether each element is between left and right (inclusive).

See also

Series.gt

Greater than of series and other.

Series.lt

Less than of series and other.

Notes

This function is equivalent to (left <= ser) & (ser <= right)

Examples

>>> s = pd.Series([2, 0, 4, 8, np.nan])  

Boundary values are included by default:

>>> s.between(1, 4)  
0     True
1    False
2     True
3    False
4    False
dtype: bool

With inclusive set to "neither" boundary values are excluded:

>>> s.between(1, 4, inclusive="neither")  
0     True
1    False
2    False
3    False
4    False
dtype: bool

left and right can be any scalar value:

>>> s = pd.Series(['Alice', 'Bob', 'Carol', 'Eve'])  
>>> s.between('Anna', 'Daniel')  
0    False
1     True
2     True
3    False
dtype: bool
bfill(axis=None, limit=None)#

Synonym for DataFrame.fillna() with method='bfill'.

This docstring was copied from pandas.core.frame.DataFrame.bfill.

Some inconsistencies with the Dask version may exist.

Returns:
Series/DataFrame or None

Object with missing values filled or None if inplace=True.

cat#

alias of CategoricalAccessor

clear_divisions()#

Forget division information

clip(lower=None, upper=None, axis=None)#

Trim values at input threshold(s).

This docstring was copied from pandas.core.series.Series.clip.

Some inconsistencies with the Dask version may exist.

Assigns values outside boundary to boundary values. Thresholds can be singular values or array like, and in the latter case the clipping is performed element-wise in the specified axis.

Parameters:
lowerfloat or array-like, default None

Minimum threshold value. All values below this threshold will be set to it. A missing threshold (e.g NA) will not clip the value.

upperfloat or array-like, default None

Maximum threshold value. All values above this threshold will be set to it. A missing threshold (e.g NA) will not clip the value.

axis{{0 or ‘index’, 1 or ‘columns’, None}}, default None

Align object with lower and upper along the given axis. For Series this parameter is unused and defaults to None.

inplacebool, default False (Not supported in Dask)

Whether to perform the operation in place on the data.

*args, **kwargs

Additional keywords have no effect but might be accepted for compatibility with numpy.

Returns:
Series or DataFrame or None

Same type as calling object with the values outside the clip boundaries replaced or None if inplace=True.

See also

Series.clip

Trim values at input threshold in series.

DataFrame.clip

Trim values at input threshold in dataframe.

numpy.clip

Clip (limit) the values in an array.

Examples

>>> data = {'col_0': [9, -3, 0, -1, 5], 'col_1': [-2, -7, 6, 8, -5]}  
>>> df = pd.DataFrame(data)  
>>> df  
   col_0  col_1
0      9     -2
1     -3     -7
2      0      6
3     -1      8
4      5     -5

Clips per column using lower and upper thresholds:

>>> df.clip(-4, 6)  
   col_0  col_1
0      6     -2
1     -3     -4
2      0      6
3     -1      6
4      5     -4

Clips using specific lower and upper thresholds per column element:

>>> t = pd.Series([2, -4, -1, 6, 3])  
>>> t  
0    2
1   -4
2   -1
3    6
4    3
dtype: int64
>>> df.clip(t, t + 4, axis=0)  
   col_0  col_1
0      6      2
1     -3     -4
2      0      3
3      6      8
4      5      3

Clips using specific lower threshold per column element, with missing values:

>>> t = pd.Series([2, -4, np.NaN, 6, 3])  
>>> t  
0    2.0
1   -4.0
2    NaN
3    6.0
4    3.0
dtype: float64
>>> df.clip(t, axis=0)  
col_0  col_1
0      9      2
1     -3     -4
2      0      6
3      6      8
4      5      3
combine(other, func, fill_value=None)#

Combine the Series with a Series or scalar according to func.

This docstring was copied from pandas.core.series.Series.combine.

Some inconsistencies with the Dask version may exist.

Combine the Series and other using func to perform elementwise selection for combined Series. fill_value is assumed when value is missing at some index from one of the two objects being combined.

Parameters:
otherSeries or scalar

The value(s) to be combined with the Series.

funcfunction

Function that takes two scalars as inputs and returns an element.

fill_valuescalar, optional

The value to assume when an index is missing from one Series or the other. The default specifies to use the appropriate NaN value for the underlying dtype of the Series.

Returns:
Series

The result of combining the Series with the other object.

See also

Series.combine_first

Combine Series values, choosing the calling Series’ values first.

Examples

Consider 2 Datasets s1 and s2 containing highest clocked speeds of different birds.

>>> s1 = pd.Series({'falcon': 330.0, 'eagle': 160.0})  
>>> s1  
falcon    330.0
eagle     160.0
dtype: float64
>>> s2 = pd.Series({'falcon': 345.0, 'eagle': 200.0, 'duck': 30.0})  
>>> s2  
falcon    345.0
eagle     200.0
duck       30.0
dtype: float64

Now, to combine the two datasets and view the highest speeds of the birds across the two datasets

>>> s1.combine(s2, max)  
duck        NaN
eagle     200.0
falcon    345.0
dtype: float64

In the previous example, the resulting value for duck is missing, because the maximum of a NaN and a float is a NaN. So, in the example, we set fill_value=0, so the maximum value returned will be the value from some dataset.

>>> s1.combine(s2, max, fill_value=0)  
duck       30.0
eagle     200.0
falcon    345.0
dtype: float64
combine_first(other)#

Update null elements with value in the same location in ‘other’.

This docstring was copied from pandas.core.series.Series.combine_first.

Some inconsistencies with the Dask version may exist.

Combine two Series objects by filling null values in one Series with non-null values from the other Series. Result index will be the union of the two indexes.

Parameters:
otherSeries

The value(s) to be used for filling null values.

Returns:
Series

The result of combining the provided Series with the other object.

See also

Series.combine

Perform element-wise operation on two Series using a given function.

Examples

>>> s1 = pd.Series([1, np.nan])  
>>> s2 = pd.Series([3, 4, 5])  
>>> s1.combine_first(s2)  
0    1.0
1    4.0
2    5.0
dtype: float64

Null values still persist if the location of that null value does not exist in other

>>> s1 = pd.Series({'falcon': np.nan, 'eagle': 160.0})  
>>> s2 = pd.Series({'eagle': 200.0, 'duck': 30.0})  
>>> s1.combine_first(s2)  
duck       30.0
eagle     160.0
falcon      NaN
dtype: float64
compute(**kwargs)#

Compute this dask collection

This turns a lazy Dask collection into its in-memory equivalent. For example a Dask array turns into a NumPy array and a Dask dataframe turns into a Pandas dataframe. The entire dataset must fit into memory before calling this operation.

Parameters:
schedulerstring, optional

Which scheduler to use like “threads”, “synchronous” or “processes”. If not provided, the default is to check the global settings first, and then fall back to the collection defaults.

optimize_graphbool, optional

If True [default], the graph is optimized before computation. Otherwise the graph is run as is. This can be useful for debugging.

kwargs

Extra keywords to forward to the scheduler function.

See also

dask.compute
compute_current_divisions(col=None)#

Compute the current divisions of the DataFrame.

This method triggers immediate computation. If you find yourself running this command repeatedly for the same dataframe, we recommend storing the result so you don’t have to rerun it.

If the column or index values overlap between partitions, raises ValueError. To prevent this, make sure the data are sorted by the column or index.

Parameters:
colstring, optional

Calculate the divisions for a non-index column by passing in the name of the column. If col is not specified, the index will be used to calculate divisions. In this case, if the divisions are already known, they will be returned immediately without computing.

Examples

>>> import dask
>>> ddf = dask.datasets.timeseries(start="2021-01-01", end="2021-01-07", freq="1h").clear_divisions()
>>> divisions = ddf.compute_current_divisions()
>>> print(divisions)  
(Timestamp('2021-01-01 00:00:00'),
 Timestamp('2021-01-02 00:00:00'),
 Timestamp('2021-01-03 00:00:00'),
 Timestamp('2021-01-04 00:00:00'),
 Timestamp('2021-01-05 00:00:00'),
 Timestamp('2021-01-06 00:00:00'),
 Timestamp('2021-01-06 23:00:00'))
>>> ddf.divisions = divisions
>>> ddf.known_divisions
True
>>> ddf = ddf.reset_index().clear_divisions()
>>> divisions = ddf.compute_current_divisions("timestamp")
>>> print(divisions)  
(Timestamp('2021-01-01 00:00:00'),
 Timestamp('2021-01-02 00:00:00'),
 Timestamp('2021-01-03 00:00:00'),
 Timestamp('2021-01-04 00:00:00'),
 Timestamp('2021-01-05 00:00:00'),
 Timestamp('2021-01-06 00:00:00'),
 Timestamp('2021-01-06 23:00:00'))
>>> ddf = ddf.set_index("timestamp", divisions=divisions, sorted=True)
copy(deep=False)#

Make a copy of the dataframe

This is strictly a shallow copy of the underlying computational graph. It does not affect the underlying data

Parameters:
deepboolean, default False

The deep value must be False and it is declared as a parameter just for compatibility with third-party libraries like cuDF

corr(other, method='pearson', min_periods=None, split_every=False)#

Compute correlation with other Series, excluding missing values.

This docstring was copied from pandas.core.series.Series.corr.

Some inconsistencies with the Dask version may exist.

The two Series objects are not required to be the same length and will be aligned internally before the correlation function is applied.

Parameters:
otherSeries

Series with which to compute the correlation.

method{‘pearson’, ‘kendall’, ‘spearman’} or callable

Method used to compute correlation:

  • pearson : Standard correlation coefficient

  • kendall : Kendall Tau correlation coefficient

  • spearman : Spearman rank correlation

  • callable: Callable with input two 1d ndarrays and returning a float.

Warning

Note that the returned matrix from corr will have 1 along the diagonals and will be symmetric regardless of the callable’s behavior.

min_periodsint, optional

Minimum number of observations needed to have a valid result.

Returns:
float

Correlation with other.

See also

DataFrame.corr

Compute pairwise correlation between columns.

DataFrame.corrwith

Compute pairwise correlation with another DataFrame or Series.

Notes

Pearson, Kendall and Spearman correlation are currently computed using pairwise complete observations.

Examples

>>> def histogram_intersection(a, b):  
...     v = np.minimum(a, b).sum().round(decimals=1)
...     return v
>>> s1 = pd.Series([.2, .0, .6, .2])  
>>> s2 = pd.Series([.3, .6, .0, .1])  
>>> s1.corr(s2, method=histogram_intersection)  
0.3
count(split_every=False)#

Count non-NA cells for each column or row.

This docstring was copied from pandas.core.frame.DataFrame.count.

Some inconsistencies with the Dask version may exist.

The values None, NaN, NaT, and optionally numpy.inf (depending on pandas.options.mode.use_inf_as_na) are considered NA.

Parameters:
axis{0 or ‘index’, 1 or ‘columns’}, default 0

If 0 or ‘index’ counts are generated for each column. If 1 or ‘columns’ counts are generated for each row.

levelint or str, optional (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DataFrame. A str specifies the level name.

numeric_onlybool, default False

Include only float, int or boolean data.

Returns:
Series or DataFrame

For each column/row the number of non-NA/null entries. If level is specified returns a DataFrame.

See also

Series.count

Number of non-NA elements in a Series.

DataFrame.value_counts

Count unique combinations of columns.

DataFrame.shape

Number of DataFrame rows and columns (including NA elements).

DataFrame.isna

Boolean same-sized DataFrame showing places of NA elements.

Examples

Constructing DataFrame from a dictionary:

>>> df = pd.DataFrame({"Person":  
...                    ["John", "Myla", "Lewis", "John", "Myla"],
...                    "Age": [24., np.nan, 21., 33, 26],
...                    "Single": [False, True, True, True, False]})
>>> df  
   Person   Age  Single
0    John  24.0   False
1    Myla   NaN    True
2   Lewis  21.0    True
3    John  33.0    True
4    Myla  26.0   False

Notice the uncounted NA values:

>>> df.count()  
Person    5
Age       4
Single    5
dtype: int64

Counts for each row:

>>> df.count(axis='columns')  
0    3
1    2
2    3
3    3
4    3
dtype: int64
cov(other, min_periods=None, split_every=False)#

Compute covariance with Series, excluding missing values.

This docstring was copied from pandas.core.series.Series.cov.

Some inconsistencies with the Dask version may exist.

The two Series objects are not required to be the same length and will be aligned internally before the covariance is calculated.

Parameters:
otherSeries

Series with which to compute the covariance.

min_periodsint, optional

Minimum number of observations needed to have a valid result.

ddofint, default 1 (Not supported in Dask)

Delta degrees of freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

New in version 1.1.0.

Returns:
float

Covariance between Series and other normalized by N-1 (unbiased estimator).

See also

DataFrame.cov

Compute pairwise covariance of columns.

Examples

>>> s1 = pd.Series([0.90010907, 0.13484424, 0.62036035])  
>>> s2 = pd.Series([0.12528585, 0.26962463, 0.51111198])  
>>> s1.cov(s2)  
-0.01685762652715874
cummax(axis=None, skipna=True, out=None)#

Return cumulative maximum over a DataFrame or Series axis.

This docstring was copied from pandas.core.frame.DataFrame.cummax.

Some inconsistencies with the Dask version may exist.

Returns a DataFrame or Series of the same size containing the cumulative maximum.

Parameters:
axis{0 or ‘index’, 1 or ‘columns’}, default 0

The index or the name of the axis. 0 is equivalent to None or ‘index’. For Series this parameter is unused and defaults to 0.

skipnabool, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA.

*args, **kwargs

Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns:
Series or DataFrame

Return cumulative maximum of Series or DataFrame.

See also

core.window.expanding.Expanding.max

Similar functionality but ignores NaN values.

DataFrame.max

Return the maximum over DataFrame axis.

DataFrame.cummax

Return cumulative maximum over DataFrame axis.

DataFrame.cummin

Return cumulative minimum over DataFrame axis.

DataFrame.cumsum

Return cumulative sum over DataFrame axis.

DataFrame.cumprod

Return cumulative product over DataFrame axis.

Examples

Series

>>> s = pd.Series([2, np.nan, 5, -1, 0])  
>>> s  
0    2.0
1    NaN
2    5.0
3   -1.0
4    0.0
dtype: float64

By default, NA values are ignored.

>>> s.cummax()  
0    2.0
1    NaN
2    5.0
3    5.0
4    5.0
dtype: float64

To include NA values in the operation, use skipna=False

>>> s.cummax(skipna=False)  
0    2.0
1    NaN
2    NaN
3    NaN
4    NaN
dtype: float64

DataFrame

>>> df = pd.DataFrame([[2.0, 1.0],  
...                    [3.0, np.nan],
...                    [1.0, 0.0]],
...                    columns=list('AB'))
>>> df  
     A    B
0  2.0  1.0
1  3.0  NaN
2  1.0  0.0

By default, iterates over rows and finds the maximum in each column. This is equivalent to axis=None or axis='index'.

>>> df.cummax()  
     A    B
0  2.0  1.0
1  3.0  NaN
2  3.0  1.0

To iterate over columns and find the maximum in each row, use axis=1

>>> df.cummax(axis=1)  
     A    B
0  2.0  2.0
1  3.0  NaN
2  1.0  1.0
cummin(axis=None, skipna=True, out=None)#

Return cumulative minimum over a DataFrame or Series axis.

This docstring was copied from pandas.core.frame.DataFrame.cummin.

Some inconsistencies with the Dask version may exist.

Returns a DataFrame or Series of the same size containing the cumulative minimum.

Parameters:
axis{0 or ‘index’, 1 or ‘columns’}, default 0

The index or the name of the axis. 0 is equivalent to None or ‘index’. For Series this parameter is unused and defaults to 0.

skipnabool, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA.

*args, **kwargs

Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns:
Series or DataFrame

Return cumulative minimum of Series or DataFrame.

See also

core.window.expanding.Expanding.min

Similar functionality but ignores NaN values.

DataFrame.min

Return the minimum over DataFrame axis.

DataFrame.cummax

Return cumulative maximum over DataFrame axis.

DataFrame.cummin

Return cumulative minimum over DataFrame axis.

DataFrame.cumsum

Return cumulative sum over DataFrame axis.

DataFrame.cumprod

Return cumulative product over DataFrame axis.

Examples

Series

>>> s = pd.Series([2, np.nan, 5, -1, 0])  
>>> s  
0    2.0
1    NaN
2    5.0
3   -1.0
4    0.0
dtype: float64

By default, NA values are ignored.

>>> s.cummin()  
0    2.0
1    NaN
2    2.0
3   -1.0
4   -1.0
dtype: float64

To include NA values in the operation, use skipna=False

>>> s.cummin(skipna=False)  
0    2.0
1    NaN
2    NaN
3    NaN
4    NaN
dtype: float64

DataFrame

>>> df = pd.DataFrame([[2.0, 1.0],  
...                    [3.0, np.nan],
...                    [1.0, 0.0]],
...                    columns=list('AB'))
>>> df  
     A    B
0  2.0  1.0
1  3.0  NaN
2  1.0  0.0

By default, iterates over rows and finds the minimum in each column. This is equivalent to axis=None or axis='index'.

>>> df.cummin()  
     A    B
0  2.0  1.0
1  2.0  NaN
2  1.0  0.0

To iterate over columns and find the minimum in each row, use axis=1

>>> df.cummin(axis=1)  
     A    B
0  2.0  1.0
1  3.0  NaN
2  1.0  0.0
cumprod(axis=None, skipna=True, dtype=None, out=None)#

Return cumulative product over a DataFrame or Series axis.

This docstring was copied from pandas.core.frame.DataFrame.cumprod.

Some inconsistencies with the Dask version may exist.

Returns a DataFrame or Series of the same size containing the cumulative product.

Parameters:
axis{0 or ‘index’, 1 or ‘columns’}, default 0

The index or the name of the axis. 0 is equivalent to None or ‘index’. For Series this parameter is unused and defaults to 0.

skipnabool, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA.

*args, **kwargs

Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns:
Series or DataFrame

Return cumulative product of Series or DataFrame.

See also

core.window.expanding.Expanding.prod

Similar functionality but ignores NaN values.

DataFrame.prod

Return the product over DataFrame axis.

DataFrame.cummax

Return cumulative maximum over DataFrame axis.

DataFrame.cummin

Return cumulative minimum over DataFrame axis.

DataFrame.cumsum

Return cumulative sum over DataFrame axis.

DataFrame.cumprod

Return cumulative product over DataFrame axis.

Examples

Series

>>> s = pd.Series([2, np.nan, 5, -1, 0])  
>>> s  
0    2.0
1    NaN
2    5.0
3   -1.0
4    0.0
dtype: float64

By default, NA values are ignored.

>>> s.cumprod()  
0     2.0
1     NaN
2    10.0
3   -10.0
4    -0.0
dtype: float64

To include NA values in the operation, use skipna=False

>>> s.cumprod(skipna=False)  
0    2.0
1    NaN
2    NaN
3    NaN
4    NaN
dtype: float64

DataFrame

>>> df = pd.DataFrame([[2.0, 1.0],  
...                    [3.0, np.nan],
...                    [1.0, 0.0]],
...                    columns=list('AB'))
>>> df  
     A    B
0  2.0  1.0
1  3.0  NaN
2  1.0  0.0

By default, iterates over rows and finds the product in each column. This is equivalent to axis=None or axis='index'.

>>> df.cumprod()  
     A    B
0  2.0  1.0
1  6.0  NaN
2  6.0  0.0

To iterate over columns and find the product in each row, use axis=1

>>> df.cumprod(axis=1)  
     A    B
0  2.0  2.0
1  3.0  NaN
2  1.0  0.0
cumsum(axis=None, skipna=True, dtype=None, out=None)#

Return cumulative sum over a DataFrame or Series axis.

This docstring was copied from pandas.core.frame.DataFrame.cumsum.

Some inconsistencies with the Dask version may exist.

Returns a DataFrame or Series of the same size containing the cumulative sum.

Parameters:
axis{0 or ‘index’, 1 or ‘columns’}, default 0

The index or the name of the axis. 0 is equivalent to None or ‘index’. For Series this parameter is unused and defaults to 0.

skipnabool, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA.

*args, **kwargs

Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns:
Series or DataFrame

Return cumulative sum of Series or DataFrame.

See also

core.window.expanding.Expanding.sum

Similar functionality but ignores NaN values.

DataFrame.sum

Return the sum over DataFrame axis.

DataFrame.cummax

Return cumulative maximum over DataFrame axis.

DataFrame.cummin

Return cumulative minimum over DataFrame axis.

DataFrame.cumsum

Return cumulative sum over DataFrame axis.

DataFrame.cumprod

Return cumulative product over DataFrame axis.

Examples

Series

>>> s = pd.Series([2, np.nan, 5, -1, 0])  
>>> s  
0    2.0
1    NaN
2    5.0
3   -1.0
4    0.0
dtype: float64

By default, NA values are ignored.

>>> s.cumsum()  
0    2.0
1    NaN
2    7.0
3    6.0
4    6.0
dtype: float64

To include NA values in the operation, use skipna=False

>>> s.cumsum(skipna=False)  
0    2.0
1    NaN
2    NaN
3    NaN
4    NaN
dtype: float64

DataFrame

>>> df = pd.DataFrame([[2.0, 1.0],  
...                    [3.0, np.nan],
...                    [1.0, 0.0]],
...                    columns=list('AB'))
>>> df  
     A    B
0  2.0  1.0
1  3.0  NaN
2  1.0  0.0

By default, iterates over rows and finds the sum in each column. This is equivalent to axis=None or axis='index'.

>>> df.cumsum()  
     A    B
0  2.0  1.0
1  5.0  NaN
2  6.0  1.0

To iterate over columns and find the sum in each row, use axis=1

>>> df.cumsum(axis=1)  
     A    B
0  2.0  3.0
1  3.0  NaN
2  1.0  1.0
describe(split_every=False, percentiles=None, percentiles_method='default', include=None, exclude=None, datetime_is_numeric=_NoDefault.no_default)#

Generate descriptive statistics.

This docstring was copied from pandas.core.frame.DataFrame.describe.

Some inconsistencies with the Dask version may exist.

Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.

Parameters:
percentileslist-like of numbers, optional

The percentiles to include in the output. All should fall between 0 and 1. The default is [.25, .5, .75], which returns the 25th, 50th, and 75th percentiles.

include‘all’, list-like of dtypes or None (default), optional

A white list of data types to include in the result. Ignored for Series. Here are the options:

  • ‘all’ : All columns of the input will be included in the output.

  • A list-like of dtypes : Limits the results to the provided data types. To limit the result to numeric types submit numpy.number. To limit it instead to object columns submit the numpy.object data type. Strings can also be used in the style of select_dtypes (e.g. df.describe(include=['O'])). To select pandas categorical columns, use 'category'

  • None (default) : The result will include all numeric columns.

excludelist-like of dtypes or None (default), optional,

A black list of data types to omit from the result. Ignored for Series. Here are the options:

  • A list-like of dtypes : Excludes the provided data types from the result. To exclude numeric types submit numpy.number. To exclude object columns submit the data type numpy.object. Strings can also be used in the style of select_dtypes (e.g. df.describe(exclude=['O'])). To exclude pandas categorical columns, use 'category'

  • None (default) : The result will exclude nothing.

datetime_is_numericbool, default False

Whether to treat datetime dtypes as numeric. This affects statistics calculated for the column. For DataFrame input, this also controls whether datetime columns are included by default.

New in version 1.1.0.

Returns:
Series or DataFrame

Summary statistics of the Series or Dataframe provided.

See also

DataFrame.count

Count number of non-NA/null observations.

DataFrame.max

Maximum of the values in the object.

DataFrame.min

Minimum of the values in the object.

DataFrame.mean

Mean of the values.

DataFrame.std

Standard deviation of the observations.

DataFrame.select_dtypes

Subset of a DataFrame including/excluding columns based on their dtype.

Notes

For numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50 and upper percentiles. By default the lower percentile is 25 and the upper percentile is 75. The 50 percentile is the same as the median.

For object data (e.g. strings or timestamps), the result’s index will include count, unique, top, and freq. The top is the most common value. The freq is the most common value’s frequency. Timestamps also include the first and last items.

If multiple object values have the highest count, then the count and top results will be arbitrarily chosen from among those with the highest count.

For mixed data types provided via a DataFrame, the default is to return only an analysis of numeric columns. If the dataframe consists only of object and categorical data without any numeric columns, the default is to return an analysis of both the object and categorical columns. If include='all' is provided as an option, the result will include a union of attributes of each type.

The include and exclude parameters can be used to limit which columns in a DataFrame are analyzed for the output. The parameters are ignored when analyzing a Series.

Examples

Describing a numeric Series.

>>> s = pd.Series([1, 2, 3])  
>>> s.describe()  
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
dtype: float64

Describing a categorical Series.

>>> s = pd.Series(['a', 'a', 'b', 'c'])  
>>> s.describe()  
count     4
unique    3
top       a
freq      2
dtype: object

Describing a timestamp Series.

>>> s = pd.Series([  
...   np.datetime64("2000-01-01"),
...   np.datetime64("2010-01-01"),
...   np.datetime64("2010-01-01")
... ])
>>> s.describe(datetime_is_numeric=True)  
count                      3
mean     2006-09-01 08:00:00
min      2000-01-01 00:00:00
25%      2004-12-31 12:00:00
50%      2010-01-01 00:00:00
75%      2010-01-01 00:00:00
max      2010-01-01 00:00:00
dtype: object

Describing a DataFrame. By default only numeric fields are returned.

>>> df = pd.DataFrame({'categorical': pd.Categorical(['d','e','f']),  
...                    'numeric': [1, 2, 3],
...                    'object': ['a', 'b', 'c']
...                   })
>>> df.describe()  
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Describing all columns of a DataFrame regardless of data type.

>>> df.describe(include='all')  
       categorical  numeric object
count            3      3.0      3
unique           3      NaN      3
top              f      NaN      a
freq             1      NaN      1
mean           NaN      2.0    NaN
std            NaN      1.0    NaN
min            NaN      1.0    NaN
25%            NaN      1.5    NaN
50%            NaN      2.0    NaN
75%            NaN      2.5    NaN
max            NaN      3.0    NaN

Describing a column from a DataFrame by accessing it as an attribute.

>>> df.numeric.describe()  
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
Name: numeric, dtype: float64

Including only numeric columns in a DataFrame description.

>>> df.describe(include=[np.number])  
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Including only string columns in a DataFrame description.

>>> df.describe(include=[object])  
       object
count       3
unique      3
top         a
freq        1

Including only categorical columns from a DataFrame description.

>>> df.describe(include=['category'])  
       categorical
count            3
unique           3
top              d
freq             1

Excluding numeric columns from a DataFrame description.

>>> df.describe(exclude=[np.number])  
       categorical object
count            3      3
unique           3      3
top              f      a
freq             1      1

Excluding object columns from a DataFrame description.

>>> df.describe(exclude=[object])  
       categorical  numeric
count            3      3.0
unique           3      NaN
top              f      NaN
freq             1      NaN
mean           NaN      2.0
std            NaN      1.0
min            NaN      1.0
25%            NaN      1.5
50%            NaN      2.0
75%            NaN      2.5
max            NaN      3.0
diff(periods=1, axis=0)#

First discrete difference of element.

This docstring was copied from pandas.core.frame.DataFrame.diff.

Some inconsistencies with the Dask version may exist.

Note

Pandas currently uses an object-dtype column to represent boolean data with missing values. This can cause issues for boolean-specific operations, like |. To enable boolean- specific operations, at the cost of metadata that doesn’t match pandas, use .astype(bool) after the shift.

Calculates the difference of a DataFrame element compared with another element in the DataFrame (default is element in previous row).

Parameters:
periodsint, default 1

Periods to shift for calculating difference, accepts negative values.

axis{0 or ‘index’, 1 or ‘columns’}, default 0

Take difference over rows (0) or columns (1).

Returns:
DataFrame

First differences of the Series.

See also

DataFrame.pct_change

Percent change over given number of periods.

DataFrame.shift

Shift index by desired number of periods with an optional time freq.

Series.diff

First discrete difference of object.

Notes

For boolean dtypes, this uses operator.xor() rather than operator.sub(). The result is calculated according to current dtype in DataFrame, however dtype of the result is always float64.

Examples

Difference with previous row

>>> df = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6],  
...                    'b': [1, 1, 2, 3, 5, 8],
...                    'c': [1, 4, 9, 16, 25, 36]})
>>> df  
   a  b   c
0  1  1   1
1  2  1   4
2  3  2   9
3  4  3  16
4  5  5  25
5  6  8  36
>>> df.diff()  
     a    b     c
0  NaN  NaN   NaN
1  1.0  0.0   3.0
2  1.0  1.0   5.0
3  1.0  1.0   7.0
4  1.0  2.0   9.0
5  1.0  3.0  11.0

Difference with previous column

>>> df.diff(axis=1)  
    a  b   c
0 NaN  0   0
1 NaN -1   3
2 NaN -1   7
3 NaN -1  13
4 NaN  0  20
5 NaN  2  28

Difference with 3rd previous row

>>> df.diff(periods=3)  
     a    b     c
0  NaN  NaN   NaN
1  NaN  NaN   NaN
2  NaN  NaN   NaN
3  3.0  2.0  15.0
4  3.0  4.0  21.0
5  3.0  6.0  27.0

Difference with following row

>>> df.diff(periods=-1)  
     a    b     c
0 -1.0  0.0  -3.0
1 -1.0 -1.0  -5.0
2 -1.0 -1.0  -7.0
3 -1.0 -2.0  -9.0
4 -1.0 -3.0 -11.0
5  NaN  NaN   NaN

Overflow in input dtype

>>> df = pd.DataFrame({'a': [1, 0]}, dtype=np.uint8)  
>>> df.diff()  
       a
0    NaN
1  255.0
div(other, level=None, fill_value=None, axis=0)#

Return Floating division of series and other, element-wise (binary operator truediv).

This docstring was copied from pandas.core.series.Series.div.

Some inconsistencies with the Dask version may exist.

Equivalent to series / other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
otherSeries or scalar value
levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level.

fill_valueNone or float value, default None (NaN)

Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.

axis{0 or ‘index’}

Unused. Parameter needed for compatibility with DataFrame.

Returns:
Series

The result of the operation.

See also

Series.rtruediv

Reverse of the Floating division operator, see Python documentation for more details.

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])  
>>> a  
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])  
>>> b  
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.divide(b, fill_value=0)  
a    1.0
b    inf
c    inf
d    0.0
e    NaN
dtype: float64
divide(other, level=None, fill_value=None, axis=0)#

Return Floating division of series and other, element-wise (binary operator truediv).

This docstring was copied from pandas.core.series.Series.divide.

Some inconsistencies with the Dask version may exist.

Equivalent to series / other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
otherSeries or scalar value
levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level.

fill_valueNone or float value, default None (NaN)

Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.

axis{0 or ‘index’}

Unused. Parameter needed for compatibility with DataFrame.

Returns:
Series

The result of the operation.

See also

Series.rtruediv

Reverse of the Floating division operator, see Python documentation for more details.

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])  
>>> a  
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])  
>>> b  
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.divide(b, fill_value=0)  
a    1.0
b    inf
c    inf
d    0.0
e    NaN
dtype: float64
property divisions#

Tuple of npartitions + 1 values, in ascending order, marking the lower/upper bounds of each partition’s index. Divisions allow Dask to know which partition will contain a given value, significantly speeding up operations like loc, merge, and groupby by not having to search the full dataset.

Example: for divisions = (0, 10, 50, 100), there are three partitions, where the index in each partition contains values [0, 10), [10, 50), and [50, 100], respectively. Dask therefore knows df.loc[45] will be in the second partition.

When every item in divisions is None, the divisions are unknown. Most operations can still be performed, but some will be much slower, and a few may fail.

It is uncommon to set divisions directly. Instead, use set_index, which sorts and splits the data as needed. See https://docs.dask.org/en/latest/dataframe-design.html#partitions.

dot(other, meta=_NoDefault.no_default)#

Compute the dot product between the Series and the columns of other.

This docstring was copied from pandas.core.series.Series.dot.

Some inconsistencies with the Dask version may exist.

This method computes the dot product between the Series and another one, or the Series and each columns of a DataFrame, or the Series and each columns of an array.

It can also be called using self @ other in Python >= 3.5.

Parameters:
otherSeries, DataFrame or array-like

The other object to compute the dot product with its columns.

Returns:
scalar, Series or numpy.ndarray

Return the dot product of the Series and other if other is a Series, the Series of the dot product of Series and each rows of other if other is a DataFrame or a numpy.ndarray between the Series and each columns of the numpy array.

See also

DataFrame.dot

Compute the matrix product with the DataFrame.

Series.mul

Multiplication of series and other, element-wise.

Notes

The Series and other has to share the same index if other is a Series or a DataFrame.

Examples

>>> s = pd.Series([0, 1, 2, 3])  
>>> other = pd.Series([-1, 2, -3, 4])  
>>> s.dot(other)  
8
>>> s @ other  
8
>>> df = pd.DataFrame([[0, 1], [-2, 3], [4, -5], [6, 7]])  
>>> s.dot(df)  
0    24
1    14
dtype: int64
>>> arr = np.array([[0, 1], [-2, 3], [4, -5], [6, 7]])  
>>> s.dot(arr)  
array([24, 14])
drop_duplicates(subset=None, split_every=None, split_out=1, shuffle_method=None, ignore_index=False, **kwargs)#

Return DataFrame with duplicate rows removed.

This docstring was copied from pandas.core.frame.DataFrame.drop_duplicates.

Some inconsistencies with the Dask version may exist.

Known inconsistencies:

keep=False will raise a NotImplementedError

Considering certain columns is optional. Indexes, including time indexes are ignored.

Parameters:
subsetcolumn label or sequence of labels, optional

Only consider certain columns for identifying duplicates, by default use all of the columns.

keep{‘first’, ‘last’, False}, default ‘first’ (Not supported in Dask)

Determines which duplicates (if any) to keep. - first : Drop duplicates except for the first occurrence. - last : Drop duplicates except for the last occurrence. - False : Drop all duplicates.

inplacebool, default False (Not supported in Dask)

Whether to modify the DataFrame rather than creating a new one.

ignore_indexbool, default False

If True, the resulting axis will be labeled 0, 1, …, n - 1.

New in version 1.0.0.

Returns:
DataFrame or None

DataFrame with duplicates removed or None if inplace=True.

See also

DataFrame.value_counts

Count unique combinations of columns.

Examples

Consider dataset containing ramen rating.

>>> df = pd.DataFrame({  
...     'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
...     'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
...     'rating': [4, 4, 3.5, 15, 5]
... })
>>> df  
    brand style  rating
0  Yum Yum   cup     4.0
1  Yum Yum   cup     4.0
2  Indomie   cup     3.5
3  Indomie  pack    15.0
4  Indomie  pack     5.0

By default, it removes duplicate rows based on all columns.

>>> df.drop_duplicates()  
    brand style  rating
0  Yum Yum   cup     4.0
2  Indomie   cup     3.5
3  Indomie  pack    15.0
4  Indomie  pack     5.0

To remove duplicates on specific column(s), use subset.

>>> df.drop_duplicates(subset=['brand'])  
    brand style  rating
0  Yum Yum   cup     4.0
2  Indomie   cup     3.5

To remove duplicates and keep last occurrences, use keep.

>>> df.drop_duplicates(subset=['brand', 'style'], keep='last')  
    brand style  rating
1  Yum Yum   cup     4.0
2  Indomie   cup     3.5
4  Indomie  pack     5.0
dropna()#

Return a new Series with missing values removed.

This docstring was copied from pandas.core.series.Series.dropna.

Some inconsistencies with the Dask version may exist.

See the User Guide for more on which values are considered missing, and how to work with missing data.

Parameters:
axis{0 or ‘index’} (Not supported in Dask)

Unused. Parameter needed for compatibility with DataFrame.

inplacebool, default False (Not supported in Dask)

If True, do operation inplace and return None.

howstr, optional (Not supported in Dask)

Not in use. Kept for compatibility.

Returns:
Series or None

Series with NA entries dropped from it or None if inplace=True.

See also

Series.isna

Indicate missing values.

Series.notna

Indicate existing (non-missing) values.

Series.fillna

Replace missing values.

DataFrame.dropna

Drop rows or columns which contain NA values.

Index.dropna

Drop missing indices.

Examples

>>> ser = pd.Series([1., 2., np.nan])  
>>> ser  
0    1.0
1    2.0
2    NaN
dtype: float64

Drop NA values from a Series.

>>> ser.dropna()  
0    1.0
1    2.0
dtype: float64

Keep the Series with valid entries in the same variable.

>>> ser.dropna(inplace=True)  
>>> ser  
0    1.0
1    2.0
dtype: float64

Empty strings are not considered NA values. None is considered an NA value.

>>> ser = pd.Series([np.NaN, 2, pd.NaT, '', None, 'I stay'])  
>>> ser  
0       NaN
1         2
2       NaT
3
4      None
5    I stay
dtype: object
>>> ser.dropna()  
1         2
3
5    I stay
dtype: object
dt#

alias of DatetimeAccessor

property dtype#

Return data type

enforce_runtime_divisions()#

Enforce the current divisions at runtime

eq(other, level=None, fill_value=None, axis=0)#

Get Equal to of DataFrame or Series and other, element-wise (binary operator eq).

This docstring was copied from cudf.core.series.Series.eq.

Some inconsistencies with the Dask version may exist.

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:
otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame or Series

Result of the arithmetic operation.

Examples

DataFrame

>>> df = cudf.DataFrame(  
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.eq(1)  
        angles  degrees
circle      False    False
triangle    False    False
rectangle   False    False

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])  
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])  
>>> a.eq(b)  
a    True
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: bool
>>> a.eq(b, fill_value=0)  
a    True
b   False
c   False
d   False
e    <NA>
dtype: bool
explode()#

Transform each element of a list-like to a row.

This docstring was copied from pandas.core.series.Series.explode.

Some inconsistencies with the Dask version may exist.

New in version 0.25.0.

Parameters:
ignore_indexbool, default False (Not supported in Dask)

If True, the resulting index will be labeled 0, 1, …, n - 1.

New in version 1.1.0.

Returns:
Series

Exploded lists to rows; index will be duplicated for these rows.

See also

Series.str.split

Split string values on specified separator.

Series.unstack

Unstack, a.k.a. pivot, Series with MultiIndex to produce DataFrame.

DataFrame.melt

Unpivot a DataFrame from wide format to long format.

DataFrame.explode

Explode a DataFrame from list-like columns to long format.

Notes

This routine will explode list-likes including lists, tuples, sets, Series, and np.ndarray. The result dtype of the subset rows will be object. Scalars will be returned unchanged, and empty list-likes will result in a np.nan for that row. In addition, the ordering of elements in the output will be non-deterministic when exploding sets.

Reference the user guide for more examples.

Examples

>>> s = pd.Series([[1, 2, 3], 'foo', [], [3, 4]])  
>>> s  
0    [1, 2, 3]
1          foo
2           []
3       [3, 4]
dtype: object
>>> s.explode()  
0      1
0      2
0      3
1    foo
2    NaN
3      3
3      4
dtype: object
ffill(axis=None, limit=None)#

Synonym for DataFrame.fillna() with method='ffill'.

This docstring was copied from pandas.core.frame.DataFrame.ffill.

Some inconsistencies with the Dask version may exist.

Returns:
Series/DataFrame or None

Object with missing values filled or None if inplace=True.

fillna(value=None, method=None, limit=None, axis=None)#

Fill NA/NaN values using the specified method.

This docstring was copied from pandas.core.frame.DataFrame.fillna.

Some inconsistencies with the Dask version may exist.

Parameters:
valuescalar, dict, Series, or DataFrame

Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). Values not in the dict/Series/DataFrame will not be filled. This value cannot be a list.

method{‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None

Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use next valid observation to fill gap.

axis{0 or ‘index’, 1 or ‘columns’}

Axis along which to fill missing values. For Series this parameter is unused and defaults to 0.

inplacebool, default False (Not supported in Dask)

If True, fill in-place. Note: this will modify any other views on this object (e.g., a no-copy slice for a column in a DataFrame).

limitint, default None

If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.

downcastdict, default is None (Not supported in Dask)

A dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible).

Returns:
DataFrame or None

Object with missing values filled or None if inplace=True.

See also

interpolate

Fill NaN values using interpolation.

reindex

Conform object to new index.

asfreq

Convert TimeSeries to specified frequency.

Examples

>>> df = pd.DataFrame([[np.nan, 2, np.nan, 0],  
...                    [3, 4, np.nan, 1],
...                    [np.nan, np.nan, np.nan, np.nan],
...                    [np.nan, 3, np.nan, 4]],
...                   columns=list("ABCD"))
>>> df  
     A    B   C    D
0  NaN  2.0 NaN  0.0
1  3.0  4.0 NaN  1.0
2  NaN  NaN NaN  NaN
3  NaN  3.0 NaN  4.0

Replace all NaN elements with 0s.

>>> df.fillna(0)  
     A    B    C    D
0  0.0  2.0  0.0  0.0
1  3.0  4.0  0.0  1.0
2  0.0  0.0  0.0  0.0
3  0.0  3.0  0.0  4.0

We can also propagate non-null values forward or backward.

>>> df.fillna(method="ffill")  
     A    B   C    D
0  NaN  2.0 NaN  0.0
1  3.0  4.0 NaN  1.0
2  3.0  4.0 NaN  1.0
3  3.0  3.0 NaN  4.0

Replace all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1, 2, and 3 respectively.

>>> values = {"A": 0, "B": 1, "C": 2, "D": 3}  
>>> df.fillna(value=values)  
     A    B    C    D
0  0.0  2.0  2.0  0.0
1  3.0  4.0  2.0  1.0
2  0.0  1.0  2.0  3.0
3  0.0  3.0  2.0  4.0

Only replace the first NaN element.

>>> df.fillna(value=values, limit=1)  
     A    B    C    D
0  0.0  2.0  2.0  0.0
1  3.0  4.0  NaN  1.0
2  NaN  1.0  NaN  3.0
3  NaN  3.0  NaN  4.0

When filling using a DataFrame, replacement happens along the same column names and same indices

>>> df2 = pd.DataFrame(np.zeros((4, 4)), columns=list("ABCE"))  
>>> df.fillna(df2)  
     A    B    C    D
0  0.0  2.0  0.0  0.0
1  3.0  4.0  0.0  1.0
2  0.0  0.0  0.0  NaN
3  0.0  3.0  0.0  4.0

Note that column D is not affected since it is not present in df2.

first(offset)#

Select initial periods of time series data based on a date offset.

This docstring was copied from pandas.core.frame.DataFrame.first.

Some inconsistencies with the Dask version may exist.

When having a DataFrame with dates as index, this function can select the first few rows based on a date offset.

Parameters:
offsetstr, DateOffset or dateutil.relativedelta

The offset length of the data that will be selected. For instance, ‘1M’ will display all the rows having their index within the first month.

Returns:
Series or DataFrame

A subset of the caller.

Raises:
TypeError

If the index is not a DatetimeIndex

See also

last

Select final periods of time series based on a date offset.

at_time

Select values at a particular time of the day.

between_time

Select values between particular times of the day.

Examples

>>> i = pd.date_range('2018-04-09', periods=4, freq='2D')  
>>> ts = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i)  
>>> ts  
            A
2018-04-09  1
2018-04-11  2
2018-04-13  3
2018-04-15  4

Get the rows for the first 3 days:

>>> ts.first('3D')  
            A
2018-04-09  1
2018-04-11  2

Notice the data for 3 first calendar days were returned, not the first 3 days observed in the dataset, and therefore data for 2018-04-13 was not returned.

floordiv(other, level=None, fill_value=None, axis=0)#

Get Integer division of DataFrame or Series and other, element-wise (binary operator floordiv).

This docstring was copied from cudf.core.series.Series.floordiv.

Some inconsistencies with the Dask version may exist.

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:
otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame or Series

Result of the arithmetic operation.

Examples

DataFrame

>>> df = cudf.DataFrame(  
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.floordiv(1)  
        angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])  
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])  
>>> a.floordiv(b)  
a       1
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.floordiv(b, fill_value=0)  
a                      1
b    9223372036854775807
c    9223372036854775807
d                      0
e                   <NA>
dtype: int64
ge(other, level=None, fill_value=None, axis=0)#

Get Greater than or equal to of DataFrame or Series and other, element-wise (binary operator ge).

This docstring was copied from cudf.core.series.Series.ge.

Some inconsistencies with the Dask version may exist.

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:
otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame or Series

Result of the arithmetic operation.

Examples

DataFrame

>>> df = cudf.DataFrame(  
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.ge(1)  
        angles  degrees
circle      False     True
triangle     True     True
rectangle    True     True

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])  
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])  
>>> a.ge(b)  
a    True
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: bool
>>> a.ge(b, fill_value=0)  
a   True
b    True
c    True
d   False
e    <NA>
dtype: bool
get_partition(n)#

Get a dask DataFrame/Series representing the nth partition.

Parameters:
nint

The 0-indexed partition number to select.

Returns:
Dask DataFrame or Series

The same type as the original object.

Examples

>>> import dask
>>> ddf = dask.datasets.timeseries(start="2021-01-01", end="2021-01-07", freq="1h")
>>> ddf.get_partition(0)  
Dask DataFrame Structure:
                 name     id        x        y
npartitions=1
2021-01-01     string  int64  float64  float64
2021-01-02        ...    ...      ...      ...
Dask Name: get-partition, 3 graph layers
groupby(*args, **kwargs)#

Group Series using a mapper or by a Series of columns.

This docstring was copied from pandas.core.series.Series.groupby.

Some inconsistencies with the Dask version may exist.

A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

Parameters:
bymapping, function, label, or list of labels

Used to determine the groups for the groupby. If by is a function, it’s called on each value of the object’s index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups (the Series’ values are first aligned; see .align() method). If a list or ndarray of length equal to the selected axis is passed (see the groupby user guide), the values are used as-is to determine the groups. A label or list of labels may be passed to group by the columns in self. Notice that a tuple is interpreted as a (single) key.

axis{0 or ‘index’, 1 or ‘columns’}, default 0 (Not supported in Dask)

Split along rows (0) or columns (1). For Series this parameter is unused and defaults to 0.

levelint, level name, or sequence of such, default None (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), group by a particular level or levels. Do not specify both by and level.

as_indexbool, default True (Not supported in Dask)

For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output.

sortbool, default True

Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.

group_keysbool, optional

When calling apply and the by argument produces a like-indexed (i.e. a transform) result, add group keys to index to identify pieces. By default group keys are not included when the result’s index (and column) labels match the inputs, and are included otherwise. This argument has no effect if the result produced is not like-indexed with respect to the input.

Changed in version 1.5.0: Warns that group_keys will no longer be ignored when the result from apply is a like-indexed Series or DataFrame. Specify group_keys explicitly to include the group keys or not.

squeezebool, default False (Not supported in Dask)

Reduce the dimensionality of the return type if possible, otherwise return a consistent type.

Deprecated since version 1.1.0.

observedbool, default False

This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.

dropnabool, default True

If True, and if group keys contain NA values, NA values together with row/column will be dropped. If False, NA values will also be treated as the key in groups.

New in version 1.1.0.

Returns:
SeriesGroupBy

Returns a groupby object that contains information about the groups.

See also

resample

Convenience method for frequency conversion and resampling of time series.

Notes

See the user guide for more detailed usage and examples, including splitting an object into groups, iterating through groups, selecting a group, aggregation, and more.

Examples

>>> ser = pd.Series([390., 350., 30., 20.],  
...                 index=['Falcon', 'Falcon', 'Parrot', 'Parrot'], name="Max Speed")
>>> ser  
Falcon    390.0
Falcon    350.0
Parrot     30.0
Parrot     20.0
Name: Max Speed, dtype: float64
>>> ser.groupby(["a", "b", "a", "b"]).mean()  
a    210.0
b    185.0
Name: Max Speed, dtype: float64
>>> ser.groupby(level=0).mean()  
Falcon    370.0
Parrot     25.0
Name: Max Speed, dtype: float64
>>> ser.groupby(ser > 100).mean()  
Max Speed
False     25.0
True     370.0
Name: Max Speed, dtype: float64

Grouping by Indexes

We can groupby different levels of a hierarchical index using the level parameter:

>>> arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'],  
...           ['Captive', 'Wild', 'Captive', 'Wild']]
>>> index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type'))  
>>> ser = pd.Series([390., 350., 30., 20.], index=index, name="Max Speed")  
>>> ser  
Animal  Type
Falcon  Captive    390.0
        Wild       350.0
Parrot  Captive     30.0
        Wild        20.0
Name: Max Speed, dtype: float64
>>> ser.groupby(level=0).mean()  
Animal
Falcon    370.0
Parrot     25.0
Name: Max Speed, dtype: float64
>>> ser.groupby(level="Type").mean()  
Type
Captive    210.0
Wild       185.0
Name: Max Speed, dtype: float64

We can also choose to include NA in group keys or not by defining dropna parameter, the default setting is True.

>>> ser = pd.Series([1, 2, 3, 3], index=["a", 'a', 'b', np.nan])  
>>> ser.groupby(level=0).sum()  
a    3
b    3
dtype: int64
>>> ser.groupby(level=0, dropna=False).sum()  
a    3
b    3
NaN  3
dtype: int64
>>> arrays = ['Falcon', 'Falcon', 'Parrot', 'Parrot']  
>>> ser = pd.Series([390., 350., 30., 20.], index=arrays, name="Max Speed")  
>>> ser.groupby(["a", "b", "a", np.nan]).mean()  
a    210.0
b    350.0
Name: Max Speed, dtype: float64
>>> ser.groupby(["a", "b", "a", np.nan], dropna=False).mean()  
a    210.0
b    350.0
NaN   20.0
Name: Max Speed, dtype: float64
gt(other, level=None, fill_value=None, axis=0)#

Get Greater than of DataFrame or Series and other, element-wise (binary operator gt).

This docstring was copied from cudf.core.series.Series.gt.

Some inconsistencies with the Dask version may exist.

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:
otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame or Series

Result of the arithmetic operation.

Examples

DataFrame

>>> df = cudf.DataFrame(  
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.gt(1)  
        angles  degrees
circle      False     True
triangle     True     True
rectangle    True     True

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])  
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])  
>>> a.gt(b)  
a   False
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: bool
>>> a.gt(b, fill_value=0)  
a   False
b    True
c    True
d   False
e    <NA>
dtype: bool
head(n=5, npartitions=1, compute=True)#

First n rows of the dataset

Parameters:
nint, optional

The number of rows to return. Default is 5.

npartitionsint, optional

Elements are only taken from the first npartitions, with a default of 1. If there are fewer than n rows in the first npartitions a warning will be raised and any found rows returned. Pass -1 to use all partitions.

computebool, optional

Whether to compute the result, default is True.

idxmax(axis=None, skipna=True, split_every=False, numeric_only=_NoDefault.no_default)#

Return index of first occurrence of maximum over requested axis.

This docstring was copied from pandas.core.frame.DataFrame.idxmax.

Some inconsistencies with the Dask version may exist.

NA/null values are excluded.

Parameters:
axis{0 or ‘index’, 1 or ‘columns’}, default 0

The axis to use. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.

skipnabool, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA.

numeric_onlybool, default False

Include only float, int or boolean data.

New in version 1.5.0.

Returns:
Series

Indexes of maxima along the specified axis.

Raises:
ValueError
  • If the row/column is empty

See also

Series.idxmax

Return index of the maximum element.

Notes

This method is the DataFrame version of ndarray.argmax.

Examples

Consider a dataset containing food consumption in Argentina.

>>> df = pd.DataFrame({'consumption': [10.51, 103.11, 55.48],  
...                    'co2_emissions': [37.2, 19.66, 1712]},
...                    index=['Pork', 'Wheat Products', 'Beef'])
>>> df  
                consumption  co2_emissions
Pork                  10.51         37.20
Wheat Products       103.11         19.66
Beef                  55.48       1712.00

By default, it returns the index for the maximum value in each column.

>>> df.idxmax()  
consumption     Wheat Products
co2_emissions             Beef
dtype: object

To return the index for the maximum value in each row, use axis="columns".

>>> df.idxmax(axis="columns")  
Pork              co2_emissions
Wheat Products     consumption
Beef              co2_emissions
dtype: object
idxmin(axis=None, skipna=True, split_every=False, numeric_only=_NoDefault.no_default)#

Return index of first occurrence of minimum over requested axis.

This docstring was copied from pandas.core.frame.DataFrame.idxmin.

Some inconsistencies with the Dask version may exist.

NA/null values are excluded.

Parameters:
axis{0 or ‘index’, 1 or ‘columns’}, default 0

The axis to use. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.

skipnabool, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA.

numeric_onlybool, default False

Include only float, int or boolean data.

New in version 1.5.0.

Returns:
Series

Indexes of minima along the specified axis.

Raises:
ValueError
  • If the row/column is empty

See also

Series.idxmin

Return index of the minimum element.

Notes

This method is the DataFrame version of ndarray.argmin.

Examples

Consider a dataset containing food consumption in Argentina.

>>> df = pd.DataFrame({'consumption': [10.51, 103.11, 55.48],  
...                    'co2_emissions': [37.2, 19.66, 1712]},
...                    index=['Pork', 'Wheat Products', 'Beef'])
>>> df  
                consumption  co2_emissions
Pork                  10.51         37.20
Wheat Products       103.11         19.66
Beef                  55.48       1712.00

By default, it returns the index for the minimum value in each column.

>>> df.idxmin()  
consumption                Pork
co2_emissions    Wheat Products
dtype: object

To return the index for the minimum value in each row, use axis="columns".

>>> df.idxmin(axis="columns")  
Pork                consumption
Wheat Products    co2_emissions
Beef                consumption
dtype: object
property index#

Return dask Index instance

property is_monotonic#

Return boolean if values in the object are monotonically increasing.

This docstring was copied from pandas.core.series.Series.is_monotonic.

Some inconsistencies with the Dask version may exist.

Deprecated since version 1.5.0: is_monotonic is deprecated and will be removed in a future version. Use is_monotonic_increasing instead.

Returns:
bool
property is_monotonic_decreasing#

Return boolean if values in the object are monotonically decreasing.

This docstring was copied from pandas.core.series.Series.is_monotonic_decreasing.

Some inconsistencies with the Dask version may exist.

Returns:
bool
property is_monotonic_increasing#

Return boolean if values in the object are monotonically increasing.

This docstring was copied from pandas.core.series.Series.is_monotonic_increasing.

Some inconsistencies with the Dask version may exist.

Returns:
bool
isin(values)#

Whether elements in Series are contained in values.

This docstring was copied from pandas.core.series.Series.isin.

Some inconsistencies with the Dask version may exist.

Return a boolean Series showing whether each element in the Series matches an element in the passed sequence of values exactly.

Parameters:
valuesset or list-like

The sequence of values to test. Passing in a single string will raise a TypeError. Instead, turn a single string into a list of one element.

Returns:
Series

Series of booleans indicating if each element is in values.

Raises:
TypeError
  • If values is a string

See also

DataFrame.isin

Equivalent method on DataFrame.

Examples

>>> s = pd.Series(['lama', 'cow', 'lama', 'beetle', 'lama',  
...                'hippo'], name='animal')
>>> s.isin(['cow', 'lama'])  
0     True
1     True
2     True
3    False
4     True
5    False
Name: animal, dtype: bool

To invert the boolean values, use the ~ operator:

>>> ~s.isin(['cow', 'lama'])  
0    False
1    False
2    False
3     True
4    False
5     True
Name: animal, dtype: bool

Passing a single string as s.isin('lama') will raise an error. Use a list of one element instead:

>>> s.isin(['lama'])  
0     True
1    False
2     True
3    False
4     True
5    False
Name: animal, dtype: bool

Strings and integers are distinct and are therefore not comparable:

>>> pd.Series([1]).isin(['1'])  
0    False
dtype: bool
>>> pd.Series([1.1]).isin(['1.1'])  
0    False
dtype: bool
isna()#

Detect missing values.

This docstring was copied from pandas.core.frame.DataFrame.isna.

Some inconsistencies with the Dask version may exist.

Return a boolean same-sized object indicating if the values are NA. NA values, such as None or numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True).

Returns:
DataFrame

Mask of bool values for each element in DataFrame that indicates whether an element is an NA value.

See also

DataFrame.isnull

Alias of isna.

DataFrame.notna

Boolean inverse of isna.

DataFrame.dropna

Omit axes labels with missing values.

isna

Top-level isna.

Examples

Show which entries in a DataFrame are NA.

>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN],  
...                    born=[pd.NaT, pd.Timestamp('1939-05-27'),
...                          pd.Timestamp('1940-04-25')],
...                    name=['Alfred', 'Batman', ''],
...                    toy=[None, 'Batmobile', 'Joker']))
>>> df  
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker
>>> df.isna()  
     age   born   name    toy
0  False   True  False   True
1  False  False  False  False
2   True  False  False  False

Show which entries in a Series are NA.

>>> ser = pd.Series([5, 6, np.NaN])  
>>> ser  
0    5.0
1    6.0
2    NaN
dtype: float64
>>> ser.isna()  
0    False
1    False
2     True
dtype: bool
isnull()#

DataFrame.isnull is an alias for DataFrame.isna.

This docstring was copied from pandas.core.frame.DataFrame.isnull.

Some inconsistencies with the Dask version may exist.

Detect missing values.

Return a boolean same-sized object indicating if the values are NA. NA values, such as None or numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True).

Returns:
DataFrame

Mask of bool values for each element in DataFrame that indicates whether an element is an NA value.

See also

DataFrame.isnull

Alias of isna.

DataFrame.notna

Boolean inverse of isna.

DataFrame.dropna

Omit axes labels with missing values.

isna

Top-level isna.

Examples

Show which entries in a DataFrame are NA.

>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN],  
...                    born=[pd.NaT, pd.Timestamp('1939-05-27'),
...                          pd.Timestamp('1940-04-25')],
...                    name=['Alfred', 'Batman', ''],
...                    toy=[None, 'Batmobile', 'Joker']))
>>> df  
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker
>>> df.isna()  
     age   born   name    toy
0  False   True  False   True
1  False  False  False  False
2   True  False  False  False

Show which entries in a Series are NA.

>>> ser = pd.Series([5, 6, np.NaN])  
>>> ser  
0    5.0
1    6.0
2    NaN
dtype: float64
>>> ser.isna()  
0    False
1    False
2     True
dtype: bool
iteritems()#

Lazily iterate over (index, value) tuples.

This docstring was copied from pandas.core.series.Series.iteritems.

Some inconsistencies with the Dask version may exist.

Deprecated since version 1.5.0: iteritems is deprecated and will be removed in a future version. Use .items instead.

This method returns an iterable tuple (index, value). This is convenient if you want to create a lazy iterator.

Returns:
iterable

Iterable of tuples containing the (index, value) pairs from a Series.

See also

Series.items

Recommended alternative.

DataFrame.items

Iterate over (column name, Series) pairs.

DataFrame.iterrows

Iterate over DataFrame rows as (index, Series) pairs.

property known_divisions#

Whether divisions are already known

kurtosis(axis=0, fisher=True, bias=True, nan_policy='propagate', out=None, numeric_only=_NoDefault.no_default)#

Return unbiased kurtosis over requested axis.

This docstring was copied from pandas.core.frame.DataFrame.kurtosis.

Some inconsistencies with the Dask version may exist.

Note

This implementation follows the dask.array.stats implementation of kurtosis and calculates kurtosis without taking into account a bias term for finite sample size, which corresponds to the default settings of the scipy.stats kurtosis calculation. This differs from pandas.

Further, this method currently does not support filtering out NaN values, which is again a difference to Pandas.

Kurtosis obtained using Fisher’s definition of kurtosis (kurtosis of normal == 0.0). Normalized by N-1.

Parameters:
axis{index (0), columns (1)}

Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

skipnabool, default True (Not supported in Dask)

Exclude NA/null values when computing the result.

levelint or level name, default None (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

Deprecated since version 1.3.0: The level keyword is deprecated. Use groupby instead.

numeric_onlybool, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Deprecated since version 1.5.0: Specifying numeric_only=None is deprecated. The default value will be False in a future version of pandas.

**kwargs

Additional keyword arguments to be passed to the function.

Returns:
Series or DataFrame (if level specified)
last(offset)#

Select final periods of time series data based on a date offset.

This docstring was copied from pandas.core.frame.DataFrame.last.

Some inconsistencies with the Dask version may exist.

For a DataFrame with a sorted DatetimeIndex, this function selects the last few rows based on a date offset.

Parameters:
offsetstr, DateOffset, dateutil.relativedelta

The offset length of the data that will be selected. For instance, ‘3D’ will display all the rows having their index within the last 3 days.

Returns:
Series or DataFrame

A subset of the caller.

Raises:
TypeError

If the index is not a DatetimeIndex

See also

first

Select initial periods of time series based on a date offset.

at_time

Select values at a particular time of the day.

between_time

Select values between particular times of the day.

Examples

>>> i = pd.date_range('2018-04-09', periods=4, freq='2D')  
>>> ts = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i)  
>>> ts  
            A
2018-04-09  1
2018-04-11  2
2018-04-13  3
2018-04-15  4

Get the rows for the last 3 days:

>>> ts.last('3D')  
            A
2018-04-13  3
2018-04-15  4

Notice the data for 3 last calendar days were returned, not the last 3 observed days in the dataset, and therefore data for 2018-04-11 was not returned.

le(other, level=None, fill_value=None, axis=0)#

Get Less than or equal to of DataFrame or Series and other, element-wise (binary operator le).

This docstring was copied from cudf.core.series.Series.le.

Some inconsistencies with the Dask version may exist.

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:
otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame or Series

Result of the arithmetic operation.

Examples

DataFrame

>>> df = cudf.DataFrame(  
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.le(1)  
        angles  degrees
circle       True    False
triangle    False    False
rectangle   False    False

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])  
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])  
>>> a.le(b)  
a    True
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: bool
>>> a.le(b, fill_value=0)  
a    True
b   False
c   False
d    True
e    <NA>
dtype: bool
property loc#

Purely label-location based indexer for selection by label.

>>> df.loc["b"]  
>>> df.loc["b":"d"]  
lt(other, level=None, fill_value=None, axis=0)#

Get Less than of DataFrame or Series and other, element-wise (binary operator lt).

This docstring was copied from cudf.core.series.Series.lt.

Some inconsistencies with the Dask version may exist.

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:
otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame or Series

Result of the arithmetic operation.

Examples

DataFrame

>>> df = cudf.DataFrame(  
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.lt(1)  
        angles  degrees
circle       True    False
triangle    False    False
rectangle   False    False

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])  
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])  
>>> a.lt(b)  
a   False
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: bool
>>> a.lt(b, fill_value=0)  
a   False
b   False
c   False
d    True
e    <NA>
dtype: bool
map(arg, na_action=None, meta=_NoDefault.no_default)#

Map values of Series according to an input mapping or function.

This docstring was copied from pandas.core.series.Series.map.

Some inconsistencies with the Dask version may exist.

Used for substituting each value in a Series with another value, that may be derived from a function, a dict or a Series.

Parameters:
argfunction, collections.abc.Mapping subclass or Series

Mapping correspondence.

na_action{None, ‘ignore’}, default None

If ‘ignore’, propagate NaN values, without passing them to the mapping correspondence.

metapd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

Returns:
Series

Same index as caller.

See also

Series.apply

For applying more complex functions on a Series.

DataFrame.apply

Apply a function row-/column-wise.

DataFrame.applymap

Apply a function elementwise on a whole DataFrame.

Notes

When arg is a dictionary, values in Series that are not in the dictionary (as keys) are converted to NaN. However, if the dictionary is a dict subclass that defines __missing__ (i.e. provides a method for default values), then this default is used rather than NaN.

Examples

>>> s = pd.Series(['cat', 'dog', np.nan, 'rabbit'])  
>>> s  
0      cat
1      dog
2      NaN
3   rabbit
dtype: object

map accepts a dict or a Series. Values that are not found in the dict are converted to NaN, unless the dict has a default value (e.g. defaultdict):

>>> s.map({'cat': 'kitten', 'dog': 'puppy'})  
0   kitten
1    puppy
2      NaN
3      NaN
dtype: object

It also accepts a function:

>>> s.map('I am a {}'.format)  
0       I am a cat
1       I am a dog
2       I am a nan
3    I am a rabbit
dtype: object

To avoid applying the function to missing values (and keep them as NaN) na_action='ignore' can be used:

>>> s.map('I am a {}'.format, na_action='ignore')  
0     I am a cat
1     I am a dog
2            NaN
3  I am a rabbit
dtype: object
map_overlap(func, before, after, *args, **kwargs)#

Apply a function to each partition, sharing rows with adjacent partitions.

This can be useful for implementing windowing functions such as df.rolling(...).mean() or df.diff().

Parameters:
funcfunction

Function applied to each partition.

beforeint, timedelta or string timedelta

The rows to prepend to partition i from the end of partition i - 1.

afterint, timedelta or string timedelta

The rows to append to partition i from the beginning of partition i + 1.

args, kwargs

Positional and keyword arguments to pass to the function. Positional arguments are computed on a per-partition basis, while keyword arguments are shared across all partitions. The partition itself will be the first positional argument, with all other arguments passed after. Arguments can be Scalar, Delayed, or regular Python objects. DataFrame-like args (both dask and pandas) will be repartitioned to align (if necessary) before applying the function; see align_dataframes to control this behavior.

enforce_metadatabool, default True

Whether to enforce at runtime that the structure of the DataFrame produced by func actually matches the structure of meta. This will rename and reorder columns for each partition, and will raise an error if this doesn’t work, but it won’t raise if dtypes don’t match.

transform_divisionsbool, default True

Whether to apply the function onto the divisions and apply those transformed divisions to the output.

align_dataframesbool, default True

Whether to repartition DataFrame- or Series-like args (both dask and pandas) so their divisions align before applying the function. This requires all inputs to have known divisions. Single-partition inputs will be split into multiple partitions.

If False, all inputs must have either the same number of partitions or a single partition. Single-partition inputs will be broadcast to every partition of multi-partition inputs.

metapd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

Notes

Given positive integers before and after, and a function func, map_overlap does the following:

  1. Prepend before rows to each partition i from the end of partition i - 1. The first partition has no rows prepended.

  2. Append after rows to each partition i from the beginning of partition i + 1. The last partition has no rows appended.

  3. Apply func to each partition, passing in any extra args and kwargs if provided.

  4. Trim before rows from the beginning of all but the first partition.

  5. Trim after rows from the end of all but the last partition.

Examples

Given a DataFrame, Series, or Index, such as:

>>> import pandas as pd
>>> import dask.dataframe as dd
>>> df = pd.DataFrame({'x': [1, 2, 4, 7, 11],
...                    'y': [1., 2., 3., 4., 5.]})
>>> ddf = dd.from_pandas(df, npartitions=2)

A rolling sum with a trailing moving window of size 2 can be computed by overlapping 2 rows before each partition, and then mapping calls to df.rolling(2).sum():

>>> ddf.compute()
    x    y
0   1  1.0
1   2  2.0
2   4  3.0
3   7  4.0
4  11  5.0
>>> ddf.map_overlap(lambda df: df.rolling(2).sum(), 2, 0).compute()
      x    y
0   NaN  NaN
1   3.0  3.0
2   6.0  5.0
3  11.0  7.0
4  18.0  9.0

The pandas diff method computes a discrete difference shifted by a number of periods (can be positive or negative). This can be implemented by mapping calls to df.diff to each partition after prepending/appending that many rows, depending on sign:

>>> def diff(df, periods=1):
...     before, after = (periods, 0) if periods > 0 else (0, -periods)
...     return df.map_overlap(lambda df, periods=1: df.diff(periods),
...                           periods, 0, periods=periods)
>>> diff(ddf, 1).compute()
     x    y
0  NaN  NaN
1  1.0  1.0
2  2.0  1.0
3  3.0  1.0
4  4.0  1.0

If you have a DatetimeIndex, you can use a pd.Timedelta for time- based windows or any pd.Timedelta convertible string:

>>> ts = pd.Series(range(10), index=pd.date_range('2017', periods=10))
>>> dts = dd.from_pandas(ts, npartitions=2)
>>> dts.map_overlap(lambda df: df.rolling('2D').sum(),
...                 pd.Timedelta('2D'), 0).compute()
2017-01-01     0.0
2017-01-02     1.0
2017-01-03     3.0
2017-01-04     5.0
2017-01-05     7.0
2017-01-06     9.0
2017-01-07    11.0
2017-01-08    13.0
2017-01-09    15.0
2017-01-10    17.0
Freq: D, dtype: float64
map_partitions(func, *args, **kwargs)#

Apply Python function on each DataFrame partition.

Note that the index and divisions are assumed to remain unchanged.

Parameters:
funcfunction

The function applied to each partition. If this function accepts the special partition_info keyword argument, it will receive information on the partition’s relative location within the dataframe.

args, kwargs

Positional and keyword arguments to pass to the function. Positional arguments are computed on a per-partition basis, while keyword arguments are shared across all partitions. The partition itself will be the first positional argument, with all other arguments passed after. Arguments can be Scalar, Delayed, or regular Python objects. DataFrame-like args (both dask and pandas) will be repartitioned to align (if necessary) before applying the function; see align_dataframes to control this behavior.

enforce_metadatabool, default True

Whether to enforce at runtime that the structure of the DataFrame produced by func actually matches the structure of meta. This will rename and reorder columns for each partition, and will raise an error if this doesn’t work, but it won’t raise if dtypes don’t match.

transform_divisionsbool, default True

Whether to apply the function onto the divisions and apply those transformed divisions to the output.

align_dataframesbool, default True

Whether to repartition DataFrame- or Series-like args (both dask and pandas) so their divisions align before applying the function. This requires all inputs to have known divisions. Single-partition inputs will be split into multiple partitions.

If False, all inputs must have either the same number of partitions or a single partition. Single-partition inputs will be broadcast to every partition of multi-partition inputs.

metapd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

Examples

Given a DataFrame, Series, or Index, such as:

>>> import pandas as pd
>>> import dask.dataframe as dd
>>> df = pd.DataFrame({'x': [1, 2, 3, 4, 5],
...                    'y': [1., 2., 3., 4., 5.]})
>>> ddf = dd.from_pandas(df, npartitions=2)

One can use map_partitions to apply a function on each partition. Extra arguments and keywords can optionally be provided, and will be passed to the function after the partition.

Here we apply a function with arguments and keywords to a DataFrame, resulting in a Series:

>>> def myadd(df, a, b=1):
...     return df.x + df.y + a + b
>>> res = ddf.map_partitions(myadd, 1, b=2)
>>> res.dtype
dtype('float64')

Here we apply a function to a Series resulting in a Series:

>>> res = ddf.x.map_partitions(lambda x: len(x)) # ddf.x is a Dask Series Structure
>>> res.dtype
dtype('int64')

By default, dask tries to infer the output metadata by running your provided function on some fake data. This works well in many cases, but can sometimes be expensive, or even fail. To avoid this, you can manually specify the output metadata with the meta keyword. This can be specified in many forms, for more information see dask.dataframe.utils.make_meta.

Here we specify the output is a Series with no name, and dtype float64:

>>> res = ddf.map_partitions(myadd, 1, b=2, meta=(None, 'f8'))

Here we map a function that takes in a DataFrame, and returns a DataFrame with a new column:

>>> res = ddf.map_partitions(lambda df: df.assign(z=df.x * df.y))
>>> res.dtypes
x      int64
y    float64
z    float64
dtype: object

As before, the output metadata can also be specified manually. This time we pass in a dict, as the output is a DataFrame:

>>> res = ddf.map_partitions(lambda df: df.assign(z=df.x * df.y),
...                          meta={'x': 'i8', 'y': 'f8', 'z': 'f8'})

In the case where the metadata doesn’t change, you can also pass in the object itself directly:

>>> res = ddf.map_partitions(lambda df: df.head(), meta=ddf)

Also note that the index and divisions are assumed to remain unchanged. If the function you’re mapping changes the index/divisions, you’ll need to clear them afterwards:

>>> ddf.map_partitions(func).clear_divisions()  

Your map function gets information about where it is in the dataframe by accepting a special partition_info keyword argument.

>>> def func(partition, partition_info=None):
...     pass

This will receive the following information:

>>> partition_info  
{'number': 1, 'division': 3}

For each argument and keyword arguments that are dask dataframes you will receive the number (n) which represents the nth partition of the dataframe and the division (the first index value in the partition). If divisions are not known (for instance if the index is not sorted) then you will get None as the division.

mask(cond, other=nan)#

Replace values where the condition is True.

This docstring was copied from pandas.core.frame.DataFrame.mask.

Some inconsistencies with the Dask version may exist.

Parameters:
condbool Series/DataFrame, array-like, or callable

Where cond is False, keep the original value. Where True, replace with corresponding value from other. If cond is callable, it is computed on the Series/DataFrame and should return boolean Series/DataFrame or array. The callable must not change input Series/DataFrame (though pandas doesn’t check it).

otherscalar, Series/DataFrame, or callable

Entries where cond is True are replaced with corresponding value from other. If other is callable, it is computed on the Series/DataFrame and should return scalar or Series/DataFrame. The callable must not change input Series/DataFrame (though pandas doesn’t check it).

inplacebool, default False (Not supported in Dask)

Whether to perform the operation in place on the data.

axisint, default None (Not supported in Dask)

Alignment axis if needed. For Series this parameter is unused and defaults to 0.

levelint, default None (Not supported in Dask)

Alignment level if needed.

errorsstr, {‘raise’, ‘ignore’}, default ‘raise’ (Not supported in Dask)

Note that currently this parameter won’t affect the results and will always coerce to a suitable dtype.

  • ‘raise’ : allow exceptions to be raised.

  • ‘ignore’ : suppress exceptions. On error return original object.

Deprecated since version 1.5.0: This argument had no effect.

try_castbool, default None (Not supported in Dask)

Try to cast the result back to the input type (if possible).

Deprecated since version 1.3.0: Manually cast back if necessary.

Returns:
Same type as caller or None if inplace=True.

See also

DataFrame.where()

Return an object of same shape as self.

Notes

The mask method is an application of the if-then idiom. For each element in the calling DataFrame, if cond is False the element is used; otherwise the corresponding element from the DataFrame other is used. If the axis of other does not align with axis of cond Series/DataFrame, the misaligned index positions will be filled with True.

The signature for DataFrame.where() differs from numpy.where(). Roughly df1.where(m, df2) is equivalent to np.where(m, df1, df2).

For further details and examples see the mask documentation in indexing.

The dtype of the object takes precedence. The fill value is casted to the object’s dtype, if this can be done losslessly.

Examples

>>> s = pd.Series(range(5))  
>>> s.where(s > 0)  
0    NaN
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64
>>> s.mask(s > 0)  
0    0.0
1    NaN
2    NaN
3    NaN
4    NaN
dtype: float64
>>> s = pd.Series(range(5))  
>>> t = pd.Series([True, False])  
>>> s.where(t, 99)  
0     0
1    99
2    99
3    99
4    99
dtype: int64
>>> s.mask(t, 99)  
0    99
1     1
2    99
3    99
4    99
dtype: int64
>>> s.where(s > 1, 10)  
0    10
1    10
2    2
3    3
4    4
dtype: int64
>>> s.mask(s > 1, 10)  
0     0
1     1
2    10
3    10
4    10
dtype: int64
>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])  
>>> df  
   A  B
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9
>>> m = df % 3 == 0  
>>> df.where(m, -df)  
   A  B
0  0 -1
1 -2  3
2 -4 -5
3  6 -7
4 -8  9
>>> df.where(m, -df) == np.where(m, df, -df)  
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True
>>> df.where(m, -df) == df.mask(~m, -df)  
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True
max(axis=0, skipna=True, split_every=False, out=None, numeric_only=None)#

Return the maximum of the values over the requested axis.

This docstring was copied from pandas.core.frame.DataFrame.max.

Some inconsistencies with the Dask version may exist.

If you want the index of the maximum, use idxmax. This is the equivalent of the numpy.ndarray method argmax.

Parameters:
axis{index (0), columns (1)}

Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

skipnabool, default True

Exclude NA/null values when computing the result.

levelint or level name, default None (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

Deprecated since version 1.3.0: The level keyword is deprecated. Use groupby instead.

numeric_onlybool, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Deprecated since version 1.5.0: Specifying numeric_only=None is deprecated. The default value will be False in a future version of pandas.

**kwargs

Additional keyword arguments to be passed to the function.

Returns:
Series or DataFrame (if level specified)

See also

Series.sum

Return the sum.

Series.min

Return the minimum.

Series.max

Return the maximum.

Series.idxmin

Return the index of the minimum.

Series.idxmax

Return the index of the maximum.

DataFrame.sum

Return the sum over the requested axis.

DataFrame.min

Return the minimum over the requested axis.

DataFrame.max

Return the maximum over the requested axis.

DataFrame.idxmin

Return the index of the minimum over the requested axis.

DataFrame.idxmax

Return the index of the maximum over the requested axis.

Examples

>>> idx = pd.MultiIndex.from_arrays([  
...     ['warm', 'warm', 'cold', 'cold'],
...     ['dog', 'falcon', 'fish', 'spider']],
...     names=['blooded', 'animal'])
>>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx)  
>>> s  
blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64
>>> s.max()  
8
mean(split_every=False)#

Return the mean of the values over the requested axis.

This docstring was copied from pandas.core.frame.DataFrame.mean.

Some inconsistencies with the Dask version may exist.

Parameters:
axis{index (0), columns (1)}

Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

skipnabool, default True

Exclude NA/null values when computing the result.

levelint or level name, default None (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

Deprecated since version 1.3.0: The level keyword is deprecated. Use groupby instead.

numeric_onlybool, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Deprecated since version 1.5.0: Specifying numeric_only=None is deprecated. The default value will be False in a future version of pandas.

**kwargs

Additional keyword arguments to be passed to the function.

Returns:
Series or DataFrame (if level specified)
median(method='default')#

Return the median of the values over the requested axis.

This docstring was copied from pandas.core.series.Series.median.

Some inconsistencies with the Dask version may exist.

Parameters:
axis{index (0)} (Not supported in Dask)

Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

skipnabool, default True (Not supported in Dask)

Exclude NA/null values when computing the result.

levelint or level name, default None (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a scalar.

Deprecated since version 1.3.0: The level keyword is deprecated. Use groupby instead.

numeric_onlybool, default None (Not supported in Dask)

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Deprecated since version 1.5.0: Specifying numeric_only=None is deprecated. The default value will be False in a future version of pandas.

**kwargs

Additional keyword arguments to be passed to the function.

Returns:
scalar or Series (if level specified)
median_approximate(method='default')#

Return the approximate median of the values over the requested axis.

Parameters:
method{‘default’, ‘tdigest’, ‘dask’}, optional

What method to use. By default will use Dask’s internal custom algorithm ("dask"). If set to "tdigest" will use tdigest for floats and ints and fallback to the "dask" otherwise.

memory_usage(index=True, deep=False)#

Return the memory usage of the Series.

This docstring was copied from pandas.core.series.Series.memory_usage.

Some inconsistencies with the Dask version may exist.

The memory usage can optionally include the contribution of the index and of elements of object dtype.

Parameters:
indexbool, default True

Specifies whether to include the memory usage of the Series index.

deepbool, default False

If True, introspect the data deeply by interrogating object dtypes for system-level memory consumption, and include it in the returned value.

Returns:
int

Bytes of memory consumed.

See also

numpy.ndarray.nbytes

Total bytes consumed by the elements of the array.

DataFrame.memory_usage

Bytes consumed by a DataFrame.

Examples

>>> s = pd.Series(range(3))  
>>> s.memory_usage()  
152

Not including the index gives the size of the rest of the data, which is necessarily smaller:

>>> s.memory_usage(index=False)  
24

The memory footprint of object values is ignored by default:

>>> s = pd.Series(["a", "b"])  
>>> s.values  
array(['a', 'b'], dtype=object)
>>> s.memory_usage()  
144
>>> s.memory_usage(deep=True)  
244
memory_usage_per_partition(index=True, deep=False)#

Return the memory usage of each partition

Parameters:
indexbool, default True

Specifies whether to include the memory usage of the index in returned Series.

deepbool, default False

If True, introspect the data deeply by interrogating object dtypes for system-level memory consumption, and include it in the returned values.

Returns:
Series

A Series whose index is the partition number and whose values are the memory usage of each partition in bytes.

min(axis=0, skipna=True, split_every=False, out=None, numeric_only=None)#

Return the minimum of the values over the requested axis.

This docstring was copied from pandas.core.frame.DataFrame.min.

Some inconsistencies with the Dask version may exist.

If you want the index of the minimum, use idxmin. This is the equivalent of the numpy.ndarray method argmin.

Parameters:
axis{index (0), columns (1)}

Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

skipnabool, default True

Exclude NA/null values when computing the result.

levelint or level name, default None (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

Deprecated since version 1.3.0: The level keyword is deprecated. Use groupby instead.

numeric_onlybool, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Deprecated since version 1.5.0: Specifying numeric_only=None is deprecated. The default value will be False in a future version of pandas.

**kwargs

Additional keyword arguments to be passed to the function.

Returns:
Series or DataFrame (if level specified)

See also

Series.sum

Return the sum.

Series.min

Return the minimum.

Series.max

Return the maximum.

Series.idxmin

Return the index of the minimum.

Series.idxmax

Return the index of the maximum.

DataFrame.sum

Return the sum over the requested axis.

DataFrame.min

Return the minimum over the requested axis.

DataFrame.max

Return the maximum over the requested axis.

DataFrame.idxmin

Return the index of the minimum over the requested axis.

DataFrame.idxmax

Return the index of the maximum over the requested axis.

Examples

>>> idx = pd.MultiIndex.from_arrays([  
...     ['warm', 'warm', 'cold', 'cold'],
...     ['dog', 'falcon', 'fish', 'spider']],
...     names=['blooded', 'animal'])
>>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx)  
>>> s  
blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64
>>> s.min()  
0
mod(other, level=None, fill_value=None, axis=0)#

Get Modulo of DataFrame or Series and other, element-wise (binary operator mod).

This docstring was copied from cudf.core.series.Series.mod.

Some inconsistencies with the Dask version may exist.

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:
otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame or Series

Result of the arithmetic operation.

Examples

DataFrame

>>> df = cudf.DataFrame(  
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.mod(1)  
        angles  degrees
circle          0        0
triangle        0        0
rectangle       0        0

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])  
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])  
>>> a.mod(b)  
a       0
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.mod(b, fill_value=0)  
a             0
b    4294967295
c    4294967295
d             0
e          <NA>
dtype: int64
mode(dropna=True, split_every=False)#

Return the mode(s) of the Series.

This docstring was copied from pandas.core.series.Series.mode.

Some inconsistencies with the Dask version may exist.

The mode is the value that appears most often. There can be multiple modes.

Always returns Series even if only one value is returned.

Parameters:
dropnabool, default True

Don’t consider counts of NaN/NaT.

Returns:
Series

Modes of the Series in sorted order.

mul(other, level=None, fill_value=None, axis=0)#

Get Multiplication of DataFrame or Series and other, element-wise (binary operator mul).

This docstring was copied from cudf.core.series.Series.mul.

Some inconsistencies with the Dask version may exist.

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:
otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame or Series

Result of the arithmetic operation.

Examples

DataFrame

>>> df = cudf.DataFrame(  
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.multiply(1)  
        angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])  
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])  
>>> a.multiply(b)  
a       1
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.multiply(b, fill_value=0)  
a       1
b       0
c       0
d       0
e    <NA>
dtype: int64
property nbytes#

Number of bytes

property ndim#

Return dimensionality

ne(other, level=None, fill_value=None, axis=0)#

Get Not equal to of DataFrame or Series and other, element-wise (binary operator ne).

This docstring was copied from cudf.core.series.Series.ne.

Some inconsistencies with the Dask version may exist.

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:
otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame or Series

Result of the arithmetic operation.

Examples

DataFrame

>>> df = cudf.DataFrame(  
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.ne(1)  
        angles  degrees
circle       True     True
triangle     True     True
rectangle    True     True

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])  
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])  
>>> a.ne(b)  
a    False
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: bool
>>> a.ne(b, fill_value=0)  
a   False
b    True
c    True
d    True
e    <NA>
dtype: bool
nlargest(n=5, split_every=None)#

Return the largest n elements.

This docstring was copied from pandas.core.series.Series.nlargest.

Some inconsistencies with the Dask version may exist.

Parameters:
nint, default 5

Return this many descending sorted values.

keep{‘first’, ‘last’, ‘all’}, default ‘first’ (Not supported in Dask)

When there are duplicate values that cannot all fit in a Series of n elements:

  • first : return the first n occurrences in order of appearance.

  • last : return the last n occurrences in reverse order of appearance.

  • all : keep all occurrences. This can result in a Series of size larger than n.

Returns:
Series

The n largest values in the Series, sorted in decreasing order.

See also

Series.nsmallest

Get the n smallest elements.

Series.sort_values

Sort Series by values.

Series.head

Return the first n rows.

Notes

Faster than .sort_values(ascending=False).head(n) for small n relative to the size of the Series object.

Examples

>>> countries_population = {"Italy": 59000000, "France": 65000000,  
...                         "Malta": 434000, "Maldives": 434000,
...                         "Brunei": 434000, "Iceland": 337000,
...                         "Nauru": 11300, "Tuvalu": 11300,
...                         "Anguilla": 11300, "Montserrat": 5200}
>>> s = pd.Series(countries_population)  
>>> s  
Italy       59000000
France      65000000
Malta         434000
Maldives      434000
Brunei        434000
Iceland       337000
Nauru          11300
Tuvalu         11300
Anguilla       11300
Montserrat      5200
dtype: int64

The n largest elements where n=5 by default.

>>> s.nlargest()  
France      65000000
Italy       59000000
Malta         434000
Maldives      434000
Brunei        434000
dtype: int64

The n largest elements where n=3. Default keep value is ‘first’ so Malta will be kept.

>>> s.nlargest(3)  
France    65000000
Italy     59000000
Malta       434000
dtype: int64

The n largest elements where n=3 and keeping the last duplicates. Brunei will be kept since it is the last with value 434000 based on the index order.

>>> s.nlargest(3, keep='last')  
France      65000000
Italy       59000000
Brunei        434000
dtype: int64

The n largest elements where n=3 with all duplicates kept. Note that the returned Series has five elements due to the three duplicates.

>>> s.nlargest(3, keep='all')  
France      65000000
Italy       59000000
Malta         434000
Maldives      434000
Brunei        434000
dtype: int64
notnull()#

DataFrame.notnull is an alias for DataFrame.notna.

This docstring was copied from pandas.core.frame.DataFrame.notnull.

Some inconsistencies with the Dask version may exist.

Detect existing (non-missing) values.

Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True). NA values, such as None or numpy.NaN, get mapped to False values.

Returns:
DataFrame

Mask of bool values for each element in DataFrame that indicates whether an element is not an NA value.

See also

DataFrame.notnull

Alias of notna.

DataFrame.isna

Boolean inverse of notna.

DataFrame.dropna

Omit axes labels with missing values.

notna

Top-level notna.

Examples

Show which entries in a DataFrame are not NA.

>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN],  
...                    born=[pd.NaT, pd.Timestamp('1939-05-27'),
...                          pd.Timestamp('1940-04-25')],
...                    name=['Alfred', 'Batman', ''],
...                    toy=[None, 'Batmobile', 'Joker']))
>>> df  
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker
>>> df.notna()  
     age   born  name    toy
0   True  False  True  False
1   True   True  True   True
2  False   True  True   True

Show which entries in a Series are not NA.

>>> ser = pd.Series([5, 6, np.NaN])  
>>> ser  
0    5.0
1    6.0
2    NaN
dtype: float64
>>> ser.notna()  
0     True
1     True
2    False
dtype: bool
property npartitions: int#

Return number of partitions

nsmallest(n=5, split_every=None)#

Return the smallest n elements.

This docstring was copied from pandas.core.series.Series.nsmallest.

Some inconsistencies with the Dask version may exist.

Parameters:
nint, default 5

Return this many ascending sorted values.

keep{‘first’, ‘last’, ‘all’}, default ‘first’ (Not supported in Dask)

When there are duplicate values that cannot all fit in a Series of n elements:

  • first : return the first n occurrences in order of appearance.

  • last : return the last n occurrences in reverse order of appearance.

  • all : keep all occurrences. This can result in a Series of size larger than n.

Returns:
Series

The n smallest values in the Series, sorted in increasing order.

See also

Series.nlargest

Get the n largest elements.

Series.sort_values

Sort Series by values.

Series.head

Return the first n rows.

Notes

Faster than .sort_values().head(n) for small n relative to the size of the Series object.

Examples

>>> countries_population = {"Italy": 59000000, "France": 65000000,  
...                         "Brunei": 434000, "Malta": 434000,
...                         "Maldives": 434000, "Iceland": 337000,
...                         "Nauru": 11300, "Tuvalu": 11300,
...                         "Anguilla": 11300, "Montserrat": 5200}
>>> s = pd.Series(countries_population)  
>>> s  
Italy       59000000
France      65000000
Brunei        434000
Malta         434000
Maldives      434000
Iceland       337000
Nauru          11300
Tuvalu         11300
Anguilla       11300
Montserrat      5200
dtype: int64

The n smallest elements where n=5 by default.

>>> s.nsmallest()  
Montserrat    5200
Nauru        11300
Tuvalu       11300
Anguilla     11300
Iceland     337000
dtype: int64

The n smallest elements where n=3. Default keep value is ‘first’ so Nauru and Tuvalu will be kept.

>>> s.nsmallest(3)  
Montserrat   5200
Nauru       11300
Tuvalu      11300
dtype: int64

The n smallest elements where n=3 and keeping the last duplicates. Anguilla and Tuvalu will be kept since they are the last with value 11300 based on the index order.

>>> s.nsmallest(3, keep='last')  
Montserrat   5200
Anguilla    11300
Tuvalu      11300
dtype: int64

The n smallest elements where n=3 with all duplicates kept. Note that the returned Series has four elements due to the three duplicates.

>>> s.nsmallest(3, keep='all')  
Montserrat   5200
Nauru       11300
Tuvalu      11300
Anguilla    11300
dtype: int64
nunique(split_every=None, dropna=True)#

Return number of unique elements in the object.

This docstring was copied from pandas.core.series.Series.nunique.

Some inconsistencies with the Dask version may exist.

Excludes NA values by default.

Parameters:
dropnabool, default True

Don’t include NaN in the count.

Returns:
int

See also

DataFrame.nunique

Method nunique for DataFrame.

Series.count

Count non-NA/null observations in the Series.

Examples

>>> s = pd.Series([1, 3, 5, 7, 7])  
>>> s  
0    1
1    3
2    5
3    7
4    7
dtype: int64
>>> s.nunique()  
4
nunique_approx(split_every=None)#

Approximate number of unique rows.

This method uses the HyperLogLog algorithm for cardinality estimation to compute the approximate number of unique rows. The approximate error is 0.406%.

Parameters:
split_everyint, optional

Group partitions into groups of this size while performing a tree-reduction. If set to False, no tree-reduction will be used. Default is 8.

Returns:
a float representing the approximate number of elements
property partitions#

Slice dataframe by partitions

This allows partitionwise slicing of a Dask Dataframe. You can perform normal Numpy-style slicing, but now rather than slice elements of the array you slice along partitions so, for example, df.partitions[:5] produces a new Dask Dataframe of the first five partitions. Valid indexers are integers, sequences of integers, slices, or boolean masks.

Returns:
A Dask DataFrame

Examples

>>> df.partitions[0]  
>>> df.partitions[:3]  
>>> df.partitions[::10]  
persist(**kwargs)#

Persist this dask collection into memory

This turns a lazy Dask collection into a Dask collection with the same metadata, but now with the results fully computed or actively computing in the background.

The action of function differs significantly depending on the active task scheduler. If the task scheduler supports asynchronous computing, such as is the case of the dask.distributed scheduler, then persist will return immediately and the return value’s task graph will contain Dask Future objects. However if the task scheduler only supports blocking computation then the call to persist will block and the return value’s task graph will contain concrete Python results.

This function is particularly useful when using distributed systems, because the results will be kept in distributed memory, rather than returned to the local process as with compute.

Parameters:
schedulerstring, optional

Which scheduler to use like “threads”, “synchronous” or “processes”. If not provided, the default is to check the global settings first, and then fall back to the collection defaults.

optimize_graphbool, optional

If True [default], the graph is optimized before computation. Otherwise the graph is run as is. This can be useful for debugging.

**kwargs

Extra keywords to forward to the scheduler function.

Returns:
New dask collections backed by in-memory data

See also

dask.persist
pipe(func, *args, **kwargs)#

Apply chainable functions that expect Series or DataFrames.

This docstring was copied from pandas.core.frame.DataFrame.pipe.

Some inconsistencies with the Dask version may exist.

Parameters:
funcfunction

Function to apply to the Series/DataFrame. args, and kwargs are passed into func. Alternatively a (callable, data_keyword) tuple where data_keyword is a string indicating the keyword of callable that expects the Series/DataFrame.

argsiterable, optional

Positional arguments passed into func.

kwargsmapping, optional

A dictionary of keyword arguments passed into func.

Returns:
objectthe return type of func.

See also

DataFrame.apply

Apply a function along input axis of DataFrame.

DataFrame.applymap

Apply a function elementwise on a whole DataFrame.

Series.map

Apply a mapping correspondence on a Series.

Notes

Use .pipe when chaining together functions that expect Series, DataFrames or GroupBy objects. Instead of writing

>>> func(g(h(df), arg1=a), arg2=b, arg3=c)  

You can write

>>> (df.pipe(h)  
...    .pipe(g, arg1=a)
...    .pipe(func, arg2=b, arg3=c)
... )  

If you have a function that takes the data as (say) the second argument, pass a tuple indicating which keyword expects the data. For example, suppose f takes its data as arg2:

>>> (df.pipe(h)  
...    .pipe(g, arg1=a)
...    .pipe((func, 'arg2'), arg1=a, arg3=c)
...  )  
pow(other, level=None, fill_value=None, axis=0)#

Get Exponential of DataFrame or Series and other, element-wise (binary operator pow).

This docstring was copied from cudf.core.series.Series.pow.

Some inconsistencies with the Dask version may exist.

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:
otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame or Series

Result of the arithmetic operation.

Examples

DataFrame

>>> df = cudf.DataFrame(  
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.pow(1)  
        angles  degrees
circle          0      360
triangle        2      180
rectangle       4      360

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])  
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])  
>>> a.pow(b)  
a       1
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.pow(b, fill_value=0)  
a       1
b       1
c       1
d       0
e    <NA>
dtype: int64
prod(axis=None, skipna=True, split_every=False, dtype=None, out=None, min_count=None, numeric_only=None)#

Return the product of the values over the requested axis.

This docstring was copied from pandas.core.frame.DataFrame.prod.

Some inconsistencies with the Dask version may exist.

Parameters:
axis{index (0), columns (1)}

Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

skipnabool, default True

Exclude NA/null values when computing the result.

levelint or level name, default None (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

Deprecated since version 1.3.0: The level keyword is deprecated. Use groupby instead.

numeric_onlybool, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Deprecated since version 1.5.0: Specifying numeric_only=None is deprecated. The default value will be False in a future version of pandas.

min_countint, default 0

The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

**kwargs

Additional keyword arguments to be passed to the function.

Returns:
Series or DataFrame (if level specified)

See also

Series.sum

Return the sum.

Series.min

Return the minimum.

Series.max

Return the maximum.

Series.idxmin

Return the index of the minimum.

Series.idxmax

Return the index of the maximum.

DataFrame.sum

Return the sum over the requested axis.

DataFrame.min

Return the minimum over the requested axis.

DataFrame.max

Return the maximum over the requested axis.

DataFrame.idxmin

Return the index of the minimum over the requested axis.

DataFrame.idxmax

Return the index of the maximum over the requested axis.

Examples

By default, the product of an empty or all-NA Series is 1

>>> pd.Series([], dtype="float64").prod()  
1.0

This can be controlled with the min_count parameter

>>> pd.Series([], dtype="float64").prod(min_count=1)  
nan

Thanks to the skipna parameter, min_count handles all-NA and empty series identically.

>>> pd.Series([np.nan]).prod()  
1.0
>>> pd.Series([np.nan]).prod(min_count=1)  
nan
product(axis=None, skipna=True, split_every=False, dtype=None, out=None, min_count=None, numeric_only=None)#

Return the product of the values over the requested axis.

This docstring was copied from pandas.core.frame.DataFrame.prod.

Some inconsistencies with the Dask version may exist.

Parameters:
axis{index (0), columns (1)}

Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

skipnabool, default True

Exclude NA/null values when computing the result.

levelint or level name, default None (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

Deprecated since version 1.3.0: The level keyword is deprecated. Use groupby instead.

numeric_onlybool, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Deprecated since version 1.5.0: Specifying numeric_only=None is deprecated. The default value will be False in a future version of pandas.

min_countint, default 0

The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

**kwargs

Additional keyword arguments to be passed to the function.

Returns:
Series or DataFrame (if level specified)

See also

Series.sum

Return the sum.

Series.min

Return the minimum.

Series.max

Return the maximum.

Series.idxmin

Return the index of the minimum.

Series.idxmax

Return the index of the maximum.

DataFrame.sum

Return the sum over the requested axis.

DataFrame.min

Return the minimum over the requested axis.

DataFrame.max

Return the maximum over the requested axis.

DataFrame.idxmin

Return the index of the minimum over the requested axis.

DataFrame.idxmax

Return the index of the maximum over the requested axis.

Examples

By default, the product of an empty or all-NA Series is 1

>>> pd.Series([], dtype="float64").prod()  
1.0

This can be controlled with the min_count parameter

>>> pd.Series([], dtype="float64").prod(min_count=1)  
nan

Thanks to the skipna parameter, min_count handles all-NA and empty series identically.

>>> pd.Series([np.nan]).prod()  
1.0
>>> pd.Series([np.nan]).prod(min_count=1)  
nan
quantile(q=0.5, method='default')#

Approximate quantiles of Series

Parameters:
qlist/array of floats, default 0.5 (50%)

Iterable of numbers ranging from 0 to 1 for the desired quantiles

method{‘default’, ‘tdigest’, ‘dask’}, optional

What method to use. By default will use dask’s internal custom algorithm ('dask'). If set to 'tdigest' will use tdigest for floats and ints and fallback to the 'dask' otherwise.

radd(other, level=None, fill_value=None, axis=0)#

Get Addition of DataFrame or Series and other, element-wise (binary operator radd).

This docstring was copied from cudf.core.series.Series.radd.

Some inconsistencies with the Dask version may exist.

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:
otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame or Series

Result of the arithmetic operation.

Examples

DataFrame

>>> df = cudf.DataFrame(  
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.radd(1)  
        angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])  
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])  
>>> a.radd(b)  
a       2
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.radd(b, fill_value=0)  
a       2
b       1
c       1
d       1
e    <NA>
dtype: int64
random_split(frac, random_state=None, shuffle=False)#

Pseudorandomly split dataframe into different pieces row-wise

Parameters:
fraclist

List of floats that should sum to one.

random_stateint or np.random.RandomState

If int create a new RandomState with this as the seed. Otherwise draw from the passed RandomState.

shufflebool, default False

If set to True, the dataframe is shuffled (within partition) before the split.

See also

dask.DataFrame.sample

Examples

50/50 split

>>> a, b = df.random_split([0.5, 0.5])  

80/10/10 split, consistent random_state

>>> a, b, c = df.random_split([0.8, 0.1, 0.1], random_state=123)  
rdiv(other, level=None, fill_value=None, axis=0)#

Return Floating division of series and other, element-wise (binary operator rtruediv).

This docstring was copied from pandas.core.series.Series.rdiv.

Some inconsistencies with the Dask version may exist.

Equivalent to other / series, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
otherSeries or scalar value
levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level.

fill_valueNone or float value, default None (NaN)

Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.

axis{0 or ‘index’}

Unused. Parameter needed for compatibility with DataFrame.

Returns:
Series

The result of the operation.

See also

Series.truediv

Element-wise Floating division, see Python documentation for more details.

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])  
>>> a  
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])  
>>> b  
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.divide(b, fill_value=0)  
a    1.0
b    inf
c    inf
d    0.0
e    NaN
dtype: float64
reduction(chunk, aggregate=None, combine=None, meta=_NoDefault.no_default, token=None, split_every=None, chunk_kwargs=None, aggregate_kwargs=None, combine_kwargs=None, **kwargs)#

Generic row-wise reductions.

Parameters:
chunkcallable

Function to operate on each partition. Should return a pandas.DataFrame, pandas.Series, or a scalar.

aggregatecallable, optional

Function to operate on the concatenated result of chunk. If not specified, defaults to chunk. Used to do the final aggregation in a tree reduction.

The input to aggregate depends on the output of chunk. If the output of chunk is a:

  • scalar: Input is a Series, with one row per partition.

  • Series: Input is a DataFrame, with one row per partition. Columns are the rows in the output series.

  • DataFrame: Input is a DataFrame, with one row per partition. Columns are the columns in the output dataframes.

Should return a pandas.DataFrame, pandas.Series, or a scalar.

combinecallable, optional

Function to operate on intermediate concatenated results of chunk in a tree-reduction. If not provided, defaults to aggregate. The input/output requirements should match that of aggregate described above.

metapd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

tokenstr, optional

The name to use for the output keys.

split_everyint, optional

Group partitions into groups of this size while performing a tree-reduction. If set to False, no tree-reduction will be used, and all intermediates will be concatenated and passed to aggregate. Default is 8.

chunk_kwargsdict, optional

Keyword arguments to pass on to chunk only.

aggregate_kwargsdict, optional

Keyword arguments to pass on to aggregate only.

combine_kwargsdict, optional

Keyword arguments to pass on to combine only.

kwargs

All remaining keywords will be passed to chunk, combine, and aggregate.

Examples

>>> import pandas as pd
>>> import dask.dataframe as dd
>>> df = pd.DataFrame({'x': range(50), 'y': range(50, 100)})
>>> ddf = dd.from_pandas(df, npartitions=4)

Count the number of rows in a DataFrame. To do this, count the number of rows in each partition, then sum the results:

>>> res = ddf.reduction(lambda x: x.count(),
...                     aggregate=lambda x: x.sum())
>>> res.compute()
x    50
y    50
dtype: int64

Count the number of rows in a Series with elements greater than or equal to a value (provided via a keyword).

>>> def count_greater(x, value=0):
...     return (x >= value).sum()
>>> res = ddf.x.reduction(count_greater, aggregate=lambda x: x.sum(),
...                       chunk_kwargs={'value': 25})
>>> res.compute()
25

Aggregate both the sum and count of a Series at the same time:

>>> def sum_and_count(x):
...     return pd.Series({'count': x.count(), 'sum': x.sum()},
...                      index=['count', 'sum'])
>>> res = ddf.x.reduction(sum_and_count, aggregate=lambda x: x.sum())
>>> res.compute()
count      50
sum      1225
dtype: int64

Doing the same, but for a DataFrame. Here chunk returns a DataFrame, meaning the input to aggregate is a DataFrame with an index with non-unique entries for both ‘x’ and ‘y’. We groupby the index, and sum each group to get the final result.

>>> def sum_and_count(x):
...     return pd.DataFrame({'count': x.count(), 'sum': x.sum()},
...                         columns=['count', 'sum'])
>>> res = ddf.reduction(sum_and_count,
...                     aggregate=lambda x: x.groupby(level=0).sum())
>>> res.compute()
   count   sum
x     50  1225
y     50  3725
rename(index=None, inplace=_NoDefault.no_default, sorted_index=False)#

Alter Series index labels or name

Function / dict values must be unique (1-to-1). Labels not contained in a dict / Series will be left as-is. Extra labels listed don’t throw an error.

Alternatively, change Series.name with a scalar value.

Parameters:
indexscalar, hashable sequence, dict-like or callable, optional

If dict-like or callable, the transformation is applied to the index. Scalar or hashable sequence-like will alter the Series.name attribute.

inplaceboolean, default False

Whether to return a new Series or modify this one inplace.

sorted_indexbool, default False

If true, the output Series will have known divisions inferred from the input series and the transformation. Ignored for non-callable/dict-like index or when the input series has unknown divisions. Note that this may only be set to True if you know that the transformed index is monotonically increasing. Dask will check that transformed divisions are monotonic, but cannot check all the values between divisions, so incorrectly setting this can result in bugs.

Returns:
renamedSeries
repartition(divisions=None, npartitions=None, partition_size=None, freq=None, force=False)#

Repartition dataframe along new divisions

Parameters:
divisionslist, optional

The “dividing lines” used to split the dataframe into partitions. For divisions=[0, 10, 50, 100], there would be three output partitions, where the new index contained [0, 10), [10, 50), and [50, 100), respectively. See https://docs.dask.org/en/latest/dataframe-design.html#partitions. Only used if npartitions and partition_size isn’t specified. For convenience if given an integer this will defer to npartitions and if given a string it will defer to partition_size (see below)

npartitionsint, optional

Approximate number of partitions of output. Only used if partition_size isn’t specified. The number of partitions used may be slightly lower than npartitions depending on data distribution, but will never be higher.

partition_size: int or string, optional

Max number of bytes of memory for each partition. Use numbers or strings like 5MB. If specified npartitions and divisions will be ignored. Note that the size reflects the number of bytes used as computed by pandas.DataFrame.memory_usage, which will not necessarily match the size when storing to disk.

Warning

This keyword argument triggers computation to determine the memory size of each partition, which may be expensive.

freqstr, pd.Timedelta

A period on which to partition timeseries data like '7D' or '12h' or pd.Timedelta(hours=12). Assumes a datetime index.

forcebool, default False

Allows the expansion of the existing divisions. If False then the new divisions’ lower and upper bounds must be the same as the old divisions’.

Notes

Exactly one of divisions, npartitions, partition_size, or freq should be specified. A ValueError will be raised when that is not the case.

Also note that len(divisons) is equal to npartitions + 1. This is because divisions represents the upper and lower bounds of each partition. The first item is the lower bound of the first partition, the second item is the lower bound of the second partition and the upper bound of the first partition, and so on. The second-to-last item is the lower bound of the last partition, and the last (extra) item is the upper bound of the last partition.

Examples

>>> df = df.repartition(npartitions=10)  
>>> df = df.repartition(divisions=[0, 5, 10, 20])  
>>> df = df.repartition(freq='7d')  
replace(to_replace=None, value=None, regex=False)#

Replace values given in to_replace with value.

This docstring was copied from pandas.core.frame.DataFrame.replace.

Some inconsistencies with the Dask version may exist.

Values of the DataFrame are replaced with other values dynamically.

This differs from updating with .loc or .iloc, which require you to specify a location to update with some value.

Parameters:
to_replacestr, regex, list, dict, Series, int, float, or None

How to find the values that will be replaced.

  • numeric, str or regex:

    • numeric: numeric values equal to to_replace will be replaced with value

    • str: string exactly matching to_replace will be replaced with value

    • regex: regexs matching to_replace will be replaced with value

  • list of str, regex, or numeric:

    • First, if to_replace and value are both lists, they must be the same length.

    • Second, if regex=True then all of the strings in both lists will be interpreted as regexs otherwise they will match directly. This doesn’t matter much for value since there are only a few possible substitution regexes you can use.

    • str, regex and numeric rules apply as above.

  • dict:

    • Dicts can be used to specify different replacement values for different existing values. For example, {'a': 'b', 'y': 'z'} replaces the value ‘a’ with ‘b’ and ‘y’ with ‘z’. To use a dict in this way, the optional value parameter should not be given.

    • For a DataFrame a dict can specify that different values should be replaced in different columns. For example, {'a': 1, 'b': 'z'} looks for the value 1 in column ‘a’ and the value ‘z’ in column ‘b’ and replaces these values with whatever is specified in value. The value parameter should not be None in this case. You can treat this as a special case of passing two lists except that you are specifying the column to search in.

    • For a DataFrame nested dictionaries, e.g., {'a': {'b': np.nan}}, are read as follows: look in column ‘a’ for the value ‘b’ and replace it with NaN. The optional value parameter should not be specified to use a nested dict in this way. You can nest regular expressions as well. Note that column names (the top-level dictionary keys in a nested dictionary) cannot be regular expressions.

  • None:

    • This means that the regex argument must be a string, compiled regular expression, or list, dict, ndarray or Series of such elements. If value is also None then this must be a nested dictionary or Series.

See the examples section for examples of each of these.

valuescalar, dict, list, str, regex, default None

Value to replace any values matching to_replace with. For a DataFrame a dict of values can be used to specify which value to use for each column (columns not in the dict will not be filled). Regular expressions, strings and lists or dicts of such objects are also allowed.

inplacebool, default False (Not supported in Dask)

Whether to modify the DataFrame rather than creating a new one.

limitint, default None (Not supported in Dask)

Maximum size gap to forward or backward fill.

regexbool or same types as to_replace, default False

Whether to interpret to_replace and/or value as regular expressions. If this is True then to_replace must be a string. Alternatively, this could be a regular expression or a list, dict, or array of regular expressions in which case to_replace must be None.

method{‘pad’, ‘ffill’, ‘bfill’} (Not supported in Dask)

The method to use when for replacement, when to_replace is a scalar, list or tuple and value is None.

Changed in version 0.23.0: Added to DataFrame.

Returns:
DataFrame

Object after replacement.

Raises:
AssertionError
  • If regex is not a bool and to_replace is not None.

TypeError
  • If to_replace is not a scalar, array-like, dict, or None

  • If to_replace is a dict and value is not a list, dict, ndarray, or Series

  • If to_replace is None and regex is not compilable into a regular expression or is a list, dict, ndarray, or Series.

  • When replacing multiple bool or datetime64 objects and the arguments to to_replace does not match the type of the value being replaced

ValueError
  • If a list or an ndarray is passed to to_replace and value but they are not the same length.

See also

DataFrame.fillna

Fill NA values.

DataFrame.where

Replace values based on boolean condition.

Series.str.replace

Simple string replacement.

Notes

  • Regex substitution is performed under the hood with re.sub. The rules for substitution for re.sub are the same.

  • Regular expressions will only substitute on strings, meaning you cannot provide, for example, a regular expression matching floating point numbers and expect the columns in your frame that have a numeric dtype to be matched. However, if those floating point numbers are strings, then you can do this.

  • This method has a lot of options. You are encouraged to experiment and play with this method to gain intuition about how it works.

  • When dict is used as the to_replace value, it is like key(s) in the dict are the to_replace part and value(s) in the dict are the value parameter.

Examples

Scalar `to_replace` and `value`

>>> s = pd.Series([1, 2, 3, 4, 5])  
>>> s.replace(1, 5)  
0    5
1    2
2    3
3    4
4    5
dtype: int64
>>> df = pd.DataFrame({'A': [0, 1, 2, 3, 4],  
...                    'B': [5, 6, 7, 8, 9],
...                    'C': ['a', 'b', 'c', 'd', 'e']})
>>> df.replace(0, 5)  
    A  B  C
0  5  5  a
1  1  6  b
2  2  7  c
3  3  8  d
4  4  9  e

List-like `to_replace`

>>> df.replace([0, 1, 2, 3], 4)  
    A  B  C
0  4  5  a
1  4  6  b
2  4  7  c
3  4  8  d
4  4  9  e
>>> df.replace([0, 1, 2, 3], [4, 3, 2, 1])  
    A  B  C
0  4  5  a
1  3  6  b
2  2  7  c
3  1  8  d
4  4  9  e
>>> s.replace([1, 2], method='bfill')  
0    3
1    3
2    3
3    4
4    5
dtype: int64

dict-like `to_replace`

>>> df.replace({0: 10, 1: 100})  
        A  B  C
0   10  5  a
1  100  6  b
2    2  7  c
3    3  8  d
4    4  9  e
>>> df.replace({'A': 0, 'B': 5}, 100)  
        A    B  C
0  100  100  a
1    1    6  b
2    2    7  c
3    3    8  d
4    4    9  e
>>> df.replace({'A': {0: 100, 4: 400}})  
        A  B  C
0  100  5  a
1    1  6  b
2    2  7  c
3    3  8  d
4  400  9  e

Regular expression `to_replace`

>>> df = pd.DataFrame({'A': ['bat', 'foo', 'bait'],  
...                    'B': ['abc', 'bar', 'xyz']})
>>> df.replace(to_replace=r'^ba.$', value='new', regex=True)  
        A    B
0   new  abc
1   foo  new
2  bait  xyz
>>> df.replace({'A': r'^ba.$'}, {'A': 'new'}, regex=True)  
        A    B
0   new  abc
1   foo  bar
2  bait  xyz
>>> df.replace(regex=r'^ba.$', value='new')  
        A    B
0   new  abc
1   foo  new
2  bait  xyz
>>> df.replace(regex={r'^ba.$': 'new', 'foo': 'xyz'})  
        A    B
0   new  abc
1   xyz  new
2  bait  xyz
>>> df.replace(regex=[r'^ba.$', 'foo'], value='new')  
        A    B
0   new  abc
1   new  new
2  bait  xyz

Compare the behavior of s.replace({'a': None}) and s.replace('a', None) to understand the peculiarities of the to_replace parameter:

>>> s = pd.Series([10, 'a', 'a', 'b', 'a'])  

When one uses a dict as the to_replace value, it is like the value(s) in the dict are equal to the value parameter. s.replace({'a': None}) is equivalent to s.replace(to_replace={'a': None}, value=None, method=None):

>>> s.replace({'a': None})  
0      10
1    None
2    None
3       b
4    None
dtype: object

When value is not explicitly passed and to_replace is a scalar, list or tuple, replace uses the method parameter (default ‘pad’) to do the replacement. So this is why the ‘a’ values are being replaced by 10 in rows 1 and 2 and ‘b’ in row 4 in this case.

>>> s.replace('a')  
0    10
1    10
2    10
3     b
4     b
dtype: object

On the other hand, if None is explicitly passed for value, it will be respected:

>>> s.replace('a', None)  
0      10
1    None
2    None
3       b
4    None
dtype: object

Changed in version 1.4.0: Previously the explicit None was silently ignored.

resample(rule, closed=None, label=None)#

Resample time-series data.

This docstring was copied from pandas.core.frame.DataFrame.resample.

Some inconsistencies with the Dask version may exist.

Convenience method for frequency conversion and resampling of time series. The object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or the caller must pass the label of a datetime-like series/index to the on/level keyword parameter.

Parameters:
ruleDateOffset, Timedelta or str

The offset string or object representing target conversion.

axis{0 or ‘index’, 1 or ‘columns’}, default 0 (Not supported in Dask)

Which axis to use for up- or down-sampling. For Series this parameter is unused and defaults to 0. Must be DatetimeIndex, TimedeltaIndex or PeriodIndex.

closed{‘right’, ‘left’}, default None

Which side of bin interval is closed. The default is ‘left’ for all frequency offsets except for ‘M’, ‘A’, ‘Q’, ‘BM’, ‘BA’, ‘BQ’, and ‘W’ which all have a default of ‘right’.

label{‘right’, ‘left’}, default None

Which bin edge label to label bucket with. The default is ‘left’ for all frequency offsets except for ‘M’, ‘A’, ‘Q’, ‘BM’, ‘BA’, ‘BQ’, and ‘W’ which all have a default of ‘right’.

convention{‘start’, ‘end’, ‘s’, ‘e’}, default ‘start’ (Not supported in Dask)

For PeriodIndex only, controls whether to use the start or end of rule.

kind{‘timestamp’, ‘period’}, optional, default None (Not supported in Dask)

Pass ‘timestamp’ to convert the resulting index to a DateTimeIndex or ‘period’ to convert it to a PeriodIndex. By default the input representation is retained.

loffsettimedelta, default None (Not supported in Dask)

Adjust the resampled time labels.

Deprecated since version 1.1.0: You should add the loffset to the df.index after the resample. See below.

baseint, default 0 (Not supported in Dask)

For frequencies that evenly subdivide 1 day, the “origin” of the aggregated intervals. For example, for ‘5min’ frequency, base could range from 0 through 4. Defaults to 0.

Deprecated since version 1.1.0: The new arguments that you should use are ‘offset’ or ‘origin’.

onstr, optional (Not supported in Dask)

For a DataFrame, column to use instead of index for resampling. Column must be datetime-like.

levelstr or int, optional (Not supported in Dask)

For a MultiIndex, level (name or number) to use for resampling. level must be datetime-like.

originTimestamp or str, default ‘start_day’ (Not supported in Dask)

The timestamp on which to adjust the grouping. The timezone of origin must match the timezone of the index. If string, must be one of the following:

  • ‘epoch’: origin is 1970-01-01

  • ‘start’: origin is the first value of the timeseries

  • ‘start_day’: origin is the first day at midnight of the timeseries

New in version 1.1.0.

  • ‘end’: origin is the last value of the timeseries

  • ‘end_day’: origin is the ceiling midnight of the last day

New in version 1.3.0.

offsetTimedelta or str, default is None (Not supported in Dask)

An offset timedelta added to the origin.

New in version 1.1.0.

group_keysbool, optional (Not supported in Dask)

Whether to include the group keys in the result index when using .apply() on the resampled object. Not specifying group_keys will retain values-dependent behavior from pandas 1.4 and earlier (see pandas 1.5.0 Release notes for examples). In a future version of pandas, the behavior will default to the same as specifying group_keys=False.

New in version 1.5.0.

Returns:
pandas.core.Resampler

Resampler object.

See also

Series.resample

Resample a Series.

DataFrame.resample

Resample a DataFrame.

groupby

Group DataFrame by mapping, function, label, or list of labels.

asfreq

Reindex a DataFrame with the given frequency without grouping.

Notes

See the user guide for more.

To learn more about the offset strings, please see this link.

Examples

Start by creating a series with 9 one minute timestamps.

>>> index = pd.date_range('1/1/2000', periods=9, freq='T')  
>>> series = pd.Series(range(9), index=index)  
>>> series  
2000-01-01 00:00:00    0
2000-01-01 00:01:00    1
2000-01-01 00:02:00    2
2000-01-01 00:03:00    3
2000-01-01 00:04:00    4
2000-01-01 00:05:00    5
2000-01-01 00:06:00    6
2000-01-01 00:07:00    7
2000-01-01 00:08:00    8
Freq: T, dtype: int64

Downsample the series into 3 minute bins and sum the values of the timestamps falling into a bin.

>>> series.resample('3T').sum()  
2000-01-01 00:00:00     3
2000-01-01 00:03:00    12
2000-01-01 00:06:00    21
Freq: 3T, dtype: int64

Downsample the series into 3 minute bins as above, but label each bin using the right edge instead of the left. Please note that the value in the bucket used as the label is not included in the bucket, which it labels. For example, in the original series the bucket 2000-01-01 00:03:00 contains the value 3, but the summed value in the resampled bucket with the label 2000-01-01 00:03:00 does not include 3 (if it did, the summed value would be 6, not 3). To include this value close the right side of the bin interval as illustrated in the example below this one.

>>> series.resample('3T', label='right').sum()  
2000-01-01 00:03:00     3
2000-01-01 00:06:00    12
2000-01-01 00:09:00    21
Freq: 3T, dtype: int64

Downsample the series into 3 minute bins as above, but close the right side of the bin interval.

>>> series.resample('3T', label='right', closed='right').sum()  
2000-01-01 00:00:00     0
2000-01-01 00:03:00     6
2000-01-01 00:06:00    15
2000-01-01 00:09:00    15
Freq: 3T, dtype: int64

Upsample the series into 30 second bins.

>>> series.resample('30S').asfreq()[0:5]   # Select first 5 rows  
2000-01-01 00:00:00   0.0
2000-01-01 00:00:30   NaN
2000-01-01 00:01:00   1.0
2000-01-01 00:01:30   NaN
2000-01-01 00:02:00   2.0
Freq: 30S, dtype: float64

Upsample the series into 30 second bins and fill the NaN values using the ffill method.

>>> series.resample('30S').ffill()[0:5]  
2000-01-01 00:00:00    0
2000-01-01 00:00:30    0
2000-01-01 00:01:00    1
2000-01-01 00:01:30    1
2000-01-01 00:02:00    2
Freq: 30S, dtype: int64

Upsample the series into 30 second bins and fill the NaN values using the bfill method.

>>> series.resample('30S').bfill()[0:5]  
2000-01-01 00:00:00    0
2000-01-01 00:00:30    1
2000-01-01 00:01:00    1
2000-01-01 00:01:30    2
2000-01-01 00:02:00    2
Freq: 30S, dtype: int64

Pass a custom function via apply

>>> def custom_resampler(arraylike):  
...     return np.sum(arraylike) + 5
...
>>> series.resample('3T').apply(custom_resampler)  
2000-01-01 00:00:00     8
2000-01-01 00:03:00    17
2000-01-01 00:06:00    26
Freq: 3T, dtype: int64

For a Series with a PeriodIndex, the keyword convention can be used to control whether to use the start or end of rule.

Resample a year by quarter using ‘start’ convention. Values are assigned to the first quarter of the period.

>>> s = pd.Series([1, 2], index=pd.period_range('2012-01-01',  
...                                             freq='A',
...                                             periods=2))
>>> s  
2012    1
2013    2
Freq: A-DEC, dtype: int64
>>> s.resample('Q', convention='start').asfreq()  
2012Q1    1.0
2012Q2    NaN
2012Q3    NaN
2012Q4    NaN
2013Q1    2.0
2013Q2    NaN
2013Q3    NaN
2013Q4    NaN
Freq: Q-DEC, dtype: float64

Resample quarters by month using ‘end’ convention. Values are assigned to the last month of the period.

>>> q = pd.Series([1, 2, 3, 4], index=pd.period_range('2018-01-01',  
...                                                   freq='Q',
...                                                   periods=4))
>>> q  
2018Q1    1
2018Q2    2
2018Q3    3
2018Q4    4
Freq: Q-DEC, dtype: int64
>>> q.resample('M', convention='end').asfreq()  
2018-03    1.0
2018-04    NaN
2018-05    NaN
2018-06    2.0
2018-07    NaN
2018-08    NaN
2018-09    3.0
2018-10    NaN
2018-11    NaN
2018-12    4.0
Freq: M, dtype: float64

For DataFrame objects, the keyword on can be used to specify the column instead of the index for resampling.

>>> d = {'price': [10, 11, 9, 13, 14, 18, 17, 19],  
...      'volume': [50, 60, 40, 100, 50, 100, 40, 50]}
>>> df = pd.DataFrame(d)  
>>> df['week_starting'] = pd.date_range('01/01/2018',  
...                                     periods=8,
...                                     freq='W')
>>> df  
   price  volume week_starting
0     10      50    2018-01-07
1     11      60    2018-01-14
2      9      40    2018-01-21
3     13     100    2018-01-28
4     14      50    2018-02-04
5     18     100    2018-02-11
6     17      40    2018-02-18
7     19      50    2018-02-25
>>> df.resample('M', on='week_starting').mean()  
               price  volume
week_starting
2018-01-31     10.75    62.5
2018-02-28     17.00    60.0

For a DataFrame with MultiIndex, the keyword level can be used to specify on which level the resampling needs to take place.

>>> days = pd.date_range('1/1/2000', periods=4, freq='D')  
>>> d2 = {'price': [10, 11, 9, 13, 14, 18, 17, 19],  
...       'volume': [50, 60, 40, 100, 50, 100, 40, 50]}
>>> df2 = pd.DataFrame(  
...     d2,
...     index=pd.MultiIndex.from_product(
...         [days, ['morning', 'afternoon']]
...     )
... )
>>> df2  
                      price  volume
2000-01-01 morning       10      50
           afternoon     11      60
2000-01-02 morning        9      40
           afternoon     13     100
2000-01-03 morning       14      50
           afternoon     18     100
2000-01-04 morning       17      40
           afternoon     19      50
>>> df2.resample('D', level=0).sum()  
            price  volume
2000-01-01     21     110
2000-01-02     22     140
2000-01-03     32     150
2000-01-04     36      90

If you want to adjust the start of the bins based on a fixed timestamp:

>>> start, end = '2000-10-01 23:30:00', '2000-10-02 00:30:00'  
>>> rng = pd.date_range(start, end, freq='7min')  
>>> ts = pd.Series(np.arange(len(rng)) * 3, index=rng)  
>>> ts  
2000-10-01 23:30:00     0
2000-10-01 23:37:00     3
2000-10-01 23:44:00     6
2000-10-01 23:51:00     9
2000-10-01 23:58:00    12
2000-10-02 00:05:00    15
2000-10-02 00:12:00    18
2000-10-02 00:19:00    21
2000-10-02 00:26:00    24
Freq: 7T, dtype: int64
>>> ts.resample('17min').sum()  
2000-10-01 23:14:00     0
2000-10-01 23:31:00     9
2000-10-01 23:48:00    21
2000-10-02 00:05:00    54
2000-10-02 00:22:00    24
Freq: 17T, dtype: int64
>>> ts.resample('17min', origin='epoch').sum()  
2000-10-01 23:18:00     0
2000-10-01 23:35:00    18
2000-10-01 23:52:00    27
2000-10-02 00:09:00    39
2000-10-02 00:26:00    24
Freq: 17T, dtype: int64
>>> ts.resample('17min', origin='2000-01-01').sum()  
2000-10-01 23:24:00     3
2000-10-01 23:41:00    15
2000-10-01 23:58:00    45
2000-10-02 00:15:00    45
Freq: 17T, dtype: int64

If you want to adjust the start of the bins with an offset Timedelta, the two following lines are equivalent:

>>> ts.resample('17min', origin='start').sum()  
2000-10-01 23:30:00     9
2000-10-01 23:47:00    21
2000-10-02 00:04:00    54
2000-10-02 00:21:00    24
Freq: 17T, dtype: int64
>>> ts.resample('17min', offset='23h30min').sum()  
2000-10-01 23:30:00     9
2000-10-01 23:47:00    21
2000-10-02 00:04:00    54
2000-10-02 00:21:00    24
Freq: 17T, dtype: int64

If you want to take the largest Timestamp as the end of the bins:

>>> ts.resample('17min', origin='end').sum()  
2000-10-01 23:35:00     0
2000-10-01 23:52:00    18
2000-10-02 00:09:00    27
2000-10-02 00:26:00    63
Freq: 17T, dtype: int64

In contrast with the start_day, you can use end_day to take the ceiling midnight of the largest Timestamp as the end of the bins and drop the bins not containing data:

>>> ts.resample('17min', origin='end_day').sum()  
2000-10-01 23:38:00     3
2000-10-01 23:55:00    15
2000-10-02 00:12:00    45
2000-10-02 00:29:00    45
Freq: 17T, dtype: int64

To replace the use of the deprecated base argument, you can now use offset, in this example it is equivalent to have base=2:

>>> ts.resample('17min', offset='2min').sum()  
2000-10-01 23:16:00     0
2000-10-01 23:33:00     9
2000-10-01 23:50:00    36
2000-10-02 00:07:00    39
2000-10-02 00:24:00    24
Freq: 17T, dtype: int64

To replace the use of the deprecated loffset argument:

>>> from pandas.tseries.frequencies import to_offset  
>>> loffset = '19min'  
>>> ts_out = ts.resample('17min').sum()  
>>> ts_out.index = ts_out.index + to_offset(loffset)  
>>> ts_out  
2000-10-01 23:33:00     0
2000-10-01 23:50:00     9
2000-10-02 00:07:00    21
2000-10-02 00:24:00    54
2000-10-02 00:41:00    24
Freq: 17T, dtype: int64
reset_index(drop=False)#

Reset the index to the default index.

Note that unlike in pandas, the reset dask.dataframe index will not be monotonically increasing from 0. Instead, it will restart at 0 for each partition (e.g. index1 = [0, ..., 10], index2 = [0, ...]). This is due to the inability to statically know the full length of the index.

For DataFrame with multi-level index, returns a new DataFrame with labeling information in the columns under the index names, defaulting to ‘level_0’, ‘level_1’, etc. if any are None. For a standard index, the index name will be used (if set), otherwise a default ‘index’ or ‘level_0’ (if ‘index’ is already taken) will be used.

Parameters:
dropboolean, default False

Do not try to insert index into dataframe columns.

rfloordiv(other, level=None, fill_value=None, axis=0)#

Get Integer division of DataFrame or Series and other, element-wise (binary operator rfloordiv).

This docstring was copied from cudf.core.series.Series.rfloordiv.

Some inconsistencies with the Dask version may exist.

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:
otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame or Series

Result of the arithmetic operation.

Examples

DataFrame

>>> df = cudf.DataFrame(  
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.rfloordiv(1)  
                        angles  degrees
circle     9223372036854775807        0
triangle                     0        0
rectangle                    0        0

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])  
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])  
>>> a.rfloordiv(b)  
a       1
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.rfloordiv(b, fill_value=0)  
a                      1
b                      0
c                      0
d    9223372036854775807
e                   <NA>
dtype: int64
rmod(other, level=None, fill_value=None, axis=0)#

Get Modulo of DataFrame or Series and other, element-wise (binary operator rmod).

This docstring was copied from cudf.core.series.Series.rmod.

Some inconsistencies with the Dask version may exist.

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:
otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame or Series

Result of the arithmetic operation.

Examples

DataFrame

>>> df = cudf.DataFrame(  
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.rmod(1)  
            angles  degrees
circle     4294967295        1
triangle            1        1
rectangle           1        1

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])  
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])  
>>> a.rmod(b)  
a       0
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.rmod(b, fill_value=0)  
a             0
b             0
c             0
d    4294967295
e          <NA>
dtype: int64
rmul(other, level=None, fill_value=None, axis=0)#

Get Multiplication of DataFrame or Series and other, element-wise (binary operator rmul).

This docstring was copied from cudf.core.series.Series.rmul.

Some inconsistencies with the Dask version may exist.

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:
otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame or Series

Result of the arithmetic operation.

Examples

DataFrame

>>> df = cudf.DataFrame(  
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.rmul(1)  
        angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])  
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])  
>>> a.rmul(b)  
a       1
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.rmul(b, fill_value=0)  
a       1
b       0
c       0
d       0
e    <NA>
dtype: int64
rolling(window, min_periods=None, center=False, win_type=None, axis=_NoDefault.no_default)#

Provides rolling transformations.

Parameters:
windowint, str, offset

Size of the moving window. This is the number of observations used for calculating the statistic. When not using a DatetimeIndex, the window size must not be so large as to span more than one adjacent partition. If using an offset or offset alias like ‘5D’, the data must have a DatetimeIndex

Changed in version 0.15.0: Now accepts offsets and string offset aliases

min_periodsint, default None

Minimum number of observations in window required to have a value (otherwise result is NA).

centerboolean, default False

Set the labels at the center of the window.

win_typestring, default None

Provide a window type. The recognized window types are identical to pandas.

axisint, str, None, default 0

This parameter is deprecated with pandas>=2.1.

Returns:
a Rolling object on which to call a method to compute a statistic
round(decimals=0)#

Round each value in a Series to the given number of decimals.

This docstring was copied from pandas.core.series.Series.round.

Some inconsistencies with the Dask version may exist.

Parameters:
decimalsint, default 0

Number of decimal places to round to. If decimals is negative, it specifies the number of positions to the left of the decimal point.

*args, **kwargs

Additional arguments and keywords have no effect but might be accepted for compatibility with NumPy.

Returns:
Series

Rounded values of the Series.

See also

numpy.around

Round values of an np.array.

DataFrame.round

Round values of a DataFrame.

Examples

>>> s = pd.Series([0.1, 1.3, 2.7])  
>>> s.round()  
0    0.0
1    1.0
2    3.0
dtype: float64
rpow(other, level=None, fill_value=None, axis=0)#

Get Exponential of DataFrame or Series and other, element-wise (binary operator rpow).

This docstring was copied from cudf.core.series.Series.rpow.

Some inconsistencies with the Dask version may exist.

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:
otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame or Series

Result of the arithmetic operation.

Examples

DataFrame

>>> df = cudf.DataFrame(  
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.rpow(1)  
        angles  degrees
circle          1        1
triangle        1        1
rectangle       1        1

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])  
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])  
>>> a.rpow(b)  
a       1
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.rpow(b, fill_value=0)  
a       1
b       0
c       0
d       1
e    <NA>
dtype: int64
rsub(other, level=None, fill_value=None, axis=0)#

Get Subtraction of DataFrame or Series and other, element-wise (binary operator rsub).

This docstring was copied from cudf.core.series.Series.rsub.

Some inconsistencies with the Dask version may exist.

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:
otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame or Series

Result of the arithmetic operation.

Examples

DataFrame

>>> df = cudf.DataFrame(  
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.rsub(1)  
        angles  degrees
circle          1     -359
triangle       -2     -179
rectangle      -3     -359

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])  
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])  
>>> a.rsub(b)  
a       0
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.rsub(b, fill_value=0)  
a       0
b      -1
c      -1
d       1
e    <NA>
dtype: int64
rtruediv(other, level=None, fill_value=None, axis=0)#

Get Floating division of DataFrame or Series and other, element-wise (binary operator rtruediv).

This docstring was copied from cudf.core.series.Series.rtruediv.

Some inconsistencies with the Dask version may exist.

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:
otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame or Series

Result of the arithmetic operation.

Examples

DataFrame

>>> df = cudf.DataFrame(  
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.rtruediv(1)  
            angles   degrees
circle          inf  0.002778
triangle   0.333333  0.005556
rectangle  0.250000  0.002778

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])  
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])  
>>> a.rtruediv(b)  
a     1.0
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: float64
>>> a.rtruediv(b, fill_value=0)  
a     1.0
b     0.0
c     0.0
d     Inf
e    <NA>
dtype: float64
sample(n=None, frac=None, replace=False, random_state=None)#

Random sample of items

Parameters:
nint, optional

Number of items to return is not supported by dask. Use frac instead.

fracfloat, optional

Approximate fraction of items to return. This sampling fraction is applied to all partitions equally. Note that this is an approximate fraction. You should not expect exactly len(df) * frac items to be returned, as the exact number of elements selected will depend on how your data is partitioned (but should be pretty close in practice).

replaceboolean, optional

Sample with or without replacement. Default = False.

random_stateint or np.random.RandomState

If an int, we create a new RandomState with this as the seed; Otherwise we draw from the passed RandomState.

sem(axis=None, skipna=True, ddof=1, split_every=False, numeric_only=None)#

Return unbiased standard error of the mean over requested axis.

This docstring was copied from pandas.core.frame.DataFrame.sem.

Some inconsistencies with the Dask version may exist.

Normalized by N-1 by default. This can be changed using the ddof argument

Parameters:
axis{index (0), columns (1)}

For Series this parameter is unused and defaults to 0.

skipnabool, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA.

levelint or level name, default None (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

Deprecated since version 1.3.0: The level keyword is deprecated. Use groupby instead.

ddofint, default 1

Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

numeric_onlybool, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Deprecated since version 1.5.0: Specifying numeric_only=None is deprecated. The default value will be False in a future version of pandas.

Returns:
Series or DataFrame (if level specified)
property shape#

Return a tuple representing the dimensionality of a Series.

The single element of the tuple is a Delayed result.

Examples

>>> series.shape  
(dd.Scalar<size-ag..., dtype=int64>,)
shift(periods=1, freq=None, axis=0)#

Shift index by desired number of periods with an optional time freq.

This docstring was copied from pandas.core.frame.DataFrame.shift.

Some inconsistencies with the Dask version may exist.

When freq is not passed, shift the index without realigning the data. If freq is passed (in this case, the index must be date or datetime, or it will raise a NotImplementedError), the index will be increased using the periods and the freq. freq can be inferred when specified as “infer” as long as either freq or inferred_freq attribute is set in the index.

Parameters:
periodsint

Number of periods to shift. Can be positive or negative.

freqDateOffset, tseries.offsets, timedelta, or str, optional

Offset to use from the tseries module or time rule (e.g. ‘EOM’). If freq is specified then the index values are shifted but the data is not realigned. That is, use freq if you would like to extend the index when shifting and preserve the original data. If freq is specified as “infer” then it will be inferred from the freq or inferred_freq attributes of the index. If neither of those attributes exist, a ValueError is thrown.

axis{0 or ‘index’, 1 or ‘columns’, None}, default None

Shift direction. For Series this parameter is unused and defaults to 0.

fill_valueobject, optional (Not supported in Dask)

The scalar value to use for newly introduced missing values. the default depends on the dtype of self. For numeric data, np.nan is used. For datetime, timedelta, or period data, etc. NaT is used. For extension dtypes, self.dtype.na_value is used.

Changed in version 1.1.0.

Returns:
DataFrame

Copy of input object, shifted.

See also

Index.shift

Shift values of Index.

DatetimeIndex.shift

Shift values of DatetimeIndex.

PeriodIndex.shift

Shift values of PeriodIndex.

tshift

Shift the time index, using the index’s frequency if available.

Examples

>>> df = pd.DataFrame({"Col1": [10, 20, 15, 30, 45],  
...                    "Col2": [13, 23, 18, 33, 48],
...                    "Col3": [17, 27, 22, 37, 52]},
...                   index=pd.date_range("2020-01-01", "2020-01-05"))
>>> df  
            Col1  Col2  Col3
2020-01-01    10    13    17
2020-01-02    20    23    27
2020-01-03    15    18    22
2020-01-04    30    33    37
2020-01-05    45    48    52
>>> df.shift(periods=3)  
            Col1  Col2  Col3
2020-01-01   NaN   NaN   NaN
2020-01-02   NaN   NaN   NaN
2020-01-03   NaN   NaN   NaN
2020-01-04  10.0  13.0  17.0
2020-01-05  20.0  23.0  27.0
>>> df.shift(periods=1, axis="columns")  
            Col1  Col2  Col3
2020-01-01   NaN    10    13
2020-01-02   NaN    20    23
2020-01-03   NaN    15    18
2020-01-04   NaN    30    33
2020-01-05   NaN    45    48
>>> df.shift(periods=3, fill_value=0)  
            Col1  Col2  Col3
2020-01-01     0     0     0
2020-01-02     0     0     0
2020-01-03     0     0     0
2020-01-04    10    13    17
2020-01-05    20    23    27
>>> df.shift(periods=3, freq="D")  
            Col1  Col2  Col3
2020-01-04    10    13    17
2020-01-05    20    23    27
2020-01-06    15    18    22
2020-01-07    30    33    37
2020-01-08    45    48    52
>>> df.shift(periods=3, freq="infer")  
            Col1  Col2  Col3
2020-01-04    10    13    17
2020-01-05    20    23    27
2020-01-06    15    18    22
2020-01-07    30    33    37
2020-01-08    45    48    52
shuffle(on, npartitions=None, max_branch=None, shuffle_method=None, ignore_index=False, compute=None)#

Rearrange DataFrame into new partitions

Uses hashing of on to map rows to output partitions. After this operation, rows with the same value of on will be in the same partition.

Parameters:
onstr, list of str, or Series, Index, or DataFrame

Column(s) or index to be used to map rows to output partitions

npartitionsint, optional

Number of partitions of output. Partition count will not be changed by default.

max_branch: int, optional

The maximum number of splits per input partition. Used within the staged shuffling algorithm.

shuffle_method: {‘disk’, ‘tasks’, ‘p2p’}, optional

Either 'disk' for single-node operation or 'tasks' and 'p2p' for distributed operation. Will be inferred by your current scheduler.

ignore_index: bool, default False

Ignore index during shuffle. If True, performance may improve, but index values will not be preserved.

compute: bool

Whether or not to trigger an immediate computation. Defaults to False.

Notes

This does not preserve a meaningful index/partitioning scheme. This is not deterministic if done in parallel.

Examples

>>> df = df.shuffle(df.columns[0])  
property size#

Size of the Series or DataFrame as a Delayed object.

Examples

>>> series.size  
dd.Scalar<size-ag..., dtype=int64>
skew(axis=0, bias=True, nan_policy='propagate', out=None, numeric_only=_NoDefault.no_default)#

Return unbiased skew over requested axis.

This docstring was copied from pandas.core.frame.DataFrame.skew.

Some inconsistencies with the Dask version may exist.

Note

This implementation follows the dask.array.stats implementation of skewness and calculates skewness without taking into account a bias term for finite sample size, which corresponds to the default settings of the scipy.stats skewness calculation. However, Pandas corrects for this, so the values differ by a factor of (n * (n - 1)) ** 0.5 / (n - 2), where n is the number of samples.

Further, this method currently does not support filtering out NaN values, which is again a difference to Pandas.

Normalized by N-1.

Parameters:
axis{index (0), columns (1)}

Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

skipnabool, default True (Not supported in Dask)

Exclude NA/null values when computing the result.

levelint or level name, default None (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

Deprecated since version 1.3.0: The level keyword is deprecated. Use groupby instead.

numeric_onlybool, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Deprecated since version 1.5.0: Specifying numeric_only=None is deprecated. The default value will be False in a future version of pandas.

**kwargs

Additional keyword arguments to be passed to the function.

Returns:
Series or DataFrame (if level specified)
squeeze()#

Squeeze 1 dimensional axis objects into scalars.

This docstring was copied from pandas.core.series.Series.squeeze.

Some inconsistencies with the Dask version may exist.

Series or DataFrames with a single element are squeezed to a scalar. DataFrames with a single column or a single row are squeezed to a Series. Otherwise the object is unchanged.

This method is most useful when you don’t know if your object is a Series or DataFrame, but you do know it has just a single column. In that case you can safely call squeeze to ensure you have a Series.

Parameters:
axis{0 or ‘index’, 1 or ‘columns’, None}, default None (Not supported in Dask)

A specific axis to squeeze. By default, all length-1 axes are squeezed. For Series this parameter is unused and defaults to None.

Returns:
DataFrame, Series, or scalar

The projection after squeezing axis or all the axes.

See also

Series.iloc

Integer-location based indexing for selecting scalars.

DataFrame.iloc

Integer-location based indexing for selecting Series.

Series.to_frame

Inverse of DataFrame.squeeze for a single-column DataFrame.

Examples

>>> primes = pd.Series([2, 3, 5, 7])  

Slicing might produce a Series with a single value:

>>> even_primes = primes[primes % 2 == 0]  
>>> even_primes  
0    2
dtype: int64
>>> even_primes.squeeze()  
2

Squeezing objects with more than one value in every axis does nothing:

>>> odd_primes = primes[primes % 2 == 1]  
>>> odd_primes  
1    3
2    5
3    7
dtype: int64
>>> odd_primes.squeeze()  
1    3
2    5
3    7
dtype: int64

Squeezing is even more effective when used with DataFrames.

>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=['a', 'b'])  
>>> df  
   a  b
0  1  2
1  3  4

Slicing a single column will produce a DataFrame with the columns having only one value:

>>> df_a = df[['a']]  
>>> df_a  
   a
0  1
1  3

So the columns can be squeezed down, resulting in a Series:

>>> df_a.squeeze('columns')  
0    1
1    3
Name: a, dtype: int64

Slicing a single row from a single column will produce a single scalar DataFrame:

>>> df_0a = df.loc[df.index < 1, ['a']]  
>>> df_0a  
   a
0  1

Squeezing the rows produces a single scalar Series:

>>> df_0a.squeeze('rows')  
a    1
Name: 0, dtype: int64

Squeezing all axes will project directly into a scalar:

>>> df_0a.squeeze()  
1
std(axis=None, skipna=True, ddof=1, split_every=False, dtype=None, out=None, numeric_only=_NoDefault.no_default)#

Return sample standard deviation over requested axis.

This docstring was copied from pandas.core.frame.DataFrame.std.

Some inconsistencies with the Dask version may exist.

Normalized by N-1 by default. This can be changed using the ddof argument.

Parameters:
axis{index (0), columns (1)}

For Series this parameter is unused and defaults to 0.

skipnabool, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA.

levelint or level name, default None (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

Deprecated since version 1.3.0: The level keyword is deprecated. Use groupby instead.

ddofint, default 1

Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

numeric_onlybool, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Deprecated since version 1.5.0: Specifying numeric_only=None is deprecated. The default value will be False in a future version of pandas.

Returns:
Series or DataFrame (if level specified)

Notes

To have the same behaviour as numpy.std, use ddof=0 (instead of the default ddof=1)

Examples

>>> df = pd.DataFrame({'person_id': [0, 1, 2, 3],  
...                   'age': [21, 25, 62, 43],
...                   'height': [1.61, 1.87, 1.49, 2.01]}
...                  ).set_index('person_id')
>>> df  
           age  height
person_id
0           21    1.61
1           25    1.87
2           62    1.49
3           43    2.01

The standard deviation of the columns can be found as follows:

>>> df.std()  
age       18.786076
height     0.237417

Alternatively, ddof=0 can be set to normalize by N instead of N-1:

>>> df.std(ddof=0)  
age       16.269219
height     0.205609
str#

alias of StringAccessor

sub(other, level=None, fill_value=None, axis=0)#

Get Subtraction of DataFrame or Series and other, element-wise (binary operator sub).

This docstring was copied from cudf.core.series.Series.sub.

Some inconsistencies with the Dask version may exist.

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:
otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame or Series

Result of the arithmetic operation.

Examples

DataFrame

>>> df = cudf.DataFrame(  
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.sub(1)  
        angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])  
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])  
>>> a.sub(b)  
a       0
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: int64
>>> a.sub(b, fill_value=0)  
a       2
b       1
c       1
d      -1
e    <NA>
dtype: int64
sum(axis=None, skipna=True, split_every=False, dtype=None, out=None, min_count=None, numeric_only=None)#

Return the sum of the values over the requested axis.

This docstring was copied from pandas.core.frame.DataFrame.sum.

Some inconsistencies with the Dask version may exist.

This is equivalent to the method numpy.sum.

Parameters:
axis{index (0), columns (1)}

Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

skipnabool, default True

Exclude NA/null values when computing the result.

levelint or level name, default None (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

Deprecated since version 1.3.0: The level keyword is deprecated. Use groupby instead.

numeric_onlybool, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Deprecated since version 1.5.0: Specifying numeric_only=None is deprecated. The default value will be False in a future version of pandas.

min_countint, default 0

The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

**kwargs

Additional keyword arguments to be passed to the function.

Returns:
Series or DataFrame (if level specified)

See also

Series.sum

Return the sum.

Series.min

Return the minimum.

Series.max

Return the maximum.

Series.idxmin

Return the index of the minimum.

Series.idxmax

Return the index of the maximum.

DataFrame.sum

Return the sum over the requested axis.

DataFrame.min

Return the minimum over the requested axis.

DataFrame.max

Return the maximum over the requested axis.

DataFrame.idxmin

Return the index of the minimum over the requested axis.

DataFrame.idxmax

Return the index of the maximum over the requested axis.

Examples

>>> idx = pd.MultiIndex.from_arrays([  
...     ['warm', 'warm', 'cold', 'cold'],
...     ['dog', 'falcon', 'fish', 'spider']],
...     names=['blooded', 'animal'])
>>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx)  
>>> s  
blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64
>>> s.sum()  
14

By default, the sum of an empty or all-NA Series is 0.

>>> pd.Series([], dtype="float64").sum()  # min_count=0 is the default  
0.0

This can be controlled with the min_count parameter. For example, if you’d like the sum of an empty series to be NaN, pass min_count=1.

>>> pd.Series([], dtype="float64").sum(min_count=1)  
nan

Thanks to the skipna parameter, min_count handles all-NA and empty series identically.

>>> pd.Series([np.nan]).sum()  
0.0
>>> pd.Series([np.nan]).sum(min_count=1)  
nan
tail(n=5, compute=True)#

Last n rows of the dataset

Caveat, the only checks the last n rows of the last partition.

to_backend(backend: str | None = None, **kwargs)#

Move to a new DataFrame backend

Parameters:
backendstr, Optional

The name of the new backend to move to. The default is the current “dataframe.backend” configuration.

Returns:
DataFrame, Series or Index
to_bag(index=False, format='tuple')#

Create a Dask Bag from a Series

to_csv(filename, **kwargs)#

Store Dask DataFrame to CSV files

One filename per partition will be created. You can specify the filenames in a variety of ways.

Use a globstring:

>>> df.to_csv('/path/to/data/export-*.csv')  

The * will be replaced by the increasing sequence 0, 1, 2, …

/path/to/data/export-0.csv
/path/to/data/export-1.csv

Use a globstring and a name_function= keyword argument. The name_function function should expect an integer and produce a string. Strings produced by name_function must preserve the order of their respective partition indices.

>>> from datetime import date, timedelta
>>> def name(i):
...     return str(date(2015, 1, 1) + i * timedelta(days=1))
>>> name(0)
'2015-01-01'
>>> name(15)
'2015-01-16'
>>> df.to_csv('/path/to/data/export-*.csv', name_function=name)  
/path/to/data/export-2015-01-01.csv
/path/to/data/export-2015-01-02.csv
...

You can also provide an explicit list of paths:

>>> paths = ['/path/to/data/alice.csv', '/path/to/data/bob.csv', ...]  
>>> df.to_csv(paths) 

You can also provide a directory name:

>>> df.to_csv('/path/to/data') 

The files will be numbered 0, 1, 2, (and so on) suffixed with ‘.part’:

/path/to/data/0.part
/path/to/data/1.part
Parameters:
dfdask.DataFrame

Data to save

filenamestring or list

Absolute or relative filepath(s). Prefix with a protocol like s3:// to save to remote filesystems.

single_filebool, default False

Whether to save everything into a single CSV file. Under the single file mode, each partition is appended at the end of the specified CSV file.

encodingstring, default ‘utf-8’

A string representing the encoding to use in the output file.

modestr, default ‘w’

Python file mode. The default is ‘w’ (or ‘wt’), for writing a new file or overwriting an existing file in text mode. ‘a’ (or ‘at’) will append to an existing file in text mode or create a new file if it does not already exist. See open().

name_functioncallable, default None

Function accepting an integer (partition index) and producing a string to replace the asterisk in the given filename globstring. Should preserve the lexicographic order of partitions. Not supported when single_file is True.

compressionstring, optional

A string representing the compression to use in the output file, allowed values are ‘gzip’, ‘bz2’, ‘xz’, only used when the first argument is a filename.

computebool, default True

If True, immediately executes. If False, returns a set of delayed objects, which can be computed at a later time.

storage_optionsdict

Parameters passed on to the backend filesystem class.

header_first_partition_onlybool, default None

If set to True, only write the header row in the first output file. By default, headers are written to all partitions under the multiple file mode (single_file is False) and written only once under the single file mode (single_file is True). It must be True under the single file mode.

compute_kwargsdict, optional

Options to be passed in to the compute method

kwargsdict, optional

Additional parameters to pass to pandas.DataFrame.to_csv().

Returns:
The names of the file written if they were computed right away.
If not, the delayed tasks associated with writing the files.
Raises:
ValueError

If header_first_partition_only is set to False or name_function is specified when single_file is True.

See also

fsspec.open_files
to_dask_array(lengths=None, meta=None)#

Convert a dask DataFrame to a dask array.

Parameters:
lengthsbool or Sequence of ints, optional

How to determine the chunks sizes for the output array. By default, the output array will have unknown chunk lengths along the first axis, which can cause some later operations to fail.

  • True : immediately compute the length of each partition

  • Sequence : a sequence of integers to use for the chunk sizes on the first axis. These values are not validated for correctness, beyond ensuring that the number of items matches the number of partitions.

metaobject, optional

An optional meta parameter can be passed for dask to override the default metadata on the underlying dask array.

Returns:
to_dask_dataframe(**kwargs)#

Create a dask.dataframe object from a dask_cudf object

to_delayed(optimize_graph=True)#

Convert into a list of dask.delayed objects, one per partition.

Parameters:
optimize_graphbool, optional

If True [default], the graph is optimized before converting into dask.delayed objects.

Examples

>>> partitions = df.to_delayed()  
to_frame(name=None)#

Convert Series to DataFrame.

This docstring was copied from pandas.core.series.Series.to_frame.

Some inconsistencies with the Dask version may exist.

Parameters:
nameobject, optional

The passed name should substitute for the series name (if it has one).

Returns:
DataFrame

DataFrame representation of Series.

Examples

>>> s = pd.Series(["a", "b", "c"],  
...               name="vals")
>>> s.to_frame()  
  vals
0    a
1    b
2    c
to_hdf(path_or_buf, key, mode='a', append=False, **kwargs)#

Store Dask Dataframe to Hierarchical Data Format (HDF) files

This is a parallel version of the Pandas function of the same name. Please see the Pandas docstring for more detailed information about shared keyword arguments.

This function differs from the Pandas version by saving the many partitions of a Dask DataFrame in parallel, either to many files, or to many datasets within the same file. You may specify this parallelism with an asterix * within the filename or datapath, and an optional name_function. The asterix will be replaced with an increasing sequence of integers starting from 0 or with the result of calling name_function on each of those integers.

This function only supports the Pandas 'table' format, not the more specialized 'fixed' format.

Parameters:
pathstring, pathlib.Path

Path to a target filename. Supports strings, pathlib.Path, or any object implementing the __fspath__ protocol. May contain a * to denote many filenames.

keystring

Datapath within the files. May contain a * to denote many locations

name_functionfunction

A function to convert the * in the above options to a string. Should take in a number from 0 to the number of partitions and return a string. (see examples below)

computebool

Whether or not to execute immediately. If False then this returns a dask.Delayed value.

lockbool, Lock, optional

Lock to use to prevent concurrency issues. By default a threading.Lock, multiprocessing.Lock or SerializableLock will be used depending on your scheduler if a lock is required. See dask.utils.get_scheduler_lock for more information about lock selection.

schedulerstring

The scheduler to use, like “threads” or “processes”

**other:

See pandas.to_hdf for more information

Returns:
filenameslist

Returned if compute is True. List of file names that each partition is saved to.

delayeddask.Delayed

Returned if compute is False. Delayed object to execute to_hdf when computed.

See also

read_hdf
to_parquet

Examples

Save Data to a single file

>>> df.to_hdf('output.hdf', '/data')            

Save data to multiple datapaths within the same file:

>>> df.to_hdf('output.hdf', '/data-*')          

Save data to multiple files:

>>> df.to_hdf('output-*.hdf', '/data')          

Save data to multiple files, using the multiprocessing scheduler:

>>> df.to_hdf('output-*.hdf', '/data', scheduler='processes') 

Specify custom naming scheme. This writes files as ‘2000-01-01.hdf’, ‘2000-01-02.hdf’, ‘2000-01-03.hdf’, etc..

>>> from datetime import date, timedelta
>>> base = date(year=2000, month=1, day=1)
>>> def name_function(i):
...     ''' Convert integer 0 to n to a string '''
...     return base + timedelta(days=i)
>>> df.to_hdf('*.hdf', '/data', name_function=name_function) 
to_json(filename, *args, **kwargs)#

See dd.to_json docstring for more information

to_sql(name: str, uri: str, schema=None, if_exists: str = 'fail', index: bool = True, index_label=None, chunksize=None, dtype=None, method=None, compute=True, parallel=False, engine_kwargs=None)#

See dd.to_sql docstring for more information

to_string(max_rows=5)#

Render a string representation of the Series.

This docstring was copied from pandas.core.series.Series.to_string.

Some inconsistencies with the Dask version may exist.

Parameters:
bufStringIO-like, optional (Not supported in Dask)

Buffer to write to.

na_repstr, optional (Not supported in Dask)

String representation of NaN to use, default ‘NaN’.

float_formatone-parameter function, optional (Not supported in Dask)

Formatter function to apply to columns’ elements if they are floats, default None.

headerbool, default True (Not supported in Dask)

Add the Series header (index name).

indexbool, optional (Not supported in Dask)

Add index (row) labels, default True.

lengthbool, default False (Not supported in Dask)

Add the Series length.

dtypebool, default False (Not supported in Dask)

Add the Series dtype.

namebool, default False (Not supported in Dask)

Add the Series name if not None.

max_rowsint, optional

Maximum number of rows to show before truncating. If None, show all.

min_rowsint, optional (Not supported in Dask)

The number of rows to display in a truncated repr (when number of rows is above max_rows).

Returns:
str or None

String representation of Series if buf=None, otherwise None.

to_timestamp(freq=None, how='start', axis=0)#

Cast to DatetimeIndex of Timestamps, at beginning of period.

This docstring was copied from pandas.core.series.Series.to_timestamp.

Some inconsistencies with the Dask version may exist.

Parameters:
freqstr, default frequency of PeriodIndex

Desired frequency.

how{‘s’, ‘e’, ‘start’, ‘end’}

Convention for converting period to timestamp; start of period vs. end.

copybool, default True (Not supported in Dask)

Whether or not to return a copy.

Returns:
Series with DatetimeIndex
truediv(other, level=None, fill_value=None, axis=0)#

Get Floating division of DataFrame or Series and other, element-wise (binary operator truediv).

This docstring was copied from cudf.core.series.Series.truediv.

Some inconsistencies with the Dask version may exist.

Equivalent to frame + other, but with support to substitute a fill_value for missing data in one of the inputs.

Parameters:
otherscalar, sequence, Series, or DataFrame

Any single or multiple element data structure, or list-like object.

axisint or string

Only 0 is supported for series, 1 or columns supported for dataframe

levelint or name

Broadcast across a level, matching Index values on the passed MultiIndex level. Not yet supported.

fill_valuefloat or None, default None

Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:
DataFrame or Series

Result of the arithmetic operation.

Examples

DataFrame

>>> df = cudf.DataFrame(  
...     {'angles': [0, 3, 4], 'degrees': [360, 180, 360]},
...     index=['circle', 'triangle', 'rectangle']
... )
>>> df.truediv(1)  
        angles  degrees
circle        0.0    360.0
triangle      3.0    180.0
rectangle     4.0    360.0

Series

>>> a = cudf.Series([1, 1, 1, None], index=['a', 'b', 'c', 'd'])  
>>> b = cudf.Series([1, None, 1, None], index=['a', 'b', 'd', 'e'])  
>>> a.truediv(b)  
a     1.0
b    <NA>
c    <NA>
d    <NA>
e    <NA>
dtype: float64
>>> a.truediv(b, fill_value=0)  
a     1.0
b     Inf
c     Inf
d     0.0
e    <NA>
dtype: float64
unique(split_every=None, split_out=1)#

Return Series of unique values in the object. Includes NA values.

Returns:
uniquesSeries
value_counts(sort=None, ascending=False, dropna=True, normalize=False, split_every=None, split_out=1)#

Return a Series containing counts of unique values.

This docstring was copied from pandas.core.series.Series.value_counts.

Some inconsistencies with the Dask version may exist.

The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.

Parameters:
normalizebool, default False

If True then the object returned will contain the relative frequencies of the unique values.

sortbool, default True

Sort by frequencies.

ascendingbool, default False

Sort in ascending order.

binsint, optional (Not supported in Dask)

Rather than count values, group them into half-open bins, a convenience for pd.cut, only works with numeric data.

dropnabool, default True

Don’t include counts of NaN.

Returns:
Series

See also

Series.count

Number of non-NA elements in a Series.

DataFrame.count

Number of non-NA elements in a DataFrame.

DataFrame.value_counts

Equivalent method on DataFrames.

Examples

>>> index = pd.Index([3, 1, 2, 3, 4, np.nan])  
>>> index.value_counts()  
3.0    2
1.0    1
2.0    1
4.0    1
dtype: int64

With normalize set to True, returns the relative frequency by dividing all values by the sum of values.

>>> s = pd.Series([3, 1, 2, 3, 4, np.nan])  
>>> s.value_counts(normalize=True)  
3.0    0.4
1.0    0.2
2.0    0.2
4.0    0.2
dtype: float64

bins

Bins can be useful for going from a continuous variable to a categorical variable; instead of counting unique apparitions of values, divide the index in the specified number of half-open bins.

>>> s.value_counts(bins=3)  
(0.996, 2.0]    2
(2.0, 3.0]      2
(3.0, 4.0]      1
dtype: int64

dropna

With dropna set to False we can also see NaN index values.

>>> s.value_counts(dropna=False)  
3.0    2
1.0    1
2.0    1
4.0    1
NaN    1
dtype: int64
property values#

Return a dask.array of the values of this dataframe

Warning: This creates a dask.array without precise shape information. Operations that depend on shape information, like slicing or reshaping, will not work.

var(axis=None, skipna=True, ddof=1, split_every=False, dtype=None, out=None, naive=False)#

Return unbiased variance over requested axis.

This docstring was copied from pandas.core.frame.DataFrame.var.

Some inconsistencies with the Dask version may exist.

Normalized by N-1 by default. This can be changed using the ddof argument.

Parameters:
axis{index (0), columns (1)}

For Series this parameter is unused and defaults to 0.

skipnabool, default True

Exclude NA/null values. If an entire row/column is NA, the result will be NA.

levelint or level name, default None (Not supported in Dask)

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.

Deprecated since version 1.3.0: The level keyword is deprecated. Use groupby instead.

ddofint, default 1

Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

numeric_onlybool, default None (Not supported in Dask)

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Deprecated since version 1.5.0: Specifying numeric_only=None is deprecated. The default value will be False in a future version of pandas.

Returns:
Series or DataFrame (if level specified)

Examples

>>> df = pd.DataFrame({'person_id': [0, 1, 2, 3],  
...                   'age': [21, 25, 62, 43],
...                   'height': [1.61, 1.87, 1.49, 2.01]}
...                  ).set_index('person_id')
>>> df  
           age  height
person_id
0           21    1.61
1           25    1.87
2           62    1.49
3           43    2.01
>>> df.var()  
age       352.916667
height      0.056367

Alternatively, ddof=0 can be set to normalize by N instead of N-1:

>>> df.var(ddof=0)  
age       264.687500
height      0.042275
view(dtype)#

Create a new view of the Series.

This docstring was copied from pandas.core.series.Series.view.

Some inconsistencies with the Dask version may exist.

This function will return a new Series with a view of the same underlying values in memory, optionally reinterpreted with a new data type. The new data type must preserve the same size in bytes as to not cause index misalignment.

Parameters:
dtypedata type

Data type object or one of their string representations.

Returns:
Series

A new Series object as a view of the same data in memory.

See also

numpy.ndarray.view

Equivalent numpy function to create a new view of the same data in memory.

Notes

Series are instantiated with dtype=float64 by default. While numpy.ndarray.view() will return a view with the same data type as the original array, Series.view() (without specified dtype) will try using float64 and may fail if the original data type size in bytes is not the same.

Examples

>>> s = pd.Series([-2, -1, 0, 1, 2], dtype='int8')  
>>> s  
0   -2
1   -1
2    0
3    1
4    2
dtype: int8

The 8 bit signed integer representation of -1 is 0b11111111, but the same bytes represent 255 if read as an 8 bit unsigned integer:

>>> us = s.view('uint8')  
>>> us  
0    254
1    255
2      0
3      1
4      2
dtype: uint8

The views share the same underlying values:

>>> us[0] = 128  
>>> s  
0   -128
1     -1
2      0
3      1
4      2
dtype: int8
visualize(filename='mydask', format=None, optimize_graph=False, **kwargs)#

Render the computation of this object’s task graph using graphviz.

Requires graphviz to be installed.

Parameters:
filenamestr or None, optional

The name of the file to write to disk. If the provided filename doesn’t include an extension, ‘.png’ will be used by default. If filename is None, no file will be written, and we communicate with dot using only pipes.

format{‘png’, ‘pdf’, ‘dot’, ‘svg’, ‘jpeg’, ‘jpg’}, optional

Format in which to write output file. Default is ‘png’.

optimize_graphbool, optional

If True, the graph is optimized before rendering. Otherwise, the graph is displayed as is. Default is False.

color: {None, ‘order’}, optional

Options to color nodes. Provide cmap= keyword for additional colormap

**kwargs

Additional keyword arguments to forward to to_graphviz.

Returns:
resultIPython.diplay.Image, IPython.display.SVG, or None

See dask.dot.dot_graph for more information.

See also

dask.visualize
dask.dot.dot_graph

Notes

For more information on optimization see here:

https://docs.dask.org/en/latest/optimize.html

Examples

>>> x.visualize(filename='dask.pdf')  
>>> x.visualize(filename='dask.pdf', color='order')  
where(cond, other=nan)#

Replace values where the condition is False.

This docstring was copied from pandas.core.frame.DataFrame.where.

Some inconsistencies with the Dask version may exist.

Parameters:
condbool Series/DataFrame, array-like, or callable

Where cond is True, keep the original value. Where False, replace with corresponding value from other. If cond is callable, it is computed on the Series/DataFrame and should return boolean Series/DataFrame or array. The callable must not change input Series/DataFrame (though pandas doesn’t check it).

otherscalar, Series/DataFrame, or callable

Entries where cond is False are replaced with corresponding value from other. If other is callable, it is computed on the Series/DataFrame and should return scalar or Series/DataFrame. The callable must not change input Series/DataFrame (though pandas doesn’t check it).

inplacebool, default False (Not supported in Dask)

Whether to perform the operation in place on the data.

axisint, default None (Not supported in Dask)

Alignment axis if needed. For Series this parameter is unused and defaults to 0.

levelint, default None (Not supported in Dask)

Alignment level if needed.

errorsstr, {‘raise’, ‘ignore’}, default ‘raise’ (Not supported in Dask)

Note that currently this parameter won’t affect the results and will always coerce to a suitable dtype.

  • ‘raise’ : allow exceptions to be raised.

  • ‘ignore’ : suppress exceptions. On error return original object.

Deprecated since version 1.5.0: This argument had no effect.

try_castbool, default None (Not supported in Dask)

Try to cast the result back to the input type (if possible).

Deprecated since version 1.3.0: Manually cast back if necessary.

Returns:
Same type as caller or None if inplace=True.

See also

DataFrame.mask()

Return an object of same shape as self.

Notes

The where method is an application of the if-then idiom. For each element in the calling DataFrame, if cond is True the element is used; otherwise the corresponding element from the DataFrame other is used. If the axis of other does not align with axis of cond Series/DataFrame, the misaligned index positions will be filled with False.

The signature for DataFrame.where() differs from numpy.where(). Roughly df1.where(m, df2) is equivalent to np.where(m, df1, df2).

For further details and examples see the where documentation in indexing.

The dtype of the object takes precedence. The fill value is casted to the object’s dtype, if this can be done losslessly.

Examples

>>> s = pd.Series(range(5))  
>>> s.where(s > 0)  
0    NaN
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64
>>> s.mask(s > 0)  
0    0.0
1    NaN
2    NaN
3    NaN
4    NaN
dtype: float64
>>> s = pd.Series(range(5))  
>>> t = pd.Series([True, False])  
>>> s.where(t, 99)  
0     0
1    99
2    99
3    99
4    99
dtype: int64
>>> s.mask(t, 99)  
0    99
1     1
2    99
3    99
4    99
dtype: int64
>>> s.where(s > 1, 10)  
0    10
1    10
2    2
3    3
4    4
dtype: int64
>>> s.mask(s > 1, 10)  
0     0
1     1
2    10
3    10
4    10
dtype: int64
>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])  
>>> df  
   A  B
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9
>>> m = df % 3 == 0  
>>> df.where(m, -df)  
   A  B
0  0 -1
1 -2  3
2 -4 -5
3  6 -7
4 -8  9
>>> df.where(m, -df) == np.where(m, df, -df)  
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True
>>> df.where(m, -df) == df.mask(~m, -df)  
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True
dask_cudf.concat(dfs, axis=0, join='outer', interleave_partitions=False, ignore_unknown_divisions=False, ignore_order=False, **kwargs)#

Concatenate DataFrames along rows.

  • When axis=0 (default), concatenate DataFrames row-wise:

    • If all divisions are known and ordered, concatenate DataFrames keeping divisions. When divisions are not ordered, specifying interleave_partition=True allows concatenate divisions each by each.

    • If any of division is unknown, concatenate DataFrames resetting its division to unknown (None)

  • When axis=1, concatenate DataFrames column-wise:

    • Allowed if all divisions are known.

    • If any of division is unknown, it raises ValueError.

Parameters:
dfslist

List of dask.DataFrames to be concatenated

axis{0, 1, ‘index’, ‘columns’}, default 0

The axis to concatenate along

join{‘inner’, ‘outer’}, default ‘outer’

How to handle indexes on other axis

interleave_partitionsbool, default False

Whether to concatenate DataFrames ignoring its order. If True, every divisions are concatenated each by each.

ignore_unknown_divisionsbool, default False

By default a warning is raised if any input has unknown divisions. Set to True to disable this warning.

ignore_orderbool, default False

Whether to ignore order when doing the union of categoricals.

Notes

This differs in from pd.concat in the when concatenating Categoricals with different categories. Pandas currently coerces those to objects before concatenating. Coercing to objects is very expensive for large arrays, so dask preserves the Categoricals by taking the union of the categories.

Examples

If all divisions are known and ordered, divisions are kept.

>>> import dask.dataframe as dd
>>> a                                               
dd.DataFrame<x, divisions=(1, 3, 5)>
>>> b                                               
dd.DataFrame<y, divisions=(6, 8, 10)>
>>> dd.concat([a, b])                               
dd.DataFrame<concat-..., divisions=(1, 3, 6, 8, 10)>

Unable to concatenate if divisions are not ordered.

>>> a                                               
dd.DataFrame<x, divisions=(1, 3, 5)>
>>> b                                               
dd.DataFrame<y, divisions=(2, 3, 6)>
>>> dd.concat([a, b])                               
ValueError: All inputs have known divisions which cannot be concatenated
in order. Specify interleave_partitions=True to ignore order

Specify interleave_partitions=True to ignore the division order.

>>> dd.concat([a, b], interleave_partitions=True)   
dd.DataFrame<concat-..., divisions=(1, 2, 3, 5, 6)>

If any of division is unknown, the result division will be unknown

>>> a                                               
dd.DataFrame<x, divisions=(None, None)>
>>> b                                               
dd.DataFrame<y, divisions=(1, 4, 10)>
>>> dd.concat([a, b])                               
dd.DataFrame<concat-..., divisions=(None, None, None, None)>

By default concatenating with unknown divisions will raise a warning. Set ignore_unknown_divisions=True to disable this:

>>> dd.concat([a, b], ignore_unknown_divisions=True)
dd.DataFrame<concat-..., divisions=(None, None, None, None)>

Different categoricals are unioned

>>> dd.concat([
...     dd.from_pandas(pd.Series(['a', 'b'], dtype='category'), 1),
...     dd.from_pandas(pd.Series(['a', 'c'], dtype='category'), 1),
... ], interleave_partitions=True).dtype
CategoricalDtype(categories=['a', 'b', 'c'], ordered=False, categories_dtype=object)