Welcome to dask-cudf’s documentation!#
Dask-cuDF (pronounced “DASK KOO-dee-eff”) is an extension library for the Dask parallel computing framework that provides a cuDF-backed distributed dataframe with the same API as Dask dataframes.
If you are familiar with Dask and pandas or cuDF, then Dask-cuDF should feel familiar to you. If not, we recommend starting with 10 minutes to Dask followed by 10 minutes to cuDF and Dask-cuDF.
When running on multi-GPU systems, Dask-CUDA is recommended to simplify the setup of the cluster, taking advantage of all features of the GPU and networking hardware.
Using Dask-cuDF#
When installed, Dask-cuDF registers itself as a dataframe backend for
Dask. This means that in many cases, using cuDF-backed dataframes requires
only small changes to an existing workflow. The minimal change is to
select cuDF as the dataframe backend in Dask’s
configuration. To do so, we must set the option
dataframe.backend
to cudf
. From Python, this can be achieved
like so:
import dask
dask.config.set({"dataframe.backend": "cudf"})
Alternatively, you can set DASK_DATAFRAME__BACKEND=cudf
in the
environment before running your code.
Dataframe creation from on-disk formats#
If your workflow creates Dask dataframes from on-disk formats
(for example using dask.dataframe.read_parquet()
), then setting
the backend may well be enough to migrate your workflow.
For example, consider reading a dataframe from parquet:
import dask.dataframe as dd
# By default, we obtain a pandas-backed dataframe
df = dd.read_parquet("data.parquet", ...)
To obtain a cuDF-backed dataframe, we must set the
dataframe.backend
configuration option:
import dask
import dask.dataframe as dd
dask.config.set({"dataframe.backend": "cudf"})
# This gives us a cuDF-backed dataframe
df = dd.read_parquet("data.parquet", ...)
This code will use cuDF’s GPU-accelerated parquet reader
to read partitions of the data.
Dataframe creation from in-memory formats#
If you already have a dataframe in memory and want to convert it to a
cuDF-backend one, there are two options depending on whether the
dataframe is already a Dask one or not. If you have a Dask dataframe,
then you can call dask.dataframe.to_backend()
passing "cudf"
as the backend; if you have a pandas dataframe then you can either
call dask.dataframe.from_pandas()
followed by
to_backend()
or first convert the dataframe with
cudf.from_pandas()
and then parallelise this with
dask_cudf.from_cudf()
.
API Reference#
Generally speaking, Dask-cuDF tries to offer exactly the same API as Dask itself. There are, however, some minor differences mostly because cuDF does not perfectly mirror the pandas API, or because cuDF provides additional configuration flags (these mostly occur in data reading and writing interfaces).
As a result, straightforward workflows can be migrated without too much trouble, but more complex ones that utilise more features may need a bit of tweaking. The API documentation describes details of the differences and all functionality that Dask-cuDF supports.