cudf.DataFrame.to_parquet#

DataFrame.to_parquet(path, engine='cudf', compression='snappy', index=None, partition_cols=None, partition_file_name=None, partition_offsets=None, statistics='ROWGROUP', metadata_file_path=None, int96_timestamps=False, row_group_size_bytes=134217728, row_group_size_rows=None, max_page_size_bytes=None, max_page_size_rows=None, storage_options=None, return_metadata=False, *args, **kwargs)#

Write a DataFrame to the parquet format.

Parameters
pathstr or list of str

File path or Root Directory path. Will be used as Root Directory path while writing a partitioned dataset. Use list of str with partition_offsets to write parts of the dataframe to different files.

compression{‘snappy’, ‘ZSTD’, None}, default ‘snappy’

Name of the compression to use. Use None for no compression.

indexbool, default None

If True, include the dataframe’s index(es) in the file output. If False, they will not be written to the file. If None, similar to True the dataframe’s index(es) will be saved, however, instead of being saved as values any RangeIndex will be stored as a range in the metadata so it doesn’t require much space and is faster. Other indexes will be included as columns in the file output.

partition_colslist, optional, default None

Column names by which to partition the dataset Columns are partitioned in the order they are given

partition_file_namestr, optional, default None

File name to use for partitioned datasets. Different partitions will be written to different directories, but all files will have this name. If nothing is specified, a random uuid4 hex string will be used for each file.

partition_offsetslist, optional, default None

Offsets to partition the dataframe by. Should be used when path is list of str. Should be a list of integers of size len(path) + 1

statistics{‘ROWGROUP’, ‘PAGE’, ‘COLUMN’, ‘NONE’}, default ‘ROWGROUP’

Level at which column statistics should be included in file.

metadata_file_pathstr, optional, default None

If specified, this function will return a binary blob containing the footer metadata of the written parquet file. The returned blob will have the chunk.file_path field set to the metadata_file_path for each chunk. When using with partition_offsets, should be same size as len(path)

int96_timestampsbool, default False

If True, write timestamps in int96 format. This will convert timestamps from timestamp[ns], timestamp[ms], timestamp[s], and timestamp[us] to the int96 format, which is the number of Julian days and the number of nanoseconds since midnight of 1970-01-01. If False, timestamps will not be altered.

row_group_size_bytes: integer, default 134217728

Maximum size of each stripe of the output. If None, 134217728 (128.0 MB) will be used.

row_group_size_rows: integer or None, default None

Maximum number of rows of each stripe of the output. If None, 1000000 will be used.

max_page_size_bytes: integer or None, default None

Maximum uncompressed size of each page of the output. If None, 524288 (512KB) will be used.

max_page_size_rows: integer or None, default None

Maximum number of rows of each page of the output. If None, 20000 will be used.

storage_optionsdict, optional, default None

Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details.

return_metadatabool, default False

Return parquet metadata for written data. Returned metadata will include the file path metadata (relative to root_path). To request metadata binary blob when using with partition_cols, Pass return_metadata=True instead of specifying metadata_file_path

**kwargs

Additional parameters will be passed to execution engines other than cudf.