libcudf  24.04.00
All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Modules Pages
Namespaces | Classes | Typedefs | Enumerations | Functions | Variables
cudf::io Namespace Reference

IO interfaces. More...

Namespaces

 orc
 Orc I/O interfaces.
 
 parquet
 Parquet I/O interfaces.
 

Classes

class  arrow_io_source
 Implementation class for reading from an Apache Arrow file. The file could be a memory-mapped file or other implementation supported by Arrow. More...
 
class  avro_reader_options
 Settings to use for read_avro(). More...
 
class  avro_reader_options_builder
 Builder to build options for read_avro(). More...
 
class  csv_reader_options
 Settings to use for read_csv(). More...
 
class  csv_reader_options_builder
 Builder to build options for read_csv(). More...
 
class  csv_writer_options
 Settings to use for write_csv(). More...
 
class  csv_writer_options_builder
 Builder to build options for writer_csv() More...
 
class  data_sink
 Interface class for storing the output data from the writers. More...
 
class  datasource
 Interface class for providing input data to the readers. More...
 
struct  schema_element
 Allows specifying the target types for nested JSON data via json_reader_options' set_dtypes method. More...
 
class  json_reader_options
 Input arguments to the read_json interface. More...
 
class  json_reader_options_builder
 Builds settings to use for read_json(). More...
 
class  json_writer_options
 Settings to use for write_json(). More...
 
class  json_writer_options_builder
 Builder to build options for writer_json() More...
 
class  orc_reader_options
 Settings to use for read_orc(). More...
 
class  orc_reader_options_builder
 Builds settings to use for read_orc(). More...
 
class  orc_writer_options
 Settings to use for write_orc(). More...
 
class  orc_writer_options_builder
 Builds settings to use for write_orc(). More...
 
class  chunked_orc_writer_options
 Settings to use for write_orc_chunked(). More...
 
class  chunked_orc_writer_options_builder
 Builds settings to use for write_orc_chunked(). More...
 
class  orc_chunked_writer
 Chunked orc writer class writes an ORC file in a chunked/stream form. More...
 
struct  raw_orc_statistics
 Holds column names and buffers containing raw file-level and stripe-level statistics. More...
 
struct  minmax_statistics
 Base class for column statistics that include optional minimum and maximum. More...
 
struct  sum_statistics
 Base class for column statistics that include an optional sum. More...
 
struct  integer_statistics
 Statistics for integral columns. More...
 
struct  double_statistics
 Statistics for floating point columns. More...
 
struct  string_statistics
 Statistics for string columns. More...
 
struct  bucket_statistics
 Statistics for boolean columns. More...
 
struct  decimal_statistics
 Statistics for decimal columns. More...
 
struct  timestamp_statistics
 Statistics for timestamp columns. More...
 
struct  column_statistics
 Contains per-column ORC statistics. More...
 
struct  parsed_orc_statistics
 Holds column names and parsed file-level and stripe-level statistics. More...
 
struct  orc_column_schema
 Schema of an ORC column, including the nested columns. More...
 
struct  orc_schema
 Schema of an ORC file. More...
 
class  orc_metadata
 Information about content of an ORC file. More...
 
class  parquet_reader_options
 Settings for read_parquet(). More...
 
class  parquet_reader_options_builder
 Builds parquet_reader_options to use for read_parquet(). More...
 
class  chunked_parquet_reader
 The chunked parquet reader class to read Parquet file iteratively in to a series of tables, chunk by chunk. More...
 
class  parquet_writer_options
 Settings for write_parquet(). More...
 
class  parquet_writer_options_builder
 Class to build parquet_writer_options. More...
 
class  chunked_parquet_writer_options
 Settings for write_parquet_chunked(). More...
 
class  chunked_parquet_writer_options_builder
 Builds options for chunked_parquet_writer_options. More...
 
class  parquet_chunked_writer
 chunked parquet writer class to handle options and write tables in chunks. More...
 
struct  parquet_column_schema
 Schema of a parquet column, including the nested columns. More...
 
struct  parquet_schema
 Schema of a parquet file. More...
 
class  parquet_metadata
 Information about content of a parquet file. More...
 
class  writer_compression_statistics
 Statistics about compression performed by a writer. More...
 
struct  column_name_info
 Detailed name (and optionally nullability) information for output columns. More...
 
struct  table_metadata
 Table metadata returned by IO readers. More...
 
struct  table_with_metadata
 Table with table metadata used by io readers to return the metadata by value. More...
 
struct  host_buffer
 Non-owning view of a host memory buffer. More...
 
struct  source_info
 Source information for read interfaces. More...
 
struct  sink_info
 Destination information for write interfaces. More...
 
class  column_in_metadata
 Metadata for a column. More...
 
class  table_input_metadata
 Metadata for a table. More...
 
struct  partition_info
 Information used while writing partitioned datasets. More...
 
class  reader_column_schema
 schema element for reader More...
 

Typedefs

using no_statistics = std::monostate
 Monostate type alias for the statistics variant.
 
using date_statistics = minmax_statistics< int32_t >
 Statistics for date(time) columns.
 
using binary_statistics = sum_statistics< int64_t >
 Statistics for binary columns. More...
 

Enumerations

enum class  json_recovery_mode_t { FAIL , RECOVER_WITH_NULL }
 Control the error recovery behavior of the json parser. More...
 
enum class  compression_type {
  NONE , AUTO , SNAPPY , GZIP ,
  BZIP2 , BROTLI , ZIP , XZ ,
  ZLIB , LZ4 , LZO , ZSTD
}
 Compression algorithms. More...
 
enum class  io_type {
  FILEPATH , HOST_BUFFER , DEVICE_BUFFER , VOID ,
  USER_IMPLEMENTED
}
 Data source or destination types. More...
 
enum class  quote_style { MINIMAL , ALL , NONNUMERIC , NONE }
 Behavior when handling quotations in field data. More...
 
enum  statistics_freq { STATISTICS_NONE = 0 , STATISTICS_ROWGROUP = 1 , STATISTICS_PAGE = 2 , STATISTICS_COLUMN = 3 }
 Column statistics granularity type for parquet/orc writers. More...
 
enum class  column_encoding {
  USE_DEFAULT = -1 , DICTIONARY , PLAIN , DELTA_BINARY_PACKED ,
  DELTA_LENGTH_BYTE_ARRAY , DELTA_BYTE_ARRAY , DIRECT , DIRECT_V2 ,
  DICTIONARY_V2
}
 Valid encodings for use with column_in_metadata::set_encoding() More...
 
enum  dictionary_policy { NEVER = 0 , ADAPTIVE = 1 , ALWAYS = 2 }
 Control use of dictionary encoding for parquet writer. More...
 

Functions

table_with_metadata read_avro (avro_reader_options const &options, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Reads an Avro dataset into a set of columns. More...
 
table_with_metadata read_csv (csv_reader_options options, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Reads a CSV dataset into a set of columns. More...
 
void write_csv (csv_writer_options const &options, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Writes a set of columns to CSV format. More...
 
table_with_metadata read_json (json_reader_options options, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Reads a JSON dataset into a set of columns. More...
 
void write_json (json_writer_options const &options, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Writes a set of columns to JSON format. More...
 
rmm::host_async_resource_ref set_host_memory_resource (rmm::host_async_resource_ref mr)
 Set the rmm resource to be used for host memory allocations by cudf::detail::hostdevice_vector. More...
 
rmm::host_async_resource_ref get_host_memory_resource ()
 Get the rmm resource being used for host memory allocations by cudf::detail::hostdevice_vector. More...
 
table_with_metadata read_orc (orc_reader_options const &options, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Reads an ORC dataset into a set of columns. More...
 
void write_orc (orc_writer_options const &options, rmm::cuda_stream_view stream=cudf::get_default_stream())
 Writes a set of columns to ORC format. More...
 
raw_orc_statistics read_raw_orc_statistics (source_info const &src_info, rmm::cuda_stream_view stream=cudf::get_default_stream())
 Reads file-level and stripe-level statistics of ORC dataset. More...
 
parsed_orc_statistics read_parsed_orc_statistics (source_info const &src_info, rmm::cuda_stream_view stream=cudf::get_default_stream())
 Reads file-level and stripe-level statistics of ORC dataset. More...
 
orc_metadata read_orc_metadata (source_info const &src_info, rmm::cuda_stream_view stream=cudf::get_default_stream())
 Reads metadata of ORC dataset. More...
 
table_with_metadata read_parquet (parquet_reader_options const &options, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Reads a Parquet dataset into a set of columns. More...
 
std::unique_ptr< std::vector< uint8_t > > write_parquet (parquet_writer_options const &options, rmm::cuda_stream_view stream=cudf::get_default_stream())
 Writes a set of columns to parquet format. More...
 
std::unique_ptr< std::vector< uint8_t > > merge_row_group_metadata (std::vector< std::unique_ptr< std::vector< uint8_t >>> const &metadata_list)
 Merges multiple raw metadata blobs that were previously created by write_parquet into a single metadata blob. More...
 
parquet_metadata read_parquet_metadata (source_info const &src_info)
 Reads metadata of parquet dataset. More...
 
template<typename T >
constexpr auto is_byte_like_type ()
 Returns true if the type is byte-like, meaning it is reasonable to pass as a pointer to bytes. More...
 

Variables

constexpr size_t default_stripe_size_bytes = 64 * 1024 * 1024
 64MB default orc stripe size
 
constexpr size_type default_stripe_size_rows = 1000000
 1M rows default orc stripe rows
 
constexpr size_type default_row_index_stride = 10000
 10K rows default orc row index stride
 
constexpr size_t default_row_group_size_bytes = 128 * 1024 * 1024
 128MB per row group
 
constexpr size_type default_row_group_size_rows = 1000000
 1 million rows per row group
 
constexpr size_t default_max_page_size_bytes = 512 * 1024
 512KB per page
 
constexpr size_type default_max_page_size_rows = 20000
 20k rows per page
 
constexpr int32_t default_column_index_truncate_length = 64
 truncate to 64 bytes
 
constexpr size_t default_max_dictionary_size = 1024 * 1024
 1MB dictionary size
 
constexpr size_type default_max_page_fragment_size = 5000
 5000 rows per page fragment
 

Detailed Description

IO interfaces.

Function Documentation

◆ get_host_memory_resource()

rmm::host_async_resource_ref cudf::io::get_host_memory_resource ( )

Get the rmm resource being used for host memory allocations by cudf::detail::hostdevice_vector.

Returns
The rmm resource used for host-side allocations

◆ set_host_memory_resource()

rmm::host_async_resource_ref cudf::io::set_host_memory_resource ( rmm::host_async_resource_ref  mr)

Set the rmm resource to be used for host memory allocations by cudf::detail::hostdevice_vector.

hostdevice_vector is a utility class that uses a pair of host and device-side buffers for bouncing state between the cpu and the gpu. The resource set with this function (typically a pinned memory allocator) is what it uses to allocate space for it's host-side buffer.

Parameters
mrThe rmm resource to be used for host-side allocations
Returns
The previous resource that was in use