Files
file	normalize.hpp

Functions
std::unique_ptr< cudf::column >	nvtext::normalize_spaces (cudf::strings_column_view const &input, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
	Returns a new strings column by normalizing the whitespace in each string in the input column. More...

std::unique_ptr< cudf::column >	nvtext::normalize_characters (cudf::strings_column_view const &input, bool do_lower_case, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
	Normalizes strings characters for tokenizing. More...

Detailed Description

Function Documentation

◆ normalize_characters()

std::unique_ptr<cudf::column> nvtext::normalize_characters	(	cudf::strings_column_view const &	input,
		bool	do_lower_case,
		rmm::cuda_stream_view	stream = `cudf::get_default_stream()`,
		rmm::mr::device_memory_resource *	mr = `rmm::mr::get_current_device_resource()`
	)

Normalizes strings characters for tokenizing.

This uses the normalizer that is built into the nvtext::subword_tokenize function which includes:

adding padding around punctuation (unicode category starts with "P") as well as certain ASCII symbols like "^" and "$"
adding padding around the CJK Unicode block characters
changing whitespace (e.g. "\t", "\n", "\r") to just space " "
removing control characters (unicode categories "Cc" and "Cf")

The padding process here adds a single space before and after the character. Details on unicode category can be found here: https://unicodebook.readthedocs.io/unicode.html#categories

If do_lower_case = true, lower-casing also removes the accents. The accents cannot be removed from upper-case characters without lower-casing and lower-casing cannot be performed without also removing accents. However, if the accented character is already lower-case, then only the accent is removed.

s = ["éâîô\teaio", "ĂĆĖÑÜ", "ACENU", "$24.08", "[a,bb]"]
s1 = normalize_characters(s,true)
s1 is now ["eaio eaio", "acenu", "acenu", " $ 24 . 08", " [ a , bb ] "]
s2 = normalize_characters(s,false)
s2 is now ["éâîô eaio", "ĂĆĖÑÜ", "ACENU", " $ 24 . 08", " [ a , bb ] "]

A null input element at row i produces a corresponding null entry for row i in the output column.

This function requires about 16x the number of character bytes in the input strings column as working memory.

Parameters

input	The input strings to normalize
do_lower_case	If true, upper-case characters are converted to lower-case and accents are stripped from those characters. If false, accented and upper-case characters are not transformed.
stream	CUDA stream used for device memory operations and kernel launches
mr	Memory resource to allocate any returned objects

Returns: Normalized strings column

◆ normalize_spaces()

std::unique_ptr<cudf::column> nvtext::normalize_spaces	(	cudf::strings_column_view const &	input,
		rmm::cuda_stream_view	stream = `cudf::get_default_stream()`,
		rmm::mr::device_memory_resource *	mr = `rmm::mr::get_current_device_resource()`
	)

Returns a new strings column by normalizing the whitespace in each string in the input column.

Normalizing a string replaces any number of whitespace character (character code-point <= ' ') runs with a single space ' ' and trims whitespace from the beginning and end of the string.

Example:
s = ["a b", "  c  d\n", "e \t f "]
t = normalize_spaces(s)
t is now ["a b","c d","e f"]