getml.preprocessors

Contains routines for preprocessing data frames.

CategoryTrimmer `dataclass`

CategoryTrimmer(
    max_num_categories: int = 999, min_freq: int = 30
)

Bases: _Preprocessor

Reduces the cardinality of high-cardinality categorical columns.

PARAMETER	DESCRIPTION
`max_num_categories`	The maximum cardinality allowed. If the cardinality is higher than that only the most frequent categories will be kept, all others will be trimmed. TYPE: `int` DEFAULT: `999`
`min_freq`	The minimum frequency required for a category to be included. TYPE: `int` DEFAULT: `30`

Example

category_trimmer = getml.preprocessors.CategoryTrimmer()

pipe = getml.Pipeline(
    population=population_placeholder,
    peripheral=[order_placeholder, trans_placeholder],
    preprocessors=[category_trimmer],
    feature_learners=[feature_learner_1, feature_learner_2],
    feature_selectors=feature_selector,
    predictors=predictor,
    share_selected_features=0.5
)

EmailDomain `dataclass`

EmailDomain()

Bases: _Preprocessor

The EmailDomain preprocessor extracts the domain from e-mail addresses.

For instance, if the e-mail address is 'some.guy@domain.com', the preprocessor will automatically extract '@domain.com'.

The preprocessor will be applied to all text columns that were assigned one of the subroles include.email or only.email.

It is recommended that you assign only.email, because it is unlikely that the e-mail address itself is interesting.

Example

my_data_frame.set_subroles("email", getml.data.subroles.only.email)

domain = getml.preprocessors.EmailDomain()

pipe = getml.Pipeline(
    population=population_placeholder,
    peripheral=[order_placeholder, trans_placeholder],
    preprocessors=[domain],
    feature_learners=[feature_learner_1, feature_learner_2],
    feature_selectors=feature_selector,
    predictors=predictor,
    share_selected_features=0.5
)

Imputation `dataclass`

Imputation(add_dummies: bool = False)

Bases: _Preprocessor

The Imputation preprocessor replaces all NULL values in numerical columns with the mean of the remaining columns.

Optionally, it can additionally add a dummy column that signifies whether the original value was imputed.

PARAMETER	DESCRIPTION
`add_dummies`	Whether you want to add dummy variables that signify whether the original value was imputed. TYPE: `bool` DEFAULT: `False`

Example

imputation = getml.preprocessors.Imputation()

pipe = getml.Pipeline(
    population=population_placeholder,
    peripheral=[order_placeholder, trans_placeholder],
    preprocessors=[imputation],
    feature_learners=[feature_learner_1, feature_learner_2],
    feature_selectors=feature_selector,
    predictors=predictor,
    share_selected_features=0.5
)

Mapping `dataclass`

Mapping(
    aggregation: Iterable[MappingAggregations] = default,
    min_freq: int = 30,
    multithreading: bool = True,
)

Bases: _Preprocessor

A mapping preprocessor maps categorical values, discrete values and individual words in a text field to numerical values. These numerical values are retrieved by aggregating targets in the relational neighbourhood.

You are particularly encouraged to use the mapping preprocessor in combination with FastProp.

Refer to the User guide for more information.

Enterprise edition

This feature is exclusive to the Enterprise edition and is not available in the Community edition. Discover the benefits of the Enterprise edition and compare their features.

For licensing information and technical support, please contact us.

ATTRIBUTE	DESCRIPTION
`agg_sets`	It is a class variable holding the available aggregation sets for the mapping preprocessor. Value: `MAPPING`. TYPE: `MappingAggregationsSets`

PARAMETER	DESCRIPTION
`aggregation`	The aggregation function to use over the targets. Must be an aggregation supported by Mapping preprocessor (`MAPPING_AGGREGATIONS`). TYPE: `Iterable[MappingAggregations]` DEFAULT: `default`
`min_freq`	The minimum number of targets required for a value to be included in the mapping. Range: [0, ∞] TYPE: `int` DEFAULT: `30`
`multithreading`	Whether you want to apply multithreading. TYPE: `bool` DEFAULT: `True`

Example

mapping = getml.preprocessors.Mapping()

pipe = getml.Pipeline(
    population=population_placeholder,
    peripheral=[order_placeholder, trans_placeholder],
    preprocessors=[mapping],
    feature_learners=[feature_learner_1, feature_learner_2],
    feature_selectors=feature_selector,
    predictors=predictor,
    share_selected_features=0.5
)

Seasonal `dataclass`

Seasonal(
    disable_year: bool = False,
    disable_month: bool = False,
    disable_weekday: bool = False,
    disable_hour: bool = False,
    disable_minute: bool = False,
)

Bases: _Preprocessor

The Seasonal preprocessor extracts seasonal data from time stamps.

The preprocessor automatically iterates through all time stamps in any data frame and extracts seasonal parameters.

These include:

year
month
weekday
hour
minute

The algorithm also evaluates the potential usefulness of any extracted seasonal parameter. Parameters that are unlikely to be useful are not included.

PARAMETER	DESCRIPTION
`disable_year`	Prevents the Seasonal preprocessor from extracting the year from time stamps. TYPE: `bool` DEFAULT: `False`
`disable_month`	Prevents the Seasonal preprocessor from extracting the month from time stamps. TYPE: `bool` DEFAULT: `False`
`disable_weekday`	Prevents the Seasonal preprocessor from extracting the weekday from time stamps. TYPE: `bool` DEFAULT: `False`
`disable_hour`	Prevents the Seasonal preprocessor from extracting the hour from time stamps. TYPE: `bool` DEFAULT: `False`
`disable_minute`	Prevents the Seasonal preprocessor from extracting the minute from time stamps. TYPE: `bool` DEFAULT: `False`

Example

seasonal = getml.preprocessors.Seasonal()

pipe = getml.Pipeline(
    population=population_placeholder,
    peripheral=[order_placeholder, trans_placeholder],
    preprocessors=[seasonal],
    feature_learners=[feature_learner_1, feature_learner_2],
    feature_selectors=feature_selector,
    predictors=predictor,
    share_selected_features=0.5
)

Substring `dataclass`

Substring(begin: int, length: int, unit: str = '')

Bases: _Preprocessor

The Substring preprocessor extracts substrings from categorical columns and unused string columns.

The preprocessor will be applied to all categorical and text columns that were assigned one of the subroles include.substring or only.substring.

To further limit the scope of a substring preprocessor, you can also assign a unit.

PARAMETER	DESCRIPTION
`begin`	Index of the beginning of the substring (starting from 0). TYPE: `int`
`length`	The length of the substring. TYPE: `int`
`unit`	The unit of all columns to which the preprocessor should be applied. These columns must also have the subrole substring. If it is left empty, then the preprocessor will be applied to all columns with the subrole `include.substring` or `only.substring`. TYPE: `str` DEFAULT: `''`

Example

my_df.set_subroles("col1", getml.data.subroles.include.substring)

my_df.set_subroles("col2", getml.data.subroles.include.substring)
my_df.set_unit("col2", "substr14")

# Will be applied to col1 and col2
substr13 = getml.preprocessors.Substring(0, 3)

# Will only be applied to col2
substr14 = getml.preprocessors.Substring(0, 3, "substr14")

pipe = getml.Pipeline(
    population=population_placeholder,
    peripheral=[order_placeholder, trans_placeholder],
    preprocessors=[substr13],
    feature_learners=[feature_learner_1, feature_learner_2],
    feature_selectors=feature_selector,
    predictors=predictor,
    share_selected_features=0.5
)

TextFieldSplitter `dataclass`

TextFieldSplitter()

Bases: _Preprocessor

A TextFieldSplitter splits columns with role text into relational bag-of-words representations to allow the feature learners to learn patterns based on the prescence of certain words within the text fields.

Text fields will be split on a whitespace or any of the following characters:

; , . ! ? - | " \t \v \f \r \n % ' ( ) [ ] { }

Refer to the User Guide for more information.

Example

text_field_splitter = getml.preprocessors.TextFieldSplitter()

pipe = getml.Pipeline(
    population=population_placeholder,
    peripheral=[order_placeholder, trans_placeholder],
    preprocessors=[text_field_splitter],
    feature_learners=[feature_learner_1, feature_learner_2],
    feature_selectors=feature_selector,
    predictors=predictor,
    share_selected_features=0.5
)

getml.preprocessors

CategoryTrimmer dataclass

EmailDomain dataclass

Imputation dataclass

Mapping dataclass

Seasonal dataclass

Substring dataclass

TextFieldSplitter dataclass

CategoryTrimmer `dataclass`

EmailDomain `dataclass`

Imputation `dataclass`

Mapping `dataclass`

Seasonal `dataclass`

Substring `dataclass`

TextFieldSplitter `dataclass`