Skip to content

getml.preprocessors

Contains routines for preprocessing data frames.

CategoryTrimmer dataclass

CategoryTrimmer(
    max_num_categories: int = 999, min_freq: int = 30
)

Bases: _Preprocessor

Reduces the cardinality of high-cardinality categorical columns.

PARAMETER DESCRIPTION
max_num_categories

The maximum cardinality allowed. If the cardinality is higher than that only the most frequent categories will be kept, all others will be trimmed.

TYPE: int DEFAULT: 999

min_freq

The minimum frequency required for a category to be included.

TYPE: int DEFAULT: 30

Example
category_trimmer = getml.preprocessors.CategoryTrimmer()

pipe = getml.Pipeline(
    population=population_placeholder,
    peripheral=[order_placeholder, trans_placeholder],
    preprocessors=[category_trimmer],
    feature_learners=[feature_learner_1, feature_learner_2],
    feature_selectors=feature_selector,
    predictors=predictor,
    share_selected_features=0.5
)

EmailDomain dataclass

EmailDomain()

Bases: _Preprocessor

The EmailDomain preprocessor extracts the domain from e-mail addresses.

For instance, if the e-mail address is '[email protected]', the preprocessor will automatically extract '@domain.com'.

The preprocessor will be applied to all text columns that were assigned one of the subroles include.email or only.email.

It is recommended that you assign only.email, because it is unlikely that the e-mail address itself is interesting.

Example
my_data_frame.set_subroles("email", getml.data.subroles.only.email)

domain = getml.preprocessors.EmailDomain()

pipe = getml.Pipeline(
    population=population_placeholder,
    peripheral=[order_placeholder, trans_placeholder],
    preprocessors=[domain],
    feature_learners=[feature_learner_1, feature_learner_2],
    feature_selectors=feature_selector,
    predictors=predictor,
    share_selected_features=0.5
)

Imputation dataclass

Imputation(add_dummies: bool = False)

Bases: _Preprocessor

The Imputation preprocessor replaces all NULL values in numerical columns with the mean of the remaining columns.

Optionally, it can additionally add a dummy column that signifies whether the original value was imputed.

PARAMETER DESCRIPTION
add_dummies

Whether you want to add dummy variables that signify whether the original value was imputed.

TYPE: bool DEFAULT: False

Example
imputation = getml.preprocessors.Imputation()

pipe = getml.Pipeline(
    population=population_placeholder,
    peripheral=[order_placeholder, trans_placeholder],
    preprocessors=[imputation],
    feature_learners=[feature_learner_1, feature_learner_2],
    feature_selectors=feature_selector,
    predictors=predictor,
    share_selected_features=0.5
)

Mapping dataclass

Mapping(
    aggregation: Iterable[MappingAggregations] = default,
    min_freq: int = 30,
    multithreading: bool = True,
)

Bases: _Preprocessor

A mapping preprocessor maps categorical values, discrete values and individual words in a text field to numerical values. These numerical values are retrieved by aggregating targets in the relational neighbourhood.

You are particularly encouraged to use the mapping preprocessor in combination with FastProp.

Refer to the User guide for more information.

Enterprise edition

This feature is exclusive to the Enterprise edition and is not available in the Community edition. Discover the benefits of the Enterprise edition and compare their features.

For licensing information and technical support, please contact us.

ATTRIBUTE DESCRIPTION
agg_sets

It is a class variable holding the available aggregation sets for the mapping preprocessor. Value: MAPPING.

TYPE: MappingAggregationsSets

PARAMETER DESCRIPTION
aggregation

The aggregation function to use over the targets.

Must be an aggregation supported by Mapping preprocessor (MAPPING_AGGREGATIONS).

TYPE: Iterable[MappingAggregations] DEFAULT: default

min_freq

The minimum number of targets required for a value to be included in the mapping. Range: [0, ∞]

TYPE: int DEFAULT: 30

multithreading

Whether you want to apply multithreading.

TYPE: bool DEFAULT: True

Example
mapping = getml.preprocessors.Mapping()

pipe = getml.Pipeline(
    population=population_placeholder,
    peripheral=[order_placeholder, trans_placeholder],
    preprocessors=[mapping],
    feature_learners=[feature_learner_1, feature_learner_2],
    feature_selectors=feature_selector,
    predictors=predictor,
    share_selected_features=0.5
)

Seasonal dataclass

Seasonal(
    disable_year: bool = False,
    disable_month: bool = False,
    disable_weekday: bool = False,
    disable_hour: bool = False,
    disable_minute: bool = False,
)

Bases: _Preprocessor

The Seasonal preprocessor extracts seasonal data from time stamps.

The preprocessor automatically iterates through all time stamps in any data frame and extracts seasonal parameters.

These include:

  • year
  • month
  • weekday
  • hour
  • minute

The algorithm also evaluates the potential usefulness of any extracted seasonal parameter. Parameters that are unlikely to be useful are not included.

PARAMETER DESCRIPTION
disable_year

Prevents the Seasonal preprocessor from extracting the year from time stamps.

TYPE: bool DEFAULT: False

disable_month

Prevents the Seasonal preprocessor from extracting the month from time stamps.

TYPE: bool DEFAULT: False

disable_weekday

Prevents the Seasonal preprocessor from extracting the weekday from time stamps.

TYPE: bool DEFAULT: False

disable_hour

Prevents the Seasonal preprocessor from extracting the hour from time stamps.

TYPE: bool DEFAULT: False

disable_minute

Prevents the Seasonal preprocessor from extracting the minute from time stamps.

TYPE: bool DEFAULT: False

Example
seasonal = getml.preprocessors.Seasonal()

pipe = getml.Pipeline(
    population=population_placeholder,
    peripheral=[order_placeholder, trans_placeholder],
    preprocessors=[seasonal],
    feature_learners=[feature_learner_1, feature_learner_2],
    feature_selectors=feature_selector,
    predictors=predictor,
    share_selected_features=0.5
)

Substring dataclass

Substring(begin: int, length: int, unit: str = '')

Bases: _Preprocessor

The Substring preprocessor extracts substrings from categorical columns and unused string columns.

The preprocessor will be applied to all categorical and text columns that were assigned one of the subroles include.substring or only.substring.

To further limit the scope of a substring preprocessor, you can also assign a unit.

PARAMETER DESCRIPTION
begin

Index of the beginning of the substring (starting from 0).

TYPE: int

length

The length of the substring.

TYPE: int

unit

The unit of all columns to which the preprocessor should be applied. These columns must also have the subrole substring.

If it is left empty, then the preprocessor will be applied to all columns with the subrole include.substring or only.substring.

TYPE: str DEFAULT: ''

Example
my_df.set_subroles("col1", getml.data.subroles.include.substring)

my_df.set_subroles("col2", getml.data.subroles.include.substring)
my_df.set_unit("col2", "substr14")

# Will be applied to col1 and col2
substr13 = getml.preprocessors.Substring(0, 3)

# Will only be applied to col2
substr14 = getml.preprocessors.Substring(0, 3, "substr14")

pipe = getml.Pipeline(
    population=population_placeholder,
    peripheral=[order_placeholder, trans_placeholder],
    preprocessors=[substr13],
    feature_learners=[feature_learner_1, feature_learner_2],
    feature_selectors=feature_selector,
    predictors=predictor,
    share_selected_features=0.5
)

TextFieldSplitter dataclass

TextFieldSplitter()

Bases: _Preprocessor

A TextFieldSplitter splits columns with role text into relational bag-of-words representations to allow the feature learners to learn patterns based on the prescence of certain words within the text fields.

Text fields will be split on a whitespace or any of the following characters:

; , . ! ? - | " \t \v \f \r \n % ' ( ) [ ] { }
Refer to the User Guide for more information.

Example
text_field_splitter = getml.preprocessors.TextFieldSplitter()

pipe = getml.Pipeline(
    population=population_placeholder,
    peripheral=[order_placeholder, trans_placeholder],
    preprocessors=[text_field_splitter],
    feature_learners=[feature_learner_1, feature_learner_2],
    feature_selectors=feature_selector,
    predictors=predictor,
    share_selected_features=0.5
)