getml.preprocessors
Contains routines for preprocessing data frames.
CategoryTrimmer
dataclass
Bases: _Preprocessor
Reduces the cardinality of high-cardinality categorical columns.
PARAMETER | DESCRIPTION |
---|---|
max_num_categories |
The maximum cardinality allowed. If the cardinality is higher than that only the most frequent categories will be kept, all others will be trimmed.
TYPE:
|
min_freq |
The minimum frequency required for a category to be included.
TYPE:
|
Example
category_trimmer = getml.preprocessors.CategoryTrimmer()
pipe = getml.Pipeline(
population=population_placeholder,
peripheral=[order_placeholder, trans_placeholder],
preprocessors=[category_trimmer],
feature_learners=[feature_learner_1, feature_learner_2],
feature_selectors=feature_selector,
predictors=predictor,
share_selected_features=0.5
)
EmailDomain
dataclass
EmailDomain()
Bases: _Preprocessor
The EmailDomain preprocessor extracts the domain from e-mail addresses.
For instance, if the e-mail address is 'some.guy@domain.com', the preprocessor will automatically extract '@domain.com'.
The preprocessor will be applied to all text
columns that were assigned one of the subroles
include.email
or
only.email
.
It is recommended that you assign only.email
,
because it is unlikely that the e-mail address itself is interesting.
Example
my_data_frame.set_subroles("email", getml.data.subroles.only.email)
domain = getml.preprocessors.EmailDomain()
pipe = getml.Pipeline(
population=population_placeholder,
peripheral=[order_placeholder, trans_placeholder],
preprocessors=[domain],
feature_learners=[feature_learner_1, feature_learner_2],
feature_selectors=feature_selector,
predictors=predictor,
share_selected_features=0.5
)
Imputation
dataclass
Imputation(add_dummies: bool = False)
Bases: _Preprocessor
The Imputation preprocessor replaces all NULL values in numerical columns with the mean of the remaining columns.
Optionally, it can additionally add a dummy column that signifies whether the original value was imputed.
PARAMETER | DESCRIPTION |
---|---|
add_dummies |
Whether you want to add dummy variables that signify whether the original value was imputed.
TYPE:
|
Example
imputation = getml.preprocessors.Imputation()
pipe = getml.Pipeline(
population=population_placeholder,
peripheral=[order_placeholder, trans_placeholder],
preprocessors=[imputation],
feature_learners=[feature_learner_1, feature_learner_2],
feature_selectors=feature_selector,
predictors=predictor,
share_selected_features=0.5
)
Mapping
dataclass
Mapping(
aggregation: Iterable[
MappingAggregations
] = MAPPING.default,
min_freq: int = 30,
multithreading: bool = True,
)
Bases: _Preprocessor
A mapping preprocessor maps categorical values, discrete values and individual words in a text field to numerical values. These numerical values are retrieved by aggregating targets in the relational neighbourhood.
You are particularly encouraged to use the mapping preprocessor in combination with
FastProp
.
Refer to the User guide for more information.
Enterprise edition
This feature is exclusive to the Enterprise edition and is not available in the Community edition. Discover the benefits of the Enterprise edition and compare their features.
For licensing information and technical support, please contact us.
ATTRIBUTE | DESCRIPTION |
---|---|
agg_sets |
It is a class variable holding the available aggregation sets for the
mapping preprocessor.
Value:
TYPE:
|
PARAMETER | DESCRIPTION |
---|---|
aggregation |
The aggregation function to use over the targets. Must be an aggregation supported by Mapping preprocessor
(
TYPE:
|
min_freq |
The minimum number of targets required for a value to be included in the mapping. Range: [0, ∞]
TYPE:
|
multithreading |
Whether you want to apply multithreading.
TYPE:
|
Example
mapping = getml.preprocessors.Mapping()
pipe = getml.Pipeline(
population=population_placeholder,
peripheral=[order_placeholder, trans_placeholder],
preprocessors=[mapping],
feature_learners=[feature_learner_1, feature_learner_2],
feature_selectors=feature_selector,
predictors=predictor,
share_selected_features=0.5
)
Seasonal
dataclass
Seasonal(
disable_year: bool = False,
disable_month: bool = False,
disable_weekday: bool = False,
disable_hour: bool = False,
disable_minute: bool = False,
)
Bases: _Preprocessor
The Seasonal preprocessor extracts seasonal data from time stamps.
The preprocessor automatically iterates through all time stamps in any data frame and extracts seasonal parameters.
These include:
- year
- month
- weekday
- hour
- minute
The algorithm also evaluates the potential usefulness of any extracted seasonal parameter. Parameters that are unlikely to be useful are not included.
PARAMETER | DESCRIPTION |
---|---|
disable_year |
Prevents the Seasonal preprocessor from extracting the year from time stamps.
TYPE:
|
disable_month |
Prevents the Seasonal preprocessor from extracting the month from time stamps.
TYPE:
|
disable_weekday |
Prevents the Seasonal preprocessor from extracting the weekday from time stamps.
TYPE:
|
disable_hour |
Prevents the Seasonal preprocessor from extracting the hour from time stamps.
TYPE:
|
disable_minute |
Prevents the Seasonal preprocessor from extracting the minute from time stamps.
TYPE:
|
Example
seasonal = getml.preprocessors.Seasonal()
pipe = getml.Pipeline(
population=population_placeholder,
peripheral=[order_placeholder, trans_placeholder],
preprocessors=[seasonal],
feature_learners=[feature_learner_1, feature_learner_2],
feature_selectors=feature_selector,
predictors=predictor,
share_selected_features=0.5
)
Substring
dataclass
Bases: _Preprocessor
The Substring preprocessor extracts substrings from categorical columns and unused string columns.
The preprocessor will be applied to all
categorical
and text
columns that were assigned one of the subroles
include.substring
or
only.substring
.
To further limit the scope of a substring preprocessor, you can also assign a unit.
PARAMETER | DESCRIPTION |
---|---|
begin |
Index of the beginning of the substring (starting from 0).
TYPE:
|
length |
The length of the substring.
TYPE:
|
unit |
The unit of all columns to which the preprocessor should be applied. These columns must also have the subrole substring. If it is left empty, then the preprocessor
will be applied to all columns with the subrole
TYPE:
|
Example
my_df.set_subroles("col1", getml.data.subroles.include.substring)
my_df.set_subroles("col2", getml.data.subroles.include.substring)
my_df.set_unit("col2", "substr14")
# Will be applied to col1 and col2
substr13 = getml.preprocessors.Substring(0, 3)
# Will only be applied to col2
substr14 = getml.preprocessors.Substring(0, 3, "substr14")
pipe = getml.Pipeline(
population=population_placeholder,
peripheral=[order_placeholder, trans_placeholder],
preprocessors=[substr13],
feature_learners=[feature_learner_1, feature_learner_2],
feature_selectors=feature_selector,
predictors=predictor,
share_selected_features=0.5
)
TextFieldSplitter
dataclass
TextFieldSplitter()
Bases: _Preprocessor
A TextFieldSplitter splits columns with role text
into relational bag-of-words representations to allow the
feature learners to learn patterns based on
the prescence of certain words within the text fields.
Text fields will be split on a whitespace or any of the following characters:
; , . ! ? - | " \t \v \f \r \n % ' ( ) [ ] { }
Example
text_field_splitter = getml.preprocessors.TextFieldSplitter()
pipe = getml.Pipeline(
population=population_placeholder,
peripheral=[order_placeholder, trans_placeholder],
preprocessors=[text_field_splitter],
feature_learners=[feature_learner_1, feature_learner_2],
feature_selectors=feature_selector,
predictors=predictor,
share_selected_features=0.5
)