Preprocessing
As preprocessing, we categorize operations on data frames that are not directly related to the relational data model. While feature learning and propositionalization
deal with relational data structures and result in a single-table representation thereof, we categorize all operations that work on single tables as preprocessing. This includes numerical transformations, encoding techniques, or alternative representations.
getML's preprocessors allow you to extract domains from email addresses (EmailDomain
), impute missing values (Imputation
), map categorical columns to a continuous representation (Mapping
), extract seasonal components from time stamps (Seasonal
), extract sub strings from string-based columns (Substring
) and split up text
columns (TextFieldSplitter
). Preprocessing operations in getML are very efficient and happen really fast. In fact, most of the time you won't even notice the presence of a preprocessor in your pipeline. getML's preprocessors operate on an abstract level without polluting your original data, are evaluated lazily and their set-up requires minimal effort.
Here is a small example that shows the Seasonal
preprocessor in action.
import getml
getml.project.switch("seasonal")
traffic = getml.datasets.load_interstate94()
# traffic explicitly holds seasonal components (hour, day, month, ...)
# extracted from column ds; we copy traffic and delete all those components
traffic2 = traffic.drop(["hour", "weekday", "day", "month", "year"])
start_test = getml.data.time.datetime(2018, 3, 14)
split = getml.data.split.time(
population=traffic,
test=start_test,
time_stamp="ds",
)
time_series1 = getml.data.TimeSeries(
population=traffic,
split=split,
time_stamps="ds",
horizon=getml.data.time.hours(1),
memory=getml.data.time.days(7),
lagged_targets=True,
)
time_series2 = getml.data.TimeSeries(
population=traffic2,
split=split,
time_stamps="ds",
horizon=getml.data.time.hours(1),
memory=getml.data.time.days(7),
lagged_targets=True,
)
fast_prop = getml.feature_learning.FastProp(
loss_function=getml.feature_learning.loss_function.SquareLoss)
pipe1 = getml.pipeline.Pipeline(
data_model=time_series1.data_model,
feature_learners=[fast_prop],
predictors=[getml.predictors.XGBoostRegressor()]
)
pipe2 = getml.pipeline.Pipeline(
data_model=time_series2.data_model,
preprocessors=[getml.preprocessors.Seasonal()],
feature_learners=[fast_prop],
predictors=[getml.predictors.XGBoostRegressor()]
)
# pipe1 includes no preprocessor but receives the data frame with the components
pipe1.fit(time_series1.train)
# pipe2 includes the preprocessor; receives data w/o components
pipe2.fit(time_series2.train)
month_based1 = pipe1.features.filter(lambda feat: "month" in feat.sql)
month_based2 = pipe2.features.filter(
lambda feat: "COUNT( DISTINCT t2.\"strftime('%m'" in feat.sql
)
print(month_based1[1].sql)
# Output:
# DROP TABLE IF EXISTS "FEATURE_1_10";
#
# CREATE TABLE "FEATURE_1_10" AS
# SELECT COUNT( t2."month" ) - COUNT( DISTINCT t2."month" ) AS "feature_1_10",
# t1.rowid AS "rownum"
# FROM "POPULATION__STAGING_TABLE_1" t1
# LEFT JOIN "POPULATION__STAGING_TABLE_2" t2
# ON 1 = 1
# WHERE t2."ds, '+1.000000 hours'" <= t1."ds"
# AND ( t2."ds, '+7.041667 days'" > t1."ds" OR t2."ds, '+7.041667 days'" IS NULL )
# GROUP BY t1.rowid;
print(month_based2[0].sql)
# Output:
# DROP TABLE IF EXISTS "FEATURE_1_5";
#
# CREATE TABLE "FEATURE_1_5" AS
# SELECT COUNT( t2."strftime('%m', ds )" ) - COUNT( DISTINCT t2."strftime('%m', ds )" ) AS "feature_1_5",
# t1.rowid AS "rownum"
# FROM "POPULATION__STAGING_TABLE_1" t1
# LEFT JOIN "POPULATION__STAGING_TABLE_2" t2
# ON 1 = 1
# WHERE t2."ds, '+1.000000 hours'" <= t1."ds"
# AND ( t2."ds, '+7.041667 days'" > t1."ds" OR t2."ds, '+7.041667 days'" IS NULL )
# GROUP BY t1.rowid;
If you compare both of the features above, you will notice they are exactly the same: COUNT - COUNT(DISTINCT)
on the month component conditional on the time-based restrictions introduced through memory and horizon.
Pipelines can include more than one preprocessor.
While most of getML's preprocessors are straightforward, two of them deserve a more detailed introduction: Mapping
and TextFieldSplitter
.
Mappings
Enterprise edition
This feature is exclusive to the Enterprise edition and is not available in the Community edition. Discover the benefits of the Enterprise edition and compare their features.
For licensing information and technical support, please contact us.
Mappings
are an alternative representation for categorical columns, text columns, and (quasi-categorical) discrete-numerical columns. Each discrete value (category) of a categorical column is mapped to a continuous spectrum by calculating the average target value for the subset of all rows belonging to the respective category. For columns from peripheral tables, the average target values are propagated back by traversing the relational structure.
Mappings are a simple and interpretable alternative representation for categorical data. By introducing a continuous representation, mappings allow getML's feature learning algorithms to apply arbitrary aggregations to categorical columns. Further, mappings enable huge gains in efficiency when learning patterns from categorical data. You can control the extent mappings are utilized by specifying the minimum number of matching rows required for categories that constitutes a mapping through the min_freq
parameter.
Here is an example mapping from the CORA notebook:
DROP TABLE IF EXISTS "CATEGORICAL_MAPPING_1_1_1";
CREATE TABLE "CATEGORICAL_MAPPING_1_1_1"(key TEXT NOT NULL PRIMARY KEY, value NUMERIC);
INSERT INTO "CATEGORICAL_MAPPING_1_1_1"(key, value)
VALUES('Case_Based', 0.7109826589595376),
('Rule_Learning', 0.07368421052631578),
('Reinforcement_Learning', 0.0576923076923077),
('Theory', 0.0547945205479452),
('Genetic_Algorithms', 0.03157894736842105),
('Neural_Networks', 0.02088772845953003),
('Probabilistic_Methods', 0.01293103448275862);
Handling of free form text
getML provides the role text
to annotate free form text fields within relational data structures. Learning from text
columns works as follows: First, for each of the text
columns, a vocabulary is built by taking into account the feature learner's text mining specific hyperparameter vocab_size
. If a text field contains words that belong to the vocabulary, getML deals with columns of role text
through one of two approaches: Text fields can either be integrated into features by learning conditions based on the mere presence (or absence) of certain words in those text fields (the default) or they can be split into a relational bag-of-words representation by means of the TextFieldSplitter
preprocessor. Opting for the second approach is as easy as adding the TextFieldSplitter
to the list of preprocessors
on your Pipeline
. The resulting bag of words can be viewed as another one-to-many relationship within our data model where each row holding a text field is related to n peripheral rows (n is the number of words in the text field). Consider the following example, where the text field is split into a relational bag of words.
One row of a table with a text field
rownum | text field |
---|---|
52 | The quick brown fox jumps over the lazy dog |
The (implicit) peripheral table that results from splitting
rownum | words |
---|---|
52 | the |
52 | quick |
52 | brown |
52 | fox |
52 | jumps |
52 | over |
52 | the |
52 | lazy |
52 | dog |
As text fields now present another relation, getML's feature learning algorithms are able to learn structural logic from text fields' contents by applying aggregations over the resulting bag of words itself (COUNT WHERE words IN ('quick', 'jumps')
). Further, by utilizing mappings, any aggregation applicable to a (mapped) categorical column can be applied to bag-of-words mappings as well.
Note that the splitting of text fields can be computationally expensive. If performance suffers too much, you may resort to the default behavior by removing the TextFieldSplitter
from your Pipeline
.