getml.pipeline
Contains handlers for all steps involved in a data science project after data preparation:
- Automated feature learning
- Automated feature selection
- Training and evaluation of machine learning (ML) algorithms
- Deployment of the fitted models
Example
We assume that you have already set up your preprocessors (refer to preprocessors
), your feature learners (refer to feature_learning
) as well as your feature selectors and predictors (refer to predictors
, which can be used for prediction and feature selection).
You might also want to refer to DataFrame
, View
, DataModel
, Container
, Placeholder
and StarSchema
.
If you want to create features for a time series problem, the easiest way to do so is to use the TimeSeries
abstraction.
Note that this example is taken from the robot notebook .
# All rows before row 10500 will be used for training.
split = getml.data.split.time(data_all, "rowid", test=10500)
time_series = getml.data.TimeSeries(
population=data_all,
time_stamps="rowid",
split=split,
lagged_targets=False,
memory=30,
)
pipe = getml.Pipeline(
data_model=time_series.data_model,
feature_learners=[...],
predictors=...
)
pipe.check(time_series.train)
pipe.fit(time_series.train)
pipe.score(time_series.test)
# To generate predictions on new data,
# it is sufficient to use a Container.
# You don't have to recreate the entire
# TimeSeries, because the abstract data model
# is stored in the pipeline.
container = getml.data.Container(
population=population_new,
)
# Add the data as a peripheral table, for the
# self-join.
container.add(population=population_new)
predictions = pipe.predict(container.full)
Example
If your data can be organized in a simple star schema, you can use StarSchema
. StarSchema
unifies Container
and DataModel
: Note that this example is taken from the loans notebook .
# First, we insert our data into a StarSchema.
# population_train and population_test are either
# DataFrames or Views. The population table
# defines the statistical population of your
# machine learning problem and contains the
# target variables.
star_schema = getml.data.StarSchema(
train=population_train,
test=population_test
)
# meta, order and trans are either
# DataFrames or Views.
# Because this is a star schema,
# all joins take place on the population
# table.
star_schema.join(
trans,
on="account_id",
time_stamps=("date_loan", "date")
)
star_schema.join(
order,
on="account_id",
)
star_schema.join(
meta,
on="account_id",
)
# Now you can insert your data model,
# your preprocessors, feature learners,
# feature selectors and predictors
# into the pipeline.
# Note that the pipeline only knows
# the abstract data model, but hasn't
# seen the actual data yet.
pipe = getml.Pipeline(
data_model=star_schema.data_model,
preprocessors=[mapping],
feature_learners=[fast_prop],
feature_selectors=[feature_selector],
predictors=predictor,
)
# Now, we pass the actual data.
# This passes 'population_train' and the
# peripheral tables (meta, order and trans)
# to the pipeline.
pipe.check(star_schema.train)
pipe.fit(star_schema.train)
pipe.score(star_schema.test)
Example
StarSchema
is simpler, but cannot be used for more complex data models. The general approach is to use Container
and DataModel
:
# First, we insert our data into a Container.
# population_train and population_test are either
# DataFrames or Views.
container = getml.data.Container(
train=population_train,
test=population_test
)
# meta, order and trans are either
# DataFrames or Views. They are given
# aliases, so we can refer to them in the
# DataModel.
container.add(
meta=meta,
order=order,
trans=trans
)
# Freezing makes the container immutable.
# This is not required, but often a good idea.
container.freeze()
# The abstract data model is constructed
# using the DataModel class. A data model
# does not contain any actual data. It just
# defines the abstract relational structure.
dm = getml.data.DataModel(
population_train.to_placeholder("population")
)
dm.add(getml.data.to_placeholder(
meta=meta,
order=order,
trans=trans)
)
dm.population.join(
dm.trans,
on="account_id",
time_stamps=("date_loan", "date")
)
dm.population.join(
dm.order,
on="account_id",
)
dm.population.join(
dm.meta,
on="account_id",
)
# Now you can insert your data model,
# your preprocessors, feature learners,
# feature selectors and predictors
# into the pipeline.
# Note that the pipeline only knows
# the abstract data model, but hasn't
# seen the actual data yet.
pipe = getml.Pipeline(
data_model=dm,
preprocessors=[mapping],
feature_learners=[fast_prop],
feature_selectors=[feature_selector],
predictors=predictor,
)
# This passes 'population_train' and the
# peripheral tables (meta, order and trans)
# to the pipeline.
pipe.check(container.train)
pipe.fit(container.train)
pipe.score(container.test)
Container
. You might as well do this (in fact, a Container
is just syntactic sugar for this approach): pipe.check(
population_train,
{"meta": meta, "order": order, "trans": trans},
)
pipe.fit(
population_train,
{"meta": meta, "order": order, "trans": trans},
)
pipe.score(
population_test,
{"meta": meta, "order": order, "trans": trans},
)
__repr__()
method of the pipeline, and it is usually in alphabetical order. pipe.check(
population_train,
[meta, order, trans],
)
pipe.fit(
population_train,
[meta, order, trans],
)
pipe.score(
population_test,
[meta, order, trans],
)
delete
delete(name: str) -> None
If a pipeline named 'name' exists, it is deleted.
PARAMETER | DESCRIPTION |
---|---|
name | Name of the pipeline. TYPE: |
Source code in getml/pipeline/helpers2.py
133 134 135 136 137 138 139 140 141 142 143 144 145 146 |
|
exists
Returns true if a pipeline named 'name' exists.
PARAMETER | DESCRIPTION |
---|---|
name | Name of the pipeline. TYPE: |
RETURNS | DESCRIPTION |
---|---|
bool | True if the pipeline exists, False otherwise. |
Source code in getml/pipeline/helpers2.py
111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 |
|
list_pipelines
Lists all pipelines present in the Engine.
Note that this function only lists pipelines which are part of the current project. See set_project
for changing projects and pipelines
for more details about the lifecycles of the pipelines.
To subsequently load one of them, use load
.
RETURNS | DESCRIPTION |
---|---|
List[str] | List containing the names of all pipelines. |
Source code in getml/pipeline/helpers2.py
63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 |
|
load
Loads a pipeline from the getML Engine into Python.
PARAMETER | DESCRIPTION |
---|---|
name | The name of the pipeline to be loaded. TYPE: |
RETURNS | DESCRIPTION |
---|---|
Pipeline | Pipeline that is a handler for the pipeline signified by name. |
Source code in getml/pipeline/helpers2.py
95 96 97 98 99 100 101 102 103 104 105 |
|