getml.data.Container

Container(
    population: Optional[Union[DataFrame, View]] = None,
    peripheral: Optional[
        Dict[str, Union[DataFrame, View]]
    ] = None,
    split: Optional[
        Union[StringColumn, StringColumnView]
    ] = None,
    deep_copy: Optional[bool] = False,
    train: Optional[Union[DataFrame, View]] = None,
    validation: Optional[Union[DataFrame, View]] = None,
    test: Optional[Union[DataFrame, View]] = None,
    **kwargs: Optional[Union[DataFrame, View]]
)

A container holds the actual data in the form of a DataFrame or a View.

The purpose of a container is twofold:

Assigning concrete data to an abstract DataModel.
Storing data and allowing you to reproduce previous results.

ATTRIBUTE	DESCRIPTION
`population`	The population table defines the statistical population of the machine learning problem and contains the target variables.
`peripheral`	The peripheral tables are joined onto population or other peripheral tables. Note that you can also pass them using `add`.
`split`	Contains information on how you want to split population into different `Subset`s. Also refer to `split`.
`deep_copy`	Whether you want to create deep copies or your tables.
`train`	The population table used in the train `Subset`. You can either pass population and split or you can pass the subsets separately using train, validation, test and kwargs.
`validation`	The population table used in the validation `Subset`. You can either pass population and split or you can pass the subsets separately using train, validation, test and kwargs.
`test`	The population table used in the test `Subset`. You can either pass population and split or you can pass the subsets separately using train, validation, test and kwargs.
`kwargs`	The population table used in `Subset`s other than the predefined train, validation and test subsets. You can call these subsets anything you want to, and you can access them just like train, validation and test. You can either pass population and split or you can pass the subsets separately using train, validation, test and kwargs. Example `# Pass the subset. container = getml.data.Container(my_subset=my_data_frame) # You can access the subset just like train, # validation or test my_pipeline.fit(container.my_subset)`

Example

A DataModel only contains abstract data. When we fit a pipeline, we need to assign concrete data.

This example is taken from the loans notebook . Note that in the notebook the high level StarSchema implementation is used. For demonstration purposes we are proceeding now with the low level implementation.

# The abstract data model is constructed
# using the DataModel class. A data model
# does not contain any actual data. It just
# defines the abstract relational structure.
dm = getml.data.DataModel(
    population_train.to_placeholder("population")
)

dm.add(getml.data.to_placeholder(
    meta=meta,
    order=order,
    trans=trans)
)

dm.population.join(
    dm.trans,
    on="account_id",
    time_stamps=("date_loan", "date")
)

dm.population.join(
    dm.order,
    on="account_id",
)

dm.population.join(
    dm.meta,
    on="account_id",
)

# We now have abstract placeholders on something
# called "population", "meta", "order" and "trans".
# But how do we assign concrete data? By using
# a container.
container = getml.data.Container(
    train=population_train,
    test=population_test
)

# meta, order and trans are either
# DataFrames or Views. Their aliases need
# to match the names of the placeholders in the
# data model.
container.add(
    meta=meta,
    order=order,
    trans=trans
)

# Freezing makes the container immutable.
# This is not required, but often a good idea.
container.freeze()

# When we call 'train', the container
# will return the train set and the
# peripheral tables.
my_pipeline.fit(container.train)

# Same for 'test'
my_pipeline.score(container.test)

If you don't already have a train and test set, you can use a function from the split module.

split = getml.data.split.random(
    train=0.8, test=0.2)

container = getml.data.Container(
    population=population_all,
    split=split,
)

# The remaining code is the same as in
# the example above. In particular,
# container.train and container.test
# work just like above.

Containers can also be used for storage and reproducing your results. A recommended pattern is to assign 'baseline roles' to your data frames and then using a View to tweak them:

# Assign baseline roles
data_frame.set_role(["jk"], getml.data.roles.join_key)
data_frame.set_role(["col1", "col2"], getml.data.roles.categorical)
data_frame.set_role(["col3", "col4"], getml.data.roles.numerical)
data_frame.set_role(["col5"], getml.data.roles.target)

# Make the data frame immutable, so in-place operations are
# no longer possible.
data_frame.freeze()

# Save the data frame.
data_frame.save()

# I suspect that col1 leads to overfitting, so I will drop it.
view = data_frame.drop(["col1"])

# Insert the view into a container.
container = getml.data.Container(...)
container.add(some_alias=view)
container.save()

The advantage of using such a pattern is that it enables you to always completely retrace your entire pipeline without creating deep copies of the data frames whenever you have made a small change like the one in our example. Note that the pipeline will record which container you have used.