Skip to content

getml.data.Container

Container(
    population: Optional[Union[DataFrame, View]] = None,
    peripheral: Optional[
        Dict[str, Union[DataFrame, View]]
    ] = None,
    split: Optional[
        Union[StringColumn, StringColumnView]
    ] = None,
    deep_copy: Optional[bool] = False,
    train: Optional[Union[DataFrame, View]] = None,
    validation: Optional[Union[DataFrame, View]] = None,
    test: Optional[Union[DataFrame, View]] = None,
    **kwargs: Optional[Union[DataFrame, View]]
)

A container holds the actual data in the form of a DataFrame or a View.

The purpose of a container is twofold:

  • Assigning concrete data to an abstract DataModel.

  • Storing data and allowing you to reproduce previous results.

ATTRIBUTE DESCRIPTION
population

The population table defines the statistical population of the machine learning problem and contains the target variables.

peripheral

The peripheral tables are joined onto population or other peripheral tables. Note that you can also pass them using add.

split

Contains information on how you want to split population into different Subsets. Also refer to split.

deep_copy

Whether you want to create deep copies or your tables.

train

The population table used in the train Subset. You can either pass population and split or you can pass the subsets separately using train, validation, test and kwargs.

validation

The population table used in the validation Subset. You can either pass population and split or you can pass the subsets separately using train, validation, test and kwargs.

test

The population table used in the test Subset. You can either pass population and split or you can pass the subsets separately using train, validation, test and kwargs.

kwargs

The population table used in Subsets other than the predefined train, validation and test subsets. You can call these subsets anything you want to, and you can access them just like train, validation and test. You can either pass population and split or you can pass the subsets separately using train, validation, test and kwargs.

Example
# Pass the subset.
container = getml.data.Container(my_subset=my_data_frame)

# You can access the subset just like train,
# validation or test
my_pipeline.fit(container.my_subset)

Example

A DataModel only contains abstract data. When we fit a pipeline, we need to assign concrete data.

This example is taken from the loans notebook . Note that in the notebook the high level StarSchema implementation is used. For demonstration purposes we are proceeding now with the low level implementation.

# The abstract data model is constructed
# using the DataModel class. A data model
# does not contain any actual data. It just
# defines the abstract relational structure.
dm = getml.data.DataModel(
    population_train.to_placeholder("population")
)

dm.add(getml.data.to_placeholder(
    meta=meta,
    order=order,
    trans=trans)
)

dm.population.join(
    dm.trans,
    on="account_id",
    time_stamps=("date_loan", "date")
)

dm.population.join(
    dm.order,
    on="account_id",
)

dm.population.join(
    dm.meta,
    on="account_id",
)

# We now have abstract placeholders on something
# called "population", "meta", "order" and "trans".
# But how do we assign concrete data? By using
# a container.
container = getml.data.Container(
    train=population_train,
    test=population_test
)

# meta, order and trans are either
# DataFrames or Views. Their aliases need
# to match the names of the placeholders in the
# data model.
container.add(
    meta=meta,
    order=order,
    trans=trans
)

# Freezing makes the container immutable.
# This is not required, but often a good idea.
container.freeze()

# When we call 'train', the container
# will return the train set and the
# peripheral tables.
my_pipeline.fit(container.train)

# Same for 'test'
my_pipeline.score(container.test)
If you don't already have a train and test set, you can use a function from the split module.

split = getml.data.split.random(
    train=0.8, test=0.2)

container = getml.data.Container(
    population=population_all,
    split=split,
)

# The remaining code is the same as in
# the example above. In particular,
# container.train and container.test
# work just like above.

Containers can also be used for storage and reproducing your results. A recommended pattern is to assign 'baseline roles' to your data frames and then using a View to tweak them:

# Assign baseline roles
data_frame.set_role(["jk"], getml.data.roles.join_key)
data_frame.set_role(["col1", "col2"], getml.data.roles.categorical)
data_frame.set_role(["col3", "col4"], getml.data.roles.numerical)
data_frame.set_role(["col5"], getml.data.roles.target)

# Make the data frame immutable, so in-place operations are
# no longer possible.
data_frame.freeze()

# Save the data frame.
data_frame.save()

# I suspect that col1 leads to overfitting, so I will drop it.
view = data_frame.drop(["col1"])

# Insert the view into a container.
container = getml.data.Container(...)
container.add(some_alias=view)
container.save()

The advantage of using such a pattern is that it enables you to always completely retrace your entire pipeline without creating deep copies of the data frames whenever you have made a small change like the one in our example. Note that the pipeline will record which container you have used.

Source code in getml/data/container.py
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
def __init__(
    self,
    population: Optional[Union[DataFrame, View]] = None,
    peripheral: Optional[Dict[str, Union[DataFrame, View]]] = None,
    split: Optional[Union[StringColumn, StringColumnView]] = None,
    deep_copy: Optional[bool] = False,
    train: Optional[Union[DataFrame, View]] = None,
    validation: Optional[Union[DataFrame, View]] = None,
    test: Optional[Union[DataFrame, View]] = None,
    **kwargs: Optional[Union[DataFrame, View]],
):
    if population is not None and not isinstance(population, (DataFrame, View)):
        raise TypeError(
            "'population' must be a getml.DataFrame or a getml.data.View, got "
            + type(population).__name__
            + "."
        )

    if peripheral is not None and not _is_typed_dict(
        peripheral, str, [DataFrame, View]
    ):
        raise TypeError(
            "'peripheral' must be a dict "
            + "of getml.DataFrames or getml.data.Views."
        )

    if split is not None and not isinstance(
        split, (StringColumn, StringColumnView)
    ):
        raise TypeError(
            "'split' must be StringColumn or a StringColumnView, got "
            + type(split).__name__
            + "."
        )

    if not isinstance(deep_copy, bool):
        raise TypeError(
            "'deep_copy' must be a bool, got " + type(split).__name__ + "."
        )

    exclusive = (population is not None) ^ (
        len(_make_subsets_from_kwargs(train, validation, test, **kwargs)) != 0
    )

    if not exclusive:
        raise ValueError(
            "'population' and 'train', 'validation', 'test' as well as "
            + "other subsets signified by kwargs are mutually exclusive. "
            + "You have to pass "
            + "either 'population' or some subsets, but you cannot pass both."
        )

    if population is None and split is not None:
        raise ValueError(
            "'split's are used for splitting population DataFrames."
            "Hence, if you supply 'split', you also have to supply "
            "a population."
        )

    if population is not None and split is None:
        logger.warning(
            "You have passed a population table without passing 'split'. "
            "You can access the entire set to pass to your pipeline "
            "using the .full attribute."
        )
        split = from_value("full")

    self._id = _make_id()

    self._population = population
    self._peripheral = peripheral or {}
    self._split = split
    self._deep_copy = deep_copy

    # HACK: Do some explicit bookeeping on the subets' length until we have
    # proper endpoint for slice-based subsetting
    if split is not None:
        self._subsets = {}
        self._lengths = {}
        for name, (length, subset) in _make_subsets_from_split(
            population, split
        ).items():
            self._subsets[name] = subset
            self._lengths[name] = length
    else:
        self._subsets = _make_subsets_from_kwargs(train, validation, test, **kwargs)
        self._lengths = {
            name: subset.nrows() for name, subset in self._subsets.items()
        }

    if split is None and not _is_typed_dict(self._subsets, str, [DataFrame, View]):
        raise TypeError(
            "'train', 'validation', 'test' and all other subsets must be either a "
            "getml.DataFrame or a getml.data.View."
        )

    if deep_copy:
        self._population = _deep_copy(self._population, self._id)
        self._peripheral = {
            k: _deep_copy(v, self._id) for (k, v) in self._peripheral.items()
        }
        self._subsets = {
            k: _deep_copy(v, self._id) for (k, v) in self._subsets.items()
        }

    self._last_change = _get_last_change(
        self._population, self._peripheral, self._subsets
    )

    self._frozen_time = None

add

add(*args, **kwargs)

Adds new peripheral data frames or views.

Source code in getml/data/container.py
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
def add(self, *args, **kwargs):
    """
    Adds new peripheral data frames or views.
    """
    wrong_type = [item for item in args if not isinstance(item, (DataFrame, View))]

    if wrong_type:
        raise TypeError(
            "All unnamed arguments must be getml.DataFrames or getml.data.Views."
        )

    wrong_type = [
        k for (k, v) in kwargs.items() if not isinstance(v, (DataFrame, View))
    ]

    if wrong_type:
        raise TypeError(
            "You must pass getml.DataFrames or getml.data.Views, "
            f"but the following arguments were neither: {wrong_type!r}."
        )

    kwargs = {**{item.name: item for item in args}, **kwargs}

    if self._frozen_time is not None:
        raise ValueError(
            f"You cannot add data frames after the {type(self).__name__} has been frozen."
        )

    if self._deep_copy:
        kwargs = {k: _deep_copy(v, self._id) for (k, v) in kwargs.items()}

    self._peripheral = {**self._peripheral, **kwargs}

    self._last_change = _get_last_change(
        self._population, self._peripheral, self._subsets
    )

freeze

freeze()

Freezes the container, so that changes are no longer possible.

This is required before you can extract data when deep_copy=True. The idea of deep_copy is to ensure that you can always retrace and reproduce your results. That is why the container needs to be immutable before it can be used.

Source code in getml/data/container.py
537
538
539
540
541
542
543
544
545
546
547
def freeze(self):
    """
    Freezes the container, so that changes are no longer possible.

    This is required before you can extract data when `deep_copy=True`. The idea of
    `deep_copy` is to ensure that you can always retrace and reproduce your results.
    That is why the container needs to be immutable before it can be
    used.
    """
    self.sync()
    self._frozen_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")

save

save()

Saves the Container to disk.

Source code in getml/data/container.py
549
550
551
552
553
554
555
556
557
558
559
560
def save(self):
    """
    Saves the Container to disk.
    """

    cmd = dict()
    cmd["type_"] = "DataContainer.save"
    cmd["name_"] = self._id

    cmd["container_"] = self._getml_deserialize()

    comm.send(cmd)

sync

sync()

Synchronizes the last change with the data to avoid warnings that the data has been changed.

This is only a problem when deep_copy=False.

Source code in getml/data/container.py
562
563
564
565
566
567
568
569
570
571
572
573
def sync(self):
    """
    Synchronizes the last change with the data to avoid warnings that the data
    has been changed.

    This is only a problem when `deep_copy=False`.
    """
    if self._frozen_time is not None:
        raise ValueError(f"{type(self).__name__} has already been frozen.")
    self._last_change = _get_last_change(
        self._population, self._peripheral, self._subsets
    )

to_pandas

to_pandas() -> Dict[str, DataFrame]

Returns a Container's contents as a dictionary of pandas.DataFrames. name holds the data frame's name, value the data converted to a pandas.DataFrame.

Source code in getml/data/container.py
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
def to_pandas(self) -> Dict[str, pd.DataFrame]:
    """
    Returns a `Container`'s contents as a dictionary of `pandas.DataFrame`s.
    `name` holds the data frame's `name`, value the data converted to a `pandas.DataFrame`.
    """
    subsets = (
        {name: df.to_pandas() for name, df in self._subsets.items()}
        if self._subsets
        else {}
    )
    peripherals = (
        {name: df.to_pandas() for name, df in self.peripheral.items()}
        if self.peripheral
        else {}
    )
    if subsets or peripherals:
        return {**subsets, **peripherals}

    raise ValueError("Container is empty.")