Skip to content

getml.data.Placeholder

Placeholder(
    name: str,
    roles: Optional[
        Union[Roles, Dict[str, List[str]]]
    ] = None,
)

Abstract representation of tables and their relations.

This class is an abstract representation of the DataFrame or View. However, it does not contain any actual data.

You might also want to refer to DataModel.

ATTRIBUTE DESCRIPTION
name

The name used for this placeholder. This name will appear in the generated SQL code.

TYPE: str

roles

The roles of the columns in this placeholder. If you pass a dictionary, the keys must be the column names and the values must be lists of roles. If you pass a Roles object, it will be used as is.

TYPE: Roles

Example

This example will construct a data model in which the 'population_table' depends on the 'peripheral_table' via the 'join_key' column. In addition, only those rows in 'peripheral_table' for which 'time_stamp' is smaller or equal to the 'time_stamp' in 'population_table' are considered:

dm = getml.data.DataModel(
    population_table.to_placeholder("POPULATION")
)

dm.add(peripheral_table.to_placeholder("PERIPHERAL"))

dm.POPULATION.join(
    dm.PERIPHERAL,
    on="join_key",
    time_stamps="time_stamp"
)
If you want to add more than one peripheral table, you can use to_placeholder:
dm = getml.data.DataModel(
    population_table.to_placeholder("POPULATION")
)

dm.add(
    getml.data.to_placeholder(
        PERIPHERAL1=peripheral_table_1,
        PERIPHERAL2=peripheral_table_2,
    )
)
If the relationship between two tables is many-to-one or one-to-one you should clearly say so:
dm.POPULATION.join(
    dm.PERIPHERAL,
    on="join_key",
    time_stamps="time_stamp",
    relationship=getml.data.relationship.many_to_one,
)
Please also refer to relationship.

If the join keys or time stamps are named differently in the two different tables, use a tuple:

dm.POPULATION.join(
    dm.PERIPHERAL,
    on=("join_key", "other_join_key"),
    time_stamps=("time_stamp", "other_time_stamp"),
)
You can join over more than one join key:
dm.POPULATION.join(
    dm.PERIPHERAL,
    on=["join_key1", "join_key2", ("join_key3", "other_join_key3")],
    time_stamps="time_stamp",
)
You can also limit the scope of your joins using memory. This can significantly speed up training time. For instance, if you only want to consider data from the last seven days, you could do something like this:
dm.POPULATION.join(
    dm.PERIPHERAL,
    on="join_key",
    time_stamps="time_stamp",
    memory=getml.data.time.days(7),
)
In some use cases, particularly those involving time series, it might be a good idea to use targets from the past. You can activate this using lagged_targets. But if you do that, you must also define a prediction horizon. For instance, if you want to predict data for the next hour, using data from the last seven days, you could do this:
dm.POPULATION.join(
    dm.PERIPHERAL,
    on="join_key",
    time_stamps="time_stamp",
    lagged_targets=True,
    horizon=getml.data.time.hours(1),
    memory=getml.data.time.days(7),
)
Please also refer to time.

If the join involves many matches, it might be a good idea to set the relationship to propositionalization. This forces the pipeline to always use a propositionalization algorithm for this join, which can significantly speed things up.

dm.POPULATION.join(
    dm.PERIPHERAL,
    on="join_key",
    time_stamps="time_stamp",
    relationship=getml.data.relationship.propositionalization,
)
Please also refer to relationship.

In some cases, it is necessary to have more than one placeholder on the same table. This is necessary to create more complicated data models. In this case, you can do something like this:

dm.add(
    getml.data.to_placeholder(
        PERIPHERAL=[peripheral_table]*2,
    )
)

# We can now access our two placeholders like this:
placeholder1 = dm.PERIPHERAL[0]
placeholder2 = dm.PERIPHERAL[1]
If you want to check out a real-world example where this is necessary, refer to the CORA notebook .

Source code in getml/data/placeholder.py
185
186
187
188
189
190
191
192
193
194
195
196
197
198
def __init__(
    self, name: str, roles: Optional[Union[Roles, Dict[str, List[str]]]] = None
):
    self._name = name

    if roles is None:
        self._roles: Roles = Roles()
    elif isinstance(roles, dict):
        self._roles = Roles(**roles)
    else:
        self._roles = roles

    self.joins: List[Join] = []
    self.parent = None

join

join(
    right: Placeholder,
    on: OnType = None,
    time_stamps: TimeStampsType = None,
    relationship: str = many_to_many,
    memory: Optional[float] = None,
    horizon: Optional[float] = None,
    lagged_targets: bool = False,
    upper_time_stamp: Optional[str] = None,
)

Joins another to placeholder to this placeholder.

PARAMETER DESCRIPTION
right

The placeholder you would like to join.

TYPE: Placeholder

on

The join keys to use. If none is passed, then everything will be joined to everything else.

TYPE: OnType DEFAULT: None

time_stamps

The time stamps used to limit the join.

TYPE: TimeStampsType DEFAULT: None

relationship

The relationship between the two tables. Must be from relationship.

TYPE: str DEFAULT: many_to_many

memory

The difference between the time stamps until data is 'forgotten'. Limiting your joins using memory can significantly speed up training time. Also refer to time.

TYPE: Optional[float] DEFAULT: None

horizon

The prediction horizon to apply to this join. Also refer to time.

TYPE: Optional[float] DEFAULT: None

lagged_targets

Whether you want to allow lagged targets. If this is set to True, you must also pass a positive, non-zero horizon.

TYPE: bool DEFAULT: False

upper_time_stamp

Name of a time stamp in right that serves as an upper limit on the join.

TYPE: Optional[str] DEFAULT: None

Source code in getml/data/placeholder.py
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
def join(
    self,
    right: "Placeholder",
    on: OnType = None,
    time_stamps: TimeStampsType = None,
    relationship: str = many_to_many,
    memory: Optional[float] = None,
    horizon: Optional[float] = None,
    lagged_targets: bool = False,
    upper_time_stamp: Optional[str] = None,
):
    """
    Joins another to placeholder to this placeholder.

    Args:
        right:
            The placeholder you would like to join.

        on:
            The join keys to use. If none is passed, then everything
            will be joined to everything else.

        time_stamps:
            The time stamps used to limit the join.

        relationship:
            The relationship between the two tables. Must be from
            [`relationship`][getml.data.relationship].

        memory:
            The difference between the time stamps until data is 'forgotten'.
            Limiting your joins using memory can significantly speed up
            training time. Also refer to [`time`][getml.data.time].

        horizon:
            The prediction horizon to apply to this join.
            Also refer to [`time`][getml.data.time].

        lagged_targets:
            Whether you want to allow lagged targets. If this is set to True,
            you must also pass a positive, non-zero *horizon*.

        upper_time_stamp:
            Name of a time stamp in *right* that serves as an upper limit
            on the join.
    """

    if not isinstance(right, type(self)):
        msg = (
            "'right' must be a getml.data.Placeholder. "
            + "You can create a placeholder by calling .to_placeholder() "
            + "on DataFrames or Views."
        )
        raise TypeError(msg)

    if self in right.to_list():
        raise ValueError(
            "Cicular references to other placeholders are not allowed."
        )

    sanitized_on = _handle_on(on)

    if isinstance(time_stamps, str):
        time_stamps = (time_stamps, time_stamps)

    keys_by_ph = list(zip(*sanitized_on))

    for i, ph in enumerate([self, right]):
        if ph.roles.join_key:
            if ph_keys := keys_by_ph[i]:
                _check_join_key(ph_keys, ph.roles, ph.name)

        if ph.roles.time_stamp and time_stamps:
            if time_stamps[i] not in ph.roles.time_stamp:
                raise ValueError(f"Not a time stamp: {time_stamps[i]}.")

    if lagged_targets and horizon in (0.0, None):
        raise ValueError(
            "Setting 'lagged_targets' to True requires a positive, non-zero "
            "'horizon'  to avoid data leakage."
        )

    if horizon not in (0.0, None) and time_stamps is None:
        raise ValueError(
            "Setting 'horizon' (i.e. a relative look-back window) requires "
            "setting 'time_stamps'."
        )

    if memory not in (0.0, None) and time_stamps is None:
        raise ValueError("Setting 'horizon' requires setting 'time_stamps'.")

    join = Join(
        right=right,
        on=sanitized_on,
        time_stamps=time_stamps,
        relationship=relationship,
        memory=memory,
        horizon=horizon,
        lagged_targets=lagged_targets,
        upper_time_stamp=upper_time_stamp,
    )

    if any(join == existing for existing in self.joins):
        raise ValueError(
            "A join with the following set of parameters already exists on "
            f"the placeholder {self.name!r}:"
            f"\n\n{join}\n\n"
            "Redundant joins are not allowed."
        )

    self.joins.append(join)
    right.parent = self  # type: ignore

to_list

to_list()

Returns a list of this placeholder and all of its descendants.

Source code in getml/data/placeholder.py
464
465
466
467
468
def to_list(self):
    """
    Returns a list of this placeholder and all of its descendants.
    """
    return [self] + [ph for join in self.joins for ph in join.right.to_list()]

to_dict

to_dict()

Expresses this placeholder and all of its descendants as a dictionary.

Source code in getml/data/placeholder.py
470
471
472
473
474
475
476
477
478
479
480
481
482
483
def to_dict(self):
    """
    Expresses this placeholder and all of its descendants as a dictionary.
    """
    phs = {}
    for ph in self.to_list():
        key = ph.name
        if ph.children:
            i = 2
            while key in phs:
                key = f"{ph.name}{i}"
                i += 1
        phs[key] = ph
    return phs