getml.data.Placeholder

Placeholder(
    name: str,
    roles: Optional[
        Union[Roles, Dict[str, List[str]]]
    ] = None,
)

Abstract representation of tables and their relations.

This class is an abstract representation of the DataFrame or View. However, it does not contain any actual data.

You might also want to refer to DataModel.

ATTRIBUTE	DESCRIPTION
`name`	The name used for this placeholder. This name will appear in the generated SQL code. TYPE: `str`
`roles`	The roles of the columns in this placeholder. If you pass a dictionary, the keys must be the column names and the values must be lists of roles. If you pass a `Roles` object, it will be used as is. TYPE: `Roles`

Example

This example will construct a data model in which the 'population_table' depends on the 'peripheral_table' via the 'join_key' column. In addition, only those rows in 'peripheral_table' for which 'time_stamp' is smaller or equal to the 'time_stamp' in 'population_table' are considered:

dm = getml.data.DataModel(
    population_table.to_placeholder("POPULATION")
)

dm.add(peripheral_table.to_placeholder("PERIPHERAL"))

dm.POPULATION.join(
    dm.PERIPHERAL,
    on="join_key",
    time_stamps="time_stamp"
)

If you want to add more than one peripheral table, you can use to_placeholder:

dm = getml.data.DataModel(
    population_table.to_placeholder("POPULATION")
)

dm.add(
    getml.data.to_placeholder(
        PERIPHERAL1=peripheral_table_1,
        PERIPHERAL2=peripheral_table_2,
    )
)

If the relationship between two tables is many-to-one or one-to-one you should clearly say so:

dm.POPULATION.join(
    dm.PERIPHERAL,
    on="join_key",
    time_stamps="time_stamp",
    relationship=getml.data.relationship.many_to_one,
)

Please also refer to relationship.

If the join keys or time stamps are named differently in the two different tables, use a tuple:

dm.POPULATION.join(
    dm.PERIPHERAL,
    on=("join_key", "other_join_key"),
    time_stamps=("time_stamp", "other_time_stamp"),
)

You can join over more than one join key:

dm.POPULATION.join(
    dm.PERIPHERAL,
    on=["join_key1", "join_key2", ("join_key3", "other_join_key3")],
    time_stamps="time_stamp",
)

You can also limit the scope of your joins using memory. This can significantly speed up training time. For instance, if you only want to consider data from the last seven days, you could do something like this:

dm.POPULATION.join(
    dm.PERIPHERAL,
    on="join_key",
    time_stamps="time_stamp",
    memory=getml.data.time.days(7),
)

In some use cases, particularly those involving time series, it might be a good idea to use targets from the past. You can activate this using lagged_targets. But if you do that, you must also define a prediction horizon. For instance, if you want to predict data for the next hour, using data from the last seven days, you could do this:

dm.POPULATION.join(
    dm.PERIPHERAL,
    on="join_key",
    time_stamps="time_stamp",
    lagged_targets=True,
    horizon=getml.data.time.hours(1),
    memory=getml.data.time.days(7),
)

Please also refer to time.

If the join involves many matches, it might be a good idea to set the relationship to propositionalization. This forces the pipeline to always use a propositionalization algorithm for this join, which can significantly speed things up.

dm.POPULATION.join(
    dm.PERIPHERAL,
    on="join_key",
    time_stamps="time_stamp",
    relationship=getml.data.relationship.propositionalization,
)

Please also refer to relationship.

In some cases, it is necessary to have more than one placeholder on the same table. This is necessary to create more complicated data models. In this case, you can do something like this:

dm.add(
    getml.data.to_placeholder(
        PERIPHERAL=[peripheral_table]*2,
    )
)

# We can now access our two placeholders like this:
placeholder1 = dm.PERIPHERAL[0]
placeholder2 = dm.PERIPHERAL[1]

If you want to check out a real-world example where this is necessary, refer to the CORA notebook .

Source code in getml/data/placeholder.py

def __init__(
    self, name: str, roles: Optional[Union[Roles, Dict[str, List[str]]]] = None
):
    self._name = name

    if roles is None:
        self._roles: Roles = Roles()
    elif isinstance(roles, dict):
        self._roles = Roles(**roles)
    else:
        self._roles = roles

    self.joins: List[Join] = []
    self.parent = None

join

join(
    right: Placeholder,
    on: OnType = None,
    time_stamps: TimeStampsType = None,
    relationship: str = many_to_many,
    memory: Optional[float] = None,
    horizon: Optional[float] = None,
    lagged_targets: bool = False,
    upper_time_stamp: Optional[str] = None,
)

Joins another to placeholder to this placeholder.

PARAMETER	DESCRIPTION
`right`	The placeholder you would like to join. TYPE: `Placeholder`
`on`	The join keys to use. If none is passed, then everything will be joined to everything else. TYPE: `OnType` DEFAULT: `None`
`time_stamps`	The time stamps used to limit the join. TYPE: `TimeStampsType` DEFAULT: `None`
`relationship`	The relationship between the two tables. Must be from `relationship`. TYPE: `str` DEFAULT: `many_to_many`
`memory`	The difference between the time stamps until data is 'forgotten'. Limiting your joins using memory can significantly speed up training time. Also refer to `time`. TYPE: `Optional[float]` DEFAULT: `None`
`horizon`	The prediction horizon to apply to this join. Also refer to `time`. TYPE: `Optional[float]` DEFAULT: `None`
`lagged_targets`	Whether you want to allow lagged targets. If this is set to True, you must also pass a positive, non-zero horizon. TYPE: `bool` DEFAULT: `False`
`upper_time_stamp`	Name of a time stamp in right that serves as an upper limit on the join. TYPE: `Optional[str]` DEFAULT: `None`

Source code in getml/data/placeholder.py

def join(
    self,
    right: "Placeholder",
    on: OnType = None,
    time_stamps: TimeStampsType = None,
    relationship: str = many_to_many,
    memory: Optional[float] = None,
    horizon: Optional[float] = None,
    lagged_targets: bool = False,
    upper_time_stamp: Optional[str] = None,
):
    """
    Joins another to placeholder to this placeholder.

    Args:
        right:
            The placeholder you would like to join.

        on:
            The join keys to use. If none is passed, then everything
            will be joined to everything else.

        time_stamps:
            The time stamps used to limit the join.

        relationship:
            The relationship between the two tables. Must be from
            [`relationship`][getml.data.relationship].

        memory:
            The difference between the time stamps until data is 'forgotten'.
            Limiting your joins using memory can significantly speed up
            training time. Also refer to [`time`][getml.data.time].

        horizon:
            The prediction horizon to apply to this join.
            Also refer to [`time`][getml.data.time].

        lagged_targets:
            Whether you want to allow lagged targets. If this is set to True,
            you must also pass a positive, non-zero *horizon*.

        upper_time_stamp:
            Name of a time stamp in *right* that serves as an upper limit
            on the join.
    """

    if not isinstance(right, type(self)):
        msg = (
            "'right' must be a getml.data.Placeholder. "
            + "You can create a placeholder by calling .to_placeholder() "
            + "on DataFrames or Views."
        )
        raise TypeError(msg)

    if self in right.to_list():
        raise ValueError(
            "Cicular references to other placeholders are not allowed."
        )

    sanitized_on = _handle_on(on)

    if isinstance(time_stamps, str):
        time_stamps = (time_stamps, time_stamps)

    keys_by_ph = list(zip(*sanitized_on))

    for i, ph in enumerate([self, right]):
        if ph.roles.join_key:
            if ph_keys := keys_by_ph[i]:
                _check_join_key(ph_keys, ph.roles, ph.name)

        if ph.roles.time_stamp and time_stamps:
            if time_stamps[i] not in ph.roles.time_stamp:
                raise ValueError(f"Not a time stamp: {time_stamps[i]}.")

    if lagged_targets and horizon in (0.0, None):
        raise ValueError(
            "Setting 'lagged_targets' to True requires a positive, non-zero "
            "'horizon'  to avoid data leakage."
        )

    if horizon not in (0.0, None) and time_stamps is None:
        raise ValueError(
            "Setting 'horizon' (i.e. a relative look-back window) requires "
            "setting 'time_stamps'."
        )

    if memory not in (0.0, None) and time_stamps is None:
        raise ValueError("Setting 'horizon' requires setting 'time_stamps'.")

    join = Join(
        right=right,
        on=sanitized_on,
        time_stamps=time_stamps,
        relationship=relationship,
        memory=memory,
        horizon=horizon,
        lagged_targets=lagged_targets,
        upper_time_stamp=upper_time_stamp,
    )

    if any(join == existing for existing in self.joins):
        raise ValueError(
            "A join with the following set of parameters already exists on "
            f"the placeholder {self.name!r}:"
            f"\n\n{join}\n\n"
            "Redundant joins are not allowed."
        )

    self.joins.append(join)
    right.parent = self  # type: ignore

to_list

to_list()

Returns a list of this placeholder and all of its descendants.

Source code in getml/data/placeholder.py

def to_list(self):
    """
    Returns a list of this placeholder and all of its descendants.
    """
    return [self] + [ph for join in self.joins for ph in join.right.to_list()]

to_dict

to_dict()

Expresses this placeholder and all of its descendants as a dictionary.

Source code in getml/data/placeholder.py

def to_dict(self):
    """
    Expresses this placeholder and all of its descendants as a dictionary.
    """
    phs = {}
    for ph in self.to_list():
        key = ph.name
        if ph.children:
            i = 2
            while key in phs:
                key = f"{ph.name}{i}"
                i += 1
        phs[key] = ph
    return phs