getml.data

Contains functionalities for importing, handling, and retrieving data from the getML Engine.

All data relevant for the getML Suite has to be present in the getML Engine. Its Python API itself does not store any of the data used for training or prediction. Instead, it provides a handler class for the data frame objects in the getML Engine, the DataFrame. Either using this overall handler for the underlying data set or the individual columns it is composed of, one can both import and retrieve data from the Engine as well as performing operations on them. In addition to the data frame objects, the Engine also uses an abstract and lightweight version of the underlying data model, which is represented by the Placeholder.

In general, working with data within the getML Suite is organized in three different steps.

Importing the data into the getML Engine .
Annotating the data by assigning roles to the individual columns
Constructing the data model by deriving Placeholder from the data and joining them to represent the data schema.

Example

Creating a new data frame object in the getML Engine and importing data is done by one of the class methods from_csv, from_db, from_json, or from_pandas.

In this example we chose to directly load data from a public database in the internet. But, firstly, we have to connect the getML Engine to the database (see MySQL interface in the user guide for further details).

getml.database.connect_mysql(
    host="db.relational-data.org",
    dbname="financial",
    port=3306,
    user="guest",
    password="relational",
    time_formats=['%Y/%m/%d']
)

Using the established connection, we can tell the Engine to construct a new data frame object called df_loan, fill it with the data of loan table contained in the MySQL database, and return a DataFrame handler associated with it.

loan = getml.DataFrame.from_db('loan', 'df_loan')

print(loan)

| loan_id      | account_id   | amount       | duration     | date          | payments      | status        |
| unused float | unused float | unused float | unused float | unused string | unused string | unused string |
-------------------------------------------------------------------------------------------------------------
| 4959         | 2            | 80952        | 24           | 1994-01-05    | 3373.00       | A             |
| 4961         | 19           | 30276        | 12           | 1996-04-29    | 2523.00       | B             |
| 4962         | 25           | 30276        | 12           | 1997-12-08    | 2523.00       | A             |
| 4967         | 37           | 318480       | 60           | 1998-10-14    | 5308.00       | D             |
| 4968         | 38           | 110736       | 48           | 1998-04-19    | 2307.00       | C             |

In order to construct the data model and for the feature learning algorithm to get the most out of your data, you have to assign roles to columns using the set_role method (see Annotating data for details).

(For demonstration purposes, we assign payments the target role. In reality, you would want to forecast the defaulting behaviour, which is encoded in the status column. See the loans notebook.)

loan.set_role(["duration", "amount"], getml.data.roles.numerical)
loan.set_role(["loan_id", "account_id"], getml.data.roles.join_key)
loan.set_role("date", getml.data.roles.time_stamp)
loan.set_role(["payments"], getml.data.roles.target)

print(loan)

| date                        | loan_id  | account_id | payments  | duration  | amount    | status        |
| time stamp                  | join key | join key   | target    | numerical | numerical | unused string |
-----------------------------------------------------------------------------------------------------------
| 1994-01-05T00:00:00.000000Z | 4959     | 2          | 3373      | 24        | 80952     | A             |
| 1996-04-29T00:00:00.000000Z | 4961     | 19         | 2523      | 12        | 30276     | B             |
| 1997-12-08T00:00:00.000000Z | 4962     | 25         | 2523      | 12        | 30276     | A             |
| 1998-10-14T00:00:00.000000Z | 4967     | 37         | 5308      | 60        | 318480    | D             |
| 1998-04-19T00:00:00.000000Z | 4968     | 38         | 2307      | 48        | 110736    | C             |

Finally, we are able to construct the data model by deriving Placeholder from each DataFrame and establishing relations between them using the join method.

# But, first, we need a second data set to build a data model.
trans = getml.DataFrame.from_db(
    'trans', 'df_trans',
    roles = {getml.data.roles.numerical: ["amount", "balance"],
             getml.data.roles.categorical: ["type", "bank", "k_symbol",
                                            "account", "operation"],
             getml.data.roles.join_key: ["account_id"],
             getml.data.roles.time_stamp: ["date"]
    }
)

ph_loan = loan.to_placeholder()
ph_trans = trans.to_placeholder()

ph_loan.join(ph_trans, on="account_id",
            time_stamps="date")

The data model contained in ph_loan can now be used to construct a Pipeline.

arange

arange(
    start: Union[Real, float] = 0.0,
    stop: Optional[Union[Real, float]] = None,
    step: Union[Real, float] = 1.0,
)

Returns evenly spaced variables, within a given interval.

PARAMETER	DESCRIPTION
`start`	The beginning of the interval. Defaults to 0. TYPE: `Union[Real, float]` DEFAULT: `0.0`
`stop`	The end of the interval. TYPE: `Optional[Union[Real, float]]` DEFAULT: `None`
`step`	The step taken. Defaults to 1. TYPE: `Union[Real, float]` DEFAULT: `1.0`

Source code in getml/data/columns/columns.py

def arange(
    start: Union[numbers.Real, float] = 0.0,
    stop: Optional[Union[numbers.Real, float]] = None,
    step: Union[numbers.Real, float] = 1.0,
):
    """
    Returns evenly spaced variables, within a given interval.

    Args:
        start:
            The beginning of the interval. Defaults to 0.

        stop:
            The end of the interval.

        step:
            The step taken. Defaults to 1.
    """
    if stop is None:
        stop = start
        start = 0.0

    if step is None:
        step = 1.0

    if not isinstance(start, numbers.Real):
        raise TypeError("'start' must be a real number")

    if not isinstance(stop, numbers.Real):
        raise TypeError("'stop' must be a real number")

    if not isinstance(step, numbers.Real):
        raise TypeError("'step' must be a real number")

    col = FloatColumnView(
        operator="arange",
        operand1=None,
        operand2=None,
    )

    col.cmd["start_"] = float(start)
    col.cmd["stop_"] = float(stop)
    col.cmd["step_"] = float(step)

    return col

rowid

rowid() -> FloatColumnView

Get the row numbers of the table.

RETURNS	DESCRIPTION
`FloatColumnView`	(numerical) column containing the row id, starting with 0

Source code in getml/data/columns/columns.py

def rowid() -> FloatColumnView:
    """
    Get the row numbers of the table.

    Returns:
            (numerical) column containing the row id, starting with 0
    """
    return FloatColumnView(operator="rowid", operand1=None, operand2=None)

list_data_frames

list_data_frames() -> Dict[str, List[str]]

Lists all available data frames of the project.

RETURNS	DESCRIPTION
`dict`	Dict containing lists of strings representing the names of the data frames objects 'in_memory' held in memory (RAM). 'on_disk' stored on disk. TYPE: `Dict[str, List[str]]`

Example

d, _ = getml.datasets.make_numerical()
getml.data.list_data_frames()
d.save()
getml.data.list_data_frames()

Source code in getml/data/helpers.py

def list_data_frames() -> Dict[str, List[str]]:
    """Lists all available data frames of the project.

    Returns:
        dict:
            Dict containing lists of strings representing the names of
            the data frames objects

            - 'in_memory'
                held in memory (RAM).
            - 'on_disk'
                stored on disk.

    ??? example
        ```python
        d, _ = getml.datasets.make_numerical()
        getml.data.list_data_frames()
        d.save()
        getml.data.list_data_frames()
        ```

    """

    cmd: Dict[str, Any] = {}
    cmd["type_"] = "list_data_frames"
    cmd["name_"] = ""

    with comm.send_and_get_socket(cmd) as sock:
        msg = comm.recv_string(sock)
        if msg != "Success!":
            comm.handle_engine_exception(msg)
        json_str = comm.recv_string(sock)

    return json.loads(json_str)

delete

delete(name: str)

If a data frame named 'name' exists, it is deleted.

PARAMETER	DESCRIPTION
`name`	Name of the data frame. TYPE: `str`

Source code in getml/data/helpers2.py

def delete(name: str):
    """
    If a data frame named 'name' exists, it is deleted.

    Args:
        name:
            Name of the data frame.
    """

    if not isinstance(name, str):
        raise TypeError("'name' must be of type str")

    if exists(name):
        DataFrame(name).delete()

exists

exists(name: str)

Returns true if a data frame named 'name' exists.

PARAMETER	DESCRIPTION
`name`	Name of the data frame. TYPE: `str`

Source code in getml/data/helpers2.py

def exists(name: str):
    """
    Returns true if a data frame named 'name' exists.

    Args:
        name:
            Name of the data frame.
    """
    if not isinstance(name, str):
        raise TypeError("'name' must be of type str")

    all_df = list_data_frames()

    return name in (all_df["in_memory"] + all_df["on_disk"])

load_data_frame

load_data_frame(name: str) -> DataFrame

Retrieves a DataFrame handler of data in the getML Engine.

A data frame object can be loaded regardless if it is held in memory or not. It only has to be present in the current project and thus listed in the output of list_data_frames.

PARAMETER	DESCRIPTION
`name`	Name of the data frame. TYPE: `str`

RETURNS	DESCRIPTION
`DataFrame`	Handle the underlying data frame in the getML Engine.

Example

d, _ = getml.datasets.make_numerical(population_name = 'test')
d2 = getml.data.load_data_frame('test')

Source code in getml/data/helpers2.py

def load_data_frame(name: str) -> DataFrame:
    """Retrieves a [`DataFrame`][getml.DataFrame] handler of data in the
    getML Engine.

    A data frame object can be loaded regardless if it is held in
    memory or not. It only has to be present in the current project
    and thus listed in the output of
    [`list_data_frames`][getml.data.list_data_frames].

    Args:
        name:
            Name of the data frame.

    Returns:
            Handle the underlying data frame in the getML Engine.

    ??? example
        ```python
        d, _ = getml.datasets.make_numerical(population_name = 'test')
        d2 = getml.data.load_data_frame('test')
        ```
    """

    if not isinstance(name, str):
        raise TypeError("'name' must be of type str")

    data_frames_available = list_data_frames()

    if name in data_frames_available["in_memory"]:
        return DataFrame(name).refresh()

    if name in data_frames_available["on_disk"]:
        return DataFrame(name).load()

    raise ValueError(
        "No data frame holding the name '" + name + "' present on the getML Engine."
    )

make_target_columns

make_target_columns(
    base: Union[DataFrame, View], colname: str
) -> View

Returns a view containing binary target columns.

getML expects binary target columns for classification problems. This helper function allows you to split up a column into such binary target columns.

PARAMETER	DESCRIPTION
`base`	The original view or data frame. `base` will remain unaffected by this function, instead you will get a view with the appropriate changes. TYPE: `Union[DataFrame, View]`
`colname`	The column you would like to split. A column named `colname` should appear on `base`. TYPE: `str`

RETURNS	DESCRIPTION
`View`	A view containing binary target columns.

Source code in getml/data/helpers2.py

def make_target_columns(base: Union[DataFrame, View], colname: str) -> View:
    """
    Returns a view containing binary target columns.

    getML expects binary target columns for classification problems. This
    helper function allows you to split up a column into such binary
    target columns.

    Args:
        base:
            The original view or data frame. `base` will remain unaffected
            by this function, instead you will get a view with the appropriate
            changes.

        colname: The column you would like to split. A column named
            `colname` should appear on `base`.

    Returns:
        A view containing binary target columns.
    """
    if not isinstance(
        base[colname], (FloatColumn, FloatColumnView, StringColumn, StringColumnView)
    ):
        raise TypeError(
            "'"
            + colname
            + "' must be a FloatColumn, a FloatColumnView, "
            + "a StringColumn or a StringColumnView."
        )

    unique_values = base[colname].unique()

    if len(unique_values) > 10:
        logger.warning(
            "You are splitting the column into more than 10 target "
            + "columns. This might take a long time to fit."
        )

    view = base

    for label in unique_values:
        col = (base[colname] == label).as_num()
        name = colname + "=" + label
        view = view.with_column(col=col, name=name, role=target)

    return view.drop(colname)

to_placeholder

to_placeholder(
    *args: Union[
        DataFrame, View, List[Union[DataFrame, View]]
    ],
    **kwargs: Union[
        DataFrame, View, List[Union[DataFrame, View]]
    ]
) -> List[Placeholder]

Factory function for extracting placeholders from a DataFrame or View.

PARAMETER	DESCRIPTION
`args`	The data frames or views you would like to convert to placeholders. TYPE: `Union[DataFrame, View, List[Union[DataFrame, View]]]` DEFAULT: `()`
`kwargs`	The data frames or views you would like to convert to placeholders. TYPE: `Union[DataFrame, View, List[Union[DataFrame, View]]]` DEFAULT: `{}`

RETURNS	DESCRIPTION
`List[Placeholder]`	A list of placeholders.

Example

Suppose we wanted to create a DataModel:

dm = getml.data.DataModel(
    population_train.to_placeholder("population")
)

# Add placeholders for the peripheral tables.
dm.add(meta.to_placeholder("meta"))
dm.add(order.to_placeholder("order"))
dm.add(trans.to_placeholder("trans"))

But this is a bit repetitive. So instead, we can do the following:

dm = getml.data.DataModel(
    population_train.to_placeholder("population")
)

# Add placeholders for the peripheral tables.
dm.add(getml.data.to_placeholder(
    meta=meta, order=order, trans=trans))

Source code in getml/data/helpers2.py

def to_placeholder(
    *args: Union[DataFrame, View, List[Union[DataFrame, View]]],
    **kwargs: Union[DataFrame, View, List[Union[DataFrame, View]]],
) -> List[Placeholder]:
    """
    Factory function for extracting placeholders from a
    [`DataFrame`][getml.DataFrame] or [`View`][getml.data.View].

    Args:
        args:
            The data frames or views you would like to convert to placeholders.

        kwargs:
            The data frames or views you would like to convert to placeholders.

    Returns:
        A list of placeholders.

    ??? example
        Suppose we wanted to create a [`DataModel`][getml.data.DataModel]:



            dm = getml.data.DataModel(
                population_train.to_placeholder("population")
            )

            # Add placeholders for the peripheral tables.
            dm.add(meta.to_placeholder("meta"))
            dm.add(order.to_placeholder("order"))
            dm.add(trans.to_placeholder("trans"))

        But this is a bit repetitive. So instead, we can do
        the following:
        ```python
        dm = getml.data.DataModel(
            population_train.to_placeholder("population")
        )

        # Add placeholders for the peripheral tables.
        dm.add(getml.data.to_placeholder(
            meta=meta, order=order, trans=trans))
        ```
    """

    def to_ph_list(list_or_elem, key=None):
        as_list = list_or_elem if isinstance(list_or_elem, list) else [list_or_elem]
        return [elem.to_placeholder(key) for elem in as_list]

    return [elem for item in args for elem in to_ph_list(item)] + [
        elem for (k, v) in kwargs.items() for elem in to_ph_list(v, k)
    ]

load_container

load_container(container_id: str) -> Container

Loads a container and all associated data frames from disk.

PARAMETER	DESCRIPTION
`container_id`	The id of the container you would like to load. TYPE: `str`

RETURNS	DESCRIPTION
`Container`	The container with the given id.

Source code in getml/data/load_container.py

def load_container(container_id: str) -> Container:
    """
    Loads a container and all associated data frames from disk.

    Args:
        container_id:
            The id of the container you would like to load.

    Returns:
        The container with the given id.
    """

    cmd: Dict[str, Any] = {}
    cmd["type_"] = "DataContainer.load"
    cmd["name_"] = container_id

    with comm.send_and_get_socket(cmd) as sock:
        msg = comm.recv_string(sock)
        if msg != "Success!":
            comm.handle_engine_exception(msg)
        json_str = comm.recv_string(sock)

    cmd = json.loads(json_str)

    population = _load_view(cmd["population_"]) if "population_" in cmd else None

    peripheral = {k: _load_view(v) for (k, v) in cmd["peripheral_"].items()}

    subsets = {k: _load_view(v) for (k, v) in cmd["subsets_"].items()}

    split = _parse(cmd["split_"]) if "split_" in cmd else None

    deep_copy = cmd["deep_copy_"]
    frozen_time = cmd["frozen_time_"] if "frozen_time_" in cmd else None
    last_change = cmd["last_change_"]

    container = Container(
        population=population, peripheral=peripheral, deep_copy=deep_copy, **subsets
    )

    container._id = container_id
    container._frozen_time = frozen_time
    container._split = split
    container._last_change = last_change

    return container

concat

concat(
    name: str, data_frames: List[Union[DataFrame, View]]
)

Creates a new data frame by concatenating a list of existing ones.

PARAMETER	DESCRIPTION
`name`	Name of the new column. TYPE: `str`
`data_frames`	The data frames to concatenate. Must be non-empty. However, it can contain only one data frame. Column names and roles must match. Columns will be appended by name, not order. TYPE: `List[Union[DataFrame, View]]`

Examples:

new_df = data.concat("NEW_DF_NAME", [df1, df2])

Source code in getml/data/concat.py

def concat(name: str, data_frames: List[Union[DataFrame, View]]):
    """
    Creates a new data frame by concatenating a list of existing ones.

    Args:
        name:
            Name of the new column.

        data_frames:
            The data frames to concatenate.
            Must be non-empty. However, it can contain only one data frame.
            Column names and roles must match.
            Columns will be appended by name, not order.

    Examples:
        ```python
        new_df = data.concat("NEW_DF_NAME", [df1, df2])
        ```
    """

    if not isinstance(name, str):
        raise TypeError("'name' must be a string.")

    if not _is_non_empty_typed_list(data_frames, (View, DataFrame)):
        raise TypeError(
            "'data_frames' must be a non-empty list of getml.data.Views "
            + "or getml.DataFrames."
        )

    cmd: Dict[str, Any] = {}

    cmd["type_"] = "DataFrame.concat"
    cmd["name_"] = name

    cmd["data_frames_"] = [df._getml_deserialize() for df in data_frames]

    comm.send(cmd)

    return DataFrame(name=name).refresh()

random

random(seed: int = 5849) -> FloatColumnView

Create random column.

The numbers will be uniformly distributed from 0.0 to 1.0. This can be used to randomly split a population table into a training and a test set

PARAMETER	DESCRIPTION
`seed`	Seed used for the random number generator. TYPE: `int` DEFAULT: `5849`

RETURNS	DESCRIPTION
`FloatColumnView`	FloatColumn containing random numbers

Example

population = getml.DataFrame('population')
population.add(numpy.zeros(100), 'column_01')

idx = random(seed=42)
population_train = population[idx > 0.7]
population_test = population[idx <= 0.7]

Source code in getml/data/columns/random.py

def random(seed: int = 5849) -> FloatColumnView:
    """
    Create random column.

    The numbers will be uniformly distributed from 0.0 to 1.0. This can be
    used to randomly split a population table into a training and a test
    set

    Args:
        seed:
            Seed used for the random number generator.

    Returns:
            FloatColumn containing random numbers

    ??? example
        ```python
        population = getml.DataFrame('population')
        population.add(numpy.zeros(100), 'column_01')

        idx = random(seed=42)
        population_train = population[idx > 0.7]
        population_test = population[idx <= 0.7]
        ```
    """

    if not isinstance(seed, numbers.Real):
        raise TypeError("'seed' must be a real number")

    col = FloatColumnView(operator="random", operand1=None, operand2=None)
    col.cmd["seed_"] = seed
    return col

OnType `module-attribute`

OnType = Optional[
    Union[
        str,
        Tuple[str, str],
        List[Union[str, Tuple[str, str]]],
    ]
]

Types that can be passed to the 'on' argument of the 'join' method.

TimeStampsType `module-attribute`

TimeStampsType = Optional[Union[str, Tuple[str, str]]]

Types of time stamps used in joins.

getml.data

arange

rowid

list_data_frames

delete

exists

load_data_frame

make_target_columns

to_placeholder

load_container

concat

random

OnType module-attribute

TimeStampsType module-attribute

OnType `module-attribute`

TimeStampsType `module-attribute`