Skip to content

getml.data

Contains functionalities for importing, handling, and retrieving data from the getML Engine.

All data relevant for the getML Suite has to be present in the getML Engine. Its Python API itself does not store any of the data used for training or prediction. Instead, it provides a handler class for the data frame objects in the getML Engine, the DataFrame. Either using this overall handler for the underlying data set or the individual columns it is composed of, one can both import and retrieve data from the Engine as well as performing operations on them. In addition to the data frame objects, the Engine also uses an abstract and lightweight version of the underlying data model, which is represented by the Placeholder.

In general, working with data within the getML Suite is organized in three different steps.

Example

Creating a new data frame object in the getML Engine and importing data is done by one of the class methods from_csv, from_db, from_json, or from_pandas.

In this example we chose to directly load data from a public database in the internet. But, firstly, we have to connect the getML Engine to the database (see MySQL interface in the user guide for further details).

getml.database.connect_mysql(
    host="db.relational-data.org",
    dbname="financial",
    port=3306,
    user="guest",
    password="relational",
    time_formats=['%Y/%m/%d']
)

Using the established connection, we can tell the Engine to construct a new data frame object called df_loan, fill it with the data of loan table contained in the MySQL database, and return a DataFrame handler associated with it.

loan = getml.DataFrame.from_db('loan', 'df_loan')

print(loan)
| loan_id      | account_id   | amount       | duration     | date          | payments      | status        |
| unused float | unused float | unused float | unused float | unused string | unused string | unused string |
-------------------------------------------------------------------------------------------------------------
| 4959         | 2            | 80952        | 24           | 1994-01-05    | 3373.00       | A             |
| 4961         | 19           | 30276        | 12           | 1996-04-29    | 2523.00       | B             |
| 4962         | 25           | 30276        | 12           | 1997-12-08    | 2523.00       | A             |
| 4967         | 37           | 318480       | 60           | 1998-10-14    | 5308.00       | D             |
| 4968         | 38           | 110736       | 48           | 1998-04-19    | 2307.00       | C             |
In order to construct the data model and for the feature learning algorithm to get the most out of your data, you have to assign roles to columns using the set_role method (see Annotating data for details).

(For demonstration purposes, we assign payments the target role. In reality, you would want to forecast the defaulting behaviour, which is encoded in the status column. See the loans notebook.)

loan.set_role(["duration", "amount"], getml.data.roles.numerical)
loan.set_role(["loan_id", "account_id"], getml.data.roles.join_key)
loan.set_role("date", getml.data.roles.time_stamp)
loan.set_role(["payments"], getml.data.roles.target)

print(loan)
| date                        | loan_id  | account_id | payments  | duration  | amount    | status        |
| time stamp                  | join key | join key   | target    | numerical | numerical | unused string |
-----------------------------------------------------------------------------------------------------------
| 1994-01-05T00:00:00.000000Z | 4959     | 2          | 3373      | 24        | 80952     | A             |
| 1996-04-29T00:00:00.000000Z | 4961     | 19         | 2523      | 12        | 30276     | B             |
| 1997-12-08T00:00:00.000000Z | 4962     | 25         | 2523      | 12        | 30276     | A             |
| 1998-10-14T00:00:00.000000Z | 4967     | 37         | 5308      | 60        | 318480    | D             |
| 1998-04-19T00:00:00.000000Z | 4968     | 38         | 2307      | 48        | 110736    | C             |
Finally, we are able to construct the data model by deriving Placeholder from each DataFrame and establishing relations between them using the join method.

# But, first, we need a second data set to build a data model.
trans = getml.DataFrame.from_db(
    'trans', 'df_trans',
    roles = {getml.data.roles.numerical: ["amount", "balance"],
             getml.data.roles.categorical: ["type", "bank", "k_symbol",
                                            "account", "operation"],
             getml.data.roles.join_key: ["account_id"],
             getml.data.roles.time_stamp: ["date"]
    }
)

ph_loan = loan.to_placeholder()
ph_trans = trans.to_placeholder()

ph_loan.join(ph_trans, on="account_id",
            time_stamps="date")

The data model contained in ph_loan can now be used to construct a Pipeline.

arange

arange(
    start: Union[Real, float] = 0.0,
    stop: Optional[Union[Real, float]] = None,
    step: Union[Real, float] = 1.0,
)

Returns evenly spaced variables, within a given interval.

PARAMETER DESCRIPTION
start

The beginning of the interval. Defaults to 0.

TYPE: Union[Real, float] DEFAULT: 0.0

stop

The end of the interval.

TYPE: Optional[Union[Real, float]] DEFAULT: None

step

The step taken. Defaults to 1.

TYPE: Union[Real, float] DEFAULT: 1.0

Source code in getml/data/columns/columns.py
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
def arange(
    start: Union[numbers.Real, float] = 0.0,
    stop: Optional[Union[numbers.Real, float]] = None,
    step: Union[numbers.Real, float] = 1.0,
):
    """
    Returns evenly spaced variables, within a given interval.

    Args:
        start:
            The beginning of the interval. Defaults to 0.

        stop:
            The end of the interval.

        step:
            The step taken. Defaults to 1.
    """
    if stop is None:
        stop = start
        start = 0.0

    if step is None:
        step = 1.0

    if not isinstance(start, numbers.Real):
        raise TypeError("'start' must be a real number")

    if not isinstance(stop, numbers.Real):
        raise TypeError("'stop' must be a real number")

    if not isinstance(step, numbers.Real):
        raise TypeError("'step' must be a real number")

    col = FloatColumnView(
        operator="arange",
        operand1=None,
        operand2=None,
    )

    col.cmd["start_"] = float(start)
    col.cmd["stop_"] = float(stop)
    col.cmd["step_"] = float(step)

    return col

rowid

rowid() -> FloatColumnView

Get the row numbers of the table.

RETURNS DESCRIPTION
FloatColumnView

(numerical) column containing the row id, starting with 0

Source code in getml/data/columns/columns.py
162
163
164
165
166
167
168
169
def rowid() -> FloatColumnView:
    """
    Get the row numbers of the table.

    Returns:
            (numerical) column containing the row id, starting with 0
    """
    return FloatColumnView(operator="rowid", operand1=None, operand2=None)

list_data_frames

list_data_frames() -> Dict[str, List[str]]

Lists all available data frames of the project.

RETURNS DESCRIPTION
dict

Dict containing lists of strings representing the names of the data frames objects

  • 'in_memory' held in memory (RAM).
  • 'on_disk' stored on disk.

TYPE: Dict[str, List[str]]

Example
d, _ = getml.datasets.make_numerical()
getml.data.list_data_frames()
d.save()
getml.data.list_data_frames()
Source code in getml/data/helpers.py
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
def list_data_frames() -> Dict[str, List[str]]:
    """Lists all available data frames of the project.

    Returns:
        dict:
            Dict containing lists of strings representing the names of
            the data frames objects

            - 'in_memory'
                held in memory (RAM).
            - 'on_disk'
                stored on disk.

    ??? example
        ```python
        d, _ = getml.datasets.make_numerical()
        getml.data.list_data_frames()
        d.save()
        getml.data.list_data_frames()
        ```

    """

    cmd: Dict[str, Any] = {}
    cmd["type_"] = "list_data_frames"
    cmd["name_"] = ""

    with comm.send_and_get_socket(cmd) as sock:
        msg = comm.recv_string(sock)
        if msg != "Success!":
            comm.handle_engine_exception(msg)
        json_str = comm.recv_string(sock)

    return json.loads(json_str)

delete

delete(name: str)

If a data frame named 'name' exists, it is deleted.

PARAMETER DESCRIPTION
name

Name of the data frame.

TYPE: str

Source code in getml/data/helpers2.py
294
295
296
297
298
299
300
301
302
303
304
305
306
307
def delete(name: str):
    """
    If a data frame named 'name' exists, it is deleted.

    Args:
        name:
            Name of the data frame.
    """

    if not isinstance(name, str):
        raise TypeError("'name' must be of type str")

    if exists(name):
        DataFrame(name).delete()

exists

exists(name: str)

Returns true if a data frame named 'name' exists.

PARAMETER DESCRIPTION
name

Name of the data frame.

TYPE: str

Source code in getml/data/helpers2.py
275
276
277
278
279
280
281
282
283
284
285
286
287
288
def exists(name: str):
    """
    Returns true if a data frame named 'name' exists.

    Args:
        name:
            Name of the data frame.
    """
    if not isinstance(name, str):
        raise TypeError("'name' must be of type str")

    all_df = list_data_frames()

    return name in (all_df["in_memory"] + all_df["on_disk"])

load_data_frame

load_data_frame(name: str) -> DataFrame

Retrieves a DataFrame handler of data in the getML Engine.

A data frame object can be loaded regardless if it is held in memory or not. It only has to be present in the current project and thus listed in the output of list_data_frames.

PARAMETER DESCRIPTION
name

Name of the data frame.

TYPE: str

RETURNS DESCRIPTION
DataFrame

Handle the underlying data frame in the getML Engine.

Example
d, _ = getml.datasets.make_numerical(population_name = 'test')
d2 = getml.data.load_data_frame('test')
Source code in getml/data/helpers2.py
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
def load_data_frame(name: str) -> DataFrame:
    """Retrieves a [`DataFrame`][getml.DataFrame] handler of data in the
    getML Engine.

    A data frame object can be loaded regardless if it is held in
    memory or not. It only has to be present in the current project
    and thus listed in the output of
    [`list_data_frames`][getml.data.list_data_frames].

    Args:
        name:
            Name of the data frame.

    Returns:
            Handle the underlying data frame in the getML Engine.

    ??? example
        ```python
        d, _ = getml.datasets.make_numerical(population_name = 'test')
        d2 = getml.data.load_data_frame('test')
        ```
    """

    if not isinstance(name, str):
        raise TypeError("'name' must be of type str")

    data_frames_available = list_data_frames()

    if name in data_frames_available["in_memory"]:
        return DataFrame(name).refresh()

    if name in data_frames_available["on_disk"]:
        return DataFrame(name).load()

    raise ValueError(
        "No data frame holding the name '" + name + "' present on the getML Engine."
    )

make_target_columns

make_target_columns(
    base: Union[DataFrame, View], colname: str
) -> View

Returns a view containing binary target columns.

getML expects binary target columns for classification problems. This helper function allows you to split up a column into such binary target columns.

PARAMETER DESCRIPTION
base

The original view or data frame. base will remain unaffected by this function, instead you will get a view with the appropriate changes.

TYPE: Union[DataFrame, View]

colname

The column you would like to split. A column named colname should appear on base.

TYPE: str

RETURNS DESCRIPTION
View

A view containing binary target columns.

Source code in getml/data/helpers2.py
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
def make_target_columns(base: Union[DataFrame, View], colname: str) -> View:
    """
    Returns a view containing binary target columns.

    getML expects binary target columns for classification problems. This
    helper function allows you to split up a column into such binary
    target columns.

    Args:
        base:
            The original view or data frame. `base` will remain unaffected
            by this function, instead you will get a view with the appropriate
            changes.

        colname: The column you would like to split. A column named
            `colname` should appear on `base`.

    Returns:
        A view containing binary target columns.
    """
    if not isinstance(
        base[colname], (FloatColumn, FloatColumnView, StringColumn, StringColumnView)
    ):
        raise TypeError(
            "'"
            + colname
            + "' must be a FloatColumn, a FloatColumnView, "
            + "a StringColumn or a StringColumnView."
        )

    unique_values = base[colname].unique()

    if len(unique_values) > 10:
        logger.warning(
            "You are splitting the column into more than 10 target "
            + "columns. This might take a long time to fit."
        )

    view = base

    for label in unique_values:
        col = (base[colname] == label).as_num()
        name = colname + "=" + label
        view = view.with_column(col=col, name=name, role=target)

    return view.drop(colname)

to_placeholder

to_placeholder(
    *args: Union[
        DataFrame, View, List[Union[DataFrame, View]]
    ],
    **kwargs: Union[
        DataFrame, View, List[Union[DataFrame, View]]
    ]
) -> List[Placeholder]

Factory function for extracting placeholders from a DataFrame or View.

PARAMETER DESCRIPTION
args

The data frames or views you would like to convert to placeholders.

TYPE: Union[DataFrame, View, List[Union[DataFrame, View]]] DEFAULT: ()

kwargs

The data frames or views you would like to convert to placeholders.

TYPE: Union[DataFrame, View, List[Union[DataFrame, View]]] DEFAULT: {}

RETURNS DESCRIPTION
List[Placeholder]

A list of placeholders.

Example

Suppose we wanted to create a DataModel:

dm = getml.data.DataModel(
    population_train.to_placeholder("population")
)

# Add placeholders for the peripheral tables.
dm.add(meta.to_placeholder("meta"))
dm.add(order.to_placeholder("order"))
dm.add(trans.to_placeholder("trans"))

But this is a bit repetitive. So instead, we can do the following:

dm = getml.data.DataModel(
    population_train.to_placeholder("population")
)

# Add placeholders for the peripheral tables.
dm.add(getml.data.to_placeholder(
    meta=meta, order=order, trans=trans))

Source code in getml/data/helpers2.py
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
def to_placeholder(
    *args: Union[DataFrame, View, List[Union[DataFrame, View]]],
    **kwargs: Union[DataFrame, View, List[Union[DataFrame, View]]],
) -> List[Placeholder]:
    """
    Factory function for extracting placeholders from a
    [`DataFrame`][getml.DataFrame] or [`View`][getml.data.View].

    Args:
        args:
            The data frames or views you would like to convert to placeholders.

        kwargs:
            The data frames or views you would like to convert to placeholders.

    Returns:
        A list of placeholders.

    ??? example
        Suppose we wanted to create a [`DataModel`][getml.data.DataModel]:



            dm = getml.data.DataModel(
                population_train.to_placeholder("population")
            )

            # Add placeholders for the peripheral tables.
            dm.add(meta.to_placeholder("meta"))
            dm.add(order.to_placeholder("order"))
            dm.add(trans.to_placeholder("trans"))

        But this is a bit repetitive. So instead, we can do
        the following:
        ```python
        dm = getml.data.DataModel(
            population_train.to_placeholder("population")
        )

        # Add placeholders for the peripheral tables.
        dm.add(getml.data.to_placeholder(
            meta=meta, order=order, trans=trans))
        ```
    """

    def to_ph_list(list_or_elem, key=None):
        as_list = list_or_elem if isinstance(list_or_elem, list) else [list_or_elem]
        return [elem.to_placeholder(key) for elem in as_list]

    return [elem for item in args for elem in to_ph_list(item)] + [
        elem for (k, v) in kwargs.items() for elem in to_ph_list(v, k)
    ]

load_container

load_container(container_id: str) -> Container

Loads a container and all associated data frames from disk.

PARAMETER DESCRIPTION
container_id

The id of the container you would like to load.

TYPE: str

RETURNS DESCRIPTION
Container

The container with the given id.

Source code in getml/data/load_container.py
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
def load_container(container_id: str) -> Container:
    """
    Loads a container and all associated data frames from disk.

    Args:
        container_id:
            The id of the container you would like to load.

    Returns:
        The container with the given id.
    """

    cmd: Dict[str, Any] = {}
    cmd["type_"] = "DataContainer.load"
    cmd["name_"] = container_id

    with comm.send_and_get_socket(cmd) as sock:
        msg = comm.recv_string(sock)
        if msg != "Success!":
            comm.handle_engine_exception(msg)
        json_str = comm.recv_string(sock)

    cmd = json.loads(json_str)

    population = _load_view(cmd["population_"]) if "population_" in cmd else None

    peripheral = {k: _load_view(v) for (k, v) in cmd["peripheral_"].items()}

    subsets = {k: _load_view(v) for (k, v) in cmd["subsets_"].items()}

    split = _parse(cmd["split_"]) if "split_" in cmd else None

    deep_copy = cmd["deep_copy_"]
    frozen_time = cmd["frozen_time_"] if "frozen_time_" in cmd else None
    last_change = cmd["last_change_"]

    container = Container(
        population=population, peripheral=peripheral, deep_copy=deep_copy, **subsets
    )

    container._id = container_id
    container._frozen_time = frozen_time
    container._split = split
    container._last_change = last_change

    return container

concat

concat(
    name: str, data_frames: List[Union[DataFrame, View]]
)

Creates a new data frame by concatenating a list of existing ones.

PARAMETER DESCRIPTION
name

Name of the new column.

TYPE: str

data_frames

The data frames to concatenate. Must be non-empty. However, it can contain only one data frame. Column names and roles must match. Columns will be appended by name, not order.

TYPE: List[Union[DataFrame, View]]

Examples:

new_df = data.concat("NEW_DF_NAME", [df1, df2])
Source code in getml/data/concat.py
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
def concat(name: str, data_frames: List[Union[DataFrame, View]]):
    """
    Creates a new data frame by concatenating a list of existing ones.

    Args:
        name:
            Name of the new column.

        data_frames:
            The data frames to concatenate.
            Must be non-empty. However, it can contain only one data frame.
            Column names and roles must match.
            Columns will be appended by name, not order.

    Examples:
        ```python
        new_df = data.concat("NEW_DF_NAME", [df1, df2])
        ```
    """

    if not isinstance(name, str):
        raise TypeError("'name' must be a string.")

    if not _is_non_empty_typed_list(data_frames, (View, DataFrame)):
        raise TypeError(
            "'data_frames' must be a non-empty list of getml.data.Views "
            + "or getml.DataFrames."
        )

    cmd: Dict[str, Any] = {}

    cmd["type_"] = "DataFrame.concat"
    cmd["name_"] = name

    cmd["data_frames_"] = [df._getml_deserialize() for df in data_frames]

    comm.send(cmd)

    return DataFrame(name=name).refresh()

random

random(seed: int = 5849) -> FloatColumnView

Create random column.

The numbers will be uniformly distributed from 0.0 to 1.0. This can be used to randomly split a population table into a training and a test set

PARAMETER DESCRIPTION
seed

Seed used for the random number generator.

TYPE: int DEFAULT: 5849

RETURNS DESCRIPTION
FloatColumnView

FloatColumn containing random numbers

Example
population = getml.DataFrame('population')
population.add(numpy.zeros(100), 'column_01')

idx = random(seed=42)
population_train = population[idx > 0.7]
population_test = population[idx <= 0.7]
Source code in getml/data/columns/random.py
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
def random(seed: int = 5849) -> FloatColumnView:
    """
    Create random column.

    The numbers will be uniformly distributed from 0.0 to 1.0. This can be
    used to randomly split a population table into a training and a test
    set

    Args:
        seed:
            Seed used for the random number generator.

    Returns:
            FloatColumn containing random numbers

    ??? example
        ```python
        population = getml.DataFrame('population')
        population.add(numpy.zeros(100), 'column_01')

        idx = random(seed=42)
        population_train = population[idx > 0.7]
        population_test = population[idx <= 0.7]
        ```
    """

    if not isinstance(seed, numbers.Real):
        raise TypeError("'seed' must be a real number")

    col = FloatColumnView(operator="random", operand1=None, operand2=None)
    col.cmd["seed_"] = seed
    return col

OnType module-attribute

OnType = Optional[
    Union[
        str,
        Tuple[str, str],
        List[Union[str, Tuple[str, str]]],
    ]
]

Types that can be passed to the 'on' argument of the 'join' method.

TimeStampsType module-attribute

TimeStampsType = Optional[Union[str, Tuple[str, str]]]

Types of time stamps used in joins.