getml.data.DataFrame

DataFrame(
    name: str,
    roles: Optional[
        Union[Dict[Union[Role, str], Iterable[str]], Roles]
    ] = None,
)

Handler for the data stored in the getML Engine.

The DataFrame class represents a data frame object in the getML Engine but does not contain any actual data itself. To create such a data frame object, fill it with data via the Python API, and to retrieve a handler for it, you can use one of the from_csv, from_db, from_json, or from_pandas class methods. The Importing Data section in the user guide explains the particularities of each of those flavors of the unified import interface.

If the data frame object is already present in the Engine - either in memory as a temporary object or on disk when save was called earlier -, the load_data_frame function will create a new handler without altering the underlying data. For more information about the lifecycle of the data in the getML Engine and its synchronization with the Python API please see the corresponding User Guide.

ATTRIBUTE	DESCRIPTION
`name`	Unique identifier used to link the handler with the underlying data frame object in the Engine.
`roles`	Maps the `roles` to the column names (see `colnames`). The `roles` dictionary is expected to have the following format `roles = {getml.data.role.numeric: ["colname1", "colname2"], getml.data.role.target: ["colname3"]}` Otherwise, you can use the `Roles` class.

Example

Creating a new data frame object in the getML Engine and importing data is done by one the class functions from_csv, from_db, from_json, or from_pandas.

random = numpy.random.RandomState(7263)

table = pandas.DataFrame()
table['column_01'] = random.randint(0, 10, 1000).astype(numpy.str)
table['join_key'] = numpy.arange(1000)
table['time_stamp'] = random.rand(1000)
table['target'] = random.rand(1000)

df_table = getml.DataFrame.from_pandas(table, name = 'table')

In addition to creating a new data frame object in the getML Engine and filling it with all the content of table, the from_pandas function also returns a DataFrame handler to the underlying data.

You don't have to create the data frame objects anew for each session. You can use their save method to write them to disk, the list_data_frames function to list all available objects in the Engine, and load_data_frame to create a DataFrame handler for a data set already present in the getML Engine (see User Guide for details).

df_table.save()

getml.data.list_data_frames()

df_table_reloaded = getml.data.load_data_frame('table')

Note

Although the Python API does not store the actual data itself, you can use the to_csv, to_db, to_json, and to_pandas methods to retrieve them.

Source code in getml/data/data_frame.py

def __init__(
    self,
    name: str,
    roles: Optional[Union[Dict[Union[Role, str], Iterable[str]], Roles]] = None,
):
    # ------------------------------------------------------------

    if not isinstance(name, str):
        raise TypeError("'name' must be str.")

    vars(self)["name"] = name

    if roles is None:
        roles = {}

    if isinstance(roles, dict):
        roles = Roles.from_dict(roles)

    # ------------------------------------------------------------

    vars(self)["_categorical_columns"] = [
        StringColumn(name=cname, role=roles_.categorical, df_name=self.name)
        for cname in roles.categorical
    ]

    vars(self)["_join_key_columns"] = [
        StringColumn(name=cname, role=roles_.join_key, df_name=self.name)
        for cname in roles.join_key
    ]

    vars(self)["_numerical_columns"] = [
        FloatColumn(name=cname, role=roles_.numerical, df_name=self.name)
        for cname in roles.numerical
    ]

    vars(self)["_target_columns"] = [
        FloatColumn(name=cname, role=roles_.target, df_name=self.name)
        for cname in roles.target
    ]

    vars(self)["_text_columns"] = [
        StringColumn(name=cname, role=roles_.text, df_name=self.name)
        for cname in roles.text
    ]

    vars(self)["_time_stamp_columns"] = [
        FloatColumn(name=cname, role=roles_.time_stamp, df_name=self.name)
        for cname in roles.time_stamp
    ]

    vars(self)["_unused_float_columns"] = [
        FloatColumn(name=cname, role=roles_.unused_float, df_name=self.name)
        for cname in roles.unused_float
    ]

    vars(self)["_unused_string_columns"] = [
        StringColumn(name=cname, role=roles_.unused_string, df_name=self.name)
        for cname in roles.unused_string
    ]

    # ------------------------------------------------------------

    self._check_duplicates()

colnames `property`

colnames: List[str]

List of the names of all columns.

RETURNS	DESCRIPTION
`List[str]`	List of the names of all columns.

columns `property`

columns: List[str]

Alias for colnames.

RETURNS	DESCRIPTION
`List[str]`	List of the names of all columns.

last_change `property`

last_change: str

A string describing the last time this data frame has been changed.

memory_usage `property`

memory_usage

Convenience wrapper that returns the memory usage in MB.

roles `property`

roles

The roles of the columns included in this DataFrame.

rowid `property`

rowid

The rowids for this data frame.

shape `property`

shape

A tuple containing the number of rows and columns of the DataFrame.

add

add(
    col: Union[StringColumn, FloatColumn, ndarray],
    name: str,
    role: Optional[Role] = None,
    subroles: Optional[Union[Role, Iterable[str]]] = None,
    unit: str = "",
    time_formats: Optional[Iterable[str]] = None,
)

Adds a column to the current DataFrame.

PARAMETER	DESCRIPTION
`col`	The column or numpy.ndarray to be added. TYPE: `Union[StringColumn, FloatColumn, ndarray]`
`name`	Name of the new column. TYPE: `str`
`role`	Role of the new column. Must be from `roles`. TYPE: `Optional[Role]` DEFAULT: `None`
`subroles`	Subroles of the new column. Must be from `subroles`. TYPE: `Optional[Union[Role, Iterable[str]]]` DEFAULT: `None`
`unit`	Unit of the column. TYPE: `str` DEFAULT: `''`
`time_formats`	Formats to be used to parse the time stamps. This is only necessary, if an implicit conversion from a `StringColumn` to a time stamp is taking place. The formats are allowed to contain the following special characters: %w - abbreviated weekday (Mon, Tue, ...) %W - full weekday (Monday, Tuesday, ...) %b - abbreviated month (Jan, Feb, ...) %B - full month (January, February, ...) %d - zero-padded day of month (01 .. 31) %e - day of month (1 .. 31) %f - space-padded day of month ( 1 .. 31) %m - zero-padded month (01 .. 12) %n - month (1 .. 12) %o - space-padded month ( 1 .. 12) %y - year without century (70) %Y - year with century (1970) %H - hour (00 .. 23) %h - hour (00 .. 12) %a - am/pm %A - AM/PM %M - minute (00 .. 59) %S - second (00 .. 59) %s - seconds and microseconds (equivalent to %S.%F) %i - millisecond (000 .. 999) %c - centisecond (0 .. 9) %F - fractional seconds/microseconds (000000 - 999999) %z - time zone differential in ISO 8601 format (Z or +NN.NN) %Z - time zone differential in RFC format (GMT or +NNNN) %% - percent sign TYPE: `Optional[Iterable[str]]` DEFAULT: `None`

Source code in getml/data/data_frame.py

def add(
    self,
    col: Union[StringColumn, FloatColumn, np.ndarray],
    name: str,
    role: Optional[Role] = None,
    subroles: Optional[Union[Role, Iterable[str]]] = None,
    unit: str = "",
    time_formats: Optional[Iterable[str]] = None,
):
    """Adds a column to the current [`DataFrame`][getml.DataFrame].

    Args:
        col:
            The column or numpy.ndarray to be added.

        name:
            Name of the new column.

        role:
            Role of the new column. Must be from [`roles`][getml.data.roles].

        subroles:
            Subroles of the new column. Must be from [`subroles`][getml.data.subroles].

        unit:
            Unit of the column.

        time_formats:
            Formats to be used to parse the time stamps.

            This is only necessary, if an implicit conversion from
            a [`StringColumn`][getml.data.columns.StringColumn] to a time
            stamp is taking place.

            The formats are allowed to contain the following
            special characters:

            * %w - abbreviated weekday (Mon, Tue, ...)
            * %W - full weekday (Monday, Tuesday, ...)
            * %b - abbreviated month (Jan, Feb, ...)
            * %B - full month (January, February, ...)
            * %d - zero-padded day of month (01 .. 31)
            * %e - day of month (1 .. 31)
            * %f - space-padded day of month ( 1 .. 31)
            * %m - zero-padded month (01 .. 12)
            * %n - month (1 .. 12)
            * %o - space-padded month ( 1 .. 12)
            * %y - year without century (70)
            * %Y - year with century (1970)
            * %H - hour (00 .. 23)
            * %h - hour (00 .. 12)
            * %a - am/pm
            * %A - AM/PM
            * %M - minute (00 .. 59)
            * %S - second (00 .. 59)
            * %s - seconds and microseconds (equivalent to %S.%F)
            * %i - millisecond (000 .. 999)
            * %c - centisecond (0 .. 9)
            * %F - fractional seconds/microseconds (000000 - 999999)
            * %z - time zone differential in ISO 8601 format (Z or +NN.NN)
            * %Z - time zone differential in RFC format (GMT or +NNNN)
            * %% - percent sign
    """

    if isinstance(col, np.ndarray):
        self._add_numpy_array(col, name, role, subroles, unit)
        return

    col, role, subroles = _with_column(
        col, name, role, subroles, unit, time_formats
    )

    is_string = isinstance(col, (StringColumnView, StringColumn))

    if is_string:
        self._add_categorical_column(col, name, role, subroles, unit)
    else:
        self._add_column(col, name, role, subroles, unit)

copy

copy(name: str) -> DataFrame

Creates a deep copy of the data frame under a new name.

PARAMETER	DESCRIPTION
`name`	The name of the new data frame. TYPE: `str`

RETURNS	DESCRIPTION
`DataFrame`	A handle to the deep copy.

Source code in getml/data/data_frame.py

def copy(self, name: str) -> DataFrame:
    """
    Creates a deep copy of the data frame under a new name.

    Args:
        name:
            The name of the new data frame.

    Returns:
            A handle to the deep copy.
    """

    if not isinstance(name, str):
        raise TypeError("'name' must be a string.")

    cmd: Dict[str, Any] = {}

    cmd["type_"] = "DataFrame.concat"
    cmd["name_"] = name

    cmd["data_frames_"] = [self._getml_deserialize()]

    comm.send(cmd)

    return DataFrame(name=name).refresh()

delete

delete()

Permanently deletes the data frame. delete first unloads the data frame from memory and then deletes it from disk.

Source code in getml/data/data_frame.py

def delete(self):
    """
    Permanently deletes the data frame. `delete` first unloads the data frame
    from memory and then deletes it from disk.
    """
    # ------------------------------------------------------------

    self._delete()

drop

drop(
    cols: Union[
        FloatColumn,
        StringColumn,
        str,
        Union[
            Iterable[FloatColumn],
            Iterable[StringColumn],
            Iterable[str],
        ],
    ]
) -> View

Returns a new View that has one or several columns removed.

PARAMETER	DESCRIPTION
`cols`	The columns or the names thereof. TYPE: `Union[FloatColumn, StringColumn, str, Union[Iterable[FloatColumn], Iterable[StringColumn], Iterable[str]]]`

RETURNS	DESCRIPTION
`View`	A new `View` object with the specified columns removed.

Source code in getml/data/data_frame.py

def drop(
    self,
    cols: Union[
        FloatColumn,
        StringColumn,
        str,
        Union[Iterable[FloatColumn], Iterable[StringColumn], Iterable[str]],
    ],
) -> View:
    """Returns a new [`View`][getml.data.View] that has one or several columns removed.

    Args:
        cols:
            The columns or the names thereof.

    Returns:
        A new [`View`][getml.data.View] object with the specified columns removed.
    """

    names = _handle_cols(cols)

    return View(base=self, dropped=names)

freeze

freeze()

Freezes the data frame.

After you have frozen the data frame, the data frame is immutable and in-place operations are no longer possible. However, you can still create views. In other words, operations like set_role are no longer possible, but operations like with_role are.

Source code in getml/data/data_frame.py

def freeze(self):
    """Freezes the data frame.

    After you have frozen the data frame, the data frame is immutable
    and in-place operations are no longer possible. However, you can
    still create views. In other words, operations like
    [`set_role`][getml.DataFrame.set_role] are no longer possible,
    but operations like [`with_role`][getml.DataFrame.with_role] are.
    """
    cmd: Dict[str, Any] = {}
    cmd["type_"] = "DataFrame.freeze"
    cmd["name_"] = self.name
    comm.send(cmd)

from_arrow `classmethod`

from_arrow(
    table: Table,
    name: str,
    roles: Optional[
        Union[Dict[Union[Role, str], Iterable[str]], Roles]
    ] = None,
    ignore: bool = False,
    dry: Literal[False] = False,
) -> DataFrame

from_arrow(
    table: Table,
    name: str,
    roles: Optional[
        Union[Dict[Union[Role, str], Iterable[str]], Roles]
    ] = None,
    ignore: bool = False,
    dry: Literal[True] = True,
) -> Roles

from_arrow(
    table: Table,
    name: str,
    roles: Optional[
        Union[Dict[Union[Role, str], Iterable[str]], Roles]
    ] = None,
    ignore: bool = False,
    dry: bool = False,
) -> Union[DataFrame, Roles]

Create a DataFrame from an Arrow Table.

This is one of the fastest way to get data into the getML Engine.

PARAMETER	DESCRIPTION
`table`	The arrow tablelike to be read. TYPE: `Table`
`name`	Name of the data frame to be created. TYPE: `str`
`roles`	Maps the `roles` to the column names (see `colnames`). The `roles` dictionary is expected to have the following format: `roles = {getml.data.role.numeric: ["colname1", "colname2"], getml.data.role.target: ["colname3"]}` Otherwise, you can use the `Roles` class. TYPE: `Optional[Union[Dict[Union[Role, str], Iterable[str]], Roles]]` DEFAULT: `None`
`ignore`	Only relevant when roles is not None. Determines what you want to do with any colnames not mentioned in roles. Do you want to ignore them (True) or read them in as unused columns (False)? TYPE: `bool` DEFAULT: `False`
`dry`	If set to True, the data will not be read. Instead, the method will return the inferred roles. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`Union[DataFrame, Roles]`	Handler of the underlying data.

Source code in getml/data/data_frame.py

@classmethod
def from_arrow(
    cls,
    table: pa.Table,
    name: str,
    roles: Optional[Union[Dict[Union[Role, str], Iterable[str]], Roles]] = None,
    ignore: bool = False,
    dry: bool = False,
) -> Union[DataFrame, Roles]:
    """Create a DataFrame from an Arrow Table.

    This is one of the fastest way to get data into the
    getML Engine.

    Args:
        table:
            The arrow tablelike to be read.

        name:
            Name of the data frame to be created.

        roles:
            Maps the [`roles`][getml.data.roles] to the
            column names (see [`colnames`][getml.DataFrame.colnames]).

            The `roles` dictionary is expected to have the following format:
            ```python
            roles = {getml.data.role.numeric: ["colname1", "colname2"],
                     getml.data.role.target: ["colname3"]}
            ```
            Otherwise, you can use the [`Roles`][getml.data.Roles] class.

        ignore:
            Only relevant when roles is not None.
            Determines what you want to do with any colnames not
            mentioned in roles. Do you want to ignore them (True)
            or read them in as unused columns (False)?

        dry:
            If set to True, the data will not be read. Instead, the method
            will return the inferred roles.

    Returns:
            Handler of the underlying data.
    """

    # ------------------------------------------------------------

    if not isinstance(name, str):
        raise TypeError("'name' must be str.")

    # The content of roles is checked in the class constructor called below.
    if roles is not None and not isinstance(roles, (dict, Roles)):
        raise TypeError("'roles' must be a geml.data.Roles object, a dict or None.")

    if not isinstance(ignore, bool):
        raise TypeError("'ignore' must be bool.")

    if not isinstance(dry, bool):
        raise TypeError("'dry' must be bool.")

    # ------------------------------------------------------------

    sniffed_roles = sniff_schema(table.schema)

    roles = _prepare_roles(roles, sniffed_roles, ignore_sniffed_roles=ignore)

    if dry:
        return roles

    data_frame = cls(name, roles)

    return data_frame.read_arrow(table=table, append=False)

from_csv `classmethod`

from_csv(
    fnames: Union[str, Iterable[str]],
    name: str,
    num_lines_sniffed: None = None,
    num_lines_read: int = 0,
    quotechar: str = '"',
    sep: str = ",",
    skip: int = 0,
    colnames: Iterable[str] = (),
    roles: Optional[
        Union[Dict[Union[Role, str], Iterable[str]], Roles]
    ] = None,
    ignore: bool = False,
    dry: Literal[False] = False,
    verbose: bool = False,
) -> DataFrame

from_csv(
    fnames: Union[str, Iterable[str]],
    name: str,
    num_lines_sniffed: None = None,
    num_lines_read: int = 0,
    quotechar: str = '"',
    sep: str = ",",
    skip: int = 0,
    colnames: Iterable[str] = (),
    roles: Optional[
        Union[Dict[Union[Role, str], Iterable[str]], Roles]
    ] = None,
    ignore: bool = False,
    dry: Literal[True] = True,
    verbose: bool = False,
) -> Roles

from_csv(
    fnames: Union[str, Iterable[str]],
    name: str,
    num_lines_sniffed: None = None,
    num_lines_read: int = 0,
    quotechar: str = '"',
    sep: str = ",",
    skip: int = 0,
    colnames: Iterable[str] = (),
    roles: Optional[
        Union[Dict[Union[Role, str], Iterable[str]], Roles]
    ] = None,
    ignore: bool = False,
    dry: bool = False,
    verbose: bool = True,
    block_size: int = DEFAULT_CSV_READ_BLOCK_SIZE,
    in_batches: bool = False,
) -> Union[DataFrame, Roles]

Create a DataFrame from CSV files.

The getML Engine will construct a data frame object in the Engine, fill it with the data read from the CSV file(s), and return a corresponding DataFrame handle.

PARAMETER	DESCRIPTION
`fnames`	CSV file paths to be read. TYPE: `Union[str, Iterable[str]]`
`name`	Name of the data frame to be created. TYPE: `str`
`num_lines_sniffed`	Number of lines analyzed by the sniffer. TYPE: `None` DEFAULT: `None`
`num_lines_read`	Number of lines read from each file. Set to 0 to read in the entire file. TYPE: `int` DEFAULT: `0`
`quotechar`	The character used to wrap strings. TYPE: `str` DEFAULT: `'"'`
`sep`	The separator used for separating fields. TYPE: `str` DEFAULT: `','`
`skip`	Number of lines to skip at the beginning of each file. TYPE: `int` DEFAULT: `0`
`colnames`	The first line of a CSV file usually contains the column names. When this is not the case, you need to explicitly pass them. TYPE: `Iterable[str]` DEFAULT: `()`
`roles`	Maps the `roles` to the column names (see `colnames`). The `roles` dictionary is expected to have the following format `roles = {getml.data.role.numeric: ["colname1", "colname2"], getml.data.role.target: ["colname3"]}` Otherwise, you can use the `Roles` class. TYPE: `Optional[Union[Dict[Union[Role, str], Iterable[str]], Roles]]` DEFAULT: `None`
`ignore`	Only relevant when roles is not None. Determines what you want to do with any colnames not mentioned in roles. Do you want to ignore them (True) or read them in as unused columns (False)? TYPE: `bool` DEFAULT: `False`
`dry`	If set to True, the data will not be read. Instead, the method will return the inferred roles. TYPE: `bool` DEFAULT: `False`
`verbose`	If True, when fnames are urls, the filenames are printed to stdout during the download. TYPE: `bool` DEFAULT: `True`
`block_size`	The number of bytes read with each batch. Passed down to pyarrow. TYPE: `int` DEFAULT: `DEFAULT_CSV_READ_BLOCK_SIZE`
`in_batches`	If True, read blocks streamwise manner and send those batches to the engine. Blocks are read and sent to the engine sequentially. While more memory efficient, streaming in batches is slower as it is inherently single-threaded. If False (default) the data is read with multiple threads into arrow first and sent to the engine afterwards. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`Union[DataFrame, Roles]`	Handler of the underlying data.

Deprecated

1.5: The num_lines_sniffed parameter is deprecated.

Note

It is assumed that the first line of each CSV file contains a header with the column names.

Example

Let's assume you have two CSV files - file1.csv and file2.csv - in the current working directory. You can import their data into the getML Engine using.

df_expd = data.DataFrame.from_csv(
    fnames=["file1.csv", "file2.csv"],
    name="MY DATA FRAME",
    sep=';',
    quotechar='"'
    )

# However, the CSV format lacks type safety. If you want to
# build a reliable pipeline, it is a good idea
# to hard-code the roles:

roles = {"categorical": ["col1", "col2"], "target": ["col3"]}

df_expd = data.DataFrame.from_csv(
    fnames=["file1.csv", "file2.csv"],
    name="MY DATA FRAME",
    sep=';',
    quotechar='"',
    roles=roles
    )

# If you think that typing out all the roles by hand is too
# cumbersome, you can use a dry run:

roles = data.DataFrame.from_csv(
    fnames=["file1.csv", "file2.csv"],
    name="MY DATA FRAME",
    sep=';',
    quotechar='"',
    dry=True
)

This will return the roles dictionary it would have used. You can now hard-code this.

Source code in getml/data/data_frame.py

@classmethod
def from_csv(
    cls,
    fnames: Union[str, Iterable[str]],
    name: str,
    num_lines_sniffed: None = None,
    num_lines_read: int = 0,
    quotechar: str = '"',
    sep: str = ",",
    skip: int = 0,
    colnames: Iterable[str] = (),
    roles: Optional[Union[Dict[Union[Role, str], Iterable[str]], Roles]] = None,
    ignore: bool = False,
    dry: bool = False,
    verbose: bool = True,
    block_size: int = DEFAULT_CSV_READ_BLOCK_SIZE,
    in_batches: bool = False,
) -> Union[DataFrame, Roles]:
    """Create a DataFrame from CSV files.

    The getML Engine will construct a data
    frame object in the Engine, fill it with the data read from
    the CSV file(s), and return a corresponding
    [`DataFrame`][getml.DataFrame] handle.

    Args:
        fnames:
            CSV file paths to be read.

        name:
            Name of the data frame to be created.

        num_lines_sniffed:
            Number of lines analyzed by the sniffer.

        num_lines_read:
            Number of lines read from each file.
            Set to 0 to read in the entire file.

        quotechar:
            The character used to wrap strings.

        sep:
            The separator used for separating fields.

        skip:
            Number of lines to skip at the beginning of each file.

        colnames: The first line of a CSV file
            usually contains the column names. When this is not the case,
            you need to explicitly pass them.

        roles:
            Maps the [`roles`][getml.data.roles] to the
            column names (see [`colnames`][getml.DataFrame.colnames]).

            The `roles` dictionary is expected to have the following format
            ```python
            roles = {getml.data.role.numeric: ["colname1", "colname2"],
                     getml.data.role.target: ["colname3"]}
            ```
            Otherwise, you can use the [`Roles`][getml.data.Roles] class.

        ignore:
            Only relevant when roles is not None.
            Determines what you want to do with any colnames not
            mentioned in roles. Do you want to ignore them (True)
            or read them in as unused columns (False)?

        dry:
            If set to True, the data will not be read. Instead, the method
            will return the inferred roles.

        verbose:
            If True, when fnames are urls, the filenames are
            printed to stdout during the download.

        block_size:
            The number of bytes read with each batch. Passed down to
            pyarrow.

        in_batches:
            If True, read blocks streamwise manner and send those batches to
            the engine. Blocks are read and sent to the engine sequentially.
            While more memory efficient, streaming in batches is slower as
            it is inherently single-threaded. If False (default) the data is
            read with multiple threads into arrow first and sent to the engine
            afterwards.

    Returns:
            Handler of the underlying data.


    Deprecated:
        1.5: The `num_lines_sniffed` parameter is deprecated.

    Note:
        It is assumed that the first line of each CSV file
        contains a header with the column names.

    ??? example
        Let's assume you have two CSV files - *file1.csv* and
        *file2.csv* - in the current working directory. You can
        import their data into the getML Engine using.
        ```python
        df_expd = data.DataFrame.from_csv(
            fnames=["file1.csv", "file2.csv"],
            name="MY DATA FRAME",
            sep=';',
            quotechar='"'
            )

        # However, the CSV format lacks type safety. If you want to
        # build a reliable pipeline, it is a good idea
        # to hard-code the roles:

        roles = {"categorical": ["col1", "col2"], "target": ["col3"]}

        df_expd = data.DataFrame.from_csv(
            fnames=["file1.csv", "file2.csv"],
            name="MY DATA FRAME",
            sep=';',
            quotechar='"',
            roles=roles
            )

        # If you think that typing out all the roles by hand is too
        # cumbersome, you can use a dry run:

        roles = data.DataFrame.from_csv(
            fnames=["file1.csv", "file2.csv"],
            name="MY DATA FRAME",
            sep=';',
            quotechar='"',
            dry=True
        )
        ```

        This will return the roles dictionary it would have used. You
        can now hard-code this.

    """

    if num_lines_sniffed is not None:
        warnings.warn(
            "The 'num_lines_sniffed' parameter is deprecated and will be ignored.",
            DeprecationWarning,
        )

    if isinstance(fnames, str):
        fnames = [fnames]

    if not _is_non_empty_typed_list(fnames, str):
        raise TypeError("'fnames' must be either a str or a list of str.")

    if not isinstance(name, str):
        raise TypeError("'name' must be str.")

    if not isinstance(num_lines_read, numbers.Real):
        raise TypeError("'num_lines_read' must be a real number")

    if not isinstance(quotechar, str):
        raise TypeError("'quotechar' must be str.")

    if not isinstance(sep, str):
        raise TypeError("'sep' must be str.")

    if not isinstance(skip, numbers.Real):
        raise TypeError("'skip' must be a real number")

    if roles is not None and not isinstance(roles, (dict, Roles)):
        raise TypeError("'roles' must be a geml.data.Roles object, a dict or None.")

    if not isinstance(ignore, bool):
        raise TypeError("'ignore' must be bool.")

    if not isinstance(ignore, bool):
        raise TypeError("'dry' must be bool.")

    if colnames:
        if not _is_iterable_not_str_of_type(colnames, str):
            raise TypeError("'colnames' must be an iterable of str")

    fnames = _retrieve_urls(fnames, verbose=verbose)

    sniffed_roles = sniff_csv(
        fnames=fnames,
        quotechar=quotechar,
        sep=sep,
        skip=int(skip),
        colnames=colnames,
    )

    roles = _prepare_roles(roles, sniffed_roles, ignore_sniffed_roles=ignore)

    if dry:
        return roles

    data_frame = cls(name, roles)

    return data_frame.read_csv(
        fnames=fnames,
        append=False,
        quotechar=quotechar,
        sep=sep,
        num_lines_read=num_lines_read,
        skip=skip,
        colnames=colnames,
        block_size=block_size,
        in_batches=in_batches,
    )

from_db `classmethod`

from_db(
    table_name: str,
    name: str,
    roles: Optional[
        Union[Dict[Union[Role, str], Iterable[str]], Roles]
    ] = None,
    ignore: bool = False,
    dry: Literal[False] = False,
    conn: Optional[Connection] = None,
) -> DataFrame

from_db(
    table_name: str,
    name: str,
    roles: Optional[
        Union[Dict[Union[Role, str], Iterable[str]], Roles]
    ] = None,
    ignore: bool = False,
    dry: Literal[True] = True,
    conn: Optional[Connection] = None,
) -> Roles

from_db(
    table_name: str,
    name: Optional[str] = None,
    roles: Optional[
        Union[Dict[Union[Role, str], Iterable[str]], Roles]
    ] = None,
    ignore: bool = False,
    dry: bool = False,
    conn: Optional[Connection] = None,
) -> Union[DataFrame, Roles]

Create a DataFrame from a table in a database.

It will construct a data frame object in the Engine, fill it with the data read from table table_name in the connected database (see database), and return a corresponding DataFrame handle.

PARAMETER	DESCRIPTION
`table_name`	Name of the table to be read. TYPE: `str`
`name`	Name of the data frame to be created. If not passed, then the table_name will be used. TYPE: `Optional[str]` DEFAULT: `None`
`roles`	Maps the `roles` to the column names (see `colnames`). The `roles` dictionary is expected to have the following format: `roles = {getml.data.role.numeric: ["colname1", "colname2"], getml.data.role.target: ["colname3"]}` Otherwise, you can use the `Roles` class. TYPE: `Optional[Union[Dict[Union[Role, str], Iterable[str]], Roles]]` DEFAULT: `None`
`ignore`	Only relevant when roles is not None. Determines what you want to do with any colnames not mentioned in roles. Do you want to ignore them (True) or read them in as unused columns (False)? TYPE: `bool` DEFAULT: `False`
`dry`	If set to True, the data will not be read. Instead, the method will return the inferred roles. TYPE: `bool` DEFAULT: `False`
`conn`	The database connection to be used. If you don't explicitly pass a connection, the Engine will use the default connection. TYPE: `Optional[Connection]` DEFAULT: `None`

RETURNS	DESCRIPTION
`Union[DataFrame, Roles]`	Handler of the underlying data.

Example

getml.database.connect_mysql(
    host="relational.fel.cvut.cz",
    port=3306,
    dbname="financial",
    user="guest",
    password="ctu-relational"
)

loan = getml.DataFrame.from_db(
    table_name='loan', name='data_frame_loan')

Source code in getml/data/data_frame.py

@classmethod
def from_db(
    cls,
    table_name: str,
    name: Optional[str] = None,
    roles: Optional[Union[Dict[Union[Role, str], Iterable[str]], Roles]] = None,
    ignore: bool = False,
    dry: bool = False,
    conn: Optional[Connection] = None,
) -> Union[DataFrame, Roles]:
    """Create a DataFrame from a table in a database.

    It will construct a data frame object in the Engine, fill it
    with the data read from table `table_name` in the connected
    database (see [`database`][getml.database]), and return a
    corresponding [`DataFrame`][getml.DataFrame] handle.

    Args:
        table_name:
            Name of the table to be read.

        name:
            Name of the data frame to be created. If not passed,
            then the *table_name* will be used.

        roles:
            Maps the [`roles`][getml.data.roles] to the
            column names (see [`colnames`][getml.DataFrame.colnames]).

            The `roles` dictionary is expected to have the following format:
            ```python
            roles = {getml.data.role.numeric: ["colname1", "colname2"],
                     getml.data.role.target: ["colname3"]}
            ```
            Otherwise, you can use the [`Roles`][getml.data.Roles] class.

        ignore:
            Only relevant when roles is not None.
            Determines what you want to do with any colnames not
            mentioned in roles. Do you want to ignore them (True)
            or read them in as unused columns (False)?

        dry:
            If set to True, the data will not be read. Instead, the method
            will return the inferred roles.

        conn:
            The database connection to be used.
            If you don't explicitly pass a connection, the Engine
            will use the default connection.

    Returns:
            Handler of the underlying data.

    ??? example
        ```python
        getml.database.connect_mysql(
            host="relational.fel.cvut.cz",
            port=3306,
            dbname="financial",
            user="guest",
            password="ctu-relational"
        )

        loan = getml.DataFrame.from_db(
            table_name='loan', name='data_frame_loan')
        ```
    """

    # -------------------------------------------

    name = name or table_name

    # -------------------------------------------

    if not isinstance(table_name, str):
        raise TypeError("'table_name' must be str.")

    if not isinstance(name, str):
        raise TypeError("'name' must be str.")

    # The content of roles is checked in the class constructor called below.
    if roles is not None and not isinstance(roles, (dict, Roles)):
        raise TypeError(
            "'roles' must be a getml.data.Roles object, a dict or None."
        )

    if not isinstance(ignore, bool):
        raise TypeError("'ignore' must be bool.")

    if not isinstance(dry, bool):
        raise TypeError("'dry' must be bool.")

    # -------------------------------------------

    conn = conn or database.Connection()

    # ------------------------------------------------------------

    sniffed_roles = _sniff_db(table_name, conn)

    roles = _prepare_roles(roles, sniffed_roles, ignore_sniffed_roles=ignore)

    if dry:
        return roles

    # ------------------------------------------------------------

    data_frame = cls(name, roles)

    return data_frame.read_db(table_name=table_name, append=False, conn=conn)

from_dict `classmethod`

from_dict(
    data: Dict[Hashable, Iterable[Any]],
    name: str,
    roles: Optional[
        Union[Dict[Union[Role, str], Iterable[str]], Roles]
    ] = None,
    ignore: bool = False,
    dry: Literal[False] = False,
) -> DataFrame

from_dict(
    data: Dict[Hashable, Iterable[Any]],
    name: str,
    roles: Optional[
        Union[Dict[Union[Role, str], Iterable[str]], Roles]
    ] = None,
    ignore: bool = False,
    dry: Literal[True] = True,
) -> Roles

from_dict(
    data: Dict[Hashable, Iterable[Any]],
    name: str,
    roles: Optional[
        Union[Dict[Union[Role, str], Iterable[str]], Roles]
    ] = None,
    ignore: bool = False,
    dry: bool = False,
) -> Union[DataFrame, Roles]

Create a new DataFrame from a dict

PARAMETER	DESCRIPTION
`data`	The dict containing the data. The data should be in the following format: `data = {'col1': [1.0, 2.0, 1.0], 'col2': ['A', 'B', 'C']}` TYPE: `Dict[Hashable, Iterable[Any]]`
`name`	Name of the data frame to be created. TYPE: `str`
`roles`	Maps the `roles` to the column names (see `colnames`). The `roles` dictionary is expected to have the following format: `roles = {getml.data.role.numeric: ["colname1", "colname2"], getml.data.role.target: ["colname3"]}` Otherwise, you can use the `Roles` class. TYPE: `Optional[Union[Dict[Union[Role, str], Iterable[str]], Roles]]` DEFAULT: `None`
`ignore`	Only relevant when roles is not None. Determines what you want to do with any colnames not mentioned in roles. Do you want to ignore them (True) or read them in as unused columns (False)? TYPE: `bool` DEFAULT: `False`
`dry`	If set to True, the data will not be read. Instead, the method will return the inferred roles. TYPE: `bool` DEFAULT: `False`

Returns: Handler of the underlying data.

Source code in getml/data/data_frame.py

@classmethod
def from_dict(
    cls,
    data: Dict[Hashable, Iterable[Any]],
    name: str,
    roles: Optional[Union[Dict[Union[Role, str], Iterable[str]], Roles]] = None,
    ignore: bool = False,
    dry: bool = False,
) -> Union[DataFrame, Roles]:
    """Create a new DataFrame from a dict

    Args:
        data:
            The dict containing the data.
            The data should be in the following format:
            ```python
            data = {'col1': [1.0, 2.0, 1.0], 'col2': ['A', 'B', 'C']}
            ```
        name:
            Name of the data frame to be created.

        roles:
            Maps the [`roles`][getml.data.roles] to the
            column names (see [`colnames`][getml.DataFrame.colnames]).

            The `roles` dictionary is expected to have the following format:
            ```python
            roles = {getml.data.role.numeric: ["colname1", "colname2"],
                     getml.data.role.target: ["colname3"]}
            ```
            Otherwise, you can use the [`Roles`][getml.data.Roles] class.

        ignore:
            Only relevant when roles is not None.
            Determines what you want to do with any colnames not
            mentioned in roles. Do you want to ignore them (True)
            or read them in as unused columns (False)?

        dry:
            If set to True, the data will not be read. Instead, the method
            will return the inferred roles.
    Returns:
            Handler of the underlying data.
    """

    if not isinstance(data, dict):
        raise TypeError("'data' must be dict.")

    return cls.from_arrow(
        table=pa.Table.from_pydict(data),
        name=name,
        roles=roles,
        ignore=ignore,
        dry=dry,
    )

from_json `classmethod`

from_json(
    json_str: str,
    name: str,
    roles: Optional[
        Union[Dict[Union[Role, str], Iterable[str]], Roles]
    ] = None,
    ignore: bool = False,
    dry: Literal[False] = False,
) -> DataFrame

from_json(
    json_str: str,
    name: str,
    roles: Optional[
        Union[Dict[Union[Role, str], Iterable[str]], Roles]
    ] = None,
    ignore: bool = False,
    dry: Literal[True] = True,
) -> Roles

from_json(
    json_str: str,
    name: str,
    roles: Optional[
        Union[Dict[Union[Role, str], Iterable[str]], Roles]
    ] = None,
    ignore: bool = False,
    dry: bool = False,
) -> Union[DataFrame, Roles]

Create a new DataFrame from a JSON string.

It will construct a data frame object in the Engine, fill it with the data read from the JSON string, and return a corresponding DataFrame handle.

PARAMETER	DESCRIPTION
`json_str`	The JSON string containing the data. The json_str should be in the following format: `json_str = "{'col1': [1.0, 2.0, 1.0], 'col2': ['A', 'B', 'C']}"` TYPE: `str`
`name`	Name of the data frame to be created. TYPE: `str`
`roles`	Maps the `roles` to the column names (see `colnames`). The `roles` dictionary is expected to have the following format: `roles = {getml.data.role.numeric: ["colname1", "colname2"], getml.data.role.target: ["colname3"]}` Otherwise, you can use the `Roles` class. TYPE: `Optional[Union[Dict[Union[Role, str], Iterable[str]], Roles]]` DEFAULT: `None`
`ignore`	Only relevant when roles is not None. Determines what you want to do with any colnames not mentioned in roles. Do you want to ignore them (True) or read them in as unused columns (False)? TYPE: `bool` DEFAULT: `False`
`dry`	If set to True, the data will not be read. Instead, the method will return the inferred roles. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`Union[DataFrame, Roles]`	Handler of the underlying data.

Source code in getml/data/data_frame.py

@classmethod
def from_json(
    cls,
    json_str: str,
    name: str,
    roles: Optional[Union[Dict[Union[Role, str], Iterable[str]], Roles]] = None,
    ignore: bool = False,
    dry: bool = False,
) -> Union[DataFrame, Roles]:
    """Create a new DataFrame from a JSON string.

    It will construct a data frame object in the Engine, fill it
    with the data read from the JSON string, and return a
    corresponding [`DataFrame`][getml.DataFrame] handle.

    Args:
        json_str:
            The JSON string containing the data.
            The json_str should be in the following format:
            ```python
            json_str = "{'col1': [1.0, 2.0, 1.0], 'col2': ['A', 'B', 'C']}"
            ```
        name:
            Name of the data frame to be created.

        roles:
            Maps the [`roles`][getml.data.roles] to the
            column names (see [`colnames`][getml.DataFrame.colnames]).

            The `roles` dictionary is expected to have the following format:
            ```python
            roles = {getml.data.role.numeric: ["colname1", "colname2"],
                     getml.data.role.target: ["colname3"]}
            ```
            Otherwise, you can use the [`Roles`][getml.data.Roles] class.

        ignore:
            Only relevant when roles is not None.
            Determines what you want to do with any colnames not
            mentioned in roles. Do you want to ignore them (True)
            or read them in as unused columns (False)?

        dry:
            If set to True, the data will not be read. Instead, the method
            will return the inferred roles.

    Returns:
        Handler of the underlying data.

    """

    if not isinstance(json_str, str):
        raise TypeError("'json_str' must be str.")

    return cls.from_dict(
        data=json.loads(json_str),
        name=name,
        roles=roles,
        ignore=ignore,
        dry=dry,
    )

from_pandas `classmethod`

from_pandas(
    pandas_df: DataFrame,
    name: str,
    roles: Optional[
        Union[Dict[Union[Role, str], Iterable[str]], Roles]
    ] = None,
    ignore: bool = False,
    dry: Literal[False] = False,
) -> DataFrame

from_pandas(
    pandas_df: DataFrame,
    name: str,
    roles: Optional[
        Union[Dict[Union[Role, str], Iterable[str]], Roles]
    ] = None,
    ignore: bool = False,
    dry: Literal[True] = True,
) -> Roles

from_pandas(
    pandas_df: DataFrame,
    name: str,
    roles: Optional[
        Union[Dict[Union[Role, str], Iterable[str]], Roles]
    ] = None,
    ignore: bool = False,
    dry: bool = False,
) -> Union[DataFrame, Roles]

Create a DataFrame from a pandas.DataFrame.

It will construct a data frame object in the Engine, fill it with the data read from the pandas.DataFrame, and return a corresponding DataFrame handle.

PARAMETER	DESCRIPTION
`pandas_df`	The table to be read. TYPE: `DataFrame`
`name`	Name of the data frame to be created. TYPE: `str`
`roles`	Maps the `roles` to the column names (see `colnames`). The `roles` dictionary is expected to have the following format: `roles = {getml.data.role.numeric: ["colname1", "colname2"], getml.data.role.target: ["colname3"]}` Otherwise, you can use the `Roles` class. TYPE: `Optional[Union[Dict[Union[Role, str], Iterable[str]], Roles]]` DEFAULT: `None`
`ignore`	Only relevant when roles is not None. Determines what you want to do with any colnames not mentioned in roles. Do you want to ignore them (True) or read them in as unused columns (False)? TYPE: `bool` DEFAULT: `False`
`dry`	If set to True, the data will not be read. Instead, the method will return the inferred roles. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`Union[DataFrame, Roles]`	Handler of the underlying data.

Source code in getml/data/data_frame.py

@classmethod
def from_pandas(
    cls,
    pandas_df: pd.DataFrame,
    name: str,
    roles: Optional[Union[Dict[Union[Role, str], Iterable[str]], Roles]] = None,
    ignore: bool = False,
    dry: bool = False,
) -> Union[DataFrame, Roles]:
    """Create a DataFrame from a `pandas.DataFrame`.

    It will construct a data frame object in the Engine, fill it
    with the data read from the `pandas.DataFrame`, and
    return a corresponding [`DataFrame`][getml.DataFrame] handle.

    Args:
        pandas_df:
            The table to be read.

        name:
            Name of the data frame to be created.

        roles:
            Maps the [`roles`][getml.data.roles] to the
            column names (see [`colnames`][getml.DataFrame.colnames]).

            The `roles` dictionary is expected to have the following format:
            ```python
             roles = {getml.data.role.numeric: ["colname1", "colname2"],
                      getml.data.role.target: ["colname3"]}
            ```
            Otherwise, you can use the [`Roles`][getml.data.Roles] class.

        ignore:
            Only relevant when roles is not None.
            Determines what you want to do with any colnames not
            mentioned in roles. Do you want to ignore them (True)
            or read them in as unused columns (False)?

        dry:
            If set to True, the data will not be read. Instead, the method
            will return the inferred roles.

    Returns:
        Handler of the underlying data.
    """

    # ------------------------------------------------------------

    if not isinstance(pandas_df, pd.DataFrame):
        raise TypeError("'pandas_df' must be of type pandas.DataFrame.")

    if not isinstance(name, str):
        raise TypeError("'name' must be str.")

    # The content of roles is checked in the class constructor called below.
    if roles is not None and not isinstance(roles, (dict, Roles)):
        raise TypeError("'roles' must be a geml.data.Roles object, a dict or None.")

    if not isinstance(ignore, bool):
        raise TypeError("'ignore' must be bool.")

    if not isinstance(dry, bool):
        raise TypeError("'dry' must be bool.")

    # ------------------------------------------------------------

    if metadata := pandas_df.attrs.get("getml"):
        sniffed_roles = Roles.from_dict(metadata["roles"])
    else:
        sniffed_roles = sniff_schema(
            pa.Schema.from_pandas(pandas_df, preserve_index=False)
        )

    roles = _prepare_roles(roles, sniffed_roles, ignore_sniffed_roles=ignore)

    if dry:
        return roles

    data_frame = cls(name, roles)

    return data_frame.read_pandas(pandas_df=pandas_df, append=False)

from_parquet `classmethod`

from_parquet(
    fnames: Union[str, Iterable[str]],
    name: str,
    roles: Optional[
        Union[Dict[Union[Role, str], Iterable[str]], Roles]
    ] = None,
    ignore: bool = False,
    dry: Literal[False] = False,
    colnames: Iterable[str] = (),
) -> DataFrame

from_parquet(
    fnames: Union[str, Iterable[str]],
    name: str,
    roles: Optional[
        Union[Dict[Union[Role, str], Iterable[str]], Roles]
    ] = None,
    ignore: bool = False,
    dry: Literal[True] = True,
    colnames: Iterable[str] = (),
) -> Roles

from_parquet(
    fnames: Union[str, Iterable[str]],
    name: str,
    roles: Optional[
        Union[Dict[Union[Role, str], Iterable[str]], Roles]
    ] = None,
    ignore: bool = False,
    dry: bool = False,
    colnames: Iterable[str] = (),
) -> Union[DataFrame, Roles]

Create a DataFrame from parquet files.

This is one of the fastest way to get data into the getML Engine.

PARAMETER	DESCRIPTION
`fnames`	The path of the parquet file(s) to be read. TYPE: `Union[str, Iterable[str]]`
`name`	Name of the data frame to be created. TYPE: `str`
`roles`	Maps the `roles` to the column names (see `colnames`). The `roles` dictionary is expected to have the following format: `roles = {getml.data.role.numeric: ["colname1", "colname2"], getml.data.role.target: ["colname3"]}` Otherwise, you can use the `Roles` class. TYPE: `Optional[Union[Dict[Union[Role, str], Iterable[str]], Roles]]` DEFAULT: `None`
`ignore`	Only relevant when roles is not None. Determines what you want to do with any colnames not mentioned in roles. Do you want to ignore them (True) or read them in as unused columns (False)? TYPE: `bool` DEFAULT: `False`
`dry`	If set to True, the data will not be read. Instead, the method will return the inferred roles. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`Union[DataFrame, Roles]`	Handler of the underlying data.

Source code in getml/data/data_frame.py

@classmethod
def from_parquet(
    cls,
    fnames: Union[str, Iterable[str]],
    name: str,
    roles: Optional[Union[Dict[Union[Role, str], Iterable[str]], Roles]] = None,
    ignore: bool = False,
    dry: bool = False,
    colnames: Iterable[str] = (),
) -> Union[DataFrame, Roles]:
    """Create a DataFrame from parquet files.

    This is one of the fastest way to get data into the
    getML Engine.

    Args:
        fnames:
            The path of the parquet file(s) to be read.

        name:
            Name of the data frame to be created.

        roles:
            Maps the [`roles`][getml.data.roles] to the
            column names (see [`colnames`][getml.DataFrame.colnames]).

            The `roles` dictionary is expected to have the following format:
            ```python
            roles = {getml.data.role.numeric: ["colname1", "colname2"],
                     getml.data.role.target: ["colname3"]}
            ```
            Otherwise, you can use the [`Roles`][getml.data.Roles] class.

        ignore:
            Only relevant when roles is not None.
            Determines what you want to do with any colnames not
            mentioned in roles. Do you want to ignore them (True)
            or read them in as unused columns (False)?

        dry:
            If set to True, the data will not be read. Instead, the method
            will return the inferred roles.

    Returns:
        Handler of the underlying data.
    """

    # ------------------------------------------------------------

    if isinstance(fnames, str):
        fnames = [fnames]

    if not isinstance(name, str):
        raise TypeError("'name' must be str.")

    # The content of roles is checked in the class constructor called below.
    if roles is not None and not isinstance(roles, (dict, Roles)):
        raise TypeError("'roles' must be a geml.data.Roles object, a dict or None.")

    if not isinstance(ignore, bool):
        raise TypeError("'ignore' must be bool.")

    if not isinstance(dry, bool):
        raise TypeError("'dry' must be bool.")

    # ------------------------------------------------------------

    sniffed_roles = sniff_parquet(fnames, colnames)

    roles = _prepare_roles(roles, sniffed_roles, ignore_sniffed_roles=ignore)

    if dry:
        return roles

    data_frame = cls(name, roles)

    return data_frame.read_parquet(fnames=fnames, append=False, colnames=colnames)

from_pyspark `classmethod`

from_pyspark(
    spark_df: DataFrame,
    name: str,
    roles: Optional[
        Union[Dict[Union[Role, str], Iterable[str]], Roles]
    ] = None,
    ignore: bool = False,
    dry: Literal[False] = False,
) -> DataFrame

from_pyspark(
    spark_df: DataFrame,
    name: str,
    roles: Optional[
        Union[Dict[Union[Role, str], Iterable[str]], Roles]
    ] = None,
    ignore: bool = False,
    dry: Literal[True] = True,
) -> Roles

from_pyspark(
    spark_df: DataFrame,
    name: str,
    roles: Optional[
        Union[Dict[Union[Role, str], Iterable[str]], Roles]
    ] = None,
    ignore: bool = False,
    dry: bool = False,
) -> Union[DataFrame, Roles]

Create a DataFrame from a pyspark.sql.DataFrame.

It will construct a data frame object in the Engine, fill it with the data read from the pyspark.sql.DataFrame, and return a corresponding DataFrame handle.

PARAMETER	DESCRIPTION
`spark_df`	The table to be read. TYPE: `DataFrame`
`name`	Name of the data frame to be created. TYPE: `str`
`roles`	Maps the `roles` to the column names (see `colnames`). The `roles` dictionary is expected to have the following format: `roles = {getml.data.role.numeric: ["colname1", "colname2"], getml.data.role.target: ["colname3"]}` Otherwise, you can use the `Roles` class. TYPE: `Optional[Union[Dict[Union[Role, str], Iterable[str]], Roles]]` DEFAULT: `None`
`ignore`	Only relevant when roles is not None. Determines what you want to do with any colnames not mentioned in roles. Do you want to ignore them (True) or read them in as unused columns (False)? TYPE: `bool` DEFAULT: `False`
`dry`	If set to True, the data will not be read. Instead, the method will return the inferred roles. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`Union[DataFrame, Roles]`	Handler of the underlying data.

Source code in getml/data/data_frame.py

@classmethod
def from_pyspark(
    cls,
    spark_df: pyspark.sql.DataFrame,
    name: str,
    roles: Optional[Union[Dict[Union[Role, str], Iterable[str]], Roles]] = None,
    ignore: bool = False,
    dry: bool = False,
) -> Union[DataFrame, Roles]:
    """Create a DataFrame from a `pyspark.sql.DataFrame`.

    It will construct a data frame object in the Engine, fill it
    with the data read from the `pyspark.sql.DataFrame`, and
    return a corresponding [`DataFrame`][getml.DataFrame] handle.

    Args:
        spark_df:
            The table to be read.

        name:
            Name of the data frame to be created.

        roles:
            Maps the [`roles`][getml.data.roles] to the
            column names (see [`colnames`][getml.DataFrame.colnames]).

            The `roles` dictionary is expected to have the following format:
            ```python
            roles = {getml.data.role.numeric: ["colname1", "colname2"],
                     getml.data.role.target: ["colname3"]}
            ```

            Otherwise, you can use the [`Roles`][getml.data.Roles] class.

        ignore:
            Only relevant when roles is not None.
            Determines what you want to do with any colnames not
            mentioned in roles. Do you want to ignore them (True)
            or read them in as unused columns (False)?

        dry:
            If set to True, the data will not be read. Instead, the method
            will return the inferred roles.

    Returns:
            Handler of the underlying data.
    """

    # ------------------------------------------------------------

    if not isinstance(name, str):
        raise TypeError("'name' must be str.")

    # The content of roles is checked in the class constructor called below.
    if roles is not None and not isinstance(roles, (dict, Roles)):
        raise TypeError("'roles' must be a geml.data.Roles object, a dict or None.")

    if not isinstance(ignore, bool):
        raise TypeError("'ignore' must be bool.")

    if not isinstance(dry, bool):
        raise TypeError("'dry' must be bool.")

    # ------------------------------------------------------------

    head = spark_df.limit(2).toPandas()

    sniffed_roles = sniff_schema(pa.Schema.from_pandas(head))

    roles = _prepare_roles(roles, sniffed_roles, ignore_sniffed_roles=ignore)

    if dry:
        return roles

    data_frame = cls(name, roles)

    return data_frame.read_pyspark(spark_df=spark_df, append=False)

from_query `classmethod`

from_query(
    query: str,
    name: str,
    roles: Optional[
        Union[Dict[Union[Role, str], Iterable[str]], Roles]
    ] = None,
    ignore: bool = False,
    dry: Literal[False] = False,
    conn: Optional[Connection] = None,
) -> DataFrame

from_query(
    query: str,
    name: str,
    roles: Optional[
        Union[Dict[Union[Role, str], Iterable[str]], Roles]
    ] = None,
    ignore: bool = False,
    dry: Literal[True] = True,
    conn: Optional[Connection] = None,
) -> Roles

from_query(
    query: str,
    name: str,
    roles: Optional[
        Union[Dict[Union[Role, str], Iterable[str]], Roles]
    ] = None,
    ignore: bool = False,
    dry: bool = False,
    conn: Optional[Connection] = None,
) -> Union[DataFrame, Roles]

Create a DataFrame from a query run on a database.

It will construct a data frame object in the engine, fill it with the data read from the query executed on the connected database (see database), and return a corresponding DataFrame handle.

PARAMETER	DESCRIPTION
`query`	The SQL query to be read. TYPE: `str`
`name`	Name of the data frame to be created. Maps the `roles` to the column names (see `colnames`). The `roles` dictionary is expected to have the following format: `roles = {getml.data.role.numeric: ["colname1", "colname2"], getml.data.role.target: ["colname3"]}` Otherwise, you can use the `Roles` class. TYPE: `str`
`ignore`	Only relevant when roles is not None. Determines what you want to do with any colnames not mentioned in roles. Do you want to ignore them (True) or read them in as unused columns (False)? TYPE: `bool` DEFAULT: `False`
`dry`	If set to True, the data will not be read. Instead, the method will return the inferred roles. TYPE: `bool` DEFAULT: `False`
`conn`	The `database.Connection` to be used. If you don't explicitly pass a connection, the engine will use the default connection. TYPE: `Optional[Connection]` DEFAULT: `None`

RETURNS	DESCRIPTION
`Union[DataFrame, Roles]`	Handler of the underlying data.

Example

getml.database.connect_mysql(
    host="relational.fel.cvut.cz"",
    port=3306,
    dbname="financial",
    user="guest",
    password="ctu-relational"
)

loan = getml.DataFrame.from_query(
    query='SELECT * FROM "loan";', name='loan')

Source code in getml/data/data_frame.py

@classmethod
def from_query(
    cls,
    query: str,
    name: str,
    roles: Optional[Union[Dict[Union[Role, str], Iterable[str]], Roles]] = None,
    ignore: bool = False,
    dry: bool = False,
    conn: Optional[Connection] = None,
) -> Union[DataFrame, Roles]:
    """Create a DataFrame from a query run on a database.

    It will construct a data frame object in the engine, fill it
    with the data read from the query executed on the connected
    database (see [`database`][getml.database]), and return a
    corresponding [`DataFrame`][getml.DataFrame] handle.

    Args:
        query:
            The SQL query to be read.

        name:
            Name of the data frame to be created.

            Maps the [`roles`][getml.data.roles] to the
            column names (see [`colnames`][getml.DataFrame.colnames]).

            The `roles` dictionary is expected to have the following format:
            ```python
            roles = {getml.data.role.numeric: ["colname1", "colname2"],
                     getml.data.role.target: ["colname3"]}
            ```

            Otherwise, you can use the [`Roles`][getml.data.Roles] class.

        ignore:
            Only relevant when roles is not None.
            Determines what you want to do with any colnames not
            mentioned in roles. Do you want to ignore them (True)
            or read them in as unused columns (False)?

        dry:
            If set to True, the data will not be read. Instead, the method
            will return the inferred roles.

        conn:
            The [`database.Connection`][getml.database.Connection] to be used.
            If you don't explicitly pass a connection, the engine
            will use the default connection.

    Returns:
            Handler of the underlying data.

    ??? example
        ```python
        getml.database.connect_mysql(
            host="relational.fel.cvut.cz"",
            port=3306,
            dbname="financial",
            user="guest",
            password="ctu-relational"
        )

        loan = getml.DataFrame.from_query(
            query='SELECT * FROM "loan";', name='loan')
        ```
    """

    # -------------------------------------------

    if not isinstance(query, str):
        raise TypeError("'query' must be str.")

    if not isinstance(name, str):
        raise TypeError("'name' must be str.")

    # The content of roles is checked in the class constructor called below.
    if roles is not None and not isinstance(roles, (dict, Roles)):
        raise TypeError(
            "'roles' must be a getml.data.Roles object, a dict or None."
        )

    if not isinstance(ignore, bool):
        raise TypeError("'ignore' must be bool.")

    if not isinstance(dry, bool):
        raise TypeError("'dry' must be bool.")

    # -------------------------------------------

    conn = conn or database.Connection()

    # ------------------------------------------------------------

    sniffed_roles = _sniff_query(query, name, conn)

    roles = _prepare_roles(roles, sniffed_roles, ignore_sniffed_roles=ignore)

    if dry:
        return roles

    data_frame = cls(name, roles)

    return data_frame.read_query(query=query, append=False, conn=conn)

from_s3 `classmethod`

from_s3(
    bucket: str,
    keys: Iterable[str],
    region: str,
    name: str,
    num_lines_sniffed: int = 1000,
    num_lines_read: int = 0,
    sep: str = ",",
    skip: int = 0,
    colnames: Optional[Iterable[str]] = None,
    roles: Optional[
        Union[Dict[Union[Role, str], Iterable[str]], Roles]
    ] = None,
    ignore: bool = False,
    dry: Literal[False] = False,
) -> DataFrame

from_s3(
    bucket: str,
    keys: Iterable[str],
    region: str,
    name: str,
    num_lines_sniffed: int = 1000,
    num_lines_read: int = 0,
    sep: str = ",",
    skip: int = 0,
    colnames: Optional[Iterable[str]] = None,
    roles: Optional[
        Union[Dict[Union[Role, str], Iterable[str]], Roles]
    ] = None,
    ignore: bool = False,
    dry: Literal[True] = True,
) -> Roles

from_s3(
    bucket: str,
    keys: Iterable[str],
    region: str,
    name: str,
    num_lines_sniffed: int = 1000,
    num_lines_read: int = 0,
    sep: str = ",",
    skip: int = 0,
    colnames: Optional[Iterable[str]] = None,
    roles: Optional[
        Union[Dict[Union[Role, str], Iterable[str]], Roles]
    ] = None,
    ignore: bool = False,
    dry: bool = False,
) -> Union[DataFrame, Roles]

Create a DataFrame from CSV files located in an S3 bucket.

This classmethod will construct a data frame object in the Engine, fill it with the data read from the CSV file(s), and return a corresponding DataFrame handle.

Enterprise edition

This feature is exclusive to the Enterprise edition and is not available in the Community edition. Discover the benefits of the Enterprise edition and compare their features.

For licensing information and technical support, please contact us.

Note

Note that S3 is not supported on Windows.

PARAMETER	DESCRIPTION
`bucket`	The bucket from which to read the files. TYPE: `str`
`keys`	The list of keys (files in the bucket) to be read. TYPE: `Iterable[str]`
`region`	The region in which the bucket is located. TYPE: `str`
`name`	Name of the data frame to be created. TYPE: `str`
`num_lines_sniffed`	Number of lines analyzed by the sniffer. TYPE: `int` DEFAULT: `1000`
`num_lines_read`	Number of lines read from each file. Set to 0 to read in the entire file. TYPE: `int` DEFAULT: `0`
`sep`	The separator used for separating fields. TYPE: `str` DEFAULT: `','`
`skip`	Number of lines to skip at the beginning of each file. TYPE: `int` DEFAULT: `0`
`colnames`	The first line of a CSV file usually contains the column names. When this is not the case, you need to explicitly pass them. TYPE: `Optional[Iterable[str]]` DEFAULT: `None`
`roles`	Maps the `roles` to the column names (see `colnames`). The `roles` dictionary is expected to have the following format: `roles = {getml.data.role.numeric: ["colname1", "colname2"], getml.data.role.target: ["colname3"]}` Otherwise, you can use the `Roles` class. TYPE: `Optional[Union[Dict[Union[Role, str], Iterable[str]], Roles]]` DEFAULT: `None`
`ignore`	Only relevant when roles is not None. Determines what you want to do with any colnames not mentioned in roles. Do you want to ignore them (True) or read them in as unused columns (False)? TYPE: `bool` DEFAULT: `False`
`dry`	If set to True, the data will not be read. Instead, the method will return the inferred roles. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`Union[DataFrame, Roles]`	Handler of the underlying data.

Example

Let's assume you have two CSV files - file1.csv and file2.csv - in the bucket. You can import their data into the getML Engine using the following commands:

getml.engine.set_s3_access_key_id("YOUR-ACCESS-KEY-ID")
getml.engine.set_s3_secret_access_key("YOUR-SECRET-ACCESS-KEY")

data_frame_expd = data.DataFrame.from_s3(
    bucket="your-bucket-name",
    keys=["file1.csv", "file2.csv"],
    region="us-east-2",
    name="MY DATA FRAME",
    sep=';'
)

You can also set the access credential as environment variables before you launch the getML Engine.

Also refer to the documentation on from_csv for further information on overriding the CSV sniffer for greater type safety.

Source code in getml/data/data_frame.py

@classmethod
def from_s3(
    cls,
    bucket: str,
    keys: Iterable[str],
    region: str,
    name: str,
    num_lines_sniffed: int = 1000,
    num_lines_read: int = 0,
    sep: str = ",",
    skip: int = 0,
    colnames: Optional[Iterable[str]] = None,
    roles: Optional[Union[Dict[Union[Role, str], Iterable[str]], Roles]] = None,
    ignore: bool = False,
    dry: bool = False,
) -> Union[DataFrame, Roles]:
    """Create a DataFrame from CSV files located in an S3 bucket.

    This classmethod will construct a data
    frame object in the Engine, fill it with the data read from
    the CSV file(s), and return a corresponding
    [`DataFrame`][getml.DataFrame] handle.

    enterprise-adm: Enterprise edition
        This feature is exclusive to the Enterprise edition and is not available in the Community edition. Discover the [benefits of the Enterprise edition][enterprise-benefits] and [compare their features][enterprise-feature-list].

        For licensing information and technical support, please [contact us][contact-page].

    Note:
        Note that S3 is not supported on Windows.

    Args:
        bucket:
            The bucket from which to read the files.

        keys:
            The list of keys (files in the bucket) to be read.

        region:
            The region in which the bucket is located.

        name:
            Name of the data frame to be created.

        num_lines_sniffed:
            Number of lines analyzed by the sniffer.

        num_lines_read:
            Number of lines read from each file.
            Set to 0 to read in the entire file.

        sep:
            The separator used for separating fields.

        skip:
            Number of lines to skip at the beginning of each file.

        colnames:
            The first line of a CSV file
            usually contains the column names. When this is not the case,
            you need to explicitly pass them.

        roles:
            Maps the [`roles`][getml.data.roles] to the
            column names (see [`colnames`][getml.DataFrame.colnames]).

            The `roles` dictionary is expected to have the following format:
            ```python
            roles = {getml.data.role.numeric: ["colname1", "colname2"],
                     getml.data.role.target: ["colname3"]}
            ```
            Otherwise, you can use the [`Roles`][getml.data.Roles] class.

        ignore:
            Only relevant when roles is not None.
            Determines what you want to do with any colnames not
            mentioned in roles. Do you want to ignore them (True)
            or read them in as unused columns (False)?

        dry:
            If set to True, the data will not be read. Instead, the method
            will return the inferred roles.

    Returns:
            Handler of the underlying data.

    ??? example
        Let's assume you have two CSV files - *file1.csv* and
        *file2.csv* - in the bucket. You can
        import their data into the getML Engine using the following
        commands:
        ```python
        getml.engine.set_s3_access_key_id("YOUR-ACCESS-KEY-ID")
        getml.engine.set_s3_secret_access_key("YOUR-SECRET-ACCESS-KEY")

        data_frame_expd = data.DataFrame.from_s3(
            bucket="your-bucket-name",
            keys=["file1.csv", "file2.csv"],
            region="us-east-2",
            name="MY DATA FRAME",
            sep=';'
        )
        ```

        You can also set the access credential as environment variables
        before you launch the getML Engine.

        Also refer to the documentation on [`from_csv`][getml.DataFrame.from_csv]
        for further information on overriding the CSV sniffer for greater
        type safety.

    """

    if isinstance(keys, str):
        keys = [keys]

    if not isinstance(bucket, str):
        raise TypeError("'bucket' must be str.")

    if not _is_non_empty_typed_list(keys, str):
        raise TypeError("'keys' must be either a string or a list of str")

    if not isinstance(region, str):
        raise TypeError("'region' must be str.")

    if not isinstance(name, str):
        raise TypeError("'name' must be str.")

    if not isinstance(num_lines_sniffed, numbers.Real):
        raise TypeError("'num_lines_sniffed' must be a real number")

    if not isinstance(num_lines_read, numbers.Real):
        raise TypeError("'num_lines_read' must be a real number")

    if not isinstance(sep, str):
        raise TypeError("'sep' must be str.")

    if not isinstance(skip, numbers.Real):
        raise TypeError("'skip' must be a real number")

    if roles is not None and not isinstance(roles, (dict, Roles)):
        raise TypeError("'roles' must be a geml.data.Roles object, a dict or None.")

    if not isinstance(ignore, bool):
        raise TypeError("'ignore' must be bool.")

    if not isinstance(dry, bool):
        raise TypeError("'dry' must be bool.")

    if colnames is not None and not _is_non_empty_typed_list(colnames, str):
        raise TypeError(
            "'colnames' must be either be None or a non-empty list of str."
        )

    sniffed_roles = _sniff_s3(
        bucket=bucket,
        keys=keys,
        region=region,
        num_lines_sniffed=int(num_lines_sniffed),
        sep=sep,
        skip=int(skip),
        colnames=colnames,
    )

    roles = _prepare_roles(roles, sniffed_roles, ignore)

    if dry:
        return roles

    data_frame = cls(name, roles)

    return data_frame.read_s3(
        bucket=bucket,
        keys=keys,
        region=region,
        append=False,
        sep=sep,
        num_lines_read=int(num_lines_read),
        skip=int(skip),
        colnames=colnames,
    )

from_view `classmethod`

from_view(
    view: View, name: str, dry: Literal[False] = False
) -> DataFrame

from_view(
    view: View, name: str, dry: Literal[True] = True
) -> Roles

from_view(
    view: View, name: str, dry: bool = False
) -> Union[DataFrame, Roles]

Create a DataFrame from a View.

This classmethod will construct a data frame object in the Engine, fill it with the data read from the View, and return a corresponding DataFrame handle.

PARAMETER	DESCRIPTION
`view`	The view from which we want to read the data. TYPE: `View`
`name`	Name of the data frame to be created. TYPE: `str`
`dry`	If set to True, the data will not be read. Instead, the method will return an empty data frame with the roles set as inferred. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`Union[DataFrame, Roles]`	Handler of the underlying data.

Source code in getml/data/data_frame.py

@classmethod
def from_view(
    cls,
    view: View,
    name: str,
    dry: bool = False,
) -> Union[DataFrame, Roles]:
    """Create a DataFrame from a [`View`][getml.data.View].

    This classmethod will construct a data
    frame object in the Engine, fill it with the data read from
    the [`View`][getml.data.View], and return a corresponding
    [`DataFrame`][getml.DataFrame] handle.

    Args:
        view:
            The view from which we want to read the data.

        name:
            Name of the data frame to be created.

        dry:
            If set to True, the data will not be read. Instead, the method
            will return an empty data frame with the roles set as inferred.

    Returns:
            Handler of the underlying data.


    """
    # ------------------------------------------------------------

    if not isinstance(view, View):
        raise TypeError("'view' must be getml.data.View.")

    if not isinstance(name, str):
        raise TypeError("'name' must be str.")

    if not isinstance(dry, bool):
        raise TypeError("'dry' must be bool.")

    # ------------------------------------------------------------

    if dry:
        return view.roles

    data_frame = cls(name, view.roles)

    # ------------------------------------------------------------

    return data_frame.read_view(view=view, append=False)

load

load() -> DataFrame

Loads saved data from disk.

The data frame object holding the same name as the current DataFrame instance will be loaded from disk into the getML Engine and updates the current handler using refresh.

Example

First, we have to create and import data sets.

d, _ = getml.datasets.make_numerical(population_name = 'test')
getml.data.list_data_frames()

In the output of list_data_frames we can find our underlying data frame object 'test' listed under the 'in_memory' key (it was created and imported by make_numerical). This means the getML Engine does only hold it in memory (RAM) yet, and we still have to save it to disk in order to load it again or to prevent any loss of information between different sessions.

d.save()
getml.data.list_data_frames()
d2 = getml.DataFrame(name = 'test').load()

RETURNS	DESCRIPTION
`DataFrame`	Updated handle the underlying data frame in the getML
`DataFrame`	Engine.

Note

When invoking load all changes of the underlying data frame object that took place after the last call to the save method will be lost. Thus, this method enables you to undo changes applied to the DataFrame.

d, _ = getml.datasets.make_numerical()
d.save()

# Accidental change we want to undo
d.rm('column_01')

d.load()

If save hasn't been called on the current instance yet, or it wasn't stored to disk in a previous session, load will throw an exception

File or directory '../projects/X/data/Y/' not found!

Alternatively, load_data_frame offers an easier way of creating DataFrame handlers to data in the getML Engine.

Source code in getml/data/data_frame.py

def load(self) -> DataFrame:
    """Loads saved data from disk.

    The data frame object holding the same name as the current
    [`DataFrame`][getml.DataFrame] instance will be loaded from
    disk into the getML Engine and updates the current handler
    using [`refresh`][getml.DataFrame.refresh].

    ??? example
        First, we have to create and import data sets.
        ```python
        d, _ = getml.datasets.make_numerical(population_name = 'test')
        getml.data.list_data_frames()
        ```

        In the output of [`list_data_frames`][getml.data.list_data_frames] we
        can find our underlying data frame object 'test' listed
        under the 'in_memory' key (it was created and imported by
        [`make_numerical`][getml.datasets.make_numerical]). This means the
        getML Engine does only hold it in memory (RAM) yet, and we
        still have to [`save`][getml.DataFrame.save] it to
        disk in order to [`load`][getml.DataFrame.load] it
        again or to prevent any loss of information between
        different sessions.
        ```python
        d.save()
        getml.data.list_data_frames()
        d2 = getml.DataFrame(name = 'test').load()
        ```

    Returns:
            Updated handle the underlying data frame in the getML
            Engine.

    Note:
        When invoking [`load`][getml.DataFrame.load] all
        changes of the underlying data frame object that took
        place after the last call to the
        [`save`][getml.DataFrame.save] method will be
        lost. Thus, this method  enables you to undo changes
        applied to the [`DataFrame`][getml.DataFrame].
        ```python
        d, _ = getml.datasets.make_numerical()
        d.save()

        # Accidental change we want to undo
        d.rm('column_01')

        d.load()
        ```
        If [`save`][getml.DataFrame.save] hasn't been called
        on the current instance yet, or it wasn't stored to disk in
        a previous session, [`load`][getml.DataFrame.load]
        will throw an exception

            File or directory '../projects/X/data/Y/' not found!

        Alternatively, [`load_data_frame`][getml.data.load_data_frame]
        offers an easier way of creating
        [`DataFrame`][getml.DataFrame] handlers to data in the
        getML Engine.

    """

    cmd: Dict[str, Any] = {}
    cmd["type_"] = "DataFrame.load"
    cmd["name_"] = self.name
    comm.send(cmd)
    return self.refresh()

nbytes

nbytes() -> uint64

Size of the data stored in the underlying data frame in the getML Engine.

RETURNS	DESCRIPTION
`uint64`	Size of the underlying object in bytes.

Source code in getml/data/data_frame.py

def nbytes(self) -> np.uint64:
    """Size of the data stored in the underlying data frame in the getML
    Engine.

    Returns:
            Size of the underlying object in bytes.

    """

    cmd: Dict[str, Any] = {}
    cmd["type_"] = "DataFrame.nbytes"
    cmd["name_"] = self.name

    with comm.send_and_get_socket(cmd) as sock:
        msg = comm.recv_string(sock)
        if msg != "Found!":
            sock.close()
            comm.handle_engine_exception(msg)
        nbytes = comm.recv_string(sock)

    return np.uint64(nbytes)

ncols

ncols() -> int

Number of columns in the current instance.

RETURNS	DESCRIPTION
`int`	Overall number of columns

Source code in getml/data/data_frame.py

def ncols(self) -> int:
    """
    Number of columns in the current instance.

    Returns:
            Overall number of columns
    """
    return len(self.colnames)

nrows

nrows() -> int

Number of rows in the current instance.

RETURNS	DESCRIPTION
`int`	Overall number of rows

Source code in getml/data/data_frame.py

def nrows(self) -> int:
    """
    Number of rows in the current instance.

    Returns:
            Overall number of rows
    """

    cmd: Dict[str, Any] = {}
    cmd["type_"] = "DataFrame.nrows"
    cmd["name_"] = self.name

    with comm.send_and_get_socket(cmd) as sock:
        msg = comm.recv_string(sock)
        if msg != "Found!":
            sock.close()
            comm.handle_engine_exception(msg)
        nrows = comm.recv_string(sock)

    return int(nrows)

read_arrow

read_arrow(
    table: Union[RecordBatch, Table, Iterable[RecordBatch]],
    append: bool = False,
) -> DataFrame

Uploads a pyarrow.Table or pyarrow.RecordBatch to the getML Engine.

Replaces the actual content of the underlying data frame in the getML Engine with table.

PARAMETER	DESCRIPTION
`table`	The arrow tablelike to be read as a `DataFrame`. TYPE: `Union[RecordBatch, Table, Iterable[RecordBatch]]`
`append`	If a data frame object holding the same `name` is already present in the getML Engine, should the content in `query` be appended or replace the existing data? TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`DataFrame`	Current instance.

Note

For columns containing pandas.Timestamp there can be small inconsistencies in the order of microseconds when sending the data to the getML Engine. This is due to the way the underlying information is stored.

Source code in getml/data/data_frame.py

def read_arrow(
    self,
    table: Union[pa.RecordBatch, pa.Table, Iterable[pa.RecordBatch]],
    append: bool = False,
) -> DataFrame:
    """Uploads a `pyarrow.Table` or `pyarrow.RecordBatch` to the getML Engine.

    Replaces the actual content of the underlying data frame in
    the getML Engine with `table`.

    Args:
        table:
            The arrow tablelike to be read as a `DataFrame`.

        append:
            If a data frame object holding the same `name` is
            already present in the getML Engine, should the content in
            `query` be appended or replace the existing data?

    Returns:
            Current instance.

    Note:
        For columns containing `pandas.Timestamp` there can
        be small inconsistencies in the order of microseconds
        when sending the data to the getML Engine. This is due to
        the way the underlying information is stored.
    """

    # ------------------------------------------------------------

    inferred_schema, batches = to_arrow_batches(table)

    if not isinstance(append, bool):
        raise TypeError("'append' must be bool.")

    # ------------------------------------------------------------

    if self.ncols() == 0:
        raise Exception(
            """Reading data is only possible in a DataFrame with more than zero
            columns. You can pre-define columns during
            initialization of the DataFrame or use the classmethod
            from_pandas(...)."""
        )

    # ------------------------------------------------------------

    preprocessed_schema = preprocess_arrow_schema(inferred_schema, self.roles)
    batches = (cast_arrow_batch(batch, preprocessed_schema) for batch in batches)

    read_arrow_batches(batches, preprocessed_schema, self, append)

    return self.refresh()

read_csv

read_csv(
    fnames: Iterable[str],
    append: bool = False,
    quotechar: str = '"',
    sep: str = ",",
    num_lines_read: int = 0,
    skip: int = 0,
    colnames: Optional[Iterable[str]] = None,
    time_formats: Optional[Iterable[str]] = None,
    verbose: bool = True,
    block_size: int = DEFAULT_CSV_READ_BLOCK_SIZE,
    in_batches: bool = False,
) -> DataFrame

Read CSV files.

It is assumed that the first line of each CSV file contains a header with the column names.

PARAMETER	DESCRIPTION
`fnames`	CSV file paths to be read. TYPE: `Iterable[str]`
`append`	If a data frame object holding the same `name` is already present in the getML, should the content of the CSV files in `fnames` be appended or replace the existing data? TYPE: `bool` DEFAULT: `False`
`quotechar`	The character used to wrap strings. TYPE: `str` DEFAULT: `'"'`
`sep`	The separator used for separating fields. TYPE: `str` DEFAULT: `','`
`num_lines_read`	Number of lines read from each file. Set to 0 to read in the entire file. TYPE: `int` DEFAULT: `0`
`skip`	Number of lines to skip at the beginning of each file. TYPE: `int` DEFAULT: `0`
`colnames`	The first line of a CSV file usually contains the column names. When this is not the case, you need to explicitly pass them. TYPE: `Optional[Iterable[str]]` DEFAULT: `None`
`time_formats`	The list of formats tried when parsing time stamps. The formats are allowed to contain the following special characters: %w - abbreviated weekday (Mon, Tue, ...) %W - full weekday (Monday, Tuesday, ...) %b - abbreviated month (Jan, Feb, ...) %B - full month (January, February, ...) %d - zero-padded day of month (01 .. 31) %e - day of month (1 .. 31) %f - space-padded day of month ( 1 .. 31) %m - zero-padded month (01 .. 12) %n - month (1 .. 12) %o - space-padded month ( 1 .. 12) %y - year without century (70) %Y - year with century (1970) %H - hour (00 .. 23) %h - hour (00 .. 12) %a - am/pm %A - AM/PM %M - minute (00 .. 59) %S - second (00 .. 59) %s - seconds and microseconds (equivalent to %S.%F) %i - millisecond (000 .. 999) %c - centisecond (0 .. 9) %F - fractional seconds/microseconds (000000 - 999999) %z - time zone differential in ISO 8601 format (Z or +NN.NN) %Z - time zone differential in RFC format (GMT or +NNNN) %% - percent sign TYPE: `Optional[Iterable[str]]` DEFAULT: `None`
`verbose`	If True, when `fnames` are urls, the filenames are printed to stdout during the download. TYPE: `bool` DEFAULT: `True`
`block_size`	The number of bytes read with each batch. Passed down to pyarrow. TYPE: `int` DEFAULT: `DEFAULT_CSV_READ_BLOCK_SIZE`
`in_batches`	If True, read blocks streamwise manner and send those batches to the engine. Blocks are read and sent to the engine sequentially. While more memory efficient, streaming in batches is slower as it is inherently single-threaded. If False (default) the data is read with multiple threads into arrow first and sent to the engine afterwards. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`DataFrame`	Handler of the underlying data.

Source code in getml/data/data_frame.py

def read_csv(
    self,
    fnames: Iterable[str],
    append: bool = False,
    quotechar: str = '"',
    sep: str = ",",
    num_lines_read: int = 0,
    skip: int = 0,
    colnames: Optional[Iterable[str]] = None,
    time_formats: Optional[Iterable[str]] = None,
    verbose: bool = True,
    block_size: int = DEFAULT_CSV_READ_BLOCK_SIZE,
    in_batches: bool = False,
) -> DataFrame:
    """Read CSV files.

    It is assumed that the first line of each CSV file contains a
    header with the column names.

    Args:
        fnames:
            CSV file paths to be read.

        append:
            If a data frame object holding the same ``name`` is
            already present in the getML, should the content of
            the CSV files in `fnames` be appended or replace the
            existing data?

        quotechar:
            The character used to wrap strings.

        sep:
            The separator used for separating fields.

        num_lines_read:
            Number of lines read from each file.
            Set to 0 to read in the entire file.

        skip:
            Number of lines to skip at the beginning of each file.

        colnames:
            The first line of a CSV file
            usually contains the column names.
            When this is not the case, you need to explicitly pass them.

        time_formats:
            The list of formats tried when parsing time stamps.

            The formats are allowed to contain the following
            special characters:

            * %w - abbreviated weekday (Mon, Tue, ...)
            * %W - full weekday (Monday, Tuesday, ...)
            * %b - abbreviated month (Jan, Feb, ...)
            * %B - full month (January, February, ...)
            * %d - zero-padded day of month (01 .. 31)
            * %e - day of month (1 .. 31)
            * %f - space-padded day of month ( 1 .. 31)
            * %m - zero-padded month (01 .. 12)
            * %n - month (1 .. 12)
            * %o - space-padded month ( 1 .. 12)
            * %y - year without century (70)
            * %Y - year with century (1970)
            * %H - hour (00 .. 23)
            * %h - hour (00 .. 12)
            * %a - am/pm
            * %A - AM/PM
            * %M - minute (00 .. 59)
            * %S - second (00 .. 59)
            * %s - seconds and microseconds (equivalent to %S.%F)
            * %i - millisecond (000 .. 999)
            * %c - centisecond (0 .. 9)
            * %F - fractional seconds/microseconds (000000 - 999999)
            * %z - time zone differential in ISO 8601 format (Z or +NN.NN)
            * %Z - time zone differential in RFC format (GMT or +NNNN)
            * %% - percent sign

        verbose:
            If True, when `fnames` are urls, the filenames are printed to
            stdout during the download.

        block_size:
            The number of bytes read with each batch. Passed down to
            pyarrow.

        in_batches:
            If True, read blocks streamwise manner and send those batches to
            the engine. Blocks are read and sent to the engine sequentially.
            While more memory efficient, streaming in batches is slower as
            it is inherently single-threaded. If False (default) the data is
            read with multiple threads into arrow first and sent to the engine
            afterwards.

    Returns:
            Handler of the underlying data.

    """

    time_formats = time_formats or constants.TIME_FORMATS

    if isinstance(fnames, str):
        fnames = [fnames]

    if not _is_non_empty_typed_list(fnames, str):
        raise TypeError("'fnames' must be either a string or a list of str")

    if not isinstance(append, bool):
        raise TypeError("'append' must be bool.")

    if not isinstance(quotechar, str):
        raise TypeError("'quotechar' must be str.")

    if not isinstance(sep, str):
        raise TypeError("'sep' must be str.")

    if not isinstance(num_lines_read, numbers.Real):
        raise TypeError("'num_lines_read' must be a real number")

    if not isinstance(skip, numbers.Real):
        raise TypeError("'skip' must be a real number")

    if not _is_non_empty_typed_list(time_formats, str):
        raise TypeError("'time_formats' must be a non-empty list of str")

    if colnames:
        if not _is_iterable_not_str_of_type(colnames, str):
            raise TypeError("'colnames' must be an iterable of str")

    if self.ncols() == 0:
        raise Exception(
            """Reading data is only possible in a DataFrame with more than zero
            columns. You can pre-define columns during
            initialization of the DataFrame or use the classmethod
            from_csv(...)."""
        )

    if not _is_non_empty_typed_list(fnames, str):
        raise TypeError(
            """'fnames' must be a list containing at
            least one path to a CSV file"""
        )

    fnames_ = _retrieve_urls(fnames, verbose)

    if colnames is None:
        colnames = ()

    stream_read_csv = stream_csv if in_batches else read_csv

    readers = (
        stream_read_csv(
            Path(fname),
            roles=self.roles,
            skip_rows=skip,
            column_names=colnames,
            delimiter=sep,
            quote_char=quotechar,
            block_size=block_size,
        )
        for fname in fnames_
    )

    for batches in readers:
        first_batch = next(batches)
        schema = first_batch.schema
        read_arrow_batches(iter((first_batch, *batches)), schema, self, append)
        if not append:
            append = True

    return self

read_json

read_json(
    json_str: str,
    append: bool = False,
    time_formats: Optional[Iterable[str]] = None,
) -> DataFrame

Fill from JSON

Fills the data frame with data from a JSON string.

Args:

json_str:
    The JSON string containing the data.

append:
    If a data frame object holding the same ``name`` is
    already present in the getML, should the content of
    `json_str` be appended or replace the existing data?

time_formats:
    The list of formats tried when parsing time stamps.
    The formats are allowed to contain the following
    special characters:

    * %w - abbreviated weekday (Mon, Tue, ...)
    * %W - full weekday (Monday, Tuesday, ...)
    * %b - abbreviated month (Jan, Feb, ...)
    * %B - full month (January, February, ...)
    * %d - zero-padded day of month (01 .. 31)
    * %e - day of month (1 .. 31)
    * %f - space-padded day of month ( 1 .. 31)
    * %m - zero-padded month (01 .. 12)
    * %n - month (1 .. 12)
    * %o - space-padded month ( 1 .. 12)
    * %y - year without century (70)
    * %Y - year with century (1970)
    * %H - hour (00 .. 23)
    * %h - hour (00 .. 12)
    * %a - am/pm
    * %A - AM/PM
    * %M - minute (00 .. 59)
    * %S - second (00 .. 59)
    * %s - seconds and microseconds (equivalent to %S.%F)
    * %i - millisecond (000 .. 999)
    * %c - centisecond (0 .. 9)
    * %F - fractional seconds/microseconds (000000 - 999999)
    * %z - time zone differential in ISO 8601 format (Z or +NN.NN)
    * %Z - time zone differential in RFC format (GMT or +NNNN)
    * %% - percent sign

RETURNS	DESCRIPTION
`DataFrame`	Handler of the underlying data.

Note

This does not support NaN values. If you want support for NaN, use from_json instead.

Source code in getml/data/data_frame.py

def read_json(
    self,
    json_str: str,
    append: bool = False,
    time_formats: Optional[Iterable[str]] = None,
) -> DataFrame:
    """Fill from JSON

    Fills the data frame with data from a JSON string.

    Args:

        json_str:
            The JSON string containing the data.

        append:
            If a data frame object holding the same ``name`` is
            already present in the getML, should the content of
            `json_str` be appended or replace the existing data?

        time_formats:
            The list of formats tried when parsing time stamps.
            The formats are allowed to contain the following
            special characters:

            * %w - abbreviated weekday (Mon, Tue, ...)
            * %W - full weekday (Monday, Tuesday, ...)
            * %b - abbreviated month (Jan, Feb, ...)
            * %B - full month (January, February, ...)
            * %d - zero-padded day of month (01 .. 31)
            * %e - day of month (1 .. 31)
            * %f - space-padded day of month ( 1 .. 31)
            * %m - zero-padded month (01 .. 12)
            * %n - month (1 .. 12)
            * %o - space-padded month ( 1 .. 12)
            * %y - year without century (70)
            * %Y - year with century (1970)
            * %H - hour (00 .. 23)
            * %h - hour (00 .. 12)
            * %a - am/pm
            * %A - AM/PM
            * %M - minute (00 .. 59)
            * %S - second (00 .. 59)
            * %s - seconds and microseconds (equivalent to %S.%F)
            * %i - millisecond (000 .. 999)
            * %c - centisecond (0 .. 9)
            * %F - fractional seconds/microseconds (000000 - 999999)
            * %z - time zone differential in ISO 8601 format (Z or +NN.NN)
            * %Z - time zone differential in RFC format (GMT or +NNNN)
            * %% - percent sign

    Returns:
            Handler of the underlying data.

    Note:
        This does not support NaN values. If you want support for NaN,
        use [`from_json`][getml.DataFrame.from_json] instead.

    """

    time_formats = time_formats or constants.TIME_FORMATS

    if self.ncols() == 0:
        raise Exception(
            """Reading data is only possible in a DataFrame with more than zero
            columns. You can pre-define columns during
            initialization of the DataFrame or use the classmethod
            from_json(...)."""
        )

    if not isinstance(json_str, str):
        raise TypeError("'json_str' must be of type str")

    if not isinstance(append, bool):
        raise TypeError("'append' must be of type bool")

    if not _is_non_empty_typed_list(time_formats, str):
        raise TypeError(
            """'time_formats' must be a list of strings
            containing at least one time format"""
        )

    cmd: Dict[str, Any] = {}
    cmd["type_"] = "DataFrame.from_json"
    cmd["name_"] = self.name

    cmd["categorical_"] = self._categorical_names
    cmd["join_keys_"] = self._join_key_names
    cmd["numerical_"] = self._numerical_names
    cmd["targets_"] = self._target_names
    cmd["text_"] = self._text_names
    cmd["time_stamps_"] = self._time_stamp_names
    cmd["unused_floats_"] = self._unused_float_names
    cmd["unused_strings_"] = self._unused_string_names

    cmd["append_"] = append
    cmd["time_formats_"] = time_formats

    with comm.send_and_get_socket(cmd) as sock:
        comm.send_string(sock, json_str)
        msg = comm.recv_string(sock)

    if msg != "Success!":
        comm.handle_engine_exception(msg)

    return self

read_parquet

read_parquet(
    fnames: Union[str, Iterable[str]],
    append: bool = False,
    verbose: bool = False,
    colnames: Iterable[str] = (),
) -> DataFrame

Read a parquet file.

PARAMETER	DESCRIPTION
`fnames`	The filepath of the parquet file(s) to be read. TYPE: `Union[str, Iterable[str]]`
`append`	If a data frame object holding the same `name` is already present in the getML, should the content of the CSV files in `fnames` be appended or replace the existing data? TYPE: `bool` DEFAULT: `False`
`verbose`	If True, when `fnames` are urls, the filenames are printed to stdout during the download. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`DataFrame`	Handler of the underlying data.

Source code in getml/data/data_frame.py

def read_parquet(
    self,
    fnames: Union[str, Iterable[str]],
    append: bool = False,
    verbose: bool = False,
    colnames: Iterable[str] = (),
) -> DataFrame:
    """Read a parquet file.

    Args:
        fnames:
            The filepath of the parquet file(s) to be read.

        append:
            If a data frame object holding the same ``name`` is
            already present in the getML, should the content of
            the CSV files in `fnames` be appended or replace the
            existing data?

        verbose:
            If True, when `fnames` are urls, the filenames are printed to
            stdout during the download.

    Returns:
        Handler of the underlying data.
    """

    if isinstance(fnames, str):
        fnames = [fnames]

    if not isinstance(append, bool):
        raise TypeError("'append' must be bool.")

    if not colnames:
        colnames = self.colnames

    if self.ncols() == 0:
        raise Exception(
            """Reading data is only possible in a DataFrame with more than
            zero columns. You can pre-define columns during
            initialization of the DataFrame or use the classmethod
            from_parquet(...)."""
        )

    fnames = _retrieve_urls(fnames, verbose)

    readers = (pq.ParquetFile(fname) for fname in fnames)

    for reader in readers:
        inferred_schema = reader.schema_arrow
        preprocessed_schema = preprocess_arrow_schema(inferred_schema, self.roles)
        cast_batches = (
            cast_arrow_batch(batch, preprocessed_schema)
            for batch in reader.iter_batches(columns=colnames)
        )
        read_arrow_batches(
            cast_batches,
            preprocessed_schema,
            self,
            append,
        )
        if not append:
            append = True

    return self

read_s3

read_s3(
    bucket: str,
    keys: Iterable[str],
    region: str,
    append: bool = False,
    sep: str = ",",
    num_lines_read: int = 0,
    skip: int = 0,
    colnames: Optional[Iterable[str]] = None,
    time_formats: Optional[Iterable[str]] = None,
) -> DataFrame

Read CSV files from an S3 bucket.

It is assumed that the first line of each CSV file contains a header with the column names.

Enterprise edition

This feature is exclusive to the Enterprise edition and is not available in the Community edition. Discover the benefits of the Enterprise edition and compare their features.

For licensing information and technical support, please contact us.

Note

Note that S3 is not supported on Windows.

PARAMETER	DESCRIPTION
`bucket`	The bucket from which to read the files. TYPE: `str`
`keys`	The list of keys (files in the bucket) to be read. TYPE: `Iterable[str]`
`region`	The region in which the bucket is located. TYPE: `str`
`append`	If a data frame object holding the same `name` is already present in the getML, should the content of the CSV files in `fnames` be appended or replace the existing data? TYPE: `bool` DEFAULT: `False`
`sep`	The separator used for separating fields. TYPE: `str` DEFAULT: `','`
`num_lines_read`	Number of lines read from each file. Set to 0 to read in the entire file. TYPE: `int` DEFAULT: `0`
`skip`	Number of lines to skip at the beginning of each file. TYPE: `int` DEFAULT: `0`
`colnames`	The first line of a CSV file usually contains the column names. When this is not the case, you need to explicitly pass them. TYPE: `Optional[Iterable[str]]` DEFAULT: `None`
`time_formats`	The list of formats tried when parsing time stamps. The formats are allowed to contain the following special characters: %w - abbreviated weekday (Mon, Tue, ...) %W - full weekday (Monday, Tuesday, ...) %b - abbreviated month (Jan, Feb, ...) %B - full month (January, February, ...) %d - zero-padded day of month (01 .. 31) %e - day of month (1 .. 31) %f - space-padded day of month ( 1 .. 31) %m - zero-padded month (01 .. 12) %n - month (1 .. 12) %o - space-padded month ( 1 .. 12) %y - year without century (70) %Y - year with century (1970) %H - hour (00 .. 23) %h - hour (00 .. 12) %a - am/pm %A - AM/PM %M - minute (00 .. 59) %S - second (00 .. 59) %s - seconds and microseconds (equivalent to %S.%F) %i - millisecond (000 .. 999) %c - centisecond (0 .. 9) %F - fractional seconds/microseconds (000000 - 999999) %z - time zone differential in ISO 8601 format (Z or +NN.NN) %Z - time zone differential in RFC format (GMT or +NNNN) %% - percent sign TYPE: `Optional[Iterable[str]]` DEFAULT: `None`

RETURNS	DESCRIPTION
`DataFrame`	Handler of the underlying data.

Source code in getml/data/data_frame.py

def read_s3(
    self,
    bucket: str,
    keys: Iterable[str],
    region: str,
    append: bool = False,
    sep: str = ",",
    num_lines_read: int = 0,
    skip: int = 0,
    colnames: Optional[Iterable[str]] = None,
    time_formats: Optional[Iterable[str]] = None,
) -> DataFrame:
    """Read CSV files from an S3 bucket.

    It is assumed that the first line of each CSV file contains a
    header with the column names.

    enterprise-adm: Enterprise edition
        This feature is exclusive to the Enterprise edition and is not available in the Community edition. Discover the [benefits of the Enterprise edition][enterprise-benefits] and [compare their features][enterprise-feature-list].

        For licensing information and technical support, please [contact us][contact-page].

    Note:
        Note that S3 is not supported on Windows.

    Args:
        bucket:
            The bucket from which to read the files.

        keys:
            The list of keys (files in the bucket) to be read.

        region:
            The region in which the bucket is located.

        append:
            If a data frame object holding the same ``name`` is
            already present in the getML, should the content of
            the CSV files in `fnames` be appended or replace the
            existing data?

        sep:
            The separator used for separating fields.

        num_lines_read:
            Number of lines read from each file.
            Set to 0 to read in the entire file.

        skip:
            Number of lines to skip at the beginning of each file.

        colnames:
            The first line of a CSV file
            usually contains the column names.
            When this is not the case, you need to explicitly pass them.

        time_formats:
            The list of formats tried when parsing time stamps.

            The formats are allowed to contain the following
            special characters:

            * %w - abbreviated weekday (Mon, Tue, ...)
            * %W - full weekday (Monday, Tuesday, ...)
            * %b - abbreviated month (Jan, Feb, ...)
            * %B - full month (January, February, ...)
            * %d - zero-padded day of month (01 .. 31)
            * %e - day of month (1 .. 31)
            * %f - space-padded day of month ( 1 .. 31)
            * %m - zero-padded month (01 .. 12)
            * %n - month (1 .. 12)
            * %o - space-padded month ( 1 .. 12)
            * %y - year without century (70)
            * %Y - year with century (1970)
            * %H - hour (00 .. 23)
            * %h - hour (00 .. 12)
            * %a - am/pm
            * %A - AM/PM
            * %M - minute (00 .. 59)
            * %S - second (00 .. 59)
            * %s - seconds and microseconds (equivalent to %S.%F)
            * %i - millisecond (000 .. 999)
            * %c - centisecond (0 .. 9)
            * %F - fractional seconds/microseconds (000000 - 999999)
            * %z - time zone differential in ISO 8601 format (Z or +NN.NN)
            * %Z - time zone differential in RFC format (GMT or +NNNN)
            * %% - percent sign

    Returns:
            Handler of the underlying data.

    """

    time_formats = time_formats or constants.TIME_FORMATS

    if isinstance(keys, str):
        keys = [keys]

    if not isinstance(bucket, str):
        raise TypeError("'bucket' must be str.")

    if not _is_non_empty_typed_list(keys, str):
        raise TypeError("'keys' must be either a string or a list of str")

    if not isinstance(region, str):
        raise TypeError("'region' must be str.")

    if not isinstance(append, bool):
        raise TypeError("'append' must be bool.")

    if not isinstance(sep, str):
        raise TypeError("'sep' must be str.")

    if not isinstance(num_lines_read, numbers.Real):
        raise TypeError("'num_lines_read' must be a real number")

    if not isinstance(skip, numbers.Real):
        raise TypeError("'skip' must be a real number")

    if not _is_non_empty_typed_list(time_formats, str):
        raise TypeError("'time_formats' must be a non-empty list of str")

    if colnames is not None and not _is_non_empty_typed_list(colnames, str):
        raise TypeError(
            "'colnames' must be either be None or a non-empty list of str."
        )

    if self.ncols() == 0:
        raise Exception(
            """Reading data is only possible in a DataFrame with more than zero
            columns. You can pre-define columns during
            initialization of the DataFrame or use the classmethod
            from_s3(...)."""
        )

    cmd: Dict[str, Any] = {}

    cmd["type_"] = "DataFrame.read_s3"
    cmd["name_"] = self.name

    cmd["append_"] = append
    cmd["bucket_"] = bucket
    cmd["keys_"] = keys
    cmd["region_"] = region
    cmd["sep_"] = sep
    cmd["time_formats_"] = time_formats
    cmd["num_lines_read_"] = num_lines_read
    cmd["skip_"] = skip

    if colnames is not None:
        cmd["colnames_"] = colnames

    cmd["categorical_"] = self._categorical_names
    cmd["join_keys_"] = self._join_key_names
    cmd["numerical_"] = self._numerical_names
    cmd["targets_"] = self._target_names
    cmd["text_"] = self._text_names
    cmd["time_stamps_"] = self._time_stamp_names
    cmd["unused_floats_"] = self._unused_float_names
    cmd["unused_strings_"] = self._unused_string_names

    comm.send(cmd)

    return self

read_view

read_view(view: View, append: bool = False) -> DataFrame

Read the data from a View.

PARAMETER	DESCRIPTION
`view`	The view to read. TYPE: `View`
`append`	If a data frame object holding the same `name` is already present in the getML, should the content of the CSV files in `fnames` be appended or replace the existing data? TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`DataFrame`	Handler of the underlying data.

Source code in getml/data/data_frame.py

def read_view(
    self,
    view: View,
    append: bool = False,
) -> DataFrame:
    """Read the data from a [`View`][getml.data.View].

    Args:
        view:
            The view to read.

        append:
            If a data frame object holding the same ``name`` is
            already present in the getML, should the content of
            the CSV files in `fnames` be appended or replace the
            existing data?

    Returns:
            Handler of the underlying data.

    """

    if not isinstance(view, View):
        raise TypeError("'view' must be getml.data.View.")

    if not isinstance(append, bool):
        raise TypeError("'append' must be bool.")

    view.check()

    cmd: Dict[str, Any] = {}

    cmd["type_"] = "DataFrame.from_view"
    cmd["name_"] = self.name

    cmd["view_"] = view._getml_deserialize()

    cmd["append_"] = append

    comm.send(cmd)

    return self.refresh()

read_db

read_db(
    table_name: str,
    append: bool = False,
    conn: Optional[Connection] = None,
) -> DataFrame

Fill from Database.

The DataFrame will be filled from a table in the database.

PARAMETER	DESCRIPTION
`table_name`	Table from which we want to retrieve the data. TYPE: `str`
`append`	If a data frame object holding the same `name` is already present in the getML, should the content of `table_name` be appended or replace the existing data? TYPE: `bool` DEFAULT: `False`
`conn`	The database connection to be used. If you don't explicitly pass a connection, the Engine will use the default connection. TYPE: `Optional[Connection]` DEFAULT: `None`

RETURNS	DESCRIPTION
`DataFrame`	Handler of the underlying data.

Source code in getml/data/data_frame.py

def read_db(
    self, table_name: str, append: bool = False, conn: Optional[Connection] = None
) -> DataFrame:
    """
    Fill from Database.

    The DataFrame will be filled from a table in the database.

    Args:
        table_name:
            Table from which we want to retrieve the data.

        append:
            If a data frame object holding the same ``name`` is
            already present in the getML, should the content of
            `table_name` be appended or replace the existing data?

        conn:
            The database connection to be used.
            If you don't explicitly pass a connection,
            the Engine will use the default connection.

    Returns:
            Handler of the underlying data.
    """

    if not isinstance(table_name, str):
        raise TypeError("'table_name' must be str.")

    if not isinstance(append, bool):
        raise TypeError("'append' must be bool.")

    if self.ncols() == 0:
        raise Exception(
            """Reading data is only possible in a DataFrame with more than zero
            columns. You can pre-define columns during
            initialization of the DataFrame or use the classmethod
            from_db(...)."""
        )

    conn = conn or database.Connection()

    cmd: Dict[str, Any] = {}

    cmd["type_"] = "DataFrame.from_db"
    cmd["name_"] = self.name
    cmd["table_name_"] = table_name

    cmd["categorical_"] = self._categorical_names
    cmd["join_keys_"] = self._join_key_names
    cmd["numerical_"] = self._numerical_names
    cmd["targets_"] = self._target_names
    cmd["text_"] = self._text_names
    cmd["time_stamps_"] = self._time_stamp_names
    cmd["unused_floats_"] = self._unused_float_names
    cmd["unused_strings_"] = self._unused_string_names

    cmd["append_"] = append

    cmd["conn_id_"] = conn.conn_id

    comm.send(cmd)

    return self

read_pandas

read_pandas(
    pandas_df: DataFrame, append: bool = False
) -> DataFrame

Uploads a pandas.DataFrame.

Replaces the actual content of the underlying data frame in the getML Engine with pandas_df.

PARAMETER	DESCRIPTION
`pandas_df`	Data the underlying data frame object in the getML Engine should obtain. TYPE: `DataFrame`
`append`	If a data frame object holding the same `name` is already present in the getML Engine, should the content in `query` be appended or replace the existing data? TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`DataFrame`	Handler of the underlying data.

Note: For columns containing pandas.Timestamp there can occur small inconsistencies in the order of microseconds when sending the data to the getML Engine. This is due to the way the underlying information is stored.

Source code in getml/data/data_frame.py

def read_pandas(self, pandas_df: pd.DataFrame, append: bool = False) -> DataFrame:
    """Uploads a `pandas.DataFrame`.

    Replaces the actual content of the underlying data frame in
    the getML Engine with `pandas_df`.

    Args:
        pandas_df:
            Data the underlying data frame object in the getML
            Engine should obtain.

        append:
            If a data frame object holding the same ``name`` is
            already present in the getML Engine, should the content in
            `query` be appended or replace the existing data?

    Returns:
            Handler of the underlying data.
    Note:
        For columns containing `pandas.Timestamp` there can
        occur small inconsistencies in the order of microseconds
        when sending the data to the getML Engine. This is due to
        the way the underlying information is stored.
    """

    if not isinstance(pandas_df, pd.DataFrame):
        raise TypeError("'pandas_df' must be of type pandas.DataFrame.")

    if not isinstance(append, bool):
        raise TypeError("'append' must be bool.")

    if self.ncols() == 0:
        raise Exception(
            """Reading data is only possible in a DataFrame with more than zero
            columns. You can pre-define columns during
            initialization of the DataFrame or use the classmethod
            from_pandas(...)."""
        )

    table = pa.Table.from_pandas(pandas_df[self.columns])

    return self.read_arrow(table, append=append)

read_pyspark

read_pyspark(
    spark_df: DataFrame, append: bool = False
) -> DataFrame

Uploads a pyspark.sql.DataFrame.

Replaces the actual content of the underlying data frame in the getML Engine with pandas_df.

PARAMETER	DESCRIPTION
`spark_df`	Data the underlying data frame object in the getML Engine should obtain. TYPE: `DataFrame`
`append`	If a data frame object holding the same `name` is already present in the getML Engine, should the content in `query` be appended or replace the existing data? TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`DataFrame`	Handler of the underlying data.

Source code in getml/data/data_frame.py

def read_pyspark(
    self, spark_df: pyspark.sql.DataFrame, append: bool = False
) -> DataFrame:
    """Uploads a `pyspark.sql.DataFrame`.

    Replaces the actual content of the underlying data frame in
    the getML Engine with `pandas_df`.

    Args:
        spark_df:
            Data the underlying data frame object in the getML
            Engine should obtain.

        append:
            If a data frame object holding the same ``name`` is
            already present in the getML Engine, should the content in
            `query` be appended or replace the existing data?

    Returns:
            Handler of the underlying data.
    """

    if not isinstance(append, bool):
        raise TypeError("'append' must be bool.")

    temp_dir = _retrieve_temp_dir()
    path = temp_dir / str(self.name)
    spark_df.write.mode("overwrite").parquet(str(path))

    filepaths = [
        os.path.join(path, filepath)
        for filepath in os.listdir(path)
        if filepath[-8:] == ".parquet"
    ]

    for i, filepath in enumerate(filepaths):
        self.read_parquet(filepath, append or i > 0)

    shutil.rmtree(path)

    return self

read_query

read_query(
    query: str,
    append: Optional[bool] = False,
    conn: Optional[Connection] = None,
) -> DataFrame

Fill from query

Fills the data frame with data from a table in the database.

PARAMETER	DESCRIPTION
`query`	The query used to retrieve the data. TYPE: `str`
`append`	If a data frame object holding the same `name` is already present in the getML Engine, should the content in `query` be appended or replace the existing data? TYPE: `Optional[bool]` DEFAULT: `False`
`conn`	The database connection to be used. If you don't explicitly pass a connection, the Engine will use the default connection. TYPE: `Optional[Connection]` DEFAULT: `None`

RETURNS	DESCRIPTION
`DataFrame`	Handler of the underlying data.

Source code in getml/data/data_frame.py

def read_query(
    self,
    query: str,
    append: Optional[bool] = False,
    conn: Optional[Connection] = None,
) -> DataFrame:
    """Fill from query

    Fills the data frame with data from a table in the database.

    Args:
        query:
            The query used to retrieve the data.

        append:
            If a data frame object holding the same ``name`` is
            already present in the getML Engine, should the content in
            `query` be appended or replace the existing data?

        conn:
            The database connection to be used.
            If you don't explicitly pass a connection,
            the Engine will use the default connection.

    Returns:
            Handler of the underlying data.
    """

    if self.ncols() == 0:
        raise Exception(
            """Reading data is only possible in a DataFrame with more than zero
            columns. You can pre-define columns during
            initialization of the DataFrame or use the classmethod
            from_db(...)."""
        )

    if not isinstance(query, str):
        raise TypeError("'query' must be of type str")

    if not isinstance(append, bool):
        raise TypeError("'append' must be of type bool")

    conn = conn or database.Connection()

    cmd: Dict[str, Any] = {}

    cmd["type_"] = "DataFrame.from_query"
    cmd["name_"] = self.name
    cmd["query_"] = query

    cmd["categorical_"] = self._categorical_names
    cmd["join_keys_"] = self._join_key_names
    cmd["numerical_"] = self._numerical_names
    cmd["targets_"] = self._target_names
    cmd["text_"] = self._text_names
    cmd["time_stamps_"] = self._time_stamp_names
    cmd["unused_floats_"] = self._unused_float_names
    cmd["unused_strings_"] = self._unused_string_names

    cmd["append_"] = append

    cmd["conn_id_"] = conn.conn_id

    comm.send(cmd)

    return self

refresh

refresh() -> DataFrame

Aligns meta-information of the current instance with the corresponding data frame in the getML Engine.

    This method can be used to avoid encoding conflicts. Note that
    [`load`][getml.DataFrame.load] as well as several other
    methods automatically call [`refresh`][getml.DataFrame.refresh].

RETURNS	DESCRIPTION
`DataFrame`	Updated handle the underlying data frame in the getML
`DataFrame`	Engine.

Source code in getml/data/data_frame.py

def refresh(self) -> DataFrame:
    """Aligns meta-information of the current instance with the
            corresponding data frame in the getML Engine.

            This method can be used to avoid encoding conflicts. Note that
            [`load`][getml.DataFrame.load] as well as several other
            methods automatically call [`refresh`][getml.DataFrame.refresh].

    Returns:
            Updated handle the underlying data frame in the getML
            Engine.

    """

    cmd: Dict[str, Any] = {}
    cmd["type_"] = "DataFrame.refresh"
    cmd["name_"] = self.name

    with comm.send_and_get_socket(cmd) as sock:
        msg = comm.recv_string(sock)

    if msg[0] != "{":
        comm.handle_engine_exception(msg)

    roles = json.loads(msg)

    self.__init__(name=cast(str, self.name), roles=roles)

    return self

remove_subroles

remove_subroles(
    cols: Union[
        str,
        FloatColumn,
        StringColumn,
        List[Union[str, FloatColumn, StringColumn]],
    ]
) -> None

Removes all subroles from one or more columns.

PARAMETER	DESCRIPTION
`cols`	The columns or the names thereof. TYPE: `Union[str, FloatColumn, StringColumn, List[Union[str, FloatColumn, StringColumn]]]`

Source code in getml/data/data_frame.py

def remove_subroles(
    self,
    cols: Union[
        str, FloatColumn, StringColumn, List[Union[str, FloatColumn, StringColumn]]
    ],
) -> None:
    """Removes all [`subroles`][getml.data.subroles] from one or more columns.

    Args:
        cols:
            The columns or the names thereof.
    """

    names = _handle_cols(cols)

    for name in names:
        self._set_subroles(name, append=False, subroles=[])

    self.refresh()

remove_unit

remove_unit(
    cols: Union[
        str,
        FloatColumn,
        StringColumn,
        List[Union[str, FloatColumn, StringColumn]],
    ]
)

Removes the unit from one or more columns.

PARAMETER	DESCRIPTION
`cols`	The columns or the names thereof. TYPE: `Union[str, FloatColumn, StringColumn, List[Union[str, FloatColumn, StringColumn]]]`

Source code in getml/data/data_frame.py

def remove_unit(
    self,
    cols: Union[
        str, FloatColumn, StringColumn, List[Union[str, FloatColumn, StringColumn]]
    ],
):
    """Removes the unit from one or more columns.

    Args:
        cols:
            The columns or the names thereof.
    """

    names = _handle_cols(cols)

    for name in names:
        self._set_unit(name, "")

    self.refresh()

save

save() -> DataFrame

Writes the underlying data in the getML Engine to disk.

RETURNS	DESCRIPTION
`DataFrame`	The current instance.

Source code in getml/data/data_frame.py

def save(self) -> DataFrame:
    """Writes the underlying data in the getML Engine to disk.

    Returns:
            The current instance.

    """

    cmd: Dict[str, Any] = {}

    cmd["type_"] = "DataFrame.save"
    cmd["name_"] = self.name

    comm.send(cmd)

    return self

set_role

set_role(
    cols: Union[
        str,
        FloatColumn,
        StringColumn,
        List[Union[str, FloatColumn, StringColumn]],
    ],
    role: str,
    time_formats: Optional[Iterable[str]] = None,
)

Assigns a new role to one or more columns.

When switching from a role based on type float to a role based on type string or vice verse, an implicit type conversion will be conducted. The time_formats argument is used to interpret Time Stamps. For more information on roles, please refer to the User Guide.

PARAMETER	DESCRIPTION
`cols`	The columns or the names of the columns. TYPE: `Union[str, FloatColumn, StringColumn, List[Union[str, FloatColumn, StringColumn]]]`
`role`	The role to be assigned. TYPE: `str`
`time_formats`	Formats to be used to parse the time stamps. This is only necessary, if an implicit conversion from a StringColumn to a time stamp is taking place. TYPE: `Optional[Iterable[str]]` DEFAULT: `None`

Example

data_df = dict(
    animal=["hawk", "parrot", "goose"],
    votes=[12341, 5127, 65311],
    date=["04/06/2019", "01/03/2019", "24/12/2018"])
df = getml.DataFrame.from_dict(data_df, "animal_elections")
df.set_role(['animal'], getml.data.roles.categorical)
df.set_role(['votes'], getml.data.roles.numerical)
df.set_role(
    ['date'], getml.data.roles.time_stamp, time_formats=['%d/%m/%Y'])

df

| date                        | animal      | votes     |
| time stamp                  | categorical | numerical |
---------------------------------------------------------
| 2019-06-04T00:00:00.000000Z | hawk        | 12341     |
| 2019-03-01T00:00:00.000000Z | parrot      | 5127      |
| 2018-12-24T00:00:00.000000Z | goose       | 65311     |

Source code in getml/data/data_frame.py

def set_role(
    self,
    cols: Union[
        str, FloatColumn, StringColumn, List[Union[str, FloatColumn, StringColumn]]
    ],
    role: str,
    time_formats: Optional[Iterable[str]] = None,
):
    """Assigns a new role to one or more columns.

    When switching from a role based on type float to a role based on type
    string or vice verse, an implicit type conversion will be conducted.
    The `time_formats` argument is used to interpret [Time Stamps][annotating-data-time-stamp]. For more information on
    roles, please refer to the [User Guide][annotating-data].

    Args:
        cols:
            The columns or the names of the columns.

        role:
            The role to be assigned.

        time_formats:
            Formats to be used to parse the time stamps.
            This is only necessary, if an implicit conversion from a StringColumn to
            a time stamp is taking place.

    ??? example
        ```python
        data_df = dict(
            animal=["hawk", "parrot", "goose"],
            votes=[12341, 5127, 65311],
            date=["04/06/2019", "01/03/2019", "24/12/2018"])
        df = getml.DataFrame.from_dict(data_df, "animal_elections")
        df.set_role(['animal'], getml.data.roles.categorical)
        df.set_role(['votes'], getml.data.roles.numerical)
        df.set_role(
            ['date'], getml.data.roles.time_stamp, time_formats=['%d/%m/%Y'])

        df
        ```
        ```
        | date                        | animal      | votes     |
        | time stamp                  | categorical | numerical |
        ---------------------------------------------------------
        | 2019-06-04T00:00:00.000000Z | hawk        | 12341     |
        | 2019-03-01T00:00:00.000000Z | parrot      | 5127      |
        | 2018-12-24T00:00:00.000000Z | goose       | 65311     |
        ```
    """
    # ------------------------------------------------------------

    time_formats = time_formats or constants.TIME_FORMATS

    # ------------------------------------------------------------

    names = _handle_cols(cols)

    if not isinstance(role, str):
        raise TypeError("'role' must be str.")

    if not _is_non_empty_typed_list(time_formats, str):
        raise TypeError("'time_formats' must be a non-empty list of str")

    # ------------------------------------------------------------

    for nname in names:
        if nname not in self.colnames:
            raise ValueError("No column called '" + nname + "' found.")

    if role not in self._all_roles:
        raise ValueError(
            "'role' must be one of the following values: " + str(self._all_roles)
        )

    # ------------------------------------------------------------

    for name in names:
        if self[name].role != role:
            self._set_role(name, role, time_formats)

    # ------------------------------------------------------------

    self.refresh()

set_subroles

set_subroles(
    cols: Union[
        str,
        FloatColumn,
        StringColumn,
        List[Union[str, FloatColumn, StringColumn]],
    ],
    subroles: Optional[
        Union[Subrole, Iterable[str]]
    ] = None,
    append: Optional[bool] = True,
)

Assigns one or several new subroles to one or more columns.

PARAMETER	DESCRIPTION
`cols`	The columns or the names thereof. TYPE: `Union[str, FloatColumn, StringColumn, List[Union[str, FloatColumn, StringColumn]]]`
`subroles`	The subroles to be assigned. Must be from `subroles`. TYPE: `Optional[Union[Subrole, Iterable[str]]]` DEFAULT: `None`
`append`	Whether you want to append the new subroles to the existing subroles. TYPE: `Optional[bool]` DEFAULT: `True`

Source code in getml/data/data_frame.py

def set_subroles(
    self,
    cols: Union[
        str, FloatColumn, StringColumn, List[Union[str, FloatColumn, StringColumn]]
    ],
    subroles: Optional[Union[Subrole, Iterable[str]]] = None,
    append: Optional[bool] = True,
):
    """Assigns one or several new [`subroles`][getml.data.subroles] to one or more columns.

    Args:
        cols:
            The columns or the names thereof.

        subroles:
            The subroles to be assigned.
            Must be from [`subroles`][getml.data.subroles].

        append:
            Whether you want to append the
            new subroles to the existing subroles.
    """

    names = _handle_cols(cols)

    if isinstance(subroles, str):
        subroles = [subroles]

    if not _is_non_empty_typed_list(subroles, str):
        raise TypeError("'subroles' must be either a string or a list of strings.")

    if not isinstance(append, bool):
        raise TypeError("'append' must be a bool.")

    for name in names:
        self._set_subroles(name, append, subroles)

    self.refresh()

set_unit

set_unit(
    cols: Union[
        str,
        FloatColumn,
        StringColumn,
        List[Union[str, FloatColumn, StringColumn]],
    ],
    unit: str,
    comparison_only: bool = False,
)

Assigns a new unit to one or more columns.

PARAMETER	DESCRIPTION
`cols`	The columns or the names thereof. TYPE: `Union[str, FloatColumn, StringColumn, List[Union[str, FloatColumn, StringColumn]]]`
`unit`	The unit to be assigned. TYPE: `str`
`comparison_only`	Whether you want the column to be used for comparison only. This means that the column can only be used in comparison to other columns of the same unit. An example might be a bank account number: The number in itself is hardly interesting, but it might be useful to know how often we have seen that same bank account number in another table. TYPE: `bool` DEFAULT: `False`

Source code in getml/data/data_frame.py

def set_unit(
    self,
    cols: Union[
        str, FloatColumn, StringColumn, List[Union[str, FloatColumn, StringColumn]]
    ],
    unit: str,
    comparison_only: bool = False,
):
    """Assigns a new unit to one or more columns.

    Args:
        cols:
            The columns or the names thereof.

        unit:
            The unit to be assigned.

        comparison_only:
            Whether you want the column to
            be used for comparison only. This means that the column can
            only be used in comparison to other columns of the same unit.

            An example might be a bank account number: The number in itself
            is hardly interesting, but it might be useful to know how often
            we have seen that same bank account number in another table.
    """

    names = _handle_cols(cols)

    if not isinstance(unit, str):
        raise TypeError("Parameter 'unit' must be a str.")

    if comparison_only:
        unit += COMPARISON_ONLY

    for name in names:
        self._set_unit(name, unit)

    self.refresh()

to_arrow

to_arrow() -> Table

Creates a pyarrow.Table from the current instance.

Loads the underlying data from the getML Engine and constructs a pyarrow.Table.

RETURNS	DESCRIPTION
`Table`	Pyarrow equivalent of the current instance including its underlying data.

Source code in getml/data/data_frame.py

def to_arrow(self) -> pa.Table:
    """Creates a `pyarrow.Table` from the current instance.

    Loads the underlying data from the getML Engine and constructs
    a `pyarrow.Table`.

    Returns:
            Pyarrow equivalent of the current instance including its underlying data.
    """
    return to_arrow(self)

to_arrow_stream

to_arrow_stream() -> Iterator[RecordBatchReader]

Streams the dataframe as an Apache Arrow pa.RecordBatchReader.

This method provides a way to access the dataframe as an Apache Arrow stream. Apache Arrow is a cross-language development platform for in-memory data that specifies a standardized language-independent columnar memory format. Using to_arrow_stream allows for efficient, zero-copy (or near zero-copy) data exchange with other systems that support Arrow, such as DuckDB, Pandas, Polars, and various data processing engines.

The method is a context manager (used with a with statement). This ensures that any underlying resources associated with the stream, such as network connections or temporary files, are properly initialized when entering the with block and cleaned up when exiting.

The pa.RecordBatchReader yielded by this context manager allows you to read the dataset iteratively as a sequence of Arrow RecordBatch objects. Each RecordBatch represents a chunk of the dataset's columns.

YIELDS	DESCRIPTION
`RecordBatchReader`	pa.RecordBatchReader: An iterator-like object that yields Apache Arrow `RecordBatch` instances.

Example

Integrating with DuckDB for SQL-based analysis:

import getml
import duckdb

getml.set_project("arrow_stream")

generated, _ = getml.datasets.make_numerical()

con = duckdb.connect()

# Use the context manager to get the Arrow stream
with generated.to_arrow_stream() as arrow_stream_reader:
    # Register the Arrow stream as a duckdb relation
    con.register("generated", arrow_stream_reader)

    # Now you can query the data using SQL
    count = con.execute("SELECT COUNT(*) FROM generated").df()
    print(count)
# count_star()
# 500

with generated.to_arrow_stream() as arrow_stream_reader:
    summary = con.execute("SUMMARIZE generated").df()
    print(summary)
#   column_name   column_type                            min                            max                           q50                         q75 count null_percentage
# 0    join_key       varchar                              0                             99                          none                        none   500             0.0
# 1   column_01        double            -0.9939631176936072             0.9966827171572035         -0.012749915265777218         0.49808863342423293   500             0.0
# 2     targets        double                            0.0                          152.0            111.92857142857143          126.16666666666667   500             0.0
# 3  time_stamp  timestamp_ns  1970-01-01 00:00:00.003686084  1970-01-01 00:00:00.998338199    1970-01-01 00:00:00.497298  1970-01-01 00:00:00.747046   500             0.0

# [4 rows x 12 columns]

Source code in getml/data/data_frame.py

@contextmanager
def to_arrow_stream(self) -> Iterator[pa.RecordBatchReader]:
    """
    Streams the dataframe as an Apache Arrow `pa.RecordBatchReader`.

    This method provides a way to access the dataframe as an Apache Arrow
    stream. Apache Arrow is a cross-language development platform for
    in-memory data that specifies a standardized language-independent
    columnar memory format. Using `to_arrow_stream` allows for efficient,
    zero-copy (or near zero-copy) data exchange with other systems that
    support Arrow, such as DuckDB, Pandas, Polars, and various data
    processing engines.

    The method is a context manager (used with a `with` statement). This
    ensures that any underlying resources associated with the stream,
    such as network connections or temporary files, are properly initialized
    when entering the `with` block and cleaned up when exiting.

    The `pa.RecordBatchReader` yielded by this context manager allows you to
    read the dataset iteratively as a sequence of Arrow `RecordBatch` objects.
    Each `RecordBatch` represents a chunk of the dataset's columns.

    Yields:
        pa.RecordBatchReader: An iterator-like object that yields Apache
            Arrow `RecordBatch` instances.

    ??? example
        Integrating with DuckDB for SQL-based analysis:
        ```python
        import getml
        import duckdb

        getml.set_project("arrow_stream")

        generated, _ = getml.datasets.make_numerical()

        con = duckdb.connect()

        # Use the context manager to get the Arrow stream
        with generated.to_arrow_stream() as arrow_stream_reader:
            # Register the Arrow stream as a duckdb relation
            con.register("generated", arrow_stream_reader)

            # Now you can query the data using SQL
            count = con.execute("SELECT COUNT(*) FROM generated").df()
            print(count)
        # count_star()
        # 500

        with generated.to_arrow_stream() as arrow_stream_reader:
            summary = con.execute("SUMMARIZE generated").df()
            print(summary)
        #   column_name   column_type                            min                            max                           q50                         q75 count null_percentage
        # 0    join_key       varchar                              0                             99                          none                        none   500             0.0
        # 1   column_01        double            -0.9939631176936072             0.9966827171572035         -0.012749915265777218         0.49808863342423293   500             0.0
        # 2     targets        double                            0.0                          152.0            111.92857142857143          126.16666666666667   500             0.0
        # 3  time_stamp  timestamp_ns  1970-01-01 00:00:00.003686084  1970-01-01 00:00:00.998338199    1970-01-01 00:00:00.497298  1970-01-01 00:00:00.747046   500             0.0

        # [4 rows x 12 columns]
        ```
    """
    with to_arrow_stream(self) as stream:
        yield stream

to_csv

to_csv(
    fname: str,
    quotechar: str = '"',
    sep: str = ",",
    batch_size: int = DEFAULT_BATCH_SIZE,
    quoting_style: str = "needed",
)

Writes the underlying data into a newly created CSV file.

PARAMETER	DESCRIPTION
`fname`	The name of the CSV file. The ending ".csv" and an optional batch number will be added automatically. TYPE: `str`
`quotechar`	The character used to wrap strings. TYPE: `str` DEFAULT: `'"'`
`sep`	The character used for separating fields. TYPE: `str` DEFAULT: `','`
`batch_size`	Maximum number of lines per file. Set to 0 to read the entire data frame into a single file. TYPE: `int` DEFAULT: `DEFAULT_BATCH_SIZE`
`quoting_style`	The quoting style to use. Delegated to pyarrow. The following values are accepted: - `"needed"` (default): only enclose values in quotes when needed. - `"all_valid"`: enclose all valid values in quotes; nulls are not quoted. - `"none"`: do not enclose any values in quotes; values containing special characters (such as quotes, cell delimiters or line endings) will raise an error. TYPE: `str` DEFAULT: `'needed'`

Deprecated

1.5: The quotechar parameter is deprecated.

Source code in getml/data/data_frame.py

def to_csv(
    self,
    fname: str,
    quotechar: str = '"',
    sep: str = ",",
    batch_size: int = DEFAULT_BATCH_SIZE,
    quoting_style: str = "needed",
):
    """
    Writes the underlying data into a newly created CSV file.

    Args:
        fname:
            The name of the CSV file.
            The ending ".csv" and an optional batch number will
            be added automatically.

        quotechar:
            The character used to wrap strings.

        sep:
            The character used for separating fields.

        batch_size:
            Maximum number of lines per file. Set to 0 to read
            the entire data frame into a single file.

        quoting_style (str):
            The quoting style to use. Delegated to pyarrow.

            The following values are accepted:
            - `"needed"` (default): only enclose values in quotes when needed.
            - `"all_valid"`: enclose all valid values in quotes; nulls are not
              quoted.
            - `"none"`: do not enclose any values in quotes; values containing
              special characters (such as quotes, cell delimiters or line
              endings) will raise an error.

    Deprecated:
       1.5: The `quotechar` parameter is deprecated.
    """

    if quotechar != '"':
        warnings.warn(
            "'quotechar' is deprecated, use 'quoting_style' instead.",
            DeprecationWarning,
        )

    to_csv(self, fname, sep, batch_size, quoting_style)

to_db

to_db(table_name: str, conn: Optional[Connection] = None)

Writes the underlying data into a newly created table in the database.

PARAMETER	DESCRIPTION
`table_name`	Name of the table to be created. If a table of that name already exists, it will be replaced. TYPE: `str`
`conn`	The database connection to be used. If you don't explicitly pass a connection, the Engine will use the default connection. TYPE: `Optional[Connection]` DEFAULT: `None`

Source code in getml/data/data_frame.py

def to_db(self, table_name: str, conn: Optional[Connection] = None):
    """Writes the underlying data into a newly created table in the
    database.

    Args:
        table_name:
            Name of the table to be created.

            If a table of that name already exists, it will be
            replaced.

        conn:
            The database connection to be used.
            If you don't explicitly pass a connection,
            the Engine will use the default connection.
    """

    conn = conn or database.Connection()

    self.refresh()

    if not isinstance(table_name, str):
        raise TypeError("'table_name' must be of type str")

    cmd = {}

    cmd["type_"] = "DataFrame.to_db"
    cmd["name_"] = self.name

    cmd["table_name_"] = table_name

    cmd["conn_id_"] = conn.conn_id

    comm.send(cmd)

to_html

to_html(max_rows: int = 10)

Represents the data frame in HTML format, optimized for an iPython notebook.

PARAMETER	DESCRIPTION
`max_rows`	The maximum number of rows to be displayed. TYPE: `int` DEFAULT: `10`

Source code in getml/data/data_frame.py

def to_html(self, max_rows: int = 10):
    """
    Represents the data frame in HTML format, optimized for an
    iPython notebook.

    Args:
        max_rows:
            The maximum number of rows to be displayed.
    """

    if not _exists_in_memory(self.name):
        return _empty_data_frame().replace("\n", "<br>")

    formatted = self._format()
    formatted.max_rows = max_rows

    footer = self._collect_footer_data()

    return formatted._render_html(footer=footer)

to_json

to_json()

Creates a JSON string from the current instance.

Loads the underlying data from the getML Engine and constructs a JSON string.

Source code in getml/data/data_frame.py

def to_json(self):
    """Creates a JSON string from the current instance.

    Loads the underlying data from the getML Engine and constructs
    a JSON string.
    """
    return self.to_pandas().to_json()

to_pandas

to_pandas() -> DataFrame

Creates a pandas.DataFrame from the current instance.

Loads the underlying data from the getML Engine and constructs pandas.DataFrame.

RETURNS	DESCRIPTION
`DataFrame`	Pandas equivalent of the current instance including
`DataFrame`	its underlying data.

Source code in getml/data/data_frame.py

def to_pandas(self) -> pd.DataFrame:
    """Creates a `pandas.DataFrame` from the current instance.

    Loads the underlying data from the getML Engine and constructs
    `pandas.DataFrame`.

    Returns:
            Pandas equivalent of the current instance including
            its underlying data.

    """
    table = to_arrow(self)
    df = table.to_pandas()
    df.attrs = {"getml": json.loads(table.schema.metadata[b"getml"])}
    return df

to_parquet

to_parquet(
    fname: str,
    compression: Literal[
        "brotli", "gzip", "lz4", "snappy", "zstd"
    ] = "snappy",
    coerce_timestamps: Optional[bool] = None,
) -> None

Writes the underlying data into a newly created parquet file.

PARAMETER	DESCRIPTION
`fname`	The name of the parquet file. TYPE: `str`
`compression`	The compression format to use. Supported values are "brotli", "gzip", "lz4", "snappy", "zstd" TYPE: `Literal['brotli', 'gzip', 'lz4', 'snappy', 'zstd']` DEFAULT: `'snappy'`
`coerce_timestamps`	Cast time stamps to a particular resolution. For details, refer to `pyarrow.parquet.ParquetWriter` docs. TYPE: `Optional[bool]` DEFAULT: `None`

Source code in getml/data/data_frame.py

def to_parquet(
    self,
    fname: str,
    compression: Literal["brotli", "gzip", "lz4", "snappy", "zstd"] = "snappy",
    coerce_timestamps: Optional[bool] = None,
) -> None:
    """
    Writes the underlying data into a newly created parquet file.

    Args:
        fname:
            The name of the parquet file.

        compression:
            The compression format to use.
            Supported values are "brotli", "gzip", "lz4", "snappy", "zstd"
        coerce_timestamps:
            Cast time stamps to a particular resolution.
            For details, refer to `pyarrow.parquet.ParquetWriter` docs.
    """
    to_parquet(self, fname, compression, coerce_timestamps=coerce_timestamps)

to_placeholder

to_placeholder(name: Optional[str] = None) -> Placeholder

Generates a Placeholder from the current DataFrame.

PARAMETER	DESCRIPTION
`name`	The name of the placeholder. If no name is passed, then the name of the placeholder will be identical to the name of the current data frame. TYPE: `Optional[str]` DEFAULT: `None`

RETURNS	DESCRIPTION
`Placeholder`	A placeholder with the same name as this data frame.

Source code in getml/data/data_frame.py

def to_placeholder(self, name: Optional[str] = None) -> Placeholder:
    """Generates a [`Placeholder`][getml.data.Placeholder] from the
    current [`DataFrame`][getml.DataFrame].

    Args:
        name:
            The name of the placeholder. If no
            name is passed, then the name of the placeholder will
            be identical to the name of the current data frame.

    Returns:
            A placeholder with the same name as this data frame.


    """
    self.refresh()
    return Placeholder(name=name or self.name, roles=self.roles)

to_pyspark

to_pyspark(
    spark: SparkSession, name: Optional[str] = None
) -> DataFrame

Creates a pyspark.sql.DataFrame from the current instance.

Loads the underlying data from the getML Engine and constructs a pyspark.sql.DataFrame.

PARAMETER	DESCRIPTION
`spark`	The pyspark session in which you want to create the data frame. TYPE: `SparkSession`
`name`	The name of the temporary view to be created on top of the `pyspark.sql.DataFrame`, with which it can be referred to in Spark SQL (refer to `pyspark.sql.DataFrame.createOrReplaceTempView`). If None is passed, then the name of this `DataFrame` will be used. TYPE: `Optional[str]` DEFAULT: `None`

RETURNS	DESCRIPTION
`DataFrame`	Pyspark equivalent of the current instance including its underlying data.

Source code in getml/data/data_frame.py

def to_pyspark(
    self, spark: pyspark.sql.SparkSession, name: Optional[str] = None
) -> pyspark.sql.DataFrame:
    """Creates a `pyspark.sql.DataFrame` from the current instance.

    Loads the underlying data from the getML Engine and constructs
    a `pyspark.sql.DataFrame`.

    Args:
        spark:
            The pyspark session in which you want to
            create the data frame.

        name:
            The name of the temporary view to be created on top
            of the `pyspark.sql.DataFrame`,
            with which it can be referred to
            in Spark SQL (refer to
            `pyspark.sql.DataFrame.createOrReplaceTempView`).
            If None is passed, then the name of this
            [`DataFrame`][getml.DataFrame] will be used.

    Returns:
            Pyspark equivalent of the current instance including its underlying data.

    """
    return _to_pyspark(self, name, spark)

to_s3

to_s3(
    bucket: str,
    key: str,
    region: str,
    sep: Optional[str] = ",",
    batch_size: Optional[int] = 50000,
)

Writes the underlying data into a newly created CSV file located in an S3 bucket.

Enterprise edition

This feature is exclusive to the Enterprise edition and is not available in the Community edition. Discover the benefits of the Enterprise edition and compare their features.

For licensing information and technical support, please contact us.

Note

Note that S3 is not supported on Windows.

PARAMETER	DESCRIPTION
`bucket`	The bucket from which to read the files. TYPE: `str`
`key`	The key in the S3 bucket in which you want to write the output. The ending ".csv" and an optional batch number will be added automatically. TYPE: `str`
`region`	The region in which the bucket is located. TYPE: `str`
`sep`	The character used for separating fields. TYPE: `Optional[str]` DEFAULT: `','`
`batch_size`	Maximum number of lines per file. Set to 0 to read the entire data frame into a single file. TYPE: `Optional[int]` DEFAULT: `50000`

Example

getml.engine.set_s3_access_key_id("YOUR-ACCESS-KEY-ID")
getml.engine.set_s3_secret_access_key("YOUR-SECRET-ACCESS-KEY")

your_df.to_s3(
    bucket="your-bucket-name",
    key="filename-on-s3",
    region="us-east-2",
    sep=';'
)

Source code in getml/data/data_frame.py

def to_s3(
    self,
    bucket: str,
    key: str,
    region: str,
    sep: Optional[str] = ",",
    batch_size: Optional[int] = 50000,
):
    """
    Writes the underlying data into a newly created CSV file
    located in an S3 bucket.

    enterprise-adm: Enterprise edition
        This feature is exclusive to the Enterprise edition and is not available in the Community edition. Discover the [benefits of the Enterprise edition][enterprise-benefits] and [compare their features][enterprise-feature-list].

        For licensing information and technical support, please [contact us][contact-page].

    Note:
        Note that S3 is not supported on Windows.

    Args:
        bucket:
            The bucket from which to read the files.

        key:
            The key in the S3 bucket in which you want to
            write the output. The ending ".csv" and an optional
            batch number will be added automatically.

        region:
            The region in which the bucket is located.

        sep:
            The character used for separating fields.

        batch_size:
            Maximum number of lines per file. Set to 0 to read
            the entire data frame into a single file.

    ??? example
        ```python
        getml.engine.set_s3_access_key_id("YOUR-ACCESS-KEY-ID")
        getml.engine.set_s3_secret_access_key("YOUR-SECRET-ACCESS-KEY")

        your_df.to_s3(
            bucket="your-bucket-name",
            key="filename-on-s3",
            region="us-east-2",
            sep=';'
        )
        ```

    """

    self.refresh()

    if not isinstance(bucket, str):
        raise TypeError("'bucket' must be of type str")

    if not isinstance(key, str):
        raise TypeError("'fname' must be of type str")

    if not isinstance(region, str):
        raise TypeError("'region' must be of type str")

    if not isinstance(sep, str):
        raise TypeError("'sep' must be of type str")

    if not isinstance(batch_size, numbers.Real):
        raise TypeError("'batch_size' must be a real number")

    cmd: Dict[str, Any] = {}

    cmd["type_"] = "DataFrame.to_s3"
    cmd["name_"] = self.name

    cmd["bucket_"] = bucket
    cmd["key_"] = key
    cmd["region_"] = region
    cmd["sep_"] = sep
    cmd["batch_size_"] = batch_size

    comm.send(cmd)

unload

unload()

Unloads the data frame from memory.

Source code in getml/data/data_frame.py

def unload(self):
    """
    Unloads the data frame from memory.
    """

    # ------------------------------------------------------------

    self._delete(mem_only=True)

where

where(
    index: Union[
        Integral,
        slice,
        BooleanColumnView,
        FloatColumnView,
        FloatColumn,
    ]
) -> View

Extract a subset of rows.

Creates a new View as a subselection of the current instance.

PARAMETER	DESCRIPTION
`index`	Indicates the rows you want to select. TYPE: `Union[Integral, slice, BooleanColumnView, FloatColumnView, FloatColumn]`

RETURNS	DESCRIPTION
`View`	A new `View` containing the selected rows.

Example

Generate example data:

data = dict(
    fruit=["banana", "apple", "cherry", "cherry", "melon", "pineapple"],
    price=[2.4, 3.0, 1.2, 1.4, 3.4, 3.4],
    join_key=["0", "1", "2", "2", "3", "3"])

fruits = getml.DataFrame.from_dict(data, name="fruits",
roles={"categorical": ["fruit"], "join_key": ["join_key"], "numerical": ["price"]})

fruits

| join_key | fruit       | price     |
| join key | categorical | numerical |
--------------------------------------
| 0        | banana      | 2.4       |
| 1        | apple       | 3         |
| 2        | cherry      | 1.2       |
| 2        | cherry      | 1.4       |
| 3        | melon       | 3.4       |
| 3        | pineapple   | 3.4       |

Apply where condition. This creates a new DataFrame called "cherries":

cherries = fruits.where(
    fruits["fruit"] == "cherry")

cherries

| join_key | fruit       | price     |
| join key | categorical | numerical |
--------------------------------------
| 2        | cherry      | 1.2       |
| 2        | cherry      | 1.4       |

Source code in getml/data/data_frame.py

def where(
    self,
    index: Union[
        numbers.Integral, slice, BooleanColumnView, FloatColumnView, FloatColumn
    ],
) -> View:
    """Extract a subset of rows.

    Creates a new [`View`][getml.data.View] as a
    subselection of the current instance.

    Args:
        index:
            Indicates the rows you want to select.

    Returns:
            A new [`View`][getml.data.View] containing the selected rows.

    ??? example
        Generate example data:
        ```python
        data = dict(
            fruit=["banana", "apple", "cherry", "cherry", "melon", "pineapple"],
            price=[2.4, 3.0, 1.2, 1.4, 3.4, 3.4],
            join_key=["0", "1", "2", "2", "3", "3"])

        fruits = getml.DataFrame.from_dict(data, name="fruits",
        roles={"categorical": ["fruit"], "join_key": ["join_key"], "numerical": ["price"]})

        fruits
        ```
        ```
        | join_key | fruit       | price     |
        | join key | categorical | numerical |
        --------------------------------------
        | 0        | banana      | 2.4       |
        | 1        | apple       | 3         |
        | 2        | cherry      | 1.2       |
        | 2        | cherry      | 1.4       |
        | 3        | melon       | 3.4       |
        | 3        | pineapple   | 3.4       |
        ```
        Apply where condition. This creates a new DataFrame called "cherries":

        ```python
        cherries = fruits.where(
            fruits["fruit"] == "cherry")

        cherries
        ```
        ```
        | join_key | fruit       | price     |
        | join key | categorical | numerical |
        --------------------------------------
        | 2        | cherry      | 1.2       |
        | 2        | cherry      | 1.4       |
        ```

    """

    return _where(self, index)

with_column

with_column(
    col: Union[
        bool,
        str,
        float,
        int,
        datetime64,
        FloatColumn,
        FloatColumnView,
        StringColumn,
        StringColumnView,
        BooleanColumnView,
    ],
    name: str,
    role: Optional[Role] = None,
    subroles: Optional[
        Union[Subrole, Iterable[str]]
    ] = None,
    unit: str = "",
    time_formats: Optional[Iterable[str]] = None,
)

Returns a new View that contains an additional column.

PARAMETER	DESCRIPTION
`col`	The column to be added. TYPE: `Union[bool, str, float, int, datetime64, FloatColumn, FloatColumnView, StringColumn, StringColumnView, BooleanColumnView]`
`name`	Name of the new column. TYPE: `str`
`role`	Role of the new column. Must be from `roles`. TYPE: `Optional[Role]` DEFAULT: `None`
`subroles`	Subroles of the new column. Must be from `subroles`. TYPE: `Optional[Union[Subrole, Iterable[str]]]` DEFAULT: `None`
`unit`	Unit of the column. TYPE: `str` DEFAULT: `''`
`time_formats`	Formats to be used to parse the time stamps. This is only necessary, if an implicit conversion from a `StringColumn` to a time stamp is taking place. The formats are allowed to contain the following special characters: %w - abbreviated weekday (Mon, Tue, ...) %W - full weekday (Monday, Tuesday, ...) %b - abbreviated month (Jan, Feb, ...) %B - full month (January, February, ...) %d - zero-padded day of month (01 .. 31) %e - day of month (1 .. 31) %f - space-padded day of month ( 1 .. 31) %m - zero-padded month (01 .. 12) %n - month (1 .. 12) %o - space-padded month ( 1 .. 12) %y - year without century (70) %Y - year with century (1970) %H - hour (00 .. 23) %h - hour (00 .. 12) %a - am/pm %A - AM/PM %M - minute (00 .. 59) %S - second (00 .. 59) %s - seconds and microseconds (equivalent to %S.%F) %i - millisecond (000 .. 999) %c - centisecond (0 .. 9) %F - fractional seconds/microseconds (000000 - 999999) %z - time zone differential in ISO 8601 format (Z or +NN.NN) %Z - time zone differential in RFC format (GMT or +NNNN) %% - percent sign TYPE: `Optional[Iterable[str]]` DEFAULT: `None`

Source code in getml/data/data_frame.py

def with_column(
    self,
    col: Union[
        bool,
        str,
        float,
        int,
        np.datetime64,
        FloatColumn,
        FloatColumnView,
        StringColumn,
        StringColumnView,
        BooleanColumnView,
    ],
    name: str,
    role: Optional[Role] = None,
    subroles: Optional[Union[Subrole, Iterable[str]]] = None,
    unit: str = "",
    time_formats: Optional[Iterable[str]] = None,
):
    """Returns a new [`View`][getml.data.View] that contains an additional column.

    Args:
        col:
            The column to be added.

        name:
            Name of the new column.

        role:
            Role of the new column. Must be from [`roles`][getml.data.roles].

        subroles:
            Subroles of the new column. Must be from [`subroles`][getml.data.subroles].

        unit:
            Unit of the column.

        time_formats:
            Formats to be used to parse the time stamps.

            This is only necessary, if an implicit conversion from
            a [`StringColumn`][getml.data.columns.StringColumn] to a time
            stamp is taking place.

            The formats are allowed to contain the following
            special characters:

            * %w - abbreviated weekday (Mon, Tue, ...)
            * %W - full weekday (Monday, Tuesday, ...)
            * %b - abbreviated month (Jan, Feb, ...)
            * %B - full month (January, February, ...)
            * %d - zero-padded day of month (01 .. 31)
            * %e - day of month (1 .. 31)
            * %f - space-padded day of month ( 1 .. 31)
            * %m - zero-padded month (01 .. 12)
            * %n - month (1 .. 12)
            * %o - space-padded month ( 1 .. 12)
            * %y - year without century (70)
            * %Y - year with century (1970)
            * %H - hour (00 .. 23)
            * %h - hour (00 .. 12)
            * %a - am/pm
            * %A - AM/PM
            * %M - minute (00 .. 59)
            * %S - second (00 .. 59)
            * %s - seconds and microseconds (equivalent to %S.%F)
            * %i - millisecond (000 .. 999)
            * %c - centisecond (0 .. 9)
            * %F - fractional seconds/microseconds (000000 - 999999)
            * %z - time zone differential in ISO 8601 format (Z or +NN.NN)
            * %Z - time zone differential in RFC format (GMT or +NNNN)
            * %% - percent sign

    """
    col, role, subroles = _with_column(
        col, name, role, subroles, unit, time_formats
    )
    return View(
        base=self,
        added={
            "col_": col,
            "name_": name,
            "role_": role,
            "subroles_": subroles,
            "unit_": unit,
        },
    )

with_name

with_name(name: str) -> View

Returns a new View with a new name.

PARAMETER	DESCRIPTION
`name`	The name of the new view. TYPE: `str`

RETURNS	DESCRIPTION
`View`	A new view with the new name.

Source code in getml/data/data_frame.py

def with_name(self, name: str) -> View:
    """Returns a new [`View`][getml.data.View] with a new name.

    Args:
        name:
            The name of the new view.

    Returns:
        A new view with the new name.
    """
    return View(base=self, name=name)

with_role

with_role(
    cols: Union[
        str,
        FloatColumn,
        StringColumn,
        Union[
            Iterable[str],
            List[FloatColumn],
            List[StringColumn],
        ],
    ],
    role: Role,
    time_formats: Optional[Iterable[str]] = None,
)

Returns a new View with modified roles.

The difference between with_role and set_role is that with_role returns a view that is lazily evaluated when needed whereas set_role is an in-place operation. From a memory perspective, in-place operations like set_role are preferable.

When switching from a role based on type float to a role based on type string or vice verse, an implicit type conversion will be conducted. The time_formats argument is used to interpret time format string: annotating_roles_time_stamp. For more information on roles, please refer to the User Guide.

PARAMETER	DESCRIPTION
`cols`	The columns or the names thereof. TYPE: `Union[str, FloatColumn, StringColumn, Union[Iterable[str], List[FloatColumn], List[StringColumn]]]`
`role`	The role to be assigned. TYPE: `Role`
`time_formats`	Formats to be used to parse the time stamps. This is only necessary, if an implicit conversion from a StringColumn to a time stamp is taking place. TYPE: `Optional[Iterable[str]]` DEFAULT: `None`

Source code in getml/data/data_frame.py

def with_role(
    self,
    cols: Union[
        str,
        FloatColumn,
        StringColumn,
        Union[Iterable[str], List[FloatColumn], List[StringColumn]],
    ],
    role: Role,
    time_formats: Optional[Iterable[str]] = None,
):
    """Returns a new [`View`][getml.data.View] with modified roles.

    The difference between [`with_role`][getml.DataFrame.with_role] and
    [`set_role`][getml.DataFrame.set_role] is that
    [`with_role`][getml.DataFrame.with_role] returns a view that is lazily
    evaluated when needed whereas [`set_role`][getml.DataFrame.set_role]
    is an in-place operation. From a memory perspective, in-place operations
    like [`set_role`][getml.DataFrame.set_role] are preferable.

    When switching from a role based on type float to a role based on type
    string or vice verse, an implicit type conversion will be conducted.
    The `time_formats` argument is used to interpret time
    format string: `annotating_roles_time_stamp`. For more information on
    roles, please refer to the [User Guide][annotating-data].

    Args:
        cols:
            The columns or the names thereof.

        role:
            The role to be assigned.

        time_formats:
            Formats to be used to
            parse the time stamps.
            This is only necessary, if an implicit conversion from a StringColumn to
            a time stamp is taking place.
    """
    return _with_role(self, cols, role, time_formats)

with_subroles

with_subroles(
    cols: Union[
        str,
        FloatColumn,
        StringColumn,
        Union[
            Iterable[str],
            List[FloatColumn],
            List[StringColumn],
        ],
    ],
    subroles: Union[Subrole, Iterable[str]],
    append: bool = True,
)

Returns a new view with one or several new subroles on one or more columns.

The difference between with_subroles and set_subroles is that with_subroles returns a view that is lazily evaluated when needed whereas set_subroles is an in-place operation. From a memory perspective, in-place operations like set_subroles are preferable.

PARAMETER	DESCRIPTION
`cols`	The columns or the names thereof. TYPE: `Union[str, FloatColumn, StringColumn, Union[Iterable[str], List[FloatColumn], List[StringColumn]]]`
`subroles`	The subroles to be assigned. TYPE: `Union[Subrole, Iterable[str]]`
`append`	Whether you want to append the new subroles to the existing subroles. TYPE: `bool` DEFAULT: `True`

Source code in getml/data/data_frame.py

def with_subroles(
    self,
    cols: Union[
        str,
        FloatColumn,
        StringColumn,
        Union[Iterable[str], List[FloatColumn], List[StringColumn]],
    ],
    subroles: Union[Subrole, Iterable[str]],
    append: bool = True,
):
    """Returns a new view with one or several new subroles on one or more columns.

    The difference between [`with_subroles`][getml.DataFrame.with_subroles] and
    [`set_subroles`][getml.DataFrame.set_subroles] is that
    [`with_subroles`][getml.DataFrame.with_subroles] returns a view that is lazily
    evaluated when needed whereas [`set_subroles`][getml.DataFrame.set_subroles]
    is an in-place operation. From a memory perspective, in-place operations
    like [`set_subroles`][getml.DataFrame.set_subroles] are preferable.

    Args:
        cols:
            The columns or the names thereof.

        subroles:
            The subroles to be assigned.

        append:
            Whether you want to append the
            new subroles to the existing subroles.
    """
    return _with_subroles(self, cols, subroles, append)

with_unit

with_unit(
    cols: Union[
        str,
        FloatColumn,
        StringColumn,
        Union[
            Iterable[str],
            List[FloatColumn],
            List[StringColumn],
        ],
    ],
    unit: str,
    comparison_only: bool = False,
)

Returns a view that contains a new unit on one or more columns.

The difference between with_unit and set_unit is that with_unit returns a view that is lazily evaluated when needed whereas set_unit is an in-place operation. From a memory perspective, in-place operations like set_unit are preferable.

PARAMETER	DESCRIPTION
`cols`	The columns or the names thereof. TYPE: `Union[str, FloatColumn, StringColumn, Union[Iterable[str], List[FloatColumn], List[StringColumn]]]`
`unit`	The unit to be assigned. TYPE: `str`
`comparison_only`	Whether you want the column to be used for comparison only. This means that the column can only be used in comparison to other columns of the same unit. An example might be a bank account number: The number in itself is hardly interesting, but it might be useful to know how often we have seen that same bank account number in another table. For more information on units, please refer to the User Guide. TYPE: `bool` DEFAULT: `False`

Source code in getml/data/data_frame.py

def with_unit(
    self,
    cols: Union[
        str,
        FloatColumn,
        StringColumn,
        Union[Iterable[str], List[FloatColumn], List[StringColumn]],
    ],
    unit: str,
    comparison_only: bool = False,
):
    """Returns a view that contains a new unit on one or more columns.

    The difference between [`with_unit`][getml.DataFrame.with_unit] and
    [`set_unit`][getml.DataFrame.set_unit] is that
    [`with_unit`][getml.DataFrame.with_unit] returns a view that is lazily
    evaluated when needed whereas [`set_unit`][getml.DataFrame.set_unit]
    is an in-place operation. From a memory perspective, in-place operations
    like [`set_unit`][getml.DataFrame.set_unit] are preferable.

    Args:
        cols:
            The columns or the names thereof.

        unit:
            The unit to be assigned.

        comparison_only:
            Whether you want the column to
            be used for comparison only. This means that the column can
            only be used in comparison to other columns of the same unit.

            An example might be a bank account number: The number in itself
            is hardly interesting, but it might be useful to know how often
            we have seen that same bank account number in another table.

            For more information on units, please refer to the
            [User Guide][annotating-data-units].
    """
    return _with_unit(self, cols, unit, comparison_only)

getml.data.DataFrame

colnames property

columns property

last_change property

memory_usage property

roles property

rowid property

shape property

add

copy

delete

drop

freeze

from_arrow classmethod

from_csv classmethod

from_db classmethod

from_dict classmethod

from_json classmethod

from_pandas classmethod

from_parquet classmethod

from_pyspark classmethod

from_query classmethod

from_s3 classmethod

from_view classmethod

load

nbytes

ncols

nrows

read_arrow

read_csv

read_json

read_parquet

read_s3

read_view

read_db

read_pandas

read_pyspark

read_query

refresh

remove_subroles

remove_unit

save

set_role

set_subroles

set_unit

to_arrow

to_arrow_stream

to_csv

to_db

to_html

to_json

to_pandas

to_parquet

to_placeholder

to_pyspark

to_s3

unload

where

with_column

with_name

with_role

with_subroles

with_unit

colnames `property`

columns `property`

last_change `property`

memory_usage `property`

roles `property`

rowid `property`

shape `property`

from_arrow `classmethod`

from_csv `classmethod`

from_db `classmethod`

from_dict `classmethod`

from_json `classmethod`

from_pandas `classmethod`

from_parquet `classmethod`

from_pyspark `classmethod`

from_query `classmethod`

from_s3 `classmethod`

from_view `classmethod`