getml.database

This module contains communication routines to access various databases.

The connect_bigquery, connect_hana, connect_greenplum, connect_mariadb, connect_mysql, connect_postgres, and connect_sqlite3 functions establish a connection between a database and the getML Engine. During the data import using either the read_db or read_query methods of a DataFrame instance or the corresponding from_db class method all data will be directly loaded from the database into the Engine without ever passing the Python interpreter.

In addition, several auxiliary functions that might be handy during the analysis and interaction with the database are provided.

Connection

Connection(conn_id: str = 'default')

A handle to a database connection on the getML Engine.

ATTRIBUTE DESCRIPTION

conn_id

The name you want to use to reference the connection. You can call it anything you want to. If a database connection with the same conn_id already exists, that connection will be removed automatically and the new connection will take its place. The default conn_id is "default", which refers to the default connection. If you do not explicitly pass a connection handle to any function that relates to a database, the default connection will be used automatically.

Source code in getml/database/connection.py

def __init__(self, conn_id: str = "default"):
    self.conn_id = conn_id

connect_bigquery

connect_bigquery(
    database_id: str,
    project_id: str,
    google_application_credentials: Union[str, Path],
    time_formats: Optional[List[str]] = None,
    conn_id: str = "default",
) -> Connection

Creates a new BigQuery database connection.

Enterprise edition

This feature is exclusive to the Enterprise edition and is not available in the Community edition. Discover the benefits of the Enterprise edition and compare their features.

For licensing information and technical support, please contact us.

PARAMETER	DESCRIPTION
`database_id`	The ID of the database to connect to. TYPE: `str`
`project_id`	The ID of the project to connect to. TYPE: `str`
`google_application_credentials`	The path of the Google application credentials. (Must be located on the machine hosting the getML Engine). TYPE: `Union[str, Path]`
`time_formats`	The list of formats tried when parsing time stamps. The formats are allowed to contain the following special characters: %w - abbreviated weekday (Mon, Tue, ...) %W - full weekday (Monday, Tuesday, ...) %b - abbreviated month (Jan, Feb, ...) %B - full month (January, February, ...) %d - zero-padded day of month (01 .. 31) %e - day of month (1 .. 31) %f - space-padded day of month ( 1 .. 31) %m - zero-padded month (01 .. 12) %n - month (1 .. 12) %o - space-padded month ( 1 .. 12) %y - year without century (70) %Y - year with century (1970) %H - hour (00 .. 23) %h - hour (00 .. 12) %a - am/pm %A - AM/PM %M - minute (00 .. 59) %S - second (00 .. 59) %s - seconds and microseconds (equivalent to %S.%F) %i - millisecond (000 .. 999) %c - centisecond (0 .. 9) %F - fractional seconds/microseconds (000000 - 999999) %z - time zone differential in ISO 8601 format (Z or +NN.NN) %Z - time zone differential in RFC format (GMT or +NNNN) %% - percent sign TYPE: `Optional[List[str]]` DEFAULT: `None`
`conn_id`	The name to be used to reference the connection. If you do not pass anything, this will create a new default connection. TYPE: `str` DEFAULT: `'default'`

RETURNS	DESCRIPTION
`Connection`	The connection object.

Source code in getml/database/connect_bigquery.py

def connect_bigquery(
    database_id: str,
    project_id: str,
    google_application_credentials: Union[str, Path],
    time_formats: Optional[List[str]] = None,
    conn_id: str = "default",
) -> Connection:
    """
    Creates a new BigQuery database connection.

    enterprise-adm: Enterprise edition
        This feature is exclusive to the Enterprise edition and is not available in the Community edition. Discover the [benefits of the Enterprise edition][enterprise-benefits] and [compare their features][enterprise-feature-list].

        For licensing information and technical support, please [contact us][contact-page].

    Args:
        database_id:
            The ID of the database to connect to.

        project_id:
            The ID of the project to connect to.

        google_application_credentials:
            The path of the Google application credentials.
            (Must be located on the machine hosting the getML Engine).

        time_formats:
            The list of formats tried when parsing time stamps.

            The formats are allowed to contain the following
            special characters:

            * %w - abbreviated weekday (Mon, Tue, ...)
            * %W - full weekday (Monday, Tuesday, ...)
            * %b - abbreviated month (Jan, Feb, ...)
            * %B - full month (January, February, ...)
            * %d - zero-padded day of month (01 .. 31)
            * %e - day of month (1 .. 31)
            * %f - space-padded day of month ( 1 .. 31)
            * %m - zero-padded month (01 .. 12)
            * %n - month (1 .. 12)
            * %o - space-padded month ( 1 .. 12)
            * %y - year without century (70)
            * %Y - year with century (1970)
            * %H - hour (00 .. 23)
            * %h - hour (00 .. 12)
            * %a - am/pm
            * %A - AM/PM
            * %M - minute (00 .. 59)
            * %S - second (00 .. 59)
            * %s - seconds and microseconds (equivalent to %S.%F)
            * %i - millisecond (000 .. 999)
            * %c - centisecond (0 .. 9)
            * %F - fractional seconds/microseconds (000000 - 999999)
            * %z - time zone differential in ISO 8601 format (Z or +NN.NN)
            * %Z - time zone differential in RFC format (GMT or +NNNN)
            * %% - percent sign

        conn_id:
            The name to be used to reference the connection.
            If you do not pass anything, this will create a new default connection.

    Returns:
        The connection object.
    """

    time_formats = time_formats or constants.TIME_FORMATS

    cmd: Dict[str, Any] = {}

    cmd["database_id_"] = database_id
    cmd["project_id_"] = project_id
    cmd["google_application_credentials_"] = os.path.abspath(
        str(google_application_credentials)
    )
    cmd["name_"] = ""
    cmd["type_"] = "Database.new"
    cmd["db_"] = "bigquery"

    cmd["time_formats_"] = time_formats
    cmd["conn_id_"] = conn_id

    with comm.send_and_get_socket(cmd) as sock:
        # The API expects a password, but in this case there is none
        comm.send_string(sock, "")
        msg = comm.recv_string(sock)
        if msg != "Success!":
            comm.handle_engine_exception(msg)

    return Connection(conn_id=conn_id)

connect_greenplum

connect_greenplum(
    dbname: str,
    user: str,
    password: str,
    host: str,
    hostaddr: str,
    port: int = 5432,
    time_formats: Optional[List[str]] = None,
    conn_id: str = "default",
) -> Connection

Creates a new Greenplum database connection.

Enterprise edition

This feature is exclusive to the Enterprise edition and is not available in the Community edition. Discover the benefits of the Enterprise edition and compare their features.

For licensing information and technical support, please contact us.

PARAMETER	DESCRIPTION
`dbname`	The name of the database to which you want to connect. TYPE: `str`
`user`	Username with which to log into the Greenplum database. TYPE: `str`
`password`	Password with which to log into the Greenplum database. TYPE: `str`
`host`	Host of the Greenplum database. TYPE: `str`
`hostaddr`	IP address of the Greenplum database. TYPE: `str`
`port`	Port of the Greenplum database. The default port used by Greenplum is 5432. If you do not know, which port to use, type the following into your Greenplum client: `SELECT setting FROM pg_settings WHERE name = 'port';` TYPE: `int` DEFAULT: `5432`
`time_formats`	The list of formats tried when parsing time stamps. The formats are allowed to contain the following special characters: %w - abbreviated weekday (Mon, Tue, ...) %W - full weekday (Monday, Tuesday, ...) %b - abbreviated month (Jan, Feb, ...) %B - full month (January, February, ...) %d - zero-padded day of month (01 .. 31) %e - day of month (1 .. 31) %f - space-padded day of month ( 1 .. 31) %m - zero-padded month (01 .. 12) %n - month (1 .. 12) %o - space-padded month ( 1 .. 12) %y - year without century (70) %Y - year with century (1970) %H - hour (00 .. 23) %h - hour (00 .. 12) %a - am/pm %A - AM/PM %M - minute (00 .. 59) %S - second (00 .. 59) %s - seconds and microseconds (equivalent to %S.%F) %i - millisecond (000 .. 999) %c - centisecond (0 .. 9) %F - fractional seconds/microseconds (000000 - 999999) %z - time zone differential in ISO 8601 format (Z or +NN.NN) %Z - time zone differential in RFC format (GMT or +NNNN) %% - percent sign TYPE: `Optional[List[str]]` DEFAULT: `None`
`conn_id`	The name to be used to reference the connection. If you do not pass anything, this will create a new default connection. TYPE: `str` DEFAULT: `'default'`

RETURNS	DESCRIPTION
`Connection`	The connection object.

Note

By selecting an existing table of your database in from_db function, you can create a new DataFrame containing all its data. Alternatively you can use the read_db and read_query methods to replace the content of the current DataFrame instance or append further rows based on either a table or a specific query.

You can also write your results back into the Greenplum database. By passing the name for the destination table to transform, the features generated from your raw data will be written back. Passing them into predict, instead, makes predictions of the target variables to new, unseen data and stores the result into the corresponding table.

Source code in getml/database/connect_greenplum.py

def connect_greenplum(
    dbname: str,
    user: str,
    password: str,
    host: str,
    hostaddr: str,
    port: int = 5432,
    time_formats: Optional[List[str]] = None,
    conn_id: str = "default",
) -> Connection:
    """Creates a new Greenplum database connection.

    enterprise-adm: Enterprise edition
        This feature is exclusive to the Enterprise edition and is not available in the Community edition. Discover the [benefits of the Enterprise edition][enterprise-benefits] and [compare their features][enterprise-feature-list].

        For licensing information and technical support, please [contact us][contact-page].

    Args:
        dbname:
            The name of the database to which you want to connect.

        user:
            Username with which to log into the Greenplum database.

        password:
            Password with which to log into the Greenplum database.

        host:
            Host of the Greenplum database.

        hostaddr:
            IP address of the Greenplum database.

        port:
            Port of the Greenplum database.

            The default port used by Greenplum is 5432.

            If you do not know, which port to use, type the following into your
            Greenplum client:

            ```sql
            SELECT setting FROM pg_settings WHERE name = 'port';
            ```
        time_formats:
            The list of formats tried when parsing time stamps.

            The formats are allowed to contain the following
            special characters:

            * %w - abbreviated weekday (Mon, Tue, ...)
            * %W - full weekday (Monday, Tuesday, ...)
            * %b - abbreviated month (Jan, Feb, ...)
            * %B - full month (January, February, ...)
            * %d - zero-padded day of month (01 .. 31)
            * %e - day of month (1 .. 31)
            * %f - space-padded day of month ( 1 .. 31)
            * %m - zero-padded month (01 .. 12)
            * %n - month (1 .. 12)
            * %o - space-padded month ( 1 .. 12)
            * %y - year without century (70)
            * %Y - year with century (1970)
            * %H - hour (00 .. 23)
            * %h - hour (00 .. 12)
            * %a - am/pm
            * %A - AM/PM
            * %M - minute (00 .. 59)
            * %S - second (00 .. 59)
            * %s - seconds and microseconds (equivalent to %S.%F)
            * %i - millisecond (000 .. 999)
            * %c - centisecond (0 .. 9)
            * %F - fractional seconds/microseconds (000000 - 999999)
            * %z - time zone differential in ISO 8601 format (Z or +NN.NN)
            * %Z - time zone differential in RFC format (GMT or +NNNN)
            * %% - percent sign

        conn_id:
            The name to be used to reference the connection.
            If you do not pass anything, this will create a new default connection.

    Returns:
        The connection object.

    Note:
        By selecting an existing table of your database in
        [`from_db`][getml.DataFrame.from_db] function, you can create
        a new [`DataFrame`][getml.DataFrame] containing all its data.
        Alternatively you can use the
        [`read_db`][getml.DataFrame.read_db] and
        [`read_query`][getml.DataFrame.read_query] methods to replace
        the content of the current [`DataFrame`][getml.DataFrame]
        instance or append further rows based on either a table or a
        specific query.

        You can also write your results back into the Greenplum
        database. By passing the name for the destination table to
        [`transform`][getml.Pipeline.transform], the features
        generated from your raw data will be written back. Passing
        them into [`predict`][getml.Pipeline.predict], instead,
        makes predictions of the target variables to new, unseen data
        and stores the result into the corresponding table.



    """

    time_formats = time_formats or constants.TIME_FORMATS

    cmd: Dict[str, Any] = {}

    cmd["name_"] = ""
    cmd["type_"] = "Database.new"
    cmd["db_"] = "greenplum"

    cmd["host_"] = host
    cmd["hostaddr_"] = hostaddr
    cmd["port_"] = port
    cmd["dbname_"] = dbname
    cmd["user_"] = user
    cmd["time_formats_"] = time_formats
    cmd["conn_id_"] = conn_id

    with comm.send_and_get_socket(cmd) as sock:
        # The password is sent separately, so it doesn't
        # end up in the logs.
        comm.send_string(sock, password)
        msg = comm.recv_string(sock)

    if msg != "Success!":
        comm.handle_engine_exception(msg)

    return Connection(conn_id=conn_id)

connect_hana

connect_hana(
    user: str,
    password: str,
    host: str,
    port: int = 39017,
    default_schema: Optional[str] = "public",
    ping_interval: int = 0,
    time_formats: Optional[List[str]] = None,
    conn_id: str = "default",
) -> Connection

Creates a new HANA database connection.

Enterprise edition

This feature is exclusive to the Enterprise edition and is not available in the Community edition. Discover the benefits of the Enterprise edition and compare their features.

For licensing information and technical support, please contact us.

PARAMETER	DESCRIPTION
`user`	Username with which to log into the HANA database. TYPE: `str`
`password`	Password with which to log into the HANA database. TYPE: `str`
`host`	Host of the HANA database. TYPE: `str`
`port`	Port of the database. TYPE: `int` DEFAULT: `39017`
`default_schema`	The schema within the database you want to connect use unless another schema is explicitly set. TYPE: `Optional[str]` DEFAULT: `'public'`
`ping_interval`	The interval at which you want to ping the database, in seconds. Set to 0 for no pings at all. TYPE: `int` DEFAULT: `0`
`time_formats`	The list of formats tried when parsing time stamps. The formats are allowed to contain the following special characters: %w - abbreviated weekday (Mon, Tue, ...) %W - full weekday (Monday, Tuesday, ...) %b - abbreviated month (Jan, Feb, ...) %B - full month (January, February, ...) %d - zero-padded day of month (01 .. 31) %e - day of month (1 .. 31) %f - space-padded day of month ( 1 .. 31) %m - zero-padded month (01 .. 12) %n - month (1 .. 12) %o - space-padded month ( 1 .. 12) %y - year without century (70) %Y - year with century (1970) %H - hour (00 .. 23) %h - hour (00 .. 12) %a - am/pm %A - AM/PM %M - minute (00 .. 59) %S - second (00 .. 59) %s - seconds and microseconds (equivalent to %S.%F) %i - millisecond (000 .. 999) %c - centisecond (0 .. 9) %F - fractional seconds/microseconds (000000 - 999999) %z - time zone differential in ISO 8601 format (Z or +NN.NN) %Z - time zone differential in RFC format (GMT or +NNNN) %% - percent sign TYPE: `Optional[List[str]]` DEFAULT: `None`
`conn_id`	The name to be used to reference the connection. If you do not pass anything, this will create a new default connection. TYPE: `str` DEFAULT: `'default'`

RETURNS	DESCRIPTION
`Connection`	The connection object.

Source code in getml/database/connect_hana.py

def connect_hana(
    user: str,
    password: str,
    host: str,
    port: int = 39017,
    default_schema: Optional[str] = "public",
    ping_interval: int = 0,
    time_formats: Optional[List[str]] = None,
    conn_id: str = "default",
) -> Connection:
    """
    Creates a new HANA database connection.

    enterprise-adm: Enterprise edition
        This feature is exclusive to the Enterprise edition and is not available in the Community edition. Discover the [benefits of the Enterprise edition][enterprise-benefits] and [compare their features][enterprise-feature-list].

        For licensing information and technical support, please [contact us][contact-page].

    Args:
        user:
            Username with which to log into the HANA database.

        password:
            Password with which to log into the HANA database.

        host:
            Host of the HANA database.

        port:
            Port of the database.

        default_schema:
            The schema within the database you want to connect
            use unless another schema is explicitly set.

        ping_interval:
            The interval at which you want to ping the database,
            in seconds. Set to 0 for no pings at all.

        time_formats:
            The list of formats tried when parsing time stamps.

            The formats are allowed to contain the following
            special characters:

            * %w - abbreviated weekday (Mon, Tue, ...)
            * %W - full weekday (Monday, Tuesday, ...)
            * %b - abbreviated month (Jan, Feb, ...)
            * %B - full month (January, February, ...)
            * %d - zero-padded day of month (01 .. 31)
            * %e - day of month (1 .. 31)
            * %f - space-padded day of month ( 1 .. 31)
            * %m - zero-padded month (01 .. 12)
            * %n - month (1 .. 12)
            * %o - space-padded month ( 1 .. 12)
            * %y - year without century (70)
            * %Y - year with century (1970)
            * %H - hour (00 .. 23)
            * %h - hour (00 .. 12)
            * %a - am/pm
            * %A - AM/PM
            * %M - minute (00 .. 59)
            * %S - second (00 .. 59)
            * %s - seconds and microseconds (equivalent to %S.%F)
            * %i - millisecond (000 .. 999)
            * %c - centisecond (0 .. 9)
            * %F - fractional seconds/microseconds (000000 - 999999)
            * %z - time zone differential in ISO 8601 format (Z or +NN.NN)
            * %Z - time zone differential in RFC format (GMT or +NNNN)
            * %% - percent sign

        conn_id (str, optional):
            The name to be used to reference the connection.
            If you do not pass anything, this will create a new default connection.

    Returns:
        The connection object.
    """

    time_formats = time_formats or constants.TIME_FORMATS

    cmd: Dict[str, Any] = {}

    cmd["name_"] = ""
    cmd["type_"] = "Database.new"
    cmd["db_"] = "sap_hana"

    cmd["host_"] = host
    cmd["port_"] = port
    cmd["default_schema_"] = default_schema
    cmd["user_"] = user
    cmd["ping_interval_"] = ping_interval
    cmd["time_formats_"] = time_formats
    cmd["conn_id_"] = conn_id

    with comm.send_and_get_socket(cmd) as sock:
        # The password is sent separately, so it doesn't
        # end up in the logs.
        comm.send_string(sock, password)
        msg = comm.recv_string(sock)

    if msg != "Success!":
        comm.handle_engine_exception(msg)

    return Connection(conn_id=conn_id)

connect_mariadb

connect_mariadb(
    dbname: str,
    user: str,
    password: str,
    host: str,
    port: int = 3306,
    unix_socket: str = "/var/run/mysqld/mysqld.sock",
    time_formats: Optional[List[str]] = None,
    conn_id: str = "default",
) -> Connection

Creates a new MariaDB database connection.

But first, make sure your database is running and you can reach it from via your command line.

PARAMETER	DESCRIPTION
`dbname`	The name of the database to which you want to connect. TYPE: `str`
`user`	Username with which to log into the MariaDB database. TYPE: `str`
`password`	Password with which to log into the MariaDB database. TYPE: `str`
`host`	Host of the MariaDB database. TYPE: `str`
`port`	Port of the MariaDB database. The default port for MariaDB is 3306. If you do not know which port to use, type `SELECT @@port;` into your MariaDB client. TYPE: `int` DEFAULT: `3306`
`unix_socket`	The UNIX socket used to connect to the MariaDB database. If you do not know which UNIX socket to use, type `SELECT @@socket;` into your MariaDB client. TYPE: `str` DEFAULT: `'/var/run/mysqld/mysqld.sock'`
`time_formats`	The list of formats tried when parsing time stamps. The formats are allowed to contain the following special characters: %w - abbreviated weekday (Mon, Tue, ...) %W - full weekday (Monday, Tuesday, ...) %b - abbreviated month (Jan, Feb, ...) %B - full month (January, February, ...) %d - zero-padded day of month (01 .. 31) %e - day of month (1 .. 31) %f - space-padded day of month ( 1 .. 31) %m - zero-padded month (01 .. 12) %n - month (1 .. 12) %o - space-padded month ( 1 .. 12) %y - year without century (70) %Y - year with century (1970) %H - hour (00 .. 23) %h - hour (00 .. 12) %a - am/pm %A - AM/PM %M - minute (00 .. 59) %S - second (00 .. 59) %s - seconds and microseconds (equivalent to %S.%F) %i - millisecond (000 .. 999) %c - centisecond (0 .. 9) %F - fractional seconds/microseconds (000000 - 999999) %z - time zone differential in ISO 8601 format (Z or +NN.NN) %Z - time zone differential in RFC format (GMT or +NNNN) %% - percent sign TYPE: `Optional[List[str]]` DEFAULT: `None`
`conn_id`	The name to be used to reference the connection. If you do not pass anything, this will create a new default connection. TYPE: `str` DEFAULT: `'default'`

RETURNS	DESCRIPTION
`Connection`	The connection object.

Note

By selecting an existing table of your database in from_db function, you can create a new DataFrame containing all its data. Alternatively you can use the read_db and read_query methods to replace the content of the current DataFrame instance or append further rows based on either a table or a specific query.

You can also write your results back into the MariaDB database. By passing the name for the destination table to transform, the features generated from your raw data will be written back. Passing them into predict, instead, makes predictions of the target variables to new, unseen data and stores the result into the corresponding table.

Source code in getml/database/connect_mariadb.py

def connect_mariadb(
    dbname: str,
    user: str,
    password: str,
    host: str,
    port: int = 3306,
    unix_socket: str = "/var/run/mysqld/mysqld.sock",
    time_formats: Optional[List[str]] = None,
    conn_id: str = "default",
) -> Connection:
    """
    Creates a new MariaDB database connection.

    But first, make sure your database is running and you can reach it
    from via your command line.

    Args:
        dbname:
            The name of the database to which you want to connect.

        user:
            Username with which to log into the MariaDB database.

        password:
            Password with which to log into the MariaDB database.

        host:
            Host of the MariaDB database.

        port:
            Port of the MariaDB database.

            The default port for MariaDB is 3306.

            If you do not know which port to use, type

            ```sql
            SELECT @@port;
            ```
            into your MariaDB client.

        unix_socket:
            The UNIX socket used to connect to the MariaDB database.

            If you do not know which UNIX socket to use, type

            ```sql
            SELECT @@socket;
            ```
            into your MariaDB client.

        time_formats:
            The list of formats tried when parsing time stamps.

            The formats are allowed to contain the following
            special characters:

            * %w - abbreviated weekday (Mon, Tue, ...)
            * %W - full weekday (Monday, Tuesday, ...)
            * %b - abbreviated month (Jan, Feb, ...)
            * %B - full month (January, February, ...)
            * %d - zero-padded day of month (01 .. 31)
            * %e - day of month (1 .. 31)
            * %f - space-padded day of month ( 1 .. 31)
            * %m - zero-padded month (01 .. 12)
            * %n - month (1 .. 12)
            * %o - space-padded month ( 1 .. 12)
            * %y - year without century (70)
            * %Y - year with century (1970)
            * %H - hour (00 .. 23)
            * %h - hour (00 .. 12)
            * %a - am/pm
            * %A - AM/PM
            * %M - minute (00 .. 59)
            * %S - second (00 .. 59)
            * %s - seconds and microseconds (equivalent to %S.%F)
            * %i - millisecond (000 .. 999)
            * %c - centisecond (0 .. 9)
            * %F - fractional seconds/microseconds (000000 - 999999)
            * %z - time zone differential in ISO 8601 format (Z or +NN.NN)
            * %Z - time zone differential in RFC format (GMT or +NNNN)
            * %% - percent sign

        conn_id:
            The name to be used to reference the connection.
            If you do not pass anything, this will create a new default connection.

    Returns:
        The connection object.

    Note:
        By selecting an existing table of your database in
        [`from_db`][getml.DataFrame.from_db] function, you can create
        a new [`DataFrame`][getml.DataFrame] containing all its data.
        Alternatively you can use the
        [`read_db`][getml.DataFrame.read_db] and
        [`read_query`][getml.DataFrame.read_query] methods to replace
        the content of the current [`DataFrame`][getml.DataFrame]
        instance or append further rows based on either a table or a
        specific query.

        You can also write your results back into the MariaDB
        database. By passing the name for the destination table to
        [`transform`][getml.Pipeline.transform], the features
        generated from your raw data will be written back. Passing
        them into [`predict`][getml.Pipeline.predict], instead,
        makes predictions
        of the target variables to new, unseen data and stores the result into
        the corresponding table.


    """

    time_formats = time_formats or constants.TIME_FORMATS

    cmd: Dict[str, Any] = {}

    cmd["name_"] = ""
    cmd["type_"] = "Database.new"
    cmd["db_"] = "mariadb"

    cmd["host_"] = host
    cmd["port_"] = port
    cmd["dbname_"] = dbname
    cmd["user_"] = user
    cmd["unix_socket_"] = unix_socket
    cmd["time_formats_"] = time_formats
    cmd["conn_id_"] = conn_id

    with comm.send_and_get_socket(cmd) as sock:
        # The password is sent separately, so it doesn't
        # end up in the logs.
        comm.send_string(sock, password)
        msg = comm.recv_string(sock)

    if msg != "Success!":
        comm.handle_engine_exception(msg)

    return Connection(conn_id=conn_id)

connect_mysql

connect_mysql(
    dbname: str,
    user: str,
    password: str,
    host: str,
    port: int = 3306,
    unix_socket: str = "/var/run/mysqld/mysqld.sock",
    time_formats: Optional[List[str]] = None,
    conn_id: str = "default",
) -> Connection

Creates a new MySQL database connection.

But first, make sure your database is running and you can reach it from via your command line.

Enterprise edition

This feature is exclusive to the Enterprise edition and is not available in the Community edition. Discover the benefits of the Enterprise edition and compare their features.

For licensing information and technical support, please contact us.

PARAMETER	DESCRIPTION
`dbname`	The name of the database to which you want to connect. TYPE: `str`
`user`	Username with which to log into the MySQL database. TYPE: `str`
`password`	Password with which to log into the MySQL database. TYPE: `str`
`host`	Host of the MySQL database. TYPE: `str`
`port`	Port of the MySQL database. The default port for MySQL is 3306. If you do not know which port to use, type `SELECT @@port;` into your mysql client. TYPE: `int` DEFAULT: `3306`
`unix_socket`	The UNIX socket used to connect to the MySQL database. If you do not know which UNIX socket to use, type `SELECT @@socket;` into your mysql client. TYPE: `str` DEFAULT: `'/var/run/mysqld/mysqld.sock'`
`time_formats`	The list of formats tried when parsing time stamps. The formats are allowed to contain the following special characters: %w - abbreviated weekday (Mon, Tue, ...) %W - full weekday (Monday, Tuesday, ...) %b - abbreviated month (Jan, Feb, ...) %B - full month (January, February, ...) %d - zero-padded day of month (01 .. 31) %e - day of month (1 .. 31) %f - space-padded day of month ( 1 .. 31) %m - zero-padded month (01 .. 12) %n - month (1 .. 12) %o - space-padded month ( 1 .. 12) %y - year without century (70) %Y - year with century (1970) %H - hour (00 .. 23) %h - hour (00 .. 12) %a - am/pm %A - AM/PM %M - minute (00 .. 59) %S - second (00 .. 59) %s - seconds and microseconds (equivalent to %S.%F) %i - millisecond (000 .. 999) %c - centisecond (0 .. 9) %F - fractional seconds/microseconds (000000 - 999999) %z - time zone differential in ISO 8601 format (Z or +NN.NN) %Z - time zone differential in RFC format (GMT or +NNNN) %% - percent sign TYPE: `Optional[List[str]]` DEFAULT: `None`
`conn_id`	The name to be used to reference the connection. If you do not pass anything, this will create a new default connection. TYPE: `str` DEFAULT: `'default'`

RETURNS	DESCRIPTION
`Connection`	The connection object.

Note

By selecting an existing table of your database in from_db function, you can create a new DataFrame containing all its data. Alternatively you can use the read_db and read_query methods to replace the content of the current DataFrame instance or append further rows based on either a table or a specific query.

You can also write your results back into the MySQL database. By passing the name for the destination table to transform, the features generated from your raw data will be written back. Passing them into predict, instead, makes predictions of the target variables to new, unseen data and stores the result into the corresponding table.

Source code in getml/database/connect_mysql.py

def connect_mysql(
    dbname: str,
    user: str,
    password: str,
    host: str,
    port: int = 3306,
    unix_socket: str = "/var/run/mysqld/mysqld.sock",
    time_formats: Optional[List[str]] = None,
    conn_id: str = "default",
) -> Connection:
    """
    Creates a new MySQL database connection.

    But first, make sure your database is running and you can reach it
    from via your command line.

    enterprise-adm: Enterprise edition
        This feature is exclusive to the Enterprise edition and is not available in the Community edition. Discover the [benefits of the Enterprise edition][enterprise-benefits] and [compare their features][enterprise-feature-list].

        For licensing information and technical support, please [contact us][contact-page].

    Args:
        dbname:
            The name of the database to which you want to connect.

        user:
            Username with which to log into the MySQL database.

        password:
            Password with which to log into the MySQL database.

        host:
            Host of the MySQL database.

        port:
            Port of the MySQL database.

            The default port for MySQL is 3306.

            If you do not know which port to use, type

            ```sql
            SELECT @@port;
            ```
            into your mysql client.

        unix_socket:
            The UNIX socket used to connect to the MySQL database.

            If you do not know which UNIX socket to use, type

            ```sql
            SELECT @@socket;
            ```
            into your mysql client.

        time_formats:
            The list of formats tried when parsing time stamps.

            The formats are allowed to contain the following
            special characters:

            * %w - abbreviated weekday (Mon, Tue, ...)
            * %W - full weekday (Monday, Tuesday, ...)
            * %b - abbreviated month (Jan, Feb, ...)
            * %B - full month (January, February, ...)
            * %d - zero-padded day of month (01 .. 31)
            * %e - day of month (1 .. 31)
            * %f - space-padded day of month ( 1 .. 31)
            * %m - zero-padded month (01 .. 12)
            * %n - month (1 .. 12)
            * %o - space-padded month ( 1 .. 12)
            * %y - year without century (70)
            * %Y - year with century (1970)
            * %H - hour (00 .. 23)
            * %h - hour (00 .. 12)
            * %a - am/pm
            * %A - AM/PM
            * %M - minute (00 .. 59)
            * %S - second (00 .. 59)
            * %s - seconds and microseconds (equivalent to %S.%F)
            * %i - millisecond (000 .. 999)
            * %c - centisecond (0 .. 9)
            * %F - fractional seconds/microseconds (000000 - 999999)
            * %z - time zone differential in ISO 8601 format (Z or +NN.NN)
            * %Z - time zone differential in RFC format (GMT or +NNNN)
            * %% - percent sign

        conn_id:
            The name to be used to reference the connection.
            If you do not pass anything, this will create a new default connection.

    Returns:
        The connection object.

    Note:
        By selecting an existing table of your database in
        [`from_db`][getml.DataFrame.from_db] function, you can create
        a new [`DataFrame`][getml.DataFrame] containing all its data.
        Alternatively you can use the
        [`read_db`][getml.DataFrame.read_db] and
        [`read_query`][getml.DataFrame.read_query] methods to replace
        the content of the current [`DataFrame`][getml.DataFrame]
        instance or append further rows based on either a table or a
        specific query.

        You can also write your results back into the MySQL
        database. By passing the name for the destination table to
        [`transform`][getml.Pipeline.transform], the features
        generated from your raw data will be written back. Passing
        them into [`predict`][getml.Pipeline.predict], instead,
        makes predictions of the target variables to new, unseen data
        and stores the result into the corresponding table.

    """

    time_formats = time_formats or constants.TIME_FORMATS

    cmd: Dict[str, Any] = {}

    cmd["name_"] = ""
    cmd["type_"] = "Database.new"
    cmd["db_"] = "mariadb"

    cmd["host_"] = host
    cmd["port_"] = port
    cmd["dbname_"] = dbname
    cmd["user_"] = user
    cmd["unix_socket_"] = unix_socket
    cmd["time_formats_"] = time_formats
    cmd["conn_id_"] = conn_id

    with comm.send_and_get_socket(cmd) as sock:
        # The password is sent separately, so it doesn't
        # end up in the logs.
        comm.send_string(sock, password)
        msg = comm.recv_string(sock)

    if msg != "Success!":
        comm.handle_engine_exception(msg)

    return Connection(conn_id=conn_id)

connect_odbc

connect_odbc(
    server_name: str,
    user: str = "",
    password: str = "",
    escape_chars: str = '"',
    double_precision: str = "DOUBLE PRECISION",
    integer: str = "INTEGER",
    text: str = "TEXT",
    time_formats: Optional[List[str]] = None,
    conn_id: str = "default",
) -> Connection

Creates a new ODBC database connection.

ODBC is standardized format that can be used to connect to almost any database.

Before you use the ODBC connector, make sure the database is up and running and that the appropriate ODBC drivers are installed.

Enterprise edition

This feature is exclusive to the Enterprise edition and is not available in the Community edition. Discover the benefits of the Enterprise edition and compare their features.

For licensing information and technical support, please contact us.

PARAMETER	DESCRIPTION
`server_name`	The server name, as referenced in your .obdc.ini file. TYPE: `str`
`user`	Username with which to log into the database. You do not need to pass this, if it is already contained in your .odbc.ini. TYPE: `str` DEFAULT: `''`
`password`	Password with which to log into the database. You do not need to pass this, if it is already contained in your .odbc.ini. TYPE: `str` DEFAULT: `''`
`escape_chars`	ODBC drivers are supposed to support escaping table and column names using '"' characters irrespective of the syntax in the target database. Unfortunately, not all ODBC drivers follow this standard. This is why some tuning might be necessary. The escape_chars value behaves as follows: If you pass an empty string, schema, table and column names will not be escaped at all. This is not a problem unless some table or column names are identical to SQL keywords. If you pass a single character, schema, table and column names will be enveloped in that character: "TABLE_NAME"."COLUMN_NAME" (standard SQL) or `TABLE_NAME`.`COLUMN_NAME` (MySQL/MariaDB style). If you pass two characters, table, column and schema names will be enveloped between these to characters. For instance, if you pass "[]", the produced queries look as follows: [TABLE_NAME].[COLUMN_NAME] (MS SQL Server style). If you pass more than two characters, the Engine will throw an exception. TYPE: `str` DEFAULT: `'"'`
`double_precision`	The keyword used for double precision columns. TYPE: `str` DEFAULT: `'DOUBLE PRECISION'`
`integer`	The keyword used for integer columns. TYPE: `str` DEFAULT: `'INTEGER'`
`text`	The keyword used for text columns. TYPE: `str` DEFAULT: `'TEXT'`
`time_formats`	The list of formats tried when parsing time stamps. The formats are allowed to contain the following special characters: %w - abbreviated weekday (Mon, Tue, ...) %W - full weekday (Monday, Tuesday, ...) %b - abbreviated month (Jan, Feb, ...) %B - full month (January, February, ...) %d - zero-padded day of month (01 .. 31) %e - day of month (1 .. 31) %f - space-padded day of month ( 1 .. 31) %m - zero-padded month (01 .. 12) %n - month (1 .. 12) %o - space-padded month ( 1 .. 12) %y - year without century (70) %Y - year with century (1970) %H - hour (00 .. 23) %h - hour (00 .. 12) %a - am/pm %A - AM/PM %M - minute (00 .. 59) %S - second (00 .. 59) %s - seconds and microseconds (equivalent to %S.%F) %i - millisecond (000 .. 999) %c - centisecond (0 .. 9) %F - fractional seconds/microseconds (000000 - 999999) %z - time zone differential in ISO 8601 format (Z or +NN.NN) %Z - time zone differential in RFC format (GMT or +NNNN) %% - percent sign TYPE: `List[str]` DEFAULT: `None`
`conn_id`	The name to be used to reference the connection. If you do not pass anything, this will create a new default connection. TYPE: `str` DEFAULT: `'default'`

RETURNS	DESCRIPTION
`Connection`	The connection object.

Source code in getml/database/connect_odbc.py

def connect_odbc(
    server_name: str,
    user: str = "",
    password: str = "",
    escape_chars: str = '"',
    double_precision: str = "DOUBLE PRECISION",
    integer: str = "INTEGER",
    text: str = "TEXT",
    time_formats: Optional[List[str]] = None,
    conn_id: str = "default",
) -> Connection:
    """
    Creates a new ODBC database connection.

    ODBC is standardized format that can be used to connect to almost any
    database.

    Before you use the ODBC connector, make sure the database is up and
    running and that the appropriate ODBC drivers are installed.

    enterprise-adm: Enterprise edition
        This feature is exclusive to the Enterprise edition and is not available in the Community edition. Discover the [benefits of the Enterprise edition][enterprise-benefits] and [compare their features][enterprise-feature-list].

        For licensing information and technical support, please [contact us][contact-page].

    Args:
        server_name:
            The server name, as referenced in your .obdc.ini file.

        user:
            Username with which to log into the database.
            You do not need to pass this, if it is already contained in your
            .odbc.ini.

        password:
            Password with which to log into the database.
            You do not need to pass this, if it is already contained in your
            .odbc.ini.

        escape_chars:
            ODBC drivers are supposed to support
            escaping table and column names using '"' characters irrespective of the
            syntax in the target database. Unfortunately, not all ODBC drivers
            follow this standard. This is why some
            tuning might be necessary.

            The escape_chars value behaves as follows:

            * If you pass an empty string, schema, table and column names will not
              be escaped at all. This is not a problem unless some table
              or column names are identical to SQL keywords.

            * If you pass a single character, schema, table and column names will
              be enveloped in that character: "TABLE_NAME"."COLUMN_NAME" (standard SQL)
              or `TABLE_NAME`.`COLUMN_NAME` (MySQL/MariaDB style).

            * If you pass two characters, table, column and schema names will be
              enveloped between these to characters. For instance, if you pass "[]",
              the produced queries look as follows:
              [TABLE_NAME].[COLUMN_NAME] (MS SQL Server style).

            * If you pass more than two characters, the Engine will throw an exception.

        double_precision:
            The keyword used for double precision columns.

        integer:
            The keyword used for integer columns.

        text:
            The keyword used for text columns.

        time_formats (List[str], optional):
            The list of formats tried when parsing time stamps.

            The formats are allowed to contain the following
            special characters:

            * %w - abbreviated weekday (Mon, Tue, ...)
            * %W - full weekday (Monday, Tuesday, ...)
            * %b - abbreviated month (Jan, Feb, ...)
            * %B - full month (January, February, ...)
            * %d - zero-padded day of month (01 .. 31)
            * %e - day of month (1 .. 31)
            * %f - space-padded day of month ( 1 .. 31)
            * %m - zero-padded month (01 .. 12)
            * %n - month (1 .. 12)
            * %o - space-padded month ( 1 .. 12)
            * %y - year without century (70)
            * %Y - year with century (1970)
            * %H - hour (00 .. 23)
            * %h - hour (00 .. 12)
            * %a - am/pm
            * %A - AM/PM
            * %M - minute (00 .. 59)
            * %S - second (00 .. 59)
            * %s - seconds and microseconds (equivalent to %S.%F)
            * %i - millisecond (000 .. 999)
            * %c - centisecond (0 .. 9)
            * %F - fractional seconds/microseconds (000000 - 999999)
            * %z - time zone differential in ISO 8601 format (Z or +NN.NN)
            * %Z - time zone differential in RFC format (GMT or +NNNN)
            * %% - percent sign

        conn_id:
            The name to be used to reference the connection.
            If you do not pass anything, this will create a new default connection.

    Returns:
        The connection object.

    """

    time_formats = time_formats or constants.TIME_FORMATS

    cmd: Dict[str, Any] = {}

    cmd["name_"] = ""
    cmd["type_"] = "Database.new"
    cmd["db_"] = "odbc"

    cmd["server_name_"] = server_name
    cmd["user_"] = user
    cmd["escape_chars_"] = escape_chars
    cmd["double_precision_"] = double_precision
    cmd["integer_"] = integer
    cmd["text_"] = text
    cmd["time_formats_"] = time_formats
    cmd["conn_id_"] = conn_id

    with comm.send_and_get_socket(cmd) as sock:
        # The password is sent separately, so it doesn't
        # end up in the logs.
        comm.send_string(sock, password)
        msg = comm.recv_string(sock)

    if msg != "Success!":
        comm.handle_engine_exception(msg)

    return Connection(conn_id=conn_id)

connect_postgres

connect_postgres(
    dbname: str,
    user: str,
    password: str,
    host: Optional[str] = None,
    hostaddr: Optional[str] = None,
    port: int = 5432,
    time_formats: Optional[List[str]] = None,
    conn_id: str = "default",
) -> Connection

Creates a new PostgreSQL database connection.

But first, make sure your database is running, and you can reach it from via your command line.

Enterprise edition

This feature is exclusive to the Enterprise edition and is not available in the Community edition. Discover the benefits of the Enterprise edition and compare their features.

For licensing information and technical support, please contact us.

PARAMETER	DESCRIPTION
`dbname`	The name of the database to which you want to connect. TYPE: `str`
`user`	Username with which to log into the PostgreSQL database. TYPE: `str`
`password`	Password with which to log into the PostgreSQL database. TYPE: `str`
`host`	Host of the PostgreSQL database. TYPE: `Optional[str]` DEFAULT: `None`
`hostaddr`	IP address of the PostgreSQL database. This should be in the standard IPv4 address format, e.g., 172.28.40.9. If your machine supports IPv6, you can also use those addresses. TCP/IP communication is always used when a nonempty string is specified for this parameter. TYPE: `Optional[str]` DEFAULT: `None`
`port`	Port of the PostgreSQL database. The default port used by PostgreSQL is 5432. If you do not know, which port to use, type the following into your PostgreSQL client `SELECT setting FROM pg_settings WHERE name = 'port';` TYPE: `int` DEFAULT: `5432`
`time_formats`	The list of formats tried when parsing time stamps. The formats are allowed to contain the following special characters: %w - abbreviated weekday (Mon, Tue, ...) %W - full weekday (Monday, Tuesday, ...) %b - abbreviated month (Jan, Feb, ...) %B - full month (January, February, ...) %d - zero-padded day of month (01 .. 31) %e - day of month (1 .. 31) %f - space-padded day of month ( 1 .. 31) %m - zero-padded month (01 .. 12) %n - month (1 .. 12) %o - space-padded month ( 1 .. 12) %y - year without century (70) %Y - year with century (1970) %H - hour (00 .. 23) %h - hour (00 .. 12) %a - am/pm %A - AM/PM %M - minute (00 .. 59) %S - second (00 .. 59) %s - seconds and microseconds (equivalent to %S.%F) %i - millisecond (000 .. 999) %c - centisecond (0 .. 9) %F - fractional seconds/microseconds (000000 - 999999) %z - time zone differential in ISO 8601 format (Z or +NN.NN) %Z - time zone differential in RFC format (GMT or +NNNN) %% - percent sign TYPE: `Optional[List[str]]` DEFAULT: `None`
`conn_id`	The name to be used to reference the connection. If you do not pass anything, this will create a new default connection. TYPE: `str` DEFAULT: `'default'`

RETURNS	DESCRIPTION
`Connection`	The connection object.

Note

By selecting an existing table of your database in from_db function, you can create a new DataFrame containing all its data. Alternatively you can use the read_db and read_query methods to replace the content of the current DataFrame instance or append further rows based on either a table or a specific query.

You can also write your results back into the PostgreSQL database. By passing the name for the destination table to transform, the features generated from your raw data will be written back. Passing them into predict, instead, makes predictions of the target variables to new, unseen data and stores the result into the corresponding table.

Source code in getml/database/connect_postgres.py

def connect_postgres(
    dbname: str,
    user: str,
    password: str,
    host: Optional[str] = None,
    hostaddr: Optional[str] = None,
    port: int = 5432,
    time_formats: Optional[List[str]] = None,
    conn_id: str = "default",
) -> Connection:
    """
    Creates a new PostgreSQL database connection.

    But first, make sure your database is running, and you can reach it
    from via your command line.

    enterprise-adm: Enterprise edition
        This feature is exclusive to the Enterprise edition and is not available in the Community edition. Discover the [benefits of the Enterprise edition][enterprise-benefits] and [compare their features][enterprise-feature-list].

        For licensing information and technical support, please [contact us][contact-page].

    Args:
        dbname:
            The name of the database to which you want to connect.

        user:
            Username with which to log into the PostgreSQL database.

        password:
            Password with which to log into the PostgreSQL database.

        host:
            Host of the PostgreSQL database.

        hostaddr:
            IP address of the PostgreSQL database.
            This should be in the standard IPv4 address format, e.g., 172.28.40.9.
            If your machine supports IPv6, you can also use those addresses.
            TCP/IP communication is always used when a nonempty string is specified
            for this parameter.

        port:
            Port of the PostgreSQL database.

            The default port used by PostgreSQL is 5432.

            If you do not know, which port to use, type the following into your
            PostgreSQL client

            ```sql
            SELECT setting FROM pg_settings WHERE name = 'port';
            ```

        time_formats:
            The list of formats tried when parsing time stamps.

            The formats are allowed to contain the following
            special characters:

            * %w - abbreviated weekday (Mon, Tue, ...)
            * %W - full weekday (Monday, Tuesday, ...)
            * %b - abbreviated month (Jan, Feb, ...)
            * %B - full month (January, February, ...)
            * %d - zero-padded day of month (01 .. 31)
            * %e - day of month (1 .. 31)
            * %f - space-padded day of month ( 1 .. 31)
            * %m - zero-padded month (01 .. 12)
            * %n - month (1 .. 12)
            * %o - space-padded month ( 1 .. 12)
            * %y - year without century (70)
            * %Y - year with century (1970)
            * %H - hour (00 .. 23)
            * %h - hour (00 .. 12)
            * %a - am/pm
            * %A - AM/PM
            * %M - minute (00 .. 59)
            * %S - second (00 .. 59)
            * %s - seconds and microseconds (equivalent to %S.%F)
            * %i - millisecond (000 .. 999)
            * %c - centisecond (0 .. 9)
            * %F - fractional seconds/microseconds (000000 - 999999)
            * %z - time zone differential in ISO 8601 format (Z or +NN.NN)
            * %Z - time zone differential in RFC format (GMT or +NNNN)
            * %% - percent sign

        conn_id:
            The name to be used to reference the connection.
            If you do not pass anything, this will create a new default connection.

    Returns:
        The connection object.

    Note:
        By selecting an existing table of your database in
        [`from_db`][getml.DataFrame.from_db] function, you can create
        a new [`DataFrame`][getml.DataFrame] containing all its data.
        Alternatively you can use the
        [`read_db`][getml.DataFrame.read_db] and
        [`read_query`][getml.DataFrame.read_query] methods to replace
        the content of the current [`DataFrame`][getml.DataFrame]
        instance or append further rows based on either a table or a
        specific query.

        You can also write your results back into the PostgreSQL
        database. By passing the name for the destination table to
        [`transform`][getml.Pipeline.transform], the features
        generated from your raw data will be written back. Passing
        them into [`predict`][getml.Pipeline.predict], instead,
        makes predictions of the target variables to new, unseen data
        and stores the result into the corresponding table.
    """

    time_formats = time_formats or constants.TIME_FORMATS

    cmd: Dict[str, Any] = {}

    cmd["name_"] = ""
    cmd["type_"] = "Database.new"
    cmd["db_"] = "postgres"

    if host is not None:
        cmd["host_"] = host

    if hostaddr is not None:
        cmd["hostaddr_"] = hostaddr

    cmd["port_"] = port
    cmd["dbname_"] = dbname
    cmd["user_"] = user
    cmd["time_formats_"] = time_formats
    cmd["conn_id_"] = conn_id

    with comm.send_and_get_socket(cmd) as sock:
        # The password is sent separately, so it doesn't
        # end up in the logs.
        comm.send_string(sock, password)
        msg = comm.recv_string(sock)

    if msg != "Success!":
        comm.handle_engine_exception(msg)

    return Connection(conn_id=conn_id)

connect_sqlite3

connect_sqlite3(
    name: str = ":memory:",
    time_formats: Optional[List[str]] = None,
    conn_id: str = "default",
) -> Connection

Creates a new SQLite3 database connection.

SQLite3 is a popular in-memory database. It is faster than distributed databases, like PostgreSQL, but less stable under massive parallel access, consumes more memory and requires all contained data sets to be loaded into memory, which might fill up too much of your RAM, especially for large data sets.

PARAMETER	DESCRIPTION
`name`	Name of the sqlite3 file. If the file does not exist, it will be created. Set to ":memory:" for a purely in-memory SQLite3 database. TYPE: `str` DEFAULT: `':memory:'`
`time_formats`	The list of formats tried when parsing time stamps. The formats are allowed to contain the following special characters: %w - abbreviated weekday (Mon, Tue, ...) %W - full weekday (Monday, Tuesday, ...) %b - abbreviated month (Jan, Feb, ...) %B - full month (January, February, ...) %d - zero-padded day of month (01 .. 31) %e - day of month (1 .. 31) %f - space-padded day of month ( 1 .. 31) %m - zero-padded month (01 .. 12) %n - month (1 .. 12) %o - space-padded month ( 1 .. 12) %y - year without century (70) %Y - year with century (1970) %H - hour (00 .. 23) %h - hour (00 .. 12) %a - am/pm %A - AM/PM %M - minute (00 .. 59) %S - second (00 .. 59) %s - seconds and microseconds (equivalent to %S.%F) %i - millisecond (000 .. 999) %c - centisecond (0 .. 9) %F - fractional seconds/microseconds (000000 - 999999) %z - time zone differential in ISO 8601 format (Z or +NN.NN) %Z - time zone differential in RFC format (GMT or +NNNN) %% - percent sign TYPE: `Optional[List[str]]` DEFAULT: `None`
`conn_id`	The name to be used to reference the connection. If you do not pass anything, this will create a new default connection. TYPE: `str` DEFAULT: `'default'`

RETURNS	DESCRIPTION
`Connection`	The new SQLite3 database connection.

Note

By selecting an existing table of your database in from_db function, you can create a new DataFrame containing all its data. Alternatively you can use the read_db and read_query methods to replace the content of the current DataFrame instance or append further rows based on either a table or a specific query.

You can also write your results back into the SQLite3 database. By passing the name for the destination table to transform, the features generated from your raw data will be written back. Passing them into predict, instead, makes predictions of the target variables to new, unseen data and stores the result into the corresponding table.

Source code in getml/database/connect_sqlite3.py

def connect_sqlite3(
    name: str = ":memory:",
    time_formats: Optional[List[str]] = None,
    conn_id: str = "default",
) -> Connection:
    """Creates a new SQLite3 database connection.

    SQLite3 is a popular in-memory database. It is faster than
    distributed databases, like PostgreSQL, but less stable under massive
    parallel access, consumes more memory and requires all contained
    data sets to be loaded into memory, which might fill up too much
    of your RAM, especially for large data sets.

    Args:
        name:
            Name of the sqlite3 file.  If the file does not exist, it
            will be created. Set to ":memory:" for a purely in-memory SQLite3
            database.

        time_formats:
            The list of formats tried when parsing time stamps.

            The formats are allowed to contain the following
            special characters:

            * %w - abbreviated weekday (Mon, Tue, ...)
            * %W - full weekday (Monday, Tuesday, ...)
            * %b - abbreviated month (Jan, Feb, ...)
            * %B - full month (January, February, ...)
            * %d - zero-padded day of month (01 .. 31)
            * %e - day of month (1 .. 31)
            * %f - space-padded day of month ( 1 .. 31)
            * %m - zero-padded month (01 .. 12)
            * %n - month (1 .. 12)
            * %o - space-padded month ( 1 .. 12)
            * %y - year without century (70)
            * %Y - year with century (1970)
            * %H - hour (00 .. 23)
            * %h - hour (00 .. 12)
            * %a - am/pm
            * %A - AM/PM
            * %M - minute (00 .. 59)
            * %S - second (00 .. 59)
            * %s - seconds and microseconds (equivalent to %S.%F)
            * %i - millisecond (000 .. 999)
            * %c - centisecond (0 .. 9)
            * %F - fractional seconds/microseconds (000000 - 999999)
            * %z - time zone differential in ISO 8601 format (Z or +NN.NN)
            * %Z - time zone differential in RFC format (GMT or +NNNN)
            * %% - percent sign

        conn_id:
            The name to be used to reference the connection.
            If you do not pass anything, this will create a new default connection.

    Returns:
        The new SQLite3 database connection.

    Note:
        By selecting an existing table of your database in
        [`from_db`][getml.DataFrame.from_db] function, you can create
        a new [`DataFrame`][getml.DataFrame] containing all its data.
        Alternatively you can use the
        [`read_db`][getml.DataFrame.read_db] and
        [`read_query`][getml.DataFrame.read_query] methods to replace
        the content of the current [`DataFrame`][getml.DataFrame]
        instance or append further rows based on either a table or a
        specific query.

        You can also write your results back into the SQLite3
        database. By passing the name for the destination table to
        [`transform`][getml.Pipeline.transform], the features
        generated from your raw data will be written back. Passing
        them into [`predict`][getml.Pipeline.predict], instead,
        makes predictions of the target variables to new, unseen data
        and stores the result into the corresponding table.

    """

    time_formats = time_formats or constants.TIME_FORMATS

    # We want to use the absolute path, unless it is a pure
    # in-memory version.
    name_ = name

    if name_ != ":memory:":
        name_ = os.path.abspath(name)

    cmd: Dict[str, Any] = {}

    cmd["name_"] = name_
    cmd["type_"] = "Database.new"

    cmd["db_"] = "sqlite3"
    cmd["time_formats_"] = time_formats
    cmd["conn_id_"] = conn_id

    with comm.send_and_get_socket(cmd) as sock:
        # The password is usually sent separately,
        # so it doesn't
        # end up in the logs. However, Sqlite3 does not
        # need a password, so we just send a dummy.
        comm.send_string(sock, "none")
        msg = comm.recv_string(sock)

    if msg != "Success!":
        comm.handle_engine_exception(msg, extra={"name": name_})

    return Connection(conn_id=conn_id)

copy_table

copy_table(
    source_conn: Connection,
    target_conn: Connection,
    source_table: str,
    target_table: Optional[str] = None,
)

Copies a table from one database connection to another.

PARAMETER	DESCRIPTION
`source_conn`	The database connection to be copied from. TYPE: `Connection`
`target_conn`	The database connection to be copied to. TYPE: `Connection`
`source_table`	The name of the table in the source connection. TYPE: `str`
`target_table`	The name of the table in the target connection. If you do not explicitly pass a target_table, the name will be identical to the source_table. TYPE: `Optional[str]` DEFAULT: `None`

Example

A frequent use case for this function is to copy data from a data source into sqlite. This is a good idea, because sqlite is faster than most standard, ACID-compliant databases, and also you want to avoid messing up a productive environment.

It is important to explicitly pass conn_id, otherwise the source connection will be closed when you create the target connection. What you pass as conn_id is entirely up to you, as long as the conn_id of the source and the target are different. It can be any name you see fit.

source_conn = getml.database.connect_odbc(
    "MY-SERVER-NAME", conn_id="MY-SOURCE")

target_conn = getml.database.connect_sqlite3(
    "MY-SQLITE.db", conn_id="MY-TARGET")

data.database.copy_table(
        source_conn=source_conn,
        target_conn=target_conn,
        source_table="MY-TABLE"
)

Source code in getml/database/copy_table.py

def copy_table(
    source_conn: Connection,
    target_conn: Connection,
    source_table: str,
    target_table: Optional[str] = None,
):
    """
    Copies a table from one database connection to another.

    Args:
        source_conn:
            The database connection to be copied from.

        target_conn:
            The database connection to be copied to.

        source_table:
            The name of the table in the source connection.

        target_table:
            The name of the table in the target
            connection. If you do not explicitly pass a target_table, the
            name will be identical to the source_table.

    ??? example
        A frequent use case for this function is to copy data from a data source into
        sqlite. This is a good idea, because sqlite is faster than most standard,
        ACID-compliant databases, and also you want to avoid messing up a productive
        environment.

        It is important to explicitly pass conn_id, otherwise the source connection
        will be closed
        when you create the target connection. What you pass as conn_id is entirely
        up to you,
        as long as the conn_id of the source and the target are different. It can
        be any name you see fit.

        ```python
        source_conn = getml.database.connect_odbc(
            "MY-SERVER-NAME", conn_id="MY-SOURCE")

        target_conn = getml.database.connect_sqlite3(
            "MY-SQLITE.db", conn_id="MY-TARGET")

        data.database.copy_table(
                source_conn=source_conn,
                target_conn=target_conn,
                source_table="MY-TABLE"
        )
        ```

    """

    # -------------------------------------------

    target_table = target_table or source_table

    # -------------------------------------------

    cmd: Dict[str, Any] = {}

    cmd["name_"] = ""
    cmd["type_"] = "Database.copy_table"

    cmd["source_conn_id_"] = source_conn.conn_id
    cmd["target_conn_id_"] = target_conn.conn_id
    cmd["source_table_"] = source_table
    cmd["target_table_"] = target_table

    # -------------------------------------------

    comm.send(cmd)

drop_table

drop_table(name: str, conn: Optional[Connection] = None)

Drops a table from the database.

PARAMETER	DESCRIPTION
`name`	The table to be dropped. TYPE: `str`
`conn`	The database connection to be used. If you don't explicitly pass a connection, the Engine will use the default connection. TYPE: `Optional[Connection]` DEFAULT: `None`

Source code in getml/database/drop_table.py

def drop_table(name: str, conn: Optional[Connection] = None):
    """
    Drops a table from the database.

    Args:
        name:
            The table to be dropped.

        conn:
            The database connection to be used.
            If you don't explicitly pass a connection,
            the Engine will use the default connection.
    """

    # -------------------------------------------

    conn = conn or Connection()

    # -------------------------------------------

    cmd: Dict[str, Any] = {}

    cmd["name_"] = name
    cmd["type_"] = "Database.drop_table"
    cmd["conn_id_"] = conn.conn_id

    # -------------------------------------------

    comm.send(cmd)

execute

execute(query: str, conn: Optional[Connection] = None)

Executes an SQL query on the database.

Please note that this is not meant to return results. If you want to get results, use database.get() instead.

PARAMETER	DESCRIPTION
`query`	The SQL query to be executed. TYPE: `str`
`conn`	The database connection to be used. If you don't explicitly pass a connection, the Engine will use the default connection. TYPE: `Optional[Connection]` DEFAULT: `None`

Source code in getml/database/execute.py

def execute(query: str, conn: Optional[Connection] = None):
    """
    Executes an SQL query on the database.

    Please note that this is not meant to return results. If you want to
    get results, use `database.get()` instead.

    Args:
        query:
            The SQL query to be executed.

        conn:
            The database connection to be used.
            If you don't explicitly pass a connection,
            the Engine will use the default connection.
    """

    conn = conn or Connection()

    cmd: Dict[str, Any] = {}

    cmd["name_"] = conn.conn_id
    cmd["type_"] = "Database.execute"

    with comm.send_and_get_socket(cmd) as sock:
        comm.send_string(sock, query)
        msg = comm.recv_string(sock)

    if msg != "Success!":
        comm.handle_engine_exception(msg)

get

get(
    query: str, conn: Optional[Connection] = None
) -> DataFrame

Executes an SQL query on the database and returns the result as a pandas dataframe.

PARAMETER	DESCRIPTION
`query`	The SQL query to be executed. TYPE: `str`
`conn`	The database connection to be used. If you don't explicitly pass a connection, the Engine will use the default connection. TYPE: `Optional[Connection]` DEFAULT: `None`

RETURNS	DESCRIPTION
`DataFrame`	The result of the query as a pandas dataframe.

Source code in getml/database/get.py

def get(query: str, conn: Optional[Connection] = None) -> pd.DataFrame:
    """
    Executes an SQL query on the database and returns the result as
    a pandas dataframe.

    Args:
        query:
            The SQL query to be executed.

        conn:
            The database connection to be used.
            If you don't explicitly pass a connection,
            the Engine will use the default connection.

    Returns:
        The result of the query as a pandas dataframe.
    """

    conn = conn or Connection()

    cmd: Dict[str, Any] = {}

    cmd["name_"] = conn.conn_id
    cmd["type_"] = "Database.get"

    with comm.send_and_get_socket(cmd) as sock:
        comm.send_string(sock, query)
        msg = comm.recv_string(sock)
        if msg != "Success!":
            comm.handle_engine_exception(msg)
        json_str = comm.recv_string(sock)

    return pd.read_json(io.StringIO(json_str))

get_colnames

get_colnames(
    name: str, conn: Optional[Connection] = None
) -> List[str]

Lists the colnames of a table held in the database.

PARAMETER	DESCRIPTION
`name`	The name of the table in the database. TYPE: `str`
`conn`	The database connection to be used. If you don't explicitly pass a connection, the Engine will use the default connection. TYPE: `Optional[Connection]` DEFAULT: `None`

RETURNS	DESCRIPTION
`List[str]`	A list of strings containing the names of the columns in the table.

Source code in getml/database/get_colnames.py

def get_colnames(name: str, conn: Optional[Connection] = None) -> List[str]:
    """
    Lists the colnames of a table held in the database.

    Args:
        name:
            The name of the table in the database.

        conn:
            The database connection to be used.
            If you don't explicitly pass a connection,
            the Engine will use the default connection.

    Returns:
        A list of strings containing the names of the columns in the table.
    """

    conn = conn or Connection()

    cmd: Dict[str, Any] = {}

    cmd["name_"] = name
    cmd["type_"] = "Database.get_colnames"
    cmd["conn_id_"] = conn.conn_id

    with comm.send_and_get_socket(cmd) as sock:
        msg = comm.recv_string(sock)
        if msg != "Success!":
            comm.handle_engine_exception(msg)
        arr = json.loads(comm.recv_string(sock))

    return arr

list_connections

list_connections() -> List[Connection]

Returns a list handles to all connections that are currently active on the Engine.

RETURNS	DESCRIPTION
`List[Connection]`	A list of Connection objects.

Source code in getml/database/list_connections.py

def list_connections() -> List[Connection]:
    """
    Returns a list handles to all connections
    that are currently active on the Engine.

    Returns:
        A list of Connection objects.
    """

    cmd: Dict[Any, str] = {}

    cmd["name_"] = ""
    cmd["type_"] = "Database.list_connections"

    with comm.send_and_get_socket(cmd) as sock:
        msg = comm.recv_string(sock)
        if msg != "Success!":
            comm.handle_engine_exception(msg)
        arr = json.loads(comm.recv_string(sock))

    return [Connection(elem) for elem in arr]

list_tables

list_tables(conn: Optional[Connection] = None) -> List[str]

Lists all tables and views currently held in the database.

PARAMETER	DESCRIPTION
`conn`	The database connection to be used. If you don't explicitly pass a connection, the Engine will use the default connection. TYPE: `Optional[Connection]` DEFAULT: `None`

RETURNS	DESCRIPTION
`List[str]`	A list of strings containing the names of the tables and views.

Source code in getml/database/list_tables.py

def list_tables(conn: Optional[Connection] = None) -> List[str]:
    """
    Lists all tables and views currently held in the database.

    Args:
        conn:
            The database connection to be used.
            If you don't explicitly pass a connection,
            the Engine will use the default connection.

    Returns:
        A list of strings containing the names of the tables and views.
    """

    conn = conn or Connection()

    cmd: Dict[str, Any] = {}

    cmd["name_"] = conn.conn_id
    cmd["type_"] = "Database.list_tables"

    with comm.send_and_get_socket(cmd) as sock:
        msg = comm.recv_string(sock)
        if msg != "Success!":
            comm.handle_engine_exception(msg)
        return json.loads(comm.recv_string(sock))

read_csv

read_csv(
    name: str,
    fnames: Union[str, List[str]],
    quotechar: str = '"',
    sep: str = ",",
    num_lines_read: int = 0,
    skip: int = 0,
    colnames: Optional[List[str]] = None,
    conn: Optional[Connection] = None,
) -> None

Reads a CSV file into the database.

PARAMETER	DESCRIPTION
`name`	Name of the table in which the data is to be inserted. TYPE: `str`
`fnames`	The list of CSV file names to be read. TYPE: `Union[str, List[str]]`
`quotechar`	The character used to wrap strings. Default:`"` TYPE: `str` DEFAULT: `'"'`
`sep`	The separator used for separating fields. Default:`,` TYPE: `str` DEFAULT: `','`
`num_lines_read`	Number of lines read from each file. Set to 0 to read in the entire file. TYPE: `int` DEFAULT: `0`
`skip`	Number of lines to skip at the beginning of each file (Default: 0). TYPE: `int` DEFAULT: `0`
`colnames`	The first line of a CSV file usually contains the column names. When this is not the case, you need to explicitly pass them. TYPE: `Optional[List[str]]` DEFAULT: `None`
`conn`	The database connection to be used. If you don't explicitly pass a connection, the Engine will use the default connection. TYPE: `Optional[Connection]` DEFAULT: `None`

Example

Let's assume you have two CSV files - file1.csv and file2.csv . You can import their data into the database using the following commands:

stmt = data.database.sniff_csv(
        fnames=["file1.csv", "file2.csv"],
        name="MY_TABLE",
        sep=';'
)

getml.database.execute(stmt)

data.database.read_csv(
    fnames=["file1.csv", "file2.csv"],
    name="MY_TABLE",
    sep=';'
)

Source code in getml/database/read_csv.py

def read_csv(
    name: str,
    fnames: Union[str, List[str]],
    quotechar: str = '"',
    sep: str = ",",
    num_lines_read: int = 0,
    skip: int = 0,
    colnames: Optional[List[str]] = None,
    conn: Optional[Connection] = None,
) -> None:
    """
    Reads a CSV file into the database.

    Args:
        name:
            Name of the table in which the data is to be inserted.

        fnames:
            The list of CSV file names to be read.

        quotechar:
            The character used to wrap strings. Default:`"`

        sep:
            The separator used for separating fields. Default:`,`

        num_lines_read:
            Number of lines read from each file.
            Set to 0 to read in the entire file.

        skip:
            Number of lines to skip at the beginning of each
            file (Default: 0).

        colnames:
            The first line of a CSV file
            usually contains the column names. When this is not the case, you need to
            explicitly pass them.

        conn:
            The database connection to be used.
            If you don't explicitly pass a connection,
            the Engine will use the default connection.

    ??? example
        Let's assume you have two CSV files - *file1.csv* and
        *file2.csv* . You can import their data into the database
        using the following commands:

        ```python
        stmt = data.database.sniff_csv(
                fnames=["file1.csv", "file2.csv"],
                name="MY_TABLE",
                sep=';'
        )

        getml.database.execute(stmt)

        data.database.read_csv(
            fnames=["file1.csv", "file2.csv"],
            name="MY_TABLE",
            sep=';'
        )
        ```

    """

    return _read_csv(
        CSVCmdType.READ,
        name,
        fnames,
        num_lines_read,
        quotechar,
        sep,
        skip,
        colnames,
        conn,
    )

read_s3

read_s3(
    name: str,
    bucket: str,
    keys: List[str],
    region: str,
    sep: str = ",",
    num_lines_read: int = 0,
    skip: int = 0,
    colnames: Optional[List[str]] = None,
    conn: Optional[Connection] = None,
) -> None

Reads a list of CSV files located in an S3 bucket.

PARAMETER	DESCRIPTION
`name`	Name of the table in which the data is to be inserted. TYPE: `str`
`bucket`	The bucket from which to read the files. TYPE: `str`
`keys`	The list of keys (files in the bucket) to be read. TYPE: `List[str]`
`region`	The region in which the bucket is located. TYPE: `str`
`sep`	The separator used for separating fields. Default:`,` TYPE: `str` DEFAULT: `','`
`num_lines_read`	Number of lines read from each file. Set to 0 to read in the entire file. TYPE: `int` DEFAULT: `0`
`skip`	Number of lines to skip at the beginning of each file (Default: 0). TYPE: `int` DEFAULT: `0`
`colnames`	The first line of a CSV file usually contains the column names. When this is not the case, you need to explicitly pass them. TYPE: `Optional[List[str]]` DEFAULT: `None`
`conn`	The database connection to be used. If you don't explicitly pass a connection, the Engine will use the default connection. TYPE: `Optional[Connection]` DEFAULT: `None`

Example

Let's assume you have two CSV files - file1.csv and file2.csv - in the bucket. You can import their data into the getML Engine using the following commands:

getml.engine.set_s3_access_key_id("YOUR-ACCESS-KEY-ID")

getml.engine.set_s3_secret_access_key("YOUR-SECRET-ACCESS-KEY")

stmt = data.database.sniff_s3(
        bucket="your-bucket-name",
        keys=["file1.csv", "file2.csv"],
        region="us-east-2",
        name="MY_TABLE",
        sep=';'
)

getml.database.execute(stmt)

data.database.read_s3(
    bucket="your-bucket-name",
    keys=["file1.csv", "file2.csv"],
    region="us-east-2",
    name="MY_TABLE",
    sep=';'
)

You can also set the access credential as environment variables before you launch the getML Engine.

Source code in getml/database/read_s3.py

def read_s3(
    name: str,
    bucket: str,
    keys: List[str],
    region: str,
    sep: str = ",",
    num_lines_read: int = 0,
    skip: int = 0,
    colnames: Optional[List[str]] = None,
    conn: Optional[Connection] = None,
) -> None:
    """
    Reads a list of CSV files located in an S3 bucket.

    Args:
        name:
            Name of the table in which the data is to be inserted.

        bucket:
            The bucket from which to read the files.

        keys:
            The list of keys (files in the bucket) to be read.

        region:
            The region in which the bucket is located.

        sep:
            The separator used for separating fields. Default:`,`

        num_lines_read:
            Number of lines read from each file.
            Set to 0 to read in the entire file.

        skip:
            Number of lines to skip at the beginning of each
            file (Default: 0).

        colnames:
            The first line of a CSV file
            usually contains the column names. When this is not the case, you need to
            explicitly pass them.

        conn:
            The database connection to be used.
            If you don't explicitly pass a connection,
            the Engine will use the default connection.

    ??? example
        Let's assume you have two CSV files - *file1.csv* and
        *file2.csv* - in the bucket. You can
        import their data into the getML Engine using the following
        commands:
        ```python
        getml.engine.set_s3_access_key_id("YOUR-ACCESS-KEY-ID")

        getml.engine.set_s3_secret_access_key("YOUR-SECRET-ACCESS-KEY")

        stmt = data.database.sniff_s3(
                bucket="your-bucket-name",
                keys=["file1.csv", "file2.csv"],
                region="us-east-2",
                name="MY_TABLE",
                sep=';'
        )

        getml.database.execute(stmt)

        data.database.read_s3(
            bucket="your-bucket-name",
            keys=["file1.csv", "file2.csv"],
            region="us-east-2",
            name="MY_TABLE",
            sep=';'
        )
        ```
        You can also set the access credential as environment variables
        before you launch the getML Engine.

    """
    # -------------------------------------------

    conn = conn or Connection()

    # -------------------------------------------

    cmd: Dict[str, Any] = {}

    cmd["name_"] = name
    cmd["type_"] = "Database.read_s3"

    cmd["bucket_"] = bucket
    cmd["keys_"] = keys
    cmd["num_lines_read_"] = num_lines_read
    cmd["region_"] = region
    cmd["sep_"] = sep
    cmd["skip_"] = skip
    cmd["conn_id_"] = conn.conn_id

    if colnames is not None:
        cmd["colnames_"] = colnames

    # -------------------------------------------

    comm.send(cmd)

sniff_csv

sniff_csv(
    name: str,
    fnames: Union[str, List[str]],
    num_lines_sniffed: int = 1000,
    quotechar: str = '"',
    sep: str = ",",
    skip: int = 0,
    colnames: Optional[List[str]] = None,
    conn: Optional[Connection] = None,
) -> str

Sniffs a list of CSV files.

PARAMETER	DESCRIPTION
`name`	Name of the table in which the data is to be inserted. TYPE: `str`
`fnames`	The list of CSV file names to be read. TYPE: `Union[str, List[str]]`
`num_lines_sniffed`	Number of lines analyzed by the sniffer. TYPE: `int` DEFAULT: `1000`
`quotechar`	The character used to wrap strings. Default:`"` TYPE: `str` DEFAULT: `'"'`
`sep`	The separator used for separating fields. Default:`,` TYPE: `str` DEFAULT: `','`
`skip`	Number of lines to skip at the beginning of each file (Default: 0). TYPE: `int` DEFAULT: `0`
`colnames`	The first line of a CSV file usually contains the column names. When this is not the case, you need to explicitly pass them. TYPE: `Optional[List[str]]` DEFAULT: `None`
`conn`	The database connection to be used. If you don't explicitly pass a connection, the Engine will use the default connection. TYPE: `Optional[Connection]` DEFAULT: `None`

RETURNS	DESCRIPTION
`str`	Appropriate `CREATE TABLE` statement.

Source code in getml/database/sniff_csv.py

def sniff_csv(
    name: str,
    fnames: Union[str, List[str]],
    num_lines_sniffed: int = 1000,
    quotechar: str = '"',
    sep: str = ",",
    skip: int = 0,
    colnames: Optional[List[str]] = None,
    conn: Optional[Connection] = None,
) -> str:
    """
    Sniffs a list of CSV files.

    Args:
        name:
            Name of the table in which the data is to be inserted.

        fnames:
            The list of CSV file names to be read.

        num_lines_sniffed:
            Number of lines analyzed by the sniffer.

        quotechar:
            The character used to wrap strings. Default:`"`

        sep:
            The separator used for separating fields. Default:`,`

        skip:
            Number of lines to skip at the beginning of each
            file (Default: 0).

        colnames:
            The first line of a CSV file
            usually contains the column names. When this is not the case, you need to
            explicitly pass them.

        conn:
            The database connection to be used.
            If you don't explicitly pass a connection,
            the Engine will use the default connection.

    Returns:
        Appropriate `CREATE TABLE` statement.
    """

    return _read_csv(
        CSVCmdType.SNIFF,
        name,
        fnames,
        num_lines_sniffed,
        quotechar,
        sep,
        skip,
        colnames,
        conn,
    )

sniff_s3

sniff_s3(
    name: str,
    bucket: str,
    keys: List[str],
    region: str,
    num_lines_sniffed: int = 1000,
    sep: str = ",",
    skip: int = 0,
    colnames: Optional[List[str]] = None,
    conn: Optional[Connection] = None,
) -> str

Sniffs a list of CSV files located in an S3 bucket.

PARAMETER	DESCRIPTION
`name`	Name of the table in which the data is to be inserted. TYPE: `str`
`bucket`	The bucket from which to read the files. TYPE: `str`
`keys`	The list of keys (files in the bucket) to be read. TYPE: `List[str]`
`region`	The region in which the bucket is located. TYPE: `str`
`num_lines_sniffed`	Number of lines analyzed by the sniffer. TYPE: `int` DEFAULT: `1000`
`sep`	The character used for separating fields. TYPE: `str` DEFAULT: `','`
`skip`	Number of lines to skip at the beginning of each file. TYPE: `int` DEFAULT: `0`
`colnames`	The first line of a CSV file usually contains the column names. When this is not the case, you need to explicitly pass them. TYPE: `Optional[List[str]]` DEFAULT: `None`
`conn`	The database connection to be used. If you don't explicitly pass a connection, the Engine will use the default connection. TYPE: `Optional[Connection]` DEFAULT: `None`

RETURNS	DESCRIPTION
`str`	Appropriate `CREATE TABLE` statement.

Example

Let's assume you have two CSV files - file1.csv and file2.csv - in the bucket. You can import their data into the getML Engine using the following commands:

getml.engine.set_s3_access_key_id("YOUR-ACCESS-KEY-ID")

getml.engine.set_s3_secret_access_key("YOUR-SECRET-ACCESS-KEY")

stmt = data.database.sniff_s3(
        bucket="your-bucket-name",
        keys=["file1.csv", "file2.csv"],
        region="us-east-2",
        name="MY_TABLE",
        sep=';'
)

You can also set the access credential as environment variables before you launch the getML Engine.

Source code in getml/database/sniff_s3.py

def sniff_s3(
    name: str,
    bucket: str,
    keys: List[str],
    region: str,
    num_lines_sniffed: int = 1000,
    sep: str = ",",
    skip: int = 0,
    colnames: Optional[List[str]] = None,
    conn: Optional[Connection] = None,
) -> str:
    """
    Sniffs a list of CSV files located in an S3 bucket.


    Args:
        name:
            Name of the table in which the data is to be inserted.

        bucket:
            The bucket from which to read the files.

        keys:
            The list of keys (files in the bucket) to be read.

        region:
            The region in which the bucket is located.

        num_lines_sniffed:
            Number of lines analyzed by the sniffer.

        sep:
            The character used for separating fields.

        skip:
            Number of lines to skip at the beginning of each file.

        colnames:
            The first line of a CSV file
            usually contains the column names. When this is not the case, you need to
            explicitly pass them.

        conn:
            The database connection to be used.
            If you don't explicitly pass a connection,
            the Engine will use the default connection.

    Returns:
        Appropriate `CREATE TABLE` statement.

    ??? example
        Let's assume you have two CSV files - *file1.csv* and
        *file2.csv* - in the bucket. You can
        import their data into the getML Engine using the following
        commands:
        ```python
        getml.engine.set_s3_access_key_id("YOUR-ACCESS-KEY-ID")

        getml.engine.set_s3_secret_access_key("YOUR-SECRET-ACCESS-KEY")

        stmt = data.database.sniff_s3(
                bucket="your-bucket-name",
                keys=["file1.csv", "file2.csv"],
                region="us-east-2",
                name="MY_TABLE",
                sep=';'
        )
        ```
        You can also set the access credential as environment variables
        before you launch the getML Engine.

    """

    conn = conn or Connection()

    cmd: Dict[str, Any] = {}

    cmd["name_"] = name
    cmd["type_"] = "Database.sniff_s3"

    cmd["bucket_"] = bucket
    cmd["keys_"] = keys
    cmd["num_lines_sniffed_"] = num_lines_sniffed
    cmd["region_"] = region
    cmd["sep_"] = sep
    cmd["skip_"] = skip
    cmd["conn_id_"] = conn.conn_id

    if colnames is not None:
        cmd["colnames_"] = colnames

    with comm.send_and_get_socket(cmd) as sock:
        msg = comm.recv_string(sock)
        if msg != "Success!":
            comm.handle_engine_exception(msg)
        return comm.recv_string(sock)