getml.database
This module contains communication routines to access various databases.
The connect_bigquery
,
connect_hana
,
connect_greenplum
,
connect_mariadb
,
connect_mysql
,
connect_postgres
, and
connect_sqlite3
functions establish a
connection between a database and the getML Engine. During the data
import using either the read_db
or
read_query
methods of a
DataFrame
instance or the corresponding
from_db
class method all data will be
directly loaded from the database into the Engine without ever passing
the Python interpreter.
In addition, several auxiliary functions that might be handy during the analysis and interaction with the database are provided.
Connection
Connection(conn_id: str = 'default')
A handle to a database connection on the getML Engine.
ATTRIBUTE | DESCRIPTION |
---|---|
conn_id |
The name you want to use to reference the connection. You can call it anything you want to. If a database connection with the same conn_id already exists, that connection will be removed automatically and the new connection will take its place. The default conn_id is "default", which refers to the default connection. If you do not explicitly pass a connection handle to any function that relates to a database, the default connection will be used automatically.
|
Source code in getml/database/connection.py
36 37 |
|
connect_bigquery
connect_bigquery(
database_id: str,
project_id: str,
google_application_credentials: Union[str, Path],
time_formats: Optional[List[str]] = None,
conn_id: str = "default",
) -> Connection
Creates a new BigQuery database connection.
enterprise-adm: Enterprise edition This feature is exclusive to the Enterprise edition and is not available in the Community edition. Discover the benefits of the Enterprise edition and compare their features.
For licensing information and technical support, please contact us.
PARAMETER | DESCRIPTION |
---|---|
database_id |
The ID of the database to connect to.
TYPE:
|
project_id |
The ID of the project to connect to.
TYPE:
|
google_application_credentials |
The path of the Google application credentials. (Must be located on the machine hosting the getML Engine). |
time_formats |
The list of formats tried when parsing time stamps. The formats are allowed to contain the following special characters:
|
conn_id |
The name to be used to reference the connection. If you do not pass anything, this will create a new default connection.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Connection
|
The connection object. |
Source code in getml/database/connect_bigquery.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 |
|
connect_greenplum
connect_greenplum(
dbname: str,
user: str,
password: str,
host: str,
hostaddr: str,
port: int = 5432,
time_formats: Optional[List[str]] = None,
conn_id: str = "default",
) -> Connection
Creates a new Greenplum database connection.
But first, make sure your database is running, and you can reach it from your command line.
Enterprise edition
This feature is exclusive to the Enterprise edition and is not available in the Community edition. Discover the benefits of the Enterprise edition and compare their features.
For licensing information and technical support, please contact us.
PARAMETER | DESCRIPTION |
---|---|
dbname |
The name of the database to which you want to connect.
TYPE:
|
user |
Username with which to log into the Greenplum database.
TYPE:
|
password |
Password with which to log into the Greenplum database.
TYPE:
|
host |
Host of the Greenplum database.
TYPE:
|
hostaddr |
IP address of the Greenplum database.
TYPE:
|
port |
Port of the Greenplum database. The default port used by Greenplum is 5432. If you do not know, which port to use, type the following into your Greenplum client:
TYPE:
|
time_formats |
The list of formats tried when parsing time stamps. The formats are allowed to contain the following special characters:
|
conn_id |
The name to be used to reference the connection. If you do not pass anything, this will create a new default connection.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Connection
|
The connection object. |
Note
By selecting an existing table of your database in
from_db
function, you can create
a new DataFrame
containing all its data.
Alternatively you can use the
read_db
and
read_query
methods to replace
the content of the current DataFrame
instance or append further rows based on either a table or a
specific query.
You can also write your results back into the Greenplum
database. By passing the name for the destination table to
transform
, the features
generated from your raw data will be written back. Passing
them into predict
, instead,
makes predictions of the target variables to new, unseen data
and stores the result into the corresponding table.
Source code in getml/database/connect_greenplum.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 |
|
connect_hana
connect_hana(
user: str,
password: str,
host: str,
port: int = 39017,
default_schema: Optional[str] = "public",
ping_interval: int = 0,
time_formats: Optional[List[str]] = None,
conn_id: str = "default",
) -> Connection
Creates a new HANA database connection.
Enterprise edition
This feature is exclusive to the Enterprise edition and is not available in the Community edition. Discover the benefits of the Enterprise edition and compare their features.
For licensing information and technical support, please contact us.
PARAMETER | DESCRIPTION |
---|---|
user |
Username with which to log into the HANA database.
TYPE:
|
password |
Password with which to log into the HANA database.
TYPE:
|
host |
Host of the HANA database.
TYPE:
|
port |
Port of the database.
TYPE:
|
default_schema |
The schema within the database you want to connect use unless another schema is explicitly set. |
ping_interval |
The interval at which you want to ping the database, in seconds. Set to 0 for no pings at all.
TYPE:
|
time_formats |
The list of formats tried when parsing time stamps. The formats are allowed to contain the following special characters:
|
conn_id |
The name to be used to reference the connection. If you do not pass anything, this will create a new default connection.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Connection
|
The connection object. |
Source code in getml/database/connect_hana.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 |
|
connect_mariadb
connect_mariadb(
dbname: str,
user: str,
password: str,
host: str,
port: int = 3306,
unix_socket: str = "/var/run/mysqld/mysqld.sock",
time_formats: Optional[List[str]] = None,
conn_id: str = "default",
) -> Connection
Creates a new MariaDB database connection.
But first, make sure your database is running and you can reach it from via your command line.
PARAMETER | DESCRIPTION |
---|---|
dbname |
The name of the database to which you want to connect.
TYPE:
|
user |
Username with which to log into the MariaDB database.
TYPE:
|
password |
Password with which to log into the MariaDB database.
TYPE:
|
host |
Host of the MariaDB database.
TYPE:
|
port |
Port of the MariaDB database. The default port for MariaDB is 3306. If you do not know which port to use, type
TYPE:
|
unix_socket |
The UNIX socket used to connect to the MariaDB database. If you do not know which UNIX socket to use, type
TYPE:
|
time_formats |
The list of formats tried when parsing time stamps. The formats are allowed to contain the following special characters:
|
conn_id |
The name to be used to reference the connection. If you do not pass anything, this will create a new default connection.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Connection
|
The connection object. |
Note
By selecting an existing table of your database in
from_db
function, you can create
a new DataFrame
containing all its data.
Alternatively you can use the
read_db
and
read_query
methods to replace
the content of the current DataFrame
instance or append further rows based on either a table or a
specific query.
You can also write your results back into the MariaDB
database. By passing the name for the destination table to
transform
, the features
generated from your raw data will be written back. Passing
them into predict
, instead,
makes predictions
of the target variables to new, unseen data and stores the result into
the corresponding table.
Source code in getml/database/connect_mariadb.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 |
|
connect_mysql
connect_mysql(
dbname: str,
user: str,
password: str,
host: str,
port: int = 3306,
unix_socket: str = "/var/run/mysqld/mysqld.sock",
time_formats: Optional[List[str]] = None,
conn_id: str = "default",
) -> Connection
Creates a new MySQL database connection.
But first, make sure your database is running and you can reach it from via your command line.
Enterprise edition
This feature is exclusive to the Enterprise edition and is not available in the Community edition. Discover the benefits of the Enterprise edition and compare their features.
For licensing information and technical support, please contact us.
PARAMETER | DESCRIPTION |
---|---|
dbname |
The name of the database to which you want to connect.
TYPE:
|
user |
Username with which to log into the MySQL database.
TYPE:
|
password |
Password with which to log into the MySQL database.
TYPE:
|
host |
Host of the MySQL database.
TYPE:
|
port |
Port of the MySQL database. The default port for MySQL is 3306. If you do not know which port to use, type
TYPE:
|
unix_socket |
The UNIX socket used to connect to the MySQL database. If you do not know which UNIX socket to use, type
TYPE:
|
time_formats |
The list of formats tried when parsing time stamps. The formats are allowed to contain the following special characters:
|
conn_id |
The name to be used to reference the connection. If you do not pass anything, this will create a new default connection.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Connection
|
The connection object. |
Note
By selecting an existing table of your database in
from_db
function, you can create
a new DataFrame
containing all its data.
Alternatively you can use the
read_db
and
read_query
methods to replace
the content of the current DataFrame
instance or append further rows based on either a table or a
specific query.
You can also write your results back into the MySQL
database. By passing the name for the destination table to
transform
, the features
generated from your raw data will be written back. Passing
them into predict
, instead,
makes predictions of the target variables to new, unseen data
and stores the result into the corresponding table.
Source code in getml/database/connect_mysql.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 |
|
connect_odbc
connect_odbc(
server_name: str,
user: str = "",
password: str = "",
escape_chars: str = '"',
double_precision: str = "DOUBLE PRECISION",
integer: str = "INTEGER",
text: str = "TEXT",
time_formats: Optional[List[str]] = None,
conn_id: str = "default",
) -> Connection
Creates a new ODBC database connection.
ODBC is standardized format that can be used to connect to almost any database.
Before you use the ODBC connector, make sure the database is up and running and that the appropriate ODBC drivers are installed.
Enterprise edition
This feature is exclusive to the Enterprise edition and is not available in the Community edition. Discover the benefits of the Enterprise edition and compare their features.
For licensing information and technical support, please contact us.
PARAMETER | DESCRIPTION |
---|---|
server_name |
The server name, as referenced in your .obdc.ini file.
TYPE:
|
user |
Username with which to log into the database. You do not need to pass this, if it is already contained in your .odbc.ini.
TYPE:
|
password |
Password with which to log into the database. You do not need to pass this, if it is already contained in your .odbc.ini.
TYPE:
|
escape_chars |
ODBC drivers are supposed to support escaping table and column names using '"' characters irrespective of the syntax in the target database. Unfortunately, not all ODBC drivers follow this standard. This is why some tuning might be necessary. The escape_chars value behaves as follows:
TYPE:
|
double_precision |
The keyword used for double precision columns.
TYPE:
|
integer |
The keyword used for integer columns.
TYPE:
|
text |
The keyword used for text columns.
TYPE:
|
time_formats |
The list of formats tried when parsing time stamps. The formats are allowed to contain the following special characters:
|
conn_id |
The name to be used to reference the connection. If you do not pass anything, this will create a new default connection.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Connection
|
The connection object. |
Source code in getml/database/connect_odbc.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 |
|
connect_postgres
connect_postgres(
dbname: str,
user: str,
password: str,
host: Optional[str] = None,
hostaddr: Optional[str] = None,
port: int = 5432,
time_formats: Optional[List[str]] = None,
conn_id: str = "default",
) -> Connection
Creates a new PostgreSQL database connection.
But first, make sure your database is running, and you can reach it from via your command line.
Enterprise edition
This feature is exclusive to the Enterprise edition and is not available in the Community edition. Discover the benefits of the Enterprise edition and compare their features.
For licensing information and technical support, please contact us.
PARAMETER | DESCRIPTION |
---|---|
dbname |
The name of the database to which you want to connect.
TYPE:
|
user |
Username with which to log into the PostgreSQL database.
TYPE:
|
password |
Password with which to log into the PostgreSQL database.
TYPE:
|
host |
Host of the PostgreSQL database. |
hostaddr |
IP address of the PostgreSQL database. This should be in the standard IPv4 address format, e.g., 172.28.40.9. If your machine supports IPv6, you can also use those addresses. TCP/IP communication is always used when a nonempty string is specified for this parameter. |
port |
Port of the PostgreSQL database. The default port used by PostgreSQL is 5432. If you do not know, which port to use, type the following into your PostgreSQL client
TYPE:
|
time_formats |
The list of formats tried when parsing time stamps. The formats are allowed to contain the following special characters:
|
conn_id |
The name to be used to reference the connection. If you do not pass anything, this will create a new default connection.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Connection
|
The connection object. |
Note
By selecting an existing table of your database in
from_db
function, you can create
a new DataFrame
containing all its data.
Alternatively you can use the
read_db
and
read_query
methods to replace
the content of the current DataFrame
instance or append further rows based on either a table or a
specific query.
You can also write your results back into the PostgreSQL
database. By passing the name for the destination table to
transform
, the features
generated from your raw data will be written back. Passing
them into predict
, instead,
makes predictions of the target variables to new, unseen data
and stores the result into the corresponding table.
Source code in getml/database/connect_postgres.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 |
|
connect_sqlite3
connect_sqlite3(
name: str = ":memory:",
time_formats: Optional[List[str]] = None,
conn_id: str = "default",
) -> Connection
Creates a new SQLite3 database connection.
SQLite3 is a popular in-memory database. It is faster than distributed databases, like PostgreSQL, but less stable under massive parallel access, consumes more memory and requires all contained data sets to be loaded into memory, which might fill up too much of your RAM, especially for large data sets.
PARAMETER | DESCRIPTION |
---|---|
name |
Name of the sqlite3 file. If the file does not exist, it will be created. Set to ":memory:" for a purely in-memory SQLite3 database.
TYPE:
|
time_formats |
The list of formats tried when parsing time stamps. The formats are allowed to contain the following special characters:
|
conn_id |
The name to be used to reference the connection. If you do not pass anything, this will create a new default connection.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Connection
|
The new SQLite3 database connection. |
Note
By selecting an existing table of your database in
from_db
function, you can create
a new DataFrame
containing all its data.
Alternatively you can use the
read_db
and
read_query
methods to replace
the content of the current DataFrame
instance or append further rows based on either a table or a
specific query.
You can also write your results back into the SQLite3
database. By passing the name for the destination table to
transform
, the features
generated from your raw data will be written back. Passing
them into predict
, instead,
makes predictions of the target variables to new, unseen data
and stores the result into the corresponding table.
Source code in getml/database/connect_sqlite3.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 |
|
copy_table
copy_table(
source_conn: Connection,
target_conn: Connection,
source_table: str,
target_table: Optional[str] = None,
)
Copies a table from one database connection to another.
PARAMETER | DESCRIPTION |
---|---|
source_conn |
The database connection to be copied from.
TYPE:
|
target_conn |
The database connection to be copied to.
TYPE:
|
source_table |
The name of the table in the source connection.
TYPE:
|
target_table |
The name of the table in the target connection. If you do not explicitly pass a target_table, the name will be identical to the source_table. |
Example
A frequent use case for this function is to copy data from a data source into sqlite. This is a good idea, because sqlite is faster than most standard, ACID-compliant databases, and also you want to avoid messing up a productive environment.
It is important to explicitly pass conn_id, otherwise the source connection will be closed when you create the target connection. What you pass as conn_id is entirely up to you, as long as the conn_id of the source and the target are different. It can be any name you see fit.
source_conn = getml.database.connect_odbc(
"MY-SERVER-NAME", conn_id="MY-SOURCE")
target_conn = getml.database.connect_sqlite3(
"MY-SQLITE.db", conn_id="MY-TARGET")
data.database.copy_table(
source_conn=source_conn,
target_conn=target_conn,
source_table="MY-TABLE"
)
Source code in getml/database/copy_table.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 |
|
drop_table
drop_table(name: str, conn: Optional[Connection] = None)
Drops a table from the database.
PARAMETER | DESCRIPTION |
---|---|
name |
The table to be dropped.
TYPE:
|
conn |
The database connection to be used. If you don't explicitly pass a connection, the Engine will use the default connection.
TYPE:
|
Source code in getml/database/drop_table.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
|
execute
execute(query: str, conn: Optional[Connection] = None)
Executes an SQL query on the database.
Please note that this is not meant to return results. If you want to
get results, use database.get()
instead.
PARAMETER | DESCRIPTION |
---|---|
query |
The SQL query to be executed.
TYPE:
|
conn |
The database connection to be used. If you don't explicitly pass a connection, the Engine will use the default connection.
TYPE:
|
Source code in getml/database/execute.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
|
get
get(
query: str, conn: Optional[Connection] = None
) -> DataFrame
Executes an SQL query on the database and returns the result as a pandas dataframe.
PARAMETER | DESCRIPTION |
---|---|
query |
The SQL query to be executed.
TYPE:
|
conn |
The database connection to be used. If you don't explicitly pass a connection, the Engine will use the default connection.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
DataFrame
|
The result of the query as a pandas dataframe. |
Source code in getml/database/get.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
|
get_colnames
get_colnames(
name: str, conn: Optional[Connection] = None
) -> List[str]
Lists the colnames of a table held in the database.
PARAMETER | DESCRIPTION |
---|---|
name |
The name of the table in the database.
TYPE:
|
conn |
The database connection to be used. If you don't explicitly pass a connection, the Engine will use the default connection.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
List[str]
|
A list of strings containing the names of the columns in the table. |
Source code in getml/database/get_colnames.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
|
list_connections
list_connections() -> List[Connection]
Returns a list handles to all connections that are currently active on the Engine.
RETURNS | DESCRIPTION |
---|---|
List[Connection]
|
A list of Connection objects. |
Source code in getml/database/list_connections.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
|
list_tables
list_tables(conn: Optional[Connection] = None) -> List[str]
Lists all tables and views currently held in the database.
PARAMETER | DESCRIPTION |
---|---|
conn |
The database connection to be used. If you don't explicitly pass a connection, the Engine will use the default connection.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
List[str]
|
A list of strings containing the names of the tables and views. |
Source code in getml/database/list_tables.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
|
read_csv
read_csv(
name: str,
fnames: Union[str, List[str]],
quotechar: str = '"',
sep: str = ",",
num_lines_read: int = 0,
skip: int = 0,
colnames: Optional[List[str]] = None,
conn: Optional[Connection] = None,
) -> None
Reads a CSV file into the database.
PARAMETER | DESCRIPTION |
---|---|
name |
Name of the table in which the data is to be inserted.
TYPE:
|
fnames |
The list of CSV file names to be read. |
quotechar |
The character used to wrap strings. Default:
TYPE:
|
sep |
The separator used for separating fields. Default:
TYPE:
|
num_lines_read |
Number of lines read from each file. Set to 0 to read in the entire file.
TYPE:
|
skip |
Number of lines to skip at the beginning of each file (Default: 0).
TYPE:
|
colnames |
The first line of a CSV file usually contains the column names. When this is not the case, you need to explicitly pass them. |
conn |
The database connection to be used. If you don't explicitly pass a connection, the Engine will use the default connection.
TYPE:
|
Example
Let's assume you have two CSV files - file1.csv and file2.csv . You can import their data into the database using the following commands:
stmt = data.database.sniff_csv(
fnames=["file1.csv", "file2.csv"],
name="MY_TABLE",
sep=';'
)
getml.database.execute(stmt)
data.database.read_csv(
fnames=["file1.csv", "file2.csv"],
name="MY_TABLE",
sep=';'
)
Source code in getml/database/read_csv.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 |
|
read_s3
read_s3(
name: str,
bucket: str,
keys: List[str],
region: str,
sep: str = ",",
num_lines_read: int = 0,
skip: int = 0,
colnames: Optional[List[str]] = None,
conn: Optional[Connection] = None,
) -> None
Reads a list of CSV files located in an S3 bucket.
PARAMETER | DESCRIPTION |
---|---|
name |
Name of the table in which the data is to be inserted.
TYPE:
|
bucket |
The bucket from which to read the files.
TYPE:
|
keys |
The list of keys (files in the bucket) to be read. |
region |
The region in which the bucket is located.
TYPE:
|
sep |
The separator used for separating fields. Default:
TYPE:
|
num_lines_read |
Number of lines read from each file. Set to 0 to read in the entire file.
TYPE:
|
skip |
Number of lines to skip at the beginning of each file (Default: 0).
TYPE:
|
colnames |
The first line of a CSV file usually contains the column names. When this is not the case, you need to explicitly pass them. |
conn |
The database connection to be used. If you don't explicitly pass a connection, the Engine will use the default connection.
TYPE:
|
Example
Let's assume you have two CSV files - file1.csv and file2.csv - in the bucket. You can import their data into the getML Engine using the following commands:
getml.engine.set_s3_access_key_id("YOUR-ACCESS-KEY-ID")
getml.engine.set_s3_secret_access_key("YOUR-SECRET-ACCESS-KEY")
stmt = data.database.sniff_s3(
bucket="your-bucket-name",
keys=["file1.csv", "file2.csv"],
region="us-east-2",
name="MY_TABLE",
sep=';'
)
getml.database.execute(stmt)
data.database.read_s3(
bucket="your-bucket-name",
keys=["file1.csv", "file2.csv"],
region="us-east-2",
name="MY_TABLE",
sep=';'
)
Source code in getml/database/read_s3.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 |
|
sniff_csv
sniff_csv(
name: str,
fnames: Union[str, List[str]],
num_lines_sniffed: int = 1000,
quotechar: str = '"',
sep: str = ",",
skip: int = 0,
colnames: Optional[List[str]] = None,
conn: Optional[Connection] = None,
) -> str
Sniffs a list of CSV files.
PARAMETER | DESCRIPTION |
---|---|
name |
Name of the table in which the data is to be inserted.
TYPE:
|
fnames |
The list of CSV file names to be read. |
num_lines_sniffed |
Number of lines analyzed by the sniffer.
TYPE:
|
quotechar |
The character used to wrap strings. Default:
TYPE:
|
sep |
The separator used for separating fields. Default:
TYPE:
|
skip |
Number of lines to skip at the beginning of each file (Default: 0).
TYPE:
|
colnames |
The first line of a CSV file usually contains the column names. When this is not the case, you need to explicitly pass them. |
conn |
The database connection to be used. If you don't explicitly pass a connection, the Engine will use the default connection.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
str
|
Appropriate |
Source code in getml/database/sniff_csv.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 |
|
sniff_s3
sniff_s3(
name: str,
bucket: str,
keys: List[str],
region: str,
num_lines_sniffed: int = 1000,
sep: str = ",",
skip: int = 0,
colnames: Optional[List[str]] = None,
conn: Optional[Connection] = None,
) -> str
Sniffs a list of CSV files located in an S3 bucket.
PARAMETER | DESCRIPTION |
---|---|
name |
Name of the table in which the data is to be inserted.
TYPE:
|
bucket |
The bucket from which to read the files.
TYPE:
|
keys |
The list of keys (files in the bucket) to be read. |
region |
The region in which the bucket is located.
TYPE:
|
num_lines_sniffed |
Number of lines analyzed by the sniffer.
TYPE:
|
sep |
The character used for separating fields.
TYPE:
|
skip |
Number of lines to skip at the beginning of each file.
TYPE:
|
colnames |
The first line of a CSV file usually contains the column names. When this is not the case, you need to explicitly pass them. |
conn |
The database connection to be used. If you don't explicitly pass a connection, the Engine will use the default connection.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
str
|
Appropriate |
Example
Let's assume you have two CSV files - file1.csv and file2.csv - in the bucket. You can import their data into the getML Engine using the following commands:
getml.engine.set_s3_access_key_id("YOUR-ACCESS-KEY-ID")
getml.engine.set_s3_secret_access_key("YOUR-SECRET-ACCESS-KEY")
stmt = data.database.sniff_s3(
bucket="your-bucket-name",
keys=["file1.csv", "file2.csv"],
region="us-east-2",
name="MY_TABLE",
sep=';'
)
Source code in getml/database/sniff_s3.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 |
|