Importing data
Before being able to analyze and process your data using the getML Suite, you have to import it into the Engine. At the end of this step, you will have your data in data frame objects in the getML Engine and will be ready to annotate them.
Note
If you have imported your data into the Engine before and want to restore it, refer to Python API: DataFrame
Unified import interface
The getML Python API provides a unified import interface requiring similar arguments and resulting in the same output format, regardless of the data source.
You can use one of the dedicated class methods (e.g. from_csv()
) to construct a data frame object in the getML Engine, fill it with the provided data, and retrieve a DataFrame
handle in the Python API.
Example
Example demonstrating the use of from_csv()
data = getml.data.DataFrame.from_csv(
"path/to/my/data.csv",
"my_data"
)
If you already have a data frame object in place, you can use the methods of the corresponding DataFrame
handle (e.g. read_csv()
) to either replace its content with new data or append to it.
All those functions also have their counterparts for exporting (e.g. to_csv()
).
Data Frames
The resulting DataFrame
instance in the Python API represents a handle to the corresponding data frame object in the getML Engine. The mapping between the two is done based on the name of the object, which has to be unique. Similarly, the names of the columns
are required to be unique within the data frame they are associated with.
Handling of NULL values
Unfortunately, data sources often contain missing or corrupt data - also called NULL values. getML is able to work with missing values except for the target variable, which must not contain any NULL values (because having NULL targets does not make any sense). Please refer to the section on join keys for details about their handling during the construction of the data model.
During import, a NULL value is automatically inserted at all occurrences of the strings "nan", "None", "NA", or an empty string as well as at all occurrences of None
and NaN
.
Import Formats
CSV
The fastest way to import data into the getML Engine is to read it directly from CSV files.
Import from CSV
Using the from_csv()
class method, you can create a new DataFrame
based on a table stored in the provided file(s). The read_csv()
method will replace the content of the current DataFrame
instance or append further rows.
Export to CSV
In addition to reading data from a CSV file, you can also write an existing DataFrame
back into one using to_csv()
.
Pandas
Pandas is one of the key packages used in most data science projects done in Python. The associated import interface is one of the slowest, but you can harness the good data exploration and manipulation capabilities of this Python package.
Import from Pandas
Using the DataFrame.from_pandas()
class method, you can create a new DataFrame
based on the provided pandas.DataFrame
. The read_pandas()
method will replace the content of the current DataFrame
instance or append further rows.
Export to Pandas
In addition to reading data from a pandas.DataFrame
, you can also write an existing DataFrame
back into a pandas.DataFrame
using DataFrame.to_pandas()
.
Note
Due to the way data is stored within the getML Engine, the dtypes of the original pandas.DataFrame
cannot be restored properly and there might be inconsistencies in the order of microseconds being introduced into timestamps.
JSON
A convenient but slow way to import data into the getML Engine via its Python API.
Import from JSON
Using the from_json()
class method, you can create a new DataFrame
based on a JSON string. The read_json()
method will replace the content of the current DataFrame
instance or append further rows.
Export to JSON
In addition to reading data from a JSON string, you can also convert an existing DataFrame
into one using to_json()
.
SQLite3
SQLite3 is a popular in-memory database. It is faster than classical relational databases like PostgreSQL but less stable under massive parallel access. It requires all contained datasets to be loaded into memory, which might use up too much RAM, especially for large datasets.
As with all other databases in the unified import interface of the getML Python API, you first need to connect to it using connect_sqlite3()
.
Import from SQLite3
By selecting an existing table from your database in the DataFrame.from_db()
class method, you can create a new DataFrame
containing all its data. Alternatively, you can use the read_db()
and read_query()
methods to replace the content of the current DataFrame
instance or append further rows based on either a table or a specific query.
Export to SQLite3
You can also write your results back into the SQLite3 database. By providing a name for the destination table in transform()
, the features generated from your raw data will be written back. Passing it into predict()
generates predictions of the target variables to new, unseen data and stores the result into the corresponding table.
MySQL
MySQL is one of the most popular databases in use today. It can be connected to the getML Engine using the function connect_mysql()
. But first, make sure your database is running, you have the corresponding hostname, port as well as your user name and password ready, and you can reach it from via your command line.
If you are unsure which port or socket your MySQL is running on, you can start the mysql
command line interface
$ mysql
Once inside the MySQL interface, use the following queries to get the required insights:
> SELECT @@port;
> SELECT @@socket;
Import from MySQL
By selecting an existing table of your database in the DataFrame.from_db()
class method, you can create a new DataFrame
containing all its data. Alternatively, you can use the read_db()
and read_query()
methods to replace the content of the current DataFrame
instance or append further rows based on either a table or a specific query.
Export to MySQL
You can also write your results back into the MySQL database. By providing a name for the destination table in transform()
, the features generated from your raw data will be written back. Passing it into predict()
generates predictions of the target variables to new, unseen data and stores the result into the corresponding table.
MariaDB
MariaDB is a popular open source fork of MySQL. It can be connected to the getML Engine using the function connect_mariadb()
. But first, make sure your database is running, you have the corresponding hostname, port as well as your username and password ready, and you can reach it from your command line.
If you are unsure which port or socket your MariaDB is running on, you can start the mysql
command line interface
$ mysql
Once inside the MariaDB interface, use the following queries to get the required insights:
MariaDB [(none)]> SELECT @@port;
MariaDB [(none)]> SELECT @@socket;
Import from MariaDB
By selecting an existing table of your database in the DataFrame.from_db()
class method, you can create a new DataFrame
containing all its data. Alternatively, you can use the read_db()
and read_query()
methods to replace the content of the current DataFrame
instance or append further rows based on either a table or a specific query.
Export to MariaDB
You can also write your results back into the MariaDB database. By providing a name for the destination table in transform()
, the features generated from your raw data will be written back. Passing it into predict()
generates predictions of the target variables to new, unseen data and stores the result into the corresponding table.
PostgreSQL
PostgreSQL is a powerful and well-established open source database system. It can be connected to the getML Engine using the function connect_postgres()
. Make sure your database is running, you have the corresponding hostname, port, user name, and password ready, and you can reach it from your command line.
Import from PostgreSQL
By selecting an existing table from your database in the DataFrame.from_db()
class method, you can create a new DataFrame
containing all its data. Alternatively, you can use the read_db()
and read_query()
methods to replace the content of the current DataFrame
instance or append further rows based on either a table or a specific query.
Export to PostgreSQL
You can also write your results back into the PostgreSQL database. If you provide a name for the destination table in transform()
, the features generated from your raw data will be written back. Passing it into predict()
generates predictions of the target variables to new, unseen data and stores the result into the corresponding table.
Greenplum
Enterprise edition
This feature is exclusive to the Enterprise edition and is not available in the Community edition. Discover the benefits of the Enterprise edition and compare their features.
For licensing information and technical support, please contact us.
Greenplum is an open source database system maintained by Pivotal Software, Inc. It can be connected to the getML Engine using the function connect_greenplum()
. But first, make sure your database is running, you have the corresponding hostname, port as well as your user name and password ready, and you can reach it from your command line.
Import from Greenplum
By selecting an existing table of your database in the from_db()
class method, you can create a new DataFrame
containing all its data. Alternatively, you can use the read_db()
and read_query()
methods to replace the content of the current DataFrame
instance or append further rows based on either a table or a specific query.
Export to Greenplum
You can also write your results back into the Greenplum database. By providing a name for the destination table in Pipeline.transform()
, the features generated from your raw data will be written back. Passing it into Pipeline.predict()
generates predictions of the target variables to new, unseen data and stores the result into the corresponding table.
ODBC
Enterprise edition
This feature is exclusive to the Enterprise edition and is not available in the Community edition. Discover the benefits of the Enterprise edition and compare their features.
For licensing information and technical support, please contact us.
ODBC (Open Database Connectivity) is an API specification for connecting software programming language to a database, developed by Microsoft.
In a nutshell, it works like this:
- Any database of relevance has an ODBC driver that translates calls from the ODBC API into a format the database can understand, returning the query results in a format understood by the ODBC API.
- To connect getML or other software to a database using ODBC, you first need to install the ODBC driver provided by your database vendor.
- In theory, ODBC drivers should translate queries from the SQL 99 standard into the SQL dialect, but this is often ignored in practice. Also, not all ODBC drivers support all ODBC calls.
With getML, native APIs are preferred for connecting to relational databases. ODBC is used when native APIs are not feasible due to licensing or other restrictions, especially for connecting to proprietary databases like Oracle, Microsoft SQL Server, or IBM DB2.
ODBC is pre-installed on modern Windows operating systems, while Linux and macOS can use open-source implementations like unixODBC and iODBC, with getML using unixODBC.
An example: Microsoft SQL Server
To connect to Microsoft SQL Server using ODBC:
- If you do not have a Microsoft SQL Server instance, you can download a trial or development version.
- Download the ODBC driver for SQL Server.
- Configure the ODBC driver. Many drivers provide customized scripts for this, so manual configuration might not be necessary.
For Linux and macOS, create a .odbc.ini
file in your home directory with the following contents:
[ANY-NAME-YOU-WANT]
Driver = /opt/microsoft/msodbcsql17/lib64/libmsodbcsql-17.5.so.2.1
Server = 123.45.67.890
Port = 1433
User = YOUR-USERNAME
Password = YOUR-PASSWORD
Database = YOUR-DATABASE
Language = us_english
NeedODBCTypesOnly = 1
./setup.sh
or bash setup.sh
. You will need to set the following parameters:
- The first line is the server name or data source name. You can use this name to tell getML that this is the server you want to connect to.
- The Driver is the location of the ODBC driver you have just downloaded. The location or file name might be different on your system.
- The Server is the IP address of the server. If the server is on the same machine as getML, just write "localhost".
- The Port is the port on which to connect the server. The default port for SQL Server is 1433.
- User and Password are the user name and password that allow access to the server.
- The Database is the database inside the server you want to connect to.
You can now connect getML to the database:
getml.database.connect_odbc(
server_name="ANY-NAME-YOU-WANT",
user="YOUR-USERNAME",
password="YOUR-PASSWORD",
escape_chars="[]")
Important: Always pass escape_chars
Earlier we mentioned that ODBC drivers are supposed to translate standard SQL queries into the specific SQL dialects, but this requirement is often ignored.
A typical issue is escape characters, needed when the names of your schemas, tables, or columns are SQL keywords, like the loans dataset containing a table named ORDER.
To avoid this problem, you can envelop the schema, table, and column names in escape characters.
SELECT "some_column" FROM "SOME_SCHEMA"."SOME_TABLE";
The SQL standard requires that the quotation mark (") be used as the escape character. However, many SQL dialects do not follow this requirement, e.g., SQL Server uses "[]":
SELECT [some_column] FROM [SOME_SCHEMA].[SOME_TABLE];
SELECT `some_column` FROM `SOME_SCHEMA`.`SOME_TABLE`;
connect_odbc()
. Import data using ODBC
By selecting an existing table from your database in the DataFrame.from_db()
class method, you can create a new DataFrame
containing all its data. Alternatively, you can use the read_db()
and read_query()
methods to replace the content of the current DataFrame
instance or append further rows based on either a table or a specific query.
Export data using ODBC
You can also write your results back into the database using ODBC. When you provide a name for the destination table in transform()
, the features generated from your raw data will be written back. Passing it into predict()
generates predictions of the target variables to new, unseen data and stores the result into the corresponding table.