The getML Suite
The getML ecosystem comprises three fundamental components:
Engine
Written in C++, the getML Engine is the core of the Suite and does all the heavy lifting. It is responsible for data management, feature engineering, and machine learning.
Starting the Engine
Depending on the method used for the installation of getML Suite, the Engine can be started by executing:
getml.engine.launch()
in the Python API for the pip-based installation./getML
in terminal for CLI-based installationdocker compose up
in terminal for docker-based installation
Follow the links to learn more about each method.
Shutting down the Engine
Depending on how you started the Engine, there are different ways to shut it down:
- In the Python API:
getml.engine.shutdown()
- On command-line interface (CLI): Press
Ctrl-C
or rungetML -stop
- For a docker container: Press
Ctrl-C
- Click the ' Shutdown' tab in the sidebar of the monitor (Enterprise edition).
Logging
The Engine keeps a log about what it is currently doing.
The easiest way to view the log is to click the '<> Log' tab in the sidebar of the getML Monitor. The Engine will also output its log to the command line when it is started using the command-line interface.
Python API
Control the Engine with the getML Python API, which provides handlers to the objects in the Engine and all other necessary tools for end-to-end data science projects. For an in-depth read about its individual classes and methods, check out the Python API documentation.
Note
- The classes in the Python API act as handles to objects in the getML Engine.
- When you connect to or create a project:
- The API establishes a socket connection to the Engine through a determined port.
- All subsequent commands are sent to the Engine via this connection.
Setup new project
Set a project in the getML Engine using set_project()
.
import getml
getml.engine.launch()
getml.engine.set_project("test")
Note
If the project name does not match an existing project, a new one will be created.
Managing projects
To get a list of all available projects, use list_projects()
.
To remove an entire project, use delete_project()
.
getml.engine.list_projects()
getml.engine.delete_project("test")
For more information, refer to the Managing projects section.
DataFrames
Create a DataFrame
by calling for example:
data = getml.data.DataFrame.from_csv(
"path/to/my/data.csv",
"my_data"
)
This creates a data frame object in the getML Engine, imports the provided data, and returns a handler to the object as a DataFrame
in the Python API.
Note
There are many other methods to create a DataFrame
, including from_db()
, from_json()
, or from_pandas()
. For a full list of available methods, refer to the Importing data section.
Synchronization
When you apply any method, like add()
, the changes will be automatically reflected in both the Engine and Python. Under the hood, the Python API sends a command to create a new column to the getML Engine. The moment the Engine is done, it informs the Python API and the latter triggers the refresh()
method to update the Python handler.
Saving
Warning
DataFrames are never saved automatically and never loaded automatically. All unsaved changes to a DataFrame
will be lost when restarting the Engine.
To get a list of all your current data_frames, access the container via:
getml.project.data_frames
#or
getml.data.list_data_frames()
You can save a specific data frame to disk using .save()
method on the DataFrame
:
# by index
getml.project.data_frames[0].save()
# by name
getml.project.data_frames["my_data"].save()
To save all data frames associated with the current project, use the .save()
method on the Container
:
getml.project.data_frames.save()
Loading
To load a specific DataFrame
, use load_data_frame()
or DataFrame().load()
:
df = getml.data.load_data_frame("my_data")
# Forces the API to load the version stored on disk over the one held in memory
df = getml.data.DataFrame("my_data").load()
Use .load()
on the Container
to load all data frames associated with the current project:
getml.project.data_frames.load()
Note
If a DataFrame
is already available in memory (for example "my_data" from above), load_data_frame()
will return a handle to that data frame. If no such DataFrame
is held in memory, the function will try to load the data frame from disk and then return a handle. If that is unsuccessful, an exception is thrown.
Pipelines
The lifecycle of a Pipeline
is straightforward and streamlined by the getML Engine, which automatically saves all changes made to a pipeline and loads all pipelines within a project. Pipelines are created within the Python API using constructors, where they are defined by a set of hyperparameters.
Note
The actual weights of the machine learning algorithms are stored exclusively in the getML Engine and are not transferred to the Python API.
Any changes made through methods such as fit()
are automatically updated in both the Engine and the Python API.
By using set_project()
, you can load an existing project, and all associated pipelines will be automatically loaded into memory. To view all pipelines in the current project, access the Pipelines container via getml.project.Pipelines
.
The function list_pipelines()
lists all available pipelines within a project:
getml.pipeline.list_pipelines()
To create a corresponding handle in the Python API, use the load()
function:
pipe = getml.pipeline.load(NAME_OF_THE_PIPELINE)
Monitor
Enterprise edition
This feature is exclusive to the Enterprise edition and is not available in the Community edition. Discover the benefits of the Enterprise edition and compare their features.
For licensing information and technical support, please contact us.
The Monitor provides information on the data imported into the Engine, as well as on the trained pipelines and their performance. It is written in Go and compiled into a binary separate from the getML Engine.
Accessing the Monitor
The Monitor runs on the same machine as the Engine, using sockets for communication. By default, it opens an HTTP port (1709) for browser access. To view the Monitor, enter the following address in your browser's navigation bar:
Please note, the HTTP port is only accessible from within the host machine running the getML Suite.
The main purpose of the Monitor is to provide visual feedback to support your data science projects.
Tip
If you experience issues opening the Monitor, try the following steps:
- Manually shut down and restart the Engine using
getml.engine.shutdown()
andgetml.engine.launch()
. - Kill the associated background process in the terminal and restart the Engine.
- Close all tabs and windows where the Monitor was previously running and try again.
To get started, head over to the installation instructions.