Tracking with MLflow
The MLflow integration in getML provides a seamless way to track projects, pipelines, and parameters throughout the machine learning lifecycle. By automatically logging pipeline operations, model metrics, and data characteristics, this integration enhances transparency, reproducibility, and collaboration in your data science projects.
Overview
MLflow is an open-source platform designed to manage the machine learning lifecycle, including experimentation, reproducibility, and deployment. When integrated with getML, MLflow automatically captures important information about:
- Pipeline creation and operations (fit, score, predict, transform)
- Project setting and switching
- Metadata about the datasets
- Pipeline parameters and performance metrics
This integration provides a clear visual representation of your getML workflows and makes it easier to compare different pipelines and approaches.
MLflow UI Overview - Shows the main UI with MLflow experiments and runs
Note
getML projects correspond to MLflow experiments, pipelines correspond to runs, and pipeline methods (functions) correspond to sub-runs.
Setup
Setting up MLflow with getML requires minimal configuration. The integration is provided through the getml-mlflow
package.
How to install
You can install getml-mlflow using pip:
pip install getml-mlflow
You can also install directly from GitHub if you need the latest development version:
pip install "git+ssh://git@github.com/getml/getml-mlflow.git"
Running the MLflow Server
To visualize your experiments, you need to run the MLflow server with its browser UI:
mlflow ui
Once the server is running, you can access the MLflow UI in your browser at:
http://localhost:5000
Basic Setup
To enable MLflow logging with default settings:
import getml
import getml_mlflow
# Enable MLflow tracking with default settings
getml_mlflow.autolog()
# Subsequent getML operations will be logged to MLflow
Custom Configuration
For more control over what is logged, you can customize the autolog()
function:
getml_mlflow.autolog(
log_pipeline_as_artifact=True,
log_system_metrics=False,
tracking_uri="http://localhost:5000"
)
Default Settings
When you use getml_mlflow.autolog()
without specifying any parameters, the following default settings are applied:
- Information about container dataframes is logged (rows, columns, roles, etc.)
- Pipeline parameters, tags, scores, features, and columns are logged
- DataFrames passed to or returned by functions are saved as artifacts
- Pipelines are saved as artifacts
- System metrics (CPU and memory usage) are logged during pipeline fitting
- MLflow tracking server is set to
http://localhost:5000
Autolog Parameters
The autolog()
function provides a wide range of parameters to customize what gets logged to MLflow:
Data Container Logging
Parameter | Default | Description |
---|---|---|
log_data_information | True | Logs metadata about dataframes (rows, columns, roles) |
log_data_as_artifact | True | Saves DataFrames as parquet files in MLflow |
Function Logging
Parameter | Default | Description |
---|---|---|
log_function_parameters | True | Logs parameters passed to getML functions |
log_function_return | True | Logs return values of getML functions |
log_function_as_trace | True | Logs function calls as traces for detailed execution flow |
Pipeline Logging
Parameter | Default | Description |
---|---|---|
log_pipeline_parameters | True | Logs pipeline parameters |
log_pipeline_tags | True | Logs pipeline tags |
log_pipeline_scores | True | Logs pipeline metrics/scores |
log_pipeline_features | True | Logs features learned during pipeline fitting |
log_pipeline_columns | True | Logs columns whose importance can be calculated |
log_pipeline_targets | True | Logs pipeline targets |
log_pipeline_data_model | True | Logs the data model as HTML |
log_pipeline_as_artifact | True | Saves pipelines as MLflow artifacts |
System and Environment
Parameter | Default | Description |
---|---|---|
log_system_metrics | True | Logs system metrics (CPU, memory) during pipeline fitting |
disable | False | Disables all getML autologging if set to True |
silent | False | Suppresses all logging messages if set to True |
create_runs | True | Creates new MLflow runs automatically when logging |
extra_tags | None | Additional custom tags to log with each run |
getml_project_path | None | Path to the getML projects directory (defaults to $HOME/.getML/projects ) |
tracking_uri | "http://localhost:5000" | MLflow tracking server URI |
Logging Data Sets, Pipeline Tags, Features, and Metrics
Data Sets
When log_data_information
is enabled, MLflow captures metadata about the dataframes in your containers, including:
- Number of rows and columns
- Column names and types
- Role assignments (target, categorical, numerical, etc.)
This metadata appears in the MLflow UI, providing insights into your dataset.
Container DataFrames Metadata - Shows rows & columns information with emojis for roles of columns
As seen in the UI above, emojis are used to visually distinguish between different types of roles of columns in the dataframes:
Emoji | Description |
---|---|
🗃 | Categorical columns |
🔗 | Join keys |
🔢 | Numerical columns |
🎯 | Target column(s) |
📝 | Text columns |
⏰ | Timestamp columns |
🧮 | Unused float columns |
🧵 | Unused string columns |
If log_data_as_artifact
is also enabled, the actual DataFrames are saved as Parquet files, which you can download from the MLflow UI.
Pipeline Tags
Pipeline tags are key-value pairs that help classify and organize your pipelines. When log_pipeline_tags
is enabled, these tags are automatically logged to MLflow, making it easier to filter and search for specific pipelines.
# Tags will be automatically logged to MLflow
pipe = getml.pipeline.Pipeline(
...
tags={"model_type": "churn_prediction", "version": "1.0"}
)
Evaluation Metrics
Performance metrics are crucial for comparing different models. With log_pipeline_scores
enabled, all metrics calculated by the score()
method are automatically logged to MLflow as metrics.
# Metrics will be automatically logged to MLflow
scores = pipe.score(container.test)
These metrics appear in the MLflow UI, making it easy to compare the performance of different pipeline versions.
MLflow in the getML Pipeline Lifecycle
The MLflow integration is designed to capture information at each stage of the getML pipeline lifecycle:
Project Creation and Switching
When you create or switch projects, MLflow creates a corresponding experiment:
# Creates or switches to an MLflow experiment named "my_project"
getml.set_project("my_project")
Pipeline Definition and Fitting
When you define a pipeline, you're setting up its parameters:
# Just defines the pipeline parameters, no MLflow run yet
pipe = getml.pipeline.Pipeline(...)
It's only when you call fit()
that the pipeline is actually created on the getML Engine side, gets an ID, and creates an MLflow run.
In addition, during pipeline fitting, MLflow logs the following : - Parameters used for fitting - System metrics from getML(if enabled) - Function execution traces (if enabled)
# This creates a new MLflow run & logs fitting information to MLflow
pipe.fit(container.train)
Scoring and Prediction
When you score or make predictions with a pipeline, MLflow logs: - Performance metrics - Output data (if enabled)
# Scores are logged as MLflow metrics
pipe.score(container.test)
# Predictions can be logged as artifacts
predictions = pipe.predict(container.test)
Working with Artifact Pipelines
When log_pipeline_as_artifact
is enabled, pipelines are saved as MLflow artifacts. They can be accessed from the Artifacts tab of MLflow UI. Artifact Pipelines enable several powerful capabilities:
Downloading Artifact Pipelines
You can download an artifact pipeline from an MLflow run into a new getML project using download_artifact_pipeline()
function:
import getml
import mlflow
from mlflow.tracking import MlflowClient
import getml_mlflow
# Initialize MLflow client
client = MlflowClient("http://localhost:5000")
run_id = "2960ee40202744daa64aa83d180f0b2f"
pipeline_id = "uPe3hR"
original_project_name = "interstate94"
# Downloads the pipeline into a new project named "original_project_name-pipeline_id"
new_project, pipeline_id = getml_mlflow.marshalling.pipeline.download_artifact_pipeline(
client,
run_id,
pipeline_id,
original_project_name=original_project_name
)
The new project will be created with a name derived from the original project name and the pipeline ID (e.g., "interstate94-uPe3hR"), containing the pipeline. This creates a separate project so you can work with the pipeline without affecting your current project.
If the project already exists (e.g., when calling this function multiple times with the same parameters), the existing project will be overwritten with the downloaded artifacts.
Experimental feature
The function download_artifact_pipeline()
is experimental and may change in future releases.
Switching to Artifact Pipelines
For convenience, you can also directly download the pipeline, switch to the new project, and load the pipeline in a single step with switch_to_artifact_pipeline()
function:
import mlflow
from mlflow.tracking import MlflowClient
import getml_mlflow
import getml
# Connect to MLflow
client = MlflowClient("http://localhost:5000")
# Download pipeline from run, switch to new project, and load the pipeline
pipeline = getml_mlflow.marshalling.pipeline.switch_to_artifact_pipeline(
client,
"2960ee40202744daa64aa83d180f0b2f",
"uPe3hR"
)
# Pipeline is ready to use immediately
predictions = pipeline.predict(container.test)
This function combines downloading the pipeline artifact, switching to the newly created project, and loading the pipeline for immediate use. It's particularly useful when you want to quickly retrieve and use a pipeline that was logged in a previous run.
Experimental feature
The function switch_to_artifact_pipeline()
is experimental and may change in future releases.
Docker Configuration for Artifact Logging
To use pipeline artifact logging when running getML Engine in a Docker container, you need to create a bind mount from the host machine into the container.
In your docker-compose.yml
file:
services:
getml:
# ... other configuration ...
volumes:
- $HOME/.getML:/home/getml
Be sure to remove any existing named volume configuration like:
volumes:
- getml:/home/getml/
volumes:
getml:
external: false
If you've mounted a directory other than $HOME/.getML
from the host, specify it with the getml_project_path
parameter when calling autolog()
:
import getml_mlflow
from pathlib import Path
# Specify custom project path matching your Docker volume mount
getml_mlflow.autolog(
log_pipeline_as_artifact=True,
getml_project_path=Path("/custom/path/to/projects")
)
Running Your Own MLflow Runs
By default, getml_mlflow.autolog()
creates new MLflow runs automatically for each pipeline operation. If you want to manage runs yourself, you can disable this behavior:
# Disable automatic run creation
getml_mlflow.autolog(create_runs=False)
# Start your own run
import mlflow
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("your_experiment_name")
with mlflow.start_run(run_name="custom_pipeline_run"):
# All getML operations in this block will be logged to this run
pipe = getml.pipeline.Pipeline(...)
pipe.fit(container.train)
pipe.score(container.test)
This approach gives you more control over how runs are organized in the MLflow UI.
Managing MLflow Experiments
Experiment Lifecycle
In MLflow, experiments provide a way to organize your runs. When using getML with MLflow integration:
- Each getML project corresponds to an MLflow experiment
- MLflow experiments can be viewed, renamed, or deleted through the UI
- When you delete an experiment in the UI, it's moved to a trash folder rather than permanently deleted
Handling Deleted Experiments
When you delete an experiment in the MLflow UI, it's moved to a trash folder but remains in the system. This can cause issues if you try to create a new experiment with the same name, as you might encounter an error like:
RestException: RESOURCE_ALREADY_EXISTS: Experiment 'experiment_name' already exists in deleted state.
To permanently delete an experiment and reuse its name:
-
Delete the experiment from the trash folder:
rm -rf mlruns/.trash/experiment_id/
-
Run MLflow garbage collection:
mlflow gc
If you're using a custom tracking URI:
MLFLOW_TRACKING_URI="http://localhost:5000" mlflow gc
After these steps, you'll be able to create a new experiment with the same name.
Conclusion
The MLflow integration in getML provides a powerful way to track, compare, and reproduce your machine learning projects. By automatically capturing key information at each stage of the pipeline lifecycle, it enhances transparency and collaboration while minimizing the overhead of manual tracking.
Whether you're working on a small personal project or collaborating in a large team, the MLflow integration helps ensure that your machine learning workflows are well-documented, reproducible, and easy to understand.