Tracking with MLflow

The MLflow integration in getML provides a seamless way to track projects, pipelines, and parameters throughout the machine learning lifecycle. By automatically logging pipeline operations, model metrics, and data characteristics, this integration enhances transparency, reproducibility, and collaboration in your data science projects.

Overview

MLflow is an open-source platform designed to manage the machine learning lifecycle, including experimentation, reproducibility, and deployment. When integrated with getML, MLflow automatically captures important information about:

Pipeline creation and operations (fit, score, predict, transform)
Project setting and switching
Metadata about the datasets
Pipeline parameters and performance metrics

This integration provides a clear visual representation of your getML workflows and makes it easier to compare different pipelines and approaches.

MLflow UI Overview - Shows the main UI with MLflow experiments and runs

Note

getML projects correspond to MLflow experiments, pipelines correspond to runs, and pipeline methods (functions) correspond to sub-runs.

Setup

Setting up MLflow with getML requires minimal configuration. The integration is provided through the getml-mlflow package.

How to install

You can install getml-mlflow using pip:

pip install getml-mlflow

You can also install directly from GitHub if you need the latest development version:

pip install "git+ssh://git@github.com/getml/getml-mlflow.git"

Running the MLflow Server

To visualize your experiments, you need to run the MLflow server with its browser UI:

mlflow ui

Once the server is running, you can access the MLflow UI in your browser at:

http://localhost:5000

Basic Setup

To enable MLflow logging with default settings:

import getml
import getml_mlflow

# Enable MLflow tracking with default settings
getml_mlflow.autolog()

# Subsequent getML operations will be logged to MLflow

Custom Configuration

For more control over what is logged, you can customize the autolog() function:

getml_mlflow.autolog(
    log_pipeline_as_artifact=True,
    log_system_metrics=False,
    tracking_uri="http://localhost:5000"
)

Default Settings

When you use getml_mlflow.autolog() without specifying any parameters, the following default settings are applied:

Information about container dataframes is logged (rows, columns, roles, etc.)
Pipeline parameters, tags, scores, features, and columns are logged
DataFrames passed to or returned by functions are saved as artifacts
Pipelines are saved as artifacts
System metrics (CPU and memory usage) are logged during pipeline fitting
MLflow tracking server is set to http://localhost:5000

Autolog Parameters

The autolog() function provides a wide range of parameters to customize what gets logged to MLflow:

Data Container Logging

Parameter	Default	Description
`log_data_information`	`True`	Logs metadata about dataframes (rows, columns, roles)
`log_data_as_artifact`	`True`	Saves DataFrames as parquet files in MLflow

Function Logging

Parameter	Default	Description
`log_function_parameters`	`True`	Logs parameters passed to getML functions
`log_function_return`	`True`	Logs return values of getML functions
`log_function_as_trace`	`True`	Logs function calls as traces for detailed execution flow

Pipeline Logging

Parameter	Default	Description
`log_pipeline_parameters`	`True`	Logs pipeline parameters
`log_pipeline_tags`	`True`	Logs pipeline tags
`log_pipeline_scores`	`True`	Logs pipeline metrics/scores
`log_pipeline_features`	`True`	Logs features learned during pipeline fitting
`log_pipeline_columns`	`True`	Logs columns whose importance can be calculated
`log_pipeline_targets`	`True`	Logs pipeline targets
`log_pipeline_data_model`	`True`	Logs the data model as HTML
`log_pipeline_as_artifact`	`True`	Saves pipelines as MLflow artifacts

System and Environment

Parameter	Default	Description
`log_system_metrics`	`True`	Logs system metrics (CPU, memory) during pipeline fitting
`disable`	`False`	Disables all getML autologging if set to `True`
`silent`	`False`	Suppresses all logging messages if set to `True`
`create_runs`	`True`	Creates new MLflow runs automatically when logging
`extra_tags`	`None`	Additional custom tags to log with each run
`getml_project_path`	`None`	Path to the getML projects directory (defaults to `$HOME/.getML/projects`)
`tracking_uri`	`"http://localhost:5000"`	MLflow tracking server URI

Logging Data Sets, Pipeline Tags, Features, and Metrics

Data Sets

When log_data_information is enabled, MLflow captures metadata about the dataframes in your containers, including:

Number of rows and columns
Column names and types
Role assignments (target, categorical, numerical, etc.)

This metadata appears in the MLflow UI, providing insights into your dataset.

Container DataFrames Metadata - Shows rows & columns information with emojis for roles of columns

As seen in the UI above, emojis are used to visually distinguish between different types of roles of columns in the dataframes:

Emoji	Description
🗃	Categorical columns
🔗	Join keys
🔢	Numerical columns
🎯	Target column(s)
📝	Text columns
⏰	Timestamp columns
🧮	Unused float columns
🧵	Unused string columns

If log_data_as_artifact is also enabled, the actual DataFrames are saved as Parquet files, which you can download from the MLflow UI.

Pipeline Tags

Pipeline tags are key-value pairs that help classify and organize your pipelines. When log_pipeline_tags is enabled, these tags are automatically logged to MLflow, making it easier to filter and search for specific pipelines.

# Tags will be automatically logged to MLflow
pipe = getml.pipeline.Pipeline(
    ...
    tags={"model_type": "churn_prediction", "version": "1.0"}
)

Evaluation Metrics

Performance metrics are crucial for comparing different models. With log_pipeline_scores enabled, all metrics calculated by the score() method are automatically logged to MLflow as metrics.

# Metrics will be automatically logged to MLflow
scores = pipe.score(container.test)

These metrics appear in the MLflow UI, making it easy to compare the performance of different pipeline versions.

MLflow in the getML Pipeline Lifecycle

The MLflow integration is designed to capture information at each stage of the getML pipeline lifecycle:

Project Creation and Switching

When you create or switch projects, MLflow creates a corresponding experiment:

# Creates or switches to an MLflow experiment named "my_project"
getml.set_project("my_project")

Pipeline Definition and Fitting

When you define a pipeline, you're setting up its parameters:

# Just defines the pipeline parameters, no MLflow run yet
pipe = getml.pipeline.Pipeline(...)

It's only when you call fit() that the pipeline is actually created on the getML Engine side, gets an ID, and creates an MLflow run.

In addition, during pipeline fitting, MLflow logs the following : - Parameters used for fitting - System metrics from getML(if enabled) - Function execution traces (if enabled)

# This creates a new MLflow run & logs fitting information to MLflow
pipe.fit(container.train)

Scoring and Prediction

When you score or make predictions with a pipeline, MLflow logs: - Performance metrics - Output data (if enabled)

# Scores are logged as MLflow metrics
pipe.score(container.test)

# Predictions can be logged as artifacts
predictions = pipe.predict(container.test)

Working with Artifact Pipelines

When log_pipeline_as_artifact is enabled, pipelines are saved as MLflow artifacts. They can be accessed from the Artifacts tab of MLflow UI. Artifact Pipelines enable several powerful capabilities:

Downloading Artifact Pipelines

You can download an artifact pipeline from an MLflow run into a new getML project using download_artifact_pipeline() function:

import getml
import mlflow
from mlflow.tracking import MlflowClient
import getml_mlflow

# Initialize MLflow client
client = MlflowClient("http://localhost:5000")

run_id = "2960ee40202744daa64aa83d180f0b2f"
pipeline_id = "uPe3hR"
original_project_name = "interstate94"

# Downloads the pipeline into a new project named "original_project_name-pipeline_id"
new_project, pipeline_id = getml_mlflow.marshalling.pipeline.download_artifact_pipeline(
    client, 
    run_id, 
    pipeline_id, 
    original_project_name=original_project_name
)

The new project will be created with a name derived from the original project name and the pipeline ID (e.g., "interstate94-uPe3hR"), containing the pipeline. This creates a separate project so you can work with the pipeline without affecting your current project.

If the project already exists (e.g., when calling this function multiple times with the same parameters), the existing project will be overwritten with the downloaded artifacts.

Experimental feature

The function download_artifact_pipeline() is experimental and may change in future releases.

Switching to Artifact Pipelines

For convenience, you can also directly download the pipeline, switch to the new project, and load the pipeline in a single step with switch_to_artifact_pipeline() function:

import mlflow
from mlflow.tracking import MlflowClient
import getml_mlflow
import getml

# Connect to MLflow
client = MlflowClient("http://localhost:5000")

# Download pipeline from run, switch to new project, and load the pipeline
pipeline = getml_mlflow.marshalling.pipeline.switch_to_artifact_pipeline(
    client,
    "2960ee40202744daa64aa83d180f0b2f",
    "uPe3hR"
)

# Pipeline is ready to use immediately
predictions = pipeline.predict(container.test)

This function combines downloading the pipeline artifact, switching to the newly created project, and loading the pipeline for immediate use. It's particularly useful when you want to quickly retrieve and use a pipeline that was logged in a previous run.

Experimental feature

The function switch_to_artifact_pipeline() is experimental and may change in future releases.

Docker Configuration for Artifact Logging

To use pipeline artifact logging when running getML Engine in a Docker container, you need to create a bind mount from the host machine into the container.

In your docker-compose.yml file:

services:
  getml:
    # ... other configuration ...
    volumes:
      - $HOME/.getML:/home/getml

Be sure to remove any existing named volume configuration like:

volumes:
  - getml:/home/getml/
volumes:
  getml:
    external: false

If you've mounted a directory other than $HOME/.getML from the host, specify it with the getml_project_path parameter when calling autolog():

import getml_mlflow
from pathlib import Path

# Specify custom project path matching your Docker volume mount
getml_mlflow.autolog(
    log_pipeline_as_artifact=True,
    getml_project_path=Path("/custom/path/to/projects")
)

Running Your Own MLflow Runs

By default, getml_mlflow.autolog() creates new MLflow runs automatically for each pipeline operation. If you want to manage runs yourself, you can disable this behavior:

# Disable automatic run creation
getml_mlflow.autolog(create_runs=False)

# Start your own run
import mlflow
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("your_experiment_name")

with mlflow.start_run(run_name="custom_pipeline_run"):
    # All getML operations in this block will be logged to this run
    pipe = getml.pipeline.Pipeline(...)
    pipe.fit(container.train)
    pipe.score(container.test)

This approach gives you more control over how runs are organized in the MLflow UI.

Managing MLflow Experiments

Experiment Lifecycle

In MLflow, experiments provide a way to organize your runs. When using getML with MLflow integration:

Each getML project corresponds to an MLflow experiment
MLflow experiments can be viewed, renamed, or deleted through the UI
When you delete an experiment in the UI, it's moved to a trash folder rather than permanently deleted

Handling Deleted Experiments

When you delete an experiment in the MLflow UI, it's moved to a trash folder but remains in the system. This can cause issues if you try to create a new experiment with the same name, as you might encounter an error like:

RestException: RESOURCE_ALREADY_EXISTS: Experiment 'experiment_name' already exists in deleted state.

To permanently delete an experiment and reuse its name:

Delete the experiment from the trash folder:
```
rm -rf mlruns/.trash/experiment_id/
```
Run MLflow garbage collection:
```
mlflow gc
```

If you're using a custom tracking URI:

MLFLOW_TRACKING_URI="http://localhost:5000" mlflow gc

After these steps, you'll be able to create a new experiment with the same name.

Conclusion

The MLflow integration in getML provides a powerful way to track, compare, and reproduce your machine learning projects. By automatically capturing key information at each stage of the pipeline lifecycle, it enhances transparency and collaboration while minimizing the overhead of manual tracking.

Whether you're working on a small personal project or collaborating in a large team, the MLflow integration helps ensure that your machine learning workflows are well-documented, reproducible, and easy to understand.