getML on Vertex AI¶
Overview¶
This tutorial demonstrates how to use the Vertex AI SDK
and gcloud cli
to build and deploy custom containers for training and prediction of getML models.
Dataset¶
The financial dataset from the CTU Prague Relational Learning Repository. It consists of multiple tables containing various features related to bank customers and their transaction histories. The target variable is whether a customer defaults on a loan.
Note: This notebook is based on Predicting the loan default risk of Czech bank customers using getML. Checkout it out first, if you want to know more about the dataset and getML in general.
Objective¶
The goal of this tutorial is to:
- Train a getML model using relational data from multiple tables.
- Save the trained model and its serialized pre-processor.
- Build a custom getML serving container with custom prediction logic using the Custom Prediction Routine feature in the Vertex AI SDK.
- Test the built container locally.
- Upload and deploy the custom container to Vertex AI Predictions.
Note: This tutorial focuses more on deploying getML models with Vertex AI than on the design of the model itself.
Costs¶
This tutorial involves the use of billable components of Google Cloud:
- Vertex AI
- Google Cloud Storage
- Google Container Registry
TIP: Check out Vertex AI pricing, and use the Pricing Calculator to generate a cost estimate based on your projected usage.
Before you begin¶
Note: If you are running this notebook on Vertex AI Workbench, your environment already meets most requirements.
However, you need to add the storage.admin
role to the *compute@developer.gserviceaccount.com
service account that is assigned to this notebook by default. Please be aware of step VertexAI Workbench: Adding Role to Service Account
If you run this notebook locally, please consider the following requirements:
Set up Your Local Development Environment¶
If you run this notebook on your local machine make sure your environment meets this notebook's requirements:
Note: If you need to install Docker or the SDK, the links will guide you to the installation steps.
Set up Your Google Cloud Project¶
The following steps are required, regardless of your notebook environment.
IMPORTANT! If you have not used gcloud CLI before you need to set it up first. On your local shell, run:
gcloud init
During the process you will authenticate, get credentials and can set your default project / region.
Note: All commands prefixed with !
are shell commands. The prefix !
allows for direct execution within Juypter. However, you can also execute them in a dedicated Terminal.
Determine Environment¶
We need to adapt to the environment this notebook runs in. So if this notebook runs on VertexAI Workbench or Colab IS_GCLOUD_ENV
is True
import os
IS_WORKBENCH_ENV = "GOOGLE_VM_CONFIG_LOCK_FILE" in os.environ
Install requirements¶
getml.vertexai
is located within src
.
This package contains:
Utility functions
for accessing GCP resourcesConfigurations
for this notebook and training/inference containers we will create later.Dependencies
needed for notebook and docker containers:- getml==1.4.0
- google-cloud-aiplatform[prediction]==1.56.0
- pyyaml==6.0.1
The Python Cloud Client Library google-cloud-aiplatform is needed to interact with services from Google Cloud, including
- Vertex AI
- Cloud Storage.
- [prediction] option includes FastAPI, that is needed for building the prediction container later on.
For more information on getML, checkout the documentation
Install getml.vertexai
¶
In the Vertex AI Workbench environment, perform the following steps:
- Download the tarball version of the
getml-demo
repository. - Extract the content of the project folder into the current working directory.
# type: ignore
if IS_WORKBENCH_ENV:
# stip-components=1 is necessary to avoid creating a directory with the name of the repository
! curl -L https://api.github.com/repos/getml/getml-demo/tarball/vertexai | tar --strip-components=1 -xz
! uv pip install --force-reinstall "."
Kernel restart¶
On Workbench we also need to restart the kernel to apply all changes.
# type: ignore
if IS_WORKBENCH_ENV:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)
Redefine IS_WORKBENCH_ENV
after kernel restart
import os
IS_WORKBENCH_ENV = "GOOGLE_VM_CONFIG_LOCK_FILE" in os.environ
Import Vertex AI SDK¶
aiplatform
is part of the google-cloud-aiplatform
package. It provides a Python API for interacting with Vertex AI services.
from google.cloud import aiplatform
Configuration¶
Define and save configuration variables using Config
, storing them in config.yaml
.
This method centralizes project variable definitions within the notebook and ensures availability to Docker containers created later.
Access the configuration via the cfg
instance using dot notation, e.g., cfg.REGION
.
Note: If you are using Vertex AI Workbench, this notebook is associated with the Compute Engine default service account. SERVICE_ACCOUNT_NAME
will be automatically filled with the corresponding *compute@developer.gserviceaccount.com name.
from getml.vertexai.config import Config
cfg = Config(
{
"GCP_PROJECT_NAME": "", # NOTE: Must be globally(!) unique on GCP
"BUCKET_NAME": "", # NOTE: Must be globally(!) unique on GCP
"BUCKET_DIR_MODEL": "model_artifact",
"BUCKET_DIR_DATASET": "datasets",
"REGION": "europe-west1", # NOTE: Adapt to your preferred region
"SERVICE_ACCOUNT_NAME": "getml-vertexai-sa", # NOTE: Gets replaced, if you run on Vertex AI Workbench
"DOCKER_REPOSITORY": "getml-vertexai-docker-repository",
"GETML_PROJECT_NAME": "Loans",
}
)
# Save configuration for later use in Docker containers
cfg.save("config.yaml")
Print all available configurations
cfg
Set the project and region to ensure ! gcloud
commands are executed accordingly.
! gcloud config set project {cfg.GCP_PROJECT_NAME}
! gcloud config set ai/region {cfg.REGION}
Initialize the Vertex AI SDK and set the project and location defaults there as well.
This ensures all aiplatform
related commands/functions execute on the correct project and region.
aiplatform.init(project=cfg.GCP_PROJECT_NAME, location=cfg.REGION)
Enable Necessary APIs¶
If you have just created a new project, some APIs might not be enabled yet. Use the following command to enable all the APIs needed for this tutorial:
! gcloud services enable \
iam.googleapis.com \
compute.googleapis.com \
containerregistry.googleapis.com \
aiplatform.googleapis.com
Setup Service Account¶
We need a service account to provide our containers appropriate permissions to access
- Storage Buckets (Save and load model artifacts)
- MetadataStore (Logging metrics/Experiments)
VertexAI Workbench: Adding Role to Service Account¶
If you are running this notebook on Vertex AI Workbench, it is associated with a Service Account (see SERVICE_ACCOUNT_EMAIL
). To ensure proper functionality, you need to add the storage.admin
role to this account.
Perform this step on the GCP Platform. Follow the link below (it should automatically open) and add the storage.admin
role to the Service Account associated with this notebook.
from getml.vertexai import open_iam_permissions
if IS_WORKBENCH_ENV:
open_iam_permissions(cfg.GCP_PROJECT_NAME)
cfg.print(["SERVICE_ACCOUNT_EMAIL"])
cfg.print_links(["iam_permissions"])
Local Environment: Create a Service Account¶
If this notebook runs on a local environment and you are authenticated to gcloud cli
with your personal account, we need to create a service account.
NOTE: If you run this notebook on VertexAI Workbench skip this step and continue with Save Service Account to JSON
# NOTE: If the service account already exists in the project, the following error can be ignored:
# ERROR: (gcloud.iam.service-accounts.create) Resource in projects [$PROJECT_ID] is the subject of a conflict..
if not IS_WORKBENCH_ENV:
cfg.print(["SERVICE_ACCOUNT_NAME"])
! gcloud iam service-accounts create {cfg.SERVICE_ACCOUNT_NAME} \
--display-name="getML Vertex AI Service Account"
Set Permissions on Service Account¶
Once the service account is created, we need to grant the roles aiplatform.user
and storage.admin
to it:
if not IS_WORKBENCH_ENV:
cfg.print(["GCP_PROJECT_NAME", "SERVICE_ACCOUNT_EMAIL"])
# Assign the Vertex AI User role
! gcloud projects add-iam-policy-binding {cfg.GCP_PROJECT_NAME} \
--member="serviceAccount:{cfg.SERVICE_ACCOUNT_EMAIL}" \
--role="roles/aiplatform.user"
# Assign the Storage Admin role
! gcloud projects add-iam-policy-binding {cfg.GCP_PROJECT_NAME} \
--member="serviceAccount:{cfg.SERVICE_ACCOUNT_EMAIL}" \
--role="roles/storage.admin"
Save Service Account to JSON¶
We will need the service_account.json
file later when we create a local endpoint to test our container.
NOTE: If too many keys have been created, the following error can occur:
ERROR: (gcloud.iam.service-accounts.keys.create) FAILED_PRECONDITION: Precondition check failed.
In this case older keys should be deleted before creating a new one.
To prevent this from happening in the first place, we check if a service_account.json is already present before we create it.
# type: ignore
from pathlib import Path
cfg.print(["SERVICE_ACCOUNT_EMAIL"])
PATH_SERVICE_ACCOUNT_CREDENTIALS = Path("service_account.json")
if not PATH_SERVICE_ACCOUNT_CREDENTIALS.exists():
! gcloud iam service-accounts keys create {PATH_SERVICE_ACCOUNT_CREDENTIALS.name} \
--iam-account={cfg.SERVICE_ACCOUNT_EMAIL}
Create Cloud Storage Bucket¶
The bucket will serve as cloud storage for:
- Trained model artifacts (The result of the training container)
- Datasets (Loans dataset)
Both are included in the getML project dump, Loans.getml
, which will be stored in the bucket we create now:
# NOTE: If BUCKET_URI already exists. The following error can be ignored:
# "ServiceException 409 A Cloud Storage bucket named $BUCKET_NAME already exists."
# Create the bucket
! gsutil mb -l {cfg.REGION} -p {cfg.GCP_PROJECT_NAME} {cfg.BUCKET_URI}
cfg.print(["BUCKET_URI"])
cfg.print_links(["bucket"])
Create Docker Repository on Artifact Registry¶
The Docker repository on Google Cloud's Artifact Registry
will store the Docker images required for our training and prediction containers. These images will be built locally and then pushed to this repository for deployment on Vertex AI.
# NOTE: If DOCKER_REPOSITORY already exists. The following error can be ignored:
# ERROR: (gcloud.artifacts.repositories.create) ALREADY_EXISTS: the repository already exists
! gcloud artifacts repositories create {cfg.DOCKER_REPOSITORY} \
--repository-format=docker \
--location={cfg.REGION} \
--description="Docker repository for getML Vertex AI Images"
cfg.print(["DOCKER_REPOSITORY", "REGION"])
cfg.print_links(["docker_repository"])
Configure Docker¶
To be able to upload images to the repository, you need to update your Docker settings:
! gcloud auth configure-docker --quiet
! gcloud auth configure-docker --quiet {cfg.REGION}-docker.pkg.dev
Set the DOCKER_HOST
environment variable to the current docker daemon path.
This is necessary for compatibility of rootless Docker setups in combination with Vertex AI SDK.
from getml.vertexai.utils import get_docker_daemon_path
os.environ["DOCKER_HOST"] = get_docker_daemon_path()
Handling "line buffering" Warnings¶
In this notebook, you may see warnings related to line buffering when using the subprocess module. These warnings do not impact the accuracy or performance and cannot be resolved within this notebook's context. Therefore, we will ignore them to keep our output clean.
Note: You might still see line buffering warnings when running ! gcloud
commands. As stated, these can be safely ignored.
import warnings
warnings.filterwarnings("ignore", message="line buffering")
Setup Finished¶
We have completed all setup and configuration steps and are now ready to start training our model.
Training¶
This notebook demonstrates the training of a binary classification model. It is based on the Loans notebook. Check out the link for more details on the dataset and usage of the getML Python API.
Main Objectives¶
The main objectives of the training container are to:
- Get and preprocess the
Loans dataset
. Train
a getML model (pipeline) on the trainset.Score
the trained model on the testset.Save
the project (including data and model) as anartifact
on theGCS Bucket
.
Create Managed Dataset¶
To use experiments, a managed dataset is essential as it creates a default MetadataStore. The Experiments/MetadataStore is crucial for logging and tracking experiments, ensuring all data-related activities are properly recorded and managed within the Vertex AI ecosystem.
The managed dataset created here is primarily for demonstration purposes and to establish a MetadataStore. The actual data used to train our model is retrieved within the training Docker container. For details, see training/train.py
.
from getml.vertexai import create_vertex_dataset_tabular
dataset_loans = create_vertex_dataset_tabular(
cfg=cfg, filename_csv="datasets/loans_population_test.csv"
)
Build Docker Container for Training¶
For training we just need a simple Docker container that includes:
Python runtime
(we conveniently use a public python image as base layer)Python dependencies
:- getml
- getml-playbooks
- google-cloud-aiplatform
print("Content of Dockerfile.train:\n")
%cat training/Dockerfile.train
Now let's
build
the Dockerfile.train image- and
push
it to the Artifact Registry
cfg.print(["DOCKER_IMAGE_URI_TRAIN"])
! docker build -f training/Dockerfile.train -t {cfg.DOCKER_IMAGE_URI_TRAIN} .
! docker push {cfg.DOCKER_IMAGE_URI_TRAIN}
Deploy Training Job¶
The gcloud ai custom-jobs create
command
wraps
the train.py script into our Training Docker Container, thenruns
it in the Vertex AI environment on Google Cloud.- Finally, the
result
is an Artifact containing the getML model and dataframes
For more details about the command, checkout https://cloud.google.com/sdk/gcloud/reference/ai/custom-jobs/create
cfg.print(
[
"GETML_PROJECT_NAME",
"GCP_PROJECT_NAME",
"REGION",
"SERVICE_ACCOUNT_EMAIL",
"DOCKER_IMAGE_URI_TRAIN",
"BUCKET_URI_DATASET",
]
)
# Define variables for the training job
TRAIN_DISPLAY_NAME = f"getml-train-{cfg.GETML_PROJECT_NAME}"
TRAIN_LOCAL_PACKAGE_PATH = "training"
TRAIN_SCRIPT = "train.py"
TRAIN_MACHINE_TYPE = "n1-standard-4"
TRAIN_REPLICA_COUNT = 1
# Create and run the custom training job
! gcloud ai custom-jobs create \
--project={cfg.GCP_PROJECT_NAME} \
--region={cfg.REGION} \
--display-name={TRAIN_DISPLAY_NAME} \
--service-account={cfg.SERVICE_ACCOUNT_EMAIL} \
--worker-pool-spec=machine-type={TRAIN_MACHINE_TYPE},replica-count={TRAIN_REPLICA_COUNT},executor-image-uri={cfg.DOCKER_IMAGE_URI_TRAIN},local-package-path={TRAIN_LOCAL_PACKAGE_PATH},script={TRAIN_SCRIPT}
Result of Training Container¶
The following links contain the resources we just created, as well as the resulting artifact from the training container:
cfg.print_links(["training_jobs", "model_artifact", "experiments"])
Prediction / Inference¶
Now that we have a trained Model Artifact stored on GCS, we can
build
a prediction routine that loads the Artifact, anddeploy
an HTTP endpoint to run predictions on our model.
Details of the prediction container¶
Basically, the container provides the HTTP route predict
via FastAPI / Uvicorn, Gunicorn
To know more about the Predictor
class, see the documentation on custom prediction routines.
All relevant files you can find within the prediction
folder.
print("Content of Dockerfile.pred:\n")
%cat prediction/Dockerfile.pred
Now let's build
the Dockerfile.pred image
! docker build -f prediction/Dockerfile.pred \
-t {cfg.DOCKER_IMAGE_URI_PRED} .
Deploy Local Model¶
Before deploying the model to the cloud, it is advisable to build and test it locally. Once the model is confirmed to be functioning correctly, you can then proceed with the cloud deployment.
See the Google documentation for more details of the LocalModel
class.
from google.cloud.aiplatform.prediction import LocalModel
local_model = LocalModel(serving_container_image_uri=cfg.DOCKER_IMAGE_URI_PRED)
cfg.print(["DOCKER_IMAGE_URI_PRED"])
Local Prediction on Test Data¶
To run a prediction using the local model, we will send a request with test data in JSON format (as string) to the local endpoint.
We have prepared some test request data in JSON format, which can be loaded using load_json_from_file()
.
Note: Refer to [OPTIONAL] Create Test Request Data
for details on how this test data was created.
from getml.vertexai import load_json_from_file
request_json = load_json_from_file("./prediction/request_test.json")
request_json
[OPTIONAL] Create Test Request Data¶
If you would like to recreate the test data JSON or see how it is generated, uncomment the following code and check its source in src/getml/vertexai/request_data.py
.
# from getml.vertexai import create_test_request
# create_test_request()
Deploy local_model
to a local_endpoint
¶
Now that we have the prediction container ready, as well as some test data, we can deploy a local endpoint and send test data to the predict
endpoint.
See the Google documentation for more details and requirements of the LocalModel
class and its deploy_to_local_endpoint
method.
Verify that the training has successfully finished by checking the following links:
cfg.print_links(["training_jobs", "model_artifact"])
Waiting for the training job to finish before proceeding to the next step.
NOTE: This may take a few minutes.
from getml.vertexai.utils_gcp import wait_for_training_artifact
wait_for_training_artifact(cfg)
with local_model.deploy_to_local_endpoint(
credential_path=PATH_SERVICE_ACCOUNT_CREDENTIALS.name,
artifact_uri=cfg.ARTIFACT_URI,
) as local_endpoint:
health_check_response = local_endpoint.run_health_check()
print(
"Health check response:", health_check_response, health_check_response.content
)
# Make a prediction
predict_response = local_endpoint.predict(
request=request_json,
headers={"Content-Type": "application/json"},
)
print("Predict response:", predict_response, predict_response.content)
You should see an output similar to:
Health check response: <Response [200]> b'{}'
Predict response: <Response [200]> b'{"predictions": [[0.9659892320632935], [0.8711856007575989], [0.882280170917511],...
If there is an issue you can check the logs of the container build process:
local_endpoint.container.logs().decode("utf-8").strip().split("\n")
Manually Spin-Up Container and Call Endpoint with Test Data¶
Alternatively, you can manually run your Docker container. This way, you have more control over the parameters of docker run
, especially the Google environment variables.
See more details about them in the Google documentation.
NOTE: You should run the docker run
command in a separate Terminal, not in this notebook.
from getml.vertexai import cmd_to_run_local_endpoint
cmd_to_run_local_endpoint(cfg)
Push Image to GCP / Vertex AI¶
Before we can deploy the container to the cloud, we need to push the image to the Artifact Registry
.
Rebuild Prediction Container¶
To ensure compatibility with GCP (x86_64), the container image must be built with the correct architecture. Regardless of your current platform, the docker build
command will now enforce the linux/amd64 platform.
! docker build --platform linux/amd64 -f prediction/Dockerfile.pred \
-t {cfg.DOCKER_IMAGE_URI_PRED} .
local_model.push_image()
cfg.print_links(["image_for_predictions"])
Upload to Model Registry¶
The Model Registry
serves as a centralized repository where you can manage and version your machine learning models. By uploading the model, you make it accessible for deployment and further analysis.
cfg.print(["GCP_PROJECT_NAME", "REGION", "ARTIFACT_URI"])
model = aiplatform.Model.upload(
project=cfg.GCP_PROJECT_NAME,
location=cfg.REGION,
local_model=local_model,
display_name="getML model (Loans)",
artifact_uri=f"{cfg.ARTIFACT_URI}",
description="getML model trained on the Loans dataset. Generated by demo_binary_classification.ipynb",
)
cfg.print_links(["model_registry"])
Online Prediction Endpoint¶
Endpoints are machine learning models made available for online prediction requests. Endpoints are useful for timely predictions from many users (for example, in response to an application request). You can also request batch predictions if you don't need immediate results.
Deploy Endpoint¶
NOTE: If you encounter a "FailedPrecondition" error, this is very likely related to an exception thrown within the docker container. You should checkout the logs of the container to find the cause.
NOTE: Deployment of endpoint can take a while (30min+)
ENDPOINT_MACHINE_TYPE = "n1-standard-4"
endpoint = model.deploy(
machine_type=ENDPOINT_MACHINE_TYPE, service_account=cfg.SERVICE_ACCOUNT_EMAIL
)
Prediction on Deployed Endpoint¶
Once the endpoint is deployed, you can also make predictions using the Test your model
feature in the Vertex AI console (see link below).
As JSON request
you can use the content of the request_test.json
file:
# model_id is just needed to build the link
model_id = Path(endpoint.gca_resource.deployed_models[0].model).name
cfg.print_links(["deployed_model"], model_id)
print("JSON request:", request_json)
# PROJECT_ID (int): The numerical project ID.
# ENDPOINT_ID (int): The numerical endpoint ID.
# Example URL format: https://europe-west1-aiplatform.googleapis.com/v1/projects/{PROJECT_ID}/locations/europe-west1/endpoints/{ENDPOINT_ID}:predict
! curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
"https://{cfg.REGION}-aiplatform.googleapis.com/v1/{endpoint.resource_name}:predict" \
-d "@prediction/request_test.json"
The result should looks similar to:
{
"predictions": [
[
0.96598923206329346
],
[
0.87118560075759888
],
[
0.882280170917511
],
...
],
"deployedModelId": "5059851355955396608",
"model": "projects/956851751872/locations/europe-west1/models/8409526724114513920",
"modelDisplayName": "getML model (Loans)",
"modelVersionId": "1"
}
Undeploy Endpoint¶
Remember to undeploy your cloud endpoints after testing to avoid unnecessary costs.
endpoint.undeploy_all()
Conclusion¶
In this notebook, we walked through the complete workflow of training and deploying a machine learning model using Vertex AI. We began by setting up our environment, configuring necessary project variables, and initializing Vertex AI. We then trained a binary classification model using the getML framework, logged and tracked our experiments using the MetadataStore, and saved the model artifact to Google Cloud Storage.
Next, we built and tested a custom prediction routine locally before pushing our Docker image to the Artifact Registry. We deployed the trained model to the Vertex AI Model Registry and created an online prediction endpoint to serve real-time predictions. Additionally, we discussed how to manually manage the Docker container and perform batch predictions.
By following these steps, you have learned how to leverage Vertex AI for end-to-end machine learning workflows, from data preprocessing and model training to deployment and prediction. This powerful combination of tools and services ensures a scalable, efficient, and well-managed approach to developing and deploying getML models on Google Cloud.