An introduction to MLOps with MLflow

Romain Avouac (Insee), Thomas Faria (Insee), Tom Seimandi (Insee)

Introduction

Who are we ?

  • Data scientists at Insee
    • methodological and IT innovation teams
    • support data science projects
  • Contact us

Context

  • Difficulty of transitioning from experiments to production-grade machine learning systems

  • Leverage best practices from software engineering

    • Improve reproducibility of analysis
    • Deploy applications in a scalable way
    • Monitor running applications

The DevOps approach

  • Unify development (dev) and system administration (ops)
    • shorten development time
    • maintain software quality

The MLOps approach

  • Integrate the specificities of machine learning projects
    • Experimentation
    • Continuous improvement

MLOps : principles

  • Reproducibility

  • Versioning

  • Automation

  • Monitoring

  • Collaboration

Why MLflow ?

  • Multiple frameworks implement the MLOps principles

  • Pros of MLflow

    • Open-source
    • Covers the whole ML lifecycle
    • Agnostic to the ML library used
    • We have experience with it

Training platform : the SSP Cloud

  • An open innovation production-like environment
    • Kubernetes cluster
    • S3-compatible object storage
    • Large computational resources (including GPUs)
  • Based on the Onyxia project

Outline

1️⃣ Introduction to MLFlow

2️⃣ A Practical Example: NACE Code Prediction for French companies

3️⃣ Deploying a ML model as an API

4️⃣ Distributing the hyperparameter optimization

5️⃣ Maintenance of a model in production

Application 0

Preparation of the working environment

  1. Create an account on the SSP Cloud using your professional mail address
  2. Launch a MLflow service by clicking this URL
  3. Launch a Jupyter-python service by clicking this URL
  4. Open the Jupyter-python service and input the service password
  5. You’re all set !

Preparation of the working environment

  1. It is assumed that you have a Github account and have already created a token. Fork the training repository by clicking here.

  2. Create an account on the SSP Cloud using your professional mail address

  3. Launch a MLflow service by clicking this URL

  4. Launch a Jupyter-python service by clicking this URL

  5. Open the Jupyter-python service and input the service password

  6. In Jupyter, open a terminal and clone your forked repository (modify the first two lines):

    GIT_REPO=formation-mlops
    GIT_USERNAME=InseeFrLab
    
    git clone https://github.com/$GIT_USERNAME/$GIT_REPO.git
    cd $GIT_REPO
  7. Install the necessary packages for the training:

    pip install -r requirements.txt
    python -m nltk.downloader stopwords
  8. You’re all set !

1️⃣ Introduction to MLFlow

Tracking server

  • “An API and UI for logging parameters, code versions, metrics, and artifacts”

Projects

  • “A standard format for packaging reusable data science code”

Models

  • “A convention for packaging machine learning models in multiple flavors

Model registry

  • “A centralized model store, set of APIs, and UI, to collaboratively manage the full lifecycle of an MLflow Model”

Application 1

Introduction to MLflow concepts

  1. In JupyterLab, open the notebook located at formation-mlops/notebooks/mlflow-introduction.ipynb
  2. Execute the notebook cell by cell. If you are finished early, explore the MLflow UI and try to build your own experiments from the example code provided in the notebook.

2️⃣ A Practical Example

Context

  • NACE

    • European standard classification of productive economic activities
    • Hierarchical structure with 4 levels and 615 codes
  • At Insee, previously handled by an outdated rule-based algorithm

  • Common problematic to many National Statistical Institutes

FastText model

  • “Bag of n-gram model” : embeddings for words but also n-gram of words and characters

  • Very simple and fast model

OVA: One vs. All

Data used

  • A simple use-case with only 2 variables:
    • Textual description of the activity – text
    • True NACE code labelised by the rule-based engine – nace (732 modalities)
  • Standard preprocessing:
    • lowercasing
    • punctuation removal
    • number removal
    • stopwords removal
    • stemming

MLflow with a non standard framework

  • Easy to use with a variety of machine learning frameworks (scikit-learn, Keras, Pytorch…)
mlflow.sklearn.log_model(pipe_rf, "model")

mlflow.pyfunc.load_model(model_uri=f"models:/{model_name}/{version}")
y_train_pred = model.predict(X_train)
  • What if we require greater flexibility, e.g. to use a custom framework?
  • Possibility to track , register and deliver your own model

MLflow with a non standard framework

  • There are 2 main differences when using your own framework:
    • logging of parameters, metrics and artifacts
    • wrapping of your custom model so that MLflow can serve it
# Define a custom model
class MyModel(mlflow.pyfunc.PythonModel):

    def load_context(self, context):
        self.my_model.load_model(context.artifacts["my_model"])

    def predict(self, context, model_input):
        return self.my_model.predict(model_input)

From experiment towards production

  • Notebooks are not suitable to build production-grade ML systems:
    • Limited potential for automation of ML pipelines.
    • Lack of clear and reproducible workflows.
    • Hinders collaboration and versioning among team members.
    • Insufficient modularity for managing complex ML components.

Application 2

Part 1 : From notebooks to a package-like project

  1. Launch a VSCode service by clicking this URL. Open the service and input the service password.

  2. All scripts related to our custom model are stored in the src folder. Check them out. Have a look at the MLproject file as well.

  3. Run a training of the model using MLflow. To do so, open a terminal ( -> Terminal -> New Terminal) and run the following command :

    export MLFLOW_EXPERIMENT_NAME="nace-prediction"
    mlflow run ~/work/formation-mlops/ --env-manager=local \
        -P remote_server_uri=$MLFLOW_TRACKING_URI \
        -P experiment_name=$MLFLOW_EXPERIMENT_NAME
  4. In the UI of MLflow, look at the results of your previous run:

    • Experiments -> nace-prediction -> <run_name>
  5. You have trained the model with some default parameters. In MLproject check the available parameters. Re-train a model with different parameters (e.g. dim = 25).

Click to see the command
mlflow run ~/work/formation-mlops/ --env-manager=local \
    -P remote_server_uri=$MLFLOW_TRACKING_URI \
    -P experiment_name=$MLFLOW_EXPERIMENT_NAME \
    -P dim=25
  1. In MLflow, compare the 2 models by plotting the accuracy against one parameter you have changed (i.e. dim)
    • Select the 2 runs -> Compare -> Scatter Plot -> Select your X and Y axis

Part 1 : From notebooks to a package-like project

  1. Launch a VSCode service by clicking this URL. Open the service and input the service password.

  2. In VSCode, open a terminal ( -> Terminal -> New Terminal) and redo steps 6 and 7 of application 0 (clone and package installation).

  3. All scripts related to our custom model are stored in the src folder. Check them out. Have a look at the MLproject file as well.

  4. Run a training of the model using MLflow. To do so, open a terminal and run the following command :

    export MLFLOW_EXPERIMENT_NAME="nace-prediction"
    mlflow run ~/work/formation-mlops/ --env-manager=local \
        -P remote_server_uri=$MLFLOW_TRACKING_URI \
        -P experiment_name=$MLFLOW_EXPERIMENT_NAME
  5. In the UI of MLflow, look at the results of your previous run:

    • Experiments -> nace-prediction -> <run_name>
  6. You have trained the model with some default parameters. In MLproject check the available parameters. Re-train a model with different parameters (e.g. dim = 25).

Click to see the command
mlflow run ~/work/formation-mlops/ --env-manager=local \
    -P remote_server_uri=$MLFLOW_TRACKING_URI \
    -P experiment_name=$MLFLOW_EXPERIMENT_NAME \
    -P dim=25
  1. In MLflow, compare the 2 models by plotting the accuracy against one parameter you have changed (i.e. dim)
    • Select the 2 runs -> Compare -> Scatter Plot -> Select your X and Y axis

Application 2

Part 2 : Distributing and querying a custom model

  1. Explore the src/train.py file carefully. What are the main differences with application 1?
  2. Why can we say that the MLflow model onboards the preprocessing?
  3. In MLflow, register your last model as fasttext to make it easily queryable from the Python API
  4. Create a script predict_mlflow.py in the src folder of the project. This script should:
    1. Load the version 1 of the fasttext model
    2. Use the model to predict the NACE codes of a given list of activity description (e.g. ["vendeur d'huitres", "boulanger"]).

💡 Don’t forget to read the documentation of the predict() function of the custom class (src/fasttext_wrapper.py) to understand the expected format for the inputs !

Click to see the content of the script
import mlflow

model_name = "fasttext"
version = 1

model = mlflow.pyfunc.load_model(
    model_uri=f"models:/{model_name}/{version}"
)

list_libs = ["vendeur d'huitres", "boulanger"]

results = model.predict(list_libs, params={"k": 1})
print(results)
  1. Run your predict_mlflow.py script.
Click to see the command
python formation-mlops/src/predict_mlflow.py
  1. Make sure that the two following descriptions give the same top prediction : "COIFFEUR" and "coiffeur, & 98789".
  2. Change the value of the parameter k and try to understand how the structure of the output changed as a result.

3️⃣ Deploying a ML model as an API

Model serving

  • Once a ML model has been developed, it must be deployed to serve its end users
    • Which production infrastructure ?
    • Who are the end users ?
    • Batch serving vs. online serving

A standard setup

  • Production infrastructure : Kubernetes cluster

  • The model might serve various applications

    • Make the model accessible via an API
  • Online serving

    • Client applications send a request to the API and get a fast response

Exposing a model through an API

Run the API in a container

  • Container: self-contained and isolated environment that encapsulates the model, its dependencies and the API code

  • Containers provide high portability and scalability for distributing the model efficiently.

  • The Dockerfile is used to configure and build the Docker container.

Development with Docker architecture

Deploying an API on Kubernetes

  • 3 main files are needed to deploy an API:
    • deployment.yaml : defines how the API should run (container image, resources, and environment variables)
    • service.yaml : establishes a stable internal network endpoint for the API.
    • ingress.yaml : provides an entry point for external clients to access the API.

Application 3

Deploying manually a machine-learning model as an API

  1. We constructed a very simplistic Rest API using FastAPI. All underlying files are in the app folder. Check them.
  2. Open the Dockerfile to see how the image is built. The image is automatically rebuilt and published via Github Actions, if interested have a look to .github/workflows/build_image.yml.
  3. Open the file kubernetes/deployment.yml and modify the highlighted lines accordingly:
deployment.yml
containers:
- name: api
    image: inseefrlab/formation-mlops-api:main
    imagePullPolicy: Always
    env:
    - name: MLFLOW_TRACKING_URI
        value: https://user-<namespace>-<pod_id>.user.lab.sspcloud.fr
    - name: MLFLOW_MODEL_NAME
        value: fasttext
    - name: MLFLOW_MODEL_VERSION
        value: "1"
  1. Open the file kubernetes/ingress.yml and modify (two times) the URL of the API endpoint to be of the form <your_firstname>-<your_lastname>-api.lab.sspcloud.fr
  2. Apply the three Kubernetes contracts contained in the kubernetes/ folder in a terminal to deploy the API
kubectl apply -f formation-mlops/kubernetes/
  1. Reach your API using the URL defined in your ingress.yml file
  2. Display the documentation of your API by adding /docs to your URL
  3. Try your API out!
  4. Re-train a new model and deploy this new model in your API
Click to see the steps
  1. Train a model
  2. Register the model in MLflow
  3. Adjust your MLFLOW_MODEL_NAME or MLFLOW_MODEL_VERSION (if you didn’t modify the model name) environment variable in the deployment.yml file
  4. Apply the new Kubernetes contracts to update the API
kubectl apply -f formation-mlops/kubernetes/
  1. Refresh your API, and verify on the home page that it is now based on the new version of the model

Continuous deployment of a machine-learning model as an API

⚠️ The previous applications must have been created with the Git option to be able to follow this one.

Previously, you deployed your model manually. Thanks to ArgoCD, it is possible to deploy a model continuously. This means that every modification of a file in the kubernetes/ folder will automatically trigger redeployment, synchronized with your GitHub repository. To convince yourself, follow the steps below:

  1. Launch an ArgoCD service by clicking on this URL. Open the service, enter the username (admin), and the service’s password.
  2. Resume the first 4 steps of the manual deployment.
  3. Commit the changes made and push them to your GitHub repository.
  4. Open the template argocd/template-argocd.yml and modify the highlighted lines:
template-argocd.yml
spec:
  project: default
  source:
    repoURL: https://github.com/<your-github-id>/formation-mlops.git
    targetRevision: HEAD
    path: kubernetes
  destination:
    server: https://kubernetes.default.svc
    namespace: <your-namespace>
  1. In ArgoCD, click on New App and then Edit as a YAML. Copy and paste the content of argocd/template-argocd.yml, and click on Create.
  2. Reach your API using the URL defined in your ingress.yml file
  3. Display the documentation of your API by adding /docs to your URL
  4. Try your API out!
  5. Re-train a new model and deploy automatically this new model in your API
Click to see the steps
  1. Train a model
  2. Register the model in MLflow
  3. Adjust your MLFLOW_MODEL_NAME or MLFLOW_MODEL_VERSION (if you didn’t modify the model name) environment variable in the deployment.yml file
  4. Commit these changes and push them to your GitHub repository.
  5. Wait for 5 minutes for ArgoCD to automatically synchronize the changes from your GitHub repository, or force synchronization. Refresh your API and check on the homepage that it is now based on the new version of the model.

4️⃣ Distributing the hyperparameter optimization

Parallel training

  • With our setup, we can train models one by one and log all relevant information to the MLflow tracking server
  • What if we would like to train multiple models at once, for example to optimize hyperparameters ?

Workflow automation

  • General principles :
    • Define workflows where each step in the workflow is a container (reproducibility)
    • Model multi-step workflows as a sequence of tasks or as a directed acyclic graph
    • This allows to easily run in parallel compute intensive jobs for machine learning or data processing

Argo workflows

  • A popular workflow engine for orchestrating parallel jobs on Kubernetes
    • open-source
    • container-native
    • available on the SSP Cloud

Hello World

apiVersion: argoproj.io/v1alpha1
kind: Workflow                  # new type of k8s spec
metadata:
  generateName: hello-world-    # name of the workflow spec
spec:
  entrypoint: whalesay          # invoke the whalesay template
  templates:
    - name: whalesay            # name of the template
      container:
        image: docker/whalesay
        command: [ cowsay ]
        args: [ "hello world" ]

What is going on ?

What is going on ?

What is going on ?

Parameters

  • Templates can take input parameters
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: hello-world-parameters-
spec:
  entrypoint: whalesay
  arguments:
    parameters:
    - name: message
      value: hello world

  templates:
  - name: whalesay
    inputs:
      parameters:
      - name: message       # parameter declaration
    container:
      image: docker/whalesay
      command: [cowsay]
      args: ["{{inputs.parameters.message}}"]

Multi-step workflows

  • Multi-steps workflows can be specified (steps or dag)
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: steps-
spec:
  entrypoint: hello-hello-hello

  # This spec contains two templates: hello-hello-hello and whalesay
  templates:
  - name: hello-hello-hello
    # Instead of just running a container
    # This template has a sequence of steps
    steps:
    - - name: hello1            # hello1 is run before the following steps
        template: whalesay
    - - name: hello2a           # double dash => run after previous step
        template: whalesay
      - name: hello2b           # single dash => run in parallel with previous step
        template: whalesay
  - name: whalesay              # name of the template
    container:
      image: docker/whalesay
      command: [ cowsay ]
      args: [ "hello world" ]

What is going on ?

What is going on ?

What is going on ?

What is going on ?

What is going on ?

Further applications

  • Workflow to test registered models, or models pushed to staging / production
  • Workflows can be triggered automatically (via Argo Events for example)
  • Continuous training workflows
  • Distributed machine learning pipelines in general (data downloading, processing, etc.)

Further applications

Notes

  • Python SDK for Argo Workflows
  • Kubeflow pipelines
  • Couler : unified interface for constructing and managing workflows on different workflow engines
  • Other Python-native orchestration tools : Apache Airflow, Metaflow, Prefect

Application 4

Part 1 : introduction to Argo Workflows

  1. Launch an Argo Workflows service by clicking this URL. Open the service and input the service password (either automatically copied or available in the README of the service)
  2. In VSCode, create a file hello_world.yaml at the root of the project with the following content:
hello_world.yml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: hello-world-
  labels:
    workflows.argoproj.io/archive-strategy: "false"
  annotations:
    workflows.argoproj.io/description: |
      This is a simple hello world example.
      You can also run it in Python: https://couler-proj.github.io/couler/examples/#hello-world
spec:
  entrypoint: whalesay
  templates:
  - name: whalesay
    container:
      image: docker/whalesay:latest
      command: [cowsay]
      args: ["hello world"]
  1. Submit the Hello world workflow via a terminal in VSCode :
argo submit formation-mlops/hello_world.yaml
  1. Open the UI of Argo Workflows. Find the logs of the workflow you just launched. You should see the Docker logo .

Application 4

Part 2 : distributing the hyperparameters optimization

  1. Take a look at the argo_workflows/workflow.yml file. What do you expect will happen when we submit this workflow ?
  2. Modify the highlighted line in the same manner as in application 3.
workflow.yml
parameters:
    # The MLflow tracking server is responsable to log the hyper-parameter and model metrics.
    - name: mlflow-tracking-uri
    value: https://user-<namespace>-<pod_id>.user.lab.sspcloud.fr
    - name: mlflow-experiment-name
    value: nace-prediction
  1. Submit the workflow and look at the jobs completing live in the UI.
Click to see the command
argo submit formation-mlops/argo_workflows/workflow.yml
  1. Once all jobs are completed, visualize the logs of the whole workflow.
  2. Finally, open the MLflow UI to check what has been done.

5️⃣ Machine learning in production

Observability

Application 5

Logging

Conclusion