Understanding Kubeflow

Kubeflow is an open-source machine learning toolkit designed to simplify the deployment, scaling, and management of ML workflows on Kubernetes. This article summarizes key concepts from the Linux Foundation’s Kubeflow course, focusing on its core components and how they support the model development lifecycle.

What is Kubeflow?

Kubeflow originated from Google’s internal approach to operating TensorFlow on Kubernetes. Initially offering a streamlined method for executing TensorFlow tasks, it has evolved into a comprehensive framework supported by multiple distributions across various cloud environments. In 2023, Google donated Kubeflow to the Cloud Native Computing Foundation (CNCF), ensuring it remains in neutral territory.

The name “Kubeflow” reflects its core purpose: running machine learning workflows (particularly TensorFlow initially) on Kubernetes, creating a fusion of “Kube” and “flow.”

Open Source vs. Open Core

Understanding Kubeflow’s position in the open source ecosystem is important:

Open Source: Software where all code is freely available under permissive licenses, like Terraform’s initial release under MPL 2.0
Open Core: A business model where core functionality is open source, but premium features are proprietary
Integrated Open Source: Multiple open source projects integrated with custom code to create a cohesive solution

Kubeflow follows the integrated open source model, combining various open source tools into a comprehensive ML platform. The donation to the CNCF ensures it remains vendor-neutral, unlike cases like HashiCorp’s Terraform which changed from open source to a proprietary Business Source License.

Kubeflow Conformance

The Kubeflow community is developing conformance standards to:

Define the core APIs and functionality that constitute “Kubeflow”
Ensure interoperability between different distributions
Provide certifications for distributions that meet these standards
Enable more neutral deployment methods beyond vendor-specific distributions

This conformance program is part of Kubeflow’s graduation process within the CNCF and will help ensure consistency across implementations.

What is a Model?

A model is a computational entity that takes data as input and responds with a prediction based on learned patterns. Similar to how our brains connect new observations with existing knowledge, models process input data through patterns established during training.

Models can be of several types:

Model Features and Determinism

Features in machine learning are the specific attributes or indicators within datasets that help models recognize patterns. They’re like clues in a puzzle that the computer uses to learn. In deep learning, these features are often discovered by the network itself rather than being explicitly defined by humans.

Feature Stores like Feast (popular in the Kubeflow community) provide centralized locations where engineered features are stored, managed, and accessed. These stores ensure consistency in how features are used across different models and applications.

Determinism refers to the principle that a process will produce the same output for a given input every time. In ML, deterministic models will yield identical predictions for the same input data and will develop the same weights and biases if training is replicated with identical data and parameters. This property is crucial for reproducibility and debugging.

Containerization and Reproducibility

Containerization is vital for ML workflows as it:

Ensures consistency across environments (“works on my machine” problems disappear)
Improves security through verifiable builds (using hashing)
Makes models more portable and deployable across different environments
Facilitates reproducibility by capturing the exact environment in which a model runs

The serialization of models (converting models to storable formats) complements containerization by allowing models to be efficiently stored and deployed in a serverless pattern, similar to container deployment.

The Model Development Lifecycle

Kubeflow supports the entire model development lifecycle:

Problem Definition & Scoping: Determining if a problem can be solved with data and establishing clear objectives
Data Extraction: Accessing crucial data from various sources using ETL (Extract, Transform, Load) tools
Data Analysis: Exploring datasets to understand features, identify patterns, and address data quality issues
Data Preparation: Dividing datasets into training/validation/test sets, cleaning data, and engineering features
Model Training: Using frameworks like TensorFlow or PyTorch to teach models to recognize patterns in data
Model Serving: Exposing models through APIs or embedding them in applications for user access
Model Monitoring: Tracking performance metrics, drift detection, and ensuring reliability in production
Model Retirement: Properly decommissioning models and understanding the impact of removal

Kubeflow’s Core Mission

Kubeflow aims to bring DevOps concepts to machine learning through three key principles:

Composability: Easily assembling and disassembling workflow components, allowing teams to select and integrate the best tools for specific stages of their ML development cycles
Portability: Ensuring ML workflows are migratable and executable across environments, allowing development in one environment and deployment in another
Scalability: Processing current work and dynamically allocating additional resources for new tasks, supporting both scaling up (more complex models) and scaling out (distributed training)

Image source: kubeflow.org

Kubeflow’s Key Components

1. Central Dashboard

A unified web interface that provides access to all Kubeflow services. The dashboard:

Centralizes access to core services
Enables namespace-based multi-tenancy
Allows customization for specific organizational needs
Integrates with Kubernetes RBAC and Istio for security

2. Kubeflow Notebooks

Provides integrated development environments (IDEs) as “Labs-as-a-Service” with benefits including:

On-demand provisioning of customizable development environments
Support for Jupyter, RStudio, and VS Code
Resource isolation through Kubernetes namespaces
Collaborative spaces for team members
Integration with storage volumes for persistent work
Automatic scaling and resource optimization

Limitations of Notebooks

While notebooks provide an excellent environment for experimentation, they have several limitations when used for production workloads:

Reliability Issues: Notebooks rely on a single pod, making them vulnerable to node failures. If the node requires maintenance or restarts, long-running jobs will be interrupted without automatic recovery.
Performance Constraints: Notebooks are limited to the resources of a single node, which can become a bottleneck for computationally intensive tasks. Training large models or processing big datasets may exceed single-node capacity.
Scaling Challenges: When multiple data scientists require high-resource notebooks, the cluster can quickly become saturated. Resource contention can lead to scheduling delays or prevent new notebooks from being created.

These limitations highlight why teams often need to transition from notebook-based development to more robust distributed training and pipeline solutions as projects mature.

3. Kubeflow Pipelines (KFP)

A workflow orchestration tool that enables defining and executing sequences of ML tasks as directed acyclic graphs (DAGs).

Component Types in Kubeflow Pipelines

KFP offers three types of components with different levels of flexibility and complexity:

Lightweight Python Components
- Description: Self-contained Python functions turned into pipeline steps
- Best for: Quick experimentation and prototyping
- Advantages:
  - Fast iteration without container building
  - Simple to create and modify
  - Minimal setup required
- Limitations:
  - All imports must be within the function body
  - Limited to Python
  - Packages are installed at runtime, affecting reproducibility and performance
- When to use: Early development phases, simple functions, rapid prototyping
Containerized Python Components
- Description: Python functions packaged as containers by the KFP SDK
- Best for: Production-ready Python code that needs external imports
- Advantages:
  - Can use external modules and libraries
  - Better reproducibility through containerization
  - No need to be a container expert
- Limitations:
  - Requires container registry access
  - SDK dependency for building
  - Still Python-only
- When to use: Production Python components, when code needs external dependencies
Container Components
- Description: Fully custom containers for maximum flexibility
- Best for: Complex components, non-Python code, specialized environments
- Advantages:
  - Can use any programming language
  - Complete control over the execution environment
  - Maximum flexibility for complex tasks
- Limitations:
  - Requires container expertise
  - More complex to develop and maintain
  - Needs careful handling of inputs and outputs
- When to use: Multi-language components, specialized processing needs, legacy code integration

Other Pipeline Features

Runs: Individual executions of pipelines with specific parameters
Experiments (KFP): Groups of related runs for comparison and organization
Recurring Runs: Scheduled executions of pipelines at regular intervals
Artifacts: Outputs produced by pipeline steps (datasets, models, metrics)
Executions: Records of individual component runs within a pipeline

4. Katib for Automated Machine Learning

Katib is Kubeflow’s AutoML solution that automates hyperparameter tuning and neural architecture search.

Finding Global Minima in Complex Optimization Landscapes

One of the key challenges in machine learning is finding the optimal set of hyperparameters in a high-dimensional space where many local minima exist. To understand this challenge, Katib uses test problems like the Ackley function – a mathematical function characterized by a large number of local minima surrounding one global minimum.

The Ackley function visualizes the problem faced in hyperparameter optimization: many “valleys” (local minima) that might trap simple optimization algorithms, while the true “deepest valley” (global minimum) is what we’re actually seeking.

In simpler terms:

A local minimum is like a small dip in a hilly landscape – it’s lower than the immediate surrounding area, but not necessarily the lowest point overall.
The global minimum is the absolute lowest point in the entire landscape – it represents the best possible solution across all possible configurations.

It’s called “global” because it’s the best solution across the entire search space, not just in a local region. This concept is crucial in machine learning because:

Finding the global minimum means we’ve found the truly optimal hyperparameters that will give our model the best possible performance
Most optimization algorithms tend to naturally fall into the nearest local minimum, potentially missing better solutions elsewhere
The difference between a good local minimum and the global minimum can significantly impact model accuracy and reliability

In machine learning, the goal is to minimize errors or losses – hence the term “minimum” – and the “global” aspect ensures we’re finding the best solution possible rather than settling for a merely decent one.

Training Algorithms and Their Hyperparameters

Different algorithms require different hyperparameters to be tuned, and Katib can optimize these across various model types:

Gradient Descent-Based Algorithms

Learning Rate: Controls the size of steps the algorithm takes towards the minimum of the loss function. Too large a learning rate can cause overshooting, while too small can lead to slow convergence.
Momentum: Helps to accelerate the algorithm in the right direction, smoothing out updates.
Batch Size: Influences the amount of data used to calculate each update, affecting the stability and speed of convergence.

Decision Tree-Based Algorithms

Maximum Depth: Limits the depth of the tree to prevent overfitting.
Minimum Samples Split: The minimum number of samples required to split an internal node.
Maximum Number of Features: The number of features to consider when looking for the best split.

Neural Networks

Number of Layers and Units per Layer: Determines the neural network’s architecture, affecting its capacity to learn complex patterns.
Activation Function: Influences how the weighted sum of the input is transformed before being passed to the next layer.
Dropout Rate: A regularization technique that ignores randomly selected neurons during training, reducing overfitting.

Ensemble Methods (e.g., Random Forest, Gradient Boosting)

Number of Estimators: The number of trees in the forest or the number of boosting stages.
Learning Rate (for Boosting): Controls how much each additional tree contributes to the prediction.
Max Features: The size of the random subsets of features to consider when splitting a node.

Experiments (AutoML)

Katib experiments are structured workflows to find optimal model configurations:

Objective: The metric to optimize (e.g., accuracy, loss)
Search Space: The range of hyperparameter values to explore
Search Algorithm: The method to navigate the search space (random, grid, Bayesian)
Trials: Individual model training runs with specific hyperparameter configurations
Worker Jobs: The actual processes that execute trials, often using the Training Operator

Katib helps overcome the challenge of manually tuning hyperparameters, which can be time-consuming and computationally expensive. By automating this process, data scientists can focus on problem definition and feature engineering instead of parameter tweaking.

5. Training Operator

A framework-agnostic way to submit training jobs that:

Simplifies job submissions through Kubernetes manifests
Manages pod lifecycle and recovery from failures
Handles the complexity of framework-specific distribution patterns

The Training Operator supports multiple frameworks with specialized capabilities:

TensorFlow Training (TFJob)

Framework: TensorFlow, developed by Google
Characteristics: Supports deployments with parameter servers and workers
Distinctive Capabilities:
- Compatible with TPU (Tensor Processing Units)
- Complete ecosystem including TensorFlow Lite, TFX
- Optimized for large-scale distributed computing
- Supports both eager execution and graph execution

PyTorch Training (PyTorchJob)

Framework: PyTorch, developed by Facebook AI Research
Characteristics: Uses a dynamic computation approach
Distinctive Capabilities:
- Dynamic computation graphs (can be modified on the fly)
- Highly Pythonic nature, easy to adopt
- Rich libraries: TorchVision, TorchText, TorchAudio
- User-friendly for prototyping and research

PaddlePaddle Training (PaddleJob)

Framework: PaddlePaddle, developed by Baidu
Characteristics: Industrial support and large-scale production focus
Distinctive Capabilities:
- Specialized tools like Paddle Lite for mobile devices
- Optimized support for Baidu’s Kunlun chips
- High-level APIs to simplify development
- Strong support for natural language processing and computer vision

MXNet Training (MXJob)

Framework: Apache MXNet, supported by AWS
Characteristics: Hybrid programming model
Distinctive Capabilities:
- Combines symbolic and imperative programming via the Gluon API
- Support for multiple programming languages (Scala, C++, R, JavaScript)
- Optimized for memory efficiency and speed
- ONNX support for model portability

XGBoost Training (XGBoostJob)

Framework: XGBoost (eXtreme Gradient Boosting)
Characteristics: Specializes in gradient boosting algorithms
Distinctive Capabilities:
- Excellent handling of sparse data
- Built-in L1 and L2 regularization
- Ability to run on various languages and platforms
- Optimized tree pruning
- Built-in cross-validation

MPI Training (MPIJob)

Framework: Message Passing Interface, a standard for process communication
Characteristics: Facilitates “allreduce” style distributed training
Distinctive Capabilities:
- Enables efficient communication between different processes
- Optimizes network usage for distributed models
- Eliminates the need to configure the optimal ratio between workers and parameter servers
- Ideal for high-performance computing (HPC)
- Interoperable with multiple ML frameworks

6. Model Registry Integration

While not a core Kubeflow component, model registries like MLflow are commonly integrated with Kubeflow to:

Catalog and version models
Track model lineage and metadata
Facilitate deployment from artifact storage
Manage the model lifecycle from experimentation to production

Model registries complement Kubeflow’s focus on workflow orchestration by providing specialized tools for model management and governance.

Kubernetes Scheduler Limitations and Volcano

Standard Kubernetes Scheduler Limitations

The default Kubernetes scheduler has several limitations for ML workloads:

Single-Pod Scheduling: The scheduler processes one pod at a time without considering relationships between pods in a distributed job.
Resource Fragmentation: Without awareness of pod groups, resources can become fragmented, leading to situations where no single node has enough resources for a pod, even though the collective cluster has sufficient capacity.
No Coordination: For distributed training jobs that require all components to start simultaneously, the default scheduler provides no guarantees that all pods will be scheduled together.
Limited Hardware Awareness: The standard scheduler has limited understanding of specialized hardware topologies, which can lead to suboptimal placement of ML workloads.

Gang Scheduling with Volcano

Gang scheduling is a batch scheduling approach where a group of related tasks (a “gang”) are scheduled to run simultaneously. Volcano implements this concept for Kubernetes:

All-or-Nothing Scheduling: Either all pods in a job are scheduled, or none are, preventing partial job execution and resource waste
Queue Management: Jobs can be prioritized and placed in queues when resources are unavailable, avoiding gridlock
Hardware Topology Awareness: Optimizes placement based on node interconnects and hardware specifics
Resource Fairness: Ensures equitable resource distribution across teams and workloads

Consider a distributed TensorFlow job with 1 parameter server and 4 workers. With the default scheduler, some pods might get scheduled while others remain pending due to resource constraints. The scheduled pods would consume resources while waiting indefinitely for their counterparts. Volcano ensures that either all 5 pods get scheduled together, or the job waits in a queue until sufficient resources are available.

Common Kubeflow Integrations

Kserve

A serverless model inference platform that:

Deploys models without requiring custom servers
Supports major frameworks (TensorFlow, PyTorch, ONNX, etc.)
Handles auto-scaling based on demand (scale to zero when idle)
Provides built-in model versioning and canary deployments
Manages preprocessing, prediction, post-processing, and monitoring
Implements A/B testing and progressive model rollouts

Kserve abstracts away the complexities that teams previously had to handle manually with custom Flask or FastAPI implementations, including model loading, API endpoints, data preprocessing, concurrency, and security.

Volcano

A batch scheduling system for high-performance workloads that:

Implements gang scheduling to ensure all related pods start together
Prevents gridlock by considering total resource requirements
Optimizes for specific hardware like GPUs and specialized accelerators
Supports job prioritization and fair-sharing through queues
Considers network topology for optimized placement

Advantages of Kubeflow

Framework Flexibility: Supports multiple ML frameworks and allows teams to use the right tool for each job
Resource Optimization: Efficiently allocates computational resources, particularly for expensive GPU workloads
Workflow Automation: Reduces manual intervention through standardized pipelines and components
Reproducibility: Ensures consistent results through containerization and version control
Collaboration: Enables team members to share work, models, and results through a unified platform
Security: Integrates with Kubernetes RBAC and network policies for proper isolation
Scalability: Handles workloads from small experiments to production-grade deployment

Distributions and Deployment

Kubeflow can be deployed through several approaches:

The future development of conformance standards will likely lead to more vendor-neutral deployment options, similar to how Kubernetes evolved from “the hard way” to more standardized deployment methods.

Summary: Introduction to AI/ML Toolkits with Kubeflow (LFS147)

Understanding Kubeflow