| |

Understanding Karpenter’s Node Consolidation [EKS]: A Deep Dive

This article assumes you have already installed karpenter and it’s intent is only to describe some caveat about consolidation mechanisms especially spot-to-spot consolidation in an EKS Cluster.

As we will see, spot-to-spot consolidation ain’t an out of the box functionality and there are few things to be aware of if you want it to work as expected.

The story

Last week, I was troubleshooting some node consolidation issues in a Kubernetes cluster when I stumbled upon an intriguing error message while running kubectl describe node:

Normal  Unconsolidatable   14m (x61 over 16h)  karpenter  SpotToSpotConsolidation requires 15 cheaper instance type options than the current candidate to consolidate, got 6

One important note: by default, spot-to-spot consolidation is not enabled in Karpenter. This means that if you’re running spot instances, Karpenter won’t automatically consolidate workloads between different spot instances to optimize costs and resource usage.

This got my curiosity. In this scenario, the node pool had various instance types configured, yet Karpenter seemed reluctant to perform its consolidation magic. This sparked my curiosity to dive deeper into Karpenter’s consolidation mechanisms…

What’s Going On Under the Hood?

After some investigation, I found that this wasn’t just a random number Karpenter pulled out of thin air. The requirement for 15 cheaper instance options is actually a clever safeguard against what’s known as the “race to the bottom” scenario in spot instance pricing.

Think of it this way: imagine you’re trying to find the cheapest apartment in a city, but you keep moving every time you find a slightly cheaper one. Eventually, you might end up in a place that’s not really suitable for your needs. Karpenter prevents this by requiring a decent number of alternative options before making a move.

Understanding Single Node vs. Multi-Node Consolidation

Here’s where it gets interesting:

  • For single node consolidation (1:1), Karpenter needs those 15 cheaper options
  • For multi-node consolidation (many:1), this requirement doesn’t apply

Why? Because when you’re consolidating multiple nodes into one, you’re already achieving significant cost savings through better resource utilization, regardless of the instance type price.

Deep Dive: Anatomy of a NodePool

Let’s look at a real-world NodePool configuration and break it down piece by piece:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
        - key: "karpenter.k8s.aws/instance-category"
          operator: In
          values: ["c", "m", "r", "t"]
        - key: "karpenter.k8s.aws/instance-generation"
          operator: In
          values: ["3", "5", "6", "7"]
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: karpenter.k8s.aws/instance-size
          operator: In
          values: ["large", "xlarge", "2xlarge"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
  limits:
    cpu: "800"     # 100 nodes × 8 vCPUs = 800 vCPUs
    memory: 1600Gi # Assuming 16Gi per instance
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 20m
    expireAfter: 720h
    budgets:
    - schedule: "0 6 * * mon-fri"
      duration: 14h
      nodes: "6"
    - nodes: "1"

Instance Types Selection

When declaring instance types in your nodepool, it’s crucial to use the built-in NodePool requirements thoroughly.

If you don’t and randomly add instance types to your nodepool, you’re likely to encounter issues where Karpenter can’t satisfy nodeclaims, effectively breaking your auto-scaling mechanism.

Again, using the built in requirements should be enough but here are some useful AWS commands in case you need to perform some aditional debugging.

  1. First, check the architecture of your AMI (x86_64 or arm64):
    aws ec2 describe-images --image-ids {id-xxxxx}
  2. Based on the architecture, verify which instance types are compatible:
    aws ec2 describe-instance-types --filters Name=processor-info.supported-architecture,Values={your arch} --query "InstanceTypes[*].InstanceType" --output text
  3. Finally, confirm if these instance types are available in your target region:
    aws ec2 describe-instance-type-offerings --region {your-region}
 

Disruption Management

The configuration includes some sophisticated disruption settings:

disruption:
  consolidationPolicy: WhenEmptyOrUnderutilized
  consolidateAfter: 20m
  budgets:
    - schedule: "0 6 * *mon-fri"
      duration: 14h
      nodes: "6"
    - nodes: "1"

This setup ensures:

  • Nodes are considered for consolidation when they’re empty or underutilized
  • There’s a 20-minute grace period before consolidation
  • During weekdays (6 AM for 14 hours), up to 6 nodes can be maintained
  • Outside business hours, only 1 node can be deprovisioned at a time

Additional Important Parameters

There are several other useful parameters worth mentioning:

  • drifted: Allows you to specify behavior when nodes drift from their desired state
  • empty: Controls actions when nodes become empty
  • expireAfter: Sets a lifetime for nodes (great for maintaining fresh infrastructure). For this to be relevant, you will need to have some dynamic mecanism which can find the recommended AMI for your architecture. For exemple :
    ARM_AMI_ID="$(aws ssm get-parameter --name /aws/service/eks/optimized-ami/${K8S_VERSION}/amazon-linux-2-arm64/recommended/image_id --query Parameter.Value --output text)"

 

Tips:

  1. Instance Type Strategy: When defining instance types, include a good mix of families to give Karpenter flexibility in placement decisions.
  2. Consolidation Timing: The consolidateAfter parameter is crucial – too short might cause thrashing, too long might miss optimization opportunities.
  3. Business Hours Management: The budget schedule feature is fantastic for maintaining different scaling patterns during business hours versus off-hours.

You have noticed this block:

nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default

The NodeClasses is the second piece of the Puzzle !  In a nutshell, NodeClasses are templates for your EC2 instances. They define the fundamental characteristics of the nodes that Karpenter will create. Let me break it down with a simple example:

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: default
spec:
  # Base image selection
  amiFamily: AL2  # Amazon Linux 2

  # Where to place the nodes
  subnetSelector:
    karpenter.sh/discovery: "${CLUSTER_NAME}"
  
  # Security settings
  securityGroupSelector:
    karpenter.sh/discovery: "${CLUSTER_NAME}"
  
  # Storage configuration
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 100Gi
        volumeType: gp3
        encrypted: true

Key Components:

  1. AMI Selection: Determines the base image for your nodes
    • Options include: AL2, Ubuntu, Bottlerocket, Custom AMIs
  2. Network Setup:
    • Which VPC subnets to use
    • Security group configurations
    • Network interfaces
  3. Storage Configuration:
    • Disk size and type
    • Encryption settings
    • Additional volumes if needed
  4. Instance Identity:
    • IAM roles
    • Tags
    • User data scripts

Think of NodeClasses as the “hardware store” part of Karpenter – they define what raw materials (EC2 instances) you have available to build with. The NodePool then uses these materials to construct the actual nodes based on your workload needs.

The big advantage? You can maintain different NodeClasses for different types of workloads

Not done yet !

With all these concepts covered and with proper instance type configuration I coud finaly see proper spot-to-spot consolidation take place but I was facing a new roadblock ! Some of the Pods got stuck with the ContainerCreating status. This happened to me in a specific scenario:

Note: Karpenter has a troubleshooting page you might want to check.

To prevent IP exhaustion, a secondary CIDR has been setup in this EKS Cluster. This approach has a clear benefit: pods get their own dedicated subnet instead of sharing IPs with the primary subnet. However, there’s a catch – when you implement a secondary CIDR, one ENI gets reserved for the node itself, making it unavailable for pod IP allocation.

Let’s look at a concrete example using a t2.xlarge instance:

  • Without secondary CIDR: All ENIs are available for pod IP allocation
    • 3 ENIs × 15 IPs per ENI = 45 potential pod IPs
  • With secondary CIDR: One ENI is reserved
    • 2 ENIs × 15 IPs per ENI = 30 potential pod IPs

Here’s the crucial part: Karpenter isn’t aware of this secondary CIDR configuration by default. You need to set ENI_RESERVED=1(the default is 0) to inform Karpenter that it should subtract one ENI from its calculations when determining the total IPs available for a given instance type.

Last But Not Least: Addressing Spot Instance Challenges

Speaking of spot instance management, if you’re concerned about the 2-minute spot instance eviction notice limitation, there’s an interesting solution worth checking out: https://www.cloudpilot.ai/. I recently had the pleasure of chatting with their team, and while I haven’t personally tested their product yet, their approach sounds promising. They seem to be using Regression Random Forests for Spot Price Prediction, which could potentially give us a longer heads-up on potential spot interruptions compared to AWS’s standard 2-minute notice. It’s definitely on my list to try out, as anything that can help us better manage spot instance volatility while maintaining cost efficiency is worth exploring.

 

Similar Posts