Installing KEDA (Kubernetes Event-driven Autoscaler)

KEDA provides event-driven capabilities to Kubernetes Horizontal Pod Autoscaler (HPA). It listens to external metrics and scales workloads (deployments, StatefulSets, etc.) accordingly. With KEDA, you don’t have to write custom controllers or rely purely on Kubernetes' default CPU/Memory-based scaling.

KEDA operates in two main ways:

Metric Feeding: KEDA connects to event sources (RabbitMQ, Prometheus) and exposes their data as custom metrics in Kubernetes, enabling scaled workloads.
Event Source Autoscaling: It can scale down workloads to 0 when no events exist, saving compute costs when there is no demand.

KEDA augments the standard Horizontal Pod Autoscaler (HPA) with event-driven triggers.

KEDA can automatically scale workloads to 0 when no events are being produced (in contrast to Kubernetes' HPA, which typically keeps at least one replica running).

NOTE: KEDA requires a minimum Kubernetes version of 1.27

Installation

KEDA can be deployed into any Kubernetes cluster using its Helm chart, static manifests, or an operator. Here is an example Helm command:

helm repo add kedacore https://kedacore.github.io/charts 
helm repo update
helm install keda kedacore/keda --namespace keda --create-namespace

YAML Scaling File Examples

The following configuration shows how to scale deployment based on Prometheus metrics using scaled objects, where Prometheus a query is defined, and a specified threshold is checked to determine when scaling should occur.

Also defined are the minimum and maximum number of replicas that can be scaled.

In KEDA, the cooldownPeriod is the duration (in seconds) that KEDA waits after the last scale-down event before considering scaling down the application again. It ensures that the workload remains stable even if there are minor fluctuations in the metrics.

When used with a Prometheus ScaledObject, the cooldownPeriod functions the same way: it applies to how long KEDA waits before scaling down the application after the metrics fall below the specified threshold.

You can define scaling for various services, some of which can be seen below.

keda-asr-prom-scale.yaml
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: asr-en
spec:
  scaleTargetRef:
    kind: Deployment # Default
    name: asr-en
  pollingInterval: 10 # Default 30
  cooldownPeriod: 300 # Default 300
  minReplicaCount: 1 # Default 0
  maxReplicaCount: 10 # Default 100
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-server.prometheus.svc.cluster.local
      metricName: asr_active_asr_requests # Grammar based ASR interactions
      threshold: "55"
      query: sum(asr_active_asr_requests{app="asr-en"})
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-server.prometheus.svc.cluster.local
      metricName: asr_active_transcription_requests # Transcription based ASR interactions
      threshold: "30"
      query: sum(asr_active_transcription_requests{app="asr-en"})
 
keda-lumenvox-api-prom-scale.yaml
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: lumenvox-api
spec:
  scaleTargetRef:
    kind: Deployment # Default
    name: lumenvox-api
  pollingInterval: 10 # Default 30
  cooldownPeriod: 300 # Default 300
  minReplicaCount: 1 # Default 0
  maxReplicaCount: 10 # Default 100
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-server.monitoring.svc.cluster.local
      metricName: lumenvox_api_active_requests
      threshold: "100"
      query: sum(lumenvox_api_active_requests{app="lumenvox-api"})
 
 
keda-session-prom-scale.yaml
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: session
spec:
  scaleTargetRef:
    kind: Deployment # Default
    name: session
  pollingInterval: 10 # Default 30
  cooldownPeriod: 300 # Default 300
  minReplicaCount: 1 # Default 0
  maxReplicaCount: 10 # Default 100
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-server.monitoring.svc.cluster.local
      metricName: session_active_streams
      threshold: "100"
      query: sum(session_active_streams{app="session"})
 
 
keda-grammar-prom-scale.yaml
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: grammar
spec:
  scaleTargetRef:
    kind: Deployment # Default
    name: grammar
  pollingInterval: 10 # Default 30
  cooldownPeriod: 300 # Default 300
  minReplicaCount: 1 # Default 0
  maxReplicaCount: 10 # Default 100
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-server.monitoring.svc.cluster.local
      metricName: grammar_active_grammars
      threshold: "1000"
      query: sum(grammar_active_grammars{app="grammar"})
 
 
keda-vad-prom-scale.yaml
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vad
spec:
  scaleTargetRef:
    kind: Deployment # Default
    name: vad
  pollingInterval: 10 # Default 30
  cooldownPeriod: 300 # Default 300
  minReplicaCount: 1 # Default 0
  maxReplicaCount: 10 # Default 100
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-server.monitoring.svc.cluster.local
      metricName: vad_active_requests
      threshold: "100"
      query: sum(vad_active_requests{app="vad"})
 
 
keda-tts-prom-scale.yaml
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: neural-tts
spec:
  scaleTargetRef:
    kind: Deployment # Default
    name: neural-tts-en-us
  pollingInterval: 10 # Default 30
  cooldownPeriod: 300 # Default 300
  minReplicaCount: 1 # Default 0
  maxReplicaCount: 10 # Default 100
  triggers:
  - type: prometheus
   metadata:
      serverAddress: http://prometheus-server.monitoring.svc.cluster.local  
      metricName: tts_active_requests  
      threshold: "100" 
      query: sum(tts_active_requests{app="neural-tts-en-us"})

Applying a Manifest File

The following is used to apply a Kubernetes manifest file (in this case named keda-<service>-prom-scale.yaml) and create Kubernetes resources defined within the file. In this context, it most likely defines a KEDA ScaledObject that configures autoscaling for a specific service (<service>) based on Prometheus metrics.

kubectl apply -f keda-<service>-prom-scale.yaml

List ScaledObjects

The following command is used to list KEDA ScaledObjects in your Kubernetes cluster. A ScaledObject is a custom resource provided by KEDA, used to define autoscaling configuration for a specific Kubernetes workload (e.g., Deployment).

This command queries the Kubernetes API to show all the ScaledObjects currently running in the cluster or in a specific namespace.

kubectl get scaledobject -n lumenvox

Which gives output similar to this:

NAME           SCALETARGETKIND      SCALETARGETNAME   MIN    MAX   TRIGGERS     AUTHENTICATION   READY    ACTIVE   FALLBACK   PAUSED    AGE
asr-en         apps/v1.Deployment   asr-en            1      10    prometheus                    True    False     False      Unknown   44h
grammar        apps/v1.Deployment   grammar           1      10    prometheus                    True    True      False      Unknown   44h
lumenvox-api   apps/v1.Deployment   lumenvox-api      1      10    prometheus                    True    False     False      Unknown   44h
session        apps/v1.Deployment   session           1      10    prometheus                    True    False     False      Unknown   44h

Node Scaling

Scaling nodes in a Kubernetes cluster ensures that you have enough resources available (CPU, memory, storage, etc.) to handle growing application workloads and ensure high availability. However, it's also important to scale nodes back down to manage costs effectively. The recommended method of scaling nodes in Kubernetes depends on your use case, cluster setup, and cost/resource goals.

Note: the following are simply examples. We strongly recommend you read the Kubernetes documentation to determine the best scaling method for your own use-cases and budget. Do not simply use one of the following without carefully considering the costs and behavior.
Also note that Capacity are not experts in these different approaches. We again recommend you choose your method carefully.

Here are the commonly recommended methods for scaling nodes in Kubernetes:

1. Cluster Autoscaler (Best Practice for Node Scaling)

The Cluster Autoscaler is the most widely used and recommended method for scaling nodes in Kubernetes. It's an open-source project developed by the Kubernetes community.

How It Works:

Automatically adjusts the size of your Kubernetes node pool based on the pending workload.
It adds nodes when:
- Pods can’t be scheduled due to insufficient resources on existing nodes (e.g., CPU, memory).
It removes nodes when:
- Nodes are underutilized and the pods running on them can be scheduled on other nodes.

2. Manual Node Scaling

Manual scaling involves explicitly adding or removing nodes in your cluster. For example:

Adding nodes by increasing the number of virtual machines or instances in your cloud provider.
Removing nodes when they are no longer needed using the cloud dashboard, CLI, or API.

3. Node Autoscalers Provided by Cloud Providers

Many Kubernetes cloud platforms (e.g., AWS EKS, GCP GKE, Azure AKS, etc.) come with managed node autoscaling tools that abstract away the complexities of configuring a Cluster Autoscaler.

4. Using Karpenter (Alternative to Cluster Autoscaler)

Karpenter is an open-source project developed by AWS as an alternative to the Cluster Autoscaler. It's designed to scale nodes quickly and dynamically without relying on pre-defined capacity in node groups.

5. Spot Instances for Cost-Effective Scaling

For cost optimization, you can combine autoscalers with spot/low-priority instances offered by many clouds (e.g., AWS Spot EC2, GCP Preemptible VMs, Azure Low Priority Nodes).

6. On-Premises Clusters: DIY Solutions

For on-prem or self-managed Kubernetes clusters (e.g., via kubeadm), the process can involve:

Manually adding/removing physical/virtual machines to support workloads.
Using Cluster Autoscaler on custom infrastructure (e.g., using API integrations with your VM provider).

Best Practices for Scaling Nodes

Use Cluster Autoscaler:
- If running in a cloud environment, set a reasonable minNodes and maxNodes for capacity limits.
Define Proper Resource Requests and Limits:
- Ensure every pod in your cluster has well-defined CPU and memory requests, as autoscalers rely on these values to decide scaling.
Use Pod Disruption Budgets (PDBs):
- Ensure pods of critical workloads are not disrupted during scaling events.
Workload-Specific Node Pools:
- Create separate node pools for workloads with unique requirements (e.g., GPU, memory-intensive workloads).
Monitor Node Usage:
- Use monitoring tools like Prometheus, Grafana, or the cloud provider's metrics dashboards to track node utilization and autoscaler effectiveness.

Was this article helpful?