Installing KEDA (Kubernetes Event-driven Autoscaler)
KEDA provides event-driven capabilities to Kubernetes Horizontal Pod Autoscaler (HPA). It listens to external metrics and scales workloads (deployments, StatefulSets, etc.) accordingly. With KEDA, you don’t have to write custom controllers or rely purely on Kubernetes' default CPU/Memory-based scaling.
KEDA operates in two main ways:
Metric Feeding: KEDA connects to event sources (RabbitMQ, Prometheus) and exposes their data as custom metrics in Kubernetes, enabling scaled workloads.
Event Source Autoscaling: It can scale down workloads to
0
when no events exist, saving compute costs when there is no demand.
KEDA augments the standard Horizontal Pod Autoscaler (HPA) with event-driven triggers.
KEDA can automatically scale workloads to 0
when no events are being produced (in contrast to Kubernetes' HPA, which typically keeps at least one replica running).
NOTE: KEDA requires a minimum Kubernetes version of 1.27
Installation
KEDA can be deployed into any Kubernetes cluster using its Helm chart, static manifests, or an operator. Here is an example Helm command:
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda --namespace keda --create-namespace
YAML Scaling File Examples
The following configuration shows how to scale deployment based on Prometheus metrics using scaled objects, where Prometheus a query
is defined, and a specified threshold
is checked to determine when scaling should occur.
Also defined are the minimum and maximum number of replicas that can be scaled.
In KEDA, the cooldownPeriod
is the duration (in seconds) that KEDA waits after the last scale-down event before considering scaling down the application again. It ensures that the workload remains stable even if there are minor fluctuations in the metrics.
When used with a Prometheus ScaledObject, the cooldownPeriod
functions the same way: it applies to how long KEDA waits before scaling down the application after the metrics fall below the specified threshold
.
You can define scaling for various services, some of which can be seen below.
keda-asr-prom-scale.yaml --- apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: asr-en spec: scaleTargetRef: kind: Deployment # Default name: asr-en pollingInterval: 10 # Default 30 cooldownPeriod: 300 # Default 300 minReplicaCount: 1 # Default 0 maxReplicaCount: 10 # Default 100 triggers: - type: prometheus metadata: serverAddress: http://prometheus-server.prometheus.svc.cluster.local metricName: asr_active_asr_requests # Grammar based ASR interactions threshold: "55" query: sum(asr_active_asr_requests{app="asr-en"}) - type: prometheus metadata: serverAddress: http://prometheus-server.prometheus.svc.cluster.local metricName: asr_active_transcription_requests # Transcription based ASR interactions threshold: "30" query: sum(asr_active_transcription_requests{app="asr-en"}) keda-lumenvox-api-prom-scale.yaml --- apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: lumenvox-api spec: scaleTargetRef: kind: Deployment # Default name: lumenvox-api pollingInterval: 10 # Default 30 cooldownPeriod: 300 # Default 300 minReplicaCount: 1 # Default 0 maxReplicaCount: 10 # Default 100 triggers: - type: prometheus metadata: serverAddress: http://prometheus-server.monitoring.svc.cluster.local metricName: lumenvox_api_active_requests threshold: "100" query: sum(lumenvox_api_active_requests{app="lumenvox-api"}) keda-session-prom-scale.yaml --- apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: session spec: scaleTargetRef: kind: Deployment # Default name: session pollingInterval: 10 # Default 30 cooldownPeriod: 300 # Default 300 minReplicaCount: 1 # Default 0 maxReplicaCount: 10 # Default 100 triggers: - type: prometheus metadata: serverAddress: http://prometheus-server.monitoring.svc.cluster.local metricName: session_active_streams threshold: "100" query: sum(session_active_streams{app="session"}) keda-grammar-prom-scale.yaml --- apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: grammar spec: scaleTargetRef: kind: Deployment # Default name: grammar pollingInterval: 10 # Default 30 cooldownPeriod: 300 # Default 300 minReplicaCount: 1 # Default 0 maxReplicaCount: 10 # Default 100 triggers: - type: prometheus metadata: serverAddress: http://prometheus-server.monitoring.svc.cluster.local metricName: grammar_active_grammars threshold: "1000" query: sum(grammar_active_grammars{app="grammar"}) keda-vad-prom-scale.yaml --- apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: vad spec: scaleTargetRef: kind: Deployment # Default name: vad pollingInterval: 10 # Default 30 cooldownPeriod: 300 # Default 300 minReplicaCount: 1 # Default 0 maxReplicaCount: 10 # Default 100 triggers: - type: prometheus metadata: serverAddress: http://prometheus-server.monitoring.svc.cluster.local metricName: vad_active_requests threshold: "100" query: sum(vad_active_requests{app="vad"}) keda-tts-prom-scale.yaml --- apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: neural-tts spec: scaleTargetRef: kind: Deployment # Default name: neural-tts-en-us pollingInterval: 10 # Default 30 cooldownPeriod: 300 # Default 300 minReplicaCount: 1 # Default 0 maxReplicaCount: 10 # Default 100 triggers: - type: prometheus
Applying a Manifest File
The following is used to apply a Kubernetes manifest file (in this case named keda-<service>-prom-scale.yaml
) and create Kubernetes resources defined within the file. In this context, it most likely defines a KEDA ScaledObject that configures autoscaling for a specific service (<service>
) based on Prometheus metrics.
kubectl apply -f keda-<service>-prom-scale.yaml
List ScaledObjects
The following command is used to list KEDA ScaledObjects in your Kubernetes cluster. A ScaledObject is a custom resource provided by KEDA, used to define autoscaling configuration for a specific Kubernetes workload (e.g., Deployment).
This command queries the Kubernetes API to show all the ScaledObjects currently running in the cluster or in a specific namespace.
kubectl get scaledobject -n lumenvox
Which gives output similar to this:
NAME SCALETARGETKIND SCALETARGETNAME MIN MAX TRIGGERS AUTHENTICATION READY ACTIVE FALLBACK PAUSED AGE asr-en apps/v1.Deployment asr-en 1 10 prometheus True False False Unknown 44h grammar apps/v1.Deployment grammar 1 10 prometheus True True False Unknown 44h lumenvox-api apps/v1.Deployment lumenvox-api 1 10 prometheus True False False Unknown 44h session apps/v1.Deployment session 1 10 prometheus True False False Unknown 44h
Node Scaling
Scaling nodes in a Kubernetes cluster ensures that you have enough resources available (CPU, memory, storage, etc.) to handle growing application workloads and ensure high availability. However, it's also important to scale nodes back down to manage costs effectively. The recommended method of scaling nodes in Kubernetes depends on your use case, cluster setup, and cost/resource goals.
Note: the following are simply examples. We strongly recommend you read the Kubernetes documentation to determine the best scaling method for your own use-cases and budget. Do not simply use one of the following without carefully considering the costs and behavior.
Also note that Capacity are not experts in these different approaches. We again recommend you choose your method carefully.
Here are the commonly recommended methods for scaling nodes in Kubernetes:
1. Cluster Autoscaler (Best Practice for Node Scaling)
The Cluster Autoscaler is the most widely used and recommended method for scaling nodes in Kubernetes. It's an open-source project developed by the Kubernetes community.
How It Works:
Automatically adjusts the size of your Kubernetes node pool based on the pending workload.
It adds nodes when:
Pods can’t be scheduled due to insufficient resources on existing nodes (e.g., CPU, memory).
It removes nodes when:
Nodes are underutilized and the pods running on them can be scheduled on other nodes.
2. Manual Node Scaling
Manual scaling involves explicitly adding or removing nodes in your cluster. For example:
Adding nodes by increasing the number of virtual machines or instances in your cloud provider.
Removing nodes when they are no longer needed using the cloud dashboard, CLI, or API.
3. Node Autoscalers Provided by Cloud Providers
Many Kubernetes cloud platforms (e.g., AWS EKS, GCP GKE, Azure AKS, etc.) come with managed node autoscaling tools that abstract away the complexities of configuring a Cluster Autoscaler.
4. Using Karpenter (Alternative to Cluster Autoscaler)
Karpenter is an open-source project developed by AWS as an alternative to the Cluster Autoscaler. It's designed to scale nodes quickly and dynamically without relying on pre-defined capacity in node groups.
5. Spot Instances for Cost-Effective Scaling
For cost optimization, you can combine autoscalers with spot/low-priority instances offered by many clouds (e.g., AWS Spot EC2, GCP Preemptible VMs, Azure Low Priority Nodes).
6. On-Premises Clusters: DIY Solutions
For on-prem or self-managed Kubernetes clusters (e.g., via kubeadm), the process can involve:
Manually adding/removing physical/virtual machines to support workloads.
Using Cluster Autoscaler on custom infrastructure (e.g., using API integrations with your VM provider).
Best Practices for Scaling Nodes
Use Cluster Autoscaler:
If running in a cloud environment, set a reasonable minNodes and maxNodes for capacity limits.
Define Proper Resource Requests and Limits:
Ensure every pod in your cluster has well-defined CPU and memory requests, as autoscalers rely on these values to decide scaling.
Use Pod Disruption Budgets (PDBs):
Ensure pods of critical workloads are not disrupted during scaling events.
Workload-Specific Node Pools:
Create separate node pools for workloads with unique requirements (e.g., GPU, memory-intensive workloads).
Monitor Node Usage:
Use monitoring tools like Prometheus, Grafana, or the cloud provider's metrics dashboards to track node utilization and autoscaler effectiveness.