Overview

What is Armada?

Armada is a multi-Kubernetes cluster batch job meta-scheduler designed to handle massive-scale workloads. Built on top of Kubernetes, Armada enables organizations to distribute millions of batch jobs per day across tens of thousands of nodes spanning multiple clusters, making it an ideal solution for high-throughput computational workloads.

Armada serves as middleware that transforms Kubernetes into a powerful batch processing platform while maintaining compatibility with service workloads. It addresses the fundamental limitations of running batch workloads at scale on Kubernetes by providing:

Multi-cluster orchestration: Schedule jobs across many Kubernetes clusters seamlessly
High-throughput queueing: Handle millions of queued jobs
Advanced batch scheduling: Fair queuing, gang scheduling, preemption, and resource limits
Enterprise-grade reliability: Secure, highly available components designed for production use

As a CNCF Sandbox project, Armada is actively maintained and used in production environments, including at G-Research where it processes millions of jobs daily.

Why Use Armada?

Kubernetes Limitations for Batch Workloads

Traditional Kubernetes faces several challenges when running batch workloads at scale:

Single Cluster Scaling Limits: Scaling a single Kubernetes cluster beyond a certain size is challenging, typically maxing out around 5,000-15,000 nodes depending on configuration.
Storage Backend Constraints: Achieving very high throughput using etcd, Kubernetes' in-cluster storage backend, is challenging and can become a bottleneck for job queuing.
Inadequate Batch Scheduling: The default kube-scheduler lacks essential batch scheduling features like fair queuing, gang scheduling, and intelligent preemption.

Armada's Solution

Armada overcomes these limitations by:

Distributing across multiple clusters: Manage thousands of nodes across many Kubernetes clusters
Out-of-cluster scheduling: Perform queueing and scheduling using specialized storage layers
Purpose-built batch scheduler: Include advanced scheduling features designed specifically for batch workloads

Key Features and Benefits

Core Scheduling Features

Fair-Use Scheduling

Maintains fair resource share over time across users and teams
Based on dominant resource fairness principles
Includes priority factors for different queues
Inspired by HTCondor priority systems

High Throughput Processing

Handle millions of queued jobs simultaneously
Specialized storage layer optimized for batch workloads
Efficient job submission and status tracking

Gang Scheduling

Atomically schedule sets of related jobs
Ensures all jobs in a group start together or not at all
Critical for distributed computing frameworks like MPI

Intelligent Preemption

Run urgent jobs in a timely fashion
Balance resource allocation between users
Configurable preemption policies

Enterprise-Grade Operations

Massive Scale Support

Utilize multiple Kubernetes clusters simultaneously
Scale beyond single cluster limitations
Add and remove clusters without service disruption

Advanced Resource Management

Resource and job scheduling rate limits
Detailed resource allocation controls

Comprehensive Monitoring

Detailed analytics via Prometheus integration
Resource allocation and system behavior insights
Automatic failure detection and node removal

Production-Ready Features

Secure authentication and authorization
High availability architecture
Automatic node failure handling

Use Cases and Success Stories

High-Performance Computing (HPC)

Machine Learning Training: Distribute large-scale ML training jobs across multiple clusters
Scientific Computing: Run complex simulations and data analysis workloads
Financial Modeling: Execute risk calculations and quantitative analysis at scale

Data Processing Pipelines

ETL Workloads: Process large datasets with parallel batch jobs
Data Analytics: Run distributed analytics jobs across multiple clusters
Backup and Archival: Coordinate large-scale data movement operations

CI/CD and Development

Build Systems: Distribute compilation and testing jobs
Integration Testing: Run comprehensive test suites across multiple environments
Deployment Automation: Coordinate complex deployment workflows

Production Deployment at G-Research

G-Research, a leading quantitative research company, uses Armada in production to:

Process millions of jobs per day
Manage tens of thousands of nodes
Support diverse computational workloads
Maintain high availability and performance

Comparison with Other Schedulers

vs. Native Kubernetes Scheduler

Scale: Armada spans multiple clusters vs. single cluster limitation
Throughput: Millions of jobs vs. thousands with native scheduler
Batch Features: Purpose-built for batch vs. service-oriented design
Fair Scheduling: Advanced fair-use policies vs. basic priority classes

vs. Traditional HPC Schedulers (SLURM, PBS)

Container Native: Built for containerized workloads vs. traditional HPC
Kubernetes Integration: Leverages Kubernetes ecosystem vs. isolated systems
Cloud Ready: Designed for cloud and hybrid environments
Modern APIs: REST/gRPC APIs vs. command-line interfaces
Rich Client Support: Client libraries available for multiple languages (Go, Java, Scala, Python and .NET)

Getting Started

Warning

TODO: add links

Ready to explore Armada? Here are your next steps:

Quick Start: Try the local installation guide to get Armada running with Kind
Core Concepts: Learn about jobs, queues, and scheduling
Production Setup: Review the operations guide for production deployment

Community and Support

Warning

TODO: add links

Documentation: Comprehensive guides and API references
Community Slack: Join discussions on CNCF Slack
GitHub: Report issues and contribute at github.com/armadaproject/armada
Videos: Watch overview presentations and technical deep-dives

On this page