Overview
What is Armada?
Armada is a multi-Kubernetes cluster batch job meta-scheduler designed to handle massive-scale workloads. Built on top of Kubernetes, Armada enables organizations to distribute millions of batch jobs per day across tens of thousands of nodes spanning multiple clusters, making it an ideal solution for high-throughput computational workloads.
Armada serves as middleware that transforms Kubernetes into a powerful batch processing platform while maintaining compatibility with service workloads. It addresses the fundamental limitations of running batch workloads at scale on Kubernetes by providing:
- Multi-cluster orchestration: Schedule jobs across many Kubernetes clusters seamlessly
 - High-throughput queueing: Handle millions of queued jobs
 - Advanced batch scheduling: Fair queuing, gang scheduling, preemption, and resource limits
 - Enterprise-grade reliability: Secure, highly available components designed for production use
 
As a CNCF Sandbox project, Armada is actively maintained and used in production environments, including at G-Research where it processes millions of jobs daily.
Why Use Armada?
Kubernetes Limitations for Batch Workloads
Traditional Kubernetes faces several challenges when running batch workloads at scale:
- 
Single Cluster Scaling Limits: Scaling a single Kubernetes cluster beyond a certain size is challenging, typically maxing out around 5,000-15,000 nodes depending on configuration.
 - 
Storage Backend Constraints: Achieving very high throughput using etcd, Kubernetes' in-cluster storage backend, is challenging and can become a bottleneck for job queuing.
 - 
Inadequate Batch Scheduling: The default kube-scheduler lacks essential batch scheduling features like fair queuing, gang scheduling, and intelligent preemption.
 
Armada's Solution
Armada overcomes these limitations by:
- Distributing across multiple clusters: Manage thousands of nodes across many Kubernetes clusters
 - Out-of-cluster scheduling: Perform queueing and scheduling using specialized storage layers
 - Purpose-built batch scheduler: Include advanced scheduling features designed specifically for batch workloads
 
Key Features and Benefits
Core Scheduling Features
Fair-Use Scheduling
- Maintains fair resource share over time across users and teams
 - Based on dominant resource fairness principles
 - Includes priority factors for different queues
 - Inspired by HTCondor priority systems
 
High Throughput Processing
- Handle millions of queued jobs simultaneously
 - Specialized storage layer optimized for batch workloads
 - Efficient job submission and status tracking
 
Gang Scheduling
- Atomically schedule sets of related jobs
 - Ensures all jobs in a group start together or not at all
 - Critical for distributed computing frameworks like MPI
 
Intelligent Preemption
- Run urgent jobs in a timely fashion
 - Balance resource allocation between users
 - Configurable preemption policies
 
Enterprise-Grade Operations
Massive Scale Support
- Utilize multiple Kubernetes clusters simultaneously
 - Scale beyond single cluster limitations
 - Add and remove clusters without service disruption
 
Advanced Resource Management
- Resource and job scheduling rate limits
 - Detailed resource allocation controls
 
Comprehensive Monitoring
- Detailed analytics via Prometheus integration
 - Resource allocation and system behavior insights
 - Automatic failure detection and node removal
 
Production-Ready Features
- Secure authentication and authorization
 - High availability architecture
 - Automatic node failure handling
 
Use Cases and Success Stories
High-Performance Computing (HPC)
- Machine Learning Training: Distribute large-scale ML training jobs across multiple clusters
 - Scientific Computing: Run complex simulations and data analysis workloads
 - Financial Modeling: Execute risk calculations and quantitative analysis at scale
 
Data Processing Pipelines
- ETL Workloads: Process large datasets with parallel batch jobs
 - Data Analytics: Run distributed analytics jobs across multiple clusters
 - Backup and Archival: Coordinate large-scale data movement operations
 
CI/CD and Development
- Build Systems: Distribute compilation and testing jobs
 - Integration Testing: Run comprehensive test suites across multiple environments
 - Deployment Automation: Coordinate complex deployment workflows
 
Production Deployment at G-Research
G-Research, a leading quantitative research company, uses Armada in production to:
- Process millions of jobs per day
 - Manage tens of thousands of nodes
 - Support diverse computational workloads
 - Maintain high availability and performance
 
Comparison with Other Schedulers
vs. Native Kubernetes Scheduler
- Scale: Armada spans multiple clusters vs. single cluster limitation
 - Throughput: Millions of jobs vs. thousands with native scheduler
 - Batch Features: Purpose-built for batch vs. service-oriented design
 - Fair Scheduling: Advanced fair-use policies vs. basic priority classes
 
vs. Traditional HPC Schedulers (SLURM, PBS)
- Container Native: Built for containerized workloads vs. traditional HPC
 - Kubernetes Integration: Leverages Kubernetes ecosystem vs. isolated systems
 - Cloud Ready: Designed for cloud and hybrid environments
 - Modern APIs: REST/gRPC APIs vs. command-line interfaces
 - Rich Client Support: Client libraries available for multiple languages (Go, Java, Scala, Python and .NET)
 
Getting Started
Warning
TODO: add links
Ready to explore Armada? Here are your next steps:
- Quick Start: Try the local installation guide to get Armada running with Kind
 - Core Concepts: Learn about jobs, queues, and scheduling
 - Production Setup: Review the operations guide for production deployment
 
Community and Support
Warning
TODO: add links
- Documentation: Comprehensive guides and API references
 - Community Slack: Join discussions on CNCF Slack
 - GitHub: Report issues and contribute at github.com/armadaproject/armada
 - Videos: Watch overview presentations and technical deep-dives
 
Last updated on