AgentOS - AI Agent Fleet Infrastructure

The Problem

Deploying AI agents to production was manual. Every agent needed a server provisioned, Docker configured, secrets injected, health monitoring set up, and usage tracked - by hand. For a firm that builds and operates AI agents for clients and internally, this didn't scale. We needed infrastructure that could deploy an agent in minutes, manage a fleet of them, and do it securely.

We built AgentOS: the platform that powers M87's agent fleet.

What We Built

AgentOS is a 4-layer platform for deploying and operating AI agents on dedicated cloud infrastructure. It's agent-agnostic - the agent is determined by the Docker image, not the platform. Currently running OpenClaw and Hermes agents in production, but any containerized agent can be deployed through the same interface. The platform is also hosting-agnostic, designed to provision across cloud providers through a unified API.

Layer 0: Cloud Provisioning - Automated server creation, Docker deployment, secret injection, and health verification. A single API call takes an agent from zero to running in minutes, across any supported cloud provider.

Layer 1: Operations & Management - Fleet-wide health monitoring, remote operations (restart, wipe, model swap, log retrieval), and lifecycle management - all from the control plane.

Layer 2: Orchestration (OmniAgent) - A self-hosted AI agent that operates the fleet. Deploy, destroy, restart, and check status of any agent using natural language via Telegram. The secondary control plane for operational tasks.

Layer 3: Dashboard - Real-time status monitoring, per-agent usage tracking, AI model configuration (with BYOK support), pairing management, live logs, and fleet-wide administrative controls.

The Architecture

The platform runs as a monorepo: infrastructure provisioning, the web dashboard and API, the agent runtime, and the operational OmniAgent.

Key technical decisions:

Callback-driven provisioning - agents self-report health instead of the control plane polling during startup. Faster, more reliable, less infrastructure load.
Agent-agnostic design - the platform doesn't care what agent runs. OpenClaw, Hermes, or any custom agent - it deploys a Docker image, injects config, and monitors health. Swap the image, get a different agent.
Hosting-agnostic abstraction - a provider interface that supports multiple cloud providers through the same deployment flow. Same API, different infrastructure underneath.

Background workers (BullMQ + Redis) handle health sweeps, usage sync, and remote operations with configurable concurrency and rate limiting. 19 API routes cover the full lifecycle from agent creation through destruction.

The Result

AgentOS is the infrastructure that makes M87's agent fleet possible. Without it, we can't effectively manage multiple agents across clients and internal operations. With it, we go from "I need an OpenClaw agent" or "I need a Hermes agent" to "it's running in production" in minutes.

This isn't a demo or a reference architecture. It's production infrastructure that we use every day to deploy and operate the agents that power our client work and our own operations.

AgentOS - AI Agent Fleet Infrastructure

The Problem

What We Built

The Architecture

The Result

Technologies Used

Tell us what you need.