Post

Real-Time Fleet Control with MQTT over mTLS

Real-Time Fleet Control with MQTT over mTLS

When you need to monitor and control a fleet of machines in real time — endpoints, devices, agents — HTTP polling falls apart fast. You want push, not poll; secure identity, not shared secrets; and behaviour that survives a dropped connection. This is the territory where MQTT over mutual TLS shines.

This is the architecture behind Data Citadel, an endpoint security tool I built that controlled 50+ machines holding confidential data, locking a compromised endpoint down in under ~5 minutes.

The problem

A control plane has to do two things at once: receive a stream of events from every agent, and push commands back to any agent instantly. It must keep working when an agent goes offline, and — because it’s security-critical — it must trust only agents that are who they say they are.

How to approach it

Think in three layers: agents at the edge, a broker for transport, and a control plane that decides. The design principle that matters most: evaluate locally, manage centrally.

flowchart TB
    subgraph Agent[Edge Agent]
        Mon[Monitors]
        Eval[Local policy evaluator]
        Enf[Enforcer]
        Q[Offline queue]
    end
    Mon --> Eval --> Enf
    Eval --> Q
    Q <-->|MQTT + mTLS| Broker[MQTT Broker]
    Broker --> CP[Control Plane]
    CP -->|signed policy + commands| Broker

What tech to use where

  • MQTT for transport. Lightweight pub/sub built for exactly this: many clients, unreliable networks, low overhead. Agents publish events to topics; the control plane publishes commands back. No polling.
  • Mutual TLS for identity. Don’t authenticate fleets with a shared token — one leak compromises everyone. Give each agent its own client certificate so the broker authenticates and the agent verifies the server. Identity becomes cryptographic, per-device, and revocable.
  • Local-first evaluation. Agents evaluate policy on the device so enforcement keeps working with no connectivity, queuing events to sync on reconnect. The network is an optimization, not a dependency.
  • Central, signed policy distribution. Policies are managed in one place, versioned, signed, and pushed over the same secure channel — so one update propagates to the whole fleet without redeploying agents.

Pitfalls to watch for

  • Securing the channel is the whole game. A control plane that can lock or wipe machines is a weapon if hijacked. mTLS, signed payloads, and least-privilege topics are non-negotiable.
  • Fail-secure vs fail-open. Decide explicitly what an offline agent does. For security, fail toward safe (keep enforcing the last known policy).
  • Tamper resistance. If a user can kill the agent, you have no control. Run it as a protected service and detect tampering.
  • Broker as a bottleneck/SPOF. Cluster it; plan topic scaling before the fleet grows.

Takeaways

For real-time fleet control, reach for MQTT (push over poll), mutual TLS (per-device cryptographic identity over shared secrets), and local-first evaluation (resilience over connectivity). Manage policy centrally but enforce it at the edge — that combination gives you instant control without making the network a single point of failure.

See it in practice in the Data Citadel case study.

This post is licensed under CC BY 4.0 by the author.