Pacemaker, together with Corosync, is often used to launch Lustre services and coordinate Lustre failover. However the common approaches to employing Pacemaker/Corosync presented LLNL with two main challenges: scalability, and limited compatibility with stateless server nodes. Corosync has algorithmic limitations that constrain the normal Pacamaker/Corosync cluster size to sixteen nodes or less. Pacemaker's use of its main configuration file as an active store for system state means that it assumes that every node has its own persistent storage, which is not the case with stateless servers.
The common solution in the Lustre world to the scalability issue is to have many Pacemaker/Corosync "clusters" within one Lustre cluster. For instance, each OSS failover pair would form an independent Pacemaker/Corosync cluster. This works, but means that monitoring the Lustre cluster state requires looking at many Pacemaker/Corsysnc systems rather than a single installation. It also precludes more advanced Pacemaker abilities such as global ordering of Lustre service startup (i.e. start the MGS before MDS and OSS).
We will present a solution for employing the lesser know pacemaker-remote functionality. This solution works well with stateless servers, and has allowed LLNL to field a production cluster of 54 nodes controlled by a single Pacemaker/Corosync instance.