LUG17 has ended
Back To Schedule
Thursday, June 1 • 2:00pm - 2:30pm
Scalable high availability for Lustre with Pacemaker

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.
Pacemaker, together with Corosync, is often used to launch Lustre services
and coordinate Lustre failover. However the common approaches to employing
Pacemaker/Corosync presented LLNL with two main challenges: scalability, and
limited compatibility with stateless server nodes. Corosync has algorithmic
limitations that constrain the normal Pacamaker/Corosync cluster size to
sixteen nodes or less. Pacemaker's use of its main configuration file
as an active store for system state means that it assumes that every node
has its own persistent storage, which is not the case with stateless servers.

The common solution in the Lustre world to the scalability issue is to have
many Pacemaker/Corosync "clusters" within one Lustre cluster. For instance,
each OSS failover pair would form an independent Pacemaker/Corosync cluster.
This works, but means that monitoring the Lustre cluster state requires looking
at many Pacemaker/Corsysnc systems rather than a single installation. It also
precludes more advanced Pacemaker abilities such as global ordering of Lustre
service startup (i.e. start the MGS before MDS and OSS).

We will present a solution for employing the lesser know pacemaker-remote
functionality. This solution works well with stateless servers, and has
allowed LLNL to field a production cluster of 54 nodes controlled by a single
Pacemaker/Corosync instance.

avatar for Christopher Morrone

Christopher Morrone

Computer Scientist, Lawrence Livermore National Laboratory

Thursday June 1, 2017 2:00pm - 2:30pm EDT
Alumni Hall (IMU - 1st Floor) 900 E 7th St, Bloomington, IN, 47405

Attendees (7)