31st
Pacemaker in Fedora 12
Good news for Fedora fans, we’ve successfully navigated the required red tape and Pacemaker will ship in Fedora 12.
Hopefully Debian and Ubuntu will not be far behind.
Good news for Fedora fans, we’ve successfully navigated the required red tape and Pacemaker will ship in Fedora 12.
Hopefully Debian and Ubuntu will not be far behind.
Yesterday I was working on a migration bug.
It didn’t take long to identify or fix, and afterwards I was terribly pleased with myself.
The fix was simple, elegant and allowed the cluster to use migration (instead of stop then start) more often.
Why had I not seen how easy it was sooner?
Unfortunately it was because I’d ignored half the problem.
One decision I’m particularly happy I made place 6 years ago, is the one when I ensured Pacemaker’s Policy Engine could be used outside of a running cluster. Combined with some additional output options, this makes it possible to have a suite (224 right at this moment) of regression tests that catch this sort of idiocy before it ever affects an actual user.
The Policy Engine is by far the most complex part of the system, and it’s totally infeasible to test by hand even a small fraction what the regression suite can (and in under 30s too!).
This is why I’m so confident when I say that each release is better than the last. Once a Policy Engine bug gets fixed, it stays fixed.
The benefits of offline testing also occur much earlier in the process. The Policy Engine keeps a rolling list of the cluster states it performed calculations on and our test reporting tool collects these. So when users report a Policy Engine bug, there is no need to reproduce the issue and afterwards we can conclusively show (using pretty dot graphs of the cluster’s old and new behavior) that the issue is resolved.
So I sat down again today and made sure I’d thought the whole problem through - so that the next version would be a complete solution. You can see my notes below if you’re interested.
And now I’m off to implement it (and some extra tests :-)
primitive(A) depends on clone(B)
| time | node-1 | node-2 | node-3 |
|---|---|---|---|
| t0 | A.stop | ||
| t1 | B.stop | B.stop | |
| t2 | B.start | B.start | |
| t3 | A.start | ||
| time | node-1 | node-2 | node-3 |
|---|---|---|---|
| t0 | B.start | B.start | |
| t1 | A.stop* | ||
| t2 | A.start** | ||
| t3 | B.stop | B.stop |
The following constraints already exist in the system. The ‘ok’ and ‘fail’ column refers to whether they still hold for migration.
Note ***: This is what the existing implementation does
It took a little longer than expected, but the latest 1.0 maintenance release (1.0.4) is finally available.
Apart from a number of important bug fixes, the latest release is the first to include comprehensive man-pages for all CLI tools. These are generated from the source code using help2man and so are guaranteed to be accurate.
Unfortunately for RHEL and CentOS users, those distros don’t ship help2man and so the man pages are not available on those platforms. However one can obtain the same information using the --help option.
Packages for Pacemaker 1.0 and it’s immediate dependancies can be downloaded for openSUSE, SLES, Fedora, RHEL, CentOS and Mandriva (Debian users read on) from the usual location: http://software.opensuse.org/download/server:/ha-clustering
and the source can be obtained from: http://hg.clusterlabs.org/pacemaker/stable-1.0/archive/Pacemaker-1.0.4.tar.bz2
General installation instructions are available at: http://clusterlabs.org/wiki/Install
| Changesets | 222 |
| Diff | 266 files changed, 12100 insertions(+), 8279 deletions(-) |
Yours truly will be on vacation for the second half of this month, so the next release will be in late July.
The build issues for RHEL-4 have finally been sorted out and binary packages are one again available from the build service.
By happy co-incidence, courtesy of Martin Loschwitz from LINBIT, the 1.0.4 release sees the arrival of a fully functional, “official”, repository from which to obtain Pacemaker.
Martin’s work replaces the sort-of-worked-sort-of-didnt packages from the openSUSE build service which have now been disabled.
Too install packages for Lenny or Sid (Etch should be available “soon”) see the following instructions:
/etc/apt/sources.list
deb http://people.debian.org/~madkiss/ha lenny main
deb http://people.debian.org/~madkiss/ha sid main
apt-get updateapt-get install pacemaker-openais
apt-get install pacemaker-heartbeat
Whenever there is doubt, there is no doubt
- Robert De Niro, Ronin
There is little point ensuring service continuity if the underlying data is toast. Pacemaker makes use of a concept called STONITH to prevent this from happening but many people don’t understand what it is or why it is so important.
STONITH is an acronym for “Shoot The Other Node In The Head” and is a form of fencing, a mechanism for isolating “bad” nodes from an otherwise correctly functioning cluster.
STONITH usually takes the form of a power switch that can be remotely controlled over the network, however other forms are also possible.
Almost always, the answer is Yes!
At the most basic level, if having one or more services active on more than one cluster node is a problem, then you need STONITH.
About the only clusters that do not fall into this category are those where each machine has a non-overlapping data set or shares data that is kept in sync manually (using rsync or something similar).
As soon as shared storage is involved, STONITH is a must. This also applies to clusters with automated synchronization, such as DRBD.
Consider a cluster with 3 nodes (lets call them A, B & C) connected to a SAN. If a cleaner accidentally unplugs node A, it is safe for nodes B and C to continue writing to the SAN. However, the cluster has no way to know this.
The cluster could assume that any node it can’t see is safely dead, but what if the node was simply cut-off from the network instead? Two nodes could easily end up writing to the same file, causing corruption.
It is true that, in this situation, node A would eventually loose quorum and could be made to stop anything that might be accessing the SAN. This reduces the window during which corruption can occur, but the possibility still exists.
This is the problem that STONITH is designed to handle, it creates certainty where there was doubt.
Fundamentally, STONITH provides an answer to the question: Is it safe to start cluster services yet?
The cluster could also use the SAN as another communication media, however this is not a generic solution as not all clusters have one (many opt for a software solution like DRBD) and more importantly, disk-based communication is not currently supported by the OpenAIS and Heartbeat.
STONITH also has a role to play in the event that a clustered service cannot be stopped. In this case, the cluster uses STONITH to force the whole node offline, thereby making it safe to start the rogue service elsewhere.
It is crucial that the STONITH device can allow the cluster to differentiate between a node failure and a network one.
The biggest mistake people make in choosing a STONITH device is to use remote power switch (such as many onboard IPMI controllers) that shares power with the node it controls. These devices create what is known as a single point of failure (SPoF).
If the node looses power, the cluster cannot be sure if the node is really offline, or active and suffering from a network fault. The cluster will try to turn off the node, but that will fail because the STONITH device has also lost power. The only safe thing to do in this situation is block and wait for further input (such as the node coming back online).
For the same reason, anything that relies on the machine being active (such as the SSH-based “device” used during testing) is also inappropriate.
Clever algorithms exist to improve the reliability of these devices, however they can only ever be approximations - which defeats the point of using STONITH in the first place.
After explaining STONITH to people, the most common objection is that they can’t afford it.
First and foremost, this is nonsense.
Sure you could go crazy and spend thousands, but Google quickly turns up devices for as low as $20. Even assuming that these devices are complete junk and you need to pay 10x that to get something reliable, we’re not exactly talking about a second mortgage here.
But for the sake of argument, lets assume that the only suitable device for your cluster costs $3,000 (twice as much as the most expensive device from WTI).
Ask yourself (or your boss):
Does STONITH sound like a worthwhile investment now?
Its a common question and a worthy topic for an extended article. Here’s the steps I usually follow when diagnosing such issues.
Check quorum status with crm_mon —one-shot
Quorum is a property of the cluster which is attained when more than half the number of known nodes is online. Unlike Heartbeat, OpenAIS based clusters don’t pretend to have quorum when only one of a possible two nodes is available. In such situations, the cluster’s default behavior is to ensure data integrity by stopping all services.
Check the current value of the no-quorum-policy option:
crm_attribute -n no-quorum-policy -G
If you don’t have quorum, you can tell the cluster to ignore the loss of quorum and start resources anyway:
crm_attribute -n no-quorum-policy -v ignore
Be careful to ensure STONITH is correctly configured before using the ignore option.
Check if the cluster is managing services:
Check the global default
crm_attribute --type rsc_defaults -n is-managed -G
Check the per-resource values
cibadmin --query --xpath '//nvpair[@name="is-managed"]'
Check the old location for the global default
crm_attribute -n is-managed-default -G
Look for any results indicating a value of false
Check target-role
The target-role setting controls what state the resource can achieve. The list of possible states is:
Check the global default
crm_attribute --type rsc_defaults -n target-role -G
Check the per-resource values
cibadmin --query --xpath '//nvpair[@name="target-role"]'
You can see the list of failures in the crm_mon output:
crm_mon --one-shot --operations
Another good source of information is ptest which can simulate what the cluster would try to do.
ptest --live-check -VVV
Look for anything unusual in the output such as
WARN: unpack_rsc_op: Processing failed op drbd0:1_start_0 on nagios-clu2: unknown error
Check the logs
ssh -l root nagios-clu2 -- grep drbd0:1 /var/log/messages
If you identified any failures above, you can instruct the cluster to “forget” about them:
crm_resource --cleanup --node nagios-clu2
This results in the resource history being erased on nagios-clu2. The cluster will then attempt to start any services that were not already active.
NOTE: This will have little or no benefit if the underlying issue, the one that caused the resource to fail in the first place, has not been fixed. If the problem persists, the resource will simply return to a failed state and the cluster will still refuse to start it.
In a later article, I’ll explain how the cluster can recover from transient failures automatically by timing them out after a certain interval.
This tumbl/blog/thingy exists because I’ve finally accepted that “If we build it, they will come” is a fallacy. The internet is a big place and if you don’t speak up, you’ll get lost in the noise of those that do.
So, I’m going to try and use this place to raise awareness of a project that’s very important to me - Pacemaker - an incredibly advanced open source, high availability cluster resource manager.
For those not already certified cluster ninjas, a resource manager is the part of a cluster stack that decides who holds cluster services and what to do when a failure is detected.
Pacemaker’s key features, which I will explore in greater depth over the coming days/weeks, are:
If you’re interested in open source clustering, check us out at http://clusterlabs.org or irc://irc.freenode.net#linux-cluster
Nothing to see here yet. Just taking the software for a spin.