The Cluster Guy RSS

ǝɹǝɥ ʇxǝʇ lnɟʇɥƃısuı

Archive

Jul
31st
Fri
permalink

Pacemaker in Fedora 12

Good news for Fedora fans, we’ve successfully navigated the required red tape and Pacemaker will ship in Fedora 12.

Hopefully Debian and Ubuntu will not be far behind.

Jul
22nd
Wed
permalink

Resource Migration and Regression Testing

Yesterday I was working on a migration bug.

It didn’t take long to identify or fix, and afterwards I was terribly pleased with myself.
The fix was simple, elegant and allowed the cluster to use migration (instead of stop then start) more often.

Why had I not seen how easy it was sooner?

Unfortunately it was because I’d ignored half the problem.

One decision I’m particularly happy I made place 6 years ago, is the one when I ensured Pacemaker’s Policy Engine could be used outside of a running cluster. Combined with some additional output options, this makes it possible to have a suite (224 right at this moment) of regression tests that catch this sort of idiocy before it ever affects an actual user.

The Policy Engine is by far the most complex part of the system, and it’s totally infeasible to test by hand even a small fraction what the regression suite can (and in under 30s too!).

This is why I’m so confident when I say that each release is better than the last. Once a Policy Engine bug gets fixed, it stays fixed.

The benefits of offline testing also occur much earlier in the process. The Policy Engine keeps a rolling list of the cluster states it performed calculations on and our test reporting tool collects these. So when users report a Policy Engine bug, there is no need to reproduce the issue and afterwards we can conclusively show (using pretty dot graphs of the cluster’s old and new behavior) that the issue is resolved.

So I sat down again today and made sure I’d thought the whole problem through - so that the next version would be a complete solution. You can see my notes below if you’re interested.

And now I’m off to implement it (and some extra tests :-)

Migration Scenario Notes

Cluster Setup

primitive(A) depends on clone(B)

Resource Activity During Move: A(node-1 to node-2)

timenode-1node-2node-3
t0A.stop
t1B.stopB.stop
t2B.startB.start
t3A.start

Resource Activity During Migration: A(node-1 to node-2)

timenode-1node-2node-3
t0B.startB.start
t1A.stop*
t2A.start**
t3B.stopB.stop
  • Node *: Rewritten to be a migrate-to operation
  • Node **: Rewritten to be a migrate-from operation

Constraints

The following constraints already exist in the system. The ‘ok’ and ‘fail’ column refers to whether they still hold for migration.

  1. A.stop -> A.start - ok
  2. B.stop -> B.start - fail
  3. A.stop -> B.stop - ok
  4. B.start -> A.start - ok
  5. B.stop -> A.start - fail
  6. A.stop -> B.start - fail

Scenarios

  1. B unchanged - ok
  2. B stopping only - fail - possible after reversing constraint 5
  3. B starting only - fail - possible after reversing constraint 6***
  4. B stoping and starting - fail - constraint 2 is unfixable
  5. B restarting but only on N2 - fail - as per case 4 but even less likely

Note ***: This is what the existing implementation does

Jun
5th
Fri
permalink

Pacemaker 1.0.4 Released

It took a little longer than expected, but the latest 1.0 maintenance release (1.0.4) is finally available.

Apart from a number of important bug fixes, the latest release is the first to include comprehensive man-pages for all CLI tools. These are generated from the source code using help2man and so are guaranteed to be accurate.

Unfortunately for RHEL and CentOS users, those distros don’t ship help2man and so the man pages are not available on those platforms. However one can obtain the same information using the --help option.

Packages for Pacemaker 1.0 and it’s immediate dependancies can be downloaded for openSUSE, SLES, Fedora, RHEL, CentOS and Mandriva (Debian users read on) from the usual location: http://software.opensuse.org/download/server:/ha-clustering

and the source can be obtained from: http://hg.clusterlabs.org/pacemaker/stable-1.0/archive/Pacemaker-1.0.4.tar.bz2

General installation instructions are available at: http://clusterlabs.org/wiki/Install

Release Statistics

Changesets 222
Diff 266 files changed, 12100 insertions(+), 8279 deletions(-)

Project Administrivia

Next Release

Yours truly will be on vacation for the second half of this month, so the next release will be in late July.

RHEL-4 Packages

The build issues for RHEL-4 have finally been sorted out and binary packages are one again available from the build service.

NEW! “Real” Debian Packages

By happy co-incidence, courtesy of Martin Loschwitz from LINBIT, the 1.0.4 release sees the arrival of a fully functional, “official”, repository from which to obtain Pacemaker.

Martin’s work replaces the sort-of-worked-sort-of-didnt packages from the openSUSE build service which have now been disabled.

Too install packages for Lenny or Sid (Etch should be available “soon”) see the following instructions:

  1. Add one of the following lines to /etc/apt/sources.list deb http://people.debian.org/~madkiss/ha lenny main deb http://people.debian.org/~madkiss/ha sid main
  2. Retrieve the package metadata with: apt-get update
  3. Install either the OpenAIS or Heartbeat version
    apt-get install pacemaker-openais
    or
    apt-get install pacemaker-heartbeat

Changes of note

  • High: ais: bnc#488291 - don’t rely on byte endianness on ptr cast
  • High: Tools: bnc#507255 - crm: import properly rsc/op_defaults
  • High: Tools: lf#2114 - crm: add support for operation instance attributes
  • High: ais: Bug lf#2126 - Messages replies cannot be routed to transient clients
  • High: attrd: Support the value++ and value+=… syntax required for failcounts
  • High: cib: Fix huge memory leak affecting heartbeat-based clusters
  • High: Core: Generate the help text directly from a tool options struct
  • High: crmd: Bug lf#2120 - All transient node attribute updates need to go via attrd
  • High: crmd: Fix another large memory leak affecting Heartbeat based clusters
  • High: PE: Bug bnc#495687 - Filesystem is not notified of successful STONITH under some conditions
  • High: PE: Make running a cluster with STONITH enabled but no STONITH resources an error and provide details on resolutions
  • High: PE: Prevent use-of-NULL when using resource ordering sets
  • High: Tools: attrd - Prevent race condition resulting in the cluster forgetting node’s wish to shut down
  • High: Tools: crm_mon - Fix smtp notifications
  • High: Tools: crm_resource - Repair the ability to query meta attributes
  • Medium: Core: Include supported stacks in version information
  • Medium: Tools: Include current stack in crm_mon output
  • Medium: PE: Correctly log the actions for resources that are being recovered
  • Medium: PE: Correctly log the occurance of promotion events
May
26th
Tue
permalink

Highly Available Data Corruption

Whenever there is doubt, there is no doubt
- Robert De Niro, Ronin

There is little point ensuring service continuity if the underlying data is toast. Pacemaker makes use of a concept called STONITH to prevent this from happening but many people don’t understand what it is or why it is so important.

What is STONITH?

STONITH is an acronym for “Shoot The Other Node In The Head” and is a form of fencing, a mechanism for isolating “bad” nodes from an otherwise correctly functioning cluster.

STONITH usually takes the form of a power switch that can be remotely controlled over the network, however other forms are also possible.

Do I Need STONITH?

Almost always, the answer is Yes!

At the most basic level, if having one or more services active on more than one cluster node is a problem, then you need STONITH.

About the only clusters that do not fall into this category are those where each machine has a non-overlapping data set or shares data that is kept in sync manually (using rsync or something similar).

As soon as shared storage is involved, STONITH is a must. This also applies to clusters with automated synchronization, such as DRBD.

Thought Experiment

Consider a cluster with 3 nodes (lets call them A, B & C) connected to a SAN. If a cleaner accidentally unplugs node A, it is safe for nodes B and C to continue writing to the SAN. However, the cluster has no way to know this.

The cluster could assume that any node it can’t see is safely dead, but what if the node was simply cut-off from the network instead? Two nodes could easily end up writing to the same file, causing corruption.

Quorum

It is true that, in this situation, node A would eventually loose quorum and could be made to stop anything that might be accessing the SAN. This reduces the window during which corruption can occur, but the possibility still exists.

This is the problem that STONITH is designed to handle, it creates certainty where there was doubt.

Fundamentally, STONITH provides an answer to the question: Is it safe to start cluster services yet?

Disk-based Heartbeats

The cluster could also use the SAN as another communication media, however this is not a generic solution as not all clusters have one (many opt for a software solution like DRBD) and more importantly, disk-based communication is not currently supported by the OpenAIS and Heartbeat.

Other Reasons to use STONITH

STONITH also has a role to play in the event that a clustered service cannot be stopped. In this case, the cluster uses STONITH to force the whole node offline, thereby making it safe to start the rogue service elsewhere.

Things to Look for When Choosing a STONITH Device

It is crucial that the STONITH device can allow the cluster to differentiate between a node failure and a network one.

The biggest mistake people make in choosing a STONITH device is to use remote power switch (such as many onboard IPMI controllers) that shares power with the node it controls. These devices create what is known as a single point of failure (SPoF).

If the node looses power, the cluster cannot be sure if the node is really offline, or active and suffering from a network fault. The cluster will try to turn off the node, but that will fail because the STONITH device has also lost power. The only safe thing to do in this situation is block and wait for further input (such as the node coming back online).

For the same reason, anything that relies on the machine being active (such as the SSH-based “device” used during testing) is also inappropriate.

Clever algorithms exist to improve the reliability of these devices, however they can only ever be approximations - which defeats the point of using STONITH in the first place.

But I Can’t Afford STONITH

After explaining STONITH to people, the most common objection is that they can’t afford it.

First and foremost, this is nonsense.

Sure you could go crazy and spend thousands, but Google quickly turns up devices for as low as $20. Even assuming that these devices are complete junk and you need to pay 10x that to get something reliable, we’re not exactly talking about a second mortgage here.

But for the sake of argument, lets assume that the only suitable device for your cluster costs $3,000 (twice as much as the most expensive device from WTI).

Ask yourself (or your boss):

  1. How much is that compared to the rest of your hardware budget?
  2. When data corruption occurs, what is the cost of having the servers offline while you restore sane data from a backup?
  3. Was the last backup corrupted too?
  4. Indeed, what is the cost of loosing all the changes (orders?) that occurred since that last good backup?

Does STONITH sound like a worthwhile investment now?

May
14th
Thu
permalink
permalink
May
12th
Tue
permalink

Why Wont the Cluster Start my Services?

Its a common question and a worthy topic for an extended article. Here’s the steps I usually follow when diagnosing such issues.

Is the cluster allowed to start services?

  1. Check quorum status with crm_mon —one-shot

    Quorum is a property of the cluster which is attained when more than half the number of known nodes is online. Unlike Heartbeat, OpenAIS based clusters don’t pretend to have quorum when only one of a possible two nodes is available. In such situations, the cluster’s default behavior is to ensure data integrity by stopping all services.

    Check the current value of the no-quorum-policy option:

    crm_attribute -n no-quorum-policy -G
    

    If you don’t have quorum, you can tell the cluster to ignore the loss of quorum and start resources anyway:

    crm_attribute -n no-quorum-policy -v ignore
    

    Be careful to ensure STONITH is correctly configured before using the ignore option.

  2. Check if the cluster is managing services:

    Check the global default

    crm_attribute --type rsc_defaults -n is-managed -G
    

    Check the per-resource values

    cibadmin --query --xpath '//nvpair[@name="is-managed"]' 
    

    Check the old location for the global default

    crm_attribute -n is-managed-default -G
    

    Look for any results indicating a value of false

  3. Check target-role

    The target-role setting controls what state the resource can achieve. The list of possible states is:

    1. Stopped
    2. Started
    3. Slave
    4. Master Look out for any places indicating a value of Stopped. In the case of master/slave resources that aren’t being promoted, a value of Started can also be problematic.

    Check the global default

    crm_attribute --type rsc_defaults -n target-role -G
    

    Check the per-resource values

    cibadmin --query --xpath '//nvpair[@name="target-role"]' 
    

Look for failures

  1. You can see the list of failures in the crm_mon output:

    crm_mon --one-shot --operations
    
  2. Another good source of information is ptest which can simulate what the cluster would try to do.

     ptest --live-check -VVV
    

    Look for anything unusual in the output such as

     WARN: unpack_rsc_op: Processing failed op drbd0:1_start_0 on nagios-clu2: unknown error
    
  3. Check the logs

    ssh -l root nagios-clu2 -- grep drbd0:1 /var/log/messages
    

Cleaning up after failures

If you identified any failures above, you can instruct the cluster to “forget” about them:

crm_resource --cleanup --node nagios-clu2

This results in the resource history being erased on nagios-clu2. The cluster will then attempt to start any services that were not already active.

NOTE: This will have little or no benefit if the underlying issue, the one that caused the resource to fail in the first place, has not been fixed. If the problem persists, the resource will simply return to a failed state and the cluster will still refuse to start it.

In a later article, I’ll explain how the cluster can recover from transient failures automatically by timing them out after a certain interval.

May
7th
Thu
permalink

raison d’etre

This tumbl/blog/thingy exists because I’ve finally accepted that “If we build it, they will come” is a fallacy.  The internet is a big place and if you don’t speak up, you’ll get lost in the noise of those that do.

So, I’m going to try and use this place to raise awareness of a project that’s very important to me - Pacemaker - an incredibly advanced open source, high availability cluster resource manager.

For those not already certified cluster ninjas, a resource manager is the part of a cluster stack that decides who holds cluster services and what to do when a failure is detected.

Pacemaker’s key features, which I will explore in greater depth over the coming days/weeks, are:

  • Recovery from node failures (obviously)
  • Built-in detection of resource failures (no need for mon
  • Support for OpenAIS, an industry standard cluster stack
  • Support for Heartbeat, a popular alternative to OpenAIS
  • Powerful dependancy model for accurately mapping your environment
  • Supports as many nodes as the cluster messaging layer will allow
  • Proven technology - ships as part of SLE10 and SLE 11 High Availability Extension

If you’re interested in open source clustering, check us out at http://clusterlabs.org or irc://irc.freenode.net#linux-cluster

permalink

Is this thing on?

Nothing to see here yet.  Just taking the software for a spin.