The Cluster Guy RSS

ǝɹǝɥ ʇxǝʇ lnɟʇɥƃısuı

Archive

Mar
29th
Thu
permalink

Pacemaker 1.1.7 Now Available

After much hard work, the latest installment of the Pacemaker 1.1 release series is now ready for general consumption.

Changesets 513 
Diff 1171 files changed, 90472 insertions, 19368 deletions

As well as the usual round of bug fixes, see the full changelog, this new release brings:

  • Support for Corosync 2.0
  • Logging optimisations (less of it and less work performed for logs that wont be printed)
  • The ability to specify that A starts after ( B or C or D )
  • Support for advanced fencing topologies: eg. kdump || (network && disk) || power
  • Resource templates and tickets have been promoted to the stable schema
  • Support for gracefully giving up resources depending on a ticket

As per our release calendar, the next 1.1 release is planned for mid-July.

Packages for all current editions of Fedora have been built and will be appearing shortly in the update channels. Other distributions will follow when their schedules allow it.

The source tarball (tar.gz) is also available directly from GitHub.

General installation instructions are available at from the ClusterLabs wiki.

Nov
24th
Thu
permalink

Pacemaker 1.0.12 Released

Thanks once again to the efforts of Keisuke MORI from NTT, the latest bug fixes have been back-ported from 1.1 and another instalment of the Pacemaker 1.0 release series is now ready for general consumption.

Changesets 96 
Diff 121 files changed, 8617 insertions(+), 988 deletions(-)

Important changes since Pacemaker-1.0.11 include:

  • cib: Call gnutls_bye() and shutdown() when disconnecting from remote TLS connections
  • cib: Remove disconnected remote connections from mainloop
  • crmd: Cancel timers for actions that were pending on dead nodes
  • crmd: Do not wait for actions that were pending on dead nodes
  • crmd: Ensure we do not attempt to perform action on failed nodes
  • PE: Correctly recognise which recurring operations are currently active
  • PE: Demote from Master does not clear previous errors
  • PE: Ensure restarts due to definition changes cause the start action to be re-issued not probes
  • PE: Ensure role is preserved for unmanaged resources
  • PE: Ensure unmanaged resources have the correct role set so the correct monitor operation is chosen
  • PE: Move master based on failure of colocated group
  • pengine: Correctly determine the state of multi-state resources with a partial operation history
  • PE: Only allocate master/slave resources once
  • Shell: implement -w,—wait option to wait for the transition to finish
  • Shell: repair template list command

You also can see the full changelog,

I have updated the release calendar and the next 1.0.x release is planned for mid-May 2012.

The source tarball is also available directly from GitHub.

Pre-built packages for Pacemaker are available immediately for current openSUSE (12.1, 11.4, 11.3) and Fedora (16, 15, 14) releases as well as EPEL-5 from the ClusterLabs Build Area.

Users of more most distributions are encouraged to use the latest 1.1.x release - either from the 1.1 Build Area or from the distribution directly.

General installation instructions are available at from the ClusterLabs wiki.

Oct
13th
Thu
permalink

New Version Control System

Since September, Pacemaker has started using Git for the 1.1 and devel trees.

There were some minor technical advantages over Mercurial (which I still personally prefer), but mostly the decision was driven by the pain associated with switching between SCMs multiple times a day.

The majority of development now happens on GitHub, which has some great features for reviewing patches and general collaboration.

The Pacemaker tree is also periodically sync’d to the Cluster Labs server in case GitHub is unavailable for any reason.

For those new to Git, GitHub has many tips for setting up Git, creating a local copy of the Pacemaker repo to work in, submitting your changes upstream (we use the Fork + Pull Model), and other assorted resources.

Be sure to configure email and user information so you get credit for your hard work too!

permalink

New Issue Tracker

Since it’s clearly not acceptable for our issue tracker to be offline for months at a time, it is time to replace the Bugzilla instance hosted by the Linux Foundation with something else.

One candidate that came close was the github issue tracker, but alas it doesn’t support attachments. The end result is that we now have an instance of Bugzilla v4 at:

http://bugs.clusterlabs.org

Bug numbers start at 5000.
This avoids clashing with older ones and may enable us to import the old ones if it ever comes back up again. I would advise people to assume this wont happen and to re-create any unresolved issues.

May
2nd
Mon
permalink

Pacemaker 1.0.11 Released

The latest installment of the Pacemaker 1.0 release series is now ready for general consumption.

Changesets 85 
Diff 500 files changed, 69642 insertions(+), 58270 deletions(-)

Thanks once again to the efforts of Keisuke MORI and NTT, the latest bug fixes have been back-ported from 1.1

Important changes since Pacemaker-1.0.10 include:

  • cib: Repair the processing of updates sent from peer nodes
  • crmd: All pending operations should be recorded, even recurring ones with high start delays
  • crmd: Bug lf#2509 - Watch for config option changes from the CIB even if we’re not the DC
  • crmd: Bug lf#2528 - Introduce a slight delay when creating a transition to allow attrd time to perform its updates
  • crmd: Bug lf#2545 - Ensure notify variables are accurate for stop operations
  • crmd: Bug lf#2559 - Fail actions that were scheduled for a failed/fenced node
  • crmd: Cancel recurring operations while we’re still connected to the lrmd
  • crmd: Don’t abort transitions when probes are completed on a node
  • crmd: Ensure the CIB is always writable on the DC by removing a timing hole
  • crmd: Update failcount for failed promote and demote operations
  • PE: Bug lf#2495 - Prevent segfault by validating the contents of ordering sets
  • PE: Bug lf#2508 - Correctly reconstruct the status of anonymous cloned groups
  • PE: Bug lf#2544 - Prevent unstable clone placement by factoring in the current node’s score before all others
  • PE: Bug lf#2554 - target-role alone is not sufficient to promote resources
  • PE: Ensure fencing of the DC preceeds the STONITH_DONE operation
  • PE: Ensure that fencing has completed for stop actions on stonith-dependent resources (lf#2551)
  • PE: Prevent clones from being stopped because resources colocated with them cannot be active
  • PE: Prevet use-after-free resulting from unintended recursion when chosing a node to promote master/slave resources
  • Shell: don’t create empty optional sections (bnc#665131)
  • Tools: Bug lf#2528 - Make progress when attrd_updater is called repeatedly within the dampen interval but with the same value
  • Tools: Prevent crm_resource commands from being lost due to the use of cib_scope_local

You also can see the full changelog,

As per our release calendar, the next 1.0.x release is planned for mid-September.

The source tarball is also available directly from Mercurial.

Pre-built packages for Pacemaker and it’s immediate dependancies are available immediately for openSUSE 11.2, 11.3, Fedora-13 and EPEL-5 from the ClusterLabs Build Area.

Users of more recent distributions are encouraged to use the latest 1.1.x - either from the 1.1 Build Area or the distribution directly.

General installation instructions are available at from the ClusterLabs wiki.

Feb
23rd
Wed
permalink

Pacemaker 1.1.5 Released

The latest installment of the Pacemaker 1.1 release series is now ready for general consumption.

Changesets 184 
Diff 605 files changed, 46103 insertions(+), 26417 deletions(-)

As well as the usual round of bug fixes, see the full changelog, S.U.S.E. has implemented support for ACLs. This means that you can now delegate permission to control parts of the cluster (as defined by you) to non-root users.

ACLs are still disabled by default, but you can read their documentation, provide feedback and decide if its something you want to use.

As per our release calendar, the next 1.1 release is planned for mid-April and 1.0.11 should be available in March depending on how quickly we can get the bugfixes from 1.1 backported.

Pre-built packages for Pacemaker and it’s immediate dependancies are available immediately for openSUSE 11.3, Fedora-14 and EPEL-5 from the ClusterLabs Build Area.

The source tarball is also available directly from Mercurial.

General installation instructions are available at from the ClusterLabs wiki.

Nov
12th
Fri
permalink

New Logo?

One unexpected outcome from the recent Linux Plumbers conference was the contribution of a new logo to the project by NTT.

New Logo

Quite possibly you’re now wondering how this logo relates at all to clustering and the Pacemaker project. Don’t worry, they came up with a backstory too!

In various forms of racing there is quite often someone/something setting a benchmark time or speed. This entity is often referred to as the pace-setter, pacemaker, or colloquially as a “rabbit”.

The logo is therefor a stylized pair of rabbit ears and the implication is that we’re setting new standards for cluster resource management.

As well as the logo, NTT also contributed some very professional looking banner images they’d created a Japanese cluster site they’ve been busy building up. Even if you can’t speak Japanese, be sure to check out the shiny intro movie on the front page!

banner red

banner white

I quite like the logo and the message, but I’m interested in the community’s reaction. I’ve created an online poll, be sure to let us know what you think.

permalink

Pacemaker Release Roundup

It may have seemed quiet since July, but things were actually so busy that I couldn’t find the time to publicize our new releases.

First up, the long awaited 1.0.10 is finally here. Thanks once again to the hard work of Keisuke MORI from NTT, 1.0.10 contains all the bug fixes from the recent 1.1.3 and 1.1.4 releases. You can preview the list of updates with the new online change log.

In addition to general bugfixes, the big news in 1.1.3 was the addition of a master control process and support for cman. Cman support allows us to run on top of a traditional RHCS cluster stack - replacing just the rgmanager component (more details on this in a subsequent post).

1.1.3 also introduced a new logging system inspired by the kernel and a PoC from Lars Ellenberg. It enables us to selectively enable logs for specific files, functions and even individual lines. Eventually this should result in less being logged by default.

The successor to 1.1.3 was all about performance. In 1.1.4 we managed to speed up the CIB and Policy Engine by about 80% each. So if you have 100’s of resources, you really want to be using this version (the changes were far too invasive to consider including in a 1.0 release).

Packages for all three releases are available from the rpm and rpm-next repositories on clusterlabs.org

In other news, I have also recently updated the release calendar for 2011.

Oct
7th
Thu
permalink

Pacemaker, Heartbeat, Corosync, WTF?

One question I still get a lot is what all these projects are/do and how they all relate.

Here is the list of the possible components that might make up a Pacemaker install is:

  • Pacemaker - Resource manager
  • Corosync - Messaging layer
  • Heartbeat - Also a messaging layer
  • Resource Agents - Scripts that know how to control various services

Pacemaker is the thing that starts and stops services (like your database or mail server) and contains logic for ensuring both that they’re running, and that they’re only running in one location (to avoid data corruption).

But it can’t do that without the ability to talk to instances of itself on the other node(s), which is where Heartbeat and/or Corosync come in.

Think of Heartbeat and Corosync as dbus but between nodes. Somewhere that any node can throw messages on and know that they’ll be received by all its peers. This bus also ensures that everyone agrees who is (and is not) connected to the bus and tells Pacemaker when that list changes.

For two nodes Pacemaker could just as easily use sockets, but beyond that the complexity grows quite rapidly and is very hard to get right - so it really makes sense to use existing components that have proven to be reliable.

You only need one of them though :-)

Finally, in order to avoid teaching Pacemaker about every possible service that people might want to make highly available, we make use of the OCF standard to hide the details in scripts - which we call Resource Agents. Any series of command-line actions can be easily turned into a resource agent by adding them to an existing template.

However a collection of the most commonly useful ones are made available as part of the Resource Agents project.

And of course pre-built packages for all these come with most of the popular Linux distributions, including Fedora, openSUSE, SLES >= 10, RHEL >= 6, Debian, and Ubuntu.

Oct
4th
Mon
permalink

Large Cluster Performance

Over the last few days, I’ve spent a bunch of time improving Pacemaker’s performance in large clusters.

This involved profiling the CIB and Policy Engine, identifying and optimizing hotspots and improving algorithm designs.

Since most of my work is done in virtual machines, it wasn’t possible to use oprofile. Strictly speaking oprofile worked, but without hardware performance counters the results weren’t very helpful. I also tried gprof, but that is more about counting calls rather than time spent.

Eventually I switched to callgrind and when combined with a tool Tim found called Gprof2Dot and/or kcachegrind, finally got the data I was looking for.

To do your own profiling, simply set PCMK_callgrind_enabled to either yes or to the name of a Pacemaker daemon you wish to profile. Eg. PCMK_callgrind_enabled=cib

Overall, the CIB (which is the main bottleneck in a large cluster) and the Policy Engine are about 70% faster.

The improvements will be available with 1.1.4 is released next month, or from our 1.1 code repository right now.

A summary of the various changes and description of future work is below. Any assistance in further optimization would be appreciated :-)

— Andrew

PE

Use case:
* 100 nodes * 100 clones, clone-max=100 (10,000 effective resources) * 100 resource location constraints

Baseline: with probes 20-30 minutes Baseline: without probes 28s

Phase 1

Use hashtables instead of lists for stores the available nodes for a resource New time without probes: 18s

Phase 2

Defer creation of deletion,promote and demote constraints until they are needed New time without probes: 13s

Phase 3

Use g_list_prepend() instead of g_list_append() for the list of ordering constraints New time without probes: 5s

Phase 4

New algorithm for determining which clone instances need probing New time with probes: 31s

Future work

  • Further improve the algorithm for determining which resources need to be probed
  • Further optimize the algorithm for enforcing ordering constraints

CIB

The CIB was harder to profile. Rather than give it one large task to chew through and see how long it took using a few printf’s to provide granularity, I had to run it through a profiler while it was operating in a real cluster and see where most of the time was being spent.

Phase 1

Remove most uses of cib_msg_copy(), reduced the amount of needless copying.

Phase speedup: 10%

Phase 2

Compression costs a LOT, don’t do it unless we’re hitting message limits. For now, use 256k as the threshold at which compression kicks in. The previous limit was 10k, compressing 184 of 1071 messages accounted for 23% of the total CPU used by the cib.

Each time we validated the CIB, we were re-reading and re-parsing the RelaxNG schema, which accounted for 28% of the CIB’s CPU usage on the DC. We now read it once and cache the result for the life of the CIB process.

Phase speedup: 51%

Phase 3

Push detection of group and set ordering changes to (the less busy) slave instances. This detection was costing 15% of the CIB’s total CPU time on the DC.

Phase speedup: 15%

Future work

The majority of CPU spent by the CIB is in post-processing.

  • Detecting what changed so we can minimize the network load: diff_xml_object, 35.5% CPU time
  • Calculating the current digest so peers can verify the diffs and detect ordering changes: calculate_xml_digest, 31% CPU time