The Cluster Guy RSS

ǝɹǝɥ ʇxǝʇ lnɟʇɥƃısuı

Archive

Jan
18th
Mon
permalink

Pacemaker 1.0.7 Released

The latest installment of the Pacemaker 1.0 stable series is now ready for general consumption.

In this release, we’ve made a number improvements to clone handling - particularly the way ordering constraints are processed - as well as some really nice improvements to the shell.

The next 1.0 release is anticipated to be in mid-March. We will be switching to a bi-monthly release schedule to begin focusing on development for the next stable series (more details soon). If you have feature requests, now is the time to voice them and/or provide patches :-)

Pre-built packages for Pacemaker and it’s immediate dependancies are currently building and will be available for openSUSE, SLES, Fedora, RHEL, CentOS from the ClusterLabs Build Area shortly.

Debian users should check for updates Martin’s repo over the coming days and Ubuntu fans can visit LaunchPad for 8.04 and 9.10 packages.

The source tarball is also available directly from Mercurial.

General installation instructions are available at from the ClusterLabs wiki.

Release Statistics

Changesets 193 
Diff 220 files changed, 15933 insertions(+), 8782 deletions(-)

Changes of note since Pacemaker-1.0.6

  • High: PE: Bug 2213 - Ensure groups process location constraints so that clone-node-max works for cloned groups
  • High: PE: Bug lf#2153 - non-clones should not restart when clones stop/start on other nodes
  • High: PE: Bug lf#2209 - Clone ordering should be able to prevent startup of dependant clones
  • High: PE: Bug lf#2216 - Correctly identify the state of anonymous clones when deciding when to probe
  • High: PE: Bug lf#2225 - Operations that require fencing should wait for ‘stonith_complete’ not ‘all_stopped’.
  • High: PE: Bug lf#2225 - Prevent clone peers from stopping while another is instance is (potentially) being fenced
  • High: PE: Correctly anti-colocate with a group
  • High: PE: Correctly unpack ordering constraints for resource sets to avoid graph loops
  • High: Tools: crm: load help from crm_cli.txt
  • High: Tools: crm: resource sets (bnc#550923)
  • High: Tools: crm: support for comments (LF 2221)
  • High: Tools: crm: support for description attribute in resources/operations (bnc#548690)
  • High: Tools: hb2openais: add EVMS2 CSM processing (and other changes) (bnc#548093)
  • High: Tools: hb2openais: do not allow empty rules, clones, or groups (LF 2215)
  • High: Tools: hb2openais: refuse to convert pure EVMS volumes
  • High: cib: Ensure the loop for login message terminates
  • High: cib: Finally fix reliability of receiving large messages over remote plaintext connections
  • High: cib: Fix remote notifications
  • High: cib: For remote connections, default to CRM_DAEMON_USER since thats the only one that the cib can validate the password for using PAM
  • High: cib: Remote plaintext - Retry sending parts of the message that did not fit the first time
  • High: crmd: Ensure batch-limit is correctly enforced
  • High: crmd: Ensure we have the latest status after a transition abort
  • High (bnc#547579,547582): Tools: crm: status section editing support
  • High: shell: Add allow-migrate as allowed meta-attribute (bnc#539968)
  • Medium: Build: Do not automatically add -L/lib, it could cause 64-bit arches to break
  • Medium: PE: Bug lf#2206 - rsc_order constraints always use score at the top level
  • Medium: PE: Only complain about target-role=master for non m/s resources
  • Medium: PE: Prevent non-multistate resources from being promoted through target-role
  • Medium: PE: Provide a default action for resource-set ordering
  • Medium: PE: Silently fix requires=fencing for stonith resources so that it can be set in op_defaults
  • Medium: Tools: Bug lf#2286 - Allow the shell to accept template parameters on the command line
  • Medium: Tools: Bug lf#2307 - Provide a way to determin the nodeid of past cluster members
  • Medium: Tools: crm: add update method to template apply (LF 2289)
  • Medium: Tools: crm: direct RA interface for ocf class resource agents (LF 2270)
  • Medium: Tools: crm: direct RA interface for stonith class resource agents (LF 2270)
  • Medium: Tools: crm: do not add score which does not exist
  • Medium: Tools: crm: do not consider warnings as errors (LF 2274)
  • Medium: Tools: crm: do not remove sets which contain id-ref attribute (LF 2304)
  • Medium: Tools: crm: drop empty attributes elements
  • Medium: Tools: crm: exclude locations when testing for pathological constraints (LF 2300)
  • Medium: Tools: crm: fix exit code on single shot commands
  • Medium: Tools: crm: fix node delete (LF 2305)
  • Medium: Tools: crm: implement -F (—force) option
  • Medium: Tools: crm: rename status to cibstatus (LF 2236)
  • Medium: Tools: crm: revisit configure commit
  • Medium: Tools: crm: stay in crm if user specified level only (LF 2286)
  • Medium: Tools: crm: verify changes on exit from the configure level
  • Medium: ais: Some clients such as gfs_controld want a cluster name, allow one to be specified in corosync.conf
  • Medium: cib: Clean up logic for receiving remote messages
  • Medium: cib: Create valid notification control messages
  • Medium: cib: Indicate where the remote connection came from
  • Medium: cib: Send password prompt to stderr so that stdout can be redirected
  • Medium: cts: Fix rsh handling when stdout is not required
  • Medium: doc: Fill in the section on removing a node from an AIS-based cluster
  • Medium: doc: Update the docs to reflect the 0.6/1.0 rolling upgrade problem
  • Medium: doc: Use Publican for docbook based documentation
  • Medium: fencing: stonithd: add metadata for stonithd instance attributes (and support in the shell)
  • Medium: fencing: stonithd: ignore case when comparing host names (LF 2292)
  • Medium: tools: Make crm_mon functional with remote connections
  • Medium: xml: Add stopped as a supported role for operations
  • Medium: xml: Bug bnc#552713 - Treat node unames as text fields not IDs
  • Medium: xml: Bug lf#2215 - Create an always-true expression for empty rules when upgrading from 0.6
Jan
15th
Fri
permalink

Ubuntu looking for Pacemaker testers

Ubuntu is looking to switch its supported cluster stack to Corosync+Pacemaker and has put out a “Call for testers”.
Check out the link if this is something you’re interested in.

Jan
12th
Tue
permalink

Pre-Announce: End of Pacemaker 0.6 support is near

Unless there are violent objections, I plan to officially stop supporting 0.6 at the end of February.

Since I’ve not seeing any bugs reported for some time, it seems that anyone still using 0.6 is happy with it for their workload.

Also, 1.0 has been out for over a year now and contains significant improvements over 0.6 including

  • A unified shell that hides the XML scaffolding
  • Migration thresholds that are easy to configure and understand
  • Failures can be ignored after a specified period of time
  • Ability to specify defaults for resource an operation parameters
  • Man pages for all CLI tools
  • Up-to-date online documentation

The online documentation has more details on whats new/different in Appendix C and detailed instructions for upgrading in Appendix E.

Nov
16th
Mon
permalink

New Documentation Formats

I’m pleased to report that the core Pacemaker documentation is now available in PDF, HTML (chunked and single page) and even TXT formats.

The old Pages.app sources have been replaced with DocBook which allows them to be:

  • published in a variety of formats
  • kept under version control
  • included in the packages
  • updated by anybody

Additionally, we’re using Publican to produce the final result so supporting multiple languages should be now possible. Let us know if you’re interested in doing some translation :-)

The primary location for Pacemaker documentation will remain http://www.clusterlabs.org/wiki/Documentation however there is also a index of the generated documentation at http://www.clusterlabs.org/doc/ which includes the date and version from which it was generated.

Nov
2nd
Mon
permalink

Pacemaker 1.0.6 Released

The next installment of the Pacemaker 1.0 stable series is now ready for general consumption.

In addition to further polishing of the crm shell and CLI tools, this is the first release to support CoroSync (version 1.1.2 or greater is required).

The ”Pacemaker Explained” reference has also been converted to docbook and is included as part of the tarball (and pre-built packages if the relevant stylesheets are present at build time).

Pre-built packages for Pacemaker and it’s immediate dependancies will be available for openSUSE, SLES, Fedora, RHEL, CentOS from the OpenSUSE Build Service in the next couple of days depending in how overloaded it is.

Debian users should check for updates Martin’s repo over the coming days and Ubuntu fans can visit LaunchPad for 8.04 and 9.10 packages.

The source tarball is also available directly from Mercurial.

General installation instructions are available at from the ClusterLabs wiki.

Release Statistics

Changesets 185 
Diff 331 files changed, 13858 insertions(+), 3277 deletions(-)

Project Administrivia

We may switch to a bi-monthly release cycle. If you have any thoughts on this (for or against), please get in touch.

Changes of note since Pacemaker-1.0.5

  • High: cib: Correctly clean up when both plaintext and tls remote ports are requested
  • High: ais: Avoid excessive load by checking for dead children every 1s (instead of 100ms)
  • High: ais: Bug lf#2199 - Prevent expected-quorum-votes from being populated with garbage
  • High: ais: Bug rh#525589 - Prevent shutdown deadlocks when running on CoroSync
  • High: ais: Gracefully handle changes to the AIS nodeid
  • High: ais: Prevent deadlock - dont try to release IPC message if the connection failed
  • High: ais: Ubuntu needs a leading zero for directory modes
  • High: cib: For validation errors, send back the full CIB so the client can display the errors
  • High: cib: Prevent use-after-free for remote plaintext connections
  • High: cib: Repair the ability to connect to the cluster from non-cluster machines
  • High: Core: Bug lf#2169 - Allow dtd/schema validation to be disabled
  • High: crmd: Bug bnc#527530 - Wait for the transition to complete before leaving S_TRANSITION_ENGINE
  • High: crmd: Bug lf#2201 - Guard against possible cause of a segfault
  • High: crmd: Prevent use-after-free with LOG_DEBUG_3
  • High: Extras: Add sctp support to the controld RA
  • High: PE: Bug bnc#515172 - Provide better defaults for lt(e) and gt(e) comparisions
  • High: PE: Bug lf#2106 - Not all anonymous clone children are restarted after configuration change
  • High: PE: Bug lf#2170 - stop-all-resources option had no effect
  • High: PE: Bug lf#2171 - Prevent groups from starting if they depend on a complex resource which can’t
  • High: PE: Bug lf#2197 - Allow master instances placemaker to be influenced by colocation constraints
  • High: PE: Disable resource management if stonith-enabled=true and no stonith resources are defined
  • High: PE: Don’t include master score if it would prevent allocation
  • High: PE: Make sure promote/demote pseudo actions are created correctly
  • High: PE: Prevent target-role from promoting more than master-max instances
  • High: shell: Add allow-migrate as allowed meta-attribute (bnc#539968)
  • High: tools: bnc#547579,547582 - crm: status section editing support
  • High: Tools: crm: add semantic checks depending on the meta-data from resource agents
  • High: Tools: crm: improve processing of group edit and constraints
  • High: Tools: crm: improve the edit command
  • High: Tools: pingd - Fix a number of critical bugs (patch via Kazunori INOUE)
  • Med: xml: Mask the “symmetrical” attribute on rsc_colocation constraints (bnc#540672)
  • Medium (bnc#520707): Tools: crm: new templates ocfs2 and clvm
  • Medium (LF 2164): Tools: hb_report: expand the crm status command
  • Medium (LF 2184): Tools: crm: extend ptest command
  • Medium (LF 2185): Tools: crm: add resource promote/demote commands
  • Medium (LF 2198): Tools: crm: add node fence command
  • Medium: ais: Attempt to enable core file generation if it was disabled
  • Medium: ais: Include version details in plugin name
  • Medium: Build: Re-enable asciidoc documentation
  • Medium: Build: Shell templates arent documentation
  • Medium: cib: Remove delay for remote plaintext connections
  • Medium: Core: Disable syslog for any process that doesn’t want its arguments logged
  • Medium: crmd: Requery the resource metadata after every start operation
  • Medium: cts: add —benchmark for scalability tests
  • Medium: cts: Prepare for corosync testing
  • Medium: Extra: Include SNMP MIB file for crm_mon (from Michael Schwartzkopff)
  • Medium: PE: Bug lf#2178 - Indicate unmanaged clones
  • Medium: PE: Bug lf#2180 - Include node information for all failed ops
  • Medium: PE: Bug lf#2189 - Incorrect error message when unpacking simple ordering constraint
  • Medium: PE: Correctly log resources that would like to start but can’t
  • Medium: PE: Correctly log the state of orphaned clone instances
  • Medium: PE: If no migrate_(from|to) action is defined, look for migrate instead
  • Medium: PE: Only re-instate target-role if it is less than the calculated one
  • Medium: PE: Provide details for the maintenance-mode option
  • Medium: PE: Stop ptest from logging to syslog
  • Medium: Tools: attrd_updater - Suppress all logging with —quiet
  • Medium: Tools: crm: add extra flag to CibObject for invalid objects
  • Medium: Tools: crm: do return cached resources dom node
  • Medium: Tools: crm: expand template documentation
  • Medium: Tools: crm: first child of a removed parent inherits constraints
  • Medium: Tools: crm_attribute - Suppress all logging with —quiet
  • Medium: Tools: crm_shadow - log diffs to stdout instead of stderr
  • Medium: Tools: Use -q as the short form for —quiet (for consistency)
Oct
6th
Tue
permalink

Advisory: Don’t use Pacemaker on Corosync (yet)

I spent some time looking into the state of the Pacemaker/Corosync integration today and I can only recommend Pacemaker users stay on the previous version of OpenAIS (aka. Whitetank).

In a nutshell, shutdown is utterly broken.

r2140 of Corosync removed the shutdown worker thread which allowed plugins such as Pacemaker to continue sending and receiving cluster messages.
Without it, Corosync waits for Pacemaker to finish and Pacemaker waits for the messages it tried to send to arrive and be acted upon. Needless to say no-one makes any progress.

Stay tuned, now that integration testing has started it shouldn’t take too long to get everything sorted out.

Update

Since writing this, the necessary testing has been done and Pacemaker is now supported on Corosync provided you have corosync >= 1.1.2 and pacemaker >= 1.0.6

Sep
21st
Mon
permalink

Clusters From Scratch

The first of a new series of step-by-step guides for Pacemaker.

This installment covers installation, the creation of an active/passive cluster and its conversion to active/active.

Technologies used include:

  • Fedora 11 as the host operating system
  • OpenAIS to provide messaging and membership services,
  • Pacemaker to perform resource management,
  • DRBD as a cost-effective alternative to shared storage,
  • OCFS2 as the cluster filesystem (in active/active mode)
  • The crm shell for displaying the configuration and making changes
  • Apache as the example service.

The PDF is available from our Documentation page or directly via http://www.clusterlabs.org/mediawiki/images/9/9d/Clusters_from_Scratch_-_Apache_on_Fedora11.pdf

Future guides are anticipated to include MySQL, mail servers and asymmetrical clusters. Feedback and suggestions for additional topics are welcome.

permalink

Version Control Prompt

I find it convenient to include current SCM data before my regular Bash prompt (reduces the chance of “accidents”). Perhaps someone else will find it useful too.

function prompt-pre-exec() {
    scm=""
    repo_root=$(hg root 2>/dev/null)
    if [ -e CVS ]; then
        scm=":: cvs ::"

    elif [ -e .svn ]; then
        scm=":: svn : ${prompt_hl}r$(svn info | grep Revision | sed s/.*:\ //)${prompt_n} ($(svn info | grep Date | sed s/.*\(\//)"

    elif [ -e .gitignore ]; then
        repo_branch=`git branch --no-color 2> /dev/null | sed -e '/^[^*]/d' -e 's/* \(.*\)/\1#/'`
        scm=`git show --pretty="format: : git : ${prompt_hl}${repo_branch}%h${prompt_n} : %an, %cr\n:: %s\n" | head -n 2`

    elif [ x != "x$repo_root" ]; then
        repo_cs=$(hg id -i)
        scm=`hg log --template " : hg : ${repo_root##*/} : ${prompt_hl}${repo_cs}${prompt_n} {tags} : {author|user}, {date|age} ago\n:: {desc|firstline|strip}\n" -r ${repo_cs%%+}`
    fi

    if [ "x$scm" != x ]; then
        # Trailing \n characters don't seem to expanded 
        scm="$scm
"
    fi
    export scm    
}

if [ x"$-" = "xhimBH" ]; then
  # Execute the following function before displaying the prompt
  export PROMPT_COMMAND='prompt-pre-exec'

  # Use \[ and \] to exclude the color code from the line wrapping calculations 
  export PS1='${scm}[\@] \u@\h \[${prompt_hl}\]\w #\[${prompt_n}\] '
fi

Then to add color, simply define prompt_hl and prompt_n. I use

export prompt_n="^[^E[00m^]"      # Default color
export prompt_hl="^[^E[01;32m^]"  # Highlight codes

To enter ^[ in emacs, type Ctrl-q then Ctrl-[. Likewise ^E is Ctrl-q Ctrl-e.

Sep
3rd
Thu
permalink

Configuring Heartbeat v1 Was So Simple

…because it couldn’t do anything.

People who loved how simple Heartbeat v1 was to configure often complain how complex Pacemaker is.

But the key differences between the two configurations are driven by the very features that haresources-based clusters couldn’t provide.

Granted we made a mess of things with the original XML syntax.

When the job of writing the CRM/Pacemaker was first pitched to me, I was promised an all-singing, all-dancing GUI that would hide the ugly XML from end users.

So we focused on power and expressiveness and little thought was given to it’s usability.

However with the release of Pacemaker 1.0 in October 2008, not only has the XML been cleaned up, but Dejan has written a consolidated cluster shell that hides all the XML anyway.

As a result almost everything you thought about Pacemaker’s complexity is no longer true and it’s just as suited to simple two-node setups as it is to large active/active configurations.

Establishing A Baseline

Here’s a sample haresources file from the linux-ha.org website:

linuxha1 IPaddr::192.168.85.3 httpd smb

The pattern here is:

$preferred_node $script::$parameter $script $script

Which is mostly sufficient for a two node cluster, because anything not running on $preferred_node is running on the only other machine in the cluster.

Deficiencies

Of course there were some obvious limitations built into Heartbeat v1 that Pacemaker was designed to address.

  • Couldn’t support more than two nodes
  • Couldn’t detect or recover from resource-level failures
  • Limited to sets of resources with a strict linear stop/start order

Power Creates Complexity, but Not That Much

Non-linear Resource Model

Consider the following environment:

  • Resource B depends on Resource A
  • Resource C depends on Resource A
  • Resource B and Resource C are independent

There is no way to truly express this with haresources.

In order to have a more powerful resource model, you need to be able to refer multiple times to resources. So now they need a name and a more flexible way to group them.

New Resource Syntax

This is Pacemaker’s XML equivalent of the IPaddr resource above, which I’ve imaginatively called IP:

    <primitive class="ocf" id="IP" provider="heartbeat" type="IPaddr">
      <instance_attributes id="IP-instance_attributes">
        <nvpair id="IP-instance_attributes-ip" name="ip" value="192.168.85.3"/>
      </instance_attributes>
    </primitive>

This is also the point at which many people run away screaming. There’s really no need for this though, almost all of it is scaffolding.

Here’s how I created the above XML:

# crm configure primitive IP ocf:heartbeat:IPaddr params ip=192.168.85.3

Which is composed of

  • primitive ::= The type of resource object that we’re creating.
  • IP ::= Our name for the resource
  • IPaddr ::= The script to call
  • ocf ::= The standard it conforms to
  • ip=192.168.85.3 ::= Parameter(s) as name/value pairs

To create the other two members of the group, I ran:

# crm configure primitive http lsb::httpd
# crm configure primitive samba lsb::smb

Admit it, that was pretty easy :-)

To group them together, as-per the haresources example, is also trivial. Just provide a name for the group and a list of members:

# crm configure group v1-group IP http samba

Thats it! Here’s the result:

# crm configure show
primitive IP ocf:heartbeat:IPaddr params ip="192.168.85.3"
primitive http lsb:httpd
primitive samba lsb:smb 
group v1-group IP http samba

Resource Recovery

In order to detect resource failure, the cluster needs to check its health periodically. But what action should it call and how often? There is nowhere in haresources to specify this. Not cleanly anyway.

By way of example, here’s what it looks like in Pacemaker if you want to monitor the IP address every 5 minutes and apache/samba once a minute:

# crm configure show
primitive IP ocf:heartbeat:IPaddr params ip="192.168.85.3" op monitor interval=5min
primitive http lsb:httpd op monitor interval=60s
primitive samba lsb:smb op monitor interval=60s
group v1-group IP http samba

More Than Two Nodes

Supporting more than two nodes means that can no longer specify a preferred node like in v1. Thats not enough to tell the cluster where to put the resource after $preferred_node fails. Instead you need an ordered list.

Here’s an example of how we might specify that we prefer linuxha1 over linuxha2 over linuxha3:

# crm configure location prefer-ha1 v1-group 5000: linuxha1
# crm configure location prefer-ha2 v1-group 500:  linuxha2
# crm configure location prefer-ha3 v1-group 50:   linuxha3

The numbers (5000, 500, 50) are scores that indicate a relative preference for running on the three nodes.

To finish off, here’s the Pacemaker equivalent of just the original haresources example:

primitive IP ocf:heartbeat:IPaddr params ip="192.168.85.3"
primitive http lsb:httpd
primitive samba lsb:smb

group v1-group IP http samba
location prefer-ha1 v1-group 5000: linuxha1

See, its really not that scary and, as we saw, easily extendable to a more complex clustering environment if needed.

Failback? Sure, but to Where?

Imagine the scenario, linuxha1 failed and the group moved to linuxha2. Then linuxha2 failed, and now the resource group is running on linuxha3. What happens when the other two nodes recover?

In v1, this is controlled by the auto_failback directive. But when there are more than two nodes, where does back refer to? It could mean linuxha2, since that was the last place it was running. It could also mean linuxha1, since that is the most preferred node under normal circumstances.

The sanest way to resolve this ambiguity is to invert the option and make it a preference for keeping the resource where it is.

Thus the resource-stickiness option was born and, unlike auto_failback, it can be specified per-resource (with a global default). Which makes sense, since the cost of stopping and relocating an IP address is significantly less than that of an Oracle database.

Pacemaker even allows the administrator to have different fail-back policies apply during “core” and “non-core” hours. But more on that another day.

Aug
25th
Tue
permalink

Another Documentation Update

Quick FYI… I’ve made some more improvements to the Configuration Explained PDF http://clusterlabs.org/mediawiki/images/f/fb/Configuration_Explained.pdf

Changes include:

  • Fixed a number of date based rule examples
  • Updated details on the stonith-enabled option
  • Fixed the URL for the obtaining the XSLT conversion script
  • Explanations of the possible values for the target-role and multiple-active, resource options
  • Explanations of the possible values for the on-fail and requires operation options
  • Explanations of the possible values for the operation option of expressions and date_expressions
  • Rewrote section on Moving Resources due to Connectivity Changes to use the new, more reliable ping RA
  • Referenced the original definition of various resource options instead of repeating them
  • Added text for obtaining detailed information on the meaning of stonith parameters