To Next Page

To Previous Page

Chapter 4. Continuous availability and manageability 103

4.3.2 General detection and deallocation of failing components

Runtime correctable or recoverable errors are monitored to determine if there is a pattern of

errors. If these components reach a predefined error limit, the service processor initiates an

action to deconfigure the faulty hardware to avoid a potential system outage and to enhance

system availability.

Persistent deallocation

To enhance system availability, a component that is identified for deallocation or

deconfiguration on a POWER processor-based system is flagged for persistent deallocation.

Component removal can occur either dynamically (as the system is running) or at boot-time

(IPL), depending on both the type of fault and when the fault is detected.

In addition, runtime unrecoverable hardware faults can be deconfigured from the system after

the first occurrence. The system can be rebooted immediately after failure and resume

operation on the remaining stable hardware. This approach prevents the same faulty

hardware from affecting system operation again, and the repair action is deferred to a more

convenient, less critical time.

Persistent deallocation includes the following elements:

򐂰 Processor

򐂰 L2/L3 cache lines (cache lines are dynamically deleted)

򐂰 Memory

򐂰 Deconfigure or bypass failing I/O adapters

Processor instruction retry

As in POWER6, the POWER7 processor has the ability to retry processor instruction and

alternate processor recovery for a number of core related faults. This approach significantly

reduces exposure to both permanent and intermittent errors in the processor core.

Intermittent errors, often as a result of cosmic rays or other sources of radiation, are generally

not repeatable.

With this function, when an error is encountered in the core, in caches and certain logic

functions, the POWER7 processor automatically retries the instruction. If the source of the

error was truly transient, the instruction succeeds and the system continues as before.

On IBM systems prior to POWER6, this error would have caused a checkstop.

Alternate processor retry

Hard failures are more difficult, being permanent errors that are replicated each time the

instruction is repeated. Retrying the instruction does not help in this situation because the

instruction continues to fail.

As in POWER6, POWER7 processors have the ability to extract the failing instruction from the

faulty core and retry it elsewhere in the system for a number of faults, after which the failing

core is dynamically deconfigured and scheduled for replacement.

Dynamic processor deallocation

Dynamic processor deallocation enables automatic deconfiguration of processor cores when

patterns of recoverable core-related faults are detected. Dynamic processor deallocation

prevents a recoverable error from escalating to an unrecoverable system error, which might

otherwise result in an unscheduled server outage. Dynamic processor deallocation relies on

the service processor’s ability to use FFDC-generated recoverable error information to notify

Brand	IBM
Model	BladeCenter PS700
Category	Server
Language	English

IBM BladeCenter PS700 User Manual

Table of Contents

Questions and Answers:

IBM BladeCenter PS700 Specifications

Related product manuals