To Next Page

To Previous Page

Chapter 4. Continuous availability and manageability 103

Draft Document for Review September 2, 2008 5:05 pm4405ch04 Continuous availability and manageability.fm

L3 Array Protection

In addition to protection through ECC and Special Uncorrectable Error handling, the L3 cache

also incorporates technology to handle memory cell errors via a special cache line delete

algorithm. During system run-time, a correctable error is reported as a recoverable error to

the Service Processor. If an individual cache line reaches its predictive error threshold, it will

be dynamically deleted. The state of L3 cache line delete will be maintained in a deallocation

record, and will persist through future reboots. This ensures that cache lines varied offline by

the server will remain offline should the server be rebooted, and don’t need to be

rediscovered each time. These faulty lines cannot then cause system operational problems.

A POWER6 processor-based system can dynamically delete up to 14 L3 cache lines. Again,

it is not likely that deletion of a few cache lines will adversely affect server performance. If this

total is reached, the L3 is marked for persistent deconfiguration on subsequent system

reboots until repair.

While hardware scrubbing has been a feature in POWER main memory for many years,

POWER6 processor-based systems introduce a hardware-assisted L3 cache memory

scrubbing feature. All L3 cache memory is periodically addressed, and any address with an

ECC error is rewritten with the faulty data corrected. In this way, soft errors are automatically

removed from L3 cache memory, decreasing the chances of encountering multi-bit memory

errors.

4.2.4 PCI Error Recovery

IBM estimates that PCI adapters can account for a significant portion of the hardware based

errors on a large server. While servers that rely on boot-time diagnostics can identify failing

components to be replaced by hot-swap and reconfiguration, run time errors pose a more

significant problem.

PCI adapters are generally complex designs involving extensive on-board instruction

processing, often on embedded microcontrollers. They tend to use industry standard grade

components with less quality than other parts of the server. As a result, they may be more

likely to encounter internal microcode errors, and/or many of the hardware errors described

for the rest of the server.

The traditional means of handling these problems is through adapter internal error reporting

and recovery techniques in combination with operating system device driver management

and diagnostics. In some cases, an error in the adapter may cause transmission of bad data

on the PCI bus itself, resulting in a hardware detected parity error and causing a global

machine check interrupt, eventually requiring a system reboot to continue.

In 2001, IBM introduced a methodology that uses a combination of system firmware and

Extended Error Handling (EEH) device drivers that allows recovery from intermittent PCI bus

errors. This approach works by recovering and resetting the adapter, thereby initiating system

recovery for a permanent PCI bus error. Rather than failing immediately, the faulty device is

frozen and restarted, preventing a machine check. POWER6 technology extends this

capability to PCIe bus errors, and includes expanded Linux support for EEH as well.

Brand	IBM
Model	Power 570
Category	Server
Language	English

IBM Power 570 User Manual

Table of Contents

Questions and Answers:

IBM Power 570 Specifications

Related product manuals