Chapter 4. Continuous availability and manageability 103
Draft Document for Review September 2, 2008 5:05 pm4405ch04 Continuous availability and manageability.fm
L3 Array Protection
In addition to protection through ECC and Special Uncorrectable Error handling, the L3 cache
also incorporates technology to handle memory cell errors via a special cache line delete
algorithm. During system run-time, a correctable error is reported as a recoverable error to
the Service Processor. If an individual cache line reaches its predictive error threshold, it will
be dynamically deleted. The state of L3 cache line delete will be maintained in a deallocation
record, and will persist through future reboots. This ensures that cache lines varied offline by
the server will remain offline should the server be rebooted, and don’t need to be
rediscovered each time. These faulty lines cannot then cause system operational problems.
A POWER6 processor-based system can dynamically delete up to 14 L3 cache lines. Again,
it is not likely that deletion of a few cache lines will adversely affect server performance. If this
total is reached, the L3 is marked for persistent deconfiguration on subsequent system
reboots until repair.
While hardware scrubbing has been a feature in POWER main memory for many years,
POWER6 processor-based systems introduce a hardware-assisted L3 cache memory
scrubbing feature. All L3 cache memory is periodically addressed, and any address with an
ECC error is rewritten with the faulty data corrected. In this way, soft errors are automatically
removed from L3 cache memory, decreasing the chances of encountering multi-bit memory
errors.
4.2.4 PCI Error Recovery
IBM estimates that PCI adapters can account for a significant portion of the hardware based
errors on a large server. While servers that rely on boot-time diagnostics can identify failing
components to be replaced by hot-swap and reconfiguration, run time errors pose a more
significant problem.
PCI adapters are generally complex designs involving extensive on-board instruction
processing, often on embedded microcontrollers. They tend to use industry standard grade
components with less quality than other parts of the server. As a result, they may be more
likely to encounter internal microcode errors, and/or many of the hardware errors described
for the rest of the server.
The traditional means of handling these problems is through adapter internal error reporting
and recovery techniques in combination with operating system device driver management
and diagnostics. In some cases, an error in the adapter may cause transmission of bad data
on the PCI bus itself, resulting in a hardware detected parity error and causing a global
machine check interrupt, eventually requiring a system reboot to continue.
In 2001, IBM introduced a methodology that uses a combination of system firmware and
Extended Error Handling (EEH) device drivers that allows recovery from intermittent PCI bus
errors. This approach works by recovering and resetting the adapter, thereby initiating system
recovery for a permanent PCI bus error. Rather than failing immediately, the faulty device is
frozen and restarted, preventing a machine check. POWER6 technology extends this
capability to PCIe bus errors, and includes expanded Linux support for EEH as well.