Chapter 4. Continuous availability and manageability 105
Draft Document for Review September 2, 2008 5:05 pm4405ch04 Continuous availability and manageability.fm
4.3.1 Detecting errors
The first and most crucial component of a solid serviceability strategy is the ability to
accurately and effectively detect errors when they occur. While not all errors are a guaranteed
threat to system availability, those that go undetected can cause problems because the
system does not have the opportunity to evaluate and act if necessary. POWER6 processor-
based systems employ System z server inspired error detection mechanisms that extend
from processor cores and memory to power supplies and hard drives.
Service Processor
The Service Processor is a separately powered microprocessor, separate from the main
instruction-processing complex. The Service Processor enables POWER Hypervisor and
Hardware Management Console surveillance, selected remote power control, environmental
monitoring, reset and boot features, remote maintenance and diagnostic activities, including
console mirroring. On systems without a Hardware Management Console, the Service
Processor can place calls to report surveillance failures with the POWER Hypervisor, critical
environmental faults, and critical processing faults even when the main processing unit is
inoperable. The Service Processor provides services common to modern computers such as:
Environmental monitoring
– The Service Processor monitors the server’s built-in temperature sensors, sending
instructions to the system fans to increase rotational speed when the ambient
temperature is above the normal operating range.
– Using an architected operating system interface, the Service Processor notifies the
operating system of potential environmental related problems (for example, air
conditioning and air circulation around the system) so that the system administrator
can take appropriate corrective actions before a critical failure threshold is reached.
– The Service Processor can also post a warning and initiate an orderly system
shutdown for a variety of other conditions:
• When the operating temperature exceeds the critical level (for example failure of air
conditioning or air circulation around the system)
• When the system fan speed is out of operational specification, for example, due to a
fan failure, the system can increase speed on the redundant fans in order to
compensate this failure or take other actions
• When the server input voltages are out of operational specification.
Mutual Surveillance
– The Service Processor monitors the operation of the POWER Hypervisor firmware
during the boot process and watches for loss of control during system operation. It also
allows the POWER Hypervisor to monitor Service Processor activity. The Service
Processor can take appropriate action, including calling for service, when it detects the
POWER Hypervisor firmware has lost control. Likewise, the POWER Hypervisor can
request a Service Processor repair action if necessary.
Availability
– The auto-restart (reboot) option, when enabled, can reboot the system automatically
following an unrecoverable firmware error, firmware hang, hardware failure, or
environmentally induced (AC power) failure.
Fault Monitoring
– BIST (built-in self-test) checks processor, L3 cache, memory, and associated hardware
required for proper booting of the operating system, when the system is powered on at
the initial install or after a hardware configuration change (e.g., an upgrade). If a