Intel Processors Machine Check Architectures in Microsoft
Windows
1. Introduction
Microsoft Windows generic Hardware Abstraction Layers (HALs) for Intel Architectures (halx86, halapic,
halmps, halia64) support the Machine Check
Architectures (MCA) for the Intel Pentium® Pro and Itanium processors. The HAL
enables Machine Check Exception (MCE) reporting for all implementation defined
errors.
2. Intel Pentium® Pro Processor Machine Check
The Machine Check Exception (MCE) is processor exception 18. The handler for Machine Check Exception is implemented as a task gate for maximum reliability of the exception handler. The HAL provides a generic exception handler for all errors that cause an exception. This handler reports the machine check exception code on the screen and causes the operating system to halt gracefully, reducing the possibility of persistent data corruption.
In addition, the HAL also provides MCA specific interface that can be used by drivers to:
· Read the MCA banks to detect an error that does not generate an exception. One case where an error does not generate an exception is if the bit controlling reporting of the machine check error for a specific bank (MCi_CTL.Eej) bit is turned cleared. There are also some restartable errors that don’t generate Machine Check Exception and are logged in the MCA banks.
· Obtain control (to possibly log errors to NVRAM) when the Machine Check exception handler is invoked by providing two callback routines - ExceptionCallback and DpcCallback
2.1 Machine Check Exception Handling
If the MCA exception handler detects only Intel Pentium® technology (style) MCE support on the platform, it does the following:
· If a MCA driver is registered with the HAL, call the MCA driver ExceptionCallback function providing the contents of P5_MC_ADDR and P5_MC_TYPE register values. This callback routine can log the register values in NVRAM and return.
·
Call KeBugCheckEx() with the following 4
parameters to halt the system
1.
Low 32 bits of
P5_MC_TYPE MSR
2.
Always zero
3.
High 32 bits of P5_MC_ADDR MSR
4.
Low 32 bits of P5_MC_ADDR MSR
If MCA support (Pentium Pro processor) on the platform is detected, the exception handler determines if the error is restartable. If not, it does the following:
· call the MCA Driver ExceptionCallback routine to give the MCA driver a chance to log the errors in NVRAM
· call KeBugCheckEx() with the following 4 parameters to halt the system
1.
MCA Bank number that generated Machine Check exception
2.
Address field from MCi_ADDR
MSR for this MCA bank
3.
High 32 bits of MCi_STATUS
MSR for this MCA bank
4.
Low 32 bits of MCi_STATUS MSR
for this MCA bank
If the error is restartable, the exception handler queues a DPC
which when called reports the MCA bank
error to the MCA Driver through the DpcCallback routine.
3. Intel Itanium® Processor Machine Check
Machines checks, including Machine Check Aborts cause IA64 processor
execution to vector to the Processor Abstraction Layer (PAL) PALE_CHECK code in
the IA64 ISA. When PALE_CHECK has finished processing, it passes control to the
System Abstraction Layer (SAL) SAL_ENTRY code in the IA64 ISA, which in turn
branches to the SAL MCA handler: SAL_CHECK.
Uncorrected machine checks refer to errors that cannot be
corrected at PAL or SAL layers. These may still be fully or partially
recoverable at the OS layer. At that
time, the control flow differs between corrected and uncorrected machine
checks.
For corrected machine checks, the OS corrected error
interrupt handlers will be invoked some time after returning to the interrupted
process.
For uncorrected machine checks, SAL exposes an interface to
register an OS_MCA callback. After validating this entry point, SAL_CHECK branches
to it and provides an Error Record that will allow the OS to recover whenever
possible. The Error Record passed by SAL must comply, at a minimum, with the
V3.0 SAL specification (January 2001), Appendix B, “Error Record Structures”.
The HAL exposes interfaces for the OEMs to register a driver, and provides the
Error Record to the driver. This enables the OEMs to assist the generic HAL MCA
handler by attempting recovery of platform specific errors and maintaining the integrity
of the platform.
For IA64 PAL, SAL and OS MCA handler’s details, please refer
to
·
http://www.intel.com/design/ia-64/manuals.
The IA64 Reference HAL provides an MCA specific interface
that can be used by drivers to:
·
Register for delivery of an ExceptionCallback
during non-corrected error processing. This callback returns an error severity
value to the standard HAL OS_MCA, allowing OEM error recovery. The driver also
registers a DpcCallback, which will be performed should
the driver recover during ExceptionCallback
processing.
·
Register for delivery of two additional DpcCallback. These are delivered during corrected error
processing for CPU Corrected errors and/or Platform corrected errors.
·
Read the Error Records during DpcCallback processing.
3.1 Machine Check Exception Handling
After collecting the MCA log, the standard HAL MCA handler calls
the MCA driver ExceptionCallback function providing
the MCA record. This allows the MCA driver to process the log and makes
appropriate consideration with regards to the stability of the system. This
callback function returns an error severity value to let the HAL know if it
should consider the event as fatal, recoverable or corrected by the MCA driver.
In case of a corrected event and if registered, the MCA
driver DpcCallback is then called for asynchronous
log collection by the driver.
In case of an OS_MCA uncorrected event, the HAL calls KeBugCheckEx(
) with the bugcheck code MACHINE_CHECK_EXCEPTION and
the following 4 parameters to halt the system
1.
HAL IA64 MCA type, which values could be:
a.
HAL_BUGCHECK_MCA_ASSERT = 1,
b. HAL_BUGCHECK_MCA_GET_STATEINFO = 2,
c.
HAL_BUGCHECK_MCA_CLEAR_STATEINFO = 3,
d. HAL_BUGCHECK_MCA_FATAL
= 4.
This last value should be the expected one for the
MCA driver, the other values being HAL internal error values.
2.
MCA log address
3.
MCA maximum log size
4.
SAL status of the last SAL interface.
4. MCA INTERFACE FOR DRIVERS
The Intel generic HALs provide the following Intel Pentium® Pro and Itanium technology MCA specific interface for drivers:
· HalSetSystemInformation with the HAL_QUERY_INFORMATION_CLASS parameter set to HalMcaRegisterDriver. This allows a driver to register MCA callbacks with the HAL. Additionally the Itanium driver may use a HAL_QUERY_INFORMATION_CLASS parameter set to HalCmcRegisterDriver or HalCpeRegisterDriver for delivery of Corrected CPU errors (CMC) and Corrected Platform Errors (CPE).
· HalQuerySystemInformation with the HAL_QUERY_INFORMATION_CLASS parameter set to HalMcaLogInformation. This allows a driver to read the MCA log. Additionally the Itanium driver may use a HAL_QUERY_INFORMATION_CLASS parameter set to HalCmcLogInformation or HalCpeLogInformation to read MCA logs from Corrected CPU errors (CMC) and/or Corrected Platform Errors (CPE).
4.1 HalSetSystemInformation to register MCA Driver
NTSTATUS
HalSetSystemInformation(
IN HAL_QUERY_INFORMATION_CLASS InformationClass,
IN ULONG BufferSize,
OUT PVOID Buffer,
);
HalSetSystemInformation can be used to register MCA driver with the HAL
Parameters
InformationClass : Specify HalMcaRegisterDriver to register MCA driver’s callback routines with the HAL. There are two callback routines- ExceptionCallback and DpcCallback. The ExceptionCallback Routine is called during the Machine Check Exception handler non-restartable error processing , before it bugchecks the system. The DpcCallback routine is called when the MCA error is restartable. For Itanium systems, specify HalCmcRegisterDriver to register a driver’s Corrected CPU Error DpcCallback routine, and HalCpeRegisterDriver to register a driver’s Corrected Platform Error DpcCallback.
BufferSize : Specifies the size in bytes of the buffer supplied by the caller.
Buffer : Pointer to a caller-supplied buffer of type MCA_DRIVER_INFO
//
// Structure to record the
callbacks from driver
//
typedef struct _MCA_DRIVER_INFO {
PDRIVER_EXCPTN_CALLBACK ExceptionCallback; - NULL for Itanium corrected
error registration
PKDEFERRED_ROUTINE DpcCallback;
PVOID DeviceContext;
} MCA_DRIVER_INFO, *PMCA_DRIVER_INFO;
ExceptionCallback is the driver-supplied routine to be called when Machine Check Exception occurs for non-restartable errors. A driver explicitly may not utilize any kernel services or spinlock routines. The handler is restricted to the same constraints as a driver operating at highest IRQL.
DpcCallback is a driver-supplied
routine that is called for restartable errors that caused Machine Check
Exception. This routine will be called at DISPATCH_LEVEL.
DeviceContext
is the Device specific context for this MCA Driver.
Include
ntddk.h
Return Value
HalSetSystemInformation returns STATUS_SUCCESS if the registration is successful.
Comments
HalSetSystemInformation
must be called before a MCA driver can use any of the other interface routines.
Only one MCA driver can be registered with the HAL at any time.
4.2 HalQuerySystemInformation to get MCA logs
NTSTATUS
HalQuerySystemInformation(
IN HAL_QUERY_INFORMATION_CLASS InformationClass,
IN ULONG BufferSize,
OUT PVOID Buffer,
OUT PULONG ReturnedLength
);
HalQuerySystemInformation can be used to read MCA banks status registers
Parameters
InformationClass : Specify HalMcaLogInformation to read the current MCA error log. If any uncorrected Machine Check error is found, it is returned in the buffer. For Itanium systems, specify HalCmcLogInformation to read the current Corrected CPU Error Log and HalCpeLogInformation to read the current Corrected Platform Error Log
BufferSize : Specifies the size in bytes of the buffer supplied by the caller.
Buffer : Points to a caller-supplied buffer of type MCA_EXCEPTION that will contain the information returned by this routine. For Itanium, the returned information will be compliant, at a minimum, with the V3.0 SAL specification (January 2001), Appendix B, “Error Record Structures”. For Pentium Pro, the information is as described below.
typedef union
_MCI_STATS {
struct {
USHORT McaCod;
USHORT MsCod;
ULONG OtherInfo :
25;
ULONG Damage :
1;
ULONG AddressValid :
1;
ULONG MiscValid :
1;
ULONG Enabled :
1;
ULONG UnCorrected :
1;
ULONG OverFlow :
1;
ULONG Valid :
1;
} MciStats;
ULONGLONG QuadPart;
} MCI_STATS, *PMCI_STATS;
typedef union _MCI_ADDR{
struct {
ULONG Address;
ULONG Reserved;
} MciAddr;
ULONGLONG QuadPart;
} MCI_ADDR, *PMCI_ADDR;
typedef struct _MCA_EXCEPTION {
ULONG VersionNumber; // Version number of this record type
MCA_EXCEPTION_TYPE ExceptionType; // MCA or MCE
LARGE_INTEGER TimeStamp; // exception recording timestamp
ULONG ProcessorNumber;// processor number
union {
struct {
UCHAR BankNumber; // bank number
MCI_STATS Status;
MCI_ADDR Address;
ULONGLONG Misc;
} Mca;
struct {
ULONGLONG McAddress; // physical address for the cycle
causing the error
ULONGLONG McType;
// cycle specification
causing the error
} Mce;
} u;
} MCA_EXCEPTION,
*PMCA_EXCEPTION;
ReturnedLength : Specifies the number of bytes returned in Buffer.
Include
ntddk.h
Return Value
· HalQuerySystemInformation returns STATUS_SUCCESS if an error log exists.
Comments
This function returns the first error. It is the MCA driver responsibility to call this routine again to see if there are any more errors available