Intel Processors Machine Check Architectures in Microsoft Windows

 

1.  Introduction

Microsoft Windows generic Hardware Abstraction Layers (HALs) for Intel Architectures (halx86, halapic, halmps, halia64) support the Machine Check Architectures (MCA) for the Intel Pentium® Pro and Itanium processors. The HAL enables Machine Check Exception (MCE) reporting for all implementation defined errors.

2. Intel Pentium® Pro Processor Machine Check

The Machine Check Exception (MCE) is processor exception 18. The handler for Machine Check Exception is implemented as a task gate for maximum reliability of the exception handler.  The HAL provides a generic exception handler for all errors that cause an exception.  This handler reports the machine check exception code on the screen and causes the operating system to halt gracefully, reducing the possibility of persistent data corruption.

 

In addition, the HAL also provides MCA specific interface that can be used by drivers to:

·         Read the MCA banks to detect an error that does not generate an exception. One case where an error does not generate an  exception is if the bit controlling reporting of the machine check error for a specific bank (MCi_CTL.Eej) bit is turned cleared. There are also some restartable errors that don’t generate Machine Check Exception and are logged in the MCA banks.

·         Obtain control (to possibly log errors to NVRAM) when the Machine Check exception handler is invoked by providing two callback routines - ExceptionCallback and DpcCallback

 

2.1    Machine Check Exception Handling

If the MCA exception handler detects only Intel Pentium® technology (style) MCE support on the platform, it does the following:

·         If a MCA driver is registered with the HAL, call the MCA driver ExceptionCallback function providing the contents of P5_MC_ADDR and P5_MC_TYPE register values. This callback routine can log the register values in NVRAM and return.

·         Call KeBugCheckEx() with the following 4 parameters to halt the system

1.        Low  32 bits of P5_MC_TYPE MSR

2.        Always zero

3.        High 32 bits of P5_MC_ADDR MSR

4.        Low 32 bits of P5_MC_ADDR MSR


 

If MCA support (Pentium Pro processor) on the platform is detected, the exception handler determines if the error is restartable. If not, it does the following:

·         call the MCA Driver ExceptionCallback routine to give the MCA driver a chance to log the errors in NVRAM

·         call KeBugCheckEx() with the following 4 parameters to halt the system

1.        MCA Bank number that generated Machine Check exception

2.        Address field from MCi_ADDR MSR for this MCA bank

3.        High 32 bits of MCi_STATUS MSR for this MCA bank

4.        Low 32 bits of MCi_STATUS MSR for this MCA bank

 

If the error is restartable, the exception handler  queues a DPC which when called  reports the MCA bank error to the MCA Driver through the DpcCallback routine.

3. Intel Itanium® Processor Machine Check

Machines checks, including Machine Check Aborts cause IA64 processor execution to vector to the Processor Abstraction Layer (PAL) PALE_CHECK code in the IA64 ISA. When PALE_CHECK has finished processing, it passes control to the System Abstraction Layer (SAL) SAL_ENTRY code in the IA64 ISA, which in turn branches to the SAL MCA handler: SAL_CHECK.

 

Uncorrected machine checks refer to errors that cannot be corrected at PAL or SAL layers. These may still be fully or partially recoverable at the OS layer.  At that time, the control flow differs between corrected and uncorrected machine checks.

 

For corrected machine checks, the OS corrected error interrupt handlers will be invoked some time after returning to the interrupted process.

 

For uncorrected machine checks, SAL exposes an interface to register an OS_MCA callback. After validating this entry point, SAL_CHECK branches to it and provides an Error Record that will allow the OS to recover whenever possible. The Error Record passed by SAL must comply, at a minimum, with the V3.0 SAL specification (January 2001), Appendix B, “Error Record Structures”. The HAL exposes interfaces for the OEMs to register a driver, and provides the Error Record to the driver. This enables the OEMs to assist the generic HAL MCA handler by attempting recovery of platform specific errors and maintaining the integrity of the platform.

 

For IA64 PAL, SAL and OS MCA handler’s details, please refer to

·          http://www.intel.com/design/ia-64/manuals.

               

The IA64 Reference HAL provides an MCA specific interface that can be used by drivers to:

 

·         Register for delivery of an ExceptionCallback during non-corrected error processing. This callback returns an error severity value to the standard HAL OS_MCA, allowing OEM error recovery. The driver also registers a DpcCallback, which will be performed should the driver recover during ExceptionCallback processing.

·         Register for delivery of two additional DpcCallback. These are delivered during corrected error processing for CPU Corrected errors and/or Platform corrected errors.

·         Read the Error Records during DpcCallback processing.

 


3.1    Machine Check Exception Handling

After collecting the MCA log, the standard HAL MCA handler calls the MCA driver ExceptionCallback function providing the MCA record. This allows the MCA driver to process the log and makes appropriate consideration with regards to the stability of the system. This callback function returns an error severity value to let the HAL know if it should consider the event as fatal, recoverable or corrected by the MCA driver. In case of a corrected event and if registered, the MCA driver DpcCallback is then called for asynchronous log collection by the driver.

 

In case of an OS_MCA uncorrected event, the HAL calls KeBugCheckEx( ) with the bugcheck code MACHINE_CHECK_EXCEPTION and the following 4 parameters to halt the system

1.        HAL IA64 MCA type, which values could be:

a.        HAL_BUGCHECK_MCA_ASSERT = 1,

b.       HAL_BUGCHECK_MCA_GET_STATEINFO  = 2,

c.        HAL_BUGCHECK_MCA_CLEAR_STATEINFO = 3,

d.       HAL_BUGCHECK_MCA_FATAL = 4.

This last value should be the expected one for the MCA driver, the other values being HAL internal error values.

2.        MCA log address

3.        MCA maximum log size

4.        SAL status of the last SAL interface.

 

4.  MCA INTERFACE FOR DRIVERS

The Intel generic HALs provide the following Intel Pentium® Pro and Itanium technology MCA specific interface for drivers:

·         HalSetSystemInformation with the HAL_QUERY_INFORMATION_CLASS parameter set to  HalMcaRegisterDriver. This allows a driver to register MCA callbacks with the HAL. Additionally the Itanium driver may use a HAL_QUERY_INFORMATION_CLASS parameter set to HalCmcRegisterDriver or HalCpeRegisterDriver for delivery of Corrected CPU errors (CMC) and Corrected Platform Errors (CPE).

·         HalQuerySystemInformation with the HAL_QUERY_INFORMATION_CLASS parameter set to HalMcaLogInformation. This allows a driver to read the MCA log. Additionally the Itanium driver may use a HAL_QUERY_INFORMATION_CLASS parameter set to HalCmcLogInformation or HalCpeLogInformation to read MCA logs from Corrected CPU errors (CMC) and/or Corrected Platform Errors (CPE).

 

4.1    HalSetSystemInformation to register MCA Driver

NTSTATUS

HalSetSystemInformation(

                                IN HAL_QUERY_INFORMATION_CLASS InformationClass,

                                IN ULONG  BufferSize,

                                OUT PVOID  Buffer,

                );

 

HalSetSystemInformation can be used to register MCA driver with the HAL


Parameters

InformationClass : Specify HalMcaRegisterDriver to register MCA driver’s callback routines with the HAL. There are two callback routines- ExceptionCallback and DpcCallback. The ExceptionCallback Routine is called during the Machine Check Exception handler non-restartable error processing , before it bugchecks the system. The DpcCallback routine is called when the MCA error is restartable. For Itanium systems, specify HalCmcRegisterDriver to register a driver’s Corrected CPU Error DpcCallback routine, and HalCpeRegisterDriver to register a driver’s Corrected Platform Error DpcCallback.

BufferSize : Specifies the size in bytes of the buffer supplied by the caller.

Buffer : Pointer to a caller-supplied buffer of type MCA_DRIVER_INFO

//

// Structure to record the callbacks from driver

//

typedef struct _MCA_DRIVER_INFO {

    PDRIVER_EXCPTN_CALLBACK ExceptionCallback;  -  NULL for Itanium corrected error registration

    PKDEFERRED_ROUTINE      DpcCallback;

    PVOID                   DeviceContext;

} MCA_DRIVER_INFO, *PMCA_DRIVER_INFO;

 

ExceptionCallback is the driver-supplied routine to be called when Machine Check Exception occurs for non-restartable errors. A driver explicitly may not utilize any kernel services or spinlock routines.  The handler is restricted to the same constraints as a driver operating at highest IRQL.

DpcCallback is a driver-supplied routine that is called for restartable errors that caused Machine Check Exception. This routine will be called at DISPATCH_LEVEL.

DeviceContext is the Device specific context for this MCA Driver.

 

Include

ntddk.h

Return Value

HalSetSystemInformation returns STATUS_SUCCESS if the registration is successful.

Comments

HalSetSystemInformation must be called before a MCA driver can use any of the other interface routines. Only one MCA driver can be registered with the HAL at any time.

 

4.2    HalQuerySystemInformation to get MCA logs

NTSTATUS

HalQuerySystemInformation(

                                IN HAL_QUERY_INFORMATION_CLASS InformationClass,

                                IN ULONG  BufferSize,

                                OUT PVOID  Buffer,

                                OUT PULONG  ReturnedLength

                );

 

HalQuerySystemInformation can be used to read MCA banks status registers


Parameters

InformationClass : Specify HalMcaLogInformation to read the current MCA error log.  If any uncorrected Machine Check error is found, it is returned in the buffer. For Itanium systems, specify HalCmcLogInformation to read the current Corrected CPU Error Log and HalCpeLogInformation to read the current Corrected Platform Error Log

 

BufferSize : Specifies the size in bytes of the buffer supplied by the caller.

Buffer : Points to a caller-supplied buffer of type MCA_EXCEPTION that will contain the information returned by this routine. For Itanium, the returned information will be compliant, at a minimum, with the V3.0 SAL specification (January 2001), Appendix B, “Error Record Structures”. For Pentium Pro, the information is as described below.

 

typedef union _MCI_STATS {

      struct {

          USHORT               McaCod;

          USHORT               MsCod;

          ULONG OtherInfo               : 25;

          ULONG Damage                 : 1;

          ULONG AddressValid        : 1;

          ULONG MiscValid             : 1;

          ULONG Enabled                 : 1;

          ULONG UnCorrected        : 1;

          ULONG OverFlow              : 1;

          ULONG Valid                      : 1;

      } MciStats;

 

      ULONGLONG          QuadPart;

} MCI_STATS, *PMCI_STATS;

 

typedef union _MCI_ADDR{

      struct {

          ULONG Address;

          ULONG Reserved;

      } MciAddr;

     

      ULONGLONG          QuadPart;

} MCI_ADDR, *PMCI_ADDR;


 

 

typedef struct _MCA_EXCEPTION {

     

      ULONG                                                     VersionNumber;  // Version number of this record type

      MCA_EXCEPTION_TYPE                    ExceptionType;    // MCA or MCE

      LARGE_INTEGER                  TimeStamp;           // exception recording timestamp

      ULONG                                                     ProcessorNumber;// processor number

           

      union {

          struct {

                          UCHAR                                  BankNumber;                       // bank number

                          MCI_STATS                          Status;    

                          MCI_ADDR                          Address;

                          ULONGLONG                      Misc;

          } Mca;

         

          struct {

                          ULONGLONG      McAddress;           // physical address for the cycle causing the error

                          ULONGLONG      McType;                // cycle specification causing the error

          } Mce;

      } u;

 

} MCA_EXCEPTION, *PMCA_EXCEPTION;

 

ReturnedLength : Specifies the number of bytes returned in Buffer.

Include

ntddk.h

Return Value

·         HalQuerySystemInformation returns STATUS_SUCCESS if an error log exists.

Comments

This function returns the first error. It is the MCA driver responsibility to call this routine again to see if there are any more errors available