MCA Management Application


SUMMARY:

The "MCA Management Application" is a command-line application, which queries, retrieves, parses and displays machine check error records on IA64 class computer systems running a 64-bit (IA64) version of Windows XP or Windows Server 2003 Family. The application provides the same support also for machine check exception information on X86 and X86-64 platforms running 32-bit (X86) versions of Windows Server 2003 Family and future releases of Windows correspondingly. The sample application demonstrates system developers how to make use of the MCA infrastructure on Windows Server 2003 and Windows XP platforms to identify hardware problems. More sophisticated system management applications may further enhance the basic functionality of this sample to also log, analyze and report these hardware problems.

DETAILED INFORMATION:

Machine Check Architecture (MCA) mainly attempts to increase the RAS (reliability, availability and scalability) features of large enterprise-class computer systems. Therefore, management applications need to make use of the MCA technology intensively to identify, analyze and solve hardware problems more efficiently. This sample application demonstrates the current OS MCA support on 32-bit and 64-bit versions of Windows Server 2003 Family and 64-bit version of Windows XP that might be utilized to achieve these maintenance goals.

The application primarily uses Windows Management Instrumentation (WMI) to query and retrieve MCA Error Records from the OS. MCA Error Records on 64-bit (IA64) versions of Windows Server 2003 Family and Windows XP are standard and are defined in "Intel SAL 3.0 Specification". Thus, system applications may take advantage of this standardization of hardware error reporting to manage their IA-64 server systems more efficiently and accurately. See <mce.h> for the MCA Error Record definitions.

On the other hand, Machine Check Exception information is retrieved from WMI on X86 and X86-64 systems running 32-bit (X86) versions of Windows Server 2003 Family and future Windows releases, rather than MCA Error Records as in IA64 class systems. The Machine Check Exception information is very similar on X86 and X86-64 and can also be found in <mce.h>.

Machine check errors can be classified into two main categories:

- Fatal errors
- Corrected errors

Fatal errors are hardware failures that cannot be recovered from thus a reboot is required after the system bugchecks (blue-screen). The machine check error record that encapsulates the error information is made available to application through WMI and the System Event-Log upon reboot. Therefore, management applications must query the OS via WMI after every reboot to check for possible fatal machine check errors.

Corrected errors are hardware failures that are corrected by the hardware and/or firmware (PAL/SAL). These errors are reported to the OS for informational purposes that might be used to diagnose the failing hardware and prevent future problems. These errors do not cause the system to bugcheck so management applications can retrieve these corrected machine check error record real-time through WMI.

This sample demonstrates how to query, retrieve, parse and display fatal and corrected machine check errors.

SYSTEM REQUIREMENTS:

Software:
This sample can be build for and run on both 32-bit and 64-bit versions of Windows Server 2003 Family and 64-bit version of Windows XP as long as the hardware requirements are satisfied as explained below. Windows 2000, Windows XP (32-bit) and all other earlier releases of Windows are not supported.

Hardware:
This sample can only be run on the following hardware platforms:   

BUILDING THE SAMPLE:

The sample can be build through the DDK build environment. Just select the appropriate build window for your platform and run build in the "src\kernel\mca\mcamgmt" directory. When the build operation successfully completes, an executable file called: "mcamgmt.exe" is created.

CODE TOUR:

File Description
 Common.cpp  Common functions that are needed by both fatal and corrected error retrieval.
 Common.h  Header file for Common.cpp
 CorrectedEngine.cpp  Corrected error retrieval functions.
 CorrectedEngine.h  Header file for CorrectedEngine.cpp
 FatalEngine.cpp  Fatal error retrieval functions.
 FatalEngine.h  Header file for FatalEngine.cpp
 Mca.cpp  User interaction, argument parsing, and general control function of the application.
 Mca.h  Header file for Mca.cpp
 MCAObjectSink.cpp  Implementation for the MCAObjectSink class used for WMI notifications.
 MCAObjectSink.h  Header file for MCAObjectSink.cpp

USAGE:

The sample can be run according to the following command-line usage:

mcamgmt [ /fatal | {/corrected <TimeOut>} | /? | /usage ]

/fatal Queries the system for a fatal error (machine check abort).
/corrected <TimeOut> Queries the system for a corrected error (CMC and CPE). The <TimeOut> parameter specifies the number of minutes to wait for an error to occur. If no corrected error is retrieved in <TimeOut> minutes, then the application exits.
/? or /usage Shows the command-line usage of the tool.

WHAT THE SAMPLE DOES NOT DEMONSTRATE:

This sample does not demonstrate how to parse the MCA Error Record or MCA Exception data into a more readable and friendly description, how to log this data to a file or any other data repository, and how to analyze the data to predict and prevent future system failures. However, these sophisticated features might be integrated easily upon the basic infrastructure provided in this sample.
 

NOTES:

Keep in mind that in order for this sample to work, that is, retrieve, parse and display the MCA Error Record or MCA Exception, a machine check error should occur on your system. Otherwise, you will not get any data from the system. In case of corrected error retrieval, the application will simply exit after <TimeOut> minutes if no corrected machine check error occurs during that timeframe. On the other hand, in case of fatal error retrieval, the application will simply report that no error record is present if no fatal machine check error has occurred before the reboot.

You may use the latest Hardware Compatibility Test (HCT) Kit provided by Microsoft to inject fatal and corrected machine check errors to IA64 systems. The HCT Kit includes a test tool (mcatest.exe) for verifying the MCA support of the hardware and firmware of IA64 class server systems. This test tool injects fatal and machine check errors to the system and verifies the corresponding data records upon retrieval. You can use this tool to inject these errors and then use this sample to retrieve and display their data record.

REFERENCE:

For more information about the Machine Check Architecture (MCA) and Windows support for this technology, please visit the following websites.

Name Link
Intel System Abstraction Layer (SAL) Specification 3.0 (Specifically, check out Section  4 - Machine Checks)

http://www.intel.com/design/Itanium/Downloads/245359.htm

MCA Support in 64-bit Windows

http://www.microsoft.com/hwdev/platform/64bit/MCAsupport.asp

MCA Implementation guide for 64-bit Windows

http://www.microsoft.com/hwdev/platform/64bit/MCAimpguide.asp

Windows Logo program for 64-bit hardware  (Latest HCT Server Tests)

http://www.microsoft.com/hwdq/hwtest/devices/systems.asp?area=syssrv-srvr

Design guidelines for 64-bit systems

 http://www.microsoft.com/hwdev/platform/pcdesign/desguide/serverdg.asp#Design3

Top of page


Copyright (c) 2002 Microsoft Corporation. All rights reserved.