The "MCA Management Application" is a command-line application, which queries, retrieves, parses and displays machine check error records on IA64 class computer systems running a 64-bit (IA64) version of Windows XP or Windows Server 2003 Family. The application provides the same support also for machine check exception information on X86 and X86-64 platforms running 32-bit (X86) versions of Windows Server 2003 Family and future releases of Windows correspondingly. The sample application demonstrates system developers how to make use of the MCA infrastructure on Windows Server 2003 and Windows XP platforms to identify hardware problems. More sophisticated system management applications may further enhance the basic functionality of this sample to also log, analyze and report these hardware problems.
Machine Check Architecture (MCA) mainly attempts to increase the RAS
(reliability, availability and scalability) features of large enterprise-class
computer systems. Therefore, management applications need to make use of the MCA
technology intensively to identify, analyze and solve hardware problems more
efficiently. This sample application demonstrates the current OS MCA support on
32-bit and 64-bit versions of Windows Server 2003 Family and 64-bit version of
Windows XP that might be utilized to achieve
these maintenance goals.
The application primarily uses Windows Management Instrumentation (WMI) to query
and retrieve MCA Error Records from the OS. MCA Error
Records on 64-bit (IA64) versions of Windows Server 2003 Family and Windows XP are standard and are
defined in "Intel SAL 3.0 Specification". Thus, system applications may take
advantage of this standardization of hardware error reporting to manage their
IA-64 server systems more efficiently and accurately. See <mce.h> for the MCA Error
Record definitions.
On the other hand, Machine Check Exception information is retrieved from WMI
on X86 and X86-64 systems running 32-bit (X86) versions of
Windows Server 2003 Family and future Windows releases, rather than MCA Error Records as in
IA64 class systems. The Machine Check Exception information is very similar on
X86 and X86-64 and can also be found in <mce.h>.
Machine check errors can be classified into two main categories:
- Fatal errors
- Corrected errors
Fatal errors are hardware failures that cannot be recovered from thus a reboot
is required after the system bugchecks (blue-screen). The machine check error
record that encapsulates the error information is made available to application
through WMI and the System Event-Log upon reboot. Therefore, management
applications must query the OS via WMI after every reboot to check for possible
fatal machine check errors.
Corrected errors are hardware failures that are corrected by the hardware and/or
firmware (PAL/SAL). These errors are reported to the OS for informational
purposes that might be used to diagnose the failing hardware and prevent future
problems. These errors do not cause the system to bugcheck so management
applications can retrieve these corrected machine check error record real-time
through WMI.
This sample demonstrates how to query, retrieve, parse and display fatal and
corrected machine check errors.
Software:
This sample can be build for and run on both 32-bit and 64-bit versions of Windows Server 2003 Family
and 64-bit version of Windows XP
as long as the hardware requirements are satisfied as explained below. Windows
2000, Windows XP (32-bit) and all other earlier releases of Windows are not
supported.
Hardware:
This sample can only be run on the following hardware platforms:
The sample can be build through the DDK build environment. Just select the appropriate build window for your platform and run build in the "src\kernel\mca\mcamgmt" directory. When the build operation successfully completes, an executable file called: "mcamgmt.exe" is created.
File | Description |
Common.cpp | Common functions that are needed by both fatal and corrected error retrieval. |
Common.h | Header file for Common.cpp |
CorrectedEngine.cpp | Corrected error retrieval functions. |
CorrectedEngine.h | Header file for CorrectedEngine.cpp |
FatalEngine.cpp | Fatal error retrieval functions. |
FatalEngine.h | Header file for FatalEngine.cpp |
Mca.cpp | User interaction, argument parsing, and general control function of the application. |
Mca.h | Header file for Mca.cpp |
MCAObjectSink.cpp | Implementation for the MCAObjectSink class used for WMI notifications. |
MCAObjectSink.h | Header file for MCAObjectSink.cpp |
The sample can be run according to
the following command-line usage:
mcamgmt [ /fatal | {/corrected <TimeOut>} | /? | /usage
]
/fatal | Queries the system for a fatal error (machine check abort). |
/corrected <TimeOut> | Queries the system for a corrected error (CMC and CPE). The <TimeOut> parameter specifies the number of minutes to wait for an error to occur. If no corrected error is retrieved in <TimeOut> minutes, then the application exits. |
/? or /usage | Shows the command-line usage of the tool. |
This sample does not demonstrate how to parse the MCA Error Record or MCA
Exception data into a more readable and friendly description, how to log this
data to a file or any other data repository, and how to analyze the data to
predict and prevent future
system failures. However, these sophisticated features might be integrated
easily upon the basic infrastructure provided in this sample.
Keep in mind that in order for this sample to work, that is, retrieve, parse and display the MCA Error Record or MCA Exception, a machine check error should occur on your system. Otherwise, you will not get any data from the system. In case of corrected error retrieval, the application will simply exit after <TimeOut> minutes if no corrected machine check error occurs during that timeframe. On the other hand, in case of fatal error retrieval, the application will simply report that no error record is present if no fatal machine check error has occurred before the reboot.
You may use the latest Hardware Compatibility Test (HCT) Kit provided by Microsoft to inject fatal and corrected machine check errors to IA64 systems. The HCT Kit includes a test tool (mcatest.exe) for verifying the MCA support of the hardware and firmware of IA64 class server systems. This test tool injects fatal and machine check errors to the system and verifies the corresponding data records upon retrieval. You can use this tool to inject these errors and then use this sample to retrieve and display their data record.
For more information about the Machine Check Architecture (MCA) and Windows support for this technology, please visit the following websites.
Name | Link |
Intel System Abstraction Layer (SAL) Specification 3.0 (Specifically, check out Section 4 - Machine Checks) | |
MCA Support in 64-bit Windows |
http://www.microsoft.com/hwdev/platform/64bit/MCAsupport.asp |
MCA Implementation guide for 64-bit Windows |
http://www.microsoft.com/hwdev/platform/64bit/MCAimpguide.asp |
Windows Logo program for 64-bit hardware (Latest HCT Server Tests) |
http://www.microsoft.com/hwdq/hwtest/devices/systems.asp?area=syssrv-srvr |
Design guidelines for 64-bit systems |
http://www.microsoft.com/hwdev/platform/pcdesign/desguide/serverdg.asp#Design3 |
Copyright (c) 2002 Microsoft Corporation. All rights reserved.