Using CPUID to detect CPU specs, reliable solution? - cpu

I'm trying to gather information about my CPU with __cpuid(). Though it is getting it right on my computer, when I run my program on my colleague's computer it is detecting the Intel Core2 Quad Q6600 to be hyper-threaded, though according to the specifications on Intel's own site it isn't.
__cpuid() is also detecting the wrong amount of "logical cores" as can be shown here:
Programmatically detect number of physical processors/cores or if hyper-threading is active on Windows, Mac and Linux. Where it claims that the Intel Xeon E5520 has 16 logical cores and 8 physical.
I tried running the code found in that thread on my own computer, a Intel i7 2600K giving me the same numbers as for the Xeon.
So how reliable is __cpuid() really? From my own experience it doesn't seem to be that reliable. Have I got something very fundamental wrong?

There is almost certainly a gap in the [x2]APIC ids on your processor, meaning that some value of the APIC ids don't map to any logical processors. You should use the 0xB leaf of cpuid to find out. You can look at the reference Intel code and algorithm (https://software.intel.com/en-us/articles/intel-64-architecture-processor-topology-enumeration/) for the steps, but it broils down to calling with EAX=0xB, ECX=0 and obtaining in EBX the number of logical processors (threads) per core and then calling cpuid again with EAX=0xB, ECX=1 and getting in EBX the number of logical processors per processor package.
The old method of using leaf 0x1 cannot account for APIC id gaps. Alas that's the sample code still given on the MSDN Visual C++ 2013 reference page (http://msdn.microsoft.com/en-us/library/hskdteyh.aspx), and it is incorrect for processors made in 2010 and thereafter, as you found out whether by using the code from MSDN or similarly incorrect code from elsewhere. The Wikipedia page on cpuid, which I've recently updated after struggling myself to understand the issue, now has a worked out example in the section on "Intel thread/core and cache topology" for enumerating topology on a processor with APIC id gaps, with additional details, including how to determine which bits of the APIC ids are actually used and which are "dead".
Given the code sample currently offered by Microsoft on their __cpuid() page, this is basically the same question as Logical CPU count return 16 instead of 4 because it's rooted in the same interpretation error of the Intel specs. As an explanation for MSDN's poor show, the code they offer worked okay before 2010 or so; Intel used to provide a similar method before the x2APIC was introduced as you can see in this old video/article: https://software.intel.com/en-us/articles/hyper-threading-technology-and-multi-core-processor-detection If you look at the various versions of the MSDN page on __cpuid, their code sample has basically remained the same since 2008...
As for the single hyperthreaded detection bit, that is a longer story, already answered by me at Why does Hyper-threading get reported as supported on processors without it?. The nutshell is that that rather legacy bit tells you if the processor package supports more than one logical processor, be it via hypethreading or multi-core technology. The name of the bit is thus rather misleading.
Also, I suggest changing the title of you question to "Using CPUID to detect CPU topology, reliable solution?" because I've found your question completely by accident. I was searching for Sandy Bridge cpuid dumps on google when I found your question.

CPUID can be trusted, you just need to use it correctly. in this case, it means enumerating the topology correctly. you get 16 logical processors because the field its gotten from represents the maximum it can support, not how many there actually are. the value retrieved for cores is actually the logical count.
the code in the topic is very basic and meant as a starting point, on my system (i7 2720QM) i also record invalid data, but using my own code that checks the topology as per Intel CPUID mappings, I get correct results.

Related

What does the following assembly instruction mean "mov rax,qword ptr gs:[20h]" [duplicate]

So I know what the following registers and their uses are supposed to be:
CS = Code Segment (used for IP)
DS = Data Segment (used for MOV)
ES = Destination Segment (used for MOVS, etc.)
SS = Stack Segment (used for SP)
But what are the following registers intended to be used for?
FS = "File Segment"?
GS = ???
Note: I'm not asking about any particular operating system -- I'm asking about what they were intended to be used for by the CPU, if anything.
There is what they were intended for, and what they are used for by Windows and Linux.
The original intention behind the segment registers was to allow a program to access many different (large) segments of memory that were intended to be independent and part of a persistent virtual store. The idea was taken from the 1966 Multics operating system, that treated files as simply addressable memory segments. No BS "Open file, write record, close file", just "Store this value into that virtual data segment" with dirty page flushing.
Our current 2010 operating systems are a giant step backwards, which is why they are called "Eunuchs". You can only address your process space's single segment, giving a so-called "flat (IMHO dull) address space". The segment registers on the x86-32 machine can still be used for real segment registers, but nobody has bothered (Andy Grove, former Intel president, had a rather famous public fit last century when he figured out after all those Intel engineers spent energy and his money to implement this feature, that nobody was going to use it. Go, Andy!)
AMD in going to 64 bits decided they didn't care if they eliminated Multics as a choice (that's the charitable interpretation; the uncharitable one is they were clueless about Multics) and so disabled the general capability of segment registers in 64 bit mode. There was still a need for threads to access thread local store, and each thread needed a a pointer ... somewhere in the immediately accessible thread state (e.g, in the registers) ... to thread local store. Since Windows and Linux both used FS and GS (thanks Nick for the clarification) for this purpose in the 32 bit version, AMD decided to let the 64 bit segment registers (GS and FS) be used essentially only for this purpose (I think you can make them point anywhere in your process space; I don't know if the application code can load them or not). Intel in their panic to not lose market share to AMD on 64 bits, and Andy being retired, decided to just copy AMD's scheme.
It would have been architecturally prettier IMHO to make each thread's memory map have an absolute virtual address (e.g, 0-FFF say) that was its thread local storage (no [segment] register pointer needed!); I did this in an 8 bit OS back in the 1970s and it was extremely handy, like having another big stack of registers to work in.
So, the segment registers are now kind of like your appendix. They serve a vestigial purpose. To our collective loss.
Those that don't know history aren't doomed to repeat it; they're doomed to doing something dumber.
The registers FS and GS are segment registers. They have no processor-defined purpose, but instead are given purpose by the OS's running them. In Windows 64-bit the GS register is used to point to operating system defined structures. FS and GS are commonly used by OS kernels to access thread-specific memory. In windows, the GS register is used to manage thread-specific memory. The linux kernel uses GS to access cpu-specific memory.
FS is used to point to the thread information block (TIB) on windows processes .
one typical example is (SEH) which store a pointer to a callback function in FS:[0x00].
GS is commonly used as a pointer to a thread local storage (TLS) .
and one example that you might have seen before is the stack canary protection (stackguard) , in gcc you might see something like this :
mov eax,gs:0x14
mov DWORD PTR [ebp-0xc],eax
TL;DR;
What is the “FS”/“GS” register intended for?
Simply to access data beyond the default data segment (DS). Exactly like ES.
The Long Read:
So I know what the following registers and their uses are supposed to be:
[...]
Well, almost, but DS is not 'some' Data Segment, but the default one. Where all operation take place by default (*1). This is where all default variables are located - essentially data and bss. It's in some way part of the reason why x86 code is rather compact. All essential data, which is what is most often accessed, (plus code and stack) is within 16 bit shorthand distance.
ES is used to access everything else (*2), everything beyond the 64 KiB of DS. Like the text of a word processor, the cells of a spreadsheet, or the picture data of a graphics program and so on. Unlike often assumed, this data doesn't get as much accessed, so needing a prefix hurts less than using longer address fields.
Similarly, it's only a minor annoyance that DS and ES might have to be loaded (and reloaded) when doing string operations - this at least is offset by one of the best character handling instruction sets of its time.
What really hurts is when user data exceeds 64 KiB and operations have to be commenced. While some operations are simply done on a single data item at a time (think A=A*2), most require two (A=A*B) or three data items (A=B*C). If these items reside in different segments, ES will be reloaded several times per operation, adding quite some overhead.
In the beginning, with small programs from the 8 bit world (*3) and equally small data sets, it wasn't a big deal, but it soon became a major performance bottleneck - and more so a true pain in the ass for programmers (and compilers). With the 386 Intel finally delivered relief by adding two more segments, so any series unary, binary or ternary operation, with elements spread out in memory, could take place without reloading ES all the time.
For programming (at least in assembly) and compiler design, this was quite a gain. Of course, there could have been even more, but with three the bottleneck was basically gone, so no need to overdo it.
Naming wise the letters F/G are simply alphabetic continuations after E. At least from the point of CPU design nothing is associated.
*1 - The usage of ES for string destination is an exception, as simply two segment registers are needed. Without they wouldn't be much useful - or always needing a segment prefix. Which could kill one of the surprising features, the use of (non repetitive) string instructions resulting in extreme performance due to their single byte encoding.
*2 - So in hindsight 'Everything Else Segment' would have been a way better naming than 'Extra Segment'.
*3 - It's always important to keep in mind that the 8086 was only meant as a stop gap measure until the 8800 was finished and mainly intended for the embedded world to keep 8080/85 customers on board.
According to the Intel Manual, in 64-bit mode these registers are intended to be used as additional base registers in some linear address calculations. I pulled this from section 3.7.4.1 (pg. 86 in the 4 volume set). Usually when the CPU is in this mode, linear address is the same as effective address, because segmentation is often not used in this mode.
So in this flat address space, FS & GS play role in addressing not just local data but certain operating system data structures(pg 2793, section 3.2.4) thus these registers were intended to be used by the operating system, however those particular designers determine.
There is some interesting trickery when using overrides in both 32 & 64-bit modes but this involves privileged software.
From the perspective of "original intentions," that's tough to say other than they are just extra registers. When the CPU is in real address mode, this is like the processor is running as a high speed 8086 and these registers have to be explicitly accessed by a program. For the sake of true 8086 emulation you'd run the CPU in virtual-8086 mode and these registers would not be used.
The FS and GS segment registers were very useful in 16-bit real mode or 16-bit protected mode under 80386 processors, when there were just 64KB segments, for example in MS-DOS.
When the 80386 processor was introduced in 1985, PC computers with 640KB RAM under MS-DOS were common. RAM was expensive and PCs were mostly running under MS-DOS in real mode with a maximum of that amount of RAM.
So, by using FS and GS, you could effectively address two more 64KB memory segments from your program without the need to change DS or ES registers whenever you need to address other segments than were loaded in DS or ES. Essentially, Raffzahn has already replied that these registers are useful when working with elements spread out in memory, to avoid reloading other segment registers like ES all the time. But I would like to emphasize that this is only relevant for 64KB segments in real mode or 16-bit protected mode.
The 16-bit protected mode was a very interesting mode that provided a feature not seen since then. The segments could have lengths in range from 1 to 65536 bytes. The range checking (the checking of the segment size) on each memory access was implemented by a CPU, that raised an interrupt on accessing memory beyond the size of the segment specified in the selector table for that segment. That prevented buffer overrun on hardware level. You could allocate own segment for each memory block (with a certain limitation on a total number). There were compilers like Borland Pascal 7.0 that made programs that run under MS-DOS in 16-bit Protected Mode known as DOS Protected Mode Interface (DPMI) using its own DOS extender.
The 80286 processor had 16-bit protected mode, but not FS/GS registers. So a program had first to check whether it is running under 80386 before using these registers, even in the real 16-bit mode. Please see an example of use of FS and GS registers a program for MS-DOS real mode.

Find out CPU model given a user-mode crash dump

I have a crash dump for my app. My app fails for some user saying “invalid instruction” trying to execute some SSSE instruction I have there.
In WinDBG, how do I find out the CPU model, so I can find out its instruction set, and either support the instruction set, or update minimum hardware requirements of the app?
Here’s the output of !cpuid:
CP F/M/S Manufacturer MHz
0 16,4,3 <unavailable> 3000
1 16,4,3 <unavailable> 3000
2 16,4,3 <unavailable> 3000
3 16,4,3 <unavailable> 3000
The rest of the commands google says might help (!errrec, !cpuinfo, !sysinfo) print “No export found”.
You definitely aren't getting much information here. Although dumps don't usually have all of the raw CPU information, you should at least be seeing a manufacturer string. Oh well, let's look at what you do have to work with here...
The CP column gives the logical processor number, so you know you're dealing with a system that has 4 logical processors. Could be a quad-core, or perhaps a dual-core with HyperThreading.
F/M/S is Family/Model/Stepping, which are a common and fairly standard way to identify processors. As AMD says:
The processor Family identifies one or more processors as belonging to a group that possesses some common definition for software or hardware purposes. The Model specifies one instance of a processor family. The
Stepping identifies a particular version of a specific model. Therefore, Family, Model and Stepping, when taken together, form a unique identification or signature for a processor.
It is most helpful if you have the manufacturer when you go looking for these things because the family numbers are pretty messy, but thankfully it's pretty clear that a family number of 16 (10 in hexadecimal) corresponds to an AMD processor (which should have a manufacturer string of "AuthenticAMD"). Specifically, it is the AMD K10, which is the Barcelona microarchitecture. That means there's no HyperThreading—this is just a native quad-core system.
We can narrow it down further by looking at the model. There were lots of different models based around the Barcelona core, branded variously as Athlon II, Opteron, Phenom, Phenom II, Sempron, Turion, and V-Series. Yours is model 4. This is where it gets kind of tricky, because I don't know of a good resource that lists the model numbers and steppings for various CPUs. You have to go directly to the manufacturer and wade through their manuals. For example, here is AMD's Revision Guide for the 10h Family. If you go to the "Processor Identification" section (appears as a bookmark in the PDF for me), you see something that looks promising, but the information is certainly not presented in an easily-digestible form. You get long hexadecimal values, out of which you have to extract individual bits corresponding to the Family (8-11), Model (4-7), and Stepping (0-3).
I didn't do all the grunt work to be certain, I'm just making a quick guess that this is an AMD Phenom II X4. The X4 fits with the quad-core, and from a cursory glance, it appears that the Phenom IIs are model 4.
Anyway, you could have probably stopped a while back, because the microarchitecture tells you everything you need to know. This is an AMD Barcelona core, which doesn't support Supplemental SSE3 (SSSE3) instructions (three S's—not to be confused with SSE3; the naming conventions are ridiculous). SSSE3 was invented by Intel, released with the Core 2 microarchitecture.
AMD didn't implement them until Bobcat/Bulldozer. Bulldozer was the subsequent generation, family 21 (15h), for desktops and servers, while Bobcat was the low-pore cores for AMD's APUs.
SSSE3 didn't really offer that many new instructions. Only 16, primarily intended for working with packed integers, and most of them aren't very exciting. The dump should also tell you the opcode of the instruction that caused the crash. If not, you'll have to go back and figure it out from the code's byte address. This will tell you exactly which instruction is the problem. I'm guessing that you are using PSHUFB to shuffle bytes in-place, which is the one SSSE3 instruction that is actually pretty useful. One common use I've seen is a fast population count algorithm (although there are other implementations that don't require SSSE3 that are almost equally as fast, if not faster).
Assuming that you're compiling with MSVC, I'm actually sort of surprised to see it emitting this instruction. In order to get it, you'd have to tell the compiler to target AVX, which would stop your code from running on anything older than Sandy Bridge/Bulldozer. I'm sure if you don't want to bump up your minimum system requirements, you can figure out an alternative sequence of instructions to do the same thing. pshufd, or movaps + shufps are possible candidates for workarounds.
The commands !sysinfo, !cpuinfo and !errec are kernel dump commands, defined in the kdexts extension, so they are not available in user mode debugging and probably won't work well if you load that extension explicitly.
The only idea to get more information from the dump I had was .dumpdebug which will output a Stream called SystemInfoStream that looks like this:
Stream 7: type SystemInfoStream (7), size 00000038, RVA 000000BC
ProcessorArchitecture 0000 (PROCESSOR_ARCHITECTURE_INTEL)
ProcessorLevel 0006
ProcessorRevision 2A07
NumberOfProcessors 04
... (OS specifics) ...
Unfortunately, that's exactly the same as displayed by !cpuid, so there's really no more information contained in the dump.

Confusion of Assembly and hardware level memory fetching, processing, segmentation, offsets, scope of memory addressing, etc

I am very perplexed having studied Assembly for some time, and reviewing many great tutorials on it.
It is surprisingly difficult I must say to fully understand the whole scheme of its usefulness, aside from memorizing a few instructions to do some things you don't completely understand.
I seek to be an operating system developer and designer, so I have to know low-level hardware data processing, memory management, processor fetching, decoding, and memory segmentation, memory usage, bit and byte usage, call stack and hardware stacks, and the mechanics of a machine-level program from the hardware itself.
Here are my main questions I am confused about:
The processor fetches bytes from RAM. When writing a bootloader you "jump" to an address before writing instructions. The first instruction executed after jumping to the address in memory, such as a move/data copy MOV AL, MOV BL kind of instruction retrieves data on the CPU's pipeline which is not directly used in memory. But how can the processor generate a code data segment on its pipeline if the instruction is loaded/fetched from memory? Or do I have it all wrong here? What is the basic steps the microprocessor does in a bootloader, and how does the CPU generate code data from a pipeline without using memory if instructions are all fetched from memory supposedly(e.g. code segments in Assembly, but data segments and text segments are all instructions for the processor)?
Also, my next main question is probably very easy to answer for some more experienced than me:
Why is memory/RAM on x86 and other architectures stored as "segments" with offsets? To me this is more complex than it needs to be. Why can't all memory be linear, addressed, fetched, stored, and computed, and moved in and out of the registers to the memory cells in a more straightforward manner? Would that not make the illustration and understanding of the architecture easier to understand, and more direct than having multitudes of registers process bi-dimensional segmentations of memory-based data storage and accessing?
It's more than "assembly" vs "high level language".
The real issue is "Real" vs. "Protected" (virtual memory) modes.
And unfortunately, most x86 assembly examples happen to be DOS examples. Which, IMHO, have little/no relevance to contemporary 32/64 bit virtual memory architectures (including, but not limited to, x86).
Excellent primer:
Programming from the Ground Up
PS:
Address space is effectively linear, event for x86, on most modern OS's (including Windows, Linux and Mac OS). x86 segment registers are largely anachronisms from the DOS era.
If you're interested, here's a good overview of the Linux boot process:
http://www.ibm.com/developerworks/library/l-linuxboot/index.html

Trace of CPU Instruction Reordering

I have studied a few things about instruction re-ordering by processors and Tomasulo's algorithm.
In an attempt to understand this topic bit more I want to know if there is ANY way to (get the trace) see the actual dynamic reordering done for a given program?
I want to give an input program and see the "out of order instruction execution trace" of my program.
I have access to an IBM-P7 machine and an Intel Core2Duo laptop. Also please tell me if there is an easy alternative.
You have no access to actual reordering done inside the CPU (there is no publically known way to enable tracing). But there is some emulators of reordering and some of them can give you useful hints.
For modern Intel CPUs (core 2, nehalem, Sandy and Ivy) there is "Intel(R) Architecture Code Analyzer" (IACA) from Intel. It's homepage is http://software.intel.com/en-us/articles/intel-architecture-code-analyzer/
This tool allows you to look how some linear fragment of code will be splitted into micro-operations and how they will be planned into execution Ports. This tool has some limitations and it is only inexact model of CPU u-op reordering and execution.
There are also some "external" tools for emulating x86/x86_84 CPU internals, I can recommend the PTLsim (or derived MARSSx86):
PTLsim models a modern superscalar out of order x86-64 compatible processor core at a configurable level of detail ranging ... down to RTL level models of all key pipeline structures. In addition, all microcode, the complete cache hierarchy, memory subsystem and supporting hardware devices are modeled with true cycle accuracy.
But PTLsim models some "PTL" cpu, not real AMD or Intel CPU. The good news is that this PTL is Out-Of-Order, based on ideas from real cores:
The basic microarchitecture of this model is a combination of design features from the Intel Pentium 4, AMD K8 and Intel Core 2, but incorporates some ideas from IBM Power4/Power5 and Alpha EV8.
Also, in arbeit http://es.cs.uni-kl.de/publications/datarsg/Senf11.pdf is said that JavaHASE applet is capable of emulating different simple CPUs and even supports Tomasulo example.
Unfortunately, unless you work for one of these companies, the answer is no. Intel/AMD processors don't even schedule the (macro) instructions you give them. They first convert those instructions into micro operations and then schedule those. What these micro instructions are and the entire process of instruction reordering is a closely guarded secret, so they don't exactly want you to know what is going on.

Alignment requirements for atomic x86 instructions vs. MS's InterlockedCompareExchange documentation?

Microsoft offers the InterlockedCompareExchange function for performing atomic compare-and-swap operations. There is also an _InterlockedCompareExchange intrinsic.
On x86 these are implemented using the lock cmpxchg instruction.
However, reading through the documentation on these three approaches, they don't seem to agree on the alignment requirements.
Intel's reference manual says nothing about alignment (other than that if alignment checking is enabled and an unaligned memory reference is made, an exception is generated)
I also looked up the lock prefix, which specifically states that
The integrity of the LOCK prefix is not affected by the alignment of the memory field.
(emphasis mine)
So Intel seems to say that alignment is irrelevant. The operation will be atomic no matter what.
The _InterlockedCompareExchange intrinsic documentation also says nothing about alignment, however the InterlockedCompareExchange function states that
The parameters for this function must be aligned on a 32-bit boundary; otherwise, the function will behave unpredictably on multiprocessor x86 systems and any non-x86 systems.
So what gives?
Are the alignment requirements for InterlockedCompareExchange just to make sure the function will work even on pre-486 CPU's where the cmpxchg instruction isn't available?
That seems likely based on the above information, but I'd like to be sure before I rely on it. :)
Or is alignment required by the ISA to guarantee atomicity, and I'm just looking the wrong places in Intel's reference manuals?
x86 does not require alignment for a lock cmpxchg instruction to be atomic. However, alignment is necessary for good performance.
This should be no surprise, backward compatibility means that software written with a manual from 14 years ago will still run on today's processors. Modern CPUs even have a performance counter specifically for split-lock detection because it's so expensive. (The core can't just hold onto exclusive access to a single cache line for the duration of the operation; it does have to do something like a traditional bus lock).
Why exactly Microsoft documents an alignment requirement is not clear. It's certainly necessary for supporting RISC architectures, but the specific claim of unpredictable behaviour on multiprocessor x86 might not even be valid. (Unless they mean unpredictable performance, rather than a correctness problem.)
Your guess of applying only to pre-486 systems without lock cmpxchg might be right; a different mechanism would be needed there which might have required some kind of locking around pure loads or pure stores. (Also note that 486 cmpxchg has a different and currently-undocumented opcode (0f a7) from modern cmpxchg (0f b1) which was new with 586 Pentium; Windows might have only used cmpxchg on P5 Pentium and later, I don't know.) That could maybe explain weirdness on some x86, without implying weirdness on modern x86.
Intel® 64 and IA-32 Architectures Software Developer’s Manual
Volume 3 (3A): System Programming Guide
January 2013
8.1.2.2 Software Controlled Bus Locking
To explicitly force the LOCK semantics, software can use the LOCK prefix with the following instructions when they are used to modify a memory location. [...]
• The exchange instructions (XADD, CMPXCHG, and CMPXCHG8B).
• The LOCK prefix is automatically assumed for XCHG instruction.
• [...]
[...] The integrity of a bus lock is not affected by the alignment of the
memory field. The LOCK semantics are followed for as many bus cycles
as necessary to update the entire operand. However, it is recommend
that locked accesses be aligned on their natural boundaries for better
system performance:
• Any boundary for an 8-bit access (locked or otherwise).
• 16-bit boundary for locked word accesses.
• 32-bit boundary for locked doubleword accesses.
• 64-bit boundary for locked quadword accesses.
Fun fact: cmpxchg without a lock prefix is still atomic wrt. context switches, so is usable for multi-threading on a single-core system.
Even misaligned it's still atomic wrt. interrupts (either completely before or completely after), and only memory reads by other devices (e.g. DMA) could see tearing. But such accesses could also see the separation between load and store, so even if old Windows did use that for a more efficient InterlockedCompareExchange on single-core systems, it still wouldn't require alignment for correctness, only performance. If this can be used for hardware access, Windows probably wouldn't do that.
If the library function needed to do a pure load separate from the lock cmpxchg this might make sense, but it doesn't need to do that. (If not inlined, the 32-bit version would have to load its args from the stack, but that's private, not access to the shared variable.)
The PDF you are quoting from is from 1999 and CLEARLY outdated.
The up-to-date Intel documentation, specifically Volume-3A tells a different story.
For example, on a Core-i7 processor, you STILL have to make sure your data doesn't not span over cache-lines, or else the operation is NOT guaranteed to be atomic.
On Volume 3A, System Programming, For x86/x64 Intel clearly states:
8.1.1 Guaranteed Atomic Operations
The Intel486 processor (and newer processors since) guarantees that the following
basic memory operations will always be carried out atomically:
Reading or writing a byte
Reading or writing a word aligned on a 16-bit boundary
Reading or writing a doubleword aligned on a 32-bit boundary
The Pentium processor (and newer processors since) guarantees that the following
additional memory operations will always be carried out atomically:
Reading or writing a quadword aligned on a 64-bit boundary
16-bit accesses to uncached memory locations that fit within a 32-bit data bus
The P6 family processors (and newer processors since) guarantee that the following
additional memory operation will always be carried out atomically:
Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache
line
Accesses to cacheable memory that are split across cache lines and page boundaries
are not guaranteed to be atomic by the Intel Core 2 Duo, Intel® Atom™, Intel Core
Duo, Pentium M, Pentium 4, Intel Xeon, P6 family, Pentium, and Intel486 processors.
The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon,
and P6 family processors provide bus control signals that permit external memory
subsystems to make split accesses atomic; however, nonaligned data accesses will
seriously impact the performance of the processor and should be avoided
See this SO question: natural alignment is important for performance, and is required on the x64 architecture (so it's not just PRE-x86 systems, but POST-x86 ones too -- x64 may still be a bit of a niche case but it's growing in popularity after all;-); that may be why Microsoft documents it as required (hard to find docs on whether MS has decided to FORCE the alignment issue by enabling alignment checking -- that may vary by Windows version; by claiming in the docs that alignment is required, MS keeps the freedom to force it in some version of Windows even if they did not force it on others).
Microsoft's Interlocked APIs also applied to ia64 (while it still existed). There was no lock prefix on ia64, only the cmpxchg.acq and cmpxchg.rel instructions (or fetchadd and other similar beasties), and these all required alignment if I recall correctly.

Resources