Find out CPU model given a user-mode crash dump - windows

I have a crash dump for my app. My app fails for some user saying “invalid instruction” trying to execute some SSSE instruction I have there.
In WinDBG, how do I find out the CPU model, so I can find out its instruction set, and either support the instruction set, or update minimum hardware requirements of the app?
Here’s the output of !cpuid:
CP F/M/S Manufacturer MHz
0 16,4,3 <unavailable> 3000
1 16,4,3 <unavailable> 3000
2 16,4,3 <unavailable> 3000
3 16,4,3 <unavailable> 3000
The rest of the commands google says might help (!errrec, !cpuinfo, !sysinfo) print “No export found”.

You definitely aren't getting much information here. Although dumps don't usually have all of the raw CPU information, you should at least be seeing a manufacturer string. Oh well, let's look at what you do have to work with here...
The CP column gives the logical processor number, so you know you're dealing with a system that has 4 logical processors. Could be a quad-core, or perhaps a dual-core with HyperThreading.
F/M/S is Family/Model/Stepping, which are a common and fairly standard way to identify processors. As AMD says:
The processor Family identifies one or more processors as belonging to a group that possesses some common definition for software or hardware purposes. The Model specifies one instance of a processor family. The
Stepping identifies a particular version of a specific model. Therefore, Family, Model and Stepping, when taken together, form a unique identification or signature for a processor.
It is most helpful if you have the manufacturer when you go looking for these things because the family numbers are pretty messy, but thankfully it's pretty clear that a family number of 16 (10 in hexadecimal) corresponds to an AMD processor (which should have a manufacturer string of "AuthenticAMD"). Specifically, it is the AMD K10, which is the Barcelona microarchitecture. That means there's no HyperThreading—this is just a native quad-core system.
We can narrow it down further by looking at the model. There were lots of different models based around the Barcelona core, branded variously as Athlon II, Opteron, Phenom, Phenom II, Sempron, Turion, and V-Series. Yours is model 4. This is where it gets kind of tricky, because I don't know of a good resource that lists the model numbers and steppings for various CPUs. You have to go directly to the manufacturer and wade through their manuals. For example, here is AMD's Revision Guide for the 10h Family. If you go to the "Processor Identification" section (appears as a bookmark in the PDF for me), you see something that looks promising, but the information is certainly not presented in an easily-digestible form. You get long hexadecimal values, out of which you have to extract individual bits corresponding to the Family (8-11), Model (4-7), and Stepping (0-3).
I didn't do all the grunt work to be certain, I'm just making a quick guess that this is an AMD Phenom II X4. The X4 fits with the quad-core, and from a cursory glance, it appears that the Phenom IIs are model 4.
Anyway, you could have probably stopped a while back, because the microarchitecture tells you everything you need to know. This is an AMD Barcelona core, which doesn't support Supplemental SSE3 (SSSE3) instructions (three S's—not to be confused with SSE3; the naming conventions are ridiculous). SSSE3 was invented by Intel, released with the Core 2 microarchitecture.
AMD didn't implement them until Bobcat/Bulldozer. Bulldozer was the subsequent generation, family 21 (15h), for desktops and servers, while Bobcat was the low-pore cores for AMD's APUs.
SSSE3 didn't really offer that many new instructions. Only 16, primarily intended for working with packed integers, and most of them aren't very exciting. The dump should also tell you the opcode of the instruction that caused the crash. If not, you'll have to go back and figure it out from the code's byte address. This will tell you exactly which instruction is the problem. I'm guessing that you are using PSHUFB to shuffle bytes in-place, which is the one SSSE3 instruction that is actually pretty useful. One common use I've seen is a fast population count algorithm (although there are other implementations that don't require SSSE3 that are almost equally as fast, if not faster).
Assuming that you're compiling with MSVC, I'm actually sort of surprised to see it emitting this instruction. In order to get it, you'd have to tell the compiler to target AVX, which would stop your code from running on anything older than Sandy Bridge/Bulldozer. I'm sure if you don't want to bump up your minimum system requirements, you can figure out an alternative sequence of instructions to do the same thing. pshufd, or movaps + shufps are possible candidates for workarounds.

The commands !sysinfo, !cpuinfo and !errec are kernel dump commands, defined in the kdexts extension, so they are not available in user mode debugging and probably won't work well if you load that extension explicitly.
The only idea to get more information from the dump I had was .dumpdebug which will output a Stream called SystemInfoStream that looks like this:
Stream 7: type SystemInfoStream (7), size 00000038, RVA 000000BC
ProcessorArchitecture 0000 (PROCESSOR_ARCHITECTURE_INTEL)
ProcessorLevel 0006
ProcessorRevision 2A07
NumberOfProcessors 04
... (OS specifics) ...
Unfortunately, that's exactly the same as displayed by !cpuid, so there's really no more information contained in the dump.

Related

Clock Cycles for the invlpg instruction

I was reading some documentation about the invlpg instruction for Intel Pentium processors and it says that it takes 25 clock cycles. I thought that this depended on the implementation (the particular CPU) and not the actual instruction set architecture? Or is the fact that this instruction must take 25 clock cycles to run also part of the instruction set specification?
The documentation is saying that it took 25 clock cycles on the Pentium. The number of clock cycles the instruction takes on other CPUs may be more or fewer. The performance of instructions is not part of the instruction set specification.
That number is not part of any official ISA documentation, it's just performance data that someone annotated into an old (then-current) copy of Intel's ISA docs.
It's from some random microarchitecture, presumably P5 Pentium that was relevant back when Tripod was a widely used web host, and which that guide labels itself as documenting. (These days there are Pentium/Celeron CPUs that are just cut-down versions of i3/i5/i7 of the same generation, with stuff like AVX and BMI1/2 disabled. But Pentium used to refer to the P5 microarchitecture.)
It's not from Intel's documentation; it was added by whoever compiled that HTML. The formatting is similar to modern versions of Intel's vol.2 x86 SDM instruction-set reference manual. You can find HTML extracts of that at https://github.com/HJLebbink/asm-dude/wiki/INVLPG and https://www.felixcloutier.com/x86/invlpg for example. The encoding / mnemonic / description table at the top has identical formatting in your Tripod link, but the actual text is somewhat different. Also, the text for inc (current Intel vs. tripod) is word for word identical.
So yes, this is based on an old PDF->HTML of Intel's vol.2 manual, with P5 cycles and instruction-pairing info added (inc pairs in the U or V pipe on that dual-issue in-order pipeline that doesn't break instructions down into uops). Also with FLAGS updating section turned into tables.
That instruction-pairing and cycle-count info is totally irrelevant when tuning for modern microarchitectures like Skylake and Zen, but you can find it in Agner Fog's instruction tables: his spreadsheet has a sheet for P5, as well as for later Intel, AMD, and Via microarchitectures. (Also see his optimization guide and microarch pdf for background info to help you make sense of uops / ports / latency / throughput info.) Agner doesn't test most kernel instructions so invlpg isn't in his list.
http://faydoc.tripod.com/cpu/index.htm is obviously not an official Intel source. IDK where the author of this got their info from. Maybe they tested themselves. Or Intel has sometimes published some timing numbers for some microarchitectures, e.g. as part of their optimization manual. This is totally separate from the x86 ISA manuals, and is not something you can rely on for correctness. And other people have published their test results.
Another good source for experimental test results of instruction performance (uops for which ports, latency, and throughput) is https://uops.info/. Their testing for invlpg m8 shows it has a back-to-back throughput of ~194 cycles in practice on Skylake-client, ~157 on Nehalem, and ~126.25 on Zen+ and Zen2, to pick some random examples. But it may interleave better with other instructions, taking "only" 47 front-end uops on recent Intel CPUs and thus can issue in under 12 cycles if the back-end has room in the ROB / RS, maybe letting later instructions execute while the invlpg operation is in progress. (Although if it takes over 100 cycles for its uops to retire, that will often stall OoO exec at some point for a fraction of the total time.)
Remember that instruction performance can't be characterized by a single number on out-of-order CPUs; it's not one dimensional. Perf analysis is not as simple as adding up a cycle costs for all instructions in a loop, you have to analyze how the can overlap with each other. Or for complex cases like invlpg, measure.

Difference between LLCMisses and CacheMisses on Hardware Counters

What is the difference between LLCMisses and CacheMisses?
The value returned for both counters should generally be the same.
The counters available in BenchmarkDotNet are those provided by the Windows ETW infrastructure. Unfortunately, so far as I am aware Microsoft does not offer any specific information about any of them, but we can reasonably infer quite a bit from the ones we see.
On the Intel systems I have seen full PMC source listings for, the list ends with 8 entries with sequential IDs The first seven of those eight (UnhaltedCoreCycles, InstructionRetired, UnhaltedReferenceCycles, LLCReference, LLCMisses, BranchInstructionRetired, BranchMispredictsRetired) pretty much exactly match the names and order of the seven Intel Architectural Performance Event counters (see the Performance Monitoring chapter of the Intel Software Developer's Manual for details).
The last of the 8, LbrInserts, likely refers to the Intel Last Branch Record performance monitoring functionality. So it appears reasonable to presume these sources directly map to those specific x86 counters, and they will not be present on architectures without them.
Of the other 5 sources listed, TotalIssues returns the same values as InstructionRetired; BranchInstructions matches BranchInstructionRetired, CacheMisses matches LLCMisses, BranchMispredictions matches BranchMispredictsRetired, and TotalCycles matches UnhaltedCoreCycles.
Presumably, other CPU architectures have their own architecture specific sources defined, with those sources mapped to different architecture specific counters, e.g. BranchMispredictions on ARM might map to the BR_MIS_PRED counter, which does not have the same semantics as Intel's Branch Mispredicts Retired, but still represents the concept of branch misprediction.
So then the actual answer is, if you are distributing software with a predefined value, you pick LLCMisses if you want the specific meaning of the Intel counter. If you just want the concept of a cache miss, you pick CacheMisses so that it might also work on other architectures with different performance counters. And if you're just running it locally, it doesn't really matter which you pick.

Why isn't the distinction between CPUs more ubiquitous?

I know that every program one writes has to eventually boil down to machine code - that's what compilers produce, that's what executable files consist of, and that's the only language that processors understand. I also know that different processors may have different instruction sets (I know 65c816 assembly, and I imagine it's vastly different from today's computers).
Here's what I'm not getting, though: If there exist different instruction sets, then why do we not seem to have to care about that every time we use software?
If a program was compiled for one particular CPU, it might not run on another - and yet, I never see notices like "Intel users download this version, AMD users download this one". I never have to even be aware of what CPU I'm on, every executable just seems to... work. The same goes for compilers, apparently - there isn't a separate version of, say, GCC, for every processor there is, right?
I'm aware that the differences in instruction sets are much more subtle than they used to be, but even then there should at least be a bit of a distinction. I'm wondering why there doesn't seem to be any.
What is it that I'm not understanding?
There actually are sometimes different versions for Intel/AMD. Even for different versions of Intel and/or AMD. That's not common (especially in the kind of software people usually use) because it's not user friendly (most people don't even know what a CPU is or does, let alone what kind they have exactly), so what often is that either the multiple versions are all in the same executable and selected at runtime, or a "least common denominator" sub-set of instructions is used (when performance is not really a concern). There are significant differences between AMD and Intel though, the most significant one is in which instruction sets they support. AMD always lags behind Intel in implementing Intels new instructions (new sets come out regularly), and Intel usually does not implement AMDs extensions (AMD64 is the big exception (99% accepted by Intel, small changes made), also a couple of instructions here and there were borrowed, but XOP will never happen even though it's awesome, 3DNow! was never adopted by Intel). As an example of software that does not package the code for different "extended instruction sets" in the same executable, see y-cruncher.
To get back to the beginning though, for some (I can't name any off the top of my head, but I've seen it before) high performance software you may see different versions each specifically tailored to get maximum performance on one specific microarchitecture. For example, P4 (netburst) and Core2 are two very different beasts (that's mostly P4's fault for being crazy), even though Core2 is backwards compatible and could run the same code unmodified, that is not necessarily a good idea from a performance perspective.
There is no Intel/AMD versions, because they use the same IS family: x86.
However, there are applications where you have to look out for different versions when you download them. There are still instruction sets that are quite different and might make a program act differently. For example if you have a PowerPC architecture and code a network based application on it, you can forget the little to big endian conversion, but if you compile the same code on x86, which is little endian, the application most likely will produce garbage on the network side.
There is also the difference in how many instructions there are, e.g. RISC vs CISC.
In the end there are a lot of differences to look for and in most programming languages you don't have to worry too much about them though, as the compiler/interpreter will handle most things for you. If you work lower lever then you have to know what you're doing on each architecture.
Also if you compile for ARM, you won't be able to run the program on any other machine, like your PC with x86. It will not work at all.
Because the op codes may/do differ, take the mov instruction, on x86 the op code is 0x88, on ARM it might be 0x13 etc.
The distinction is in fact quite dramatic. Except in the case of Intel vs. AMD. AMD makes their processors compatible with Intel machine code. On purpose of course.
Today there is a move to JIT compiling (Java, .NET, etc.). In this case, the executable file doesn't contain machine code. It contains a simple intermediate language that is compiled just before it is executed, in the machine code of the running processor. This allows the processor architecture to be completely opaque.
AMD is an intel clone. or vice versa depending on your view of the situation. Either way, there is so much in common that programs can be compiled as to run on either (within reason, cant go back 10 years for example, cant make a 32bit processor understand 64 bit specific instructions). Next step is the motherboards have to do similar things and they do, there maybe some intel or amd specific chip support items but then you get into generic peripherals that can be found on either platform or are widespread enough on one or the other platform that the operating systems and/or applications support them.

Using CPUID to detect CPU specs, reliable solution?

I'm trying to gather information about my CPU with __cpuid(). Though it is getting it right on my computer, when I run my program on my colleague's computer it is detecting the Intel Core2 Quad Q6600 to be hyper-threaded, though according to the specifications on Intel's own site it isn't.
__cpuid() is also detecting the wrong amount of "logical cores" as can be shown here:
Programmatically detect number of physical processors/cores or if hyper-threading is active on Windows, Mac and Linux. Where it claims that the Intel Xeon E5520 has 16 logical cores and 8 physical.
I tried running the code found in that thread on my own computer, a Intel i7 2600K giving me the same numbers as for the Xeon.
So how reliable is __cpuid() really? From my own experience it doesn't seem to be that reliable. Have I got something very fundamental wrong?
There is almost certainly a gap in the [x2]APIC ids on your processor, meaning that some value of the APIC ids don't map to any logical processors. You should use the 0xB leaf of cpuid to find out. You can look at the reference Intel code and algorithm (https://software.intel.com/en-us/articles/intel-64-architecture-processor-topology-enumeration/) for the steps, but it broils down to calling with EAX=0xB, ECX=0 and obtaining in EBX the number of logical processors (threads) per core and then calling cpuid again with EAX=0xB, ECX=1 and getting in EBX the number of logical processors per processor package.
The old method of using leaf 0x1 cannot account for APIC id gaps. Alas that's the sample code still given on the MSDN Visual C++ 2013 reference page (http://msdn.microsoft.com/en-us/library/hskdteyh.aspx), and it is incorrect for processors made in 2010 and thereafter, as you found out whether by using the code from MSDN or similarly incorrect code from elsewhere. The Wikipedia page on cpuid, which I've recently updated after struggling myself to understand the issue, now has a worked out example in the section on "Intel thread/core and cache topology" for enumerating topology on a processor with APIC id gaps, with additional details, including how to determine which bits of the APIC ids are actually used and which are "dead".
Given the code sample currently offered by Microsoft on their __cpuid() page, this is basically the same question as Logical CPU count return 16 instead of 4 because it's rooted in the same interpretation error of the Intel specs. As an explanation for MSDN's poor show, the code they offer worked okay before 2010 or so; Intel used to provide a similar method before the x2APIC was introduced as you can see in this old video/article: https://software.intel.com/en-us/articles/hyper-threading-technology-and-multi-core-processor-detection If you look at the various versions of the MSDN page on __cpuid, their code sample has basically remained the same since 2008...
As for the single hyperthreaded detection bit, that is a longer story, already answered by me at Why does Hyper-threading get reported as supported on processors without it?. The nutshell is that that rather legacy bit tells you if the processor package supports more than one logical processor, be it via hypethreading or multi-core technology. The name of the bit is thus rather misleading.
Also, I suggest changing the title of you question to "Using CPUID to detect CPU topology, reliable solution?" because I've found your question completely by accident. I was searching for Sandy Bridge cpuid dumps on google when I found your question.
CPUID can be trusted, you just need to use it correctly. in this case, it means enumerating the topology correctly. you get 16 logical processors because the field its gotten from represents the maximum it can support, not how many there actually are. the value retrieved for cores is actually the logical count.
the code in the topic is very basic and meant as a starting point, on my system (i7 2720QM) i also record invalid data, but using my own code that checks the topology as per Intel CPUID mappings, I get correct results.

Trace of CPU Instruction Reordering

I have studied a few things about instruction re-ordering by processors and Tomasulo's algorithm.
In an attempt to understand this topic bit more I want to know if there is ANY way to (get the trace) see the actual dynamic reordering done for a given program?
I want to give an input program and see the "out of order instruction execution trace" of my program.
I have access to an IBM-P7 machine and an Intel Core2Duo laptop. Also please tell me if there is an easy alternative.
You have no access to actual reordering done inside the CPU (there is no publically known way to enable tracing). But there is some emulators of reordering and some of them can give you useful hints.
For modern Intel CPUs (core 2, nehalem, Sandy and Ivy) there is "Intel(R) Architecture Code Analyzer" (IACA) from Intel. It's homepage is http://software.intel.com/en-us/articles/intel-architecture-code-analyzer/
This tool allows you to look how some linear fragment of code will be splitted into micro-operations and how they will be planned into execution Ports. This tool has some limitations and it is only inexact model of CPU u-op reordering and execution.
There are also some "external" tools for emulating x86/x86_84 CPU internals, I can recommend the PTLsim (or derived MARSSx86):
PTLsim models a modern superscalar out of order x86-64 compatible processor core at a configurable level of detail ranging ... down to RTL level models of all key pipeline structures. In addition, all microcode, the complete cache hierarchy, memory subsystem and supporting hardware devices are modeled with true cycle accuracy.
But PTLsim models some "PTL" cpu, not real AMD or Intel CPU. The good news is that this PTL is Out-Of-Order, based on ideas from real cores:
The basic microarchitecture of this model is a combination of design features from the Intel Pentium 4, AMD K8 and Intel Core 2, but incorporates some ideas from IBM Power4/Power5 and Alpha EV8.
Also, in arbeit http://es.cs.uni-kl.de/publications/datarsg/Senf11.pdf is said that JavaHASE applet is capable of emulating different simple CPUs and even supports Tomasulo example.
Unfortunately, unless you work for one of these companies, the answer is no. Intel/AMD processors don't even schedule the (macro) instructions you give them. They first convert those instructions into micro operations and then schedule those. What these micro instructions are and the entire process of instruction reordering is a closely guarded secret, so they don't exactly want you to know what is going on.

Resources