How to disable prefetcher in Atom N270 processor - performance

I am trying to disable hardware prefetching in my system with Atom processors(N270).
I am following the method as per the link How do I programatically disable hardware prefetching in core2duo ?
I am able to execute,
./rdmsr 0x1a0
366b52488
however, this gives error message
./wrmsr -p0 0x1a0 0x366d52688
wrmsr: Cpu 0 can't set MSR from 0x1a0 to 0x366d52688
Although I am able to set bit-0 and bit-3, no other bits are allowed to modify .
./wrmsr -p0 0x1a0 0x366b52489
As per this link disable prefetcher in i3/i7 hardware prefetcher in Nehalem, Westmere, Sandy Bridge, Ivy Bridge, Haswell, and Broadwell can be disabled via MSR at 0x1a4 address .
In Atom processor, reading at 0x1a4 is not permitted.
./rdmsr 0x1a4
rdmsr: Cpu 0 can't read MSR from 0x000001a4
I am wondering how is it possible that there is no information available related to how to disable hardware prefetcher in Atom processor,
although Atom N270 and Core2duo processor are released at the same year (year 2008) and how to disable hardware prefetcher in Core2Duo is disclosed by Intel.
Any link to document to how can I disable prefetcher in Atom processors would be a great help? thank you in advance.

The only reliable source to find information like this is the Intel Architecture Software Developer's Manual. There is an entire chapter dedicated to MSR (In the most recent release it's Chapter 35).
The 0x1a4 or 0x1a0 MSR address that you found is machine dependent (that's why it's called Model Specific Register), meaning they might be only available on some models, or maybe removed in future models.
If you go the chapter "MSRS IN THE 45 NM AND 32 NM INTEL® ATOMTM PROCESSOR FAMILY", which matches what your Atom N270. You won't be able to find any MSR related to prefetcher setting. So it means in official Intel CPU release it's not available (though in some engineering sample it might be found).
There might be two reasons why it's not available, either it's not highly required feature thus removing it could save some silicon gates; or it's might be because Intel thinks this feature is best left default and not configurable to the public (subject to misuse by vendor or user and lead to poor performance in certain condition).
BTW, information about 0x1a4 MSR address could be found in IA SDM Chapter 35.5 and 0x1a0 in Chapter 35.2

Related

Find out CPU model given a user-mode crash dump

I have a crash dump for my app. My app fails for some user saying “invalid instruction” trying to execute some SSSE instruction I have there.
In WinDBG, how do I find out the CPU model, so I can find out its instruction set, and either support the instruction set, or update minimum hardware requirements of the app?
Here’s the output of !cpuid:
CP F/M/S Manufacturer MHz
0 16,4,3 <unavailable> 3000
1 16,4,3 <unavailable> 3000
2 16,4,3 <unavailable> 3000
3 16,4,3 <unavailable> 3000
The rest of the commands google says might help (!errrec, !cpuinfo, !sysinfo) print “No export found”.
You definitely aren't getting much information here. Although dumps don't usually have all of the raw CPU information, you should at least be seeing a manufacturer string. Oh well, let's look at what you do have to work with here...
The CP column gives the logical processor number, so you know you're dealing with a system that has 4 logical processors. Could be a quad-core, or perhaps a dual-core with HyperThreading.
F/M/S is Family/Model/Stepping, which are a common and fairly standard way to identify processors. As AMD says:
The processor Family identifies one or more processors as belonging to a group that possesses some common definition for software or hardware purposes. The Model specifies one instance of a processor family. The
Stepping identifies a particular version of a specific model. Therefore, Family, Model and Stepping, when taken together, form a unique identification or signature for a processor.
It is most helpful if you have the manufacturer when you go looking for these things because the family numbers are pretty messy, but thankfully it's pretty clear that a family number of 16 (10 in hexadecimal) corresponds to an AMD processor (which should have a manufacturer string of "AuthenticAMD"). Specifically, it is the AMD K10, which is the Barcelona microarchitecture. That means there's no HyperThreading—this is just a native quad-core system.
We can narrow it down further by looking at the model. There were lots of different models based around the Barcelona core, branded variously as Athlon II, Opteron, Phenom, Phenom II, Sempron, Turion, and V-Series. Yours is model 4. This is where it gets kind of tricky, because I don't know of a good resource that lists the model numbers and steppings for various CPUs. You have to go directly to the manufacturer and wade through their manuals. For example, here is AMD's Revision Guide for the 10h Family. If you go to the "Processor Identification" section (appears as a bookmark in the PDF for me), you see something that looks promising, but the information is certainly not presented in an easily-digestible form. You get long hexadecimal values, out of which you have to extract individual bits corresponding to the Family (8-11), Model (4-7), and Stepping (0-3).
I didn't do all the grunt work to be certain, I'm just making a quick guess that this is an AMD Phenom II X4. The X4 fits with the quad-core, and from a cursory glance, it appears that the Phenom IIs are model 4.
Anyway, you could have probably stopped a while back, because the microarchitecture tells you everything you need to know. This is an AMD Barcelona core, which doesn't support Supplemental SSE3 (SSSE3) instructions (three S's—not to be confused with SSE3; the naming conventions are ridiculous). SSSE3 was invented by Intel, released with the Core 2 microarchitecture.
AMD didn't implement them until Bobcat/Bulldozer. Bulldozer was the subsequent generation, family 21 (15h), for desktops and servers, while Bobcat was the low-pore cores for AMD's APUs.
SSSE3 didn't really offer that many new instructions. Only 16, primarily intended for working with packed integers, and most of them aren't very exciting. The dump should also tell you the opcode of the instruction that caused the crash. If not, you'll have to go back and figure it out from the code's byte address. This will tell you exactly which instruction is the problem. I'm guessing that you are using PSHUFB to shuffle bytes in-place, which is the one SSSE3 instruction that is actually pretty useful. One common use I've seen is a fast population count algorithm (although there are other implementations that don't require SSSE3 that are almost equally as fast, if not faster).
Assuming that you're compiling with MSVC, I'm actually sort of surprised to see it emitting this instruction. In order to get it, you'd have to tell the compiler to target AVX, which would stop your code from running on anything older than Sandy Bridge/Bulldozer. I'm sure if you don't want to bump up your minimum system requirements, you can figure out an alternative sequence of instructions to do the same thing. pshufd, or movaps + shufps are possible candidates for workarounds.
The commands !sysinfo, !cpuinfo and !errec are kernel dump commands, defined in the kdexts extension, so they are not available in user mode debugging and probably won't work well if you load that extension explicitly.
The only idea to get more information from the dump I had was .dumpdebug which will output a Stream called SystemInfoStream that looks like this:
Stream 7: type SystemInfoStream (7), size 00000038, RVA 000000BC
ProcessorArchitecture 0000 (PROCESSOR_ARCHITECTURE_INTEL)
ProcessorLevel 0006
ProcessorRevision 2A07
NumberOfProcessors 04
... (OS specifics) ...
Unfortunately, that's exactly the same as displayed by !cpuid, so there's really no more information contained in the dump.

Using CPUID to detect CPU specs, reliable solution?

I'm trying to gather information about my CPU with __cpuid(). Though it is getting it right on my computer, when I run my program on my colleague's computer it is detecting the Intel Core2 Quad Q6600 to be hyper-threaded, though according to the specifications on Intel's own site it isn't.
__cpuid() is also detecting the wrong amount of "logical cores" as can be shown here:
Programmatically detect number of physical processors/cores or if hyper-threading is active on Windows, Mac and Linux. Where it claims that the Intel Xeon E5520 has 16 logical cores and 8 physical.
I tried running the code found in that thread on my own computer, a Intel i7 2600K giving me the same numbers as for the Xeon.
So how reliable is __cpuid() really? From my own experience it doesn't seem to be that reliable. Have I got something very fundamental wrong?
There is almost certainly a gap in the [x2]APIC ids on your processor, meaning that some value of the APIC ids don't map to any logical processors. You should use the 0xB leaf of cpuid to find out. You can look at the reference Intel code and algorithm (https://software.intel.com/en-us/articles/intel-64-architecture-processor-topology-enumeration/) for the steps, but it broils down to calling with EAX=0xB, ECX=0 and obtaining in EBX the number of logical processors (threads) per core and then calling cpuid again with EAX=0xB, ECX=1 and getting in EBX the number of logical processors per processor package.
The old method of using leaf 0x1 cannot account for APIC id gaps. Alas that's the sample code still given on the MSDN Visual C++ 2013 reference page (http://msdn.microsoft.com/en-us/library/hskdteyh.aspx), and it is incorrect for processors made in 2010 and thereafter, as you found out whether by using the code from MSDN or similarly incorrect code from elsewhere. The Wikipedia page on cpuid, which I've recently updated after struggling myself to understand the issue, now has a worked out example in the section on "Intel thread/core and cache topology" for enumerating topology on a processor with APIC id gaps, with additional details, including how to determine which bits of the APIC ids are actually used and which are "dead".
Given the code sample currently offered by Microsoft on their __cpuid() page, this is basically the same question as Logical CPU count return 16 instead of 4 because it's rooted in the same interpretation error of the Intel specs. As an explanation for MSDN's poor show, the code they offer worked okay before 2010 or so; Intel used to provide a similar method before the x2APIC was introduced as you can see in this old video/article: https://software.intel.com/en-us/articles/hyper-threading-technology-and-multi-core-processor-detection If you look at the various versions of the MSDN page on __cpuid, their code sample has basically remained the same since 2008...
As for the single hyperthreaded detection bit, that is a longer story, already answered by me at Why does Hyper-threading get reported as supported on processors without it?. The nutshell is that that rather legacy bit tells you if the processor package supports more than one logical processor, be it via hypethreading or multi-core technology. The name of the bit is thus rather misleading.
Also, I suggest changing the title of you question to "Using CPUID to detect CPU topology, reliable solution?" because I've found your question completely by accident. I was searching for Sandy Bridge cpuid dumps on google when I found your question.
CPUID can be trusted, you just need to use it correctly. in this case, it means enumerating the topology correctly. you get 16 logical processors because the field its gotten from represents the maximum it can support, not how many there actually are. the value retrieved for cores is actually the logical count.
the code in the topic is very basic and meant as a starting point, on my system (i7 2720QM) i also record invalid data, but using my own code that checks the topology as per Intel CPUID mappings, I get correct results.

Trace of CPU Instruction Reordering

I have studied a few things about instruction re-ordering by processors and Tomasulo's algorithm.
In an attempt to understand this topic bit more I want to know if there is ANY way to (get the trace) see the actual dynamic reordering done for a given program?
I want to give an input program and see the "out of order instruction execution trace" of my program.
I have access to an IBM-P7 machine and an Intel Core2Duo laptop. Also please tell me if there is an easy alternative.
You have no access to actual reordering done inside the CPU (there is no publically known way to enable tracing). But there is some emulators of reordering and some of them can give you useful hints.
For modern Intel CPUs (core 2, nehalem, Sandy and Ivy) there is "Intel(R) Architecture Code Analyzer" (IACA) from Intel. It's homepage is http://software.intel.com/en-us/articles/intel-architecture-code-analyzer/
This tool allows you to look how some linear fragment of code will be splitted into micro-operations and how they will be planned into execution Ports. This tool has some limitations and it is only inexact model of CPU u-op reordering and execution.
There are also some "external" tools for emulating x86/x86_84 CPU internals, I can recommend the PTLsim (or derived MARSSx86):
PTLsim models a modern superscalar out of order x86-64 compatible processor core at a configurable level of detail ranging ... down to RTL level models of all key pipeline structures. In addition, all microcode, the complete cache hierarchy, memory subsystem and supporting hardware devices are modeled with true cycle accuracy.
But PTLsim models some "PTL" cpu, not real AMD or Intel CPU. The good news is that this PTL is Out-Of-Order, based on ideas from real cores:
The basic microarchitecture of this model is a combination of design features from the Intel Pentium 4, AMD K8 and Intel Core 2, but incorporates some ideas from IBM Power4/Power5 and Alpha EV8.
Also, in arbeit http://es.cs.uni-kl.de/publications/datarsg/Senf11.pdf is said that JavaHASE applet is capable of emulating different simple CPUs and even supports Tomasulo example.
Unfortunately, unless you work for one of these companies, the answer is no. Intel/AMD processors don't even schedule the (macro) instructions you give them. They first convert those instructions into micro operations and then schedule those. What these micro instructions are and the entire process of instruction reordering is a closely guarded secret, so they don't exactly want you to know what is going on.

Can I prefetch specific data to a specific cache level in a CUDA kernel?

I understand that Fermi GPUs support prefetching to L1 or L2 cache. However, in the CUDA reference manual I can not find any thing about it.
Dues CUDA allow my kernel code to prefetch specific data to a specific level of cache?
Well not at instruction level but detailed information about prefetching in GPUs in here:
Many-Thread Aware Prefetching Mechanisms for GPGPU Applications
(paper in the the ACM symposium on microarchitecture 2010)
You can find instruction reference in nVIDIA's PTX ISA reference document; the relevant instructions are prefetch and prefetchu.

Alignment requirements for atomic x86 instructions vs. MS's InterlockedCompareExchange documentation?

Microsoft offers the InterlockedCompareExchange function for performing atomic compare-and-swap operations. There is also an _InterlockedCompareExchange intrinsic.
On x86 these are implemented using the lock cmpxchg instruction.
However, reading through the documentation on these three approaches, they don't seem to agree on the alignment requirements.
Intel's reference manual says nothing about alignment (other than that if alignment checking is enabled and an unaligned memory reference is made, an exception is generated)
I also looked up the lock prefix, which specifically states that
The integrity of the LOCK prefix is not affected by the alignment of the memory field.
(emphasis mine)
So Intel seems to say that alignment is irrelevant. The operation will be atomic no matter what.
The _InterlockedCompareExchange intrinsic documentation also says nothing about alignment, however the InterlockedCompareExchange function states that
The parameters for this function must be aligned on a 32-bit boundary; otherwise, the function will behave unpredictably on multiprocessor x86 systems and any non-x86 systems.
So what gives?
Are the alignment requirements for InterlockedCompareExchange just to make sure the function will work even on pre-486 CPU's where the cmpxchg instruction isn't available?
That seems likely based on the above information, but I'd like to be sure before I rely on it. :)
Or is alignment required by the ISA to guarantee atomicity, and I'm just looking the wrong places in Intel's reference manuals?
x86 does not require alignment for a lock cmpxchg instruction to be atomic. However, alignment is necessary for good performance.
This should be no surprise, backward compatibility means that software written with a manual from 14 years ago will still run on today's processors. Modern CPUs even have a performance counter specifically for split-lock detection because it's so expensive. (The core can't just hold onto exclusive access to a single cache line for the duration of the operation; it does have to do something like a traditional bus lock).
Why exactly Microsoft documents an alignment requirement is not clear. It's certainly necessary for supporting RISC architectures, but the specific claim of unpredictable behaviour on multiprocessor x86 might not even be valid. (Unless they mean unpredictable performance, rather than a correctness problem.)
Your guess of applying only to pre-486 systems without lock cmpxchg might be right; a different mechanism would be needed there which might have required some kind of locking around pure loads or pure stores. (Also note that 486 cmpxchg has a different and currently-undocumented opcode (0f a7) from modern cmpxchg (0f b1) which was new with 586 Pentium; Windows might have only used cmpxchg on P5 Pentium and later, I don't know.) That could maybe explain weirdness on some x86, without implying weirdness on modern x86.
Intel® 64 and IA-32 Architectures Software Developer’s Manual
Volume 3 (3A): System Programming Guide
January 2013
8.1.2.2 Software Controlled Bus Locking
To explicitly force the LOCK semantics, software can use the LOCK prefix with the following instructions when they are used to modify a memory location. [...]
• The exchange instructions (XADD, CMPXCHG, and CMPXCHG8B).
• The LOCK prefix is automatically assumed for XCHG instruction.
• [...]
[...] The integrity of a bus lock is not affected by the alignment of the
memory field. The LOCK semantics are followed for as many bus cycles
as necessary to update the entire operand. However, it is recommend
that locked accesses be aligned on their natural boundaries for better
system performance:
• Any boundary for an 8-bit access (locked or otherwise).
• 16-bit boundary for locked word accesses.
• 32-bit boundary for locked doubleword accesses.
• 64-bit boundary for locked quadword accesses.
Fun fact: cmpxchg without a lock prefix is still atomic wrt. context switches, so is usable for multi-threading on a single-core system.
Even misaligned it's still atomic wrt. interrupts (either completely before or completely after), and only memory reads by other devices (e.g. DMA) could see tearing. But such accesses could also see the separation between load and store, so even if old Windows did use that for a more efficient InterlockedCompareExchange on single-core systems, it still wouldn't require alignment for correctness, only performance. If this can be used for hardware access, Windows probably wouldn't do that.
If the library function needed to do a pure load separate from the lock cmpxchg this might make sense, but it doesn't need to do that. (If not inlined, the 32-bit version would have to load its args from the stack, but that's private, not access to the shared variable.)
The PDF you are quoting from is from 1999 and CLEARLY outdated.
The up-to-date Intel documentation, specifically Volume-3A tells a different story.
For example, on a Core-i7 processor, you STILL have to make sure your data doesn't not span over cache-lines, or else the operation is NOT guaranteed to be atomic.
On Volume 3A, System Programming, For x86/x64 Intel clearly states:
8.1.1 Guaranteed Atomic Operations
The Intel486 processor (and newer processors since) guarantees that the following
basic memory operations will always be carried out atomically:
Reading or writing a byte
Reading or writing a word aligned on a 16-bit boundary
Reading or writing a doubleword aligned on a 32-bit boundary
The Pentium processor (and newer processors since) guarantees that the following
additional memory operations will always be carried out atomically:
Reading or writing a quadword aligned on a 64-bit boundary
16-bit accesses to uncached memory locations that fit within a 32-bit data bus
The P6 family processors (and newer processors since) guarantee that the following
additional memory operation will always be carried out atomically:
Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache
line
Accesses to cacheable memory that are split across cache lines and page boundaries
are not guaranteed to be atomic by the Intel Core 2 Duo, Intel® Atom™, Intel Core
Duo, Pentium M, Pentium 4, Intel Xeon, P6 family, Pentium, and Intel486 processors.
The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon,
and P6 family processors provide bus control signals that permit external memory
subsystems to make split accesses atomic; however, nonaligned data accesses will
seriously impact the performance of the processor and should be avoided
See this SO question: natural alignment is important for performance, and is required on the x64 architecture (so it's not just PRE-x86 systems, but POST-x86 ones too -- x64 may still be a bit of a niche case but it's growing in popularity after all;-); that may be why Microsoft documents it as required (hard to find docs on whether MS has decided to FORCE the alignment issue by enabling alignment checking -- that may vary by Windows version; by claiming in the docs that alignment is required, MS keeps the freedom to force it in some version of Windows even if they did not force it on others).
Microsoft's Interlocked APIs also applied to ia64 (while it still existed). There was no lock prefix on ia64, only the cmpxchg.acq and cmpxchg.rel instructions (or fetchadd and other similar beasties), and these all required alignment if I recall correctly.

Resources