Where to find OPcodes documentation of my CPU and also some other CPUs? - cpu

I could find datasheets containing documentation of OPcodes and their meanings for some popular (and old) microprocessors on internet, for example, here's the link of 8085 and 4004 -
intel 4004 : https://web.archive.org/web/20110601032753/http://www.intel.com/Assets/PDF/DataSheet/4004_datasheet.pdf
intel 8085 : https://ia801807.us.archive.org/3/items/intel-8085-datasheet/8085_datasheet.pdf
I really want to know new technologies implemented in recent CPUs (CPUs released by Intel, AMD) and their OPcodes, but I could not find any documentation, especially of CPUs from AMD.
Intel do have some documentations of their latest CPUs at
https://www.intel.com/content/www/us/en/products/docs/processors/core/core-technical-resources.html , but I couldn't find documentation of OPcodes of their CPU. And I couldn't find any documentations (except standard specifcations of their CPU) by AMD.
I really want Documentation related to OPcodes of my CPU: Ryzen 5 3500 u

Related

How to measure the ACTUAL number of clock cycles elapsed on modern x86?

On recent x86, RDTSC returns some pseudo-counter that measures time instead of clock cycles.
Given this, how do I measure actual clock cycles for the current thread/program?
Platform-wise, I prefer Windows, but a Linux answer works too.
This is not simple. Such a thing is described in the Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 3B:
Here is the behaviour:
For Pentium M processors; for Pentium 4 processors, Intel Xeon processors; and for P6 family processors: the time-stamp counter increments
with every internal processor clock cycle. The internal processor clock cycle is determined by the current core-clock to bus-clock ratio. Intel®
SpeedStep® technology transitions may also impact the processor clock.
For Pentium 4 processors, Intel Xeon processors; for Intel Core Solo
and Intel Core Duo processors; for the Intel Xeon processor 5100 series and Intel Core 2 Duo processors; for Intel Core 2 and Intel Xeon processors; for Intel Atom processors: the time-stamp counter increments at a constant rate. That rate may be set by the maximum core-clock to bus-clock ratio of the processor or may be set by the maximum resolved frequency at which the processor is booted. The maximum resolved frequency may differ from the processor base frequency. On certain processors, the TSC frequency may not be the same as the frequency in the brand string.
Here is the advise for your use-case:
To determine average processor clock frequency, Intel recommends the use of performance monitoring logic to count processor core clocks over the period of time for which the average is required. See Section 18.17, “Counting Clocks on systems with Intel Hyper-Threading Technology in Processors Based on Intel NetBurst® Microarchitecture,” and Chapter 19, “Performance-
Monitoring Events,” for more information.
The bad news is that AFAIK performance counters are often not portable between AMD and Intel processors. Thus, you certainly need to check which performance counters to use in the AMD documentation. There are also complications: you cannot easily measure the number of of cycle taken by any arbitrary code. For example, the processor can be halted or enter in sleep mode for a short period of time (see C-state) or the OS can executing some protected code that cannot be profiled without high privileges (for sake of security). This method is fine as long as you need to measure the number of cycle of a numerically-intensive code taking relatively-long time (at least several dozens of cycles). On top of all of that, the documentation and usage of MSR is pretty complex and it has some restrictions.
Performance counters like CPU_CLK_UNHALTED.THREAD and CPU_CLK_UNHALTED.REF_TSC seems a good start for what you want to measure. Using library to read such performance counter is generally a very good idea (unless you like having a headache for at least few days). PAPI might be enough to do the job for this.
Here is some interesting related posts:
Lost Cycles on Intel? An inconsistency between rdtsc and CPU_CLK_UNHALTED.REF_TSC
How to read performance counters by rdpmc instruction?

Which mobile windows devices don't support AVX2

I understand that Intels AVX2 extension is on the market since 2011 and therefore it is pretty much standard in modern devices.
However, for some decision making we need to find out, roughly, the share of existing mobile windows devices which don't support AVX2 (nor its successor AVX-512).
It is rather well documented, which CPUs, Intel and AMD, actually support the extension. So that is not what I am asking for.
How do I find which mobile windows devices exist on the market, including recent years, that have processors which don't yet support the AVX2 instruction set?
You're incorrect about the dates, and about being "pretty much standard", unfortunately. It could have been by now if Intel hadn't disabled it for market-segmentation reasons in their low-end CPUs. (To be slightly fair, that may have let them sell chips with defects in one half of a 256-bit execution unit, improving yields).
All AMD CPUs aimed at mobile/laptop use (not Geode), including low-power CPUs since Jaguar, have had AVX since Bulldozer. Their low-power CPUs decode 256-bit instructions to two 128-bit uops, same as they did in Bulldozer-family and Zen1. (Which meant it wasn't always worth using in Bulldozer-family, but it wasn't a lot slower than carefully-tuned SSE, and sometimes still faster, and meant software had that useful baseline. And 128-bit AVX instructions are great everywhere, often saving instructions by being 3 operand.) Intel used the same decode into 2 halves strategy in Gracemont as the E-cores for Alder Lake, like they did for SSE in P6 CPUs before Core 2, like Pentium III and Pentium M.
AVX was new in Sandy Bridge (2011) and Bulldozer (2011), AVX2 was new in Haswell (2013) and Excavator (2015).
Pentium/Celeron versions of Skylake / Coffee Lake etc. (lower end than i3) have AVX disabled, along with AVX2/FMA/BMI1/2. BMI1 and 2 include some instructions that use VEX encodings on general-purpose integer registers, which seems to indicate that Intel disables decoding of VEX prefixes entirely as part of binning a silicon chip for use in low-end SKUs.
The first Pentium/Celeron CPUs with AVX1/2/FMA are Ice Lake / Tiger Lake based. There are currently Alder Lake based Pentiums with AVX2, like 2c4t (2 P cores) Pentium Gold G7400 and Pentium Gold 8505 (mobile 1 P core, 4 E cores). So 7xxx and 8xxx and higher Pentiums should have AVX1 / AVX2 / FMA, but earlier ones mostly not. One of the latest without AVX is desktop Pentium Gold G6405, 2c4t Comet Lake launched in Q1 2021. (The mobile version, 6405U, launched in Q4'19). There's also an "Amber Lake Y" Pentium Gold 6500Y with AVX2, launched Q1'21.
Low-power CPUs in the Silvermont family (up to 2019's Tremont) don't support AVX at all.
These are common in "netbook" and low budget laptops, as well as low-power servers / NAS. (The successor, Gracemont, has AVX1/2/FMA, so it can work as the E-cores in Alder Lake.)
These CPUs are badged as Pentium J-series and N-series. For example, Intel Pentium Processor N6415 launched in 2021, with 4 cores, aimed at "PC/Client/Tablet" use cases. These are Elkheart Lake (Tremont cores), with only SSE4.2.
The "Atom" brand name is still used on server versions of these, including chips with 20 cores.

Clock Cycles for the invlpg instruction

I was reading some documentation about the invlpg instruction for Intel Pentium processors and it says that it takes 25 clock cycles. I thought that this depended on the implementation (the particular CPU) and not the actual instruction set architecture? Or is the fact that this instruction must take 25 clock cycles to run also part of the instruction set specification?
The documentation is saying that it took 25 clock cycles on the Pentium. The number of clock cycles the instruction takes on other CPUs may be more or fewer. The performance of instructions is not part of the instruction set specification.
That number is not part of any official ISA documentation, it's just performance data that someone annotated into an old (then-current) copy of Intel's ISA docs.
It's from some random microarchitecture, presumably P5 Pentium that was relevant back when Tripod was a widely used web host, and which that guide labels itself as documenting. (These days there are Pentium/Celeron CPUs that are just cut-down versions of i3/i5/i7 of the same generation, with stuff like AVX and BMI1/2 disabled. But Pentium used to refer to the P5 microarchitecture.)
It's not from Intel's documentation; it was added by whoever compiled that HTML. The formatting is similar to modern versions of Intel's vol.2 x86 SDM instruction-set reference manual. You can find HTML extracts of that at https://github.com/HJLebbink/asm-dude/wiki/INVLPG and https://www.felixcloutier.com/x86/invlpg for example. The encoding / mnemonic / description table at the top has identical formatting in your Tripod link, but the actual text is somewhat different. Also, the text for inc (current Intel vs. tripod) is word for word identical.
So yes, this is based on an old PDF->HTML of Intel's vol.2 manual, with P5 cycles and instruction-pairing info added (inc pairs in the U or V pipe on that dual-issue in-order pipeline that doesn't break instructions down into uops). Also with FLAGS updating section turned into tables.
That instruction-pairing and cycle-count info is totally irrelevant when tuning for modern microarchitectures like Skylake and Zen, but you can find it in Agner Fog's instruction tables: his spreadsheet has a sheet for P5, as well as for later Intel, AMD, and Via microarchitectures. (Also see his optimization guide and microarch pdf for background info to help you make sense of uops / ports / latency / throughput info.) Agner doesn't test most kernel instructions so invlpg isn't in his list.
http://faydoc.tripod.com/cpu/index.htm is obviously not an official Intel source. IDK where the author of this got their info from. Maybe they tested themselves. Or Intel has sometimes published some timing numbers for some microarchitectures, e.g. as part of their optimization manual. This is totally separate from the x86 ISA manuals, and is not something you can rely on for correctness. And other people have published their test results.
Another good source for experimental test results of instruction performance (uops for which ports, latency, and throughput) is https://uops.info/. Their testing for invlpg m8 shows it has a back-to-back throughput of ~194 cycles in practice on Skylake-client, ~157 on Nehalem, and ~126.25 on Zen+ and Zen2, to pick some random examples. But it may interleave better with other instructions, taking "only" 47 front-end uops on recent Intel CPUs and thus can issue in under 12 cycles if the back-end has room in the ROB / RS, maybe letting later instructions execute while the invlpg operation is in progress. (Although if it takes over 100 cycles for its uops to retire, that will often stall OoO exec at some point for a fraction of the total time.)
Remember that instruction performance can't be characterized by a single number on out-of-order CPUs; it's not one dimensional. Perf analysis is not as simple as adding up a cycle costs for all instructions in a loop, you have to analyze how the can overlap with each other. Or for complex cases like invlpg, measure.

How to disable prefetcher in Atom N270 processor

I am trying to disable hardware prefetching in my system with Atom processors(N270).
I am following the method as per the link How do I programatically disable hardware prefetching in core2duo ?
I am able to execute,
./rdmsr 0x1a0
366b52488
however, this gives error message
./wrmsr -p0 0x1a0 0x366d52688
wrmsr: Cpu 0 can't set MSR from 0x1a0 to 0x366d52688
Although I am able to set bit-0 and bit-3, no other bits are allowed to modify .
./wrmsr -p0 0x1a0 0x366b52489
As per this link disable prefetcher in i3/i7 hardware prefetcher in Nehalem, Westmere, Sandy Bridge, Ivy Bridge, Haswell, and Broadwell can be disabled via MSR at 0x1a4 address .
In Atom processor, reading at 0x1a4 is not permitted.
./rdmsr 0x1a4
rdmsr: Cpu 0 can't read MSR from 0x000001a4
I am wondering how is it possible that there is no information available related to how to disable hardware prefetcher in Atom processor,
although Atom N270 and Core2duo processor are released at the same year (year 2008) and how to disable hardware prefetcher in Core2Duo is disclosed by Intel.
Any link to document to how can I disable prefetcher in Atom processors would be a great help? thank you in advance.
The only reliable source to find information like this is the Intel Architecture Software Developer's Manual. There is an entire chapter dedicated to MSR (In the most recent release it's Chapter 35).
The 0x1a4 or 0x1a0 MSR address that you found is machine dependent (that's why it's called Model Specific Register), meaning they might be only available on some models, or maybe removed in future models.
If you go the chapter "MSRS IN THE 45 NM AND 32 NM INTEL® ATOMTM PROCESSOR FAMILY", which matches what your Atom N270. You won't be able to find any MSR related to prefetcher setting. So it means in official Intel CPU release it's not available (though in some engineering sample it might be found).
There might be two reasons why it's not available, either it's not highly required feature thus removing it could save some silicon gates; or it's might be because Intel thinks this feature is best left default and not configurable to the public (subject to misuse by vendor or user and lead to poor performance in certain condition).
BTW, information about 0x1a4 MSR address could be found in IA SDM Chapter 35.5 and 0x1a0 in Chapter 35.2

How many CPU, cores are really in multiccores?

I have a corei7 intel processore(CPU name: Intel(R) Core(TM) i7-4500U CPU # 1.80GHz, CPU type: Intel Core Haswell processor).
I wonder the output of CPUID command as it shows 4 cpus each having 2 cores!
Do I have really 4 CPUs?
the out put includes 4 cpus(cpu0 to cpu3)
(multi-processing synth): multi-core (c=2), hyper-threaded (t=2)
This is because I want to use hardware performance counters to test my app. However I am confused with how many cores I have to monitor and profile.
Your Intel i7 4500U is a Dual Core CPU with Hyper Threading support, so you see 4 Cores.
This U stands for ultra book, so this is a CPU which is designed for long battery life for the slim ultra books.
First, as mentioned before, your system is a dual-core with Hyperthreading (Hyperthreading means each core can execute from two simultaneous hardware threads). Therefore, your OS sees 4 "logical CPUs" even though there's only one "physical CPU". Read more below:
If you're on linux, look at /proc/cpuinfo using cat or less as follows:
cat /proc/cpuinfo
That will list all info you need to know. However, to answer your question and to make sense of the information. You need to know that there is a difference between a 'logical cpu' and a 'physical cpu'. A physical CPU is the actual hardware made by Intel for example that's installed in your system. A logical CPU is what is seen by the OS and basically refers to a 'hardware thread' or one processor core. So, let's say you have One physical CPU with 4 cores and each core supports one thread (hardware thread), then your OS will see 4 CPUs and those will be listed in the /proc/cpuinfo having different 'processor' numbers but the same 'physical id' because they all belong to the same physical processor.
Another example, let's say that each of the cores above supports two threads (again, hardware threads, not software threads). Then, your OS will see 8 CPUs. If you have dual-socket (multi-node) server with two physical cpus and all the above, then your OS will see 16 CPUs; each 8 of them will have the same 'physical id'.
Info about your system is here: http://ark.intel.com/products/75460/Intel-Core-i7-4500U-Processor-4M-Cache-up-to-3_00-GHz

Resources