Which mobile windows devices don't support AVX2 - windows

I understand that Intels AVX2 extension is on the market since 2011 and therefore it is pretty much standard in modern devices.
However, for some decision making we need to find out, roughly, the share of existing mobile windows devices which don't support AVX2 (nor its successor AVX-512).
It is rather well documented, which CPUs, Intel and AMD, actually support the extension. So that is not what I am asking for.
How do I find which mobile windows devices exist on the market, including recent years, that have processors which don't yet support the AVX2 instruction set?

You're incorrect about the dates, and about being "pretty much standard", unfortunately. It could have been by now if Intel hadn't disabled it for market-segmentation reasons in their low-end CPUs. (To be slightly fair, that may have let them sell chips with defects in one half of a 256-bit execution unit, improving yields).
All AMD CPUs aimed at mobile/laptop use (not Geode), including low-power CPUs since Jaguar, have had AVX since Bulldozer. Their low-power CPUs decode 256-bit instructions to two 128-bit uops, same as they did in Bulldozer-family and Zen1. (Which meant it wasn't always worth using in Bulldozer-family, but it wasn't a lot slower than carefully-tuned SSE, and sometimes still faster, and meant software had that useful baseline. And 128-bit AVX instructions are great everywhere, often saving instructions by being 3 operand.) Intel used the same decode into 2 halves strategy in Gracemont as the E-cores for Alder Lake, like they did for SSE in P6 CPUs before Core 2, like Pentium III and Pentium M.
AVX was new in Sandy Bridge (2011) and Bulldozer (2011), AVX2 was new in Haswell (2013) and Excavator (2015).
Pentium/Celeron versions of Skylake / Coffee Lake etc. (lower end than i3) have AVX disabled, along with AVX2/FMA/BMI1/2. BMI1 and 2 include some instructions that use VEX encodings on general-purpose integer registers, which seems to indicate that Intel disables decoding of VEX prefixes entirely as part of binning a silicon chip for use in low-end SKUs.
The first Pentium/Celeron CPUs with AVX1/2/FMA are Ice Lake / Tiger Lake based. There are currently Alder Lake based Pentiums with AVX2, like 2c4t (2 P cores) Pentium Gold G7400 and Pentium Gold 8505 (mobile 1 P core, 4 E cores). So 7xxx and 8xxx and higher Pentiums should have AVX1 / AVX2 / FMA, but earlier ones mostly not. One of the latest without AVX is desktop Pentium Gold G6405, 2c4t Comet Lake launched in Q1 2021. (The mobile version, 6405U, launched in Q4'19). There's also an "Amber Lake Y" Pentium Gold 6500Y with AVX2, launched Q1'21.
Low-power CPUs in the Silvermont family (up to 2019's Tremont) don't support AVX at all.
These are common in "netbook" and low budget laptops, as well as low-power servers / NAS. (The successor, Gracemont, has AVX1/2/FMA, so it can work as the E-cores in Alder Lake.)
These CPUs are badged as Pentium J-series and N-series. For example, Intel Pentium Processor N6415 launched in 2021, with 4 cores, aimed at "PC/Client/Tablet" use cases. These are Elkheart Lake (Tremont cores), with only SSE4.2.
The "Atom" brand name is still used on server versions of these, including chips with 20 cores.

Related

How to measure the ACTUAL number of clock cycles elapsed on modern x86?

On recent x86, RDTSC returns some pseudo-counter that measures time instead of clock cycles.
Given this, how do I measure actual clock cycles for the current thread/program?
Platform-wise, I prefer Windows, but a Linux answer works too.
This is not simple. Such a thing is described in the Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 3B:
Here is the behaviour:
For Pentium M processors; for Pentium 4 processors, Intel Xeon processors; and for P6 family processors: the time-stamp counter increments
with every internal processor clock cycle. The internal processor clock cycle is determined by the current core-clock to bus-clock ratio. Intel®
SpeedStep® technology transitions may also impact the processor clock.
For Pentium 4 processors, Intel Xeon processors; for Intel Core Solo
and Intel Core Duo processors; for the Intel Xeon processor 5100 series and Intel Core 2 Duo processors; for Intel Core 2 and Intel Xeon processors; for Intel Atom processors: the time-stamp counter increments at a constant rate. That rate may be set by the maximum core-clock to bus-clock ratio of the processor or may be set by the maximum resolved frequency at which the processor is booted. The maximum resolved frequency may differ from the processor base frequency. On certain processors, the TSC frequency may not be the same as the frequency in the brand string.
Here is the advise for your use-case:
To determine average processor clock frequency, Intel recommends the use of performance monitoring logic to count processor core clocks over the period of time for which the average is required. See Section 18.17, “Counting Clocks on systems with Intel Hyper-Threading Technology in Processors Based on Intel NetBurst® Microarchitecture,” and Chapter 19, “Performance-
Monitoring Events,” for more information.
The bad news is that AFAIK performance counters are often not portable between AMD and Intel processors. Thus, you certainly need to check which performance counters to use in the AMD documentation. There are also complications: you cannot easily measure the number of of cycle taken by any arbitrary code. For example, the processor can be halted or enter in sleep mode for a short period of time (see C-state) or the OS can executing some protected code that cannot be profiled without high privileges (for sake of security). This method is fine as long as you need to measure the number of cycle of a numerically-intensive code taking relatively-long time (at least several dozens of cycles). On top of all of that, the documentation and usage of MSR is pretty complex and it has some restrictions.
Performance counters like CPU_CLK_UNHALTED.THREAD and CPU_CLK_UNHALTED.REF_TSC seems a good start for what you want to measure. Using library to read such performance counter is generally a very good idea (unless you like having a headache for at least few days). PAPI might be enough to do the job for this.
Here is some interesting related posts:
Lost Cycles on Intel? An inconsistency between rdtsc and CPU_CLK_UNHALTED.REF_TSC
How to read performance counters by rdpmc instruction?

Can I compile Go programs on Xeon Phi (Knight's Landing) processors?

I'm a hobbyist who likes to run my own programs in Go, and as Xeon Phi processors become older they're also becoming extremely cheap. So cheap I can build a dual socket machine from 2015/16 for <$1000
I'm trying to find out if I can run Go programs on these. From what I've seen, this thread says they won't run (and to try gccgo), but it says it won't run because it partially runs on an x87 ISA. Confusingly, in Go release notes they say they're dropping x87 support in 1.16, implying it was supported in the past. I've seen in other threads that all programs will run on the compatibility layer, but that's an extremely slow layer which only has access to a small portion of the cpu's cache.
I feel like I'm moving farther and farther out of my element; I was wondering if someone who's used Xeon Phi knows if it will run Go code? Or just in general, after booting up Ubuntu (or FreeBSD, something that I've seen done and is listed in motherboard specs) what sort of things aren't going to work and what will?
I appreciate any and all help!
You're basing your Knight's Landing worries on this quote about Knight's Corner:
The Knight's Corner processor is based on an x86-64 foundation, yes, but it in fact has its own floating-point instruction set—no x87, no AVX, no SSE, no MMX... Oh, and then you can throw all that away when Knight's Landing (KNL) comes out.
By "throw all that away", they mean all the worries and incompatibilities. KNL is based on Silvermont and is fully x86-64 compatible (including x87, SSE, and SSE2 for both standard ways of doing FP math). It also supports AVX-512F, AVX-512ER, and a few other AVX-512 extensions, along with AVX and AVX2 and SSE up to SSE4.2. A lot like a Skylake-server CPU, except a different set of AVX-512 extensions.
The point of this is exactly to solve the problem you're worried about: so any legacy binary can run on KNL. To get good performance out of it, you want to be running code vectorized with AVX-512 vectors in the loops that do the heavy lifting, but all the surrounding code and other programs in the rest of the Linux distro or whatever can be running ordinary bog-standard code that uses whatever x87 and/or SSE.
Knight's Corner (first-gen commercial Xeon Phi) has its own variant / precursor of AVX-512 in a core based on P5-Pentium, and no other FP hardware.
Knight's Landing (second-gen commercial Xeon Phi) is based on Silvermont, with AVX-512, and is the first that can act as a "host" processor (bootable) instead of just a coprocessor.
This "host" mode is another reason for including enough hardware to decode and execute x87 and SSE: if you're running a whole system on KNL, you're much more likely to want to execute some legacy binaries for non-perf-sensitive tasks, not only binaries compiled specifically for it.
Its x87 performance is not great, though: like one scalar fmul per 2 clocks (https://agner.org/optimize). vs. 2-per-clock SSE mulsd (0.5c recip throughput). Same 0.5c throughput for other SSE/AVX math, including AVX-512 vfma132ps zmm to do 16x single-precision Fused-Multiply-Add operations in one instruction.
So hopefully Go's compiler doesn't use x87 much. The normal way to do scalar math in 64-bit mode (that C compilers and their math libraries use) is SSE, in XMM registers. x86-64 C compilers only use x87 for types like long double.
Yes:
Xeon Phi is a series of x86 manycore processors designed and made by Intel. It is intended for use in supercomputers, servers, and high-end workstations. Its architecture allows use of standard programming languages and application programming interfaces (APIs) such as...
See also https://en.wikipedia.org/wiki/Xeon_Phi
If you can compile go on an x86 processor then you will be able to compile on that specific x86 processor which is manufactured by intel.
Xeon is not Itanium :)
On such systems you would also be able to compile go you would just need to provide a suitable c compiler...
What makes you think you would otherwise not be able to compile go on say... an Atari or perhaps a Arduino?
If you can elaborate on that perhaps I can improve my terrible answer further.

Are caches of different level operating in the same frequency domain?

Larger caches are usually with longer bitlines or wordlines and thus most likely higher access latency and cycle time.
So, does L2 caches work in the same domain as L1 caches? How about L3 cache (slices) since they are now non-inclusive and shared among all the cores?
And related questions are:
Are all function units in a core in the same clock domain?
Are the uncore part all in the same clock domain?
Are cores in the multi-core system synchronous?
I believe clock domain crossing would introduce extra latency. Do most parts in a CPU chip working on the same clock domain?
The private L1i/d caches are always part of each core, not on a separate clock, in modern CPUs1. L1d is very tightly coupled with load execution units, and the L1dTLB. This is pretty universally true across architectures. (VIPT Cache: Connection between TLB & Cache?).
On CPUs with per-core private L2 cache, it's also part of the core, in the same frequency domain. This keeps L2 latency very low by keeping timing (in core clock cycles) fixed, and not requiring any async logic to transfer data across clock domains. This is true on Intel and AMD x86 CPUs, and I assume most other designs.
Footnote 1: Decades ago, when even having the L1 caches on-chip was a stretch for transistor budgets, sometimes just the comparators and maybe tags were on-chip, so that part could go fast while starting to set up the access to the data on external SRAM. (Or if not external, sometimes a separate die (piece of silicon) in the same plastic / ceramic package, so the wires could be very short and not exposed as external pins that might need ESD protection, etc).
Or for example early Pentium II ran its off-die / on-package L2 cache at half core clock speed (down from full speed in PPro). (But all the same "frequency domain"; this was before DVFS dynamic frequency/voltage for power management.) L1i/d was tightly integrated into the core like they still are today; you have to go farther back to find CPUs with off-die L1, like maybe early classic RISC CPUs.
The rest of this answer is mostly about Intel x86 CPUs, because from your mention of L3 slices I think that's what you're imagining.
How about L3 cache (slices) since they are now non-inclusive and shared among all the cores?
Of mainstream Intel CPUs (P6 / SnB-family), only Skylake-X has non-inclusive L3 cache. Intel since Nehalem has used inclusive last-level cache so its tags can be a snoop filter. See Which cache mapping technique is used in intel core i7 processor?. But SKX changed from a ring to a mesh, and made L3 non-inclusive / non-exclusive.
On Intel desktop/laptop CPUs (dual/quad), all cores (including their L1+L2 caches) are in the same frequency domain. The uncore (the L3 cache + ring bus) is in a separate frequency domain, but I think normally runs at the speed of the cores. It might clock higher than the cores if the GPU is busy but the cores are all idle.
The memory clock stays high even when the CPU clocks down. (Still, single-core bandwidth can suffer if the CPU decides to clock down from 4.0 to 2.7GHz because it's running memory-bound code on the only active core. Single-core bandwidth is limited by max_concurrency / latency, not by DRAM bandwidth itself if you have dual-channel DDR4 or DDR3. Why is Skylake so much better than Broadwell-E for single-threaded memory throughput? I think this is because of increased uncore latency.)
The wikipedia Uncore article mentions overclocking it separately from the cores to reduce L3 / memory latency.
On Haswell and later Xeons (E5 v3), uncore (the ring bus and L3 slices) and each individual core have separate frequency domains. (source: Frank Denneman's NUMA Deep Dive Part 2: System Architecture. It has a typo, saying Haswell (v4) when Haswell is actually Xeon E[357]-xxxx v3. But other sources like this paper Comparisons of core and uncore frequency scaling modes in quantum chemistry application GAMESS confirm that Haswell does have those features. Uncore Frequency Scaling (UFS) and Per Core Power States (PCPS) were both new in Haswell.
On Xeons before Haswell, the uncore runs at the speed of the current fastest core on that package. On a dual-socket NUMA setup, this can badly bottleneck the other socket, by making it slow keeping up with snoop requests. See John "Dr. Bandwidth" McCalpin's post on this Intel forum thread:
On the Xeon E5-26xx processors, the "uncore" (containing the L3 cache, ring interconnect, memory controllers, etc), runs at a speed that is no faster than the fastest core, so the "package C1E state" causes the uncore to also drop to 1.2 GHz. When in this state, the chip takes longer to respond to QPI snoop requests, which increases the effective local memory latency seen by the processors and DMA engines on the other chip!
... On my Xeon E5-2680 chips, the "package C1E" state increases local latency on the other chip by almost 20%
The "package C1E state" also reduces sustained bandwidth to memory located on the "idle" chip by up to about 25%, so any NUMA placement errors generate even larger performance losses.
Dr. Bandwidth ran a simple infinite-loop pinned to a core on the other socket to keep it clocked up, and was able to measure the difference.
Quad-socket-capable Xeons (E7-xxxx) have a small snoop filter cache in each socket. Dual-socket systems simply spam the other socket with every snoop request, using a good fraction of the QPI bandwidth even when they're accessing their own local DRAM after an L3 miss.
I think Broadwell and Haswell Xeon can keep their uncore clock high even when all cores are idle, exactly to avoid this bottleneck.
Dr. Bandwidth says he disables package C1E state on his Haswell Xeons, but that probably wasn't necessary. He also posted some stuff about using Uncore perf counters to measure uncore frequency to find out what your CPU is really doing, and about BIOS settings that can affect the uncore frequency decision-making.
More background: I found https://www.anandtech.com/show/8423/intel-xeon-e5-version-3-up-to-18-haswell-ep-cores-/4 about some changes like new snoop mode options (which hop on the ring bus sends snoops to the other core), but it doesn't mention clocks.
A larger cache may have a higher access time, but still it could sustain one access per cycle per port by fully pipelining it. But it also may constrain the maximum supported frequency.
In modern Intel processors, the L1i/L1d and L2 caches and all functional units of a core are in the same frequency domain. On client processors, all cores of the same socket are also in the same frequency domain because they share the same frequency regulator. On server processors (starting with Haswell I think), each core in a separate frequency domain.
In modern Intel processors (since Nehalem I think), the uncore (which includes the L3) is in a separate frequency domain. One interesting case is when a socket is used in a dual NUMA nodes configuration. In this case, I think the uncore partition of each NUMA node would still both exist in the same frequency domain.
There is a special circuitry used to cross frequency domains where all cross-domain communication has to pass through it. So yes I think it incurs a small performance overhead.
There are other frequency domains. In particular, each DRAM channel operates in a frequency domains. I don't know whether current processors support having different channels to operate at different frequencies.

Query Intel CPU details of execution unit, port, etc

Is it possible to query the number of execution unit/port per core and similar information on Intel CPU?
I have an assembly program, and noticed that the performance is quite different on different CPU's. For example, on an Core i5 4570, some functions takes consistently 25% cycles to complete than on an Core i7 4970HQ. They are both Haswell based, from the same generation. No memory movement is involved in the part of program benchmarked. So I am thinking maybe the difference comes from the details such as number of execution unit, number of ports etc. The benchmark measures single core CPU cycles, so frequencies/HT etc does not come into play.
Am I right to assume such an explanation of performance difference? If yes, where can I find such informations for specific CPUs. And is it possible to query it dynamically? If possible, then I can dispatch dynamically based on such informations and distribution uops more evenly and similar techniques to optimize the program for multiple CPUs.
Did you time reference cycles (RDTSC) instead of core clock cycles (with perf counters)? That would explain your observations.
Turbo makes a big difference, and the ratio between max turbo and max sustained / rated clock speed (i.e. reference cycle tick rate) is different on different CPUs. e.g. see my answer on this related question
The lower the CPU's TDP, the bigger the ratio between sustained and peak. The Haswell wikipedia article has tables:
84W desktop i5 4570: sustained 3.2GHz = RDTSC frequency, max turbo 3.6GHz (the speed the core was probably actually running for most of your benchmark, if it had time to go up from low-power idle speed).
47W laptop i7-4960HQ: 2.6GHz sustained = RDTSC frequency vs. 3.8GHz max turbo.
Time your code with performance counters, and look at the "core clock cycles" count. (And lots of other neat stuff).
Every Haswell core is identical from Core-M 5Watt CPUs to high-power quad core to 18-core Xeon (which actually has a per-core power-budget more like a laptop CPU); it's only the L3 caches, number of cores (and interconnect), and support or not for HT and/or Turbo that differ. Basically everything outside the cores themselves can be different, including the GPU. They don't disable execution ports, and even the L1/L2 caches are identical. I think disabling execution ports would require significant redesigns in the out-of-order scheduler and stuff like that.
More importantly, every port has at least one execution unit that isn't found on any other port: p0 has the divider, p1 has the integer multiply unit, p5 has the shuffle unit, and p6 is the only port that can execute predicted-taken branches. Actually, p2 and p3 are identical load ports (and can handle store-address uops)...
See Agner Fog's microarch pdf for more about Haswell internals, and also David Kanter's writeup with diagrams of the different blocks.
(However, it's not strictly true that the entire core is identical: Haswell Pentium/Celeron CPUs don't support AVX/AVX2, or BMI/BMI2. I think they do that by disabling decode of VEX prefixes in the decoders. This is still the case for Skylake Pentiums/Celerons, so thanks Intel for delaying the time when we can assume support for new instruction sets. Presumably they do this so CPUs with defects in one only the upper or lower half of their vector execution units can still be sold as Celeron or Pentium, just like CPUs with a defect in some of their L3 can be sold as i5 instead of i7)

cpu frequency impact on build graphic card

I'm working on building a Dynamic Voltage Frequency Scaling (DVFS) algorithm for a video decoding application operating on an Intel core i7 6500U CPU (Skylake). The application is to support both software as well as hardware decoder modules and the software decoder is working as expected. It controls the operational frequency of the CPU which eventually controls the operational voltage, thereby reducing the overall energy consumption.
My question is regarding the hardware decoder which is available in the Intel skylake processor (Intel HD graphics 520) which performs the hardware decoding. The experimental results for the two decoders suggest that the energy consumption reduction is much less in the hardware decoder compared to the software decoder when using the DVFS algorithm.
Does the CPU frequency level adjusted on the software before passing the video frame to be decoded on the hardware decoder, actually have an impact on the energy consumption of the hardware decoder?.
Does the Intel HD graphics 520 GPU on the same chip as the CPU have any impact on the CPU's operational frequency and the voltage level?
Why did you need to implement your own DVFS in the first place? Didn't Skylake's self-regulating mode work well? (where you let the CPU's hardware power management controller make all the frequency decisions, instead of just choosing whether to turbo or not).
Setting the CPU core clock speeds should have little to no effect on the GPU's DVFS. It's in a separate domain, and not linked to any of the cores (which can each choose their clocks individually). As you can see on Wikipedia, that SKL model can scale its GPU clocks from 300MHz to 1050MHz, and is probably doing so automatically if you're using an OS running Intel's normal graphics drivers.
For more about how Skylake power management works under the hood, see Efraim Rotem's (Lead Client Power Architect) IDF2015 talk (audio+slides, very good stuff). The title is Skylake Deep Dive: A New Architecture to Manage Power Performance and Energy Efficiency.
There's a link to the list of IDF2015 sessions in the x86 tag wiki.

Resources