I am evaluating the feasibility of moving a simulation to a FX 8350.
Is there a document on the availability (if not, a workaround) of the rdrand instruction of Intel's BullMountain on AMD FX 8350?
Thanks in advance
No, RDRAND is not present in AMD FX 8350 as well as in any other Piledriver platform based chip. I believe AMD implemented RDRAND in Zen architecture/platform
Related
I understand that Intels AVX2 extension is on the market since 2011 and therefore it is pretty much standard in modern devices.
However, for some decision making we need to find out, roughly, the share of existing mobile windows devices which don't support AVX2 (nor its successor AVX-512).
It is rather well documented, which CPUs, Intel and AMD, actually support the extension. So that is not what I am asking for.
How do I find which mobile windows devices exist on the market, including recent years, that have processors which don't yet support the AVX2 instruction set?
You're incorrect about the dates, and about being "pretty much standard", unfortunately. It could have been by now if Intel hadn't disabled it for market-segmentation reasons in their low-end CPUs. (To be slightly fair, that may have let them sell chips with defects in one half of a 256-bit execution unit, improving yields).
All AMD CPUs aimed at mobile/laptop use (not Geode), including low-power CPUs since Jaguar, have had AVX since Bulldozer. Their low-power CPUs decode 256-bit instructions to two 128-bit uops, same as they did in Bulldozer-family and Zen1. (Which meant it wasn't always worth using in Bulldozer-family, but it wasn't a lot slower than carefully-tuned SSE, and sometimes still faster, and meant software had that useful baseline. And 128-bit AVX instructions are great everywhere, often saving instructions by being 3 operand.) Intel used the same decode into 2 halves strategy in Gracemont as the E-cores for Alder Lake, like they did for SSE in P6 CPUs before Core 2, like Pentium III and Pentium M.
AVX was new in Sandy Bridge (2011) and Bulldozer (2011), AVX2 was new in Haswell (2013) and Excavator (2015).
Pentium/Celeron versions of Skylake / Coffee Lake etc. (lower end than i3) have AVX disabled, along with AVX2/FMA/BMI1/2. BMI1 and 2 include some instructions that use VEX encodings on general-purpose integer registers, which seems to indicate that Intel disables decoding of VEX prefixes entirely as part of binning a silicon chip for use in low-end SKUs.
The first Pentium/Celeron CPUs with AVX1/2/FMA are Ice Lake / Tiger Lake based. There are currently Alder Lake based Pentiums with AVX2, like 2c4t (2 P cores) Pentium Gold G7400 and Pentium Gold 8505 (mobile 1 P core, 4 E cores). So 7xxx and 8xxx and higher Pentiums should have AVX1 / AVX2 / FMA, but earlier ones mostly not. One of the latest without AVX is desktop Pentium Gold G6405, 2c4t Comet Lake launched in Q1 2021. (The mobile version, 6405U, launched in Q4'19). There's also an "Amber Lake Y" Pentium Gold 6500Y with AVX2, launched Q1'21.
Low-power CPUs in the Silvermont family (up to 2019's Tremont) don't support AVX at all.
These are common in "netbook" and low budget laptops, as well as low-power servers / NAS. (The successor, Gracemont, has AVX1/2/FMA, so it can work as the E-cores in Alder Lake.)
These CPUs are badged as Pentium J-series and N-series. For example, Intel Pentium Processor N6415 launched in 2021, with 4 cores, aimed at "PC/Client/Tablet" use cases. These are Elkheart Lake (Tremont cores), with only SSE4.2.
The "Atom" brand name is still used on server versions of these, including chips with 20 cores.
Some CPUs have bugs at the architecture level (such as these), and it's possible that some programs developed for these CPUs also have bugs which are compensated by the CPU's own ones. If so, such program wouldn't work on a 'perfect' emulator. Do PC emulators include these bugs? For example, Bochs is known to be pretty accurate, does it handle them 'properly', as a real CPU would?
P.S. Already got two minuses. What's wrong?
Such emulators exist, cpu design process require extremely accurate emulators, with precise microarchitecture model. CPU designers need them to debug purposes or to estimate theoretical performance of future chip, also their mangers could calm down investors before chip is ready, by showing some expected functionality. Such emulators is strictly confidential.
Also RTL freeze in CPU design closure born a lot of erratas, or chains of erratas. To simplify future chip bring up, firmware developers could support special tools to emulate functional behavior of expected cpu, with all known erratas implemeted. But them is also proprietary.
But really, it is necessary to understand what is meant by the word "emulator" and "accurate" at that case.
Bochs, as a QEMU, is a functional ISA models, and they purpose is provide some workable architecture profile to run binaries of target ISA, emulation speed is a first goal: there is no micro-architecture modeled: no pipeline, no cache model, no performance monitors, and so on.
To understand kind of BOCHS accuracy, please look to their implementation of CPUID perfomance monitors and caches topology leafs:
cpu/cpuid.cc
When you specify some cpu at BOCHS, skylake for example, bochs known nothing except CPUID values that belong to that cpu, feature sets in other words: AVX2, FMA, XSAVE etc.
Also BOCHS do not implements precise model/family cpuid values: look for implementation of cpuid version information leaf (grep for get_cpu_version_information function): it is hardcoded value.
So there is no cpu erratas at Bochs.
I have a Macbook pro mid 2014 with intel iris and intel core i5 processor 16GB of RAM. I am planing to learn some ray-traced 3D. But, I am not sure, if my laptop can render fast without any nvidia's hardware.
So, I would appreciate it, if someone can tell me if I can use Cuda if not, then could you please teach me in a very easy way how to enable OpenCL in after affects. I am looking for any tutorial for beginners to learn how to create or build OpenCL?
Cuda works only on nvidia hardware but there may be some libraries converting it to run on cpu cores(not igpu).
AMD is working on "hipify"ing old cuda kernels to translate them to opencl or similar codes so they can become more general.
Opencl works everywhere as long as both hardware and os supports. Amd, Nvidia, Intel, Xilinx, Altera, Qualcomm, MediaTek, Marvell, Texas Instruments .. support this. Maybe even Raspberry pi-x can support in future.
Documentation for opencl in stackoverflow.com is under development. But there are some sites:
Amd's tutorial
Amd's parallel programming guide for opencl
Nvidia's learning material
Intel's HD graphics coding tutorial
Some overview of hardware, benchmark and parallel programming subjects
blog
Scratch-a-pixel-raytracing-tutorial (I read it then wrote its teraflops gpu version)
If it is Iris Graphics 6100:
Your integrated gpu has 48 execution units each having 8 ALU units that can do add,multiply and many more operations. Its clock frequency can rise to 1GHz. This means a maximum of 48*8*2(1 add+1multiply)*1G = 768 Giga floating point operations per second but only if each ALU is capable of concurrently doing 1 addition and 1 multiplication. 768 Gflops is more than a low-end discrete gpu such as R7-240 of AMD.(As of 19.10.2017, AMD's low-end is RX550 with 1200 GFlops, faster than Intel's Iris Plus 650 which is nearly 900 GFlops). Ray tracing needs re-accessing to too many geometry data so a device should have its own memory(such as with Nvidia or Amd), to let CPU do its work.
How you install opencl on a computer can change by OS and hardware type, but building a software with an opencl-installed computer is similar:
Query platforms. Result of this can be AMD, Intel, Nvidia,duplicate of these because of overlapped installations of wrong drivers,experimental platforms prior to newer opencl version supports.
Query devices of a platform(or all platforms). This gives individual devices (and their duplicates if there are driver errors or some other things to fix).
Create a context(or multiple) using a platform
Using a context(so everything will have implicit sync in it):
Build programs using kernel strings. Usually CPU can take less time than a GPU to build a program.(there is binary load option to shurtcut this)
Build kernels(as objects now) from programs.
Create buffers from host-side buffers or opencl-managed buffers.
Create a command queue (or multiple)
Just before computing(or an array of computations):
Select buffers for a kernel as its arguments.
Enqueue buffer write(or map/unmap) operations on "input" buffers
Compute:
Enqueue nd range kernel(with specifying which kernel runs and with how many threads)
Enqueue buffer read(or map/unmap) operations on "output" buffers
Don't forget to synchronize with host using clFinish() if you haven't used blocking type enqueueBufferRead.
Use your accelerated data.
After opencl is no more needed:
Be sure all command queues are empty / finished doing kernel work.
Release all in the opposite order of creation
If you need to accelerate an open source software, you can switch a hotspot parallelizable loop with a simple opencl kernel, if it doesn't have another acceleration support already. For example, you can accelerate air-pressure and heat-advection part of powdertoy sand-box simulator.
Yes, you can, because OpenCL is supported by MacOS natively.
From your question it appears you are not seeking advice on programming, which would have been the appropriate subject for Stack Overflow. The first search hit on Google explains how to turn on OpenCL accelerated effects in After Effects (Project Settings dialog -> Video Rendering and Effects), but I have no experience with that myself.
I'm working on building a Dynamic Voltage Frequency Scaling (DVFS) algorithm for a video decoding application operating on an Intel core i7 6500U CPU (Skylake). The application is to support both software as well as hardware decoder modules and the software decoder is working as expected. It controls the operational frequency of the CPU which eventually controls the operational voltage, thereby reducing the overall energy consumption.
My question is regarding the hardware decoder which is available in the Intel skylake processor (Intel HD graphics 520) which performs the hardware decoding. The experimental results for the two decoders suggest that the energy consumption reduction is much less in the hardware decoder compared to the software decoder when using the DVFS algorithm.
Does the CPU frequency level adjusted on the software before passing the video frame to be decoded on the hardware decoder, actually have an impact on the energy consumption of the hardware decoder?.
Does the Intel HD graphics 520 GPU on the same chip as the CPU have any impact on the CPU's operational frequency and the voltage level?
Why did you need to implement your own DVFS in the first place? Didn't Skylake's self-regulating mode work well? (where you let the CPU's hardware power management controller make all the frequency decisions, instead of just choosing whether to turbo or not).
Setting the CPU core clock speeds should have little to no effect on the GPU's DVFS. It's in a separate domain, and not linked to any of the cores (which can each choose their clocks individually). As you can see on Wikipedia, that SKL model can scale its GPU clocks from 300MHz to 1050MHz, and is probably doing so automatically if you're using an OS running Intel's normal graphics drivers.
For more about how Skylake power management works under the hood, see Efraim Rotem's (Lead Client Power Architect) IDF2015 talk (audio+slides, very good stuff). The title is Skylake Deep Dive: A New Architecture to Manage Power Performance and Energy Efficiency.
There's a link to the list of IDF2015 sessions in the x86 tag wiki.
parallelized code(openmp), compiled on and intel (linux) with gcc, runs much faster on an intel computer than on an AMD with twice as many cores. I see that all the cores are in use but it takes about 10 times more cpu time on the AMD. I had heard about "cripple AMD" in intel compiler, but I am using gcc! Thanks in advance
Intel has hyper-threading technology in their modern processor cores which essentially means that you have multiple hardware contexts running on a single core simultaneously, r you taking this into account, when you make the comparison ??