AVX512 vbroadcastss throwing illegal instruction on i9-10920X - avx512

I have a customer, who is reporting a "illegal instruction" exception crash for an AVX512 instruction on i9-10920X, which should (according to intel's page) fully support AVX512. Any ideas what could be causing this? Perhaps AVX512 can be disabled in BIOS and still the app can detect its presence or something?
Here's a call stack, just in case that would be interesting:

Related

What happens when the Visual C++ compiler encounters AVX instructions while the "Enhanced Instruction Set" flag is disabled?

In Visual Studio 2013, there is a flag located in the project Configuration Properties > C/C++ > Code Generation page called Enable Enhanced Instruction Set. It can be set to SSE, SSE2, AVX, AVX2 or IA32.
IA32 is specified as making "No Enhanced Instructions".
If I compile and run a project that uses AVX instructions, while having the flag set as IA32, it still works. Running the program also works fine.
What does the compiler do on this scenario ? Does it keep AVX instructions as they are because they are "hardcoded" and not compiler-generated instructions ? Or does the compiler replace these AVX SIMD instructions with SISD instructions ?
It's worth considering that the old x87 FPU ops aren't valid within an x64 binary, so if you have 64bit output enabled, setting IA-32 will be ignored, and SSE2 will be used as the default. If you happen to use AVX2 intrinsics within an app compiled for AVX (or prior instruction set), the AVX2 instructions will still be emitted within the binary, which will cause an app to hard crash on a CPU that doesn't support AVX2. See for example this:
https://godbolt.org/z/6VZhp-
Even though we are asking for x87, f() gets compiled using SSE2 (hence movss v.s. fld), and g() gets compiled using AVX2 instructions (e.g. vmovss).
So in effect, Visual Studio allows you to insert AVX2 instrinsics into an SSE binary, but you will have to do your own cpuid checks to make sure the CPU supports those features before running that code.

Restoring SIMD registers on Windows x64 during unwinding

On Windows x86-64, how are callee-saved SIMD registers restored during unwinding? If one was writing assembly code, how would one ensure that this happened properly?

How do you enable Clang Address Sanitizer in Xcode?

As announced at WWDC 2015, Clang Address Sanitizer is being brought to Xcode and OS X.
Session 413: Advanced Debugging and the Address Sanitizer
How do you enable Clang Address Sanitizer for your Xcode project?
Address Sanitizer has been added as a new feature in Xcode 7.
Use the Runtime Sanitization > Enable Address Sanitizer flag in your scheme to enable the option.
git will then shown this change to your .xcscheme file:
enableAddressSanitizer = "YES"
From the New Features in Xcode 7 document:
Address Sanitizer. Xcode 7 can build your app with instrumentation designed to catch and debug memory corruption using the address sanitizer.
Objective-C and C code is susceptible to memory corruption issues such as stack and heap buffer overruns and use-after-free issues. When these memory violations occur, your app can crash unpredictably or display odd behavior. Memory corruption issues are difficult to track down because the crashes and odd behavior are often hard to reproduce and the cause can be far from the origin of the problem.
You enable the address sanitizer in the build scheme. Once enabled, added instrumentation is built into the app to catch memory violations immediately, enabling you to inspect the problem right at the place where it occurs. Other diagnostic information is provided as well, such as the relationship between the faulty address and a valid object on the heap and allocation/deallocation information, which helps you pinpoint and fix the problem quickly.
Address sanitizer is efficient—fast enough to be used regularly, as well as with interactive applications. It is supported on OS X, in the Simulator, and on iOS devices.

intel kernel builder, failed to create context for intel opencl cpu device

I'm trying to use intel kernel builder. I installed intel opencl sdk and execute the program. Type easy code like this.
__kernel void hello(__global int* a, __global int* b, __global int* c) {
int gid = get_global_id(0);
c[gid] = a[gid] + b[gid];
}
I pressed the compile or build button but it showed me an error. The error message is.
Using default instruction set architecture.
Failed to create context for Intel OpenCL CPU device...
Compilation failed!
I found someone is suffering from the similar problem.
http://software.intel.com/en-us/forums/topic/392622
But there's no proper answer in this question.
I think the code is not a main problem of this. So I erase the code and remain only one character and then compiled it. The error message is same as before. I have no idea what the matter is.
My environment is Windows7 Professional 64bits, intel Core2 Quad Q8200 which support SSE4.1, OpenCL-1.2-3.0.67279 intel_sdk_2013_x64, kernel builder version 3.0.0.1.
And I have never run OpenCL in this machine. I tried but compilation error occured. The error message is similar. They failed to create context. The code may have no problem since I've run it in another machine and OS(linux).
If there's someone who use this program, please help me.
Thanks in advance.
According to this Site, Intels OpenCL SDK does not support Core2 Duo/Quad CPUs (because they do not support SSe 4.2):
http://software.intel.com/en-us/articles/intel-sdk-for-opencl-applications-2013-release-notes#_System_Requirements_1
So the compiler cannot create the context for this device. You may be able to work around this by explicitly choosing the target architecture in the Option Dialog. I'm not sure this will work and the code will certainly not run.
If you want to use OpenCL on you machine, try the AMD APP SDK, which supports every x86 with at least SSE2:
http://developer.amd.com/tools/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/
That, however, will not work with Intel's KernelBuilder.

What is the difference between hardware and software breakpoints?

What is the difference between hardware and software breakpoints?
Are hardware breakpoints are said to be faster than software breakpoints, if yes then how, and also then why would we need the software breakpoints at all?
This article provides a good discussion of pros and cons:
http://www.nynaeve.net/?p=80
To answer your question directly software breakpoints are more flexible because hardware breakpoints are limited in some functionality and highly architecture-dependant. One example given in the article is that x86 hardware has a limit of 4 hardware breakpoints.
Hardware breakpoints are faster because they have dedicated registers and less overhead than software breakpoints.
Hardware breakpoints are actually comparators, comparing the current PC with the address in the comparator (when enabled). Hardware breakpoints are the best solution when setting breakpoints. Typically set via the debug probe (using JTAG, SWD, ...). The downside of hardware breakpoints: They are limited. CPUs have only a limited number of hardware breakpoints (comparators). The number of available hardware breakpoints depends on the CPU. ARM 7/9 cores have 2, modern ARM devices (Cortex-M 0,3,4) between 2 and 6,
x86 usually 4.
Software breakpoints are in fact set by replacing the instruction to be breakpointed with a breakpoint instruction. The breakpoint instruction is present in most CPUs, and usually as short as the shortest instruction, so only one byte on x86 (0xcc, INT 3). On Cortex-M CPUs, instructions are 2 or 4 bytes, so the breakpoint instruction is a 2 byte instruction.
Software breakpoints can easily be set if the program is located in RAM (such as on a PC). A lot of embedded systems have the program located in flash memory. Here it is not so easy to exchange the instruction, as the flash needs to be reprogrammed, so hardware breakpoints are used primarily. Most debug probes support only hardware breakpoints if the program is located in flash memory. However, some (such as SEGGER's J-Link) allow reprogramming the flash memory with breakpoint instruction and aso allow an unlimited number of (software) breakpoints even when debugging a program located in flash.
More info about software breakpoints in flash memory
You can go through GDB internals, its very well explains the HW and SW breakpoints.
HW breakpoints are something that require support from MCU. The ARM controllers have special registers where you can write some address space, whenever PC (program counter) == sp register CPU halts. Jtag is usually required to write into those special registers.
SW breakpoints are implemented in GDB by inserting a trap, an illegal divide, or some other instruction that will cause an exception, and then when it’s encountered, gdb will take the exception and stop the program. When the user says to continue, gdb will restore the original instruction, single-step, re-insert the trap, and continue on.
There are a lot of advantages in using HW debuggers over SW debuggers especially if you are dealing with interrupts and memory bus devices. AFAIK interrupts cannot be debugged with software debuggers.
In addition to the answers above, it is also important to note that while software breakpoints overwrite specific instructions in the program to know where to stop, the more limited number of hardware breakpoints are actually part of the processor.
Justin Seitz in his book Gray Hat Python points out that the important difference here is that by overwriting instructions, software breakpoints actually change the CRC of the file, and so any sort of program such as a piece of malware which calculates its CRC can change its behavior in response to breakpoints being set, whereas with hardware breakpoints it is less obvious that the debugger is stopping and stepping through certain chunks of code.
In brief, hardware breakpoints make use of dedicated registers and hence are limited in number. These can be set on both volatile and non volatile memory.
Software breakpoints are set by replacing the opcode of instruction in RAM memory with breakpoint instruction. These can be set only in RAM memory(Flash memory is not feasible to be written) and are not limited.
This article provides good explanation about breakpoints.
Thanks and regards,
Shivakumar V W
Software breakpoints put an instruction in RAM that is executed like a TRAP when your program reaches that address.
While hardware breakpoints use a register of the CPU to implement the breakpoint itself. That is why the hardware breakpoints are much faster. And that is why we need software breakpoints: hardware breakpoints are limited to the processor number of registers dedicated to breakpoints.
I have learned it at work today :)
Watchpoints is where it makes a huge difference
This is a case where hardware handling is much faster:
watch var
rwatch var
awatch var
When you enter those commands on GDB 7.7 x86-64 it says:
Hardware watchpoint 2: var
This hardware capability for x86 is mentioned at: http://en.wikipedia.org/wiki/X86_debug_register
It is likely possible because of the existing paging circuit, which manages every memory access.
The "software" alternative is to single step the program, which is very slow.
Compare that to regular breakpoints, where at least the software implementation injects an int3 instruction at the breaking point and lets the program run, so you only pay overhead when a breakpoint is hit.
Some quote from the Intel System Debugger help doc:
Hardware vs. Software Breakpoints
The debugger can use both hardware
and software breakpoints, each of these has strengths and weaknesses:
Hardware Breakpoints are implemented using the DRx architectural
breakpoint registers described in the Intel SDM. They have the
advantage of being usable directly at reset, being non-volatile, and
being usable with flash or other read-only memory. The downside is
that they are a finite resource. Software Breakpoints require
modifying system memory as they are implemented by replacing the
opcode at the desired location with a special instruction. This makes
them an unlimited resource, but the memory dependency mean you cannot
install them prior to a module being loaded in memory, and if the
target software overwrites that memory then they will become invalid.
In general, any debug feature that must be enabled by the debugger
does not persist after a reset, and may be impacted after other
architectural mode transitions such as SMM entry/exit or VM
entry/exit. Specific examples include:
CPU Reset will clear all debug features, except for reset break. This
means for example that user-specified breakpoints will be invalid
until the target halts once after reset. Note that this halt can be
due to either a reset-break, or due to a user-initiated halt. In
either case the debugger will restore the necessary debug features.
SMM Entry/exit will disable/re-enable breakpoints, this means you
cannot specify a breakpoint in SMRAM while halted outside of SMRAM. If
you wish the break within SMRAM, you must first halt at the SMM
entry-break and manually apply the breakpoint. Alternatively you can
patch the BIOS to re-enable breakpoints when entering SMM, but this
requires the ability to modify the BIOS which cannot be used in
production code.

Resources