SSE4 instructions in VS2005? - visual-studio-2005

I need to use the popcnt instruction in a project that is compiled using Visual Stdio 2005
The intrinsic __popcnt() only works with VS2008 and the compiler doesn't seem to recognize the instruction even when I write in a __asm {} block.
Is there any way to do this?

Okay, this is a wild guess thing but ... assuming you've set up VS2005 like this to do assembly language, then you could get a hold of the SSE4.1 manual from Intel and code a macro for each SSE4.1 opcode that you needed as per this thread at masm32.com (which discusses a similar issue w.r.t. SSE2.)
For example, here's some code out of one of the downloads from the masm32 link:
;SSE2 macros for MASM 6.14 by daydreamer aka Magnus Svensson
ADDPD MACRO M1,M2
db 066h
ADDPS M1,M2
ENDM
ADDSD MACRO M1,M2
DB 0F2H
ADDPS M1,M2
ENDM

As a small note, you can use __emit to put bytes into __asm blocks in VC++. This is easier in a lot of cases than linking with masm produced objects. I used this in the past when SSE3 was new (and the opcodes were not supported in VS 2003).
All of the opcodes are well documented by Intel.

Related

What is the difference between __i686.get_pc_thunk and __x86.get_pc_thunk?

These helper functions are used by GCC and Clang in 32-bit x86 position-independent code to get the current execution address into a register, for example:
call __i686.get_pc_thunk.bx
addl $_GLOBAL_OFFSET_TABLE_, %ebx
movl $2, 4(%esp)
leal .LC0#GOTOFF(%ebx), %eax
movl %eax, (%esp)
call dlopen#PLT
It seems the implementations are equivalent:
__x86.get_pc_thunk.bx:
movl (%esp), %ebx
ret
__i686.get_pc_thunk.bx:
movl (%esp), %ebx
ret
Is there any difference besides the name change (seems i686 is older)? And is there a reason for the i686 prefix instead of i386?
So, after some digging in commit history and bug trackers I think I mostly figured it out.
Long time ago, glibc used to have its own handling of PIC code which involved a call/pop pattern to get the GOT address.
Around 2002, __i686.get_pc_thunk.*, which did a similar task, was added to GCC, initially as an internal symbol.
Shortly afterwards it ended up in glibc too, probably to avoid code duplication when being compiled with GCC.
However, when built for Pentium 2 or later (-march=i686), GCC defined preprocessor macro __i686=1, breaking glibc's compilation of the stub code. The problem has been discovered quite early but for several years glibc used various workarounds to handle this.
In 2011 (GCC 4.7?) the name was changed to __x86.get_pc_thunk.* and glibc added some checks to use a matching name. Eventually support for old GCC versions was dropped together with the old name. Both GCC and glibc only use __x86.get_pc_thunk.* now (although GCC can also generate the inline call/pop version).
So, in summary:
There is no actual difference between the two, the name change is simply historical due to a predefined macro collision.
References:
https://gcc.gnu.org/git/?p=gcc.git&a=search&h=HEAD&st=commit&s=get_pc_thunk
https://sourceware.org/git/?p=glibc.git&a=search&st=commit&s=get_pc_thunk
https://sourceware.org/bugzilla/show_bug.cgi?id=411
https://sourceware.org/bugzilla/show_bug.cgi?id=4507
Just a different name choice, not significant AFAIK.
i686 is the standard name for 32-bit code using PPro new instructions like CMOV and FCOMI, and 586 CMPGXCHG and CPUID. Modern GNU/Linux distros typically configure gcc to use that as the default target for -m32 32-bit code, instead of truly baseline i386. e.g. gcc -v will show i686-linux-gnu for a 32-bit build of GCC.
Usually clang uses call next_insn / pop reg to read EIP into a register. (Fun fact: that actually doesn't break return-address prediction on CPUs other than original Pentium-Pro or Via Nano3000: http://blog.stuffedcow.net/2018/04/ras-microbenchmarks/#call0 - CPUs special case call rel32=0 as not being a real call and don't put the return address into the predictor stack.)
The get_pc_thunk.bx includes the name of the register to return in. 32-bit PIC code used to only ever use EBX as the GOT pointer register, but GCC can now pick any convenient register and emit a thunk function for it, like ....get_pc_thunk.ax so leaf functions don't have to save/restore EBX.
PIE does make executables slower, by maybe 15% for 32-bit code vs. a couple percent for 64-bit code. x86-64 has RIP-relative addressing which avoids the need for these thunks. IMO a 32-bit PIE isn't worth the price, unless you really need more hardening against ROP and Spectre attacks by having ASLR of the main executable.

Debugging 0xc000001d exception with WinDbg

I'm trying to find a root cause of the "Illegal instruction" exception (0xc000001d) with WinDbg.
The project was built with VC++2015. I've got two memory dumps from two test runs.
For now I found the following that is true for both dumps:
the exception points to the "movq mmword ptr [ecx], xmm0" instruction
xmm0 contains zeros
the exception occurs in an object constructor
the address is inside DS
the address belongs to a heap entry which looks valid
the address points to the object is being constructed, so it seems like it tries to put zero to the obj.m_data member that looks valid too
I have no idea where to go further, so I'd appreciate any directions.
UPD:
...
movq xmm0,mmword ptr [esi]
lea ecx,[edi+94h]
movq mmword ptr [ecx],xmm0 ; << this causes the exception
Illegal instruction is raised when the operating system handles a fault from the CPU where it has failed to decode an instruction. This can occur if an instruction extension is not supported by the CPU or the operating system.
msdn : illegal instruction AVX. In this case the bug in msvc 2013 occurred when the CPU supported AVX, but the operating system did not.
The CPUs which are failing don't appear to support SSE2, which is a likely cause for this issue.
In the case I came across the AVX issue, when using a tool to identify if AVX was used, there was a CPU test which decided that the AVX was not supported by the tool (supplied by Intel).
I am not aware of a tool by AMD, and would be wary of such a tool working, as it may be that it is the operating system support which is missing.
Update
Why does an instruction fail if the operating system does not support it? An example of this is the AVX instructions, which from wikipedia : AVX states.
AVX adds new register-state through the 256-bit wide YMM register file, so explicit operating system support is required to properly save and restore AVX's expanded registers between context switches.
Any change to the work or memory needed by the operating system, probably requires explicit opt-in. In the case of AVX, the extra registers changed the amount of data stored for a context switch.

regarding gcc produced assembly code (assembly code not in order?)

I'm using a gcc compiler for 64 bit mips machine.
I noticed something interesting for a piece of assembly code generated. below is detail:
00000001200a4348 <get_pa_txr_index+0x50> 2ca2001f sltiu v0,a1,31
00000001200a434c <get_pa_txr_index+0x54> 14400016 bnez v0,00000001200a43a8 <get_pa_txr_index+0xb0>
00000001200a4350 <get_pa_txr_index+0x58> 64a2000e daddiu v0,a1,14
00000001200a43a8 <get_pa_txr_index+0xb0> 000210f8 dsll v0,v0,0x3
00000001200a43ac <get_pa_txr_index+0xb4> 0062102d daddu v0,v1,v0
00000001200a43b0 <get_pa_txr_index+0xb8> dc440008 ld a0,8(v0)
00000001200a43b4 <get_pa_txr_index+0xbc> df9955c0 ld t9,21952(gp)
00000001200a43b8 <get_pa_txr_index+0xc0> 0320f809 jalr t9
00000001200a43bc <get_pa_txr_index+0xc4> 00000000 nop
normally the bnez will immediately jump to 0xb0. But in the block after 0xb0, what I'm sure is the program must use a1 as a parameter.
But as we can see, a1 never showed up in the block after 0xb0.
But a1 is used in 0x58 which is right after the bnez (0x54).
So is it possible the 0x54 and 0x58 instruction get executed at the same time? A superscalar processor executes more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to redundant functional units on the processor.
my question is, how can gcc compiler knows my cpu has this capability? what kind of technology is gcc using? what optimize option is gcc using to generate this kind of assembly code?
thanks.
This feature is usually called a branch delay slot. Finding an instruction with which to fill a branch delay slot is usually done during the scheduling phase of the backend of an optimizing compiler.

what is GCC compiler option to get Segment Override prefix in x86

I have memory layout (In Increasing memory addr) like :
Code Section (0-4k), Data Section(4k-8k), Stack Section(8k-12k), CustomData Section(12k-16k).
I have put some special arrays, structs in Custom Data Section.
As i know, Data Segment (#DS)Selector will be used for any Data related compiler code.
So Data Section(4k-8k) will have #DS by default for all operation. Except some str op where ES may be used. Like:
mov $0xc00,%eax
addl $0xd, (%eax)
But, I want to use Extra Segment(#ES) selector for CustomData access. I would define a new GDT entry for ES with different Base and Limit. like:
mov $0x3400,%eax
addl $0xd, %es:(%eax)
So my question is:
Does GCC has any x86 compiler flag, which can be used to tell compiler that use #ES for CustomData Section code access.?
Means, compiler flag which will generate code using #ES for CustomData Section.?
Thanks in advance !!
While the question is asking for an option to make gcc's code-gen use an es prefix for accessing a custom section, IF you wanted to do it in hand-written code, the AT&T syntax already allows e.g. %es:(%eax).
Note, this may break rep-string instructions which gcc sometimes inlines; fs or gs would be the only sane choices, and are still usable even in x86-64.
(Making this comment community wiki based on useful critique by Peter Cordes.)
Quoting the example from the clang language-extension docs
#define GS_RELATIVE __attribute__((address_space(256)))
int foo(int GS_RELATIVE *P) {
return *P;
}
Which compiles to (on X86-32):
_foo:
movl 4(%esp), %eax # load the arg
movl %gs:(%eax), %eax # use it with its prefix
ret
address-space 256 is gs, 257 is fs, and 258 is ss.
The docs don't mention es; compilers normally assume that es=ds so they can freely inline rep movs or rep stos for memcpy / memset if they choose to do so based on tuning options. Library implementations of memcpy or memset may also use rep stos/movs on some CPUs. Related: Why is std::fill(0) slower than std::fill(1)?
Obviously this is really low-level stuff, and only makes sense if you've already set the GS or FS base address. (wrfsbase). Beware that i386 Linux uses gs for thread-local storage in user-space, while x86-64 Linux uses fs.
I'm not aware of extensions like this for gcc, ICC, or MSVC.
Well, there is __thread in GNU C, which will use %gs: or %fs: prefixes as appropriate for the target platform. How does the gcc `__thread` work?.

Visual Studio 2010 x64 __setReg Equivalent Compiler Intrinsic

I have an application I have written in C where I really need to modify the value of one of the processor registers before calling a function. Normally I would do this with inline assembly, but as we all know that has been removed for 64 bit applications. I also cannot do this in a separate .asm file that is compiled with ml64 due to certain project constraints. So basically I need to execute the equivalent of the following code inline:
_asm mov r10d, 0xDEADBEEF
Does anyone know of a creative method or some other compiler intrinsic for x64 that will allow you to modify the value of a register inline?
Unfortunately, after looking at possible workarounds, it seems that Hans was right and it's simply not possible to modify the contents of a register inline. There is no compiler intrinsic that exists to do it and the only alternative is to either write the entire function in 64 bit assembly as a separate .asm file and compile it with ml64, or do as Alexey suggested and allocate an executable block of memory before hand and write the opcodes to it. You can then create a function pointer and just call this code directly. So for example, if I wanted to do the equivalent of:
mov r10d, ecx
ret
Just create an array to store the opcodes:
BYTE copyValueToR10[] = "\x44\x8B\xD1\xC3";
You can then VirtualAlloc memory for this tiny function with PAGE_EXECUTE protection. Next just create a function pointer and you're good to go. Definitely a dirty way to do it, but given the constraints of not having inline asm or wanting to compile using ml64, this seems to be the only other way to do it.

Resources