Do C/C++ compilers such as GCC generally optimize modulo by a constant power of 2?

Do C/C++ compilers such as GCC generally optimize modulo by a constant power of 2? - gcc

Let's say I have something like:
#define SIZE 32
/* ... */
unsigned x;
/* ... */
x %= SIZE;
Would the x % 32 generally be reduced to x & 31 by most C/C++ compilers such as GCC?

Yes, any respectable compiler should perform this optimization. Specifically, a % X operation, where X is a constant power of two will become the equivalent of an & (X-1) operation.
GCC will even do this with optimizations turned off:
Example (gcc -c -O0 version 3.4.4 on Cygwin):
unsigned int test(unsigned int a) {
return a % 32;
}
Result (objdump -d):
00000000 <_test>:
0: 55 push %ebp
1: 89 e5 mov %esp,%ebp
3: 8b 45 08 mov 0x8(%ebp),%eax
6: 5d pop %ebp
7: 83 e0 1f and $0x1f,%eax ;; Here
a: c3 ret

Related

What do the constraints "Rah" and "Ral" mean in extended inline assembly?

This question is inspired by a question asked by someone on another forum. In the following code what does the extended inline assembly constraint Rah and Ral mean. I haven't seen these before:
#include<stdint.h>
void tty_write_char(uint8_t inchar, uint8_t page_num, uint8_t fg_color)
{
asm (
"int $0x10"
:
: "b" ((uint16_t)page_num<<8 | fg_color),
"Rah"((uint8_t)0x0e), "Ral"(inchar));
}
void tty_write_string(const char *string, uint8_t page_num, uint8_t fg_color)
{
while (*string)
tty_write_char(*string++, page_num, fg_color);
}
/* Use the BIOS to print the first command line argument to the console */
int main(int argc, char *argv[])
{
if (argc > 1)
tty_write_string(argv[1], 0, 0);
return 0;
}
In particular are the use of Rah and Ral as constraints in this code:
asm (
"int $0x10"
:
: "b" ((uint16_t)page_num<<8 | fg_color),
"Rah"((uint8_t)0x0e), "Ral"(inchar));
The GCC Documentation doesn't have an l or h constraint for either simple constraints or x86/x86 machine constraints. R is any legacy register and a is the AX/EAX/RAX register.
What am I not understanding?

What you are looking at is code that is intended to be run in real mode on an x86 based PC with a BIOS. Int 0x10 is a BIOS service that has the ability to write to the console. In particular Int 0x10/AH=0x0e is to write a single character to the TTY (terminal).
That in itself doesn't explain what the constraints mean. To understand the constraints Rah and Ral you have to understand that this code isn't being compiled by a standard version of GCC/CLANG. It is being compiled by a GCC port called ia16-gcc. It is a special port that targets 8086/80186 and 80286 and compatible processors. It doesn't generate 386 instructions or use 32-bit registers in code generation. This experimental version of GCC is to target 16-bit environments like DOS (FreeDOS, MSDOS), and ELKS.
The documentation for ia16-gcc is hard to find online in HTML format but I have produced a copy for the recent GCC 6.3.0 versions of the documentation on GitHub. The documentation was produced by building ia16-gcc from source and using make to generate the HTML. If you review the machine constraints for Intel IA-16—config/ia16 you should now be able to see what is going on:
Ral The al register.
Rah The ah register.
This version of GCC doesn't understand the R constraint by itself anymore. The inline assembly you are looking at matches that of the parameters for Int 0x10/Ah=0xe:
VIDEO - TELETYPE OUTPUT
AH = 0Eh
AL = character to write
BH = page number
BL = foreground color (graphics modes only)
Return:
Nothing
Desc: Display a character on the screen, advancing the cursor
and scrolling the screen as necessary
Other Information
The documentation does list all the constraints that are available for the IA16 target:
Intel IA-16—config/ia16/constraints.md
a
The ax register. Note that for a byte operand,
this constraint means that the operand can go into either al or ah.
b
The bx register.
c
The cx register.
d
The dx register.
S
The si register.
D
The di register.
Ral
The al register.
Rah
The ah register.
Rcl
The cl register.
Rbp
The bp register.
Rds
The ds register.
q
Any 8-bit register.
T
Any general or segment register.
A
The dx:ax register pair.
j
The bx:dx register pair.
l
The lower half of pairs of 8-bit registers.
u
The upper half of pairs of 8-bit registers.
k
Any 32-bit register group with access to the two lower bytes.
x
The si and di registers.
w
The bx and bp registers.
B
The bx, si, di and bp registers.
e
The es register.
Q
Any available segment register—either ds or es (unless one or both have been fixed).
Z
The constant 0.
P1
The constant 1.
M1
The constant -1.
Um
The constant -256.
Lbm
The constant 255.
Lor
Constants 128 … 254.
Lom
Constants 1 … 254.
Lar
Constants -255 … -129.
Lam
Constants -255 … -2.
Uo
Constants 0xXX00 except -256.
Ua
Constants 0xXXFF.
Ish
A constant usable as a shift count.
Iaa
A constant multiplier for the aad instruction.
Ipu
A constant usable with the push instruction.
Imu
A constant usable with the imul instruction except 257.
I11
The constant 257.
N
Unsigned 8-bit integer constant (for in and out instructions).
There are many new constraints and some repurposed ones.
In particular the a constraint for the AX register doesn't work like other versions of GCC that target 32-bit and 64-bit code. The compiler is free to choose either AH or AL with the a constraint if the values being passed are 8 bit values. This means it is possible for the a constraint to appear twice in an extended inline assembly statement.
You could have compiled your code to a DOS EXE with this command:
ia16-elf-gcc -mcmodel=small -mregparmcall -march=i186 \
-Wall -Wextra -std=gnu99 -O3 int10h.c -o int10h.exe
This targets the 80186. You can generate 8086 compatible code by omitting the -march=i186 The generated code for main would look something like:
00000000 <main>:
0: 83 f8 01 cmp ax,0x1
3: 7e 1d jle 22 <tty_write_string+0xa>
5: 56 push si
6: 89 d3 mov bx,dx
8: 8b 77 02 mov si,WORD PTR [bx+0x2]
b: 8a 04 mov al,BYTE PTR [si]
d: 20 c0 and al,al
f: 74 0d je 1e <tty_write_string+0x6>
11: 31 db xor bx,bx
13: b4 0e mov ah,0xe
15: 46 inc si
16: cd 10 int 0x10
18: 8a 04 mov al,BYTE PTR [si]
1a: 20 c0 and al,al
1c: 75 f7 jne 15 <main+0x15>
1e: 31 c0 xor ax,ax
20: 5e pop si
21: c3 ret
22: 31 c0 xor ax,ax
24: c3 ret
When run with the command line int10h.exe "Hello, world!" should print:
Hello, world!
Special Note: The IA16 port of GCC is very experimental and does have some code generation bugs especially when higher optimization levels are used. I wouldn't use it for mission critical applications at this point in time.

Understanding non-contiguous stack addressing in gcc x86 disassembly

I am using this C source code to compile with gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0.
/////////////////////////////////////////////////////////////////////////////////////////////
// Name: megabeets_0x1.c
// Description: Simple crackme intended to teach radare2 framework capabilities.
// Compilation: $ gcc megabeets_0x1.c -o megabeets_0x1 -fno-stack-protector -m32 -z execstac
//
// Author: Itay Cohen (#megabeets)
// Website: https://www.megabeets.net
/////////////////////////////////////////////////////////////////////////////////////////////
#include <stdio.h>
#include <string.h>
void rot13 (char *s) {
if (s == NULL)
return;
int i;
for (i = 0; s[i]; i++) {
if (s[i] >= 'a' && s[i] <= 'm') { s[i] += 13; continue; }
if (s[i] >= 'A' && s[i] <= 'M') { s[i] += 13; continue; }
if (s[i] >= 'n' && s[i] <= 'z') { s[i] -= 13; continue; }
if (s[i] >= 'N' && s[i] <= 'Z') { s[i] -= 13; continue; }
}
}
int beet(char *name)
{
char buf[128];
strcpy(buf, name);
char string[] = "Megabeets";
rot13(string);
return !strcmp(buf, string);
}
int main(int argc, char *argv[])
{
printf("\n .:: Megabeets ::.\n");
printf("Think you can make it?\n");
if (argc >= 2 && beet(argv[1]))
{
printf("Success!\n\n");
}
else
printf("Nop, Wrong argument.\n\n");
return 0;
}
gcc command used
gcc megabeets_0x1.c -o test32 -fno-stack-protector -z execstack -m32 -no-pie -fno-pic
The disassembly of function beet generated using objdump looks like the following:
080485a8 <beet>:
80485a8: 55 push ebp
80485a9: 89 e5 mov ebp,esp
80485ab: 81 ec 98 00 00 00 sub esp,0x98
80485b1: 83 ec 08 sub esp,0x8
80485b4: ff 75 08 push DWORD PTR [ebp+0x8]
80485b7: 8d 85 78 ff ff ff lea eax,[ebp-0x88]
80485bd: 50 push eax
80485be: e8 6d fd ff ff call 8048330 <strcpy#plt>
80485c3: 83 c4 10 add esp,0x10
80485c6: c7 85 6e ff ff ff 4d mov DWORD PTR [ebp-0x92],0x6167654d
80485cd: 65 67 61
80485d0: c7 85 72 ff ff ff 62 mov DWORD PTR [ebp-0x8e],0x74656562
80485d7: 65 65 74
80485da: 66 c7 85 76 ff ff ff mov WORD PTR [ebp-0x8a],0x73
80485e1: 73 00
80485e3: 83 ec 0c sub esp,0xc
80485e6: 8d 85 6e ff ff ff lea eax,[ebp-0x92]
80485ec: 50 push eax
80485ed: e8 94 fe ff ff call 8048486 <rot13>
80485f2: 83 c4 10 add esp,0x10
80485f5: 83 ec 08 sub esp,0x8
80485f8: 8d 85 6e ff ff ff lea eax,[ebp-0x92]
80485fe: 50 push eax
80485ff: 8d 85 78 ff ff ff lea eax,[ebp-0x88]
8048605: 50 push eax
8048606: e8 15 fd ff ff call 8048320 <strcmp#plt>
804860b: 83 c4 10 add esp,0x10
804860e: 85 c0 test eax,eax
8048610: 0f 94 c0 sete al
8048613: 0f b6 c0 movzx eax,al
8048616: c9 leave
8048617: c3 ret
I have few doubts regarding this disassembly,
After pushing ebp and moving esp to ebp, stack pointer is decreased by 0x98 first time, then by 0x8, totalling to 0xA0 which results stack frame aligned to 16 bytes. Why didn't compiler do a direct subtraction of 0xA0 from esp instead of 2 subsequent subtraction?
As can be seen from the C code, variable buf in function beet is 128 bytes. But in this disassembly buf is pointed by ebp-0x88 which means 136 bytes for buffer. Why 136 bytes allocated instead of 128 bytes?
Before calling functions like strcpy or rot13, random number of bytes first subtracted from esp before calling and after execution completion of these functions another random number of bytes is added to esp(which I guess to clear the arguments sent to those functions).
Example- Before calling rot13, 0xc is subtracted from esp, after completion 0x10 added instead of 0xc.
So, these random shifting of esp and pushing data results non-contiguous data, resulting in lower utilization of stack memory. Is there any particular reason behind this behaviour ?
After searching on google or stackoverflow I couldn't find any answer to these doubts.
Thank you
NOTE:
GCC code optimization results almost same disassembly.

Subtracting 0x98 from the stack leaves it 16-byte aligned. The additional 8 bytes is to prepare for pushing the parameters to strcpy, so that the stack is 16-byte aligned again before the call.
It does allocate 128 bytes for buf. The additional bytes between buf and ebp are either for alignment or for compiler temporaries or some other purpose of the compiler. Perhaps there is space for the return value here. In any case, the compiler doesn’t end up needing to use the space. If you enable optimization, it probably wouldn’t be there.
As in #1, the stack pointer is adjusted before pushing the parameters for each call so that the stack is 16-byte aligned before the call.

Understanding GCC's alloca() alignment and seemingly missed optimization

Consider the following toy example that allocates memory on the stack by means of the alloca() function:
#include <alloca.h>
void foo() {
volatile int *p = alloca(4);
*p = 7;
}
Compiling the function above using gcc 8.2 with -O3 results in the following assembly code:
foo:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
leaq 15(%rsp), %rax
andq $-16, %rax
movl $7, (%rax)
leave
ret
Honestly, I would have expected a more compact assembly code.
16-byte alignment for allocated memory
The instruction andq $-16, %rax in the code above results in rax containing the (only) 16-byte-aligned address between the addresses rsp and rsp + 15 (both inclusive).
This alignment enforcement is the first thing I don't understand: Why does alloca() align the allocated memory to a 16-byte boundary?
Possible missed optimization?
Let's consider anyway that we want the memory allocated by alloca() to be 16-byte aligned. Even so, in the assembly code above, keeping in mind that GCC assumes the stack to be aligned to a 16-byte boundary at the moment of performing the function call (i.e., call foo), if we pay attention to the status of the stack inside foo() just after pushing the rbp register:
Size Stack RSP mod 16 Description
-----------------------------------------------------------------------------------
------------------
| . |
| . |
| . |
------------------........0 at "call foo" (stack 16-byte aligned)
8 bytes | return address |
------------------........8 at foo entry
8 bytes | saved RBP |
------------------........0 <----- RSP is 16-byte aligned!!!
I think that by taking advantage of the red zone (i.e., no need to modify rsp) and the fact that rsp already contains a 16-byte aligned address, the following code could be used instead:
foo:
pushq %rbp
movq %rsp, %rbp
movl $7, -16(%rbp)
leave
ret
The address contained in the register rbp is 16-byte aligned, therefore rbp - 16 will also be aligned to a 16-byte boundary.
Even better, the creation of the new stack frame can be optimized away, since rsp is not modified:
foo:
movl $7, -8(%rsp)
ret
Is this just a missed optimization or I am missing something else here?

This is (partially) missed optimization in gcc. Clang does it as expected.
I said partially because if you know you will be using gcc you can use builtin functions (use conditional compilation for gcc and other compilers to have portable code).
__builtin_alloca_with_align is your friend ;)
Here is an example (changed so the compiler will not reduce function call to single ret):
#include <alloca.h>
volatile int* p;
void foo()
{
p = alloca(4) ;
*p = 7;
}
void zoo()
{
// aligment is 16 bits, not bytes
p = __builtin_alloca_with_align(4,16) ;
*p = 7;
}
int main()
{
foo();
zoo();
}
Disassembled code (with objdump -d -w --insn-width=12 -M intel)
Clang will produce the following code (clang -O3 test.c) - both functions look alike
0000000000400480 <foo>:
400480: 48 8d 44 24 f8 lea rax,[rsp-0x8]
400485: 48 89 05 a4 0b 20 00 mov QWORD PTR [rip+0x200ba4],rax # 601030 <p>
40048c: c7 44 24 f8 07 00 00 00 mov DWORD PTR [rsp-0x8],0x7
400494: c3 ret
00000000004004a0 <zoo>:
4004a0: 48 8d 44 24 fc lea rax,[rsp-0x4]
4004a5: 48 89 05 84 0b 20 00 mov QWORD PTR [rip+0x200b84],rax # 601030 <p>
4004ac: c7 44 24 fc 07 00 00 00 mov DWORD PTR [rsp-0x4],0x7
4004b4: c3 ret
GCC this one (gcc -g -O3 -fno-stack-protector)
0000000000000620 <foo>:
620: 55 push rbp
621: 48 89 e5 mov rbp,rsp
624: 48 83 ec 20 sub rsp,0x20
628: 48 8d 44 24 0f lea rax,[rsp+0xf]
62d: 48 83 e0 f0 and rax,0xfffffffffffffff0
631: 48 89 05 e0 09 20 00 mov QWORD PTR [rip+0x2009e0],rax # 201018 <p>
638: c7 00 07 00 00 00 mov DWORD PTR [rax],0x7
63e: c9 leave
63f: c3 ret
0000000000000640 <zoo>:
640: 48 8d 44 24 fc lea rax,[rsp-0x4]
645: c7 44 24 fc 07 00 00 00 mov DWORD PTR [rsp-0x4],0x7
64d: 48 89 05 c4 09 20 00 mov QWORD PTR [rip+0x2009c4],rax # 201018 <p>
654: c3 ret
As you can see zoo now looks like expected and similar to clang code.

The x86-64 System V ABI requires VLAs (C99 Variable Length Arrays) to be 16-byte aligned, same for automatic / static arrays that are >= 16 bytes.
It looks like gcc is treating alloca as a VLA, and failing to do constant-propagation into an alloca that only runs once per function call. (Or that it internally uses alloca for VLAs.)
A generic alloca / VLA can't use the red-zone, in case the runtime value is larger than 128 bytes. GCC also makes a stack frame with RBP instead of saving the allocation size and doing an add rsp, rdx later.
So the asm looks exactly like what it would if the size was a function arg or other runtime variable instead of a constant. That's what led me to this conclusion.
Also alignof(maxalign_t) == 16 , but alloca and malloc can satisfy the requirement to return memory usable for any object without 16-byte alignment for objects smaller than 16 bytes. None of the standard types have alignment requirements wider than their size in x86-64 SysV.
You're right, it should be able to optimize it to this:
void foo() {
alignas(16) int dummy[1];
volatile int *p = dummy; // alloca(4)
*p = 7;
}
and compile it to the movl $7, -8(%rsp) ; ret you suggested.
The alignas(16) might be optional here for alloca.
If you really need gcc to emit better code when constant propagation makes the arg to alloca a compile-time constant, you could consider simply using a VLA in the first place. GNU C++ supports C99-style VLAs in C++ mode, but ISO C++ (and MSVC) don't.
Or possibly use if(__builtin_constant_p(size)) { VLA version } else { alloca version }, but scoping of VLAs means you can't return a VLA from the scope of an if that detects that we're being inlined with a compile-time constant size. So you'd have to duplicate the code that needs the pointer.

Why is it necessary to use edi constraint in this inline assembly?

centos 6.5 64bit vps, 500MB ram gcc 4.8.2
I have the following function that works only if I use edi as the constraint to hold the string pointer. If I try to use any other register or constraintg or q etc, it segfaults.
BUT this problem only occurs when both link time optimization and o3 are used together. If o2 it's fine. If I don't use -flto, it's fine. But both together then the only register I can use that doesn't crash is edi
gcc -flto
CFLAGS=-I. -flto -std=gnu11 -msse4.2 -fno-builtin-printf -Wall -Winline -Wstrict-aliasing -g -pg -O3 -lrt -lpthread
It seems like there might be some sort of register clobbering going on or something else. I'm really at a loss to understand why and how to fix this. Another interesting aspect is the generated assembly puts rdi into rdx before using the pointer but if I try to use either register as the input constraint... it segfaults! If it fails under aggressive compiling options it suggests to me either the compiler is stuffing up somehow, or more likely I'm doing something wrong.
char *sse4_strCRLF(char *str)
{
__m128i M = _mm_set1_epi8(13);
char *res;
__asm__ __volatile__(
"xor %0,%0\n\t"
"sub $1, %1\n\t"
"1:" "sub $15,%1\n\t"
".align 16\n\t"
"2:" "add $16, %1\n\t"
"pcmpistri $0x08,(%1),%2\n\t"
"ja 2b\n\t"
"jnc 2f\n\t"
"cmpb $10,1(%1,%%rcx)\n\t"
"jne 1b\n\t"
"add %%rcx,%1\n\t"
"mov %1,%0\n\t"
"2:"
:"=q"(res)
:"edi"(str),"x"(M) //<-- if use anything except edi, it segfaults
:"rcx"
);
return (char*) res;
}
Disassembled output:
00000000000002e0 <sse4_strCRLF>:
2e0: 55 push rbp
2e1: 48 89 e5 mov rbp,rsp
2e4: e8 00 00 00 00 call 2e9 <sse4_strCRLF+0x9>
2e9: 66 0f 6f 05 00 00 00 00 movdqa xmm0,[rip+0x0] # 2f1 <sse4_strCRLF+0x11>
2f1: 48 89 fa mov rdx,rdi //<--- puts rdi into rdx!
2f4: 48 31 c0 xor rax,rax
2f7: 48 83 ea 01 sub rdx,0x1
2fb: 48 83 ea 0f sub rdx,0xf
2ff: 90 nop
300: 48 83 c2 10 add rdx,0x10
304: 66 0f 3a 63 02 08 pcmpistri xmm0,[rdx],0x8
30a: 77 f4 ja 300 <sse4_strCRLF+0x20>
30c: 73 0d jae 31b <sse4_strCRLF+0x3b>
30e: 80 7c 0a 01 0a cmp byte[rdx+rcx*1+0x1],0xa
313: 75 e6 jne 2fb <sse4_strCRLF+0x1b>
315: 48 01 ca add rdx,rcx
318: 48 89 d0 mov rax,rdx
31b: 5d pop rbp
31c: c3 ret

#David Wohlferd gave me the answer. It was 2 dumb mistakes I was making due to ignorance and assumptions. The below code is modified such that the input variable char pointer is not modified by the routine. It's copied into a register and that register is used. Also I was mistakenly thinking I could directly specify a particular register as opposed to a b etc.
gcc still seems to be fussy about what constraints I use. e.g. If I use a and b for res and str respectively, it compiles fine but segfaults on running. But using S and D seems to work fine.
#David Wohlferd, I'd like to credit you as the answerer but I don't think I can do that to a comment.
char *sse4_strCRLF(char *str)
{
__m128i M = _mm_set1_epi8(13);
char *res;
__asm__ __volatile__(
"xor %0,%0\n\t"
"mov %1,%%rdx\n\t"
"sub $1,%%rdx\n\t"
"1:" "sub $15,%%rdx\n\t"
".align 16\n\t"
"2:" "add $16, %%rdx\n\t"
"pcmpistri $0x08,(%%rdx),%2\n\t"
"ja 2b\n\t"
"jnc 2f\n\t"
"cmpb $10,1(%%rdx,%%rcx)\n\t"
"jne 1b\n\t"
"add %%rcx,%%rdx\n\t"
"mov %%rdx,%0\n\t"
"2:"
:"=S"(res)
:"D"(str),"x"(M)
:"rcx","rdx"
);
return (char*) res;
}

Arithmetic shift using intel intrinsics

I have a set of bits, for example: 1000 0000 0000 0000 which is 16 bits and therefore a short. I would like to use an arithmetic shift so that I use the MSB to assign the rest of the bits:
1111 1111 1111 1111
and if I started off with 0000 0000 0000 0000:
after arithmetically shifting I would still have: 0000 0000 0000 0000
How can I do this without resorting to assembly? I looked at the intel intrinsics guide and it looks like I need to do this using AVX extensions but they look at data types larger than my short.

As mattnewport states in his answer, C code can do the job efficiently though with the possibility of 'implementation defined' behavior. This answer shows how to avoid the implementation defined behavior while preserving efficient code generation.
Because the question is about shifting a 16-bit operand, the concern over the implementation defined decision whether to sign extend or zero fill can be avoided by first sign extending the operand to 32-bits. Then the 32-bit value can be right shifted as unsigned, and finally truncated back to 16-bits.
Mattnewport's code actually sign extends the 16-bit operand to an int (32-bit or 64-bit depending on the compiler model) before shifting. This is because the language specification (C99 6.5.7 bitwise shift operators) requires a first step: integer promotions are performed on each of the operands. Likewise, the result of mattnewport's code is an int because The type of the result is that of the promoted left operand. Because of this, the code variation that avoids implementation defined behavior generates the same number of instructions as mattnewport's original code.
To avoid implementation defined behavior, the implicit promotion to signed int is replaced with an explicit promotion to unsigned int. This eliminates any possibility of implementation defined behavior while maintaining the same code efficiency.
The idea can be extended to cover 32-bit operands and works efficiently when 64-bit native integer support is present. Here is an example:
// use standard 'll' for long long print format
#define __USE_MINGW_ANSI_STDIO 1
#include <stdio.h>
#include <stdint.h>
// Code provided by mattnewport
int16_t aShiftRight16x (int16_t val, int count)
{
return val >> count;
}
// This variation avoids implementation defined behavior
int16_t aShiftRight16y (int16_t val, int count)
{
uint32_t uintVal = val;
uint32_t uintResult = uintVal >> count;
return (int16_t) uintResult;
}
// A 32-bit arithmetic right shift without implementation defined behavior
int32_t aShiftRight32 (int32_t val, int count)
{
uint64_t uint64Val = val;
uint64_t uint64Result = uint64Val >> count;
return (int32_t) uint64Result;
}
int main (void)
{
int16_t val16 = 0x8000;
int32_t val32 = 0x80000000;
int count;
for (count = 0; count <= 15; count++)
printf ("%04hX %04hX %08X\n", aShiftRight16x (val16, count),
aShiftRight16y (val16, count),
aShiftRight32 (val32, count));
return 0;
}
Here is gcc 4.8.1 x64 code generation:
0000000000000030 <aShiftRight16x>:
30: 0f bf c1 movsx eax,cx
33: 89 d1 mov ecx,edx
35: d3 f8 sar eax,cl
37: c3 ret
0000000000000040 <aShiftRight16y>:
40: 0f bf c1 movsx eax,cx
43: 89 d1 mov ecx,edx
45: d3 e8 shr eax,cl
47: c3 ret
0000000000000050 <aShiftRight32>:
50: 48 63 c1 movsxd rax,ecx
53: 89 d1 mov ecx,edx
55: 48 d3 e8 shr rax,cl
58: c3 ret
Here is MS visual studio x64 code generation:
aShiftRight16x:
00: 0F BF C1 movsx eax,cx
03: 8B CA mov ecx,edx
05: D3 F8 sar eax,cl
07: C3 ret
aShiftRight16y:
10: 0F BF C1 movsx eax,cx
13: 8B CA mov ecx,edx
15: D3 E8 shr eax,cl
17: C3 ret
aShiftRight32:
20: 48 63 C1 movsxd rax,ecx
23: 8B CA mov ecx,edx
25: 48 D3 E8 shr rax,cl
28: C3 ret
program output:
8000 8000 80000000
C000 C000 C0000000
E000 E000 E0000000
F000 F000 F0000000
F800 F800 F8000000
FC00 FC00 FC000000
FE00 FE00 FE000000
FF00 FF00 FF000000
FF80 FF80 FF800000
FFC0 FFC0 FFC00000
FFE0 FFE0 FFE00000
FFF0 FFF0 FFF00000
FFF8 FFF8 FFF80000
FFFC FFFC FFFC0000
FFFE FFFE FFFE0000
FFFF FFFF FFFF0000

I'm not sure why you are looking for intrinsics for this. Why not just use ordinary C++ right shift? This behavior is implementation defined but AFAIK on Intel platforms it will always sign extend.
int16_t val = 1 << 15; // 1000 0000 0000 0000
int16_t shiftVal = val >> 15; // 1111 1111 1111 1111

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio