Arithmetic shift using intel intrinsics - performance

I have a set of bits, for example: 1000 0000 0000 0000 which is 16 bits and therefore a short. I would like to use an arithmetic shift so that I use the MSB to assign the rest of the bits:
1111 1111 1111 1111
and if I started off with 0000 0000 0000 0000:
after arithmetically shifting I would still have: 0000 0000 0000 0000
How can I do this without resorting to assembly? I looked at the intel intrinsics guide and it looks like I need to do this using AVX extensions but they look at data types larger than my short.

As mattnewport states in his answer, C code can do the job efficiently though with the possibility of 'implementation defined' behavior. This answer shows how to avoid the implementation defined behavior while preserving efficient code generation.
Because the question is about shifting a 16-bit operand, the concern over the implementation defined decision whether to sign extend or zero fill can be avoided by first sign extending the operand to 32-bits. Then the 32-bit value can be right shifted as unsigned, and finally truncated back to 16-bits.
Mattnewport's code actually sign extends the 16-bit operand to an int (32-bit or 64-bit depending on the compiler model) before shifting. This is because the language specification (C99 6.5.7 bitwise shift operators) requires a first step: integer promotions are performed on each of the operands. Likewise, the result of mattnewport's code is an int because The type of the result is that of the promoted left operand. Because of this, the code variation that avoids implementation defined behavior generates the same number of instructions as mattnewport's original code.
To avoid implementation defined behavior, the implicit promotion to signed int is replaced with an explicit promotion to unsigned int. This eliminates any possibility of implementation defined behavior while maintaining the same code efficiency.
The idea can be extended to cover 32-bit operands and works efficiently when 64-bit native integer support is present. Here is an example:
// use standard 'll' for long long print format
#define __USE_MINGW_ANSI_STDIO 1
#include <stdio.h>
#include <stdint.h>
// Code provided by mattnewport
int16_t aShiftRight16x (int16_t val, int count)
{
return val >> count;
}
// This variation avoids implementation defined behavior
int16_t aShiftRight16y (int16_t val, int count)
{
uint32_t uintVal = val;
uint32_t uintResult = uintVal >> count;
return (int16_t) uintResult;
}
// A 32-bit arithmetic right shift without implementation defined behavior
int32_t aShiftRight32 (int32_t val, int count)
{
uint64_t uint64Val = val;
uint64_t uint64Result = uint64Val >> count;
return (int32_t) uint64Result;
}
int main (void)
{
int16_t val16 = 0x8000;
int32_t val32 = 0x80000000;
int count;
for (count = 0; count <= 15; count++)
printf ("%04hX %04hX %08X\n", aShiftRight16x (val16, count),
aShiftRight16y (val16, count),
aShiftRight32 (val32, count));
return 0;
}
Here is gcc 4.8.1 x64 code generation:
0000000000000030 <aShiftRight16x>:
30: 0f bf c1 movsx eax,cx
33: 89 d1 mov ecx,edx
35: d3 f8 sar eax,cl
37: c3 ret
0000000000000040 <aShiftRight16y>:
40: 0f bf c1 movsx eax,cx
43: 89 d1 mov ecx,edx
45: d3 e8 shr eax,cl
47: c3 ret
0000000000000050 <aShiftRight32>:
50: 48 63 c1 movsxd rax,ecx
53: 89 d1 mov ecx,edx
55: 48 d3 e8 shr rax,cl
58: c3 ret
Here is MS visual studio x64 code generation:
aShiftRight16x:
00: 0F BF C1 movsx eax,cx
03: 8B CA mov ecx,edx
05: D3 F8 sar eax,cl
07: C3 ret
aShiftRight16y:
10: 0F BF C1 movsx eax,cx
13: 8B CA mov ecx,edx
15: D3 E8 shr eax,cl
17: C3 ret
aShiftRight32:
20: 48 63 C1 movsxd rax,ecx
23: 8B CA mov ecx,edx
25: 48 D3 E8 shr rax,cl
28: C3 ret
program output:
8000 8000 80000000
C000 C000 C0000000
E000 E000 E0000000
F000 F000 F0000000
F800 F800 F8000000
FC00 FC00 FC000000
FE00 FE00 FE000000
FF00 FF00 FF000000
FF80 FF80 FF800000
FFC0 FFC0 FFC00000
FFE0 FFE0 FFE00000
FFF0 FFF0 FFF00000
FFF8 FFF8 FFF80000
FFFC FFFC FFFC0000
FFFE FFFE FFFE0000
FFFF FFFF FFFF0000

I'm not sure why you are looking for intrinsics for this. Why not just use ordinary C++ right shift? This behavior is implementation defined but AFAIK on Intel platforms it will always sign extend.
int16_t val = 1 << 15; // 1000 0000 0000 0000
int16_t shiftVal = val >> 15; // 1111 1111 1111 1111

Related

What do the constraints "Rah" and "Ral" mean in extended inline assembly?

This question is inspired by a question asked by someone on another forum. In the following code what does the extended inline assembly constraint Rah and Ral mean. I haven't seen these before:
#include<stdint.h>
void tty_write_char(uint8_t inchar, uint8_t page_num, uint8_t fg_color)
{
asm (
"int $0x10"
:
: "b" ((uint16_t)page_num<<8 | fg_color),
"Rah"((uint8_t)0x0e), "Ral"(inchar));
}
void tty_write_string(const char *string, uint8_t page_num, uint8_t fg_color)
{
while (*string)
tty_write_char(*string++, page_num, fg_color);
}
/* Use the BIOS to print the first command line argument to the console */
int main(int argc, char *argv[])
{
if (argc > 1)
tty_write_string(argv[1], 0, 0);
return 0;
}
In particular are the use of Rah and Ral as constraints in this code:
asm (
"int $0x10"
:
: "b" ((uint16_t)page_num<<8 | fg_color),
"Rah"((uint8_t)0x0e), "Ral"(inchar));
The GCC Documentation doesn't have an l or h constraint for either simple constraints or x86/x86 machine constraints. R is any legacy register and a is the AX/EAX/RAX register.
What am I not understanding?
What you are looking at is code that is intended to be run in real mode on an x86 based PC with a BIOS. Int 0x10 is a BIOS service that has the ability to write to the console. In particular Int 0x10/AH=0x0e is to write a single character to the TTY (terminal).
That in itself doesn't explain what the constraints mean. To understand the constraints Rah and Ral you have to understand that this code isn't being compiled by a standard version of GCC/CLANG. It is being compiled by a GCC port called ia16-gcc. It is a special port that targets 8086/80186 and 80286 and compatible processors. It doesn't generate 386 instructions or use 32-bit registers in code generation. This experimental version of GCC is to target 16-bit environments like DOS (FreeDOS, MSDOS), and ELKS.
The documentation for ia16-gcc is hard to find online in HTML format but I have produced a copy for the recent GCC 6.3.0 versions of the documentation on GitHub. The documentation was produced by building ia16-gcc from source and using make to generate the HTML. If you review the machine constraints for Intel IA-16—config/ia16 you should now be able to see what is going on:
Ral The al register.
Rah The ah register.
This version of GCC doesn't understand the R constraint by itself anymore. The inline assembly you are looking at matches that of the parameters for Int 0x10/Ah=0xe:
VIDEO - TELETYPE OUTPUT
AH = 0Eh
AL = character to write
BH = page number
BL = foreground color (graphics modes only)
Return:
Nothing
Desc: Display a character on the screen, advancing the cursor
and scrolling the screen as necessary
Other Information
The documentation does list all the constraints that are available for the IA16 target:
Intel IA-16—config/ia16/constraints.md
a
The ax register. Note that for a byte operand,
this constraint means that the operand can go into either al or ah.
b
The bx register.
c
The cx register.
d
The dx register.
S
The si register.
D
The di register.
Ral
The al register.
Rah
The ah register.
Rcl
The cl register.
Rbp
The bp register.
Rds
The ds register.
q
Any 8-bit register.
T
Any general or segment register.
A
The dx:ax register pair.
j
The bx:dx register pair.
l
The lower half of pairs of 8-bit registers.
u
The upper half of pairs of 8-bit registers.
k
Any 32-bit register group with access to the two lower bytes.
x
The si and di registers.
w
The bx and bp registers.
B
The bx, si, di and bp registers.
e
The es register.
Q
Any available segment register—either ds or es (unless one or both have been fixed).
Z
The constant 0.
P1
The constant 1.
M1
The constant -1.
Um
The constant -256.
Lbm
The constant 255.
Lor
Constants 128 … 254.
Lom
Constants 1 … 254.
Lar
Constants -255 … -129.
Lam
Constants -255 … -2.
Uo
Constants 0xXX00 except -256.
Ua
Constants 0xXXFF.
Ish
A constant usable as a shift count.
Iaa
A constant multiplier for the aad instruction.
Ipu
A constant usable with the push instruction.
Imu
A constant usable with the imul instruction except 257.
I11
The constant 257.
N
Unsigned 8-bit integer constant (for in and out instructions).
There are many new constraints and some repurposed ones.
In particular the a constraint for the AX register doesn't work like other versions of GCC that target 32-bit and 64-bit code. The compiler is free to choose either AH or AL with the a constraint if the values being passed are 8 bit values. This means it is possible for the a constraint to appear twice in an extended inline assembly statement.
You could have compiled your code to a DOS EXE with this command:
ia16-elf-gcc -mcmodel=small -mregparmcall -march=i186 \
-Wall -Wextra -std=gnu99 -O3 int10h.c -o int10h.exe
This targets the 80186. You can generate 8086 compatible code by omitting the -march=i186 The generated code for main would look something like:
00000000 <main>:
0: 83 f8 01 cmp ax,0x1
3: 7e 1d jle 22 <tty_write_string+0xa>
5: 56 push si
6: 89 d3 mov bx,dx
8: 8b 77 02 mov si,WORD PTR [bx+0x2]
b: 8a 04 mov al,BYTE PTR [si]
d: 20 c0 and al,al
f: 74 0d je 1e <tty_write_string+0x6>
11: 31 db xor bx,bx
13: b4 0e mov ah,0xe
15: 46 inc si
16: cd 10 int 0x10
18: 8a 04 mov al,BYTE PTR [si]
1a: 20 c0 and al,al
1c: 75 f7 jne 15 <main+0x15>
1e: 31 c0 xor ax,ax
20: 5e pop si
21: c3 ret
22: 31 c0 xor ax,ax
24: c3 ret
When run with the command line int10h.exe "Hello, world!" should print:
Hello, world!
Special Note: The IA16 port of GCC is very experimental and does have some code generation bugs especially when higher optimization levels are used. I wouldn't use it for mission critical applications at this point in time.

Why do I find some never called instructions nopl, nopw after ret or jmp in GCC compiled code? [duplicate]

I've been working with C for a short while and very recently started to get into ASM. When I compile a program:
int main(void)
{
int a = 0;
a += 1;
return 0;
}
The objdump disassembly has the code, but nops after the ret:
...
08048394 <main>:
8048394: 55 push %ebp
8048395: 89 e5 mov %esp,%ebp
8048397: 83 ec 10 sub $0x10,%esp
804839a: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%ebp)
80483a1: 83 45 fc 01 addl $0x1,-0x4(%ebp)
80483a5: b8 00 00 00 00 mov $0x0,%eax
80483aa: c9 leave
80483ab: c3 ret
80483ac: 90 nop
80483ad: 90 nop
80483ae: 90 nop
80483af: 90 nop
...
From what I learned nops do nothing, and since after ret wouldn't even be executed.
My question is: why bother? Couldn't ELF(linux-x86) work with a .text section(+main) of any size?
I'd appreciate any help, just trying to learn.
First of all, gcc doesn't always do this. The padding is controlled by -falign-functions, which is automatically turned on by -O2 and -O3:
-falign-functions
-falign-functions=n
Align the start of functions to the next power-of-two greater than n, skipping up to n bytes. For instance,
-falign-functions=32 aligns functions to the next 32-byte boundary, but -falign-functions=24 would align to the next 32-byte boundary only
if this can be done by skipping 23 bytes or less.
-fno-align-functions and -falign-functions=1 are equivalent and mean that functions will not be aligned.
Some assemblers only support this flag when n is a power of two; in
that case, it is rounded up.
If n is not specified or is zero, use a machine-dependent default.
Enabled at levels -O2, -O3.
There could be multiple reasons for doing this, but the main one on x86 is probably this:
Most processors fetch instructions in aligned 16-byte or 32-byte blocks. It can be
advantageous to align critical loop entries and subroutine entries by 16 in order to minimize
the number of 16-byte boundaries in the code. Alternatively, make sure that there is no 16-byte boundary in the first few instructions after a critical loop entry or subroutine entry.
(Quoted from "Optimizing subroutines in assembly
language" by Agner Fog.)
edit: Here is an example that demonstrates the padding:
// align.c
int f(void) { return 0; }
int g(void) { return 0; }
When compiled using gcc 4.4.5 with default settings, I get:
align.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <f>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: b8 00 00 00 00 mov $0x0,%eax
9: c9 leaveq
a: c3 retq
000000000000000b <g>:
b: 55 push %rbp
c: 48 89 e5 mov %rsp,%rbp
f: b8 00 00 00 00 mov $0x0,%eax
14: c9 leaveq
15: c3 retq
Specifying -falign-functions gives:
align.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <f>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: b8 00 00 00 00 mov $0x0,%eax
9: c9 leaveq
a: c3 retq
b: eb 03 jmp 10 <g>
d: 90 nop
e: 90 nop
f: 90 nop
0000000000000010 <g>:
10: 55 push %rbp
11: 48 89 e5 mov %rsp,%rbp
14: b8 00 00 00 00 mov $0x0,%eax
19: c9 leaveq
1a: c3 retq
This is done to align the next function by 8, 16 or 32-byte boundary.
From “Optimizing subroutines in assembly language” by A.Fog:
11.5 Alignment of code
Most microprocessors fetch code in aligned 16-byte or 32-byte blocks. If an importantsubroutine entry or jump label happens to be near the end of a 16-byte block then themicroprocessor will only get a few useful bytes of code when fetching that block of code. Itmay have to fetch the next 16 bytes too before it can decode the first instructions after thelabel. This can be avoided by aligning important subroutine entries and loop entries by 16.
[...]
Aligning a subroutine entry is as simple as putting as many
NOP
's as needed before thesubroutine entry to make the address divisible by 8, 16, 32 or 64, as desired.
As far as I remember, instructions are pipelined in cpu and different cpu blocks (loader, decoder and such) process subsequent instructions. When RET instructions is being executed, few next instructions are already loaded into cpu pipeline. It's a guess, but you can start digging here and if you find out (maybe the specific number of NOPs that are safe, share your findings please.

Understanding non-contiguous stack addressing in gcc x86 disassembly

I am using this C source code to compile with gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0.
/////////////////////////////////////////////////////////////////////////////////////////////
// Name: megabeets_0x1.c
// Description: Simple crackme intended to teach radare2 framework capabilities.
// Compilation: $ gcc megabeets_0x1.c -o megabeets_0x1 -fno-stack-protector -m32 -z execstac
//
// Author: Itay Cohen (#megabeets)
// Website: https://www.megabeets.net
/////////////////////////////////////////////////////////////////////////////////////////////
#include <stdio.h>
#include <string.h>
void rot13 (char *s) {
if (s == NULL)
return;
int i;
for (i = 0; s[i]; i++) {
if (s[i] >= 'a' && s[i] <= 'm') { s[i] += 13; continue; }
if (s[i] >= 'A' && s[i] <= 'M') { s[i] += 13; continue; }
if (s[i] >= 'n' && s[i] <= 'z') { s[i] -= 13; continue; }
if (s[i] >= 'N' && s[i] <= 'Z') { s[i] -= 13; continue; }
}
}
int beet(char *name)
{
char buf[128];
strcpy(buf, name);
char string[] = "Megabeets";
rot13(string);
return !strcmp(buf, string);
}
int main(int argc, char *argv[])
{
printf("\n .:: Megabeets ::.\n");
printf("Think you can make it?\n");
if (argc >= 2 && beet(argv[1]))
{
printf("Success!\n\n");
}
else
printf("Nop, Wrong argument.\n\n");
return 0;
}
gcc command used
gcc megabeets_0x1.c -o test32 -fno-stack-protector -z execstack -m32 -no-pie -fno-pic
The disassembly of function beet generated using objdump looks like the following:
080485a8 <beet>:
80485a8: 55 push ebp
80485a9: 89 e5 mov ebp,esp
80485ab: 81 ec 98 00 00 00 sub esp,0x98
80485b1: 83 ec 08 sub esp,0x8
80485b4: ff 75 08 push DWORD PTR [ebp+0x8]
80485b7: 8d 85 78 ff ff ff lea eax,[ebp-0x88]
80485bd: 50 push eax
80485be: e8 6d fd ff ff call 8048330 <strcpy#plt>
80485c3: 83 c4 10 add esp,0x10
80485c6: c7 85 6e ff ff ff 4d mov DWORD PTR [ebp-0x92],0x6167654d
80485cd: 65 67 61
80485d0: c7 85 72 ff ff ff 62 mov DWORD PTR [ebp-0x8e],0x74656562
80485d7: 65 65 74
80485da: 66 c7 85 76 ff ff ff mov WORD PTR [ebp-0x8a],0x73
80485e1: 73 00
80485e3: 83 ec 0c sub esp,0xc
80485e6: 8d 85 6e ff ff ff lea eax,[ebp-0x92]
80485ec: 50 push eax
80485ed: e8 94 fe ff ff call 8048486 <rot13>
80485f2: 83 c4 10 add esp,0x10
80485f5: 83 ec 08 sub esp,0x8
80485f8: 8d 85 6e ff ff ff lea eax,[ebp-0x92]
80485fe: 50 push eax
80485ff: 8d 85 78 ff ff ff lea eax,[ebp-0x88]
8048605: 50 push eax
8048606: e8 15 fd ff ff call 8048320 <strcmp#plt>
804860b: 83 c4 10 add esp,0x10
804860e: 85 c0 test eax,eax
8048610: 0f 94 c0 sete al
8048613: 0f b6 c0 movzx eax,al
8048616: c9 leave
8048617: c3 ret
I have few doubts regarding this disassembly,
After pushing ebp and moving esp to ebp, stack pointer is decreased by 0x98 first time, then by 0x8, totalling to 0xA0 which results stack frame aligned to 16 bytes. Why didn't compiler do a direct subtraction of 0xA0 from esp instead of 2 subsequent subtraction?
As can be seen from the C code, variable buf in function beet is 128 bytes. But in this disassembly buf is pointed by ebp-0x88 which means 136 bytes for buffer. Why 136 bytes allocated instead of 128 bytes?
Before calling functions like strcpy or rot13, random number of bytes first subtracted from esp before calling and after execution completion of these functions another random number of bytes is added to esp(which I guess to clear the arguments sent to those functions).
Example- Before calling rot13, 0xc is subtracted from esp, after completion 0x10 added instead of 0xc.
So, these random shifting of esp and pushing data results non-contiguous data, resulting in lower utilization of stack memory. Is there any particular reason behind this behaviour ?
After searching on google or stackoverflow I couldn't find any answer to these doubts.
Thank you
NOTE:
GCC code optimization results almost same disassembly.
Subtracting 0x98 from the stack leaves it 16-byte aligned. The additional 8 bytes is to prepare for pushing the parameters to strcpy, so that the stack is 16-byte aligned again before the call.
It does allocate 128 bytes for buf. The additional bytes between buf and ebp are either for alignment or for compiler temporaries or some other purpose of the compiler. Perhaps there is space for the return value here. In any case, the compiler doesn’t end up needing to use the space. If you enable optimization, it probably wouldn’t be there.
As in #1, the stack pointer is adjusted before pushing the parameters for each call so that the stack is 16-byte aligned before the call.

Why `sizeof var` does not show true size?

Suppose we use avr-gcc to compile code which has the following structure:
typedef struct {
uint8_t bLength;
uint8_t bDescriptorType;
int16_t wString[];
} S_string_descriptor;
We initialize it globally like this:
const S_string_descriptor sn_desc PROGMEM = {
1 + 1 + sizeof L"1234" - 2, 0x03, L"1234"
};
Let's check what is generated from it:
000000ac <__trampolines_end>:
ac: 0a 03 fmul r16, r18
ae: 31 00 .word 0x0031 ; ????
b0: 32 00 .word 0x0032 ; ????
b2: 33 00 .word 0x0033 ; ????
b4: 34 00 .word 0x0034 ; ????
...
So, indeed string content follows the first two elements of the structure, as required.
But if we try to check sizeof sn_desc, result is 2.
Definition of the variable is done in compile-time, sizeof is also a compile-time operator. So, why sizeof var does not show true size of var? And where this behavior of the compiler (i.e., adding arbitrary data to a structure) is documented?
sn_desc is a 2-byte pointer into flash. It is meant to be used with LPM et alia in order to retrieve the actual data. There is no way to get the actual size of this data; store it separately.

Do C/C++ compilers such as GCC generally optimize modulo by a constant power of 2?

Let's say I have something like:
#define SIZE 32
/* ... */
unsigned x;
/* ... */
x %= SIZE;
Would the x % 32 generally be reduced to x & 31 by most C/C++ compilers such as GCC?
Yes, any respectable compiler should perform this optimization. Specifically, a % X operation, where X is a constant power of two will become the equivalent of an & (X-1) operation.
GCC will even do this with optimizations turned off:
Example (gcc -c -O0 version 3.4.4 on Cygwin):
unsigned int test(unsigned int a) {
return a % 32;
}
Result (objdump -d):
00000000 <_test>:
0: 55 push %ebp
1: 89 e5 mov %esp,%ebp
3: 8b 45 08 mov 0x8(%ebp),%eax
6: 5d pop %ebp
7: 83 e0 1f and $0x1f,%eax ;; Here
a: c3 ret

Resources