Align C function to "odd" address - gcc

I know from
C Function alignment in GCC
that i can align functions using
__attribute__((optimize("align-functions=32")))
Now, what if I want a function to start at an "odd" address, as in, I want it to start at an address of the form 32(2k+1), where k is any integer?
I would like the function to start at address (decimal) 32 or 96 or 160, but not 0 or 64 or 128.
Context: I'm doing a research project on code caches, and I want a function aligned in one level of cache but misaligned in another.

GCC doesn't have options to do that.
Instead, compile to asm and do some text manipulation on that output. e.g. gcc -O3 -S foo.c then run some script on foo.s to odd-align before some function labels, before compiling to a final executable with gcc -o benchmark foo.s.
One simple way (that costs between 32 and 95 bytes of padding) is this simplistic way:
.balign 64 # byte-align by 64
.space 32 # emit 32 bytes (of zeros)
starts_half_way_into_a_cache_line:
testfunc1:
Tweaking GCC/clang output after compilation is in general a good way to explore what gcc should have done. All references to other code/data inside and outside the function uses symbol names, nothing depends on relative distances between functions or absolute addresses until after you assemble (and link), so editing the asm source at this point is totally safe. (Another answer proposes copying final machine code around; that's very fragile, see the comments under it.)
An automated text-manipulation script will let you run your experiment on larger amounts of code. It can be as simple as
awk '/^testfunc.*:/ { print ".p2align 6; .skip 32"; print $0 }' foo.s
to do this before every label that matches the pattern ^testfunc.*. (Assuming no leading underscore name mangling.)
Or even use sed which has a convenient -i option to do it "in-place" by renaming the output file over the original, or perl has something similar. Fortunately, compiler output is pretty formulaic, for a given compiler it should be a pretty easy pattern-matching problem.
Keep in mind that the effects of code-alignment aren't always purely local. Branches in one function can alias (in the branch-predictor) with branches from another function depending on alignment details.
It can be hard to know exactly why a change affects performance, especially if you're talking about early in a function where it shifts branch addresses in the rest of the function by a couple bytes. You're not talking about changes like that, though, just shifting the whole function around. But it will change alignment relative to other functions, so tests that call multiple functions alternating with each other, or if the functions call each other, can be affected.
Other effects of alignment include uop-cache packing on modern x86, as well as fetch block. (Beyond the obvious effect of leaving unused space in an I-cache line).
Ideally you'd only insert 0..63 bytes to reach a desired position relative to a 64-byte boundary. This section is a failed attempt at getting that to work.
.p2align and .balign1 support an optional 3rd arg which specifies a maximum amount of padding, so we're close to being about to do it with GAS directives. We can maybe build on that to detect whether we're close to an odd or even boundary by checking whether it inserted any padding or not. (Assuming we're only talking about 2 cases, not the 4 cases of 16-byte relative to 64-byte for example.)
# DOESN'T WORK, and maybe not fixable
1: # local label
.balign 64,,31 # pad with up to 31 bytes to reach 64-byte alignment
2:
.balign 32 # byte-align by 32, maybe to the position we want, maybe not
.ifne 2b - 1b
# there is space between labels 2 and 1 so that balign reached a 64-byte boundary
.space 32
.endif # else it was already an odd boundary
But unfortunately this doesn't work: Error: non-constant expression in ".if" statement. If the code between the 1: and 2: labels has fixed size, like .long 0xdeadbeef, it will assemble just fine. So apparently GAS won't let you query with a .if how much padding an alignment directive inserted.
Footnote 1: .align is either .p2align (power of 2) or .balign (byte) depending on which target you're assembling for. Instead of remembering which is which on which target, I'd recommend always using .p2align or .balign, not .align.

As this question is tagged assembly, here are two spots in my (NASM 8086) sources that "anti align" following instructions and data. (Here just with an alignment to even addresses, ie 2-byte alignment.) Both were based on the calculation done by NASM's align macro.
https://hg.ulukai.org/ecm/ldebug/file/683a1d8ccef9/source/debug.asm#l1161
times 1 - (($ - $$) & 1) nop ; align in-code parameter
call entry_to_code_sel, exc_code
https://hg.ulukai.org/ecm/ldebug/file/683a1d8ccef9/source/debug.asm#l7062
; $ - $$ = offset into section
; % 2 = 1 if odd offset, 0 if even
; 2 - = 1 if odd, 2 if even
; % 2 = 1 if odd, 0 if even
; resb (2 - (($-$$) % 2)) % 2
; $ - $$ = offset into section
; % 2 = 1 if odd offset, 0 if even
; 1 - = 0 if odd, 1 if even
resb 1 - (($-$$) % 2) ; make line_out aligned
trim_overflow: resb 1 ; actually part of line_out to avoid overflow of trimputs loop
line_out: resb 263
resb 1 ; reserved for terminating zero
line_out_end:
Here is a simpler way to achieve anti-alignment:
align 2
nop
This is more wasteful though, it may use up 2 bytes if the target anti-alignment already would be satisfied before this sequence. My prior examples will not reserve any more space than necessary.

I believe GCC only lets you align on powers of 2
If you want to get around this for testing, you could compile your functions using position independent code (-FPIC or -FPIE) and then write a separate loader that manually copies the function into an area that was MMAP'd as read/write. And then you can change the permissions to make it executable. Of course for a proper performance comparison, you would want to make sure the aligned code that you are comparing it against was also compiled with FPIC/FPIE.
I can probably give you some example code if you need it, just let me know.

Related

How to select jump instructions in one pass?

Variable length instruction sets often provide multiple jump instructions with differently sized displacements to optimize the size of the code. For example, the PDP-11 has both
0004XX BR FOO # branch always, 8 bit displacement
and
000167 XXXXX JMP FOO # jump relative, 16 bit displacement
for relative jumps. For a more recent example, the 8086 has both
EB XX JMP FOO # jump short, 8 bit displacement
and
E8 XX XX JMP FOO # jmp near, 16 bit displacement
available.
An assembler should not burden the programmer with guessing the right kind of jump to use; it should be able to infer the shortest possible encoding such that all jumps reach their targets. This optimisation is even more important when you consider cases where a jump with a big displacement has to be emulated. For example, the 8086 lacks conditional jumps with 16 bit displacements. For code like this:
74 XX JE FOO # jump if equal, 8 bit displacement
the assembler should emit code like this if FOO is too far away:
75 03 JNE AROUND # jump if not equal
E8 XX XX JMP FOO # jump near, 16 bit displacement
AROUND: ...
Since this code is and takes more than twice as long as a JE, it should only be emitted if absolutely necessary.
An easy way to do this is to first try to assembly all jumps with the smallest possible displacements. If any jump doesn't fit, the assembler performs another pass where all jumps that didn't fit previously are replaced with longer jumps. This is iterated until all jumps fit. While easy to implement, this algorithm runs in O(n²) time and is a bit hackish.
Is there a better, preferably linear time algorithm to determine the ideal jump instruction?
As outlined in another question, this start small algorithm is not necessarily optimal given suitably advanced constructs. I would like to assume the following niceness constraints:
conditional inclusion of code and macro expansion has already occurred by the time instructions are selected and cannot be influenced by these decisions
if the size of a data object is affected by the distance between two explicit or implicit labels, the size grows monotonously with distance. For example, it is legal to write
.space foo-bar
if bar does not occur before foo.
in violation of the previous constraint, alignment directives may exist

How to write asm code "bsrl" in golang

I need to write some asm code in golang. I read this question Is it possible to include inline assembly in Google Go code?, but not see how to write it.
Could anyone help me? thanks.
asm ("bsrl %1, %0;"
:"=r"(bits) /* output */
:"r"(value) ); /* input */
All the answers on the question you found say it's not possible to use inline-asm in Go, with any syntax. GNU C inline-asm syntax isn't going to help.
But fortunately, you don't need inline asm for bsr (which finds the bit-index of the highest set bit). Go 1.9 has an intrinsic / built-in function for bitwise operations that are close enough that they should compile efficiently.
Use math.bits.LeadingZeros32 to get lzcnt(x), which is 31-bsr(x) for non-zero x. This may cost extra instructions, especially on CPUs which only support bsr, not lzcnt (e.g. Intel pre-Haswell).
Or use Len32(x) - 1
Len32(x) returns the number of bits required to represent x. It returns 0 for x=0, and presumably it returns 1 for x=1, so it's bsr(x) + 1, with defined behaviour for 0 (thus potentially costing extra instructions). Hopefully Len32(x) - 1 can compile directly to a bsr.
Of course, if what you really wanted was lzcnt, then use LeadingZeros32 in the first place.
Note that bsr leaves the destination register unmodified for input = 0. Intel's docs only say with an undefined value, so compilers probably don't take advantage of this guarantee that AMD documents and Intel does provide in hardware.
At least in theory, though, Len32(x) - 1 could compile to a single bsr instruction if the compiler can prove that x is non-zero.

Any byte sequences that can not be present in valid x86 code?

Any byte sequences that can not be present in valid x86 code?
I'm looking for a byte sequence (or sequences), to inject into an x86 program compiled using GCC, that can not show up in the binary as a by product of compilation.
The reason is that I want these byte sequences to act as "labels", so that I can recognize them later during inspection.
Is it possible to construct patterns of bytes, so that, searching through the binary, these patterns will not show up except with very small probability (I prefer probability zero). In other words, I want to minimize the number of false positives!
There are sequences that today are not a valid encoding of any instruction.
Rather than digging in the opcode table present in the Intel Manual 2 you can exploit two facts of the x86 architecture:
The maximum instruction length is 15 bytes.
You can repeat prefixes.
These should also be more stable across generations than reserved opcodes.
The sequence 666666666666666666666666666666 (15 operand-size override prefixes, but any prefix will do) will generate an #UD exception because it is invalid.
For what it's worth, there is a specific instruction that fulfills the role of invalid instruction: ud2.
It's presence in a binary module is possible but its more idiomatic than an invalid encoding and it is standard, for example Linux uses it to mark a bug for if ud2 is the execution flow, the code behind it cannot be valid.
That said, if I got you right, that's not going to be useful to you.
You want to skip the process of decoding the instructions and scan the code section of the binary instead.
There is no guarantee that the code section will contain only code, for example ARM compilers generate literal pools - that's definitively uncommon on x86 though.
However the compilers usually align functions to a specific boundary (usually 16 bytes), this can be done in several ways - like stretching the previous function or with a mere padding.
This padding can be a sequence of bytes of any value - hence arbitrary bytes can be present in the code section.
Long story short, there is no universal byte sequence that appear with probability zero in the code section.
Everything that it's not in the execution flow can have any value.
We will deal with probability later, for now lets assume the 66..66h appears rarely enough in an executable.
You can't just use it directly, as 66..66h can be part of two instructions and thus be a valid sequence:
mov rax, 6666666666666666h
db 66h, 66h, 66h , 66h
db 66h, 66h, 66h
nop
is valid.
This is due to the immediate operands of instructions - the biggest immediate can be 8 bytes in length (as today), so the sequence must be lengthen to 15 + 8 = 23 bytes.
If you really want to be safe again future features, you can use a sequence of 14 + 15 = 29 bytes (for the 15-byte instruction length limit).
It's possible to find 23/29 bytes of value 66h in the code section or in the whole binary.
But how probable is that?
If the bytes in a binary were uniformly random then the probability would be astronomically small: 256-23 = 2-184.
Well, the point is that the bytes in a binary are not uniformly random.
You can open a file with an embedded icon to confirm that.
You can make the probability arbitrarily small by stretching the sequence - it's up to you to find a compromise between the length and an acceptable number of false positives.
It's unclear what you want to do but here some advice:
Most, if not all, building tools support generating a map file.
It is a file with all the symbols/names and their addresses.
If you could use actual labels (with a prefix and a random suffix) you'd collect them easily after the build.
Most output formats can be enriched with meta-information.
You can add an ELF/PE section with a table of offsets to the locations you want to mark.

Appending 0 before the hexa number

I have been instructed by my teacher to append 0 before the hexa numbers while writing instructions as some compilers search for 0 before the number in an instruction to differentiate it from a label. I am confused if the instruction already starts with a 0, what should be done in such a case?
For Example,
AND BL, 0FH
Is there a need of adding 0 before that hexa number or not? Please help me out. Thanks
EDIT:
Sorry if I had not been clearer enough before. What I meant was that in the above example, a 0 is already present, do I need to convert it to,
AND BL, 00FH
Except for the special cases like 0 or 1, I tend to encode my hex numbers with the full complement of digits just so it's easier to see what the intent is:
mov al, 09h
mov ax, 0123h
and so on.
For cases where the number starts with an alpha character (like deadbeef), I prefix it with an extra 0.
But no, it's not usually (a) necessary to do this if your hex number already begins with a digit.
In any case, I'd be putting most numbers into an equ statement rather than sprinkling magic numbers throughout the code. Would you rather see:
mov ax, 80
or:
mov ax, lines_per_screen
(a) Of course, it depends on your assembler but, from memory, all the ones I've used work this way.
No, there's no need (and including more than one leading 0 is fairly unusual).
Your example is an apt one though -- without the leading 0 to tell it this was a number, the assembler would normally interpret FH as a symbol rather than a number.

Usefulness of LOOPNE

I am unable to understand the usefulness of LOOPNE. Even if LOOPNE was not there and only LOOP was there, it would have done the same thing here. Please help me out.
MOV CX, 80
MOV AH,1
INT 21H
CMP AL, ' '
LOOPNE BACK
CMP is more or less a SUB instruction without changing the value, which means that it sets flags such as ZF (the zero flag).
LOOPNE has 2 conditions to loop: cx > 0 and ZF = 0
LOOP has 1 condition to loop: cx > 0
So, a normal LOOP would go through all characters, whereas LOOPNE will go through all characters, or until a space is encountered. Whichever comes first
LOOPNE loops when a comparison fails, and when there is a remaining nonzero iteration count (after decrementing it). This is arguably very convenient for finding an element in a linear list of known length.
There is little use for it in modern x86 CPUs.
The LOOPNE instruction is likely implemented internally in the CPU by microinstructions and thus effectively equivalent to JNE/DEC CX/JNE.
Because the CPU designers invest vast amounts of effort to optimize compare/branch/register arithmetic, the equivalent instruction sequence is likely, on a highly pipelined CPU, to execute virtually just as fast. It may actually execute slower; you'll only know by timing it. And the fact that you are confused about what it does makes it a source of coding errors.
I presently code the equivalent instruction sequence, because I got bit by a misunderstanding once. I'm not confused about CMP and JNE.

Resources