Encoding conditional jump (jecxz) within inline assembly

Encoding conditional jump (jecxz) within inline assembly - gcc

I am trying encode a jecxz instruction within inline assembly. The jexcz should jump to the next immediate instruction (i.e: the nop).
int main() {
asm("lea -24(%rdi), %rcx");
asm("jecxz $0x00");
asm("nop");
}
But I am getting the following error.
gcc -o t main.c
main.c: Assembler messages:
main.c:7: Error: operand type mismatch for `jecxz'
What needs to be fixed here?

The most compatible solution is to write the line as follows:
asm("jecxz nextline; nextline:");
Regarding the asm("jecxz .+3") solution:
In 16-bit mode, a jcxz is encoded as e3 XX and a jecxz is encoded as 67 e3 XX
In 32-bit mode, a jecxz is encoded as e3 XX and a jcxz is encoded as 67 e3 XX
In 64-bit mode, a jrcxz is encoded as e3 XX and a jecxz is encoded as 67 e3 XX (jcxz is not available)
(Where XX is a signed-byte offset from the end of the instruction to the jump target)
So then, the line asm("jecxz .+3"); would assemble to 67 e3 00 in 16-bit and 64-bit code, and e3 01 in 32-bit code. The 32-bit case would be incorrect, as it would jump one byte past the end of the instruction, given that the 32-bit form is only two bytes wide.
If we use a label, we cover all three cases.

As per Micheal Petch's comment the correct usage is
asm("jecxz .+3");
which encodes the relative distance to the next immediate instruction.

Related

Why am I getting an unexpected `0xcc` byte when loading nearby code bytes? Is it because of segment register %es?

I got some inconsistent result of instruction.
I don't know why this happens, so I suspect %es register is doing something weird, but I'm not sure.
Look at below code snippet.
08048400 <main>:
8048400: bf 10 84 04 08 mov $HERE,%edi
8048405: 26 8b 07 mov %es:(%edi),%eax # <----- Result 1
8048408: bf 00 84 04 08 mov $main,%edi
804840d: 26 8b 07 mov %es:(%edi),%eax # <----- Result 2
08048410 <HERE>:
8048410: 11 11 adc %edx,(%ecx)
8048412: 11 11 adc %edx,(%ecx)
Result 1:
%eax : 0x11111111
Seeing this result, I guessed that mov %es:(%edi),%eax to be something like mov (%edi),%eax.
Because 0x11111111 is stored at HERE.
Result 2:
%eax : 0x048410cc
However, the result of Result 2 was quite different.
I assumed %eax to be 0x048410bf, because this value is stored at main.
But the result was different as you can see.
Question:
Why this inconsistency of the result happens?
By the way, value of %es was always 0x7b during execution of both instruction.

es is a red herring. The difference you see is 1 byte at main, cc vs. bf. That is because you used a software breakpoint at main and your debugger inserted an int3 instruction which has machine code cc temporarily overwriting your actual code.
Do not set a breakpoint where you intend to read from, or use a hardware breakpoint instead which does not modify code.

how is thread local storage via gcc __thread keyword implemented in x86_64?

I'm digging around in libc and found an interesting asm sequence that I try to understand. glibc-2.27/malloc/malloc.c has:
static __thread tcache_perthread_struct *tcache = NULL;
...
# define MAYBE_INIT_TCACHE() \
if (__glibc_unlikely (tcache == NULL)) \
....
void *
__libc_malloc (size_t bytes) {
...
MAYBE_INIT_TCACHE()
gcc translates it to:
96a97: 48 8b 2d da 42 35 00 mov 0x3542da(%rip),%rbp # 3ead78 <.got+0x18>
...
96aa6: 64 48 8b 4d 00 mov %fs:0x0(%rbp),%rcx
in runtime mov 0x3542da(%rip),%rbp will yield a negative value, i.e.:
(gdb) p $rbp
$1 = (void *) 0xfffffffffffffec0
The %fs segment is loaded in __libc_setup_tls via syscall arch_prct (as I learned in another thread) and there seem to be a loop over program headers of type PT_TLS that probably determines the aggregated tls variable sizes that are marked via gcc's __thread keyword. The __thread marked variables seem to be accessed below the struct pthread tcb using negative indexes.
The negative indexes of tls variables seems to be located in the global offset table, in the above example i.e.
0x3542da(%rip) ... # 3ead78 <.got+0x18>
Question:
Is there a description on which elements (libc, ld, gcc) are involved in GOT tls indexes calculation and how it is done in detail? I guess that there is maybe a pre-calculated layout, but how are libraries handled that are loaded via libdl? etc...

power8 assembly code with shared build issue with save and restore of TOC

I have the following assembly code
.machine power8
.abiversion 2
.section ".toc","aw"
.section .text
GLOBAL(myfunc)
myfunc:
stdu 1,-240(1)
mflr 0
std 0, 0*8(1)
mfcr 8
std 8, 1*8(1)
std 2, 2*8(1)
# Save all non-volatile registers R14-R31
std 14, 4*8(1)
...
# Save all the non-volatile FPRs
...
stwu 1, -48(1)
bl function_call
nop
addi 1, 1, 48
ld 0, 0*8(1)
mtlr 0
ld 8, 1*8(1)
ld 2, 2*8(1)
...
# epilogue, restore stack frame
This works fine with static build but shared build gives segmentation fault in
00000157.plt_call.__tls_get_addr_opt##GLIBC_2.22, should the shared build be handled differently in power8 w.r.t TOC?

The calling convention is the same between POWER 8 and previous processors. However, there has been changes with regards to the TOC pointer (r2) handling between ABIv1 and ABIv2.
In ABIv2, the caller does not establish the TOC pointer in r2; the called function should do this for global entry points (ie, where the TOC pointer may not be the same as that used in the callee). To do this, ABIv2 functions will have a prologue that sets r2:
0000000000000000 <foo>:
0: 00 00 4c 3c addis r2,r12,0
4: 00 00 42 38 addi r2,r2,0
- this depends on r12 containing the address of the function's global entry point (those 0 values will be replaced with actual offsets at final link time).
I don't see any code setting r12 appropriately in your example. Are you sure you're complying with the v2 ABI there?
The ABIv2 spec is available here: https://members.openpowerfoundation.org/document/dl/576 Section 2.3.2 will be the most relevant for this issue.

How to determine the major compiler version from .obj files compiled with /GL?

I'm trying to determine Visual Studio version (2002/2003, 2005, 2008, 2010, 2012, 2013, 2015) from the .obj file generated with the link time code generation option.
The file I have, generated with MSVC2012, has following COFF header contents:
File Header
+0 00 00 Machine - Unknown Machine
+2 FF FF NumberOfSections
+4 01 00 4C 01 TimeDateStamp
+8 70 94 F9 55 PointerToSymbolTable
+12 38 FE B3 0C NumberOfSymbols
+16 A5 D9 SizeOfOptionalHeader
+18 AB 4D Characteristics
Optional Header
+20 AC 9B Magic
+22 D6 B6 Linker Version Major/Minor
It seems that the initial 4 bytes being 00,00,FF,FF mark it as a LTCG object, and what follows is proprietary. None of the usual file header members make "sense" (maybe the timestamp is OK, I didn't check).
Does anyone know offhand if any part of this header is compiler-specific? All I need to determine is the MSVC major version used to compile the object...
It appears that there is a version, coded as <MAJOR:16:LE> 0x80 <MINOR:16:LE>, stored shortly after the header. E.g.:
17.00.61030 -> 0x11.0xEE66 -> 11 00 80 66 EE
19.00.23026 -> 0x13.0x59F2 -> 13 00 80 F2 59
What's needed is to figure out how to get to it reliably by offsets from preceding data.
This is a related question, with no resolution...

TL,DR :
You can't get the compiler version with this file format, I guess ...
Complete answer :
It looks like some variation of the "anonymous file format", described in the "winnth.h" by various ANON_OBJECT_HEADER_XXX structures (replace XXX by V2 or BIGOBJ).
Here is a copy of the ANON_OBJECT_HEADER_BIGOBJ found in winnt.h :
typedef struct ANON_OBJECT_HEADER_BIGOBJ {
/* same as ANON_OBJECT_HEADER_V2 */
WORD Sig1; // Must be IMAGE_FILE_MACHINE_UNKNOWN
WORD Sig2; // Must be 0xffff
WORD Version; // >= 2 (implies the Flags field is present)
WORD Machine; // Actual machine - IMAGE_FILE_MACHINE_xxx
DWORD TimeDateStamp;
CLSID ClassID; // CLSID is a 16 bytes struct (not original comment)
DWORD SizeOfData; // Size of data that follows the header
DWORD Flags; // 0x1 -> contains metadata
DWORD MetaDataSize; // Size of CLR metadata
DWORD MetaDataOffset; // Offset of CLR metadata
/* bigobj specifics */
DWORD NumberOfSections; // extended from WORD
DWORD PointerToSymbolTable;
DWORD NumberOfSymbols;
} ANON_OBJECT_HEADER_BIGOBJ;</code>
The description match:
Sig1 : 00 00
Sig2 : FF FF
Version : >=2
Machine : 0x14c`
The other header structures (i.e, ANON_OBJECT_HEADER and ANON_OBJECT_HEADER_V2) are basically the same, but with less fields.
For the Version field, I found some information here :
http://www.geoffchappell.com/studies/msvc/link/dump/infiles/obj.htm
Looks like the Version field is "1" for anonymous files, and it seems like the anonymous files and the so called "import files" shared the same characteristics, only that Version = 0 for import file format (I do not really know what it is admittedly).
But yeah, by just looking at the header, it seems that we have no information on what compiler version was used. And even then, when looking at .obj files generated with the /GL switch, they do not exactly follow this format and I didn't find much information about them. I'll be glad that someone prove me wrong.

Getting GCC to optimize hand assembly

In an attempt to make GCC not generate a load-modify-store operation every time I do |= or &=, I have defined the following macros:
#define bset(base, offset, mask) bmanip(set, base, offset, mask)
#define bclr(base, offset, mask) bmanip(clr, base, offset, mask)
#define bmanip(op, base, offset, mask) \
asm("pshx");\
asm("ldx " #base);\
asm("b" #op " " #offset ",x " #mask);\
asm("pulx")
And they work great; the disassembled binary is perfect.
The problem comes when I use more than one in sequence:
inline void spi_init()
{
bset(_io_ports, M6811_DDRD, 0x38);
bset(_io_ports, M6811_PORTD, 0x20);
bset(_io_ports, M6811_SPCR, (M6811_SPE | M6811_DWOM | M6811_MSTR));
}
This results in:
00002227 <spi_init>:
2227: 3c pshx
2228: fe 10 00 ldx 0x1000 <_io_ports>
222b: 1c 09 38 bset 0x9,x, #0x38
222e: 38 pulx
222f: 3c pshx
2230: fe 10 00 ldx 0x1000 <_io_ports>
2233: 1c 08 20 bset 0x8,x, #0x20
2236: 38 pulx
2237: 3c pshx
2238: fe 10 00 ldx 0x1000 <_io_ports>
223b: 1c 28 70 bset 0x28,x, #0x70
223e: 38 pulx
223f: 39 rts
Is there any way to get GCC (3.3.6-m68hc1x-20060122) to automatically optimize out the redundant stack operations?

gcc will always emit the assembly instructions you tell it to emit. So instead of explicitly writing code to load registers with the value you want to manipulate, you instead want to tell gcc to do this on your behalf. You can do this with register constraints.
Unfortunately the 6811 code generator doesn't seem to be a standard part of gcc --- I don't spot the documentation in the manual. So I can't point you at platform-specific bit of the docs. But the generic bit you need to read is here: http://gcc.gnu.org/onlinedocs/gcc-4.8.1/gcc/Extended-Asm.html#Extended-Asm
The syntax is freaky, but the summary is:
asm("instructions" : outputs : inputs);
...where inputs and outputs are lists of constraints, which tell gcc what value to put where. The classic example is:
asm("fsinx %1,%0" : "=f" (result) : "f" (angle));
f indicates that the named value needs to go into a floating point register; = indicates it's an output; then the names of the registers are substituted into the instruction.
So, you'll probably want something like this:
asm("b" #op " " #offset ",%0 " #mask : "=Z" (i) : "0" (i));
...where i is a variable containing the value you want to modify. Z you'll need to look up in the 6811 gcc docs --- it's a constraint which represents a register which is valid for the asm instruction which is being generated. The 0 indicates that the input shares a register with output 0, and is used for read/write values.
Because you've told gcc what register you want i to be, it can integrate this knowledge into its register allocator and find the least-cost way to get i where you need it with the least amount of code. (Sometimes no additional code.)
gcc inline assembly is deeply contorted and weird, but pretty powerful. It's worth spending some time to thoroughly understand the constraint system to get the best use out of it.
(Incidentally, I don't know 6811 code, but have you forgotten to put the result of the op somewhere? I'd expect to see an stx to match the ldx.)
Update: Oh, I see what bset is doing now --- it's writing the result back to a memory location, right? That's still doable but it's a bit more painful. You need to tell gcc that you're modifying that memory location, so that it knows not to rely on any cached value. You'll need to have an output parameter with constraint m which represents that location. Check the docs.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio