addiu instruction encoding (MIPS,GCC) - gcc

Here is addiu instruction opcode (16-bit instructions, GCC option -mmicromips):
full instruction: addiu sp,sp,-280
opcode, hexa: 4F75
opcode, binary: 1001(instruction) 11101(sp is $29) 110101
My purpose is to detect all instruction of this kind (addiu sp,sp,)
and then to decode the immediate, in the above case (-280) (to follow the sp).
What I don't understand is the encoding of (-280).
Linked to: How to get a call stack backtrace?(GCC,MIPS,no frame pointer)

microMips has a specialized ADDIUSP instruction which the assembler chose to use. The first 6 bits are the opcode 010011, the next 9 bits are the encoded immediate 110111010 = 0x1BA and the LSB is reserved at 1.
The encoding for the immediate uses scaling by 4 and sign extension. Given that 0x1BA = -70 (using 9 bits) the value is -70 * 4 = -280.

Related

Dumping W32pServiceTable

I want to see what function in win32k.sys driver handles specific syscall number.
I attach windbg to GUI process since win32k.sys is season space driver.
Then I shift first DWORD value right by 4 bits add base address of W32pServiceTable and use u command to show function in WinDbg but address isn't valid. I checked KiSystemCall64 and it seems to be doing the same thing.
!process 0 0 winlogon.exe
.process /p (PROCESS addr)
.reload
Answer: DWORD value from table is loaded with this instruction
movsxd r11,dword ptr [r10+rax*4]
W32pServiceTable DWORD values has bit at 31 position set to 1 so movsxd sets upper 32 bits of r11 register to 1 then adding r11 and table base address leads to correct function.
These values are negative so you need to preserve that when you shift off the bits. For example:
0: kd> dd win32k!W32pServiceTable L1
fffff88b`d1568000 ff8c8340
0: kd> u win32k!W32pServiceTable + ffffffff`fff8c834 L1
win32k!NtUserGetThreadState:
fffff88b`d14f4834 4883ec28 sub rsp,28h
Also, WinDbg is very picky/weird/broken/unpredictable when it comes to sign extension so you need to be careful about how you do this. For example, this doesn't work:
0: kd> u win32k!W32pServiceTable + fff8c834 L1
fffff88c`d14f4834 ?? ???
Due to WinDbg zero extending the value. But this does:
0: kd> u win32k!W32pServiceTable + (fff8c834) L1
win32k!NtUserGetThreadState:
fffff88b`d14f4834 4883ec28 sub rsp,28h
Because the () causes WinDbg to sign extend instead of zero extend.
Lastly, this happens even on the normal service table, it's not just a Win32k thing.

how is thread local storage via gcc __thread keyword implemented in x86_64?

I'm digging around in libc and found an interesting asm sequence that I try to understand. glibc-2.27/malloc/malloc.c has:
static __thread tcache_perthread_struct *tcache = NULL;
...
# define MAYBE_INIT_TCACHE() \
if (__glibc_unlikely (tcache == NULL)) \
....
void *
__libc_malloc (size_t bytes) {
...
MAYBE_INIT_TCACHE()
gcc translates it to:
96a97: 48 8b 2d da 42 35 00 mov 0x3542da(%rip),%rbp # 3ead78 <.got+0x18>
...
96aa6: 64 48 8b 4d 00 mov %fs:0x0(%rbp),%rcx
in runtime mov 0x3542da(%rip),%rbp will yield a negative value, i.e.:
(gdb) p $rbp
$1 = (void *) 0xfffffffffffffec0
The %fs segment is loaded in __libc_setup_tls via syscall arch_prct (as I learned in another thread) and there seem to be a loop over program headers of type PT_TLS that probably determines the aggregated tls variable sizes that are marked via gcc's __thread keyword. The __thread marked variables seem to be accessed below the struct pthread tcb using negative indexes.
The negative indexes of tls variables seems to be located in the global offset table, in the above example i.e.
0x3542da(%rip) ... # 3ead78 <.got+0x18>
Question:
Is there a description on which elements (libc, ld, gcc) are involved in GOT tls indexes calculation and how it is done in detail? I guess that there is maybe a pre-calculated layout, but how are libraries handled that are loaded via libdl? etc...

power8 assembly code with shared build issue with save and restore of TOC

I have the following assembly code
.machine power8
.abiversion 2
.section ".toc","aw"
.section .text
GLOBAL(myfunc)
myfunc:
stdu 1,-240(1)
mflr 0
std 0, 0*8(1)
mfcr 8
std 8, 1*8(1)
std 2, 2*8(1)
# Save all non-volatile registers R14-R31
std 14, 4*8(1)
...
# Save all the non-volatile FPRs
...
stwu 1, -48(1)
bl function_call
nop
addi 1, 1, 48
ld 0, 0*8(1)
mtlr 0
ld 8, 1*8(1)
ld 2, 2*8(1)
...
# epilogue, restore stack frame
This works fine with static build but shared build gives segmentation fault in
00000157.plt_call.__tls_get_addr_opt##GLIBC_2.22, should the shared build be handled differently in power8 w.r.t TOC?
The calling convention is the same between POWER 8 and previous processors. However, there has been changes with regards to the TOC pointer (r2) handling between ABIv1 and ABIv2.
In ABIv2, the caller does not establish the TOC pointer in r2; the called function should do this for global entry points (ie, where the TOC pointer may not be the same as that used in the callee). To do this, ABIv2 functions will have a prologue that sets r2:
0000000000000000 <foo>:
0: 00 00 4c 3c addis r2,r12,0
4: 00 00 42 38 addi r2,r2,0
- this depends on r12 containing the address of the function's global entry point (those 0 values will be replaced with actual offsets at final link time).
I don't see any code setting r12 appropriately in your example. Are you sure you're complying with the v2 ABI there?
The ABIv2 spec is available here: https://members.openpowerfoundation.org/document/dl/576 Section 2.3.2 will be the most relevant for this issue.

Getting GCC to optimize hand assembly

In an attempt to make GCC not generate a load-modify-store operation every time I do |= or &=, I have defined the following macros:
#define bset(base, offset, mask) bmanip(set, base, offset, mask)
#define bclr(base, offset, mask) bmanip(clr, base, offset, mask)
#define bmanip(op, base, offset, mask) \
asm("pshx");\
asm("ldx " #base);\
asm("b" #op " " #offset ",x " #mask);\
asm("pulx")
And they work great; the disassembled binary is perfect.
The problem comes when I use more than one in sequence:
inline void spi_init()
{
bset(_io_ports, M6811_DDRD, 0x38);
bset(_io_ports, M6811_PORTD, 0x20);
bset(_io_ports, M6811_SPCR, (M6811_SPE | M6811_DWOM | M6811_MSTR));
}
This results in:
00002227 <spi_init>:
2227: 3c pshx
2228: fe 10 00 ldx 0x1000 <_io_ports>
222b: 1c 09 38 bset 0x9,x, #0x38
222e: 38 pulx
222f: 3c pshx
2230: fe 10 00 ldx 0x1000 <_io_ports>
2233: 1c 08 20 bset 0x8,x, #0x20
2236: 38 pulx
2237: 3c pshx
2238: fe 10 00 ldx 0x1000 <_io_ports>
223b: 1c 28 70 bset 0x28,x, #0x70
223e: 38 pulx
223f: 39 rts
Is there any way to get GCC (3.3.6-m68hc1x-20060122) to automatically optimize out the redundant stack operations?
gcc will always emit the assembly instructions you tell it to emit. So instead of explicitly writing code to load registers with the value you want to manipulate, you instead want to tell gcc to do this on your behalf. You can do this with register constraints.
Unfortunately the 6811 code generator doesn't seem to be a standard part of gcc --- I don't spot the documentation in the manual. So I can't point you at platform-specific bit of the docs. But the generic bit you need to read is here: http://gcc.gnu.org/onlinedocs/gcc-4.8.1/gcc/Extended-Asm.html#Extended-Asm
The syntax is freaky, but the summary is:
asm("instructions" : outputs : inputs);
...where inputs and outputs are lists of constraints, which tell gcc what value to put where. The classic example is:
asm("fsinx %1,%0" : "=f" (result) : "f" (angle));
f indicates that the named value needs to go into a floating point register; = indicates it's an output; then the names of the registers are substituted into the instruction.
So, you'll probably want something like this:
asm("b" #op " " #offset ",%0 " #mask : "=Z" (i) : "0" (i));
...where i is a variable containing the value you want to modify. Z you'll need to look up in the 6811 gcc docs --- it's a constraint which represents a register which is valid for the asm instruction which is being generated. The 0 indicates that the input shares a register with output 0, and is used for read/write values.
Because you've told gcc what register you want i to be, it can integrate this knowledge into its register allocator and find the least-cost way to get i where you need it with the least amount of code. (Sometimes no additional code.)
gcc inline assembly is deeply contorted and weird, but pretty powerful. It's worth spending some time to thoroughly understand the constraint system to get the best use out of it.
(Incidentally, I don't know 6811 code, but have you forgotten to put the result of the op somewhere? I'd expect to see an stx to match the ldx.)
Update: Oh, I see what bset is doing now --- it's writing the result back to a memory location, right? That's still doable but it's a bit more painful. You need to tell gcc that you're modifying that memory location, so that it knows not to rely on any cached value. You'll need to have an output parameter with constraint m which represents that location. Check the docs.

ARM linux kernel head-common.S

I was looking head-common.S
at the __mmap_switched:
.long init_thread_union + THREAD_START_SP # sp //for stack pointer
THREAD_START_SP is defined THREAD_SIZE(8192) - 8 in "thread+info.h"
set stack size 8KB(8129) and minus 8byte.
why minus 8byte?
i suspect, i think DA(decrement after) right?
The 8 bytes aligned is the requirement in APCS.
In APCS, the chapter 5.2.1 The Stack,
The stack must also conform to the following constraint at a public interface:
SP mod 8 = 0. The stack must be double-word aligned.
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.subset.swdev.abi/index.html

Resources