Linux kernel ARM exception stack init - linux-kernel

I am using Linux kernel 3.0.35 on Freescale i.MX6 (ARM Cortex-A9). After running into a kernel OOPS I tried to understand the exception stack initialization. Here is what I have uncovered so far.
In cpu_init() in arch/arm/kernel/setup.c, I see the exception stack getting initialized:
struct stack {
u32 irq[3];
u32 abt[3];
u32 und[3];
} ____cacheline_aligned;
static struct stack stacks[NR_CPUS];
void cpu_init(void)
{
struct stack *stk = &stacks[cpu];
...<snip>
/*
* setup stacks for re-entrant exception handlers
*/
__asm__ (
"msr cpsr_c, %1\n\t"
"add r14, %0, %2\n\t"
"mov sp, r14\n\t"
"msr cpsr_c, %3\n\t"
"add r14, %0, %4\n\t"
"mov sp, r14\n\t"
"msr cpsr_c, %5\n\t"
"add r14, %0, %6\n\t"
"mov sp, r14\n\t"
"msr cpsr_c, %7"
:
: "r" (stk),
PLC (PSR_F_BIT | PSR_I_BIT | IRQ_MODE),
"I" (offsetof(struct stack, irq[0])),
PLC (PSR_F_BIT | PSR_I_BIT | ABT_MODE),
"I" (offsetof(struct stack, abt[0])),
PLC (PSR_F_BIT | PSR_I_BIT | UND_MODE),
"I" (offsetof(struct stack, und[0])),
PLC (PSR_F_BIT | PSR_I_BIT | SVC_MODE)
: "r14");
I see that each stack has room for only three words. That is how the macro vector_stub in arch/arm/kernel/entry-armv.S uses it. It saves R0, LR (parent PC) and SPSR (parent CPSR) into those three words. Then it jumps to __irq_svc. That starts with a macro svc_entry which creates a stack frame
.macro svc_entry, stack_hole=0
UNWIND(.fnstart )
UNWIND(.save {r0 - pc} )
sub sp, sp, #(S_FRAME_SIZE + \stack_hole - 4)
That is also how I see the disassembled code from KGDB:
Dump of assembler code for function __irq_svc:
0xc01402c0 <+0>: 44 d0 4d e2 sub sp, sp, #68 ; 0x44
0xc01402c4 <+4>: 04 00 1d e3 tst sp, #4
0xc01402c8 <+8>: 04 d0 4d 02 subeq sp, sp, #4
0xc01402cc <+12>: fe 1f 8d e8 stm sp, {r1, r2, r3, r4, r5, r6, r7, r8, r9, r10, r11, r12}
0xc01402d0 <+16>: 0e 00 90 e8 ldm r0, {r1, r2, r3}
0xc01402d4 <+20>: 30 50 8d e2 add r5, sp, #48 ; 0x30
0xc01402d8 <+24>: 00 40 e0 e3 mvn r4, #0
0xc01402dc <+28>: 44 00 8d e2 add r0, sp, #68 ; 0x44
0xc01402e0 <+32>: 04 00 80 02 addeq r0, r0, #4
0xc01402e4 <+36>: 04 10 2d e5 push {r1} ; (str r1, [sp, #-4]!)
0xc01402e8 <+40>: 0e 10 a0 e1 mov r1, lr
0xc01402ec <+44>: 1f 00 85 e8 stm r5, {r0, r1, r2, r3, r4}
0xc01402f0 <+48>: ad 96 a0 e1 lsr r9, sp, #13
0xc01402f4 <+52>: 89 96 a0 e1 lsl r9, r9, #13
0xc01402f8 <+56>: 04 80 99 e5 ldr r8, [r9, #4]
0xc01402fc <+60>: 01 70 88 e2 add r7, r8, #1
0xc0140300 <+64>: 04 70 89 e5 str r7, [r9, #4]
0xc0140304 <+68>: 54 50 9f e5 ldr r5, [pc, #84] ; 0xc0140360
0xc0140308 <+72>: 00 50 95 e5 ldr r5, [r5]
0xc014030c <+76>: 0c 60 95 e5 ldr r6, [r5, #12]
0xc0140310 <+80>: 4c e0 9f e5 ldr lr, [pc, #76] ; 0xc0140364
0xc0140314 <+84>: 07 0b c6 e3 bic r0, r6, #7168 ; 0x1c00
0xc0140318 <+88>: 1d 00 50 e3 cmp r0, #29
0xc014031c <+92>: 00 00 50 31 cmpcc r0, r0
0xc0140320 <+96>: 0e 00 50 11 cmpne r0, lr
0xc0140324 <+100>: 00 00 50 21 cmpcs r0, r0
0xc0140328 <+104>: 0d 10 a0 11 movne r1, sp
0xc014032c <+108>: 28 e0 4f 12 subne lr, pc, #40 ; 0x28
0xc0140330 <+112>: 32 eb ff 1a bne 0xc013b000 <asm_do_IRQ>
0xc0140334 <+116>: 04 80 89 e5 str r8, [r9, #4]
0xc0140338 <+120>: 00 00 99 e5 ldr r0, [r9]
0xc014033c <+124>: 00 00 38 e3 teq r8, #0
0xc0140340 <+128>: 00 00 a0 13 movne r0, #0
0xc0140344 <+132>: 02 00 10 e3 tst r0, #2
0xc0140348 <+136>: 06 00 00 1b blne 0xc0140368 <svc_preempt>
0xc014034c <+140>: 40 40 9d e5 ldr r4, [sp, #64] ; 0x40
0xc0140350 <+144>: 04 f0 6f e1 msr SPSR_fsxc, r4
0xc0140354 <+148>: 1f f0 7f f5 clrex
0xc0140358 <+152>: ff ff dd e8 ldm sp, {r0, r1, r2, r3, r4, r5, r6, r7, r8, r9, r10, r11, r12, sp, lr, pc}^
End of assembler dump.
During an exception, SP is the banked R13. If I am following correctly, there is no room for this frame on that stack. That means I must have missed something. Is there some other place where the exception stacks are initialized?

tl;dr - We switch modes to supervisor and use that stack.
You are missing the key point of where control is handed to the CPU via the vector table and the mode is switched. See: entry-armv.S and __vectors_start. The vector stubs is the code where control is initially sent after the branch in the main vector table. The vector_stub macro saves three items; a corrected lr, r0 and the spsr of the excepted mode (as you noted).
The point you miss is, after this all exceptions switch to SVC_MODE and as such use the current tasks stack, which also has the thread_info structure. mode switching is a tough concept to get in ARM system level assembler. Registers that were previously set are now completely different. Pay attention to msr and cps type instructions. Things can change completely after them; I have been confused by this dozens of times.
The spsr is used as an index into a vector_stub table, which will normally jump to either __irq_svc or __irq_usr. Just scroll down to look at the bottom of the entry-arm.S which you already found.
Related: Physical address of ARM-Linux vector table

Related

where are the shared libs in macOS?

I am trying to find the libdyld.dylib file in macOS but I can't find it. according to the lldb debugger, it supposes to be at /usr/lib/system/libdyld.dylib but it is not there... I read this apple support but it got me more confused...
I understand that there is dyld that load the code from somewhere... but from where ? Where is the code of this lib comes from?
using macOS 12.5 Monterey.
mac M1.
update:
I looked into /usr/lib/dyld (there is no /System/Library/dyld file in my system). The code that I see in the lldb when stepping into lib function is different from the code I see when disassembling the same function in the /usr/lib/dyld. E.g let's take dlopen - the debugger (lldb) shows that there are 2 implementations
(lldbinit) image lookup -n dlopen
1 match found in /usr/lib/dyld:
Address: dyld[0x0000000000025954] (dyld.__TEXT.__text + 149844)
Summary: dyld`dyld4::APIs::dlopen(char const*, int)
1 match found in /usr/lib/system/libdyld.dylib:
Address: libdyld.dylib[0x000000018033329c] (libdyld.dylib.__TEXT.__text + 1532)
Summary: libdyld.dylib`dlopen
but when stepping into the dlopen function it choose the one in /usr/lib/system/libdyld.dylib and not in /usr/lib/dyld:
(lldbinit) image lookup -v -a $pc
Address: libdyld.dylib[0x000000018033329c] (libdyld.dylib.__TEXT.__text + 1532)
Summary: libdyld.dylib`dlopen
Module: file = "/usr/lib/system/libdyld.dylib", arch = "arm64e"
Symbol: id = {0x000001a3}, range = [0x0000000182f5b29c-0x0000000182f5b2d0), name="dlopen"
Also the asm is differnet. When stepping into dlopen with lldb I see the next instructions:
dlopen # /usr/lib/system/libdyld.dylib:
-> 0x182f5b29c (0x18033329c): e2 03 01 aa mov x2, x1
0x182f5b2a0 (0x1803332a0): e1 03 00 aa mov x1, x0
0x182f5b2a4 (0x1803332a4): 68 7b 2c b0 adrp x8, 364397
0x182f5b2a8 (0x1803332a8): 00 39 43 f9 ldr x0, [x8, #0x670]
0x182f5b2ac (0x1803332ac): 10 00 40 f9 ldr x16, [x0]
0x182f5b2b0 (0x1803332b0): f1 03 00 aa mov x17, x0
0x182f5b2b4 (0x1803332b4): 51 7f ec f2 movk x17, #0x63fa, lsl #48
0x182f5b2b8 (0x1803332b8): 30 1a c1 da autda x16, x17
0x182f5b2bc (0x1803332bc): 03 0e 47 f8 ldr x3, [x16, #0x70]!
0x182f5b2c0 (0x1803332c0): e4 03 10 aa mov x4, x16
0x182f5b2c4 (0x1803332c4): f0 03 04 aa mov x16, x4
0x182f5b2c8 (0x1803332c8): 30 e6 f7 f2 movk x16, #0xbf31, lsl
and when disassembling dlopen in dyld I see the next instructions:
(jtool2 -d /usr/lib/dyld | less)
__ZN5dyld44APIs6dlopenEPKci:
25954 0xd503237f PACIBSP
25958 0xa9bd57f6 STP X22, X21, [SP, #-48]!
2595c 0xa9014ff4 STP X20, X19, [SP, #16]
25960 0xa9027bfd STP X29, X30, [SP, #32]
25964 0x910083fd ADD X29, SP, #32
25968 0xaa0203f3 _MOV_R X19, X2 R19 = R2 (0x0)
2596c 0xaa0103f5 _MOV_R X21, X1 R21 = R1 (0x0)
25970 0xaa0003f6 _MOV_R X22, X0 R22 = R0 (0x0)
25974 0xaa1e03f4 _MOV_R X20, X30 R20 = R30 (0x0)
25978 0xdac143f4 PACIA X20, X31
2597c 0xf9400408 _LDR X8, [X0, #8] ...R8 = *(R0 + 8) = *0x8 = 0x780000002
25980 0xb9403508 _LDR W8, [X8, #52] ...R8 = *(R8 + 52) = *0x3c = 0x6000000000000
25984 0x7100091f CMP W8, #2
25988 0xfa400824 CCMP
2598c 0x540000e0 B.EQ 0x259a8
25990 0xaa1503e0 _MOV_R X0, X21 R0 = R21 (0x0)
So it is a still a mistory, where does this dlopen code comes from ?
Since 11 version, Apple made some efforts to make harder to reverse optimize their shared libs.
Long story short, they merged most libs and frameworks into a single binary, which is loaded into memory on system start.
You can find it here: /System/Library/dyld/ (folder), there may be several file versions for Intel and arm archs.
All such system libraries referenced from mach-o section of the binary you run are mapped then directly from the loaded dyld cache, so Apple does not need libs to be on filesystem anymore. They made some efforts for compatibility, so for most apps it still looks like they are present on a disk though.
However, as Apple have to publish parts of their sources due to using a lot of opensource stuff, folks found the code responsible for the dyld cache and created several extractors, like this one:
https://github.com/keith/dyld-shared-cache-extractor
(you can even install it with brew)
So if you need to look inside some library - you will need to install extractor, perform extraction, and then you will have what you want.

How to Modify a String in Arm Assembly [duplicate]

So I'm having trouble with my program. It's supposed to read in a text file
that has a number on each line. It then stores that in an array, sorts it using selection sort, and then outputs it to a new file. The reading of and writing to the file work perfectly fine but my code for the sort isn't working properly. When I run the program, it only seems to store some of the numbers
in the array and then a bunch of zeroes.
So if my input is 112323, 32, 12, 19, 2, 1, 23. The output is 0,0,0,0, 2,1,23. I'm pretty sure the problem's with how I'm storing and loading from the array
onto the registers because assuming that part works, I can't find any reason why the selection sort algorithm shouldn't work.
Ok thanks to your help, I figured out that I needed to change the load and store instruction so that it matches the specifier used (ldr -> ldrb and str -> strb). But I need to make a sorting algorithm that works for 32 bit numbers so which combination of specifiers and load/store instructions would allow me to do that? Or would I have to load/store 8 bits a time? And if so, how would I do that?
.data
.balign 4
readfile: .asciz "myfile.txt"
.balign 4
readmode: .asciz "r"
.balign 4
writefile: .asciz "output.txt"
.balign 4
writemode: .asciz "w"
.balign 4
return: .word 0
.balign 4
scanformat: .asciz "%d"
.balign 4
printformat: .asciz "%d\n"
.balign 4
a: .space 32
.text
.global main
.global fopen
.global fprintf
.global fclose
.global fscanf
.global printf
main:
ldr r1, =return
str lr, [r1]
ldr r0, =readfile
ldr r1, =readmode
bl fopen
mov r4, r0
mov r5, #0
ldr r6, =a
loop:
cmp r5, #7
beq sort
mov r0, r4
ldr r1, =scanformat
mov r2, r6
bl fscanf
add r5, r5, #1
add r6, r6, #1
b loop
sort:
mov r5,#0 /*array parser for first loop*/
mov r6, #0 /* #stores index of minimum*/
mov r7, #0 /* #temp*/
mov r8, #0 /*# array parser for second loop*/
mov r9, #7 /*# stores length of array*/
ldr r10, =a /*# the array*/
mov r11, #0 /*#used to obtain offset for min*/
mov r12, #0 /*# used to obtain offset for second parser access*/
loop3:
cmp r5, r9 /*# check if first parser reached end of array*/
beq write /* #if it did array is sorted write it to file*/
mov r6, r5 /*#set the min index to the current position*/
mov r8, r6 /*#set the second parser to where first parser is at*/
b loop4 /*#start looking for min in this subarray*/
loop4:
cmp r8, r9 /* #if reached end of list min is found*/
beq increment /* #get out of this loop and increment 1st parser**/
lsl r7, r6, #3 /*multiplies min index by 8 */
ADD r7, r10, r7 /* adds offset to r10 address storing it in r7 */
ldr r11, [r7] /* loads value of min in r11 */
lsl r7, r8, #3 /* multiplies second parse index by 8 */
ADD r7, r10, r7 /* adds offset to r10 address storing in r7 */
ldr r12, [r7] /* loads value of second parse into r12 */
cmp r11, r12 /* #compare current min to the current position of 2nd parser !!!!!*/
movgt r6, r8 /*# set new min to current position of second parser */
add r8, r8, #1 /*increment second parser*/
b loop4 /*repeat */
increment:
lsl r11, r5, #3 /* multiplies first parse index by 8 */
ADD r11, r10, r11 /* adds offset to r10 address stored in r11*/
ldr r8, [r11] /* loads value in memory address in r11 to r8*/
lsl r12, r6, #3 /*multiplies min index by 8 */
ADD r12, r10, r12 /*ads offset to r10 address stored in r12 */
ldr r7, [r12] /* loads value in memory address in r12 to r7 */
str r8, [r12] /* # stores value of first parser where min was !!!!!*/
str r7, [r11] /*# store value of min where first parser was !!!!!*/
add r5, r5, #1 /*#increment the first parser*/
ldr r0,=printformat
mov r1, r7
bl printf
b loop3 /*#go to loop1*/
write:
mov r0, r4
bl fclose
ldr r0, =writefile
ldr r1, =writemode
bl fopen
mov r4, r0
mov r5, #0
ldr r6, =a
loop2:
cmp r5, #7
beq end
mov r0, r4
ldr r1, =printformat
ldrb r2, [r6]
bl fprintf
add r5, r5, #1
add r6, r6, #1
b loop2
end:
mov r0, r4
bl fclose
ldr r0, =a
ldr r0, [r0]
ldr lr, =return
ldr lr, [lr]
bx lr
I figured out that I needed to change the load and store instruction
so that it matches the specifier used (ldr -> ldrb and str -> strb).
But I need to make a sorting algorithm that works for 32 bit numbers
so which combination of specifiers and load/store instructions would
allow me to do that?
If you want to read 32b (4 bytes) values from memory, you have to have 4 bytes values in memory to begin with. Well that should not be surprising :)
Eg if your input is numbers 1, 2, 3, 4, each number is 32b value than in memory that would be
0x00000000: 01 00 00 00 | 02 00 00 00 <- 32b values of 1 & 2
0x00000008: 03 00 00 00 | 04 00 00 00 <- 32b values of 3 & 4
In such case ldr would read 32b each time and you would get 1, 2, 3, 4 with each read in register.
Now, you have in memory byte values (based on your statement that `ldrb` gives right result), eg
0x00000000: 01
0x00000001: 02
0x00000002: 03
0x00000003: 04
or same in one line
0x00000000: 01 02 03 04
So reading 8bit by ldrb gives you numbers 1, 2, 3, 4
But ldr would do read 32b value from memory (all 4 bytes at once) and you would get 32b value 0x04030201 in register.
Note: examples for little-endian systems

Errata in "Practical Reverse Engineering"?

I've just started the book Practical Reverse Engineering by Bruce Dang et alia, and am confused about a portion of the "walk-through" at the end of chapter one. This is the relevant portion of code:
65: ...
66: loc_10001d16:
67: mov eax, [ebp-118h]
68: mov ecx, [ebp-128h]
69: jmp short loc_10001d2a (line 73)
70: loc_10001d24:
71: mov eax, [ebp+0ch]
72: mov ecx, [ebp+0ch]
73: loc_10001d2a:
74: cmp eax, ecx
75: pop esi
76: jnz short loc_10001D38 (line 82)
77: xor eax, eax
78: pop edi
79: mov esp, ebp
80: pop ebp
81: retn 0ch
82: ...
And the authors' commentary:
"After the loop exits, execution resumes at line 66. Lines 67–68 save the matching PROCESSENTRY32’s th32ParentProcessID/th32ProcessID in EAX/ECX and
continue execution at 73. Notice that Line 66 is also a jump target in line 43.
Lines 70–74 read the fdwReason parameter of DllMain (EBP+C) and check
whether it is 0 (DLL_PROCESS_DETACH). If it is, the return value is set to 0 and
it returns; otherwise, it goes to line 82."
This is not how I interpreted the code when reading it; surely any jump to loc_10001d24 (line 70) will cause the function to terminate with return value 0 unconditionally, and not only if the value at ebp+0x0c is 0? (I assume that poping into esi does not affect the eflags register, and that the jump in line 76 conditions on the result of cmp eax, ecx in line 74?) This is also consistent with earlier portions in the code, which jump to loc_10001d24 if various called functions return with values indicating failure.
In addition, I thought the point of the section starting at line 66 was to also return with value 0 if PROCESSENTRY32 (a structure defined earlier, starting at position ebp-0x130 in memory) has equal th32ParentProcessID (ebp-0x118 in memory) and th32ProcessID (ebp-0x128 in memory) entries; is this correct? The authors' commentary did not seem to indicate this.
As a more general question, even just chapter 1 of the book has seemed to have had quite a large number of typos; does anyone know of a webpage collecting errata from the book anywhere?
Yes, ECX and EAX are both loaded from the same memory location, so unless something else has a pointer to it and is changing it asynchronously, cmp x,x / jne will always be not-taken. Unlike floating-point, ever possible integer is equal to itself.
And you're correct, pop doesn't change EFLAGS, as per Intel's manuals: https://www.felixcloutier.com/x86/pop.
To check whether a memory location is zero, you can load it into a reg for test eax,eax / jnz
or cmp dword ptr [ebp + 0xc], 0 / jne.
(JNE and JNZ are the same instruction; the different mnemonics let you express the semantic meanings of equality or directly being zero based on ZF being set according to the value itself.)
Lines 70–74 read the fdwReason parameter of DllMain (EBP+C) and check whether it is 0 (DLL_PROCESS_DETACH)
This is bogus. If the book is full of stuff like that, that doesn't sound like a good book.
The cmp eax,ecx only makes any sense when reached from the path that loaded 2 different values. (And couldn't use test for that, x & y != 0 doesn't tell you whether they were equal.) This seems unlikely to be real compiler output.
This is the full listing. It's part of malware found in the wild:
01: ; BOOL __stdcall DllMain(HINSTANCE hinstDLL, DWORD fdwReason, LPVOID lpvReserved)
02: _DllMain#12 proc near
03: 55 push ebp
04: 8B EC mov ebp, esp
05: 81 EC 30 01 00+ sub esp, 130h
06: 57 push edi
07: 0F 01 4D F8 sidt fword ptr [ebp-8]
08: 8B 45 FA mov eax, [ebp-6]
09: 3D 00 F4 03 80 cmp eax, 8003F400h
10: 76 10 jbe short loc_10001C88 (line 18)
11: 3D 00 74 04 80 cmp eax, 80047400h
12: 73 09 jnb short loc_10001C88 (line 18)
13: 33 C0 xor eax, eax
14: 5F pop edi
15: 8B E5 mov esp, ebp
16: 5D pop ebp
17: C2 0C 00 retn 0Ch
18: loc_10001C88:
19: 33 C0 xor eax, eax
20: B9 49 00 00 00 mov ecx, 49h
21: 8D BD D4 FE FF+ lea edi, [ebp-12Ch]
22: C7 85 D0 FE FF+ mov dword ptr [ebp-130h], 0
23: 50 push eax
24: 6A 02 push 2
25: F3 AB rep stosd
26: E8 2D 2F 00 00 call CreateToolhelp32Snapshot
27: 8B F8 mov edi, eax
28: 83 FF FF cmp edi, 0FFFFFFFFh
29: 75 09 jnz short loc_10001CB9 (line 35)
30: 33 C0 xor eax, eax
31: 5F pop edi
32: 8B E5 mov esp, ebp
33: 5D pop ebp
34: C2 0C 00 retn 0Ch
35: loc_10001CB9:
36: 8D 85 D0 FE FF+ lea eax, [ebp-130h]
37: 56 push esi
38: 50 push eax
39: 57 push edi
40: C7 85 D0 FE FF+ mov dword ptr [ebp-130h], 128h
41: E8 FF 2E 00 00 call Process32First
42: 85 C0 test eax, eax
43: 74 4F jz short loc_10001D24 (line 70)
44: 8B 35 C0 50 00+ mov esi, ds:_stricmp
45: 8D 8D F4 FE FF+ lea ecx, [ebp-10Ch]
46: 68 50 7C 00 10 push 10007C50h
47: 51 push ecx
48: FF D6 call esi
49: 83 C4 08 add esp, 8
50: 85 C0 test eax, eax
51: 74 26 jz short loc_10001D16 (line 66)
52: loc_10001CF0:
53: 8D 95 D0 FE FF+ lea edx, [ebp-130h]
54: 52 push edx
55: 57 push edi
56: E8 CD 2E 00 00 call Process32Next
57: 85 C0 test eax, eax
58: 74 23 jz short loc_10001D24 (line 70)
59: 8D 85 F4 FE FF+ lea eax, [ebp-10Ch]
60: 68 50 7C 00 10 push 10007C50h
61: 50 push eax
62: FF D6 call esi
63: 83 C4 08 add esp, 8
64: 85 C0 test eax, eax
65: 75 DA jnz short loc_10001CF0 (line 52)
66: loc_10001D16:
67: 8B 85 E8 FE FF+ mov eax, [ebp-118h]
68: 8B 8D D8 FE FF+ mov ecx, [ebp-128h]
69: EB 06 jmp short loc_10001D2A (line 73)
70: loc_10001D24:
71: 8B 45 0C mov eax, [ebp+0Ch]
72: 8B 4D 0C mov ecx, [ebp+0Ch]
73: loc_10001D2A:
74: 3B C1 cmp eax, ecx
75: 5E pop esi
76: 75 09 jnz short loc_10001D38 (line 82)
77: 33 C0 xor eax, eax
78: 5F pop edi
79: 8B E5 mov esp, ebp
80: 5D pop ebp
81: C2 0C 00 retn 0Ch
82: loc_10001D38:
83: 8B 45 0C mov eax, [ebp+0Ch]
84: 48 dec eax
85: 75 15 jnz short loc_10001D53 (line 93)
86: 6A 00 push 0
87: 6A 00 push 0
88: 6A 00 push 0
89: 68 D0 32 00 10 push 100032D0h
90: 6A 00 push 0
91: 6A 00 push 0
92: FF 15 20 50 00+ call ds:CreateThread
93: loc_10001D53:
94: B8 01 00 00 00 mov eax, 1
95: 5F pop edi
96: 8B E5 mov esp, ebp
97: 5D pop ebp
98: C2 0C 00 retn 0Ch
99: _DllMain#12 endp
So lines 70-74 make no sense on their own, but do serve the original purpose - if either Process32First()/Process32Next() returns FALSE then the code jumps here and eventually exits with 0.
And if the desired process was found then eax/ecx are set to ParentProcessID/ProcessID respectively so the function will continue.
Anyway, there's also lines 83-85 which the books states:
...with lpStartAddress as 0x100032D0. This block can be decompiled as follows:
if (fdwReason == DLL_PROCESS_DETACH) { return FALSE; }
if (fdwReason == DLL_THREAD_ATTACH || fdwReason == DLL_THREAD_DETACH) { return TRUE; }
CreateThread(0, 0, (LPTHREAD_START_ROUTINE) 0x100032D0, 0, 0, 0);
return TRUE;
Lines 83-85 actually check if fdwReason equals DLL_PROCESS_ATTACH or not (bypassing the call to CreateThread if not, which makes perfect sense), and there's no special case for DLL_PROCESS_DETACH.
I'll say that the book certainly lacks proper structure, some things the book takes for granted, other maybe mundane things are emphasizes. Still a very good resource.
Oh well, who said this was easy.

Why do I find some never called instructions nopl, nopw after ret or jmp in GCC compiled code? [duplicate]

I've been working with C for a short while and very recently started to get into ASM. When I compile a program:
int main(void)
{
int a = 0;
a += 1;
return 0;
}
The objdump disassembly has the code, but nops after the ret:
...
08048394 <main>:
8048394: 55 push %ebp
8048395: 89 e5 mov %esp,%ebp
8048397: 83 ec 10 sub $0x10,%esp
804839a: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%ebp)
80483a1: 83 45 fc 01 addl $0x1,-0x4(%ebp)
80483a5: b8 00 00 00 00 mov $0x0,%eax
80483aa: c9 leave
80483ab: c3 ret
80483ac: 90 nop
80483ad: 90 nop
80483ae: 90 nop
80483af: 90 nop
...
From what I learned nops do nothing, and since after ret wouldn't even be executed.
My question is: why bother? Couldn't ELF(linux-x86) work with a .text section(+main) of any size?
I'd appreciate any help, just trying to learn.
First of all, gcc doesn't always do this. The padding is controlled by -falign-functions, which is automatically turned on by -O2 and -O3:
-falign-functions
-falign-functions=n
Align the start of functions to the next power-of-two greater than n, skipping up to n bytes. For instance,
-falign-functions=32 aligns functions to the next 32-byte boundary, but -falign-functions=24 would align to the next 32-byte boundary only
if this can be done by skipping 23 bytes or less.
-fno-align-functions and -falign-functions=1 are equivalent and mean that functions will not be aligned.
Some assemblers only support this flag when n is a power of two; in
that case, it is rounded up.
If n is not specified or is zero, use a machine-dependent default.
Enabled at levels -O2, -O3.
There could be multiple reasons for doing this, but the main one on x86 is probably this:
Most processors fetch instructions in aligned 16-byte or 32-byte blocks. It can be
advantageous to align critical loop entries and subroutine entries by 16 in order to minimize
the number of 16-byte boundaries in the code. Alternatively, make sure that there is no 16-byte boundary in the first few instructions after a critical loop entry or subroutine entry.
(Quoted from "Optimizing subroutines in assembly
language" by Agner Fog.)
edit: Here is an example that demonstrates the padding:
// align.c
int f(void) { return 0; }
int g(void) { return 0; }
When compiled using gcc 4.4.5 with default settings, I get:
align.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <f>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: b8 00 00 00 00 mov $0x0,%eax
9: c9 leaveq
a: c3 retq
000000000000000b <g>:
b: 55 push %rbp
c: 48 89 e5 mov %rsp,%rbp
f: b8 00 00 00 00 mov $0x0,%eax
14: c9 leaveq
15: c3 retq
Specifying -falign-functions gives:
align.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <f>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: b8 00 00 00 00 mov $0x0,%eax
9: c9 leaveq
a: c3 retq
b: eb 03 jmp 10 <g>
d: 90 nop
e: 90 nop
f: 90 nop
0000000000000010 <g>:
10: 55 push %rbp
11: 48 89 e5 mov %rsp,%rbp
14: b8 00 00 00 00 mov $0x0,%eax
19: c9 leaveq
1a: c3 retq
This is done to align the next function by 8, 16 or 32-byte boundary.
From “Optimizing subroutines in assembly language” by A.Fog:
11.5 Alignment of code
Most microprocessors fetch code in aligned 16-byte or 32-byte blocks. If an importantsubroutine entry or jump label happens to be near the end of a 16-byte block then themicroprocessor will only get a few useful bytes of code when fetching that block of code. Itmay have to fetch the next 16 bytes too before it can decode the first instructions after thelabel. This can be avoided by aligning important subroutine entries and loop entries by 16.
[...]
Aligning a subroutine entry is as simple as putting as many
NOP
's as needed before thesubroutine entry to make the address divisible by 8, 16, 32 or 64, as desired.
As far as I remember, instructions are pipelined in cpu and different cpu blocks (loader, decoder and such) process subsequent instructions. When RET instructions is being executed, few next instructions are already loaded into cpu pipeline. It's a guess, but you can start digging here and if you find out (maybe the specific number of NOPs that are safe, share your findings please.

Ineffective stack management and registers allocation

Consider the following code:
extern unsigned int foo(char c, char **p, unsigned int *n);
unsigned int test(const char *s, char **p, unsigned int *n)
{
unsigned int done = 0;
while (*s)
done += foo(*s++, p, n);
return done;
}
Output in Assembly:
00000000 <test>:
0: b5f8 push {r3, r4, r5, r6, r7, lr}
2: 0005 movs r5, r0
4: 000e movs r6, r1
6: 0017 movs r7, r2
8: 2400 movs r4, #0
a: 7828 ldrb r0, [r5, #0]
c: 2800 cmp r0, #0
e: d101 bne.n 14 <test+0x14>
10: 0020 movs r0, r4
12: bdf8 pop {r3, r4, r5, r6, r7, pc}
14: 003a movs r2, r7
16: 0031 movs r1, r6
18: f7ff fffe bl 0 <foo>
1c: 3501 adds r5, #1
1e: 1824 adds r4, r4, r0
20: e7f3 b.n a <test+0xa>
C code compiled using arm-none-eabi-gcc versions: 4.9.1, 5.4.0, 6.3.0 and 7.1.0 on
Linux host. Assembly output is the same for all GCC versions.
CFLAGS := -Os -march=armv6-m -mcpu=cortex-m0plus -mthumb
My understanding of the execution flow is following:
Push R3-R7 + LR onto the stack (totally unclear)
Move R0 to R5 (this is clear)
Move R1 to R6 and R2 to R7 (totally unclear)
Dereference R5 into R0 (This is clear)
Compare R0 with 0 (This is clear)
If R0 != 0 go to line 14: - Restore R1 from R6 and R2 from R7 and call foo(),
If R0 == 0 stay at line 10, restore R3 - R7 + PC from stack (totally unclear)
Increment R5 (clear)
accumulate result from foo() (clear)
Branch back to line a: (clear)
My own Assembly. Not extensively tested, but definitely I would not need more than R4 + LR to be pushed onto the stack:
EDIT: According to the provided answers, my example from below will fail due to R1 and R2 not being persistent through call to foo()
51 unsigned int __attribute__((naked)) test_asm(const char *s, char **p, unsigned int *n)
52 {
53 // r0 - *s (move ptr to r3 and dereference it to r0)
54 // r1 - **p
55 // r2 - *n
56 asm volatile(
57 " push {r4, lr} \n\t"
58 " movs r4, #0 \n\t"
59 " movs r3, r0 \n\t"
60 "1: \n\t"
61 " ldrb r0, [r3, #0] \n\t"
62 " cmp r0, #0 \n\t"
63 " beq 2f \n\t"
64 " bl foo \n\t"
65 " add r4, r4, r0 \n\t"
66 " add r3, #1 \n\t"
67 " b 1b \n\t"
68 "2: \n\t"
69 " movs r0, r4 \n\t"
70 " pop {r4, pc} \n\t"
71 );
72 }
Questions:
Why GCC stores so many registers for such trivial function?
Why it pushes R3 while it is written in ABI that R0-R3 are argument registers
and supposed to be a caller save and should be safely used inside called function
in this case test()
Why it copy R1 to R6 and R2 to R7 while the prototype of extern function almost
ideally matches the test() function. So R1 and R2 are already ready to be passed
to foo() routine. My understanding is that only R0 need to be dereferenced before
call to foo()
LR must be saved since test is not a leaf function. r5-r7 are used by the function to store values that are used across function calls and since they are not scratch they must be saved. r3 is pushed to align the stack.
Adding an extra register to push is a fast and compact way to align the stack.
r1 and r2 may be trashed by the call to foo and since the values initially stored in these registers are needed after the call they must be stored in a location that survive calls.

Resources