How to write and execute PURE machine code manually without containers like EXE or ELF?

How to write and execute PURE machine code manually without containers like EXE or ELF? - machine-code

I just need a hello world demo to see how machine code actually works.
Though windows' EXE and linux' ELF is near machine code,but it's not PURE
How can I write/execute PURE machine code?

You can write in PURE machine code manually WITHOUT ASSEMBLY
Linux/ELF: https://github.com/XlogicX/m2elf. This is still a work in progress, I just started working on this yesterday.
Source file for "Hello World" would look like this:
b8 21 0a 00 00 #moving "!\n" into eax
a3 0c 10 00 06 #moving eax into first memory location
b8 6f 72 6c 64 #moving "orld" into eax
a3 08 10 00 06 #moving eax into next memory location
b8 6f 2c 20 57 #moving "o, W" into eax
a3 04 10 00 06 #moving eax into next memory location
b8 48 65 6c 6c #moving "Hell" into eax
a3 00 10 00 06 #moving eax into next memory location
b9 00 10 00 06 #moving pointer to start of memory location into ecx
ba 10 00 00 00 #moving string size into edx
bb 01 00 00 00 #moving "stdout" number to ebx
b8 04 00 00 00 #moving "print out" syscall number to eax
cd 80 #calling the linux kernel to execute our print to stdout
b8 01 00 00 00 #moving "sys_exit" call number to eax
cd 80 #executing it via linux sys_call
WIN/MZ/PE:
shellcode2exe.py (takes asciihex shellcode and creates a legit MZ PE exe file) script location:
https://web.archive.org/web/20140725045200/http://zeltser.com/reverse-malware/shellcode2exe.py.txt
dependency:
https://github.com/radare/toys/tree/master/InlineEgg
extract
python setup.py build
sudo python setup.py install

Real Machine Code
What you need to run the test: Linux x86 or x64 (in my case I am using Ubuntu x64)
Let's Start
This Assembly (x86) moves the value 666 into the eax register:
movl $666, %eax
ret
Let's make the binary representation of it:
Opcode movl (movl is a mov with operand size 32) in binary is = 1011
Instruction width in binary is = 1
Register eax in binary is = 000
Number 666 in signed 32 bits binary is = 00000000 00000000 00000010 10011010
666 converted to little endian is = 10011010 00000010 00000000 00000000
Instruction ret (return) in binary is = 11000011
So finally our pure binary instructions will look like this:
1011(movl)1(width)000(eax)10011010000000100000000000000000(666)
11000011(ret)
Putting it all together:
1011100010011010000000100000000000000000
11000011
For executing it the binary code has to be placed in a memory page with execution privileges, we can do that using the following C code:
#include <ctype.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
/* Allocate size bytes of executable memory. */
unsigned char *alloc_exec_mem(size_t size)
{
void *ptr;
ptr = mmap(0, size, PROT_READ | PROT_WRITE | PROT_EXEC,
MAP_PRIVATE | MAP_ANON, -1, 0);
if (ptr == MAP_FAILED) {
perror("mmap");
exit(1);
}
return ptr;
}
/* Read up to buffer_size bytes, encoded as 1's and 0's, into buffer. */
void read_ones_and_zeros(unsigned char *buffer, size_t buffer_size)
{
unsigned char byte = 0;
int bit_index = 0;
int c;
while ((c = getchar()) != EOF) {
if (isspace(c)) {
continue;
} else if (c != '0' && c != '1') {
fprintf(stderr, "error: expected 1 or 0!\n");
exit(1);
}
byte = (byte << 1) | (c == '1');
bit_index++;
if (bit_index == 8) {
if (buffer_size == 0) {
fprintf(stderr, "error: buffer full!\n");
exit(1);
}
*buffer++ = byte;
--buffer_size;
byte = 0;
bit_index = 0;
}
}
if (bit_index != 0) {
fprintf(stderr, "error: left-over bits!\n");
exit(1);
}
}
int main()
{
typedef int (*func_ptr_t)(void);
func_ptr_t func;
unsigned char *mem;
int x;
mem = alloc_exec_mem(1024);
func = (func_ptr_t) mem;
read_ones_and_zeros(mem, 1024);
x = (*func)();
printf("function returned %d\n", x);
return 0;
}
Source: https://www.hanshq.net/files/ones-and-zeros_42.c
We can compile it using:
gcc source.c -o binaryexec
To execute it:
./binaryexec
Then we pass the first sets of instructions:
1011100010011010000000100000000000000000
press enter
and pass the return instruction:
11000011
press enter
finally ctrl+d to end the program and get the output:
function returned 666

Everyone knows that the application we usually wrote is run on the operating system. And managed by it.
It means that the operating system is run on the machine. So I think that is PURE machine code which you said.
So, you need to study how an operating system works.
Here is some NASM assembly code for a boot sector which can print "Hello world" in PURE.
org
xor ax, ax
mov ds, ax
mov si, msg
boot_loop:lodsb
or al, al
jz go_flag
mov ah, 0x0E
int 0x10
jmp boot_loop
go_flag:
jmp go_flag
msg db 'hello world', 13, 10, 0
times 510-($-$$) db 0
db 0x55
db 0xAA
And you can find more resources here: http://wiki.osdev.org/Main_Page.
END.
If you had installed nasm and had a floppy, You can
nasm boot.asm -f bin -o boot.bin
dd if=boot.bin of=/dev/fd0
Then, you can boot from this floppy and you will see the message.
(NOTE: you should make the first boot of your computer the floppy.)
In fact, I suggest you run that code in full virtual machine, like: bochs, virtualbox etc.
Because it is hard to find a machines with a floppy.
So, the steps are
First, you should need to install a full virtual machine.
Second, create a visual floppy by commend: bximage
Third, write bin file to that visual floppy.
Last, start your visual machine from that visual floppy.
NOTE: In https://wiki.osdev.org , there are some basic information about that topic.

It sounds like you're looking for the old 16-bit DOS .COM file format. The bytes of a .COM file are loaded at offset 100h in the program segment (limiting them to a maximum size of 64k - 256 bytes), and the CPU simply started executing at offset 100h. There are no headers or any required information of any kind, just raw CPU instructions.

The OS is not running the instructions, the CPU does (except if we're talking about a virtual machine OS, which do exist, I'm thinking about Forth or such things). The OS however does require some metainformation to know, that a file does in fact contain executable code, and how it expects its environment to look like. ELF is not just near machine code. It is machine code, together with some information for the OS to know that it's supposed to put the CPU to actually execute that thing.
If you want something simpler than ELF but *nix, have a look at the a.out format, which is much simpler. Traditionally *nix C compilers do (still) write their executable to a file called a.out, if no output name is specified.

The next program is an Hello World program I wrote in Machine Code 16 bit (intel 8086), If you want to know machine code, I suggest that you learn Assembly first, because every line of code in Assembly is converted to A code line in Machine Code. For well I know I am from the few people in the world, still programming in Machine Code, instead of Assembly.
BTW, To run it, save the file with a ".com" extension and run on DOSBOX!
So, this is an Hello World Program.

When targeting an embedded system you can make a binary image of the rom or ram that is strictly the instructions and associated data from the program. And often can write that binary into a flash/rom and run it.
Operating systems want to know more than that, and developers often want to leave more than that in their file so they can debug or do other things with it later (disassemble with some recognizable symbol names). Also, embedded or on an operating system you may need to separate .text from .data from .bss from .rodata, etc and file formats like .elf provide a mechanism for that, and the preferred use case is to load that elf with some sort of loader be it the operating system or something programming the rom and ram of a microcontroller.
.exe has some header info as well. As mentioned .com didnt it loaded at address 0x100h and branched there.
to create a raw binary from an executable, with a gcc created elf file for example you can do something like
objcopy file.elf -O binary file.bin
If the program is segmented (.text, .data, etc) and those segments are not back to back the binary can get quite large. Again using embedded as an example if the rom is at 0x00000000 and data or bss is at 0x20000000 even if your program only has 4 bytes of data objcopy will create a 0x20000004 byte file filling in the gap between .text and .data (as it should because that is what you asked it to do).
What is it you are trying to do? Reading a elf or intel hex or srec file are quite trivial and from that you can see all the bits and bytes of the binary. Or disassembling the elf or whatever will also show you that in a human readable form. (objdump -D file.elf > file.list)

With pure machine code, you can use any language that has an ability to write files.
even visual basic.net can write 8,16,32,64 bit while interchanging between the int types while it writes.
You can even set up to have vb write out machine code in a loop as needed
for something like setpixel, where x,y changes and you have your argb colors.
or, create your vb.net program regularly in windows, and use NGEN.exe to make a native code file of your program. It creates pure machine code specific to ia-32 all in one shot throwing the JIT debugger aside.

This are nice responses, but why someone would want to do this might guide the answer better. I think the most important reason is to get full control of their machine, especially over its cache writing, for maximum performance, and prevent any OS from sharing the processor or virtualizing your code (thus slowing it down) or especially in these days snooping on your code as well. As far as I can tell, assembler doesn't handle these issues and M$/Intel and other companies treat this like an infringement or "for hackers." This is very wrong headed however. If your assembler code is handed over to an OS or proprietary hardware, true optimization (potentially at GHz frequencies) will be out of reach. This is an very important issue with regards to science and technology, as our computers cannot be used to their full potential without hardware optimization, and are often computing several orders of magnitude below it. There probably is some workaround or some open-source hardware that enables this but I have yet to find it. Penny for anyones thoughts.

On Windows--at least 32bit Windows--you can execute RAW INSTRUCTIONS using a .com file.
For instance, if you take this string and save it in notepad with a .com extension:
X5O!P%#AP[4\PZX54(P^)7CC)7}$EICAR-STANDARD-ANTIVIRUS-TEST-FILE!$H+H*
It will print a string and set off your antivirus software.

Related

Fully understanding how .exe file is executed

Goal
I want to understand how executables work. I hope that understanding one very specific example in full detail will enable me to do so. My final (perhaps too ambitious) goal is to take a hello-world .exe file (compiled with a C compiler and linked) and understand in full detail how it is loaded into memory and executed by a x86 processor. If I succeed in doing so, I want to write an article and/or make a video about it, since I have not found something like this on the internet.
Specific questions I want to ask are marked in bold. Of course any further suggestions and sources doing something similar are very welcome. Thanks a lot in advance for any help!
What I need
This Answer gives an overview of the process that C code goes through until it gets into physical memory as a programm. I'm not sure yet how much I want to look into how the C code is compiled. Is there a way to view the Assembly code a C compiler generates before assembling it? I may decide it's worth the effort to understand the processes of loading and linking. In the meantime the most important parts I need to understand are
the PA executable file format
the relation between assembler code and x86 byte-code
the process of loading (i.e. how the process RAM is prepared for execution using information from the executable file).
I have a very basic understanding of the PA format (this understanding will be outlined in the section "What I have learned so far") and I think the sources given there should be sufficient, I just need to look into it some more until I know enough to understand a basic Hello-World programm. Further sources on this topic are of course very welcome.
The translation of byte-code into assembler code (disassembly) seems to be quite difficult for x86. Nonetheless, I would love to learn more about it. How would you go about disassembling a short byte code segment?
I'm still looking for a way to view the contents of a process' memory (the virtual memory assigned to it). I've already looked into windows-kernel32.dll functions such as ReadProcessMemory but couldn't get it to work yet. Also it's strange to me that there don't seem to be (free) tools available for this. Together with an understanding of loading, I may then be able to understand how a process is run from RAM. Also I'm looking for debugging tools for assembly programmers that allow to view the entire process virtual memory conents. My current starting point of this search is this question. Do you have further advice on how I can see and understand loading and process execution from RAM?
What I have learned so far
The rest of this StackOverflow question describes what I have learned so far in some detail and giving various sources. It is meant to be reproducible and help anyone trying to understand this. However, I still do have some questions about the example I looked at so far.
PA format
In Windows, an executable file follows the PA format. The official documentation and this article give a good overview of the format. The format describes what the individual bytes in an .exe file mean. The beginning is a DOS programm (included for legacy reasons) that I will not worry about. Then comes a bunch of headers, which give information about the executable. The actual file contents are split into sections that have names, such as '.rdata'. After the file headers, there are also section headers, which tell you which parts of the file are which section and what each section does (e.g. if it contains executable code).
The headers and sections can be parsed using tools such as dumpbin (microsoft tool to look at binary files). For comparison with dumpbin output, the hex code of a file can be viewed directly with a Hex editor or even using the Powershell (command Format-Hex -Path <Path to file>).
Specific example
I performed these steps for a very simple programm, which does nothing. This is the code:
; NASM assembler programm. Does nothing. Stores string in code section.
; Adapted from stackoverflow.com/a/1029093/9988487
global _main
section .text
_main:
hlt
db 'Hello, World'
I assembled it with NASM (command nasm -fwin32 filename.asm) and linked it with the linker that comes with VS2019 (link /subsystem:console /nodefaultlib /entry:main test.obj). This is adapted from this answer, which demonstrates how to make a hello-world programm for Windows using WinAPI call. The programm runs on Windows 10 and terminates with no output. It takes about 2 sec to run, which seems very long and makes me think there may be some error somehwere?
I then looked at the dumpbin output:
D:\ASM>dumpbin test.exe /ALL
Microsoft (R) COFF/PE Dumper Version 14.22.27905.0
Copyright (C) Microsoft Corporation. All rights reserved.
Dump of file test.exe
PE signature found
File Type: EXECUTABLE IMAGE
FILE HEADER VALUES
14C machine (x86)
2 number of sections
5E96C000 time date stamp Wed Apr 15 10:04:16 2020
0 file pointer to symbol table
0 number of symbols
E0 size of optional header
102 characteristics
Executable
32 bit word machine
OPTIONAL HEADER VALUES
10B magic # (PE32)
14.22 linker version
200 size of code
200 size of initialized data
0 size of uninitialized data
1000 entry point (00401000)
1000 base of code
2000 base of data
400000 image base (00400000 to 00402FFF)
1000 section alignment
200 file alignment
<further header values omitted ...>
SECTION HEADER #1
.text name
E virtual size
1000 virtual address (00401000 to 0040100D)
200 size of raw data
200 file pointer to raw data (00000200 to 000003FF)
0 file pointer to relocation table
0 file pointer to line numbers
0 number of relocations
0 number of line numbers
60000020 flags
Code
Execute Read
RAW DATA #1
00401000: F4 48 65 6C 6C 6F 2C 20 57 6F 72 6C 64 0A ôHello, World.
SECTION HEADER #2
.rdata name
58 virtual size
2000 virtual address (00402000 to 00402057)
200 size of raw data
400 file pointer to raw data (00000400 to 000005FF)
0 file pointer to relocation table
0 file pointer to line numbers
0 number of relocations
0 number of line numbers
40000040 flags
Initialized Data
Read Only
RAW DATA #2
00402000: 00 00 00 00 00 C0 96 5E 00 00 00 00 0D 00 00 00 .....À.^........
00402010: 3C 00 00 00 1C 20 00 00 1C 04 00 00 00 00 00 00 <.... ..........
00402020: 00 10 00 00 0E 00 00 00 2E 74 65 78 74 00 00 00 .........text...
00402030: 00 20 00 00 1C 00 00 00 2E 72 64 61 74 61 00 00 . .......rdata..
00402040: 1C 20 00 00 3C 00 00 00 2E 72 64 61 74 61 24 7A . ..<....rdata$z
00402050: 7A 7A 64 62 67 00 00 00 zzdbg...
Debug Directories
Time Type Size RVA Pointer
-------- ------- -------- -------- --------
5E96C000 coffgrp 3C 0000201C 41C
Summary
1000 .rdata
1000 .text
The file header field "characteristics" is a combination of flags. In particular 102h = 1 0000 0010b and the two set flags (according to the PE format doc) are IMAGE_FILE_EXECUTABLE_IMAGE and IMAGE_FILE_BYTES_REVERSED_HI. The latter has description
IMAGE_FILE_BYTES_REVERSED_HI:
Big endian: the MSB precedes the LSB in memory. This flag is deprecated and should be zero.
I ask myself: Why does a modern assembler and a modern linker produce a deprecated flag?
There are 2 sections in the file. The section .text was defined in the assembler code (and is the only one containing executable code, as specified in its header). I don't know what the second section '.rdata' (name seems to refer to "readable data") is or does here. Why was it created? How could I find out?
Disassembly
I used dumpbin to diassemble the .exe file (command dumpbin test.exe /DISASM). It gets the hlt correct, the 'Hello, World.' string is (perhaps unfortunately) interpreted as executable commands. I guess the disassembler can hardly be blamed for this. However, if I understand correctly (I have no practical experience in assembly programming), putting data into a code section is not unheard of (it was done in several examples that I found while looking into assembly programming). Is there a better way to disassemle this, that would be able to reproduce my assembly code better? Also, do compilers sometimes put data into code sections in this way?

In some respects this is a massively broad question that may not survive for that reason. The information is all out there on the internet, keep looking, it is not complicated, and not worthy of a paper or video.
So you have a rough idea that a compiler takes a program written in one language and converts it to another language be that assembly language or machine code or whatever.
Then there are file formats and there are many different ones that we all use the term "binary" for but again, different formats. Ideally they contain, using some form of encoding, the machine code and data or information about the data.
Going to use ARM for now, fixed length instructions easy to disassemble and read, etc.
#define ONE 1
unsigned int x;
unsigned int y = 5;
const unsigned int z = 7;
unsigned int fun ( unsigned int a )
{
return(a+ONE);
}
and gnu gcc/binutils because it is very well know, widely used, you can use it to make programs on your wintel machine. I run Linux so you will see elf not exe, but that is just a file format for what you are asking.
arm-none-eabi-gcc -O2 -c so.c -save-temps -o so.o
This toolchain (chain of tools that are linked for example compiler -> assembler -> linker) is Unix style and modular. You are going to have an assembler for a target so not sure why you would want to re-invent that, and it is so much easier to debug a compiler by looking at the assembly output than trying to go straight to machine code. But there are folks that like to climb the mountain just because it is there rather than go around and some tools go straight for machine code just because its there.
This specific compiler has this save temps feature, gcc itself is a front end program that preps for the real compiler then if asked for (if you don't say not to) will call the assembler and linker.
cat so.i
# 1 "so.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "so.c"
unsigned int x;
unsigned int y = 5;
const unsigned int z = 7;
unsigned int fun ( unsigned int a )
{
return(a+1);
}
So at this point defines and includes are taken care of and its one big file to be sent to the compiler.
The compiler does its thing and turns it onto assembly language
cat so.s
.cpu arm7tdmi
.eabi_attribute 20, 1
.eabi_attribute 21, 1
.eabi_attribute 23, 3
.eabi_attribute 24, 1
.eabi_attribute 25, 1
.eabi_attribute 26, 1
.eabi_attribute 30, 2
.eabi_attribute 34, 0
.eabi_attribute 18, 4
.file "so.c"
.text
.align 2
.global fun
.arch armv4t
.syntax unified
.arm
.fpu softvfp
.type fun, %function
fun:
# Function supports interworking.
# args = 0, pretend = 0, frame = 0
# frame_needed = 0, uses_anonymous_args = 0
# link register save eliminated.
add r0, r0, #1
bx lr
.size fun, .-fun
.global z
.global y
.comm x,4,4
.section .rodata
.align 2
.type z, %object
.size z, 4
z:
.word 7
.data
.align 2
.type y, %object
.size y, 4
y:
.word 5
.ident "GCC: (GNU) 9.3.0"
which then gets put into an object file, in this case, binutils, linux default, etc
file so.o
so.o: ELF 32-bit LSB relocatable, ARM, EABI5 version 1 (SYSV), not stripped
It is using an elf file format which is easy to find info on, easy to write programs to parse, etc.
I can disassemble this, note that because I am using the disassembler it tries to disassemble everything even if it isn't machine code, sticking to 32 bit arm stuff It can grind through that and when there are real instructions they are shown (aligned and not variable length as used here, so you can disassemble linearly which you cannot with a variable length instruction set and have a hope of success (like x86) you need to disassemble in execution order and then you often miss some due to the nature of the program)
arm-none-eabi-objdump -D so.o
so.o: file format elf32-littlearm
Disassembly of section .text:
00000000 <fun>:
0: e2800001 add r0, r0, #1
4: e12fff1e bx lr
Disassembly of section .data:
00000000 <y>:
0: 00000005 andeq r0, r0, r5
Disassembly of section .rodata:
00000000 <z>:
0: 00000007 andeq r0, r0, r7
Disassembly of section .comment:
00000000 <.comment>:
0: 43434700 movtmi r4, #14080 ; 0x3700
4: 4728203a ; <UNDEFINED> instruction: 0x4728203a
8: 2029554e eorcs r5, r9, lr, asr #10
c: 2e332e39 mrccs 14, 1, r2, cr3, cr9, {1}
10: Address 0x0000000000000010 is out of bounds.
Disassembly of section .ARM.attributes:
00000000 <.ARM.attributes>:
0: 00002941 andeq r2, r0, r1, asr #18
4: 61656100 cmnvs r5, r0, lsl #2
8: 01006962 tsteq r0, r2, ror #18
c: 0000001f andeq r0, r0, pc, lsl r0
10: 00543405 subseq r3, r4, r5, lsl #8
14: 01080206 tsteq r8, r6, lsl #4
18: 04120109 ldreq r0, [r2], #-265 ; 0xfffffef7
1c: 01150114 tsteq r5, r4, lsl r1
20: 01180317 tsteq r8, r7, lsl r3
24: 011a0119 tsteq r10, r9, lsl r1
28: Address 0x0000000000000028 is out of bounds.
and yes the tool put extra stuff in there, but note primarily that I created. some code, some initialized read/write data, some initialized read/write data and some initialized read only data. The toolchain authors can use whatever names they want, they don't even have to use the term section. But from decades of history and communication and terminology .text is generally used for code (as in read only machine code AND related data), .bss for zeroed read/write data although I have seen other names, .data for initialized read/write data and this generation of this tool .rodata for read only initialized data (technically that could land in .text)
And note that they all have an address of zero. they are not linked yet.
Now this is ugly but to avoid adding any more code and if the tool lets me do it, let's link it to make a completely unusable binary (no bootstrap, etc, etc):
arm-none-eabi-ld -Ttext=0x1000 -Tdata=0x2000 so.o -o so.elf
arm-none-eabi-ld: warning: cannot find entry symbol _start; defaulting to 0000000000001000
arm-none-eabi-objdump -D so.elf
so.elf: file format elf32-littlearm
Disassembly of section .text:
00001000 <fun>:
1000: e2800001 add r0, r0, #1
1004: e12fff1e bx lr
Disassembly of section .data:
00002000 <y>:
2000: 00000005 andeq r0, r0, r5
Disassembly of section .rodata:
00001008 <z>:
1008: 00000007 andeq r0, r0, r7
Disassembly of section .bss:
00002004 <x>:
2004: 00000000 andeq r0, r0, r0
And now it is linked. The read only items .text and .rodata landed in the .text address space in the order found in the file. The read/write items landed in the .data address space in the order found in the file.
Yes, where was .bss in the object? It is in there, it has no actual data as in bytes that are part of the object, instead it has a name and size and that it is .bss. And for whatever reason the tool does show it from the linked binary.
So back on the term binary. The so.elf binary has the bytes that go in memory that make up the program, but also file format infrastructure plus a symbol table to make the disassembly and debugging easier plus other stuff. Elf is a flexible file format gnu can use it and you get one result some other tool or version of a tool can use it and have a different file. And obviously two compilers can generate different machine code from the same source program not just due to optimizations, the job is to make a functional program in the target language and functional is the opinion of the compiler/tool author.
What about a memory image type file:
arm-none-eabi-objcopy so.elf so.bin -O binary
hexdump -C so.bin
00000000 01 00 80 e2 1e ff 2f e1 07 00 00 00 00 00 00 00 |....../.........|
00000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00001000 05 00 00 00 |....|
00001004
Now how the objcopy tool works is that it starts with the first defined loadable or whatever term you want to use byte and ends with the last one and uses (zero) padding to make the file size match so that the memory image matches from an address perspective. The asterisk means essentially 0 padding. Because we started at 0x1000 with .text and 0x2000 for .data but the first byte of this file (offset 0) is the beginning of .text and 0x1000 byte later which is offset 0x1000 in the file but we know it goes to 0x2000 in memory is the read/write stuff. Also note that the bss zeros are not in the output. The bootstrap is expected to zero those.
There is no information like where in memory this data from this file goes, etc. And if you think a bit about it what if I have one byte at a section I define goes to 0x00000000 and one byte at a section I define goes to 0x80000000 and output this file, yes that is a 0x80000001 byte file even though there are only two useful bytes of relevant information. A 2GB file to hold two bytes. This is why you don't want to output this file format until you have sorted out your linker script and tools.
Same data and two other equally old school formats with a little history of intel vs motorola
arm-none-eabi-objcopy so.elf so.hex -O ihex
cat so.hex
:08100000010080E21EFF2FE158
:0410080007000000DD
:0420000005000000D7
:0400000300001000E9
:00000001FF
arm-none-eabi-objcopy so.elf so.srec -O srec
cat so.srec
S00A0000736F2E7372656338
S10B1000010080E21EFF2FE154
S107100807000000D9
S107200005000000D3
S9031000EC
now these contain the relevant bytes, plus addresses, but not much other information, takes more than two bytes for every byte of data, but compared to a huge file with padding, a worthy trade-off. Both of these formats can be found in use today, not as much as the old days but still there.
And countless other binary file formats and a tool like objdump has a decent list of formats it can generate as well as other linkers and/or tools out there.
What is relevant about all of this is that there is a binary file format of some form that contains the bytes we need to run the program.
What format and what addresses you might ask...That is part of the operating system or the system design. In the case of Windows there are specific file formats and variations perhaps of those formats that are supported by the windows operating system, the specific version you are using. Windows has determined what the address space looks like. Operating systems like this take advantage of the MMU both for virtualizing addresses and protection. Having a virtual address space means every program can live in the same space. All programs can have an address that is zero based for example....
test.c
int main ( void )
{
return 1;
}
hello.c
int main ( void )
{
return 2;
}
gcc test.c -o test
objdump -D test
Disassembly of section .text:
00000000004003e0 <_start>:
4003e0: 31 ed xor %ebp,%ebp
4003e2: 49 89 d1 mov %rdx,%r9
4003e5: 5e pop %rsi
...
gcc hello.c -o hello
objdump -D hello
Disassembly of section .text:
00000000004003e0 <_start>:
4003e0: 31 ed xor %ebp,%ebp
4003e2: 49 89 d1 mov %rdx,%r9
same address, how is that possible won't they sit on top of each other? no virtual machine. And note this is built for a specific Linux on a specific day, etc. The toolchain has a default linker script (notice I didn't specify how to link) for this platform when the compiler was built for this target/platform.
arm-none-eabi-gcc -O2 test.c -c -o test.o
arm-none-eabi-ld test.o -o test.elf
arm-none-eabi-ld: warning: cannot find entry symbol _start; defaulting to 0000000000008000
arm-none-eabi-objdump -D test.elf
test.elf: file format elf32-littlearm
Disassembly of section .text:
00008000 <main>:
8000: e3a00001 mov r0, #1
8004: e12fff1e bx lr
same source code, same compiler, built for a different target and system different address.
So for Windows there are definitely going to be rules for the supported binary formats and rules for address spaces that can be used, how to define those spaces in the file.
Then it is a simple matter of the operating systems launcher to read the binary file and put the loadable items into memory at those addresses (in the virtual space that the os has created for this specific program) It is very possible that a feature of the loader is to zero bss for you since the information is there. The low level programmer needs to know that to possibly deal with zeroing .bss or not.
If not you will see and may need to create a solution, unfortunately this is where you get deeper into tool specific items. While C may be somewhat standardized there are tool specific things that are not or at least are standardized by the tool/authors but no reason to assume those cross over to other tools.
.globl _start
_start:
ldr sp,sp_init
bl fun
b .
.word __bss_start__
.word __bss_end__
sp_init:
.word 0x8000
Everything about assembly language is tool specific, the mnemonics for sanity reasons no doubt will resemble the ip/processor vendors documentation which uses syntax that the tool they paid to have developed uses. But beyond that assembly language is wholly defined by the tool not the target, x86 because of its age and other things is really bad about that and this is not the Intel vs AT&T thing, just in general. Gnu assembler is well known for I would assume perhaps intentionally not making compatible languages with other assembly languages. The above is gnu assembler for arm.
Using the fun() function above, C says it should be main() but the tool doesn't care I am already typing enough here.
add a simple ram based linker script
MEMORY
{
ram : ORIGIN = 0x1000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > ram
.rodata : { *(.rodata*) } > ram
.bss : {
__bss_start__ = .;
*(.bss*)
} > ram
__bss_end__ = .;
}
build it all
arm-none-eabi-as start.s -o start.o
arm-none-eabi-gcc -O2 -c so.c -o so.o
arm-none-eabi-ld -T sram.ld start.o so.o -o so.elf
examine
arm-none-eabi-nm so.elf
0000102c B __bss_end__
00001028 B __bss_start__
00001018 T fun
00001014 t sp_init
00001000 T _start
00001028 B x
00001024 D y
00001020 R z
arm-none-eabi-objdump -D so.elf
so.elf: file format elf32-littlearm
Disassembly of section .text:
00001000 <_start>:
1000: e59fd00c ldr sp, [pc, #12] ; 1014 <sp_init>
1004: eb000003 bl 1018 <fun>
1008: eafffffe b 1008 <_start+0x8>
100c: 00001028 andeq r1, r0, r8, lsr #32
1010: 0000102c andeq r1, r0, r12, lsr #32
00001014 <sp_init>:
1014: 00008000 andeq r8, r0, r0
00001018 <fun>:
1018: e2800001 add r0, r0, #1
101c: e12fff1e bx lr
Disassembly of section .rodata:
00001020 <z>:
1020: 00000007 andeq r0, r0, r7
Disassembly of section .data:
00001024 <y>:
1024: 00000005 andeq r0, r0, r5
Disassembly of section .bss:
00001028 <x>:
1028: 00000000 andeq r0, r0, r0
So now it is possible to add to the bootstrap a memory zeroing loop (do not use C/memset you don't create chicken and egg problems you write the bootstrap in asm) based on the start and end addresses.
Fortunately or unfortunately because the linker script is tool specific and assembly language is tool specific and they need to work together if you are letting the tools do the work for you (the sane way to do it, have fun figuring out where .bss is otherwise).
This can be done on an operating system but when you get into say microcontrollers where it all has to be on non-volatile storage (flash) well it is possible to have one that is downloaded from elsewhere (like your mouse firmware sometimes, sometimes keyboard, etc) into ram, assume flash, so how do you deal with .data??
MEMORY
{
rom : ORIGIN = 0x0000, LENGTH = 0x1000
ram : ORIGIN = 0x1000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > rom
.rodata : { *(.rodata*) } > rom
.data : {
*(.data*)
} > ram AT > rom
.bss : {
__bss_start__ = .;
*(.bss*)
} > ram
__bss_end__ = .;
}
With gnu ld this basically says that .data's home is in ram, but the output binary formats will put it in flash/rom
so.elf so.srec -O srec
cat so.srec
S00A0000736F2E7372656338
S11300000CD09FE5030000EBFEFFFFEA04100000A4
S11300100810000000800000010080E21EFF2FE1B4
S107002007000000D1 <- z variable at address 0020
S107002405000000CF <- y variable at 0024
S9030000FC
and you have to play with the linker script more to get the tool to tell you both the ram and flash starting addresses and ending addresses or length. then add code in the bootstrap (asm not C) to copy .data from flash to ram.
Also note here per another one of your many questions.
.word __bss_start__
.word __bss_end__
sp_init:
.word 0x8000
These items are technically data. but they live in .text first and foremost because they were defined in the code that was assumed to be .text (I didn't need to state that in the asm, but could have). you will see this in x86 as well, but for fixed length like arm, mips, risc-v, etc where you cant put any old immediate/constant/linked value you want in the instruction itself you put it nearby in a "pool" and do a pc relative read to get it. You will see this for linking externals too:
extern unsigned int x;
int main ( void )
{
return x;
}
arm-none-eabi-gcc -O2 -c test.c -o test.o
arm-none-eabi-objdump -D test.o
test.o: file format elf32-littlearm
Disassembly of section .text.startup:
00000000 <main>:
0: e59f3004 ldr r3, [pc, #4] ; c <main+0xc>
4: e5930000 ldr r0, [r3]
8: e12fff1e bx lr
c: 00000000 andeq r0, r0, r0 <--- the code gets the address of the
variable from here and then reads it from memory
once linked
Disassembly of section .text:
00008000 <main>:
8000: e59f3004 ldr r3, [pc, #4] ; 800c <main+0xc>
8004: e5930000 ldr r0, [r3]
8008: e12fff1e bx lr
800c: 00018010 andeq r8, r1, r0, lsl r0
Disassembly of section .data:
00018010 <x>:
18010: 00000005 andeq r0, r0, r5
for x86
gcc -c -O2 test.c -o test.o
dwelch-desktop so # objdump -D test.o
test.o: file format elf64-x86-64
Disassembly of section .text.startup:
0000000000000000 <main>:
0: 8b 05 00 00 00 00 mov 0x0(%rip),%eax # 6 <main+0x6>
6: c3 retq
00000000004003e0 <main>:
4003e0: 8b 05 4a 0c 20 00 mov 0x200c4a(%rip),%eax # 601030 <x>
4003e6: c3 retq
If you squint is it really different? there is data nearby that the processor reads to load into a register and or use. either way, due to the nature of the instruction sets the linker modifies the instruction or nearby pool data or both.
last one:
arm-none-eabi-gcc -S test.c
cat test.s
.cpu arm7tdmi
.eabi_attribute 20, 1
.eabi_attribute 21, 1
.eabi_attribute 23, 3
.eabi_attribute 24, 1
.eabi_attribute 25, 1
.eabi_attribute 26, 1
.eabi_attribute 30, 6
.eabi_attribute 34, 0
.eabi_attribute 18, 4
.file "test.c"
.text
.align 2
.global main
.arch armv4t
.syntax unified
.arm
.fpu softvfp
.type main, %function
main:
# Function supports interworking.
# args = 0, pretend = 0, frame = 0
# frame_needed = 1, uses_anonymous_args = 0
# link register save eliminated.
str fp, [sp, #-4]!
add fp, sp, #0
ldr r3, .L3
ldr r3, [r3]
mov r0, r3
add sp, fp, #0
# sp needed
ldr fp, [sp], #4
bx lr
.L4:
.align 2
.L3:
.word x
.size main, .-main
.ident "GCC: (GNU) 9.3.0"
So can you see the assembly language, yes some tools will let you save the intermediate files and/or let you generate the assembly output of the file when compiling.
Can you have data in the code, yes there are times and reasons to have data values in the .text area not just target specific you will see this for various reasons and some toolchains put read only data there.
There are many file formats the ones used by modern operating systems have features not just for defining the bytes that make up the machine code and data values but also will include symbols and other debug information.
The file format and memory space for a program is operating system specific not language nor even target specific (Linux, Windows, MacOS on the same laptop are not expected to have the same rules despite the exact same target computer). A native toolchain for that platform has a default linker script and whatever other information required to build usable/working programs for that target. Including the supported file format.
The machine code and data items can be represented in different file formats in different ways, whether or not the operating system or loader of the target system can use that format depends on that target system.
Programs have bugs and nuances. File formats have versions and inconsistencies, you might find some elf file format reader only to find it doesn't work or prints out strange stuff when fed a perfectly good elf file that works on some system. Why are some flags being set? Perhaps those bytes got re-used or the flag to repurposed or the data structure changed or a tool is using it differently or in a non-standard way (think mov 20h,ax) and another tool that is not compatible can't understand or gets lucky and gets close enough.
Asking "why" questions at Stack Overflow is not very useful, the odds of finding the individual that wrote the thing are very low, better odds of asking the place you got the tool from and following that hoping the person is still alive and willing to be bothered. And 99.999(lots of 9s)% there is no global set of godly rules that the thing was written under/for. General it was some dude just felt like it that is why they did what they did, no real reason, laziness, a bug, intentionally trying to break someone else's tool. All the way up to a large committee of people with an opinion voted on it on a particular day in a particular room and that's why (and we know what we get when we design by committee or try to write specs that nobody conforms to).
I know you are on Windows and I don't have a Windows machine handy and am on Linux. But the gnu/binutils and clang/llvm tools are readily available and have a rich set of tools like readelf, nm, objdump, etc. That assist in examining things, a good tool is going to have that at least internally for the developers so they can debug the output of the tool to a certain quality level. gnu folks made tools and made them available for everyone, and while it takes time to sort through them and their features they are very powerful for the things you are trying to understand.
You are NOT going to find a good x86 disassembler, they are all crap simply because of the nature of the beast. It is a variable length instruction set, so by definition unless you are executing you cant sort it out correctly. You must disassemble in execution order from a known good entry point to have half a chance, and then for various reasons there are code paths you cannot see that way (think jump tables for example, or dlls or so files). The BEST solution is to have a very accurate/perfect emulator/simulator and run the code and perform all the actions/gyrations you need to do to get it to cover all the code paths, and have that tool record instructions from data and where each is located or each linear section without a branch.
The good side of this is that a lot of code is compiled today using tools that are not trying to hide anything. In the old days for various reasons you would see hand written asm that intentionally tried to prevent disassembly or due to other factors (hand editing a binary rom image for a video game the day before the trade show, go disassemble some of the classic roms).
mov r0,#0
cmp r0,#0
jz somewhere
.word 0x12345678
A disassembler isn't going to figure this out, some might add a case for that then
mov r0,#0
nop
nop
xor r0,#1
nop
nop
xor r0,#3
xor r0,#2
cmp r0,#0
jz somewhere
.word 0x12345678
and it thinks that data is an instruction, for variable length that is super hard for a disassembler to resolve a decent one will at least detect collisions where the non opcode part of the instruction is branched to and/or an opcode part of an instruction shows up later as additional bytes in some other instruction. The tool cant resolve it a human has to.
Even with arm and mips and having 32 and 16 bit instructions, risc-v with variable sized instructions, etc...
Very often gnu's disassembler will get tripped up with x86.

I don't think I'll be able to answer to everything. I am a beginner too so I may say some things not exact. But, I'll try my best and I think I can bring you some things.
No, compilers do not put data in code sections (correct me if I am wrong). There is the section .data (for initialized data) and section .bss (for uninitialized data).
I think, I'll better show you an example of a program which prints hello world (for linux because it's much simpler and I don't know how to do with windows. in x64 but it's like x86. Just the names of the syscalls and the registers that are different. x64 is for 64 bits and x86 for 32 bits).
BITS 64 ;not obligatory but I prefer
section .data
msg db "hello world" ;the message
len equ $-msg ;the length of msg
section .text
global _start
_start: ;the entry point
mov rax, 1 ;syscall 1 to print something
mov rdi, 1 ;1 for stdout
mov rsi, msg ;the message
mov rdx, len ;length in rdx
syscall
mov rax, 60 ;exit syscall
mov rdi, 0 ;exit with 0
syscall
(https://tio.run/#assembly-nasm if you don't want to use a VM. I advise you to look for WSL + vscode if you are using windows. you will have linux in your windows and vscode has an extension to have an access to the files in windows) but
If you wanna disassemble the code or see what is the memory, you can use gdb or radare2 in linux. For windows, there are other tools such as ghidra, IDA, olly dbg..
I don't know any way to make the compiler create a better assembly code. but it doesn't mean it doesn't exist.
I have never made anything for windows. However, to link my object file, I use ld (I don't know if it will be helpful).
ld object.o -o compiledprogram
I don't have time right now to continue writing so I can't advise you any courses right now.. I'll see later.
Hope it has helped you.

Answers to questions in your text:
1. You can see process execution step by step and process memory with debugger. I used OllyDbg for learning assembly, it's free and powerful debugger.
2. Process is loaded by Windows kernel after calling NtCreateUserProcess so I think that you would need kernel debugging to see how it is done.
3. Code that is debugged in OllyDbg is automatically disassembled.
4. You can put read-only data in ".text" section. You can change section flags to make it writable, then code and data can be mixed. Some compilers may merge ".text" and ".rdata" sections.
I would recommend that you read about PE imports, exports, relocations and resources in that order. If you want to see easiest possible i386 PE helloworld you can check my hello_world_pe_i386_dynamic.exe program here: https://github.com/pajacol/hello-world. I wrote it entirely in binary file editor. It contains only required data structures. This executable is position independent and can be loaded at any address without relocations.

Calls to Addresses in the Middle of Routines

I am tracing wireshark-2.6.10 using Pin. At several points during the initialization, I can see some calls, such as this:
00000000004e9400 <__libc_csu_init##Base>:
...
4e9449: 41 ff 14 dc callq *(%r12,%rbx,8)
...
The target of this call is 0x197db0, shown here:
0000000000197cb0 <_start##Base>:
...
197db0: 55 push %rbp
197db1: 48 89 e5 mov %rsp,%rbp
197db4: 5d pop %rbp
197db5: e9 66 ff ff ff jmpq 197d20 <_start##Base+0x70>
197dba: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
...
Pin says that this is in the middle of the containing routine, i.e., _start##Base. But, when I reach this target using gdb, I see the following output:
>│0x5555556ebdb0 <frame_dummy> push %rbp
│0x5555556ebdb1 <frame_dummy+1> mov %rsp,%rbp
│0x5555556ebdb4 <frame_dummy+4> pop %rbp
│0x5555556ebdb5 <frame_dummy+5> jmpq 0x5555556ebd20 <register_tm_clones>
│0x5555556ebdba <frame_dummy+10> nopw 0x0(%rax,%rax,1)
│0x5555556ebdc0 <main_window_update()> xor %edi,%edi
Note that if I subtract the bias value, the runtime target address will be consistent with the compile time value (i.e., 0x5555556ebdb0 - 0x555555554000 = 0x197db0). It seems that there exists a pseudo-routine called frame_dummy inside _start##Base. How is that possible? How can I extract the addresses for these pseudo-routines, beforehand (i.e., before execution)?
UPDATE:
These types of calls to the middle of functions were not present in GIMP and Anjuta (which are written almost purely in C and built from source). But are present in Inkscape and Wireshark (written in C++, although I do not think that the language is the cause. These two were installed from packages.).
At first, it seemed that this situation occurs only during the initialization and before calling the main() function. But, at least in wireshark-2.6.10 this occurs at least in one place after main() starts. Here, we have wireshark-qt.cpp: Lines 522-524 (which is part of main()).
/* Get the compile-time version information string */
comp_info_str = get_compiled_version_info(get_wireshark_qt_compiled_info,
get_gui_compiled_info);
This is a call to get_compiled_version_info(). In assembly, the function is called at address 0x5555556e74c2 (0x1934c2 without bias), as shown below:
>│0x5555556e74c2 <main(int, char**)+178> callq 0x5555556f5870 <get_compiled_version_info>
│0x5555556e74c7 <main(int, char**)+183> lea 0x4972(%rip),%rdi # 0x5555556ebe40 <get_wireshark_runtime_info(_GString*)>
│0x5555556e74ce <main(int, char**)+190> mov %rax,%r13
Again, the target is in the middle of another function, _ZN7QStringD1Ev##Base:
00000000001980f0 <_ZN7QStringD1Ev##Base>:
...
1a1870: 41 54 push %r12
...
This is the output of gdb (0x5555556f5870 - 0x555555554000 = 0x1a1870):
>│0x5555556f5870 <get_compiled_version_info> push %r12
│0x5555556f5872 <get_compiled_version_info+2> mov %rdi,%r12
│0x5555556f5875 <get_compiled_version_info+5> push %rbp
│0x5555556f5876 <get_compiled_version_info+6> lea 0x349445(%rip),%rdi # 0x555555a3ecc2
As can be seen, the debugger recognizes that this address is the start address of get_compiled_version_info(). This is because it has access to debug_info. In all cases that I found, the symbol for these pseudo-routines were removed from the original binary (because .symtab was removed from the binary). But the strange thing is that it is located inside _ZN7QStringD1Ev##Base. Therefore, Pin considers get_compiled_version_info() to be inside _ZN7QStringD1Ev##Base.

How is that possible?
The frame_dummy is a bona-fide C function. If Pin thinks it's in the middle of _start, it's probably because:
_start is an assembly function, and
its .st_size is set incorrectly in the symbol table.
You can confirm this by looking at readelf -Ws a.out | egrep ' (_start|frame_dummy)'.
You are probably using the binary linked with fairly old GLIBC.
GLIBC used to generate C runtime startup files (whence _start comes from) by using gcc -S to create assembly from C source, then splitting and editing the assembly with sed. Getting .size directive wrong was one problem with that approach, and it is no longer used on x86_64 as of 2012 (commit).
How can I extract the addresses for these pseudo-routines, beforehand (i.e., before execution)?
Pin doesn't magically create these pseudo-routines, they must be visible in the readelf -Ws output of the original binary.

gcc x86-32 stack alignment and calling printf

To the best of my knowledge, x86-64 requires the stack to be 16-byte aligned before a call, while gcc with -m32 doesn't require this for main.
I have the following testing code:
.data
intfmt: .string "int: %d\n"
testint: .int 20
.text
.globl main
main:
mov %esp, %ebp
push testint
push $intfmt
call printf
mov %ebp, %esp
ret
Build with as --32 test.S -o test.o && gcc -m32 test.o -o test. I am aware that syscall write exists, but to my knowledge it cannot print ints and floats the way printf can.
After entering main, a 4 byte return address is on the stack. Then interpreting this code naively, the two push calls each put 4 bytes on the stack, so call needs another 4 byte value pushed to be aligned.
Here is the objdump of the binary generated by gas and gcc:
0000053d <main>:
53d: 89 e5 mov %esp,%ebp
53f: ff 35 1d 20 00 00 pushl 0x201d
545: 68 14 20 00 00 push $0x2014
54a: e8 fc ff ff ff call 54b <main+0xe>
54f: 89 ec mov %ebp,%esp
551: c3 ret
552: 66 90 xchg %ax,%ax
554: 66 90 xchg %ax,%ax
556: 66 90 xchg %ax,%ax
558: 66 90 xchg %ax,%ax
55a: 66 90 xchg %ax,%ax
55c: 66 90 xchg %ax,%ax
55e: 66 90 xchg %ax,%ax
I am very confused about the push instructions generated.
If two 4 byte values are pushed, how is alignment achieved?
Why is 0x2014 pushed instead of 0x14? What is 0x201d?
What does call 54b even achieve? Output of hd matches objdump. Why is this different in gdb? Is this the dynamic linker?
B+>│0x5655553d <main> mov %esp,%ebp │
│0x5655553f <main+2> pushl 0x5655701d │
│0x56555545 <main+8> push $0x56557014 │
│0x5655554a <main+13> call 0xf7e222d0 <printf> │
│0x5655554f <main+18> mov %ebp,%esp │
│0x56555551 <main+20> ret
Resources on what goes on when a binary is actually executed are appreciated, since I don't know what's actually going on and the tutorials I've read don't cover it. I'm in the process of reading through How programs get run: ELF binaries.

The i386 System V ABI does guarantee / require 16 byte stack alignment before a call, like I said at the top of my answer that you linked. (Unless you're calling a private helper function, in which case you can make up your own rules for alignment, arg-passing, and which registers are clobbered for that function.)
Functions are allowed to crash or misbehave if you violate this ABI requirement, but are not required to. e.g. scanf in x86-64 Ubuntu glibc (as compiled by recent gcc) only recently started doing that: scanf Segmentation faults when called from a function that doesn't change RSP
Functions can depend on stack alignment for performance (to align a double or array of doubles to avoid cache-line splits when accessing them).
Usually the only case where a function depends on stack alignment for correctness is when compiled to use SSE/SSE2, so it can use 16-byte alignment-required loads/stores to copy a struct or array (movaps or movdqa), or to actually auto-vectorize a loop over a local array.
I think Ubuntu doesn't compile their 32-bit libraries with SSE (except functions like memcpy that use runtime dispatching), so they can still work on ancient CPUs like Pentium II. Multiarch libraries on an x86-64 system should assume SSE2, but with 4-byte pointers it's less likely that 32-bit functions would have 16 byte structs to copy.
Anyway, whatever the reason, obviously printf in your 32-bit build of glibc doesn't actually depend on 16-byte stack alignment for correctness, so it doesn't fault even when you misalign the stack.
Why is 0x2014 pushed instead of 0x14? What is 0x201d?
0x14 (decimal 20) is the value in memory at that location. It will be loaded at runtime, because you used push r/m32, not push $20 (or an assemble time constant like .equ testint, 20 or testint = 20).
You used gcc -m32 to make a PIE (Position Independent Executable), which is relocated at runtime, because that's the default on Ubuntu's gcc.
0x2014 is the offset relative to the start of the file. If you disassemble at runtime after running the program, you'll see a real address.
Same for call 54b. It's presuambly a call to the PLT (which is near the start of the file / text segment, hence the low address).
If you disassembled with objdump -drwC, you'd see symbol relocation info. (I like -Mintel as well, but beware it's MASM-like, not NASM).
You can link with gcc -m32 -no-pie to make classic position-dependent executables. I'd definitely recommend that especially for 32-bit code, and especially if you're compiling C, use gcc -m32 -no-pie -fno-pie to get non-PIE code-gen as well as linking into a non-PIE executable. (see 32-bit absolute addresses no longer allowed in x86-64 Linux? for more about PIEs.)

Can GDB parse global data from xx.so without executable?

I have a shared library (hlapi.so) running on linux system. This hlapi.so has many modules(I mean .c files ). One of them is named as hlapi.c which defines two global datas like this:
static int hlapiInitialized = FALSE;
static struct hlapi_data app_sp;
Of course there are many other codes in this hlapi.c module. The hlapi.so is released to customer who builds their own application (named as appbasehlapi) based on our hlapi.so.
Now I got a core dump whose backtrace parsed by customer shows the core is in our codes. But the customer can only provide us the core dump file. The appbasehlapi executable will not be shared with us. So in my hands, I have only the core dump file + hlapi.so.
In order to debug this core, I load the core dump file by command
gdb --core=mycoredumpfile
and then in gdb, I use
set solib-search-path .
to specify the folder which contains hlapi.so so that gdb can load symbols from hlapi.so. And then I use:
print hlapiInitialized
print app_sp
to parse the global data in our module. But the output values are very abnormal.
My question here is that if I can parse global datas defined in the hlapi.so via gdb without the executable? If the outputs I got via gdb are believable?
I am appreciating any comment.
BTW, the hlapi.so is built with gcc options "-g -fPIC".

I investigated the questions for a while, and in my opinion, I believe GDB can parse the global variables without the executable.
In the test, the following codes are in hlapi.cpp:
static int hlapiInitialized = 0;
void hlapiInit()
{
if (hlapiInitialized == 0)
{
// do something else
}
hlapiInitialized = 1;
}
The objdump shows the assembly codes for it is:
00000000000009a2 <_Z9hlapiInitv>:
9a2: 55 push %rbp
9a3: 48 89 e5 mov %rsp,%rbp
9a6: c7 05 98 06 20 00 01 movl $0x1,0x200698(%rip) # 201048 <_ZL16hlapiInitialized>
9ad: 00 00 00
9b0: 90 nop
9b1: 5d pop %rbp
9b2: c3 retq
During running the application, I generate a core dump against it. In gdb, before specifying the solib-search-path, I get:
(gdb) disas hlapiInit
No symbol table is loaded. Use the "file" command.
Once the search path is specified, the output is:
(gdb) disas hlapiInit
Dump of assembler code for function hlapiInit():
0x00007ffff7bd59a2 <+0>: push %rbp
0x00007ffff7bd59a3 <+1>: mov %rsp,%rbp
0x00007ffff7bd59a6 <+4>: movl $0x1,0x200698(%rip) # 0x7ffff7dd6048 <_ZL16hlapiInitialized>
0x00007ffff7bd59b0 <+14>: nop
0x00007ffff7bd59b1 <+15>: pop %rbp
0x00007ffff7bd59b2 <+16>: retq
End of assembler dump.
After comparing the output from hlapi.so and from core file, we know that once the shared library had been loaded into the process, the address of global variable will be reallocated, and the address of the global variables are clear. So, once have the symbol info of the shared library, gdb can map the variables.

x86_64: Is it possible to "in-line substitute" PLT/GOT references?

I'm not sure what a good subject line for this question is, but here we go:
In order to force code locality/compactness for a critical section of code, I'm looking for a way to call a function in an external (dynamically-loaded) library through a "jump slot" (an ELF R_X86_64_JUMP_SLOT relocation) directly at the call site - what the linker ordinarily puts into PLT / GOT, but have these inlined right at the call site.
If I emulate the call like:
#include <stdio.h>
int main(int argc, char **argv)
{
asm ("push $1f\n\t"
"jmp *0f\n\t"
"0: .quad %P0\n"
"1:\n\t"
: : "i"(printf), "D"("Hello, World!\n"));
return 0;
}
To get the space for a 64bit word, the call itself works (please, no comments about this being lucky coincidence as this breaks certain ABI rules - all these are not subject of this question.
For my case, be worked around/addressed in other ways, I'm trying to keep this example brief).
It creates the following assembly:0000000000000000 <main>:
0: bf 00 00 00 00 mov $0x0,%edi
1: R_X86_64_32 .rodata.str1.1
5: 68 00 00 00 00 pushq $0x0
6: R_X86_64_32 .text+0x19
a: ff 24 25 00 00 00 00 jmpq *0x0
d: R_X86_64_32S .text+0x11
...
11: R_X86_64_64 printf
19: 31 c0 xor %eax,%eax
1b: c3 retq
But (due to using printf as the immediate, I guess ... ?) the target address here is still that of the PLT hook - the same R_X86_64_64 reloc. Linking the object file against libc into an actual executable results in:
0000000000400428 <printf#plt>:
400428: ff 25 92 04 10 00 jmpq *1049746(%rip) # 5008c0 <_GLOBAL_OFFSET_TABLE_+0x20>
[ ... ]
0000000000400500 <main>:
400500: bf 0c 06 40 00 mov $0x40060c,%edi
400505: 68 19 05 40 00 pushq $0x400519
40050a: ff 24 25 11 05 40 00 jmpq *0x400511
400511: [ .quad 400428 ]
400519: 31 c0 xorl %eax, %eax
40051b: c3 retq
[ ... ]
DYNAMIC RELOCATION RECORDS
OFFSET TYPE VALUE
[ ... ]
00000000005008c0 R_X86_64_JUMP_SLOT printf
I.e. this still gives the two-step redirection, first transfer execution to the PLT hook, then jump into the library entry point.
Is there a way how I can instruct the compiler/assembler/linker to - in this example - "inline" the jump slot target at address 0x400511?
I.e. replace the "local" (resolved at program link time by ld) R_X86_64_64 reloc with the "remote" (resolved at program load time by ld.so) R_X86_64_JUMP_SLOT one (and force non-lazy-load for this section of code) ? Maybe linker mapfiles might make this possible - if so, how?
Edit:
To make this clear, the question is about how to achieve this in a dynamically-linked executable / for an external function that's only available in a dynamic library. Yes, it's true static linking resolves this in a simpler way, but:
There are systems (like Solaris) where static libraries are generally not shipped by the vendor
There are libraries that aren't available as either source code or static versions
Hence static linking is not helpful here :(
Edit2:
I've found that in some architectures (SPARC, noticeably, see section on SPARC relocations in the GNU as manual), GNU is able to create certain types of relocation references for the linker in-place using modifiers. The quoted SPARC one would use %gdop(symbolname) to make the assembler emit instructions to the linker stating "create that relocation right here". Intel's assembler on Itanium knows the #fptr(symbol) link-relocation operator for the same kind of thing (see also section 4 in the Itanium psABI). But does an equivalent mechanism - something to instruct the assembler to emit a specific linker relocation type at a specific position in the code - exist for x86_64?
I've also found that the GNU assembler has a .reloc directive which supposedly is to be used for this purpose; still, if I try:
#include <stdio.h>
int main(int argc, char **argv)
{
asm ("push %%rax\n\t"
"lea 1f(%%rip), %%rax\n\t"
"xchg %%rax, (%rsp)\n\t"
"jmp *0f\n\t"
".reloc 0f, R_X86_64_JUMP_SLOT, printf\n\t"
"0: .quad 0\n"
"1:\n\t"
: : "D"("Hello, World!\n"));
return 0;
}
I get an error from the linker (note that 7 == R_X86_64_JUMP_SLOT):error: /tmp/cc6BUEZh.o: unexpected reloc 7 in object file
The assembler creates an object file for which readelf says:Relocation section '.rela.text.startup' at offset 0x5e8 contains 2 entries:
Offset Info Type Symbol's Value Symbol's Name + Addend
0000000000000001 000000050000000a R_X86_64_32 0000000000000000 .rodata.str1.1 + 0
0000000000000017 0000000b00000007 R_X86_64_JUMP_SLOT 0000000000000000 printf + 0
This is what I want - but the linker doesn't take it.
The linker does accept just using R_X86_64_64 instead above; doing that creates the same kind of binary as in the first case ... redirecting to printf#plt, not the "resolved" one.

This optimization has since been implemented in GCC. It can be enabled with the -fno-plt option and the noplt function attribute:
Do not use the PLT for external function calls in position-independent code. Instead, load the callee address at call sites from the GOT and branch to it. This leads to more efficient code by eliminating PLT stubs and exposing GOT loads to optimizations. On architectures such as 32-bit x86 where PLT stubs expect the GOT pointer in a specific register, this gives more register allocation freedom to the compiler. Lazy binding requires use of the PLT; with -fno-plt all external symbols are resolved at load time.
Alternatively, the function attribute noplt can be used to avoid calls through the PLT for specific external functions.
In position-dependent code, a few targets also convert calls to functions that are marked to not use the PLT to use the GOT instead.

In order to inline the call you would need a code (.text) relocation whose result is the final address of the function in the dynamically loaded shared library. No such relocation exists (and modern static linkers don't allow them) on x86_64 using a GNU toolchain for GNU/Linux, therefore you cannot inline the entire call as you wish to do.
The closest you can get is a direct call through the GOT (avoids PLT):
.section .rodata
.LC0:
.string "Hello, World!\n"
.text
.globl main
.type main, #function
main:
pushq %rbp
movq %rsp, %rbp
movl $.LC0, %eax
movq %rax, %rdi
call *printf#GOTPCREL(%rip)
nop
popq %rbp
ret
.size main, .-main
This should generate a R_X86_64_GLOB_DAT relocation against printf in the GOT to be used by the sequence above. You need to avoid C code because in general the compiler may use any number of caller-saved registers in the prologue and epilogue, and this forces you to save and restore all such registers around the asm function call or risk corrupting those registers for later use in the wrapper function. Therefore it is easier to write the wrapper in pure assembly.
Another option is to compile with -Wl,-z,now -Wl,-z,relro which ensures the PLT and PLT-related GOT entries are resolved at startup to increase code locality and compactness. With full RELRO you'll only have to run code in the PLT and access data in the GOT, two things which should already be somewhere in the cache hierarchy of the logical core. If full RELRO is enough to meet your needs then you wouldn't need wrappers and you would have added security benefits.
The best options are really static linking or LTO if they are available to you.

You can statically link the executable. Just add -static to the final link command, and all you indirect jumps will be replaced by direct calls.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio