number of executed bytes when debugging a PE

number of executed bytes when debugging a PE - windows

I am currently writing a small debugger in assembly on windows plateform.
I open the debuggee process as follow:
invoke CreateProcess, addr buffer, NULL, NULL, NULL, FALSE, DEBUG_PROCESS+DEBUG_ONLY_THIS_PROCESS, NULL, NULL, addr startinfo, addr pi
It works well, i can get the EIP by looking on the context of the debuggee and so i can get the 1st byte of the instruction that will be executed.
However, I need to get the number of bytes that have been executed in the previous instruction.
Instructions are not size independant. Sometimes an instruction is just 1 byte, and some other time 6 bytes or more.
I tried to substract the previous EIP with the current EIP in order to get the number of bytes that have been executed. But it doesn't work if there is a jmp or a call because the address space is not the same anymore.
I planned to get a map of all opcode and make some cmp, but it seems to be a huge work to do.
If you have some idea in order to get the number of byte of the previous instruction that has been executed (maybe looking into a cache or something like that), please let me know.
Best regards

TL;DR
Keep it simple: single step and decode only the branch instructions and use EIP - last EIP unless the last instruction was a branch (in that case use the decoding to find the length).
If an unknown instruction is found, back off and don't provide its size.
It's impossible to decode an x86 instruction stream backward because x86 encoding is not symmetric (w.r.t. address growth), to see this consider mov eax, 90909090h or similar.
So you need to disassemble each instruction as you single step through the program (a debugger needs this anyway) and record its size.
The control transfer instructions are significantly less than the total number of instructions, so you could decode just that and use the EIP - EIP' (where EIP' is the EIP of the last instruction) trick otherwise.
Intel processors support Last Branch Recording but it requires OS support and you'd need to post-process the data anyway, it's seem too burdensome.
A similar argument can be made for the Intel Processor Trace technology.
I can't think of any event for the performance counters (granted that you can use them) that would result in the the number of bytes of an instruction.
Actually in the backend, the concept of "instruction" has been reduced to a sequence of uOPs (probably with a bit to say that an opcode is the last one in an instruction) and the front-end is mostly decoupled from the architectural value of eip (working almost always with a speculative value of eip) so it may be several instructions ahead of the backend.
I believe each uOP probably have a field to record how to update eip at retirement but not the size of an instruction in bytes.
Similarly in the front-end only in the pre-decode stage an instruction length in bytes is recorded, after that I think it's discarded (I can't think of any use of it).
Instructions in the L1 instruction cache are not yet decoded, so even if there was a way to inspect their content and metadata there would be nothing there.
The usual way this is done is by making a trace: single step thorough the program, disassemble the instruction at eip (see below), record its size, resume the program, repeat until a stop condition.
This gives you a list of addresses and instruction sizes.
If you find an instruction you can't decode you either not record the size for it or try to estimate it with some heuristic (its length must be less than 16B and you could in theory integrate the data with the count from a PMC like BR_INST_RETIRED.ALL_BRANCHES).
It's possible to detect the size of an instruction at runtime but that's totally not feasible in this context.

Related

What is the initial value of `ESP`?

Under Visual Studio masm:
mov ecx,270
l1: pop eax
loop l1
push eax
The point of this code is to find out if there is and what is the initial value of ESP. I try to pop immediately after the program starts, and experiment that after how many pop a push will create some memory reading related error. The result of the experiment is somehow unstable, even with exactly same number for ecx. Generally, greater than 512 will always(in my limited times of experiments) create an error, less than 128 is always "safe", and values around 250 to 400 will sometimes create error. It seem that there is no initial value for ESP. If there is, my experiment should create some stable result.
OK I run 127 for other 10 more times and now it start to crash. I am trying to experiment more numbers about this.
Let us just say using Windows-x86, on an average moment of starting a program like my experiment's program. How Windows determine what will be the initial value of esp? Is this difficult to determine(because I could imagine simply put the last address of stack segement in esp)? Is there a common practice of how to do this?

The initial value is wherever the OS put the stack in the process's virtual address space. In modern operating systems it's random.
What is above the top of the stack at _start is architecture-dependent. On Windows, you get an actual return address to something that will exit the current thread. On Linux, you get the command line and the environment variables. In any case, popping stuff from the stack that you didn't push is not going to be ABI compliant and will get you into trouble. The only rules that remain at that point are the security rules.

Does the CS and DS registers still affect instructions in x64 intel

Afaik they are never used and CS=DS=SS nowadays. However if I were to set these values, would anything change or does the processor ignore them. Ive found really conflicting information on the question and I don't understand why they would still be there if they are ignored. Help pls

Yes, the segment registers do still affect code execution.
The question and some of the comments don't seem to distinguish between the selector value and the base address. To clearly understand some of the apparently conflicting information you're reading on this topic, you need to make sure you recognize which one is being discussed.
The CS selector cannot be 0. It must refer to a valid code segment descriptor in the GDT or LDT. The L bit of the code segment descriptor controls whether the current process is 64-bit mode or 32-bit compatibility mode.
CS (the selector) cannot be equal to DS and SS. CS must refer to a code segment, whereas DS and SS must refer to data segments (possibly the same one). The DS and SS selectors are allowed to be 0 (which would cause a GP fault in 32-bit mode).
The main aspect of segment registers that doesn't still have an effect is the base address and segment limit; the base address of CS, DS, ES, and SS are all treated as if they are 0, and there are no segment limit checks in 64-bit code.
This is the reason you see people saying that they are ignored.
As Margaret mentioned, the current privilege level (CPL) is in the low 2 bits of the CS and SS selector registers and also in the DPL bits of the descriptors in the GDT. These bits should be either 0 or 3, since no current operating systems use rings 1 and 2, as far as I know.
One other minor point is that certain faults caused by memory accesses are reported as stack faults instead of GP faults, if the memory access is performed using the SS segment (because RBP or RSP is used as a base register in the instruction operand).

What's up with the "half fence" behavior of rdtscp?

For many years x86 CPUs supported the rdtsc instruction, which reads the "time stamp counter" of the current CPU. The exact definition of this counter has changed over time, but on recent CPUs it is a counter that increments at a fixed frequency with respect to wall clock time, so it is very useful as building block for a fast, accurate clock or measuring the time taken by small segments of code.
One important fact about the rdtsc instruction isn't ordered in any special way with the surrounding code. Like most instructions, it can be freely reordered with respect to other instructions which it isn't in a dependency relationship with. This is actually "normal" and for most instructions it's just a mostly invisible way of making the CPU faster (it's just a long winder way of saying out-of-order execution).
For rdtsc it is important because it means you might not be timing the code you expect to be timing. For example, given the following sequence1:
rdtsc
mov ecx, eax
mov rdi, [rdi]
mov rdi, [rdi]
rdtsc
You might expect rdtsc to measure the latency of the two pointer chasing loading loads mov rdi, [rdi]. In practice, however, even if both of these loads take a look time (100s of cycles if they miss in the cache), you'll get a fairly small reading for the rdtsc pair. The problem is that the second rdtsc doens't wait for the loads to finish, it just executes out of order, so you aren't timing the interval you think you are. Perhaps both rdtsc instruction actually even execute before the first load even starts, depending how rdi was calculated in the code prior to this example.
So far, this is sounding more like an answer to a question nobody asked than a real question, but I'm getting there.
You have two basic use-cases for rdtsc:
As a quick timestamp, in which can you usually don't care exactly how it reorders with the surrounding code, since you probably don't have have an instruction-level concept of where the timestamp should be taken, anyways.
As a precise timing mechanism, e.g., in a micro-benchmark. In this case you'll usually protect your rdtsc from re-ordering with the lfence instruction. For the example above, you might do something like:
lfence
rdtsc
lfence
mov ecx, eax
...
lfence
rdtsc
To ensure the timed instructions (...) don't escape outside of the timed region, and also to ensure instructions from inside the time region don't come in (probably less of a problem, but they may compete for resources with the code you want to measure).
Years later, Intel looked down upon us poor programmers and came up with a new instruction: rdtscp. Like rdtsc it returns a reading of the time stamp counter, and this guy does something more: it reads a core-specific MSR value atomically with the timestamp reading. On most OSes this contains a core ID value. I think the idea is that this value can be used to properly adjust the returned value to real time on CPUs that may have different TSC offsets per core.
Great.
The other thing rdtscp introduced was half-fencing in terms of out-of-order execution:
From the manual:
The RDTSCP instruction is not a serializing instruction, but it does
wait until all previous instructions have executed and all previous
loads are globally visible.1 But it does not wait for previous stores
to be globally visible, and subsequent instructions may begin
execution before the read operation is performed.
So it's like putting an lfence before the rdtscp, but not after. What is the point of this half-fencing behavior? If you want a general timestamp and don't care about instruction ordering, the unfenced behavior is what you want. If you want to use this for timing short code sections, the half-fencing behavior is useful only for the second (final) reading, but not for the initial reading, since the fence is on the "wrong" side (in practice you want fences on both sides, but having them on the inside is probably the most important).
What purpose does such half-fencing serve?
1 I'm ignoring the upper 32-bits of the counter in this case.

How does a CPU know if an address in RAM contains an integer, a pre-defined CPU instruction, or any other kind of data?

The reason this gets me confused is that all addresses hold a sequence of 1's and 0's. So how does the CPU differentiate, let's say, 00000100(integer) from 00000100(CPU instruction)?

First of all, different commands have different values (opcodes). That's how the CPU knows what to do.
Finally, the questions remains: What's a command, what's data?
Modern PCs are working with the von Neumann-Architecture ( https://en.wikipedia.org/wiki/John_von_Neumann) where data and opcodes are stored in the same memory space. (There are architectures seperating between these two data types, such as the Harvard architecture)
Explaining everything in Detail would totally be beyond the scope of stackoverflow, most likely the amount of characters per post would not be sufficent.
To answer the question with as few words as possible (Everyone actually working on this level would kill me for the shortcuts in the explanation):
Data in the memory is stored at certain addresses.
Each CPU Advice is basically consisting of 3 different addresses (NOT values - just addresses!):
Adress about what to do
Adress about value
Adress about an additional value
So, assuming an addition should be performed, and you have 3 Adresses available in the memory, the application would Store (in case of 5+7) (I used "verbs" for the instructions)
Adress | Stored Value
1 | ADD
2 | 5
3 | 7
Finally the CPU receives the instruction 1 2 3, which then means ADD 5 7 (These things are order-sensitive! [Command] [v1] [v2])... And now things are getting complicated.
The CPU will move these values (actually not the values, just the adresses of the values) into its registers and then processing it. The exact registers to choose depend on datatype, datasize and opcode.
In the case of the command #1 #2 #3, the CPU will first read these memory addresses, then knowing that ADD 5 7 is desired.
Based on the opcode for ADD the CPU will know:
Put Address #2 into r1
Put Address #3 into r2
Read Memory-Value Stored at the address stored in r1
Read Memory-Value stored at the address stored in r2
Add both values
Write result somewhere in memory
Store Address of where I put the result into r3
Store Address stored in r3 into the Memory-Address stored in r1.
Note that this is simplified. Actually the CPU needs exact instructions on whether its handling a value or address. In Assembly this is done by using
eax (means value stored in register eax)
[eax] (means value stored in memory at the adress stored in the register eax)
The CPU cannot perform calculations on values stored in the memory, so it is quite busy moving values From memory to registers and from registers to memory.
i.e. If you have
eax = 0x2
and in memory
0x2 = 110011
and the instruction
MOV ebx, [eax]
this means: move the value, currently stored at the address, that is currently stored in eax into the register ebx. So finally
ebx = 110011
(This is happening EVERYTIME the CPU does a single calculation!. Memory -> Register -> Memory)
Finally, the demanding application can read its predefined memory address #2,
resulting in address #2568 and then knows, that the outcome of the calculation is stored at adress #2568. Reading that Adress will result in the value 12 (5+7)
This is just a tiny tiny example of whats going on. For a more detailed introduction about this, refer to http://www.cs.virginia.edu/~evans/cs216/guides/x86.html
One cannot really grasp the amount of data movement and calculations done for a simple addition of 2 values. Doing what a CPU does (on paper) would take you several minutes just to calculate "5+7", since there is no "5" and no "7" - Everything is hidden behind an address in memory, pointing to some bits, resulting in different values depending on what the bits at adress 0x1 are instructing...

Short form: The CPU does not know what's stored there, but the instructions tell the CPU how to interpret it.
Let's have a simplified example.
If the CPU is told to add a word (let's say, an 32 bit integer) stored at the location X, it fetches the content of that address and adds it.
If the program counter reaches the same location, the CPU will again fetch this word and execute it as a command.

The CPU (other than security stuff like the NX bit) is blind to whether it's data or code.
The only way data doesn't accidentally get executed as code is by carefully organizing the code to never refer to a location holding data with an instruction meant to operate on code.
When a program is started, the processor starts executing it at a predefined spot. The author of a program written in machine language will have intentionally put the beginning of their program there. From there, that instruction will always end up setting the next location the processor will execute to somewhere this is an instruction. This continues to be the case for all of the instructions that make up the program, unless there is a serious bug in the code.
There are two main ways instructions can set where the processor goes next: jumps/branches, and not explicitly specifying. If the instruction doesn't explicitly specify where to go next, the CPU defaults to the location directly after the current instruction. Contrast that to jumps and branches, which have space to specifically encode the address of the next instruction's address. Jumps always jump to the place specified. Branches check if a condition is true. If it is, the CPU will jump to the encoded location. If the condition is false, it will simply go to the instruction directly after the branch.
Additionally, the a machine language program should never write data to a location that is for instructions, or some other instruction at some future point in the program could try to run what was overwritten with data. Having that happen could cause all sorts of bad things to happen. The data there could have an "opcode" that doesn't match anything the processor knows what to do. Or, the data there could tell the computer to do something completely unintended. Either way, you're in for a bad day. Be glad that your compiler never messes up and accidentally inserts something that does this.
Unfortunately, sometimes the programmer using the compiler messes up, and does something that tells the CPU to write data outside of the area they allocated for data. (A common way this happens in C/C++ is to allocate an array L items long, and use an index >=L when writing data.) Having data written to an area set aside for code is what buffer overflow vulnerabilities are made of. Some program may have a bug that lets a remote machine trick the program into writing data (which the remote machine sent) beyond the end of an area set aside for data, and into an area set aside for code. Then, at some later point, the processor executes that "data" (which, remember, was sent from a remote computer). If the remote computer/attacker was smart, they carefully crafted the "data" that went past the boundary to be valid instructions that do something malicious. (To give them more access, destroy data, send back sensitive data from memory, etc).

this is because an ISA must take into account what a valid set of instructions are and how to encode data: memory address/registers/literals.
see this for more general info on how ISA is designed
https://en.wikipedia.org/wiki/Instruction_set

In short, the operating system tells it where the next instruction is. In the case of x64 there is a special register called rip (instruction pointer) which holds the address of the next instruction to be executed. It will automatically read the data at this address, decode and execute it, and automatically increment rip by the number of bytes of the instruction.
Generally, the OS can mark regions of memory (pages) as holding executable code or not. If an error or exploit tries to modify executable memory an error should occur, similarly if the CPU finds itself trying to execute non-executable memory it will/should also signal an error and terminate the program. Now you're into the wonderful world of software viruses!

Hard time understanding execution of machine instructions

Im reading the book "Write great code: understanding the machine" by Randall Hyde, is a great and clear text but here im completely stuck with his explanation of, for example, the mov instruction.
He dissects the steps for the mov(srcReg,destMem) instruction as follows:
1. Fetch the instruction's opcode from memory.
2. Update the EIP register with the address of the byte following the opcode.
3. Decode the instruction's opcode to see what instruction it specifies.
4. Fetch the displacement associated with the memory operand from the memory location immediately
following the opcode.
5. Update EIP to point at the first byte beyond the operand that follows the opcode.
6. If the mov instruction uses a complex addressing mode (for example, the indexed addressing mode),compute the effective address of the destination memory location.
7. Fetch the data from srcReg.
8. Store the fetched value into the destination memory location.
Im lost in steps 4-6. My exact questions are:
Step 4: Why do I need this displacement, how Im gonna use it later and why?
Step5: I understand that in step 2, the EIP must "point" to the next byte where the next instruction to be executed is stored. But I dont understand why does EIP needs to be one byte beyond the operand address. I belived that EIP was concerned only with instructions/opcodes, not data.
Step6: What is exactly and effective address? Are there other types of address?

Step 4:
Some opcodes reference memory that's relative to the opcode's location. For example, a function might have a constant or static piece of data. If it does, the code may opt to place that right before the function starts (or right after it ends) and refer to it by saying "get the memory from 46 bytes earlier". That's the displacement -- it's an offset from the contents of a register (in this case, EIP), used for referencing data relative to the register's contents.
Step 5
The operands for opcodes are normally stored right after the opcode. So you might have some memory arranged like so: a b c. a is and opcode, b is the operand for a and c is the next opcode.
If you only move EIP to the end of a (so it references b), then in the next instruction cycle, the computer will assume that b is the next opcode to execute. b isn't supposed to be an opcode though; it's an operand. The computer can't tell the difference between an opcode and an operand though. It just assumes whatever EIP points to is an instruction and executes it. That's why EIP needs to be moved past the operand too.
Step 6
An "effective" address is just an absolute one (relative to the start of memory) while the "complex" address the book refers to is relative to something else (often the contents of a register).
Step 4 showed that an opcode might not refer to an absolute memory address. It could easily refer to a relative one. In fact, programs very frequently refer to addresses that are relative to some register. For example, if you wrote some_struct.data in C and compiled it for an x86 processor, it would load the address of some_struct into a register (say, EAX), then hard-code data's offset from the base of some_struct into the operand. So if there are 5 bytes of data between the start of the struct and the start of the data element, then the instruction might look like load [EAX + 5] -> EBX which means "take what's in EAX, add 5, fetch the data from that address and put it in EBX".
The thing is, the memory doesn't really understand relative addresses like this. It only understands absolute ones. So in order to access a relative address, the processor has to first add that 5 to whatever's in EAX to compute an absolute address. Then it can send that address to the memory controller and have it understood.
There are two basic types of relative addresses I've worked with (there are more I haven't).
Register relative: The processor takes the contents of a register and uses that as the address in memory. Depending on the opcode and processor support, it may also add an operand to the register as well. Step 4 was dealing with this kind of addressing, with EIP as the register the address was relative to.
Memory relative: Sometimes referred to as "indirect". The processor starts out with a register relative address, then automatically fetches the data at that address and treats it as the real address.
Wikipedia describes lots of other addressing modes on their addressing modes page.
Memory relative took me a while to understand. Say you did a memory relative load where the register contains 10 and the offset is 5. The processor will add them together (10 + 5 = 15). Then, it'll go to that address (15 in this case) and grab whatever's there. If address 15 happens to contain the value 60, then 60 will be treated as the actual address and the processor will load the contents of address 60. If you're familiar with a language with pointers (e.g. C), memory relative is like a pointer-to-a-pointer.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio