Error code 487 (ERROR_INVALID_ADDRESS) when using VirtualAllocEX - winapi

I'm trying to use VirtualAllocEx(). When I set dwSize (the third parameter) to a number larger than about 63 MB, it cause to generate error code 487 when I look at GetLastError(). However, it works with smaller sizes such as 4MB.
Here is part of my code:
VirtualAllocEx(peProcessInformation.hProcess,
(LPVOID)(INH.OptionalHeader.ImageBase),
dwImageSize,
MEM_RESERVE | MEM_COMMIT,
PAGE_EXECUTE_READWRITE);
In the case that I used a 4MB EXE file, the LPVOID return value is 0x00400000, but in other cases (20MB or bigger file) it returns 0x00000000.
Is there a maximum value for the dwSize parameter?
Is there any other solution for my problem, such as another function?

My guess from your code is that you're trying to load a DLL or EXE into memory manually using something like this technique - is that right? I'll address this at the end (pun intended) but first a quick explanation of why VirtualAllocEx is failing.
Why is VirtualAllocEx giving this error?
The problem with allocating memory at a specific address is that there needs to be enough room at that address to allocate the memory size you request. This is why, generally, when you request memory you let the OS decide where to put it. (Plus letting the OS / malloc library decide can lead to other benefits, such as decreased fragmentation etc - out of scope for this answer.)
The problem you're getting is not that VirtualAllocEx is incapable of allocating 64MB rather than 4MB. VirtualAllocEx can allocate (nearly) as much memory as you want it to. The problem is that at the address you specify, in your process, there isn't 64MB of unallocated memory.
Consider hypothetical addresses 0-15 (0x0 - 0xF), where - marks empty memory and x marks allocated memory:
0 1 2 3 4 5 6 7 8 9 A B C D E F
x x - x - - - - x - - - - - - -
This is your process's memory space. Now, you want to allocate 4 bytes at address 0x4. Easy - 0x4 to 0x7 are free, so you allocate and get (new allocation marked with X):
0 1 2 3 4 5 6 7 8 9 A B C D E F
x x - x X X X X x - - - - - - -
Fantastic. But now suppose that instead you wanted to allocate 6 bytes. There aren't six free bytes at address 0x4: there's some memory being used at 0x8:
0 1 2 3 4 5 6 7 8 9 A B C D E F
x x - x - - - - x - - - - - - -
1 2 3 4 bang!
You can't do it. The problem isn't that the memory allocator can't handle allocating 6 bytes, but that the memory isn't free for it to do it. Nor, most likely, can it shuffle the memory around - in a normal non-GC program you can't move memory to make space, because you might, say, leave dangling pointers which don't know the contents of the memory they were pointing at has changed address. The only thing to do is either fail and not allocate memory at all, or allocate where it has free space, say at 0x9 or 0xA.
You might wonder why VirtualAllocEx is failing with ERROR_INVALID_ADDRESS instead of NULL: most likely, it is because you specified an address it couldn't allocate at; thus, even though there is some free memory at that address (maybe) there isn't enough and the address isn't valid. This is hinted at in the documentation:
Attempting to commit a specific address range by specifying MEM_COMMIT
without MEM_RESERVE and a non-NULL lpAddress fails unless the entire
range has already been reserved. The resulting error code is
ERROR_INVALID_ADDRESS.
This isn't quite your situation: you're specifying both flags at once, but if the method can't reserve then it effectively falls into this situation. It can't reserve the entire range at that address, so it gives error code ERROR_INVALID_ADDRESS.
Loading DLL or EXE images
So, what should you do with your problem, which I am guessing from your question and code is loading a DLL or EXE image in memory?
Here you need a bit of background on image locations in an EXE file. Generally, an EXE is loaded into memory at the process's virtual address location 0x400000. It's optional: your linker can ask it be put wherever, but this value is common. Similarly, DLLs have a common default location: 0x10000000. So, for one EXE and one DLL, you're fine: the image loader can almost certainly load them at their requested locations.
What happens when you have two DLLs, both asking to be located at 0x10000000?
The answer is image rebasing. The image location is optional, it's not necessary. Code inside the image that depends on being loaded at a specific address can be adjusted by the image loader, and so the second DLL might be loaded not at 0x10000000, but somewhere else - say, 0x1080000. That's an address difference of 0x80000, and so the loader actually patches up a bunch of addresses and code inside the DLL so all the bits that thought they should refer to 0x10000000 now refer to 0x10800000.
This is really, really common, and every time you load an EXE this will be done to several DLLs. It is so common that Microsoft have a little optimisation tool called rebase, (for "rebasing", that is, adjusting the base address) and when you distribute your EXE and your own DLLs with it, you can use this to make sure each DLL has a different base address, each of which is located so that when Windows loads your EXE and the DLLs they will already have the right addresses and it is unlikely to have to rebase - perform the above operation - on any of them. For some applications this can make a noticeable improvement in starting time. (In modern versions of Windows, sometimes DLLs are moved around anyway - this is for address space layout randomization, a security technique to deliberately make sure code is not at the same address each time it's run.)
(One other thing is that some DLL and EXE compression tools strip out the data that is used for this relocation. That's fine, because it makes the EXE smaller... right up until it needs to be relocated, and because the data is missing it can't, and so can't be loaded at all. Or you can build with a fixed base, and it will magically work right until it doesn't. Don't do this to your EXEs or DLLs.)
So, what should you do when you try to manually load a DLL into memory, and there isn't enough space for it at the address it asks to be loaded at? Easy - it's not a fatal error, just load it somewhere else, and then perform the rebasing yourself. I would suggest if you have problems to ask a new SO question, but to give you a starting point you can use the RebaseImage function, or if you can't use it or want to do it yourself, I found this code which from a quick overview seems to perform this manually. No guarantees about its correctness.
TLDR
Your process address space doesn't have 64MB of empty space at the address you specify, so you can't allocate 64MB of memory there. Instead, allocate it somewhere else and patch / rebase your loaded DLL or EXE image to match the new address.

Use from NULL in address parameter:
void* pImageBase = VirtualAllocEx(peProcessInformation.hProcess,
NULL,
dwImageSize,
MEM_RESERVE | MEM_COMMIT,
PAGE_EXECUTE_READWRITE);
If lpAddress is NULL, the function determines where to allocate the
region.
Note that for the rest of the code, use the (void* pImageBase), not the (INH.OptionalHeader.ImageBase).

Related

x86 store when data is in 2 different blocks

Supose linux-32: the aligment rules say, for example, that doubles (8 Bytes) must be aligned to 4 Bytes. This means that, if we assume 64 Bytes cache blocks (a typical value for modern processors) we can have a double aligned in the 60th position, which mean that this double will be in 2 different cache blocks.
It could even happen that both parts of the double were in 2 different cache blocks located in 2 different 4KB pages.
After this brief introduction to put the question in context, I have a couple of doubts:
1- For an assembler programming where we seek maximum performance, it is recommended to prevent these things from happenning by putting alignment directives, right? Or, for any reason that I unknow, making the alignment to make the double in only 1 block doesn't imply any performance change?
2- How will be the store instruction decoded in the in the mentioned case? (supose modern intel microarchitecture). I mean, I know that a normal store x86 instruction is decoded in a micro-fused pair of str-addr and str-data, but in this case where 2 different cache blocks (and maybe even 2 different 4KB pages) are involved, this will be decoded in 2 micro-fused pair of str-addr and str-data (one for the first 4 bytes of the double and another for the last 4 bytes)? Or it will be decoded to a single micro-fused pair but having to do both the str-addr and the str-data twice the work until finally being able to exit the execution port?
Yes, of course you should align a double whenever possible, like compilers do except when forced by ABI struct-layout rules to misalign them. (The ABI was designed when i386 was current so a double always required 2 loads anyway.)
The current version of the i386 System V ABI requires 16-byte stack alignment, so local doubles (that have to get spilled at all instead of kept in regs) can be aligned, and malloc has to return memory suitable for any type, and alignof(max_align_t) = 16 on 32-bit Linux (8 on 32-bit Windows) so 32-bit malloc will always give you at least 16 (or 8)-byte aligned memory. And of course in static storage you control the alignment with align (NASM) or .p2align (GAS) directives.
For the perf downsides of cacheline splits and page splits, see How can I accurately benchmark unaligned access speed on x86_64
re: decoding: The address isn't know at decode time so obviously any effects of a line-split page-split are resolved later. For stores, probably no effect until the store-buffer entry has to commit to L1d cache. Are two store buffer entries needed for split line/page stores on recent Intel? - probably no, allocating a 2nd entry after executing the store-address uop is implausible.
For loads, re-running the load through the execution unit to get the other half (or whatever uneven split), using internal line-split buffers to combine data. (Not re-dispatching from the RS, just internally handled in the load port. But the RS does aggressively replay uops waiting for the result of a load.)
Re-running the store-data uop for a misaligned store seems unlikely, too. I don't think we see extra counts for uops_dispatched_port.port_4 perf events.

What will happen if the highest bit of the pointer as a sign bit,can someone make a example to explain it?

I am reading the book <windows via c/c++> ,in Chapter 13 - Windows Memory Architecture -
Getting a Larger User-Mode Partition in x86 Windows
I occur at this:
In early versions of Windows, Microsoft didn't allow applications to
access their address space above 2 GB. So some creative developers
decided to leverage this and, in their code, they would use the high
bit in a pointer as a flag that had meaning only to their
applications. Then when the application accessed the memory address,
code executed that cleared the high bit of the pointer before the
memory address was used. Well, as you can imagine, when an application
runs in a user-mode environment greater than 2 GB, the application
fails in a blaze of fire.
I can't understand that, can someone make an example to explain it for me, thanks.
To access ~2GB of memory, you only need a 31 bit address. However, on 32 bit systems, addresses are 32 bit long and hence, pointers are 32 bit long.
As the book describes, in early versions of windows developers could only use 2GB of memory, therefore, the last bit in each 32-bit pointer could be used for other purposes, as it was ALWAYS zero. However, before using the address, this extra bit had to be cleared again, presumably so the program didn't crash, because it tried to access a higher than 2GB address.
The code probably looked something like this:
int val = 1;
int* p = &val;
// ...
// Using the last bit of p to set a flag, for some purpose
p |= 1UL << 31;
// ...
// Then before using the address in some way, the bit has to be cleared again:
p &= ~(1UL << 31);
*p = 3;
Now, if you can be certain that your pointers will only ever point to an address where the most significant bit (MSB) is zero, i.e. in a ~2GB address space, this is fine. However, if the address space is increased, some pointers will have a 1 in their MSB and by clearing it, you set your pointer to an incorrect address in memory. If you then try to read from or write to that address, you will have undefined behavior and your program will most likely fail in a blaze of fire.

How does a CPU know if an address in RAM contains an integer, a pre-defined CPU instruction, or any other kind of data?

The reason this gets me confused is that all addresses hold a sequence of 1's and 0's. So how does the CPU differentiate, let's say, 00000100(integer) from 00000100(CPU instruction)?
First of all, different commands have different values (opcodes). That's how the CPU knows what to do.
Finally, the questions remains: What's a command, what's data?
Modern PCs are working with the von Neumann-Architecture ( https://en.wikipedia.org/wiki/John_von_Neumann) where data and opcodes are stored in the same memory space. (There are architectures seperating between these two data types, such as the Harvard architecture)
Explaining everything in Detail would totally be beyond the scope of stackoverflow, most likely the amount of characters per post would not be sufficent.
To answer the question with as few words as possible (Everyone actually working on this level would kill me for the shortcuts in the explanation):
Data in the memory is stored at certain addresses.
Each CPU Advice is basically consisting of 3 different addresses (NOT values - just addresses!):
Adress about what to do
Adress about value
Adress about an additional value
So, assuming an addition should be performed, and you have 3 Adresses available in the memory, the application would Store (in case of 5+7) (I used "verbs" for the instructions)
Adress | Stored Value
1 | ADD
2 | 5
3 | 7
Finally the CPU receives the instruction 1 2 3, which then means ADD 5 7 (These things are order-sensitive! [Command] [v1] [v2])... And now things are getting complicated.
The CPU will move these values (actually not the values, just the adresses of the values) into its registers and then processing it. The exact registers to choose depend on datatype, datasize and opcode.
In the case of the command #1 #2 #3, the CPU will first read these memory addresses, then knowing that ADD 5 7 is desired.
Based on the opcode for ADD the CPU will know:
Put Address #2 into r1
Put Address #3 into r2
Read Memory-Value Stored at the address stored in r1
Read Memory-Value stored at the address stored in r2
Add both values
Write result somewhere in memory
Store Address of where I put the result into r3
Store Address stored in r3 into the Memory-Address stored in r1.
Note that this is simplified. Actually the CPU needs exact instructions on whether its handling a value or address. In Assembly this is done by using
eax (means value stored in register eax)
[eax] (means value stored in memory at the adress stored in the register eax)
The CPU cannot perform calculations on values stored in the memory, so it is quite busy moving values From memory to registers and from registers to memory.
i.e. If you have
eax = 0x2
and in memory
0x2 = 110011
and the instruction
MOV ebx, [eax]
this means: move the value, currently stored at the address, that is currently stored in eax into the register ebx. So finally
ebx = 110011
(This is happening EVERYTIME the CPU does a single calculation!. Memory -> Register -> Memory)
Finally, the demanding application can read its predefined memory address #2,
resulting in address #2568 and then knows, that the outcome of the calculation is stored at adress #2568. Reading that Adress will result in the value 12 (5+7)
This is just a tiny tiny example of whats going on. For a more detailed introduction about this, refer to http://www.cs.virginia.edu/~evans/cs216/guides/x86.html
One cannot really grasp the amount of data movement and calculations done for a simple addition of 2 values. Doing what a CPU does (on paper) would take you several minutes just to calculate "5+7", since there is no "5" and no "7" - Everything is hidden behind an address in memory, pointing to some bits, resulting in different values depending on what the bits at adress 0x1 are instructing...
Short form: The CPU does not know what's stored there, but the instructions tell the CPU how to interpret it.
Let's have a simplified example.
If the CPU is told to add a word (let's say, an 32 bit integer) stored at the location X, it fetches the content of that address and adds it.
If the program counter reaches the same location, the CPU will again fetch this word and execute it as a command.
The CPU (other than security stuff like the NX bit) is blind to whether it's data or code.
The only way data doesn't accidentally get executed as code is by carefully organizing the code to never refer to a location holding data with an instruction meant to operate on code.
When a program is started, the processor starts executing it at a predefined spot. The author of a program written in machine language will have intentionally put the beginning of their program there. From there, that instruction will always end up setting the next location the processor will execute to somewhere this is an instruction. This continues to be the case for all of the instructions that make up the program, unless there is a serious bug in the code.
There are two main ways instructions can set where the processor goes next: jumps/branches, and not explicitly specifying. If the instruction doesn't explicitly specify where to go next, the CPU defaults to the location directly after the current instruction. Contrast that to jumps and branches, which have space to specifically encode the address of the next instruction's address. Jumps always jump to the place specified. Branches check if a condition is true. If it is, the CPU will jump to the encoded location. If the condition is false, it will simply go to the instruction directly after the branch.
Additionally, the a machine language program should never write data to a location that is for instructions, or some other instruction at some future point in the program could try to run what was overwritten with data. Having that happen could cause all sorts of bad things to happen. The data there could have an "opcode" that doesn't match anything the processor knows what to do. Or, the data there could tell the computer to do something completely unintended. Either way, you're in for a bad day. Be glad that your compiler never messes up and accidentally inserts something that does this.
Unfortunately, sometimes the programmer using the compiler messes up, and does something that tells the CPU to write data outside of the area they allocated for data. (A common way this happens in C/C++ is to allocate an array L items long, and use an index >=L when writing data.) Having data written to an area set aside for code is what buffer overflow vulnerabilities are made of. Some program may have a bug that lets a remote machine trick the program into writing data (which the remote machine sent) beyond the end of an area set aside for data, and into an area set aside for code. Then, at some later point, the processor executes that "data" (which, remember, was sent from a remote computer). If the remote computer/attacker was smart, they carefully crafted the "data" that went past the boundary to be valid instructions that do something malicious. (To give them more access, destroy data, send back sensitive data from memory, etc).
this is because an ISA must take into account what a valid set of instructions are and how to encode data: memory address/registers/literals.
see this for more general info on how ISA is designed
https://en.wikipedia.org/wiki/Instruction_set
In short, the operating system tells it where the next instruction is. In the case of x64 there is a special register called rip (instruction pointer) which holds the address of the next instruction to be executed. It will automatically read the data at this address, decode and execute it, and automatically increment rip by the number of bytes of the instruction.
Generally, the OS can mark regions of memory (pages) as holding executable code or not. If an error or exploit tries to modify executable memory an error should occur, similarly if the CPU finds itself trying to execute non-executable memory it will/should also signal an error and terminate the program. Now you're into the wonderful world of software viruses!

Memory, Stack and 64 bit

On a x86 system a memory location can hold 4 bytes (32 / 8) of data, therefore a single memory address in a 64 bit system can hold 8 bytes per memory address. When examining the stack in GDB though this doesn't appear to be the case, example:
0x7fff5fbffa20: 0x00007fff5fbffa48 0x0000000000000000
0x7fff5fbffa30: 0x00007fff5fbffa48 0x00007fff857917e1
If I have this right then each hexadecimal pair (48) is a byte, thus the first memory address
0x7fff5fbffa20: is actually holding 16 bytes of data and not 8.
This has had me really confused and has for a while, so absolutely any input is vastly appreciated.
Short answer: on both x86 and x64 the minimum addressable entity is a byte: each "memory location" contains one byte, in each case. What you are seeing from GDB is only formatting: it is dumping 16 contiguous bytes, as the address increasing from ....20 to ....30, (on the left) indicates.
Long answer: 32bit or 64bit is used to indicate many things, in an architecture: almost always, is the addressable size (how many bits are in an address = how much memory you can directly address - again, bytes of memory). It also usually indicates the dimension of registers, and also (but not always) the native word size.
That means that usually, even if you can address a single byte, the machine works "better" using data of different (longer) size. What "better" means is beyond the question; a little background, however, is good to understand some misconceptions about word size in the question.

Examining Erlang crash dumps - how to account for all memory?

I've been poring over this Erlang crash dump where the VM has run out of heap memory. The problem is that there is no obvious culprit allocating all that memory.
Using some serious black awk magic I've summed up the fields Stack+heap, OldHeap, Heap unused and OldHeap unused for each process and ranked them by memory usage. The problem is that this number doesn't come even close to the number that is representing the total memory for all the processes processes_used according to the Erlang crash dump guide.
I've already tried the Crashdump Viewer and either I'm missing something or there isn't much help there for my kind of problem.
The number I get is 525 MB whereas the processes_used value is at 1348 MB. Where can I find the rest of the memory?
Edit: The Heap unused and OldHeap unused shouldn't have been included since they are a sub-part of Stack+Heap and OldHeap, that plus the fact that the number displayed for Stack+Heap and OldHeap are listed as number of words, not bytes, was the problem.
There is an module called crashdump_viewer which is great for these kinds of analysis.
Another thing to keep in mind is that Heap+Stack is afaik in words, not bytes which would mean that you have to multiply Heap+Stack with 4 on 32 and 8 on 64 bit. Can't find a reference in the manual for this but Processes talks about it a bit.

Resources