What do the different columns in the "!heap -flt -s xxxx" windbg command represent - debugging

I've been doing some work on high memory issues, and I've been doing a lot of heap analysis in windbg, and I was curious what the different columns really mean in "!heap -flt -s xxxx" command.
I read What do the 'size' numbers mean in the windbg !heap output?, and I looked in my "Windows Internals" book, but I still had a bunch of questions. So the columns and my questions are below.
**HEAP_ENTRY** - What does this pointer really point to? How is it different than UserPtr?
**Size** - What does this size mean? How is it different than UserSize?
**Prev** - This just appears to be the negative offset to get to the previous heap entry. Still not sure exactly how it's used.
**Flags** - Is there any documentation on these flags?
**UserPtr** - What is the user pointer? In all cases I've seen it's always 8 bytes higher than the HEAP_ENTRY, but I don't really know what it points to.
**UserSize** - This appears to be the size of the actual allocation.
**state** - This just tells you what state of this heap entry is (free, busy, etc....)
Example:
HEAP_ENTRY Size Prev Flags UserPtr UserSize - state
0015eeb0 0044 0000 [07] 0015eeb8 00204 - (busy)

HEAP_ENTRY
Heaps store allocated blocks in contiguous Segments of memory, each allocated block starts with a 8-bytes header followed by the actual allocated data. The HEAP_ENTRY column is the address of the beginning of the header of the allocated block.
Size
The heap manager handles blocks in multiple of 8 bytes. The column is the number of 8 bytes chunk allocated. In your sample, 0044 means that the block takes 0x220 bytes (0x44*8).
Prev
Multiply per 8 to have the negative offset in bytes to the previous heap block.
Flags
This is a bitmask that encodes the following information
0x01 - HEAP_ENTRY_BUSY
0x02 - HEAP_ENTRY_EXTRA_PRESENT
0x04 - HEAP_ENTRY_FILL_PATTERN
0x08 - HEAP_ENTRY_VIRTUAL_ALLOC
0x10 - HEAP_ENTRY_LAST_ENTRY
UserPtr
This is the pointer returned to the application by the HeapAlloc (callbed by malloc/new) function. Since the header is always 8 bytes long, it is always HEAP_ENTRY +8.
UserSize
This is the size passed the HeapAlloc function.
state
This is a decoding of the Flags column, telling if the entry is busy, freed, last of its segment, …
Be aware that in Windows 7/2008 R2, heaps are by default using a front-end named LFH (Low fragmented heap) that uses the default heap manager to allocate chunks in which it dispatched user allocated data. For these heaps, UserPtr and UserSize will not point to real user data.
The output of !heap -s displays which heaps are LFH enabled.

From looking at the !heap documentation in the Debugging Tools for Windows help file and the heap docs on MSDN and a great excerpt from Advanced Windows Debugging, here's what I've been able to put together:
HEAP_ENTRY: pointer to entry within the heap. As you found, there is an 8 byte header which contains the data for the HEAP_ENTRY structure. The size of the HEAP_ENTRY structure is 8 bytes which defines the "heap granularity" size. This is used for determining the...
SIZE: size of the entry in terms of the granularity (i.e. the allocation size / 8)
FLAGS: these are defined in winbase.h with explanations found the in MSDN link.
USERPTR: the actual pointer to the allocated (or freed) object

Well, the main difference between HEAP_ENTRY and UserPtr is the due to the fact that heaps have to be indexed, allocated, filled with metadata (like the allocated length made available to user)... otherwise, how could you free(p) something without providing how many bytes were allocated? Same thing with the two size fields: one thing is how big the structure indexing the heap is, one thing is how big is the memory region made available to the user.
The FLAGS, in turn, basically specify which properties of the allocated memory block, if it is committed or just reserved, and, I guess, used by the kernel to rearrange or share memory regions if needed (but as nithins specifies they are documented in MSDN).
The PREV ptr is used to keep track of all the allocated regions and the first pointer is stored in the PEB structure so both user-space and kernel-space code is aware of the allocated heap pools.

Related

Relation between size of address bus and memory size; memory Segmentation in 8086

My question is related to memory segmentation in 8086. I learnt that,
8086 has a 20 bit address bus. And so it can address 2^20 different addresses. Which means it has an memory size of 2^20, i.e, 1MB.
I have a few doubts:
What I understand from the fact that 8086 has a 20 bit address bus is that it could have 2^20 different combinations of 0s and 1s, each of which represents one physical address. What I don't understand is that how does 2^20 different address locations mean 1 MB of addressable memory? How is total number of different addresses locations related to memory size (in Megabytes)?
Also, correct me if I'm wrong, the 16 bit segment registers in 8086 hold the starting address of the different segments in the memory (Code, Stack, Data, Extra).My question is, aren't the addresses in memory of 20 bits? Then how can the 16 bit register hold 20 bit addresses? If it contains the upper 16 bit of the 20 bit address, how does the processor make out to which exact address location it has to point?
P.S: I am a beginner is micro-processors and total reliant on self study, so kindly excuse if my questions seem a bit silly.
Thanks in advance.
For this question, its important to remember there is a different between the number of possible memory addresses and the amount of actual memory (RAM) installed in the system. For the 8086, memory addresses are 20-bits long as you note, so that means there are 2^20 possible memory addresses (which is exactly 1 MiB in size since 1 MiB is 1024 or 2^10 KiB and 1 KiB is 1024 or 2^10 Bytes). This does NOT mean the system has 1 MiB worth of RAM necessarily, it very likely has less but the most addresses the 8086 could possibly address is 1 MiB; so if nothing but RAM was in the address space, the most RAM it could possibly have is 1 MiB. Frequently, you might have gaps in the address space not filled with anything, some of the address space is used for ROM or other peripherals. So, that size of the address space is 1 MiB but that does not mean there is 1 MiB of RAM/memory in the system.
Correct, the segment registers are all 16-bits for the 8086. A memory address is created by combining the appropriate segment register with the argument (the argument being the result of whatever the addressing mode being used by the instruction) by adding the argument to the segment register's value shifted by 4 bits. So, if for example the ss is 0x1111, sp is at 0x2222 and you preform a push ax instruction, the 20-bit address to which the value is pushed is (ss << 4) + sp or 0x11110 + 0x02222 = 0x13332. More information can be found on Wikipedia under the Real Mode section: https://en.wikipedia.org/wiki/X86_memory_segmentation

How is byte addressing implemented in modern computers?

I have trouble understanding how in say a 32-bit computer byte addressing is achieved:
Is the ram itself byte addressable meaning the first byte has address 0 and the second 1 etc? In this case, wouldn't is take 4 read cycles to read a 32-bit word and waste the width of the data bus?
Or does the ram consist of 32-bit words meaning address 0 points to the first 4 bytes and address 2 points to bytes 5 to 8? In this case I would expect the ram interface to make byte addressing possible (from the cpu's point of view)
Think of RAM as 8 bit wide structure with N entries. N is often the size quoted when referring to memory (256 MB - 256M entries, 2GB - 2G entries etc, B is for bytes). When you access this memory, the smallest unit you can address is one of these entries which is 8 bits (1 byte). Since you can only access it at byte level, we call it byte addressable memory.
Now coming to your question about accessing this memory, we do not just access a byte. Most of the time, memory accesses are sent through caches which are there to reduce memory access latency. Caches store data at a higher granularity than a byte or word, normally it is multiple of words. In doing so, caches explore a property called "locality". Locality means, there is a high chance that we either access this data item or a near by data item very soon. So fetching not just the byte, but all the adjacent bytes is not a waste. Think of it as an investment for future, saves you multiple data fetches that you would have done otherwise.
Memory addresses in RAM start with 0th address and they are accessed using the registers with capacity of 8 bit register or 32 bit registers. Based on these registers the value from specific address is accessed by the CPU. If you really need to understand how it works, you will need to run couple of programs using Assembly language to navigate in the physical memory by reading the values directly using registers and register move commands.

Calculating Cache Memory Hit and Miss, and Calculating Rows in Cache

I am studying an old exam for an upcoming exam, and the final questions consist of what the title describes. Now, I am familiar with assembly language instructions and I somewhat know what the code means. But, what the exam question actually wants me to do is confusing. I would really appreciate if someone could explain this question.
The question:
I am given a cache-memory which has room for 512 bytes and every row is 8 bytes long. The memory is direct-mapped and an "address" is 32 bits long. Also, the cache-memory is empty from the start.
After that, I get some instructions and am supposed to explain if it becomes a cache-hit or cache-miss. It should also be assumed that the instructions are all sequential and all data that is added/modified in an instruction still exists for the next instruction.
The instructions I get are
movia r8, 0xBEDA12C4
ldw r10, 0( r8 )
ldw r11, 8( r8 )
stw r10, 16( r8 )
ldw r10, 24(r8)
ldw r18, 32(r8)
Now I would really appreciate if someone could explain the details to me:
The cache-memory has room for a total of 512 bytes. What is this? Is it the total memory the cache is able to store? Also, I heard from somewhere that this is how you calculate rows in cache. For example, 512 bytes of memory and every row is 16 bytes. 512/16 = 32 rows in cache. For this example 512/8 = 64 rows. Which one is it? What does this mean!?
It also states that every row is 16 bytes long. I've seen the example with TAG, ROW, BYTE where they try to illustrate the cache. But how do I understand the 16 bytes per row? At least it doesn't seem to take part of the length on TAG, ROW, BYTE. What is this for?
Direct-mapped cache. I understand this somewhat. It's just a big row of slots of order which are empty or not, yeah? I found some information on this here.
http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/Memory/direct.html
*Updated link: https://web.archive.org/web/20150213025748/http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/Memory/direct.html
Now to the main part. How do I calculate for each instruction if it will be a cache miss or hit? My guess is that the first instruction ought to be a miss, since the question said that the cache memory is empty from the start. The second instruction also must be a cache miss but from this point on I am not sure how to calculate if the instruction generates a cache hit or miss. To be honest, I am not even sure what a hit would be.
I would really appreciate if someone could show me how to calculate each step and how I know whether an instruction creates a cache hit or miss. The instructions we get for calculating this are really confusing. Thank you so much!
Generally you have to look at it as at a separate memory space, with only 512 bytes, addressable, readable and writable as arrays 8 bytes each. If you need byte 2, the address will be 0, you read the whole array and select byte 3 from it. If you need byte 8, the address will be 1, and you select byte 0 from the array. Such small memory have one huge advantage - it is fast. It alone can store the contents of some larger memory space, only first 512 bytes. If you store something to address 1 of larger memory space, it will go to that smaller memory instead, the address will become 0 and offset 1, internally for that small amount of memory. If you access beyond that, for example, 1000, you will have to wait more. In this case it would be just memory mapped "registers" - it would be actually faster and better in some cases, than "cache" - unfortunately for some reason processor makers generally won't let you use the cache in that way (probably marketing and support reasons - to sell other products as a separate market share, with higher price).
If you add some more space to each array to store some other value, you can store a part of address there. Without hardware support you could store there virtually anything, that second part is called tag. Now if you have some address fffff000, you can read the second space (assuming that you have the commands to do so), from address 0 - for simplicity and speed you can obtain the address from the primary memory space by masking all the bits except bits 3..8 and 0-2 (which are used to obtain offset in 8 byte array), and check the tag part from that address. One bit in that tag may be used to indicate whether there is something stored there, the other bits may be used to store the part of address from main memory. If you want to save something cached there, you set the bit indicating that the array is not "empty", and assign the upper bits of the main address there, and copy the 8 bytes from the main memory. Next time, before reading something within that range in memory, you read the tag part of the smaller memory array first, then decide whether to read from slow main memory, or from that smaller but faster part (and it would be cache hit).
If you write something with an address of (+-)x512 bytes in main memory, you would have to read the already mentioned array of 8 bytes, copy it into main memory, whole 8 bytes, and write what you want into the very same cell, and then modify the address with a new value. But you would lose the previous copy of your data in the smaller memory area (but faster). If you need the previous value again (any of those 8 bytes), you would have to copy it again from main memory (cache miss).
The same goes for all other arrays of that "cache" memory. So we have a sequence of cache checking, writing, reading and copying the data to or from main memory.
That is called 1 way associativity, for 2 ways there would be one more array (same) of 512 bytes, which can store different addresses though (with the step of 512 from main memory), the tags of those 2 arrays may be checked simultaneously, and if some array has the copy of that memory range it can return it instead of reading it from main memory. Without tag checking (extra cycles for that), the "cache" is essentially a small amount of memory.

Is VirtualAlloc alignment consistent with size of allocation?

When using the VirtualAlloc API to allocate and commit a region of virtual memory with a power of two size of the page boundary such as:
void* address = VirtualAlloc(0, 0x10000, MEM_COMMIT, PAGE_READWRITE); // Get 64KB
The address seems to always be in 64KB alignment, not just the page boundary, which in my case is 4KB.
The question is: Is this alignment reliable and prescribed, or is it just coincidence? The docs state that it is guaranteed to be on a page boundary, but does not address the behavior I'm seeing. I ask because I'd later like to take an arbitrary pointer (provided by a pool allocator that uses this chunk) and determine which 64KB chunk it belongs to by something similar to:
void* chunk = (void*)((uintptr_t)ptr & 0xFFFF0000);
The documentation for VirtualAlloc describes the behavior for 2 scenarios: 1) Reserving memory and 2) Committing memory:
If the memory is being reserved, the specified address is rounded down to the nearest multiple of the allocation granularity.
If the memory is already reserved and is being committed, the address is rounded down to the next page boundary.
In other words, memory is allocated (reserved) in multiples of the allocation granularity and committed in multiples of a page size. If you are reserving and committing memory in a single step, it will be be aligned at a multiple of the allocation granularity. When committing already reserved memory it will be aligned at a page boundary.
To query a system's page size and allocation granularity, call GetSystemInfo. The SYSTEM_INFO structure's dwPageSize and dwAllocationGranularity will hold the page size and allocation granularity, respectively.
This is entirely normal. 64KB is the value of SYSTEM_INFO.dwAllocationGranularity. It is a simple counter-measure against address space fragmentation, 4KB pages are too small. The memory manager will still sub-divide 64KB chunks as needed if you change page protection of individual pages within the chunk.
Use HeapAlloc() to sub-allocate. The heap manager has specific counter-measures against fragmentation.

Memory, Stack and 64 bit

On a x86 system a memory location can hold 4 bytes (32 / 8) of data, therefore a single memory address in a 64 bit system can hold 8 bytes per memory address. When examining the stack in GDB though this doesn't appear to be the case, example:
0x7fff5fbffa20: 0x00007fff5fbffa48 0x0000000000000000
0x7fff5fbffa30: 0x00007fff5fbffa48 0x00007fff857917e1
If I have this right then each hexadecimal pair (48) is a byte, thus the first memory address
0x7fff5fbffa20: is actually holding 16 bytes of data and not 8.
This has had me really confused and has for a while, so absolutely any input is vastly appreciated.
Short answer: on both x86 and x64 the minimum addressable entity is a byte: each "memory location" contains one byte, in each case. What you are seeing from GDB is only formatting: it is dumping 16 contiguous bytes, as the address increasing from ....20 to ....30, (on the left) indicates.
Long answer: 32bit or 64bit is used to indicate many things, in an architecture: almost always, is the addressable size (how many bits are in an address = how much memory you can directly address - again, bytes of memory). It also usually indicates the dimension of registers, and also (but not always) the native word size.
That means that usually, even if you can address a single byte, the machine works "better" using data of different (longer) size. What "better" means is beyond the question; a little background, however, is good to understand some misconceptions about word size in the question.

Resources