Suppose you have a 32 byte direct mapped cache with 16 bytes per block, and it reads the address 0xd. What are the contents of the cache? I thought the fetcher would add 0xd to the cache in set 1(so second one), byte 6. But in my simulator, 0xc is then a hit, so obviously this is wrong! Any help is appreciated.
To determine which set an address is found in, you use:
(addr >> log2(block size)) & (number of sets - 1)
(Assumes block size and number of sets are both a power of 2, which they pretty much always are, so &(n-1) is the same as %n, a modulo)
For your case, it would be: (0xd >> 4) & 1, which is 0.
The block in set 0 contains bytes 0x0 through 0xf, so a reference to 0xc is a hit.
The reason why it contains bytes 0x0 through 0xf is that computer memory systems are usually designed to handle naturally aligned data more quickly than nonaligned data. A block of size N is naturally aligned if the address which starts the block has zeros in the low log2(N) bits of the address. So when address 0xd is referenced, the aligned block that contains it (with starting address 0x0) is loaded into the cache.
Related
I've heard that a in-between page table contains the address of the other page table. But, I've seen it contains less number of bits than those actually required to address the main memory. But, these number of bits are less than the bits required to address physical address space. So, does that mean, some bits are padded with 0s ? So would that imply every page will start at some xxxx(some number of 0s) ?
Since pages have a minimum size and alignment the lower bits of a pointer to a page will always be zero and do not need to be stored.
For example, with a page size of 4096 (0x1000) bytes, addresses to these pages will be always be a multiple of 4096 bytes (e.g. 0x1000, 0x2000, 0x728373000). Notice that the bottom 12 bits are always zero!
I'm sorry if I made an error in posting this. Please let me know if I need to change anything.
I've received my computer architecture homework back and I missed this question. My professor's explanation didn't make sense to me, and I disagree with what he told me, so I am here asking what you guys think.
Here is the question:
A computer uses 16-bit memory addresses. Main memory is 512KB, and the cache is 1KB with 32B per block. Given each of the following mapping functions, calculate the number of bits in each field of the memory address.
Here is how I worked through the direct mapping part of the problem:
Cache memory: 1KB (2^10), 16-bit memory addresses (1 word = 2B) -> 1024B/2B = 512 words, 16 words per block (32B) -> 512/16 = 32 cache memory blocks.
Main memory: 512 KB (2^19), 16-bit memory addresses (1 word = 2B) -> 524288B/2B = 256K words, 16 words per block (32B) -> 256K/16 = 16384 or 16K main memory blocks.
I understand the word tag as such: 32B per block allows for 16 16-bit memory addresses per block. This (I believe) supports that: 1 word = 16 bits = 2 B -> 32B/2B = 16 words in each block. This equates to 2^4 = 4 bits for determining which word in the block, leaving 12 bits for tag and block bits in the memory address.
Now, in order to map 16K main memory blocks directly into 32 cache memory blocks, there will have to be 512 main memory blocks mapped to each cache memory block. So 512/16K blocks per 1/32 blocks.
Here is where I am confused. Doesn't this require 9 tag bits, as 2^9 = 512 (main memory blocks possibly mapped into one cache memory block)?
For the block bits, which point to a particular block in the cache, this requires 5 bits. 2^5 = 32, blocks in cache memory.
This would require 18 bits in the memory address.
Here is my professor's answer for this question:
2^5 = 32 -> 5 Word bits
(1KB)/(32B) = 32 blocks -> 5 Block bits
16 – 5 – 5 = 6 Tag bits
I did not realize I could simply subtract the required block and word bits to get the tag bits. But it still doesn't make sense to me. 2^6 = 64 blocks per cache block. 64*32 gives 2048. I can't wrap my head around this. Can someone please help?
Okay, the terminology that i learnt is slightly different but the principal should be the same for this explanation.
So cache will have multiple sets (sort of like a cell). And each set will have 1 cache line (containing 1 block of data) or multiple cache lines (each contain 1 block of data) (direct mapping or n-associativity mapping).
In mapping the main memory blocks to the cache, the main memory address (16 bit) is divided into 3 fields: tag, index bits and offset bits. A memory cell is 1 byte and a block is made up of a few cells
Offset bits are used to access the individual bytes of a memory block. Think of it as the offset on top of the block base address to get the byte you want (i assume your memory should be byte-addressable rather than word-addressable as it doesn't make sense to access 2B word as this would be inflexible) And here your prof/textbook call it as word bit. Hence if a block has 32 Bytes, there would be log2(block size) = 5 bits needed to access the individuals cells in the mapped block.
Index bits (in direct mapped cache is called block bits too as the number of set is the same as the number of blocks in the cache) is used to identify which set/cache line/ cache block that the main memory block is mapped to the cache. There are 1KB/32B = 32 cache blocks in the cache. As direct mapping is used, each set contain only 1 cache block and therefore there will be 32 set in this cache. Thus to access the correct set in cache, 5 bits is needed and therefore index bits = 5 bits
Tag is a name to determine if the data block in cache is the correct one we are looking one from the main memory. As the address of main memory is 16 bit and we already know index and offset fields, it is easy to deduce that tag will need 16 - 5 - 5 6 bits. How we determine the tag is not really a concern as the block size and cache size (and hence no. of sets in cache is given here).
I've been doing some work on high memory issues, and I've been doing a lot of heap analysis in windbg, and I was curious what the different columns really mean in "!heap -flt -s xxxx" command.
I read What do the 'size' numbers mean in the windbg !heap output?, and I looked in my "Windows Internals" book, but I still had a bunch of questions. So the columns and my questions are below.
**HEAP_ENTRY** - What does this pointer really point to? How is it different than UserPtr?
**Size** - What does this size mean? How is it different than UserSize?
**Prev** - This just appears to be the negative offset to get to the previous heap entry. Still not sure exactly how it's used.
**Flags** - Is there any documentation on these flags?
**UserPtr** - What is the user pointer? In all cases I've seen it's always 8 bytes higher than the HEAP_ENTRY, but I don't really know what it points to.
**UserSize** - This appears to be the size of the actual allocation.
**state** - This just tells you what state of this heap entry is (free, busy, etc....)
Example:
HEAP_ENTRY Size Prev Flags UserPtr UserSize - state
0015eeb0 0044 0000 [07] 0015eeb8 00204 - (busy)
HEAP_ENTRY
Heaps store allocated blocks in contiguous Segments of memory, each allocated block starts with a 8-bytes header followed by the actual allocated data. The HEAP_ENTRY column is the address of the beginning of the header of the allocated block.
Size
The heap manager handles blocks in multiple of 8 bytes. The column is the number of 8 bytes chunk allocated. In your sample, 0044 means that the block takes 0x220 bytes (0x44*8).
Prev
Multiply per 8 to have the negative offset in bytes to the previous heap block.
Flags
This is a bitmask that encodes the following information
0x01 - HEAP_ENTRY_BUSY
0x02 - HEAP_ENTRY_EXTRA_PRESENT
0x04 - HEAP_ENTRY_FILL_PATTERN
0x08 - HEAP_ENTRY_VIRTUAL_ALLOC
0x10 - HEAP_ENTRY_LAST_ENTRY
UserPtr
This is the pointer returned to the application by the HeapAlloc (callbed by malloc/new) function. Since the header is always 8 bytes long, it is always HEAP_ENTRY +8.
UserSize
This is the size passed the HeapAlloc function.
state
This is a decoding of the Flags column, telling if the entry is busy, freed, last of its segment, …
Be aware that in Windows 7/2008 R2, heaps are by default using a front-end named LFH (Low fragmented heap) that uses the default heap manager to allocate chunks in which it dispatched user allocated data. For these heaps, UserPtr and UserSize will not point to real user data.
The output of !heap -s displays which heaps are LFH enabled.
From looking at the !heap documentation in the Debugging Tools for Windows help file and the heap docs on MSDN and a great excerpt from Advanced Windows Debugging, here's what I've been able to put together:
HEAP_ENTRY: pointer to entry within the heap. As you found, there is an 8 byte header which contains the data for the HEAP_ENTRY structure. The size of the HEAP_ENTRY structure is 8 bytes which defines the "heap granularity" size. This is used for determining the...
SIZE: size of the entry in terms of the granularity (i.e. the allocation size / 8)
FLAGS: these are defined in winbase.h with explanations found the in MSDN link.
USERPTR: the actual pointer to the allocated (or freed) object
Well, the main difference between HEAP_ENTRY and UserPtr is the due to the fact that heaps have to be indexed, allocated, filled with metadata (like the allocated length made available to user)... otherwise, how could you free(p) something without providing how many bytes were allocated? Same thing with the two size fields: one thing is how big the structure indexing the heap is, one thing is how big is the memory region made available to the user.
The FLAGS, in turn, basically specify which properties of the allocated memory block, if it is committed or just reserved, and, I guess, used by the kernel to rearrange or share memory regions if needed (but as nithins specifies they are documented in MSDN).
The PREV ptr is used to keep track of all the allocated regions and the first pointer is stored in the PEB structure so both user-space and kernel-space code is aware of the allocated heap pools.
http://www.alex-ionescu.com/?p=50.
I read the above post. The author explains why Windows x64 supports only 44-bit virtual memory address with singly linked list example.
struct { // 8-byte header
ULONGLONG Depth:16;
ULONGLONG Sequence:9;
ULONGLONG NextEntry:39;
} Header8;
The first sacrifice to make was to reduce the space for the sequence
number to 9 bits instead of 16 bits, reducing the maximum sequence
number the list could achieve. This still only left 39 bits for the
pointer — a mediocre improvement over 32 bits. By forcing the
structure to be 16-byte aligned when allocated, 4 more bits could be
won, since the bottom bits could now always be assumed to be 0.
Oh, I can't understand.
What "By forcing the structure to be 16-byte aligned when allocated, 4 more bits could be won, since the bottom bits could now always be assumed to be 0." means?
16 is 0010000 in binary
32 is 0100000 in binary
64 is 1000000 in binary
etc
You can see that for all numbers that are multiples of 16, the last four bits are always zero.
So, instead of storing these bits you can leave them out and add them back in when it's time to use the pointer.
For a 2^N-byte aligned pointer, its address is always divisible by 2^N - which means that lower N bits are always zero. You can store additional information in them:
encode ptr payload = ptr | payload
decode_ptr data = data & ~mask
decode_payload data = data & mask
where mask is (1 << N) - 1 - i.e. a number with low N bits set.
This trick is often used to save space in low-level code (payload can be GC flags, type tag, etc.)
In effect you're not storing a pointer, but a number from which a pointer can be extracted. Of course, care should be taken to not dereference the number as a pointer without decoding.
The normal answers to why data alignment is to access more efficiently and to simplify the design of CPU.
A relevant question and its answers is here. And another source is here. But they both do not resolve my question.
Suppose a CPU has a access granularity of 4 bytes. That means the CPU reads 4 bytes at a time. The material I listed above both says that if I access a misaligned data, say address 0x1, then the CPU has to do 2 accesses (one from addresses 0x0, 0x1, 0x2 and 0x3, one from addresses 0x4, 0x5, 0x6 and 0x7) and combine the results. I can't see why. Why just can't CPU read data from 0x1, 0x2, 0x3, 0x4 when I issue accessing address 0x1. It will not degrade the performance and incur much complexity in circuitry.
Thank you in advance!
It will not degrade the performance and incur much complexity in circuitry.
It's the false assumptions we take as fact that really throw off further understanding.
Your comment in the other question used much more appropriate wording ("I don't think it would degrade"...)
Did you consider that the memory architecture uses many memory chips in parallel in order to maximize the bandwidth? And that a particular data item is in only one chip, you can't just read whatever chip happens to be most convenient and expect it to have the data you want.
Right now, the CPU and memory can be wired together such that bits 0-7 are wired only to chip 0, 8-15 to chip 1, 16-23 to chip 2, 24-31 to chip 3. And for all integers N, memory location 4N is stored in chip 0, 4N+1 in chip 1, etc. And it is the Nth byte in each of those chips.
Let's look at the memory addresses stored at each offset of each memory chip
memory chip 0 1 2 3
offset
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
N 4N 4N+1 4N+2 4N+3
So if you load from memory bytes 0-3, N=0, each chip reports its internal byte 0, the bits all end up in the right places, and everything is great.
Now, if you try to load a word starting at memory location 1, what happens?
First, we look at the way it is done. First memory bytes 1-3, which are stored in memory chips 1-3 at offset 0, end up in bits 8-31, because that's where those memory chips are attached, even though you asked them to be in bits 0-23. This isn't a big deal because the the CPU can swizzle them internally, using the same circuitry used for logical shift left. Then on the next transaction memory byte 4, which is stored in memory chip 0 at offset 1, gets read into bits 0-7 and swizzled into bits 24-31 where you wanted it to be.
Notice something here. The word you asked for is split across offsets, the first memory transaction read from offset 0 of three chips, the second memory transaction read from offset 1 of the other chip. Here's where the problem lies. You have to tell the memory chips the offset so they can send you the right data back, and the offset is ~40 bits wide and the signals are VERY high speed. Right now there is only one set of offset signals that connects to all the memory chips, to do a single transaction for unaligned memory access you would need independent offset (called the address bus BTW) running to each memory chip. For a 64-bit processor, you'd change from one address bus to eight, an increase of almost 300 pins. In a world where CPUs use between 700 and 1300 pins, this can hardly be called "not much increase in circuitry". Not to mention the huge increase in noise and crosstalk from that many extra high-speed signals.
Ok, it isn't quite that bad, because there can only be a maximum of two different offsets out on the address bus at once, and one is always the other plus one. So you could get away with one extra wire to each memory chip, saying in effect either (read the offset listed on the address bus) or (read the offset following) which is two states. But now there's an extra adder in each memory chip, which means it has to calculate the offset before actually doing the memory access, which slows down the maximum clock rate for memory. Which means that aligned access gets slower if you want unaligned access to be faster. Since 99.99% of access can be made aligned, this is a net loss.
So that's why unaligned access gets split into two steps. Because the address bus is shared by all the bytes involved. And this is actually a simplification, because when you have different offsets, you also have different cache lines involved, so all the cache coherency logic would have to double to handle twice the communication between CPU cores.
In my opinion that's a very simplistic assumption. The circuitry could involve many layers of pipeling and caching optimisation to ensure that certain bits of memory are read. Also the memory reads are delegated to the memory subsystems that may be built from components that have orders of difference in performance and design complexity to read in the way that you think.
However I do add the caveat that I'm not a cpu or memory designer so I could be talking a crock.
The answer to your question is in the question itself.
The CPU has access granularity of 4 bytes. So it can only slurp up data in chunks of 4 bytes.
If you had accessed the address 0x0, the CPU would give you the 4 bytes from 0x0 to 0x3.
When you issue an instruction to access data from address 0x1, the CPU takes that as a request for 4 bytes of data starting at 0x1 ( ie. 0x1 to 0x4 ). This can't be interpreted in any other way essentially because of the granularity of the CPU. Hence, the CPU slurps up data from 0x0 to 0x3 & 0x4 to 0x7 (ergo, 2 accesses), then puts the data from 0x1 to 0x4 together as the final result.
Addressing 4 bytes with the first byte misaligned at the left at 0x1 not 0x0 means it does not start on a word boundary and spills over to the next adjacent word. First access grabs the 3 bytes to word boundary (assuming a 32-bit word) and then second access grabs byte 0x4 in the mode of completing the 4-byte 32-bit word of the memory addressing implementation. The object code or assembler effectively does the second access and concatenation for the programmer transparently. Its best to keep to word boundaries when possible usually in units of 4 bytes.