Page boundaries, implementing memory pool - windows

I have decided to reinvent the wheel for a millionth time and write my own memory pool. My only question is about page size boundaries.
Let's say GetSystemInfo() call tells me that the page size is 4096 bytes. Now, I want to preallocate a memory area of 1MB (could be smaller, or larger), and divide this area into 128 byte blocks. HeapAlloc()/VirtualAlloc() will have an overhead between 8 and 16 bytes I guess. Might be some more, I've read posts talking about 60 bytes.
Question is, do I need to pay attention to not to have one of my 128 byte blocks across page boundaries?
Do I simply allocate 1MB in one chunk and divide it into my block size?
Or should I allocate many blocks of, say, 4000 bytes (to take into account HeapAlloc() overhead), and sub-divide this 4000 bytes into 128 byte blocks (4000 / 128 = 31 blocks, 128 bytes each) and not use the remaining bytes at all (4000 - 31x128 = 32 bytes in this example)?

Having a block cross a page boundary isn't a huge deal. It just means that if you try to access that block and it's completely swapped out, you'll get two page faults instead of one. The more important thing to worry about is the alignment of the block.
If you're using your small block to hold a structure that contains native types longer than 1 byte, you'll want to align it, otherwise you face potentially abysmal performance that will outweigh any performance gains you may have made by pooling.
The Windows pooling function ExAllocatePool describes its behaviour as follows:
If NumberOfBytes is PAGE_SIZE or greater, a page-aligned buffer is
allocated. Memory allocations of PAGE_SIZE or less do not cross page
boundaries. Memory allocations of less than PAGE_SIZE are not
necessarily page-aligned but are aligned to 8-byte boundaries in
32-bit systems and to 16-byte boundaries in 64-bit systems.
That's probably a reasonable model to follow.
I'm generally of the idea that larger is better when it comes to a pool. Within reason, of course, and depending on how you are going to use it. I don't see anything wrong with allocating 1 MB at a time (I've made pools that grow in 100 MB chunks). You want it to be worthwhile to have the pool in the first place. That is, have enough data in the same contiguous region of memory that you can take full advantage of cache locality.

I've found out that if I used _align_malloc(), I wouldn't need to worry wether spreading my sub-block to two pages would make any difference or not. An answer by Freddie to another thread (How to Allocate memory from a new virtual page in C?) also helped. Thanks Harry Johnston, I just wanted to use it as a memory pool object.

Related

How is byte addressing implemented in modern computers?

I have trouble understanding how in say a 32-bit computer byte addressing is achieved:
Is the ram itself byte addressable meaning the first byte has address 0 and the second 1 etc? In this case, wouldn't is take 4 read cycles to read a 32-bit word and waste the width of the data bus?
Or does the ram consist of 32-bit words meaning address 0 points to the first 4 bytes and address 2 points to bytes 5 to 8? In this case I would expect the ram interface to make byte addressing possible (from the cpu's point of view)
Think of RAM as 8 bit wide structure with N entries. N is often the size quoted when referring to memory (256 MB - 256M entries, 2GB - 2G entries etc, B is for bytes). When you access this memory, the smallest unit you can address is one of these entries which is 8 bits (1 byte). Since you can only access it at byte level, we call it byte addressable memory.
Now coming to your question about accessing this memory, we do not just access a byte. Most of the time, memory accesses are sent through caches which are there to reduce memory access latency. Caches store data at a higher granularity than a byte or word, normally it is multiple of words. In doing so, caches explore a property called "locality". Locality means, there is a high chance that we either access this data item or a near by data item very soon. So fetching not just the byte, but all the adjacent bytes is not a waste. Think of it as an investment for future, saves you multiple data fetches that you would have done otherwise.
Memory addresses in RAM start with 0th address and they are accessed using the registers with capacity of 8 bit register or 32 bit registers. Based on these registers the value from specific address is accessed by the CPU. If you really need to understand how it works, you will need to run couple of programs using Assembly language to navigate in the physical memory by reading the values directly using registers and register move commands.

Why does the speed of memcpy() drop dramatically every 4KB?

I tested the speed of memcpy() noticing the speed drops dramatically at i*4KB. The result is as follow: the Y-axis is the speed(MB/second) and the X-axis is the size of buffer for memcpy(), increasing from 1KB to 2MB. Subfigure 2 and Subfigure 3 detail the part of 1KB-150KB and 1KB-32KB.
Environment:
CPU : Intel(R) Xeon(R) CPU E5620 # 2.40GHz
OS : 2.6.35-22-generic #33-Ubuntu
GCC compiler flags : -O3 -msse4 -DINTEL_SSE4 -Wall -std=c99
I guess it must be related to caches, but I can't find a reason from the following cache-unfriendly cases:
Why is my program slow when looping over exactly 8192 elements?
Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?
Since the performance degradation of these two cases are caused by unfriendly loops which read scattered bytes into the cache, wasting the rest of the space of a cache line.
Here is my code:
void memcpy_speed(unsigned long buf_size, unsigned long iters){
struct timeval start, end;
unsigned char * pbuff_1;
unsigned char * pbuff_2;
pbuff_1 = malloc(buf_size);
pbuff_2 = malloc(buf_size);
gettimeofday(&start, NULL);
for(int i = 0; i < iters; ++i){
memcpy(pbuff_2, pbuff_1, buf_size);
}
gettimeofday(&end, NULL);
printf("%5.3f\n", ((buf_size*iters)/(1.024*1.024))/((end.tv_sec - \
start.tv_sec)*1000*1000+(end.tv_usec - start.tv_usec)));
free(pbuff_1);
free(pbuff_2);
}
UPDATE
Considering suggestions from #usr, #ChrisW and #Leeor, I redid the test more precisely and the graph below shows the results. The buffer size is from 26KB to 38KB, and I tested it every other 64B(26KB, 26KB+64B, 26KB+128B, ......, 38KB). Each test loops 100,000 times in about 0.15 second. The interesting thing is the drop not only occurs exactly in 4KB boundary, but also comes out in 4*i+2 KB, with a much less falling amplitude.
PS
#Leeor offered a way to fill the drop, adding a 2KB dummy buffer between pbuff_1 and pbuff_2. It works, but I am not sure about Leeor's explanation.
Memory is usually organized in 4k pages (although there's also support for larger sizes). The virtual address space your program sees may be contiguous, but it's not necessarily the case in physical memory. The OS, which maintains a mapping of virtual to physical addresses (in the page map) would usually try to keep the physical pages together as well but that's not always possible and they may be fractured (especially on long usage where they may be swapped occasionally).
When your memory stream crosses a 4k page boundary, the CPU needs to stop and go fetch a new translation - if it already saw the page, it may be cached in the TLB, and the access is optimized to be the fastest, but if this is the first access (or if you have too many pages for the TLBs to hold on to), the CPU will have to stall the memory access and start a page walk over the page map entries - that's relatively long as each level is in fact a memory read by itself (on virtual machines it's even longer as each level may need a full pagewalk on the host).
Your memcpy function may have another issue - when first allocating memory, the OS would just build the pages to the pagemap, but mark them as unaccessed and unmodified due to internal optimizations. The first access may not only invoke a page walk, but possibly also an assist telling the OS that the page is going to be used (and stores into, for the target buffer pages), which would take an expensive transition to some OS handler.
In order to eliminate this noise, allocate the buffers once, perform several repetitions of the copy, and calculate the amortized time. That, on the other hand, would give you "warm" performance (i.e. after having the caches warmed up) so you'll see the cache sizes reflect on your graphs. If you want to get a "cold" effect while not suffering from paging latencies, you might want to flush the caches between iteration (just make sure you don't time that)
EDIT
Reread the question, and you seem to be doing a correct measurement. The problem with my explanation is that it should show a gradual increase after 4k*i, since on every such drop you pay the penalty again, but then should enjoy the free ride until the next 4k. It doesn't explain why there are such "spikes" and after them the speed returns to normal.
I think you are facing a similar issue to the critical stride issue linked in your question - when your buffer size is a nice round 4k, both buffers will align to the same sets in the cache and thrash each other. Your L1 is 32k, so it doesn't seem like an issue at first, but assuming the data L1 has 8 ways it's in fact a 4k wrap-around to the same sets, and you have 2*4k blocks with the exact same alignment (assuming the allocation was done contiguously) so they overlap on the same sets. It's enough that the LRU doesn't work exactly as you expect and you'll keep having conflicts.
To check this, i'd try to malloc a dummy buffer between pbuff_1 and pbuff_2, make it 2k large and hope that it breaks the alignment.
EDIT2:
Ok, since this works, it's time to elaborate a little. Say you assign two 4k arrays at ranges 0x1000-0x1fff and 0x2000-0x2fff. set 0 in your L1 will contain the lines at 0x1000 and 0x2000, set 1 will contain 0x1040 and 0x2040, and so on. At these sizes you don't have any issue with thrashing yet, they can all coexist without overflowing the associativity of the cache. However, everytime you perform an iteration you have a load and a store accessing the same set - i'm guessing this may cause a conflict in the HW. Worse - you'll need multiple iteration to copy a single line, meaning that you have a congestion of 8 loads + 8 stores (less if you vectorize, but still a lot), all directed at the same poor set, I'm pretty sure there's are a bunch of collisions hiding there.
I also see that Intel optimization guide has something to say specifically about that (see 3.6.8.2):
4-KByte memory aliasing occurs when the code accesses two different
memory locations with a 4-KByte offset between them. The 4-KByte
aliasing situation can manifest in a memory copy routine where the
addresses of the source buffer and destination buffer maintain a
constant offset and the constant offset happens to be a multiple of
the byte increment from one iteration to the next.
...
loads have to wait until stores have been retired before they can
continue. For example at offset 16, the load of the next iteration is
4-KByte aliased current iteration store, therefore the loop must wait
until the store operation completes, making the entire loop
serialized. The amount of time needed to wait decreases with larger
offset until offset of 96 resolves the issue (as there is no pending
stores by the time of the load with same address).
I expect it's because:
When the block size is a 4KB multiple, then malloc allocates new pages from the O/S.
When the block size is not a 4KB multiple, then malloc allocates a range from its (already allocated) heap.
When the pages are allocated from the O/S then they are 'cold': touching them for the first time is very expensive.
My guess is that, if you do a single memcpy before the first gettimeofday then that will 'warm' the allocated memory and you won't see this problem. Instead of doing an initial memcpy, even writing one byte into each allocated 4KB page might be enough to pre-warm the page.
Usually when I want a performance test like yours I code it as:
// Run in once to pre-warm the cache
runTest();
// Repeat
startTimer();
for (int i = count; i; --i)
runTest();
stopTimer();
// use a larger count if the duration is less than a few seconds
// repeat test 3 times to ensure that results are consistent
Since you are looping many times, I think arguments about pages not being mapped are irrelevant. In my opinion what you are seeing is the effect of hardware prefetcher not willing to cross page boundary in order not to cause (potentially unnecessary) page faults.

Memory benchmark plot: understanding cache behaviour

I've tried every kind of reasoning I can possibly came out with but I don't really understand this plot.
It basically shows the performance of reading and writing from different size array with different stride.
I understand that for small stride like 4 bytes I read all the cell in the cache, consequently I have good performance. But what happen when I have the 2 MB array and the 4k stride? or the 4M and 4k stride? Why the performance are so bad? Finally why when I have 1MB array and the stride is 1/8 of the size performance are decent, when is 1/4 the size performance get worst and then at half the size, performance are super good?
Please help me, this thing is driving me mad.
At this link, the code: https://dl.dropboxusercontent.com/u/18373264/membench/membench.c
Your code loops for a given time interval instead of constant number of access, you're not comparing the same amount of work, and not all cache sizes/strides enjoy the same number of repetitions (so they get different chance for caching).
Also note that the second loop will probably get optimized away (the internal for) since you don't use temp anywhere.
EDIT:
Another effect in place here is TLB utilization:
On a 4k page system, as you grow your strides while they're still <4k, you'll enjoy less and less utilization of each page (finally reaching one access per page on the 4k stride), meaning growing access times as you'll have to access the 2nd level TLB on each access (possibly even serializing your accesses, at least partially).
Since you normalize your iteration count by the stride size, you'll have in general (size / stride) accesses in your innermost loop, but * stride outside. However, the number of unique pages you access differs - for 2M array, 2k stride, you'll have 1024 accesses in the inner loop, but only 512 unique pages, so 512*2k accesses to TLB L2. on the 4k stride, there would be 512 unique pages still, but 512*4k TLB L2 accesses.
For the 1M array case, you'll have 256 unique pages overall, so the 2k stride would have 256 * 2k TLB L2 accesses, and the 4k would again have twice.
This explains both why there's gradual perf drop on each line as you approach 4k, as well as why each doubling in array size doubles the time for the same stride. The lower array sizes may still partially enjoy the L1 TLB so you don't see the same effect (although i'm not sure why 512k is there).
Now, once you start growing the stride above 4k, you suddenly start benefiting again since you're actually skipping whole pages. 8K stride would access only every other page, taking half the overall TLB accesses as 4k for the same array size, and so on.

How much memory do I need to have for 100 million records

How much memory do i need to load 100 million records in to memory. Suppose each record needs 7 bytes. Here is my calculation
each record = <int> <short> <byte>
4 + 2 + 1 = 7 bytes
needed memory in GB = 7 * 100 * 1,000,000 / 1000,000,000 = 0.7 GB
Do you see any problem with this calculation?
With 100,000,000 records, you need to allow for overhead. Exactly what and how much overhead you'll have will depend on the language.
In C/C++, for example, fields in a structure or class are aligned onto specific boundaries. Details may vary depending on the compiler, but in general int's must begin at an address that is a multiple of 4, short's at a multiple of 2, char's can begin anywhere.
So assuming that your 4+2+1 means an int, a short, and a char, then if you arrange them in that order, the structure will take 7 bytes, but at the very minimum the next instance of the structure must begin at a 4-byte boundary, so you'll have 1 pad byte in the middle. I think, in fact, most C compilers require structs as a whole to begin at an 8-byte boundary, though in this case that doesn't matter.
Every time you allocate memory there's some overhead for allocation block. The compiler has to be able to keep track of how much memory was allocated and sometimes where the next block is. If you allocate 100,000,000 records as one big "new" or "malloc", then this overhead should be trivial. But if you allocate each one individually, then each record will have the overhead. Exactly how much that is depends on the compiler, but, let's see, one system I used I think it was 8 bytes per allocation. If that's the case, then here you'd need 16 bytes for each record: 8 bytes for block header, 7 for data, 1 for pad. So it could easily take double what you expect.
Other languages will have different overhead. The easiest thing to do is probably to find out empirically: Look up what the system call is to find out how much memory you're using, then check this value, allocate a million instances, check it again and see the difference.
If you really need just 7 bytes per structure, then you are almost right.
For memory measurements, we usually use the factor of 1024, so you would need
700 000 000 / 1024³ = 667,57 MiB = 0,652 GiB

why is data structure alignment important for performance?

Can someone give me a short and plausible explanation for why the compiler adds padding to data structures in order to align its members? I know that it's done so that the CPU can access the data more efficiently, but I don't understand why this is so.
And if this is only CPU related, why is a double 4 byte aligned in Linux and 8 byte aligned in Windows?
Alignment helps the CPU fetch data from memory in an efficient manner: less cache miss/flush, less bus transactions etc.
Some memory types (e.g. RDRAM, DRAM etc.) need to be accessed in a structured manner (aligned "words" and in "burst transactions" i.e. many words at one time) in order to yield efficient results. This is due to many things amongst which:
setup time: time it takes for the memory devices to access the memory locations
bus arbitration overhead i.e. many devices might want access to the memory device
"Padding" is used to correct the alignment of data structures in order to optimize transfer efficiency.
In other words, accessing a "mis-aligned" structure will yield lower overall performance. A good example of such pitfall: suppose a data structure is mis-aligned and requires the CPU/Memory Controller to perform 2 bus transactions (instead of 1) in order to fetch the said structure, the performance is thus consequently lower.
the CPU fetches data from memory in groups of 4 bytes (it actualy depends on the hardware its 8 or other values for some types of hardware, but lets stick with 4 to keep it simple),
all is well if the data begins in an address which is dividable by 4, the CPU goes to the memory address and loads the data.
now suppose the data begins in an address not dividable by 4 say for the sake of simplicity at address 1, the CPU must take data from address 0 and then apply some algorithm to dump the byte at the 0 address , to gain access to the actual data at byte 1. this takes time and therefore lowers preformance. so it is much more efficient to have all data addresses aligned.
A cache line is a basic unit of caching. Typically it is 16-64 bytes or more.
Pentium IV: 64 bytes; Pentium Pro/II: 32 bytes; Pentium I: 32 bytes; 486: 16 bytes.
myrandomreader:
; ...
; ten instructions to generate next pseudo-random
; address in ESI from previous address
; ...
MOV EAX, DS:[ESI] ; X
LOOP myrandomreader
For memory read straddling two cachelines:
(for L1 cache miss) the processor must wait for the whole of cache line 1 to be read from L2->L1 into the processor before it can request the second cache line, causing a short execution stall
(for L2 cache miss) the processor must wait for two burst reads from L3 cache (if present) or main memory to complete rather than one
Processor stalls
A random 4 byte read will straddle a cacheline boundary about 5% of the time for 64 byte cachelines, 10% for 32 byte ones and 20% for 16 byte ones.
There may be additional execution overheads for some instructions on misaligned data even if it is within a cacheline. This is talked about on the Intel website for some SSE instructions.
If you are defining the structures yourself, it may make sense to look at listing all the <32bit data fields together in a struct so that padding overhead is reduced or alternatively review whether it is better to turn packing on or off for a particular structure.
On MIPS and many other platforms you don't get the choice and must align - kernel exception if you don't!!
Alignment may also matter extra specially to you if you are doing I/O on the bus or using atomic operations such as atomic increment/decrement or if you wish to be able to port your code to non-Intel.
On Intel only (!) code, a common practice is to define one set of packed structures for network and disk, and another padded set for in-memory and to have routines to convert data between these formats (also consider "endianness" for the disk and network formats).
In addition to jldupont's answer, some architectures have load and store instructions (those used to read/write to and from memory) that only operate on word aligned boundaries - so, to load a non-aligned word from memory would take two load instructions, a shift instruction, and then a mask instruction - much less efficient!

Resources