I'm attempting to use VsPerfCmd.exe to profile branch misprediction and last level cache misses in an instrumented native application.
The setup works as it says on the tin, but the results I'm getting don't seem sensible. For instance, a function that always touches a data set of 24MB is reported to only cause ~700 cache misses when being called ~2000 times. Now let me put this into perspective - The function linearly traverses two arrays of 1024*1024 elements of 12-byte elements. For every element, it randomly decides whether it needs information of an element 1024 indices before or after it. That means in order to not generate any cache misses, the CPU would always have to have at least three sections of 1024*12 bytes each of both these arrays in cache. Furthermore, after every iteration the process yields the CPU using sleep() for about 8 milliseconds. I can't imagine any hardware prefetcher doing that good a job.
How would this silly amount of data not generate more last level cache misses than VsPerfCmd says? Even though my i7 has 8MB of shared L3 cache, this seems highly unlikely. Can anyone share their opinions on what might be going on here? Of course "VsPerfCmd.exe sucks" would be a valid answer but if someone is going to say that, I'd like to at least hear of a similar experience someone had as a basis for this assertion.
Answering my own question -
So, after trying to verify the VsPerfCmd results using Intel VTune Amplifier XE™ (this is no advertising, I just like typing out product names like that because it amuses my how they can be so silly), I can definitely say that they are garbage.
That's just a rough comparison, as I havent found out how to get the number of times a function was called from VTune, but an approximate 900 calls resulted in 1,040,000 Last Level Cache misses, according to VTune.
Contrasting that to the ~ 2000 calls profiled with VsPerfCmd and and the reported ~ 700 LLC misses, it's safe to assume that the VTune results are much more reasonable.
Of course I cant say anything more specific than "VsPerfCmd was very likely wrong" - The why's and the how's of this phenomenon remain unclear. Should anyone who knows more feel an urge to elaborate on this, shoot me a comment!
first off - hardware LLC miss counter (let's call it that) does not in reality count LLC misses in your particular application. What it does is it counts all LLC misses and compares the number with a preset threshold (called SAV - sample after value, it usually is in the order of thousands or even millions). If current count is equal to SAV an interrupt is raised and whatever IP is pointed to at this point is saved in the trace alongside the counter and timestamp (for instance to make the trace reasonable). If this IP points to an instruction in your module, then all of those cache misses are attributed to your module/function/instruction. So the resulting picture that you see is not real, but rather statistically correct. I've not worked with VsPerfCmd, but what could help is for you to check the SAV it sets for LLC misses. If it's orders of magnitude larger than what VTune sets, then it could be that you're comparing 700,000 LLC misses with 1,040,000 LLC misses, which would make much more sense.
And then the subject of your application workload and working set. The 3 x 1024 x 12B is only 36KB, which is nothing for an 8MB LLC. If the algorithm is uniformly jumping back and forth, not always forward or backwards, then it'll be just a small portion of those 24 MB that is being used frequently, which means that hottest data will most likely fit in LLC as well. Additionally, CPU only sees memory in chunks called cache lines, which are 64 bytes long. So whenever your algorithm jumps forward or backwords to access the next 12 bytes, 52 neighboring bytes are being loaded into L1, so if the next step after the jump is *(ptr++), then it will not result in a cache miss. Yielding the CPU for 8 milliseconds should not affect cache performance unless you suspect that the thread scheduled for this next quantum is doing something else which is also memory-intensive, which will cause your data cache lines to be evicted. Otherwise, if it's just some OS thread progressing in the background touching just a few bytes, then massive cache eviction should not be happenning.
Related
I have an x86-64 Linux program which I am attempting to optimize via perf. The perf report shows the hottest instructions are scalar conversions from double to long with a memory argument, for example:
cvttsd2si (%rax, %rdi, 8), %rcx
which corresponds to C code like:
extern double *array;
long val = (long)array[idx];
(This is an unusual bottleneck but the code itself is very unusual.)
To inform optimizations I want to know if these instructions are hot because of the load from memory, or because of the arithmetic conversion itself. What's the best way to answer this question? What other data should I collect and how should I proceed to optimize this?
Some things I have looked at already. CPU counter results show 1.5% cache misses per instruction:
30686845493287 cache-references
2140314044614 cache-misses # 6.975 % of all cache refs
52970546 page-faults
1244774326560850 instructions
194784478522820 branches
2573746947618 branch-misses # 1.32% of all branches
52970546 faults
Top-down performance monitors show we are primarily backend-bound:
frontend bound retiring bad speculation backend bound
10.1% 25.9% 4.5% 59.5%
Ad-hoc measurement with top shows all CPUs pegged at 100% suggesting we are not waiting on memory.
A final note of interest: when run on AWS EC2, the code is dramatically slower (44%) on AMD vs Intel with the same core count. (Tested on Ice Lake 8375C vs EPYC 7R13). What could explain this discrepancy?
Thank you for any help.
To inform optimizations I want to know if these instructions are hot because of the load from memory, or because of the arithmetic conversion itself. What's the best way to answer this question?
I think there is two main reason for this instruction to be slow. 1. There is a dependency chain and the latency of this instruction is a problem since the processor is waiting on it to execute other instructions. 2. There is a cache miss (saturating the memory with such instruction is improbable unless many cores are doing memory-based operations).
First of all, tracking what is going on on a specific instruction is hard (especially if the instruction is not executed a lot of time). You need to use precise events to track the root of the problem, that is, events for which the exact instruction addresses that caused the event are available. Only a (small) subset of all events are precise one.
Regarding (1), the latency of the instruction should be about 12 cycles on both architecture (although it might be slightly more on the AMD processor, I do not expect a 44% difference). The target processor are able to execute multiple instruction at the same time in a given cycle. Instructions are executed on different port and are also pipelined. The port usage matters to understand what is going on. This means all the instruction in the hot loop matters. You cannot isolate this specific instruction. Modern processors are insanely complex so a basic analysis can be tricky. On Ice Lake processors, you can measure the average port usage with events like UOPS_DISPATCHED.PORT_XXX where XXX can be 0, 1, 2_3, 4_9, 5, 6, 7_8. Only the first three matters for this instruction. The EXE_ACTIVITY.XXX events may also be useful. You should check if a port is saturated and which one. AFAIK, none of these events are precise so you can only analyse a block of code (typically the hot loop). On Zen 3, the ports are FP23 and FP45. IDK what are the useful events on this architecture (I am not very familiar with it).
On Ice Lake, you can check the FRONTEND_RETIRED.LATENCY_GE_XXX events where XXX is a power of two integer (which should be precise one so you can see if this instruction is the one impacting the events). This help you to see whether the front-end or the back-end is the limiting factor.
Regarding (2), you can check the latency of the memory accesses as well as the number of L1/L2/L3 cache hits/misses. On Ice Lake, you can use events like MEM_LOAD_RETIRED.XXX where XXX can be for example L1_MISS L1_HIT, L2_MISS, L2_HIT, L3_MISS and L3_HIT. Still on Ice Lake, t may be useful to track the latency of the memory operation with MEM_TRANS_RETIRED.LOAD_LATENCY_GT_XXX where XXX is again a power of two integer.
You can also use LLVM-MCA to simulate the scheduling of the loop instruction statically on the target architecture (do not consider branches). This is very useful to understand deeply what the scheduler can do pretty easily.
What could explain this discrepancy?
The latency and reciprocal throughput should be about the same on the two platform or at least close. That being said, for the same core count, the two certainly do not operate at the same frequency. If this is not coming from that, then I doubt this instruction is actually the problem alone (tricky scheduling issues, wrong/inaccurate profiling results, etc.).
CPU counter results show 1.5% cache misses per instruction
The thing is the cache-misses event is certainly not very informative here. Indeed, it references the last-level cache (L3) misses. Thus, it does not give any information about the L1/L2 misses (previous events do).
how should I proceed to optimize this?
If the code is latency bound, the solution is to first break any dependency chain in this loop. Unrolling the loop dans rewriting it so to make it more SIMD-friendly can help a lot to improve performance (the reciprocal throughput of this instruction is about 1 cycle as opposed to 12 for the latency so there is a room for improvements in this case).
If the code is memory bound, they you should care about data locality. Data should fit in the L1 cache if possible. There are many tips to do so but it is hard to guide you without more context. This includes for example sorting data, reordering loop iterations, using smaller data types.
There are many possible source of weird unusual unexpected behaviours that can occurs. If such a thing happens, then it is nearly impossible to understand what is going on without the exact code executed. All details matter in this case.
Problem (think of the mark phase of a GC)
I have a graph of “objects” that I need to walk, visiting all objects.
I can store in each object if it has been visited.
All the objects are stored in memory and linked together using normal pointers.
The objects are not all the same size.
Sometimes there is not enough ram in the system to hold all the objects in memory at the same time, and I wish to avoid “page thrashing”.
I also wish to avoid TLB faults
Other times, there is more than enough ram.
I do not mind writing low-level code.
I do not mind different code for windows and linux.
The code must run in “user space” without needing none standard permissions.
I don't care the order I visit the nodes in.
I am going to ask more detail questions about possible solutions, linking back to this questions.
Page faults aren't necessarily bad, as long as they're not stalling your progress.
This means that if you have a node Node* p with two candidate successors p->left and p->right, it can be useful to pick the nearest (in terms of (char*)p - (char*)p->next) and pre-fetch the other (e.g. with PrefetchVirtualMemory).
How efficient this will be cannot be predicted; it greatly depends on your graph topology. But the prefetch is virtually free when you have enough RAM.
Closer to the CPU, there's cache prefetching. Same idea, different storage
Use 2M hugepages for address ranges that are full of "hot" data that the kernel can't usefully swap out any / many 4k chunks of. This will reduce TLB misses, but costs extra physical memory if there are any 4k chunks of a hugepage that aren't hot.
Linux does this transparently for anonymous pages (https://www.kernel.org/doc/Documentation/vm/transhuge.txt), but you can use madvise(MADV_HUGEPAGE) on pages you know are worth it, to encourage the kernel to defrag physical memory even if that's not the default in /sys/kernel/mm/transparent_hugepage/defrag. (You can look at /proc/PID/smaps to see how many transparent hugepages are in use for any given mapping.)
Based on what you posted in your answer: An ordered set of nodesToVisit would give you the most locality, but might be too expensive to maintain. Multiple accesses within the same 64-byte cache line are much cheaper than coming back to it later after it's been evicted from L3 cache and has to come from DRAM again.
If you have lots of addresses to visit in your Set, doing one pass of a radix-sort into 2M buckets would give you locality within one hugepage. 2M is also smaller than L3 cache size, so you'll probably get some cache hits when visiting multiple objects in the same cache line, even if you don't hit them back to back.
Depending on how big your Set is, throwing around that many pointers even to partial-sort them might not be worth the memory traffic that takes. But there's probably some sweet spot of taking a window of data and at least partially sorting it. Using the pointers before they are evicted from cache is nice.
SW prefetch can trigger a page-walk to avoid a TLB miss, so you could _mm_prefetch(_MM_HINT_T2) one address from the next 2M bucket before starting on the current bucket. See also Prefetching Examples?. I haven't tested this, but it might work well. It won't help with page faults: prefetch from an unmapped page won't cause a page fault, and you don't want to trigger an actual PF until you're ready to touch the page.
MSalter's suggestion to ask the OS to prefetch and wire the next page is interesting (I think madvise(MADV_WILLNEED) is the Linux equivalent), but a system call will be slow for no benefit if the page was already mapped+wired into the HW page table. There's no x86 asm instruction that just asks if a page is mapped without faulting if it isn't, so I can't think of a way to efficiently choose not to call it. And BTW, I think Linux breaks up transparent hugepages into 4k regular pages for paging in/out. But don't write a big loop that just does _mm_prefetch() or madvise on all the 4k pages in a 2M block; that probably sucks. The prefetcht2 part would probably just result in excess prefetch requests being dropped.
Use perf counters to look at cache hit/miss rates. On Intel CPUs, the mem_load_retired.l1_miss and/or .l2_miss event should show you whether you're getting cache hits on accessing the Set itself, as well as on accessing dereferencing those pointers. Those counters are precise events, so they should map accurately to asm load instructions. (e.g. perf record -e mem_load_retired.l2_miss ./my_program / perf report on Linux).
We remove one item at a time from nodesToVisit
I don't know much about GC design, but can't you use a sequence number or tagged-pointer or something to avoid modifying the Set data structure itself every GC pass? If your minimum object alignment is 4 bytes, you have 2 bits to play with at the bottom of every pointer. ANDing them off before dereferencing is very cheap.
x86-64 with full 64-bit pointers currently requires the top 16 to be the sign-extension of the low 48. So you could use bits there (16 bits, or maybe just the top byte) if you re-canonicalize pointers. (redo sign extension, or just zero the high 16 bits if you want to assume user-space pointers; Linux uses a high-half kernel VM layout so user-space addresses are always in the low half of virtual address space. IDK what Windows does.)
On x86-64, you might consider using the x32 ABI (32-bit pointers in long mode) if 4GiB of address space is enough, especially if you're hitting physical memory limits and swapping. Smaller pointers mean smaller data structures, thus half the cache footprint.
Some Linux systems are built without kernel support for x32, though, only classic x86-64 and usually 32-bit mode. But if it works on your systems, consider gcc -mx32.
These are my first thoughts about a possible solution, they are clearly not optimal. I will delete this answer if someone posts a better answer.
The basic method:
Assume we have a Set<NodePointer> nodesToVisit that contains all nodes we have not yet visited.
We remove one item at a time from nodesToVisit,
and if it has not been visited before we add all “pointers to other nodes” to nodesToVisit.
Improvements:
But we can clearly do better, by ordering nodesToVisit based on address, so that we are more likely to visit nodes that are contained in pages we have recently accessed. This could be as simple as having a second Set<NodePointer> nodesToVisitLater, and putting any node that has an address a long way from the current node into it.
Or we could skip over any node that are contained in pages that are not resident in memory, visiting these nodes after we have visited all nodes that are currently in memory.
(The"set" could just be a stack, as visiting a node more than once is a "no-opp")
https://patents.google.com/patent/US7653797B1/en seems to be related, but I have not read it yet.
https://hosking.github.io/links/Cher+2004ASPLOS.pdf
https://people.cs.umass.edu/~emery/pubs/cramm.pdf
https://people.cs.umass.edu/~emery/pubs/f034-hertz.pdf
https://people.cs.umass.edu/~emery/pubs/04-16.pdf
Given a cache size with constant capacity and associativity, for a given code to determine average of array elements, would a cache with higher block size be preferred?
[from comments]
Examine the code given below to compute the average of an array:
total = 0;
for(j=0; j < k; j++) {
sub_total = 0; /* Nested loops to avoid overflow */
for(i=0; i < N; i++) {
sub_total += A[jN + i];
}
total += sub_total/N;
}
average = total/k;
Related: in the more general case of typical access patterns with some but limited spatial locality, larger lines help up to a point. These "Memory Hierarchy: Set-Associative Cache" (powerpoint) slides by Hong Jiang and/or Yifeng Zhu (U. Maine) have a graph of AMAT (Average Memory Access Time) vs. block size showing a curve, and also breaking it down into miss penalty vs. miss rate (for a simple model I think, for a simple in-order CPU that sucks at hiding memory latency. e.g. maybe not even pipelining multiple independent misses. (miss under miss))
There is a lot of good stuff in those slides, including a compiler-optimization section that mentions loop interchange (to fix nested loops with column-major vs. row-major order), and even cache-blocking for more reuse. A lot of stuff on the Internet is crap, but I looked through these slides and they have some solid info on how caches are designed and what the tradeoffs are. The performance-analysis stuff is only really accurate for simple CPUs, not like modern out-of-order CPUs that can overlap some computation with cache-miss latency so more shorter misses is different from fewer longer misses.
Specific answer to this question:
So the only workload you care about is a linear traversal of your elements? That makes cache line size nearly irrelevant for performance, assuming good hardware prefetching. (So larger lines mean less HW complexity and power usage for the same performance.)
With software prefetch, larger lines mean less prefetch overhead (although depending on the CPU design, that may not hurt performance if you still max out memory bandwidth.)
Without any prefetching, a larger line/block size would mean more hits following every demand-miss. A single traversal of an array has perfect spatial locality and no temporal locality. (Actually not quite perfect spatial locality at the start/end, if the array isn't aligned to the start of a cache line, and/or ends in the middle of a line.)
If a miss has to wait until the entire line is present in cache before the load that caused the miss can be satisfied, this slightly reduces the advantage of larger blocks. (But most of the latency of a cache miss is in the signalling and request overhead, not in waiting for the burst transfer to complete after it's already started.)
A larger block size means fewer requests in flight with the same bandwidth and latency, and limited concurrency is a real limiting factor in memory bandwidth in real CPUs. (See the latency-bound platforms part of this answer about x86 memory bandwidth: many-core Xeons with higher latency to L3 cache have lower single-threaded bandwidth than a dual or quad-core of the same clock speed. Each core only has 10 line-fill buffers to track outstanding L1 misses, and bandwidth = concurrency / latency.)
If your cache-miss handling has an early restart design, even that bit of extra latency can be avoided. (That's very common, but Paul says theoretically possible to not have it in a CPU design). The load that caused the miss gets its data as soon as it arrives. The rest of the cache line fill happens "in the background", and hopefully later loads can also be satisfied from the partially-received cache line.
Critical word first is a related feature, where the needed word is sent first (for use with early restart), and the burst transfer then wraps around to transfer the earlier words of the block. In this case, the critical word will always be the first word, so no special hardware support is needed beyond early restart. (The U. Maine slides I linked above mention early restart / critical word first and point out that it decreases the miss penalty for large cache lines.)
An out-of-order execution CPU (or software pipelining on an in-order CPU) could give you the equivalent of HW prefetch by having multiple demand-misses outstanding at once. If the CPU "sees" the loads to another cache line while a miss to the current cache line is still outstanding, the demand-misses can be pipelined, again hiding some of the difference between larger or smaller lines.
If lines are too small, you'll run into a limit on how many outstanding misses for different lines your L1D can track. With larger lines or smaller out-of-order windows, you might have some "slack" when there's no outstanding request for the next cache line, so you're not maxing out the bandwidth. And you pay for it with bubbles in the pipeline when you get to the end of a cache line and the start of the next line hasn't arrived yet, because it started too late (while ALU execution units were using data from too close to the end of the current cache line.)
Related: these slides don't say much about the tradeoff of larger vs. smaller lines, but look pretty good.
The simplistic answer is that larger cache blocks would be preferred since the workload has no (data) temporal locality (no data reuse), perfect spacial locality (excluding the potentially inadequate alignment of the array for the first block and insufficient size of the array for the last block, every part of every block of data will be used), and a single access stream (no potential for conflict misses).
A more nuanced answer would consider the size and alignment of the array (the fraction of the first and last cache blocks that will be unused and what fraction of the memory transfer time that represents; for a 1 GiB array, even 4 KiB blocks would waste less than 0.0008% of the memory bandwidth), the ability of the system to use critical word first (if the array is of modest size and there is no support for early use of data as it becomes available rather than waiting for the entire block to be filled, then the start-up overhead will remove much of the prefetching advantage of larger cache blocks), the use of prefetching (software or hardware prefetching reduces the benefit of large cache blocks and this workload is extremely friendly to prefetching), the configuration of the memory system (e.g., using DRAM with an immediate page close controller policy would increase the benefit of larger cache blocks because each access would involve a row activate and row close, often to the same DRAM bank preventing latency overlap), whether the same block size is used for instructions and page table accesses and whether these accesses share the cache (instruction accesses provide a second "stream" which can introduce conflict misses; with shared caching of a two-level hierarchical page table TLB misses would access two cache blocks), whether simple way prediction is used (a larger block would increase prediction accuracy reducing misprediction overhead), and perhaps other factors.
From your example code we can't say either way as long as the hardware pre-fetcher can keep up a memory stream at maximum memory throughput.
In a random access scenario a shorter cache-line might be preferable as you then don't need to fill all the line. But the total amount of cached memory would go down as you need more circuits for tags and potentially more time for comparing.
So a compromise must be made Intel has chosen 64-bytes per line (and fetches 2 lines) others has chosen 32-bytes per line.
Suppose I'm copying data between two arrays that are 1024000+1 bytes apart. Since the offset is not a multiple of word size, I'll need to do some misaligned accesses - either loads or stores (for the moment, let's forget that it's possible to avoid misaligned accesses entirely with some ORing and bit shifting). Which of misaligned loads or misaligned stores will be more expensive?
This is a hypothetical situation, so I can't just benchmark it :-) I'm more interested in what factors will lead to performance difference, if any. A pointer to some further reading would be great.
Thanks!
A misaligned write will need to read two destination words, merge in the new data, and write two words. This would be combined with an aligned read. So, 3R + 2W.
A misaligned read will need to read two source words, and merge the data (shift and bitor). This would be combined with an aligned write. So, 2R + 1W.
So, the misaligned read is a clear winner.
Of course, as you say there are more efficient ways to do this that avoid any mis-aligned operations except at the ends of the arrays.
Actually that depends greatly on the CPU you are using. On newer Intel CPUs there is no penalty for loading and storing unaligned words (at least none that you can notice). Only if you load and store 16byte or 32byte unaligned chunks you may see small performance degradation.
How much data? are we talking about two things unaligned at the ends of a large block of data (in the noise) or one item (word, etc) that is unaligned (100% of the data)?
Are you using a memcpy() to move this data, etc?
I'm more interested in what factors will lead to performance
difference, if any.
Memories, modules, chips, on die blocks, etc are usually organized with a fixed access size, at least somewhere along the way there is a fixed access size. Lets just say 64 bits wide, not an uncommon size these days. So at that layer wherever it is you can only write or read in aligned 64 bit units.
If you think about a write vs read, with a read you send out an address and that has to go to the memory and data come back, a full round trip has to happen. With a write everything you need to know to perform the write goes on the outbound path, so it is not uncommon to have a fire and forget type deal where the memory controller takes the address and data and tells the processor the write has finished even though the information has not net reached the memory. It does take time but not as long as a read (not talking about flash/proms just ram here) since a read requires both paths. So for aligned full width stuff a write CAN BE faster, some systems may wait for the data to make it all the way to the memory and then return a completion which is perhaps about the same amount of time as the read. It depends on your system though, the memory technology can make one or the other faster or slower right at the memory itself. Now the first write after nothing has been happening can do this fire and forget thing, but the second or third or fourth or 16th in a row eventually fills up a buffer somewhere along the path and the processor has to wait for the oldest one to make it all the way to the memory before the most recent one has a place in the queue. So for bursty stuff writes may be faster than reads but for large movements of data they approach each other.
Now alignment. The whole memory width will be read on a read, in this case lets say 64 bits, if you were only really interested in 8 of those bits, then somewhere between the memory and the processor the other 24 bits are discarded, where depends on the system. Writes that are not a whole, aligned, size of the memory mean that you have to read the width of the memory, lets say 64 bits, modify the new bits, say 8 bits, then write the whole 64 bits back. A read-modify-write. A read only needs a read a write needs a read-modify-write, the farther away from the memory requiring the read modify write the longer it takes the slower it is, no matter what the read-modify-write cant be any faster than the read alone so the read will be faster, the trimming of bits off the read generally wont take any time so reading a byte compared to reading 16 bits or 32 or 64 bits from the same location so long as the busses and destination are that width all the way, take the same time from the same location, in general, or should.
Unaligned simply multiplies the problem. Say worst case if you want to read 16 bits such that 8 bits are in one 64 bit location and the other 8 in the next 64 bit location, you need to read 128 bits to satisfy that 16 bit read. How that exactly happens and how much of a penalty is dependent on your system. some busses set up the transfer X number of clocks but the data is one clock per bus width after that so a 128 bit read might be only one clock longer (than the dozens to hundreds) of clocks it takes to read 64, or worst case it could take twice as long in order to get the 128 bits needed for this 16 bit read. A write, is a read-modify-write so take the read time, then modify the two 64 bit items, then write them back, same deal could be X+1 clocks in each direction or could be as bad as 2X number of clocks in each direction.
Caches help and hurt. A nice thing about using caches is that you can smooth out the transfers to the slow memory, you can let the cache worry about making sure all memory accesses are aligned and all writes are whole 64 bit writes, etc. How that happens though is the cache will perform same or larger sized reads. So reading 8 bits may result in one or many 64 bit reads of the slow memory, for the first byte, if you perform a second read right after that of the next byte location and if that location is in the same cache line then it doesnt go out to slow memory, it reads from the cache, much faster. and so on until you cross over into another cache boundary or other reads cause that cache line to be evicted. If the location being written is in cache then the read-modify-write happens in the cache, if not in cache then it depends on the system, a write doesnt necessarily mean the read modify write causes a cache line fill, it could happen on the back side as of the cache were not there. Now if you modified one byte in the cache line, now that line has to be written back it simply cannot be discarded so you have a one to few widths of the memory to write back as a result. your modification was fast but eventually the write happens to the slow memory and that affects the overall performance.
You could have situations where you do a (byte) read, the cache line if bigger than the external memory width can make that read slower than if the cache wasnt there, but then you do a byte write to some item in that cache line and that is fast since it is in the cache. So you might have experiments that happen to show writes are faster.
A painful case would be reading say 16 bits unaligned such that not only do they cross over a 64 bit memory width boundary but the cross over a cache line boundary, such that two cache lines have to be read, instead of reading 128 bits that might mean 256 or 512 or 1024 bits have to be read just to get your 16.
The memory sticks on your computer for example are actually multiple memories, say maybe 8 8 bit wide to make a 64 bit overall width or 16 4 bit wide to make an overall 64 bit width, etc. That doesnt mean you can isolate writes on one lane, but maybe, I dont know those modules very well but there are systems where you can/could do this, but those systems I would consider to be 8 or 4 bit wide as far as the smallest addressable size not 64 bit as far as this discussion goes. ECC makes things worse though. First you need an extra memory chip or more, basically more width 72 bits to support 64 for example. You must do full writes with ECC as the whole 72 bits lets say has to be self checking so you cant do fractions. if there is a correctable (single bit) error the read suffers no real penalty it gets the corrected 64 bits (somewhere in the path where this checking happens). Ideally you want a system to write back that corrected value but that is not how all systems work so a read could turn into a read modify write, aligned or not. The primary penalty is if you were able to do fractional writes you cant now with ECC has to be whole width writes.
Now to my question, lets say you use memcpy to move this data, many C libraries are tuned to do aligned transfers, at least where possible, if the source and destination are unaligned in a different way that can be bad, you might want to manage part of the copy yourself. say they are unaligned in the same way, the memcpy will try to copy the unaligned bytes first until it gets to an aligned boundary, then it shifts into high gear, copying aligned blocks until it gets near the end, it downshifts and copies the last few bytes if any, in an unaligned fashion. so if this memory copy you are talking about is thousands of bytes and the only unaligned stuff is near the ends then yes it will cost you some extra reads as much as two extra cache line fills, but that may be in the noise. Even on smaller sizes even if aligned on say 32 bit boundaries if you are not moving whole cache lines or whole memory widths there may still be an extra cache line involved, aligned or not you might only suffer an extra cache lines worth of reading and later writing...
The pure traditional, non-cached memory view of this, all other things held constant is as Doug wrote. Unaligned reads across one of these boundaries, like the 16 bits across two 64 bit words, costs you an extra read 2R vs 1R. A similar write costs you 2R+2W vs 1W, much more expensive. Caches and other things just complicate the problem greatly making the answer "it depends"...You need to know your system pretty well and what other stuff is going on around it, if any. Caches help and hurt, with any cache a test can be crafted to show the cache makes things slower and with the same system a test can be written to show the cache makes things faster.
Further reading would be go look at the databooks/sheets technical reference manuals or whatever the vendor calls their docs for various things. for ARM get the AXI/AMBA documentation on their busses, get the cache documentation for their cache (PL310 for example). Information on ddr memory the individual chips used in the modules you plug into your computer are all out there, lots of timing diagrams, etc. (note just because you think you are buying gigahertz memory, you are not, dram has not gotten faster in like 10 years or more, it is pretty slow around 133Mhz, it is just that the bus is faster and can queue more transfers, it still takes hundreds to thousands of processor cycles for a ddr memory cycle, read one byte that misses all the caches and you processor waits an eternity). so memory interfaces on the processors and docs on various memories, etc may help, along with text books on caches in general, etc.
(apologies for the somewhat lengthy intro)
During development of an application which prefaults an entire large file (>400MB) into the buffer cache for speeding up the actual run later, I tested whether reading 4MB at a time still had any noticeable benefits over reading only 1MB chunks at a time. Surprisingly, the smaller requests actually turned out to be faster. This seemed counter-intuitive, so I ran a more extensive test.
The buffer cache was purged before running the tests (just for laughs, I did one run with the file in the buffers, too. The buffer cache delivers upwards of 2GB/s regardless of request size, though with a surprising +/- 30% random variance).
All reads used overlapped ReadFile with the same target buffer (the handle was opened with FILE_FLAG_OVERLAPPED and without FILE_FLAG_NO_BUFFERING). The harddisk used is somewhat elderly but fully functional, NTFS has a cluster size of 8kB. The disk was defragmented after an initial run (6 fragments vs. unfragmented, zero difference). For better figures, I used a larger file too, below numbers are for reading 1GB.
The results were really surprising:
4MB x 256 : 5ms per request, completion 25.8s # ~40 MB/s
1MB x 1024 : 11.7ms per request, completion 23.3s # ~43 MB/s
32kB x 32768 : 12.6ms per request, completion 15.5s # ~66 MB/s
16kB x 65536 : 12.8ms per request, completion 13.5s # ~75 MB/s
So, this suggests that submitting ten thousands of requests two clusters in length is actually better than submitting a few hundred large, contiguous reads. The submit time (time before ReadFile returns) does go up substantially as the number of requests goes up, but asynchronous completion time nearly halves.
Kernel CPU time is around 5-6% in every case (on a quadcore, so one should really say 20-30%) while the asynchronous reads are completing, which is a surprising amount of CPU -- apparently the OS does some non-neglegible amount of busy waiting, too. 30% CPU for 25 seconds at 2.6 GHz, that's quite a few cycles for doing "nothing".
Any idea how this can be explained? Maybe someone here has a deeper insight of the inner workings of Windows overlapped IO? Or, is there something substantially wrong with the idea that you can use ReadFile for reading a megabyte of data?
I can see how an IO scheduler would be able to optimize multiple requests by minimizing seeks, especially when requests are random access (which they aren't!). I can also see how a harddisk would be able to perform a similar optimization given a few requests in the NCQ.
However, we're talking about ridiculous numbers of ridiculously small requests -- which nevertheless outperform what appears to be sensible by a factor of 2.
Sidenote: The clear winner is memory mapping. I'm almost inclined to add "unsurprisingly" because I am a big fan of memory mapping, but in this case, it actually does surprise me, as the "requests" are even smaller and the OS should be even less able to predict and schedule the IO. I didn't test memory mapping at first because it seemed counter-intuitive that it might be able to compete even remotely. So much for your intuition, heh.
Mapping/unmapping a view repeatedly at different offsets takes practically zero time. Using a 16MB view and faulting every page with a simple for() loop reading a single byte per page completes in 9.2 secs # ~111 MB/s. CPU usage is under 3% (one core) at all times. Same computer, same disk, same everything.
It also appears that Windows loads 8 pages into the buffer cache at a time, although only one page is actually created. Faulting every 8th page runs at the same speed and loads the same amount of data from disk, but shows lower "physical memory" and "system cache" metrics and only 1/8 of the page faults. Subsequent reads prove that the pages are nevertheless definitively in the buffer cache (no delay, no disk activity).
(Possibly very, very distantly related to Memory-Mapped File is Faster on Huge Sequential Read?)
To make it a bit more illustrative:
Update:
Using FILE_FLAG_SEQUENTIAL_SCAN seems to somewhat "balance" reads of 128k, improving performance by 100%. On the other hand, it severely impacts reads of 512k and 256k (you have to wonder why?) and has no real effect on anything else. The MB/s graph of the smaller blocks sizes arguably seems a little more "even", but there is no difference in runtime.
I may have found an explanation for smaller block sizes performing better, too. As you know, asynchronous requests may run synchronously if the OS can serve the request immediately, i.e. from the buffers (and for a variety of version-specific technical limitations).
When accounting for actual asynchronous vs. "immediate" asyncronous reads, one notices that upwards of 256k, Windows runs every asynchronous request asynchronously. The smaller the blocksize, the more requests are being served "immediately", even when they are not available immediately (i.e. ReadFile simply runs synchronously). I cannot make out a clear pattern (such as "the first 100 requests" or "more than 1000 requests"), but there seems to be an inverse correlation between request size and synchronicity. At a blocksize of 8k, every asynchronous request is served synchronously.
Buffered synchronous transfers are, for some reason, twice as fast as asynchronous transfers (no idea why), hence the smaller the request sizes, the faster the overall transfer, because more transfers are done synchronously.
For memory mapped prefaulting, FILE_FLAG_SEQUENTIAL_SCAN causes a slightly different shape of the performance graph (there is a "notch" which is moved a bit backwards), but the total time taken is exactly identical (again, this is surprising, but I can't help it).
Update 2:
Unbuffered IO makes the performance graphs for the 1M, 4M, and 512k request testcases somewhat higher and more "spiky" with maximums in the 90s of GB/s, but with harsh minumums too, the overall runtime for 1GB is within +/- 0.5s of the buffered run (the requests with smaller buffer sizes complete significantly faster, however, that is because with more than 2558 in-flight requests, ERROR_WORKING_SET_QUOTA is returned). Measured CPU usage is zero in all unbuffered cases, which is unsurprising, since any IO that happens runs via DMA.
Another very interesting observation with FILE_FLAG_NO_BUFFERING is that it significantly changes API behaviour. CancelIO does not work any more, at least not in a sense of cancelling IO. With unbuffered in-flight requests, CancelIO will simply block until all requests have finished. A lawyer would probably argue that the function cannot be held liable for neglecting its duty, because there are no more in-flight requests left when it returns, so in some way it has done what was asked -- but my understanding of "cancel" is somewhat different.
With buffered, overlapped IO, CancelIO will simply cut the rope, all in-flight operations terminate immediately, as one would expect.
Yet another funny thing is that the process is unkillable until all requests have finished or failed. This kind of makes sense if the OS is doing DMA into that address space, but it's a stunning "feature" nevertheless.
I'm not a filesystem expert but I think there are a couple of things going on here. First off. w.r.t. your comment about memory mapping being the winner. This isn't totally surprising since the NT cache manager is based on memory mapping - by doing the memory mapping yourself, you're duplicating the cache manager's behavior without the additional memory copies.
When you read sequentially from the file, the cache manager attempts to pre-fetch the data for you - so it's likely that you are seeing the effect of readahead in the cache manager. At some point the cache manager stops prefetching reads (or rather at some point the prefetched data isn't sufficient to satisfy your reads and so the cache manager has to stall). That may account for the slowdown on larger I/Os that you're seeing.
Have you tried adding FILE_FLAG_SEQUENTIAL_SCAN to your CreateFile flags? That instructs the prefetcher to be even more aggressive.
This may be counter-intuitive, but traditionally the fastest way to read data off the disk is to use asynchronous I/O and FILE_FLAG_NO_BUFFERING. When you do that, the I/O goes directly from the disk driver into your I/O buffers with nothing to get in the way (assuming that the segments of the file are contiguous - if they're not, the filesystem will have to issue several disk reads to satisfy the application read request). Of course it also means that you lose the built-in prefetch logic and have to roll your own. But with FILE_FLAG_NO_BUFFERING you have complete control of your I/O pipeline.
One other thing to remember: When you're doing asynchronous I/O, it's important to ensure that you always have an I/O request oustanding - otherwise you lose potential time between when the last I/O completes and the next I/O is started.