Unexpected page handling (also, VirtualLock = no op?) - windows

This morning I stumbled across a surprising number of page faults where I did not expect them. Yes, I probably should not worry, but it still strikes me odd, because in my understanding they should not happen. And, I'd like better if they didn't.
The application (under WinXP Pro 32bit) reserves a larger section (1GB) of address space with VirtualAlloc(MEM_RESERVE) and later allocates moderately large blocks (20-50MB) of memory with VirtualAlloc(MEM_COMMIT). This is done in a worker ahead of time, the intent being to stall the main thread as little as possible. Obviously, you cannot ever assure that no page faults happen unless the memory region is currently locked, but a few of them are certainly tolerable (and unavoidable). Surprisingly every single page faults. Always.
The assumption was thus that the system only creates pages lazily after allocating them, which somehow makes sense too (although the documentation suggests something different). Fair enough, my bad.
The obvious workaround is therefore VirtualLock/VirtualUnlock, which forces the system to create those pages, as they must exist after VirtualLock returns. Surprisingly, still every single page faults.
So I wrote a little test program which did all above steps in sequence, sleeping 5 seconds in between each, to rule out something was wrong in the other code. The results were:
MEM_RESERVE 1GB ---> success, zero CPU, zero time, nothing happens
MEM_COMMIT 1 GB ---> success, zero CPU, zero time, working set increases by 2MB, 512 page faults (respectively 8 bytes of metadata allocated in user space per page)
for(... += 128kB) { VirtualLock(128kB); VirtualUnlock(128kB); } ---> success, zero CPU, zero time, nothing happens
for(... += 4096) *addr = 0; ---> 262144 page faults, about 0.25 seconds (~95% kernel time). 1GB increase for both "working set" and "physical" inside Process Explorer
VirtualFree ---> zero CPU, zero time, both "working set" and "physical" instantly go * poof *.
My expectation was that since each page had been locked once, it must physically exist at least after that. It might of course still be moved in and out of the WS as the quota is exceeded (merely changing one reference as long as sufficient RAM is available). Yet, neither the execution time, nor the working set, nor the physical memory metrics seem to support this. Rather, as it looks, each single accessed page is created upon faulting, even if it had been locked previously. Of course I can touch every page manually in a worker thread, but there must be a cleaner way too?
Am I making a wrong assumption about what VirtualLock should do or am I not understanding something right about virtual memory? Any idea about how to tell the OS in a "clean, legitimate, working" way that I'll be wanting memory, and I'll be wanting it for real?
UPDATE:
In reaction to Harry Johnston's suggestion, I tried the somewhat problematic approach of actually calling VirtualLock on a gigabyte of memory. For this to succeed, you must first set the process' working set size accordingly, since the default quotas are 200k/1M, which means VirtualLock cannot possibly lock a region larger than 200k (or rather, it cannot lock more than 200k alltogether, and that is minus what is already locked for I/O or for another reason).
After setting a minimum working set size of 1GB and a maximum of 2GB, all the page faults happen the moment VirtualAlloc(MEM_COMMIT) is called. "Virtual size" in Process Explorer jumps up by 1GB instantly. So far, it looked really, really good.
However, looking closer, "Physical" remains as it is, actual memory is really only used the moment you touch it.
VirtualLock remains a no-op (fault-wise), but raising the minimum working set size kind of got closer to the goal.
There are two problems with tampering the WS size, however. First, you're generally not meant to have a gigabyte of minimum working set in a process, because the OS tries hard to keep that amount of memory locked. This would be acceptable in my case (it's actually more or less just what I ask for).
The bigger problem is that SetProcessWorkingSetSize needs the the PROCESS_SET_QUOTA access right, which is no problem as "administrator", but it fails when you run the program as a restricted user (for a good reason), and it triggers the "allow possibly harmful program?" alert of some well-known Russian antivirus software (for no good reason, but alas, you can't turn it off).

Technically VirtualLock is a hint, and so the OS is allowed to ignore it. It's backed by the NtLockVirtualMemory syscall which on Reactos/Wine is implemented as a no-op, however Windows does back the syscall with real work (MiLockVadRange).
VirtualLock isn't guarranteed to succeed. Calls to this function require the SE_LOCK_MEMORY_PRIVILEGE to work, and the addresses must fulfil security and quota restrictions. Additionally after a VirtualUnlock, the kernel is no longer obliged to keep your page in memory, so a page fault after that is a valid action.
And as Raymond Chen points out, when you unlock the memory it can formally release the page. This means that the next VirtualLock on the next page might obtain that very same page again, so when you touch the original page you'll still get a page-fault.

VirtualLock remains a no-op (fault-wise)
I tried to reproduce this, but it worked as one might expect. Running the example code shown at the bottom of this post:
start application (523 page faults)
adjust the working set size (21 page faults)
VirtualAlloc with MEM_COMMIT 2500 MB of RAM (2 page faults)
VirtualLock all of that (about 641,250 page faults)
perform writes to all of this RAM in an infinite loop (zero page faults)
This all works pretty much as expected. 2500 MB of RAM is 640,000 pages. The numbers add up. Also, as far as the OS-wide RAM counters go, commit charge goes up at VirtualAlloc, while physical memory usage goes up at VirtualLock.
So VirtualLock is most definitely not a no-op on my Win7 x64 machine. If I don't do it, the page faults, as expected, shift to where I start writing to the RAM. They still total just over 640,000. Plus, the first time the memory is written to takes longer.
Rather, as it looks, each single accessed page is created upon faulting, even if it had been locked previously.
This is not wrong. There is no guarantee that accessing a locked-then-unlocked page won't fault. You lock it, it gets mapped to physical RAM. You unlock it, and it's free to be unmapped instantly, making a fault possible. You might hope it will stay mapped, but no guarantees...
For what it's worth, on my system with a few gigabytes of physical RAM free, it works the way you were hoping for: even if I follow my VirtualLock with an immediate VirtualUnlock and set the minimum working set size back to something small, no further page faults occur.
Here's what I did. I ran the test program (below) with and without the code that immediately unlocks the memory and restores a sensible minimum working set size, and then forced physical RAM to run out in each scenario. Before forcing low RAM, neither program gets any page faults. After forcing low RAM, the program that keeps the memory locked retains its huge working set and has no further page faults. The program that unlocked the memory, however, starts getting page faults.
This is easiest to observe if you suspend the process first, since otherwise the constant memory writes keep it all in the working set even if the memory isn't locked (obviously a desirable thing). But suspend the process, force low RAM, and watch the working set shrink only for the program that has unlocked the RAM. Resume the process, and witness an avalanche of page faults.
In other words, at least in Win7 x64 everything works exactly as you expected it to, using the code supplied below.
There are two problems with tampering the WS size, however. First, you're generally not meant to have a gigabyte of minimum working set in a process
Well... if you want to VirtualLock, you are already tampering with it. The only thing that SetProcessWorkingSetSize does is allow you to tamper with it. It doesn't degrade performance by itself; it's VirtualLock that does - but only if the system actually runs low on physical RAM.
Here's the complete program:
#include <stdio.h>
#include <tchar.h>
#include <Windows.h>
#include <iostream>
using namespace std;
int _tmain(int argc, _TCHAR* argv[])
{
SIZE_T chunkSize = 2500LL * 1024LL * 1024LL; // 2,626,568,192 = 640,000 pages
int sleep = 5000;
Sleep(sleep);
cout << "Setting working set size... ";
if (!SetProcessWorkingSetSize(GetCurrentProcess(), chunkSize + 5001001L, chunkSize * 2))
return -1;
cout << "done" << endl;
Sleep(sleep);
cout << "VirtualAlloc... ";
UINT8* data = (UINT8*) VirtualAlloc(NULL, chunkSize, MEM_COMMIT, PAGE_READWRITE);
if (data == NULL)
return -2;
cout << "done" << endl;
Sleep(sleep);
cout << "VirtualLock... ";
if (VirtualLock(data, chunkSize) == 0)
return -3;
//if (VirtualUnlock(data, chunkSize) == 0) // enable or disable to experiment with unlocks
// return -3;
//if (!SetProcessWorkingSetSize(GetCurrentProcess(), 5001001L, chunkSize * 2))
// return -1;
cout << "done" << endl;
Sleep(sleep);
cout << "Writes to the memory... ";
while (true)
{
int* end = (int*) (data + chunkSize);
for (int* d = (int*) data; d < end; d++)
*d = (int) d;
cout << "done ";
}
return 0;
}
Note that this code puts the thread to sleep after VirtualLock. According to a 2007 post by Raymond Chen, the OS is free to page it all out of physical RAM at this point and until the thread wakes up again. Note also that MSDN claims otherwise, saying that this memory will not be paged out, regardless of whether all threads are sleeping or not. On my system, they certainly remain in the physical RAM while the only thread is sleeping. I suspect Raymond's advice applied in 2007, but is no longer true in Win7.

I don't have enough reputation to comment, so I'll have to add this as an answer.
Note that this code puts the thread to sleep after VirtualLock. According to a 2007 post by Raymond Chen, the OS is free to page it all out of physical RAM at this point and until the thread wakes up again [...] I suspect Raymond's advice applied in 2007, but is no longer true in Win7.
What romkyns said has been confirmed by Raymond Chen in 2014. That is, when you lock memory with Virtual­Lock, it will remain locked even if all your threads are blocked. He also says the fact that pages remain locked, may be just an implementation detail and not contractual.
This is probably not the case, because according to msdn, it is contractual
Pages that a process has locked remain in physical memory until the process unlocks them or terminates. These pages are guaranteed not to be written to the pagefile while they are locked.

Related

What will happen if the highest bit of the pointer as a sign bit,can someone make a example to explain it?

I am reading the book <windows via c/c++> ,in Chapter 13 - Windows Memory Architecture -
Getting a Larger User-Mode Partition in x86 Windows
I occur at this:
In early versions of Windows, Microsoft didn't allow applications to
access their address space above 2 GB. So some creative developers
decided to leverage this and, in their code, they would use the high
bit in a pointer as a flag that had meaning only to their
applications. Then when the application accessed the memory address,
code executed that cleared the high bit of the pointer before the
memory address was used. Well, as you can imagine, when an application
runs in a user-mode environment greater than 2 GB, the application
fails in a blaze of fire.
I can't understand that, can someone make an example to explain it for me, thanks.
To access ~2GB of memory, you only need a 31 bit address. However, on 32 bit systems, addresses are 32 bit long and hence, pointers are 32 bit long.
As the book describes, in early versions of windows developers could only use 2GB of memory, therefore, the last bit in each 32-bit pointer could be used for other purposes, as it was ALWAYS zero. However, before using the address, this extra bit had to be cleared again, presumably so the program didn't crash, because it tried to access a higher than 2GB address.
The code probably looked something like this:
int val = 1;
int* p = &val;
// ...
// Using the last bit of p to set a flag, for some purpose
p |= 1UL << 31;
// ...
// Then before using the address in some way, the bit has to be cleared again:
p &= ~(1UL << 31);
*p = 3;
Now, if you can be certain that your pointers will only ever point to an address where the most significant bit (MSB) is zero, i.e. in a ~2GB address space, this is fine. However, if the address space is increased, some pointers will have a 1 in their MSB and by clearing it, you set your pointer to an incorrect address in memory. If you then try to read from or write to that address, you will have undefined behavior and your program will most likely fail in a blaze of fire.

Is it possible to know the address of a cache miss?

Whenever a cache miss occurs, is it possible to know the address of that missed cache line? Are there any hardware performance counters in modern processors that can provide such information?
Yes, on modern Intel hardware there are precise memory sampling events that track not only the address of the instruction, but the data address as well. These events also includes a great deal of other information, such as what level of the cache hierarchy the memory access was satisfied it, the total latency and so on.
You can use perf mem to sample this information and produces a report.
For example, the following program:
#include <stddef.h>
#define SIZE (100 * 1024 * 1024)
int p[SIZE] = {1};
void do_writes(volatile int *p) {
for (size_t i = 0; i < SIZE; i += 5) {
p[i] = 42;
}
}
void do_reads(volatile int *p) {
volatile int sink;
for (size_t i = 0; i < SIZE; i += 5) {
sink = p[i];
}
}
int main(int argc, char **argv) {
do_writes(p);
do_reads(p);
}
compiled with:
g++ -g -O1 -march=native perf-mem-test.cpp -o perf-mem-test
and run with:
sudo perf mem record -U ./perf-mem-test && sudo perf mem report
Produces a report of memory accesses sorted by latency like this:
The Data Symbol column shows where address the load was targeting - most here show up as something like p+0xa0658b4 which means at an offset of 0xa0658b4 from the start of p which makes sense as the code is reading and writing p. The list is sorted by "local weight" which is the access latency in reference cycles1.
Note that the information recorded is only a sample of memory accesses: recording every miss would usually be way too much information. Furthermore, it only records loads with a latency of 30 cycles or more by default, but you can apparently tweak this with command line arguments.
If you're only interested in accesses that miss in all levels of cache, you're looking for the "Local RAM hit" lines2. Perhaps you can restrict your sampling to only cache misses - I'm pretty sure the Intel memory sampling stuff supports that, and I think you can tell perf mem to look at only misses.
Finally, note that here I'm using the -U argument after record which instructs perf mem to only record userspace events. By default it will include kernel events, which may or may not be useful for your. For the example program, there are many kernel events associated with copying the p array from the binary into writable process memory.
Keep in mind that I specifically arranged my program such that the global array p ended up in the initialized .data section (the binary is ~400 MB!), so that it shows up with the right symbol in the listing. The vast majority of the time your process is going to be accessing dynamically allocated or stack memory, which will just give you a raw address. Whether you can map this back to a meaningful object depends on if you track enough information to make that possible.
1 I think it's in reference cycles, but I could be wrong and the kernel may have already converted it to nanoseconds?
2 The "Local" and "hit" part here refer to the fact that we hit the RAM attached to the current core, i.e., we didn't have go to the RAM associated with another socket in a multi-socket NUMA configuration.
If you want to know the exact virtual or physical address of every cache miss on a particular processor, that would be very hard and sometimes impossible. But you are more likely to be interested in expensive memory access patterns; those patterns that incur large latencies because they miss in one or more levels of the cache subsystem. Note that it is important to keep in mind that a cache miss on one processor might be a cache hit on another depending on design details of each processor and depending also on the operating system.
There are several ways to find such patterns, two are commonly used. One is to use a simulator such as gem5 or Sniper. Another is to use hardware performance events. Events that represent cache misses are available but they do not provide any details on why or where a miss occurred. However, using a profiler, you can approximately associate cache misses as reported by the corresponding hardware performance events with the instructions that caused them which in turn can be mapped back to locations in the source code using debug information. Examples of such profilers include Intel VTune Amplifier and AMD CodeXL. The results produced by simulators and profilers may not be accurate and so you have to be careful when interpreting them.

Why does the speed of memcpy() drop dramatically every 4KB?

I tested the speed of memcpy() noticing the speed drops dramatically at i*4KB. The result is as follow: the Y-axis is the speed(MB/second) and the X-axis is the size of buffer for memcpy(), increasing from 1KB to 2MB. Subfigure 2 and Subfigure 3 detail the part of 1KB-150KB and 1KB-32KB.
Environment:
CPU : Intel(R) Xeon(R) CPU E5620 # 2.40GHz
OS : 2.6.35-22-generic #33-Ubuntu
GCC compiler flags : -O3 -msse4 -DINTEL_SSE4 -Wall -std=c99
I guess it must be related to caches, but I can't find a reason from the following cache-unfriendly cases:
Why is my program slow when looping over exactly 8192 elements?
Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?
Since the performance degradation of these two cases are caused by unfriendly loops which read scattered bytes into the cache, wasting the rest of the space of a cache line.
Here is my code:
void memcpy_speed(unsigned long buf_size, unsigned long iters){
struct timeval start, end;
unsigned char * pbuff_1;
unsigned char * pbuff_2;
pbuff_1 = malloc(buf_size);
pbuff_2 = malloc(buf_size);
gettimeofday(&start, NULL);
for(int i = 0; i < iters; ++i){
memcpy(pbuff_2, pbuff_1, buf_size);
}
gettimeofday(&end, NULL);
printf("%5.3f\n", ((buf_size*iters)/(1.024*1.024))/((end.tv_sec - \
start.tv_sec)*1000*1000+(end.tv_usec - start.tv_usec)));
free(pbuff_1);
free(pbuff_2);
}
UPDATE
Considering suggestions from #usr, #ChrisW and #Leeor, I redid the test more precisely and the graph below shows the results. The buffer size is from 26KB to 38KB, and I tested it every other 64B(26KB, 26KB+64B, 26KB+128B, ......, 38KB). Each test loops 100,000 times in about 0.15 second. The interesting thing is the drop not only occurs exactly in 4KB boundary, but also comes out in 4*i+2 KB, with a much less falling amplitude.
PS
#Leeor offered a way to fill the drop, adding a 2KB dummy buffer between pbuff_1 and pbuff_2. It works, but I am not sure about Leeor's explanation.
Memory is usually organized in 4k pages (although there's also support for larger sizes). The virtual address space your program sees may be contiguous, but it's not necessarily the case in physical memory. The OS, which maintains a mapping of virtual to physical addresses (in the page map) would usually try to keep the physical pages together as well but that's not always possible and they may be fractured (especially on long usage where they may be swapped occasionally).
When your memory stream crosses a 4k page boundary, the CPU needs to stop and go fetch a new translation - if it already saw the page, it may be cached in the TLB, and the access is optimized to be the fastest, but if this is the first access (or if you have too many pages for the TLBs to hold on to), the CPU will have to stall the memory access and start a page walk over the page map entries - that's relatively long as each level is in fact a memory read by itself (on virtual machines it's even longer as each level may need a full pagewalk on the host).
Your memcpy function may have another issue - when first allocating memory, the OS would just build the pages to the pagemap, but mark them as unaccessed and unmodified due to internal optimizations. The first access may not only invoke a page walk, but possibly also an assist telling the OS that the page is going to be used (and stores into, for the target buffer pages), which would take an expensive transition to some OS handler.
In order to eliminate this noise, allocate the buffers once, perform several repetitions of the copy, and calculate the amortized time. That, on the other hand, would give you "warm" performance (i.e. after having the caches warmed up) so you'll see the cache sizes reflect on your graphs. If you want to get a "cold" effect while not suffering from paging latencies, you might want to flush the caches between iteration (just make sure you don't time that)
EDIT
Reread the question, and you seem to be doing a correct measurement. The problem with my explanation is that it should show a gradual increase after 4k*i, since on every such drop you pay the penalty again, but then should enjoy the free ride until the next 4k. It doesn't explain why there are such "spikes" and after them the speed returns to normal.
I think you are facing a similar issue to the critical stride issue linked in your question - when your buffer size is a nice round 4k, both buffers will align to the same sets in the cache and thrash each other. Your L1 is 32k, so it doesn't seem like an issue at first, but assuming the data L1 has 8 ways it's in fact a 4k wrap-around to the same sets, and you have 2*4k blocks with the exact same alignment (assuming the allocation was done contiguously) so they overlap on the same sets. It's enough that the LRU doesn't work exactly as you expect and you'll keep having conflicts.
To check this, i'd try to malloc a dummy buffer between pbuff_1 and pbuff_2, make it 2k large and hope that it breaks the alignment.
EDIT2:
Ok, since this works, it's time to elaborate a little. Say you assign two 4k arrays at ranges 0x1000-0x1fff and 0x2000-0x2fff. set 0 in your L1 will contain the lines at 0x1000 and 0x2000, set 1 will contain 0x1040 and 0x2040, and so on. At these sizes you don't have any issue with thrashing yet, they can all coexist without overflowing the associativity of the cache. However, everytime you perform an iteration you have a load and a store accessing the same set - i'm guessing this may cause a conflict in the HW. Worse - you'll need multiple iteration to copy a single line, meaning that you have a congestion of 8 loads + 8 stores (less if you vectorize, but still a lot), all directed at the same poor set, I'm pretty sure there's are a bunch of collisions hiding there.
I also see that Intel optimization guide has something to say specifically about that (see 3.6.8.2):
4-KByte memory aliasing occurs when the code accesses two different
memory locations with a 4-KByte offset between them. The 4-KByte
aliasing situation can manifest in a memory copy routine where the
addresses of the source buffer and destination buffer maintain a
constant offset and the constant offset happens to be a multiple of
the byte increment from one iteration to the next.
...
loads have to wait until stores have been retired before they can
continue. For example at offset 16, the load of the next iteration is
4-KByte aliased current iteration store, therefore the loop must wait
until the store operation completes, making the entire loop
serialized. The amount of time needed to wait decreases with larger
offset until offset of 96 resolves the issue (as there is no pending
stores by the time of the load with same address).
I expect it's because:
When the block size is a 4KB multiple, then malloc allocates new pages from the O/S.
When the block size is not a 4KB multiple, then malloc allocates a range from its (already allocated) heap.
When the pages are allocated from the O/S then they are 'cold': touching them for the first time is very expensive.
My guess is that, if you do a single memcpy before the first gettimeofday then that will 'warm' the allocated memory and you won't see this problem. Instead of doing an initial memcpy, even writing one byte into each allocated 4KB page might be enough to pre-warm the page.
Usually when I want a performance test like yours I code it as:
// Run in once to pre-warm the cache
runTest();
// Repeat
startTimer();
for (int i = count; i; --i)
runTest();
stopTimer();
// use a larger count if the duration is less than a few seconds
// repeat test 3 times to ensure that results are consistent
Since you are looping many times, I think arguments about pages not being mapped are irrelevant. In my opinion what you are seeing is the effect of hardware prefetcher not willing to cross page boundary in order not to cause (potentially unnecessary) page faults.

Determining precisely ram unused available in Win32

I use this routine to fill unused ram with zero.
It procures crash on some computers and is coarse
size = size - (size /10);
There is a more accurate way to determine the unused RAM amount to be filled with zeroes?
DWORDLONG getTotalSystemMemory(){
PROCESS_MEMORY_COUNTERS lMemInfo;
BOOL success = GetProcessMemoryInfo(
GetCurrentProcess(),
&lMemInfo,
sizeof(lMemInfo)
);
MEMORYSTATUSEX statex;
statex.dwLength = sizeof(statex);
GlobalMemoryStatusEx(&statex);
wprintf(L"Mem: %d\n", lMemInfo.WorkingSetSize);
return statex.ullAvailPhys - lMemInfo.WorkingSetSize;
}
void Zero(){
int size = getTotalSystemMemory();//-(1024*140000)
size = size - (size /10);
//if(size>1073741824) size=1073741824; //2^32-1
wprintf(L"Mem: %d\n", size);
BYTE* ar = new BYTE[size];
RtlSecureZeroMemory(ar,size);
delete[] ar;
}
This program does not do what you think it does. In fact, it is counterproductive. Fortunately, it is also unnecessary.
First of all, the program is unncessary Windows already has a thread whole sole job to zero out free pages, uncreatively known as the zero page thread. This blog entry goes into quite a bit of detail on how it works. Therefore, the way to fill free memory with zeroes is to do nothing because there is already somebody filling free memory with zeroes.
Second, the program does not do what you think it does because when an application allocates memory, the kernel makes sure that the memory is full of zeroes before giving it to the application. (If there are not enough pre-zeroed pages available, the kernel will zero out the pages right there.) Therefore, your program which writes out zeroes is just writing zeroes on top of zeroes.
Third, the program is counterproductive because it is not limiting itself to memory that is free. It is zeroing out a big chunk of memory that might have been busy. This may force other applications to give up their active memory so that it can be given to you.
The program is also counterproductive because even if it manages only to grab free memory, it dirties the memory (by writing zeroes) before freeing it. Returning dirty pages to the kernel puts them on the "dirty free memory" list, which means that the zero page thread has to go and zero them out again. (Redundantly, in this case, but the kernel doesn't bother checking whether a freed page is full of zeros before zeroing it out. Checking whether a page is full of zeroes is about as expensive as just zeroing it out anyway.)
It is unclear what the purpose of your program is. Why does it matter that free memory is full of zeroes or not?

Which (OS X) dtrace probe fires when a page is faulted in from disk?

I'm writing up a document about page faulting and am trying to get some concrete numbers to work with, so I wrote up a simple program that reads 12*1024*1024 bytes of data. Easy:
int main()
{
FILE*in = fopen("data.bin", "rb");
int i;
int total=0;
for(i=0; i<1024*1024*12; i++)
total += fgetc(in);
printf("%d\n", total);
}
So yes, it goes through and reads the entire file. The issue is that I need the dtrace probe that is going to fire 1536 times during this process (12M/8k). Even if I count all of the fbt:mach_kernel:vm_fault*: probes and all of the vminfo::: probes, I don't hit 500, so I know I'm not finding the right probes.
Anyone know where I can find the dtrace probes that fire when a page is faulted in from disk?
UPDATE:
On the off chance that the issue was that there was some intelligent pre-fetching going on in the stdio functions, I tried the following:
int main()
{
int in = open("data.bin", O_RDONLY | O_NONBLOCK);
int i;
int total=0;
char buf[128];
for(i=0; i<1024*1024*12; i++)
{
read(in, buf, 1);
total += buf[0];
}
printf("%d\n", total);
}
This version takes MUCH longer to run (42s real time, 10s of which was user and the rest was system time - page faults, I'm guessing) but still generates one fifth as many faults as I would expect.
For the curious, the time increase is not due to loop overhead and casting (char to int.) The code version that does just these actions takes .07 seconds.
Not a direct answer, but it seems you are equating disk reads and page faults. They are not necessarily the same. In your code you are reading data from a file into a small user memory chunk, so the I/O system can read the file into the buffer/VM cache in any way and size it sees fit. I might be wrong here, I don't know how Darwin does this.
I think the more reliable test would be to mmap(2) the whole file into process memory and then go touch each page is that space.
I was down the same rathole recently. I don't have my DTrace scripts or test programs available just now, but I will give you the following advice:
1.) Get your hands on OS X Internals by Amit Singh and read section 8.3 on virtual memory (this will get you in the right frame of reference for selecting DTrace probes).
2.) Get your hands on Solaris Performance and Tools by Brendan Gregg / Jim Mauro. Read the section on virtual memory and pay close attention to the example DTrace scripts that make use of the vminfo provider.
3.) OS X is definitely prefetching large chunks of pages from the filesystem, and your test program is playing right into this optimization (since you're reading sequentially). Interestingly, this is not the case for Solaris. Try randomly accessing the big array to defeat the prefetch.
The assumption that the operating system will fault in each and every page that's being touched as a separate operation (and that therefore, if you touch N pages, you'll see the DTrace probe fire N times) is flawed; most UN*Xes will perform some sort of readahead or pre-faulting and you're very unlikely to get exactly the same number of calls to as you have pages. This is so even if you use mmap() directly.
The exact ratio may also depend on the filesystem, as readahead and page clustering implementations and thresholds are unlikely to be the same for all of them.
You probably can force a per-page fault policy if you use mmap directly and then apply madvise(MADV_DONTNEED) or similar and/or purge the entire range with msync(MS_INVALIDATE).

Resources