Direct Mapped Cache, Filling out Contents of Main Memory

Direct Mapped Cache, Filling out Contents of Main Memory - caching

I am preparing for an exam in Computer Architecture. I understand how caching works and how data gets copied from Main Memory to the Cache depending on the Address.
However, I am unable to figure out how the contents of a main memory are filled out depending on the address. For instance, in the image linked to below, I can easily fill out the contents of the cache, but I do not understand how I should fill the cells that are pointed at with the arrows.
My Professor did not really talk much about that part and I can't complete the actual problem without further help. Please help me understand how I should fill the contents of the Main Memory!

There are few ways you can fill out the main memory.
When a cache entry is evicted from, it is stored in the main memory.
When your program load data to memory, from a file. Lets say you have an image you want to process, you can load this to memory and process it than accessing the data from file every time, which is slower.
Direct Memory Access (DMA) operations - These are there to read/write memory without CPU interaction. You may have a DMA operation filling out an input buffer which then CPU processes.
You could have multiple programs working together, one producing data and other consuming it. For example, your first program may be doing a matrix multiplication and the second is using that result. So the first program will write out its result to main memory and will pass a reference to the second program to use it.

Related

Does GetWriteWatch work with Memory-Mapped FIles?

outI'm working with memory mapped files (MMF) with very large datasets (depending on the input file), where each file has ~50GB and there are around 40 files open at the same time. Of course this depends, I can also have smaller files, but I can also have larger files - so the system should scale itself.
The MMF is acting as a backing buffer, so as long as I have enough free memory there shoud occur no paging. The problem is that the windows memory manager and my application are two autonomous processes. In good conditions everything is working fine, but the memory manager obviously is too slow in conditions where I'm entering low memory conditions, the memory is full and then the system starts to page (which is good), but I'm still allocating memory, because I don't get any information about the paging.
In the end I'm entering a state where the system stalls, the memory manager pages and I'm allocating.
So I came to the point where I need to advice the memory manager, check current memory conditions and invoke the paging myself. For that reason I wanted to use the GetWriteWatch to inspect the memory region I can flush.
Interestingly the GetWriteWatch does not work in my situation, it returns a -1 without filling the structures. So my question is does GetWriteWatch work with MMFs?

Does GetWriteWatch work with Memory-Mapped Files?
I don't think so.
GetWriteWatch accepts memory allocated via VirtualAlloc function using MEM_WRITE_WATCH.
File mapping are mapped using MapViewOfFile* functions that do not have this flag.

Does mmap directly access the page cache, or a copy of the page cache?

To ask the question another way, can you confirm that when you mmap() a file that you do in fact access the exact physical pages that are already in the page cache?
I ask because I’m doing testing on a 192 core machine with 1TB of RAM, on a 400GB data file that is pre-cached into the page cache prior to the test (by just dropping the cache, then doing md5sum on the file).
Initially, I had all 192 threads each mmap the file separately, on the assumption that they would all get (basically) the same memory region back (or perhaps the same memory region but somehow mapped multiple times). Accordingly, I assumed two threads using two different mappings to the same file would both have direct access to the same pages. (Let’s ignore NUMA for this example, though obviously it’s significant at higher thread counts.)
However, in practice I found performance would get terrible at higher thread counts when each thread separately mmapped the file. When we removed that and instead just did a single mmap that was passed into the thread (such that all threads just directly access the same memory region), then performance improved dramatically.
That’s all great, but I’m trying to figure out why. If in fact mmapping a file just grants direct access to the existing page cache, then I would think that it shouldn’t matter how many times you map it — it should all go to the exact same place.
But given that there was such a performance cost, it seemed to me that in fact each mmap was being independently and redundantly populated (perhaps by copying from the page cache, or perhaps by reading again from disk).
Can you comment on why I was seeing such different performance between shared access to the same memory, versus mmapping the same file?
Thanks, I appreciate your help!

I think I found my answer, and it deals with the page directory. The answer is yes, two mmapped regions of the same file will access the same underlying page cache data. However, each mapping needs to independently map each of the virtual pages to the physical pages -- meaning 2x as many entries in the page directory to access the same RAM.
Basically, each mmap() creates a new range in virtual memory. Every page of that range corresponds to a page of physical memory, and that mapping is stored in a hierarchical page directory -- with one entry per 4KB page. So every mmap() of a large region generates a huge number of entries in the page directory.
My guess is it doesn't actually define them all up front, which is why mmap() is instant to call even for a giant file. But over time it probably has to establish those entries as there are faults on the mmapped range, meaning over the course of time it gets filled out. This extra work to populate the page directory is probably why threads using different mmaps are slower than threads sharing the same mmap. And I bet the kernel needs to erase all those entries when unmapping the range -- which is why unmmap() is so slow.
(There's also the translation lookaside buffer, but that's per-CPU, and so small I don't think that matters much here.)
Anyway, it sounds like re-mapping the same region just adds extra overhead, for what seems to me like no gain.

How smart is mmap?

mmap can be used to share read-only memory between processes, reducing the memory foot print:
process P1 mmaps a file, uses the mapped memory -> data gets loaded into RAM
process P2 mmaps a file, uses the mapped memory -> OS re-uses the same memory
But how about this:
process P1 mmaps a file, loads it into memory, then exits.
another process P2 mmaps the same file, accesses the memory that is still hot from P1's access.
Is the data loaded again from disk? Is the OS smart enough to re-use the virtual memory even if "mmap count" dropped to zero temporarily?
Does the behaviour differ between different OS? (I'm mostly interested in Linux/OS X)
EDIT: In case the OS is not smart enough -- would it help if there is one "background process", keeping the file mmaped, so it never leaves the address space of at least one process?
I am of course interested in performance when I mmap and munmap the same file successively and rapidly, possibly (but not necessarily) within the same process.
EDIT2: I see answers describing completely irrelevant points at great length. To reiterate the point -- can I rely on Linux/OS X to not re-load data that already resides in memory, from previous page hits within mmaped memory segments, even though the particular region is no longer mmaped by any process?

The presence or absence of the contents of a file in memory is much less coupled to mmap system calls than you think. When you mmap a file, it doesn't necessarily load it into memory. When you munmap it (or if the process exits), it doesn't necessarily discard the pages.
There are many different things that could trigger the contents of a file to be loaded into memory: mapping it, reading it normally, executing it, attempting to access memory that is mapped to the file. Similarily, there are different things that could cause the file's contents to be removed from memory, mostly related to the OS deciding it wants the memory for something more important.
In the two scenarios from your question, consider inserting a step between steps 1 and 2:
1.5. another process allocates and uses a large amount of memory -> the mmaped file is evicted from memory to make room.
In this case the file's contents will probably have to get reloaded into memory if they are mapped again and used again in step 2.
versus:
1.5. nothing happens -> the contents of the mmaped file hang around in memory.
In this case the file's contents don't need to be reloaded in step 2.
In terms of what happens to the contents of your file, your two scenarios aren't much different. It's something like this step 1.5 that would make a much more important difference.
As for a background process that is constantly accessing the file in order to ensure it's kept in memory (for example, by scanning the file and then sleeping for a short amount of time in a loop), this would of course force the file to remain in memory. but you're probably better off just letting the OS make its own decision about when to evict the file and when not to evict it.

The second process likely finds the data from the first process in the buffer cache. So in most cases the data will not be loaded again from disk. But since the buffer cache is a cache, there are no guarantees that the pages don't get evicted inbetween.
You could start a third process and use mmap(2) and mlock(2) to fix the pages in ram. But this will probably cause more trouble than it is worth.
Linux substituted the UNIX buffer cache for a page cache. But the principle is still the same. The Mac OS X equivalent is called Unified Buffer Cache (UBC).

Memory Mapped File v/s Normal File IO

When we talk about Memory Mapped Files, it is generally mentioned that a portion of file can be mapped to a process address space and we can do random access on it using pointers etc . I also have read at many places that I should have sufficient memory to accomodate whole file into memory. Now these are two statements which are bit confusing to me because if we have need sufficient memory for the complete file than what would be the advantage? I know about the benefits concerning extra kernel space copy of contents or fast time as data would not be block read or byte read as in case of streams etc.

You don't need to have memory for the entire file - mmap is lazy loading, so the benefit there is you can modify a large file without having to use a lot of ram. Another neat trick is if you have to iterate over it backwards without having to chunk it.

How is fseek() implemented in the filesystem?

This is not a pure programming question, however it impacts the performance of programs using fseek(), hence it is important to know how it works. A little disclaimer so that it doesn't get closed.
I am wondering how efficient it is to insert data in the middle of the file. Supposing I have a file with 1MB data and then I insert something at the 512KB offset. How efficient would that be compared to appending my data at the end of the file? Just to make the example complete lets say I want to insert 16KB of data.
I understand the answer varies depending on the filesystem, however I assume that the techniques used in common filesystems are quite similar and I just want to get the right notion of it.

(disclaimer: I want just to add some hints to this interesting discussion)
IMHO there are some things to take into account:
1) fseek is not a primary system service, but a library function. To evaluate its performance we must consider how the file stream library is implemented. In general, the file I/O library adds a layer of buffering in user space, so the performance of fseek may be quite different if the target position is inside or outside the current buffer. Also, the system services that the I/O libary uses may vary a lot. I.e. on some systems the library uses extensively the file memory mapping if possible.
2) As you said, different filesystems may behave in a very different way. In particular, I would expect that a transactional filesystem must do something very smart and perhaps expensive to be prepared to a possible rollback of an aborted write operation in the middle of a file.
3) Modern OS'es have very aggressive caching algorithms. An "fseeked" file is likely to be already present in cache, so operations become much faster. But they may degrade a lot if the overall filesystem activity produced by other processes become important.
Any comments?

fseek(...) is a library call, not an OS system call. It is the run-time library that takes care of the actual overhead involved in making a system call to the OS, technically speaking, fseek is indirectly making a call to the system but really it is not (this brings up a clear distinction between the differences between a library call and a system call). fseek(...) is a standard input-output function regardless of the underlying system...however...and this is a big however...
The OS will more than likely to have cached the file in its kernel memory, that is, the direct offset to the location on the disk on where the 1's and 0's are stored, it is through the OS's kernel layers, more than likely, a top-most layer within the kernel that would have the snapshot of what the file is composed of, i.e. data irrespectively of what it contains (it does not care either way, as long as the 'pointers' to the disk structure for that offset to the lcoation on the disk is valid!)...
When fseek(..) occurs, there would be a lot of over-head, indirectly, the kernel delegated the task of reading from the disk, depending on how fragmented the file is, it could be theoretically, "all over the place", that could be a significant over-head in terms of having to, from a user-land perspective, i.e. the C code doing an fseek(...), it could be scattering itself all over the place to gather the data into a "one contiguous view of the data" and henceforth, inserting into the middle of a file, (remember at this stage, the kernel would have to adjust the location/offsets into the actual disk platter for the data) would be deemed slower than appending to the end of the file.
The reason is quite simple, the kernel "knows" what was the last offset was, and simply wipe the EOF marker and insert more data, behind the scenes, the kernel, is having to allocate another block of memory for the disk-buffer with the adjusted offset to the location on the disk following an EOF marker, once the appending of data is completed.

Let us assume the ext2 FS and the Linux OS as an example. I don't think there will be a significant performance difference between a insert and an append. In both cases the files node and offset table must be read, the relevant disk sector mapped into memory, the data updated and at some later point the data written back to disk. What will make a big performance difference in this example is good temporal and spatial locality when accessing parts of the file since this will reduce the number of load/store combos.
As a previous answers says you may be able to speed up both operations if you deal with data writes that exact multiples of the FS block size, in this case you could skip the load stage and just insert the new blocks into the files inode datastrucure. This would not be practical, as you would need low level access to the FS driver, and using it would be very restrictive and not portable.

One observation I have made about fseek on Solaris, is that each call to it resets the read buffer of the FILE. The next read will then always read a full block (8K by default). So if you have a lot of random access with small reads it's a good idea to do it unbuffered (setvbuf with NULL buffer) or even use direct syscalls (lseek+read or even better pread which is only 1 syscall instead of 2). I suppose this behaviour will be similar on other OS.

You can insert data to the middle of file efficiently only if data size is a multiple of FS sector but OSes doesn't provide such functions so you have to use low-level interface to the FS driver.

Inserting data in the middle of the file is less efficient than appending to the end because when inserting you would have to move the data after the insertion point to make room for the data being inserted. Moving these data would involve reading them from disk, writing the data to be inserted and then writing the old data after the inserted data. So you have at least one extra read and write when inserting.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio