Creating a Wise Ordering of Writes cache on Windows

Creating a Wise Ordering of Writes cache on Windows - windows

I'm considering creating a wise ordering of writes cache (WOW cache) for a database system I'm building, but I'm having trouble mapping regions of the file to physical disk regions.
I've discovered you can use DeviceIoControl to get the virtual cluster allocations for a file but I'm unsure how to relate this to physical disk locations.
Is this possible on Windows? If not is there a workaround or a more applicable write caching algorithm.

Try the following.
First, open the handle to a volume using CreateFile routine (use \\?\C: as the name of volume C:).
Use DeviceIoControl with IOCTL_VOLUME_GET_VOLUME_DISK_EXTENTS to retrieve the physical location of a volume on one or more disks.
Thus, you will have the physical location of a volume within a disk (or several disk in more complex setups) and the virtual cluster allocation for a file within a volume. Combine these two pieces of information to get physical location of a file within a disk.
P.S. Possibly, you may also need some auxiliary information as the size of the disk sector and the size of the volume cluster.

Related

What happens to the physical memory contents of a process during context switch

Let's consider that process A's virtual address V1->P1 (virtual address(V1) maps to physical address(P1) ).
During a context switch, page table of process A is swapped out with process B's page table.
Let's consider that process B's virtual address V2->P1 ((virtual address(V2) maps to physical address(P1)) with its own contents in that memory area.
Now what has happened to the physical memory contents that V1 was pointing to ?
Is it saved in memory when the context switch took place ? If so, what if the process A had written contents worth or close to the size of physical memory or RAM available ? Where would it save the contents then ?

There are many ways that an OS can handle the scenario described in the question, which is how to effectively deal with running out of free RAM. Depending on the CPU architecture and the goals of the OS here are some ways of handling this issue.
One solution is to simply kill processes when they attempt to malloc(or some other similar mechanism) and there is no free pages available. This effectively avoids the problem posed in the original question. On the surface this seems like a bad idea, but it has the advantages of simplifying kernel code and potentially speeding up context switches. In fact, in some applications if the code running had to use swap space to accommodate pages that can not fit into RAM by using non-volatile storage, the application would take such a performance hit that the system effectively failed anyways. Alternatively, not all computers even have non-volatile storage to use for swap space!
As already alluded to, the alternative is to use non-volatile storage to hold pages that can not fit into RAM. Actual specific implementations can vary depending on the specific needs of the system. Here are some possible ways to directly answer how the mappings of V1->P1 and V2->P1 can exist.
1 -There is often no strict requirement for the OS needs to maintain a V1->P1 and V2->P1 mapping. So long as the contents of the virtual space stays the same, the physical address backing it is transparent to the program running. If both programs needed to run concurrently the OS can stop the program running in V2 and move the memory of P1 to a new region, say P2. Then remap V2 to P2 and resume running the program in V2. This assumes free RAM exists to map of course.
2 - The OS can simply choose not to map the full virtual address space of a program into a RAM backed physical address space. Suppose not all of V1 address space was directly mapped into physical memory. When the program in V1 hits an unpagged section the OS can catch the exception triggered by this. If available RAM is running low, the OS can then use the swap space in non-volatile storage. The OS can free up some RAM by pushing some physical addresses of a a region not currently in use to swap space in non-volatile storage (such as the P1 space). Next the OS can load the requested page into the freed up RAM, setup a virtual to physical mapping and then return execution to the program in V1.
The advantage of this approach is that the OS can allocate more memory then it has RAM. Additionally, in many situations programs tend to repeatedly access a small area of memory. As a result, not having the entire virtual address region page into RAM may not incur that big of a performance penalty. The main downside to this approach is that it is more complex to code, can make context switches slower and accessing non-volatile storage is extremely slow compared to RAM.

API for monitoring individual files IO performace on Windows

What Windows API can I use to monitor I/O performance metrics for a specific file or set of files? Performance counters seem to offer only higher level objects such as LogicalDisk and PhysicalDisk. I'm looking for something that Windows Resource Monitor uses under Disk->Disk Activity, i.e read/write bps and response time.

I did a quick search for "Perfmon individual files" and didn't see anything promising.
But I'm not sure measuring the performance of individual files will be all that meaningful. I/O activity is coalesced in the I/O stack in several places, the result being that at different levels the OS can't distinguish file I/O for one file versus another.
Assuming the app isn't doing any buffering/caching on it's own, the first place can be in buffering that happens in "C" (or similar) runtime libraries. Another place where coalescing occurs is in the file system (I'm assuming NTFS). I/O for file directories can be coalesced across multiple files in the same directory. I/O can be coalesced based on the file system's block size. So if multiple MFT entries share a block they can all be read/written at once. NTFS also implements caching and other I/O optimizations (read ahead). The performance of the cache can be affected by other processes running at the same time by either accessing the same file(s) you want to measure (helping to keep the file in cache) or by accessing other files (helping to evict your file from cache).
Coalescing also happens below the file system at the logical disk level. Single I/Os may service multiple files.
At the disk driver level single I/O requests may again involve multiple files. Additionally the driver (or more likely the drive firmware) can reorder disk I/Os based on knowledge it has about the drive "geometry" to gain additional throughput at the (possible) expense of response time. In this case I/O to your files may suffer compared to what it would see if the other processes weren't doing I/O at the same time.
Many disks implement caching in DRAM. This cache will also be affected by other processes in the same way that Window's cache is. Again affecting measured performance due to other process's activity.
If you still want to measure though, one way to circumvent the limitations in Perfmon is to put files or sets of files on different drives. The drives don't necessarily have to be different physical drives, they could be VHDs, or some other kind of virtual disk-on-physical disk. I know the Volume Snapshot Service (VSS) SDK has a little utility to create virtual drives out of files.
But putting your files on their own physical disks will probably give much more consistent results.

Whole memory cycle in executing a program

I have been thinking about how the whole information(data) is passed while executing any program or query.
The below diagram I used expand my assumption:
All data are stored in a disk storage.
The whole platter of the disk is divided into many sectors, and sectors are divided into blocks. Blocks are divided into pages, and pages are contain in a page table and sequence id.
The most frequently used data are stored in cache for faster access.
If data is not found in cache then program goes to check Main Memory and if page fault occurs, then it goes into disk storage.
Virtual Memory is used as a address mapping from RAM to Disk Storage.
Do you think I am missing anything here? Is my assumption correct regarding how memory management works? Will appreciate any helpful comments. Thank you

I think you are mixing too many things up together.
All data are stored in a disk storage.
In most disk based operating systems, all user data (and sometimes kernel data) is stored on disk (somewhere) and mapped to memory.
The whole platter of the disk is divided into many sectors, and sectors are divided into blocks. Blocks are divided into pages, and pages are contain in a page table and sequence id.
No.
Most disks these days use logical I/O so that the software only sees blocks, not tracks, sectors, and platters (as in ye olde days).
Blocks exist only on disk. Pages only exist in memory. Blocks are divided into pages.
The most frequently used data are stored in cache for faster access.
There are two common caches. I cannot tell which you are referring to. One is the CPU cache (hardware) and the other is software caches maintained by the operating system.
If data is not found in cache then program goes to check Main Memory and if page fault occurs, then it goes into disk storage.
No.
This sounds like you are referring to the CPU cache. Page faults are triggered when reading the page table.
Virtual Memory is used as a address mapping from RAM to Disk Storage.
Logical memory mapping is used to map logical pages to physical page frames. Virtual memory is used to map logical pages to disk storage.

Tachyon Doesn't Seem to be Aware of Available Memory

Just to see if Tachyon would give me an error about configured memory being more than available I set:
# Some value over combined available mem and disk space.
export TACHYON_WORKER_MEMORY_SIZE=1000GB
And observed the allocation in the web UI without error.
Is some of the info going to be pushed to disk when available RAM is exceeded?
What happens when it exceeds disk space? Dropped file errors or system failure?

This is the expected (if perhaps unhelpful behaviour) and ultimately it is to do with the fact that Tachyon uses Linux ramfs as the in-memory storage.
As this article explains:
ramfs file systems cannot be limited in size like a disk base file
system which is limited by it’s capacity. ramfs will continue using
memory storage until the system runs out of RAM and likely crashes or
becomes unresponsive.
Note that Tachyon will enforce the size constraint based on the size you give it. However as you've found you can allocate more RAM than is actually available and Tachyon won't check this so you may want to go ahead and file a bug report.
To answer your specific questions:
No excess data will not be pushed to disk automatically
When RAM is full behaviour is OS dependent
Note that the setting you are referring to only controls the in-memory space, if you want to use local disks in addition to RAM then you need to use Tachyon's Tiered Storage.

Question about hard drive , 'seek' and 'read' in windows OS

Does anyone know when calling 'seek' and 'read' , how is the hard-drive physicly affected?
If i'll be more specific, I know that the harddrive has some kind of a magnetic needle that is used to read the data from the magnetic plates. So my question is , when is the needle actualy moved to the reading location?
Is it moved when we are calling the "seek" windowsApi method (no matter if an actual read performed) , or does "seek" just remember a virtual pointer , and the physical movement of the needle is performed only when the "read" method is called?
Edit: Assume that the data requested from the Hard-Drive doesn't exist in any of the caches (hard-drive cache , Os Cache , Ram and whatever else it could be)

Wanted to break out this question from your post
When is the needle actualy moved to the reading location?
I think the simple answer is "whenever data is requested that is not already present in any number of caches". The problem with predicting hard drive movement is you have to consider all of the different places that cache data read from the hard drive. If the data is present in those caches and accessible in the context requesting the data, the cache will be used instead of actually reading the hard drive. Here are just some of the places that can and do cache hard drive data
Hard Drive's internal cache
OS level caches
Program level caches
API level cache
In the case where none of the data is present then it will likely be read from the hard drive during a read call. A seek call is unlikely to cause the hard drive to move because you're not changing the physical hard drive pointer but a virtual pointer to the file within your program.

The hard drive head (needle) starts moving and the disk starts spinning up (unless already spinning) at the read operation. There is no head move or spinup at the seek operation.
Please note that the head may move nonsequentially above the disk even if you are reading a file sequentially, i.e. the the read of the 2nd, 3rd etc. 512-byte block may cause the head to move far away as well even if there aren't intervening seeks. This is partially because the file is fragmented on the filesystem, or because the firmware has a sector number remapping (i.e. logical sector 5 is not between logical sectors 4 and 6) to compensate bad-block errors.

The assumption in the question "Assume that the data requested from the Hard-Drive doesn't exist in any of the caches (hard-drive cache , Os Cache , Ram and whatever else it could be)" is difficult to assume and relatively rare. Even in this case, there is only a loose association between user mode file I/O operations and physical storage device operations.
There are many user mode File I/O functions in various windows libraries. Some of the oldest are the C library low level I/O functions. There are also the C library stream I/O functions, the C++ iostreams classes, and the manged I/O classes. There are other I/O interfaces as well that are part of other packages.
In general, all the user mode I/O Libraries are built on top of the Win32 file I/O functions including CreateFile(), SetFilePointer(), ReadFile(), and WriteFile().
Unless a file is opened in unbuffered mode the operating system can cache the files contents. This is done system wide, and not on a per-file basis. So, even if your program had not read or written a file, I/O to a file may be cached and not result in any physical storage device I/Os.
There are many factors that determine how file I/Os map to actual I/O operations on a physical device. This includes, library level bufering, OS cashing, device driver caching, hardware level cashing, device block size, file size, hardware block/sector remapping, and other factors.
The short story here is that you cannot assume that individual file level read or seek operations correspond to physical device operations, such as disk head seeking.
This gets even trickier when writes are considered. Often writes are accompanied by a flush - which the application developer assumes will push the data all the way to the physical media. Developers often assume that when a flush call returns success, that the data is guaranteed to be persistent on the storage device. This is far from true as devices and drivers often ignore flush calls.
There is more complexity with solid state drives which are not mechanical and therefore do not have 'seek' operations. Here, other physical characteristics manifest themselves such as the necessity to erase blocks before they are written to.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio