In order to support data mangling I need to write a custom device driver inserting a short amount of code at latest possible moment before actual write to SD (mmc driver) and, specularly, at the earliest possible moment after data is read back from SD.
I am aware all I/O is done using DMA transfer directly from/to disk cache structures, this means I will have to allocate a new buffer, transcode buffer to temp, point DMA to temp and start transfer. Reverse path on read.
Ideally I should use standard kernel crypto facilities (dm-crypt and LUKS), but my linux device is a small embedded ARM device which slows to a crawl with standard encryption, so I'm willing to trade some security for speed and settle for a "smart-obfuscation" instead of true crypto.
I need to find the point where to insert my code. In that point I need to have access to the data buffer, the sector number where buffer will be written/read and be able to redirect DMA transfer to a temp buffer.
kernel/drivers/mmc/core/core.c seems to have only routines dealing with card as a whole (reser, reset, ...) and not for actual data handling.
I have been unable to find the right place (to date) can someone point me to the right file, please?
EDIT:
As pointed out in a comment I don't really need to change data at the "absolute last moment", but that seemed the best solution because:
Mangling will no change data length.
Mangling depends on actual logical sector.
Data in disk cache should remain readable and usable.
Only data going to SD needs to be mangled (no mangling for data in Flash).
I will need to do the same modification to a desktop PC to be able to read/write SDs used in the embedded system.
Overhead should be kept as low as possible (embedded has low mem and computational power).
Any (roughly) equivalent solution can be evaluated.
I am also willing to forgo DMA usage and force PIO-mode for SD if that makes things easier; this would lift requirement of sector copying as requested mangling can be done "on the fly" while transferring data from buffer to peripheral.
I'm passing 1-2 MB of data from one process to another, using a plain old file. Is it significantly slower than going through RAM entirely?
Before answering yes, please keep in mind that in modern Linux at least, when writing a file it is actually written to RAM, and then a daemon syncs the data to disk from time to time. So in that way, if process A writes a 1-2 MB into a file, then process B reads them within 1-2 seconds, process B would simply read the cached memory. It gets even better than that, because in Linux, there is a grace period of a few seconds before a new file is written to the hard disk, so if the file is deleted, it's not written at all to the hard disk. This makes passing data through files as fast as passing them through RAM.
Now that is Linux, is it so in Windows?
Edit: Just to lay out some assumptions:
The OS is reasonably new - Windows XP or newer for desktops, Windows Server 2003 or newer for servers.
The file is significantly smaller than available RAM - let's say less than 1% of available RAM.
The file is read and deleted a few seconds after it has been written.
When you read or write to a file Windows will often keep some or all of the file resident in memory (in the Standby List). So that if it is needed again, it is just a soft-page fault to map it into the processes' memory space.
The algorithm for what pages of a file will be kept around (and for how long) isn't publicly documented. So the short answer is that if you are lucky some or all of it may still be in memory. You can use the SysInternals tool VMmap to see what of your file is still in memory during testing.
If you want to increase your chances of the data remaining resident, then you should use Memory Mapped Files to pass the data between the two processes.
Good reading on Windows memory management:
Mysteries of Windows Memory Management Revealed
You can use FILE_ATTRIBUTE_TEMPORARY to hint that this data is never needed on disk:
A file that is being used for temporary storage. File systems avoid writing data back to mass storage if sufficient cache memory is available, because typically, an application deletes a temporary file after the handle is closed. In that scenario, the system can entirely avoid writing the data. Otherwise, the data is written after the handle is closed.
(i.e. you need use that flag with CreateFile, and DeleteFile immediately after closing that handle).
But even if the file remains cached, you still have to copy it twice: from your process A to the cache (the WriteFile call), and from cache to the proces B (ReadFile call).
Using memory mapped files (MMF, as josh poley already suggested) has the primary advantage of avoiding one copy: the same physical memory pages are mapped into both processes.
A MMF can be backed by virtual memory, which means basically that it always stays in memory unless swapping becomes necessary.
The major downside is that you can't easily grow the memory mapping to changing demands, you are stuck with the initial size.
Whether that matters for an 1-2 MB data transfer depends mostly on how you acquire and what you do with the data, in many scenarios the additional copy doesn't really matter.
mmap can be used to share read-only memory between processes, reducing the memory foot print:
process P1 mmaps a file, uses the mapped memory -> data gets loaded into RAM
process P2 mmaps a file, uses the mapped memory -> OS re-uses the same memory
But how about this:
process P1 mmaps a file, loads it into memory, then exits.
another process P2 mmaps the same file, accesses the memory that is still hot from P1's access.
Is the data loaded again from disk? Is the OS smart enough to re-use the virtual memory even if "mmap count" dropped to zero temporarily?
Does the behaviour differ between different OS? (I'm mostly interested in Linux/OS X)
EDIT: In case the OS is not smart enough -- would it help if there is one "background process", keeping the file mmaped, so it never leaves the address space of at least one process?
I am of course interested in performance when I mmap and munmap the same file successively and rapidly, possibly (but not necessarily) within the same process.
EDIT2: I see answers describing completely irrelevant points at great length. To reiterate the point -- can I rely on Linux/OS X to not re-load data that already resides in memory, from previous page hits within mmaped memory segments, even though the particular region is no longer mmaped by any process?
The presence or absence of the contents of a file in memory is much less coupled to mmap system calls than you think. When you mmap a file, it doesn't necessarily load it into memory. When you munmap it (or if the process exits), it doesn't necessarily discard the pages.
There are many different things that could trigger the contents of a file to be loaded into memory: mapping it, reading it normally, executing it, attempting to access memory that is mapped to the file. Similarily, there are different things that could cause the file's contents to be removed from memory, mostly related to the OS deciding it wants the memory for something more important.
In the two scenarios from your question, consider inserting a step between steps 1 and 2:
1.5. another process allocates and uses a large amount of memory -> the mmaped file is evicted from memory to make room.
In this case the file's contents will probably have to get reloaded into memory if they are mapped again and used again in step 2.
versus:
1.5. nothing happens -> the contents of the mmaped file hang around in memory.
In this case the file's contents don't need to be reloaded in step 2.
In terms of what happens to the contents of your file, your two scenarios aren't much different. It's something like this step 1.5 that would make a much more important difference.
As for a background process that is constantly accessing the file in order to ensure it's kept in memory (for example, by scanning the file and then sleeping for a short amount of time in a loop), this would of course force the file to remain in memory. but you're probably better off just letting the OS make its own decision about when to evict the file and when not to evict it.
The second process likely finds the data from the first process in the buffer cache. So in most cases the data will not be loaded again from disk. But since the buffer cache is a cache, there are no guarantees that the pages don't get evicted inbetween.
You could start a third process and use mmap(2) and mlock(2) to fix the pages in ram. But this will probably cause more trouble than it is worth.
Linux substituted the UNIX buffer cache for a page cache. But the principle is still the same. The Mac OS X equivalent is called Unified Buffer Cache (UBC).
I am looking to optimize my disk IO, and am looking around to try to find out what the disk cache size is. system_profiler is not telling me, where else can I look?
edit: my program is processing entire volumes: I'm doing a secure-wipe, so I loop through all of the blocks on the volume, reading, randomizing the data, writing... if I read/write 4k blocks per IO operation the entire job is significantly faster than r/w a single block per operation. so my question stems from my search to find the ideal size of a r/w operation (ideal in terms of performance:speed). please do not point out that for a wipe-program I don't need the read operation, just assume that I do. thx.
Mac OS X uses a Unified Buffer Cache. What that means is that in the kernel VM objects and files are them at some level, same thing, and the size of the available memory for caching is entirely dependent on the VM pressure in the rest of the system. It also means the read and write caching is unified, if an item in the read cache is written to it just gets marked dirty and then will be written to disk when changes are committed.
So the disk cache may be very small or gigabytes large, and dynamically changes as the system is used. Because of this trying to determine the cache size and optimize based on it is a losing fight. You are much better off looking at doing things that inform the cache how to operate better, like checking with the underlying device's optimal IO size is, or identifying data that should not be cached and using F_NOCACHE.
I have a small question:
For example I'm using System.IO.File.Copy() method from .NET Framework. This method is a managed wrapper for CopyFile() function from WinAPI. But how CopyFile function works? It is interacts with HDD's firmware or maybe some other operations are performed through Assembler or maybe something other...
How does it look like from the highest level to the lowest?
Better to start at the bottom and work your way up.
Disk drives are organized, at the lowest level, in to a collection of Sectors, Tracks, and Heads. Sectors are segments of a track, Tracks are area on the disks itself, represented by the heads position as the platters spins underneath it, and the head is the actual element that reads the data from the platter.
Since Tracks are measured based on the distance that a head is from the center of a disk, you can see how towards the center of the disk the "length" of a track is short than one at the outer edge of the disk.
Sectors are pieces of a track, typically of a fixed length. So, an inner track will hold fewer sectors than an outer track.
Much of this disk geometry is handled by the drive controllers themselves nowadays, though in the past this organization was managed directly by the operating systems and the disk drivers.
The drive electronics and disk drivers cooperate to try and represent the disk as a sequential series of fixed length blocks.
So, you can see that if you have a 10MB drive, and you use 512 byte disk blocks, then that drive would have a capacity of 20,480 "blocks".
This block organization is the foundation upon which everything else is built. Once you have this capability, you can tell the disk, via the disk driver and drive controller, to go to a specific block on the disk, and read/write that block with new data.
A file system organizes this heap of blocks in to it's own structure. The FS must track which blocks are being used, and by which files.
Most file systems have a fixed location "where they start", that is, some place that upon start up they can go to try and find out information about the disk layout.
Consider a crude file system that doesn't have directories, and support files that have 8 letter names and 3 letter extension, plus 1 byte of status information, and 2 bytes for block number where the file starts on the disk. We can also assume that the system has a hard limit of 1024 files. Finally, it must know which blocks on the disk are being used. For that it will use 1 bit per block.
This information is commonly called the "file system metadata". When a disk is "formatted", nowadays it's simply a matter of writing new file system metadata. In the old days, it was a matter of actually writing sector marks and other information on blank magnetic media (commonly known as a "low level format"). Today, most drives already have a low level format.
For our crude example, we must allocate space for the directory, and space for the "Table of Contents", the data that says which blocks are being used.
We'll also say that the file system must start at block 16, so that the OS can use the first 16 blocks for, say, a "boot sector".
So, at block 16, we need to store 14 bytes (each file entry) * 1024 (number of files) = 12K. Divide that by 512 (block size) is 24 blocks. For our 10MB drive, it has 20,480 blocks. 20,480 / 8 (8 bits/byte) is 2,560 bytes / 512 = 5 blocks.
Of the 20,480 block available on the disk, the file system metadata is 29 blocks. Add in the 16 for the OS, that 45 blocks out of the 20,480, leaving 20,435 "free blocks".
Finally, each of the data blocks reserves the last 2 bytes to point to the next block in the file.
Now, to read a file, you look up the file name in the directory blocks. From there, you find the offset to the first data block for the file. You read that data block, grab the last two bytes. If those two byte are 00 00, then that's the end of the file. Otherwise, take that number, load that data block, and keep going until the entire file is read.
The file system code hides the details of the pointers at the end, and simply loads blocks in to memory, for use by the program. If the program does a read(buffer, 10000), you can see how this will translate in to reading several blocks of data from the disk until the buffer has been filled, or the end of file is reached.
To write a file, the system must first find a free space in the directory. Once it has that, it then finds a free block in the TOC bitmap. Finally, it takes the data, write the directory entry, sets its first block to the available block from the bitmap, toggles the bit on the bitmap, and then takes the data and writes it to the correct block. The system will buffer this information so that it ideally only has to write the blocks once, when they're full.
As it writes the blocks, it continues to consume bits from the TOC, and chains the blocks together as it goes.
Beyond that, a "file copy" is a simple process, from a system leverage the file system code and disk drivers. The file copy simply reads a buffer in, fills it up, writes the buffer out.
The file system has to maintain all of the meta data, keep track of where you are reading from a file, or where you are writing. For example, if you read only 100 bytes from a file, obviously the system will need to read the entire 512 byte datablock, and then "know" it's on byte 101 for when you try to read another 100 bytes from the file.
Also, I hope it's obvious, this is a really, really crude file system layout, with lots of issues.
But the fundamentals are there, and all file systems work in some manner similar to this, but the details vary greatly (most modern file systems don't have hard limits any more, as a simple example).
This is a question demanding or a really long answer, but I'm trying to make it brief.
Basically, the .NET Framework wraps some "native" calls, calls that are processed in lower-level libraries. These lower-level calls are often wrapped in a buffer logic to hide complicated stuff like synchronizing file contents from you.
Below, there is the native level, interacting with the OS' kernel. The kernel, the core of any operating system, then translates your high-level instruction to something your hardware can understand. Windows and Linux are for example both using a Hardware Abstraction Layer, a system that hides hardware specific details behind a generic interface. Writing a driver for a specific device is then only the task of implementing all methods certain device has to provide.
Before anything gets called on your hardware, the filesystem gets involved, and the filesystem for itself also buffers and caches a lot, but again transparently, so you don't even notice that. The last element in the call-queue is the device itself, and again, most devices conform to some standard ( like SATA or IDE ) and can thus be interfaced in a similar manner.
I hope this helps :-)
The .NET framework invokes the Windows API.
The Windows API has functions for managing files across various file systems.
Then it depends on the file system in question. Remember, it's not necessarily a "normal" file system over a HDD; It could even be a shell extension that just emulates a drive and keeps the data in you gmail account, or whatever. The point is that the same file manipulation functions in the Windows API are used as an abstraction over many possible lower layers of data.
So the answer really depends on the kind of file system you're interested in.