NVMeOF/RDMA sync file modifications - caching

I just set up the NVMeOF/RDMA environment to play around. I have a target node which NVMe SSD is accessed by some client nodes. However, when I delete a file say test on one client node, the rest nodes cannot see this operation and can still read the content of test as normal. I know that RDMA bypasses the kernel, so I guess this is because of the cache? I then have tried to clean up the cache using these commands:
sudo sync; echo 3 | sudo tee /proc/sys/vm/drop_caches
sudo sync; echo 1 | sudo tee /proc/sys/vm/drop_caches
sudo sync; echo 2 | sudo tee /proc/sys/vm/drop_caches
Unfortunately, other nodes still keep this file.
So actually I have two questions:
Does it exactly due to the cache? How does it work?
What is the correct way to clean up the cache so that other nodes can see the deletion without re-mount?
Any help will be greatly appreciated!

Relatively short answer
Like Boris said, you don't want to do that (distributed consistency on storage is a hard problem), and you need something else to do what you want. Flushing caches may not work because you've got multiple distinct views of the system + caching behaviors
Longer answer:
As Boris mentioned, NVMeoF is a block protocol. This means that at a broad level (with some hand-waving) all it can do is read and write blocks at a particular address. In practice, we usually have layers above the NVMe/NVMeoF communication layer like file systems that handle this abstraction.
I can't tell right now if you're using a file system or if you reading/writing the device directly, but in either case you are at least partially correct that the page cache may be getting in the way, even with RDMA.
Now, if you are using local file systems on your client nodes, you quickly get inconsistent views. The filesystem (and consequently overall operating system and its view of the state of the page cache and block storage) has no idea anyone else wrote anything. So even if you write and sync on one client, you may have to bypass the page cache on another (e.g. use O_DIRECT reads, which have their own sets of complexities) and make sure you target something that eventually refers to the same block addresses that were written on the NVMe target from your other client.
In theory, this will let you read data written by another if everything lines up correctly, in practice though this can cause confusion, especially if a file system or application on one client writes something at a location, and the other client attempts to read or write that location unknowingly. Now you have a consistency problem.

NVMeoF (with RDMA or any other transport) is a block level storage protocol and not a file level storage protocol. Thus, there is no guarantee to the atomicity of file operations across nodes in NVMeoF systems. Even if one node deletes a file, there is no guarantee that:
The delete operation was actually translated to block erase operations and sent to the storage server;
Even if the storage server deleted the blocks, there is no guarantee that other clients that have cached this data will not continue to read it. Moreover, another client can overwrite the deleted file.
Overall, I think that to have any guarantees at the file level, you should consider a distribute file systems, rather than NVMeoF.
What is the correct way to clean up the cache so that other nodes can see the deletion without re-mount?
There is no good way to do it. Flushing the cache on all nodes and only then reading may work, but it depends on the file system.

Related

Write to and read from free disk space using Windows API

Is it possible to write to free clusters on disk or read data from them using Windows APIs? I found Defrag API: https://learn.microsoft.com/en-gb/windows/desktop/FileIO/defragmenting-files
FSCTL_GET_VOLUME_BITMAP can be used to obtain allocation state of each cluster, FSCTL_MOVE_FILE can be used to move clusters around. But I couldn't find a way of reading data from free clusters or writing data to them.
Update: one of the workarounds which comes to mind is creating a small new file, writing some data to it, then relocating it to desired position and deleting the file (the data will remain in freed cluster). But that still doesn't solve reading problem.
What I'm trying to do is some sort of transparent cache, so user could still use his NTFS partition as usual and still see these clusters as free space, but I could store some data in them. Data safety is not of concern, it can be overwritten by user actions and will just be regenerated / redownloaded later when clusters become free again.
There is no easy solution in this way.
First of all, you should create own partition of the drive. It prevents from an accidental access to your data from OS or any process. Then call CreateFileA() with name of the partition. You will get raw access to the data. Please bear in mind that the function will fail for any partition accessed by OS.
You can perform the same trick with a physical drive too.
The docs
One way could be to open the volume directly via using CreateFile with the volumes UNC path as filename arguement (e.g.: \\.\C:).
You now can directly read and write to the volume.
So you maybe can achieve your desired goal with:
get the cluster size in bytes with GetDiskFreeSpace
get the map of free clusters with DeviceIoControl and FSCTL_GET_VOLUME_BITMAP
open the volume with CreateFile with its UNC path \\.\F:
(take a careful look into the documentation, especially the Remarks sections part about opening drives and volumes)
seek to the the offset of a free cluster (clusterindex * clusterByteSize) by using SetFilePointer
write/read your data with WriteFile/ReadFile on the handle, retreived by above CreateFile
(Also note that read/write access has to be sector aligned, otherwise the ReadFile/WriteFile calls fail)
Please note:
this is only meant as a starting point for your own research. This is not a bullet proof cooking receipt.
Backup your data before messing with the file system!!!
Also keep in mind that the free cluster bitmap will be outdated as soon as you get it (especially if using the system volume).
So I would strongly advise against use of such techniques in production or customer environments.

What to do after a "rm -R /*"

I was working on my website under root and I commit the worst thing that a linux user can do : rm -R /* instead of rm -R ./*.
I've stopped the process when I saw that it was taking too long...
I manage to reinstall lubuntu with an usb key, is this a good idea or are there other ways to reverse this big mistake ?
Thanks to any answer
Short answer: no.
Long answer: depends on the filesystem and on how rm is implemented. It's possible that rm merely unlinks the file; the inode (marked "deleted") and data may still remain. And even if the inode is hard-deleted, the data may remain. But in either case: there is a risk that your actions since that time have already written data over your old data or over the location of the soft-deleted inode. This can happen even with temporary files, or file descriptors (such as for sockets or processes) or pagefile [well, unless that thing has its own partition].
I wouldn't recommend trying to relink soft-deleted inodes, or infer from your data how to reconstruct hard-deleted inodes. Sure, maybe for irreplaceable memories this would be worth it (take the drive to a data forensics specialist), but there's near-guaranteed corruption somewhere on the disk. I would certainly not attempt to run a production system off a disk recovered like that.
I recommend one of the following:
restoring from your regularly-scheduled backup
wiping everything and starting again (you have all your website files stored under source control and stored remotely, right?)
redeploying your Docker image (this was an immutable deployment, right?)

Hadoop - data block caching techniques

I am running some experiments to benchmark the time it takes (by map-reduce) to read and process data stored on HDFS with varying parameters. I use pig script to launch map-reduce jobs. Since I am working with the same set of files frequently, my results may get affected because of file/block caching.
I want to understand the various caching techniques employed in a map-reduce environment.
Lets say that a file foo (contains some data to be procesed) stored on HDFS occupies 1 HDFS block and it gets stored in machine STORE. During a map-reduce task, machine COMPUTE reads that block over network and processes it. Caching can happen at two levels:
Cached in memory of machine STORE (in-memory file cache)
Cached in memory/disk of machine COMPUTE.
I am pretty sure that #1 caching happens. I want to ensure whether something like #2 happens? From the post here, it looks like there is no client level caching going on since it is very unlikely that the block cached by COMPUTE will be needed again in the same machine before the cache is flushed.
Also, is the hadoop distributed cache used only to distribute any application specific files (not task specific input data files) to all task tracker nodes? Or is the task specific input file data (like the foo file block) cached in the distributed cache? I assume local.cache.size and related parameters only control the distributed cache.
Please clarify.
The only caching that is ever applied within HDFS is the OS caching to minimize disk accesses.
So if you access a block from a datanode, it is likely to be cached if nothing else is going on there.
On your client side, this depends on what you do with the block. If you directly write it to disk, it is also very likely that your client OS caches it.
The distributed cache is just for jars and files that need to be distributed across the cluster where your job launches tasks. The name is thus a bit misleading, as it "caches" nothing.

Move or copy and truncate a file that is in use

I want to be able to (programmatically) move (or copy and truncate) a file that is constantly in use and being written to. This would cause the file being written to would never be too big.
Is this possible? Either Windows or Linux is fine.
To be specific what I'm trying to do is log video with FFMPEG and create hour long videos.
It is possible in both Windows and Linux, but it would take cooperation between the applications involved. If the application that is writing the new data to the file is not aware of what the other application is doing, it probably would not work (well ... there is some possibility ... back to that in a moment).
In general, to get this to work, you would have to open the file shared. For example, if using the Windows API CreateFile, both applications would likely need to specify FILE_SHARE_READ and FILE_SHARE_WRITE. This would allow both (multiple) applications to read and write the file "concurrently".
Beyond sharing the file, though, it would also be necessary to coordinate the operations between the applications. You would need to use some kind of locking mechanism (either by locking some part of the file or some shared mutex/semaphore). Note that if you use file locking, you could lock some known offset in the file to act as a "semaphore" (it can even be a byte value beyond the physical end of the file). If one application were appending to the file at the same exact time that the other application were truncating it, then it would lead to unpredictable results.
Back to the comment about both applications needing to be aware of each other ... It is possible that if both applications opened the file exclusively and kept retrying the operations until they succeeded, then perform the operation, then close the file, it would essentially allow them to work without "knowledge" of each other. However, that would probably not work very well and not be very efficient.
Having said all that, you might want to consider alternatives for efficiency reasons. For example, if it were possible to have the writing application write to new files periodically, it might be more efficient than having to "move" the data constantly out of one file to another. Also, if you needed to maintain some portion of the file (e.g., move out the first 100 MB to another file and then move the second 100 MB to the beginning) that could be a fairly expensive operation as well.
logrotate would be a good option is linux, comes stock on just about any distro. I'm sure there's a similar windows service out there somewhere

Transaction implementation for a simple file

I'm a part of a team writing an application for embedded systems. The application often suffers from data corruption caused by power shortage. I thought that implementing some kind of transactions would stop this from happening. One scenario would include copying the area of a file before writing to some additional storage (transaction log). What are other possibilities?
Databases use a variety of techniques to assure that the state is properly persisted.
The DBMS often retains a replicated control file -- several synchronized copies on several devices. Two is enough. More if your're paranoid. The control file provides a few key parameters used to locate the other files and their expected states. The control file can include a "database version number".
Each file has a "version number" in several forms. A lot of times it's in plain form plus in some XOR-complement so that the two version numbers can be trivially checked to have the correct relationship, and match the control file version number.
All transactions are written to a transaction journal. The transaction journal is then written to the database files.
Before writing to database files, the original data block is copied to a "before image journal", or rollback segment, or some such.
When the block is written to the file, the sequence numbers are updated, and the block is removed from the transaction journal.
You can read up on RDBMS techniques for reliability.
There's a number of ways to do this; generally the only assumption required is that small writes (<4k) are atomic. For example, here's how CouchDB does it:
A 4k header contains, amongst other things, the file offset of the root of the BTree containing all the data.
The file is append-only. When updates are required, write the update to the end of the file, followed by any modified BTree nodes, up to and including the root. Then, flush the data, and write the new address of the root node to the header.
If the program dies while writing an update but before writing the header, the extra data at the end of the file is discarded. If it fails after writing the header, the write is complete and all is well. Because the file is append-only, these are the only failure scenarios. This also has the advantage of providing multi-version concurrency control with no read locks.
When the file grows too long, simply read out all the 'live' data and write it to a new file, then delete the original.
You can avoid implementing such transaction logs yourself by using existing transaction managers around file-systems, e.g. XADisk.
The old link is no longer available, a github repo is here.

Resources