True file descriptor clone - linux-kernel

Why is there no true file descriptor clone mechanism when possible, like it is for disk files.
POSIX:
After a successful return from one of these system calls, the old and
new file descriptors may be used interchangeably. They refer to the
same open file description (see open(2)) and thus share file offset
and file status flags; for example, if the file offset is modified by
using lseek(2) on one of the descriptors, the offset is also changed
for the other.
Windows:
The duplicate handle refers to the same object as the original handle. Therefore, any changes to the object are reflected through both handles. For example, if you duplicate a file handle, the current file position is always the same for both handles. For file handles to have different file positions, use the CreateFile function to create file handles that share access to the same file.
Reasons for having a clone primitive:
When manipulating a file archive, I want each file in the archive has to be accessible independently. The file archive should behave somewhat like a virtual filesystem.
File type checking. Being able to clone file offsets makes it possible to read a small portion of the file without affecting the original position.

You should consider the following: file descriptor is merely an offset into the array of "file" (literally, that's what they are called) object pointers on the kernel side. So when you duplicate the file descriptor, the kernel will simply copy the value of the file pointer from one location in the array to another and increment the reference count on the pointed to object.
Thus, your issue is not with file descriptor duplication, but with management of the file offsets. The easy answer for this: do it yourself. That is, associate the current file offset with each file descriptor on the application side explicitly.
Of course, the most basic file access system calls read() and write() make use of kernel maintained file offset variable, if it's available (and it's only available if you are dealing with "normal" random access files). But more advanced file access system calls will expect the desired file offset to be supplied by the application on each invocation. Those include pread()/pwrite(), preadv()/pwritev() and aio_read()/aio_write (the later is probably the best approach for writing parallel access applications like the one you described).
On Windows, ReadFile()/WriteFile(), ReadFileScatter()/WriteFileGather() and ReadFileEx()/WriteFileEx() analogously expect to be passed the file offset on every invocation (via the lpOverlapped argument).

Related

What happens when the FILE stream file is overwritten

What happens if a file is overwritten between fopen() and fgets()? I have a program that is failing with the following stack trace:
0x00007f9d63629850 (Linux)
0x00007f9d6253e8ab (/lib64/libc-2.11.3.so) __memchr
0x00007f9d62523996 (/lib64/libc-2.11.3.so) _IO_getline_info_internal
0x00007f9d6252d0cd (/lib64/libc-2.11.3.so) __GI_fgets_unlocked
I have reason to believe the file being read might be being overwritten between fopen() and fgets(). How plausible is this?
We are on SUSE 11.4 with glibc 2.11.3 with all updates and patches applied to glibc.
This depends on how the file is overwritten on disk.
If the file is overwritten in place (same inode), your program will read the new data. fopen does not automatically fill the buffer before the first read, so fgets will fill the buffer and obtain the new data (after a concurrent update).
This applies predominantly to local file systems. Most network file systems have optimizations which, in the absence of file locks, may still return the old data.
Without file locking, concurrent modification can result in read operations which observe intermediate states.
If the file is replaced (either removed and recreated, or with a rename operation), the old data will be read.

How to read a file even after closing the handle in Windows?

I need to read a file in Windows and avoid any OS-level locks so users can delete the file even while my application is reading from it.
Using typical read operations via C++, Python, Java, etc reveal the same expected sequence of Winapi calls when evaluated through procmon:
CreateFile
ReadFile (multiple times until "END OF FILE" is reached)
CloseFile
If I try to delete the file via Explorer between steps 1 and 3 (basically after CreateFile and before CloseFile), I'll get a "file in use" error.
However, I noticed that when Dropbox reads files to upload to the server, the sequence is:
CreateFile
CloseFile
ReadFile
Repeat steps 1-3
Since ReadFile is called after CloseFile, I can still delete the file even while Dropbox is reading it.
I can't figure out how Winapi allows for ReadFile after CloseFile is called.
I've attached a screenshot of Procmon that shows Dropbox's behavior.
Anybody know how this is done?
This create-close-read semantics should be caused by the fact that the file is mapped to a memory region in application's address space. When the application attempts to read from that region, the operating system reads necessary data from the file and delivers them to the application (the data appear in the memory region where the file is mapped).
Memory mapped files are backed by a file mapping objects (called a section in the kernel world). Given a file handle, you can create such an object via CreateFileMapping and map it to your address space through MapViewOfFile (some Ex variants also exist). The file mapping object keeps an extra reference to the file it maps. So, after creating the file mapping object for a file, you can close the file handle and read the file through the file mapping object.

Why isn't copy operation implemented in kernel?

It's my understanding that most file IO operations are implemented in the kernel, such as CRUD, move or remove. However file copy is not implemented as a kernel level API.
In order to detect a file copy in the kernel one will need to use heuristics approach (discussion on this approach), e.g. as detect file reads, file creates and file writes from the same user with the same file name, but different paths.
Why copy is a user land operation?
First, because caring about whether or not two different files have the same content, where one file's content is copied directly from the other, is a user-space concern that has no logical reason to exist inside a kernel.
At best.
Bytes are bytes.
Second, how would the kernel distinguish copying a file between what are just two different file descriptors? See the man page for sendfile(). Why should the kernel track if the calling user called sendfile() to send the contents of a file to a TCP socket to who-knows-where or to another file?
Third, even if the kernel tracked copying a file, what on God's good Earth would it do with such data?
If you care about such file copy events, set up auditing.

Storing data associated with opened file in linux kernel space

I want to observe some operations performed on files (and do it from kernel level). I have to attach some data to opened file descriptor (for instance a single int value). More precisely, for each opened file in sys_do_open I decide whether to track the file or not. For further usage I have to store somewhere this decision.
There is a field private_data in struct file, it seems to be good enough for my needs, but I suppose it's used also by other modules.
So, how can I store some data associated with opened file descriptor (for every opened file)?Any suggestions?

Is appending to a file atomic with Windows/NTFS?

If I'm writing a simple text log file from multiple processes, can they overwrite/corrupt each other's entries?
(Basically, this question Is file append atomic in UNIX? but for Windows/NTFS.)
You can get atomic append on local files. Open the file with FILE_APPEND_DATA access (Documented in WDK). When you omit FILE_WRITE_DATA access then all writes will ignore the the current file pointer and be done at the end-of file. Or you may use FILE_WRITE_DATA access and for append writes specify it in overlapped structure (Offset = FILE_WRITE_TO_END_OF_FILE and OffsetHigh = -1 Documented in WDK).
The append behavior is properly synchronized between writes via different handles. I use that regularly for logging by multiple processes. I do write BOM at every open to offset 0 and all other writes are appended. The timestamps are not a problem, they can be sorted when needed.
Even if append is atomic (which I don't believe it is), it may not give you the results you want. For example, assuming a log includes a timestamp, it seems reasonable to expect more recent logs to be appended after older logs. With concurrency, this guarantee doesn't hold - if multiple processes are waiting to write to the same file, any one of them might get the write lock - not just the oldest one waiting. Thus, logs can be written out of sequence.
If this is not desirable behaviour, you can avoid it by publishing logs entries from all processes to a shared queue, such as a named pipe. You then have a single process that writes from this queue to the log file. This avoids the conccurrency issues, ensures that logs are written in order, and works when file appends are not atomic, since the file is only written to directly by one process.
From this MSDN page on creating and opening Files:
An application also uses CreateFile to specify whether it wants to share the file for reading, writing, both, or neither. This is known as the sharing mode. An open file that is not shared (dwShareMode set to zero) cannot be opened again, either by the application that opened it or by another application, until its handle has been closed. This is also referred to as exclusive access.
and:
If you specify an access or sharing mode that conflicts with the modes specified in the previous call, CreateFile fails.
So if you use CreateFile rather than say File.Open which doesn't have the same level of control over the file access, you should be able to open a file in such a way that it can't get corrupted by other processes.
You'll obviously have to add code to your processes to cope with the case where they can't get exclusive access to the log file.
No it isn't. If you need this there is Transactional NTFS in Windows Vista/7.

Resources