Is basic_managed_mapped_file.flush() immediately written to disk? - boost

I use basic_managed_mapped_file and I want to backup the file while the program is running.
How can I make sure the data is written to disk for the backup?

The answer is logically "yes".
The operating system will make sure that the data is written, I believe even if your process would crash next.
However, if you
must be sure that the data hit the disk before doing anything else
need to ensure that data hits the disk in any particular order (e.g. journaling/intent logging)
need to be sure that data is safely written in the face of e.g. power failure
then you will need to add a disk sync call on most OS-es. If you require this level of detail (and worse, in portable fashion), the topic quickly becomes hard, and I defer to
eat my data: how everybody gets file IO wrong from Stewart Smith
I've also mirrored the video/slides just in case (see here)

Related

Why do software updates exist?

I know this may sound crazy but hear me out...
Say you have a game and you want to update it (add new features / redecorate for seasonal themes / add LTMs etc.) Now, instead of editing your code and then waiting days for your app market provider (Google/Microsoft/Apple etc.) to approve the update and roll out the changes, why not:
Put all of your code into a database
Remove all of your existing code from your code files
Add code which can run code from a database (reads it in and eval()s it)
This way, there'd be no need for software updates unless you wanted to change your database-related code, and you could simply update your database to change what the app does when it's live.
My Question: Why hasn't this been done?
For example:
Fortnite (a real game) often has LTMs (Limited Time Modes) which are available for a few weeks and are then removed. Generally, the software updates are ~ 5GB and take a lot of time unless your broadband is fast. If the code was fetched from a database and then executed, there'd be no need for these updates and the changes could be instantaneous.
EDIT: (In response to the close votes)
I'm looking for facts and statistics to back up reasons rather than just pure opinions. Answers like 'I think this would be good/bad ... ' aren't needed (that's why there's comments); answers like are 'This would be good/bad as this fact shows that ...' are much better and desired.
There are few challenges in your suggested approach.
putting everything in database will make increase the size of db. But that doest affect much.
If all the code is there in db, its possible to decompile your software and get a way to connect to your code?
too much performance overhead of evals. Precompiled code is optimized for their respective runtime
multiple versions? What to do when you want to have multiple versions of your software?
The main reason updates exists is they are easy to maintain, flexible and allows the developer to ship fastest optimized tool.
Put all of your code into a database
Remove all of your existing code from your code files
Add code which can run code from a database (reads it in and eval()s it)
My Question: Why hasn't this been done?
This is exactly how every game works already.
Each time you launch a game, an executable binary game engine (which you describe in step 3) already reads the rest of your code (often some embedded language like LUA) from a "database" (the file system) and "evals" (interprets) it to run the game, as well as assets like level geometry, textures, sounds and music.
You're talking about introducing a layer of abstraction (a real database) between the engine and its data to hide some of your assets, but the database stores its data on the file system, so you really haven't gained anything, you've just changed the way the data is encoded at rest and queried during runtime, and introduce a ton of overhead in both cases.
On the other hand, you're intentionally cheating your way through the app reviewal process this way, and whatever real technical problems you would have are moot, because your app would not be allowed in the app store. The entire point of the app reviewal process is to prevent people from shipping unverified unreviewed code to users, and if your program is obviously designed to circumvent this, your app will be rejected.
Fortnite (a real game) often has LTMs (Limited Time Modes) which are available for a few weeks and are then removed. Generally, the software updates are ~ 5GB and take a lot of time unless your broadband is fast.
Fotenite will have a small binary executable that is the game engine. Updates to this binary will account for a fraction of a percentage of that 5GB. The rest will be some kind of interpreted/embedded language describing the game's levels (also a tiny fraction) and then assets which account for the rest (geometry, textures, sound, music).
If the code was fetched from a database and then executed, there'd be no need for these updates and the changes could be instantaneous
This makes no sense. If you move that entire 5GB from the file system into a database, you still have to transfer around 5GB worth of database updates. 5GB of data in a database still lives as 5GB of data on the file system, it's just that you can't access it directly anymore. You have to transfer around the exact same amount of data, regardless of how you store it.

Windows: redirect ReadFile to run process and pipe it's stdout

I was wondering how hard it would be to create a set-up under Windows where a regular ReadFile on certain files is being redirected by the file system to actually run (e.g. ShellExecute) those files, and then the new process' stdout is being used as the file content streamed out to the ReadFile call to the callee...
What I envision the set-up to look like, is that you can configure it to denote a certain folder as 'special', and that this extra functionality is then only available on that folder's content (so it doesn't need to be disk-wide). It might be accessible under a new drive letter, or a path parallel to the source folder; the location it is hooked up to is irrelevant to me.
To those of you that wonder if this is a classic xy problem: it might very well be ;) It's just that this idea has intrigued me, and I want to know what possibilities there are. In my particular case I want to employ it to #include content in my C++ code base, where the actual content included is being made up on the spot, different on each compile round. I could of course also create a script to create such content to include, call it as a pre-build step and leave it at that, but why choose the easy route.
Maybe there are already ready-made solutions for this? I did an extensive Google search for it, but came out empty handed. But then I'm not sure I already know all the keywords involved to do a good search...
When coding up something myself, I think a minifilter driver might be needed intercepting ReadFile calls, but then it must at that spot run usermode apps from kernel space - not a happy marriage I assume. Or use an existing file system driver framework that allows for usermode parts, but I found the price of existing solutions to be too steep for my taste (several thousand dollars).
And I also assume that a standard file system (minifilter) driver might be required to return a consistent file size for such files, although the actual data size returned through ReadFile would of course differ on each call. Not to mention negating any buffering that takes place.
All in all I think that a create-it-yourself solution will take quite some effort, especially when you have never done Windows driver development in your life :) Although I see myself quite capable of learning up on it, the time invested will be prohibitive I think.
Another approach might be to hook ReadFile calls from the process doing the ReadFile - via IAT hooking, or via code injection. But I want this solution to more work 'out-of-the-box', i.e. all ReadFile requests for these special files trigger the correct behavior, regardless of origin. In my case I'd need to intercept my C++ compiler (G++) behavior, but that one is called on the fly by the IDE, so I see no easy way to detect it's startup and hook it up quickly before it does it's ReadFiles. And besides, I only want certain files to be special in this regard; intercepting all ReadFiles for a certain process is overkill.
You want something like FUSE (which I used with profit many times), but for Windows. Apparently there's Dokan, I've never used it but seems to be well known enough (and, at very least, can be used as an inspiration to see "how it's done").

Is there an OS-agnostic way to verify that a file isn't being written to or opened by another process?

Wondering if there is a way to validate that a file isn't being written to or has been opened by another process at runtime. Preferably a way that would work on all OS's
Not in general.
The most ubiquitous general application-level mechanism for detecting and preventing use or alteration of a file that is being used by another process is file locking
One reason there isn't a cross-platform solution is that some operating systems provide for cooperative locking where file locks are advisory. For example most Unix variants and Linux.
So, on those platforms, you can only guarantee knowledge of other process using a file where the other process is known in advance to be using a specific type of advisory lock.
Most of those platforms do have mandatory locking available. It is set on a per-file basis as part of the file attributes. There are some problems with this (e.g. race conditions).
So no, the underlying mechanisms that could provide the verification you seek are very different. It would probably be very troublesome to provide a reliable cross-platform mechanism in Go that would be guaranteed to work on a variety of popular platforms where other processes are or can be uncooperative.
References
https://www.kernel.org/doc/Documentation/filesystems/mandatory-locking.txt
https://unix.stackexchange.com/questions/244543/mandatory-locking-in-unix
That won't answer your question but since we might be dealing with an XY problem here, I'd like to look at the problem from a PoV different to locking and otherwise detecting the file is not being written to: an update-then-rename-over approach which is the only sensible way to do atomic updates to files which is sadly not very well known by (novice) programmers.
Since filesystem is inherently racy, to ensure proper "database-like" work with files—where everyone sees consistent state of the file's contents,—you have to use either locking or atomic updates or both.
To update the file's contents in an atomic way, you do this:
Read the file's data.
Open a temporary file (on the same filesystem).
Write the updated data into it.
Rename the new file over the old one.
Renaming is guaranteed to be atomic on all contemporary commodity OSes so that when a process tries to open a file, it opens either an old copy or the new one but not something in between.
On POSIX systems, Go's os.Rename() has always been atomic since it would end up calling rename(2); on Windows it was fixed since Go 1.5.
Note that this approach merely provides consistency of the file's contents
in the sense no two processes would ever end up updating it at the same time
but it does not ensure "serialized" updates which is only possible to ensure
through locking or other "side-channel" signaling.
That is, with atomic updates, it's still possible to have this situation:
Processes A and B read the file's data.
They both modify it and do atomic updates.
The file's contents will be consistent, but the state would be of whatever
process ended up calling the OS's renaming API function last.
So if you need serialization, you need locking.
I'm afraid, that no cross-platform file locking solution exists for Go
(anyway, approaches to locking differ greatly even across Unix-y systems
— let alone Windows; see this for an entertaining read) but one way to do it is to use platform-specific
locking of that temporary file created on the step (2) above.
The approach to update a file then changes to:
Open a temporary file with a well-known name.
Say, if the file to update is named "foo.state", call it "foo.state.lock".
Lock it using any platform-specific locking.
If locking fails, this means another process is updating the file,
so back out or wait—this really depends on what you're after.
Once the lock is held, read the file's data.
Modify it, re-write the temporary file being locked with this data.
Rename the temporary file over the original one.
Close the temp. file and release the lock.

Can a read() by one process see a partial write() by another?

If one process does a write() of size (and alignment) S (e.g. 8KB), then is it possible for another process to do a read (also of size and alignment S and the same file) that sees a mix of old and new data?
The writing process adds a checksum to each data block, and I'd like to know whether I can use a reading process to verify the checksums in the background. If the reader can see a partial write, then it will falsely indicate corruption.
What standards or documents apply here? Is there a portable way to avoid problems here, preferably without introducing lots of locking?
When a function is guaranteed to complete without there being any chance of any other process/thread/anything seeing things in a half finished state, it's said to be atomic. It either has or hasn't happened, there is no part way. While I can't speak to Windows, there are very few file operations in POSIX (which is what Linux/BSD/etc attempt to stick to) that are guaranteed to be atomic. Reading and writing are not guaranteed to be atomic.
While it would be pretty unlikely for you to write 2 bytes to a file and another process only see one of those bytes written, if by dumb luck your write straddled two different pages in memory and the VM system had to do something to prepare the second page, it's possible you'd see one byte without the other in a second process. Usually if things are page aligned in your file, they will be in memory, but again you can't rely on that.
Here's a list someone made of what is atomic in POSIX, which is pretty short, and I can't vouch for it's authenticity. (I can't think of why unlink isn't listed, for example).
I'd also caution you against testing what appears to work and running with it, the moment you start accessing files over a network file system (NFS on Unix, or SMB mounts in Windows) a lot of things that seemed to be atomic before no longer are.
If you want to have a second process calculating checksums while a first process is writing the file, you may want to open a pipe between the two and have the first process write a copy of everything down the pipe to the checksumming process. That may be faster than dealing with locking.

What's the best erlang approach to being able to identify a processes identity from its process id?

When I'm debugging, I'm usually looking at about 5000 processes, each of which could be one of about 100 gen_servers, fsms, etc. If I want to know WHAT an erlang process is, I can do:
process_info(pid(0,1,0), initial_call).
And get a result like:
{initial_call,{proc_lib,init_p,5}}
...which is all but useless.
More recently, I hit upon the idea (brace yourselves) of registering each process with a name that told me WHO that process represented. For example, player_1150 is the player process that represents player 1150. Yes, I end up making a couple million atoms over the course of a week-long run. (And I would love to hear comments on the drawbacks of boosting the limit to 10,000,000 atoms when my system runs with about 8GB of real memory unused, if there are any.) Doing this meant that I could, at the console of a live system, query all processes for how long their message queue was, find the top offenders, then check to see if those processes were registered and print out the atom they were registered with.
I've hit a snag with this: I'm moving processes from one node to another. Now a player process can have 3 different names; player_1158, player_1158_deprecating, player_1158_replacement. And I have to make absolutely sure I register and unregister these names with precision timing to make sure that a process is always named and that the appropriate names always exist, AND that I don't try to register a name that some dying process already holds. There is some slop room, since this is only used for console debugging of a live system Nonetheless, the moment I started feeling like this mechanism was affecting how I develop the system (the one that moves processes around) I felt like it was time to do something else.
There are two ideas on the table for me right now. An ets tables that associates process ids with their description:
ets:insert(self(), {player, 1158}).
I don't really like that one because I have to manually keep the tables clean. When a player exits (or crashes) someone is responsible for making sure that his data are removed from the ets table.
The second alternative was to use the process dictionary, storing similar information. When my exploration of a live system led me to wonder who a process is, I could just look at his process dictionary using process_info.
I realize that none of these solutions is functionally clean, but given that the system itself is never, EVER the consumer of these data, I'm not too worried about it. I need certain debugging tools to work quickly and easily, so the behavior described is not open for debate. Are there any convincing arguments to go one way or another (other than the academic "don't use the _, it's evil" canned garbage?) I'd be happy to hear other suggestions and their justifications.
You should try out gproc, it's a very convenient application for keeping process metadata.
A process can be registered with several names and you can associate arbitrary properties to a process (where the key and value can be any erlang term). Also gproc monitors the registered processes and unregisters them automatically if they crash.
If you're debugging gen_servers and gen_fsms while they're still running, I would implement the handle_info functions for these behaviors. When you send each process a {get_info, ReplyPid} tuple, the process in question can send back a term describing its own state, what it is, etc. That way you don't have to keep track of this information outside of the process itself.
Isac mentions there is already a built in way to do this

Resources