Graceful File Reading without Locking - windows

Whiteboard Overview
The images below are 1000 x 750 px, ~130 kB JPEGs hosted on ImageShack.
Internal
Global
Additional Information
I should mention that each user (of the client boxes) will be working straight off the /Foo share. Due to the nature of the business, users will never need to see or work on each other's documents concurrently, so conflicts of this nature will never be a problem. Access needs to be as simple as possible for them, which probably means mapping a drive to their respective /Foo/username sub-directory.
Additionally, no one but my applications (in-house and the ones on the server) will be using the FTP directory directly.
Possible Implementations
Unfortunately, it doesn't look like I can use off the shelf tools such as WinSCP because some other logic needs to be intimately tied into the process.
I figure there are two simple ways for me to accomplishing the above on the in-house side.
Method one (slow):
Walk the /Foo directory tree every N minutes.
Diff with previous tree using a combination of timestamps (can be faked by file copying tools, but not relevant in this case) and check-summation.
Merge changes with off-site FTP server.
Method two:
Register for directory change notifications (e.g., using ReadDirectoryChangesW from the WinAPI, or FileSystemWatcher if using .NET).
Log changes.
Merge changes with off-site FTP server every N minutes.
I'll probably end up using something like the second method due to performance considerations.
Problem
Since this synchronization must take place during business hours, the first problem that arises is during the off-site upload stage.
While I'm transferring a file off-site, I effectively need to prevent the users from writing to the file (e.g., use CreateFile with FILE_SHARE_READ or something) while I'm reading from it. The internet upstream speeds at their office are nowhere near symmetrical to the file sizes they'll be working with, so it's quite possible that they'll come back to the file and attempt to modify it while I'm still reading from it.
Possible Solution
The easiest solution to the above problem would be to create a copy of the file(s) in question elsewhere on the file-system and transfer those "snapshots" without disturbance.
The files (some will be binary) that these guys will be working with are relatively small, probably ≤20 MB, so copying (and therefore temporarily locking) them will be almost instant. The chances of them attempting to write to the file in the same instant that I'm copying it should be close to nil.
This solution seems kind of ugly, though, and I'm pretty sure there's a better way to handle this type of problem.
One thing that comes to mind is something like a file system filter that takes care of the replication and synchronization at the IRP level, kind of like what some A/Vs do. This is overkill for my project, however.
Questions
This is the first time that I've had to deal with this type of problem, so perhaps I'm thinking too much into it.
I'm interested in clean solutions that don't require going overboard with the complexity of their implementations. Perhaps I've missed something in the WinAPI that handles this problem gracefully?
I haven't decided what I'll be writing this in, but I'm comfortable with: C, C++, C#, D, and Perl.

After the discussions in the comments my proposal would be like so:
Create a partition on your data server, about 5GB for safety.
Create a Windows Service Project in C# that would monitor your data driver / location.
When a file has been modified then create a local copy of the file, containing the same directory structure and place on the new partition.
Create another service that would do the following:
Monitor Bandwidth Usages
Monitor file creations on the temporary partition.
Transfer several files at a time (Use Threading) to your FTP Server, abiding by the bandwidth usages at the current time, decreasing / increasing the worker threads depending on network traffic.
Remove the files from the partition that have successfully transferred.
So basically you have your drives:
C: Windows Installation
D: Share Storage
X: Temporary Partition
Then you would have following services:
LocalMirrorService - Watches D: and copies to X: with the dir structure
TransferClientService - Moves files from X: to ftp server, removes from X:
Also use multi threads to move multiples and monitors bandwidth.
I would bet that this is the idea that you had in mind but this seems like a reasonable approach as long as your really good with your application development and your able create a solid system that would handle most issues.
When a user edits a document in Microsoft Word for instance, the file will change on the share and it may be copied to X: even though the user is still working on it, within windows there would be an API see if the file handle is still opened by the user, if this is the case then you can just create a hook to watch when the user actually closes the document so that all there edits are complete, then you can migrate to drive X:.
this being said that if the user is working on the document and there PC crashes for some reason, the document / files handle may not get released until the document is opened at a later date, thus causing issues.

For anyone in a similar situation (I'm assuming the person who asked the question implemented a solution long ago), I would suggest an implementation of rsync.
rsync.net's Windows Backup Agent does what is described in method 1, and can be run as a service as well (see "Advanced Usage"). Though I'm not entirely sure if it has built-in bandwidth limiting...
Another (probably better) solution that does have bandwidth limiting is Duplicati. It also properly backs up currently-open or locked files. Uses SharpRSync, a managed rsync implementation, for its backend. Open source too, which is always a plus!

Related

What is an alternative to mandatory file locking on macOS?

I'm writing an app for macOS with the primary goal of managing arbitrary user files in a certain manner. 'Management' includes arbitrarily reading/writing/updating these files. Management is not internally a discrete event, and may consist of several idle-periods. However, it must appear so to the user.
Note: The term 'user' includes any and all user-activity (i.e. via Finder) or user-initiated processes (i.e. other apps opened by the user; though not running as root, similar to the privileges of my own application).
My app does not store these files in an owned container (e.g. sandboxed app container), but rather runs continuously in the background keeping track of these files, monitoring for changes and managing them as necessary.
The duration of this 'management' may vary from a few milliseconds to a few hours.
I'm trying to write a construct (i.e. class / struct) to encapsulate references to these 'hot' files (i.e. files under management). During management, the user must not be capable of reading/writing-to/deleting these files, unless the app is explicitly quit (through normal quit / forced quit, regardless).
Is there any way I can "lock" a file, as to prevent user reading/writing/updating and/or even modification of permissions?
Here are two possible solutions:
Copy the file to an undisclosed location, manage it, and overwrite the old file. This is undesirable for multiple reasons: copying is expensive and impractical for large files, user is not explicitly aware of management, does nothing to prevent other processes from seeing the file as "free".
Modify file permissions. I'm not sure if this is even possible (please let me know in detail if it is!), but if my process could modify file permissions as to prevent user-access, it would solve the essence my problem. However, if anything were to prevent my app from 'unlocking' these files (be it through a crash/force-quit etc.), it would leave the files inaccessible to the user.
A third, though not really a solution, would be to simply not attempt to 'lock' any of these files. I could just monitor the files continuously, and alert the user of any failure. I really don't want to do this, hence the question.
The second solution seems quite promising. I can't, however, find any high-level APIs that let me interface with the file ACLs (access-control-lists). I'm not even sure whether I'm correct in my understanding of how it would work, so feel free to build upon that thought and turn it into a concrete answer.
I'm also curious as to how Finder seems to know whether files are being used by other processes. Again, I think I know but I'm not entirely sure, so better ask it here with the main question.

Why is project and file saving/management so awkward in programming as compared to other digital media?

TLDR: What is the reason for the complex file management systems in place, such as Github repositories, when working in Visual Studio?
This has been bothering me for a while. I've finished a diploma course in Digital Media, and have started another course in programming. One thing that stuck out immediately after coming from 3D art is how incredibly awkward and obtuse basic file management is when working with Visual Studio. Presumably the same issues arise with other development environments, as if they didn't I can't imagine why anybody would ever use VS.
For example, let's say I want to work on a project in 3ds Max. It's stored on a shared network drive, so I don't want to risk two people accessing it at the same time and saving over each others work. I simply grab the folder or file that I want to use, copy and paste it with a new name, and then I'm good to go.
Saving things with a new name is easy, just save as, rename it. I can work from network drives, local drives, portable drives. The file can come from anywhere and be saved anywhere. Everything is fast, painless, and clear.
If I was to try and do the same thing in VS, for starters, it wouldn't let me build the program while saved to the network, so I'd have to copy it over to a local drive. Presumably this is to prevent the "multiple people accessing, saving over each other" issues that are easily avoided by just renaming the thing.
If I wanted to iteration save, that is, to frequently save the project with a version number name to allow easy rollbacks and troubleshooting, I'm not even sure how I'd do it. Renaming projects/solutions has proven so hard to do that I've had to delete projects and make them again with a new name, rather than try and figure out how to it properly.
There are all sorts of complex file management systems that VS seems to require for any large project work, all of which would be completely unnecessary if you could just copy, paste, rename and save-as with any real ease.
I'm obviously rather new to this, and I'm certain that there is an important reason why it's so awkward to manage files, I just don't know what that reason is. I feel like I'd have a far better understanding of how all these file management systems actually work if I knew why they existed in the first place. At the moment, just trying to be able to work from a network drive is taking up hours when it would be a non-issue in every other digital media field I've worked with.

Possible to selective sync dropbox or other cloud storage from multi-platform command line?

Going to be working with a medium sized remote group on a large (but independent) project that will be generating many GB to TB of data.
To keep users from having to store 500GB of data on their personal machines, and to keep everyone in sync, we need a command-line/python utility to control selective syncing of dependencies on multiple operating systems: or at least osx and linux.
So example, someone who needs to work on the folder:
startrek/startrekiii
May require the folders:
startrek/nimoy/common
startrek/nimoy/[user]
startrek/shatner/common
startrek/shatner/[user]
but not:
startrek/startrekii, startrek/nimoy/[some_other_user], etc
From their command line (or a UI) they would run:
sync startrekiii
And they'd also receive startrek/nimoy/common, etc
likewise we'll have an unsync command that, as long as those dependent folders are not in use by another sync, will be unsynced and removed from the user's HD.
Of cloud sync/storage solutions, dropbox seems to offer the most granular control over this, allowing you to sync specific folders and subfolders - however from everything I can find this granular control is strictly limited to their UI.
We're completely open to alternative solutions if you have them, we just need something as easily deployable as possible and don't have the budget for Aspera or something to that effect.
Two other important notes:
Because of one very central part of our pipeline which pulls files
from those dependent folders (over which we have limited API
control), the paths need to be consistent on their respective
platform. So ~/Dropbox/startrek/nimoy can never be ~/Dropbox/startrek/startrekiii/nimoy
Many of the people using this will be artists and otherwise non-technical people, the extent of who's experience using csh or bash is for simple things like changing directories and moving files around.
Has anyone found a way to hack into Dropbox's selective sync, and/or know of a better alternative?

Possible to bypass caching and download/open file to RAM?

Preamble:
Recently I came across an interesting story about people who seem to be sending emails with documents that contain child pornography. This is an example (this one is jpeg but im hearing about it being done with PDFs, which generally cant be previewed)
https://www.youtube.com/watch?v=zislzpkpvZc
This can pose a real threat to people in investigative journalism, because even if you delete the file after its been opened in Temp the file may still be recovered by forensics software. Even just having opened the file already puts you in the realm of committing a felony.
This also can pose a real problem to security consultants for a group. Lets say person A emails criminal files, person B is suspicious of email and forwards it to security manager for their program. In order to analyze the file the consultant may have to download it on a harddrive, even if they load it in a VM or Sandbox. Even if they figure out what it is they are still in this legal landmine area that bad timing could land them in jail for 20 years. Thinking about this if the memory was to only enter the RAM then upon a power down all traces of this opened file would disappear.
Question: I have an OK understanding about how computer architecture works, but this problem presented earlier made me start wondering. Is there a limitation, at the OS, hardware, or firmware level, that prevents a program from opening a stream of downloading information directly to the RAM? If not let's say you try to open a pdf, is it possible for the file it's opening to instead be passed to the program as a stream of downloading bytes that could then rewrite/otherwise make retention of the final file on the hdd impossible?
Unfortunately I can only give a Linux/Unix based answer to this, but hopefully it is helpful and extends to Windows too.
There are many ways to pass data between programs without writing to the hard disk, it is usually more of a question of whether the software applications support it (web browser and pdf reader for your example). Streams can be passed via pipes and sockets, but the problem here is that it may be more convenient for the receiving program to seek back in the stream at certain points rather than store all the data in memory. This may be a more efficient use of resources too. Hence many programs do not do this. Indeed a pipe can be made to look like a file, but if the application tries to seek backward, it will cause an error.
If there was more demand for streaming data to applications, it would probably be seen in more cases though as there are no major barriers. Currently it is more common just to store pdfs in a temporary file if they are viewed in a plugin and not downloaded. Video can be different though.
An alternative is to use a RAM drive, it is common for a Linux system to have at least one set up by default (tmpfs), although it seems for Windows that you have to install additional software. Using one of these removes the above limitations and it is fairly easy to set a web browser to use it for temporary files.

How does the DropBox Mac client work?

I've been looking at the DropBox Mac client and I'm currently researching implementing a similar interface for a different service.
How exactly do they interface with finder like this? I highly doubt these objects represented in the folder are actual documents downloaded on every load? They must dynamically download as they are needed. So how can you display these items in finder without having actual file system objects?
Does anyone know how this is achieved in Mac OS X?
Or any pointer's to Apple API's or other open source projects that have a similar integration with finder?
Dropbox is not powered by either MacFUSE or WebDAV, although those might be perfectly fine solutions for what you're trying to accomplish.
If it were powered by those things, it wouldn't work when you weren't connected, as both of those rely on the server to store the actual information and Dropbox does not. If I quit Dropbox (done via the menu item) and disconnect from the net, I can still use the files. That's because the files are actually stored here on my hard drive.
It also means that the files don't need to be "downloaded on every load," since they are actually stored on my machine here. Instead, only the deltas are sent over the wire, and the Dropbox application (running in the background) patches the files appropriately. Going the other way, the Dropbox application watches for the files in the Dropbox folder, and when they change, it sends the appropriate deltas to the server, which propagates them to any other clients.
This setup has some decided advantages: it works when offline, it is an order of magnitude faster, and it is transparent to other apps, since they just see files on the disk. However, I have no idea how it deals with merge conflicts (which could easily arise with one or more clients offline), which are not an issue if the server is the only copy and every edit changes that central copy.
Where Dropbox really shines is that they have an additional trick that badges the items in the Dropbox folder with their current sync status. But that's not what you're asking about here.
As far as the question at hand, you should definitely look into MacFUSE and WebDAV, which might be perfect solutions to your problem. But the Dropbox way of doing things, with a background application changing actual files on the disk, might be a better tradeoff.
Dropbox is likely using FSEvents to watch for changes to the file system. It's a great API and can even bundle up changes that happened while your app was not running. It's the same API that Spotlight uses. The menubar app likely does the actual observing itself (since restarting it can fix uploads being hung, for instance).
There's no way they're using MacFUSE, as that would require installing the MacFUSE kernel extension to make Dropbox work, and since I definitely didn't install it, I highly doubt they're using it.
Two suggestions:
MacFUSE
WebDAV
The former will allow you to write an app that appears as a filesystem and does all the right things; the latter will allow you move everything server-side and let the user just mount your service as a file share.
Dropbox on the client is written in python.
The client seems to use a sqlite3 database to index files.
I suppose Dropobox split a file in chunks, to reduce bandwith usage.
By the way, it two people has the same file, even if they do not know each other, the server can optimize and avoid to transfer the file more times, only copying it on the server side
To me it feels like a heavily modified revision control system. It has all the features: updates files based on deltas, options to recover or restore old revisions of files. It almost feels like they are using git (GitFS?), or some filesystem they designed.
You could also give File Conveyor a try. It's a Python daemon capable of instantly detecting FS changes (on Linux through inotify, on OS X through FSEvents), processing the files and syncing them to one or more destinations.
Supported protocols: FTP, SFTP, Amazon S3 (CloudFront is also supported), Rackspace Cloud Files. Can easily be extended. Uses django-storages.
"processing files": e.g. optimizing images, transcoding videos — this was originally conceived to be used for sending static assets to a CDN in the context of speeding up websites)

Resources