I'm contemplating implementing launchd to watch a file structure on my computer. Using watchpaths to tell whether or not one of these directories changes, though I will need to create a new property list file to account for each directory. My point of contention is as to how scalable this is; will I notice a drop in my computer's performance if im watching 10, 100, 1000 or more directories, or are the resources consumed by watching this many paths focused in memory as opposed to processing?
These jobs are going to be used to handle when a file is deleted or renamed and will update a manifest in the root of this file structure, so my application will know what files are where without walking the tree; I'm trying to make the application a little more responsive and aware. Should these jobs be daemons or agents? I've presumed agents because I don't see how this structure could be modified without a user being logged in, although these jobs would have no need to create a gui.
Will launchd be scalable enough to handle an arbitrarily sized file structure in this manner?
Are there other options? (Portability would be nice.)
Handling whether or not files have been renamed created or otherwise modified is better accomplished using FSEvents or kqueues.
Related
I'm writing an app for macOS with the primary goal of managing arbitrary user files in a certain manner. 'Management' includes arbitrarily reading/writing/updating these files. Management is not internally a discrete event, and may consist of several idle-periods. However, it must appear so to the user.
Note: The term 'user' includes any and all user-activity (i.e. via Finder) or user-initiated processes (i.e. other apps opened by the user; though not running as root, similar to the privileges of my own application).
My app does not store these files in an owned container (e.g. sandboxed app container), but rather runs continuously in the background keeping track of these files, monitoring for changes and managing them as necessary.
The duration of this 'management' may vary from a few milliseconds to a few hours.
I'm trying to write a construct (i.e. class / struct) to encapsulate references to these 'hot' files (i.e. files under management). During management, the user must not be capable of reading/writing-to/deleting these files, unless the app is explicitly quit (through normal quit / forced quit, regardless).
Is there any way I can "lock" a file, as to prevent user reading/writing/updating and/or even modification of permissions?
Here are two possible solutions:
Copy the file to an undisclosed location, manage it, and overwrite the old file. This is undesirable for multiple reasons: copying is expensive and impractical for large files, user is not explicitly aware of management, does nothing to prevent other processes from seeing the file as "free".
Modify file permissions. I'm not sure if this is even possible (please let me know in detail if it is!), but if my process could modify file permissions as to prevent user-access, it would solve the essence my problem. However, if anything were to prevent my app from 'unlocking' these files (be it through a crash/force-quit etc.), it would leave the files inaccessible to the user.
A third, though not really a solution, would be to simply not attempt to 'lock' any of these files. I could just monitor the files continuously, and alert the user of any failure. I really don't want to do this, hence the question.
The second solution seems quite promising. I can't, however, find any high-level APIs that let me interface with the file ACLs (access-control-lists). I'm not even sure whether I'm correct in my understanding of how it would work, so feel free to build upon that thought and turn it into a concrete answer.
I'm also curious as to how Finder seems to know whether files are being used by other processes. Again, I think I know but I'm not entirely sure, so better ask it here with the main question.
Going to be working with a medium sized remote group on a large (but independent) project that will be generating many GB to TB of data.
To keep users from having to store 500GB of data on their personal machines, and to keep everyone in sync, we need a command-line/python utility to control selective syncing of dependencies on multiple operating systems: or at least osx and linux.
So example, someone who needs to work on the folder:
startrek/startrekiii
May require the folders:
startrek/nimoy/common
startrek/nimoy/[user]
startrek/shatner/common
startrek/shatner/[user]
but not:
startrek/startrekii, startrek/nimoy/[some_other_user], etc
From their command line (or a UI) they would run:
sync startrekiii
And they'd also receive startrek/nimoy/common, etc
likewise we'll have an unsync command that, as long as those dependent folders are not in use by another sync, will be unsynced and removed from the user's HD.
Of cloud sync/storage solutions, dropbox seems to offer the most granular control over this, allowing you to sync specific folders and subfolders - however from everything I can find this granular control is strictly limited to their UI.
We're completely open to alternative solutions if you have them, we just need something as easily deployable as possible and don't have the budget for Aspera or something to that effect.
Two other important notes:
Because of one very central part of our pipeline which pulls files
from those dependent folders (over which we have limited API
control), the paths need to be consistent on their respective
platform. So ~/Dropbox/startrek/nimoy can never be ~/Dropbox/startrek/startrekiii/nimoy
Many of the people using this will be artists and otherwise non-technical people, the extent of who's experience using csh or bash is for simple things like changing directories and moving files around.
Has anyone found a way to hack into Dropbox's selective sync, and/or know of a better alternative?
Ok i wrote and application that use Adobe ActiveX control for displaying PDF files.
Adobe ActiveX control load files only from file system. So i nead to feed a file path to this control.
Problem is that i don't want to store PDF files on file system. Event temporary! I wan't to store my PDF files only in memory, and i want to use Adobe ActiveX control.
So i nead:
1) A way to fake file on a file system. So this control would "think" that there is a file, but would load it from memory
2) A way to create file on file system that would be "visible" to only one application, so my PDF control could load it, and other users won't even see it..
3) Something else
PS: I'm not asking to "finish my home work", i'm just asking - is there a way to do this?
You can almost do it (means: no you can't, but you can do something that comes close).
Creating a file with FILE_ATTRIBUTE_TEMPORARY does in principle create a file, temporarily. However, as long as there is sufficient buffer cache (which is normally always the case unless your file is tens to hundreds of megabytes), the system will not write to disk. This is not just something that happens accidentially, but the actual specified behaviour of this flag.
Further, specifying 0 as share mode and FILE_FLAG_DELETE_ON_CLOSE will prevent any other process from opening your file for as long as you keep it open, even if someone knows it's there, and the file will "disappear" when you close it. Even if your application crashes, the OS will clean up behind you (if DRM is the reason). If you're in super paranoia mode and worried about the system bluescreening while your file exists, you can additionally schedule a pending move too. This will, in case of a system crash, remove the file during boot.
Lastly, given NTFS, you can create an alternate stream with a random, preferrably unique name (e.g. SHA1 of the document or a UUID) on any file or even directory. Alternate streams on directories are ... a kind of nasty hack, but entirely legal and they work just fine, and don't appear in Explorer. This will not really make your file invisible, but nearly so (in almost every practical aspect, anyway). If you're a good citizen, you will want to use the system temp folder for such a thing, not the program folder or some other place that you shouldn't write to.
Creating an alternate stream is dead easy too, just use the normal file or directory name and append a colon (:) and the name of the stream you want. No extra API needed.
Other than that, it gets kind of hard. You can of course always create something like a ramdisk (would be tough to hide it, though), or try to use one of the stream-from-memory functions to fool an application into reading from a memory buffer on the allegation of a file... but that's not trivial stuff.
If something needs to be on a file system to pass to another application, you can not hide it/limit it to certain processes. Anything your app can see, anything else at the same privilige level can also see/access. You may be able to lock it but how depends on why you want to protect against.
Remember that the user's PC is theirs, not yours so they have full access to everything on it.
You can create a virtual disk and limit access to it to only specific application. Do to this you would have to write a file system driver or a filesystem filter driver. Both work in kernel mode and are tricky to write and maintain. Our company offers components that let you avoid writing drivers yourself and write business logic in user-mode (we provide drivers in those products).
Your most obvious option is to get rid of Adobe Reader control and use some third-party component that displays PDFs and can load them from memory.
But in general a smart hacker would be able to capture your data unless you have (a) non-standard data format, and/or (b) stream the data from the server dynamically, not keeping the complete data on the computer. These are not bulletproof solutions either, but they make hacker's work much harder.
Whiteboard Overview
The images below are 1000 x 750 px, ~130 kB JPEGs hosted on ImageShack.
Internal
Global
Additional Information
I should mention that each user (of the client boxes) will be working straight off the /Foo share. Due to the nature of the business, users will never need to see or work on each other's documents concurrently, so conflicts of this nature will never be a problem. Access needs to be as simple as possible for them, which probably means mapping a drive to their respective /Foo/username sub-directory.
Additionally, no one but my applications (in-house and the ones on the server) will be using the FTP directory directly.
Possible Implementations
Unfortunately, it doesn't look like I can use off the shelf tools such as WinSCP because some other logic needs to be intimately tied into the process.
I figure there are two simple ways for me to accomplishing the above on the in-house side.
Method one (slow):
Walk the /Foo directory tree every N minutes.
Diff with previous tree using a combination of timestamps (can be faked by file copying tools, but not relevant in this case) and check-summation.
Merge changes with off-site FTP server.
Method two:
Register for directory change notifications (e.g., using ReadDirectoryChangesW from the WinAPI, or FileSystemWatcher if using .NET).
Log changes.
Merge changes with off-site FTP server every N minutes.
I'll probably end up using something like the second method due to performance considerations.
Problem
Since this synchronization must take place during business hours, the first problem that arises is during the off-site upload stage.
While I'm transferring a file off-site, I effectively need to prevent the users from writing to the file (e.g., use CreateFile with FILE_SHARE_READ or something) while I'm reading from it. The internet upstream speeds at their office are nowhere near symmetrical to the file sizes they'll be working with, so it's quite possible that they'll come back to the file and attempt to modify it while I'm still reading from it.
Possible Solution
The easiest solution to the above problem would be to create a copy of the file(s) in question elsewhere on the file-system and transfer those "snapshots" without disturbance.
The files (some will be binary) that these guys will be working with are relatively small, probably ≤20 MB, so copying (and therefore temporarily locking) them will be almost instant. The chances of them attempting to write to the file in the same instant that I'm copying it should be close to nil.
This solution seems kind of ugly, though, and I'm pretty sure there's a better way to handle this type of problem.
One thing that comes to mind is something like a file system filter that takes care of the replication and synchronization at the IRP level, kind of like what some A/Vs do. This is overkill for my project, however.
Questions
This is the first time that I've had to deal with this type of problem, so perhaps I'm thinking too much into it.
I'm interested in clean solutions that don't require going overboard with the complexity of their implementations. Perhaps I've missed something in the WinAPI that handles this problem gracefully?
I haven't decided what I'll be writing this in, but I'm comfortable with: C, C++, C#, D, and Perl.
After the discussions in the comments my proposal would be like so:
Create a partition on your data server, about 5GB for safety.
Create a Windows Service Project in C# that would monitor your data driver / location.
When a file has been modified then create a local copy of the file, containing the same directory structure and place on the new partition.
Create another service that would do the following:
Monitor Bandwidth Usages
Monitor file creations on the temporary partition.
Transfer several files at a time (Use Threading) to your FTP Server, abiding by the bandwidth usages at the current time, decreasing / increasing the worker threads depending on network traffic.
Remove the files from the partition that have successfully transferred.
So basically you have your drives:
C: Windows Installation
D: Share Storage
X: Temporary Partition
Then you would have following services:
LocalMirrorService - Watches D: and copies to X: with the dir structure
TransferClientService - Moves files from X: to ftp server, removes from X:
Also use multi threads to move multiples and monitors bandwidth.
I would bet that this is the idea that you had in mind but this seems like a reasonable approach as long as your really good with your application development and your able create a solid system that would handle most issues.
When a user edits a document in Microsoft Word for instance, the file will change on the share and it may be copied to X: even though the user is still working on it, within windows there would be an API see if the file handle is still opened by the user, if this is the case then you can just create a hook to watch when the user actually closes the document so that all there edits are complete, then you can migrate to drive X:.
this being said that if the user is working on the document and there PC crashes for some reason, the document / files handle may not get released until the document is opened at a later date, thus causing issues.
For anyone in a similar situation (I'm assuming the person who asked the question implemented a solution long ago), I would suggest an implementation of rsync.
rsync.net's Windows Backup Agent does what is described in method 1, and can be run as a service as well (see "Advanced Usage"). Though I'm not entirely sure if it has built-in bandwidth limiting...
Another (probably better) solution that does have bandwidth limiting is Duplicati. It also properly backs up currently-open or locked files. Uses SharpRSync, a managed rsync implementation, for its backend. Open source too, which is always a plus!
I've been looking at the DropBox Mac client and I'm currently researching implementing a similar interface for a different service.
How exactly do they interface with finder like this? I highly doubt these objects represented in the folder are actual documents downloaded on every load? They must dynamically download as they are needed. So how can you display these items in finder without having actual file system objects?
Does anyone know how this is achieved in Mac OS X?
Or any pointer's to Apple API's or other open source projects that have a similar integration with finder?
Dropbox is not powered by either MacFUSE or WebDAV, although those might be perfectly fine solutions for what you're trying to accomplish.
If it were powered by those things, it wouldn't work when you weren't connected, as both of those rely on the server to store the actual information and Dropbox does not. If I quit Dropbox (done via the menu item) and disconnect from the net, I can still use the files. That's because the files are actually stored here on my hard drive.
It also means that the files don't need to be "downloaded on every load," since they are actually stored on my machine here. Instead, only the deltas are sent over the wire, and the Dropbox application (running in the background) patches the files appropriately. Going the other way, the Dropbox application watches for the files in the Dropbox folder, and when they change, it sends the appropriate deltas to the server, which propagates them to any other clients.
This setup has some decided advantages: it works when offline, it is an order of magnitude faster, and it is transparent to other apps, since they just see files on the disk. However, I have no idea how it deals with merge conflicts (which could easily arise with one or more clients offline), which are not an issue if the server is the only copy and every edit changes that central copy.
Where Dropbox really shines is that they have an additional trick that badges the items in the Dropbox folder with their current sync status. But that's not what you're asking about here.
As far as the question at hand, you should definitely look into MacFUSE and WebDAV, which might be perfect solutions to your problem. But the Dropbox way of doing things, with a background application changing actual files on the disk, might be a better tradeoff.
Dropbox is likely using FSEvents to watch for changes to the file system. It's a great API and can even bundle up changes that happened while your app was not running. It's the same API that Spotlight uses. The menubar app likely does the actual observing itself (since restarting it can fix uploads being hung, for instance).
There's no way they're using MacFUSE, as that would require installing the MacFUSE kernel extension to make Dropbox work, and since I definitely didn't install it, I highly doubt they're using it.
Two suggestions:
MacFUSE
WebDAV
The former will allow you to write an app that appears as a filesystem and does all the right things; the latter will allow you move everything server-side and let the user just mount your service as a file share.
Dropbox on the client is written in python.
The client seems to use a sqlite3 database to index files.
I suppose Dropobox split a file in chunks, to reduce bandwith usage.
By the way, it two people has the same file, even if they do not know each other, the server can optimize and avoid to transfer the file more times, only copying it on the server side
To me it feels like a heavily modified revision control system. It has all the features: updates files based on deltas, options to recover or restore old revisions of files. It almost feels like they are using git (GitFS?), or some filesystem they designed.
You could also give File Conveyor a try. It's a Python daemon capable of instantly detecting FS changes (on Linux through inotify, on OS X through FSEvents), processing the files and syncing them to one or more destinations.
Supported protocols: FTP, SFTP, Amazon S3 (CloudFront is also supported), Rackspace Cloud Files. Can easily be extended. Uses django-storages.
"processing files": e.g. optimizing images, transcoding videos — this was originally conceived to be used for sending static assets to a CDN in the context of speeding up websites)