Possible to selective sync dropbox or other cloud storage from multi-platform command line? - shell

Going to be working with a medium sized remote group on a large (but independent) project that will be generating many GB to TB of data.
To keep users from having to store 500GB of data on their personal machines, and to keep everyone in sync, we need a command-line/python utility to control selective syncing of dependencies on multiple operating systems: or at least osx and linux.
So example, someone who needs to work on the folder:
startrek/startrekiii
May require the folders:
startrek/nimoy/common
startrek/nimoy/[user]
startrek/shatner/common
startrek/shatner/[user]
but not:
startrek/startrekii, startrek/nimoy/[some_other_user], etc
From their command line (or a UI) they would run:
sync startrekiii
And they'd also receive startrek/nimoy/common, etc
likewise we'll have an unsync command that, as long as those dependent folders are not in use by another sync, will be unsynced and removed from the user's HD.
Of cloud sync/storage solutions, dropbox seems to offer the most granular control over this, allowing you to sync specific folders and subfolders - however from everything I can find this granular control is strictly limited to their UI.
We're completely open to alternative solutions if you have them, we just need something as easily deployable as possible and don't have the budget for Aspera or something to that effect.
Two other important notes:
Because of one very central part of our pipeline which pulls files
from those dependent folders (over which we have limited API
control), the paths need to be consistent on their respective
platform. So ~/Dropbox/startrek/nimoy can never be ~/Dropbox/startrek/startrekiii/nimoy
Many of the people using this will be artists and otherwise non-technical people, the extent of who's experience using csh or bash is for simple things like changing directories and moving files around.
Has anyone found a way to hack into Dropbox's selective sync, and/or know of a better alternative?

Related

Using with rsync to MS SharePoint

I've searched large and deep, but nothing is available, as far as I can see.
TLDR: How can I use rsync with a SharePoint installation? (Or something like rsync)
Long description
We have a large install base of Macs (~50%), Windows (~40%), and Linux (~10%), so our environment is pretty heterogeneous. Being an experimental job we produce a considerable amount of experimental datasets that we need to share, and more importantly, backup.
Right now we use external hard drives to store these files and folders, since our computers cannot hold these amount of data (50GB++, for instance, per dataset). And when we need to share, we "physically" share. We mainly we use rsync with some kind of backend (what kind is not important), but this solution requires computers to be left turned on, and act as servers.
For reasons that I will not bother you with, we cannot leave a computer on after work.
Having OneDrive for Business seemed a very promising technology to use, since we have more than 1TB per user. We could start syncing out datasets from our computers and hard drives, and we could share even when computers are turned off.
We are aware that we may hit some drawbacks, as not being able to actually share, having some limits about the number of objects (files/directories), but we will handle them later.
I prefer rsync, but right now we're open to any solution.
OneDrive for Business has a download that will allow you to synchronize a directory locally. https://onedrive.live.com/about/en-us/download/
For a Linux platform, you should be able to use onedrive-d found here:
https://github.com/xybu/onedrive-d
I know that it's an old question, but it's unanswered. Maybe a solution could be https://rclone.org/. Rclone is a command line program to sync files and directories to and from the cloud.

Is it possible to explore SVN repo as an ordinary folder in Windows (for examle, mount as remote drive)?

So, I need to make a file storage for our team. Also I have SVN server. Opportunity to do rollbacks and control on who created or deleted file is very neccessary and important for our project.
Any ideas? Maybe without SVN. I can connect using WebDAV but only in read-only mode (because there is no LOCKS support in it).
You can set up the SVN server to allow exactly that.
Read the chapter in the SVN book about WebDAV and Autoversioning
So, what you want is the ability to roll back changes, and limit who can make the changes, but without the bother of checking in and out files?
Maybe Subversion isn't for you. I've done similar sharing with Dropbox and there's now BoxNet that's suppose to be like Dropbox on Steroids. Dropbox (and I assume box.net too) has some features that are very nice:
You can setup folder sharing between particular teams. That way, you can say who can and cannot access these files.
Dropbox automatically saves each and every version of a file, so you can always go back to previous versions -- even if that file has been deleted.
Files are stored locally. All a user has to know is to save a particular file in a particular folder, and everyone has access to it. I've successfully used Dropbox to collaborate with managers that make the Pointed Hair boss in Dilbert look like a high tech genius.
There's also Skydrive and Google Drive, but I don't find them as universal as Dropbox or as easy to use. It's possible to use Dropbox without ever going to the Dropbox website. To the non-geek, it appears to be magic as files I've written and edited appear on their drive. It took me a few weeks to train one person that he didn't have to email me his document when he made changes because I already had it.
Dropbox gives you 2 Gb of space for free which doesn't sound like a lot. However, my first hard drive was a whopping 20Mb which was twice the size of the standard 10Mb drive at that time. If you're not storing a lot of multimedia presentations or doing a lot of Photoshop, 2Gb might be more than enough for your project.
I know Windows 7 and later has some sort of versioning system built into it. I know this because anytime someone mentions that Mac OS X has time machine, some Wingeek pipes in stating that Windows has the same thing, but only better!. Unfortunately, Windows is not my forte, so I don't know too much about this specific feature. I believe the default is once per day, but it can be changed. This might be the perfect solution if everyone is on Windows.
Subversion can do autoversioning as Stefan stated. Considering his position in the Subversion community (especially his work on TortoiseSVN), he knows his stuff. Unfortunately I don't know too much about it since I've never used or seen this feature implemented. It's probably due to the fact that I work mainly with developers who know what a version control system is, and therefore have no need for something that does the versioning for them.
Also don't forget to check if you can use your corporate Sharepoint which does something very much what you want. I am not too impressed with Sharepoint, but if the facility is there, and your company can give you the support, it is something you probably want to look into.

Graceful File Reading without Locking

Whiteboard Overview
The images below are 1000 x 750 px, ~130 kB JPEGs hosted on ImageShack.
Internal
Global
Additional Information
I should mention that each user (of the client boxes) will be working straight off the /Foo share. Due to the nature of the business, users will never need to see or work on each other's documents concurrently, so conflicts of this nature will never be a problem. Access needs to be as simple as possible for them, which probably means mapping a drive to their respective /Foo/username sub-directory.
Additionally, no one but my applications (in-house and the ones on the server) will be using the FTP directory directly.
Possible Implementations
Unfortunately, it doesn't look like I can use off the shelf tools such as WinSCP because some other logic needs to be intimately tied into the process.
I figure there are two simple ways for me to accomplishing the above on the in-house side.
Method one (slow):
Walk the /Foo directory tree every N minutes.
Diff with previous tree using a combination of timestamps (can be faked by file copying tools, but not relevant in this case) and check-summation.
Merge changes with off-site FTP server.
Method two:
Register for directory change notifications (e.g., using ReadDirectoryChangesW from the WinAPI, or FileSystemWatcher if using .NET).
Log changes.
Merge changes with off-site FTP server every N minutes.
I'll probably end up using something like the second method due to performance considerations.
Problem
Since this synchronization must take place during business hours, the first problem that arises is during the off-site upload stage.
While I'm transferring a file off-site, I effectively need to prevent the users from writing to the file (e.g., use CreateFile with FILE_SHARE_READ or something) while I'm reading from it. The internet upstream speeds at their office are nowhere near symmetrical to the file sizes they'll be working with, so it's quite possible that they'll come back to the file and attempt to modify it while I'm still reading from it.
Possible Solution
The easiest solution to the above problem would be to create a copy of the file(s) in question elsewhere on the file-system and transfer those "snapshots" without disturbance.
The files (some will be binary) that these guys will be working with are relatively small, probably ≤20 MB, so copying (and therefore temporarily locking) them will be almost instant. The chances of them attempting to write to the file in the same instant that I'm copying it should be close to nil.
This solution seems kind of ugly, though, and I'm pretty sure there's a better way to handle this type of problem.
One thing that comes to mind is something like a file system filter that takes care of the replication and synchronization at the IRP level, kind of like what some A/Vs do. This is overkill for my project, however.
Questions
This is the first time that I've had to deal with this type of problem, so perhaps I'm thinking too much into it.
I'm interested in clean solutions that don't require going overboard with the complexity of their implementations. Perhaps I've missed something in the WinAPI that handles this problem gracefully?
I haven't decided what I'll be writing this in, but I'm comfortable with: C, C++, C#, D, and Perl.
After the discussions in the comments my proposal would be like so:
Create a partition on your data server, about 5GB for safety.
Create a Windows Service Project in C# that would monitor your data driver / location.
When a file has been modified then create a local copy of the file, containing the same directory structure and place on the new partition.
Create another service that would do the following:
Monitor Bandwidth Usages
Monitor file creations on the temporary partition.
Transfer several files at a time (Use Threading) to your FTP Server, abiding by the bandwidth usages at the current time, decreasing / increasing the worker threads depending on network traffic.
Remove the files from the partition that have successfully transferred.
So basically you have your drives:
C: Windows Installation
D: Share Storage
X: Temporary Partition
Then you would have following services:
LocalMirrorService - Watches D: and copies to X: with the dir structure
TransferClientService - Moves files from X: to ftp server, removes from X:
Also use multi threads to move multiples and monitors bandwidth.
I would bet that this is the idea that you had in mind but this seems like a reasonable approach as long as your really good with your application development and your able create a solid system that would handle most issues.
When a user edits a document in Microsoft Word for instance, the file will change on the share and it may be copied to X: even though the user is still working on it, within windows there would be an API see if the file handle is still opened by the user, if this is the case then you can just create a hook to watch when the user actually closes the document so that all there edits are complete, then you can migrate to drive X:.
this being said that if the user is working on the document and there PC crashes for some reason, the document / files handle may not get released until the document is opened at a later date, thus causing issues.
For anyone in a similar situation (I'm assuming the person who asked the question implemented a solution long ago), I would suggest an implementation of rsync.
rsync.net's Windows Backup Agent does what is described in method 1, and can be run as a service as well (see "Advanced Usage"). Though I'm not entirely sure if it has built-in bandwidth limiting...
Another (probably better) solution that does have bandwidth limiting is Duplicati. It also properly backs up currently-open or locked files. Uses SharpRSync, a managed rsync implementation, for its backend. Open source too, which is always a plus!

is there any sync algorithm/reference available for syncing a directory?

I'm planning to write a program to sync a folder in real time across multiple computers over the internet. I'm wondering if there is any sync algorithm to handle file sync conflicts, ie, computer A tries to save a file, while computer B has removed the file.
The example you gave is exactly why synchronization is considered a hard problem.
Computer A has deleted a file which computer B still has. Now: how do you know if the file was added on B, and should be copied to A, or deleted on A, and should be deleted on B? You don't, really. Many synchronization systems have the possibility of conflicting changes which need to be resolved by a human.
Many tools have already been built to do synchronization, including:
version control systems, like CVS, Subversion, Mercurial, git, Perforce, etc.
standalone, one-directional synchronization programs. They cannot handle changes on either side, but they can make the destination directory look exactly like the source directory. This is better than a complete COPY because it's faster, but it's really basically the same thing. Examples include rsync, ROBOCOPY and XCOPY /MIR on Windows.
easy to use internet synchronization tools that synchronize folders on multiple machines. Examples include Windows Live Folder and Dropbox. These apps often resolve conflicts by making extra copies of both versions in subdirectories so that you can sort it out later. They really assume that there will be very very few conflicts.
built-in synchronization in sophisticated applications, for example, email/contact/calendar synchronization in Microsoft Exchange, Lotus Notes, etc.
You might also want to look into Unison. It is a multidirectional file synchronization tool that uses the rsync algorithm to make sure that only the changed parts of files are sent.

How does the DropBox Mac client work?

I've been looking at the DropBox Mac client and I'm currently researching implementing a similar interface for a different service.
How exactly do they interface with finder like this? I highly doubt these objects represented in the folder are actual documents downloaded on every load? They must dynamically download as they are needed. So how can you display these items in finder without having actual file system objects?
Does anyone know how this is achieved in Mac OS X?
Or any pointer's to Apple API's or other open source projects that have a similar integration with finder?
Dropbox is not powered by either MacFUSE or WebDAV, although those might be perfectly fine solutions for what you're trying to accomplish.
If it were powered by those things, it wouldn't work when you weren't connected, as both of those rely on the server to store the actual information and Dropbox does not. If I quit Dropbox (done via the menu item) and disconnect from the net, I can still use the files. That's because the files are actually stored here on my hard drive.
It also means that the files don't need to be "downloaded on every load," since they are actually stored on my machine here. Instead, only the deltas are sent over the wire, and the Dropbox application (running in the background) patches the files appropriately. Going the other way, the Dropbox application watches for the files in the Dropbox folder, and when they change, it sends the appropriate deltas to the server, which propagates them to any other clients.
This setup has some decided advantages: it works when offline, it is an order of magnitude faster, and it is transparent to other apps, since they just see files on the disk. However, I have no idea how it deals with merge conflicts (which could easily arise with one or more clients offline), which are not an issue if the server is the only copy and every edit changes that central copy.
Where Dropbox really shines is that they have an additional trick that badges the items in the Dropbox folder with their current sync status. But that's not what you're asking about here.
As far as the question at hand, you should definitely look into MacFUSE and WebDAV, which might be perfect solutions to your problem. But the Dropbox way of doing things, with a background application changing actual files on the disk, might be a better tradeoff.
Dropbox is likely using FSEvents to watch for changes to the file system. It's a great API and can even bundle up changes that happened while your app was not running. It's the same API that Spotlight uses. The menubar app likely does the actual observing itself (since restarting it can fix uploads being hung, for instance).
There's no way they're using MacFUSE, as that would require installing the MacFUSE kernel extension to make Dropbox work, and since I definitely didn't install it, I highly doubt they're using it.
Two suggestions:
MacFUSE
WebDAV
The former will allow you to write an app that appears as a filesystem and does all the right things; the latter will allow you move everything server-side and let the user just mount your service as a file share.
Dropbox on the client is written in python.
The client seems to use a sqlite3 database to index files.
I suppose Dropobox split a file in chunks, to reduce bandwith usage.
By the way, it two people has the same file, even if they do not know each other, the server can optimize and avoid to transfer the file more times, only copying it on the server side
To me it feels like a heavily modified revision control system. It has all the features: updates files based on deltas, options to recover or restore old revisions of files. It almost feels like they are using git (GitFS?), or some filesystem they designed.
You could also give File Conveyor a try. It's a Python daemon capable of instantly detecting FS changes (on Linux through inotify, on OS X through FSEvents), processing the files and syncing them to one or more destinations.
Supported protocols: FTP, SFTP, Amazon S3 (CloudFront is also supported), Rackspace Cloud Files. Can easily be extended. Uses django-storages.
"processing files": e.g. optimizing images, transcoding videos — this was originally conceived to be used for sending static assets to a CDN in the context of speeding up websites)

Resources