is there any sync algorithm/reference available for syncing a directory? - algorithm

I'm planning to write a program to sync a folder in real time across multiple computers over the internet. I'm wondering if there is any sync algorithm to handle file sync conflicts, ie, computer A tries to save a file, while computer B has removed the file.

The example you gave is exactly why synchronization is considered a hard problem.
Computer A has deleted a file which computer B still has. Now: how do you know if the file was added on B, and should be copied to A, or deleted on A, and should be deleted on B? You don't, really. Many synchronization systems have the possibility of conflicting changes which need to be resolved by a human.
Many tools have already been built to do synchronization, including:
version control systems, like CVS, Subversion, Mercurial, git, Perforce, etc.
standalone, one-directional synchronization programs. They cannot handle changes on either side, but they can make the destination directory look exactly like the source directory. This is better than a complete COPY because it's faster, but it's really basically the same thing. Examples include rsync, ROBOCOPY and XCOPY /MIR on Windows.
easy to use internet synchronization tools that synchronize folders on multiple machines. Examples include Windows Live Folder and Dropbox. These apps often resolve conflicts by making extra copies of both versions in subdirectories so that you can sort it out later. They really assume that there will be very very few conflicts.
built-in synchronization in sophisticated applications, for example, email/contact/calendar synchronization in Microsoft Exchange, Lotus Notes, etc.

You might also want to look into Unison. It is a multidirectional file synchronization tool that uses the rsync algorithm to make sure that only the changed parts of files are sent.

Related

Using with rsync to MS SharePoint

I've searched large and deep, but nothing is available, as far as I can see.
TLDR: How can I use rsync with a SharePoint installation? (Or something like rsync)
Long description
We have a large install base of Macs (~50%), Windows (~40%), and Linux (~10%), so our environment is pretty heterogeneous. Being an experimental job we produce a considerable amount of experimental datasets that we need to share, and more importantly, backup.
Right now we use external hard drives to store these files and folders, since our computers cannot hold these amount of data (50GB++, for instance, per dataset). And when we need to share, we "physically" share. We mainly we use rsync with some kind of backend (what kind is not important), but this solution requires computers to be left turned on, and act as servers.
For reasons that I will not bother you with, we cannot leave a computer on after work.
Having OneDrive for Business seemed a very promising technology to use, since we have more than 1TB per user. We could start syncing out datasets from our computers and hard drives, and we could share even when computers are turned off.
We are aware that we may hit some drawbacks, as not being able to actually share, having some limits about the number of objects (files/directories), but we will handle them later.
I prefer rsync, but right now we're open to any solution.
OneDrive for Business has a download that will allow you to synchronize a directory locally. https://onedrive.live.com/about/en-us/download/
For a Linux platform, you should be able to use onedrive-d found here:
https://github.com/xybu/onedrive-d
I know that it's an old question, but it's unanswered. Maybe a solution could be https://rclone.org/. Rclone is a command line program to sync files and directories to and from the cloud.

Possible to selective sync dropbox or other cloud storage from multi-platform command line?

Going to be working with a medium sized remote group on a large (but independent) project that will be generating many GB to TB of data.
To keep users from having to store 500GB of data on their personal machines, and to keep everyone in sync, we need a command-line/python utility to control selective syncing of dependencies on multiple operating systems: or at least osx and linux.
So example, someone who needs to work on the folder:
startrek/startrekiii
May require the folders:
startrek/nimoy/common
startrek/nimoy/[user]
startrek/shatner/common
startrek/shatner/[user]
but not:
startrek/startrekii, startrek/nimoy/[some_other_user], etc
From their command line (or a UI) they would run:
sync startrekiii
And they'd also receive startrek/nimoy/common, etc
likewise we'll have an unsync command that, as long as those dependent folders are not in use by another sync, will be unsynced and removed from the user's HD.
Of cloud sync/storage solutions, dropbox seems to offer the most granular control over this, allowing you to sync specific folders and subfolders - however from everything I can find this granular control is strictly limited to their UI.
We're completely open to alternative solutions if you have them, we just need something as easily deployable as possible and don't have the budget for Aspera or something to that effect.
Two other important notes:
Because of one very central part of our pipeline which pulls files
from those dependent folders (over which we have limited API
control), the paths need to be consistent on their respective
platform. So ~/Dropbox/startrek/nimoy can never be ~/Dropbox/startrek/startrekiii/nimoy
Many of the people using this will be artists and otherwise non-technical people, the extent of who's experience using csh or bash is for simple things like changing directories and moving files around.
Has anyone found a way to hack into Dropbox's selective sync, and/or know of a better alternative?

Installer vs. zip or executable exe?

In most of the games and programs you download, you just get the installer.
Some .exe files can be ran straightly, though (it's probably cause they don't have much source files to extract, huh?).
I was wondering, what's the difference between an installer, that just extracts the files, and a zip (rar, iso..) file, that you could download ,just depending on your internet speed, in up to few seconds. And where does a, maybe 200mb, installer fetch the, let's say 5gb of, files, offline?
I've never heard about this, and I'm learning to program, so I'd appreciate if you could answer me properly.
What you're really asking is:
How does an installer work?
A bit of background.
In the Before Times, man did not have such things as "installers." Software was run directly off of floppy disks (and none of that rigid 3.5" crap, I'm talking disks that flopped), like God intended.
Then came the first home computers with persistent hard drives. For the first time, it made sense to copy a program off a disk and have it stick around.
But programs still worked the way "portable" applications do today: you copied them as-is and ran them as-is.
Then operating systems began to get more complicated.
Windows introduced this notion of a registry: a central location where program and operating system configuration could be stored. Software authors began using this registry. Its arcane architecture and user-hostile editing utility (the infamous regedit.exe) made it the perfect place to store shareware information -- how many days you have left on your trial, for example.
This happened around the same time that programs began to be too large to fit -- uncompressed -- on a single floppy disk. A way was needed to split a program onto multiple disks. Since it wasn't very user-friendly to require the user to have e.g. a ZIP extractor installed (remember, this was before ubiquitous Internet), Windows programs began to be shipped with installers. You can think of these as basically portable versions of WinZIP whose sole purpose was to reassemble and extract a compressed file.
These days, installers serve a number of other purposes:
providing a convenient user interface
prompting the user to accept a click-through end-user license agreement (EULA)
prompting the user for CD keys (though this is being phased out for many systems in favor of digital distribution)
asking the user to register their software
and so on. They may also serve as DRM vehicles, validating CDs and decrypting data to prevent villainous individuals (yarr) from brrreakin' ye olde DMCA.
At their heart, they aren't any more complex than in the Windows 95 days -- a glorified unzip program.
Sidenote: Where does the installer get 5GB of data from 200MB of archives if not the Internet?
That's high, though there are plenty of ways you could get that compression ratio. Imagine a complex game whose world is defined in verbose XML -- that's readily compressible. You could even get that back in the old WinZIP days.
A zip file can only hold some files and then you unzip and get those files as is.
An installer however can be a very complicated program. It can create the needed files or folders structures, It can register the required dlls on your system, give you the options of the features that can be installed, Check your system for the compatibility and also be used as a wizard to guide you, step by step, to custom install you application.
An Installer (esp. Windows Installer) can make automatic Registry entries, as well as unpack and write files to a directory. With the Zip, you have to manually extract the files, and get no automatic registry edits.
The advantage to a zip is that it guarantees (most of the time) that the application is portable, that all necessary files are included in the unzipped directory.
The advantage of an installer is pretty obvious: automated, UI.
As for the 200mb -> 5gb....compressing the files into an exe can add another layer of more/better/smaller compression than that of just simply throwing the files into a zipped folder, however 200mb -> 5gb is a pretty big jump, not impossible, just pretty big. For most installers that do have instructions for large external (online) downloads, they typically let you know before hand that they are about to download a large chunk of data and to not disconnect from the internet during install....
An Installer or EXE Can Be Easily Get Affected By Virus But if there is ZIP archive than there are less chances for virus affection and using zip is more flexible too because it can be protected using you own password too.
Another Normal Benefit is that ZIP compress the files too.
Hope You are getting me.

Is it possible to explore SVN repo as an ordinary folder in Windows (for examle, mount as remote drive)?

So, I need to make a file storage for our team. Also I have SVN server. Opportunity to do rollbacks and control on who created or deleted file is very neccessary and important for our project.
Any ideas? Maybe without SVN. I can connect using WebDAV but only in read-only mode (because there is no LOCKS support in it).
You can set up the SVN server to allow exactly that.
Read the chapter in the SVN book about WebDAV and Autoversioning
So, what you want is the ability to roll back changes, and limit who can make the changes, but without the bother of checking in and out files?
Maybe Subversion isn't for you. I've done similar sharing with Dropbox and there's now BoxNet that's suppose to be like Dropbox on Steroids. Dropbox (and I assume box.net too) has some features that are very nice:
You can setup folder sharing between particular teams. That way, you can say who can and cannot access these files.
Dropbox automatically saves each and every version of a file, so you can always go back to previous versions -- even if that file has been deleted.
Files are stored locally. All a user has to know is to save a particular file in a particular folder, and everyone has access to it. I've successfully used Dropbox to collaborate with managers that make the Pointed Hair boss in Dilbert look like a high tech genius.
There's also Skydrive and Google Drive, but I don't find them as universal as Dropbox or as easy to use. It's possible to use Dropbox without ever going to the Dropbox website. To the non-geek, it appears to be magic as files I've written and edited appear on their drive. It took me a few weeks to train one person that he didn't have to email me his document when he made changes because I already had it.
Dropbox gives you 2 Gb of space for free which doesn't sound like a lot. However, my first hard drive was a whopping 20Mb which was twice the size of the standard 10Mb drive at that time. If you're not storing a lot of multimedia presentations or doing a lot of Photoshop, 2Gb might be more than enough for your project.
I know Windows 7 and later has some sort of versioning system built into it. I know this because anytime someone mentions that Mac OS X has time machine, some Wingeek pipes in stating that Windows has the same thing, but only better!. Unfortunately, Windows is not my forte, so I don't know too much about this specific feature. I believe the default is once per day, but it can be changed. This might be the perfect solution if everyone is on Windows.
Subversion can do autoversioning as Stefan stated. Considering his position in the Subversion community (especially his work on TortoiseSVN), he knows his stuff. Unfortunately I don't know too much about it since I've never used or seen this feature implemented. It's probably due to the fact that I work mainly with developers who know what a version control system is, and therefore have no need for something that does the versioning for them.
Also don't forget to check if you can use your corporate Sharepoint which does something very much what you want. I am not too impressed with Sharepoint, but if the facility is there, and your company can give you the support, it is something you probably want to look into.

Graceful File Reading without Locking

Whiteboard Overview
The images below are 1000 x 750 px, ~130 kB JPEGs hosted on ImageShack.
Internal
Global
Additional Information
I should mention that each user (of the client boxes) will be working straight off the /Foo share. Due to the nature of the business, users will never need to see or work on each other's documents concurrently, so conflicts of this nature will never be a problem. Access needs to be as simple as possible for them, which probably means mapping a drive to their respective /Foo/username sub-directory.
Additionally, no one but my applications (in-house and the ones on the server) will be using the FTP directory directly.
Possible Implementations
Unfortunately, it doesn't look like I can use off the shelf tools such as WinSCP because some other logic needs to be intimately tied into the process.
I figure there are two simple ways for me to accomplishing the above on the in-house side.
Method one (slow):
Walk the /Foo directory tree every N minutes.
Diff with previous tree using a combination of timestamps (can be faked by file copying tools, but not relevant in this case) and check-summation.
Merge changes with off-site FTP server.
Method two:
Register for directory change notifications (e.g., using ReadDirectoryChangesW from the WinAPI, or FileSystemWatcher if using .NET).
Log changes.
Merge changes with off-site FTP server every N minutes.
I'll probably end up using something like the second method due to performance considerations.
Problem
Since this synchronization must take place during business hours, the first problem that arises is during the off-site upload stage.
While I'm transferring a file off-site, I effectively need to prevent the users from writing to the file (e.g., use CreateFile with FILE_SHARE_READ or something) while I'm reading from it. The internet upstream speeds at their office are nowhere near symmetrical to the file sizes they'll be working with, so it's quite possible that they'll come back to the file and attempt to modify it while I'm still reading from it.
Possible Solution
The easiest solution to the above problem would be to create a copy of the file(s) in question elsewhere on the file-system and transfer those "snapshots" without disturbance.
The files (some will be binary) that these guys will be working with are relatively small, probably ≤20 MB, so copying (and therefore temporarily locking) them will be almost instant. The chances of them attempting to write to the file in the same instant that I'm copying it should be close to nil.
This solution seems kind of ugly, though, and I'm pretty sure there's a better way to handle this type of problem.
One thing that comes to mind is something like a file system filter that takes care of the replication and synchronization at the IRP level, kind of like what some A/Vs do. This is overkill for my project, however.
Questions
This is the first time that I've had to deal with this type of problem, so perhaps I'm thinking too much into it.
I'm interested in clean solutions that don't require going overboard with the complexity of their implementations. Perhaps I've missed something in the WinAPI that handles this problem gracefully?
I haven't decided what I'll be writing this in, but I'm comfortable with: C, C++, C#, D, and Perl.
After the discussions in the comments my proposal would be like so:
Create a partition on your data server, about 5GB for safety.
Create a Windows Service Project in C# that would monitor your data driver / location.
When a file has been modified then create a local copy of the file, containing the same directory structure and place on the new partition.
Create another service that would do the following:
Monitor Bandwidth Usages
Monitor file creations on the temporary partition.
Transfer several files at a time (Use Threading) to your FTP Server, abiding by the bandwidth usages at the current time, decreasing / increasing the worker threads depending on network traffic.
Remove the files from the partition that have successfully transferred.
So basically you have your drives:
C: Windows Installation
D: Share Storage
X: Temporary Partition
Then you would have following services:
LocalMirrorService - Watches D: and copies to X: with the dir structure
TransferClientService - Moves files from X: to ftp server, removes from X:
Also use multi threads to move multiples and monitors bandwidth.
I would bet that this is the idea that you had in mind but this seems like a reasonable approach as long as your really good with your application development and your able create a solid system that would handle most issues.
When a user edits a document in Microsoft Word for instance, the file will change on the share and it may be copied to X: even though the user is still working on it, within windows there would be an API see if the file handle is still opened by the user, if this is the case then you can just create a hook to watch when the user actually closes the document so that all there edits are complete, then you can migrate to drive X:.
this being said that if the user is working on the document and there PC crashes for some reason, the document / files handle may not get released until the document is opened at a later date, thus causing issues.
For anyone in a similar situation (I'm assuming the person who asked the question implemented a solution long ago), I would suggest an implementation of rsync.
rsync.net's Windows Backup Agent does what is described in method 1, and can be run as a service as well (see "Advanced Usage"). Though I'm not entirely sure if it has built-in bandwidth limiting...
Another (probably better) solution that does have bandwidth limiting is Duplicati. It also properly backs up currently-open or locked files. Uses SharpRSync, a managed rsync implementation, for its backend. Open source too, which is always a plus!

Resources