How does Dropbox synchronization work? - algorithm

I want to know how dropBox is able to synchronize the large data files without replacing or re-uploading the files again to the dropbox server
Example: an encrypted zip archive
Suppose I've a 1GB encrypted zip archive file Fully synchronized on my computer and on the dropbox servers,
On my computer I added to that zip archive file a file of size about 5MB then saved the file on my computer,
dropbox is able to synchronize zip archive file without re-uploading the whole file again instead it just update it with the small change I've done.
Also TrueCrypt containers works in that manner
Any keywords, ideas, topics, reviews, links, code is greatly appreciated.

Dropbox uses the rsync algorithm to generate delta files with the difference from file A1 to file A2. Only the delta(usually much smaller than A2) is uploaded to the dropbox servers since dropbox already has file A1. The delta file can then be applied to file A1, turning it into file A2.
You can learn more about the algorithm here.
http://en.wikipedia.org/wiki/Rdiff-backup#Variations
The source code for the library behind the delta creation can be found here.
http://librsync.sourceforge.net/

My first thought (it's late sorry!) is that it might be performing a hash at a block level.
For example, it might generate a hash for each 64k segment and then uploads the whole segment for each portion that has a different hash.

Related

Detect incompatible file location (iCloud, Dropbox, shared folders) for custom file format

I’m designing a custom file format. It will be either a monolith file or a folder with smaller files. It’s a rather large file in total and there is no need to load everything into memory at once. It would make it also slower than necessary. One of the file(s) may or may not be database file. Running SQL queries would be useful.
The user can have many such files. The user might want to share files with others even if it takes some time to up/download it.
Conceptually I run into issues with shared network folders, Dropbox, iCloud, etc. Such services can lead to sync issues if the file is not loaded entirely in memory or the database file can get corrupted.
One solution is to prohibit storing the file on such services. Either by using a user/library folder or forcing the user to pick a local folder.
Using a folder in library means recreating a file navigation system like Finder. It limits the choice of the user as well in where the files end up. Limiting the location to a local folder seems the better choice.
Is there a way to programmatically detect if a folder is local?

Creating a variable zip archive on the fly, estimating file size for content-length

I'm maintaining a site where users can place pictures and other files in a kind of shopping cart. After selecting all the various contents the user wishes to download, he can checkout. Till' now an archive was generated beforehand and the user got an email with the link to the file after the generation finished.
I've changed this now by using web api and push stream to directly generate the archive on the fly. My code is offering either a zip, a zip64 or .tar.gz dynamically, depending on the estimated filesize and operating system. For performance reasons compression ist set to best speed ('none' would make the zip archives incompatible with Mac OS, the gzip library I'm using doesn't offer none).
This is working great so far, however the user is no longer having a progress bar while downloading the file because I'm not setting the content-length. What are the best ways to get around this? I've tried to guess the resulting file size, but either the browsers are canceling the downloads to early or stopping at 99,x% and are waiting for the missing bytes resulting for the difference between the estimated and actual file size.
My first thought was to guess the resulting file size always a little bit to big and filling the rest with zeros?
I've seen many file hosters offering the possibility to select files from a folder and putting them into a zip file and all are having the correct (?) file size with them and a progress bar. Any best practises? Thanks!
This is just some thoughts, maybe you can use them :)
Using Web API/HTTP the normal way to go about is that the response contains the lenght of the file. Since the response is first received after the call has finished, the actual time for generating the file will not show any progress bar in any browser other than a Windows wait cursor.
What you could do is using a two steps approach.
Generating the zip file
Create a duplex like channel using SignalR to give feedback on the file generation.
Downloading the zip file
After the file is generated you should know the file size, and the browser will show a progress bar while downloading.
It looks that this problem should have been addressed using chunk extensions, but it seems to never got further than a draft.
So I guess you are stuck with either no progress or sending the file size up front.
It seems that generating exact size zip archives is trickier than adding zero padding.
Another option might be to pre-generate the zip file without storing it just to determine the size.
But I am just wondering why not just use tar? It has no compression, so it is easy determine it's final size up front from the size of individual files and it should be also supported by both OSx and Linux. And Windows should be able to handle none compressed zip archives, so a similar trick might work as well.

Fastest way to move files within remote computer from Cocoa application?

I have files stored in a shared directory on one computer and a Cocoa Application running on another computer on the same LAN.
I want the application to move files within the shared directory.
I’m using -NSFileManager copyItemAtPath: toPath: error:. But sometimes it seems extremely slow, regardless of file size. Why would that operation be much longer than doing it directly on the shared directory’s computer?
I'd guess, I don't know for sure, that NSFileManager first downloads the file to copy and then reuploads the downloaded file under a different name. The last thing it does is removing the original file. Of course the downloading and uploading take some time.
The reason for this procedure is that most protocols don't have a 'copy' command. So the client will have to do all the work itself with the explained procedure.

What is the best/quickest method to upload a very large folder to my Server?

I have a large directory which I need to upload to a new host's server, but because I have never transferred such a large directory (32GB), I am wondering whether there is something I'm missing.
Now, I am assuming that the best way is to compress it into a zip file, upload to the server and then extract. But for some reason, my zip file is still about 32GB!
I have already attempted to start uploading the files and it has literally been taking about 30 hours to simply upload about 3GB! Obviously this is too long, so I wondered whether there is a better method of doing this?
Speed of upload is determined by your internet connection speed. Try to find different location with faster connection speed. It could be your work, school or internet caffee.
You can test your upload speed here: http://speedtest.net/
Pack everything into large zip file, upload it there, unpack remotely. It is faster then uploading file by file with ftp.

Backup configuration files

I need to be able to store configuration files on machines that gets disconnected by plugging off the electricity ;), I'm using basic WinApi to store configuration data (WriteFile), this works unless the machine is plugged off ;), sometimes file isn't saved at all.
I was thinking of 2 solutions:
1) Transactional NTFS API (eg. CreateFileTransacted() ), but this thing works only on Vista and NTFS has to be present and I can't use it in most cases
2) To create a backup copy of configuration file in %APPDATA% directory, say 20 backup copies and restore them on application startup when damaged configuration file is detected
If you know of any other solution to my problem (the main problem is turning off the machine by plugging it out), please let me know. Thank you.
You don't really need 20 backup copies. You only need 1 - the last copy. Now, if your client actually asks for a basic versioning system for config files that is another story. But just to have a good config file you only need 1 backup.
Now, here's what I used to do for my embedded projects:
Calculate a hash of the config file and store it in the file. The easiest is to append it to the end of file as a comment. I used to use crc32 for this but these days I would use SHA1. This can even be done automatically by a config-uploader tool just prior to transmitting/storing the config file.
When opening the config file, extract the hash and compare it with the value calculated from the file (obviously calculated after the hash is removed from the file). If the hash is not there then the file is incomplete. If the hash is not the same then the file is corrupted. In either case use the older file.
Now that a valid & correct config file is verified it can replace the older config file. Use the OS's rename operation for this. It is usually atomic on most modern filesystems so a failed rename will not clobber up the old file.
This is the most robust system I've used in my years of experience. It's basically what bittorrent does.

Resources