Architecture used by dropbox - client-server

I want to know whether client-server or peer-to-peer architecture is used by dropbox.
Here my doubt lies:
Suppose we have two systems which are synced via dropbox.
System1: dropbox > Folder_A > file_1
System2: dropbox > Folder_A > file_1
Initially both are synced. Suppose now, User on System1 adds a file_2 in Folder_A. Now this file gets uploaded to dropbox server. But my question is how does server notifies System2 about file_2.
I am seeing Observer Pattern being used here. But Is Pull or Push mechanism used ??
Point1: Does dropbox client on System2 polls dropbox server after some interval to get updates.
Point2: Dropbox server pushes the file on system2 itself.
Point3: All systems including dropbox central system is considered as a peer. peer-to-peer network is formed. dropbox central peer controls which file to be sent to which system.
PS: My question is not specific to dropbox but all the file sync service provider sites. I just used dropbox as reference.

I suspect a pull mechanism is used as there are too many firewall issues in push this article strongly suggests that pull is used. Of course, the easiest way to be sure would be to look at a wireshark trace.

dropbox uses push mechanism.Also it has
Constant TCP connection to notification server by Dropbox client used for changes that are performed elsewhere.client eventually requests for new changes and server responds at every 60 seconds in case of no change .new request is sent as soon as response comes from server.Changes on the central storage are instead advertised as soon as they are performed.

I dont know much but i think dropbox (or other similar sites) use push mechanism. Because in pull mechanism there will be lot of unnecessary calls. please correct me if i am wrong.

Related

How to implement synchronization of browser-based online games when users refresh their browser

In implementing a browser-based simple game involving multiple users, I have the server save the game state at certain sync points (not time-based but event-specific). I identify each state by an integer.
When a user refreshes his browser, the server provides the latest state and restores the content in the browser. However, in those few seconds while the browser is loading the latest content after browser-refresh, the state could change again. I do not know how to handle this situation because sending the next state will again raise the same issue.
I want a seamless refresh so none of the other players are impacted when one user refreshes his browser (or for that matter leaves and comes back).
The implementation language is not relevant. I use websockets to communicate between the browser and the server. The server is the intermediary for all communication between users (I am not using WebRTC data channels). What is the best way to sync the application content in multiple browsers?
This is indeed a programming-based question though no code is provided.
Forget the fact that your client exists in a browser. Let's just talk about replication.
The usual approach in databases is to separate snapshots from Write Access Logging (WAL) logs. When you bring a new client up, you select a snapshot and transfer that. Then when the client is ready it asks for WAL logs from that snapshot forward. The same mechanism is used after crashes. The last available snapshot is loaded, then the WAL log is replayed, then the database comes up.
I would suggest the same strategy. This does require efficient storage of snapshots. Some kind of log. And some kind of replay mechanism. Which is a lot of easy to mess up code. If you can use something existing, that would be good.
The first thing that I looked into was using Emscripten to compile Redis to JS, and then try to use Redis' built-in asynchronous replication to replicate to your browser. That may be possible, but the fact that Redis is single-threaded and wants to be a client-server is probably a showstopper.
The next best option that I found is that you can use https://isomorphic-git.org/. Here is how that could build what you need. You simply maintain your current state in a git repository, and keep a WAL log of everything that you've done with it. When a client connects, it clones the repository. Once done it connects to the websocket, tells you what commit it is at, and you send it the WAL log from that point forward. Locally in the browser you run those git commands. If the client simply loses its connection and then rejoins, it can do a git pull, and then follow the same strategy.
This will be a bunch of work for you. But a lot less work than implementing everything from scratch.

Client-server synchronization

I found a nice answer of S.Lott about what I've been searching for:
Client-server synchronization pattern / algorithm?
But my question is now, what if the client has a wrong time?
Here's my problem:
Let's say the time of the client is 1 hour behind the servers, then the client changes a file, so the last write time is now 1 hour behind the servers. When the user starts his program which synchronizes the file, the server says to the changed file: "Oh, that file you have there is 1 hour older than mine, so let's replace it", but that's wrong, because the users file is actually newer, so it should be uploaded to the server.
I need a system that checks if the file is newer on server or on the client, and that doesn't work if the time is wrong or different.
Any ideas?
By the way, I am trying to write a cloud program.
If you're resolving conflicts manually (which would make sense for most applications), this can probably be done better with versioning rather than timestamps. When a client modifies a file, set a flag. When synchronizing, check the flag and versions.
If the client flag is set and the client and server versions are the same, send the client file to the server.
If the client flag is not set and the server version is newer, send the server file to the client.
If the client flag is set and the server version is newer, a conflict occurred and should be resolved.
The versions are per-file and should be sent along with the files.
Reset all client flags after synchronization.
This 'flag' can just be a check whether the last modified time on the file is different from the time that file was received from the server (we can store this time separately right after getting the file from the server).
Alternatively, you could sync the time.
Here's one possible solution:
When receiving files from the server, first get the current time from the server, then offset the timestamp of each file received on the client side by the difference between the server and client time. When sending files to the server, you can do something similar by offsetting by the client time.
But this seems more complex than necessary.

Heroku - letting users download files from tmp

Let me start by saying I understand that heroku's dynos are temporary and unreliable. I only need them to persist for at most 5 minutes, and from what I've read that generally won't be an issue.
I am making a tool that gathers files from websites and zips the up for download. My tool does everything and creates the zip - I'm just stuck at the last part: providing the user with a way to download the file. I've tried direct links to the file location, and http GET requests, and Heroku didn't like either. I really don't want to have to set up AWS just to host a file that only needs to persist for a couple of minutes.. Is there another way to download files stored on /tmp?
As far as I know, you have absolutely no guarantee that a request goes to the same dyno as the previous request.
The best way to do this would probably be to either host the file somewhere else, like S3, or to send it immediately in the same request.
If you're generating the file in a background worker, then it most definitely won't work. Every process runs on a separate dyno.
See How Heroku Works for more information on their backend.

algorithm for synchronizing text between client/server

What is a low-latency, low-bandwidth algorithm for synchronizing, say, a text file between a client and a server?
Is there a design where the client send a delta of it's current state and it's last ACK'd state from the server? I am thinking Quake3 networking..
EDIT 1:
More specifically, how would a diff/delta algorithm behave in a client/server environment.
e.g. Is it more expensive to calculate a diff on the client side, send to server, server interprets and updates its store, sends ACK to client? Or is it cheaper to have a replication model where client sends its full state and server stores it..?
EDIT 2:
100 KB text file. Something small, not too large.
You mean like a diff?
Store the server-side's version of the file in the client. Whenever you need to synchronize, run a diff (you can either write your own or use a library). Then send the difference over to the server and have the server patch it's local version.
If a client also edits text, and has an undo/redo feature then undo stack can be used for delta. For large texts and small changes using undo stack should be more efficient than running a diff.
For text you can use delta algorithm, take a look, for example, on how rsync works.
Google uses a different approach to update chrome, you can "google" it to see.
Edit: If it was a server generating one change and replicating in lots of clients, it should be done in server. From the question's changes, I understood that a client (or many clients) will produce the changes and want them to be replicated on server.
Well... I'd take in account 4 things:
network performance
number of clients
number of changes expected
performance of the server and of the client
Too many clients sending and doing that on server: it's almost a DoS
I'd only do that on server if there were few clients, high server performance and low client performance.
Otherwise, I'd only do that on clients.

Best approach to collecting log files from remote machines?

I have over 500 machines distributed across a WAN covering three continents. Periodically, I need to collect text files which are on the local hard disk on each blade. Each server is running Windows server 2003 and the files are mounted on a share which can be accessed remotely as \server\Logs. Each machine holds many files which can be several Mb each and the size can be reduced by zipping.
Thus far I have tried using Powershell scripts and a simple Java application to do the copying. Both approaches take several days to collect the 500Gb or so of files. Is there a better solution which would be faster and more efficient?
I guess it depends what you do with them ... if you are going to parse them for metrics data into a database, it would be faster to have that parsing utility installed on each of those machines to parse and load into your central database at the same time.
Even if all you are doing is compressing and copying to a central location, set up those commands in a .cmd file and schedule it to run on each of the servers automatically. Then you will have distributed the work amongst all those servers, rather than forcing your one local system to do all the work. :-)
The first improvement that comes to mind is to not ship entire log files, but only the records from after the last shipment. This of course is assuming that the files are being accumulated over time and are not entirely new each time.
You could implement this in various ways: if the files have date/time stamps you can rely on, running them through a filter that removes the older records from consideration and dumps the remainder would be sufficient. If there is no such discriminator available, I would keep track of the last byte/line sent and advance to that location prior to shipping.
Either way, the goal is to only ship new content. In our own system logs are shipped via a service that replicates the logs as they are written. That required a small service that handled the log files to be written, but reduced latency in capturing logs and cut bandwidth use immensely.
Each server should probably:
manage its own log files (start new logs before uploading and delete sent logs after uploading)
name the files (or prepend metadata) so the server knows which client sent them and what period they cover
compress log files before shipping (compress + FTP + uncompress is often faster than FTP alone)
push log files to a central location (FTP is faster than SMB, the windows FTP command can be automated with "-s:scriptfile")
notify you when it cannot push its log for any reason
do all the above on a staggered schedule (to avoid overloading the central server)
Perhaps use the server's last IP octet multiplied by a constant to offset in minutes from midnight?
The central server should probably:
accept log files sent and queue them for processing
gracefully handle receiving the same log file twice (should it ignore or reprocess?)
uncompress and process the log files as necessary
delete/archive processed log files according to your retention policy
notify you when a server has not pushed its logs lately
We have a similar product on a smaller scale here. Our solution is to have the machines generating the log files push them to a NAT on a daily basis in a randomly staggered pattern. This solved a lot of the problems of a more pull-based method, including bunched-up read-write times that kept a server busy for days.
It doesn't sound like the storage servers bandwidth would be saturated, so you could pull from several clients at different locations in parallel. The main question is, what is the bottleneck that slows the whole process down?
I would do the following:
Write a program to run on each server, which will do the following:
Monitor the logs on the server
Compress them at a particular defined schedule
Pass information to the analysis server.
Write another program which sits on the core srver which does the following:
Pulls compressed files when the network/cpu is not too busy.
(This can be multi-threaded.)
This uses the information passed to it from the end computers to determine which log to get next.
Uncompress and upload to your database continuously.
This should give you a solution which provides up to date information, with a minimum of downtime.
The downside will be relatively consistent network/computer use, but tbh that is often a good thing.
It will also allow easy management of the system, to detect any problems or issues which need resolving.
NetBIOS copies are not as fast as, say, FTP. The problem is that you don't want an FTP server on each server. If you can't process the log files locally on each server, another solution is to have all the server upload the log files via FTP to a central location, which you can process from. For instance:
Set up an FTP server as a central collection point. Schedule tasks on each server to zip up the log files and FTP the archives to your central FTP server. You can write a program which automates the scheduling of the tasks remotely using a tool like schtasks.exe:
KB 814596: How to use schtasks.exe to Schedule Tasks in Windows Server 2003
You'll likely want to stagger the uploads back to the FTP server.

Resources