Which protocol should I use to exchange files between multiple nodes? - go

I have multiple nodes. Node is just a linux or windows server. Also I have one master node. The master node is manager of process of file sharing.
This images shows process of communication:
So, I try to choose some protocol for this system, which I can implement (or just use some existing implementation). I need a file sharing protocol. I mean checking the checksum, managing a Internet bandwidth, managing the process of data exchange.
File is just a binary data. File size is approximately 1-10 MB. The number of files in the system is approximately 1 million. 90% of all requests are write requests.

Web servers are designed to serve files (amongst other things).
I would recommend you use the http protocol and use https://golang.org/pkg/net/http/#FileServer which requires just a few lines of code to set up.
If you need secure transmission then use https, also available with FileServer: https://golang.org/pkg/net/http/#ListenAndServeTLS

Related

Apache Camel - SFTP latency

I'm using Apache Camel to interact with several SFTP endpoints; for each one, I perform the following pipeline:
retrieve the list of existing files
validate those files against a given set of rules
download remote files, in case of successful validation
Everything works like a charm (for about a hundred different endpoints) and the URI used to retrieve the list of files is something like that: sftp://${HOST}:${PORT}/${DIR}?username=${USER}&download=false&recursive=true&disconnect=true&sendEmptyMessageWhenIdle=true
The problem is that, for one of those SFTP endpoints, the SFTP Camel component behaves, alternatively, as follows:
immediately return 0 remote files
takes a couple of minutes to list the remote content (which is composed by around 250 files, from 2KB to 2MB each)
In addition, in the latter case, the download takes around 30 seconds to download only 10KB of data.
Since this is happening on this specific SFTP only, I suppose it doesn't directly depend on Camel, which works fine for all other endpoints.
So, my questions are:
what can affect such a connection, leading to an unreasonable delay (there are no network issues, nor huge data to fetch)?
supposing it depends on the remote SFTP endpoint, why should the aforementioned Camel URI immediately return 0 files, since lots of files exist in the SFTP?
Thanks for any feedback.
Let's assume there is no bug in the Camel SFTP component of your version.
what can affect such a connection, leading to an unreasonable delay
(there are no network issues, nor huge data to fetch)?
Consider the fact that your app can immediately return 0 remote files, the problem source exist between your app and target server is relatively low. For server side, it could be
Too many folders to traverse
Server have slow action on each call
other problem on server side
For the case (Too many folders to traverse), consider to ignore folders that are useless and other config (e.g. stepwise)
supposing it depends on the remote SFTP endpoint, why should the
aforementioned Camel URI immediately return 0 files, since lots of
files exist in the SFTP?
The server side could be using multiple SFTP server nodes and some nodes are empty due to file system synchronization failure. When client is being redirect to any empty SFTP server node by server side's gateway, server node return 0 remote files in response and client report as-is.

File sync between n web servers in cluster

There are n nodes in a web cluster. Files may be uploaded to any node and then must be distributed to every other node. This distribution does not have to happen in a transaction (in fact it must not, distributed transactions don't scale) and some latency is acceptable, although must be minimal. Conflicts can be resolved arbitrarily (typically last write wins) provided that the resolution is also distributed to all nodes so that eventually all nodes have the same set of files. Nodes can be added and removed dynamically without having to reconfigure existing nodes. There must be no single point of failure and no additional boxes required to solve this (such as RabbitMQ)
I am thinking along the lines of using consul.io for dynamic configuration so that each node can refer to consul to determine what other nodes are available and writing a daemon (Golang) that monitors the relevant folders and communicates with other nodes using ZeroMQ.
Feels like I would be re-inventing the wheel though. This is a common problem and I expect there are solutions available already that I don't know about? Or perhaps my approach is wrong and there is another way to solve this?
Yes, there has been some stuff going on with distributed synchronization lately:
You could use syncthing (open source) or BitTorrent Sync.
Syncthing is node-based, i.e. you add nodes to a cluster and choose which folders to synchronize.
BTSync is folder-based, i.e. you obtain a "secret" for a folder and can synchronize with everyone in the swarm for that folder.
From my experience, BTSync has a better discovery and connectivity, but the whole synchronization process is closed source and nobody really knows what happens. Syncthing is written in go, but sometimes has trouble discovering peers.
Both syncthing and BTSync use LAN discovery via broadcast and a tracker for discovery, AFAIK.
EDIT: Or, if you're really cool, use IPFS to host the latest version, IPNS to "name" that and mount the IPNS on the servers. You can set the IPFS bootstrap list to some of your servers, which would even make you independent of external trackers. :)

algorithm for synchronizing text between client/server

What is a low-latency, low-bandwidth algorithm for synchronizing, say, a text file between a client and a server?
Is there a design where the client send a delta of it's current state and it's last ACK'd state from the server? I am thinking Quake3 networking..
EDIT 1:
More specifically, how would a diff/delta algorithm behave in a client/server environment.
e.g. Is it more expensive to calculate a diff on the client side, send to server, server interprets and updates its store, sends ACK to client? Or is it cheaper to have a replication model where client sends its full state and server stores it..?
EDIT 2:
100 KB text file. Something small, not too large.
You mean like a diff?
Store the server-side's version of the file in the client. Whenever you need to synchronize, run a diff (you can either write your own or use a library). Then send the difference over to the server and have the server patch it's local version.
If a client also edits text, and has an undo/redo feature then undo stack can be used for delta. For large texts and small changes using undo stack should be more efficient than running a diff.
For text you can use delta algorithm, take a look, for example, on how rsync works.
Google uses a different approach to update chrome, you can "google" it to see.
Edit: If it was a server generating one change and replicating in lots of clients, it should be done in server. From the question's changes, I understood that a client (or many clients) will produce the changes and want them to be replicated on server.
Well... I'd take in account 4 things:
network performance
number of clients
number of changes expected
performance of the server and of the client
Too many clients sending and doing that on server: it's almost a DoS
I'd only do that on server if there were few clients, high server performance and low client performance.
Otherwise, I'd only do that on clients.

Testing file transfer speed across LAN/WAN

Is there a utility for Windows that allows you to test different aspects of file transfer operations across a Lan or a Wan.
Example...
How long does it take to move a file of a known size (500 MB or 1 GB) from Server A (on site) to Server B (on site) or to Server C (off site-Satellite location)?
D-ITG will allow you to test many aspects of your links. It does not necessarily allow you transfer a file directly, but it allows you to control almost all aspects of the transmission of data across the wire.
If all you are interested in is bulk transfer time (and not all the nitty-gritty details) you could just use a basic FTP application and time the transfer.
Probably nothing you've not already figured out. You could get some coarse grain metrics using a batch file to coordinate:
start monitoring
copy file
stop monitoring
Copy file might just be initiating a file copy between two nodes on the LAN, or it might initiate a FTP copy between two nodes on the WAN.
Monitoring could be as basic as writing the current time to output or file, or it could be as complex as adding performance counter metrics from the network adapter on the two machines.
A commercial WAN emulator would also give you the information your looking for. I've used the Shunra Appliance successfully in the past. Its pretty expensive, so I'd really only recommend it if critical business success is riding on understanding how application behavior could change based on network conditions and is something you could incorporate into regular testing activities.

Best approach to collecting log files from remote machines?

I have over 500 machines distributed across a WAN covering three continents. Periodically, I need to collect text files which are on the local hard disk on each blade. Each server is running Windows server 2003 and the files are mounted on a share which can be accessed remotely as \server\Logs. Each machine holds many files which can be several Mb each and the size can be reduced by zipping.
Thus far I have tried using Powershell scripts and a simple Java application to do the copying. Both approaches take several days to collect the 500Gb or so of files. Is there a better solution which would be faster and more efficient?
I guess it depends what you do with them ... if you are going to parse them for metrics data into a database, it would be faster to have that parsing utility installed on each of those machines to parse and load into your central database at the same time.
Even if all you are doing is compressing and copying to a central location, set up those commands in a .cmd file and schedule it to run on each of the servers automatically. Then you will have distributed the work amongst all those servers, rather than forcing your one local system to do all the work. :-)
The first improvement that comes to mind is to not ship entire log files, but only the records from after the last shipment. This of course is assuming that the files are being accumulated over time and are not entirely new each time.
You could implement this in various ways: if the files have date/time stamps you can rely on, running them through a filter that removes the older records from consideration and dumps the remainder would be sufficient. If there is no such discriminator available, I would keep track of the last byte/line sent and advance to that location prior to shipping.
Either way, the goal is to only ship new content. In our own system logs are shipped via a service that replicates the logs as they are written. That required a small service that handled the log files to be written, but reduced latency in capturing logs and cut bandwidth use immensely.
Each server should probably:
manage its own log files (start new logs before uploading and delete sent logs after uploading)
name the files (or prepend metadata) so the server knows which client sent them and what period they cover
compress log files before shipping (compress + FTP + uncompress is often faster than FTP alone)
push log files to a central location (FTP is faster than SMB, the windows FTP command can be automated with "-s:scriptfile")
notify you when it cannot push its log for any reason
do all the above on a staggered schedule (to avoid overloading the central server)
Perhaps use the server's last IP octet multiplied by a constant to offset in minutes from midnight?
The central server should probably:
accept log files sent and queue them for processing
gracefully handle receiving the same log file twice (should it ignore or reprocess?)
uncompress and process the log files as necessary
delete/archive processed log files according to your retention policy
notify you when a server has not pushed its logs lately
We have a similar product on a smaller scale here. Our solution is to have the machines generating the log files push them to a NAT on a daily basis in a randomly staggered pattern. This solved a lot of the problems of a more pull-based method, including bunched-up read-write times that kept a server busy for days.
It doesn't sound like the storage servers bandwidth would be saturated, so you could pull from several clients at different locations in parallel. The main question is, what is the bottleneck that slows the whole process down?
I would do the following:
Write a program to run on each server, which will do the following:
Monitor the logs on the server
Compress them at a particular defined schedule
Pass information to the analysis server.
Write another program which sits on the core srver which does the following:
Pulls compressed files when the network/cpu is not too busy.
(This can be multi-threaded.)
This uses the information passed to it from the end computers to determine which log to get next.
Uncompress and upload to your database continuously.
This should give you a solution which provides up to date information, with a minimum of downtime.
The downside will be relatively consistent network/computer use, but tbh that is often a good thing.
It will also allow easy management of the system, to detect any problems or issues which need resolving.
NetBIOS copies are not as fast as, say, FTP. The problem is that you don't want an FTP server on each server. If you can't process the log files locally on each server, another solution is to have all the server upload the log files via FTP to a central location, which you can process from. For instance:
Set up an FTP server as a central collection point. Schedule tasks on each server to zip up the log files and FTP the archives to your central FTP server. You can write a program which automates the scheduling of the tasks remotely using a tool like schtasks.exe:
KB 814596: How to use schtasks.exe to Schedule Tasks in Windows Server 2003
You'll likely want to stagger the uploads back to the FTP server.

Resources