Exchange files (up to many GB) - spring-boot

For my project, I have to create a file manager which aims at storing many files (from many locations) and exposing URL to download them.
In a micro-service ecosystem (I am used to use spring boot), I wonder what is the best way to exchange such files, I mean sending files to file manager?
On a one hand, I always thought it is better to exchange them asynchronously, so HTTP does not seem a good choice. But maybe I am wrong.
Is it a good choice to split files into fragments (in order to reduce number of bytes for each part) and send each of them through something like RabbitMQ or Kafka? Or should I rather transfer entire files on a NAS or through FTP and let file manager handling them? Or something else, like for example storing bytes in a temp database (maybe not a good choice)...
The problem of fragmentation is I have to implement a logic for keeping sort of each fragments which complicates processing of queues of topics.

IMO, never send actual files through a message broker.
First, setup some object storage system, for example S3 (with AWS or locally with Ceph), then send the path to the file as a string with the producer, then have the consumer read that path, and download the file.
If you want to collect files off of NAS or FTP, then Apache NiFi is one tool that has connectors to systems like that.

Based on my professional experience working with distributed systems (JMS based), to transfer huge content between participants:
a fragment approach should be used for request - reply model + control signals (has next, fragment counter)
delta approach for updates.
To avoid corrupt data, a hash function result can also be transmitted and checked in both scenarios.
But as mentioned in this e-mail thread, a better approach is to use FTP for this kind of scenarios:
RabbitMQ should actually not be used for big file transfers or only
with great care and fragmenting the files into smaller separate
messages.
When running a single broker instance, you'd still be safe, but in a
clustered setup, very big messages will break the cluster.
Clustered nodes are connected via 1 tcp connection, which must also
transport a (erlang) heartbeat. If your big message takes more time to
transfer between nodes than the heartbeat timeout (anywhere between
~20-45 seconds if I'm correct), the cluster will break and your
message is lost.
The preferred architecture for file transfer over amqp is to just send
a message with a link to a downloadable resource and let the file
transfer be handle by specialized protocol like ftp :-)
Hope it helps.

Related

Preventing data loss in client authoritative database writes

A project I'm working on requires users to insert themselves into a list on a server. We expect a few hundred users over a weekend and while very unlikely, a collision could happen in which two users submit the list concurrently and one of them is lost. The server has no validation, it simply allows you to get and put data.
I was pointed in the direction of "optimistic locking" but I'm having trouble grasping when exactly the data should be validated and how it prevents this from happening. If one of the clients reads the data, adds itself and then checks again to ensure that the data is the same with the use of an index or timestamp, how does this prevent the other client from doing the same and then one overwriting the other?
I'm trying to understand the flow in the context of two clients getting data and putting data.
The point of optimistic locking is that the decision to accept or reject a write is taken on the server, and is protected against concurrency by a pessimistic transaction or some sort of hardware protection, such as compare-and-swap. So a client requests a write together with some sort of timestamp or version identifier, and the server only accepts the write if the timestamp is still accurate. If it isn't the client gets some sort of rejection code and will have to try again. If it is, the client gets told that its write succeeded.
This is not the only way to handle receiving data from multiple clients. One popular alternative is to use a reliable messaging system - for example the Java Messaging Service specifies an interface for such systems for which you can find open source implementations. Clients write into the messaging system and can go away as soon as their message is accepted. The server reads requests from the messaging system and acts on them. If the server or the network goes down it's no big deal: the messages will still be there to be read when they come back (typically they are written to disk and have the same level of protection as database data although if you look at a reliable message queue implementation you may find that it is not, in fact, built on top of a standard database table).
One example of a writeup of the details of optimistic locking is the HTTP server Etag specification e.g. https://en.wikipedia.org/wiki/HTTP_ETag

producer consumer in golang - concurrency vs parallelism?

I am working on backend architecture which is purely in Golang. I have an API which is used to upload a file to golang server and then I am transferring the file to cloud storage(from the golang server itself). Now, I want both the transfers to be independent, so that, the end user should not has to wait for the response after uploading a file.
End User -> Golang Server ->[Concurrency/Parallelism] -> Cloud Storage
Now, I thought of two ways:
Create a goroutine as soon as the user finishes the upload and transfer the file to cloud.
Insert the file handler into a queue, and a different process would read this queue and transfer the file to cloud storage (Multiple producers - Single Consumer model).
I found examples of doing this using goroutine and channels but I think that would create as many goroutines as much there are uploads. I want to use the second option but not able to understand of how to go about it in golang?
Also, do suggest if I am using wrong approach and there is some other efficient method of doing this.
Update
Details about the requirement and constraint:
1. I am using AWS S3 as cloud storage. If at some point, the upload from Go server to Amazon S3 fails, the file handler should be kept as in to keep record of the failed upload.(I am not prioritising this, I might change this based on clients feedback)
2. The file will be deleted from the Go server as soon as the upload completes successfully to Amazon S3, so as to avoid repetitive uploads. Also, if a file is uploaded with same name, it will be replaced at Amazon S3.
3. As pointed out in comments, I can use channel as the queue. Is it possible to design the above architecture using Go's Channels and goroutines?
A User uploading a file could tolerate an error, and try again. But the danger exists when an uploaded file exists only on the machine it was uploaded to, and something goes wrong before it gets uploaded to cloud storage. In that case, the file would be lost, and it would be a bummer for the User.
This is solved by good architecture. It's a first-in, first out queue pattern.
A favorite Go implementation of this pattern is go-workers perhaps backed by a Redis database.
Assume there are n number of servers running your service at any given time. Assume that your backend code compiles two separate binaries, a server binary and a worker binary.
Ideally, the machines accepting file uploads would all mount a shared Network File System such that:
User uploads a file to a server
a. server adds a record into the work queue, which contains a unique ID from the Redis storage.
b. This unique ID is used to create the filename, and the file is piped directly from the User upload to temporary storage on NFS server. Note that the file never resides on the storage of the machine running the server.
File is uploaded to cloud storage by a worker
a. worker picks up the next to-do record from the work queue, which has a unique ID
b. Using the unique ID to find the file on NFS server, the worker uploads the file to cloud storage
c. When successful, worker updates the record in the work queue to reflect success
d. worker deletes the file on NFS server
By monitoring the server traffic and work queue size as two separate metrics, it can be determined how many servers ought to run the server/worker services respectively.
Marcio Castilho had written a good article on a similar problem. It can be found at Handling one million requests per minutes with golang.
He shows the mistakes that he made and the steps which he took to correct them. Good source to learn the use of channels, goroutines and concurrency in general.
go-workers mentioned by charneykaye is also excellent source.

how to read MQTT mosquitto server persisted DB file

I am using mosquitto server for MQTT protocol.
Using persistence setting in a configuration file with -c option, I am able to save the data.
However the file generated is binary one.
How would one be able to read that file?
Is there any specific tool available for that?
Appreciate your views.
Thanks!
Amit
Why do you want to read it?
The data is only kept there while messages (QOS1 or QOS2) are in flight to ensure they are not lost in transit while waiting for a response from the subscribed client.
Data may also be kept for clients that are disconnected but have persistent subscriptions (cleanSession=false) until that client reconnects.
If you are looking to persist all messages for later consumption you will have to write a client to subscribe and store this data in a DB of your choosing. One possible option to do this quickly and simply is Node-RED, but there are others and some brokers even have plugins for this e.g. HiveMQ.
If you really want to read it then you will probably have to write your own tool to do this based on the Mosquitto src code

How can I transfer bytes in chunks to clients?

SignalR loses many messages when I transfer chunks of bytes from client over server to client (or client to server; or server to client).
I read the file into a stream and sent it over a hub or persistent connection to other client. This runs very fast, but there are always messages dropped or lost.
How can I transfer large files (in chunks or not) from client to client without losing messages?
As #dfowler points out, it's not the right technology for the job. What I would recommend doing is sending a message that there is a file to be downloaded that includes the link and then you can download that file using standard GET requests against either static files or some web service written with ASP.NET WebAPI.
SignalR isn't for file transfer, it's for sending messages.
Why isn't it the right technology? If a client needs to send some data to a signalR hub it should be able to over the signalR connection without requiring additional stuff.
In fact it works fine when sending a byte array, at least for me, however I encountered similar problems when transferring chunks.
Perhaps you can do some tests to check if the order in which you send the chunks is the same as the order they are received.
UPDATE
I did a test myself and in my case the order was indeed the problem. Modified the hub method receiving the chunks to accept an order parameter which I then use to reconstruct the byte array at the end and it works fine. Having said this I however now understand that this wouldn't work well with large file transfers.
In my case I don't need to transfer very large amounts of data, I just wanted to give my UI an indication of progress, requiring the data to be sent in chunks.

Sending files over MSMQ

In a retail scenario where each stores report their daily transaction to the backend system at the end of the day. Today a file consisting of the daily transactions and some other meta information is transferred from the stores to the backend using FTP. I’m currently investigating replacing FTP with something else. MSMQ has been suggested as an alternative transport mechanism. So my question is, do we need to write a custom windows service that sticks the daily transactions file into a message object and sends it on its way or is there any out the box mechanism in MSMQ to handle this?
Also, since the files we want to transfer can reach 5-6 Mb for large stores should we rule out MSMQ? In that case is there any other suggested technologies we should investigate?
Cheers!
NServiceBus provides a nice abstraction over MSMQ for situations like this. You get the reliable messaging aspects of MSMQ, along with a very nice programming model for defining your messages.
MSMQ is limited to a 4MB message size, however, and there are two ways you could deal with this in NServiceBus:
NServiceBus has a concept called the Data Bus, which takes the large attachments in your messages and transmits them reliably using another method. This is handled by the infrastructure and as far as your message handlers are concerned, the data is just there.
You could break up the payload into smaller atomic messages and send them as normal messages. The NServiceBus infrastructure would ensure that they all arrive at their destination and are processed. I would recommend this method unless it's absolutely critical that the entire huge data dump is processed as one atomic transaction.
One other thing to note is that the fact that you do nightly dumps is probably a limitation of a previous system. With NServiceBus it may be possible to change the system so that these bits of information are sent in a more immediate fashion, which will result in much more up-to-date data all the time, which may be a big win for the business.
You can look at IBM Sterling Managed File Transfer and WebSphere MQ Managed File Transfer products.
You can consider WebSphere MQ MFT if you require both messaging and file transfer capabilities. On the other hand if your requirement is just file transfer then you can look at Sterling MFT.
Sending files over a messaging transport is not trivial. If you put the entire file into a single message you can have the atomicity you need but tuning the messaging provider for wide variance in message sizes can be challenging. If all the files are of about the same size, one per message is about the simplest solution.
On the other hand, you can split the files into multiple messages but then you've got to reassemble them, in the right order, include a protocol to detect and resend missing segments, integrity-check the file received against the file sent, etc. You also probably want to check that the files on either end did not change during the transmission.
With any of these systems you also need the system to be smart enough to manage the disposition of sending and receiving files under normal and exception conditions, log the transfers, etc.
So when considering whether to move to messaging the two best options are either to move natively to messaging and give up files altogether, or to use an enterprise managed file transfer solution that runs atop the messaging provider that you choose. None of the off-the-shelf MFT products will cost as much in the long run as developing it yourself if you wish to do it right with robust exception handling and reporting.
If the stores are on separate networks and communicating over the internet, then MSMQ is not really an option. NServiceBus provides a concept of a gateway, which allows to asynchronously transport MSMQ messages over HTTP or HTTPS.

Resources