producer consumer in golang - concurrency vs parallelism? - go

I am working on backend architecture which is purely in Golang. I have an API which is used to upload a file to golang server and then I am transferring the file to cloud storage(from the golang server itself). Now, I want both the transfers to be independent, so that, the end user should not has to wait for the response after uploading a file.
End User -> Golang Server ->[Concurrency/Parallelism] -> Cloud Storage
Now, I thought of two ways:
Create a goroutine as soon as the user finishes the upload and transfer the file to cloud.
Insert the file handler into a queue, and a different process would read this queue and transfer the file to cloud storage (Multiple producers - Single Consumer model).
I found examples of doing this using goroutine and channels but I think that would create as many goroutines as much there are uploads. I want to use the second option but not able to understand of how to go about it in golang?
Also, do suggest if I am using wrong approach and there is some other efficient method of doing this.
Update
Details about the requirement and constraint:
1. I am using AWS S3 as cloud storage. If at some point, the upload from Go server to Amazon S3 fails, the file handler should be kept as in to keep record of the failed upload.(I am not prioritising this, I might change this based on clients feedback)
2. The file will be deleted from the Go server as soon as the upload completes successfully to Amazon S3, so as to avoid repetitive uploads. Also, if a file is uploaded with same name, it will be replaced at Amazon S3.
3. As pointed out in comments, I can use channel as the queue. Is it possible to design the above architecture using Go's Channels and goroutines?

A User uploading a file could tolerate an error, and try again. But the danger exists when an uploaded file exists only on the machine it was uploaded to, and something goes wrong before it gets uploaded to cloud storage. In that case, the file would be lost, and it would be a bummer for the User.
This is solved by good architecture. It's a first-in, first out queue pattern.
A favorite Go implementation of this pattern is go-workers perhaps backed by a Redis database.
Assume there are n number of servers running your service at any given time. Assume that your backend code compiles two separate binaries, a server binary and a worker binary.
Ideally, the machines accepting file uploads would all mount a shared Network File System such that:
User uploads a file to a server
a. server adds a record into the work queue, which contains a unique ID from the Redis storage.
b. This unique ID is used to create the filename, and the file is piped directly from the User upload to temporary storage on NFS server. Note that the file never resides on the storage of the machine running the server.
File is uploaded to cloud storage by a worker
a. worker picks up the next to-do record from the work queue, which has a unique ID
b. Using the unique ID to find the file on NFS server, the worker uploads the file to cloud storage
c. When successful, worker updates the record in the work queue to reflect success
d. worker deletes the file on NFS server
By monitoring the server traffic and work queue size as two separate metrics, it can be determined how many servers ought to run the server/worker services respectively.

Marcio Castilho had written a good article on a similar problem. It can be found at Handling one million requests per minutes with golang.
He shows the mistakes that he made and the steps which he took to correct them. Good source to learn the use of channels, goroutines and concurrency in general.
go-workers mentioned by charneykaye is also excellent source.

Related

Microservice failure Scenario

I am working on Microservice architecture. One of my service is exposed to source system which is used to post the data. This microservice published the data to redis. I am using redis pub/sub. Which is further consumed by couple of microservices.
Now if the other microservice is down and not able to process the data from redis pub/sub than I have to retry with the published data when microservice comes up. Source can not push the data again. As source can not repush the data and manual intervention is not possible so I tohught of 3 approaches.
Additionally Using redis data for storing and retrieving.
Using database for storing before publishing. I have many source and target microservices which use redis pub/sub. Now If I use this approach everytime i have to insert the request in DB first than its response status. Now I have to use shared database, this approach itself adding couple of more exception handling cases and doesnt look very efficient to me.
Use kafka inplace if redis pub/sub. As traffic is low so I used Redis pub/sub and not feasible to change.
In both of the above cases, I have to use scheduler and I have a duration before which I have to retry else subsequent request will fail.
Is there any other way to handle above cases.
For the point 2,
- Store the data in DB.
- Create a daemon process which will process the data from the table.
- This Daemon process can be configured well as per our needs.
- Daemon process will poll the DB and publish the data, if any. Also, it will delete the data once published.
Not in micro service architecture, But I have seen this approach working efficiently while communicating 3rd party services.
At the very outset, as you mentioned, we do indeed seem to have only three possibilities
This is one of those situations where you want to get a handshake from the service after pushing and after processing. In order to accomplish the same, using a middleware queuing system would be a right shot.
Although a bit more complex to accomplish, what you can do is use Kafka for streaming this. Configuring producer and consumer groups properly can help you do the job smoothly.
Using a DB to store would be a overkill, considering the situation where you "this data is to be processed and to be persisted"
BUT, alternatively, storing data to Redis and reading it in a cron-job/scheduled job would make your job much simpler. Once the job is run successfully, you may remove the data from cache and thus save Redis Memory.
If you can comment further more on the architecture and the implementation, I can go ahead and update my answer accordingly. :)

Exchange files (up to many GB)

For my project, I have to create a file manager which aims at storing many files (from many locations) and exposing URL to download them.
In a micro-service ecosystem (I am used to use spring boot), I wonder what is the best way to exchange such files, I mean sending files to file manager?
On a one hand, I always thought it is better to exchange them asynchronously, so HTTP does not seem a good choice. But maybe I am wrong.
Is it a good choice to split files into fragments (in order to reduce number of bytes for each part) and send each of them through something like RabbitMQ or Kafka? Or should I rather transfer entire files on a NAS or through FTP and let file manager handling them? Or something else, like for example storing bytes in a temp database (maybe not a good choice)...
The problem of fragmentation is I have to implement a logic for keeping sort of each fragments which complicates processing of queues of topics.
IMO, never send actual files through a message broker.
First, setup some object storage system, for example S3 (with AWS or locally with Ceph), then send the path to the file as a string with the producer, then have the consumer read that path, and download the file.
If you want to collect files off of NAS or FTP, then Apache NiFi is one tool that has connectors to systems like that.
Based on my professional experience working with distributed systems (JMS based), to transfer huge content between participants:
a fragment approach should be used for request - reply model + control signals (has next, fragment counter)
delta approach for updates.
To avoid corrupt data, a hash function result can also be transmitted and checked in both scenarios.
But as mentioned in this e-mail thread, a better approach is to use FTP for this kind of scenarios:
RabbitMQ should actually not be used for big file transfers or only
with great care and fragmenting the files into smaller separate
messages.
When running a single broker instance, you'd still be safe, but in a
clustered setup, very big messages will break the cluster.
Clustered nodes are connected via 1 tcp connection, which must also
transport a (erlang) heartbeat. If your big message takes more time to
transfer between nodes than the heartbeat timeout (anywhere between
~20-45 seconds if I'm correct), the cluster will break and your
message is lost.
The preferred architecture for file transfer over amqp is to just send
a message with a link to a downloadable resource and let the file
transfer be handle by specialized protocol like ftp :-)
Hope it helps.

Spring boot applications high availability

We have a microservice which is developed using spring boot. couple of the functionalities it implements is
1) A scheduler that triggers, at a specified time, a file download using webhdfs and process it and once the data is processed, it will send an email to users with the data process summary.
2) Read messages from kafka and once the data is read, send an email to users.
We are now planning to make this application high available either in Active-Active or Active-passive set up. The problem we are facing now is if both the instances of the application are running then both of them will try to download the file/read the data from kafka, process it and send emails. How can this be avoided? I mean to ensure that only one instance triggers the download and process it ?
Please let me know if there is known solution for this kind of scenarios as this seems to be a common scenario in most of the projects? Is master-slave/leader election approach a correct solution?
Thanks
Let the service download that file, extract the information and publish them via kafka.
Check beforehand if the information was already processed by querying kafka or a local DB.
You also could publish an DataProcessed-Event that triggers the EmailService, that sends the corresponding E-Mail.

Concurrent batch jobs writing logs to database

My production system has nearly 170 Ctrl-M jobs(essentially cron jobs) running every day. These jobs are weaved together(by creating dependencies) to perform ETL operations. eg: Ctrl-M(scheduler like CRON) almost always start with a shell script, which then executes a bunch of python, hive scripts or map-reduce jobs in a specific order.
I am trying to implement logging into each of these processes to be able to better monitor the tasks and the pipelines in whole. The logs would be used to build a monitoring dashboard.
Currently I have implemented logging using a central wrapper which would be called by each of the processes to log information. This wrapper in turn opens up a teradata connection EACH time and calls a teradata stored procedure to write into a teradata table.
This works fine for now. But in my case, multiple concurrent processes (spawning even more parallel child processes) run at the same time and I have started experiencing dropped connections while doing some load testing. Below is an approach I have been thinking about:
Make processes write to some kind of message queues(eg: AWS sqs). A listener would pick data from these message queues asynchronously and then batch write to teradata.
Using files or some structure to perform batch writing to teradata db.
I would definitely like to hear your thoughts on that or any other better approaches. Eventually the end point of logging will be shifted to redshift and hence thinking in the lines of AWS SQS queues.
Thanks in advance.
I think Kinesis firehose is the perfect solution for this. Setting up a the firehose stream is incredibly quick and easy to configure, very inexpensive and will stream your data to s3 bucket of your choice and optionally stream your logs directly to redshift.
If redshift is your end goal (or even just s3), kinesis firehose couldn't make it easier.
https://aws.amazon.com/kinesis/firehose/
Amazon Kinesis Firehose is the easiest way to load streaming data into
AWS. It can capture and automatically load streaming data into Amazon
S3 and Amazon Redshift, enabling near real-time analytics with
existing business intelligence tools and dashboards you’re already
using today. It is a fully managed service that automatically scales
to match the throughput of your data and requires no ongoing
administration. It can also batch, compress, and encrypt the data
before loading it, minimizing the amount of storage used at the
destination and increasing security. You can easily create a Firehose
delivery stream from the AWS Management Console, configure it with a
few clicks, and start sending data to the stream from hundreds of
thousands of data sources to be loaded continuously to AWS – all in
just a few minutes.

how to read MQTT mosquitto server persisted DB file

I am using mosquitto server for MQTT protocol.
Using persistence setting in a configuration file with -c option, I am able to save the data.
However the file generated is binary one.
How would one be able to read that file?
Is there any specific tool available for that?
Appreciate your views.
Thanks!
Amit
Why do you want to read it?
The data is only kept there while messages (QOS1 or QOS2) are in flight to ensure they are not lost in transit while waiting for a response from the subscribed client.
Data may also be kept for clients that are disconnected but have persistent subscriptions (cleanSession=false) until that client reconnects.
If you are looking to persist all messages for later consumption you will have to write a client to subscribe and store this data in a DB of your choosing. One possible option to do this quickly and simply is Node-RED, but there are others and some brokers even have plugins for this e.g. HiveMQ.
If you really want to read it then you will probably have to write your own tool to do this based on the Mosquitto src code

Resources