I am using a library called cryptofeed to get access to streaming real-time exchange data on cryptoexchanges. There is a lot of data once I open the websockets connection. I want to store this data on S3 as a data lake, but I worry that creating a file out of the streaming data and sending it to s3 will be too slow -- i.e., creeating the file and sending the file to S3 will be slower than the incoming streaming data eventually causing my application to run out of memory. Is this a problem? If so, what can I do? Should I use something like Kafka or Kinesis as a buffer for this streaming data? If so, how do I know that sending the data to Kafka or Kinesis will be fast enough to balance out the incoming streaming data?
This is the library: https://github.com/bmoscon/cryptofeed
I wrote the library in question. I would recommend buffering the data in something like Redis. You can also use Kafka. Check out the backends module in the library, there are provided interfaces that will store the data for you in Kafka, Redis (and other destinations). You could also buffer to a file and periodically write it, check out the AsyncFile implementation (also in the library).
Related
My web application ingests data from a third-party source and aims to display this data to web clients over a web socket in a real-time manner. The third-party clients push data into the backend over an HTTP endpoint. The data store inside the Golang backend is of temporary nature, just a global variable: a slice keeping third-party client message content.
Since the global slice can be written to (by the third-party client ingestion endpoint) and read from (by threads that send the ingested data to web-app websockets) at any point in time, the message store reads and writes are protected with a mutex. The slice could grow and get rearranged in memory at any point of time.
A "long-term mutex" lock issue arises here. There's a thread that needs to:
read data from the memory state
write the data to a particular web client websocket (possibly a time-lengthy operation)
Any general patters that elegantly deal with this type of problem?
For my project, I have to create a file manager which aims at storing many files (from many locations) and exposing URL to download them.
In a micro-service ecosystem (I am used to use spring boot), I wonder what is the best way to exchange such files, I mean sending files to file manager?
On a one hand, I always thought it is better to exchange them asynchronously, so HTTP does not seem a good choice. But maybe I am wrong.
Is it a good choice to split files into fragments (in order to reduce number of bytes for each part) and send each of them through something like RabbitMQ or Kafka? Or should I rather transfer entire files on a NAS or through FTP and let file manager handling them? Or something else, like for example storing bytes in a temp database (maybe not a good choice)...
The problem of fragmentation is I have to implement a logic for keeping sort of each fragments which complicates processing of queues of topics.
IMO, never send actual files through a message broker.
First, setup some object storage system, for example S3 (with AWS or locally with Ceph), then send the path to the file as a string with the producer, then have the consumer read that path, and download the file.
If you want to collect files off of NAS or FTP, then Apache NiFi is one tool that has connectors to systems like that.
Based on my professional experience working with distributed systems (JMS based), to transfer huge content between participants:
a fragment approach should be used for request - reply model + control signals (has next, fragment counter)
delta approach for updates.
To avoid corrupt data, a hash function result can also be transmitted and checked in both scenarios.
But as mentioned in this e-mail thread, a better approach is to use FTP for this kind of scenarios:
RabbitMQ should actually not be used for big file transfers or only
with great care and fragmenting the files into smaller separate
messages.
When running a single broker instance, you'd still be safe, but in a
clustered setup, very big messages will break the cluster.
Clustered nodes are connected via 1 tcp connection, which must also
transport a (erlang) heartbeat. If your big message takes more time to
transfer between nodes than the heartbeat timeout (anywhere between
~20-45 seconds if I'm correct), the cluster will break and your
message is lost.
The preferred architecture for file transfer over amqp is to just send
a message with a link to a downloadable resource and let the file
transfer be handle by specialized protocol like ftp :-)
Hope it helps.
I need to get some real time data from a third party provider, transform them and push the to the browser via websockets.
The whole procedure should not take more than 200ms from the time I received the data till the time the browser gets them.
I am thinking in using pub/sub to dataflow to pub/sub again where a websocket server will subscribe and push the messages to the browsers.
Is this approach correct or dataflow is not designed for something like this?
Dataflow is designed for reliable streaming aggregation and analytics and is not designed for guaranteed sub-second latencies through the system. The core primitives like windowing and triggering allow for reliable processing of streams over defined windows of data despite late data and potential machine or pipeline errors. The main use case we have optimized for is for example, aggregating and outputting statistics over a stream of data, outputting reliable statistics for each window while doing logging to disk for fault-tolerance and waiting if necessary before triggering, to accommodate late data. As such, what we have not optimized for is the end-to-end latency you require.
We are receiving data as HTTP POST messages from a number of servers. We want to receive the messages, do some pre-processing and then write it to HDFS. What are the best options to operate on real time data streams?
Some options i have read: Flume, Kafka, Spark streaming. How to connect the pieces?
It's hard to say because it's too general question. I can briefly describe our pipeline because we do the exact same thing. We have a few NodeJS HTTP server, they send all incoming requests to Kafka. Then we use Samza to preprocess the data. Samza reads the messages from Kafka and writes it back to Kafka (to another topic). Finally we use Camus to transfer data from Kafka to HDFS (Camus is deprecated by now). You can also use Kafka Connect to transfer data from Kafka to HDFS.
Both Samza and Kafka are (or were) LinkedIn projects thus it's easy to setup this architecture and Samza takes advantages of some Kafka features.
I am using mosquitto server for MQTT protocol.
Using persistence setting in a configuration file with -c option, I am able to save the data.
However the file generated is binary one.
How would one be able to read that file?
Is there any specific tool available for that?
Appreciate your views.
Thanks!
Amit
Why do you want to read it?
The data is only kept there while messages (QOS1 or QOS2) are in flight to ensure they are not lost in transit while waiting for a response from the subscribed client.
Data may also be kept for clients that are disconnected but have persistent subscriptions (cleanSession=false) until that client reconnects.
If you are looking to persist all messages for later consumption you will have to write a client to subscribe and store this data in a DB of your choosing. One possible option to do this quickly and simply is Node-RED, but there are others and some brokers even have plugins for this e.g. HiveMQ.
If you really want to read it then you will probably have to write your own tool to do this based on the Mosquitto src code