Real time transformations from/to pubsub and websocket push to client - websocket

I need to get some real time data from a third party provider, transform them and push the to the browser via websockets.
The whole procedure should not take more than 200ms from the time I received the data till the time the browser gets them.
I am thinking in using pub/sub to dataflow to pub/sub again where a websocket server will subscribe and push the messages to the browsers.
Is this approach correct or dataflow is not designed for something like this?

Dataflow is designed for reliable streaming aggregation and analytics and is not designed for guaranteed sub-second latencies through the system. The core primitives like windowing and triggering allow for reliable processing of streams over defined windows of data despite late data and potential machine or pipeline errors. The main use case we have optimized for is for example, aggregating and outputting statistics over a stream of data, outputting reliable statistics for each window while doing logging to disk for fault-tolerance and waiting if necessary before triggering, to accommodate late data. As such, what we have not optimized for is the end-to-end latency you require.

Related

Microservice failure Scenario

I am working on Microservice architecture. One of my service is exposed to source system which is used to post the data. This microservice published the data to redis. I am using redis pub/sub. Which is further consumed by couple of microservices.
Now if the other microservice is down and not able to process the data from redis pub/sub than I have to retry with the published data when microservice comes up. Source can not push the data again. As source can not repush the data and manual intervention is not possible so I tohught of 3 approaches.
Additionally Using redis data for storing and retrieving.
Using database for storing before publishing. I have many source and target microservices which use redis pub/sub. Now If I use this approach everytime i have to insert the request in DB first than its response status. Now I have to use shared database, this approach itself adding couple of more exception handling cases and doesnt look very efficient to me.
Use kafka inplace if redis pub/sub. As traffic is low so I used Redis pub/sub and not feasible to change.
In both of the above cases, I have to use scheduler and I have a duration before which I have to retry else subsequent request will fail.
Is there any other way to handle above cases.
For the point 2,
- Store the data in DB.
- Create a daemon process which will process the data from the table.
- This Daemon process can be configured well as per our needs.
- Daemon process will poll the DB and publish the data, if any. Also, it will delete the data once published.
Not in micro service architecture, But I have seen this approach working efficiently while communicating 3rd party services.
At the very outset, as you mentioned, we do indeed seem to have only three possibilities
This is one of those situations where you want to get a handshake from the service after pushing and after processing. In order to accomplish the same, using a middleware queuing system would be a right shot.
Although a bit more complex to accomplish, what you can do is use Kafka for streaming this. Configuring producer and consumer groups properly can help you do the job smoothly.
Using a DB to store would be a overkill, considering the situation where you "this data is to be processed and to be persisted"
BUT, alternatively, storing data to Redis and reading it in a cron-job/scheduled job would make your job much simpler. Once the job is run successfully, you may remove the data from cache and thus save Redis Memory.
If you can comment further more on the architecture and the implementation, I can go ahead and update my answer accordingly. :)

MQTT vs Socket.IO on Network bandwidth usage

I need solution upstream much data every seconds.
200kBytes per seconds via wireless (WiFi) or Ethenet.
I selected MQTT, because It is useful for connections with remote locations where a small code footprint is required and/or network bandwidth is at a premium.
MQTT is better that Socket.io in network bandwidth usage?
Or, MQTT is good solution for upload/publish realtime.
MQTT can be used for charting system same as socket.io(WebSocket)?
Socket.io does several things at once. This answer focuses on your note about the underlying protocol, WebSockets, though of course you could use those without Socket.io.
WebSockets vs. MQTT is an apples-to-baskets comparison, as each can work without the other or together. MQTT can work alone as an alternative to HTTP. WebSockets is an additional protocol on top of HTTP, and can keep a long-running connection open, so that a stream of messages can be sent over a long period without having to set up a new connection for each request. That connection can carry MQTT, or non-MQTT data like JSON objects, and has the benefit of providing a reliable two-way link whose messages arrive in order.
MQTT also has less overhead, for different reasons: it is designed with a publish-subscribe model (Pub-Sub Model) and optimizes for delivering data over narrow, slow, or unreliable connections. Though it omits many of the headers that accompany an HTTP message in favor of a few densely-coded bytes, the real difference is in speed of delivery. A top option for constrained embedded devices, though they are usually sending small messages and trying to conserve data/processing/power.
So they have different strengths, and can even be combined. MQTT-via-WebSockets is a common approach for using MQTT inside a webapp, though plain MQTT is the norm in lower-end devices (which may be hard-pressed to send that much data anyway). I suggest MQTT for sending from device to server, or WebSockets-MQTT for quickly receiving device data in the browser or ensuring the order of messages that are sent at high rates. An important exception would be for streaming - there have only been isolated reports of it over MQTT, while Socket.io reports it as a top feature. The balance will depend on what systems you have on both ends and what sort of charting is involved.

Should a websocket connection be general or specific?

Should a websocket connection be general or specific?
e.g. If I was building a stock trading system, I'd likely to have real time stock prices, real time trade information, real time updates to the order book, perhaps real time chat to enable traders to collude and manipulate the market. Should I have one websocket to handle all the above data flow or is it better to have several websocket to handle different topics?
It all depends. Let's look at your options, assuming your stock trader, your chat, and your order book are built as separate servers/micro-services.
One WebSocket for each server
You can have each server running their own WebSocket server, streaming events relevant to that server.
Pros
It is a simple approach. Each server is independent.
Cons
Scales poorly. The number of open TCP connections will come at a price as the number of concurrent users increases. Increased complexity when you need to replicate the servers for redundancy, as all replicas needs to broadcast the same events. You also have to build your own fallback for recovering from client data going stale due to lost WebSocket connection. Need to create event handlers on the client for each type of event. Might have to add version handling to prevent data races if initial data is fetched over HTTP, while events are sent on the separate WebSocket connection.
Publish/Subscribe event streaming
There are many publish/subscribe solutions available, such as Pusher, PubNub or SocketCluster. The idea is often that your servers publish events on a topic/subject to a message queue, which is listened to by WebSocket servers that forwards the events to the connected clients.
Pros
Scales more easily. The server only needs to send one message, while you can add more WebSocket servers as the number of concurrent users increases.
Cons
You most likely still have to handle recovery from events lost during disconnect. Still might require versioning to handle data races. And still need to write handlers for each type of event.
Realtime API gateway
This part is more shameless, as it covers Resgate, an open source project I've been involved in myself. But it also applies to solutions such as Firebase. With the term "realtime API gateway", I mean an API gateway that not only handles HTTP requests, but operates bidirectionally over WebSocket as well.
With web clients, you are seldom interested in events - you are interested in change of state. Events are just means to either describe the changes. By fetching the data through a gateway, it can keep track on which resources the client is currently interested in. It will then keep the client up to date for as long as the data is being used.
Pros
Scales well. Client requires no custom code for event handling, as the system updates the client data for you. Handles recovery from lost connections. No data races. Simple to work with.
Cons
Primarily for client rendered web sites (using React, Vue, Angular, etc), as it works poorly with sites with server-rendered pages. Harder to apply to already existing HTTP API's.

When does server-side push technology become necessary?

Right now, I'm pulling some 10 kB sensor data from a server via a single plain old HTTP request every 5 minutes. In the future, I might want to increase the frequency to make one request every 30 seconds.
When does server-side push technology become necessary?
Obviously, the precise answer depends on the server - but what's the general approach to the issue? Using push technology definitely seems advantageous. However, the would have to be some major code rewriting. Additionally, I feel like 30 seconds is still a long enough interval and the overhead (e.g. cookies in HTTP headers, ...) shouldn't cause too much surplus traffic.
Push technology is useful for any of the following situations:
You need the client to have low latency when there is new data (e.g. not wait for the next polling interval, but to find out within seconds or even ms when there is new data).
When you're interested in minimizing overhead on your server or bandwidth usage or power consumption on the client, yet the client needs to know in a fairly timely fashion when new data is available. Doing frequent polling from a mobile client chews up bandwidth and battery.
When data availability is unpredictable and a regular polling would usually result in no data being available. If every poll results in data being collected and the timeliness of a moderate polling interval is sufficient for your application, then there are not big gains by switching to a push notification mechanism. When the poll usually results in an empty request, that's when polling becomes very inefficient.
If you're sending data regularly and you are trying to minimize bandwidth. A single webSocket packet is way more efficient than an HTTP request as the HTTP request includes headers, cookies, etc... that need not be send with a single webSocket packet once the webSocket connection has already been established.
Some other references on the topic:
What are the pitfalls of using Websockets in place of RESTful HTTP?
Ajax vs Socket.io
websocket vs rest API for real time data?
Cordova: Sockets, PushNotifications, or repeatedly polling server?
HTML5 WebSocket: A Quantum Leap in Scalability for the Web

Factors Affected for Low Performance of middleware Messaging Softwares

I am planning to inegrate messaging middleware in my web application. Right now I am tesing different messaging middleware software like RabbitMQ,JMS, HornetQ, etc..
Examples provided with this softwares are working but its not giving as desired results.
So, I want to know that which are the factors which are responsible to improve peformance that one should keep in eyes?
Which are the areas, a developer should take care of to improve the performance of middleware messaging software?
I'm the project lead for HornetQ but I will try to give you a generic answer that could be applied to any message system you choose.
A common question that I see is people asking why a single producer / single consumer won't give you the expected performance.
When you send a message, and are asking confirmation right away, you need to wait:
The message transfer from client to server
The message being persisted on the disk
The server acknowledging receipt of the message by sending a callback to the client
Similarly when you are receiving a message, you ACK to the server:
The ACK is sent from client to server
The ACK is persisted
The server sends back a callback saying that the callback was achieved
And if you need confirmation for all your message-sends and mesage-acks you need to wait these steps as you have a hardware involved on persisting the disk and sending bits on the network.
Message Systems will try to scale up with many producers and many consumers. That is if many are producing they should all use the resources available at the server shared for all the consumers.
There are ways to speed up a single producer or single consumer:
One is by using transactions. So, you minimize the blocks and syncs you perform on disk while persisting at the server and roundtrips on the network. (This is actually the same on any database)
Another one, is by using Callbacks instead of blocking at the consumer. (JMS 2 is proposing a Callback similar to the ConfirmationHandler on HornetQ).
Also: most providers I know will have a performance section on their docs with requirements and suggestions for that specific product. You should look individually at each product

Resources