Best practice for IoT stream data processing

Best practice for IoT stream data processing - events

I assume that there are hundreds and thousands IoT devices that publish the data to the (broker)MQTT cluster via the MQTT protocol, behind the broker i have the data processing module which subscribe the data from the broker and maintain a status table for all these devices. The number of the devices is still rising, therefor I have to scale out the broker cluster and data processing module accordingly, for the MQTT broker such as Kafka/Rabbit MQ/Hive MQ can be scaled out very easily, but for the data processing module I'm not quite sure whether there is any best practice, or any framework/architecture can achieve this very easily:
I assume I have to create many daemon processes with hundreds and thousands threads to listen on the MQTT broker, the question is how to scale out these services dynamically?
Thanks.

One way of doing this would be using Node.js as it uses an event-driven approach and you don't have to deal with threads, etc.
I found this library for Node.js which is specific to MQTT:
https://www.npmjs.com/package/mqtt
You can use this to subscribe to different topics.
You may also find this project interesting:
http://nodered.org/
The other solution can be using Apache Kafka which has scalability as an important feature. However, the problem here is that Kafka does not support MQTT out of the box and has its own conventions. Therefore, there is a need for some sort of adapter to make them work together. For that, take a look at this:
using mqtt protocol with kafka as a message broker

Related

Simple Server to PUSH lots of data to Browser?

I'm building a Web Application that consumes data pushed from Server.
Each message is JSON and could be large, hundreds of kilobytes, and messages send couple times per minute, and the order doesn't matter.
The Server should be able to persist not yet delivered messages, potentially storing couple of megabytes for client for couple of days, until client won't get online. There's a limit on the storage size for unsent messages, say 20mb per client, and old undelivered messages get deleted when this limit is exceeded.
Server should be able to handle around 1 thousand simultaneous connections. How it could be implemented simply?
Possible Solutions
I was thinking maybe store messages as files on disk and use Browser Pool for 1 sec, to check for new messages and serve it with NGinx or something like that? Is there some configs / modules for NGinx for such use cases?
Or maybe it's better to use MQTT Server or some Message Queue like Rabbit MQ with some Browser Adapter?

Actually, MQTT supports the concept of sessions that persist across client connections, but the client must first connect and request a "non-clean" session. After that, if the client is disconnected, the broker will hold all the QoS=1 or 2 messages destined for that client until it reconnects.
With MQTT v3.x, technically, the server is supposed to hold all the messages for all these disconnected clients forever! Each messages maxes out at a 256MB payload, but the server is supposed to hold all that you give it. This created a big problem for servers that MQTT v5 came in to fix. And most real-world brokers have configurable settings around this.
But MQTT shines if the connections are over unreliable networks (wireless, cell modems, etc) that may drop and reconnect unexpectedly.
If the clients are connected over fairly reliable networks, AMQP with RabbitMQ is considerably more flexible, since clients can create and manage the individual queues. But the neat thing is that you can mix the two protocols using RabbitMQ, as it has an MQTT plugin. So, smaller clients on an unreliable network can connect via MQTT, and other clients can connect via AMQP, and they can all communicate with each other.

MQTT is most likely not what you are looking for. The protocol is meant to be lightweight and as the comments pointed out, the protocol specifies that there may only exist "Control Packets of size up to 268,435,455 (256 MB)" source. Clearly, this is much too small for your use case.
Moreover, if a client isn't connected (and subscribed on that particular topic) at the time of the message being published, the message will never be delivered. EDIT: As #Brits pointed out, this only applies to QoS 0 pubs/subs.
Like JD Allen mentioned, you need a queuing service like Rabbit MQ or AMQ. There are countless other such services/libraries/packages in existence so please investigate more.
If you want to role your own, it might be worth considering using AWS SQS and wrapping some of your own application logic around it. That'll likely be a bit hacky though, so take that suggestion with a grain of salt.

Why should I use JMS for MOM

I am really curious about this topic.
I will create a communication mechanism for internal systems and may also need connection to some external clients too. The internal modules are also distributed systems.
I need to create a ESB between that modules. The system should provide high performence over millions of subscribers.
publish subscribe or p2p communications are both needed,
When I first started to thinking about that implementation , I was planed to make a REST api on front and the REST api will communicate with a JMS bus .The JMS bus has an ability to provide communication between internal systems.
Unfortunately as per my investigation, using JMS can be caused so musch critical problems : performance,scalability... and looks like JMS is needless, I can create some adapters over internal modules and both can communicate with REST services.
Does anyone have any idea why should I use JMS for internal communication ?

Both REST and JMS/MQ enable communicate between remote systems (and local). You can get help based on the scenarios below:
Some Reasons for using JMS in your case:
If your producer is spitting messages at a very high rate than the consumer then the persistent messaging will help. This may also mean you are fine with the transaction/message to be processed later.
All systems are not up all the time.
You need a publish subscribe mechanism (topic).
Messages are not critical and discard old messages when load is high.
Reasons for using REST API (without any jms connected):
1. You want an immediate response that transaction is completed. Example, hotel booking etc.
2. All systems should be up all the time for processing to complete.

You would want to use JMS (or enterprise messaging) when don't have to rely on all the systems being available. So if one of your internal systems was down for some reason, then a REST api interface would fail when communicating to that system, but a JMS interface would not as you are communicating to the MOM.
For some MOM you don't have to just communicate using JMS, so you can have different runtimes communicate to the MOM.

What is the best way to deliver real-time messages to Client that can not be requested

We need to deliver real-time messages to our clients, but their servers are behind a proxy, and we cannot initialize a connection; webhook variant won't work.
What is the best way to deliver real-time messages considering that:
client that is behind a proxy
client can be off for a long period of time, and all messages must be delivered
the protocol/way must be common enough, so that even a PHP developer could easily use it
I have in mind three variants:
WebSocket - client opens a websocket connection, and we send messages that were stored in DB, and messages comming in real time at the same time.
RabbitMQ - all messages are stored in a durable, persistent queue. What if partner will not read from a queue for some time?
HTTP GET - partner will pull messages by blocks. In this approach it is hard to pick optimal pull interval.
Any suggestions would be appreciated. Thanks!

Since you seem to have to store messages when your peer is not connected, the question applies to any other solution equally: what if the peer is not connected and messages are queueing up?
RabbitMQ is great if you want loose coupling: separating the producer and the consumer sides. The broker will store messages for you if no consumer is connected. This can indeed fill up memory and/or disk space on the broker after some time - in this case RabbitMQ will shut down.
In general, RabbitMQ is a great tool for messaging-based architectures like the one you describe:
Load balancing: you can use multiple publishers and/or consumers, thus sharing load.
Flexibility: you can configure multiple exchanges/queues/bindings if your business logic needs it. You can easily change routing on the broker without reconfiguring multiple publisher/consumer applications.
Flow control: RabbitMQ also gives you some built-in methods for flow control - if a consumer is too slow to keep up with publishers, RabbitMQ will slow down publishers.
You can refactor the architecture later easily. You can set up multiple brokers and link them via shovel/federation. This is very useful if you need your app to work via multiple data centers.
You can easily spot if one side is slower than the other, since queues will start growing if your consumers can't read fast enough from a queue.
High availability and fault tolerance. RabbitMQ is very good at these (thanks to Erlang).
So I'd recommend it over the other two (which might be good for a small-scale app, but you might grow it out quickly is requirements change and you need to scale up things).
Edit: something I missed - if it's not vital to deliver all messages, you can configure queues with a TTL (message will be discarded after a timeout) or with a limit (this limits the number of messages in the queue, if reached new messages will be discarded).

ZeroMQ PUB/SUB and TCP transport

On Windows, I have to build a relatively simple topology in ZeroMQ.
I have a process (let's call it a bridge) that recieves data from outside and introduces them in the ZeroMQ topology. I'd like to use a set of publishers (something like ipc:///bridge/entity1, ipc:///bridge/entity2, ipc:///bridge/entity3 and so on) but afaik, ZeroMQ does not support IPC transports on windows (due to the lack of named pipes in such OS).
So I've to move to a TCP transport, but I don't want to use one port for each entity: I'd like to use something like tcp:///bridge:12345/entity1, tcp:///bridge:12345/entity2 and so on.
However AFAIK, this is not possible with ZeroMQ.
Can you please point me to the right direction?

That's right, it's not possible to bind several ZeroMQ sockets to a single port.
Probably, your problem might be solved with a single PUB socket that publishes messages to different topics, and subscribers that connect with zmq_setsockopt(ZMQ_SUBSCRIBE, ...). Since ZeroMQ 3.x topic filtering is done on PUB side, so there won't be redundant data transmission (related question: ZeroMQ filtering at publisher)

Solution/Architecture: queues or something else?

I have a multiple frontends to my service written in Node.js and workers written in Ruby. Now the question is how to make those communicate? I need to maintain dynamic pool of workers to handle load (spawn more workers when load rises) and messages are quite big ~2-3M because I'm sending images to workers uploaded by users through Node.js frontends. Because I want nice scaling I thought about some queuing solution, but I didn't find any existing solutions (or misunderstood guides) that will provide:
Fallback mechanisms. Solutions I've found so far have single failure point - message broker and there are no ways to provide fallbacks.
Serialization. So when broker fails tasks are not lost.
Ability to pass big messages.
Easy API for Ruby and Node.js
Some API to track queue size so I could rearrange workers pool.
Preferrably lightweight.
Maybe my approach is wrong? Maybe I shouldn't use queues but some other way? Or there's some queueing solution that fits requirements above?

No doubt you require a Queue to scale and you can monitor this queue to spawn "workers".
Apache ActiveMQ is very robust and supports REST protocol. Ruby client is also available to access the queue.
Interesting article on RESTful queue using Apache ActiveMQ

in the end of the day i took ZeroMQ queue solution. Very fast, robust and lightweight implementation. Had to write own broker, but thats the only cons of this solution.

redis publish/subscribe should do the trick
http://redis.io/topics/pubsub

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio