Aggregating results of parallel data processing

Aggregating results of parallel data processing - parallel-processing

I want to process data in parallel using a cluster of ServiceMix / ActiveMQ / Camel. It seems I can achieve that by first splitting the data up, then distributing it via multiple JMS messages and an ActiveMQ network of brokers.
The part which makes it hard though is that I need to aggregate all results from all nodes in the end. I'm currently unsure how to do that. All results would have to end up on one node in the end.
So the overall flow looks like this:
(1) Retrieve data.
(2) Split it up into manageable chunks.
(3) Post chunks of data into a distributed JMS queue, via an ActiveMQ cluster.
(4) Data is processed on all nodes.
Now the part I don't know how to handle:
(5) Aggregate processed data from all nodes
(6) Last processing step with the aggregated results.
> [Process data (node 1)] >
[Retrieve DATA] >>>[vm://]>>> [SPLIT] >>>[activemq://]>>> [Process data (node 2)] >>>[activemq://]>>> [AGGREGATE] >>>[vm://]>>> [FINALIZE DATA]
> [Process data (node 3)] >
How do I achieve that given that an ActiveMQ broker network happily distributes everything? Deploy the final aggregating route only on one node? Don't like it since that would create a SPOF …
Thanks!

Well, sounds like you could use an exclusive consumer on the aggregate stage. You should be able to run that aggregate route on all nodes.
Disclaimer: Im not sure about this solution on a network of brokers. But you can probably give it a shot and see if it helps.

This use-case sounds like the composed message processor eip
http://camel.apache.org/composed-message-processor.html
Camel supports aggregating using the splitter only, which makes it even easier. See the link above, about the splitter only.
And with competing consumers you can have multiple nodes process the data in parallel: http://camel.apache.org/competing-consumers.html
And you should then do request/reply over JMS so the reply messages is sent back to the queue where the messages will be aggregated: http://camel.apache.org/request-reply.html
And make sure to study the information about request/reply over JMS as there is several options that can make this faster: http://camel.apache.org/jms
And for ActiveMQ broker to distribute, then you can set the brokers up in a network of brokers (NOB): http://activemq.apache.org/networks-of-brokers.html

Related

One partition multiple consumers same group, consumer IDs

We have one topic with one partition due to ordering of message requirements. We have two consumers running on different servers with same set of configurations i.e. groupId, consumerId, consumerGroup. i.e.
1 Topic -> 1 Partition -> 2 Consumers
When we deploy consumers same code is deployed on both the servers. Noticed when a message comes we see both the consumers are consuming message rather than only one processing. Reason having consumers running on two separate servers is if one server crashes at least other can continue processing messages. But looks like if both up both consuming messages. Reading Kafka docs it says if we have more consumers than partitions then some stay idle don't see that happening. Anything we are missing on configuration side apart from consumerId & groupId. Thanks

As #Gary Russel said, as long as the two consumer instances have their own consumer group, they will consume every event that is written to the topic. Just put them into the same consumer-group. You can provide a consumer-group-id in the consumer.properties.

Emitting Tuple to different Bolts

I am trying of a scenario in which I have a Spout which reads a data from a Message Broker and emits the message as a tuple to a Bolt for some Processing.
Bolt post processing converts it into seperate Messages and each sub- message has to be sent to different Brokers which can be hosted on different machines .
Assuming I have finite recipients (in my case there are 3 Message Brokers for output) .
So , Bolt1 post processing can either drop the message directly to these 3 Message Brokers
Now, If I use a single Bolt here which drops the messages to these three brokers by itself and lets say One of them fails(due to unavailability etc) on which I call the collector's fail method .
Once the fail method is called on the bolt , in my Spout fail method gets Invoked .
Here , I believe I will have to again process the entire message again (I have to make sure everyMessage has to be processed ) even though 2 out of 3 messages got successfully delivered .
Alternatively , even If I emit these 3 sub messages to different bolt , I think even in that case Spout will have to process the entire message again .
This is because I am appending a Unique Guid with the message while emitting it first time in the spout nextTuple() method .
Is there a way to ensure that only the failed sub message should be processed and not entire one?
Thanks

Storm (low level Java API) provides only "at-least-once" processing guarantees, ie, there is no support to avoid duplicate processing on case of failure.
If you need exactly once proceeding, you can use Trident on top of Storm. However, even Trident can not give exactly once if you emit data to an external system (if the external system cannot detect and delete duplicates). This is not a Storm specific but general problem. Other system like Apache Flink, Apache Spark Streaming, or S-Store (a recent research prototype system from MIT -> Stonebraker) "suffer" from the exact same problem.
Maybe the best approach would be to try out Trident to evaluate if it can meet your requirements.

MQGet and MQInput from the same queue

I've come across a curious detail in the legacy integration solution based on WebSphere MQ 7.0.1.3 and WebSphere Message Broker 7.0.0.7. There are 2 message flows:
The 1st flow is a case of MQ Request-Reply pattern. After MQPut it has a MQGet node that gets the message by correlation ID from queue "MQ_BIS_IN".
The 2nd flow is a kind of a one-way router that starts with a MQInput node (without any filters) that listens on the queue "MQ_GW_IN".
Interestingly, "MQ_BIS_IN" is an alias for "MQ_GW_IN" queue. My first thought was that the 2 flows would interfere in a bad way, basically the "omnivorous" MQInput would ruin the Request-Reply thing. But they seem to somehow get along.
I am going to reproduce this configuration in a test environment to determine if their behaviour is stable under load. Nevertheless, does anybody know if there are some rules of precedence between concurrent read operation from the same queue? Does it matter that there's an alias to the queue?

Both the MQInput and the MQGet node can be configured to look for particular msgId's or correlation Id's only, or to pick up the items on the queue in a determined order, or only pick up complete groups of messages - so there doesn't need to be a conflict here.

RabbitMQ: Move messages to another queue on acknowledgement received

I have a setup with two queues (no exchanges), let's say queue A and queue B.
One parser puts messages on queue A, that are consumed by ElasticSearch RabbitMQ river.
What I want now is to move messages from queue A to queue B when the ES river sends an ack to the queue A, so that I can do other processing in the ack'd messages, being sure that ES already has processed them.
Is there any way in RabbitMQ to do this? If not, is there any other setup that can guarantee me that a message is only in queue B after being processed by ES?
Thanks in advance

I don't think this is supported by either AMQP or the rabbitmq extensions.
You could drop the river and let your consumer also publish to elasticsearch.
Since a normal behavior is that the queues are empty you can just perform a few retries of reading the entries from elasticsearch (with exponential backoff), so even if the elasticsearch loses the initial race it will backoff a bit and you can then perform the task. This might require tuning the prefetch_size/count in your clients.

Aggregating JMS messajes from many destinations to a single queue

What can be the best way to aggregate messages from many different sources (actually queues/topics) into a single queue/topic and then consume it. I am trying to design an application to receive messages from different topics in JMS using weblogic.

You could write your own "aggregator" as a stand-alone Java application:
For each queue/topic have a reader in its own thread.
Each reader sends its received message again on a "aggregate queue".
Have another thread to listen on the "aggregate queue".
As a variation, you could use a JVM Queue (like java.util.concurrent.ArrayBlockingQueue) as the "aggregate queue". This is faster, does not require another MQ queue, does not need network bandwidth, but it's not persistent.
Another idea is to use a "Message driven bean (MDB)" for each incoming queue/topic:
Again, each of these MDBs just reads the message and resends it to the "aggregate queue".
Have another MDB listening on the "aggregate queue".

A few suggestions on quality requirements. I belive you have to consider them.
They will be highly relate with your technical solution.
is that message loss acceptable?
client ack could be considered.
e.g. A memory queue sit in middle, e.g. incoming queue1...n -> ArrayBlockingQueue in memory -> outgoing queue. The data in the ArrayBlockingQueue , will lost when app crash.
is that message duplicate acceptable for the single outgoing queue?
I would suggest yes.
Set applicable level PossibleDuplicateFlag to make the client aware of that.
how fast the incoming messages per second on the diff incoming queue?
one queue session has only a uniqe thread. Performance has to be considered in advance.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Aggregating results of parallel data processing - parallel-processing

Well, sounds like you could use an exclusive consumer on the aggregate stage. You should be able to run that aggregate route on all nodes. Disclaimer: Im not sure about this solution on a network of brokers. But you can probably give it a shot and see if it helps.

Related

One partition multiple consumers same group, consumer IDs

Emitting Tuple to different Bolts

MQGet and MQInput from the same queue

RabbitMQ: Move messages to another queue on acknowledgement received

Aggregating JMS messajes from many destinations to a single queue

Categories

Resources