Are Spark Streaming RDDs always processed in order? - spark-streaming

I'm writing a Spark Streaming application that uses RabbitMQ to consume events. One feature of RabbitMQ that I intend to make use of is bulk ack of messages, i.e. no need to ack one-by-one, but only ack the last event in a batch and that would ack the entire batch.
Before I commit to doing so, I'd like to know if Spark Streaming always processes RDDs in the same order they arrive in, i.e. if RDD1 arrives before RDD2, is it true that RDD2 will never be scheduled/processed before RDD1 is finished?
This is crucial to the ack logic, since if RDD2 can be potentially processed while RDD1 is still being processed, then if I ack the the last event in RDD2 that would also ack all events in RDD1, even though they may have not been completely processed yet.

By default, only after all the retries etc related to batch X is done, then batch X+1 will be started.
ref
Additional information: This is true in the default configuration. You may
find references to an undocumented hidden configuration called
spark.streaming.concurrentJobs elsewhere in the mailing list. Setting
that to more than 1 to get more concurrency (between output ops) breaks
the above guarantee.
ref

Related

Kafka streams commit offset semantics

I just wanted to confirm something which i think is in between the line of the documentation. Would it be correct to say that Commit in kafka streams is independent of if the offset/message has been processed by the entire set of processing nodes of application topology, but solely depend on the commit interval ? In other words, where in typical kafka consumer application, one would commit when a message is fully processed as opposed to only fetch, in Kafka stream, simply being fetched is enough for the commit interval to kick in and commit that message/offset ? That is, even if that offset/message has not yet been processed by the entire set of processing node of the application topology ?
Or are message eligible to be committed, based on the fact that the entire set of processing node of the topology processed them, and they are ready to go out in a topic or external system.
In a sense the question could be sum up as, when are offset/messages, eligible to be committed in Kafka streams ? is it conditional ? if so what is the condition ?
You have do understand that a Kafka Streams program, i.e., its Topology my contain multiple sub-topologies (https://docs.confluent.io/current/streams/architecture.html#stream-partitions-and-tasks). Sub-topolgies are connected via topics to each other.
A record can be committed, if it's fully processed by a sub-topology. For this case, the record's intermediate output is written into the topic that connects two sub-topologies before committing happens. The downstream sub-topology would read from the "connecting topic" and commit offsets for this topic.
Committing indeed happens based on commit.interval.ms only. If a fetch returns lets say 100 records (offsets 0 to 99), and 30 records are processed by the sub-topology when commit.interval.ms hits, Kafka Streams would first make sure that the output of those 30 messages is flushed to Kafka (ie, Producer.flush()) and would afterward commit offset 30 -- the other 70 messages are just in an internal buffer of Kafka Streams and would be processed after the commit. If the buffer is empty, a new fetch would be send. Each thread, tracks commit.interval.ms independently, and would commit all its tasks if commit interval passed.
Because committing happens on a sub-topology basis, it can be than an input topic record is committed, while the output topic does not have the result data yet, because the intermediate results are not processed yet by a downstream sub-topology.
You can inspect the structure of your program via Topology#describe() to see what sub-topologies your program has.
Whether using streams or just a simple consumer, the key thing is that auto-commit happens in the polling thread, not a separate thread - the offset of a batch of messages is only committed on the subsequent poll, and commit.interval.ms just defines the minimum time between commits, ie a large value means that commit won't happen on every poll.
The implication is that as long as you are not spawning additional threads, you will only ever be committing offsets for messages that have been completely processed, whatever that processing involves.

Exactly-once guarantee in Storm Trident in network partitioning and/or failure scenarios

So, Apache Storm + Trident provide the exactly-once semantics. Imagine I have the following topology:
TridentSpout -> SumMoneyBolt -> SaveMoneyBolt -> Persistent Storage.
CalculateMoneyBolt sums monetary values in memory, then passes the result to SaveMoneyBolt which should save the final value to a remote storage/database.
Now it is very important that we calculate these values and store only once to the database. We do not want to accidentally double count the money.
So how does Storm with Trident handle network partitioning and/or failure scenarios when the write request to the database has been successfully sent, the database has successfully received the request, logged the transaction, and while responding to the client, the SaveMoneyBolt has either died or partitioned from the network before having received the database response?
I assume that if SaveMoneyBolt had died, Trident would retry the batch, but we cannot afford double counting.
How are such scenarios handled?
Thanks.
Trident gives a unique transaction id for each batch. If a batch is retried it will have the same txid. Also the batch updates are ordered, i.e. the state update for a batch will not happen until the update for the previous batch is complete. So by storing the txid along with the values in the state trident can de-duplicate the updates and provide exactly once semantics.
Trident comes with a few built-in Map state implementations which handles all this automatically.
For more information take a look at the docs :
http://storm.apache.org/releases/1.0.1/Trident-tutorial.html
http://storm.apache.org/releases/current/Trident-state.html

Emitting Tuple to different Bolts

I am trying of a scenario in which I have a Spout which reads a data from a Message Broker and emits the message as a tuple to a Bolt for some Processing.
Bolt post processing converts it into seperate Messages and each sub- message has to be sent to different Brokers which can be hosted on different machines .
Assuming I have finite recipients (in my case there are 3 Message Brokers for output) .
So , Bolt1 post processing can either drop the message directly to these 3 Message Brokers
Now, If I use a single Bolt here which drops the messages to these three brokers by itself and lets say One of them fails(due to unavailability etc) on which I call the collector's fail method .
Once the fail method is called on the bolt , in my Spout fail method gets Invoked .
Here , I believe I will have to again process the entire message again (I have to make sure everyMessage has to be processed ) even though 2 out of 3 messages got successfully delivered .
Alternatively , even If I emit these 3 sub messages to different bolt , I think even in that case Spout will have to process the entire message again .
This is because I am appending a Unique Guid with the message while emitting it first time in the spout nextTuple() method .
Is there a way to ensure that only the failed sub message should be processed and not entire one?
Thanks
Storm (low level Java API) provides only "at-least-once" processing guarantees, ie, there is no support to avoid duplicate processing on case of failure.
If you need exactly once proceeding, you can use Trident on top of Storm. However, even Trident can not give exactly once if you emit data to an external system (if the external system cannot detect and delete duplicates). This is not a Storm specific but general problem. Other system like Apache Flink, Apache Spark Streaming, or S-Store (a recent research prototype system from MIT -> Stonebraker) "suffer" from the exact same problem.
Maybe the best approach would be to try out Trident to evaluate if it can meet your requirements.

read messages from JMS MQ or In-Memory Message store by count

I want to read messages from JMS MQ or In-memory message store based on count.
Like I want to start reading the messages when the message count is 10, until that i want the message processor to be idle.
I want this to be done using WSO2 ESB.
Can someone please help me?
Thanks.
I'm not familiar with wso2, but from an MQ perspective, the way to do this would be to trigger the application to run once there are 10 messages on the queue. There are trigger settings for this, specifically TRIGTYPE(DEPTH).
To expand on Morag's answer, I doubt that WS02 has built-in triggers that would monitor the queue for depth before reading messages. I suspect it just listens on a queue and processes messages as they arrive. I also doubt that you can use MQ's triggering mechanism to directly execute the flow conveniently based on depth. So although triggering is a great answer, you need a bit of glue code to make that work.
Conveniently, there's a tutorial that provides almost all the information necessary to do this. Please see Mission:Messaging: Easing administration and debugging with circular queues for details. That article has the scripts necessary to make the Q program work with MQ triggering. You just need to make a couple changes:
Instead of sending a command to Q to delete messages, send a command to move them.
Ditch the math that calculates how many messages to delete and either move them in batches of 10, or else move all messages until the queue drains. In the latter case, make sure to tell Q to wait for any stragglers.
Here's what it looks like when completed: The incoming messages land on some queue other than the WS02 input queue. That queue is triggered based on depth so that the Q program (SupportPac MA01) copies the messages to the real WS02 input queue. After the messages are copied, the glue code resets the trigger. This continues until there are less than 10 messages on the queue, at which time the cycle idles.
I got it by pushing the message to db and get as per the count required as in this answer of me take a look at my answer

activemessaging with stomp and activemq.prefetchSize=1

I have a situation where I have a single activemq broker with 2 queues, Q1 and Q2. I have two ruby-based consumers using activemessaging. Let's call them C1 and C2. Both consumers subscribe to each queue. I'm setting activemq.prefetchSize=1 when subscribing to each queue. I'm also setting ack=client.
Consider the following sequence of events:
1) A message that triggers a long-running job is published to queue Q1. Call this M1.
2) M1 is dispatched to consumer C1, kicking off a long operation.
3) Two messages that trigger short jobs are published to queue Q2. Call these M2 and M3.
4) M2 is dispatched to C2 which quickly runs the short job.
5) M3 is dispatched to C1, even though C1 is still running M1. It's able to dispatch to C1 because prefetchSize=1 is set on the queue subscription, not on the connection. So the fact that a Q1 message has already been dispatched doesn't stop one Q2 message from being dispatched.
Since activemessaging consumers are single-threaded, the net result is that M3 sits and waits on C1 for a long time until C1 finishes processing M1. So, M3 is not processed for a long time, despite the fact that consumer C2 is sitting idle (since it quickly finishes with message M2).
Essentially, whenever a long Q1 job is run and then a whole bunch of short Q2 jobs are created, exactly one of the short Q2 jobs gets stuck on a consumer waiting for the long Q1 job to finish.
Is there a way to set prefetchSize at the connection level rather than at the subscription level? I really don't want any messages dispatched to C1 while it is processing M1. The other alternative is that I could create a consumer dedicated to processing Q1 and then have other consumers dedicated to processing Q2. But, I'd rather not do that since Q1 messages are infrequent--Q1's dedicated consumers would sit idle most of the day tying up memory.
The activemq.prefetchSize is only available on a SUBSCRIBE message, not a CONNECT, according to the ActiveMQ docs for their extended stomp headers (http://activemq.apache.org/stomp.html). Here is the relevant info:
verb: SUBSCRIBE
header: activemq.prefetchSize
type: int
description: Specifies the maximum
number of pending messages that will
be dispatched to the client. Once this
maximum is reached no more messages
are dispatched until the client
acknowledges a message. Set to 1 for
very fair distribution of messages
across consumers where processing
messages can be slow.
My reading and experience with this, is that since M1 has not been ack'd (b/c you have client ack turned on), that this M1 should be the 1 message allowed by prefetchSize=1 set on the subscription. I am surprised to hear that it didn't work, but perhaps I need to run a more detailed test. Your settings should be correct for the behavior you want.
I have heard of flakiness from others about the activemq dispatch, so it is possible this is a bug with the version you are using.
One suggestion I would have is to either sniff the network traffic to see if the M1 is getting ack'd for some reason, or throw some puts statements into the ruby stomp gem to watch the communication (this is what I usually end up doing when debugging stomp problems).
If I get a chance to try this out, I'll update my comment with my own results.
One suggestion: It is very possible that multiple long processing messages could be sent, and if the number of long processing messages exceeds your number of processes, you'll be in this fix where quick processing messages are waiting.
I tend to have at least one dedicated process that just does quick jobs, or to put it another way, dedicate a set # of processes that just do longer jobs. Having all poller consumer processes listen to both long and short can end up with sub-optimal results no matter what dispatch does. Process groups are the way to configure a consumer to listen to a subset of destinations: http://code.google.com/p/activemessaging/wiki/Configuration
processor_group name,
*list_of_processors
A processor group is a way to run the poller to only execute a subset of
the processors by passing the name of
the group in the poller command line
arguments.
You specify the name of the processor as its underscored lowercase
version. So if you have a
FooBarProcessor and BarFooProcessor in
a processor group, it would look like
this:
ActiveMessaging::Gateway.define do |s|
...
s.processor_group :my_group, :foo_bar_processor, :bar_foo_processor
end
The processor group is passed into the poller like the following:
./script/poller start -- process-group=my_group
I'm not sure if ActiveMessaging supports this, but you could unsubscribe your other consumers when the long processing message arrives and then re-subscribe them after it get processed.
It should give you the desired effect.

Resources