Sending exception back to client thrown by a flume sink - hadoop

I am planning to use Flume with HTTPSource to upload data to HDFS. The sink will be configured to save data to Hive/Hbase table. If there is any excpetion/error writing data to HDFS, can it be thrown back to the client?
HTTPSourceHandler throws exception if it is unable to parse the data or if unable to send data to memory channel, but can an exception thrown by a sink be sent back to client?

Generally, sources work as data producers and sinks as data consumers. This means the sinks will not put any data into the channel, and the sources will not get any data from the channel. Nevertheless, I think you can create (never tested, just figuring out how to do such a thing) custom sources and sinks that work both as sources and sinks; in that case you could have 2 channels, one for each direction, and perform some kind of back communication.
In any case, if you are expecting to send back Http responses about all the possible errors regarding the workflow from the source to the sink, I would say you to forget about that: once the data has been put into the channel by the source, there is no guarantee such a data is inmediately process by the sink; it could take 1 second or 1 minute to be processed (the channel, which behaves as a queue, may have a lot of previous data). I mean, you do not want to implement that kind of synchronous communications because the new data arriving to the Flume agent will have to wait a lot.

Related

Spring Cloud Stream and sending messages to queue in batches

I'm sending messages to my message queue like this
messages.forEach(message->
sources.output().send(MessageBuilder.withPayload(message).build());
Those messages come from an external source and there could be thousands of them.
I've seen the Splitter but it requires an input channel and output channel, but my messages are going into the queue for the first time, I'm just producing messages not consuming them, and I'm not sure how Aggregator would work or if it would be too complex for such a simple scenario.
So basically I'd like to be able to send those messages in batches, rather than one by one.
How could that be accomplished?
For something simple you can collect and create a List of data (messages or just payloads) and then create a single Message with List being a payload and send it.
For more configurable approach you can also use Spring Integration Aggregator

Storm bolt following a kafka bolt

I have a Storm topology where I have to send output to kafka as well as update a value in redis. For this I have a Kafkabolt as well as a RedisBolt.
Below is what my topology looks like -
tp.setSpout("kafkaSpout", kafkaSpout, 3);
tp.setBolt("EvaluatorBolt", evaluatorBolt, 6).shuffleGrouping("kafkaStream");
tp.setBolt("ResultToRedisBolt",ResultsToRedisBolt,3).shuffleGrouping("EvaluatorBolt","ResultStream");
tp.setBolt("ResultToKafkaBolt", ResultsToKafkaBolt, 3).shuffleGrouping("EvaluatorBolt","ResultStream");
The problem is that both of the end bolts (Redis and Kafka) are listening to the same stream from the preceding bolt (ResultStream), hence both can fail independently. What I really need is that if the result is successfully published in Kafka, then only I update the value in Redis. Is there a way to have an output stream from a kafkaBolt where I can get the messages published successfully to Kafka? I can then probably listen to that stream in my RedisBolt and act accordingly.
It is not currently possible, unless you modify the bolt code. You would likely be better off changing your design slightly, since doing extra processing after the tuple is written to Kafka has some drawbacks. If you write the tuple to Kafka and you fail to write to Redis, you will get duplicates in Kafka, since the processing will start over at the spout.
It might be better, depending on your use case, to write the result to Kafka, and then have another topology read the result from Kafka and write to Redis.
If you still need to be able to emit new tuples from the bolt, it should be pretty easy to implement. The bolt recently got the ability to add a custom Producer callback, so we could extend that mechanism.
See the discussion at https://github.com/apache/storm/pull/2790#issuecomment-411709331 for context.

AWS Lambda processing stream from DynamoDB

I'm trying to create a lambda function that is consuming a stream from dynamoDB table. However I was wondering which is the best practice to handle data that may not have been processed for some errors during the execution? For example my lambda failed and I lost part of the stream, which is the best way to reprocess the lost data?
This is handled for you. DynamoDB Streams, like Kinesis Streams, will resend records until they have been successfully processed. When you are using Lambda to process the stream, that means successfully exiting the function. If there is an error and the function exits unexpectedly, the DynamoDB stream will simply resend the record that was being processing.
The good thing is you are guaranteed at-least-once processing however, there are some things you need to look out for. Like Kinesis Streams, DynamoDB Streams are guaranteed to processes records in order. As a side effect of this, when a record fails to process, it is retried until it is successfully processed or it expires from the stream (possibly days) before processing any records behind it in the stream.
How you solve for this depends on the needs of your application. If you need at-least-once processing but don't need to guarantee that all records are processed in order, I would just drop the records into an SQS queue and do the processing off of the queue. SQS queues will also retry records that aren't successfully processed however, unlike DynamoDB and Kinesis Streams, records will not block each other in the queue. If you encounter an error when transferring a record from the DynamoDB Stream to the SQS Queue, you can just retry however, this may introduce duplicates in the SQS Queue.
If order is critical or duplicates can't be tolerated, you can use a SQS FIFO Queue. SQS FIFO Queues are similar to (Standard) SQS Queues except they they are guaranteed to deliver messages to the consumer in order and have a deduplication window (5 mins) where any duplicates added to the queue within that window will be discarded.
In both cases, when using SQS queues to process messages, you can setup a Dead Letter Queue where messages can automatically be sent if they fail to be processed N number of times.
TLDR: Use SQS Queues.
Updating this thread as all the existing answers are stale.
AWS Lambda now supports the DLQs for synchronous steam read from DynamoDB table stream.
With this feature in context, here is the flow that I would recommend:
Configure the event source mapping to include the DLQ arns and set the retry-attempts count. After these many retry, the batch metadata would then be moved to DLQs.
Set-up alarm on DLQ message visibility to get alert on impacted records.
DLQ message can be used to retrieve the impacted stream record using KCL library
ProTip: you can use attribute "Bisect on Function Error" to enable batch splitting. With this option, lambda would be able to narrow down on the impacted record.
DynamoDB Streams invokes the Lambda function for each event untill it successfully processes it (Untill the code calls success callback).
In an error situation while executing, you need to handle it in code unless otherwise the Lambda won't continue with the remaining messages in the stream.
If there is a situation where you need to process the message separate due to an error, you can use the dead letter queue (with Amazon SQS) to push the message and continue with the remaining items in the stream. You can have a separate logic to process the messages in this queue.

How to create unique messages to rabbitmq queue - spring-amp

I am putting a message containing string data to rabbitmq queue.
Message publishing is called as a part of a service and the service can be called with same data (data goes to the queue) multiple times, thus chances for having duplicated data in the queue is very likely.
We have issues with this as the consumer code is inserting this data to table where this data is primary key. Consumer will be called from 4 different nodes simultaneously thus chances for having consumers consuming same data (from different messages) can happen.
I want to know if rabbitMQ publishing has any way to avoid message duplication.
Read "define a property "x-unique-message-code" to compare them is an easy and simple way" , but don't know how to do it.
I am using spring-amqp
Any help is highly appreciated.
Thank you
There is a good article from RabbitMQ about reliability: https://www.rabbitmq.com/reliability.html
There is a note like:
In the event of network failure (or a node crashing), messages can be duplicated, and consumers must be prepared to handle them. If possible, the simplest way to handle this is to ensure that your consumers handle messages in an idempotent way rather than explicitly deal with deduplication.
For this purpose the message to produce can be supplied with a messageId property.

Exactly-once guarantee in Storm Trident in network partitioning and/or failure scenarios

So, Apache Storm + Trident provide the exactly-once semantics. Imagine I have the following topology:
TridentSpout -> SumMoneyBolt -> SaveMoneyBolt -> Persistent Storage.
CalculateMoneyBolt sums monetary values in memory, then passes the result to SaveMoneyBolt which should save the final value to a remote storage/database.
Now it is very important that we calculate these values and store only once to the database. We do not want to accidentally double count the money.
So how does Storm with Trident handle network partitioning and/or failure scenarios when the write request to the database has been successfully sent, the database has successfully received the request, logged the transaction, and while responding to the client, the SaveMoneyBolt has either died or partitioned from the network before having received the database response?
I assume that if SaveMoneyBolt had died, Trident would retry the batch, but we cannot afford double counting.
How are such scenarios handled?
Thanks.
Trident gives a unique transaction id for each batch. If a batch is retried it will have the same txid. Also the batch updates are ordered, i.e. the state update for a batch will not happen until the update for the previous batch is complete. So by storing the txid along with the values in the state trident can de-duplicate the updates and provide exactly once semantics.
Trident comes with a few built-in Map state implementations which handles all this automatically.
For more information take a look at the docs :
http://storm.apache.org/releases/1.0.1/Trident-tutorial.html
http://storm.apache.org/releases/current/Trident-state.html

Resources