How to detect that Spark stream processed all messages? - spark-streaming

At some point of time I know that no more messages are coming to the stream. How can I determine that all messages are processed by the Spark stream?
I want to use that information to (1) shut down the cluster, (2) to send a "job done" event to other parties.
Details:
At the moment I don't count the incoming messages yet.
A processed message results in a file in an S3 bucket.

Related

"Amazon Kinesis Data Streams" not completely realtime stream?

I expect Kinesis data stream (and succeeding process such as KDS firehose) to send data sequentially in real time.
However, when I check the data sequence, it does not seem what I expected.
I get audio data by pyaudio, and send data to KDS by while loop. Its sequence is continuous.
However when I check the cloudwatch log of lambda function (that is triggered when one data arrived at KDS), its sequence number is not continuous.
Could someone tell me how it happens and how to make it sequential?

AWS Lambda processing stream from DynamoDB

I'm trying to create a lambda function that is consuming a stream from dynamoDB table. However I was wondering which is the best practice to handle data that may not have been processed for some errors during the execution? For example my lambda failed and I lost part of the stream, which is the best way to reprocess the lost data?
This is handled for you. DynamoDB Streams, like Kinesis Streams, will resend records until they have been successfully processed. When you are using Lambda to process the stream, that means successfully exiting the function. If there is an error and the function exits unexpectedly, the DynamoDB stream will simply resend the record that was being processing.
The good thing is you are guaranteed at-least-once processing however, there are some things you need to look out for. Like Kinesis Streams, DynamoDB Streams are guaranteed to processes records in order. As a side effect of this, when a record fails to process, it is retried until it is successfully processed or it expires from the stream (possibly days) before processing any records behind it in the stream.
How you solve for this depends on the needs of your application. If you need at-least-once processing but don't need to guarantee that all records are processed in order, I would just drop the records into an SQS queue and do the processing off of the queue. SQS queues will also retry records that aren't successfully processed however, unlike DynamoDB and Kinesis Streams, records will not block each other in the queue. If you encounter an error when transferring a record from the DynamoDB Stream to the SQS Queue, you can just retry however, this may introduce duplicates in the SQS Queue.
If order is critical or duplicates can't be tolerated, you can use a SQS FIFO Queue. SQS FIFO Queues are similar to (Standard) SQS Queues except they they are guaranteed to deliver messages to the consumer in order and have a deduplication window (5 mins) where any duplicates added to the queue within that window will be discarded.
In both cases, when using SQS queues to process messages, you can setup a Dead Letter Queue where messages can automatically be sent if they fail to be processed N number of times.
TLDR: Use SQS Queues.
Updating this thread as all the existing answers are stale.
AWS Lambda now supports the DLQs for synchronous steam read from DynamoDB table stream.
With this feature in context, here is the flow that I would recommend:
Configure the event source mapping to include the DLQ arns and set the retry-attempts count. After these many retry, the batch metadata would then be moved to DLQs.
Set-up alarm on DLQ message visibility to get alert on impacted records.
DLQ message can be used to retrieve the impacted stream record using KCL library
ProTip: you can use attribute "Bisect on Function Error" to enable batch splitting. With this option, lambda would be able to narrow down on the impacted record.
DynamoDB Streams invokes the Lambda function for each event untill it successfully processes it (Untill the code calls success callback).
In an error situation while executing, you need to handle it in code unless otherwise the Lambda won't continue with the remaining messages in the stream.
If there is a situation where you need to process the message separate due to an error, you can use the dead letter queue (with Amazon SQS) to push the message and continue with the remaining items in the stream. You can have a separate logic to process the messages in this queue.

AWS SQS - Queue not delivering any messages until Visibility Timeout expires for one message

EDIT: Solved this one while I was writing it up :P -- I love those kind of solutions. I figured I'd post it anyway, maybe someone else will have the same problem and find my solution. Don't care about points/karma, etc. I just already wrote the whole thing up, so figured I'd post it and the solution.
I have an SQS FIFO queue. It is using a dead letter queue. Here is how it had been configured:
I have a single producer microservice, and I have 10 ECS images that are running as consumers.
It is important that we process the messages close to the time they are delivered in the queue for business reasons.
We're using a fairly recent version of the AWS SDK Golang client package for both producer and consumer code (if important, I can go look up the version, but it is not terribly outdated).
I capture the logs for the producer so I know exactly when messages were put in the queue and what the messages were.
I capture aggregate logs for all the consumers, so I have a full view of all 10 consumers and when messages were received and processed.
Here's what I see under normal conditions looking at the logs:
Message put in the queue at time x
Message received by one of the 10 consumers at time x
Message processed by consumer successfully
Message deleted from queue by consumer at time x + (0-2 seconds)
Repeat ad infinitum for up to about 700 messages / day at various times per day
But the problem I am seeing now is that some messages are not being processed in a timely manner. Occasionally we fail processing a message deliberately b/c of the state of the system for that message (e.g. maybe users still logged in, so it should back off and retry...which it does). The problem is if the consumer fails a message it is causing the queue to stop delivering any other messages to any other consumers.
"Failure to process a message" here just means the message was received, but the consumer declared it a failure, so we just log an error, and do not proceed to delete it from the queue. Thus, the visibility timeout (here 5m) will expire and it will be re-delivered to another consumer and retried up to 10 times, after which it will go to the dead letter queue.
After delving into the logs and analyzing it, here's what I'm seeing:
Process begins like above (message produced, consumed, deleted).
New message received at time x by consumer
Consumer fails -- logs error and just returns (does not delete)
Same message is received again at time x + 5m (visibility timeout)
Consumer fails -- logs error and just returns (does not delete)
Repeat up to 10x -- message goes to dead-letter queue
New message received but it is now 50 minutes late!
Now all messages that were put in the queue between steps 2-7 are 50 minutes late (5m visibility timeout * 10 retries)
All the docs I've read tells me the queue should not behave this way, but I've verified it several times in our logs. Sadly, we don't have a paid AWS support plan, or I'd file a ticket with them. But just consider the fact that we have 10 separate consumers all reading from the same queue. They only read from this queue. We don't have any other queues it is using.
For de-duplication we are using the automated hash of the message body. Messages are small JSON documents.
My expectation would be if we have a single bad message that causes a visibility timeout, that the queue would still happily deliver any other messages it has available while there are available consumers.
OK, so turns out I missed this little nugget of info about FIFO queues in the documentation:
https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/FIFO-queues.html
When you receive a message with a message group ID, no more messages
for the same message group ID are returned unless you delete the
message or it becomes visible.
I was indeed using the same Message Group ID. Hadn't given it a second thought. Just be aware, if you do that and any one of your messages fails to process, it will back up all other messages in the queue, until the time that the message is finally dealt with. The solution for me was to change the message group id. There is some business logic id I can postfix on it that will work for me.

Sending exception back to client thrown by a flume sink

I am planning to use Flume with HTTPSource to upload data to HDFS. The sink will be configured to save data to Hive/Hbase table. If there is any excpetion/error writing data to HDFS, can it be thrown back to the client?
HTTPSourceHandler throws exception if it is unable to parse the data or if unable to send data to memory channel, but can an exception thrown by a sink be sent back to client?
Generally, sources work as data producers and sinks as data consumers. This means the sinks will not put any data into the channel, and the sources will not get any data from the channel. Nevertheless, I think you can create (never tested, just figuring out how to do such a thing) custom sources and sinks that work both as sources and sinks; in that case you could have 2 channels, one for each direction, and perform some kind of back communication.
In any case, if you are expecting to send back Http responses about all the possible errors regarding the workflow from the source to the sink, I would say you to forget about that: once the data has been put into the channel by the source, there is no guarantee such a data is inmediately process by the sink; it could take 1 second or 1 minute to be processed (the channel, which behaves as a queue, may have a lot of previous data). I mean, you do not want to implement that kind of synchronous communications because the new data arriving to the Flume agent will have to wait a lot.

Logstash availability when elasticsearch dies/can't write to disk

I have rsyslog forwarding logs to logstash via TCP. If logstash is not available rsyslog will build up queues.
In the event that logstash is available, but elasticsearch is dead or for some reason cannot write to the file system.
Is there a way for logstash to reject further TCP messages.
Thanks
According to life of an event description:
An output can fail or have problems because of some downstream cause, such as full disk, permissions problems, temporary network failures, or service outages. Most outputs should keep retrying to ship any events that were involved in the failure.
If an output is failing, the output thread will wait until this output is healthy again and able to successfully send the message. Therefore, the output queue will stop being read from by this output and will eventually fill up with events and block new events from being written to this queue.
A full output queue means filters will block trying to write to the output queue. Because filters will be stuck, blocked writing to the output queue, they will stop reading from the filter queue which will eventually cause the filter queue (input -> filter) to fill up.
A full filter queue will cause inputs to block when writing to the filters. This will cause each input to block, causing each input to stop processing new data from wherever that input is getting new events.
This means that if the elasticsearch output starts to fail then the entire pipeline will be blocked which is what you want in your case. Are you seeing something different?

Resources