Logstash availability when elasticsearch dies/can't write to disk - elasticsearch

I have rsyslog forwarding logs to logstash via TCP. If logstash is not available rsyslog will build up queues.
In the event that logstash is available, but elasticsearch is dead or for some reason cannot write to the file system.
Is there a way for logstash to reject further TCP messages.
Thanks

According to life of an event description:
An output can fail or have problems because of some downstream cause, such as full disk, permissions problems, temporary network failures, or service outages. Most outputs should keep retrying to ship any events that were involved in the failure.
If an output is failing, the output thread will wait until this output is healthy again and able to successfully send the message. Therefore, the output queue will stop being read from by this output and will eventually fill up with events and block new events from being written to this queue.
A full output queue means filters will block trying to write to the output queue. Because filters will be stuck, blocked writing to the output queue, they will stop reading from the filter queue which will eventually cause the filter queue (input -> filter) to fill up.
A full filter queue will cause inputs to block when writing to the filters. This will cause each input to block, causing each input to stop processing new data from wherever that input is getting new events.
This means that if the elasticsearch output starts to fail then the entire pipeline will be blocked which is what you want in your case. Are you seeing something different?

Related

Splittling SQS Lambda batch into partial success/partial failure

The AWS SQS -> Lambda integration allows you to process incoming messages in a batch, where you configure the maximum number you can receive in a single batch. If you throw an exception during processing, to indicate failure, all the messages are not deleted from the incoming queue and can be picked up by another lambda for processing once the visibility timeout has passed.
Is there any way to keep the batch processing, for performance reasons, but allow some messages from the batch to succeed (and be deleted from the inbound queue) and only leave some of the batch un-deleted?
The problem with manually re-enqueueing the failed messages to the queue is that you can get into an infinite loop where those items perpetually fail and get re-enqueued and fail again. Since they are being resent to the queue their retry count gets reset every time which means they'll never fail out into a dead letter queue. You also lose the benefits of the visibility timeout. This is also bad for monitoring purposes since you'll never be able to know if you're in a bad state unless you go manually check your logs.
A better approach would be to manually delete the successful items and then throw an exception to fail the rest of the batch. The successful items will be removed from the queue, all the items that actually failed will hit their normal visibility timeout periods and retain their receive count values, and you'll be able to actually use and monitor a dead letter queue. This is also overall less work than the other approach.
Considerations
Only override the default behavior if there has been a partial batch failure. If all the items succeeded, let the default behavior take its course
Since you're tracking the failures of each queue item, you'll need to catch and log each exception as they come in so that you can see what's going on later
I recently encountered this problem and the best way to handle this without writing any code from our side is to use the FunctionResponseTypes property of EventSourceMapping. Using this we just have to pass the list of failed message Id and the event source will handle to delete the successful message.
Please checkout Using SQS and Lambda
Cloudformation template to configure Eventsource for lambda
"FunctionEventSourceMapping": {
"Type": "AWS::Lambda::EventSourceMapping",
"Properties": {
"BatchSize": "100",
"Enabled": "True",
"EventSourceArn": {"Fn::GetAtt": ["SQSQueue", "Arn"]},
"FunctionName": "FunctionName",
"MaximumBatchingWindowInSeconds": "100",
"FunctionResponseTypes": ["ReportBatchItemFailures"] # This is important
}
}
After you configure your Event source with above configuration it should look something like below
Then we just have to return the response in the below-mentioned format from our lambda
{"batchItemFailures": [{"itemIdentifier": "85f26da9-fceb-4252-9560-243376081199"}]}
Provide the list of failed message Ids in batchIntemFailures list
If your lambda runtime environment is in python than please return dict in the above mentioned format for java based runtime you can use aws-lambda-java-event
Sample Python code
Advantages of this approach are
You don't have to add any code to manually delete the message from SQS queue
You don't have to include any third party library or boto just for deleting the message from the queue it will help you to reduce your final artifact size.
Keep it simple an stupid
On a side note make sure your lambda have the required permission on sqs to get and delete the message.
Thanks
One option is to manually send back the failed messages to the queue, and then replying with a success to the SQS so that there are no duplicates.
You could do something like setting up a fail count, so that if all messages failed you can simply return a failed status for all messages, otherwise if the fail count is < 10 (10 being the max batch size you can get from SQS -> Lambda event) then you can individually send back the failed messages to the queue, and then reply with a success message.
Additionally, to avoid any possible infinite retry loop, add a property to the event such as a "retry" count before sending it back to the queue, and drop the event when "retry" is greater than X.

AWS Lambda processing stream from DynamoDB

I'm trying to create a lambda function that is consuming a stream from dynamoDB table. However I was wondering which is the best practice to handle data that may not have been processed for some errors during the execution? For example my lambda failed and I lost part of the stream, which is the best way to reprocess the lost data?
This is handled for you. DynamoDB Streams, like Kinesis Streams, will resend records until they have been successfully processed. When you are using Lambda to process the stream, that means successfully exiting the function. If there is an error and the function exits unexpectedly, the DynamoDB stream will simply resend the record that was being processing.
The good thing is you are guaranteed at-least-once processing however, there are some things you need to look out for. Like Kinesis Streams, DynamoDB Streams are guaranteed to processes records in order. As a side effect of this, when a record fails to process, it is retried until it is successfully processed or it expires from the stream (possibly days) before processing any records behind it in the stream.
How you solve for this depends on the needs of your application. If you need at-least-once processing but don't need to guarantee that all records are processed in order, I would just drop the records into an SQS queue and do the processing off of the queue. SQS queues will also retry records that aren't successfully processed however, unlike DynamoDB and Kinesis Streams, records will not block each other in the queue. If you encounter an error when transferring a record from the DynamoDB Stream to the SQS Queue, you can just retry however, this may introduce duplicates in the SQS Queue.
If order is critical or duplicates can't be tolerated, you can use a SQS FIFO Queue. SQS FIFO Queues are similar to (Standard) SQS Queues except they they are guaranteed to deliver messages to the consumer in order and have a deduplication window (5 mins) where any duplicates added to the queue within that window will be discarded.
In both cases, when using SQS queues to process messages, you can setup a Dead Letter Queue where messages can automatically be sent if they fail to be processed N number of times.
TLDR: Use SQS Queues.
Updating this thread as all the existing answers are stale.
AWS Lambda now supports the DLQs for synchronous steam read from DynamoDB table stream.
With this feature in context, here is the flow that I would recommend:
Configure the event source mapping to include the DLQ arns and set the retry-attempts count. After these many retry, the batch metadata would then be moved to DLQs.
Set-up alarm on DLQ message visibility to get alert on impacted records.
DLQ message can be used to retrieve the impacted stream record using KCL library
ProTip: you can use attribute "Bisect on Function Error" to enable batch splitting. With this option, lambda would be able to narrow down on the impacted record.
DynamoDB Streams invokes the Lambda function for each event untill it successfully processes it (Untill the code calls success callback).
In an error situation while executing, you need to handle it in code unless otherwise the Lambda won't continue with the remaining messages in the stream.
If there is a situation where you need to process the message separate due to an error, you can use the dead letter queue (with Amazon SQS) to push the message and continue with the remaining items in the stream. You can have a separate logic to process the messages in this queue.

Re-queue a Logstash event

Is it possible to re-queue a Logstash event to be processed again in a bit of time?
In order to illustrate what I mean, I will explain my use case: I have a custom Logstash filter that extracts the application version from logs at the start of an application, and then appends the correct version to every log event. However in the very beginning, race conditions can occur where an application version has not yet been written to a file, and yet the Logstash filter tries to read in the data anyway (since it it processing log lines concurrently). This results in an application version that is null. In case it matters, Logstash gets its input from Filebeat.
I would like to re-queue these events to be re-processed entirely a couple seconds (or milliseconds) from now, when the application version has been saved to the disk.
However this leads me to a broader question, which is, can you tell a Logstash event to be re-queued, or is there an alternative solution to this scenario?
Thanks for the help!
Process data and append in a new file after that use that file to further process data.
Logstash Processor - 1 Geat Data Proces Data and append to file.
Logstash Processor - 2 Get Data From 2nd File and do whatever you
want to do.

Logstash: Output: Success or Failure Condition handling and Email trigge

Task:
Pulling data from SQLServer and Pushing records to Elasticsearch.
Achieving this through triggering the logstash cmd after certain upstream conditional triggers finished.
Planning to do this by cmd.exe in c#.net process. Any better way to achieve?
Scenario to handle:
Need to send email if data transfer completed successfully.
Need to send email if unsuccessful and also perform some event.
Unsuccessful Condition: Could be anything like server not available/disk full.
Also can we capture the last record transferred to Elasticsearch in the same consecutive request in case of failure and trigger some event?
First two are V important.
Also facing issue when ES stopped "dead ES instance" in logs/command window output, but LS dont stop waits for ES? How to get this out/terminate when no response after say 5 attempts by LS to ES?
Logstash usually isn't a batch command to be triggered.
If your data is in SQLServer, try connecting with the JDBC input.

Best practices to handle errors in NIFI

I'm using NIFI, and i have data flows where I use the following processos :
ExecuteScript
RouteOnAttribute
FetchMapDistribuedCache
InvokeHTTPRequest
EvaluateJSONPath
and two level process group like NIFI FLOW >>> Process group 1 >>> Process group 2, my question is how to handle errors in this case, I have created output port for each processor to output errors outside the process group and in the NIFI Flow I have done a funnel for each error type and then put all those errors catched in Hbase so i can do some reporting later on, and as you can imagine this add multiples relationships and my simple dataflow start to became less visible.
My questions are, what's the best practices to handle errors in processors, and what's the best approach to do some error reporting using NIFI ( Email or PDF )
It depends on the errors you routinely encounter. Some processors may fail to perform a task (an expected but not desired outcome), and route the failed flowfile to REL_FAILURE, a specific relationship which can be connected to a processor to handle these failures, or back to the same processor to be retried. Others (or the same processors in different scenarios) may encounter exceptions, which are unexpected occurrences which cannot be resolved by the processor.
An example of this is PutKafka vs. EncryptContent. If the remote Kafka system is temporarily unavailable, the processor would fail to send the flowfile content. However, retrying after some delay period could be successful if the remote system is once again available. However, decrypting cipher text with the wrong key will always throw an exception, no matter how many times it is attempted or how long the retry delay is.
Many users route the errors to PutEmail processor and report them to a specific user/group who can evaluate the errors and monitor the data flow if necessary. You can also use "Reporting Tasks" to monitor metrics or ingest provenance data as operational data and route that to email/offline storage, etc. to run analytics on it.

Resources