Shutdown Logstash when using pipeline-to-pipelines communication - elasticsearch

I would like to refactor my Logstash pipelines by using one of the pipeline-to-pipeline architecture patterns (the forked path pattern). The input of the upstream pipeline is a one-time query to an Elasticsearch cluster, using the elasticsearch input plugin. The number of inputs to downstream is therefore finite.
However, when the downstream pipelines have consumed and processed all the inputs, Logstash does not shut down - as it does, when I don't use pipeline-to-pipelines communication.
Is this the expected behaviour? Is there a way to shut down Logstash when all the downstream pipelines have processed all the events coming from upstream?
Thanks
(I originally asked this question on discuss.elastic.co, but I got no reply)
Update 26.06.2020
Till today, I couldn't find any working solution. I asked the same question at the latest Elastic{ON}, but they couldn't give me an answer. So here I am again to the community!
Looking at the Logstash code, I don't see any parameter to request Logstash termination once (1) the upstream pipeline has sent out all the event s and (2) all downstream pipelines have consumed all the sent events.
If this is the intended behaviour, do you know what is the rationale behind it, and how I can hack around it?

Related

Why use Beats if i can post directly to Elasticsearch?

Recently i have been reading into Elastic stack and finding out about this thing called Beats, which basically used for lightweight shippers.
So the question is, if my service can directly hit to Elasticsearch, do i actually need beats for it? Since from what i have known it's just kinda a proxy (?)
Hopefully my question is clear enough
Not sure which beat you are specifically referring but let's take an example of Filebeat.
Suppose application logs need to be indexed into Elasticsearch. Options
Post the logs directly to Elasticsearch
Save the logs to a file, then use Filebeat to index logs
Publish logs to a AMQP service like RabbitMQ or Kafka, then use Logstash input plugins to read from RabbitMQ or Kafka and index into Elasticsearch
Option 2 Benefits
Filebeat ensures that each log message got delivered at-least-once. Filebeat is able to achieve this behavior because it stores the delivery state of each event in the registry file. In situations where the defined output is blocked and has not confirmed all events, Filebeat will keep trying to send events until the output acknowledges that it has received the events.
Before shipping data to Elasticsearh, we can do some additional processing or filtering. We want to drop some logs based on some text in the log message or add additional field (eg: Add Application Name to all logs, so that we can index multiple application logs into single index, then on consumption side we can filter the logs based on application name.)
Essentially beats provide the reliable way of indexing data without causing much overhead to the system as beats are lightweight shippers.
Option 3 - This also provides the same benefits as option2. This might be more useful in case if we want to ship the logs directly to an external system instead of storing it in a file in the local system. For any applications deployed in Docker/Kubernetes, where we do not have much access or enough space to store files in the local system.
Beats are good as lightweight agents for collecting streaming data like log files, OS metrics, etc, where you need some sort of agent to collect and send. If you have a service that wants to put things into Elastic, then yes by all means it can just use rest/java etc API directly.
Filebeat offers a way to centralize live logs from Multiple Servers
Let's say you are running multiple instances of an application in different servers and they are writing logs.
You can ship all these logs to a single ElasticSearch index and analyze or visualize them from there.
A single static file doesn't need Filebeat for moving to ElasticSearch.

Handling kafka clients updates in kubernetes

I have a Kafka cluster running on AWS MSK with Kafka producer and consumer go clients running in kubernetes. The producer is responsible for sending the stream of data to Kafka. I need help solving the following problems:
Let's say, there is some code change in producer code and have to redeploy it in kubernetes. How can I do that? Since the data is continuously generated, I cannot just simply stop the already running producer and deploy the updated one. In this case, I will lose the data between the update process.
Sometimes due to a panic(golang) in the code, the client crashes, but since it is running as a pod, kubernetes restarts it. I am not able to understand as to whether it's a good thing or bad.
Thanks
For your first question, I would suggest having rolling update of your deployment in the cluster.
For second, that is the general behavior of deployments in kubernetes. I could think of an external monitoring solution that de-deploys your application or stops handling requests in case of a panic.
It would be better if you could explain why exactly you need such kind of behavior.!

How to get errors captured from nifi logs specific to my application when multiple nifi applications are running

We have multiple team nifi applications running in same nifi machine... Is there any way to log the logs specific to my application? Also by default nifi-app.log file is difficult to track the issues and bulletin board shows the error msg for only 5 mins... How to get the errors captured and send an mail alert in Nifi?
Please help me to get through this. Thanks in advance!
There are a couple ways to approach this. One is to route failure relationships from processors to a PutEmail processor which can send an alert on errors. Another is to use a custom reporting task to alert a monitoring service when a certain number of flowfiles are in an error queue.
Finally, we have heard that in multitenant environments, log parsing is difficult. While NiFi aims to reduce or completely eliminate the need to visually inspect logs by providing the data provenance feature, in the event you do need to inspect the logs, we recommend searching the log by processor ID to isolate relevant messages. You can also use NiFi itself to ingest those same logs and perform parsing and filtering activities if desired. Future versions may improve this experience.
By parsing the nifi log, you can separate the logs which is specific to your team applications, by using the processor group id and using Nifi Rest API. Check the below link for the nifi template and python codes to solve this issue:
https://link.medium.com/L6IY1wTimV
You can send all the errors in a Processor Group to the same processor, it could be a regular UpdateAttribute or a custom processor, this processor is going to add the path and all the relevant information and then send this to a general error/logs flow that is going to check the information inside the flowfile regarding to the error, and will make the decision of send or not an email, to whom and this kind of things.
Using this approach, the system keeps simple and inside NiFi, so you don't add more layers of complexity, and you are going to have only one processor to manage the errors per Process Group.
This is the way we are managing errors in my company.

When Logstash sends ACK to input source

I have read about the at-least-once-delivery commitment of filebeat and what I understood is that until the ack of sent logline is not received by filebeat, that line will be sent again (in case of filebeat re-start).
Now supppose, In my solution, I am using Filebeat, Logstash, and one other component that logstash is using for filtering. And after filtering the logstash sends the line to elasticsearch.
Now here are below checkpoints where we can loss data :
Filebeat got shutdown without receiving ack from logstash - In this case we know that line will be sent again by filebeat.
Suppose Filebeat sent a line, and logstash applies filtering on it with the external component and then when It tries to send to elasticsearch and the same time logstash/elasticsearch got crashed, So will we loss this data.
My question is:
Basically logstash processes data in below sequence:
INPUT --> FILTER --> OUTPUT
So I want to know at which step the logstash will send ACK to filebeat. I want to basically understand how the ACKS are being sent and when. I tried to search it on google and ELK official websites but didn't get the information in details.
Can somebody help me in understanding these details ?
thanks in advance.
The input will ACK when it pushes the events to the internal queue for the pipeline workers. That's when the plugin-input thread considers the event to be completed.
What happens with the pipeline workers kind of depends. If you have persistent queues configured and enabled, those jobs will be picked up again once logstash restarts and no data should be lost (if it is, that's a bug). If you don't have persistent queues, then that data will be lost.

Monitoring health of logstash

I am going to be using logstash to send a high amount of events to a broker. I have monitoring of the broker to check the health status, but I can't find much information on how to see if the logstash process is healthy, if there are indicators of a failing process.
I was interested for those who use logstash, what are some ways you monitor it?
You can have a cronjob inject a heartbeat message and route such messages to some kind of monitoring system. If you already use Elasticsearch you could use it for this as well and write a script to ensure that you have reasonably recent heartbeat messages from all hosts that should be sending messages, but I'd prefer using e.g. Nagios or lovebeat-go.
This could be used to monitor the health of a single Logstash instance (i.e. you inject the heartbeat message into the same instance that feeds the monitoring software) but you could just as well use it to check the overall health of the whole pipeline.
Update: This got built into Logstash in 2015. See the announcement of the Logstash heartbeat plugin.
If you're trying to monitor logstash as a shipper, it's easy to write a script that would compare the contents of the .sincedb* file to the actual file on disk to make sure they're in sync.
As an indexer, I'd probably skip ahead and query ElasticSearch for the number of documents being inserted.
#magnus' idea for a latency check is also good. I've used the log's timestamp and compared it to ElasticSearch's timestamp to compute the latency.

Resources