I have read about the at-least-once-delivery commitment of filebeat and what I understood is that until the ack of sent logline is not received by filebeat, that line will be sent again (in case of filebeat re-start).
Now supppose, In my solution, I am using Filebeat, Logstash, and one other component that logstash is using for filtering. And after filtering the logstash sends the line to elasticsearch.
Now here are below checkpoints where we can loss data :
Filebeat got shutdown without receiving ack from logstash - In this case we know that line will be sent again by filebeat.
Suppose Filebeat sent a line, and logstash applies filtering on it with the external component and then when It tries to send to elasticsearch and the same time logstash/elasticsearch got crashed, So will we loss this data.
My question is:
Basically logstash processes data in below sequence:
INPUT --> FILTER --> OUTPUT
So I want to know at which step the logstash will send ACK to filebeat. I want to basically understand how the ACKS are being sent and when. I tried to search it on google and ELK official websites but didn't get the information in details.
Can somebody help me in understanding these details ?
thanks in advance.
The input will ACK when it pushes the events to the internal queue for the pipeline workers. That's when the plugin-input thread considers the event to be completed.
What happens with the pipeline workers kind of depends. If you have persistent queues configured and enabled, those jobs will be picked up again once logstash restarts and no data should be lost (if it is, that's a bug). If you don't have persistent queues, then that data will be lost.
Related
I would like to refactor my Logstash pipelines by using one of the pipeline-to-pipeline architecture patterns (the forked path pattern). The input of the upstream pipeline is a one-time query to an Elasticsearch cluster, using the elasticsearch input plugin. The number of inputs to downstream is therefore finite.
However, when the downstream pipelines have consumed and processed all the inputs, Logstash does not shut down - as it does, when I don't use pipeline-to-pipelines communication.
Is this the expected behaviour? Is there a way to shut down Logstash when all the downstream pipelines have processed all the events coming from upstream?
Thanks
(I originally asked this question on discuss.elastic.co, but I got no reply)
Update 26.06.2020
Till today, I couldn't find any working solution. I asked the same question at the latest Elastic{ON}, but they couldn't give me an answer. So here I am again to the community!
Looking at the Logstash code, I don't see any parameter to request Logstash termination once (1) the upstream pipeline has sent out all the event s and (2) all downstream pipelines have consumed all the sent events.
If this is the intended behaviour, do you know what is the rationale behind it, and how I can hack around it?
I am using the following pipeline to forward data
Auditbeat ---> logstash ---> ES
Suppose if the logstash machine goes down, I want to know how the Auditbeat handles the situation.
I would like to know the specifics like
is there a retry mechanism?
how long will it retry?
what happens to the audit logs, will it be lost?
the reason that I ask question 3 is that, we enable auditbeat by disabling auditd service (which was generating the auditlogs under /var/log/audit/audit.log). SO
if logstash goes down there is no data forwarding happening and hence there is a chance of data loss. Please clarify.
if auditbeat is storing the data while logstash is down, where is it doing so? and what is the memory(disk space) allocated to this saving process?
Thanks in advance
Auditbeat has an internal queue which stores the events before sending it to the configured output, by default this queue is a memory queue that will store up to 4096 events.
If the queue is full, no more events will be stored until the output comes back and start to receive data from auditbeat, there is a risk of data loss here.
You can change the number of the events that the memory queue stores.
There is also the option to use a file queue, which will save the events to disk before sending to the configured output, but this feature is still in beta.
You can read about the internal queue in the documentation.
I am currently using filebeat to forward logs to logstash and then to elasticsearch.
Now, I am thinking about forwarding logs by rsyslog to logstash. The benefit of this would be that, I would not need to install and configure filebeat on every server, and also I can forward logs in JSON format which is easy to parse and filter.
I can use TCP/UDP to forward logs to logstash by rsyslog.
I want to know the more benefits and drawbacks of rsyslog over filebeat, in terms of performance, reliability and ease of use.
When you couple Beats with Logstash you have something called "back pressure management" - Beats will stop flooding the Logstash server with messages in case something goes wrong on the network, for instance.
Another advantage of using Beats is that in Logstash you can have persisted queues, which prevents you from losing log messages in case your elasticsearch cluster goes down. So Logstash will persist messages on disk. Be careful because Logstash can't ensure you wont lose messages if you are using UDP, this link will be helpful.
Rsyslog has In-Memory, disk Queues. That should takes care of buffering messages.
Rsyslog queue-modes
The use case is this:
I've several java applications running which all have to interact with different (each one has a specific target) elasticsearch indices. For instance an application A uses the indices A,B,C of ElasticSearch to query and update. Application B uses indices A,C,D(say).
Some common interface is required which can manage all these data streams. Currently I'm evaluating Kafka and fluentd for this purpose.
Can someone explain which will be better suited for this situation. I've looked at features of both Kafka and Fluentd and I don't really understand the difference it would make here.
Thanks a lot.
kafka provides publish/subscribe messaging as a distributed commit log. Usually you install kafka on each host where you need to produce some data to be forwarded somewhere else and all those hosts will together form a cluster. The good thing here is that if for some reason network connectivity becomes unstable or goes down, your application can continue to produce data/logs and they won't be lost. Whereas if your application directly sends logs to some remote centralized logging host, you might lose some logs during the time the network goes down.
fluentd is a centralized log collector which is commonly installed on one host (or more if you need horizontal scaling). It connects to remote data sources, applies filtering and sends unified log data to remote data sinks.
From the fluentd docs, you can see that fluentd can consume data from kafka and produce data towards kafka as well. This alone should hint that fluentd and kafka are on different layers since the former uses the latter.
It would be more logical to compare fluentd and logstash actually. As far as fluentd is concerned, kafka is just another data source and/or data sink, but they are different beasts altogether.
If you want the best of both worlds, use kafka as input/output data pipes from/to your apps and fluentd (or logstash) as your centralized logging system reading from those kafka topics.
If you want to read more on the topic, you can read how fluentd and kafka complement each other very well, read they are not competing against each other.
From: The Life Blood Of Your Data Pipeline
Kafka is primarily related to holding log data rather than moving log
data. Thus, Kafka producers need to write the code to put data in
Kafka, and Kafka consumers need to write the code to pull data out of
Kafka.
Fluentd has both input and output plugins for Kafka so that data
engineers can write less code to get data in and out of Kafka. We have
many users that use Fluentd as a Kafka producer and/or consumer.
I am going to be using logstash to send a high amount of events to a broker. I have monitoring of the broker to check the health status, but I can't find much information on how to see if the logstash process is healthy, if there are indicators of a failing process.
I was interested for those who use logstash, what are some ways you monitor it?
You can have a cronjob inject a heartbeat message and route such messages to some kind of monitoring system. If you already use Elasticsearch you could use it for this as well and write a script to ensure that you have reasonably recent heartbeat messages from all hosts that should be sending messages, but I'd prefer using e.g. Nagios or lovebeat-go.
This could be used to monitor the health of a single Logstash instance (i.e. you inject the heartbeat message into the same instance that feeds the monitoring software) but you could just as well use it to check the overall health of the whole pipeline.
Update: This got built into Logstash in 2015. See the announcement of the Logstash heartbeat plugin.
If you're trying to monitor logstash as a shipper, it's easy to write a script that would compare the contents of the .sincedb* file to the actual file on disk to make sure they're in sync.
As an indexer, I'd probably skip ahead and query ElasticSearch for the number of documents being inserted.
#magnus' idea for a latency check is also good. I've used the log's timestamp and compared it to ElasticSearch's timestamp to compute the latency.