Understanding the logstash retry policy - elasticsearch

I have kibana and elasticsearch instance running on a machine. Logstash and filebeat are running on other machine.
The flow is working perfectly fine. I have one doubt and i need to understand that. I made elasticsearch go down and made logstash to pump some logs to elasticearch. Since elasticsearch is down, i am hoping data will be lost. But when i brought up the elasticsearch service, Kibana was able to show the logs which was sent when elasticsearch was down.
When i googled online, i got to know that logstash retries to connect in elasticsearch is down.
May i please know how to set this parameter

The reason is that the elasticsearch output implements exponential backoff using two parameters called:
retry_initial_interval
retry_max_interval
If a bulk call fails, Logstash will wait for retry_initial_interval seconds and try again. If it still fails, it will wait for 2 * retry_initial_interval and try again. Ans so on until the wait time reaches retry_max_interval, at which point it will keep trying every retry_max_interval seconds indefinitely.
Note that this retry policy only works when ES is unreachable. If there's another error, such as a mapping error (HTTP 400) or a conflict (HTTP 409), the bulk call will not be retried.

Related

Logstash restart slow with large persistent queue

I have a logstash server that somehow stopped listening on its syslog input (but didn't crash thats odd enough on itself but case for another question), it was configured to have a max queue of 100GB and, after some time (31gb of queue) i decided to restart it.
After restarting logstash it gets "stuck" on
[INFO ][logstash.agent ] Successfully started Logstash API endpoint {:port=>9600}
and it doesnt start sending to Elasticsearch the messages on the queue (neither the ones on the queue nor newly received ones).
If i delete the queue folder and restart logstash then I get new events but obviously I lost all the old ones.
Why is logstash taking so long to process the persistent queue when its big? are there any settings i should tune to make the pipeline flow ?
I have saved the old Persistent Queue for looking deeper into it, any pointers?

Elasticsearch - missing data

I have been planning to use ELK for our production environment and seems to be running into a weird problem -
the problem is that while loading a sample of the production log file I realized that there is a huge mismatch in the number of events being published by Filebeat and what we see in kibana. My first doubt was on filebeat but i could verify that all the events were successfully received in logstash.
I also checked logstash (by enabling debug mode ) and could see all the events were received and processed (i am using the following filters date , json ) and i could see them getting processed successfully
but when i do a search in kibana I only get to see the percent of the number of logs being actually published (e.g. only 16000 out of 350K). No exception or error in either logstash or elasticsearch logs.
I have tried zapping the entire data by doing the following so far :
Stopped all processes for ES, Logstash and kibana.
Deleted all the index files, cleared the cache , deleted mappings
stopped filebeat, deleted registry files (since its running in windows)
Restarted elasticsearch, logstash and filebeat (in that order)
but same results. i get only 2 out of 8 records (in the shortened file) and even less when i use the full file
i tried increasing the time windows in kibana to 10 years (:)) to see if they are being pushed to the wrong year but got nothing
I have read almost all threads related to the missing data but nothing seems to work.
any pointers would help !

Could not push logs to Elasticsearch, resetting connection and trying again. read timeout reached

I am trying to set up EFK (elasticsearch, fluentd, kibana) on kubernetes cluster, so i used the following controller and service yaml files:
fluentd-es.yaml
https://github.com/kubernetes/kubernetes/blob/release-1.2/cluster/saltbase/salt/fluentd-es/fluentd-es.yaml
es-controller.yaml, es-service.yaml, kibana-controller.yaml and kibana-service.yaml
https://github.com/kubernetes/kubernetes/tree/master/cluster/addons/fluentd-elasticsearch
after running them, i had the following log output and kibana dashboard was unable to show me logs and charts (keep loading for ever like next image).
fluentd log snapshot:
elasticsearch log snapshot:
kibana log snapshot
You have two issues:
ES connection is gettng time out after retrying.So make sure you are defining right es config in fluentd.conf.
Also this is giving- BufferQueueLimitError which comes when your queue is filled due to connection time out. If you are expecting to fix this you should define:
buffered memory and
buffer_type memory
buffer_chunk_limit **m
buffer_queue_limit **
flush_interval ***s
disable_retry_limit false
retry_wait **s
refer-
https://docs.fluentd.org/v0.12/articles/buffer-plugin-overview#secondary-output
The logs are pretty much telling you .... there's a connection problem to Elasticsearch.

elasticsearch and logstash shutting down prematurely

I am pretty new to this and currently have a single unix (centos) server running logstash, elasticsearch and kibana. The data is being consumed from rabbitmq exchange and works pretty well but for some reason after a few hours the kibana dashboard will become inactive, the elasticsearch node inactive and logstash stops consuming. I initially set it up to manually start each process for eg. ./elasticsearch etc. and wonder if setting it up as a service would prevent this from occurring.
I want to ensure that the setup runs continuously without any interruptions.
http://192.xxx.xxx.xxx:9200/_plugin/head/
Any suggestions and links appreciated

Why is elasticsearch crashing from inactive shards and logstash failing on bulk actions?

I am currently testing an ELK stack on 1 Ubuntu 14.04 box. It has 6 GB of RAM and 1TB storage. This is pretty modest, but for the amount of data I am getting, this should be plenty right? I followed this guide elk stack guide. In summary, I have Kibana4, Logstash 1.5, and Elasticsearch 1.4.4 all running on one box, with a nginx server acting as a reverse proxy so I can access Kibana from outside. The main difference from the guide is that instead of syslogs, I am taking json input from a logstash-forwarder, sending about 300 events/minute.
Once started, everything is fine -- the logs show up on Kibana and there are no errors. After about 3 hours, elasticsearch crashes. I get a
Discover: Cannot read property 'indexOf' of undefined
error on the site. Logs can be seen on pastebin. It seems that shards become inactive and elasticsearch updates the index_buffer size.
If I refresh the Kibana UI, it starts working again for my json logs. However, if I test a different log source (using TCP input instead of lumberjack), I get similar errors to the above, except that I stop processing logs -- anywhere from 10 min to an hour, I do not process any more logs and I cannot stop logstash unless I perform a kill -KILL.
Killing logstash (pid 13333) with SIGTERM
Waiting logstash (pid 13333) to die...
Waiting logstash (pid 13333) to die...
Waiting logstash (pid 13333) to die...
Waiting logstash (pid 13333) to die...
Waiting logstash (pid 13333) to die...
logstash stop failed; still running.
Logstash error log shows . Logstash .log file is empty...
For the tcp input, I get about 1500 events every 15 minutes, in a bulk insert process by logstash.
Any ideas here?
EDIT: I also observed that when starting my elasticsearch process, my shards are set to a lower mb...
[2015-05-08 19:19:44,302][DEBUG][index.engine.internal ] [Eon] [logstash- 2015.05.05][0] updating index_buffer_size from [64mb] to [4mb]
#jeffrey, I have the same problem with DNS filter.
I did two things.
I installed dnsmasq as DNS caching resolver. It help if you have high latency or high load DNS server.
And second I increased number of worker threads of logstash. Just use -w option.
Trick with threads working without dnsmasq. Trick with dnsmask without threads not.
I found out my problem. The problem was in my logstash configuration -- The DNS filter (http://www.logstash.net/docs/1.4.2/filters/dns) for some reason caused by logstash to crash/hang. Once I took out the DNS filter, everything worked fine. Perhaps there was some error with the way the DNS filter was configured.

Resources