Elasticsearch Bulk Write is slow using Scan and Scroll - elasticsearch

I am currently running into an issue on which i am really stuck.
I am trying to work on a problem where I have to output the Elasticsearch documents and write them to csv. The docs range from 50,000 to 5 million.
I am experience serious performance issues and I get a feeling that I am missing something here.
Right now I have a dataset to 400,000 documents on which I am trying to scan and scroll and which would ultimately be formatted and written to csv. But the time taken to just output is 20 mins!! That is insane.
Here is my script:
import elasticsearch
import elasticsearch.exceptions
import elasticsearch.helpers as helpers
import time
es = elasticsearch.Elasticsearch(['http://XX.XXX.XX.XXX:9200'],retry_on_timeout=True)
scanResp = helpers.scan(client=es,scroll="5m",index='MyDoc',doc_type='MyDoc',timeout="50m",size=1000)
resp={}
start_time = time.time()
for resp in scanResp:
data = resp
print data.values()[3]
print("--- %s seconds ---" % (time.time() - start_time))
I am using a hosted AWS m3.medium server for Elasticsearch.
Can anyone please tell me what I might be doing wrong here?

A simple solution to output ES data to CSV is to use Logstash with an elasticsearch input and a csv output with the following es2csv.conf config:
input {
elasticsearch {
host => "localhost"
port => 9200
index => "MyDoc"
}
}
filter {
mutate {
remove_field => [ "#version", "#timestamp" ]
}
}
output {
csv {
fields => ["field1", "field2", "field3"] <--- specify the field names you want
path => "/path/to/your/file.csv"
}
}
You can then export your data easily with bin/logstash -f es2csv.conf

Related

Timeout reached in KV filter with value entry too large

I'm trying to build a new ELK project. I'm a newbie here so not sure what I'm missing. I'm trying to move very huge logs to ELK and while doing so, its timing out in KV filter with the error "Timeout reached in KV filter with value entry too large".
My logstash is in the below format:
grok {
match => [ "message", "(?<timestamp>%{MONTHDAY:monthday} %{MONTH:month} %{YEAR:year} % {TIME:time} \[%{LOGLEVEL:loglevel}\] %{DATA:requestId} \(%{DATA:thread}\) %{JAVAFILE:className}: %{GREEDYDATA:logMessage}" ]
}
kv {
source => logMessage"
}
Is there a way, i can skip execution to go through kv filter when the logs are huge? If so, can someone guide me on how that can be done.
Thank you
I have tried multiple things but nothing seemed to work.
I solved this by using dissect.
The query was something along the lines of:
dissect{
mapping => { "message" => "%{[#metadata][timestamp] %{[#metadata][timestamp] %{[#metadata][timestamp] %{[#metadata][timestamp] %{loglevel} %{requestId} %{thread} %{classname} %{logMessage}"
}

How to choose optimal logstash pipleline batch size and delay? (Logstash 6.4.3)

Introduction
We have a logstash that is receiving our logs from java microservices, and lately the machine has been at 100% utilization.
I noticed that very low values were used for pipeline batch size, workers, and delay as well as ram.
My feeling was that I could improve performance by increasing the batch size into the thousands, increasing the delay into the seconds, and increasing the ram.
It seems to have worked and we have gone from a logstash that was crashing at 100% continously (or close to it) to being at (or below) 70%. This is a virtual server running in vmware with only 1 core assigned so resources are a bit limited.
Question
How do I optimize further? (without messing with the microservices or limiting the number of incoming message)?
How do I find the optimal values for delay and batch size?
Also, even though we have 1 core, I have the feeling that having more than 1 worker helps but I'm not sure about that (due to IO delays)
Current config
ELK (Elastic, Logstash, Kibana) 6.4
logstash.yml contains
pipeline:
batch:
size: 2048
delay: 5000
pipeline.workers: 4
Elastic jvm.options
-Xms4g
-Xmx10g
Logstash jvm.options
-Xms4g
-Xmx10g
Logstash config:
input {
tcp {
port => 8999
codec => json
}
}
filter {
geoip {
source => "req.xForwardedFor"
}
}
filter {
kv {
include_keys => [ "freeTextSearch", "entityId","businessId"]
recursive => "true"
field_split => ","
}
}
filter {
mutate {
split => { "req.user" => "," }
split => { "req.application" => "," }
split => { "req.organization" => "," }
split => { "app.profiles" => "," }
copy => { "app.name" => "appLicationName" }
}
}
filter {
fingerprint {
target => "[#metadata][uuid]"
method => "UUID"
}
}
filter {
if [app]
{
ruby
{ init => '
BODY_PATH = "[app]"
BODY_STRING = "[name]"
'
code => '
body_val = event.get(BODY_PATH)
if body_val.is_a?(String)
event.set(BODY_PATH, {BODY_STRING => body_val,"[olderApp]" => "true"})
end
'
}
}
}
output {
stdout {
codec => rubydebug {
metadata => true
}
}
if [stackTrace] {
email {
address => 'smtp.internal.email'
to => 'Warnings<warning#server.internal.org>'
from => 'Warnings<warning#server.internal.org>'
subject => '%{message}'
template_file => "C:\logstash\emailtemplate.mustache"
port => 25
}
}
elasticsearch {
hosts => ["localhost:8231"]
sniffing => true
manage_template => false
index => "sg-logs"
document_id => "%{[#metadata][uuid]}"
}
}
Update
I switched to the persistent queue, which has improved things quite a bit in terms of performance. I ran the scripts that used to freeze our logtash and it seems to not be breaking, though it took quite a bit of work.
Switched to pipeline.yml
I switched to pipeline.yml since I noticed that the queue settings were not working. I also had to pass the YML through a validator.
---
-
path.config: "../configsg/"
pipeline.batch.size: 1000
pipeline.id: persisted-queue-pipeline
pipeline.workers: 2
queue.type: persisted
queue.max_bytes: 2000mb
queue.drain: true
Modified the bat file to clean the data/queue folder
I noticed logstash wasn't processing correctly when there was leftover data inside data/queue folder. I added a bat file to clean/move this data during logstash restarts etc. I need to think about how to handle this in the future.
Folder: logstash-6.4.3\data\queue
Here is my bat file that is called by a windows service during starts/restarts.
echo Date format = %date%
echo dd = %date:~0,2%
echo mm = %date:~3,2%
echo yyyy = %date:~6,8%
echo.
echo Time format = %time%
echo hh = %time:~0,2%
echo mm = %time:~3,2%
echo ss = %time:~6,2%
cd ..
cd data/queue
move ./persisted-queue-pipeline ../persist-queue-backup-%date:~0,2%_%date:~3,2%_%date:~6,8%-%time:~0,2%_%time:~3,2%_%time:~6,2%.txt
cd ../../bin
logstash.bat
Here some tips from Logstash team about optimization: link
I would also suggest taking a look at multi-pipeline cases. From your config, it sounds to me filter cases may causing the backpressure. It seems if you can divide your input (by port), you can set multiple pipeline to handle the backpressure cases.

Create a new index in elasticsearch for each log file by date

Currently
I have completed the above task by using one log file and passes data with logstash to one index in elasticsearch :
yellow open logstash-2016.10.19 5 1 1000807 0 364.8mb 364.8mb
What I actually want to do
If i have the following logs files which are named according to Year,Month and Date
MyLog-2016-10-16.log
MyLog-2016-10-17.log
MyLog-2016-10-18.log
MyLog-2016-11-05.log
MyLog-2016-11-02.log
MyLog-2016-11-03.log
I would like to tell logstash to read by Year,Month and Date and create the following indexes :
yellow open MyLog-2016-10-16.log
yellow open MyLog-2016-10-17.log
yellow open MyLog-2016-10-18.log
yellow open MyLog-2016-11-05.log
yellow open MyLog-2016-11-02.log
yellow open MyLog-2016-11-03.log
Please could I have some guidance as to how do i need to go about doing this ?
Thanks You
It is also simple as that :
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "MyLog-%{+YYYY-MM-DD}.log"
}
}
If the lines in the file contain datetime info, you should be using the date{} filter to set #timestamp from that value. If you do this, you can use the output format that #Renaud provided, "MyLog-%{+YYYY.MM.dd}".
If the lines don't contain the datetime info, you can use the input's path for your index name, e.g. "%{path}". To get just the basename of the path:
mutate {
gsub => [ "path", ".*/", "" ]
}
wont this configuration in output section be sufficient for your purpose ??
output {
elasticsearch {
embedded => false
host => localhost
port => 9200
protocol => http
cluster => 'elasticsearch'
index => "syslog-%{+YYYY.MM.dd}"
}
}

Stop pushing data in elasticsearch initiate by logstash "exec" plugin

I am very new to elasticsearch stuck in a problem. I have made a logstash configuration file named test.conf which is as follows :-
input
{
exec
{
command => "free"interval => 1
}
}
output
{
elasticsearch
{
host => "localhost"protocol => "http"
}
}
Now I execute this config file so that it will start pushing data in elasticsearch every 1 sec by following command :-
$ /opt/logstash/bin/logstash -f test.conf
I m using kibana to display data inserted in elasticsearch.
Since the data is keep on adding into elasticsearch every second I am not getting how to stop this data insertion job. Please help me out.

Email alert after threshold crossed, logstash?

I am using logstash, elasticsearch and kibana to analyze my logs.
I am alerting via email when a particular string comes into the log via email output in logstash:
email {
match => [ "Session Detected", "logline,*Session closed*" ]
...........................
}
This works fine.
Now, I want to alert on the count of a field (when a threshold is crossed):
Eg If user is field, I want to alert when number of unique users go more than 5.
Can this be done via email output in logstash??
Please help.
EDIT:
As #Alcanzar told I did this:
config file:
if [server] == "Server2" and [logtype] == "ABClog" {
grok{
match => ["message", "%{TIMESTAMP_ISO8601:timestamp} %{HOSTNAME:server-name} abc\[%{INT:id}\]:
\(%{USERNAME:user}\) CMD \(%{GREEDYDATA:command}\)"]
}
metrics {
meter => ["%{user}"]
add_tag => "metric"
}
}
So according to above, for server2 and abclog I have a grok pattern for parsing my file and on the user field parsed by grok I want the metric applied.
I did that in the config file as above, but I get strange behaviour when I check logstash console with -vv.
So if there are 9 log lines in the file it parses the 9 first, after that it starts metric part but there the message field is not the logline in the log file but it's the user-name of my PC, thus it gives _grokparsefailure. Something like this:
output received {
:event=>{"#version"=>"1", "#timestamp"=>"2014-06-17T10:21:06.980Z", "message"=>"my-pc-name",
"root.count"=>2, "root.rate_1m"=>0.0, "root.rate_5m"=>0.0, "root.rate_15m"=>0.0,
"abc.count"=>2, "abc.rate_1m"=>0.0, "abc.rate_5m"=>0.0, "abc.rate_15m"=>0.0, "tags"=>["metric",
"_grokparsefailure"]}, :level=>:debug, :file=>"(eval)", :line=>"137"
}
Any help is appreciated.
I believe what you need is http://logstash.net/docs/1.4.1/filters/metrics.
You'd want to use a metrics tag to calculate the rate of your event, and then use the thing.rate_1m or thing.rate_5m in an if statement around your email output.
For example:
filter {
if [message] =~ /whatever_message_you_want/ {
metrics {
meter => "user"
add_tag => "metric"
}
}
}
output {
if "metric" in [tags] and [user.rate_1m] > 1 {
email { ... }
}
}
Aggregating on the logstash side is fairly limited. It also increases the state size thus memory consumption may grow. Alerts that run on the Elasticsearch layer offer more freedom and possibilities.
Logz.io alerts on top of ELK are offered in the below blog: http://logz.io/blog/introducing-alerts-for-elk/

Resources