Logstash+Elasticsearch throughput - elasticsearch

we're trying to process 5K msgs/sec with 2 identical machines, but seems like we max out logstash or elasticsearch.
Each has:
64Gb RAM, ≈3Ghz Xeon CPU
Logstash 1.5 installed
Elasticsearch 1.7.8 installed in cluster mode with second machine.
Logstash is configured to receive messages from 16-node kafka cluster and send it to Elasticsearch.
Data is CSV, contains 22 fields. Is that a normal throughput?
Here's the config:
input{
kafka {
type => "api"
zk_connect => "node1:2181,node2:2181,node3:2181"
codec => "plain"
topic_id => "api_events"
consumer_threads => 8
queue_size => 10000
rebalance_backoff_ms => 10000
rebalance_max_retries => 10
}
}
filters{
csv {
separator => "::"
columns => [
"hostname",
"status",
"body_bytes_sent",
"request_time",
"http_x_forwarded_for",
"uri",
"arg_key",
"http_user_agent",
"http_deviceid",
"http_country_code",
"http_language_code",
"http_platform",
"http_versioncode",
"request_method",
"http_x_forwarded_proto",
"upstream_cache_status",
"upstream_response_time",
"upstream_header_time",
"upstream_status",
"bytes_sent",
"time_local",
"upstream_addr"
]
remove_field => [
"message"
]
}
mutate {
convert => {
"body_bytes_sent" => "integer"
"request_time" => "float"
"upstream_response_time" => "float"
"upstream_header_time" => "float"
"bytes_sent" => "integer"
}
}
}
}
output{
elasticsearch {
cluster => "MyCluster"
protocol => "node"
index => "api-%{+YYYY.MM.dd}"
host => "elasticnode1"
flush_size => 50000
workers => 4
}
}

I'm surprised this question has not been answered. The question has been asked with rather old versions of logstash There have been multiple improvements and refactoring in Logstash since so some of the parameters will be different.
5k messages a second, at least on the face of it, sounds pretty low to achieve but of course it depends on quite a few things which are not stated. For example, wow big is each message? How many partitions is it listening from? Is each partition in the input actually receiving messages at high throughputs or is only the aggregate throughput high?
I would suggest starting with a small batch size (say 500) with 1 worker and slowly increasing the batch size till you see no improvement and then increasing the workers to make use of the cores on the machine. It is possible you're not getting each batch full enough per request per worker. The following article shows how to profile and measure how the in flight requests are doing with real arriving data:
https://www.elastic.co/guide/en/logstash/current/tuning-logstash.html
Of course the other side of this is the Elasticsearch side. It is worth measuring that Elasticsearch is not lagging in processing all he concurrent requests. How many Elasticsearch nodes are there and what are the numbers of Client/Data nodes? There are a number of things to look out for on Elasticsearch as well when doing "heavy" indexing:
https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html
I hope this helps you and others looking to do this today.

Related

how to push huge data in less time from Database table to Elastic search using log stash

I'm processing the 500 000 records from Postgres database to elastic using Logstash but it taking 40 minutes to completed the process. I want to reduce the process time and i have changed the pipeline.batch.size: 1000, pipeline.batch.delay: 50 in logstash.yml file and increase the heap space 1 gb to 2 gb in the JVM.options file still processing the records in same time.
Conf file
input {
jdbc {
jdbc_driver_library => "C:\Users\Downloads\elk stack/postgresql-42.3.1.jar"
jdbc_driver_class => "org.postgresql.Driver"
jdbc_connection_string => "jdbc:postgresql://localhost:5432/postgres"
jdbc_user => "postgres"
jdbc_password => "postgres123"
statement => "SELECT * FROM jolap.order_desk_activation"
}
}
output {
elasticsearch {
hosts =>["http://localhost:9200/"]
index => "test-powerbi-transformed"
document_type => "_doc"
}
stdout {}
}
The problem is not the logstash pipeline or the batch size. As above suggested, u need to get volume is less time.
This can be achieved using "Parallel Hints" which makes the query superfast, as the query start using the core processor the DB infrastructure (Dont miss to consult your DBA before applying this). Once u start getting volume records in less time, you can scale your logstash or tweak the pipeline settings.
Refer to this link.

How to choose optimal logstash pipleline batch size and delay? (Logstash 6.4.3)

Introduction
We have a logstash that is receiving our logs from java microservices, and lately the machine has been at 100% utilization.
I noticed that very low values were used for pipeline batch size, workers, and delay as well as ram.
My feeling was that I could improve performance by increasing the batch size into the thousands, increasing the delay into the seconds, and increasing the ram.
It seems to have worked and we have gone from a logstash that was crashing at 100% continously (or close to it) to being at (or below) 70%. This is a virtual server running in vmware with only 1 core assigned so resources are a bit limited.
Question
How do I optimize further? (without messing with the microservices or limiting the number of incoming message)?
How do I find the optimal values for delay and batch size?
Also, even though we have 1 core, I have the feeling that having more than 1 worker helps but I'm not sure about that (due to IO delays)
Current config
ELK (Elastic, Logstash, Kibana) 6.4
logstash.yml contains
pipeline:
batch:
size: 2048
delay: 5000
pipeline.workers: 4
Elastic jvm.options
-Xms4g
-Xmx10g
Logstash jvm.options
-Xms4g
-Xmx10g
Logstash config:
input {
tcp {
port => 8999
codec => json
}
}
filter {
geoip {
source => "req.xForwardedFor"
}
}
filter {
kv {
include_keys => [ "freeTextSearch", "entityId","businessId"]
recursive => "true"
field_split => ","
}
}
filter {
mutate {
split => { "req.user" => "," }
split => { "req.application" => "," }
split => { "req.organization" => "," }
split => { "app.profiles" => "," }
copy => { "app.name" => "appLicationName" }
}
}
filter {
fingerprint {
target => "[#metadata][uuid]"
method => "UUID"
}
}
filter {
if [app]
{
ruby
{ init => '
BODY_PATH = "[app]"
BODY_STRING = "[name]"
'
code => '
body_val = event.get(BODY_PATH)
if body_val.is_a?(String)
event.set(BODY_PATH, {BODY_STRING => body_val,"[olderApp]" => "true"})
end
'
}
}
}
output {
stdout {
codec => rubydebug {
metadata => true
}
}
if [stackTrace] {
email {
address => 'smtp.internal.email'
to => 'Warnings<warning#server.internal.org>'
from => 'Warnings<warning#server.internal.org>'
subject => '%{message}'
template_file => "C:\logstash\emailtemplate.mustache"
port => 25
}
}
elasticsearch {
hosts => ["localhost:8231"]
sniffing => true
manage_template => false
index => "sg-logs"
document_id => "%{[#metadata][uuid]}"
}
}
Update
I switched to the persistent queue, which has improved things quite a bit in terms of performance. I ran the scripts that used to freeze our logtash and it seems to not be breaking, though it took quite a bit of work.
Switched to pipeline.yml
I switched to pipeline.yml since I noticed that the queue settings were not working. I also had to pass the YML through a validator.
---
-
path.config: "../configsg/"
pipeline.batch.size: 1000
pipeline.id: persisted-queue-pipeline
pipeline.workers: 2
queue.type: persisted
queue.max_bytes: 2000mb
queue.drain: true
Modified the bat file to clean the data/queue folder
I noticed logstash wasn't processing correctly when there was leftover data inside data/queue folder. I added a bat file to clean/move this data during logstash restarts etc. I need to think about how to handle this in the future.
Folder: logstash-6.4.3\data\queue
Here is my bat file that is called by a windows service during starts/restarts.
echo Date format = %date%
echo dd = %date:~0,2%
echo mm = %date:~3,2%
echo yyyy = %date:~6,8%
echo.
echo Time format = %time%
echo hh = %time:~0,2%
echo mm = %time:~3,2%
echo ss = %time:~6,2%
cd ..
cd data/queue
move ./persisted-queue-pipeline ../persist-queue-backup-%date:~0,2%_%date:~3,2%_%date:~6,8%-%time:~0,2%_%time:~3,2%_%time:~6,2%.txt
cd ../../bin
logstash.bat
Here some tips from Logstash team about optimization: link
I would also suggest taking a look at multi-pipeline cases. From your config, it sounds to me filter cases may causing the backpressure. It seems if you can divide your input (by port), you can set multiple pipeline to handle the backpressure cases.

Logstash with Elasticsearch: I am using logstash to ingest the data into the ES index. But now I want logstash to run 24/7

#file:db.conf
input {
jdbc {
jdbc_driver_library => ""
jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
jdbc_connection_string => "jdbc:oracle:thin:#abcd.klm.uvw:1521/qtp1"
jdbc_user =>"user_wew"
jdbc_password => "password_wew"
statement => "select col1, col2, col3, col4, col5, col6, countid,max(version) as mv from master_object_table where version >:sql_last_value group by countid"
schedule => "* * * * *"
last_run_metadata_path => "C:/ES1/ELK_stack_7.4.2/logstash-7.4.2/logstash-7.4.2/Master_refresh_a.txt"
use_column_value => true
tracking_column => "version"
}
}
filter {
mutate {
convert => {
"countid" => "string"
}
}
}
output {
elasticsearch {
hosts => "localhost:9200"
index =>"refresh_index_a"
document_id =>"%{countid}"
#document_type="_doc"
}
file {
path => "C:\\ES1\\ELK_stack_7.4.2\\logstash-7.4.2\\logstash-7.4.2\\bin\\logstashESRecordsIngestionDetails_refresh_a.txt"
codec => rubydebug
}
stdout { codec => rubydebug }
}
Above is my logstash config file setting. I want to run this logstash 24/7 and also if the machine shutsdown on which this logstash is running then how can I manage that as this logstash is ingesting the live data to ES index. Please suggest. Is there any way if one server goes down the logstash on another node will continue the work.
As per the documentation
Logstash is horizontally scalable and can form groups of nodes running
the same pipeline. Logstash’s adaptive buffering capabilities will
facilitate smooth streaming even through variable throughput loads. If
the Logstash layer becomes an ingestion bottleneck, simply add more
nodes to scale out. Here are a few general recommendations:
Beats should load balance across a group of Logstash nodes.
A minimum of two Logstash nodes are recommended for high availability.
It’s common to deploy just one Beats input per Logstash node, but multiple
Beats inputs can also be deployed per Logstash node to expose
independent endpoints for different data sources.

Kibana Timelion Is not graphing data from index

I'm setting up a graph to display Cisco Netflow 9 data using ELK stack 7.7.0. Data from routers reaches logstash, then to ElasticSearch and finally to Kibana.
In Kibana I'm using Timelion to graph incoming Bytes on router interface. For that purpose I created the index cisconetflow and picked the field "in_bytes" for graphing. The Timelion expression looks like this:
.es(q='netflow.in_bytes',index=cisconetflow*)
But once I press the Update and refresh buttons I get no errors but nothing happens, no data is displayed in the graph:
If I only include the index in the Timelion expression, it shows some hits:
Simultaneously I'm running a debug on logstash and I see that Netfrow data is present:
"host" => "172.16.8.57",
"#timestamp" => 2020-05-25T20:12:38.000Z,
"netflow" => {
"in_bytes" => 1638,
"flowset_id" => 256,
"input_snmp" => 1,
"protocol" => 17,
"l4_src_port" => 9131,
"ipv4_src_addr" => "192.168.1.70",
"version" => 9,
"src_tos" => 0,
"l4_dst_port" => 9131,
"ipv4_dst_addr" => "239.255.250.250",
"dst_as" => 0,
"flow_seq_num" => 23193,
"output_snmp" => 0,
"in_pkts" => 7,
"src_as" => 0
},
Same on Kibana discover dashboard, I see netflow data coming in and the netflowin_bytes field is displayed as available.
So, any clue on what I'm missing to get the data in the chart?
Thanks.
Ok,after researching I found I was missing timefield and metric parameters in the expression, now I see traffic from the field required.

Logstash read a very large number of static xml files ( input file plugin)

I have many xml static file about 1 million in one directory. I want to read and parse those file with logstash and output to elasticsearch.
I have the next input config (I try many way and it`s my last version):
input{
file {
path => "/opt/lun/data-unzip/ftp/223/*.xml*"
exclude => "*.zip"
type => "223-purplan"
start_position => beginning
discover_interval => "3"
max_open_files => "128"
close_older => "3"
codec => multiline {
pattern => "xml version"
negate => true
what => "previous"
max_lines => "9999"
max_bytes => "100 MiB"
}
}
}
My server use CentOS 6.8 and the next hardware:
80G memory
Intel(R) Xeon(R) CPU E5620 # 2.40GHz
with 16 cpu`s
Logstash(5.1.2) and elasticsearch(5.1.2) installing in this server.
This config work very slow - about 4 file per second
How can I do it so more fast parsing?
There're few ways that could increase the processing of logstash, but then it's really to hard to point out which one should be done. Maybe you could try increasing the sizes of *pipeline.workers, pipeline.batch.size, and pipeline.batch.delay* in order to tune the pipeline performance.
AND there are few troubleshooting ways in order to quickly diagnose and resolve Logstash performance issues. You could also try optimizing your inputs by removing all filters, and again send all the documents to /dev/null to ensure that there is no bottleneck with processing or outputting your documents.
Try adding this line to your file:
sincedb_path => "/dev/null"
You might also want to have a look at the Tuning and Profiling Logstash Performance & this blog post. Hope it helps!

Resources