Logstash output Performance - elasticsearch

I 'am using Elasticsearch 5.1.1 , logstash 5.1.1 ,I imported 3 millions rows from sqlserver into elastic via logstash in 2 hours
I have 1 single windows machine with 4GB Ram , core I 3 ): is there any additional configurations should I add to speed up the importing ?
I tried to change logstash.yml settings via https://www.elastic.co/guide/en/logstash/current/logstash-settings-file.html
but it doesn't affect
Logstash Configurations
input {
jdbc {
jdbc_driver_library => "D:\Usefull_Jars\sqljdbc4-4.0.jar"
jdbc_driver_class => "com.microsoft.sqlserver.jdbc.SQLServerDriver"
jdbc_connection_string => "jdbc:sqlserver://192.168.5.14:1433;databaseName=DataSource;integratedSecurity=false;user=****;password=****;"
jdbc_user => "****"
jdbc_password => "****"
statement => "SELECT * FROM RawData"
jdbc_fetch_size => 1000
}
}
output {
elasticsearch {
hosts => "localhost"
index => "testdata"
document_type => "testfeed"
document_id => "%{id}"
flush_size => 512
}
}
logstash.yml
pipeline:
batch:
size: 125
delay: 2
#
# Or as flat keys:
# pipeline.batch.size: 125
# pipeline.batch.delay: 5
# ------------ Pipeline Settings --------------
# Set the number of workers that will, in parallel, execute the filters+outputs
# stage of the pipeline.
# This defaults to the number of the host's CPU cores.
pipeline.workers: 5
# How many workers should be used per output plugin instance
pipeline.output.workers: 5
# How many events to retrieve from inputs before sending to filters+workers
pipeline.batch.size: 125
# How long to wait before dispatching an undersized batch to filters+workers
# Value is in milliseconds.
# pipeline.batch.delay: 5
# ------------ Queuing Settings --------------
#
# Internal queuing model, "memory" for legacy in-memory based queuing and
# "persisted" for disk-based acked queueing. Defaults is memory
#
# queue.type: memory
#
# If using queue.type: persisted, the directory path where the data files will be stored.
# Default is path.data/queue
#
# path.queue:
#
# If using queue.type: persisted, the page data files size. The queue data consists of
# append-only data files separated into pages. Default is 250mb
#
# queue.page_capacity: 250mb
#
# If using queue.type: persisted, the maximum number of unread events in the queue.
# Default is 0 (unlimited)
#
# queue.max_events: 0
#
# If using queue.type: persisted, the total capacity of the queue in number of bytes.
# If you would like more unacked events to be buffered in Logstash, you can increase the
# capacity using this setting. Please make sure your disk drive has capacity greater than
# the size specified here. If both max_bytes and max_events are specified, Logstash will pick
# whichever criteria is reached first
# Default is 1024mb or 1gb
#
# queue.max_bytes: 1024mb
#
# If using queue.type: persisted, the maximum number of acked events before forcing a checkpoint
# Default is 1024, 0 for unlimited
#
# queue.checkpoint.acks: 1024
#
# If using queue.type: persisted, the maximum number of written events before forcing a checkpoint
# Default is 1024, 0 for unlimited
#
# queue.checkpoint.writes: 1024
#
# If using queue.type: persisted, the interval in milliseconds when a checkpoint is forced on the head page
# Default is 1000, 0 for no periodic checkpoint.
#
# queue.checkpoint.interval: 1000
Thanks in advance ....

Related

Deleted logs are not rewritten to Elasticsearch

I'm using Logstash to read log files and send to Elasticsearch. It works fine in a streaming mode, creating everyday a different index and writing logs in real time.
The problem is, yesterday at 3pm I occasionally deleted the index. It restored automatically and continued writing logs. However, I have lost the logs related to 12am - 3pm.
In order to rewrite the log from the beginning, I deleted the sincedb file, also added ignore_older => 0 in the Logstash configuration. After that, I deleted the index again. But it continues streaming, ignoring old data.
My current configuration of logstash:
input {
file {
path => ["/someDirectory/Logs/20221220-00001.log"]
start_position => "beginning"
tags => ["prod"]
ignore_older => 0
sincedb_path => "/dev/null"
type => "cowrie"
}
}
filter {
grok {
match => ["path", "/var/www/cap/cap-server/Logs/%{GREEDYDATA:index_name}" ]
}
}
output {
elasticsearch {
hosts => "IP:9200"
user => "elastic"
password => "xxxxxxxx"
index => "logstash-log-%{index_name}"
}
}
I would appreciate for any help.
I'm also attaching Elasticsearch configuration:
# Lock the memory on startup:
#
#bootstrap.memory_lock: true
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# By default Elasticsearch is only accessible on localhost. Set a different
# address here to expose this node on the network:
#
network.host: 0.0.0.0
#
# By default Elasticsearch listens for HTTP traffic on the first free port it
# finds starting at 9200. Set a specific HTTP port here:
#
http.port: 9200
#
# For more information, consult the network module documentation.
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when this node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
#discovery.seed_hosts: ["host1", "host2"]
#
# Bootstrap the cluster using an initial set of master-eligible nodes:
#
#cluster.initial_master_nodes: ["node-1", "node-2"]
#
# For more information, consult the discovery and cluster formation module documentation.
#
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
discovery.type: single-node
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
#action.destructive_requires_name: true
Note, that after all configuration changes, logstash and elasticsearch have been restared.

how to push huge data in less time from Database table to Elastic search using log stash

I'm processing the 500 000 records from Postgres database to elastic using Logstash but it taking 40 minutes to completed the process. I want to reduce the process time and i have changed the pipeline.batch.size: 1000, pipeline.batch.delay: 50 in logstash.yml file and increase the heap space 1 gb to 2 gb in the JVM.options file still processing the records in same time.
Conf file
input {
jdbc {
jdbc_driver_library => "C:\Users\Downloads\elk stack/postgresql-42.3.1.jar"
jdbc_driver_class => "org.postgresql.Driver"
jdbc_connection_string => "jdbc:postgresql://localhost:5432/postgres"
jdbc_user => "postgres"
jdbc_password => "postgres123"
statement => "SELECT * FROM jolap.order_desk_activation"
}
}
output {
elasticsearch {
hosts =>["http://localhost:9200/"]
index => "test-powerbi-transformed"
document_type => "_doc"
}
stdout {}
}
The problem is not the logstash pipeline or the batch size. As above suggested, u need to get volume is less time.
This can be achieved using "Parallel Hints" which makes the query superfast, as the query start using the core processor the DB infrastructure (Dont miss to consult your DBA before applying this). Once u start getting volume records in less time, you can scale your logstash or tweak the pipeline settings.
Refer to this link.

How to choose optimal logstash pipleline batch size and delay? (Logstash 6.4.3)

Introduction
We have a logstash that is receiving our logs from java microservices, and lately the machine has been at 100% utilization.
I noticed that very low values were used for pipeline batch size, workers, and delay as well as ram.
My feeling was that I could improve performance by increasing the batch size into the thousands, increasing the delay into the seconds, and increasing the ram.
It seems to have worked and we have gone from a logstash that was crashing at 100% continously (or close to it) to being at (or below) 70%. This is a virtual server running in vmware with only 1 core assigned so resources are a bit limited.
Question
How do I optimize further? (without messing with the microservices or limiting the number of incoming message)?
How do I find the optimal values for delay and batch size?
Also, even though we have 1 core, I have the feeling that having more than 1 worker helps but I'm not sure about that (due to IO delays)
Current config
ELK (Elastic, Logstash, Kibana) 6.4
logstash.yml contains
pipeline:
batch:
size: 2048
delay: 5000
pipeline.workers: 4
Elastic jvm.options
-Xms4g
-Xmx10g
Logstash jvm.options
-Xms4g
-Xmx10g
Logstash config:
input {
tcp {
port => 8999
codec => json
}
}
filter {
geoip {
source => "req.xForwardedFor"
}
}
filter {
kv {
include_keys => [ "freeTextSearch", "entityId","businessId"]
recursive => "true"
field_split => ","
}
}
filter {
mutate {
split => { "req.user" => "," }
split => { "req.application" => "," }
split => { "req.organization" => "," }
split => { "app.profiles" => "," }
copy => { "app.name" => "appLicationName" }
}
}
filter {
fingerprint {
target => "[#metadata][uuid]"
method => "UUID"
}
}
filter {
if [app]
{
ruby
{ init => '
BODY_PATH = "[app]"
BODY_STRING = "[name]"
'
code => '
body_val = event.get(BODY_PATH)
if body_val.is_a?(String)
event.set(BODY_PATH, {BODY_STRING => body_val,"[olderApp]" => "true"})
end
'
}
}
}
output {
stdout {
codec => rubydebug {
metadata => true
}
}
if [stackTrace] {
email {
address => 'smtp.internal.email'
to => 'Warnings<warning#server.internal.org>'
from => 'Warnings<warning#server.internal.org>'
subject => '%{message}'
template_file => "C:\logstash\emailtemplate.mustache"
port => 25
}
}
elasticsearch {
hosts => ["localhost:8231"]
sniffing => true
manage_template => false
index => "sg-logs"
document_id => "%{[#metadata][uuid]}"
}
}
Update
I switched to the persistent queue, which has improved things quite a bit in terms of performance. I ran the scripts that used to freeze our logtash and it seems to not be breaking, though it took quite a bit of work.
Switched to pipeline.yml
I switched to pipeline.yml since I noticed that the queue settings were not working. I also had to pass the YML through a validator.
---
-
path.config: "../configsg/"
pipeline.batch.size: 1000
pipeline.id: persisted-queue-pipeline
pipeline.workers: 2
queue.type: persisted
queue.max_bytes: 2000mb
queue.drain: true
Modified the bat file to clean the data/queue folder
I noticed logstash wasn't processing correctly when there was leftover data inside data/queue folder. I added a bat file to clean/move this data during logstash restarts etc. I need to think about how to handle this in the future.
Folder: logstash-6.4.3\data\queue
Here is my bat file that is called by a windows service during starts/restarts.
echo Date format = %date%
echo dd = %date:~0,2%
echo mm = %date:~3,2%
echo yyyy = %date:~6,8%
echo.
echo Time format = %time%
echo hh = %time:~0,2%
echo mm = %time:~3,2%
echo ss = %time:~6,2%
cd ..
cd data/queue
move ./persisted-queue-pipeline ../persist-queue-backup-%date:~0,2%_%date:~3,2%_%date:~6,8%-%time:~0,2%_%time:~3,2%_%time:~6,2%.txt
cd ../../bin
logstash.bat
Here some tips from Logstash team about optimization: link
I would also suggest taking a look at multi-pipeline cases. From your config, it sounds to me filter cases may causing the backpressure. It seems if you can divide your input (by port), you can set multiple pipeline to handle the backpressure cases.

NO alert received on elastalert-test-rule or while executing the rule

I have done setup on windows 10. Getting below output when executing elastalert-test-rule for my rule.
elastalert-test-rule example_rules\example_frequency.yaml --config config.yaml
Would have written the following documents to writeback index (default is elastalert_status):
elastalert_status - {'rule_name': 'Example frequency rule', 'endtime': datetime.datetime(2020, 4, 19, 18, 49, 10, 397745, tzinfo=tzutc()), 'starttime': datetime.datetime(2019, 4, 17, 3, 13, 10, 397745, tzinfo=tzutc()), 'matches': 4, 'hits': 4, '#timestamp': datetime.datetime(2020, 4, 19, 18, 55, 56, 314841, tzinfo=tzutc()), 'time_taken': 405.48910188674927}
However, no alert is triggered.
Please find below contents of config.yaml and example_frequency.yaml
config.yaml
# This is the folder that contains the rule yaml files
# Any .yaml file will be loaded as a rule
rules_folder: example_rules
# How often ElastAlert will query Elasticsearch
# The unit can be anything from weeks to seconds
run_every:
seconds: 5
# ElastAlert will buffer results from the most recent
# period of time, in case some log sources are not in real time
buffer_time:
minutes: 15
# The Elasticsearch hostname for metadata writeback
# Note that every rule can have its own Elasticsearch host
es_host: 127.0.0.1
# The Elasticsearch port
es_port: 9200
# The AWS region to use. Set this when using AWS-managed elasticsearch
#aws_region: us-east-1
# The AWS profile to use. Use this if you are using an aws-cli profile.
# See http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html
# for details
#profile: test
# Optional URL prefix for Elasticsearch
#es_url_prefix: elasticsearch
# Connect with TLS to Elasticsearch
#use_ssl: True
# Verify TLS certificates
#verify_certs: True
# GET request with body is the default option for Elasticsearch.
# If it fails for some reason, you can pass 'GET', 'POST' or 'source'.
# See http://elasticsearch-py.readthedocs.io/en/master/connection.html?highlight=send_get_body_as#transport
# for details
#es_send_get_body_as: GET
# Option basic-auth username and password for Elasticsearch
#es_username: someusername
#es_password: somepassword
# Use SSL authentication with client certificates client_cert must be
# a pem file containing both cert and key for client
#verify_certs: True
#ca_certs: /path/to/cacert.pem
#client_cert: /path/to/client_cert.pem
#client_key: /path/to/client_key.key
# The index on es_host which is used for metadata storage
# This can be a unmapped index, but it is recommended that you run
# elastalert-create-index to set a mapping
writeback_index: elastalert_status
writeback_alias: elastalert_alerts
# If an alert fails for some reason, ElastAlert will retry
# sending the alert until this time period has elapsed
alert_time_limit:
days: 2
# Custom logging configuration
# If you want to setup your own logging configuration to log into
# files as well or to Logstash and/or modify log levels, use
# the configuration below and adjust to your needs.
# Note: if you run ElastAlert with --verbose/--debug, the log level of
# the "elastalert" logger is changed to INFO, if not already INFO/DEBUG.
#logging:
# version: 1
# incremental: false
# disable_existing_loggers: false
# formatters:
# logline:
# format: '%(asctime)s %(levelname)+8s %(name)+20s %(message)s'
#
# handlers:
# console:
# class: logging.StreamHandler
# formatter: logline
# level: DEBUG
# stream: ext://sys.stderr
#
# file:
# class : logging.FileHandler
# formatter: logline
# level: DEBUG
# filename: elastalert.log
#
# loggers:
# elastalert:
# level: WARN
# handlers: []
# propagate: true
#
# elasticsearch:
# level: WARN
# handlers: []
# propagate: true
#
# elasticsearch.trace:
# level: WARN
# handlers: []
# propagate: true
#
# '': # root logger
# level: WARN
# handlers:
# - console
# - file
# propagate: false
example_frequency.yaml
# Alert when the rate of events exceeds a threshold
# (Optional)
# Elasticsearch host
# es_host: elasticsearch.example.com
# (Optional)
# Elasticsearch port
# es_port: 14900
# (OptionaL) Connect with SSL to Elasticsearch
#use_ssl: True
# (Optional) basic-auth username and password for Elasticsearch
#es_username: someusername
#es_password: somepassword
# (Required)
# Rule name, must be unique
name: Example frequency rule
# (Required)
# Type of alert.
# the frequency rule type alerts when num_events events occur with timeframe time
type: frequency
# (Required)
# Index to search, wildcard supported
index: com-*
# (Required, frequency specific)
# Alert when this many documents matching the query occur within a timeframe
num_events: 1
# (Required, frequency specific)
# num_events must occur within this amount of time to trigger an alert
timeframe:
days: 365
# (Required)
# A list of Elasticsearch filters used for find events
# These filters are joined with AND and nested in a filtered query
# For more info: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl.html
filter:
- term:
"log_json.response.statusCode": "404"
# (Required)
# The alert is use when a match is found
alert:
- "email"
# (required, email specific)
# a list of email addresses to send alerts to
email:
- "username#mydomain.com"
realert:
minutes: 0
What is it that i am missing to receive alerts? Neither do i see any error on console.
SMTP configuration in missing, so that is why no alert is being sent.
Please try to include the smtp_host,smtp_port,smtp_ssl and smtp_auth_file in your example_frequency.yaml.
Refer to the document for Email Alert

Logstash read a very large number of static xml files ( input file plugin)

I have many xml static file about 1 million in one directory. I want to read and parse those file with logstash and output to elasticsearch.
I have the next input config (I try many way and it`s my last version):
input{
file {
path => "/opt/lun/data-unzip/ftp/223/*.xml*"
exclude => "*.zip"
type => "223-purplan"
start_position => beginning
discover_interval => "3"
max_open_files => "128"
close_older => "3"
codec => multiline {
pattern => "xml version"
negate => true
what => "previous"
max_lines => "9999"
max_bytes => "100 MiB"
}
}
}
My server use CentOS 6.8 and the next hardware:
80G memory
Intel(R) Xeon(R) CPU E5620 # 2.40GHz
with 16 cpu`s
Logstash(5.1.2) and elasticsearch(5.1.2) installing in this server.
This config work very slow - about 4 file per second
How can I do it so more fast parsing?
There're few ways that could increase the processing of logstash, but then it's really to hard to point out which one should be done. Maybe you could try increasing the sizes of *pipeline.workers, pipeline.batch.size, and pipeline.batch.delay* in order to tune the pipeline performance.
AND there are few troubleshooting ways in order to quickly diagnose and resolve Logstash performance issues. You could also try optimizing your inputs by removing all filters, and again send all the documents to /dev/null to ensure that there is no bottleneck with processing or outputting your documents.
Try adding this line to your file:
sincedb_path => "/dev/null"
You might also want to have a look at the Tuning and Profiling Logstash Performance & this blog post. Hope it helps!

Resources