Logstash with Elasticsearch: I am using logstash to ingest the data into the ES index. But now I want logstash to run 24/7 - elasticsearch

#file:db.conf
input {
jdbc {
jdbc_driver_library => ""
jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
jdbc_connection_string => "jdbc:oracle:thin:#abcd.klm.uvw:1521/qtp1"
jdbc_user =>"user_wew"
jdbc_password => "password_wew"
statement => "select col1, col2, col3, col4, col5, col6, countid,max(version) as mv from master_object_table where version >:sql_last_value group by countid"
schedule => "* * * * *"
last_run_metadata_path => "C:/ES1/ELK_stack_7.4.2/logstash-7.4.2/logstash-7.4.2/Master_refresh_a.txt"
use_column_value => true
tracking_column => "version"
}
}
filter {
mutate {
convert => {
"countid" => "string"
}
}
}
output {
elasticsearch {
hosts => "localhost:9200"
index =>"refresh_index_a"
document_id =>"%{countid}"
#document_type="_doc"
}
file {
path => "C:\\ES1\\ELK_stack_7.4.2\\logstash-7.4.2\\logstash-7.4.2\\bin\\logstashESRecordsIngestionDetails_refresh_a.txt"
codec => rubydebug
}
stdout { codec => rubydebug }
}
Above is my logstash config file setting. I want to run this logstash 24/7 and also if the machine shutsdown on which this logstash is running then how can I manage that as this logstash is ingesting the live data to ES index. Please suggest. Is there any way if one server goes down the logstash on another node will continue the work.

As per the documentation
Logstash is horizontally scalable and can form groups of nodes running
the same pipeline. Logstash’s adaptive buffering capabilities will
facilitate smooth streaming even through variable throughput loads. If
the Logstash layer becomes an ingestion bottleneck, simply add more
nodes to scale out. Here are a few general recommendations:
Beats should load balance across a group of Logstash nodes.
A minimum of two Logstash nodes are recommended for high availability.
It’s common to deploy just one Beats input per Logstash node, but multiple
Beats inputs can also be deployed per Logstash node to expose
independent endpoints for different data sources.

Related

how to push huge data in less time from Database table to Elastic search using log stash

I'm processing the 500 000 records from Postgres database to elastic using Logstash but it taking 40 minutes to completed the process. I want to reduce the process time and i have changed the pipeline.batch.size: 1000, pipeline.batch.delay: 50 in logstash.yml file and increase the heap space 1 gb to 2 gb in the JVM.options file still processing the records in same time.
Conf file
input {
jdbc {
jdbc_driver_library => "C:\Users\Downloads\elk stack/postgresql-42.3.1.jar"
jdbc_driver_class => "org.postgresql.Driver"
jdbc_connection_string => "jdbc:postgresql://localhost:5432/postgres"
jdbc_user => "postgres"
jdbc_password => "postgres123"
statement => "SELECT * FROM jolap.order_desk_activation"
}
}
output {
elasticsearch {
hosts =>["http://localhost:9200/"]
index => "test-powerbi-transformed"
document_type => "_doc"
}
stdout {}
}
The problem is not the logstash pipeline or the batch size. As above suggested, u need to get volume is less time.
This can be achieved using "Parallel Hints" which makes the query superfast, as the query start using the core processor the DB infrastructure (Dont miss to consult your DBA before applying this). Once u start getting volume records in less time, you can scale your logstash or tweak the pipeline settings.
Refer to this link.

Migrating 3 million records from Oracle to Elastic search using logstash

We are trying to migrate around 3 million records from oracle to Elastic Search using Logstash.
We are applying a couple of jdbc_streaming filters as a part of our logstash script, one to load connecting nested objects and another to run a hierarchical query to load data to another nested object in the index.
We are able to index 0.4 million records in 24 hours. The total size occupied by .4 million records is around 300MB.
We tried multiple approaches to migrate data quickly into elastic from oracle but were not able to achieve desired results.
Please find below the approaches we tried :
1.In the logstash script,
we used jdbc_fetch_size,
jdbc_page_size,
jdbc_paging_enabled,
clean_run parameters,
set pipeline workers to 20 and
pipeline batch size to 125 in logstash.yml file.
2. On the elastic side,
we set the number of replicas to 0,
refresh interval to -1,
tried increasing the value of indices.memory.index_buffer_size parameter, increased number of watcher queues in the elastic.yml file.
We basically googled out and followed various suggestions from this site and others too but nothing seems to work out so far.
We are using a single node elastic setup and neither the DB nor the elastic node are present on the machine from which we are running the logstash script.
Please find below the logstash config file
input {
jdbc {
jdbc_driver_library => "LIB"
jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
jdbc_connection_string => "connection url"
jdbc_user => "user"
jdbc_password => "pwd"
statement => "select * from "
}
}
filter{
jdbc_streaming {
jdbc_driver_library => "LIB"
jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
jdbc_connection_string => "connection url"
jdbc_user => "user"
jdbc_password => "pwd"
#statement => "select claimnumber,claimtype,is_active from claim where policynumber = :policynumber"
parameters => {"policynumber" => "policynumber"}
target => "nested node"
}
stdout { codec => json }
}
filter{
jdbc_streaming {
jdbc_driver_library => "LIB"
jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
jdbc_connection_string => "connection url"
jdbc_user => "user"
jdbc_password => "pwd"
statement => "select listagg(column name,'/' ) within group(order by column name) from
where LEVEL > 1
start with =:
connect by prior = "
parameters => {"p1" => "p1"}
target => "nested node1"
}
}
output {
elasticsearch {
hosts => [""]
index => "<index_name>"
document_id => "%{doc_id}"
}
}
Can you please help us identify bottlenecks and also make suggestions on how to increase indexing performance.
Thank You

How to turn off pre-check of how many rows are in the resultset in logstash output

I'm trying to turn off the pre-select logstash does to determine the count of rows, but ExaSol DB does not support any limits in any aggregation, is there any way to turn it off in logstash?
input {
jdbc {
jdbc_driver_library => "/opt/jdbc/exajdbc6.0.15.jar"
jdbc_driver_class => "com.exasol.jdbc.EXADriver"
jdbc_user => "am_mon"
jdbc_password => "XXXXX"
jdbc_connection_string => "jdbc:exa:xxx.xx.xx.xx..xx:xxxx"
jdbc_default_timezone => "Europe/Berlin"
# schedule => "05 7 * * *"
statement => "select local_date, LOCAL_HOUR, events from DWH_MON.V.M_EVENTS"
}
}
Logstash Error Log:
[2019-06-07T12:28:00,834][ERROR][logstash.inputs.jdbc ] Java::JavaSql::SQLException: LIMIT not allowed in aggregated selects [line 1, column 127] (Session: 1635677142479452406): SELECT count(*) AS "COUNT" FROM (select local_date, LOCAL_HOUR, events from DWH_MON.V.M_EVENTS limit 1) AS "T1" LIMIT 1
[2019-06-07T12:28:00,838][WARN ][logstash.inputs.jdbc ] Exception when executing JDBC query {:exception=>#}
As logstash wants to see how many rows are to be expected it uses limit 1, but exasol can't process any limit on aggregations.
It's a problem of LogStash, I guess. The LIMIT 1 part is unnecessary and should not be there in the first place.
You may try to use SQL pre-processor to try to identify such queries and remove LIMIT manually. But maybe it's easier to patch LogStash itself.

Logstash not updating last run metadata file

In my Logstash I want to download from a database the most recent data using :sql_last_value in a query and tracking_column option in conf file. I've set
last_run_metadata_path because I have 2 pipelines for the same table but Logstash saved last date only once or stopped saving new dates and now I can see in logs that it runs queries with the same :sql_last_value from metadata file.
That's how my conf file looks like, it has many jdbc inputs and one of them below:
jdbc {
jdbc_driver_library => "/opt/logstash/lib/ojdbc8.jar"
jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
jdbc_connection_string => ""
jdbc_user => ""
jdbc_password => ""
schedule => "*/15 * * * *"
statement_filepath => "/etc/logstash/queries/UAT/transactions_UAT.sql"
use_column_value => true
tracking_column => 'sys_created_on'
tracking_column_type => "timestamp"
last_run_metadata_path => "/etc/logstash/conf.d/lastrun_metadata/transactions_uat_metadata"
tags => ["transactions_uat"]
}
Content of the metadata file:
--- 2018-05-26 08:41:55.000000000 -04:00
I can see in the logs that Logstash always uses the same date from the metadata file and newer updates it:
select * from snc_uat.syslog_transaction0007
where "sys_created_on" >= TIMESTAMP '2018-05-26 08:41:55.000000 -04:00'
Logstash is working and is downloading recent data but unnecessarily processes data that already exists. Why is Logstash not updating metadata?
This is because your comparison operator is greater than or equal to i.e. >= please change it to > and it will fix your problem.
Hope it helps.

Logstash+Elasticsearch throughput

we're trying to process 5K msgs/sec with 2 identical machines, but seems like we max out logstash or elasticsearch.
Each has:
64Gb RAM, ≈3Ghz Xeon CPU
Logstash 1.5 installed
Elasticsearch 1.7.8 installed in cluster mode with second machine.
Logstash is configured to receive messages from 16-node kafka cluster and send it to Elasticsearch.
Data is CSV, contains 22 fields. Is that a normal throughput?
Here's the config:
input{
kafka {
type => "api"
zk_connect => "node1:2181,node2:2181,node3:2181"
codec => "plain"
topic_id => "api_events"
consumer_threads => 8
queue_size => 10000
rebalance_backoff_ms => 10000
rebalance_max_retries => 10
}
}
filters{
csv {
separator => "::"
columns => [
"hostname",
"status",
"body_bytes_sent",
"request_time",
"http_x_forwarded_for",
"uri",
"arg_key",
"http_user_agent",
"http_deviceid",
"http_country_code",
"http_language_code",
"http_platform",
"http_versioncode",
"request_method",
"http_x_forwarded_proto",
"upstream_cache_status",
"upstream_response_time",
"upstream_header_time",
"upstream_status",
"bytes_sent",
"time_local",
"upstream_addr"
]
remove_field => [
"message"
]
}
mutate {
convert => {
"body_bytes_sent" => "integer"
"request_time" => "float"
"upstream_response_time" => "float"
"upstream_header_time" => "float"
"bytes_sent" => "integer"
}
}
}
}
output{
elasticsearch {
cluster => "MyCluster"
protocol => "node"
index => "api-%{+YYYY.MM.dd}"
host => "elasticnode1"
flush_size => 50000
workers => 4
}
}
I'm surprised this question has not been answered. The question has been asked with rather old versions of logstash There have been multiple improvements and refactoring in Logstash since so some of the parameters will be different.
5k messages a second, at least on the face of it, sounds pretty low to achieve but of course it depends on quite a few things which are not stated. For example, wow big is each message? How many partitions is it listening from? Is each partition in the input actually receiving messages at high throughputs or is only the aggregate throughput high?
I would suggest starting with a small batch size (say 500) with 1 worker and slowly increasing the batch size till you see no improvement and then increasing the workers to make use of the cores on the machine. It is possible you're not getting each batch full enough per request per worker. The following article shows how to profile and measure how the in flight requests are doing with real arriving data:
https://www.elastic.co/guide/en/logstash/current/tuning-logstash.html
Of course the other side of this is the Elasticsearch side. It is worth measuring that Elasticsearch is not lagging in processing all he concurrent requests. How many Elasticsearch nodes are there and what are the numbers of Client/Data nodes? There are a number of things to look out for on Elasticsearch as well when doing "heavy" indexing:
https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html
I hope this helps you and others looking to do this today.

Resources