How to turn off pre-check of how many rows are in the resultset in logstash output - jdbc

I'm trying to turn off the pre-select logstash does to determine the count of rows, but ExaSol DB does not support any limits in any aggregation, is there any way to turn it off in logstash?
input {
jdbc {
jdbc_driver_library => "/opt/jdbc/exajdbc6.0.15.jar"
jdbc_driver_class => "com.exasol.jdbc.EXADriver"
jdbc_user => "am_mon"
jdbc_password => "XXXXX"
jdbc_connection_string => "jdbc:exa:xxx.xx.xx.xx..xx:xxxx"
jdbc_default_timezone => "Europe/Berlin"
# schedule => "05 7 * * *"
statement => "select local_date, LOCAL_HOUR, events from DWH_MON.V.M_EVENTS"
}
}
Logstash Error Log:
[2019-06-07T12:28:00,834][ERROR][logstash.inputs.jdbc ] Java::JavaSql::SQLException: LIMIT not allowed in aggregated selects [line 1, column 127] (Session: 1635677142479452406): SELECT count(*) AS "COUNT" FROM (select local_date, LOCAL_HOUR, events from DWH_MON.V.M_EVENTS limit 1) AS "T1" LIMIT 1
[2019-06-07T12:28:00,838][WARN ][logstash.inputs.jdbc ] Exception when executing JDBC query {:exception=>#}
As logstash wants to see how many rows are to be expected it uses limit 1, but exasol can't process any limit on aggregations.

It's a problem of LogStash, I guess. The LIMIT 1 part is unnecessary and should not be there in the first place.
You may try to use SQL pre-processor to try to identify such queries and remove LIMIT manually. But maybe it's easier to patch LogStash itself.

Related

how to push huge data in less time from Database table to Elastic search using log stash

I'm processing the 500 000 records from Postgres database to elastic using Logstash but it taking 40 minutes to completed the process. I want to reduce the process time and i have changed the pipeline.batch.size: 1000, pipeline.batch.delay: 50 in logstash.yml file and increase the heap space 1 gb to 2 gb in the JVM.options file still processing the records in same time.
Conf file
input {
jdbc {
jdbc_driver_library => "C:\Users\Downloads\elk stack/postgresql-42.3.1.jar"
jdbc_driver_class => "org.postgresql.Driver"
jdbc_connection_string => "jdbc:postgresql://localhost:5432/postgres"
jdbc_user => "postgres"
jdbc_password => "postgres123"
statement => "SELECT * FROM jolap.order_desk_activation"
}
}
output {
elasticsearch {
hosts =>["http://localhost:9200/"]
index => "test-powerbi-transformed"
document_type => "_doc"
}
stdout {}
}
The problem is not the logstash pipeline or the batch size. As above suggested, u need to get volume is less time.
This can be achieved using "Parallel Hints" which makes the query superfast, as the query start using the core processor the DB infrastructure (Dont miss to consult your DBA before applying this). Once u start getting volume records in less time, you can scale your logstash or tweak the pipeline settings.
Refer to this link.

Migrating 3 million records from Oracle to Elastic search using logstash

We are trying to migrate around 3 million records from oracle to Elastic Search using Logstash.
We are applying a couple of jdbc_streaming filters as a part of our logstash script, one to load connecting nested objects and another to run a hierarchical query to load data to another nested object in the index.
We are able to index 0.4 million records in 24 hours. The total size occupied by .4 million records is around 300MB.
We tried multiple approaches to migrate data quickly into elastic from oracle but were not able to achieve desired results.
Please find below the approaches we tried :
1.In the logstash script,
we used jdbc_fetch_size,
jdbc_page_size,
jdbc_paging_enabled,
clean_run parameters,
set pipeline workers to 20 and
pipeline batch size to 125 in logstash.yml file.
2. On the elastic side,
we set the number of replicas to 0,
refresh interval to -1,
tried increasing the value of indices.memory.index_buffer_size parameter, increased number of watcher queues in the elastic.yml file.
We basically googled out and followed various suggestions from this site and others too but nothing seems to work out so far.
We are using a single node elastic setup and neither the DB nor the elastic node are present on the machine from which we are running the logstash script.
Please find below the logstash config file
input {
jdbc {
jdbc_driver_library => "LIB"
jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
jdbc_connection_string => "connection url"
jdbc_user => "user"
jdbc_password => "pwd"
statement => "select * from "
}
}
filter{
jdbc_streaming {
jdbc_driver_library => "LIB"
jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
jdbc_connection_string => "connection url"
jdbc_user => "user"
jdbc_password => "pwd"
#statement => "select claimnumber,claimtype,is_active from claim where policynumber = :policynumber"
parameters => {"policynumber" => "policynumber"}
target => "nested node"
}
stdout { codec => json }
}
filter{
jdbc_streaming {
jdbc_driver_library => "LIB"
jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
jdbc_connection_string => "connection url"
jdbc_user => "user"
jdbc_password => "pwd"
statement => "select listagg(column name,'/' ) within group(order by column name) from
where LEVEL > 1
start with =:
connect by prior = "
parameters => {"p1" => "p1"}
target => "nested node1"
}
}
output {
elasticsearch {
hosts => [""]
index => "<index_name>"
document_id => "%{doc_id}"
}
}
Can you please help us identify bottlenecks and also make suggestions on how to increase indexing performance.
Thank You

Logstash with Elasticsearch: I am using logstash to ingest the data into the ES index. But now I want logstash to run 24/7

#file:db.conf
input {
jdbc {
jdbc_driver_library => ""
jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
jdbc_connection_string => "jdbc:oracle:thin:#abcd.klm.uvw:1521/qtp1"
jdbc_user =>"user_wew"
jdbc_password => "password_wew"
statement => "select col1, col2, col3, col4, col5, col6, countid,max(version) as mv from master_object_table where version >:sql_last_value group by countid"
schedule => "* * * * *"
last_run_metadata_path => "C:/ES1/ELK_stack_7.4.2/logstash-7.4.2/logstash-7.4.2/Master_refresh_a.txt"
use_column_value => true
tracking_column => "version"
}
}
filter {
mutate {
convert => {
"countid" => "string"
}
}
}
output {
elasticsearch {
hosts => "localhost:9200"
index =>"refresh_index_a"
document_id =>"%{countid}"
#document_type="_doc"
}
file {
path => "C:\\ES1\\ELK_stack_7.4.2\\logstash-7.4.2\\logstash-7.4.2\\bin\\logstashESRecordsIngestionDetails_refresh_a.txt"
codec => rubydebug
}
stdout { codec => rubydebug }
}
Above is my logstash config file setting. I want to run this logstash 24/7 and also if the machine shutsdown on which this logstash is running then how can I manage that as this logstash is ingesting the live data to ES index. Please suggest. Is there any way if one server goes down the logstash on another node will continue the work.
As per the documentation
Logstash is horizontally scalable and can form groups of nodes running
the same pipeline. Logstash’s adaptive buffering capabilities will
facilitate smooth streaming even through variable throughput loads. If
the Logstash layer becomes an ingestion bottleneck, simply add more
nodes to scale out. Here are a few general recommendations:
Beats should load balance across a group of Logstash nodes.
A minimum of two Logstash nodes are recommended for high availability.
It’s common to deploy just one Beats input per Logstash node, but multiple
Beats inputs can also be deployed per Logstash node to expose
independent endpoints for different data sources.

LOGSTASH - Issue with JDBC input connected to HSQL DB database when selecting ARRAY columns

I'm having troubles to successfully import HSQL DB database content using Logstash's JDBC input plugin.
The problem occurs when I try to fetch a column that is of type ARRAY.
Please note that if I try to fetch non-array columns, it works just fine.
I get the following error message from Logstash :
[WARN ][logstash.inputs.jdbc ] Exception when executing JDBC query {:exception=>#<Sequel::DatabaseError: Java::OrgLogstash::MissingConverterException: Missing Converter handling for full class name=org.hsqldb.jdbc.JDBCArray, simple name=JDBCArray>}
[INFO ][logstash.pipeline ] Pipeline has terminated {:pipeline_id=>"hsql", :thread=>"#<Thread:0x7b626752 run>"}
Please find below the input part of the Logstash conf file (PLATFORM_DESTINATION_CANDIDATES is the name of a column in a table.)
input {
jdbc {
jdbc_driver_library => "hsqldb_2.5.0.jar"
jdbc_driver_class => "org.hsqldb.jdbc.JDBCDriver"
jdbc_connection_string => "jdbc:hsqldb:hsql://localhost/probe"
jdbc_user => "SA"
statement => "SELECT PLATFORM_DESTINATION_CANDIDATES FROM PUBLIC.MESSAGES_SENT"
connection_retry_attempts => 10
}
}
Did any of you encounter this kind of problem, and how did you solve it ?
Thanks.
OS : windows 10
Logstash version : 6.3.1
HSQLDB driver version : 2.5.0 (LINK)
I do not know if it is the best solution, but I managed to solve my issue. Here is how.
I replaced the line :
statement => "SELECT PLATFORM_DESTINATION_CANDIDATES FROM PUBLIC.MESSAGES_SENT"
with :
statement => SELECT concat_ws('', PLATFORM_DESTINATION_CANDIDATES , '') AS str_platforms
It has the consequence to put in the field str_platforms of type string data that looks like : ARRAY[1,2,3,4]
With the following ruby line, I then remove unwanted characters ( ARRAY[ and ] ) from the field :
ruby {
code => "event.set('listRxUnits',event.get('str_platforms').split('ARRAY[')[1].split(']')[0])"
}

Logstash not updating last run metadata file

In my Logstash I want to download from a database the most recent data using :sql_last_value in a query and tracking_column option in conf file. I've set
last_run_metadata_path because I have 2 pipelines for the same table but Logstash saved last date only once or stopped saving new dates and now I can see in logs that it runs queries with the same :sql_last_value from metadata file.
That's how my conf file looks like, it has many jdbc inputs and one of them below:
jdbc {
jdbc_driver_library => "/opt/logstash/lib/ojdbc8.jar"
jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
jdbc_connection_string => ""
jdbc_user => ""
jdbc_password => ""
schedule => "*/15 * * * *"
statement_filepath => "/etc/logstash/queries/UAT/transactions_UAT.sql"
use_column_value => true
tracking_column => 'sys_created_on'
tracking_column_type => "timestamp"
last_run_metadata_path => "/etc/logstash/conf.d/lastrun_metadata/transactions_uat_metadata"
tags => ["transactions_uat"]
}
Content of the metadata file:
--- 2018-05-26 08:41:55.000000000 -04:00
I can see in the logs that Logstash always uses the same date from the metadata file and newer updates it:
select * from snc_uat.syslog_transaction0007
where "sys_created_on" >= TIMESTAMP '2018-05-26 08:41:55.000000 -04:00'
Logstash is working and is downloading recent data but unnecessarily processes data that already exists. Why is Logstash not updating metadata?
This is because your comparison operator is greater than or equal to i.e. >= please change it to > and it will fix your problem.
Hope it helps.

Resources