Migrating 3 million records from Oracle to Elastic search using logstash - oracle

We are trying to migrate around 3 million records from oracle to Elastic Search using Logstash.
We are applying a couple of jdbc_streaming filters as a part of our logstash script, one to load connecting nested objects and another to run a hierarchical query to load data to another nested object in the index.
We are able to index 0.4 million records in 24 hours. The total size occupied by .4 million records is around 300MB.
We tried multiple approaches to migrate data quickly into elastic from oracle but were not able to achieve desired results.
Please find below the approaches we tried :
1.In the logstash script,
we used jdbc_fetch_size,
jdbc_page_size,
jdbc_paging_enabled,
clean_run parameters,
set pipeline workers to 20 and
pipeline batch size to 125 in logstash.yml file.
2. On the elastic side,
we set the number of replicas to 0,
refresh interval to -1,
tried increasing the value of indices.memory.index_buffer_size parameter, increased number of watcher queues in the elastic.yml file.
We basically googled out and followed various suggestions from this site and others too but nothing seems to work out so far.
We are using a single node elastic setup and neither the DB nor the elastic node are present on the machine from which we are running the logstash script.
Please find below the logstash config file
input {
jdbc {
jdbc_driver_library => "LIB"
jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
jdbc_connection_string => "connection url"
jdbc_user => "user"
jdbc_password => "pwd"
statement => "select * from "
}
}
filter{
jdbc_streaming {
jdbc_driver_library => "LIB"
jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
jdbc_connection_string => "connection url"
jdbc_user => "user"
jdbc_password => "pwd"
#statement => "select claimnumber,claimtype,is_active from claim where policynumber = :policynumber"
parameters => {"policynumber" => "policynumber"}
target => "nested node"
}
stdout { codec => json }
}
filter{
jdbc_streaming {
jdbc_driver_library => "LIB"
jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
jdbc_connection_string => "connection url"
jdbc_user => "user"
jdbc_password => "pwd"
statement => "select listagg(column name,'/' ) within group(order by column name) from
where LEVEL > 1
start with =:
connect by prior = "
parameters => {"p1" => "p1"}
target => "nested node1"
}
}
output {
elasticsearch {
hosts => [""]
index => "<index_name>"
document_id => "%{doc_id}"
}
}
Can you please help us identify bottlenecks and also make suggestions on how to increase indexing performance.
Thank You

Related

how to push huge data in less time from Database table to Elastic search using log stash

I'm processing the 500 000 records from Postgres database to elastic using Logstash but it taking 40 minutes to completed the process. I want to reduce the process time and i have changed the pipeline.batch.size: 1000, pipeline.batch.delay: 50 in logstash.yml file and increase the heap space 1 gb to 2 gb in the JVM.options file still processing the records in same time.
Conf file
input {
jdbc {
jdbc_driver_library => "C:\Users\Downloads\elk stack/postgresql-42.3.1.jar"
jdbc_driver_class => "org.postgresql.Driver"
jdbc_connection_string => "jdbc:postgresql://localhost:5432/postgres"
jdbc_user => "postgres"
jdbc_password => "postgres123"
statement => "SELECT * FROM jolap.order_desk_activation"
}
}
output {
elasticsearch {
hosts =>["http://localhost:9200/"]
index => "test-powerbi-transformed"
document_type => "_doc"
}
stdout {}
}
The problem is not the logstash pipeline or the batch size. As above suggested, u need to get volume is less time.
This can be achieved using "Parallel Hints" which makes the query superfast, as the query start using the core processor the DB infrastructure (Dont miss to consult your DBA before applying this). Once u start getting volume records in less time, you can scale your logstash or tweak the pipeline settings.
Refer to this link.

Logstash with Elasticsearch: I am using logstash to ingest the data into the ES index. But now I want logstash to run 24/7

#file:db.conf
input {
jdbc {
jdbc_driver_library => ""
jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
jdbc_connection_string => "jdbc:oracle:thin:#abcd.klm.uvw:1521/qtp1"
jdbc_user =>"user_wew"
jdbc_password => "password_wew"
statement => "select col1, col2, col3, col4, col5, col6, countid,max(version) as mv from master_object_table where version >:sql_last_value group by countid"
schedule => "* * * * *"
last_run_metadata_path => "C:/ES1/ELK_stack_7.4.2/logstash-7.4.2/logstash-7.4.2/Master_refresh_a.txt"
use_column_value => true
tracking_column => "version"
}
}
filter {
mutate {
convert => {
"countid" => "string"
}
}
}
output {
elasticsearch {
hosts => "localhost:9200"
index =>"refresh_index_a"
document_id =>"%{countid}"
#document_type="_doc"
}
file {
path => "C:\\ES1\\ELK_stack_7.4.2\\logstash-7.4.2\\logstash-7.4.2\\bin\\logstashESRecordsIngestionDetails_refresh_a.txt"
codec => rubydebug
}
stdout { codec => rubydebug }
}
Above is my logstash config file setting. I want to run this logstash 24/7 and also if the machine shutsdown on which this logstash is running then how can I manage that as this logstash is ingesting the live data to ES index. Please suggest. Is there any way if one server goes down the logstash on another node will continue the work.
As per the documentation
Logstash is horizontally scalable and can form groups of nodes running
the same pipeline. Logstash’s adaptive buffering capabilities will
facilitate smooth streaming even through variable throughput loads. If
the Logstash layer becomes an ingestion bottleneck, simply add more
nodes to scale out. Here are a few general recommendations:
Beats should load balance across a group of Logstash nodes.
A minimum of two Logstash nodes are recommended for high availability.
It’s common to deploy just one Beats input per Logstash node, but multiple
Beats inputs can also be deployed per Logstash node to expose
independent endpoints for different data sources.

How to turn off pre-check of how many rows are in the resultset in logstash output

I'm trying to turn off the pre-select logstash does to determine the count of rows, but ExaSol DB does not support any limits in any aggregation, is there any way to turn it off in logstash?
input {
jdbc {
jdbc_driver_library => "/opt/jdbc/exajdbc6.0.15.jar"
jdbc_driver_class => "com.exasol.jdbc.EXADriver"
jdbc_user => "am_mon"
jdbc_password => "XXXXX"
jdbc_connection_string => "jdbc:exa:xxx.xx.xx.xx..xx:xxxx"
jdbc_default_timezone => "Europe/Berlin"
# schedule => "05 7 * * *"
statement => "select local_date, LOCAL_HOUR, events from DWH_MON.V.M_EVENTS"
}
}
Logstash Error Log:
[2019-06-07T12:28:00,834][ERROR][logstash.inputs.jdbc ] Java::JavaSql::SQLException: LIMIT not allowed in aggregated selects [line 1, column 127] (Session: 1635677142479452406): SELECT count(*) AS "COUNT" FROM (select local_date, LOCAL_HOUR, events from DWH_MON.V.M_EVENTS limit 1) AS "T1" LIMIT 1
[2019-06-07T12:28:00,838][WARN ][logstash.inputs.jdbc ] Exception when executing JDBC query {:exception=>#}
As logstash wants to see how many rows are to be expected it uses limit 1, but exasol can't process any limit on aggregations.
It's a problem of LogStash, I guess. The LIMIT 1 part is unnecessary and should not be there in the first place.
You may try to use SQL pre-processor to try to identify such queries and remove LIMIT manually. But maybe it's easier to patch LogStash itself.

Logstash - Got error as An unknown error occurred sending a bulk request to Elasticsearch

I am trying to move SQL Server table record to elasticsearch via logstash. Its basically a synchronization. But I am getting an error from LogStash as unknown error. I have provided my configuration file as well as Error log.
Configuration:
input {
jdbc {
#https://www.elastic.co/guide/en/logstash/current/plugins-inputs-jdbc.html#plugins-inputs-jdbc-record_last_run
jdbc_connection_string => "jdbc:sqlserver://localhost-serverdb;database=Application;user=dev;password=system23$"
jdbc_driver_class => "com.microsoft.sqlserver.jdbc.SQLServerDriver"
jdbc_user => nil
# The path to our downloaded jdbc driver
jdbc_driver_library => "C:\Program Files (x86)\sqljdbc6.2\enu\sqljdbc4-3.0.jar"
# The name of the driver class for SqlServer
jdbc_driver_class => "com.microsoft.sqlserver.jdbc.SQLServerDriver"
#executes every minutes.
schedule => "* * * * *"
#executes 0th minute of every day, basically every hour.
#schedule => "0 * * * *"
last_run_metadata_path => "C:\Software\ElasticSearch\logstash-6.4.0\.logstash_jdbc_last_run"
#record_last_run => false
#clean_run => true
# Query for testing purpose
statement => "Select * from tbl_UserDetails"
}
}
output {
elasticsearch {
hosts => ["10.187.144.113:9200"]
index => "tbl_UserDetails"
#document_id is a unique id, this has to be provided during syn, else we may get duplicate entry in ElasticSearch index.
document_id => "%{Login_User_Id}"
}
}
Error Log:
[2018-09-18T21:04:32,171][ERROR][logstash.outputs.elasticsearch]
An unknown error occurred sending a bulk request to Elasticsearch. We will retry indefinitely {
:error_message=>"\"\\xF0\" from ASCII-8BIT to UTF-8",
:error_class=>"LogStash::Json::GeneratorError",
:backtrace=>["C:/Software/ElasticSearch/logstash-6.4.0/log
stash-core/lib/logstash/json.rb:27:in `jruby_dump'",
"C:/Software/ElasticSearch/logstash-6.4.0/vendor/bundle/jruby/2.3.0/gems/logstash-output-elasticsearch-9.2.0-java/lib/logstash/outputs/elasticsearch/http_client.rb:119:in `block in bulk'"
, "org/jruby/RubyArray.java:2486:in `map'",
"C:/Software/ElasticSearch/logstash-6.4.0/vendor/bundle/jruby/2.3.0/gems/logstash-output-elasticsearch-9.2.0-java/lib/logstash/outputs/elasticsearch/http_client.rb:119:in `block in bulk'", "org/jruby/RubyArray.java:1734:in `each'", "C:/Software/ElasticSearch/logstash-6.4.0/vendor/bundle/jruby/2.3.0/gems/logstash-output-elasticsearch-9.2.0-java/lib/logstash/outputs/elasticsearch/http_client.rb:117:in `bulk'", "C:/Software/ElasticSearch/logstash-6.4.0/vendor/bundle/jruby/2.3.0/gems/logstash-output-elasticsearch-9
.2.0-java/lib/logstash/outputs/elasticsearch/common.rb:275:in `safe_bulk'", "C:/Software/ElasticSearch/logstash-6.4.0/vendor/bundle/jruby/2.3.0/gems/logstash-output-elasticsearch-9.2.0-java/lib/logstash/outputs/elasticsearch/common.rb:180:in `submit'", "C:/Software/ElasticSearch/logstash-6.4.0/vendor/bundle/jruby/2.3.0
/gems/logstash-output-elasticsearch-9.2.0-java/lib/logstash/outputs/elasticsearch/common.rb:148:in `retrying_submit'", "C:/Software/ElasticSearch/logstash-6.4.0/vendor/bundle/jruby/2.3.0/gems/logstash-output-elasticsearch-9.2.0-java/lib/log
stash/outputs/elasticsearch/common.rb:38:in `multi_receive'", "org/logstash/config/ir/compiler/OutputStrategyExt.java:114:in `multi_receive'", "org/logstash/config/ir/compiler/AbstractOutputDelegatorExt.java:97:in `multi_receive'", "C:/Soft
ware/ElasticSearch/logstash-6.4.0/logstash-core/lib/logstash/pipeline.rb:372:in`block in output_batch'", "org/jruby/RubyHash.java:1343:in `each'", "C:/Software/ElasticSearch/logstash-6.4.0/logstash-core/lib/logstash/pipeline.rb:371:in `output_batch'", "C:/Software/ElasticSearch/logstash-6.4.0/logstash-core/lib/logstash/pipeline.rb:323:in `worker_loop'", "C:/Software/ElasticSearch/logstash-6.4.0/logstash-core/lib/logstash/pipeline.rb:285:in `block in start_workers'"]}
[2018-09-18T21:05:00,140][INFO ][logstash.inputs.jdbc ] (0.008273s) Select *
from tbl_UserDetails
Logstash Version : 6.4.0
Elasticsearch Version :6.3.1
Thanks in advance.
You have a character '\xF0' in database which is causing this issue. This '\xF0' character might be first byte of multibyte character. But since ruby here is trying to decode using ASCII-8BIT, it is considering each byte as character.
You may try using columns_charset to set proper charset. https://www.elastic.co/guide/en/logstash/current/plugins-inputs-jdbc.html#plugins-inputs-jdbc-columns_charset
The above issue resolved.
Thanks for your support guys.
The change what I did was in the input -> jdbc I added the below two properties
input {
jdbc {
tracking_column => "login_user_id"
use_column_value => true
}
}
and under output->elasticsearch I changed the two properties
output {
elasticsearch {
document_id => "%{login_user_id}"
document_type => "user_details"
}
}
the main take away from here is all the values should be mentioned in lowercase.

Logstash not updating last run metadata file

In my Logstash I want to download from a database the most recent data using :sql_last_value in a query and tracking_column option in conf file. I've set
last_run_metadata_path because I have 2 pipelines for the same table but Logstash saved last date only once or stopped saving new dates and now I can see in logs that it runs queries with the same :sql_last_value from metadata file.
That's how my conf file looks like, it has many jdbc inputs and one of them below:
jdbc {
jdbc_driver_library => "/opt/logstash/lib/ojdbc8.jar"
jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
jdbc_connection_string => ""
jdbc_user => ""
jdbc_password => ""
schedule => "*/15 * * * *"
statement_filepath => "/etc/logstash/queries/UAT/transactions_UAT.sql"
use_column_value => true
tracking_column => 'sys_created_on'
tracking_column_type => "timestamp"
last_run_metadata_path => "/etc/logstash/conf.d/lastrun_metadata/transactions_uat_metadata"
tags => ["transactions_uat"]
}
Content of the metadata file:
--- 2018-05-26 08:41:55.000000000 -04:00
I can see in the logs that Logstash always uses the same date from the metadata file and newer updates it:
select * from snc_uat.syslog_transaction0007
where "sys_created_on" >= TIMESTAMP '2018-05-26 08:41:55.000000 -04:00'
Logstash is working and is downloading recent data but unnecessarily processes data that already exists. Why is Logstash not updating metadata?
This is because your comparison operator is greater than or equal to i.e. >= please change it to > and it will fix your problem.
Hope it helps.

Resources