I have a strange problem. I have 5 logstash jobs to load into elastic search from DB2 using JDBC. The document ID is a number stored as string and I have strings starting with 0 and 1 on the first logstash job, strings starting with 2 and 3 on the 2nd logstash job, strings starting with 4 and 5 on the 3rd logstash job etc. Each logstash job is supposed to bring around 30M records. Each record size is ~2k. But each of the logstash job loads 24M records without any errors. After that the logstash job doesnt show any errors but _stats on elastic search shows increasing deleted count. As I look thru the logstash jobs I see that the specific document ID does not exist in elastic but its not getting loaded and is counting towards deletes.
When I run the same record on a totally new logstash instance the record gets inserted into elastic.
I am suspecting there is a limit of 50GB somewhere which is blocking these inserts into elastic from a single logstash instance. I am very confused. Can someone help.
The elastic search instance is running on a cloud environment and I dont have a dedicated cluster. I just have an index with around 12 shards of 30GB each allocated.
input {
jdbc {
#input configurations
jdbc_driver_library => "db2jcc4-10.5.jar"
jdbc_driver_class => "com.ibm.db2.jcc.DB2Driver"
jdbc_connection_string => "jdbc:db2://..."
columns_charset => {}
jdbc_fetch_size => 2000
jdbc_page_size => 5000000
jdbc_paging_enabled => false
lowercase_column_names => true
jdbc_user => "*****"
jdbc_password => "*****"
statement_filepath => "part_a.sql"
use_column_value => true
schedule => "*/5 * * * *"
sql_log_level => warn
tracking_column => "lst_updt_ts"
tracking_column_type => "timestamp"
record_last_run => true
last_run_metadata_path => "last_run_a.conf"
}
}
filter {
mutate {
convert => {
...
}
rename => {
...
}
}
}
output {
elasticsearch {
#output configurations
hosts => "https://...:8443"
index => "demographics"
document_type => "demographics_doc"
document_id => "%{record_id}"
manage_template => false
action => "update"
doc_as_upsert => true
failure_type_logging_whitelist => []
user => "..."
password => "..."
}
stdout {
codec => rubydebug
}
}
Related
I am trying to update ElasticSearch indexes with the data stored into a SQL DataBase in way that every row added into the DB are added automatically into ElasticSearch.
I tried to set the Primary Key of the DB as the _id field of ElasticSearch in way that every time the schedule launches Logstash (once a minute), the documents that were alredy in ElasticSearch doesn't get re-added.
This is my Logstash .conf file:
input {
jdbc {
jdbc_connection_string => "JDBC-Connection-String"
jdbc_driver_class => "com.microsoft.sqlserver.jdbc.SQLServerDriver"
jdbc_user => "JDBC-Connection-User"
jdbc_driver_library => "JDBC-Driver-Path"
statement => "SELECT MyCol1 MyCol2 FROM MyTable"
use_column_value => true
tracking_column => "MyCol1"
tracking_column_type => "numeric"
clean_run => true
schedule => "*/1 * * * *"
}
}
output {
elasticsearch {
hosts => "http://localhost:9200"
index => "MyIndex"
document_id => "%{MyCol1}"
}
stdout { }
}
After Logstash finishes I find only 1 document with "_id": "%{MyCol1}" into ElasticSearch, why can't Logstash take the id value properly?
P.S. MyCol1 is Primary Key of Mytable
few things to keep in mind.
value in document_id must be part of query.
the id is case sensitive. so use exact name..
clean_run=>false
use :sql_last_value to identify the column which you want to be taken care to identify new record.
I'm sending elasticsearch using the logstash of the data contained in the mysql database.
but each time logstash runs, the number of documents remains the same, but the index size increases.
first run
count: 333 |
size in bytes : 206kb
now
count:333 |
size in bytes : 1.6MB
input {
jdbc {
jdbc_connection_string => "jdbc:mysql://***rds.amazonaws.com:3306/"
jdbc_user => "***"
jdbc_password => "***"
jdbc_driver_library => "***\mysql-connector-java-5.1.46/mysql-connector-java-5.1.46-bin.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
statement => "SELECT id,title,url, FROM tableName"
schedule => "*/2 * * * *"
}
}
filter {
json {
source => "texts"
target => "texts"
}
mutate { remove_field => [ "#version", "#timestamp" ] }
}
output {
stdout {
codec => json_lines
}
amazon_es {
hosts => ["***es.amazonaws.com"]
document_id => "%{id}"
index => "texts"
region => "***"
aws_access_key_id => '***'
aws_secret_access_key => '***'
}
}
Apparently you're always sending the same data over and over. In ES, each time you update a document (i.e. by using the same ID), the older version gets deleted and stays in the index for a while (until the underlying index segments get merged).
Between each run, you can issue the following command:
curl -XGET ***es.amazonaws.com/_cat/indices?v
In the response you get, check the docs.deleted column and you'll see that the number of deleted documents increases.
Using Logstash I would like to know how to send data to ES without getting duplications. Meaning that I want to send data that is not present in the ES instance yet, and not data that is already in the instance.
Today I am deleting all the data on the specific index in ES, and then resend all data that is in the database. This prevents duplications but is however not so ideal since I have to manually delete the data.
This is the .config I am currently using:
input {
jdbc {
jdbc_driver_library => "/Users/Carl/Progs/logstash-6.3.0/mysql-connector-java/mysql-connector-java-5.1.46-bin.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql://*****"
jdbc_user => "****"
jdbc_password => "*****"
schedule => "0 * * * *"
statement => "SELECT * FROM carl.customer"
}
}
filter {
mutate {convert => { "long" => "float"} }
}
output {
#stdout { codec => json_lines }
elasticsearch {
hosts => "localhost"
index => "customers"
}
}
I have a log stash running pulling records from postgresql and creating documents in elastic search, but whenever i am trying to update a record in postgres the same is not getting reflected in elastic search, here is my INPUT & OUTPUT configs let me know if i am missing anything here,
input {
jdbc {
# Postgres jdbc connection string to our database, mydb
jdbc_connection_string => "jdbc:postgresql://127.0.0.1:5009/data"
# The user we wish to execute our statement as
jdbc_user => "data"
jdbc_password=>"data"
# The path to our downloaded jdbc driver
jdbc_driver_library => "/postgresql-9.4.1209.jar"
# The name of the driver class for Postgresql
jdbc_driver_class => "org.postgresql.Driver"
#sql_log_level => "debug"
jdbc_paging_enabled => "true"
jdbc_page_size => "5000"
schedule => "* * * * *"
# our query
clean_run => true
last_run_metadata_path => "/logstash/.test_metadata"
#use_column_value => true
#tracking_column => id
statement => "SELECT id,name,update_date from data where update_date > :sql_last_value"
}
}
output {
elasticsearch{
hosts => ["127.0.0.1"]
index => "test_data"
action => "index"
document_type => "data"
document_id => "%{id}"
upsert => ' {
"name" : "%{data.name}",
"update_date" : "%{data.update_date}"
} '
}
}
I think you need to track a date/timestamp column instead of the id column. If you have an UPDATE_DATE column that changes on each update, that would be good.
Your SELECT statement will only grab new records (i.e. id > last_id) and if you update a record, its id won't change, hence that updated record won't be picked up by the jdbc input the next time it runs.
I am trying to create a data pipeline where Logstash jdbc plugin get some data with SQL query every 5 minutes and ElasticSearch output plugin puts data from the input plugin into ElasticSearch server. I want this output plugin to partial-updates existing document in ElasticSearch server. my Logstash configuration file looks like:
input {
jdbc {
jdbc_driver_library => "/Users/hello/logstash-2.3.2/lib/mysql-connector-java-5.1.34.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql://localhost:13306/mysqlDB”
jdbc_user => “root”
jdbc_password => “1234”
last_run_metadata_path => "/Users/hello/.logstash_last_run_display"
statement => "SELECT * FROM checkout WHERE checkout_no between :sql_last_value + 1 and :sql_last_value + 5 ORDER BY checkout_no ASC"
schedule => “*/5 * * * *"
use_column_value => true
tracking_column => “checkout_no”
}
}
output {
stdout { codec => json_lines }
elasticsearch {
action => "update"
index => "ecs"
document_type => “checkout”
document_id => “%{checkout_no}"
hosts => ["localhost:9200"]
}
}
the problem is that ElasticSearch output plugin appears not to call partial update API such as /{index}/{type}/{id}/_update. the manual just lists actions such as index, delete, create, update, But it doesn’t mention each action calls which REST API URL, i.e) Whether update action calls /{index}/{type}/{id}/_update or /{index}/{type}/{id} api (upsert). I would like to call partial update api from elastic search output plugin? Is it possible?
set both doc_as_upsert => true and action => "update" works in my production script.
output {
elasticsearch {
hosts => ["es_host"]
document_id => "%{id}" # !!! the id here MUST be the same
index => "logstash-my-index"
timeout => 30
workers => 1
doc_as_upsert => true
action => "update"
}
}
It is possible. The Elasticsearch output plugin has a series of upsert options that correspond to the ones in Elasticsearch update API:
upsert itself: https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html#plugins-outputs-elasticsearch-upsert
scripted_upsert: https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html#plugins-outputs-elasticsearch-scripted_upsert
doc_as_upsert: https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html#plugins-outputs-elasticsearch-doc_as_upsert