I am trying to update ElasticSearch indexes with the data stored into a SQL DataBase in way that every row added into the DB are added automatically into ElasticSearch.
I tried to set the Primary Key of the DB as the _id field of ElasticSearch in way that every time the schedule launches Logstash (once a minute), the documents that were alredy in ElasticSearch doesn't get re-added.
This is my Logstash .conf file:
input {
jdbc {
jdbc_connection_string => "JDBC-Connection-String"
jdbc_driver_class => "com.microsoft.sqlserver.jdbc.SQLServerDriver"
jdbc_user => "JDBC-Connection-User"
jdbc_driver_library => "JDBC-Driver-Path"
statement => "SELECT MyCol1 MyCol2 FROM MyTable"
use_column_value => true
tracking_column => "MyCol1"
tracking_column_type => "numeric"
clean_run => true
schedule => "*/1 * * * *"
}
}
output {
elasticsearch {
hosts => "http://localhost:9200"
index => "MyIndex"
document_id => "%{MyCol1}"
}
stdout { }
}
After Logstash finishes I find only 1 document with "_id": "%{MyCol1}" into ElasticSearch, why can't Logstash take the id value properly?
P.S. MyCol1 is Primary Key of Mytable
few things to keep in mind.
value in document_id must be part of query.
the id is case sensitive. so use exact name..
clean_run=>false
use :sql_last_value to identify the column which you want to be taken care to identify new record.
Related
I have a strange problem. I have 5 logstash jobs to load into elastic search from DB2 using JDBC. The document ID is a number stored as string and I have strings starting with 0 and 1 on the first logstash job, strings starting with 2 and 3 on the 2nd logstash job, strings starting with 4 and 5 on the 3rd logstash job etc. Each logstash job is supposed to bring around 30M records. Each record size is ~2k. But each of the logstash job loads 24M records without any errors. After that the logstash job doesnt show any errors but _stats on elastic search shows increasing deleted count. As I look thru the logstash jobs I see that the specific document ID does not exist in elastic but its not getting loaded and is counting towards deletes.
When I run the same record on a totally new logstash instance the record gets inserted into elastic.
I am suspecting there is a limit of 50GB somewhere which is blocking these inserts into elastic from a single logstash instance. I am very confused. Can someone help.
The elastic search instance is running on a cloud environment and I dont have a dedicated cluster. I just have an index with around 12 shards of 30GB each allocated.
input {
jdbc {
#input configurations
jdbc_driver_library => "db2jcc4-10.5.jar"
jdbc_driver_class => "com.ibm.db2.jcc.DB2Driver"
jdbc_connection_string => "jdbc:db2://..."
columns_charset => {}
jdbc_fetch_size => 2000
jdbc_page_size => 5000000
jdbc_paging_enabled => false
lowercase_column_names => true
jdbc_user => "*****"
jdbc_password => "*****"
statement_filepath => "part_a.sql"
use_column_value => true
schedule => "*/5 * * * *"
sql_log_level => warn
tracking_column => "lst_updt_ts"
tracking_column_type => "timestamp"
record_last_run => true
last_run_metadata_path => "last_run_a.conf"
}
}
filter {
mutate {
convert => {
...
}
rename => {
...
}
}
}
output {
elasticsearch {
#output configurations
hosts => "https://...:8443"
index => "demographics"
document_type => "demographics_doc"
document_id => "%{record_id}"
manage_template => false
action => "update"
doc_as_upsert => true
failure_type_logging_whitelist => []
user => "..."
password => "..."
}
stdout {
codec => rubydebug
}
}
JDBC Plugin Polymorphic Index
Hi, we have a table that's polymorphic to items and we'd like to find a way to update different indexes in one logstash config.
Table Structure
Below is an example table. The item_type column denotes the type (such as Pen, Post, Collection), the item_id is a foreign key to the item in our DB, and the score is calculated on a cron and updated every once in a while, which updates our updated_at column.
popularity_scores
Process
Using logstash jdbc plugins, we'd like to query the data, then push it to ES. However, I don't see a way (other than a logstash config and sql query for each item type) to dynamically push updates to indexes. In a perfect world, we'd like to take input from the table above (see input code below)
input
input {
jdbc {
jdbc_driver_library => "/usr/share/logstash/bin/mysql-connector-java-8.0.15.jar"
jdbc_driver_class => "com.mysql.cj.jdbc.Driver"
# useCursorFetch needed cause jdbc_fetch_size not working??
# https://discuss.elastic.co/t/logstash-jdbc-plugin/84874/2
# https://stackoverflow.com/a/10772407
jdbc_connection_string => "jdbc:mysql://${CP_LS_SQL_HOST}:${CP_LS_SQL_PORT}/${CP_LS_SQL_DB}?useCursorFetch=true&autoReconnect=true&failOverReadOnly=false&maxReconnects=10"
statement => "select * from view_elastic_popularity_all where updated_at > :sql_last_value"
jdbc_user => "${CP_LS_SQL_USER}"
jdbc_password => "${CP_LS_SQL_PASSWORD}"
jdbc_fetch_size => "${CP_LS_FETCH_SIZE}"
last_run_metadata_path => "/usr/share/logstash/cp/last_run_files/last_run_popularity_live"
jdbc_page_size => '10000'
use_column_value => true
tracking_column => 'updated_at'
tracking_column_type => 'timestamp'
schedule => "* * * * *"
}
}
Then run update queries to ES via an output plugin (see output code below)
output
output {
elasticsearch {
index => "HOW_DO_WE_DYNAMICALLY_SET_INDEX_BASED_ON_ITEM_TYPE?"
document_id => "%{id}"
hosts => ["${CP_LS_ES_HOST}:${CP_LS_ES_PORT}"]
user => "${CP_LS_ES_USER}"
password => "${CP_LS_ES_PASSWORD}"
}
}
Help?
We can't be the first company with this problem. How would we structure the output?
You can dynamically set the name of the index by using a field in the event message in the same way that you dynamically setup the document_id.
output {
elasticsearch {
index => "%{item_type}"
document_id => "%{id}"
hosts => ["${CP_LS_ES_HOST}:${CP_LS_ES_PORT}"]
user => "${CP_LS_ES_USER}"
password => "${CP_LS_ES_PASSWORD}"
}
}
I am trying to index data from mysql db to elasticsearch using logstash. Logstash is running without errors but the problem is, it indexing only one row from my SELECT query.
Below are the versions of softwares I am using:
elastic search : 2.4.1
logstash: 5.1.1
mysql: 5.7.17
jdbc_driver_library: mysql-connector-java-5.1.40-bin.jar
I am not sure if this is because logstash and elasticsearch versions are different.
Below is my pipeline configuration:
input {
jdbc {
jdbc_driver_library => "mysql-connector-java-5.1.40-bin.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql://localhost:3306/mydb"
jdbc_user => "user"
jdbc_password => "password"
schedule => "* * * * *"
statement => "SELECT * FROM employee"
use_column_value => true
tracking_column => "id"
}
}
output {
elasticsearch {
index => "logstash"
document_type => "sometype"
document_id => "%{uid}"
hosts => ["localhost:9200"]
}
}
It seems like the tracking_column (id) which you're using in the jdbc plugin and the document_id (uid) in the output is different. What if you have both of them same since it'll be easy to get all the records by id and push them into ES using the same id as well which could look more understandable:
document_id => "%{id}" <-- make sure you've got the exact spellings
And also please try adding this following line to your jdbc input after tracking_column:
tracking_column_type => "numeric"
Additionally to make sure that you don't have the .logstash_jdbc_last_run file existing when you're running the logstash file include the following line as well:
clean_run => true
So this is how your jdbc input should look like:
jdbc {
jdbc_driver_library => "mysql-connector-java-5.1.40-bin.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql://localhost:3306/mydb"
jdbc_user => "user"
jdbc_password => "password"
schedule => "* * * * *"
statement => "SELECT * FROM employee"
use_column_value => true
tracking_column => "id"
tracking_column_type => "numeric"
clean_run => true
}
Other than that the conf seems to be fine, unless you're willing to have :sql_last_value where if you only wanted to update the newly added records in your database table. Hope it helps!
I have a log stash running pulling records from postgresql and creating documents in elastic search, but whenever i am trying to update a record in postgres the same is not getting reflected in elastic search, here is my INPUT & OUTPUT configs let me know if i am missing anything here,
input {
jdbc {
# Postgres jdbc connection string to our database, mydb
jdbc_connection_string => "jdbc:postgresql://127.0.0.1:5009/data"
# The user we wish to execute our statement as
jdbc_user => "data"
jdbc_password=>"data"
# The path to our downloaded jdbc driver
jdbc_driver_library => "/postgresql-9.4.1209.jar"
# The name of the driver class for Postgresql
jdbc_driver_class => "org.postgresql.Driver"
#sql_log_level => "debug"
jdbc_paging_enabled => "true"
jdbc_page_size => "5000"
schedule => "* * * * *"
# our query
clean_run => true
last_run_metadata_path => "/logstash/.test_metadata"
#use_column_value => true
#tracking_column => id
statement => "SELECT id,name,update_date from data where update_date > :sql_last_value"
}
}
output {
elasticsearch{
hosts => ["127.0.0.1"]
index => "test_data"
action => "index"
document_type => "data"
document_id => "%{id}"
upsert => ' {
"name" : "%{data.name}",
"update_date" : "%{data.update_date}"
} '
}
}
I think you need to track a date/timestamp column instead of the id column. If you have an UPDATE_DATE column that changes on each update, that would be good.
Your SELECT statement will only grab new records (i.e. id > last_id) and if you update a record, its id won't change, hence that updated record won't be picked up by the jdbc input the next time it runs.
I am trying to create a data pipeline where Logstash jdbc plugin get some data with SQL query every 5 minutes and ElasticSearch output plugin puts data from the input plugin into ElasticSearch server. I want this output plugin to partial-updates existing document in ElasticSearch server. my Logstash configuration file looks like:
input {
jdbc {
jdbc_driver_library => "/Users/hello/logstash-2.3.2/lib/mysql-connector-java-5.1.34.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql://localhost:13306/mysqlDB”
jdbc_user => “root”
jdbc_password => “1234”
last_run_metadata_path => "/Users/hello/.logstash_last_run_display"
statement => "SELECT * FROM checkout WHERE checkout_no between :sql_last_value + 1 and :sql_last_value + 5 ORDER BY checkout_no ASC"
schedule => “*/5 * * * *"
use_column_value => true
tracking_column => “checkout_no”
}
}
output {
stdout { codec => json_lines }
elasticsearch {
action => "update"
index => "ecs"
document_type => “checkout”
document_id => “%{checkout_no}"
hosts => ["localhost:9200"]
}
}
the problem is that ElasticSearch output plugin appears not to call partial update API such as /{index}/{type}/{id}/_update. the manual just lists actions such as index, delete, create, update, But it doesn’t mention each action calls which REST API URL, i.e) Whether update action calls /{index}/{type}/{id}/_update or /{index}/{type}/{id} api (upsert). I would like to call partial update api from elastic search output plugin? Is it possible?
set both doc_as_upsert => true and action => "update" works in my production script.
output {
elasticsearch {
hosts => ["es_host"]
document_id => "%{id}" # !!! the id here MUST be the same
index => "logstash-my-index"
timeout => 30
workers => 1
doc_as_upsert => true
action => "update"
}
}
It is possible. The Elasticsearch output plugin has a series of upsert options that correspond to the ones in Elasticsearch update API:
upsert itself: https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html#plugins-outputs-elasticsearch-upsert
scripted_upsert: https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html#plugins-outputs-elasticsearch-scripted_upsert
doc_as_upsert: https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html#plugins-outputs-elasticsearch-doc_as_upsert