logstash JDBC Plugin Polymorphic Index - elasticsearch

JDBC Plugin Polymorphic Index
Hi, we have a table that's polymorphic to items and we'd like to find a way to update different indexes in one logstash config.
Table Structure
Below is an example table. The item_type column denotes the type (such as Pen, Post, Collection), the item_id is a foreign key to the item in our DB, and the score is calculated on a cron and updated every once in a while, which updates our updated_at column.
popularity_scores
Process
Using logstash jdbc plugins, we'd like to query the data, then push it to ES. However, I don't see a way (other than a logstash config and sql query for each item type) to dynamically push updates to indexes. In a perfect world, we'd like to take input from the table above (see input code below)
input
input {
jdbc {
jdbc_driver_library => "/usr/share/logstash/bin/mysql-connector-java-8.0.15.jar"
jdbc_driver_class => "com.mysql.cj.jdbc.Driver"
# useCursorFetch needed cause jdbc_fetch_size not working??
# https://discuss.elastic.co/t/logstash-jdbc-plugin/84874/2
# https://stackoverflow.com/a/10772407
jdbc_connection_string => "jdbc:mysql://${CP_LS_SQL_HOST}:${CP_LS_SQL_PORT}/${CP_LS_SQL_DB}?useCursorFetch=true&autoReconnect=true&failOverReadOnly=false&maxReconnects=10"
statement => "select * from view_elastic_popularity_all where updated_at > :sql_last_value"
jdbc_user => "${CP_LS_SQL_USER}"
jdbc_password => "${CP_LS_SQL_PASSWORD}"
jdbc_fetch_size => "${CP_LS_FETCH_SIZE}"
last_run_metadata_path => "/usr/share/logstash/cp/last_run_files/last_run_popularity_live"
jdbc_page_size => '10000'
use_column_value => true
tracking_column => 'updated_at'
tracking_column_type => 'timestamp'
schedule => "* * * * *"
}
}
Then run update queries to ES via an output plugin (see output code below)
output
output {
elasticsearch {
index => "HOW_DO_WE_DYNAMICALLY_SET_INDEX_BASED_ON_ITEM_TYPE?"
document_id => "%{id}"
hosts => ["${CP_LS_ES_HOST}:${CP_LS_ES_PORT}"]
user => "${CP_LS_ES_USER}"
password => "${CP_LS_ES_PASSWORD}"
}
}
Help?
We can't be the first company with this problem. How would we structure the output?

You can dynamically set the name of the index by using a field in the event message in the same way that you dynamically setup the document_id.
output {
elasticsearch {
index => "%{item_type}"
document_id => "%{id}"
hosts => ["${CP_LS_ES_HOST}:${CP_LS_ES_PORT}"]
user => "${CP_LS_ES_USER}"
password => "${CP_LS_ES_PASSWORD}"
}
}

Related

ElasticSearch 7 not taking _id from JDBC on Logstash

I am trying to update ElasticSearch indexes with the data stored into a SQL DataBase in way that every row added into the DB are added automatically into ElasticSearch.
I tried to set the Primary Key of the DB as the _id field of ElasticSearch in way that every time the schedule launches Logstash (once a minute), the documents that were alredy in ElasticSearch doesn't get re-added.
This is my Logstash .conf file:
input {
jdbc {
jdbc_connection_string => "JDBC-Connection-String"
jdbc_driver_class => "com.microsoft.sqlserver.jdbc.SQLServerDriver"
jdbc_user => "JDBC-Connection-User"
jdbc_driver_library => "JDBC-Driver-Path"
statement => "SELECT MyCol1 MyCol2 FROM MyTable"
use_column_value => true
tracking_column => "MyCol1"
tracking_column_type => "numeric"
clean_run => true
schedule => "*/1 * * * *"
}
}
output {
elasticsearch {
hosts => "http://localhost:9200"
index => "MyIndex"
document_id => "%{MyCol1}"
}
stdout { }
}
After Logstash finishes I find only 1 document with "_id": "%{MyCol1}" into ElasticSearch, why can't Logstash take the id value properly?
P.S. MyCol1 is Primary Key of Mytable
few things to keep in mind.
value in document_id must be part of query.
the id is case sensitive. so use exact name..
clean_run=>false
use :sql_last_value to identify the column which you want to be taken care to identify new record.

logstash input plugin for postgresql issue - duplication (ignoring last run state)

I am using jdbc plugin to fetch data from postgresql db, it seems to be work fine for entire export and i am able to pull the data, but it is not working according to saved state, everytime all of data is queried and there are lot of duplicates.
I checked the .logstash_jdbc_last_run. The metadata state is updated as required, still plugin is importing entire data from table on every run. If any thing wrong in config.
input
{
jdbc {
jdbc_connection_string => "jdbc:postgresql://x.x.x.x:5432/dodb"
jdbc_user => "myuser"
jdbc_password => "passsword"
jdbc_validate_connection => true
jdbc_driver_library => "/opt/postgresql-9.4.1207.jar"
jdbc_driver_class => "org.postgresql.Driver"
statement => "select id,timestamp,distributed_query_id,distributed_query_task_id, "columns"->>'uid' as uid, "columns"->>'name' as name from distributed_query_result;"
schedule => "* * * * *"
use_column_value => true
tracking_column => "id"
tracking_column_type => "numeric"
clean_run => true
}
}
output
{
kafka{
topic_id => "psql-logs"
bootstrap_servers => "x.x.x.x:9092"
codec => "json"
}
}
Any help !! Thanks in advance,, I used below doc for reference.
https://www.elastic.co/guide/en/logstash/current/plugins-inputs-jdbc.html

Logstash is indexing only one row of select query from mysql to elastic search

I am trying to index data from mysql db to elasticsearch using logstash. Logstash is running without errors but the problem is, it indexing only one row from my SELECT query.
Below are the versions of softwares I am using:
elastic search : 2.4.1
logstash: 5.1.1
mysql: 5.7.17
jdbc_driver_library: mysql-connector-java-5.1.40-bin.jar
I am not sure if this is because logstash and elasticsearch versions are different.
Below is my pipeline configuration:
input {
jdbc {
jdbc_driver_library => "mysql-connector-java-5.1.40-bin.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql://localhost:3306/mydb"
jdbc_user => "user"
jdbc_password => "password"
schedule => "* * * * *"
statement => "SELECT * FROM employee"
use_column_value => true
tracking_column => "id"
}
}
output {
elasticsearch {
index => "logstash"
document_type => "sometype"
document_id => "%{uid}"
hosts => ["localhost:9200"]
}
}
It seems like the tracking_column (id) which you're using in the jdbc plugin and the document_id (uid) in the output is different. What if you have both of them same since it'll be easy to get all the records by id and push them into ES using the same id as well which could look more understandable:
document_id => "%{id}" <-- make sure you've got the exact spellings
And also please try adding this following line to your jdbc input after tracking_column:
tracking_column_type => "numeric"
Additionally to make sure that you don't have the .logstash_jdbc_last_run file existing when you're running the logstash file include the following line as well:
clean_run => true
So this is how your jdbc input should look like:
jdbc {
jdbc_driver_library => "mysql-connector-java-5.1.40-bin.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql://localhost:3306/mydb"
jdbc_user => "user"
jdbc_password => "password"
schedule => "* * * * *"
statement => "SELECT * FROM employee"
use_column_value => true
tracking_column => "id"
tracking_column_type => "numeric"
clean_run => true
}
Other than that the conf seems to be fine, unless you're willing to have :sql_last_value where if you only wanted to update the newly added records in your database table. Hope it helps!

Update document issue

I have a log stash running pulling records from postgresql and creating documents in elastic search, but whenever i am trying to update a record in postgres the same is not getting reflected in elastic search, here is my INPUT & OUTPUT configs let me know if i am missing anything here,
input {
jdbc {
# Postgres jdbc connection string to our database, mydb
jdbc_connection_string => "jdbc:postgresql://127.0.0.1:5009/data"
# The user we wish to execute our statement as
jdbc_user => "data"
jdbc_password=>"data"
# The path to our downloaded jdbc driver
jdbc_driver_library => "/postgresql-9.4.1209.jar"
# The name of the driver class for Postgresql
jdbc_driver_class => "org.postgresql.Driver"
#sql_log_level => "debug"
jdbc_paging_enabled => "true"
jdbc_page_size => "5000"
schedule => "* * * * *"
# our query
clean_run => true
last_run_metadata_path => "/logstash/.test_metadata"
#use_column_value => true
#tracking_column => id
statement => "SELECT id,name,update_date from data where update_date > :sql_last_value"
}
}
output {
elasticsearch{
hosts => ["127.0.0.1"]
index => "test_data"
action => "index"
document_type => "data"
document_id => "%{id}"
upsert => ' {
"name" : "%{data.name}",
"update_date" : "%{data.update_date}"
} '
}
}
I think you need to track a date/timestamp column instead of the id column. If you have an UPDATE_DATE column that changes on each update, that would be good.
Your SELECT statement will only grab new records (i.e. id > last_id) and if you update a record, its id won't change, hence that updated record won't be picked up by the jdbc input the next time it runs.

Logstash configuration: how to call the partial-update API from ElasticSearch output plugin?

I am trying to create a data pipeline where Logstash jdbc plugin get some data with SQL query every 5 minutes and ElasticSearch output plugin puts data from the input plugin into ElasticSearch server. I want this output plugin to partial-updates existing document in ElasticSearch server. my Logstash configuration file looks like:
input {
jdbc {
jdbc_driver_library => "/Users/hello/logstash-2.3.2/lib/mysql-connector-java-5.1.34.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql://localhost:13306/mysqlDB”
jdbc_user => “root”
jdbc_password => “1234”
last_run_metadata_path => "/Users/hello/.logstash_last_run_display"
statement => "SELECT * FROM checkout WHERE checkout_no between :sql_last_value + 1 and :sql_last_value + 5 ORDER BY checkout_no ASC"
schedule => “*/5 * * * *"
use_column_value => true
tracking_column => “checkout_no”
}
}
output {
stdout { codec => json_lines }
elasticsearch {
action => "update"
index => "ecs"
document_type => “checkout”
document_id => “%{checkout_no}"
hosts => ["localhost:9200"]
}
}
the problem is that ElasticSearch output plugin appears not to call partial update API such as /{index}/{type}/{id}/_update. the manual just lists actions such as index, delete, create, update, But it doesn’t mention each action calls which REST API URL, i.e) Whether update action calls /{index}/{type}/{id}/_update or /{index}/{type}/{id} api (upsert). I would like to call partial update api from elastic search output plugin? Is it possible?
set both doc_as_upsert => true and action => "update" works in my production script.
output {
elasticsearch {
hosts => ["es_host"]
document_id => "%{id}" # !!! the id here MUST be the same
index => "logstash-my-index"
timeout => 30
workers => 1
doc_as_upsert => true
action => "update"
}
}
It is possible. The Elasticsearch output plugin has a series of upsert options that correspond to the ones in Elasticsearch update API:
upsert itself: https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html#plugins-outputs-elasticsearch-upsert
scripted_upsert: https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html#plugins-outputs-elasticsearch-scripted_upsert
doc_as_upsert: https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html#plugins-outputs-elasticsearch-doc_as_upsert

Resources