Only send new data to ES using Logstash, preventing duplication - elasticsearch

Using Logstash I would like to know how to send data to ES without getting duplications. Meaning that I want to send data that is not present in the ES instance yet, and not data that is already in the instance.
Today I am deleting all the data on the specific index in ES, and then resend all data that is in the database. This prevents duplications but is however not so ideal since I have to manually delete the data.
This is the .config I am currently using:
input {
jdbc {
jdbc_driver_library => "/Users/Carl/Progs/logstash-6.3.0/mysql-connector-java/mysql-connector-java-5.1.46-bin.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql://*****"
jdbc_user => "****"
jdbc_password => "*****"
schedule => "0 * * * *"
statement => "SELECT * FROM carl.customer"
}
}
filter {
mutate {convert => { "long" => "float"} }
}
output {
#stdout { codec => json_lines }
elasticsearch {
hosts => "localhost"
index => "customers"
}
}

Related

Trying to get the data from oracle database through logstash but data is not coming to elasticsearch

I am trying to get the data of oracle database through logstash but data is not coming to elasticsearch. I am not sure where I missed it. I didn't see any error on logstash log file. Below are the logstash conf file.
input {
jdbc {
jdbc_validate_connection => "true"
jdbc_connection_string => "jdbc:oracle:thin:#//server:1521/db"
jdbc_user => "user"
jdbc_password => "pass"
jdbc_driver_library => "/etc/logstash/files/ojdbc7.jar"
jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
jdbc_paging_enabled => "true"
schedule => "* * * * *"
statement_filepath => "/etc/logstash/files/keycount.sql"
use_column_value => "true"
tracking_column => "timestamp"
last_run_metadata_path => "/etc/logstash/files/.logstash_jdbc_last_run"
}
}
output {
elasticsearch {
hosts => "localhost:9200"
index => "keyinventory-%{+YYYY}"
}
stdout{
codec => rubydebug
}
}
Please, someone, help me.

Document count is same but index size is growing every logstash run

I'm sending elasticsearch using the logstash of the data contained in the mysql database.
but each time logstash runs, the number of documents remains the same, but the index size increases.
first run
count: 333 |
size in bytes : 206kb
now
count:333 |
size in bytes : 1.6MB
input {
jdbc {
jdbc_connection_string => "jdbc:mysql://***rds.amazonaws.com:3306/"
jdbc_user => "***"
jdbc_password => "***"
jdbc_driver_library => "***\mysql-connector-java-5.1.46/mysql-connector-java-5.1.46-bin.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
statement => "SELECT id,title,url, FROM tableName"
schedule => "*/2 * * * *"
}
}
filter {
json {
source => "texts"
target => "texts"
}
mutate { remove_field => [ "#version", "#timestamp" ] }
}
output {
stdout {
codec => json_lines
}
amazon_es {
hosts => ["***es.amazonaws.com"]
document_id => "%{id}"
index => "texts"
region => "***"
aws_access_key_id => '***'
aws_secret_access_key => '***'
}
}
Apparently you're always sending the same data over and over. In ES, each time you update a document (i.e. by using the same ID), the older version gets deleted and stays in the index for a while (until the underlying index segments get merged).
Between each run, you can issue the following command:
curl -XGET ***es.amazonaws.com/_cat/indices?v
In the response you get, check the docs.deleted column and you'll see that the number of deleted documents increases.

Logstash reading SQL Server data real time

Is there any way I can configure logstash so that it picks up delta records real time automatically. If not then is there any opensource plugin/tool available to achieve this? Thanks for the help.
Try the below configuration for the MSSQL server. You need to schedule it like below by adding the schedule period, a statement which would the query to fetch the data from your database
input {
jdbc {
jdbc_connection_string => "jdbc:sqlserver://localhost:1433;databaseName=test"
# The user we wish to execute our statement as
jdbc_user => "sa"
jdbc_password => "sasa"
# The path to our downloaded jdbc driver
jdbc_driver_library => "C:\Users\abhijitb\.m2\repository\com\microsoft\sqlserver\mssql-jdbc\6.2.2.jre8\mssql-jdbc-6.2.2.jre8.jar"
jdbc_driver_class => "com.microsoft.sqlserver.jdbc.SQLServerDriver"
#clean_run => true
schedule => "* * * * *"
#query
statement => "SELECT * FROM Student where studentid > :sql_last_value"
use_column_value => true
tracking_column => "studentid"
}
}
output {
#stdout { codec => json_lines }
elasticsearch {
"hosts" => "localhost:9200"
"index" => "student"
"document_type" => "data"
"document_id" => "%{studentid}"
}
}

Issue connecting to elastic search from logstash in input

I want to import data from oracle and would like to pass one of the params of the imported data to elastic search to fetch some other details.
For ex:- If I have an Employee Id which I get from oracle db for say 100 rows , I want to pass all these 100 employee ids to elastic search and get the emp name and salary.
I am able to retrieve the data from oracle now but unable to connect to elastic search. Also I am not sure what will be a better approach to do this.
I am using log stash 2.3.3 and the elastic search log stash filter plugin.
input {
jdbc {
jdbc_connection_string => "jdbc:oracle:thin:#<dbhost>:<port>:<sid>"
# The user we wish to execute our statement as
jdbc_user => “user"
jdbc_password => “pass"
# The path to our downloaded jdbc driver
jdbc_driver_library => “<path>"
# The name of the driver class for oracle
jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
# our query
statement => "SELECT empId, desg from Employee"
}
elasticsearch {
hosts => "https://xx.corp.com:9200"
index => “empdetails”
}
}
output {
stdout { codec => json_lines }
}
I am getting the below error due to elastic search.
A plugin had an unrecoverable error. Will restart this plugin.
Plugin: ["https://xx.corp.com:9200"], index=>"empdetails ", query=>”empId:’1001'", codec=>"UTF-8">, scan=>true, size=>1000, scroll=>"1m", docinfo=>false, docinfo_target=>"#metadata", docinfo_fields=>["_index", "_type", "_id"], ssl=>false>
Error: [401] {:level=>:error}
You need to use the elasticsearch filter and not the elasticsearch input
input {
jdbc {
jdbc_connection_string => "jdbc:oracle:thin:#<dbhost>:<port>:<sid>"
# The user we wish to execute our statement as
jdbc_user => “user"
jdbc_password => “pass"
# The path to our downloaded jdbc driver
jdbc_driver_library => “<path>"
# The name of the driver class for oracle
jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
# our query
statement => "SELECT empId, desg from Employee"
}
}
filter {
elasticsearch {
hosts => ["xx.corp.com:9200"]
query => "empId:%{empId}"
user => "admin"
password => "admin"
sort => "empName:desc"
fields => {
"empName" => "empName"
"salary" => "salary"
}
}
}
output {
stdout { codec => json_lines }
}
As a result, each record fetched via JDBC will be enriched by the corresponding data found in ES.

multiple inputs on logstash jdbc

I am using logstash jdbc to keep the things syncd between mysql and elasticsearch. Its working fine for one table. But now I want to do it for multiple tables. Do I need to open multiple in terminal
logstash agent -f /Users/logstash/logstash-jdbc.conf
each with a select query or do we have a better way of doing it so we can have multiple tables being updated.
my config file
input {
jdbc {
jdbc_driver_library => "/Users/logstash/mysql-connector-java-5.1.39-bin.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql://localhost:3306/database_name"
jdbc_user => "root"
jdbc_password => "password"
schedule => "* * * * *"
statement => "select * from table1"
}
}
output {
elasticsearch {
index => "testdb"
document_type => "table1"
document_id => "%{table_id}"
hosts => "localhost:9200"
}
}
You can definitely have a single config with multiple jdbc input and then parametrize the index and document_type in your elasticsearch output depending on which table the event is coming from.
input {
jdbc {
jdbc_driver_library => "/Users/logstash/mysql-connector-java-5.1.39-bin.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql://localhost:3306/database_name"
jdbc_user => "root"
jdbc_password => "password"
schedule => "* * * * *"
statement => "select * from table1"
type => "table1"
}
jdbc {
jdbc_driver_library => "/Users/logstash/mysql-connector-java-5.1.39-bin.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql://localhost:3306/database_name"
jdbc_user => "root"
jdbc_password => "password"
schedule => "* * * * *"
statement => "select * from table2"
type => "table2"
}
# add more jdbc inputs to suit your needs
}
output {
elasticsearch {
index => "testdb"
document_type => "%{type}" # <- use the type from each input
hosts => "localhost:9200"
}
}
This will not create duplicate data. and compatible logstash 6x.
# YOUR_DATABASE_NAME : test
# FIRST_TABLE : place
# SECOND_TABLE : things
# SET_DATA_INDEX : test_index_1, test_index_2
input {
jdbc {
# The path to our downloaded jdbc driver
jdbc_driver_library => "/mysql-connector-java-5.1.44-bin.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
# Postgres jdbc connection string to our database, YOUR_DATABASE_NAME
jdbc_connection_string => "jdbc:mysql://localhost:3306/test"
# The user we wish to execute our statement as
jdbc_user => "root"
jdbc_password => ""
schedule => "* * * * *"
statement => "SELECT #slno:=#slno+1 aut_es_1, es_qry_tbl.* FROM (SELECT * FROM `place`) es_qry_tbl, (SELECT #slno:=0) es_tbl"
type => "place"
add_field => { "queryFunctionName" => "getAllDataFromFirstTable" }
use_column_value => true
tracking_column => "aut_es_1"
}
jdbc {
# The path to our downloaded jdbc driver
jdbc_driver_library => "/mysql-connector-java-5.1.44-bin.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
# Postgres jdbc connection string to our database, YOUR_DATABASE_NAME
jdbc_connection_string => "jdbc:mysql://localhost:3306/test"
# The user we wish to execute our statement as
jdbc_user => "root"
jdbc_password => ""
schedule => "* * * * *"
statement => "SELECT #slno:=#slno+1 aut_es_2, es_qry_tbl.* FROM (SELECT * FROM `things`) es_qry_tbl, (SELECT #slno:=0) es_tbl"
type => "things"
add_field => { "queryFunctionName" => "getAllDataFromSecondTable" }
use_column_value => true
tracking_column => "aut_es_2"
}
}
# install uuid plugin 'bin/logstash-plugin install logstash-filter-uuid'
# The uuid filter allows you to generate a UUID and add it as a field to each processed event.
filter {
mutate {
add_field => {
"[#metadata][document_id]" => "%{aut_es_1}%{aut_es_2}"
}
}
uuid {
target => "uuid"
overwrite => true
}
}
output {
stdout {codec => rubydebug}
if [type] == "place" {
elasticsearch {
hosts => "localhost:9200"
index => "test_index_1_12"
#document_id => "%{aut_es_1}"
document_id => "%{[#metadata][document_id]}"
}
}
if [type] == "things" {
elasticsearch {
hosts => "localhost:9200"
index => "test_index_2_13"
document_id => "%{[#metadata][document_id]}"
# document_id => "%{aut_es_2}"
# you can set document_id . otherwise ES will genrate unique id.
}
}
}
If you need to run more than one pipeline in the same process, Logstash provides a way to do this through a configuration file called pipelines.yml and using multiple pipelines
multiple pipeline
Using multiple pipelines is especially useful if your current configuration has event flows that don’t share the same inputs/filters and outputs and are being separated from each other using tags and conditionals.
more helpfull resource

Resources