Logstash jdbc-input-plugin configuration for initial sql_last_value - oracle

I synchronise data in Oracle database and ElasticSearch instance.
Database table "SYNC_TABLE" has the following columns: "ID" which is a NUMBER, "LAST_MODIFICATION" - TIMESTAMP, "TEXT" - VARCHAR2.
I use Logstash with jdbc-input-plugin in order to perform data syncronisation on a regular basis.
This is the Logstash configuration file:
input {
jdbc {
jdbc_driver_library => "ojdbc6.jar"
jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
jdbc_connection_string => "jdbc:oracle:thin:#localhost:1521:XE"
jdbc_user => "******"
jdbc_password => "******"
schedule => "* * * * *"
statement => "SELECT * from SYNC_TABLE where LAST_MODIFICATION >= :sql_last_value"
tracking_column => "LAST_MODIFICATION"
tracking_column_type => "timestamp"
use_column_value => true
}
}
output {
elasticsearch {
index => "SYNC_TABLE"
document_type => "SYNCED_DATA"
document_id => "%{ID}"
hosts => "localhost:9200"
}
stdout { codec => rubydebug }
}
I'd like to import all the data on the first run and then syncronise only diff between the last run and current time.
So I expect Logstash to make the following queries:
SELECT * from SYNC_TABLE where LAST_MODIFICATION >= '1 January 1970 00:00'
and then regularly
SELECT * from SYNC_TABLE where LAST_MODIFICATION >= 'time of last run'
Documentation says that initial value for should be 1 January 1970, but I see in my logs that instead it takes current timestamp.
This is the first query:
SELECT * from SYNC_TABLE where LAST_MODIFICATION >= TIMESTAMP '2017-08-14 09:17:00.481000 +00:00'
Is there any mistake in logstash configuration file that makes the logstash use current timestamp instead of default ('1 January 1970 00:00')?

The problem was in .logstash_jdbc_last_run file that contained the sql_last_value from previous runs.
I've removed this file and restarted logstash.

Related

Logstash pagination quits early

I'm having a problem that i could not crack by googling, we are doing a load with jdbc plugin, using explicit pagination. when pipeline runs it is loading about 3.2 million records and then quits without errors like it finished successfully, but it should load around 6.4 million records. Here are our configuration:
input {
jdbc {
id => "NightlyRun"
jdbc_connection_string => "*******"
jdbc_driver_class => "Driver"
jdbc_user => "${USER}"
jdbc_password => "${PASS}"
lowercase_column_names => "false"
jdbc_paging_enabled => true
jdbc_page_size => 50000
jdbc_paging_mode => "explicit"
schedule => "5 2 * * *"
statement_filepath => "/usr/share/logstash/sql-files/sqlQuery1.sql"
}
}
}
output {
elasticsearch {
hosts => ["${ELASTIC_HOST}:9200"]
index => "index"
user => logstash
password => "${PASSWORD}"
document_id => "%{NUMBER}-%{value}"
}
}
And sql query we use:
declare #PageSize int
declare #Offset integer
set #PageSize=:size
set #Offset=:offset;
WITH cte AS
(
SELECT
id
FROM
entry
ORDER BY CREATE_TIMESTAMP
OFFSET #Offset ROWS
FETCH NEXT #PageSize ROWS ONLY
)
select * from entry
where entry.id=cte.id
the cte select count(*) from entry returns the expected 6.4 million records but logstash loads only 3.2 million before quitting. How can I ensure logstash loads all the records.
I tried running the query in database and setting offset to 3200000 and page size to 50000, database returns results, so it is not likely a database issue.

Logstash :sql_last_value is showing wrong junk date (Showing 6 month's old date as last run time)

I am observing very strange issue
I am using logstash + jdbc to load data from Oracle db to Elasticsearch
Below is how my config file looks like
input{
jdbc{
clean_run => "false"
jdbc_driver_library => "<path_to_ojdbc8-12.1.0.jar>"
jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
jdbc_connection_string => "<connection_string>"
jdbc_user => "<usename>"
jdbc_password_filepath => ".\pwd.txt"
statement=> "SELECT * FROM customers WHERE CUSTOMER_NAME LIKE 'PE%' AND UPD_DATE > :sql_last_value "
schedule=>"*/1 * * * * "
use_column_value => true
tracking_column_type => "timestamp"
tracking_column => "upd_date"
last_run_metadata_path =>"<path to logstash_metadata>"
record_last_run => true
}
}
filter {
mutate {
copy => { "id" => "[#metadata][_id]"}
remove_field => ["#version","#timestamp"]
}
}
output {
elasticsearch{
hosts => ["<host>"]
index => "<index_name>"
document_id=>"%{[#metadata][_id]}"
user => "<user>"
password => "<pwd>"
}
stdout{
codec => dots
}
}
Now , i am triggering this file every minute on today that is March 8th 2021.
when i load for first-time , its all good -:sql_last_value is '1970-01-01 00:00:00.000000 +00:00'
But after this first load , ideally logstash_metadata should be showing '2021-03-08 <HH:MM:ss>' But strangely it is getting update as --- 2020-09-11 01:05:09.000000000 Z in logstash_metadata (:sql_last_value)
As you can see the difference is near about 180 days
I tried multiple times but still it is updating in the same way , Due to this my incremental load is getting screwed
My logstash Version is 7.10.2
Help is much appreciated!
NOTE: I am not using pagination as the number of results in the resultset are always very low in number for my query
The recorded date is the date of the last processed row.
Seeing your query , you don't have a specific order for the records read from DB.
Logstash jdbc input plugin encloses your query to one that orders rows by [1], 1 being the ordinal of the column it orders by.
So to process records in a correct order and get the latest upd_date value you need have upd_date be the first column in the select statement.
input{
jdbc{
clean_run => "false"
jdbc_driver_library => "<path_to_ojdbc8-12.1.0.jar>"
jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
jdbc_connection_string => "<connection_string>"
jdbc_user => "<usename>"
jdbc_password_filepath => ".\pwd.txt"
statement=> "SELECT c.UPD_DATE, c.CUSTOMER_NAME, c.<Other field>
FROM customers c
WHERE c.CUSTOMER_NAME LIKE 'PE%' AND c.UPD_DATE > :sql_last_value
ORDER BY c.UPD_DATE ASC"
schedule=>"*/1 * * * * "
use_column_value => true
tracking_column_type => "timestamp"
tracking_column => "upd_date"
last_run_metadata_path =>"<path to logstash_metadata>"
record_last_run => true
}
}
Also note that this approach will exhaust the table the first time logstash runs, even if you set up jdbc_page_size. If you want this, that's fine.
But if you want logstash to run one batch of X rows every minute and stop until the next execution, then you must use a combination of jdbc_page_size and query with limits to make logstash retrieve exactly the amount of records you want, in the correct order. In SQL Server it work like that:
input{
jdbc{
jdbc_driver_library => ...
jdbc_driver_class => ...
jdbc_connection_string => ...
jdbc_user => ...
jdbc_password_filepath => ...
statement=> "SELECT TOP 10000 c.UPD_DATE, c.CUSTOMER_NAME
FROM customers c
WHERE c.CUSTOMER_NAME LIKE 'PE%' AND c.UPD_DATE > :sql_last_value
ORDER BY c.UPD_DATE ASC"
schedule=>"*/1 * * * * "
use_column_value => true
tracking_column_type => "timestamp"
tracking_column => "upd_date"
jdbc_page_size => 10000
last_run_metadata_path =>"<path to logstash_metadata>"
record_last_run => true
}
}
For Oracle DB you'll have to change your query depending on the version, either using
FETCH FIRST x ROWS ONLY; with Oracle 12, or ROWNUM for older versions.
In any case, I suggest you take a look at the logs to check the queries logstash runs.

Elasticsearch is not creating an index received from Logstash Output file

I'm have an Ubuntu 20.04 VM with Elasticsearch, Logstash and Kibana (all rel.7.7.0) What I'm trying to do is (among other things) to have Logstash to receive Syslog and Netflow traps from Cisco devices, transfer them to Elasticsearch and from there to Kibana for visualization.
I created a Logstash config file (cisco.conf) where input and output sections look like this:
input {
udp {
port => 5003
type => "syslog"
}
udp {
port => 2055
codec => netflow {
include_flowset_id => true
enable_metric => true
versions => [5, 9]
}
}
}
output {
stdout { codec => rubydebug }
if [type] == "syslog" {
elasticsearch {
hosts => ["localhost:9200"]
manage_template => false
index => "ciscosyslog-%{+YYYY.MM.dd}"
}
}
if [type] == "netflow" {
elasticsearch {
hosts => ["localhost:9200"]
manage_template => false
index => "cisconetflow-%{+YYYY.MM.dd}"
}
}
}
The problem is: the index ciscosyslog is created in Elasticsearch with no problem:
$ curl 'localhost:9200/_cat/indices?v'
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open ciscosyslog-2020.05.21 BRshOOnoQ5CsdVn3l0Z3kw 1 1 1438 0 338.4kb 338.4kb
green open .async-search dpd-HWYJSyW653u7BAhQVg 1 0 2 0 34.1kb 34.1kb
green open .kibana_1 xA5PIwKsTHCeOFyj9_NIQA 1 0 111 8 231.9kb 231.9kb
yellow open ciscosyslog-2020.05.22 kB4vJAooT3-fbIg0dKKt8w 1 1 566 0 159.2kb 159.2kb
However the index cisconetflow is not created as seen in the above table.
I made a debug on Logstash and I can see netflow messages arriving from Cisco devices:
[WARN ] 2020-05-22 17:57:04.999 [[main]>worker1] Dissector - Dissector mapping, field not found in event {"field"=>"message", "event"=>{"host"=>"10.200.8.57", "#timestamp"=>2020-05-22T21:57:04.000Z, "#version"=>"1", "netflow"=>{"l4_src_port"=>443, "version"=>9, "l4_dst_port"=>41252, "src_tos"=>0, "dst_as"=>0, "protocol"=>6, "in_bytes"=>98, "flowset_id"=>256, "src_as"=>0, "ipv4_dst_addr"=>"10.200.8.57", "input_snmp"=>1, "output_snmp"=>4, "ipv4_src_addr"=>"104.244.42.133", "in_pkts"=>1, "flow_seq_num"=>17176}}}
[WARN ] 2020-05-22 17:57:04.999 [[main]>worker1] Dissector - Dissector mapping, field not found in event {"field"=>"message", "event"=>{"host"=>"10.200.8.57", "#timestamp"=>2020-05-22T21:57:04.000Z, "#version"=>"1", "netflow"=>{"l4_src_port"=>443, "version"=>9, "l4_dst_port"=>39536, "src_tos"=>0, "dst_as"=>0, "protocol"=>6, "in_bytes"=>79, "flowset_id"=>256, "src_as"=>0, "ipv4_dst_addr"=>"10.200.8.57", "input_snmp"=>1, "output_snmp"=>4, "ipv4_src_addr"=>"104.18.252.222", "in_pkts"=>1, "flow_seq_num"=>17176}}}
{
"host" => "10.200.8.57",
"#timestamp" => 2020-05-22T21:57:04.000Z,
"#version" => "1",
"netflow" => {
"l4_src_port" => 57654,
"version" => 9,
"l4_dst_port" => 443,
"src_tos" => 0,
"dst_as" => 0,
"protocol" => 6,
"in_bytes" => 7150,
"flowset_id" => 256,
"src_as" => 0,
"ipv4_dst_addr" => "104.244.39.20",
"input_snmp" => 4,
"output_snmp" => 1,
"ipv4_src_addr" => "172.16.1.21",
"in_pkts" => 24,
"flow_seq_num" => 17176
}
But at this point I can't tell if Logstash is not delivering the information to ES or if ES is failing to create the index, Current facts are:
a) Netflow traffic is present at Logstash input
b) ES is creating only one of the two indexes received from Logstash.
Thanks.
You have conditionals in your output, using the type field, your first input is adding this field with its correct value, but your second input does not have the field, so it will never match your conditional.
Add the line type => "netflow" in your second input as you did with your first one.

Using the betweenDate operation filter for the SoftLayer Ruby API includes values beyond my endDate

I'm trying to retrieve the invoices for a single month (beginning of one month and ending at the beginning of the next month). However, I get results for the first day of the ending month in my result set, which I'm not expecting.
Example, this will return invoices for December 1st:
account = SoftLayer::Service.new("...")
billing_invoice_service = softlayer_client.service_named("Billing_Invoice");
object_filter = SoftLayer::ObjectFilter.new
object_filter.set_criteria_for_key_path('invoices.createDate',
'operation' => 'betweenDate',
'options' => [{
'name' => 'startDate',
'value' => ["11/01/2015 00:00:00"]
},
{
'name' => 'endDate',
'value' => ["12/01/2015 00:00:00"]
}
]
)
invoices = account.result_limit(0,5000).object_filter(object_filter).object_mask("mask[id,closedDate,createDate]").getInvoices
If I run with the below filter I get no results for December 1st:
account = SoftLayer::Service.new("...")
billing_invoice_service = softlayer_client.service_named("Billing_Invoice");
object_filter = SoftLayer::ObjectFilter.new
object_filter.set_criteria_for_key_path('invoices.createDate',
'operation' => 'betweenDate',
'options' => [{
'name' => 'startDate',
'value' => ["12/01/2015 00:00:00"]
},
{
'name' => 'endDate',
'value' => ["12/01/2015 00:00:00"]
}
]
)
invoices = account.result_limit(0,5000).object_filter(object_filter).object_mask("mask[id,closedDate,createDate]").getInvoices
So I'm not sure why I get results for December 1st in my first filter when I specify an ending time of 00:00:00. Thank you.
Edit: Here is a tail of the results from the first filter above (minus the id):
...
{"closedDate"=>"2015-11-30T21:52:17+05:30",
"createDate"=>"2015-11-30T21:52:16+05:30"},
{"closedDate"=>"2015-11-30T23:22:14+05:30",
"createDate"=>"2015-11-30T23:22:13+05:30"},
{"closedDate"=>"2015-12-01T01:43:59+05:30",
"createDate"=>"2015-12-01T01:43:56+05:30"},
{"closedDate"=>"2015-12-01T01:45:36+05:30",
"createDate"=>"2015-12-01T01:45:34+05:30"},
{"closedDate"=>"2015-12-01T02:05:20+05:30",
"createDate"=>"2015-12-01T02:05:16+05:30"},
{"closedDate"=>"2015-12-01T02:12:22+05:30",
"createDate"=>"2015-12-01T02:12:22+05:30"},
{"closedDate"=>"2015-12-01T02:13:06+05:30",
"createDate"=>"2015-12-01T02:13:04+05:30"},
{"closedDate"=>"2015-12-01T02:13:07+05:30",
"createDate"=>"2015-12-01T02:13:04+05:30"},
{"closedDate"=>"2015-12-01T02:13:07+05:30",
"createDate"=>"2015-12-01T02:13:05+05:30"},
{"closedDate"=>"2015-12-01T02:13:08+05:30",
"createDate"=>"2015-12-01T02:13:06+05:30"},
{"closedDate"=>"2015-12-01T02:13:07+05:30",
"createDate"=>"2015-12-01T02:13:06+05:30"},
{"closedDate"=>"2015-12-01T02:21:34+05:30",
"createDate"=>"2015-12-01T02:21:32+05:30"},
{"closedDate"=>"2015-12-01T02:38:12+05:30",
"createDate"=>"2015-12-01T02:38:10+05:30"},
{"closedDate"=>"2015-12-01T03:36:07+05:30",
"createDate"=>"2015-12-01T03:36:06+05:30"},
{"closedDate"=>"2015-12-01T04:09:57+05:30",
"createDate"=>"2015-12-01T04:09:55+05:30"},
{"closedDate"=>"2015-12-01T04:37:45+05:30",
"createDate"=>"2015-12-01T04:37:43+05:30"},
{"closedDate"=>"2015-12-01T06:35:34+05:30",
"createDate"=>"2015-12-01T06:35:33+05:30"},
{"closedDate"=>"2015-12-01T07:00:09+05:30",
"createDate"=>"2015-12-01T07:00:06+05:30"},
{"closedDate"=>"2015-12-01T08:00:32+05:30",
"createDate"=>"2015-12-01T08:00:30+05:30"}]
The error might be due to the timezone, the filter does not take in account your current timezone, it only filters the data stored in the database, when the data is displayed it is converted to your current timezone. I suggest you change your end date value considering the timezone difference between the stored data in softlayer and your current timezone

Puppet: cron schedule - "is not a valid hour" error

I need to schedule a cron job for daily running every 30 mins from 0 to 6 and from 13 to 23. I tried this code:
cron { "MyJob":
ensure => present,
command => "my-cron-command",
user => 'root',
hour => "0-6,13-23",
minute => '*/30',
environment => "MY_ENV"
}
This fails with
0-6,13-23 is not a valid hour
What hour format should I use? Do I need any other changes in cron clause?
Close, but no cigar.
cron { "MyJob":
ensure => present,
command => "my-cron-command",
user => 'root',
hour => [ "0-6", "13-23" ],
minute => '*/30',
environment => "MY_ENV"
}
If you are putting more than one values for any attributes. Put them in an array list. SO, the hour will be ['0-6','13-23']
Should just list the hours:
cron { "MyJob":
ensure => present,
command => "my-cron-command",
user => 'root',
hour => [0,1,2,3,4,5,6,13,14,15,16,17,18,19,20,21,22,23],
minute => '*/30',
environment => "MY_ENV"
}
Works, but hour => "0-6,13-23" does not.

Resources