Logstash "sql_last_value" value comparizon is now working properly with GT condition - jdbc

I'd like to make a data pipeline composed with RDS(Aurora DB), Logstash and AWS Opensearch.
To make my opensearch index get a Data Consistency, I'd like to remove duplicated value with query.
For that, I wrote a logstash's config file like this.
input{
jdbc {
jdbc_driver_library => "/home/ubuntu/logstash-7.16.2/bin/mysql-connector-java-8.0.27.jar"
jdbc_driver_class => "com.mysql.cj.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql://~~~~?useSSL=false"
jdbc_user => "root"
jdbc_password => "~~~~~~"
jdbc_paging_enabled => true
tracking_column => "updated_at"
use_column_value => true
record_last_run => true
tracking_column_type => “timestamp”
schedule => "*/10 * * * * *"
statement => "select * from my_table where updated_at > :sql_last_value order by updated_at ASC"
jdbc_default_timezone => "Asia/Seoul"
}
}
output {
opensearch{
hosts => "https://~~~~~:443"
user => "admin"
password => "~~~~~"
index => "index"
ecs_compatibility => disabled
ssl_certificate_verification => false
}
}
And the generated queries by logstash are these.
[2022-12-29T16:48:40,299][INFO ][logstash.inputs.jdbc ][main][0bb20d034a10be3c1a48635cda2cc7dfcb97e29fb63940352f5380ec253dfe48] (0.001157s) SELECT version()
[2022-12-29T16:48:40,302][INFO ][logstash.inputs.jdbc ][main][0bb20d034a10be3c1a48635cda2cc7dfcb97e29fb63940352f5380ec253dfe48] (0.001163s) SELECT version()
[2022-12-29T16:48:40,306][INFO ][logstash.inputs.jdbc ][main][0bb20d034a10be3c1a48635cda2cc7dfcb97e29fb63940352f5380ec253dfe48] (0.001221s) SELECT count(*) AS `count` FROM (select * from my_table where updated_at > '2022-12-28 22:31:05' order by updated_at ASC) AS `t1` LIMIT 1
[2022-12-29T16:48:40,309][INFO ][logstash.inputs.jdbc ][main][0bb20d034a10be3c1a48635cda2cc7dfcb97e29fb63940352f5380ec253dfe48] (0.001166s) SELECT * FROM (select * from my_table where updated_at > '2022-12-28 22:31:05' order by updated_at ASC) AS `t1` LIMIT 100000 OFFSET 0
[2022-12-29T16:48:50,172][INFO ][logstash.inputs.jdbc ][main][0bb20d034a10be3c1a48635cda2cc7dfcb97e29fb63940352f5380ec253dfe48] (0.001303s) SELECT version()
[2022-12-29T16:48:50,174][INFO ][logstash.inputs.jdbc ][main][0bb20d034a10be3c1a48635cda2cc7dfcb97e29fb63940352f5380ec253dfe48] (0.001152s) SELECT version()
[2022-12-29T16:48:50,178][INFO ][logstash.inputs.jdbc ][main][0bb20d034a10be3c1a48635cda2cc7dfcb97e29fb63940352f5380ec253dfe48] (0.001382s) SELECT count(*) AS `count` FROM (select * from my_table where updated_at > '2022-12-28 22:31:05' order by updated_at ASC) AS `t1` LIMIT 1
[2022-12-29T16:48:50,182][INFO ][logstash.inputs.jdbc ][main][0bb20d034a10be3c1a48635cda2cc7dfcb97e29fb63940352f5380ec253dfe48] (0.001153s) SELECT * FROM (select * from my_table where updated_at > '2022-12-28 22:31:05' order by updated_at ASC) AS `t1` LIMIT 100000 OFFSET 0
It looks like that everything is fine.
Sql_last_value was updated whenever a new data was inserted.
However, the values queried in opensearch also contain values that are equal to sql_last_value and updated_at.
For example,
sql_last_value = 2022.12.08 12:12:12
first data : 2022.12.08 12:12:11
second data : 2022.12.08 12:12:12
third data : 2022.12.08 12:12:13
In above case, second and third one was selected by my query.
Also, sql_last_value was updated to 2022.12.08 12:12:13 from the next query.
And, updated_at column is created by Sequelize Module in NestJS (it is timestamp type).
What is the problem of my config file??

I didn't understand how this works, but I solved the problem.
The problem was my statement query, select * from my_table where updated_at > :sql_last_value order by updated_at ASC.
I edited it to select * from my_table where updated_at > :sql_last_value without order by command.

Related

Log stash not loading exact number of records in Elasticsearch and on every hit results are keep changing

Problem statement : Logstash is not loading all records from Database to elasticsearch correctly and everytime I hit same api gets different results (However sometimes correct but changes on every hit and shows only subset of records under salutations nested field). The logstash mechanism looks sporadic and loading results are not consistent especially in One to Many scenario .
http://localhost:9200/staffsalutation/_search
I am observing a weird behaviour of logstash logstash-7.8.0 while loading records from 2 tables with query and configuration as below
Query :
select s.update_time, s.staff_id as staff_id, birth_date, first_name, last_name, gender, hire_date,
st.title AS title_nm, st.from_date AS title_frm_dt, st.to_date AS title_to_dt
from staff s
LEFT JOIN salutation st ON s.staff_id = st.staff_id
order by s.update_time
input {
jdbc {
jdbc_connection_string => "jdbc:postgresql://localhost:5432/postgres"
jdbc_driver_library => "C:\\Users\\NS\\.m2\\repository\\org\\postgresql\\postgresql\\42.2.11\\postgresql-42.2.11.jar"
jdbc_user => "postgres"
jdbc_password => "postgres"
jdbc_driver_class => "org.postgresql.Driver"
schedule => "* * * * *"
statement => "select e.update_time, e.emp_no as staff_id, birth_date, first_name, last_name, gender, hire_date, t.title AS title_nm, t.from_date AS title_frm_dt, t.to_date AS title_to_dt
from employees e
LEFT JOIN titles t
ON e.emp_no = t.emp_no
order by e.update_time"
add_field => { "doctype" => "employee" }
tracking_column_type => "timestamp"
use_column_value =>true
tracking_column => update_time
jdbc_fetch_size => "50000"
}
}
filter {
aggregate {
task_id => "%{staff_id}"
code => "
map['staff_id'] = event.get('staff_id')
map['birth_date'] = event.get('birth_date')
map['first_name'] = event.get('first_name')
map['last_name'] = event.get('last_name')
map['gender'] = event.get('gender')
map['hire_date'] = event.get('hire_date')
map['salutations'] ||= []
map['salutations'] << {
'title_nm' => event.get('title_nm'),'title_frm_dt' => event.get('title_frm_dt'),
'title_to_dt' => event.get('title_to_dt')
}
event.cancel()
"
push_previous_map_as_event => true
timeout => 30
}
}
output {
elasticsearch {
document_id => "%{staff_id}"
index => "staffsalutation"
}
file {
path => "test.log"
codec => line
}
}
Found the solution !
Need to use order by clause in query so that records are sorted by emp_no and
logstash can search and aggregate dependant entities like titles (like One to many ).
select e.update_time, e.emp_no as staff_id, birth_date, first_name,
last_name, gender, hire_date, t.title AS title_nm, t.from_date AS
title_frm_dt, t.to_date AS title_to_dt from employees e LEFT
JOIN titles t ON e.emp_no = t.emp_no order by e.emp_no
Since aggregation is used here need to have single thread to process the record else
it will cause aggregation issues (and that is where the random results you will get on multiple call to search on index as per url above) . Though it looks to be a performance hit as only 1 worker thread will process records but it can be mitigated by invoking multiple logstash config file with heterogeneous set of records e.g. first 100 emp_no in one file and 2nd hundred in other so that logstash can execute them in parallel.
so execute like below
logstash -f logstash_config.conf -w 1

Elastic Search ,Kibana - Unable to get entire data into elastic search from Oracle database using JDBC Plugin

I have an oracle database over server, but when I try to index entire data into elastic search, the total count of database mismatch with that of elastic total hits. I don't know why. Primary reasons could be -1) Not selecting the time range appropriate to your index reference timestamp.
I have cross checked time range and still missing few thousand records out of 20 million (approx).
input {
jdbc {
jdbc_connection_string => "jdbc:oracle:thin:#ip:port:ORCL"
jdbc_user => "system"
jdbc_password => "oracle"
jdbc_validate_connection => true
jdbc_driver_library => "---\ojdbc6.jar"
jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
statement => "SELECT DISTNAME ||'-'|| DNO as Distributor, DISTNAME ||'-'|| DNO || '-' || STATE_NAME || '-' || DCNAME as Distributor_2,
col1, col2,col3 ---- from EPMUY_ELASTIC"
}
}
output {
elasticsearch {
hosts => ["http://localhost:9200"]
index => "index"
user => "user"
password => "password"
}
stdout{
codec => rubydebug
}
}
Note - I didn't create index mapping myself, I let kibana decide Index mapping (dynamic:true)
On server I checked the total count of records in database which is - 18331793
(select count(*) from table)
And the same count with my query with logstash pipeline also.
But on kibana I am only seeing 18,303,447 number of records.

push data from the last updated_on date

So im using the below query in my jdbc logstash input.
statement => "SELECT * from mydata WHERE updated_on > :sql_last_value ORDER BY updated_on"
use_column_value =>true
tracking_column =>updated_on
tracking_column_type => "#timestamp"
or
statement => "SELECT * from mydata WHERE :sql_last_value > updated_on ORDER BY updated_on"
use_column_value =>true
tracking_column =>updated_on
tracking_column_type => "#timestamp"
here, my :sql_last value is considered as the last run time of the configuration file.
example:
"updated_on": "2019-09-26T08:11:00.000Z",
"#timestamp": "2019-09-26T08:17:52.974Z"
here my sql_last_value corresponds to #timestamp,
I want it to consider updated_on instead.
How do i change it to consider the last updated_on date instead of execution time?
So this is your current configuration:
statement => "SELECT * from agedata WHERE updated_on > :sql_last_value ORDER BY updated_on"
use_column_value => true
tracking_column => updated_on
tracking_column_type => "timestamp"
What it says is basically that the sql_last_value variable will store/remember the last value of the updated_on column from the last run since use_column_value is true (not the last run time value as you suggest or else use_column_value would be false).
So your configuration is already doing what you expect.

Knowing when a table was updated in Oracle without a full scan

I'm building an Oracle connector that reads data periodically from a couple of very big table, some are divided into partitions.
I'm trying to figure out which table were updated from the last time they were read to avoid unnecessary queries. I have the last ora_rowscn or updated_at and the only methods I find requires a full table scan to see if there are new or updated rows in the table.
Is there a way to tell if a table a row was inserted or updated without the full scan?
A couple of ideas:
1. Create a table to store last DML by table_name and then create a simple trigger on the table to update meta table.
2. Create a Materialized View Log on the table and use the data from the log to determine the changes.
If there are archive logs for the search period. You can use the utility LogMiner. for example:
insert into "ASOUP"."US"("KEY_COLUMN","COD_ROAD","COD_COMPUTER","COD_STATION_OPER","NUMB_TRAIN","STAT_CREAT","NUMB_SOSTAVA","STAT_APPOINT","COD_OPER","DIRECT_1","DIRECT_2","DATE_OPER","PARK","PATH","LOCOMOT","LATE","CAUSE_LATE","COD_CONNECT","CATEGORY","TIME") values ('42018740','988','0','9200','2624','8642','75','9802','1','8891','0',TO_DATE('18-Dec-2018', 'DD-Mon-RRRR'),'0','0','0','0','0','0',NULL,TO_DATE('18-Dec-2018', 'DD-Mon-RRRR'));
select name, first_time, next_time
from v$archived_log
where first_time >sysdate -3/24
/oracle/app/oracle/product/11.2/redolog/edcu/1_48060_769799469.dbf 18-дек-2018 09:03:06 18-дек-2018 10:22:00
/oracle/app/oracle/product/11.2/redolog/edcu/1_48061_769799469.dbf 18-дек-2018 10:22:00 18-дек-2018 10:30:02
/oracle/app/oracle/product/11.2/redolog/edcu/1_48062_769799469.dbf 18-дек-2018 10:30:02 18-дек-2018 10:56:07
Run the logminer utility.
EXECUTE DBMS_LOGMNR.add_logfile(LOGFILENAME => '/oracle/app/oracle/product/11.2/redolog/edcu/1_48060_769799469.dbf', OPTIONS => DBMS_LOGMNR.NEW);
EXECUTE DBMS_LOGMNR.add_logfile(LOGFILENAME => '/oracle/app/oracle/product/11.2/redolog/edcu/1_48061_769799469.dbf', OPTIONS => DBMS_LOGMNR.addfile);
EXECUTE DBMS_LOGMNR.add_logfile(LOGFILENAME => '/oracle/app/oracle/product/11.2/redolog/edcu/1_48062_769799469.dbf', OPTIONS => DBMS_LOGMNR.addfile);
EXECUTE DBMS_LOGMNR.START_LOGMNR(OPTIONS => DBMS_LOGMNR.DICT_FROM_ONLINE_CATALOG);
SELECT scn,ROW_ID,to_char(timestamp,'DD-MM-YYYY HH24:MI:SS'),
table_name,seg_name,operation, sql_redo,sql_undo
FROM v$logmnr_contents
where seg_owner='ASOUP' and table_name='US'
SCN ROW_ID TIMESTAMP TABLE_NAME SEG_NAME OPERATION SQL_REDO SQL_UNDO
1398405575908 AAA3q2AAoAACFweABi 18-12-2018 09:03:15 US US,ADCU201902 INSERT insert into "ASOUP"."US"("KEY_COLUMN","COD_ROAD","COD_COMPUTER","COD_STATION_OPER","NUMB_TRAIN","STAT_CREAT","NUMB_SOSTAVA","STAT_APPOINT","COD_OPER","DIRECT_1","DIRECT_2","DATE_OPER","PARK","PATH","LOCOMOT","LATE","CAUSE_LATE","COD_CONNECT","CATEGORY","TIME") values ('42018727','988','0','8800','4404','1','895','8800','1','8838','0',TO_DATE('18-Dec-2018', 'DD-Mon-RRRR'),'4','2','0','0','0','0',NULL,TO_DATE('18-Dec-2018', 'DD-Mon-RRRR')); delete from "ASOUP"."US" where "KEY_COLUMN" = '42018727' and "COD_ROAD" = '988' and "COD_COMPUTER" = '0' and "COD_STATION_OPER" = '8800' and "NUMB_TRAIN" = '4404' and "STAT_CREAT" = '1' and "NUMB_SOSTAVA" = '895' and "STAT_APPOINT" = '8800' and "COD_OPER" = '1' and "DIRECT_1" = '8838' and "DIRECT_2" = '0' and "DATE_OPER" = TO_DATE('18-Dec-2018', 'DD-Mon-RRRR') and "PARK" = '4' and "PATH" = '2' and "LOCOMOT" = '0' and "LATE" = '0' and "CAUSE_LATE" = '0' and "COD_CONNECT" = '0' and "CATEGORY" IS NULL and "TIME" = TO_DATE('18-Dec-2018', 'DD-Mon-RRRR') and ROWID = 'AAA3q2AAoAACFweABi';
You can see inserted row without full scan:
select * from asoup.us where ROWID = 'AAA3q2AAoAACFweABi';

cake php distinct date_format

Please I need some help with cakephp3 mysql distinct.
My desire result in SQL:
select distinct(date_format(torokuDate,'%Y')) as year
from kani_tbl
where torokuDate >= '2000-01-01'
order by torokuDate ASC
limit 1;
But I get the wrong result:
SELECT (date_format((torokuDate), '%Y')) AS `year`
FROM kani_tbl
WHERE torokuDate > :c0
GROUP BY torokuDate
ORDER BY torokuDate ASC
LIMIT 1
My model src:
$query = $this->find();
$time = $query->func()->date_format([
'torokuDate' => 'identifier',
"'%Y'" => 'literal'
]);
$yearList = $query->select(['year' => $time])
->distinct('torokuDate')
->from('kani_tbl ')
->order(['torokuDate' => 'ASC'])
->where(['torokuDate >' => '2000-01-01'])
->limit(1);
// ->hydrate(false)
// ->toArray();
var_dump($yearList);
Please help me to add distinct field in the MySQL command.
As your expected query, you need to distinct result of date_format(torokuDate, '%Y'), not the 'torokuDate' field. So your code should be like this:
$yearList = $query->select(['year' => $time])
->distinct()
->from('kani_tbl ')
->order(['torokuDate' => 'ASC'])
->where(['torokuDate >' => '2000-01-01'])
->limit(1);
How to use distinct method: https://api.cakephp.org/3.4/class-Cake.Database.Query.html#_distinct

Resources