Log stash not loading exact number of records in Elasticsearch and on every hit results are keep changing - elasticsearch

Problem statement : Logstash is not loading all records from Database to elasticsearch correctly and everytime I hit same api gets different results (However sometimes correct but changes on every hit and shows only subset of records under salutations nested field). The logstash mechanism looks sporadic and loading results are not consistent especially in One to Many scenario .
http://localhost:9200/staffsalutation/_search
I am observing a weird behaviour of logstash logstash-7.8.0 while loading records from 2 tables with query and configuration as below
Query :
select s.update_time, s.staff_id as staff_id, birth_date, first_name, last_name, gender, hire_date,
st.title AS title_nm, st.from_date AS title_frm_dt, st.to_date AS title_to_dt
from staff s
LEFT JOIN salutation st ON s.staff_id = st.staff_id
order by s.update_time
input {
jdbc {
jdbc_connection_string => "jdbc:postgresql://localhost:5432/postgres"
jdbc_driver_library => "C:\\Users\\NS\\.m2\\repository\\org\\postgresql\\postgresql\\42.2.11\\postgresql-42.2.11.jar"
jdbc_user => "postgres"
jdbc_password => "postgres"
jdbc_driver_class => "org.postgresql.Driver"
schedule => "* * * * *"
statement => "select e.update_time, e.emp_no as staff_id, birth_date, first_name, last_name, gender, hire_date, t.title AS title_nm, t.from_date AS title_frm_dt, t.to_date AS title_to_dt
from employees e
LEFT JOIN titles t
ON e.emp_no = t.emp_no
order by e.update_time"
add_field => { "doctype" => "employee" }
tracking_column_type => "timestamp"
use_column_value =>true
tracking_column => update_time
jdbc_fetch_size => "50000"
}
}
filter {
aggregate {
task_id => "%{staff_id}"
code => "
map['staff_id'] = event.get('staff_id')
map['birth_date'] = event.get('birth_date')
map['first_name'] = event.get('first_name')
map['last_name'] = event.get('last_name')
map['gender'] = event.get('gender')
map['hire_date'] = event.get('hire_date')
map['salutations'] ||= []
map['salutations'] << {
'title_nm' => event.get('title_nm'),'title_frm_dt' => event.get('title_frm_dt'),
'title_to_dt' => event.get('title_to_dt')
}
event.cancel()
"
push_previous_map_as_event => true
timeout => 30
}
}
output {
elasticsearch {
document_id => "%{staff_id}"
index => "staffsalutation"
}
file {
path => "test.log"
codec => line
}
}

Found the solution !
Need to use order by clause in query so that records are sorted by emp_no and
logstash can search and aggregate dependant entities like titles (like One to many ).
select e.update_time, e.emp_no as staff_id, birth_date, first_name,
last_name, gender, hire_date, t.title AS title_nm, t.from_date AS
title_frm_dt, t.to_date AS title_to_dt from employees e LEFT
JOIN titles t ON e.emp_no = t.emp_no order by e.emp_no
Since aggregation is used here need to have single thread to process the record else
it will cause aggregation issues (and that is where the random results you will get on multiple call to search on index as per url above) . Though it looks to be a performance hit as only 1 worker thread will process records but it can be mitigated by invoking multiple logstash config file with heterogeneous set of records e.g. first 100 emp_no in one file and 2nd hundred in other so that logstash can execute them in parallel.
so execute like below
logstash -f logstash_config.conf -w 1

Related

Logstash "sql_last_value" value comparizon is now working properly with GT condition

I'd like to make a data pipeline composed with RDS(Aurora DB), Logstash and AWS Opensearch.
To make my opensearch index get a Data Consistency, I'd like to remove duplicated value with query.
For that, I wrote a logstash's config file like this.
input{
jdbc {
jdbc_driver_library => "/home/ubuntu/logstash-7.16.2/bin/mysql-connector-java-8.0.27.jar"
jdbc_driver_class => "com.mysql.cj.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql://~~~~?useSSL=false"
jdbc_user => "root"
jdbc_password => "~~~~~~"
jdbc_paging_enabled => true
tracking_column => "updated_at"
use_column_value => true
record_last_run => true
tracking_column_type => “timestamp”
schedule => "*/10 * * * * *"
statement => "select * from my_table where updated_at > :sql_last_value order by updated_at ASC"
jdbc_default_timezone => "Asia/Seoul"
}
}
output {
opensearch{
hosts => "https://~~~~~:443"
user => "admin"
password => "~~~~~"
index => "index"
ecs_compatibility => disabled
ssl_certificate_verification => false
}
}
And the generated queries by logstash are these.
[2022-12-29T16:48:40,299][INFO ][logstash.inputs.jdbc ][main][0bb20d034a10be3c1a48635cda2cc7dfcb97e29fb63940352f5380ec253dfe48] (0.001157s) SELECT version()
[2022-12-29T16:48:40,302][INFO ][logstash.inputs.jdbc ][main][0bb20d034a10be3c1a48635cda2cc7dfcb97e29fb63940352f5380ec253dfe48] (0.001163s) SELECT version()
[2022-12-29T16:48:40,306][INFO ][logstash.inputs.jdbc ][main][0bb20d034a10be3c1a48635cda2cc7dfcb97e29fb63940352f5380ec253dfe48] (0.001221s) SELECT count(*) AS `count` FROM (select * from my_table where updated_at > '2022-12-28 22:31:05' order by updated_at ASC) AS `t1` LIMIT 1
[2022-12-29T16:48:40,309][INFO ][logstash.inputs.jdbc ][main][0bb20d034a10be3c1a48635cda2cc7dfcb97e29fb63940352f5380ec253dfe48] (0.001166s) SELECT * FROM (select * from my_table where updated_at > '2022-12-28 22:31:05' order by updated_at ASC) AS `t1` LIMIT 100000 OFFSET 0
[2022-12-29T16:48:50,172][INFO ][logstash.inputs.jdbc ][main][0bb20d034a10be3c1a48635cda2cc7dfcb97e29fb63940352f5380ec253dfe48] (0.001303s) SELECT version()
[2022-12-29T16:48:50,174][INFO ][logstash.inputs.jdbc ][main][0bb20d034a10be3c1a48635cda2cc7dfcb97e29fb63940352f5380ec253dfe48] (0.001152s) SELECT version()
[2022-12-29T16:48:50,178][INFO ][logstash.inputs.jdbc ][main][0bb20d034a10be3c1a48635cda2cc7dfcb97e29fb63940352f5380ec253dfe48] (0.001382s) SELECT count(*) AS `count` FROM (select * from my_table where updated_at > '2022-12-28 22:31:05' order by updated_at ASC) AS `t1` LIMIT 1
[2022-12-29T16:48:50,182][INFO ][logstash.inputs.jdbc ][main][0bb20d034a10be3c1a48635cda2cc7dfcb97e29fb63940352f5380ec253dfe48] (0.001153s) SELECT * FROM (select * from my_table where updated_at > '2022-12-28 22:31:05' order by updated_at ASC) AS `t1` LIMIT 100000 OFFSET 0
It looks like that everything is fine.
Sql_last_value was updated whenever a new data was inserted.
However, the values queried in opensearch also contain values that are equal to sql_last_value and updated_at.
For example,
sql_last_value = 2022.12.08 12:12:12
first data : 2022.12.08 12:12:11
second data : 2022.12.08 12:12:12
third data : 2022.12.08 12:12:13
In above case, second and third one was selected by my query.
Also, sql_last_value was updated to 2022.12.08 12:12:13 from the next query.
And, updated_at column is created by Sequelize Module in NestJS (it is timestamp type).
What is the problem of my config file??
I didn't understand how this works, but I solved the problem.
The problem was my statement query, select * from my_table where updated_at > :sql_last_value order by updated_at ASC.
I edited it to select * from my_table where updated_at > :sql_last_value without order by command.

Elastic Search ,Kibana - Unable to get entire data into elastic search from Oracle database using JDBC Plugin

I have an oracle database over server, but when I try to index entire data into elastic search, the total count of database mismatch with that of elastic total hits. I don't know why. Primary reasons could be -1) Not selecting the time range appropriate to your index reference timestamp.
I have cross checked time range and still missing few thousand records out of 20 million (approx).
input {
jdbc {
jdbc_connection_string => "jdbc:oracle:thin:#ip:port:ORCL"
jdbc_user => "system"
jdbc_password => "oracle"
jdbc_validate_connection => true
jdbc_driver_library => "---\ojdbc6.jar"
jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
statement => "SELECT DISTNAME ||'-'|| DNO as Distributor, DISTNAME ||'-'|| DNO || '-' || STATE_NAME || '-' || DCNAME as Distributor_2,
col1, col2,col3 ---- from EPMUY_ELASTIC"
}
}
output {
elasticsearch {
hosts => ["http://localhost:9200"]
index => "index"
user => "user"
password => "password"
}
stdout{
codec => rubydebug
}
}
Note - I didn't create index mapping myself, I let kibana decide Index mapping (dynamic:true)
On server I checked the total count of records in database which is - 18331793
(select count(*) from table)
And the same count with my query with logstash pipeline also.
But on kibana I am only seeing 18,303,447 number of records.

push data from the last updated_on date

So im using the below query in my jdbc logstash input.
statement => "SELECT * from mydata WHERE updated_on > :sql_last_value ORDER BY updated_on"
use_column_value =>true
tracking_column =>updated_on
tracking_column_type => "#timestamp"
or
statement => "SELECT * from mydata WHERE :sql_last_value > updated_on ORDER BY updated_on"
use_column_value =>true
tracking_column =>updated_on
tracking_column_type => "#timestamp"
here, my :sql_last value is considered as the last run time of the configuration file.
example:
"updated_on": "2019-09-26T08:11:00.000Z",
"#timestamp": "2019-09-26T08:17:52.974Z"
here my sql_last_value corresponds to #timestamp,
I want it to consider updated_on instead.
How do i change it to consider the last updated_on date instead of execution time?
So this is your current configuration:
statement => "SELECT * from agedata WHERE updated_on > :sql_last_value ORDER BY updated_on"
use_column_value => true
tracking_column => updated_on
tracking_column_type => "timestamp"
What it says is basically that the sql_last_value variable will store/remember the last value of the updated_on column from the last run since use_column_value is true (not the last run time value as you suggest or else use_column_value would be false).
So your configuration is already doing what you expect.

Linq select a record based on child table data

Trying to select orders based on their current status which is stored in a another table. Keep getting all orders instead of orders filtered by current status. When status is not empty, the query should filter based on the latest status of the order which I think should be status in descending order of date of record taking the first status.
private IQueryable<order_tbl> CreateQuery(string status, int? orderId, DateTime? orderDate) {
var query = from o in context.order_tbl select o;
if (orderId.HasValue) query = query.Where(w => w.id.Equals(orderId.Value));
if (orderDate.HasValue) query = query.Where(w => w.OrderDate.Equals(orderDate.Value));
if (!string.IsNullOrEmpty(status)) {
query = (from q in query
from s in q.order_status
.OrderByDescending(o => o.DateStatusUpdated)
.Take(1)
.Where(w => w.Status.Equals(status))
select q);
}
return query;
}
There are more fields in the tables which I omitted for brevity.
order_tbl
id date customerId
1 2/1/2018 6
2 2/3/2018 5
3 2/6/2018 3
order_status
id orderId DateStatusUpdated status
1 1 2/1/2018 open
2 1 2/2/2018 filled
3 2 2/3/2018 open
4 2 2/4/2018 filled
5 3 2/6/2018 open
When searching only on 'open', the query will return orders 1,2,3 instead of just order 3. What is wrong with the query on the status?
This answer pointed me in the right direction, LINQ Query - Only get Order and MAX Date from Child Collection
Modified my query to the below.
if (!string.IsNullOrEmpty(status)) {
{
query = query
.SelectMany(s => s.order_status
.OrderByDescending(o => o.DateStatusUpdated)
.Take(1)
)
.Where(w => w.Status.Equals(status))
.Select(s => s.order_tbl);
}

Oracle markup into temporary table

I'm using ctx_doc.markup to highlight search results and insert them into a temporary table. Then i retrieve the results from the temporary table. All is running in one transaction. However, the results get deleted from the temporary table (or never inserted?) before i can retrieve them. If i use a normal table it's working fine. Here's the query i'm using:
BEGIN
FOR cur_rec IN (SELECT id FROM contents WHERE CONTAINS(text, 'test', 1) > 0)
LOOP
CTX_DOC.markup(
index_name => 'I_CONTENTS_TEXT',
textkey => TO_CHAR(cur_rec.id),
text_query => 'test',
restab => 'CONTENTS_MARKUP',
query_id => cur_rec.id,
plaintext => FALSE,
tagset => 'HTML_NAVIGATE');
END LOOP;
END;
EOF

Resources