Pull interlinked records based on rank and latest timestamp - oracle

I have a table like below.
myTable:
---------------------------------------------------------------------------------
id | ref | type | status | update_dt
---------------------------------------------------------------------------------
id1 | m1123 | 10 | 1 | 03-NOV-22 10.44.64.104000000 AM
id1 | m2123 | 10 | 2 | 03-NOV-22 10.44.64.104000000 AM
id1 | s1123 | 20 | | 03-NOV-22 10.44.64.104000000 AM
id1 | s2123 | 20 | | 03-NOV-22 10.44.54.104000000 AM
id1 | p1123 | 30 | | 03-NOV-22 10.44.54.104000000 AM
id2 | m1234 | 10 | | 02-NOV-22 10.44.64.104000000 AM
id2 | s1234 | 20 | | 02-NOV-22 10.44.54.104000000 AM
id2 | s2234 | 20 | | 02-NOV-22 10.44.54.104000000 AM
id3 | m1345 | 10 | 1 | 01-NOV-22 10.44.64.104000000 AM
id3 | s1345 | 20 | | 01-NOV-22 10.44.64.104000000 AM
id3 | s2345 | 20 | | 01-NOV-22 10.44.54.104000000 AM
---------------------------------------------------------------------------------
My requirement looks pretty complex to me and I have tried to reach somewhere but not completely there. Here are my requirements.
From the table, I have to pull records of type 10 and 20 alone. With type 10 having status either null or 1.
For type 10 comparison, I need to convert the update_dt to epoch and pull all the type 10 records above a specific epoch.
type 10 records are linked to type 20 records by the id. They have the same id.
For all the records pulled in step 2, need to pull their corresponding type 20 records. But only the latest one based on update_dt.
If multiple records of type 20 has the same update_dt from step 4, any one of them can be picked.
By the above requirements, I need to get a result like for a sample epoch that corresponds to Nov 1 2022 - 11AM (1667300400):
-----------------------------------------------------------------------------------------------
ref1 | ref2 | ref1_update_dt | ref2_update_dt
-----------------------------------------------------------------------------------------------
m1123 | s1123 | 03-NOV-22 10.44.64.104000000 AM | 03-NOV-22 10.44.64.104000000 AM
m1234 | s2234 | 02-NOV-22 10.44.64.104000000 AM | 02-NOV-22 10.44.54.104000000 AM
-----------------------------------------------------------------------------------------------
I tried the below. But didnt quite get there.
WITH cte_latest AS
(
SELECT
t1.ref ref1,
t2.ref ref2,
t1.update_dt ref1_update_dt,
t2.update_dt ref2_update_dt,
RANK() OVER(ORDER BY t2.update_dt DESC) rank_temp
FROM
myTable t1
JOIN myTable t2 ON
t1.id = t2.id
WHERE
t1.type = 10
AND (t1.status IS NULL
OR t1.status = 1)
AND t2.type = 20
AND (CAST(t1.update_dt AS DATE) - TO_DATE('01/01/1970', 'DD/MM/YYYY')) * 24 * 60 * 60 > '1667300400')
SELECT
ref1,
ref2,
ref1_update_dt,
ref2_update_dt
FROM
cte_latest
WHERE
rank_temp = 1
ORDER BY
ref1_update_dt;
Please help.

RANK will return the same number when there are multiple type 20 records that have the same update_dt. So, you will want to use ROW_NUMBER instead. That will ensure that each type 20 row gets a unique number to break any ties - per rule #5.
Also, you will need to partition the ROW_NUMBER based on the id of the type 10 records. That will cause the numbering to reset at 1 for each type 10 record id. Without partitioning every row in the result set would get a unique number.
ROW_NUMBER() OVER (PARTITION BY t1.id ORDER BY t2.update_dt DESC)

Related

Presto, how to duplicate a record based on a time validity interval

I'm traying to make the below transformation in presto:
From:
id | valid_from | valid_unitl | value
12 | 2021/02/17 | 2021/05/17 | 150
To:
id | date | value
12 | 2021/02/17 | 150
12 | 2021/03/17 | 150
12 | 2021/04/17 | 150
12 | 2021/05/17 | 150
Is it possible?
Thanks,
You can use the sequence function to generate an array between valid_from and valid_until with the desired step and then unnest it:
select id, format_datetime(date,'yyyy/MM/dd') as date, value
from my_table
cross join unnest(sequence(parse_datetime(valid_from,'yyyy/MM/dd'),
parse_datetime(valid_until,'yyyy/MM/dd'),
interval '1' month)) t(date)
See the docs for the sequence function: https://prestodb.io/docs/current/functions/array.html

When i select , only one column is checked without duplicates

I have a 2 table like this:
first table
+------------+---------------+--------+
| pk | user_one |user_two|
+------------+---------------+--------+
second table
+------------+---------------+--------+----------------+----------------+
| pk | sender |receiver|fk of firsttable|content |
+------------+---------------+--------+----------------+----------------+
First and second table have one to many(1:N) relations.
There are many records in second table:
| pk | sender|receiver|fk of firsttable|content |
|120 |car224 |car223 |1 |test message1 to 223
|121 |car224 |car223 |1 |test message2 to 223
|122 |car224 |car225 |21 |test message1 to 225
|123 |car224 |car225 |21 |test message2 to 225
|124 |car224 |car225 |21 |test message3 to 225
|125 |car224 |car225 |21 |test message4 to 225
I need to find if fk has the same value and I want the row with the largest pk.
I've changed the above column name to make it easier to understand.
Here is the actual sql I've tried so far:
select *
from (select rownum rn,
mr.mrno,
mr.user_one,
mr.user_two,
m.mno,
m.content
from tbl_messagerelation mr,
tbl_message m
where (mr.user_one = 'car224' or
mr.user_two='car224') and
m.rowid in (select max(rowid)
from tbl_message
group by m.mno) and
rownum <= 1*20)
where rn > (1-1) * 20
And this is the result:
+---------+-------+----------+----------+-------------------------+----------------------+
| rn | mrno | user_one | user_two | mno(pk of second table) | content |
+---------+-------+----------+----------+-------------------------+----------------------+
| 1 | 1 | car224 | car223 | 125 | test message4 to 225 |
| 2 | 21 | car224 | car225 | 125 | test message4 to 225 |
+---------+-------+----------+----------+-------------------------+----------------------+
My desired result is something like this:
+---------+---------+----------+--------------------+----------------------+
| fk | sender | receiver | pk of second table | content |
+---------+---------+----------+--------------------+----------------------+
| 1 | car224 | car223 | 121 | test message2 to 223 |
| 21 | car224 | car223 | 125 | test message4 to 225 |
+---------+---------+----------+--------------------+----------------------+
Your table description when compared to your query is confusing me. However, what I could understand was that you are probably looking for row_number().
An important advice is to use standard explicit JOIN syntax rather than outdated a,b syntax for joins. Join keys were not clear to me and you may replace it appropriately in your final query.
select * from
(
select mr.*, m.*, row_number() over ( partition by m.fk order by m.pk desc ) as rn
from tbl_messagerelation mr join tbl_message m on mr.? = m.?
) where rn =1
Or perhaps you don't need that join at all
select * from
(
select m.*, row_number() over ( partition by m.fk order by m.pk desc ) as rn
from tbl_message m
) where rn =1

Hive Contiguous Date Ranges

I am using Hive and I would like to take a table with a historical list of customers, subscription events, and subscription types and summarize by contiguous runs of subscription types for each customer.
Example Input (db.cust_hist):
customer_id | eff_dt | exp_dt | sub_cd | sub_type
---------------------------------------------------------
1 | 02/01/2015 | 03/01/2015 | active | A
1 | 03/01/2015 | 04/01/2015 | active | A
1 | 03/15/2015 | 12/31/9999 | cancel | A
1 | 04/01/2015 | 05/01/2015 | active | A
1 | 05/01/2015 | 06/01/2015 | active | A
1 | 02/01/2015 | 03/01/2015 | active | B
1 | 03/01/2015 | 04/01/2015 | active | B
The sub_cd in this case refers to the type of event that is effective over the date range for that row. For example, the user canceled their A subscription type on 3/15 and resumed on 4/01.
The output I'm trying to get looks like this (db.cust_snapshot):
customer_id | eff_dt | exp_dt | sub_type
------------------------------------------------
1 | 02/01/2015 | 03/15/2015 | A
1 | 04/01/2015 | 06/01/2015 | A
1 | 02/01/2015 | 04/01/2015 | B
and reflects the gap in coverage.
From what I have read in this link from BetterAtOracle (specific to SQL) which does a very good job of laying things out, I need to use row numbers and a lagging window, but I can't seem to apply it to my situation in Hive (perhaps because of the 12/31/9999 notation/subscription code?)
I tried:
SELECT customer_id
, eff_dt
, exp_dt
, sub_cd
, sub_type
, CASE WHEN DATEDIFF(TO_DATE(eff_dt), TO_DATE(lag(exp_dt) OVER (PARTITION BY customer_id, sub_type ORDER BY eff_dt)) <=1 THEN NULL
ELSE row_number() OVER(PARTITION BY customer_id, sub_type ORDER BY eff_dt)
END) as grp
FROM db.cust_hist
ORDER BY TO_DATE(eff_dt)
As you can see, I haven't applied the subscription event code. This sort of gets me there as I can start to see different groups based on subscription type, but I feel like I'm stuck from here on out.
Any help or pointers would be greatly appreciated. Before this task, I never understood the true power of ranks, rows, lag, and other window functions!

Ignore nulls in SQL

I'm using Oracle SQL and i need some help with a query. I have the following table:
Time | Type | Value_A | Value_b | Loaded time sequence
11:00:37 | A | Null | 30 | 1
11:00:37 | A | 40 | Null | 2
11:00:37 | B | Null | 20 | 3
11:00:37 | B | Null | 50 | 4
11:00:38 | C | 50 | Null | 5
11:00:38 | D | Null | 30 | 6
11:00:38 | D | 10 | Null | 7
11:00:38 | D | Null | 5 | 8
For each Type i want to take the last loaded values of Value_a and value_b which are not NULL. Here is the expected output table:
Time | Type | Value_A | Value_b
11:00:37 | A | 40 | 30
11:00:37 | B | Null | 50
11:00:38 | C | 50 | Null
11:00:38 | D | 10 | 5
Please advise.
Your test data suggests that the TIME and TYPE are linked, and that the values always rise with time. In which case this solution will work:
select Time
, Type
, max(Value_A) as Value_A
, max(Value_B) as Value_B
from your_table
group by Time
, Type
However, I think your posted data is unlikely to be representative, so you'll need a more sophisticated solution. This solution uses the LAST_VALUE() analytic function:
select distinct Time
, Type
, last_value(Value_A ignore nulls)
over (partition by time, type order by Loaded_time_sequence
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) Value_A
, last_value(Value_B ignore nulls)
over (partition by time, type order by Loaded_time_sequence
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) Value_B
from t23
;
Here is the SQL Fiddle (although the site appears to be broken at the moment).
This may still not be the most correct answer. It depends on the actual data. For instance, what should happen when you have an entry for TYPE=A at TIME=11:00:39 ?
I would think it would be as simple as aggregating with MAX():
SELECT time, type, MAX(value_a) AS value_a, MAX(value_b) AS value_b
FROM mytable
GROUP BY time, type;
If time is a TIMESTAMP then you may want to group by TRUNC(time, 'SS')

Get records from multiple Hive tables without join

I have 2 tables :
Table1 desc:
count int
Table2 desc:
count_val int
I get the fields count, count_val from the above tables and insert into the another Audit table(table3) .
Table3 desc:
count int
count_val int
I am trying to log the record count of these 2 tables into audit table for each job run.
Any of your suggestions are appreciated.Thanks!
If you want just aggregations (like sums), the solution comes with the use of UNION
INSERT INTO TABLE audit
SELECT
SUM(count),
SUM(count_val)
FROM (
SELECT
t1.count,
0 as count_val
FROM table1 t1
UNION ALL
SELECT
0 as count,
t2.count_val
FROM table2 t2
) unioned;
Otherwise join is required, because you should somehow match your lines, it's how relational algebra (the theory behind SQL) works.
==table1==
| count|
|------|
| 12 |
| 751 |
| 167 |
===table2===
| count_val|
|----------|
| 1991 |
| 321 |
| 489 |
| 7201 |
| 3906 |
===audit===
| count | count_val|
|-------|----------|
| ??? | ??? |

Resources