Hive - Remove duplicates, keeping newest record - all of it [duplicate] - hadoop

This question already has answers here:
Retrieve top n in each group of a DataFrame in pyspark
(6 answers)
Closed 6 years ago.
There have been a few questions like this, with no answer, like this one here.
I thought I would post another in hopes of getting one.
I have a hive table with duplicate rows. Consider the following example:
*ID Date value1 value2*
1001 20160101 alpha beta
1001 20160201 delta gamma
1001 20160115 rho omega
1002 20160101 able charlie
1002 20160101 able charlie
When complete, I only want two records. Specifically, these two:
*ID Date value1 value2*
1001 20160201 delta gamma
1002 20160101 able charlie
Why those two? For the ID=1001, I want the latest date and the data that is in that row with it. For the ID=1002, really the same answer, but the two records with that ID are complete duplicates, and I only want one.
So, any suggestions on how to do this? The simple "group by" using the ID and the 'max' date won't work, as that ignores the other columns. I cannot put 'max' on those, as it will pull the max columns from all the records (will pull 'rho' from an older record), which is not good.
I hope my explanation is clear, and I appreciate any insight.
Thank you

Try this:
WITH temp_cte AS (
SELECT mt.ID AS ID
, mt.Date AS Date
, mt.value1 AS value1
, mt.value2 AS value2
, ROW_NUMBER() OVER (PARTITION BY mt.ID ORDER BY mt.Date DESC) AS row_num
FROM my_table mt
)
SELECT tc.ID AS ID
, tc.Date AS Date
, tc.value1 AS value1
, tc.value2 AS value2
FROM temp_cte tc
WHERE tc.row_num = 1
;
Or you can do MAX() and join the table to itself where ID = ID and max_date = Date. HTH.
Edit March 2022:
Since ROW_NUMBER numbers every row and the user only cares about 1 row with the max date there's a better way to do this I discovered.
WITH temp_cte AS (
SELECT mt.ID AS ID
, MAX(NAMED_STRUCT('Date', mt.Date, 'Value1', mt.value1, 'Value2', mt.Value2)) AS my_struct
FROM my_table mt
GROUP BY mt.ID
)
SELECT tt.ID AS ID
, tt.my_struct.Date AS Date
, tt.my_struct.Value1 AS Value1
, tt.my_struct.Value2 AS Value2
FROM temp_cte tt
;

Related

Lag() Function in SQLiteStudio

I am wanting to return the last transaction date grouped by CustomerID, and I am using SQLiteStudio 3.2.1. My table looks like this:
CustomerID Date TransactionID Amount
1 2000-07-01 1 20.00
2 2000-07-04 2 40.00
1 2002-08-01 3 20.00
1 2007-01-01 4 60.00
2 2010-05-09 5 70.00
1 2012-06-25 6 35.00`
This is what I would like the end result to look like: `
CustomerID Date TransactionID Amount Last Transaction Date
1 2000-07-01 1 20.00 NULL
2 2000-07-04 2 40.00 NULL
1 2002-08-01 3 20.00 2000-07-01
1 2007-01-01 4 60.00 2002-01-01
2 2010-05-09 5 70.00 2000-07-04
1 2012-06-25 6 35.00` 2007-01-01
I was attempting to use the following code:
SELECT CustomerID, Date, Amount, LAG(Date,1) OVER (PARTITIONED BY CustomerID ORDER BY Date)
FROM table
However, the lag function is not supported in SQLiteStudio (or maybe I am missing something?). The SQL Editor is also not recognizing the PARTITION BY clause either. Is there a way to use the LAG function or the PARTITION BY clause in the SQL Function Editor? Any help would be greatly appreciated! Thanks!
Also: does anyone have any resources for aggregate function creation in the SQL Function Editor for SQLiteStudio? I know it takes the three parameters of "Initialization code", "Per step code", and "Final step implementation code", but I am looking for examples of the syntax/requirements for these three parameters in SQLiteStudio. (Thanks again!)
Your partition clause, as your pasted above, has a typo, and it should be PARTITION BY, not PARTITIONED BY. If this be the only problem, then just fix the typo:
SELECT CustomerID, Date, Amount,
LAG(Date) OVER (PARTITION BY CustomerID
ORDER BY Date) AS "Last Transaction Date"
FROM yourTable
ORDER BY Date;
If the above still does not work, then perhaps your version of SQLite does not support LAG. One workaround in this case would be to use a correlated subquery in place of LAG:
SELECT CustomerID, Date, Amount,
(SELECT t2.Date
FROM yourTable t2
WHERE t2.CustomerID = t1.CustomerID AND
t2.TransactionID < t1.TransactionID
ORDER BY t2.TransactionID DESC
LIMIT 1) AS "Last Transaction Date"
FROM yourTable t1
ORDER BY Date;

Oracle: Update values in table with aggregated values from same table

I am looking for a possibly better approach to this.
I have created a temp table in Oracle 11.2 that I'm using to pre calculate values that I will need in other selects instead of always generating them again with each select.
create global temporary table temp_foo (
DT timestamp(6), --only the date part will be used in this example but for later things I will need the time
Something varchar2(100),
Customer varchar2(100),
MinDate timestamp(6),
MaxDate timestamp(6),
Filecount int,
Errorcount int,
AvgFilecount int,
constraint PK_foo primary key (DT, Customer)
) on commit preserve rows;
I then first insert some fixed values for everything except AvgFilecount. AvgFilecount should contain the average for the Filecount for the 3 previous records (going by the date in DT). It doesn’t matter that the result will be converted to an int, I don’t need the decimal places
DT | Customer | Filecount | AvgFilecount
2019-04-30 | x | 10 | avg(2+3+9)
2019-04-29 | x | 2 | based on values before this
2019-04-28 | x | 3 | based on values before this
2019-04-27 | x | 9 | based on values before this
I thought about using a normal UPDATE statement as this should be faster than looping through the values. I should mention that there are no gaps in the DT field but obviously there is a first one where I won‘t find any previous records. If I would loop through, I could easily calculate AvgFilecount with (the record before previous record/2 + previous record)/3 which I cannot with UPDATE as I cannot guarantee the order of how they are executed. So I‘m fine with just taking the last 3 records (going by DT) and calcuting it from there.
What I thought would be an easy update is giving me headaches. I‘m mostly doing SQL Server where I would just join the 3 other records but it seems is a bit different in Oracle. I have found https://stackoverflow.com/a/2446834/4040068 and wanted to use the second approach in the answer.
update
(select curr.DT, curr.temp_foo, curr.Filecount, curr.AvgFilecount as OLD, (coalesce(Minus1.Filecount, 0) + coalesce(Minus2.Filecount, 0) + coalesce(Minus3.Filecount, 0)) / 3 as NEW
from temp_foo curr
left join temp_foo Minus1 ON Minus1.Customer = curr.Customer and trunc(Minus1.DT) = trunc(curr.DT-1)
left join temp_foo Minus2 ON Minus2.Customer = curr.Customer and trunc(Minus2.DT) = trunc(curr.DT-2)
left join temp_foo Minus3 ON Minus3.Customer = curr.Customer and trunc(Minus3.DT) = curr.DT-3
order by 1, 2
)
set OLD = NEW;
Which gives me an
ORA-01779: cannot modify a column which maps to a non key-preserved
table
01779. 00000 - "cannot modify a column which maps to a non key-preserved table"
*Cause: An attempt was made to insert or update columns of a join view which
map to a non-key-preserved table.
*Action: Modify the underlying base tables directly.
I thought this should work as both join conditions are in the primary key and thus unique. I am currently implementing the first approach in the above mentioned answer but it is getting quite big and it feels like there should be a better solution to this.
Other things I thought about trying:
using a nested subselect (nested because Oracle doesn’t know top(n) and I need to sort the subselect) to select the previous 3 records ordered by DT and then he outer select with rownum <=3 and then I could just use AVG(). However, I was told subselect can be quite slow and joins are better in Oracle performance wise. Dunno if that is really the case, haven‘t done any testing
Edit: My insert right now looks like this. I am already aggregating the Filecount for a day as there can be multiple records per DT per Customer per Something.
insert into temp_foo (DT, Something, Customer, Filecount)
select dates.DT, tbl1.Something, tbl1.Customer, coalesce(sum(tbl3.Filecount),0)
from table(Function_Returning_Daterange(NULL, NULL)) dates
cross join
(SELECT Something,
Code,
Value
FROM Table2 tbl2
WHERE (Something = 'Value')) tbl1
left outer join Table3 tbl3
on tbl3.Customer = tbl1.Customer
and trunc(tbl3.MinDate) = trunc(dates.DT)
group by dates.DT, tbl1.Something, tbl1.Customer;
You could use an analytic average with a window clause:
select dt, customer, filecount,
avg(filecount) over (partition by customer order by dt
rows between 3 preceding and 1 preceding) as avgfilecount
from tmp_foo
order by dt desc;
DT CUSTOMER FILECOUNT AVGFILECOUNT
---------- -------- ---------- ------------
2019-04-30 x 10 4.66666667
2019-04-29 x 2 6
2019-04-28 x 3 9
2019-04-27 x 9
and then do the update part with a merge statement:
merge into tmp_foo t
using (
select dt, customer,
avg(filecount) over (partition by customer order by dt
rows between 3 preceding and 1 preceding) as avgfilecount
from tmp_foo
) s
on (s.dt = t.dt and s.customer = t.customer)
when matched then update set t.avgfilecount = s.avgfilecount;
4 rows merged.
select dt, customer, filecount, avgfilecount
from tmp_foo
order by dt desc;
DT CUSTOMER FILECOUNT AVGFILECOUNT
---------- -------- ---------- ------------
2019-04-30 x 10 4.66666667
2019-04-29 x 2 6
2019-04-28 x 3 9
2019-04-27 x 9
You haven't shown your original insert statement; it might be possible to add the analytic calculation to that, and avoid the separate update step.
Also, if you want the first two date values to be calculated as if the 'missing' extra days before them had zero counts, you could use sum and division instead of avg:
select dt, customer, filecount,
sum(filecount) over (partition by customer order by dt
rows between 3 preceding and 1 preceding)/3 as avgfilecount
from tmp_foo
order by dt desc;
DT CUSTOMER FILECOUNT AVGFILECOUNT
---------- -------- ---------- ------------
2019-04-30 x 10 4.66666667
2019-04-29 x 2 4
2019-04-28 x 3 3
2019-04-27 x 9
It depends what you expect those last calculated values to be.

Consolidate rows

I'm trying to cut down on rows a report has. There are 2 assets that return on this query but I want them to show up on one row.
Basically if dc.name LIKE '%CT/PT%' then I want it to be same row as the asset. The SP.SVC_PT_ID is the common field to join them.
There will be times when there is no dc.name LIKE '%CT/PT%' however I still want the DV.MFG_SERIAL_NUM to populated just with a Null to the right.
Select SP.SVC_PT_ID, SP.DEVICE_ID, DV.MFG_SERIAL_NUM, dc.name,
substr(dc.name,26)
From EIP.SVC_PT_DEVICE_REL SP,
eip.device_class dc,
EIP.DEVICE DV
Where SP.EFF_START_TIME < To_date('20170930', 'YYYYMMDD') + 1
and SP.EFF_END_TIME is null
and dc.id = DV.device_class_id
and DV.ID = SP.device_id
ORDER BY SP.SVC_PT_ID, DV.MFG_SERIAL_NUM;
I'm not sure I understand what you are saying; test case would certainly help. You said that query you posted returns two rows (only if we saw which ones ...) but you want them to be displayed as the image you attached to the message.
Generally speaking, you can do that using an aggregate function (such as MAX) on certain column(s), along with the GROUP BY clause that contains the rest of them.
Just for example:
select svc_pt_id, max(ctpt_name) ctpt_name, sum(ctpt_multipler) ctpt_multipler
from ...
group by svc_pt_id
As I said: a test case would help people who'd want to answer the question. True - someone might have understood it far better than I did and will provide assistance nevertheless.
EDIT: after you posted sample data (which, by the way, don't match screenshot you posted previously), maybe something like this might do the job: use analytic function to check whether name contains CT/PT; if so, take its data. Otherwise, display both rows.
SQL> with test as (
2 select 14 svc_pt_id, 446733 device_id, 'Generic Electric' name from dual union
3 select 14, 456517, 'Generic CT/PT, Multiplier' from dual
4 ),
5 podaci as
6 (select svc_pt_id, device_id, name,
7 rank() over (partition by svc_pt_id
8 order by case when instr(name, 'CT/PT') > 1 then 1
9 else 2
10 end) rnk
11 from test
12 )
13 select svc_pt_id, device_id, name
14 from podaci
15 where rnk = 1;
SVC_PT_ID DEVICE_ID NAME
---------- ---------- -------------------------
14 456517 Generic CT/PT, Multiplier
SQL>
My TEST table (created by WITH factoring clause) would be the result of your current query.

Aggregate only new rows from source table

I got one Source table with a timestamp column (YYYY.MM.DD HH24:MI:SS) and a target table with aggregated rows on daily basis (Date column: YYYY.MM.DD).
My Problem is: How do I bring new data from source to target and aggregate it?
I tried:
select
a.Sales,
trunc(a.timestamp,'DD') as TIMESTAMP,
count(1) as COUNT,
from
tbl_Source a
where trunc(a.timestamp,'DD') > nvl((select MAX(b.TIME_TO_DAY)from tbl_target b), to_date('01.01.1975 00:00:00','dd.mm.yyyy hh24:mi:ss'))
group by a.sales,
trunc(a.Timestamp,'DD')
The problem with that is: when I have a row with timestamp '2013.11.15 00:01:32' and the max day from target is the 14th of november, it will only aggregate the 15th. Would I use >= instead of > some rows would get loaded twice.
It looks like you are looking for a merge statement: If the day is already present in tbl_target then update the count else insert the record.
merge into tbl_target dest
using
(
select sales, trunc(timestamp) as theday , count(*) as sales_count
from tbl_Source
where trunc(timestamp) >= ( select nvl(max(time_to_day),to_date('01.01.1975','dd.mm.yyyy')) from tbl_target )
group by sales, trunc(timestamp)
) src
on (src.theday = dest.time_to_day)
when matched then update set
dest.sales_count = src.sales_count
when not matched then
insert (time_to_day, sales_count)
values (src.theday, src.sales_count)
;
As far as I can understand your question: you need to get everything since the last reload to target table.
The problem here: you need this date, but it is truncated during the update.
If my guesses are correct you cannot do anything except to store the date of reload as an additional column because there is no way to get it back from the data presented here.
about your query:
count(*) and count(1) are the same in performance (proved many times, at least in 10-11 versions) - do not make this count(1), looks really ugly
do not use nvl, use coalesce instead of it - it is much faster
I would write your query like that:
with t as (select max(b.time_to_day) mx from tbl_target b)
select a.sales,trunc(a.timestamp,'dd') as timestamp,count(*) as count
from tbl_source a,t
where trunc(a.timestamp,'dd') > t.mx or t.mx is null
group by a.sales,trunc(a.timestamp,'dd')
Does this fit your needs:
WHERE trunc(a.timestamp,'DD') > nvl((select MAX(b.TIME_TO_DAY) + 1 - 1/(24*60*60) from tbl_target b), to_date('01.01.1975 00:00:00','dd.mm.yyyy hh24:mi:ss'))
i.e. instead of 2013-11-15 00:00:00 compare to 2013-11-16 23:59:59
Update:
This one?
WHERE trunc(a.timestamp,'DD') BETWEEN nvl((select MAX(b.TIME_TO_DAY) from ...) AND nvl((select MAX(b.TIME_TO_DAY) + 1 - 1/(24*60*60) from ...)

Grouping data by date ranges

I wonder how do I select a range of data depending on the date range?
I have these data in my payment table in format dd/mm/yyyy
Id Date Amount
1 4/1/2011 300
2 10/1/2011 200
3 27/1/2011 100
4 4/2/2011 300
5 22/2/2011 400
6 1/3/2011 500
7 1/1/2012 600
The closing date is on the 27 of every month. so I would like to group all the data from 27 till 26 of next month into a group.
Meaning to say I would like the output as this.
Group 1
1 4/1/2011 300
2 10/1/2011 200
Group 2
1 27/1/2011 100
2 4/2/2011 300
3 22/2/2011 400
Group 3
1 1/3/2011 500
Group 4
1 1/1/2012 600
It's not clear the context of your qestion. Are you querying a database?
If this is the case, you are asking about datetime but it seems you have a column in string format.
First of all, convert your data in datetime data type (or some equivalent, what db engine are you using?), and then use a grouping criteria like this:
GROUP BY datepart(month, dateadd(day, -26, [datefield])), DATEPART(year, dateadd(day, -26, [datefield]))
EDIT:
So, you are in Linq?
Different language, same logic:
.GroupBy(x => DateTime
.ParseExact(x.Date, "dd/mm/yyyy", CultureInfo.InvariantCulture) //Supposed your date field of string data type
.AddDays(-26)
.ToString("yyyyMM"));
If you are going to do this frequently, it would be worth investing in a table that assigns a unique identifier to each month and the start and end dates:
CREATE TABLE MonthEndings
(
MonthID INTEGER NOT NULL PRIMARY KEY,
StartDate DATE NOT NULL,
EndDate DATE NOT NULL
);
INSERT INTO MonthEndings VALUES(201101, '27/12/2010', '26/01/2011');
INSERT INTO MonthEndings VALUES(201102, '27/01/2011', '26/02/2011');
INSERT INTO MonthEndings VALUES(201103, '27/02/2011', '26/03/2011');
INSERT INTO MonthEndings VALUES(201112, '27/11/2011', '26/01/2012');
You can then group accurately using:
SELECT M.MonthID, P.Id, P.Date, P.Amount
FROM Payments AS P
JOIN MonthEndings AS M ON P.Date BETWEEN M.StartDate and M.EndDate
ORDER BY M.MonthID, P.Date;
Any group headings etc are best handled out of the DBMS - the SQL gets you the data in the correct sequence, and the software retrieving the data presents it to the user.
If you can't translate SQL to LINQ, that makes two of us. Sorry, I have never used LINQ, so I've no idea what is involved.
SELECT *, CASE WHEN datepart(day,date)<27 THEN datepart(month,date)
ELSE datepart(month,date) % 12 + 1 END as group_name
FROM payment

Resources