Impala Query to get next date - hadoop

I have 2 Impala tables.
1st table T1 (additional columns are there but I am interested in only date and day type as weekday):
date day_type
04/01/2020 Weekday
04/02/2020 Weekday
04/03/2020 Weekday
04/04/2020 Weekend
04/05/2020 Weekend
04/06/2020 Weekday
2nd table T2:
process date status
A 04/01/2020 finished
A 04/02/2020 finished
A 04/03/2020 finished
A 04/03/2020 run_again
Using Impala queries I have to get the maximum date from second table T2 and get its status. According to the above table 04/03 is the maximum date.
If the status is finished on 04/03, then my query should return the next available weekday date from T1 which is 04/06/2020.
But if the status is run_again, then the query should return the same date.
In the above table, 04/03 has run_again and when my query runs the output should be 04/03/2020 and not 04/06/2020.
Please note more than one status is possible for a date. For example, 04/03/2020 can have a row with finished as status and another with run again as status. In this case run again should be prioritized and the query should give 04/03/2020 as output date
What I tried so far:
I ran a subquery from second table and got the maximum date and its status. I tried to run a case in my main query and gave T1 as subselect in Case statement but its not working.
Is it possible to achieve this through Impala query?

One way to do this is to create a CTE from table T1 instead of a correlated subquery. Something like:
WITH T3 as (
select t.date date, min(x.date) next_workday
from T1 t join T1 x
on t.date < x.date
where x.day_type = 'Weekday'
group by t.date
)
select T2.process, T2.date run_date, T2.status,
case when T2.status = 'finished' then T3.next_workday
else T3.date
end next_run_date
from T2 join T3
on T2.date = T3.date
order by T2.process, T2.date;
+---------+------------+-----------+---------------+
| process | run_date | status | next_run_date |
+---------+------------+-----------+---------------+
| A | 2020-04-01 | finished | 2020-04-02 |
| A | 2020-04-02 | finished | 2020-04-03 |
| A | 2020-04-03 | run again | 2020-04-03 |
+---------+------------+-----------+---------------+
You can then select max from the result instead of ordering.

There might be multiple solutions and even some better ones considering performance but this is my approach. Hope it helps.
select case when status='run_again' then t2_date else t1_date end as needed_date from t2 cross join (select t1_date from t1 where t1.day_type='Weekday' and t1_date>(select max(t2_date) from t2) order by t1.t1_date limit 1)a where t2_date=(select max(t2_date) from t2);

Related

Oracle: Update values in table with aggregated values from same table

I am looking for a possibly better approach to this.
I have created a temp table in Oracle 11.2 that I'm using to pre calculate values that I will need in other selects instead of always generating them again with each select.
create global temporary table temp_foo (
DT timestamp(6), --only the date part will be used in this example but for later things I will need the time
Something varchar2(100),
Customer varchar2(100),
MinDate timestamp(6),
MaxDate timestamp(6),
Filecount int,
Errorcount int,
AvgFilecount int,
constraint PK_foo primary key (DT, Customer)
) on commit preserve rows;
I then first insert some fixed values for everything except AvgFilecount. AvgFilecount should contain the average for the Filecount for the 3 previous records (going by the date in DT). It doesn’t matter that the result will be converted to an int, I don’t need the decimal places
DT | Customer | Filecount | AvgFilecount
2019-04-30 | x | 10 | avg(2+3+9)
2019-04-29 | x | 2 | based on values before this
2019-04-28 | x | 3 | based on values before this
2019-04-27 | x | 9 | based on values before this
I thought about using a normal UPDATE statement as this should be faster than looping through the values. I should mention that there are no gaps in the DT field but obviously there is a first one where I won‘t find any previous records. If I would loop through, I could easily calculate AvgFilecount with (the record before previous record/2 + previous record)/3 which I cannot with UPDATE as I cannot guarantee the order of how they are executed. So I‘m fine with just taking the last 3 records (going by DT) and calcuting it from there.
What I thought would be an easy update is giving me headaches. I‘m mostly doing SQL Server where I would just join the 3 other records but it seems is a bit different in Oracle. I have found https://stackoverflow.com/a/2446834/4040068 and wanted to use the second approach in the answer.
update
(select curr.DT, curr.temp_foo, curr.Filecount, curr.AvgFilecount as OLD, (coalesce(Minus1.Filecount, 0) + coalesce(Minus2.Filecount, 0) + coalesce(Minus3.Filecount, 0)) / 3 as NEW
from temp_foo curr
left join temp_foo Minus1 ON Minus1.Customer = curr.Customer and trunc(Minus1.DT) = trunc(curr.DT-1)
left join temp_foo Minus2 ON Minus2.Customer = curr.Customer and trunc(Minus2.DT) = trunc(curr.DT-2)
left join temp_foo Minus3 ON Minus3.Customer = curr.Customer and trunc(Minus3.DT) = curr.DT-3
order by 1, 2
)
set OLD = NEW;
Which gives me an
ORA-01779: cannot modify a column which maps to a non key-preserved
table
01779. 00000 - "cannot modify a column which maps to a non key-preserved table"
*Cause: An attempt was made to insert or update columns of a join view which
map to a non-key-preserved table.
*Action: Modify the underlying base tables directly.
I thought this should work as both join conditions are in the primary key and thus unique. I am currently implementing the first approach in the above mentioned answer but it is getting quite big and it feels like there should be a better solution to this.
Other things I thought about trying:
using a nested subselect (nested because Oracle doesn’t know top(n) and I need to sort the subselect) to select the previous 3 records ordered by DT and then he outer select with rownum <=3 and then I could just use AVG(). However, I was told subselect can be quite slow and joins are better in Oracle performance wise. Dunno if that is really the case, haven‘t done any testing
Edit: My insert right now looks like this. I am already aggregating the Filecount for a day as there can be multiple records per DT per Customer per Something.
insert into temp_foo (DT, Something, Customer, Filecount)
select dates.DT, tbl1.Something, tbl1.Customer, coalesce(sum(tbl3.Filecount),0)
from table(Function_Returning_Daterange(NULL, NULL)) dates
cross join
(SELECT Something,
Code,
Value
FROM Table2 tbl2
WHERE (Something = 'Value')) tbl1
left outer join Table3 tbl3
on tbl3.Customer = tbl1.Customer
and trunc(tbl3.MinDate) = trunc(dates.DT)
group by dates.DT, tbl1.Something, tbl1.Customer;
You could use an analytic average with a window clause:
select dt, customer, filecount,
avg(filecount) over (partition by customer order by dt
rows between 3 preceding and 1 preceding) as avgfilecount
from tmp_foo
order by dt desc;
DT CUSTOMER FILECOUNT AVGFILECOUNT
---------- -------- ---------- ------------
2019-04-30 x 10 4.66666667
2019-04-29 x 2 6
2019-04-28 x 3 9
2019-04-27 x 9
and then do the update part with a merge statement:
merge into tmp_foo t
using (
select dt, customer,
avg(filecount) over (partition by customer order by dt
rows between 3 preceding and 1 preceding) as avgfilecount
from tmp_foo
) s
on (s.dt = t.dt and s.customer = t.customer)
when matched then update set t.avgfilecount = s.avgfilecount;
4 rows merged.
select dt, customer, filecount, avgfilecount
from tmp_foo
order by dt desc;
DT CUSTOMER FILECOUNT AVGFILECOUNT
---------- -------- ---------- ------------
2019-04-30 x 10 4.66666667
2019-04-29 x 2 6
2019-04-28 x 3 9
2019-04-27 x 9
You haven't shown your original insert statement; it might be possible to add the analytic calculation to that, and avoid the separate update step.
Also, if you want the first two date values to be calculated as if the 'missing' extra days before them had zero counts, you could use sum and division instead of avg:
select dt, customer, filecount,
sum(filecount) over (partition by customer order by dt
rows between 3 preceding and 1 preceding)/3 as avgfilecount
from tmp_foo
order by dt desc;
DT CUSTOMER FILECOUNT AVGFILECOUNT
---------- -------- ---------- ------------
2019-04-30 x 10 4.66666667
2019-04-29 x 2 4
2019-04-28 x 3 3
2019-04-27 x 9
It depends what you expect those last calculated values to be.

In hiveql, what is the most elegant/performatic way of calculating an average value if some of the data is implicitly not present?

In Hiveql, what is the most elegant and performatic way of calculating an average value when there are 'gaps' in the data, with implicit repeated values between them? i.e. Considering a table with the following data:
+----------+----------+----------+
| Employee | Date | Balance |
+----------+----------+----------+
| John | 20181029 | 1800.2 |
| John | 20181105 | 2937.74 |
| John | 20181106 | 3000 |
| John | 20181110 | 1500 |
| John | 20181119 | -755.5 |
| John | 20181120 | -800 |
| John | 20181121 | 1200 |
| John | 20181122 | -400 |
| John | 20181123 | -900 |
| John | 20181202 | -1300 |
+----------+----------+----------+
If I try to calculate a simple average of the november rows, it will return ~722.78, but the average should take into account the days that are not shown have the same balance as the previous register. In the above data, John had 1800.2 between 20181101 and 20181104, for example.
Assuming that the table always have exactly one row for each date/balance and given that I cannot change how this data is stored (and probably shouldn't since it would be a waste of storage to write rows for days with unchanged balances), I've been tinkering with getting the average from a select with subqueries for all the days in the queried month, returning a NULL for the absent days, and then using case to get the balance from the previous available date in reverse order. All of this just to avoid writing temporary tables.
Step 1: Original Data
The 1st step is to recreate a table with the original data. Let's say the original table is called daily_employee_balance.
daily_employee_balance
use default;
drop table if exists daily_employee_balance;
create table if not exists daily_employee_balance (
employee_id string,
employee string,
iso_date date,
balance double
);
Insert Sample Data in original table daily_employee_balance
insert into table daily_employee_balance values
('103','John','2018-10-25',1800.2),
('103','John','2018-10-29',1125.7),
('103','John','2018-11-05',2937.74),
('103','John','2018-11-06',3000),
('103','John','2018-11-10',1500),
('103','John','2018-11-19',-755.5),
('103','John','2018-11-20',-800),
('103','John','2018-11-21',1200),
('103','John','2018-11-22',-400),
('103','John','2018-11-23',-900),
('103','John','2018-12-02',-1300);
Step 2: Dimension Table
You will need a dimension table where you will have a calendar (table with all the possible dates), call it dimension_date. This is a normal industry standard to have a calendar table, you could probably download this sample data over the internet.
use default;
drop table if exists dimension_date;
create external table dimension_date(
date_id int,
iso_date string,
year string,
month string,
month_desc string,
end_of_month_flg string
);
Insert some sample data for entire month of Nov 2018:
insert into table dimension_date values
(6880,'2018-11-01','2018','2018-11','November','N'),
(6881,'2018-11-02','2018','2018-11','November','N'),
(6882,'2018-11-03','2018','2018-11','November','N'),
(6883,'2018-11-04','2018','2018-11','November','N'),
(6884,'2018-11-05','2018','2018-11','November','N'),
(6885,'2018-11-06','2018','2018-11','November','N'),
(6886,'2018-11-07','2018','2018-11','November','N'),
(6887,'2018-11-08','2018','2018-11','November','N'),
(6888,'2018-11-09','2018','2018-11','November','N'),
(6889,'2018-11-10','2018','2018-11','November','N'),
(6890,'2018-11-11','2018','2018-11','November','N'),
(6891,'2018-11-12','2018','2018-11','November','N'),
(6892,'2018-11-13','2018','2018-11','November','N'),
(6893,'2018-11-14','2018','2018-11','November','N'),
(6894,'2018-11-15','2018','2018-11','November','N'),
(6895,'2018-11-16','2018','2018-11','November','N'),
(6896,'2018-11-17','2018','2018-11','November','N'),
(6897,'2018-11-18','2018','2018-11','November','N'),
(6898,'2018-11-19','2018','2018-11','November','N'),
(6899,'2018-11-20','2018','2018-11','November','N'),
(6900,'2018-11-21','2018','2018-11','November','N'),
(6901,'2018-11-22','2018','2018-11','November','N'),
(6902,'2018-11-23','2018','2018-11','November','N'),
(6903,'2018-11-24','2018','2018-11','November','N'),
(6904,'2018-11-25','2018','2018-11','November','N'),
(6905,'2018-11-26','2018','2018-11','November','N'),
(6906,'2018-11-27','2018','2018-11','November','N'),
(6907,'2018-11-28','2018','2018-11','November','N'),
(6908,'2018-11-29','2018','2018-11','November','N'),
(6909,'2018-11-30','2018','2018-11','November','Y');
Step 3: Fact Table
Create a fact table from the original table. In normal practice, you ingest the data to hdfs/hive then process the raw data and create a table with historical data where you keep inserting in increment manner. You can look more into data warehousing to get the proper definition but I call this a fact table - f_employee_balance.
This will re-create the original table with missing dates and populate the missing balance with earlier known balance.
--inner query to get all the possible dates
--outer self join query will populate the missing dates and balance
drop table if exists f_employee_balance;
create table f_employee_balance
stored as orc tblproperties ("orc.compress"="SNAPPY") as
select q1.employee_id, q1.iso_date,
nvl(last_value(r.balance, true) --initial dates to be populated with 0 balance
over (partition by q1.employee_id order by q1.iso_date rows between unbounded preceding and current row),0) as balance,
month, year from (
select distinct
r.employee_id,
d.iso_date as iso_date,
d.month, d.year
from daily_employee_balance r, dimension_date d )q1
left outer join daily_employee_balance r on
(q1.employee_id = r.employee_id) and (q1.iso_date = r.iso_date);
Step 4: Analytics
The query below will give you the true average for by month:
select employee_id, monthly_avg, month, year from (
select employee_id,
row_number() over (partition by employee_id,year,month) as row_num,
avg(balance) over (partition by employee_id,year,month) as monthly_avg, month, year from
f_employee_balance)q1
where row_num = 1
order by year, month;
Step 5: Conclusion
You could have just combined step 3 and 4 together; this would save you from creating extra table. When you are in the big data world, you don't worry much about wasting extra disk space or development time. You can easily add another disk or node and automate the process using workflows. For more information, please look into data warehousing concept and hive analytical queries.

Populating future dates in oracle table

I have attached tables, product and date.
Lets say my product table has data till yesterday i.e 05/31/2018
am trying to populate season table where I could do the calculation till 5/31/2018 where value = (value on same day last year/previous day last year) with Ch(a) and P(pen), however the data set was till 5/31/2018. my aim is to get data/calculation for 06/1/2018 till 12/31/2018 as well. how do i get the data for these future dates as I have the data to calculate these future dates in prod table.
appreciate if you can help.
Thank you!
You can generate a series of dates using CONNECT BY subquery like this one:
SELECT Start_date + level - 1 as my_date
FROM (
SELECT date '2018-01-01' as Start_date FROM dual
)
CONNECT BY Start_date + level - 1 <= date '2018-01-05'
Demo: http://www.sqlfiddle.com/#!4/072359/1
| MY_DATE |
|----------------------|
| 2018-01-01T00:00:00Z |
| 2018-01-02T00:00:00Z |
| 2018-01-03T00:00:00Z |
| 2018-01-04T00:00:00Z |
| 2018-01-05T00:00:00Z |

oracle query to get max hour every day, and corresponding row values

I'm having a hard time creating a query to do the following:
I have this table, called LOG:
ID | INSERT_TIME | LOG_VALUE
----------------------------------------
1 | 2013-04-29 18:00:00.000 | 160473
2 | 2013-04-29 21:00:00.000 | 154281
3 | 2013-04-30 09:00:00.000 | 186552
4 | 2013-04-30 14:00:00.000 | 173145
5 | 2013-04-30 14:30:00.000 | 102235
6 | 2013-05-01 11:00:00.000 | 201541
7 | 2013-05-01 23:00:00.000 | 195234
What I want to do is build a query that returns, for each day, the last values inserted (using the max value of INSERT_TIME). I'm only interested in the date part of that column, and in the column LOG_VALUE. So, this would be my resultset after running the query:
2013-04-29 154281
2013-04-30 102235
2013-05-01 195234
I guess that I need to use GROUP BY over the INSERT_TIME column, along with MAX() function, but by doing that, I can't seem to get the LOG_VALUE. Can anyone help me on this, please?
(I'm on Oracle 10g)
SELECT trunc(insert_time),
log_value
FROM (
SELECT insert_time,
log_value,
rank() over (partition by trunc(insert_time)
order by insert_time desc) rnk
FROM log)
WHERE rnk = 1
is one option. This uses the analytic function rank to identify the row with the latest insert_time on each day.

SQL Server: Combine multiple rows into one row

I've looked at a few other similar questions, but none of them fits the particular situation I find myself in.
I am a relative beginner at SQL.
I am writing a query to create a report. I have read-only access to this DB. I am trying to combine three rows into one row. Any method that only requires read access will work.
That being said, the three rows I have, were obtained by a very long sub-query. Here is the outer shell:
SELECT Availability,
Start_Date,
End_Date
FROM (
-- long subquery goes here (it is several UNION ALLs)
...
) AS dual
Here are the rows:
Availability | Start_Date | End_Date
-------------------------------------
99.983 | NULL | NULL
NULL | 1/10/2013 | NULL
NULL | NULL | 1/28/2013
What I am trying to do is combine the three rows into one row, like so:
Availability | Start_Date | End_Date
-------------------------------------
99.983 | 1/10/2013 | 1/28/2013
I am aware that I could use COALESCE() to put them in one column, but I would prefer to keep the three separate columns.
I can't create or use stored procedures.
Is it possible to do this? Can I have an example for the general case?
Have you tried using an aggregate function:
SELECT max(Availability) Availability,
max(Start_Date) Start_Date,
max(End_Date) End_Date
FROM (
-- long subquery goes here (it is several UNION ALLs)
...
) AS dual
If you have additional columns, then you would need to add a GROUP BY clause

Resources