In Oracle (PROD), we will be creating views on table(s) and the users will be querying the views to fetch data for each reporting period (a single month, eg: between '01-DEC-2015' and '31-DEC-2015'). We created a view as
CREATE OR REPLACE VIEW VW_TABLE1 AS SELECT ACCNT_NBR, BIZ_DATE, MAX(COL1) COL1, MAX(COL2) COL2 FROM TABLE1_D WHERE BIZ_DATE IN (SELECT BIZ_DATE FROM TABLE2_M GROUP BY BIZ_DATE) GROUP BY ACCNT_NBR, BIZ_DATE;
The issue here is TABLE1_D (daily table, has data from Dec2015 to Feb2016) has records with multiple dates for a month say for Dec2015, it has records with 01-DEC-2015, 02-DEC-2015,....,29-DEC-2015,30-DEC-2015 (may not be continuous, but loaded on business date) with each day having close to 2,500,000 of records.
TABLE2_M is a monthly table and has a single date for a month (eg for Dec2015 say 30-DEC-2015) with around 4000 records for each date.
When we query the view as
SELECT * FROM VW_TABLE1 WHERE BIZ_DATE BETWEEN '01-DEC-2015' AND '31-DEC-2015'
it returns the aggregated data in table TABLE1_D for 30-DEC-2015 as expected. I thought the Grouping on BIZ_DATE in TABLE1_D is unnecessary as only one BIZ_DATE will be the output from the INNER query.
Checked by removing the BIZ_DATE in the final GROUP BY assuming that there will be data for a single day from the inner query.
Hence took 2 rows for the dates 30-dec-2015 and 30-jan-2016 from both tables and created them in SIT for testing and created view as
CREATE VIEW VW_TABLE1 AS SELECT ACCNT_NBR, MAX(BIZ_DATE) BIZ_DATE, MAX(COL1) COL1, MAX(COL2) COL2 FROM TABLE1_D WHERE BIZ_DATE IN (SELECT BIZ_DATE FROM TABLE2_M GROUP BY BIZ_DATE) GROUP BY ACCNT_NBR;
The select with between for each month (or = exact month date) gives correct data in SIT; i.e., when used BETWEEN for a single month, it produces the respective months data.
SELECT * FROM VW_TABLE1 WHERE BIZ_DATE BETWEEN '01-DEC-2015' AND '31-DEC-2015';
SELECT * FROM VW_TABLE1 WHERE BIZ_DATE = '30-DEC-2015';
With this, I modified the view DDL in PROD (to be same as SIT). But surprisingly the same select (2nd one with ='30-DEC-2015' ; 1st one was taking too long due to volume of data, hence aborted)
returned no data; as I hope that the inner query is sending out dates all 30-DEC-2015 to 30-JAN-2016 and thereby the MAX(BIZ_DATE) is being derived to be from 30-jan-2016. (Table2_M doesn't have FEB2016 data)
I verified whether there was any version differences of Oracle in SIT and PROD and found it to be same from v$version (11.2.0.4.0). Can you please explain this behavior as the same query on same view DDL in different environments returning different results with same data ...
Related
I have a situation where I have a row in a table for each time a customer visits. What I'm trying to do is find those customers who have visited within any given 30 day window and select those visits.
EX: The main focus is just going to be on three rows in the Table: ROW_ID, CUSTOMER_ID, VISIT_DATE (in the date format).
What I'm trying to get is when a customer has visited multiple times within a 30 day span. EX: CUSTOMER_ID #5 visits on the 10/8/2019 and again on 11/1/2019, I would want to see both rows
We could try using exists logic to handle the requirement:
SELECT ROW_ID, CUSTOMER_ID, VISIT_DATE
FROM yourTable t1
WHERE EXISTS (SELECT 1 FROM yourTable t2
WHERE t2.CUSTOMER_ID = t2.CUSTOMER_ID AND
t2.ROW_ID <> t1.ROW_ID AND
ABS(t2.VISIT_DATE - t1.VISIT_DATE) <= 30);
The logic behind the above query reads cleanly as return any customer record where there another record for the same customer such that the two (different) records are within 30 days of each other.
I have two hive tables (t1 and t2) that I would like to compare. The second table has 5 additional columns that are not in the first table. Other than the five disjoint fields, the two tables should be identical. I am trying to write a query to check this. Here is what I have so far:
SELECT * FROM t1
UNION ALL
select * from t2
GROUP BY some_value
HAVING count(*) == 2
If the tables are identical, this should return 0 records. However, since the second table contains 5 extra fields, I need to change the second select statement to reflect this. There are almost 60 column names so I would really hate to write it like this:
SELECT * FROM t1
UNION ALL
select field1, field2, field3,...,fieldn from t2
GROUP BY some_value
HAVING count(*) == 2
I have looked around and I know there is no select * EXCEPT syntax, but is there a way to do this query without having to explicity name each column that I want included in the final result?
You should have used UNION DISTINCT for the logic you are applying.
However, the number and names of columns returned by each select_statement have to be the same otherwise a schema error is thrown.
You could have a look at this Python program that handles such comparisons of Hive tables (comparing all the rows and all the columns), and would show you in a webpage the differences that might appear: https://github.com/bolcom/hive_compared_bq
To skip the 5 extra fields, you could use the "--ignore-columns" option.
I have a Hive table named "sales" with below structure:
id,ptype,amount,time,date
1,a,12,2240,2013-12-25
1,a,4,1830,2013-12-25
1,b,2,1920,2013-12-25
1,b,3,2023,2013-12-25
2,a,5,1220,2013-12-25
2,a,1,1320,2013-12-25
Below is my queries for different variables variable:
Q1: select id,sum(amount) as s_amt from sales group by id;
Q2: select id, sum(amount) as s_a_amt from sales where ptype='a' group by id;
Q3: select id, sum(amount) as s_b_amt from sales where ptype='b' group by id;
As far what I learned in Hive we can apply "union all" option only when we have same column name or query schema. Below is the end result what i want to achieve using Hive query:
id,s_amt,s_a_amt,s_b_amt
1,21,16,5
2,6,6,0
Below is one query that i tried and it executed successfully. But it will be a very painful task when you have to design the same query for more than 300 variables. Is there any efficient approach for the same task considering we have more than 300 variables? Appreciate your comments!
select t.id,max(t.s_amt) as s_amt,max(t.s_a_amt) as s_a_amt, max(t.s_b_amt) as s_b_amt
from
(select s1.id,sum(amount) as s_amt,0 as s_a_amt,0 as s_b_amt from sales s1 group by id union all
select s2.id, 0 as s_amt, sum(amount) as s_a_amt, 0 as s_b_amt from sales s2 where ptype='a' group by id union all
select s3.id, 0 as s_amt,0 as s_a_amt, sum(amount) as s_b_amt from sales s3 where ptype='b' group by id) t
group by t.id;
Ideal Solution would be to have a
Materialised Query Table (MQT) as IBM refers.
Summary tables are special form of MQTs and thats exactly what you need. Quick definition - as the name suggests MQT is a simple summary table, materialized on the disk.
With MQT support all you have to do is the below
CREATE MATERIALISED QUERY TABLE MQTA AS (
select id, sum(amount) as s_a_amt from sales where ptype='a' group by id;
)
Data initially deferred
Refresh deferred
Maintained by User
Data initially deferred says not insert summary records into the summary table. Refresh deferred says data in the table can be refreshed at any time using the REFRESH TABLE statement. Maintained by user says the Refersh of this table has to be taken care by the user - Maintained by System is another option in which system takes care of automatically updating the summary table when the base table sees inserts/deletes//updates.
You could directly query the MQT like a simple select query, all the heavy lifting of summarising records would have actually ran before and not when you query the MQT so it would much faster.
But AFAIK HIVE doesn’t support MQT or summary tables.
You now know the concept, you just have to simply simulate this.Create a summary table and insert the summary records (The REFRESH TABLE concept). You have to load summary values periodically by controlling with some kind of last load date fields so you will pickup only the records after last refresh.You can do this with scheduled jobs - Hive scripts.
INSERT INTO PTYPE_AMOUNT_MQT AS (
select *
from
(select s1.id,sum(amount) as s_amt,0 as s_a_amt,0 as s_b_amt from sales s1 where record_create_date > last_Refresh_date group by id union all
select s2.id, 0 as s_amt, sum(amount) as s_a_amt, 0 as s_b_amt from sales s2 where ptype='a' and record_create_date > last_Refresh_date group by id union all
select s3.id, 0 as s_amt,0 as s_a_amt, sum(amount) as s_b_amt from sales s3 where ptype='b' and record_create_date > last_Refresh_date group by id)
)
It is always good to have audit fields like record_create_date and time.The last_Refresh_date is the
last time your job ran
The solution should be:
select id, sum(amount) s_amt,
SUM (CASE WHEN ptype='a' THEN amount
ELSE 0
END) sum_a_amt,
SUM (CASE WHEN ptype='b' THEN amount
ELSE 0
END) sum_b_amt
from sales
group by id;
Please try it and tell me if it works, I cannot test it right now...
Hive has recently added GROUPING SETS as a new feature (https://issues.apache.org/jira/browse/HIVE-3471). It could be a lot easier (to write or read) than MQT. But not everyone knows about this feature and the use of CASE functions, as Arnaud has illustrated, is more commonly used in practice.
I have a data in the table POL_INFO pol_num,pol_sym,pol_mod,eff_date. I need to pull the data from it on quarterly basis using EFF_DATE.
I'm not sure what you want to query, so here's an example that will hopefully get you started; it counts rows by quarter based on eff_date:
SELECT TO_CHAR(eff_date, 'YYYYQ'), COUNT(*)
FROM my_table
GROUP BY TO_CHAR(eff_date, 'YYYYQ')
The query relies on the TO_CHAR date format code Q, which returns the calendar quarter (Jan-Mar = quarter 1, Apr-Jun = quarter 2, etc.).
Finally, be warned that the WHERE clause is not optimizable. If you have millions of rows you'll want a different approach.
Here's my original question:
merging two data sets
Unfortunately I omitted some intircacies, that I'd like to elaborate here.
So I have two tables events_source_1 and events_source_2 tables. I have to produce the data set from those tables into resultant dataset (that I'd be able to insert into third table, but that's irrelevant).
events_source_1 contain historic event data and I have to do get the most recent event (for such I'm doing the following:
select event_type,b,c,max(event_date),null next_event_date
from events_source_1
group by event_type,b,c,event_date,null
events_source_2 contain the future event data and I have to do the following:
select event_type,b,c,null event_date, next_event_date
from events_source_2
where b>sysdate;
How to put outer join statement to fill the void (i.e. when same event_type,b,c found from event_source_2 then next_event_date will be filled with the first date found
GREATLY APPRECIATE FOR YOUR HELP IN ADVANCE.
Hope I got your question right. This should return the latest event_date of events_source_1 per event_type, b, c and add the lowest event_date of event_source_2.
Select es1.event_type, es1.b, es1.c,
Max(es1.event_date),
Min(es2.event_date) As next_event_date
From events_source_1 es1
Left Join events_source_2 es2 On ( es2.event_type = es1.event_type
And es2.b = es1.b
And es2.c = es1.c
)
Group By c1.event_type, c1.b, c1.c
You could just make the table where you need to select a max using a group by into a virtual table, and then do the full outer join as I provided in the answer to the prior question.
Add something like this to the top of the query:
with past_source as (
select event_type, b, c, max(event_date)
from event_source_1
group by event_type, b, c, event_date
)
Then you can use past_source as if it were an actual table, and continue your select right after the closing parens on the with clause shown.
I end up doing two step process: 1st step populates the data from event table 1, 2nd step MERGES the data between target (the dataset from 1st step) and another source. Please forgive me, but I had to obfuscate table name and omit some columns in the code below for legal reasons. Here's the SQL:
INSERT INTO EVENTS_TARGET (VEHICLE_ID,EVENT_TYPE_ID,CLIENT_ID,EVENT_DATE,CREATED_DATE)
select VEHICLE_ID, EVENT_TYPE_ID, DEALER_ID,
max(EVENT_INITIATED_DATE) EVENT_DATE, sysdate CREATED_DATE
FROM events_source_1
GROUP BY VEHICLE_ID, EVENT_TYPE_ID, DEALER_ID, sysdate;
Here's the second step:
MERGE INTO EVENTS_TARGET tgt
USING (
SELECT ee.VEHICLE_ID VEHICLE_ID, ee.POTENTIAL_EVENT_TYPE_ID POTENTIAL_EVENT_TYPE_ID, ee.CLIENT_ID CLIENT_ID,ee.POTENTIAL_EVENT_DATE POTENTIAL_EVENT_DATE FROM EVENTS_SOURCE_2 ee WHERE ee.POTENTIAL_EVENT_DATE>SYSDATE) src
ON (tgt.vehicle_id = src.VEHICLE_ID AND tgt.client_id=src.client_id AND tgt.EVENT_TYPE_ID=src.POTENTIAL_EVENT_TYPE_ID)
WHEN MATCHED THEN
UPDATE SET tgt.NEXT_EVENT_DATE=src.POTENTIAL_EVENT_DATE
WHEN NOT MATCHED THEN
insert (tgt.VEHICLE_ID,tgt.EVENT_TYPE_ID,tgt.CLIENT_ID,tgt.NEXT_EVENT_DATE,tgt.CREATED_DATE) VALUES (src.VEHICLE_ID, src.POTENTIAL_EVENT_TYPE_ID, src.CLIENT_ID, src.POTENTIAL_EVENT_DATE, SYSDATE)
;