Sybase expert help: groupby aggregate performance problems - performance

Hey I have the following tables and SQL:
T1: ID, col2,col3 - PK(ID) - 23mil rows
T2: ID, col2,col3 - PK(ID) - 23mil rows
T3: ID, name,value - PK(ID,name) -66mil rows
1) The below sql returns back the 10k row resultset very fast, no problems.
select top 10000 T1.col2, T2.col2, T3.name, T4.value
from T1, T2, T3
where T1.ID = T2.ID and T1.ID *= T3.ID and T3.name in ('ABC','XYZ')
and T2.col1 = 'SOMEVALUE'
2) The below sql took FOREVER.
select top 10000 T1.col2, T2.col2,
ABC = min(case when T3.name='ABC ' then T3.value end)
XYZ = min(case when T3.name='XYZ ' then T3.value end)
from T1, T2, T3
where T1.ID = T2.ID and T1.ID *= T3.ID and T3.name in ('ABC','XYZ')
and T2.col1 = 'SOMEVALUE'
group by T1.col2, T2.col2,
The only difference in the showplan between those 2 queries are the below for query 2). I dont understand it 100%, is it selecting the ENTIRE resultset WITHOUT top 10000 into the temp table then doing a group by on it? is that why it's slow?
STEP 1
The type of query is SELECT (into Worktable1).
GROUP BY
Evaluate Grouped MINIMUM AGGREGATE.
FROM TABLE ...etc..
TO TABLE
Worktable1.
STEP 2
The type of query is SELECT.
FROM TABLE
Worktable1.
Nested iteration.
Table Scan.
Forward scan.
Positioning at start of table.
Using I/O Size 16 Kbytes for data pages.
With MRU Buffer Replacement Strategy for data pages.
My question is
1) Why is query 2) so slow
2) How do I fix while keeping the query logic the same and preferably limit it to just 1 select SQL as before.
thank you

Although possibly a generic answer, I'd say to put a index on the columns you're grouping by.
Edit / Revise: Here's my theory after re-looking at the issue. The SELECT statement in a query is always the last line executed. This makes sense as it is the statement that retrieves the values you want from the dataset specified below. In your query, the whole dataset (millions of records) will be evaluated for the MIN value expression that you specified. There will be two seperate functions called on the entire dataset, since you have specified two MIN columns in the select statement. After the dataset is filtered and the MIN columns have been determined, the top 10000 rows will then be selected.
In a nutshell, you're doing two mathematical function on millions of records. This will take a significant amount of time, especially with no indexes.
The solution for you would be to use a derived table. I haven't compiled the code below, but it's something close to what you would use. It will only take the min values of the 10,000 records rather than the whole dataset.
I.e.
Select my_derived_table.t1col2, my_derived_table.t2col2,
ABC = min(case when my_derived_table.t3name ='ABC ' then my_derived_table.t3value end),
XYZ = min(case when my_derived_table.t3name='XYZ ' then my_derived_table.t3value end)
FROM
(Select top 10000 T1.col2 as t1col2,
T2.col2 as t2col2,
t3.name as t3name,
t3.value as t3.value
from T1, T2, T3
where T1.ID = T2.ID
and T1.ID *= T3.ID
and T3.name in ('ABC','XYZ')
and T2.col1 = 'SOMEVALUE') my_derived_table
group by my_derived_table.t1col2, my_derived_table.t2col2

Related

Re-writing a join query

I have a question concerning Hive. Let me explain to you the scenario :
I am using a Hive action on Oozie; I have a query which is doing
succesive LEFT JOIN on different tables;
Total number of rows to be inserted is about 35 million;
First, the job was crashing due to lack of memory, so I set "set hive.auto.convert.join=false" the query was perfectly executed but it took 4 hours to be done;
I tried to rewrite the order of LEFT JOINs putting large tables at the end, but same result, about 4 hours to be executed;
Here is what the query look like:
INSERT OVERWRITE TABLE final_table
SELECT
T1.Id,
T1.some_field_name,
T1.another_filed_name,
T2.also_another_filed_name,
FROM table1 T1
LEFT JOIN table2 T2 ON ( T2.Id = T1.Id ) -- T2 is the smallest table
LEFT JOIN table3 T3 ON ( T3.Id = T1.Id )
LEFT JOIN table4 T4 ON ( T4.Id = T1.Id ) -- T4 is the biggest table
So, knowing the structure of the query is there a way to rewrite it so that I can avoid too many JOINs ?
Thanks in advance
PS: Even vectorization gave me the same timing
Too long for a comment, will be deleted later.
(1) Your current query won't compile.
(2) You are not selecting anything from T3 and T4, which makes no sense.
(3) Changing the order of tables is not likely to have any impact with cost based optimizer.
(4) Basically I would suggest to collect statistics on the tables, specifically on the id columns, but in your case I got a feeling that id is not unique in more than 1 table.
Add to your post the result of the following query:
select *
, case when cnt_1 = 0 then 1 else cnt_1 end
* case when cnt_2 = 0 then 1 else cnt_2 end
* case when cnt_3 = 0 then 1 else cnt_3 end
* case when cnt_4 = 0 then 1 else cnt_4 end as product
from (select id
,count(case when tab = 1 then 1 end) as cnt_1
,count(case when tab = 2 then 1 end) as cnt_2
,count(case when tab = 3 then 1 end) as cnt_3
,count(case when tab = 4 then 1 end) as cnt_4
from ( select 1 as tab,id from table1
union all select 2 as tab,id from table2
union all select 3 as tab,id from table3
union all select 4 as tab,id from table4
) t
group by id
having greatest (cnt_1,cnt_2,cnt_3,cnt_4) >= 10
) t
order by product desc
limit 10
;

How to use select with if condition with different tables in oracle rather than writing a function or stored procedure?

My requirement is to get a report from a complex query using a if sentence.
If a flag=0 I must perform set of select statements, if the flag = 1 I must perform another set of select statements from another table,
Is there any way I can achieve this in a query rather than writing a function or stored procedure?
Eg:
In SQL I do this
if flag = 0
select var1, vari2 from table1
else
select var1, vari2, var3, vari4 from table2
Is this possible ??
There is no if in SQL - there is the case expression, but it is not quite the same thing.
If you have two tables, t1 and t2, and flag is in a scalar table t3 ("scalar" means exactly one column, flag, and with exactly one row, with the value either 0 or 1), you can do what you want but only if t1 and t2 have the same number of columns, with the same data types (and, although not required by syntax, this would only make sense if the columns in t1 and t2 have the same business meaning). Or, at least, if you plan to select only some columns from t1 or from t2, the columns you want to select from either table should be equal in number, have the same data type, and preferably the same business meaning.
For example: t1 and t2 may be employee tables, perhaps for two companies that just merged. If they both include first_name, last_name, date_of_birth and you just want to select these three columns from either t1 or t2 based on the flag value (even if t1 has other columns, not present in t2), you can do it. Same if t1 or t2 or both is not a single table, but the result of a more complicated query. The principle is the same.
The way you can do it is with a UNION ALL, like this:
select t1.col1, t1.col2, ...
from t1 cross join t3
where t3.flag = 0
UNION ALL
select t2.col1, t2.col2, ...
from t2 cross join t3
where t3.flag = 1
;

hive agg asking for column in group by

I have a basic query(rewritten with vague names), I do not understand why hive is asking for the t2.description column in the case statement to be added to the group by. I appeased them and put it in but of course I get null value for that column for every row... If i take out the case statement and query the raw data I get all the lovely descriptions. only when I want to add some logic with the case statement does it fail. I am new to Hive and understand it is not ANSI sql but I did not imagine it to be this finicky.
select
t1.columnid as column_id,
(case when t2.description in ('description1','description2','description3') then t2.description else null end) as label_description
from table1 t1
left outer join table2 t2 on (t1.inresult = t2.inresult)
group by
t1.columnid
It's often difficult to understand the actual problem based on the error logs shown by Hive's sql parser. The problem here is that you are selecting 2 columns but only applying the GROUP BY to one column. To make this query executable you must do one of the following:
Group by both column 1 and column 2
select t1.columnid as column_id,
(case when t2.description in ('description1','description2','description3') then t2.description
else null end) as label_description from table1 t1 left outer join
table2 t2 on (t1.inresult = t2.inresult) GROUP BY t1.columnid, (case
when t2.description in ('description1','description2','description3')
then t2.description else null end);
Do not use a GROUP BY statement
select t1.columnid as column_id,
(case when t2.description in ('description1','description2','description3') then t2.description
else null end) as label_description from table1 t1 left outer join
table2 t2 on (t1.inresult = t2.inresult)
Apply an aggregate function to column 2
select t1.columnid as column_id,
MIN(case when t2.description in ('description1','description2','description3') then t2.description
else null end) as label_description from table1 t1 left outer join
table2 t2 on (t1.inresult = t2.inresult) group by t1.columnid
For hive, if you are using a GROUP BY then all the columns you are selecting must either be in the GROUP BY statement or be wrapped in an aggregate statement applied such as MAX, MIN or SUM.

Oracle query taking too much of time when I use rownum

If I execute below query, it is returning results very fast.
(select * from
select * from t1, t2 t3, t4 where ...(inner/outer join) group by ...) order by create_date desc)
How ever If I use ROWNUM like below, it is taking too much of time.
select * from (select * from
select * from t1, t2 t3, t4 where ...(inner/outer join) group by ...) order by create_date desc) where rownum =1
could you please let me know why t is taking too much of time. How I can get latest date record.
Are you see all returning results for first query or are you see first few rows and you don't wait last row?
I think in second query, first works internal query and after the condition "rownum =1" check all result records

Does Oracle re-hash the driving table for each join on the same table columns?

Say you've got the following query on 9i:
SELECT /*+ USE_HASH(t2 t3) */
* FROM
table1 t1 -- this has lots of rows
LEFT JOIN table2 t2 ON t1.col1 = t2.col1
AND t1.col2 = t2.col2
LEFT JOIN table3 t3 ON t1.col1 = t3.col1
AND t1.col2 = t3.col2
Due to 9i not having RIGHT OUTER HASH JOIN, it needs to hash table1 for both joins. Does it re-hash table1 between joining t2 and t3 (even though it's using the same join columns), or does it keep the same hash information for both joins?
It would need to rehash since the second hash would be table3 against the join of table1/table2 rather than against table1. Or vice versa.
For example, say TABLE1 had 100 rows, table2 had 50 and table3 had 10.
Joining table1 to table2 may give 500 rows. It then joins that result set to table3 to give (perhaps) 700 rows.
It won't do a join of table1 to table2, then a join of table1 to table3, then a join of those two intermediate results.
Look at the plan, it'll tell you the answer.
An example might be something like (I've just made this up):
SELECT
HASH JOIN
HASH JOIN
TABLE FULL SCAN table1
TABLE FULL SCAN table2
TABLE FULL SCAN table3
This sample plan involves a scan through table1, hashing its contents as it goes; scans through table2, hashes the results of the join into a second hash, then finally scans table3.
There are other plans it could choose from.
If table1 is the biggest table, and the optimizer knows this (due to stats), it probably won't drive from it though.

Resources