hive agg asking for column in group by - hadoop

I have a basic query(rewritten with vague names), I do not understand why hive is asking for the t2.description column in the case statement to be added to the group by. I appeased them and put it in but of course I get null value for that column for every row... If i take out the case statement and query the raw data I get all the lovely descriptions. only when I want to add some logic with the case statement does it fail. I am new to Hive and understand it is not ANSI sql but I did not imagine it to be this finicky.
select
t1.columnid as column_id,
(case when t2.description in ('description1','description2','description3') then t2.description else null end) as label_description
from table1 t1
left outer join table2 t2 on (t1.inresult = t2.inresult)
group by
t1.columnid

It's often difficult to understand the actual problem based on the error logs shown by Hive's sql parser. The problem here is that you are selecting 2 columns but only applying the GROUP BY to one column. To make this query executable you must do one of the following:
Group by both column 1 and column 2
select t1.columnid as column_id,
(case when t2.description in ('description1','description2','description3') then t2.description
else null end) as label_description from table1 t1 left outer join
table2 t2 on (t1.inresult = t2.inresult) GROUP BY t1.columnid, (case
when t2.description in ('description1','description2','description3')
then t2.description else null end);
Do not use a GROUP BY statement
select t1.columnid as column_id,
(case when t2.description in ('description1','description2','description3') then t2.description
else null end) as label_description from table1 t1 left outer join
table2 t2 on (t1.inresult = t2.inresult)
Apply an aggregate function to column 2
select t1.columnid as column_id,
MIN(case when t2.description in ('description1','description2','description3') then t2.description
else null end) as label_description from table1 t1 left outer join
table2 t2 on (t1.inresult = t2.inresult) group by t1.columnid
For hive, if you are using a GROUP BY then all the columns you are selecting must either be in the GROUP BY statement or be wrapped in an aggregate statement applied such as MAX, MIN or SUM.

Related

how to select specific columns from three different tables in Oracle SQL

I am trying to select values from three different tables.
When I select all columns it works well, but if I select specific column, the SQL Error [42000]: JDBC-8027:Column name is ambiguous. appear.
this is the query that selected all that works well
SELECT
*
FROM (SELECT x.*, B.*,C.* , COUNT(*) OVER (PARTITION BY x.POLICY_NO) policy_no_count
FROM YIP.YOUTH_POLICY x
LEFT JOIN
YIP.YOUTH_POLICY_AREA B
ON x.POLICY_NO = B.POLICY_NO
LEFT JOIN
YIP.YOUTH_SMALL_CATEGORY C
ON B.SMALL_CATEGORY_SID = C.SMALL_CATEGORY_SID
ORDER BY x.POLICY_NO);
and this is the error query
SELECT DISTINCT
x.POLICY_NO,
x.POLICY_TITLE,
policy_no_count ,
B.SMALL_CATEGORY_SID,
C.SMALL_CATEGORY_TITLE
FROM (SELECT x.*, B.*,C.* , COUNT(*) OVER (PARTITION BY x.POLICY_NO) policy_no_count
FROM YIP.YOUTH_POLICY x
LEFT JOIN
YIP.YOUTH_POLICY_AREA B
ON x.POLICY_NO = B.POLICY_NO
LEFT JOIN
YIP.YOUTH_SMALL_CATEGORY C
ON B.SMALL_CATEGORY_SID = C.SMALL_CATEGORY_SID
ORDER BY x.POLICY_NO);
I am trying to select if A.POLICY_NO values duplicate rows more than 18, want to change C.SMALL_CATEGORY_TITLE values to "ZZ" and also want to cahge B.SMALL_CATEGORY_SID values to null.
that is why make 2 select in query like this
SELECT DISTINCT
x.POLICY_NO,
CASE WHEN (policy_no_count > 17) THEN 'ZZ' ELSE C.SMALL_CATEGORY_TITLE END AS C.SMALL_CATEGORY_TITLE,
CASE WHEN (policy_no_count > 17) THEN NULL ELSE B.SMALL_CATEGORY_SID END AS B.SMALL_CATEGORY_SID,
x.POLICY_TITLE
FROM (SELECT x.*, B.*,C.* , COUNT(*) OVER (PARTITION BY x.POLICY_NO) policy_no_count
FROM YIP.YOUTH_POLICY x
LEFT JOIN
YIP.YOUTH_POLICY_AREA B
ON x.POLICY_NO = B.POLICY_NO
LEFT JOIN
YIP.YOUTH_SMALL_CATEGORY C
ON B.SMALL_CATEGORY_SID = C.SMALL_CATEGORY_SID
ORDER BY x.POLICY_NO);
If i use that query, I got SQL Error [42000]: JDBC-8006:Missing FROM keyword. ¶at line 3, column 80 of null error..
I know I should solve it step by step. Is there any way to select specific columns?
That's most probably because of SELECT x.*, B.*,C.* - avoid asterisks - explicitly name all columns you need, and then pay attention to possible duplicate column names; if you have them, use column aliases.
For example, if that select (which is in a subquery) evaluates to
select x.id, x.name, b.id, b.name
then outer query doesn't know which id you want as two columns are named id (and also two names), so you'd have to
select x.id as x_id,
x.name as x_name,
b.id as b_id,
b.name as b_name
from ...
and - in outer query - select not just id, but e.g. x_id.

getting error- ORA-00905: missing keyword

Select * from table1 t1 left outer join table2 t2 on t1.id=t2.id and
case
when t1.id in (select t2.id from table2)
then t1.valid_to_ts > sysdate and t2.valid_to_ts>sysdate
else
t1.valid_to_ts>sysdate.
getting error-
ORA-00905: missing keyword
You can use UNIONs to implement the logic of the case:
SELECT *
FROM table1 t1 LEFT JOIN table2 t2 ON t1.id = t2.id
WHERE t2.id IS NULL
AND t1.valid_to_ts > SYSDATE
UNION ALL
SELECT *
FROM table1 t1 LEFT JOIN table2 t2 ON t1.id = t2.id
WHERE t1.id IS NOT NULL
AND t1.valid_to_ts > SYSDATE
AND t2.valid_to_ts > SYSDATE;
As Álvaro González said in a comment, you can't use a case expression as a flow control operator. You can use a case expression in a where or on clause, but only by making it generate a value which you then compare with something else, which can be awkward, so it's usually better not to. You can usually replace the logic you're trying to implement with simple Boolean logic.
In this case you're already joining to table2 so your subquery isn't needed; you can use a where clause to see if a matching record was found - preferably checking for a not-null column other than ID, so this example will only work if valid_to_ts is not nullable:
select * from table1 t1
left outer join table2 t2
on t1.id = t2.id
where t1.valid_to_ts > sysdate
and (t2.valid_to_ts is null or t2.valid_to_ts > sysdate)
If t2.valid_to_ts is nullable then you should use a different not-nullable column instead; unless you want to include values with no valid-to date - but you aren't doing that for t1.valid_to_ts.
If you try to check that in the on clause then you won't filter out IDs which do exist in table2 but with an earlier date.
db<>fiddle, including an on-clause version which gets the wrong result (as far as I can tell from your starting query anyway).
For the demo I've assumed that the _ts columns are dates, since you're comparing with sysdate; if they are actually timestamps (as the suffix might suggest) then you could compare with systimestamp instead.

Improve performance of stored procedure where only select query is used

In our environment one procedure is taking long time to execute. I have checked the procedure, and below is the summary -
The procedure contains only select block (around 24). Before each select we are checking if data exists. If yes select the data, else do something else. For example :
-- Select block 1 --
IF EXISTS (SELECT 1 FROM table1 t1
INNER JOIN table2 t2
ON t1.col1=t2.col1
WHERE t1.col2='someValue' AND t2.col2='someValue'
)
BEGIN
SELECT t1.col1,t2.col2,t2.col3 FROM table1 t1
INNER JOIN table2 t2
ON t1.col1=t2.col1
WHERE t1.col2='someValue' AND t2.col2='someValue'
END
ELSE
BEGIN
SELECT 'DEFAULT1', 'DEFAULT2', 'DEFAULT3'
END
-- Select block 2 --
IF EXISTS (SELECT 1 FROM table1 t1
INNER JOIN table2 t2
ON t1.col1=t2.col1
WHERE t1.col5='someValue' AND t2.col5='someValue'
)
BEGIN
SELECT t1.col5,t2.col6,t2.col7 FROM table1 t1
INNER JOIN table2 t2
ON t1.col1=t2.col1
WHERE t1.col5='someValue' AND t2.col5='someValue'
END
ELSE
BEGIN
SELECT 'DEFAULT1', 'DEFAULT2', 'DEFAULT3'
END
I have come to an conclution that, somehow if we can combine the query that is used within IF EXISTS block into one query, and set some value to some variables so that we can identify which where condition returns true, that can reduce the execution time and improve the performance.
Is my thought correct? Is there any option to do that? Can you suggest any other options?
We are using Microsoft SQL Server 2005.
[Editted : Added] - All select statement doesn't return same column types they are different. And all select statements are required. If there are 24 if block, procedure should return 24 result-set.
[Added]
I would like to ask one more thing, which one of the below runs faster -
SELECT 1 FROM table1 t1
INNER JOIN table2 t2
ON t1.col1=t2.col1
WHERE t1.col2='someValue' AND t2.col2='someValue'
SELECT COUNT(1) FROM table1 t1
INNER JOIN table2 t2
ON t1.col1=t2.col1
WHERE t1.col2='someValue' AND t2.col2='someValue'
SELECT TOP 1 1 FROM table1 t1
INNER JOIN table2 t2
ON t1.col1=t2.col1
WHERE t1.col2='someValue' AND t2.col2='someValue'
Thanks.
Kartic
To enhance the performance of select query...create "index" on columns which you are using in where clause
like you are using the
WHERE t1.col2='someValue' AND t2.col2='someValue'
WHERE t1.col5='someValue' AND t2.col5='someValue'
so create database index on col2 and col5
Temp table
you can use the temp table to store the result. since you are using same query 24 time so first store the result of below query into the temp table (correct the syntax as require)
insert into temp_table (col2, col5)
SELECT col1, col5 FROM table1 t1
INNER JOIN table2 t2
ON t1.col1=t2.col1
Now use the temp table for checking
-- Select block 1 --
IF EXISTS (SELECT 1 FROM temp_table
WHERE t1.col2='someValue' AND t2.col2='someValue'
)
BEGIN
SELECT t1.col1,t2.col2,t2.col3 FROM table1 t1
INNER JOIN table2 t2
ON t1.col1=t2.col1
WHERE t1.col2='someValue' AND t2.col2='someValue'
END
-- Select block 2 --
IF EXISTS (SELECT 1 FROM temp_table1
WHERE t1.col5='someValue' AND t2.col5='someValue'
)
BEGIN
SELECT t1.col5,t2.col6,t2.col7 FROM table1 t1
INNER JOIN table2 t2
ON t1.col1=t2.col1
WHERE t1.col5='someValue' AND t2.col5='someValue'
END
The current structure is not very efficient - you effectively have to execute each "if" statement (which will be expensive), and then repeat the same where clause (the expensive bit) if the "if" returns true. And you do this 24 times. Worst case (all the queries return data), you're doubling the time for the query.
You say you've checked for indexing - given that each query appears to be subtly different, it would be worth double checking this.
The obvious thing is to refactor the application to execute the 24 select statements, and deal with the fact that sometimes, they don't return any data. That's a fairly large refactoring, and I assume you've considered that...
If you can't do that, consider a less ambitious (though nastier) refactoring. Instead of checking whether data exists, and either returning it or an equivalent default result set, write it as a union:
SELECT t1.col1,t2.col2,t2.col3 FROM table1 t1
INNER JOIN table2 t2
ON t1.col1=t2.col1
WHERE t1.col2='someValue' AND t2.col2='someValue'
UNION
SELECT 'DEFAULT1', 'DEFAULT2', 'DEFAULT3'
This reduces the number of times you're hitting the where clause, but means your client application must filter out the "default" data.
To answer your final question, I'd run it through the query optimizer and look at the execution plan - but I'd imagine that the first version is fastest - the query can complete as soon as it finds the first record that matches the where criteria. The second version must find all records that match and count them; the final version must find all records and select the first one.
You could outer-join the results of a query to a row of default values, then fall back to the defaults when the query's results are empty:
SELECT
col1 = COALESCE(query.col1, defaults.col1),
col2 = COALESCE(query.col2, defaults.col2),
col3 = COALESCE(query.col3, defaults.col3)
FROM
(SELECT 'DEFAULT1', 'DEFAULT2', 'DEFAULT3') AS defaults (col1, col2, col3)
LEFT JOIN
(
SELECT t1.col1, t2.col2, t2.col3
FROM table1 t1
INNER JOIN table2 t2
ON t1.col1=t2.col1
WHERE t1.col2='someValue' AND t2.col2='someValue'
) query
ON 1=1 -- i.e. join all the rows unconditionally
;
The method may not suit you in exactly that form you if the subquery may actually return NULLs and those must not be replaced with default values. In that case, have the subqueries return a flag column (just any value). If that column evaluates to NULL in the final query, that can only mean that the subquery hasn't returned rows. You can use that fact in a CASE expression like this:
SELECT
col1 = CASE WHEN query.HasRows IS NULL THEN defaults.col1 ELSE query.col2 END,
col2 = CASE WHEN query.HasRows IS NULL THEN defaults.col2 ELSE query.col2 END,
col3 = CASE WHEN query.HasRows IS NULL THEN defaults.col3 ELSE query.col2 END
FROM
(SELECT 'DEFAULT1', 'DEFAULT2', 'DEFAULT3') AS defaults (col1, col2, col3)
LEFT JOIN
(
SELECT HasRows = 1, t1.col1, t2.col2, t2.col3
FROM table1 t1
INNER JOIN table2 t2
ON t1.col1=t2.col1
WHERE t1.col2='someValue' AND t2.col2='someValue'
) query
ON 1=1
;

How to optimize this SELECT with sub query Oracle

Here is my query,
SELECT ID As Col1,
(
SELECT VID FROM TABLE2 t
WHERE (a.ID=t.ID or a.ID=t.ID2)
AND t.STARTDTE =
(
SELECT MAX(tt.STARTDTE)
FROM TABLE2 tt
WHERE (a.ID=tt.ID or a.ID=tt.ID2) AND tt.STARTDTE < SYSDATE
)
) As Col2
FROM TABLE1 a
Table1 has 48850 records and Table2 has 15944098 records.
I have separate indexes in TABLE2 on ID,ID & STARTDTE, STARTDTE, ID, ID2 & STARTDTE.
The query is still too slow. How can this be improved? Please help.
I'm guessing that the OR in inner queries is messing up with the optimizer's ability to use indexes. Also I wouldn't recommend a solution that would scan all of TABLE2 given its size.
This is why in this case I would suggest using a function that will efficiently retrieve the information you are looking for (2 index scan per call):
CREATE OR REPLACE FUNCTION getvid(p_id table1.id%TYPE)
RETURN table2.vid%TYPE IS
l_result table2.vid%TYPE;
BEGIN
SELECT vid
INTO l_result
FROM (SELECT vid, startdte
FROM (SELECT vid, startdte
FROM table2 t
WHERE t.id = p_id
AND t.startdte < SYSDATE
ORDER BY t.startdte DESC)
WHERE rownum = 1
UNION ALL
SELECT vid, startdte
FROM (SELECT vid, startdte
FROM table2 t
WHERE t.id2 = p_id
AND t.startdte < SYSDATE
ORDER BY t.startdte DESC)
WHERE rownum = 1
ORDER BY startdte DESC)
WHERE rownum = 1;
RETURN l_result;
END;
Your SQL would become:
SELECT ID As Col1,
getvid(a.id) vid
FROM TABLE1 a
Make sure you have indexes on both table2(id, startdte DESC) and table2(id2, startdte DESC). The order of the index is very important.
Possibly try the following, though untested.
WITH max_times AS
(SELECT a.ID, MAX(t.STARTDTE) AS Startdte
FROM TABLE1 a, TABLE2 t
WHERE (a.ID=t.ID OR a.ID=t.ID2)
AND t.STARTDTE < SYSDATE
GROUP BY a.ID)
SELECT b.ID As Col1, tt.VID
FROM TABLE1 b
LEFT OUTER JOIN max_times mt
ON (b.ID = mt.ID)
LEFT OUTER JOIN TABLE2 tt
ON ((mt.ID=tt.ID OR mt.ID=tt.ID2)
AND mt.startdte = tt.startdte)
You can look at analytic functions to avoid having to hit the second table twice. Something like this might work:
SELECT id AS col1, vid
FROM (
SELECT t1.id, t2.vid, RANK() OVER (PARTITION BY t1.id ORDER BY
CASE WHEN t2.startdte < TRUNC(SYSDATE) THEN t2.startdte ELSE null END
NULLS LAST) AS rn
FROM table1 t1
JOIN table2 t2 ON t2.id IN (t1.ID, t1.ID2)
)
WHERE rn = 1;
The inner select gets the id and vid values from the two tables with a simple join on id or id2. The rank function calculates a ranking for each matching row in the second table based on the startdte. It's complicated a bit by you wanting to filter on that date, so I've used a case to effectively ignore any dates today or later by changing the evaluated value to null, and in this instance that means the order by in the over clause needs nulls last so they're ignored.
I'd suggest you run the inner select on its own first - maybe with just a couple of id values for brevity - to see what its doing, and what ranks are being allocated.
The outer query is then just picking the top-ranked result for each id.
You may still get duplicates though; if table2 has more than one row for an id with the same startdte they'll get the same rank, but then you may have had that situation before. You may need to add more fields to the order by to break ties in a way that makes sens to you.
But this is largely speculation without being able to see where your existing query is actually slow.

left outer join on nullable field with between in join condition (Oracle)

I have two tables as: table1 with fields c1 and dt(nullable); table2 with fields start_dt, end_dt and wk_id. Now I need to perform left outer join between the table1 and table2 to take wk_id such that dt falls between start_dt and end_dt. I applied following condition but some wk_id which shouldn't be NULL are pulled NULL and some rows get repeated.
where nvl(t1.dt,'x') between nvl(t2.start_dt(+), 'x') and nvl(t2.end_dt(+), 'x');
What is wrong with the condition?
select *
from table1 t1
left join table2 t2
on t1.dt between t2.start_dt and t2.end_dt
I recommend you try the new ANSI join syntax.
Also, are you just using 'x' as an example? Or are the dt columns really stored as strings?
It seems you are missing the part "table1 left outer join table2 on table1.some_field = table2.some_field"
Something like this:
select t1.c1, t1.dt, t2.start_dt, t2.end_dt, t2.wk_id
from table1 t1 left outer join table2 t2
on t1.some_field1 = t2.some_field1
where nvl(t1.dt,'x')
between nvl(t2.start_dt, 'x') and
nvl(t2.end_dt, 'x')

Resources