Transforming hive IN subselect query combined with WHERE replacement - hadoop

I know that one needs to replace IN query with semi-left-join (e.g. Hive doesn't support in, exists. How do I write the following query?), but I don't know how to combine it with a WHERE clause:
SELECT *
from foo
WHERE userId IN
(SELECT distinct(userId) FROM foo WHERE x=true ORDER BY RAND() LIMIT 100);
thanks.
EDIT: Changed query. Intention is to create a random sample of entries (statistics wise).

(Posting alternative approach for completeness.)
To sample a set of records from a table, you can use Hive's TABLESAMPLE syntax. For example, too select a random sample of 100 distinct userId's you would use:
SELECT userId
FROM (SELECT DISTINCT(userId) as userId FROM foo) f
TABLESAMPLE(100 ROWS);
The syntax allows you to specify your sample size in different ways. The following is also valid:
SELECT userId
FROM (SELECT DISTINCT(userId) as userId FROM foo) f
TABLESAMPLE(1 PERCENT);
For more details, check out the manual page for this topic.
Once you have your sample of userId's, you can use Manuel Aldana's earlier answer to select the corresponding records from your original table.

select id from foo
left semi join
(SELECT id_2 FROM bar WHERE x=true RAND() LIMIT 100) x
ON foo.id=x.id_2
Should be like this.
I just don't understand this part : x=true RAND()
Also, this doesn't handle nulls just like your query.

Related

Oracle tuning for query with query annidate

i am trying to better a query. I have a dataset of ticket opened. Every ticket has different rows, every row rappresent an update of the ticket. There is a field (dt_update) that differs it every row.
I have this indexs in the st_remedy_full_light.
IDX_ASSIGNMENT (ASSIGNMENT)
IDX_REMEDY_INC_ID (REMEDY_INC_ID)
IDX_REMDULL_LIGHT_DTUPD (DT_UPDATE)
Now, the query is performed in 8 second. Is high for me.
WITH last_ticket AS
( SELECT *
FROM st_remedy_full_light a
WHERE a.dt_update IN
( SELECT MAX(dt_update)
FROM st_remedy_full_light
WHERE remedy_inc_id = a.remedy_inc_id
)
)
SELECT remedy_inc_id, ASSIGNMENT FROM last_ticket
This is the plan
How i could to better this query?
P.S. This is just a part of a big query
Additional information:
- The table st_remedy_full_light contain 529.507 rows
You could try:
WITH last_ticket AS
( SELECT remedy_inc_id, ASSIGNMENT,
rank() over (partition by remedy_inc_id order by dt_update desc) rn
FROM st_remedy_full_light a
)
SELECT remedy_inc_id, ASSIGNMENT FROM last_ticket
where rn = 1;
The best alternative query, which is also much easier to execute, is this:
select remedy_inc_id
, max(assignment) keep (dense_rank last order by dt_update)
from st_remedy_full_light
group by remedy_inc_id
This will use only one full table scan and a (hash/sort) group by, no self joins.
Don't bother about indexed access, as you'll probably find a full table scan is most appropriate here. Unless the table is really wide and a composite index on all columns used (remedy_inc_id,dt_update,assignment) would be significantly quicker to read than the table.

HIVE equivalent of FIRST and LAST

I have a table with 3 columns:
table1: ID, CODE, RESULT, RESULT2, RESULT3
I have this SAS code:
data table1
set table1;
BY ID, CODE;
IF FIRST.CODE and RESULT='A' THEN OUTPUT;
ELSE IF LAST.CODE and RESULT NE 'A' THEN OUTPUT;
RUN;
So we are grouping the data by ID and CODE, and then writing to the dataset if certain conditions are met. I want to write a hive query to replicate this. This is what I have:
proc sql;
create table temp as
select *, row_number() over (partition by ID, CODE) as rowNum
from table1;
create table temp2 as
select a.ID, a.CODE, a.RESULT, a.RESULT2, a.RESULT3
from temp a
inner join (select ID, CODE, max(rowNum) as maxRowNum
from temp
group by ID, CODE) b
on a.ID=b.ID and a.CODE=b.CODE
where (a.rowNum=1 and a.RESULT='A') or (a.rowNum=b.maxRowNum and a.RESULT NE 'A');
quit;
There are two issues I see with this.
1) The row that is first or last in each BY group is entirely dependant on the order of rows in table1 in SAS, we aren't ordering by anything. I don't think row order is preserved when translating to a hive query.
2) The SAS code is taking the first row in each BY GROUP or the last, not both. I think that my HIVE query is taking both, resulting in more rows than I want.
Any suggestions or insight on how to improve my query is appreciated. Is it even possible to replicate this SAS code in HIVE?
The SAS code has a by statement (BY ID CODE;), which tells SAS that the set dataset is sorted at those levels. So, not a random selection for first. and last..
That said, we can replicate this in HIVE by using the first_value and last_value window functions.
FIRST.CODE should replicate to
first_value(code) over (partition by Id order by code)fcode
Similarly, LAST.CODE would be
last_value(code) over (partition by Id order by code)lcode
Once you have the fcode and lcode columns, use case when statements for the result column criteria. Like,
case when (code=fcode and result='A') or (code=lcode and result<>'A')
then 1 else 0 end as op_flag
Then the fetch the table with where op_flag = 1
SAMPLE
select id, code, result from (
select *,
first_value(code) over (partition by id order by code)fcode,
last_value(code) over (partition by id order by code)lcode
from footab) f
where (code=fcode and result='A') or (code=lcode and result<>'A')
Regarding point 1) the BY group processing requires the input data to be sorted or indexed on BY variables, so though the code contains no ordering, the source data is processed in order. If the input data was not indexed/sorted, SAS will throw error.
Regarding this, possible differences are on rows with same values of BY variables, especially if the RESULT is different.
In SAS, I would pre-sort data by ID, CODE, RESULT, then use BY ID CODE in order to not be influenced by order of rows.
Regarding 2) FIRST and LAST can be both true in SAS. Since your condition for first and last on RESULT is different, I guess this is not a source of differences.
I guess you could add another field as
row_number() over (partition by ID, CODE desc) as rowNumDesc
to detect last row with rowNumDesc = 1 (so that you skip the join).
EDIT:
I think the two programs above both include random selection of rows for groups with same values of ID and CODE variables, especially with same values of RESULT. But you should get same number of rows from both. If not, just debug it.
However the random aspect in SAS code/storage is based on physical order of rows, while the ROW_NUMBERs randomness within a group will be influenced by the implementation of the function in the engine.

Comparing Similar Hive Tables

I have two hive tables (t1 and t2) that I would like to compare. The second table has 5 additional columns that are not in the first table. Other than the five disjoint fields, the two tables should be identical. I am trying to write a query to check this. Here is what I have so far:
SELECT * FROM t1
UNION ALL
select * from t2
GROUP BY some_value
HAVING count(*) == 2
If the tables are identical, this should return 0 records. However, since the second table contains 5 extra fields, I need to change the second select statement to reflect this. There are almost 60 column names so I would really hate to write it like this:
SELECT * FROM t1
UNION ALL
select field1, field2, field3,...,fieldn from t2
GROUP BY some_value
HAVING count(*) == 2
I have looked around and I know there is no select * EXCEPT syntax, but is there a way to do this query without having to explicity name each column that I want included in the final result?
You should have used UNION DISTINCT for the logic you are applying.
However, the number and names of columns returned by each select_statement have to be the same otherwise a schema error is thrown.
You could have a look at this Python program that handles such comparisons of Hive tables (comparing all the rows and all the columns), and would show you in a webpage the differences that might appear: https://github.com/bolcom/hive_compared_bq
To skip the 5 extra fields, you could use the "--ignore-columns" option.

Selecting data from one table or another in multiple queries PL/SQL

The easiest way to ask my question is with a Hypothetical Scenario.
Lets say we have 3 tables. Singapore_Prices, Produce_val, and Bosses_unreasonable_demands.
So Prices is a pretty simple table. Item column containing a name, and a Price column containing a number.
Produce_Val is also simple 2 column table. Type column containing what type the produce is (Fruit or veggie) and then Name column (Tomato, pineapple, etc.)
The Bosses_unreasonable_demands only contains one column, Fruit, which CAN contain the names of some fruits.
OK? Ok.
SO, My boss wants me to write a query that returns the prices for every fruit in his unreasonable demands table. Simple enough. BUT, if he doesn't have any entries in his table, he just wants me to output the prices of ALL fruits that exist in produce_val.
Now, assuming I don't know where the DBA who designed this silly hypothetical system lives (and therefore can't get him to fix this), our query would look like this:
if <Logic to determine if Bosses demands are empty>
Then
select Item, Price
from Singapore_Prices
where Item in (select Fruit from Bosses_Unreasonable_demands)
Else
select Item, Price
from Singapore_Prices
where Item in (select Name from Produce_val where type = 'Fruit')
end if;
(Well, we'd select those into a variable, and then output the variable, probably with bulk-collect shenanigans, but that's not important)
Which works. It is entirely functional, and won't be slow, even if we extend it out to 2000 other stores other than Singapore. (Well, no slower than anything else that touches 2000 some tables) BUT, I'm still doing two different select statements that are practically identical. My Comp Sci teacher rolls in their grave every time my fingers hit ctrl-V. I can cut this code in half and only do one select statement. I KNOW I can.
I just have no earthly idea how. I can't use cursors as an in statement, I can't use nested tables or varrays, I can't use cleverly crafted strings, I... I just... I don't know. I don't know how to do this. Is there a way? Does it exist?
Or do I have to copy/paste forever?
Your best bet would be dynamic SQL, because you can't parameterize table or column names.
You will have a SQL query template, have a logic to determine tables and columns that you want to query, then blend them together and execute.
Another aproach, (still a lot of ctrl-v like code) is to use set construction UNION ALL:
select 1st query where boss_condition
union all
select 2nd query where not boss_condition
Try this:
SELECT *
FROM (SELECT s.*, 'BOSS' AS FRUIT_SOURCE
FROM BOSSES_UNREASONABLE_DEMANDS b
INNER JOIN SINGAPORE_FRUIT_LIST s
ON s.ITEM = b.FRUIT
CROSS JOIN (SELECT COUNT(*) AS BOSS_COUNT
FROM BOSSES_UNREASONABLE_DEMANDS)) x
UNION ALL
(SELECT s.*, 'NORMAL' AS FRUIT_SOURCE
FROM PRODUCE_VAL p
INNER JOIN SINGAPORE_FRUIT_LIST s
ON (s.ITEM = p.NAME AND
s.TYPE = 'Fruit')
CROSS JOIN (SELECT COUNT(*) AS BOSS_COUNT
FROM BOSSES_UNREASONABLE_DEMANDS)) n
WHERE (BOSS_COUNT > 0 AND FRUIT_SOURCE = 'BOSS') OR
(BOSS_COUNT = 0 AND FRUIT_SOURCE = 'NORMAL')
Share and enjoy.
I think you can use nested tables. Assume you have a schema-level nested table type FRUIT_NAME_LIST (defined using CREATE TYPE).
SELECT fruit
BULK COLLECT INTO my_fruit_name_list
FROM bosses_unreasonable_demands
;
IF my_fruit_name_list.count = 0 THEN
SELECT name
BULK COLLECT INTO my_fruit_name_list
FROM produce_val
WHERE type='Fruit'
;
END IF;
SELECT item, price
FROM singapore_prices
WHERE item MEMBER OF my_fruit_name_list
;
(or, WHERE item IN (SELECT column_value FROM TABLE(CAST(my_fruit_name_list AS fruit_name_list)) if you like that better)

Oracle Error : Maximum number of expressions in a list is 1000

I am working in C#.Net and Oracle. i am passing a string to a query. i had used this code for concating all the item id's
List<string> listRetID = new List<string>();
foreach (DataRow row in dtNew.Rows)
{
listRetID.Add(row[3].ToString());
}
This concatination goes above 10,000. so i am getting the error message like this..
ORA-01795: maximum number of expressions in a list is 1000
How to fix this..
The documentation states:
A comma-delimited list of expressions can contain no more than 1000
expressions. A comma-delimited list of sets of expressions can contain
any number of sets, but each set can contain no more than 1000
expressions.
Presumably you're using this string as the contents of in IN (...) restriction, in which case there isn't really anything you can do - this just won't work. A common way to work around this is to generate a dummy table as a subquery or common table expression (CTE) and joining to that, but I'm not sure how you'd translate your List - possibly similar to whatever you're doing with your IN clause. You'd want to end up with your query looking something like:
with tmp_tab as (
select <val1 from list> as val from dual
union all select <val2 from list from dual
union all select <val3 from list from dual
...
)
select <something>
from <your table> yt
join tmp_tab tt on yt.<field> = tt.val
But that requires generating the entire (huge) query including the CTE each time you run it, and there's no opportunity to use bind variables.
You might find something like this approach more palatable.
You can have 10 lists of 1000 items instead of 1 list of 10000 items.
WHERE some_column IN (1,2,...,1000)
OR some_column IN (1001,1002,...2000) -- etc.
Not a C# guy but I would just split the list listRetID in multiple lists or create a list of lists
Then loop through that list of lists and perform the query on each element of the list.
What is the intent of your query?
It looks like you are selecting rows that have some column equal to the 3rd column of one of the records of some query.
The correct way of doing this is either an SQL join or a subquery. There is absolutely no need to bring this into C# code. For example, using a subquery you can write something like this:
SELECT *
FROM atable
WHERE afield IN (
SELECT field3
FROM someothertable)

Resources