I have a table abcd in Oracle DB
+-------------+----------+
| abcd.speed | abcd.ab |
+-------------+----------+
| 4.0 | 2 |
| 4.0 | 2 |
| 7.0 | 2 |
| 7.0 | 2 |
| 8.0 | 1 |
+-------------+----------+
And I'm using a query like this:
select min(speed) keep (dense_rank last order by abcd.ab NULLS FIRST) MOD from abcd;
I'm trying to convert the code to Hive, but it looks like keep is not available in Hive.
Could you suggest an equivalent statement?
select -max(struct(ab,-speed)).col2 as mod
from abcd
;
+------+
| mod |
+------+
| 4.0 |
+------+
Let start by explaining min(speed) keep (dense_rank last order by abcd.ab NULLS FIRST):
Find the row(s) with the max value of ab.
For this/those row(s), find the min value of speed.
We are using 2 tricks here.
The 1st is based on the ability to get the max value of a struct.
max(struct(c1,c2,c3,...)) returns the same result as if you have sorted the structs by c1, then by c2, then by c3 etc. and then chose the last element.
The 2nd trick is to use -speed (which is the same of -1*speed).
Finding the max of -speed and then taking the minus of that value (which gives us speed), is the same of finding the min of speed.
If we would have ordered the structs, it would have looked like this (since 2 is bigger than 1 and -4 is bigger than -7):
+----+-------+
| ab | speed |
+----+-------+
| 1 | -8.0 |
| 2 | -7.0 |
| 2 | -7.0 |
| 2 | -4.0 |
| 2 | -4.0 |
+----+-------+
The last struct in this case in struct(2,-4.0), therefore this is the result of the max function.
The fields names for a struct are col1, col2, col3 etc., so
struct(2,-4.0).col2 is -4.0. and preceding it with minus (which is the same as multiple it by -1) as in -struct(2,-4.0).col2 is 4.0.
Related
I have a query like this:
SELECT * FROM TEST1 LEFT OUTER JOIN TEST2 on TEST1.ID=TEST2.ID
UNION ALL
SELECT * FROM TEST3 LEFT OUTER JOIN TEST4 on TEST3.ID=TEST4.ID;
The behavior I see here is, it first join TEST1 and TEST2 tables (billions of rows) and then stores the output in temp tablespace. Then it joins TEST3 and TEST4 and then saves the output in same temp table. And finally select the records from there to display the result.
This behavior I see in both Redshift and Oracle. I was just wondering why it stores the result in temporary segments after getting result from first SELECT. It's time taking as well as eats up the temp space. Can not it just starts displaying the result after 1st SELECT is finishes and then goes for 2nd one (instead of storing).
This answer is somewhat speculative, because I don't have an Oracle doc reference. By inspection, we can imagine instead that you wanted to run the following query:
SELECT * FROM TEST1 JOIN TEST2
UNION ALL
SELECT * FROM TEST3 JOIN TEST4
ORDER BY some_col;
It should be clear that to apply any set operation like ORDER BY, all the records returned from the union query would need to be in one logical place. A temp table would seem to work.
That you are not using ORDER BY appears to not affect the workflow which Oracle is using.
I can also add another reason why Oracle is insisting on using a temp table here. Suppose it would be possible to write both halves of the union directly to the buffer. But what would happen if, at a later date, the size of the total union query suddenly exceeded what the buffer can hold? The answer is that your database would crash. So, using a temp table is a safe bet which should generally always work.
How do you observe this behaviour? By any chance don't you perform INSERT or CREATE TABLE? That would explain your observation, because at the end, all rows are required.
Also if your client has set an option fetch all rows this could be observed.
But in normal case, where the client is interested in few first rows Oracle returns quickly the first available (array size) rows from the first join ignoring the second one.
You may perform this little Gedankenexperiment:
create table test1 as
select rownum id,
lpad('x',1023,'X') pad
from dual connect by level <= 1000000;
Create analog the table 2 to 4.
Now run your query (adapted to valid syntax)
SELECT * FROM TEST1 CROSS JOIN TEST2
UNION ALL
SELECT * FROM TEST3 CROSS JOIN TEST4;
This returns for my the first page in SQL Developer in ca 30 seconds, which somehow disproves your claim.
Simple calculate the required TEMP space for two 10**6 * 10**6 cartesian join with row lenth 1K - this is far above my TEMP configuration.
The one possible way to observe what is Oracle actualy doing is to run the query with the /*+ gather_plan_statistics */ hint.
Than get the SQL_ID of the statement and check the actual rows A-Rowsin the plan
select * from table(dbms_xplan.display_cursor('a9y62gxagups6',null,'ALLSTATS LAST'));
SQL_ID a9y62gxagups6, child number 0
-------------------------------------
SELECT /*+ gather_plan_statistics */ * FROM TEST1 CROSS JOIN TEST2
UNION ALL SELECT * FROM TEST3 CROSS JOIN TEST4
Plan hash value: 1763392637
--------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads | Writes | OMem | 1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 50 |00:00:28.52 | 166K| 166K| 142K| | | |
| 1 | UNION-ALL | | 1 | | 50 |00:00:28.52 | 166K| 166K| 142K| | | |
| 2 | MERGE JOIN CARTESIAN| | 1 | 1000G| 50 |00:00:28.52 | 166K| 166K| 142K| | | |
| 3 | TABLE ACCESS FULL | TEST1 | 1 | 1000K| 1 |00:00:00.02 | 4 | 28 | 0 | | | |
| 4 | BUFFER SORT | | 1 | 1000K| 50 |00:00:28.49 | 166K| 166K| 142K| 1255M| 11M| 97M (0)|
| 5 | TABLE ACCESS FULL | TEST2 | 1 | 1000K| 1000K|00:00:03.66 | 166K| 166K| 0 | | | |
| 6 | MERGE JOIN CARTESIAN| | 0 | 1000G| 0 |00:00:00.01 | 0 | 0 | 0 | | | |
| 7 | TABLE ACCESS FULL | TEST3 | 0 | 1000K| 0 |00:00:00.01 | 0 | 0 | 0 | | | |
| 8 | BUFFER SORT | | 0 | 1000K| 0 |00:00:00.01 | 0 | 0 | 0 | 1103M| 10M| |
| 9 | TABLE ACCESS FULL | TEST4 | 0 | 1000K| 0 |00:00:00.01 | 0 | 0 | 0 | | | |
--------------------------------------------------------------------------------------------------------------------------------------
You see, that Oracle
1) full scanned the table2 (row 5)
2) get one row from table1 (row 3)
3) to return to frist 50 rows (row 0)
4) tables 3 and 4 are untached (rows 7 and 9)
You may simple adapt the example to you inner join to see similar results.
I have two tables one includes about 17K (NLIST) records while the other 57K (FNAMES).
I would like to join the both by comparing the records using levenshtein formula.
Here is the example for the content of tables:
Table NLIST:
+------+-------------+
| ID | S_NAME |
+------+-------------+
| 1 | Avi |
| 2 | Moshe |
| 3 | David |
....
Table FNAMES:
+------+-------------+
| ID | NICKNAMES |
+------+-------------+
| 1 | Avile |
| 2 | Dudi |
| 3 | Moshiko |
| 4 | Avi |
| 5 | DAVE |
....
The above tables are just examples. In the real case the names column can include more than one word.
The required result should be:
+------+-------------+--------+
| ID | NICKNAMES | S_NAME |
+------+-------------+--------+
| 1 | Avile | Avi |
| 2 | Dudi | David |
| 3 | Moshiko | Moshe |
| 4 | Avi | Avi |
| 5 | DAVE | David |
...
Here is the code I use:
select FNAMES.NICKNAMES, NLIST.S_NAME
from NICKNAMES
LEFT OUTER JOIN NLIST
ON(true)
WHERE levenshtein (FNAMES.NICKNAMES, NLIST.S_NAME) <=4
The above code runs for a very long time and I stopped its running.
How can I make it run in a reasonable time?
In addition, I think the levenshtein distance depends on the length of the words. How can I find the optimal value for the distance (in this case I chose 4 arbitrarily)?
Hive Table performance is depends upon various point .
Query enginee
File format
use VECTORIZATION set hive.vectorized.execution.enabled = true;set hive.vectorized.execution.reduce.enabled = true;
If you have good server you can try with Impala and definitely it is faster than Hive.
You can do the fine tuning of impala which will give you an edge to execute this query faster .Tuning Impala for Performance
I'm trying to use this recursive SQL feature but can't get it to do what I want, not even close. I've coded up the logic in an unrolled loop, asking if it can be converted into a single recursive SQL query, not the table update style I've used.
http://sqlfiddle.com/#!4/b7217/1
There are six players to be ranked. They have id, group id, score and rank.
Initial state
+----+--------+-------+--------+
| id | grp_id | score | rank |
+----+--------+-------+--------+
| 1 | 1 | 100 | (null) |
| 2 | 1 | 90 | (null) |
| 3 | 1 | 70 | (null) |
| 4 | 2 | 95 | (null) |
| 5 | 2 | 70 | (null) |
| 6 | 2 | 60 | (null) |
+----+--------+-------+--------+
I want to take the person with the highest initial score and give them rank 1. Then I apply 10 bonus points to the score of everyone who has the same group id. Take the next highest, assign rank 2, distribute bonus points and so on until there are no players left.
User id breaks ties.
The bonus points changes the ranking. id=4 initially appears to be second placed with 95, behind the leader with 100 but with the 10 pts bonus, id=2 moves up and takes the spot.
Final state
+-----+---------+--------+------+
| ID | GRP_ID | SCORE | RANK |
+-----+---------+--------+------+
| 1 | 1 | 100 | 1 |
| 2 | 1 | 100 | 2 |
| 4 | 2 | 95 | 3 |
| 3 | 1 | 90 | 4 |
| 5 | 2 | 80 | 5 |
| 6 | 2 | 80 | 6 |
+-----+---------+--------+------+
This is a quite a bit late, but I'm not sure this can be done using Recursive CTE. I did however come up with a solution using the MODEL clause:
WITH SAMPLE (ID,GRP_ID,SCORE,RANK) AS (
SELECT 1,1,100,NULL FROM DUAL UNION
SELECT 2,1,90,NULL FROM DUAL UNION
SELECT 3,1,70,NULL FROM DUAL UNION
SELECT 4,2,95,NULL FROM DUAL UNION
SELECT 5,2,70,NULL FROM DUAL UNION
SELECT 6,2,60,NULL FROM DUAL)
SELECT ID,GRP_ID,SCORE,RANK FROM SAMPLE
MODEL
DIMENSION BY (ID,GRP_ID)
MEASURES (SCORE,0 RANK,0 LAST_RANKED_GRP,0 ITEM_COUNT,0 HAS_RANK)
RULES
ITERATE (1000) UNTIL (ITERATION_NUMBER = ITEM_COUNT[1,1]) --ITERATE ONCE FOR EACH ITEM TO BE RANKED
(
RANK[ANY,ANY] = CASE WHEN SCORE[CV(),CV()] = MAX(SCORE) OVER (PARTITION BY HAS_RANK) THEN RANK() OVER (ORDER BY SCORE DESC,ID) ELSE RANK[CV(),CV()] END, --IF THE CURRENT ITEM SCORE IS EQUAL TO THE MAX SCORE OF UNRANKED, ASSIGN A RANK
LAST_RANKED_GRP[ANY,ANY] = FIRST_VALUE(GRP_ID) OVER (ORDER BY RANK DESC),
SCORE[ANY,ANY] = CASE WHEN RANK[CV(),CV()] = 0 AND CV(GRP_ID) = LAST_RANKED_GRP[CV(),CV()] THEN SCORE[CV(),CV()]+10 ELSE SCORE[CV(),CV()] END,
ITEM_COUNT[ANY,ANY] = COUNT(*) OVER (),
HAS_RANK[ANY,ANY] = CASE WHEN RANK[CV(),CV()] <> 0 THEN 1 ELSE 0 END --TO SEPARATE RANKED/UNRANKED ITEMS
)
ORDER BY RANK;
It's not very pretty, and I suspect there is a better way to go about this, but it does give the expected output.
Caveats:
You'd have to increase the iteration count if you have more than that number of rows.
This does a full re-ranking based on the score after each iteration. So if we took your sample data, but changed the initial score of item 2 to 95 rather than 90: after ranking item 1 and giving the 10 point bonus to item 2, it now has a score of 105. So we rank it as 1st and move item 1 down to 2nd. You'd have to make a few modifications if this is not the desired behavior.
I am using Oracle.
I am currently working one 2 tables which both have the same column names. Is there any way in which I can combine the 2 tables together as they are?
Simple example to show what I mean:
TABLE 1:
| COLUMN 1 | COLUMN 2 | COLUMN 3 |
----------------------------------------
| a | 1 | w |
| b | 2 | x |
TABLE 2:
| COLUMN 1 | COLUMN 2 | COLUMN 3 |
----------------------------------------
| c | 3 | y |
| d | 4 | z |
RESULT THAT I WANT:
| COLUMN 1 | COLUMN 2 | COLUMN 3 |
----------------------------------------
| a | 1 | w |
| b | 2 | x |
| c | 3 | y |
| d | 4 | z |
Any help would be greatly appreciated. Thank you in advance!
You can use the union set operator to get the result of two queries as a single result set:
select column1, column2, column3
from table1
union all
select column1, column2, column3
from table2
union on its own implicitly removes duplicates; union all preserves them. More info here.
The column names don't need to be the same, you just need the same number of columns with the same datatpes, in the same order.
(This is not what is usually meant by a join, so the title of your question is a bit misleading; I'm basing this on the example data and output you showed.)
This is a bit hard to explain in words ... I'm trying to calculate a sum of grouped distinct values in a matrix. Let's say I have the following data returned by a SQL query:
------------------------------------------------
| Group | ParentID | ChildID | ParentProdCount |
| A | 1 | 1 | 2 |
| A | 1 | 2 | 2 |
| A | 1 | 3 | 2 |
| A | 1 | 4 | 2 |
| A | 2 | 5 | 3 |
| A | 2 | 6 | 3 |
| A | 2 | 7 | 3 |
| A | 2 | 8 | 3 |
| B | 3 | 9 | 1 |
| B | 3 | 10 | 1 |
| B | 3 | 11 | 1 |
------------------------------------------------
There's some other data in the query, but it's irrelevant. ParentProdCount is specific to the ParentID.
Now, I have a matrix in the MS Report Designer in which I'm trying to calculate a sum for ParentProdCount (grouped by "Group"). If I just add the expression
=Sum(Fields!ParentProdCount.Value)
I get a result 20 for Group A and 3 for Group B, which is incorrect. The correct values should be 5 for group A and 1 for group B. This wouldn't happen if there wasn't ChildID involved, but I have to use some other child-specific data in the same matrix.
I tried to nest FIRST() and SUM() aggregate functions but apparently it's not possible to have nested aggregation functions, even when they have scopes defined.
I'm pretty sure there is some way to calculate the grouped distinct sum without needing to create another SQL query. Anyone got an idea how to do that?
Ok I got this sorted out by adding a ROW_NUMBER() function my SQL query:
SELECT Group, ParentID, ROW_NUMBER() OVER (PARTITION BY ParentID ORDER BY ChildID ASC) AS Position, ChildID, ParentProdCount FROM Table
and then I replaced the SSRS SUM function with
=SUM(IIF(Position = 1, ParentProdCount.Value, 0))
Put a grouping over the ParentID and use a summation over that group,
eg:
if group over ParentID = "ParentIDGroup"
then
column sum of ParentPrdCount = SUM(Fields!ParentProdCount.Value,"ParentIDGroup")