whats the purpose of using over and rank keywords in hive sql? - hadoop

What is the meaning/purpose of using over and rank keywords in hive sql?
select rank() over (order by net_worth desc) as rank, name, net_worth from wealth order by rank, name;
+------+---------+---------------+
| rank | name | net_worth |
+------+---------+---------------+
| 1 | Solomon | 2000000000.00 |
| 2 | Croesus | 1000000000.00 |
| 2 | Midas | 1000000000.00 |
| 4 | Crassus | 500000000.00 |
| 5 | Scrooge | 80000000.00 |
+------+---------+---------------+

The OVER clause is powerful in that you can have aggregates over different ranges ("windowing"), whether you use a GROUP BY or not
OVER clause defines a window or user-specified set of rows within a query result set. A window function then computes a value for each row in the window. You can use the OVER clause with functions to compute aggregated values such as moving averages, cumulative aggregates, running totals, or a top N per group results
Over clause can be used in association with aggregate function and ranking function. The over clause determine the partitioning and ordering of the records before associating with aggregate or ranking function.
suppose you use only rank() function then how sql will understand on which bases rank will be calculated. example table have 3 columns name, net_worth and net_profit. Name with highest net_profit will be first rank. so you have to tell the sql that calculate rank on the bases of highest net_profit.

over() works on a "window" of attributes.
In your example, select rank() over (order by net_worth desc), you have instructed to rank the table with net_worth column in descending order. Due to that reason, ranking is done on descending order of net_worth.
over() is powerful it was used along with partition by.
Have a look at this article, which provides good examples to understand the concepts.
If you have sales table with Territory & Sales Amount, you can provide rank on order of Sales Amount Or create a partition for Territory and rank the Sales amount with in a Territory.
Have a look at this article to get understanding on WindowingAndAnalytics. It will explain how to use aggregate functions in HiveQL.

Related

Column that sums values once per unique ID, while filtering on type (Oracle Fusion Transportation Intelligence)

I realize that this has been discussed before but haven't seen a solution in a simple CASE expression for adding a column in Oracle FTI - which is as far as my experience goes at the moment unfortunately. My end goal is to have an total Weight for each Category only counting the null type entries and only one Weight per ID (Don't know why null was chosen as the default Type). I need to break the data apart by Type for a total Cost column which is working fine so I didn't include that in the example data below, but because I have to break the data up by Type, I am having trouble eliminating redundant values in my Total Weight results.
My original column which included redundant weights was as follows:
SUM(CASE Type
WHEN null
THEN 'Weight'
ELSE null
END)
Some additional info:
Each ID can have multiple Types (additionally each ID may not always have A or B but should always have null)
Each ID can only have one Weight (But when broken apart by type the value just repeats and messes up results)
Each ID can only have one Category (This doesn't really matter since I already separate this out with a Category column in the results)
Example Data:
ID |Categ. |Type | Weight
1 | Old | A | 1600
1 | Old | B | 1600
1 | Old |(null) | 1600
2 | Old | B | 400
2 | Old |(null) | 400
2 | Old |(null) | 400
3 | New | A | 500
3 | New | B | 500
3 | New |(null) | 500
4 | New | A | 500
4 | New |(null) | 500
4 | New |(null) | 500
Desired Results:
Categ. | Total Weight
Old | 2000
New | 1000
I was trying to figure out how to include a DISTINCT based on ID in the column, but when I put DISTINCT in front of CASE it just eliminates redundant weights so I would just get 500 for Total Weight New.
Additionally, I thought it would be possible to divide the weight by the count of weights before aggregating them, but this didn't seem to work either:
SUM(CASE Type
WHEN null
THEN 'Weight'/COUNT(CASE Type
WHEN null
THEN 'Weight'
ELSE null
END)
ELSE null
END)
I am very appreciative of any help that can be offered, please let me know if there is a simple way to create a column that achieves the desired results. As it may be apparent, I am pretty new to Oracle, so please let me know if there is any additional information that is needed.
Thanks,
Will
You don't need a case statement here. You were on the right track with distinct, but you also need to use an inline view (a subquery in the from the caluse).
The subquery in the from clause, selecting all distinct combinations of (id, categ, weight), allows you to then select from the result set, whereby you select only categ, sum of weight, grouping by categ. The subquery in the from clause has no repeated weights for a given id (unlike the table itself, which is why this is needed).
This would have to be done a little differently if an id were ever to have more than one category, but you noted that an id only ever has one category.
select categ,
sum(weight)
from (select distinct id,
categ,
weight
from tbl)
group by categ;
Fiddle: http://sqlfiddle.com/#!4/11a56/1/0

how do retrieve specific row in Hive?

I have a dataset looks like this:
---------------------------
cust | cost | cat | name
---------------------------
1 | 2.5 | apple | pkLady
---------------------------
1 | 3.5 | apple | greenGr
---------------------------
1 | 1.2 | pear | yelloPear
----------------------------
1 | 4.5 | pear | greenPear
-------------------------------
my hive query should now compare the cheapest price of each item the customer bought. So I want now to get the 2.5 and 1.2 into one row to get its difference. Since I am new to Hive I don't now how to ignore everything else until I reach next category of item while I still kept the cheapest price in the previous category.
you can use like below:
select cat,min(cost) from table group by cost;
Given your options (brickhouse UDFs, hive windowing functions or a self-join) in Hive, a self-join is the worst way to do this.
select *
, (cost - min(cost) over (partition by cust)) cost_diff
from table
You could create a subquery containing the minimum cost for each customer, and then join it to the original table:
select
mytable.*,
minCost.minCost,
cost - minCost as costDifference
from mytable
inner join
(select
cust,
min(cost) as minCost
from mytable
group by cust) minCost
on mytable.cust = minCost.cust
I created an interactive SQLFiddle example using MySQL, but it should work just fine in Hive.
I think this is really a SQL question rather than a Hive question: If you just want the cheapest cost per customer you can do
select cust, min(cost)
group by cust
Otherwise if you want the cheapest cost per customer per category you can do:
select cust, cat, min(cost)
from yourtable
groupby cust, cat

Oracle Domain index and sorting

The following query performs very poorly, due to the "order by". My goal is to get only a small subset of the resultset (using ROWNUM, for example). However, when I add "order by" it goes through the entire resultset performing an index lookup for each record, which makes it extremely slow. Without sorting the query is about 100 times faster when I limit the resultset to, for example, 1000 records.
QUERY:
SELECT text_field
from mytable where
contains(text_field,'ABC', 1)>0
order by another_field;
THIS IS HOW I CREATED THE INDEX:
CREATE INDEX myindex ON mytable (text_field) INDEXTYPE IS ctxsys.context FILTER BY another_field
EXECUTION PLAN:
---------------------------------------------------------------
| Id | Operation | Name |
---------------------------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | SORT ORDER BY | |
| 2 | TABLE ACCESS BY INDEX ROWID| MYTABLE |
|* 3 | DOMAIN INDEX | MYINDEX |
---------------------------------------------------------------
I also used CTXCAT instead of CONTEXT, and no improvement. I think the problem is, when I want the results sorted (only top 1000), it performs an index lookup for each record in the "entire" resultset. Is there a way to avoid that?
Thank you.
To have the ordering applied before the rownum filter, you need to use an in-line view:
SELECT text_file
from (
SELECT text_field
from mytable where
contains(text_field,'ABC', 1)>0
order by another_field
)
where rownum <= 1000;
With your index in place Oracle should optimise this to do as little work as possible. You should see 'sort order by stopkey' and 'count stopkey' steps in the plan, which is Oracle being clever and knowing it only needs to get 1000 values from the index.
If you don't use the in-line view but just add the rownum to your original query it will still optimise it but as you state it will order the first 1000 random (or indeterminate, anyway) rows it finds, because of the sequence of operations it performs.

Will this type of pagination scale?

I need to paginate on a set of models that can/will become large. The results have to be sorted so that the latest entries are the ones that appear on the first page (and then, we can go all the way to the start using 'next' links).
The query to retrieve the first page is the following, 4 is the number of entries I need per page:
SELECT "relationships".* FROM "relationships" WHERE ("relationships".followed_id = 1) ORDER BY created_at DESC LIMIT 4 OFFSET 0;
Since this needs to be sorted and since the number of entries is likely to become large, am I going to run into serious performance issues?
What are my options to make it faster?
My understanding is that an index on 'followed_id' will simply help the where clause. My concern is on the 'order by'
Create an index that contains these two fields in this order (followed_id, created_at)
Now, how large is the large we are talking about here? If it will be of the order of millions.. How about something like the one that follows..
Create an index on keys followed_id, created_at, id (This might change depending upon the fields in select, where and order by clause. I have tailor-made this to your question)
SELECT relationships.*
FROM relationships
JOIN (SELECT id
FROM relationships
WHERE followed_id = 1
ORDER BY created_at
LIMIT 10 OFFSET 10) itable
ON relationships.id = itable.id
ORDER BY relationships.created_at
An explain would yield this:
+----+-------------+---------------+------+---------------+-------------+---------+------+------+-----------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+------+---------------+-------------+---------+------+------+-----------------------------------------------------+
| 1 | PRIMARY | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Impossible WHERE noticed after reading const tables |
| 2 | DERIVED | relationships | ref | sample_rel2 | sample_rel2 | 5 | | 1 | Using where; Using index |
+----+-------------+---------------+------+---------------+-------------+---------+------+------+-----------------------------------------------------+
If you examine carefully, the sub-query containing the order, limit and offset clauses will operate on the index directly instead of the table and finally join with the table to fetch the 10 records.
It makes a difference when at one point your query makes a call like limit 10 offset 10000. It will retrieve all the 10000 records from the table and fetch the first 10. This trick should restrict the traversal to just the index.
An important note: I tested this in MySQL. Other database might have subtle differences in behavior, but the concept holds good no matter what.
you can index these fields. but it depends:
you can assume (mostly) that the created_at is already ordered. So that might by unnecessary. But that more depends on you app.
anyway you should index followed_id (unless its the primary key)

How should I range partition an index with a varchar2 column in Oracle? Is it a bad idea?

I am using Oracle 10g Enterprise edition.
A table in our Oracle database stores the soundex value representation of another text column. We are using a custom soundex implementation in which the soundex values are longer than are generated by traditional soundex algorithms (such as the one Oracle uses). That's really beside the point.
Basically I have a varchar2 column that has values containing a single character followed by a dynamic number of numeric values (e.g. 'A12345', 'S382771', etc). The table is partitioned by another column, but I'd like to add a partitioned index to the soundex column since it is often searched. When trying to add a range partitioned index using the first character of the soundex column it worked great:
create index IDX_NAMES_SOUNDEX on NAMES_SOUNDEX (soundex)
global partition by range (soundex) (
partition IDX_NAMES_SOUNDEX_PART_A values less than ('B'), -- 'A%'
partition IDX_NAMES_SOUNDEX_PART_B values less than ('C'), -- 'B%'
...
);
However, I in order to more evenly distribute the size of the partitions, I want to define some partitions by the first two chars, like so:
create index IDX_NAMES_SOUNDEX on NAMES_SOUNDEX (soundex)
global partition by range (soundex) (
partition IDX_NAMES_SOUNDEX_PART_A5 values less than ('A5'), -- 'A0% - A4%'
partition IDX_NAMES_SOUNDEX_PART_A values less than ('B'), -- 'A4% - A9%'
partition IDX_NAMES_SOUNDEX_PART_B values less than ('C'), -- 'B%'
...
);
I'm not sure how to properly range partition using varchar2 columns. I'm sure this is a less than ideal choice, so perhaps someone can recommend a better solution. Here's a distribution of the soundex data in my table:
-----------------------------------
| SUBSTR(SOUNDEX,1,1) | COUNT |
-----------------------------------
| A | 6476349 |
| B | 854880 |
| D | 520676 |
| F | 1200045 |
| G | 280647 |
| H | 3048637 |
| J | 711031 |
| K | 1336522 |
| L | 348743 |
| M | 3259464 |
| N | 1510070 |
| Q | 276769 |
| R | 1263008 |
| S | 3396223 |
| V | 533844 |
| W | 555007 |
| Y | 348504 |
| Z | 1079179 |
-----------------------------------
As you can see, the distribution is not evenly spread, which is why I want to define range partitions using the first two characters instead of just the first character.
Suggestions?
Thanks!
What exactly is your question?
Don't you know how you can split your table in n equal parts to avoid skew?
You can do that with analytic function percentile_disc().
Here an SQL PLUS example with n=100, I admit that it isn't very sophisticated but it will do the job.
set pages 0
set lines 200
drop table random_strings;
create table random_strings
as
select upper(dbms_random.string('A', 12)) rndmstr
from dual
connect by level < 1000;
spool parts
select 'select '||level||'/100,percentile_disc('||level||
'/100) within group (order by RNDMSTR) from random_strings;'
sql_statement
from dual
connect by level <= 100
/
spool off
This will output in file parts.lst:
select 1/100,percentile_disc(1/100) within group (order by RNDMSTR) from random_strings;
select 2/100,percentile_disc(2/100) within group (order by RNDMSTR) from random_strings;
select 3/100,percentile_disc(3/100) within group (order by RNDMSTR) from random_strings;
...
select 100/100,percentile_disc(100/100) within group (order by RNDMSTR) from random_strings;
Now you can run script parts.lst to get the partition values. Each partition will contain 1% of the data initially.
Script parts.lst will output:
,01 AJUDRRSPGMNP
,02 AOMJZQPZASQZ
,03 AWDQXVGLLUSJ
,04 BIEPUHAEMELR
....
,99 ZTMHDWTXUJAR
1 ZYVJLNATVLOY
Is the table is being searched by the partitioning key in addition to the SOUNDEX value? Or is it being searched just by the SOUNDEX column?
If you are just trying to achieve an even distribution of data among partitions, have you considered using hash partitions rather than range partitions? Assuming you choose a power of 2 for the number of partitions, that should give you a pretty even distribution of data between partitions.
Talk to me!
Can you tell me what your reason is for partitioning this table? It sounds like it is an OLTP table and may not need to be partition. We don’t want to partition just to say we are partitioned. Tell me what you are trying to accomplish by partitioning this table and I can help you pick a correct partitioning scheme. Partitioning does not equal faster queries. It actually can cause your queries to be slower in some cases.
I see some of your additional thoughts above and I don’t believe you need to partition your table. If your queries are going to be doing aggregates on entire partitions then you may want to partition. If you are going to have hundreds of millions of rows of data you may want to partition to help with DBA maintenance. If you just want you queries to run fast then the primary key index will suffice. Please let me know
Just create a global index on your desired columns.

Resources