Emit a group only when condition meets - hadoop

I have requirement to emit all records corresponds to a group, only when a condition is met. Below is the sample data set with alias name as "SAMPLE_DATA".
Col-1 | Col-2 | Col-3
-------------------------
2 | 4 | 1
2 | 5 | 2
3 | 3 | 1
3 | 2 | 2
4 | 5 | 1
4 | 6 | 2
SAMPLE_DATA_GRP = GROUP SAMPLE_DATA BY Col-1;
RESULT = FOREACH SAMPLE_DATA_GRP {
max_value = MAX(SAMPLE_DATA.Col-2);
IF(max_value >= 5)
GENERATE ALL RECORDS IN THAT GROUP;
}
RESULT should be:
Col-1 | Col-2 | Col-3
-------------------------
2 | 4 | 1
2 | 5 | 2
---- ---- ---
4 | 5 | 1
4 | 6 | 2
Two groups got generated. First group is generate because max value of 4,5 is "5"(which meets our condition >=5). Same for second group (6 >= 5).
As I would be performing this operation on big dataset operations like distinct and join would be overkill. For this reason I have come up with pseudo code with one grouping to perform this operation.
Hope I have provided enough information. Thanks in advance.
I would be performing this operation on a huge data set. Doing operation like distinct and join would be overkill on the system. For this reason I have come up with this grouping approach.

Please try the below code and see..
This solution is little lengthy ,but it will work
numbers = LOAD '/home/user/inputfiles/c1.txt' USING PigStorage(',') AS(c1:int,c2:int,c3:int);
num_grp = GROUP numbers by c1;
num_each = FOREACH num_grp
{
max_each = MAX(numbers.c2);
generate flatten(group) as temp_c1, (max_each >= 5 ?1 :0) as indicator;
};
num_each_filtered = filter num_each BY indicator == 1;
num_joined = join numbers BY c1,num_each_filtered by tem_c1;
num_output = FOREACH num_joined GENERATE c1,c2,c3;
dump num_output;
O/p:
Col-1 | Col-2 | Col-3
-------------------------
2 | 4 | 1
2 | 5 | 2
---- ---- ---
4 | 5 | 1
4 | 6 | 2

Related

MySQL - Top 5 rank best seller plans or courses

I sell subscriptions of my online course, as well as the courses in retail.
I would bring the "top 5" of best selling plans / courses. For this, I have a table called "subscriptionPlan", which stores the purchased plan ID, or in the case of a course, the course ID, and the amount spent on this transaction. Example:
table subscriptionPlan
sbpId | subId | plaId | couId | sbpAmount
1 | 1 | 1 | 1 | 499.99
2 | 2 | 1 | 2 | 499.99
3 | 3 | 2 | 0 | 899.99
4 | 4 | 1 | 1 | 499.99
Just for educational purposes, plaId = 1 is a plan called "Single Sale" that I created, to maintain the integrity of the DB. When the couId isn't empty, you also have bought a separate course, not a plan where you can attend any course.
My need is: List the top 5 sales. If it is a plan, display the plan name (plan table, column plaTitle). If it is a course, display its name (table course, colna couTitle). This logic that I can't code. I was able to rank a top 5 of PLANS, but it groups the courses, since the GROUP BY is by the ID of the plan. I believe the prank is here, maybe creating an IF / ELSE in this GROUPBY, but I don't know how to do this.
The query that i code, to rank my top 5 plans is:
SELECT sp.plaId, sp.couId, p.plaTitle, p.plaPermanent, c.couTitle, SUM(sbpAmount) AS sbpTotalAmount
FROM subscriptionPlan sp
LEFT JOIN plan p ON sp.plaId = p.plaId
LEFT JOIN course c ON sp.couId = c.couId
GROUP BY sp.plaId
ORDER BY sbpTotalAmount DESC
LIMIT 5
The result that i expected was:
plaId | couId | plaTitle | couTitle | plaPermanent | sbpTotalAmount
1 | 1 | Venda avulsa | Curso 01 | true | 999.98
2 | 0 | Acesso total | null | false | 899.99
3 | 2 | Venda avulsa | Curso 02 | true | 499.99
How could I get into this query formula?
When grouping you can use:
Simple columns, or
Any [complex] expression.
In your case, it seems you need to group by an expression, such as:
GROUP BY CASE WHEN sp.plaId = 1 THEN -1 ELSE sp.couId END
In this case I chose -1 as the grouping for the "Single Plan". You can replace the value for any other that doesn't match any couId.

Oracle Recursive Subquery Factoring convert

I'm trying to use this recursive SQL feature but can't get it to do what I want, not even close. I've coded up the logic in an unrolled loop, asking if it can be converted into a single recursive SQL query, not the table update style I've used.
http://sqlfiddle.com/#!4/b7217/1
There are six players to be ranked. They have id, group id, score and rank.
Initial state
+----+--------+-------+--------+
| id | grp_id | score | rank |
+----+--------+-------+--------+
| 1 | 1 | 100 | (null) |
| 2 | 1 | 90 | (null) |
| 3 | 1 | 70 | (null) |
| 4 | 2 | 95 | (null) |
| 5 | 2 | 70 | (null) |
| 6 | 2 | 60 | (null) |
+----+--------+-------+--------+
I want to take the person with the highest initial score and give them rank 1. Then I apply 10 bonus points to the score of everyone who has the same group id. Take the next highest, assign rank 2, distribute bonus points and so on until there are no players left.
User id breaks ties.
The bonus points changes the ranking. id=4 initially appears to be second placed with 95, behind the leader with 100 but with the 10 pts bonus, id=2 moves up and takes the spot.
Final state
+-----+---------+--------+------+
| ID | GRP_ID | SCORE | RANK |
+-----+---------+--------+------+
| 1 | 1 | 100 | 1 |
| 2 | 1 | 100 | 2 |
| 4 | 2 | 95 | 3 |
| 3 | 1 | 90 | 4 |
| 5 | 2 | 80 | 5 |
| 6 | 2 | 80 | 6 |
+-----+---------+--------+------+
This is a quite a bit late, but I'm not sure this can be done using Recursive CTE. I did however come up with a solution using the MODEL clause:
WITH SAMPLE (ID,GRP_ID,SCORE,RANK) AS (
SELECT 1,1,100,NULL FROM DUAL UNION
SELECT 2,1,90,NULL FROM DUAL UNION
SELECT 3,1,70,NULL FROM DUAL UNION
SELECT 4,2,95,NULL FROM DUAL UNION
SELECT 5,2,70,NULL FROM DUAL UNION
SELECT 6,2,60,NULL FROM DUAL)
SELECT ID,GRP_ID,SCORE,RANK FROM SAMPLE
MODEL
DIMENSION BY (ID,GRP_ID)
MEASURES (SCORE,0 RANK,0 LAST_RANKED_GRP,0 ITEM_COUNT,0 HAS_RANK)
RULES
ITERATE (1000) UNTIL (ITERATION_NUMBER = ITEM_COUNT[1,1]) --ITERATE ONCE FOR EACH ITEM TO BE RANKED
(
RANK[ANY,ANY] = CASE WHEN SCORE[CV(),CV()] = MAX(SCORE) OVER (PARTITION BY HAS_RANK) THEN RANK() OVER (ORDER BY SCORE DESC,ID) ELSE RANK[CV(),CV()] END, --IF THE CURRENT ITEM SCORE IS EQUAL TO THE MAX SCORE OF UNRANKED, ASSIGN A RANK
LAST_RANKED_GRP[ANY,ANY] = FIRST_VALUE(GRP_ID) OVER (ORDER BY RANK DESC),
SCORE[ANY,ANY] = CASE WHEN RANK[CV(),CV()] = 0 AND CV(GRP_ID) = LAST_RANKED_GRP[CV(),CV()] THEN SCORE[CV(),CV()]+10 ELSE SCORE[CV(),CV()] END,
ITEM_COUNT[ANY,ANY] = COUNT(*) OVER (),
HAS_RANK[ANY,ANY] = CASE WHEN RANK[CV(),CV()] <> 0 THEN 1 ELSE 0 END --TO SEPARATE RANKED/UNRANKED ITEMS
)
ORDER BY RANK;
It's not very pretty, and I suspect there is a better way to go about this, but it does give the expected output.
Caveats:
You'd have to increase the iteration count if you have more than that number of rows.
This does a full re-ranking based on the score after each iteration. So if we took your sample data, but changed the initial score of item 2 to 95 rather than 90: after ranking item 1 and giving the 10 point bonus to item 2, it now has a score of 105. So we rank it as 1st and move item 1 down to 2nd. You'd have to make a few modifications if this is not the desired behavior.

Vbscript basic functions

I am new to programming and computer science. HTML is all I know and I have been facing problems with vbscript.
This program (my first in vbscript) was given by my teacher. But I really do not understand anything. I referred to my book but in vain.
I am not even sure if this is the right SE to post the question.
Please help.
What you have there is a loop with another nested loop, both of which print some text to the screen (document.write("...")).
The outer loop
For i = 1 To 5 Step 1
...
Next
iterates from 1 to 5 in steps of 1 (which is redundant, since 1 is the default step size, so you could just omit the Step 1). If you printed the value of i inside the loop
For i = 1 To 5 Step 1
document.Write(i & "<br>")
Next
You'd get the following output:
1
2
3
4
5
In your code sample you just print <br>, though, so each cycle of the outer loop just prints a line break.
In addition to printing line breaks in the outer loop you also have a nested loop, which for each cycle of the outer loop iterates from 1 to the current value of i, again in steps of 1.
For j = 1 To i Step 1
...
Next
So in the first cycle of the outer loop (i=1) the inner loop iterates from 1 to 1, in the second cycle of the outer loop (i=2) it iterates from 1 to 2, and so on.
For i = 1 To 5 Step 1
document.Write(i & "<br>")
For j = 1 To i Step 1
document.Write("*")
Next
Next
Since the inner loop prints an asterisk with each cycle you get i asterisks per line before the inner loop ends, the outer loop then goes into the next cycle and prints a line break, thus ending the current output line.
A good (although somewhat tedious) way to get an understanding of how the loops work is to note the current value of each variable as well as the current output line in a table on a sheet of paper, e.g. like this:
code line | instruction | i | j | output line
----------+------------------------+-------+-------+------------
1 | For i = 1 To 5 Step 1 | 1 | Empty |
2 | document.Write("<br>") | 1 | Empty | <br>
3 | For j = 1 To i Step 1 | 1 | 1 |
4 | document.Write("*") | 1 | 1 | *
5 | Next | 1 | 1 | *
6 | Next | 1 | 1 | *
1 | For i = 1 To 5 Step 1 | 2 | 1 | *
2 | document.Write("<br>") | 2 | 1 | *<br>
3 | For j = 1 To i Step 1 | 2 | 1 |
4 | document.Write("*") | 2 | 1 | *
5 | Next | 2 | 1 | *
3 | For j = 1 To i Step 1 | 2 | 2 | *
4 | document.Write("*") | 2 | 2 | **
5 | Next | 2 | 2 | **
6 | Next | 2 | 2 | **
1 | For i = 1 To 5 Step 1 | 3 | 2 | **
2 | document.Write("<br>") | 3 | 2 | **<br>
3 | For j = 1 To i Step 1 | 3 | 1 |
4 | document.Write("*") | 3 | 1 | *
... | ... | ... | ... | ...

Oracle hierarchical query join to selection of all decendants

I am trying to retrieve a list of each descendant with each item.
I am not sure I am making sense, I will try and explain.
Example Data:
ID | PID
--------
1 | 0
2 | 1
3 | 1
4 | 1
5 | 2
6 | 2
7 | 5
8 | 3
etc...
The desired results are:
ID | Decendant
--------------
1 | 1
1 | 2
1 | 3
1 | 4
...
2 | 2
2 | 5
2 | 6
2 | 7
3 | 3
3 | 8
etc...
This is currently being achieved by using a cursor to move through the data and inserting each descendant into a table and then selecting from them.
I was wondering if there was a better way to do these, there must be a way to right a query that would bring back the desired results.
If any one has ideas, or has figured this out before it would be very appreciated. Ordering is not important, nor is the 1 - 1, 2 -2 reference. It would be cool to have it, but not crucial.
select connect_by_root(id) as ID, id as Decendant
from table1
connect by prior id = pid
order by 1, 2
fiddle
Here is my attempt! Not sure, if I got you right!
select pid ,connect_By_root(id) as descendant from process
connect by id = prior pid
union all
select distinct pid,pid from
process
order by pid,descendant

Sum of the grouped distinct values

This is a bit hard to explain in words ... I'm trying to calculate a sum of grouped distinct values in a matrix. Let's say I have the following data returned by a SQL query:
------------------------------------------------
| Group | ParentID | ChildID | ParentProdCount |
| A | 1 | 1 | 2 |
| A | 1 | 2 | 2 |
| A | 1 | 3 | 2 |
| A | 1 | 4 | 2 |
| A | 2 | 5 | 3 |
| A | 2 | 6 | 3 |
| A | 2 | 7 | 3 |
| A | 2 | 8 | 3 |
| B | 3 | 9 | 1 |
| B | 3 | 10 | 1 |
| B | 3 | 11 | 1 |
------------------------------------------------
There's some other data in the query, but it's irrelevant. ParentProdCount is specific to the ParentID.
Now, I have a matrix in the MS Report Designer in which I'm trying to calculate a sum for ParentProdCount (grouped by "Group"). If I just add the expression
=Sum(Fields!ParentProdCount.Value)
I get a result 20 for Group A and 3 for Group B, which is incorrect. The correct values should be 5 for group A and 1 for group B. This wouldn't happen if there wasn't ChildID involved, but I have to use some other child-specific data in the same matrix.
I tried to nest FIRST() and SUM() aggregate functions but apparently it's not possible to have nested aggregation functions, even when they have scopes defined.
I'm pretty sure there is some way to calculate the grouped distinct sum without needing to create another SQL query. Anyone got an idea how to do that?
Ok I got this sorted out by adding a ROW_NUMBER() function my SQL query:
SELECT Group, ParentID, ROW_NUMBER() OVER (PARTITION BY ParentID ORDER BY ChildID ASC) AS Position, ChildID, ParentProdCount FROM Table
and then I replaced the SSRS SUM function with
=SUM(IIF(Position = 1, ParentProdCount.Value, 0))
Put a grouping over the ParentID and use a summation over that group,
eg:
if group over ParentID = "ParentIDGroup"
then
column sum of ParentPrdCount = SUM(Fields!ParentProdCount.Value,"ParentIDGroup")

Resources