Poor performance on hash joins with Pig on Tez - hadoop

I have a series of Pig scripts that are transforming hundreds of millions of records from multiple data sources that need to be joined together. Towards the end of each script, I reach a point where JOIN performance becomes terribly slow. Looking at the DAG in the Tez View, I see that it is split into relatively few tasks (typically 100-200), but each task takes multiple hours to complete. The task description shows that it's doing a HASH_JOIN.
Interestingly, I only run into this bottleneck when running on the Tez execution engine. On MapReduce, it can still take a while, but nothing like the agonizing crawl I get on Tez. However, running on MapReduce is a problem as I have an issue with MapReduce for which I've asked another question here.
Here's a sample of my code (apologies, I've had to make the code very generic to be able to post on the interwebs). I'm wondering what I can do to remove this bottleneck -- would specifying parallelism help? Is there something wrong with my approach?
-- Incoming data:
-- A: hundreds of millions of rows, 19 fields
-- B: hundreds of millions of rows, 3 fields
-- C: hundreds of millions of rows, 5 fields
-- D: a few thousand rows, 5 fields
J = -- This reduces the size of A, but still probably in the hundreds of millions
FILTER A
BY qualifying == 1;
K = -- This is a one-to-one join that doesn't explode the number of rows in J
JOIN J BY Id
, B BY Id;
L =
FOREACH K
GENERATE J1 AS L1
, J2 AS L2
, J3 AS L3
, J4 AS L4
, J5 AS L5
, J6 AS L6
, J7 AS L7
, J8 AS L8
, B1 AS L9
, B2 AS L10
;
M = -- Reduces the size of C to around one hundred million rows
FILTER C
BY Code matches 'Code-.+';
M_WithYear =
FOREACH M
GENERATE *
, (int)REGEX_EXTRACT(Code, 'Code-.+-([0-9]+)', 1) AS year:int
;
SPLIT M_WithYear
INTO M_annual IF year <= (int)'$currentYear' -- roughly 75% of the data from M
, M_lifetime IF Code == 'Code-Lifetime'; -- roughly 25% of the data from M
-- Transformations for M_annual
N =
JOIN M_WithYear BY Id, D BY Id USING 'replicated';
O = -- This is where performance falls apart
JOIN N BY (Id, year, M7) -- M7 matches L7
, L BY (Id, year, L7);
P =
FOREACH O
GENERATE N1 AS P1
, N2 AS P2
, N3 AS P3
, N4 AS P4
, N5 AS P5
, N6 AS P6
, N7 AS P7
, N8 AS P8
, N9 AS P9
, L1 AS P10
, L2 AS P11
;
-- Transformations N-P above repeated for M_lifetime

Related

Data manipulation in APEX 20 oracle

Currently I am performing dynamic action (executing server side code) where I am selecting the values from two different tables (XYZ & ABC) performing calculation and inserting into another table (ABC_TEMP) and creating a report view out of that in apex(v20)
Below is what I am performing.
BEGIN
INSERT INTO ABC_TEMP (
A1, --> VARCHAR2(4000)
B1, --> VARCHAR2(4000)
C1, --> NUMBER
D1, --> NUMBER
E1, --> NUMBER
F, --> NUMBER
G1, --> NUMBER
H3, --> NUMBER
I3, --> NUMBER
J, --> NUMBER
K, --> NUMBER
L, --> NUMBER
timestamp -->timestamp(6)
)
VALUES (
:A_SELECT,
:B_SELECT,
:C_SELECT,
(SELECT D2 FROM XYZ WHERE B2 = :B_SELECT AND C2 = :C_SELECT),
(SELECT E2 FROM XYZ WHERE B2 = :B_SELECT AND C2 = :C_SELECT),
(SELECT SUM(D2 + E2) FROM XYZ WHERE B2 = :B_SELECT AND C2 = :C_SELECT),
(SELECT G2 FROM XYZ WHERE B2 = :B_SELECT AND C2 = :B_SELECT),
(SELECT H2 FROM ABC WHERE A2 = :A_SELECT AND P2 = 'mock1' AND SE = 'mock2' AND Q2 = 'val1'),
(SELECT H2 FROM ABC WHERE A2 = :A_SELECT AND P2 = 'mock1' AND SE = 'mock2' AND Q2 = 'val2'),
(:J), --> This value is derived from `ABC_TEMP` table only by dividing I3 BY F
(:K), --> This value is derived from `ABC_TEMP` table only by dividing H3 BY G1
(:L), --> this value is derived from low of J & K column
(CURRENT_TIMESTAMP)
);
END;
My question is how do I set the values of column J,K,L in the same query as it involves selecting from the same table and performing calculation on top of it where I am inserting data.
if this is not possible what can be other approach out here.
Literally copy/paste those columns' source (select statements) and divide them:
values
(:A_SELECT,
...
-- for J, literally copy/paste I3 / F
(SELECT H2 FROM ABC WHERE A2 = :A_SELECT AND P2 = 'mock1' AND SE = 'mock2' AND Q2 = 'val2') /
(SELECT SUM(D2 + E2) FROM XYZ WHERE B2 = :B_SELECT AND C2 = :C_SELECT)
-- the same goes for other columns
);
Does it work? Sure:
SQL> select (select sal from emp where rownum = 1) /
2 (select empno from emp where rownum = 1) as j
3 from dual;
J
----------
,108562899
SQL>
Is it optimized? Of course not, you'll be using the same query twice (once for the "original" column and once for "derived" one).
Can you optimize it? Maybe. Try to create bunch of CTEs (one for each subquery you used) and then re-use it for derived columns.
On the other hand, why would you store such a value into the table? That's redundant. Omit (drop) columns J, K and L from the table and compute their value whenever needed, e.g.
select a1,
c1,
i3 / f as J --> this
from abc_temp
where ...
Or, you could even create a view using the same select (I posted above) and select values from a view.

Finding maximal sum of happiness

I have a problem to solve, and do not see any optimal solution :/ The problem is:
I have n workers and k jobs. Each job is to be done by specified number of workers, and each worker has his level of happines for each job. I have to to make a work schedule so that workers would be as happy as possible.
So, I have array a1 of int[n,k] (k >= n). k-th column of i-th row contains preference (number from 0 to 10) of i-th worker for k-th job. I also have array a2 of int[k], where i-th element contains number of people who will be doing that job. Each worker is to do the same number of jobs. I have to find maximal possible sum of happines, knowing that n >= max(a2).
My solution is to use recursion. Select first worker for first combination of jobs, add preferences to the sum, check if sum is higher then maximal already found, and if it is, go to the next worker. When back, check the next combination for the first worker etc. This works fine for small amount of workers, but has to high computational complexity to solve bigger problems. Have You got any idea for better solution?
PS. Guy from another site recommended me using Hungarian Algorithm, but it assumes that n == k, and I do not know how to make it working with n <= k
PS2 an exaple:
a1:
job1 job2 job3 job4
wokrer1 1 3 4 2
worker2 9 8 1 2
worker3 6 7 8 9
a2:
job1 job2 job3 job4
count 1 2 2 1
example solution:
worker1: job2, job3 (7)
worker2: job1, job2 (17)
worker3: job3, job4 (17)
sum: 41
This looks like the Transportation Problem to me. It can be solved using the Hungarian Algorithm though. First let's set up the matrix for the Hungarian Algorithm.
The Hungarian Algorithm is used to find the minimum sum. To make it solve a maximum sum problem you would first have to invert all of your happiness values.
J1 J2 J3 J4
W1 1 3 4 2
W2 9 8 1 2
W3 6 7 8 9
Subtract each value by the greatest value in the matrix.
The greatest value in this matrix is 9.
J1 J2 J3 J4
W1 9-1 9-3 9-4 9-2
W2 9-9 9-8 9-1 9-2
W3 9-6 9-7 9-8 9-9
J1 J2 J3 J4
W1 8 6 5 7
W2 0 1 8 7
W3 3 2 1 0
Now, as you noted, the Hungarian Algorithm only works on square matrices. To make it work on a rectangular matrix, we must make it square. We can do this by adding dummy rows or columns filled with zeroes.
J1 J2 J3 J4
W1 8 6 5 7
W2 0 1 8 7
W3 3 2 1 0
WD 0 0 0 0
Now that we have it in a usable form, we can solve for the minimum sum. I'm going to skip to the solution as instructions on how to use the Hungarian Algorithm are readily available elsewhere.
W1 -> J3
W2 -> J1
W3 -> J4
WD -> J2 (Except this is a dummy row so it doesn't count.)
We have now assigned one job to each of our workers. This is where your second array comes into play.
J1 J2 J3 J4
1 2 2 1
We have assigned a worker to jobs 1, 3, and 4, so we will subtract 1 from their respective values.
J1 J2 J3 J4
0 2 1 0
Since we no longer need anyone to do jobs 1 or 4, we can remove their columns from our happiness matrix as well.
J2 J3
W1 6 5
W2 1 8
W3 2 1
We still have jobs to do though, so we go through the process again.
Add dummy columns to make the matrix square.
J2 J3 JD
W1 6 5 0
W2 1 8 0
W3 2 1 0
and solve. Remember that the columns are for jobs 2 and 3, not 1 and 2.
W1 -> JD
W2 -> J2
W3 -> J3
We have now gone through the algorithm twice, and have assigned five jobs.
W1 -> J3
W2 -> J1, J2
W3 -> J4, J3
We would now go through the entire process again. Since there is only one more job to assign, and one person to assign it to (W1 has only been assigned one job, but they must all be assigned the same number.), we can just go to our final solution.
W1 -> J3, J2
W2 -> J1, J2
W3 -> J4, J3
and the happiness values for this are:
W1 -> 4 + 3 = 7
W2 -> 9 + 8 = 17
W3 -> 9 + 8 = 17
for a total of 41.
The way to use the Hungarian algorithm would be to make a2[i] vertices for job i. Hopefully the a2 array sums to n. If k << n, then you're probably better off formulating as a min-cost circulation problem.

Join two sorted files using Hive/Hadoop

I have two sorted files and i need to join using hive or hadoop and aggregate by a key.
File A is sorted by (A.X, A.Y) and File B is sorted by (B.X, B.Y). I can make join using hive, create a intermediate result and then execute another query to sum values. What is the best way to make this operation? Doing a mapreduce job or using hive? The file B is much smaller then file A. Can i use on my favor the fact that file A and file B are sorted?
FILE A FILE B INTERMEDIATE_FILE FINAL_FILE
X Y Z X Y X Y Z X Y
1 V1 10 1 V1 1 V1 10 1 30 (20 + 10)
1 V1 20 2 V2 1 V1 20 2 50 (50)
1 V2 30 3 V1 2 V2 50 3 130 (60 + 70)
2 V1 40 3 V1 60
2 V2 50 3 V1 70
3 V1 60
3 V1 70
4 V1 80
Thanks
You can join the data using 'merge' option in pig.
Example:
data_a = load '$input1' as (X, Y, Z);
data_b = load '$input2' as (P, Q);
join_data = join data_a by (X,Y), data_b by (P,Q) using 'merge';
Perform your aggregation logic on join_data relation.
This is a sort-merge join operation. The join can be done in the map phase by opening both files and walking through them. Pig refers to this as a merge join because it is a sort-merge join, but the sort has already been done.
Source: Programming Pig by Alan Gates.
I create an Identy Mapper Reducer job, and then execute another job using CompositeInputFormat. In the map phase, i made the calculation, using a pattern called "In-mapper Combiner". So, this secod job has no reducer. I think this is solution is goind to scale linearly. So if i double the size of my cluste, my job is goind to finish in a half of time.

Similar queries have way different execution times

I had the following query:
SELECT nvl(sum(adjust1),0)
FROM (
SELECT
ManyOperationsOnFieldX adjust1,
a, b, c, d, e
FROM (
SELECT
a, b, c, d, e,
SubStr(balance, INSTR(balance, '[&&2~', 1, 1)) X
FROM
table
WHERE
a >= To_Date('&&1','YYYYMMDD')
AND a < To_Date('&&1','YYYYMMDD')+1
)
)
WHERE
b LIKE ...
AND e IS NULL
AND adjust1>0
AND (b NOT IN ('...','...','...'))
OR (b = '... AND c <> NULL)
I tried to change it to this:
SELECT nvl(sum(adjust1),0)
FROM (
SELECT
ManyOperationsOnFieldX adjust1
FROM (
SELECT
SubStr(balance, INSTR(balance, '[&&2~', 1, 1)) X
FROM
table
WHERE
a >= To_Date('&&1','YYYYMMDD')
AND a < To_Date('&&1','YYYYMMDD')+1
AND b LIKE '..'
AND e IS NULL
AND (b NOT IN ('..','..','..'))
OR (b='..' AND c <> NULL)
)
)
WHERE
adjust1>0
Mi intention was to have all the filtering in the innermost query, and only give to the outer ones the field X which is the one I have to operate a lot. However, the firts (original) query takes a couple of seconds to execute, while the second one won't even finish. I waited for almost 20 minutes and still I wouldn't get the answer.
Is there an obvious reason for this to happen that I might be overlooking?
These are the plans for each of them:
SELECT STATEMENT optimizer=all_rows (cost = 973 Card = 1 bytes = 288)
SORT (aggregate)
PARTITION RANGE (single) (cost=973 Card = 3 bytes = 864)
TABLE ACCESS (full) OF "table" #3 TABLE Optimizer = analyzed(cost=973 Card = 3 bytes=564)
SELECT STATEMENT optimizer=all_rows (cost = 750.354 Card = 1 bytes = 288)
SORT (aggregate)
PARTITION RANGE (ALL) (cost=759.354 Cart = 64.339 bytes = 18.529.632)
TABLE ACCESS (full) OF "table" #3 TABLE Optimizer = analyzed(cost=750.354 Card = 64.339 bytes=18.529.632)
Your two queries are not identical.
the logical operator AND is evaluated before the operator OR:
SQL> WITH data AS
2 (SELECT rownum id
3 FROM dual
4 CONNECT BY level <= 10)
5 SELECT *
6 FROM data
7 WHERE id = 2
8 AND id = 3
9 OR id = 5;
ID
----------
5
So your first query means: Give me the big SUM over this partition when the data is this way.
Your second query means: give me the big SUM over (this partition when the data is this way) or (when the data is this other way [no partition elimination hence big full scan])
Be careful when mixing the logical operators AND and OR. My advice would be to use brackets so as to avoid any confusion.
It is all about your OR... Try this:
SELECT nvl(sum(adjust1),0)
FROM (
SELECT
ManyOperationsOnFieldX adjust1
FROM (
SELECT
SubStr(balance, INSTR(balance, '[&&2~', 1, 1)) X
FROM
table
WHERE
a >= To_Date('&&1','YYYYMMDD')
AND a < To_Date('&&1','YYYYMMDD')+1
AND (
b LIKE '..'
AND e IS NULL
AND (b NOT IN ('..','..','..'))
OR (b='..' AND c <> NULL)
)
)
)
WHERE
adjust1>0
Because you have the OR inline with the rest of your AND statements with no parenthesis, the 2nd version isn't limiting the data checked to just the rows that fall in the date filter. For more info, see the documentation of Condition Precedence

What is the best way to distribute n forms in c categories between u users? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
I have asked this question in cstheory too
I have a form distribution problem. There is n forms in c categories (each form in 1 category). And there is u users, which each user can receive forms from at least one category (but maybe more than one category).
The goal is to distribute forms between users, so each user receive the same amount of forms. I prefer to equally use categories.
For example:
If categories are:
C1 : 20 forms
C2 : 3 forms
C3 : 8 forms
C4 : 2 forms
And users are:
U1 with access to C1 and C2
U2 with access to C2
U3 with access to C3
U4 with access to C1 and C3
U5 with access to C2 and C4
The answer should be:
U1: 1 x C1 + 1 x C2 | 2 x C1 (preferred)
U2: 2 x C2
U3: 2 x C3
U4: 1 x C1 + 1 x C3 | 2 x C1 (preferred) | 2 x C3
U5: 2 x C4
And 23 forms remains.
Do you have any suggestion on how can I write such algorithm?
There could be a second question, which in that some Categories have a SHOULD CONTRIBUTE option. Which if set, all remaining forms in that category will distribute between users who have access to that. for example if C1 have this option enabled, the answer should be:
U1: 1 x C1 + 1 x C2 + 9 C1
U2: 2 x C2
U3: 2 x C3
U4: 2 x C3 (to minimize remaining forms in C3 category) + 10 C1
U5: 2 x C4
and remaining forms would be 0 in C1, 0 in C2, 4 in C3 and 0 in C4.
I think its kinda Bin Packing algorithm, but I am not sure and I don't know how to solve it! :(
Note: The above answers are not best answers, these are just what I think!
It seems to me that if you fix a number N of forms per user and ask the question: can we give N forms to each user? then you can turn this into a http://en.wikipedia.org/wiki/Maximum_flow_problem problem, where each user can receive flow/forms from their subset of categories, and there is an outflow of capacity N from each user. Also, if you can solve this problem for N you can solve it for all lesser values of N.
So you could solve the first problem by running max-flow lg (maximum N) times, using a binary chop to find out what the best possible value of N is. Since you can solve it by max flow, you can also solve it by linear programming. Doing it this way, perhaps just for the critical value of N, might allow you to favour some assignments over others, or perhaps to see where there are neighbouring feasible solutions, and then see if you can mix them to use categories equally.
Example - Create a source, and link it to each of the categories Ci, with the capacity of the link being the number of forms available in that category, so C1 gets a link from the source of capacity 20. Create links with their source's capacity between users and categories, where the user has access to the category, so U1 gets links to C1 and C2, but U2 only gets a link to C2. Now create links of capacity N from each user to a single sink. If there is an assignment of forms to users that gives every user N forms, then this will produce a maximum flow that fills every link from user to sink, and you can look at the flows between users and categories to see how to assign forms. You could start off with N = 3, because user 2 only has access to a total of 3 forms, so the answer can't be greater than that. That won't work because you have said the right answer has N = 2, so the max flow won't fill all the N=3 capacity links. So your program tries again at 3/2 = 1, and finds a solution - you have provided a solution for N = 2, so there must be one for N = 1. Now the program knows there is a solution for N = 1 but not one for N = 3 so it tries one halfway between at N = (1 + 3) / 2 = 2, and finds your solution. There is one for N = 2 but not for N = 3 so the N = 2 solution is the best you can do.

Resources