Join two sorted files using Hive/Hadoop - performance

I have two sorted files and i need to join using hive or hadoop and aggregate by a key.
File A is sorted by (A.X, A.Y) and File B is sorted by (B.X, B.Y). I can make join using hive, create a intermediate result and then execute another query to sum values. What is the best way to make this operation? Doing a mapreduce job or using hive? The file B is much smaller then file A. Can i use on my favor the fact that file A and file B are sorted?
FILE A FILE B INTERMEDIATE_FILE FINAL_FILE
X Y Z X Y X Y Z X Y
1 V1 10 1 V1 1 V1 10 1 30 (20 + 10)
1 V1 20 2 V2 1 V1 20 2 50 (50)
1 V2 30 3 V1 2 V2 50 3 130 (60 + 70)
2 V1 40 3 V1 60
2 V2 50 3 V1 70
3 V1 60
3 V1 70
4 V1 80
Thanks

You can join the data using 'merge' option in pig.
Example:
data_a = load '$input1' as (X, Y, Z);
data_b = load '$input2' as (P, Q);
join_data = join data_a by (X,Y), data_b by (P,Q) using 'merge';
Perform your aggregation logic on join_data relation.
This is a sort-merge join operation. The join can be done in the map phase by opening both files and walking through them. Pig refers to this as a merge join because it is a sort-merge join, but the sort has already been done.
Source: Programming Pig by Alan Gates.

I create an Identy Mapper Reducer job, and then execute another job using CompositeInputFormat. In the map phase, i made the calculation, using a pattern called "In-mapper Combiner". So, this secod job has no reducer. I think this is solution is goind to scale linearly. So if i double the size of my cluste, my job is goind to finish in a half of time.

Related

Select only first rows in each h2o dataframe group_by group (for merging)?

Is there a way to select only first rows in each h2o dataframe group_by group?
The reason for doing this is to merge some columns in an h2o dataframe into a group_by'ed version of that dataframe that was created to get some stats. based on particular groupings in the original.
Example, suppose had two dataframes like
df1
receipt_key b c item_id
------------------------
a1 1 2 1
a2 3 4 1
and
df2
receipt_key e f item_id
--------------------------
a1 5 6 1
a1 7 8 2
a2 9 10 1
would like to join them such that end up with dataframe
df3
receipt_key b c e f item_id
-----------------------------
a1 1 2 5 6 1
a2 3 4 9 10 1
Have tried doing something like df2.group_by('receipt_key').max('item_id') to merge into df1, but doing so only leaves the item_id column in the group's get_frame() dataframe (and even listing all of the columns in df2 to max() on would not give the right values as well as be cumbersome for my actual use case which has much more columns in df2).
Any ideas on how this could be done? Would simply deleting duplicates be sufficient to get the desired dataframe (though there appears to be barriers to doing this in h2o, see https://0xdata.atlassian.net/browse/PUBDEV-3292)?
here you go:
import h2o
h2o.init()
df1 = h2o.H2OFrame({'receipt_key': ['a1', 'a2'] , 'b':[1,3] , 'c':[2,4], 'item_id': [1,1]})
df1['receipt_key'] = df1['receipt_key'] .asfactor()
df2 = h2o.H2OFrame({'receipt_key': ['a1', 'a1','a2'] , 'e':[5,7,9] , 'f':[6,8,10], 'item_id': [1,2,1]})
df2['receipt_key'] = df2['receipt_key'].asfactor()
df3 = df1.merge(df2)
df_subset = df3[['receipt_key','b','c','e','f','item_id']]
print(df_subset)
receipt_key b c e f item_id
a1 1 2 5 6 1
a2 3 4 9 10 1

Grouping connected pairs of values

I have a list containing unique pairs of values x and y; for example:
x y
-- --
1 A
2 A
3 A
4 B
5 A
5 C
6 D
7 D
8 C
8 E
9 B
9 F
10 C
10 G
I want to divide this list of pairs as follows:
Group 1
1 A
2 A
3 A
5 A
5 C
8 C
10 C
8 E
10 G
Group 2
4 B
9 B
9 F
Group 3
6 D
7 D
Group 1 contains
all pairs where y = 'A' (1-A, 2-A, 3-A, 5-A)
any additional pairs where x = any of the x's above (5-C)
any additional pairs where y = any of the y's above (8-C, 10-C)
any additional pairs where x = any of the x's above (8-E, 10-G)
The pairs in Group 2 can't be reached in such a manner from any pairs in Group 1, nor can the pairs in Group 3 be reached from either Group 1 or Group 2.
As suggested in Group 1, the chain of connections can be arbitrarily long.
I'm exploring solutions using Perl, but any sort of algorithm, including pseudocode, would be fine. For simplicity, assume that all of the data can fit in data structures in memory.
[UPDATE] Because I need to apply this approach to 5.3 billion pairs, scaleability is important to me.
Pick a starting point. Find all points reachable from that, removing from the master list. Repeat for all added points, until no more can be reached. Move to the next group, starting with another remaining point. Continue until you have no more remaining points.
pool = [(1 A), (2 A), (3 A), (4 B), ... (10 G)]
group_list = []
group = []
pos = 0
while pool is not empty
group = [ pool[0] ] # start with next available point
pos = -1
while pos+1 < size(group) // while there are new points in the group
pos += 1
group_point = group[pos] // grab next available point
for point in pool // find all remaining points reachable
if point and group_point have a coordinate in common
remove point from pool
add point to group
// we've reached closure with that starting point
add group to group_list
return group_list
You can think of the letters and numbers as nodes of a graph, and the pairs as edges. Divide this graph into connected components in linear time.
The connected component with 'A' forms group 1. The other connected components form the other groups.

Subset a data frame in R based on above and below a threshold value

I searched a lot to find similar post to my post below but no luck yet
I have 1 column of data like below (extracted from original big file having many columns)
C1
0
1
2
3
4
3
3
2
1
From this data I want to generate a new column C2 where in C2 should just indicate where my C1 column values are above and below a threshold compared to max value.
In this case max(C1) is 4. So If set threshold of 2 then the new data should be like below.
C1 C2
0 0
1 0
2 1
3 1
4 1
3 1
3 1
2 1
1 0
Note: My data always have a increasing trend upto some point and then decreasing trend after that.
I know how to do simple plain subset on a particular column but I am not getting the logic to subset when there is a increasing and decreasing trend.
Thanks in advance.
I would use the plyr package in r, and use an ifelse statement as a part of the mutate function. I will write my code and then explain. I assume you already have the C1 vector in a data frame named df
install.packages('plyr')
library(plyr)
df2 <- mutate(df, c2 = ifelse(c1 >= 2,1,0))
The mutate function creates a new column that satisfies whatever function you wish. In this case I used the ifelse function similar to excel's IF() function that inputs:
Condition , What happens if True , What happens if false.
Hope that helps =)

Stata - How to Generate Random Integers

I am learning Stata and want to know how to generate random integers (without replacement). If I had 10 total rows, I would want each row to have a unique integer from 1 to 10 assigned to it. In R, one could simply do:
sample(1:10, 10)
But it seems more difficult to do in Stata. From this Stata page, I saw:
generate ui = floor((b-a+1)*runiform() + a)
If I substitute a=1 and b=10, I get something close to what I want, but it samples with replacement.
After getting that part figured out, how would I handle the following wrinkle: my data come in pairs. For example, in the 10 observations, there are 5 groups of 2. Each group of 2 has a unique identifier. How would I arrange the groups (and not the observations) in random order? The data would look something like this:
obs group mem value
1 A x 9345
2 A y 129
3 B x 251
4 B y 373
5 C x 788
6 C y 631
7 D x 239
8 D y 481
9 E x 224
10 E y 585
obs is the observation number. group is the group the observation (row) belongs to. mem is the member identifier in the group. Each group has one x and one y in it.
First question:
You could just shuffle observation identifiers.
set obs 10
gen y = _n
gen rnd = runiform()
sort rnd
Or in Mata
jumble(1::10)
Second question: Several ways. Here's one.
gen rnd = runiform()
bysort group (rnd): replace rnd = rnd[1]
sort rnd
General comment: For reproducibility, set the random number seed beforehand.
set seed 2803
or whatever.

Hungarian algorithm - assign systematically

I'm implementing the Hungarian algorithm in a project. I managed to get it working until what is called step 4 on Wikipedia. I do manage to let the computer create enough zeroes so that the minimal amount of covering lines is the amount of rows/columns, but I'm stuck when it comes to actually assign the right agent to the right job. I see how I could assign myself, but that's more trial and error - i.e., I do not see the systematic method which is of course essential for the computer to get it work.
Say we have this matrix in the end:
a b c d
0 30 0 0 0
1 0 35 5 0
2 60 5 0 0
3 0 50 35 40
The zeroes we have to take to have each agent assigned to a job are (a, 3), (b, 0), (c,2) and (d,1). What is the system behind chosing these ones? My code now picks (b, 0) first, and ignores row 0 and column b from now on. However, it then picks (a, 1), but with this value picked there is no assignment possible for row 3 anymore.
Any hints are appreciated.
Well, I did manage to solve it in the end. The method I used was to check whether there are any columns/rows with only one zero. In such case, that agent must use that job, and that column and row have to be ignored in the future. Then, do it again so as to get a job for every agent.
In my example, (b, 0) would be the first choice. After that we have:
a b c d
0 x x x x
1 0 x 5 0
2 60 x 0 0
3 0 x 35 40
Using the method again, we can do (a, 3), etc. I'm not sure whether it has been proven that this is always correct, but it seems it is.

Resources