Using Pig, best way to count numbers within tuples - hadoop

I'm working with tuples of data:
dump c;
(20
5
5
)
(1
1
1
5
10
)
The output I'm trying to achieve is count the occurrences of each number in total, so like this:
(1,3)
(5,3)
(10,1)
(20,1)
I'm attempted this command, and it was unsuccessful:
d = FOREACH c GENERATE COUNT($0);
I currently do not have schema for c (not sure that it matters at this point):
describe c;
Schema for c unknown.
Looking for suggestions.

Input Tuple:
(20 5 5)
(1 1 1 5 10)
You could get the count across the tuple by tokenizing and then grouping it.
A = LOAD 'file' using TextLoader() as (line:chararray);
B = FOREACH A GENERATE FLATTEN(TOKENIZE(line)) as (line:chararray);
C = GROUP B BY line;
D = FOREACH C GENERATE group,COUNT(B);
dump D;
Output:
(1,3)
(5,3)
(10,1)
(20,1)

Related

How to create repeted seq in informatica?

How to generate repeated seq using Informatica mapping.
Src file
A
B
C
D
E
F
G
H
I
J
Trg file
A 1
B 1
C 2
D 2
E 3
F 3
G 4
H 4
I 5
J 5
Thank you in advance.
You can use a Sequence Generator, and then an Expression that divides the value of NEXTVAL by 2:
OUT: ROUND(NEXTVAL / 2)
In the Sequence Generator you could set "Start Value" to 1 and check "Reset" so that the mapping always starts with 1 1 2 2 3 3 if that's what you need.
You should be able to achieve this using variable ports in an Expression transformation, as long as your input rows are sorted in the correct order. e.g. (pseudocode)
v_RowCount = v_RowCount + 1
v_Seq = if v_RowCount Mod 2 = 0 then (v_Seq + 1) else v_Seq
(Output port) out_Seq = v_Seq

Select only first rows in each h2o dataframe group_by group (for merging)?

Is there a way to select only first rows in each h2o dataframe group_by group?
The reason for doing this is to merge some columns in an h2o dataframe into a group_by'ed version of that dataframe that was created to get some stats. based on particular groupings in the original.
Example, suppose had two dataframes like
df1
receipt_key b c item_id
------------------------
a1 1 2 1
a2 3 4 1
and
df2
receipt_key e f item_id
--------------------------
a1 5 6 1
a1 7 8 2
a2 9 10 1
would like to join them such that end up with dataframe
df3
receipt_key b c e f item_id
-----------------------------
a1 1 2 5 6 1
a2 3 4 9 10 1
Have tried doing something like df2.group_by('receipt_key').max('item_id') to merge into df1, but doing so only leaves the item_id column in the group's get_frame() dataframe (and even listing all of the columns in df2 to max() on would not give the right values as well as be cumbersome for my actual use case which has much more columns in df2).
Any ideas on how this could be done? Would simply deleting duplicates be sufficient to get the desired dataframe (though there appears to be barriers to doing this in h2o, see https://0xdata.atlassian.net/browse/PUBDEV-3292)?
here you go:
import h2o
h2o.init()
df1 = h2o.H2OFrame({'receipt_key': ['a1', 'a2'] , 'b':[1,3] , 'c':[2,4], 'item_id': [1,1]})
df1['receipt_key'] = df1['receipt_key'] .asfactor()
df2 = h2o.H2OFrame({'receipt_key': ['a1', 'a1','a2'] , 'e':[5,7,9] , 'f':[6,8,10], 'item_id': [1,2,1]})
df2['receipt_key'] = df2['receipt_key'].asfactor()
df3 = df1.merge(df2)
df_subset = df3[['receipt_key','b','c','e','f','item_id']]
print(df_subset)
receipt_key b c e f item_id
a1 1 2 5 6 1
a2 3 4 9 10 1

How to count number of occurrences in a sorted text file

I have a sorted text file with the following format:
Company1 Company2 Date TransactionAmount
A B 1/1/19 20000
A B 1/4/19 200000
A B 1/19/19 324
A C 2/1/19 3456
A C 2/1/19 663633
A D 1/6/19 3632
B C 1/9/19 84335
B C 1/23/19 253
B C 1/13/19 850
B D 1/1/19 234
B D 1/8/19 635
C D 1/9/19 749
C D 1/10/19 203200
Ultimately I want a Python dictionary so that each pair maps to a list containing the number of transactions and the total amount of all transactions. For instance, (A,B) would map to [3,220324].
The file has ~250,000 lines in this format and each pair may have 1 transaction up to ~10 or so transactions. There are also tens of thousands of pairs of companies.
Here's the only way I've thought of implementing it.
my_dict = {}
file = open("my_file.txt").readlines()[1:]
for i in file:
i = i.split()
pair = (i[0],i[1])
amt = int(i[3])
if pair in my_dict:
exist = my_dict[pair]
exist[0] += 1
exist[1] += amt
my_dict[pair] = exist
else:
my_dict[pair] = [1,amt]
I feel like there is a faster way to do this. Any ideas?

Efficient way of finding rows in which A>B

Suppose M is a matrix where each row represents a randomized sequence of a pool of N objects, e.g.,
1 2 3 4
3 4 1 2
2 1 3 4
How can I efficiently find all the rows in which a number A comes before a number B?
e.g., A=1 and B=2; I want to retrieve the first and the second rows (in which 1 comes before 2)
There you go:
[iA jA] = find(M.'==A);
[iB jB] = find(M.'==B);
sol = find(iA<iB)
Note that this works because, according to the problem specification, every number is guaranteed to appear once in each row.
To find rows of M with a given prefix (as requested in the comments): let prefix be a vector with the sought prefix (for example, prefix = [1 2]):
find(all(bsxfun(#eq, M(:,1:numel(prefix)).', prefix(:))))
something like the following code should work. It will look to see if A comes before B in each row.
temp = [1 2 3 4;
3 4 1 2;
2 1 3 4];
A = 1;
B = 2;
orderMatch = zeros(1,size(temp,1));
for i = 1:size(temp,1)
match1= temp(i,:) == A;
match2= temp(i,:) == B;
aIndex = find(match1,1);
bIndex = find(match2,1);
if aIndex < bIndex
orderMatch(i) = 1;
end
end
solution = find(orderMatch);
This will result in [1,1,0] because the first two rows have 1 coming before 2, but the third row does not.
UPDATE
added find function on ordermatch to give row indices as suggested by Luis

Group by multiple fields and output tuple

I have a feed in the following format:
Hour Key ID Value
1 K1 001 3
1 K1 002 2
2 K1 005 4
1 K2 002 1
2 K2 003 5
2 K2 004 6
and I want to group the feed by (Hour, Key) then sum the Value but keep ID as a tuple:
({1, K1}, {001, 002}, 5)
({2, K1}, {005}, 4)
({1, K2}, {002}, 1)
({2, K2}, {003, 004}, 11)
I know how to use FLATTEN to generate the sum of the Value but don't know how to output ID as a tuple. This is what I have so far:
A = LOAD 'data' AS (Hour:chararray, Key:chararray, ID:chararray, Value:int);
B = GROUP A BY (Hour, Key);
C = FOREACH B GENERATE
FLATTEN(group) AS (Hour, Key),
SUM(A.Value) AS Value
;
Will you explain how to do this? Appreciate it!
You just need to use the bag projection operator, .. This will create a new bag where the tuples have just the element(s) you specify. In your case, use A.ID. In fact, you are already using this operator to provide the input to SUM -- the input to sum is a bag of single-element tuples, which you create by projecting the Value field.
A = LOAD 'data' AS (Hour:chararray, Key:chararray, ID:chararray, Value:int);
B = GROUP A BY (Hour, Key);
C = FOREACH B GENERATE
FLATTEN(group) AS (Hour, Key),
A.ID,
SUM(A.Value) AS Value
;

Resources