How to get count along with rest of the fields in Pig? - hadoop

I have following data set.
f1,f2,f3,f4,f5,f6
I am looking for count of f6 along with rest of the fields.
f1,f2,f3,f4,f5,5
f1,f2,f3,f4,f5,3
and so on.
I tried this code but it takes too long to execute
A = LOAD 'file a'
B = GROUP A BY f6
C = FOREACH B GENERATE FLATTEN (group) as f6, FLATTEN(f1), FLATTEN(f2),FLATTEN(f3),FLATTEN(f4),FLATTEN(f5),COUNT(f6)
Is there any better way to achieve what I am looking for ?
If I simply try to get count without flatten then fields end up in bag but I want final output as tuple.
So trying this gives me output as bag
C = FOREACH B GENERATE FLATTEN (group) as f6, A.f1,A.f2.A.f3,A.f4,A.f5, COUNT(f6)
All inputs are appreciated.
Cheers

It is also possible to flatten the projection which was grouped.
A = LOAD 'file a';
B = GROUP A BY f6;
C = FOREACH B GENERATE FLATTEN(A), COUNT(A) as f6_count;
EDIT 1:
The key is using FLATTEN(A) instead of FLATTEN(group).
FLATTEN(A) will produce a tuple with all of the columns from the original relation and will get rid of the bag, even those that were not used in the group by statement (f1, f2, f3, f4, f5, f6).
FLATTEN(group) will return only columns used in the group by, in this case f6.
Advantage of this approach is that it is very efficient and requires a single Map Reduce job to execute. Any solution that involves JOIN operation adds extra Map Reduce job.
As a general rule of thumb in pig, hive and MR, group by and join operations usually are executed as separate MR Jobs and reducing the number of MR jobs leads to improved performance.

Related

Elastic search storing hierarchical data and querying it

Let me break down the problem it will take some time.
Consider that you have an entities A, B, C in your system.
A is the parent of everything
B is the child of A
C can be child of A or B, Please note there are some more entities like D,E,F which are same as C. So lets consider C only for time being
So basically its a tree alike structure like
```
A
/ \
/ \
B C(there are similar elements like D, E, F)
|
|
C
```
Now we need are using Elastic Search as secondary DB to store this. In the data base the structure is completely different since A, B, C have dynamic fields, so they are different tables and we join them to get data, but from business prospective this is design.
Now when we try to flat it and store in es for under set
We have a entity A1 who has 2 children C1 and B1, B1 has further children C2
A B C
1 A1 null null
2 A1 null C1
3 A1 B1 null
4 A1 B1 C2
Now what your can query
use says he wants All columns of A,B,C where value of columns A is A1, so adding some null removing rules we can give him row number 2,3,4
now the problem set , now user says he want all As where value of A is A1 , so basically we will return him all rows 1,2,3,4 or 2,3,4 so we will see values like
A
A1
A1
A1
but logically he should see only one column A1 since that is only unique value. As ES doesn't have the ability to group by things.
So how we solved things.
We solved this problem by creating multiple indices and one nested index
So when we need to group by index we go to nested index and other index work as flat index
so we have different index, like index for A and B, A or B and C . But we have more elements so it lead to creation of 5 indices.
As data started increasing its becoming difficult to maintain 5 indices and indexing them from scratch takes too much time.
So to solve this we started to look for other options and we are testing cratedb. But on the first place we are still trying to figure is there any way to do that in ES since need to use many feature of ES as percolation, watcher etc. Any clues on that?
Please also note that we need to apply pagination also. That's why single nested index will not work

How to specify keys in Apache Spark while joining two datasets

I am loading two files as below -
f1 = sc.textFile("s3://testfolder1/file1")
f2 = sc.textFile("s3://testfolder2/file2")
This load operation gives me list of tuples. For each row one tuple is created.
Schema for file1 and file2 is as below -
f1 (a,b,c,d,e,f,g,h,i)
f2 (x,y,z,a,b,c,f,r,u)
I want to join these two datasets based on fields a,b,c. I did some research and found that there below method that might be useful.
rdd.keyBy(func)
However, I can't find a easy way to specify keys and join two datasets.
Can anyone demonstrate how to do it without using DataFrames ? Use of SparkSQL is okay but if it can be done without SparkSQL that would be best.
This load operation gives me list of tuples
No, it will give you an RDD[String].
You can take the string, and convert it to anything you want.
For your use case, you can convert each line to ((a,b,c),(d,e,f,g,h,i))
f1 = sc.textFile("s3://testfolder1/file1").map { line =>
val a::b::c::d::e::d::f::g::h::i::other = line.split(YOUR_DELIMITER).toList
((a,b,c),(d,e,f,g,h,i))
}
f2 = sc.textFile("s3://testfolder1/file1").map { line =>
val a::b::c::d::e::d::f::g::h::i::other = line.split(YOUR_DELIMITER).toList
((a,b,c),(d,e,f,g,h,i))
}
and then, f1.join(f2) should just work.

Pig latin join by field

I have a Pig latin related problem:
I have this data below (in one row):
A = LOAD 'records' AS (f1:chararray, f2:chararray,f3:chararray, f4:chararray,f5:chararray, f6:chararray);
DUMP A;
(FITKA,FINVA,FINVU,FEEVA,FETKA,FINVA)
Now I have another dataset:
B = LOAD 'values' AS (f1:chararray, f2:chararray);
Dump B;
(FINVA,0.454535)
(FITKA,0.124411)
(FEEVA,0.123133)
And I would like to get those two dataset joined. I would get corresponding value from dataset B and place that value beside the value from dataset A. So expected output is below:
FITKA 0.123133, FINVA 0.454535 and so on ..
(They can also be like: FITKA, 0.123133, FINVA, 0.454535 and so on .. )
And then I would be able to multiply values (0.123133 x 0.454535 .. and so on) because they are on the same row now and this is what I want.
Of course I can join column by column but then values appear "end of row" and then I can clean it by using another foreach generate. But, I want some simpler solution without too many joins which may cause performance issues.
Dataset A is text (Sentence in one way..).
So what are my options to achieve this?
Any help would be nice.
A sentence can be represented as a tuple and contains a bag of tuples (word, count).
Therefore, I suggest you change the way you store your data to the following format:
sentence:tuple(words:bag{wordcount:tuple(word, count)})

Need to pull distinct tuples from a relation in PIG regardless of order i.e. (1,2)=(2,1)

I am very new to using Pig and Hadoop, so please forgive me if this is very basic. I have a relation that has a list of users and followers (like Twitter) with the format (userA,userB) meaning that userB follows userA. My assignment (yes this is homework) is to find people who follow each other. I have done this, t I have twice as many tuples as I need since I have (userA,userB) and (userB,userA) in the relation. It doesn't matter which of the two tuples I end up with, I just need to eliminate one of them. The DISTINCT keyword won't do me any good since the order is reversed
Without seeing your code it seems you could try sorting the fields of the tuple before de-duplicating, like this:
X = FOREACH A GENERATE (f1 < f2 ? f1 : f2), (f1 < f2 ? f2 : f1);
Y = DISTINCT X;

how to create set of values, after group function in Pig (Hadoop)

Lets say I have set of values in file.txt
a,b,c
a,b,d
k,l,m
k,l,n
k,l,o
And my code is:
file = LOAD 'file.txt' using PigStorage(',');
events = foreach file generate session_id, user_id, code, type;
gr = group events by (session_id, user_id);
and I have set of value:
((a,b),{(a,b,c),(a,b,d)})
((k,l),{(k,l,m),(k,l,n),(k,l,o)})
And I'd like to have:
(a,b,(c,d))
(k,l,(m,n,o))
Have you got any idea how to do it?
Regards
Pawel
Note: you are inconsistent in your question. You say session_id, user_id, code, type in the FOREACH line, but your have a PigStorage not providing values. Also, that FOREACH has 4 values, while your sample data only has 3. I'll assume that type doesn't exist in order to answer your question.
After your gr relation, you are left with the group by key (in this case (session_id, user_id)) in a automatically generated tuple called group.
So, first step: gr2 = FOREACH gr GENERATE FLATTEN(group);
This will give you the tuples (a,b) and (k,l). You need to use FLATTEN because group is a tuple and you are asking for session_id and user_id to be individual columns. FLATTEN does that for you.
Ok, so now modify the gr2 line to also use a projection to tease out the third value:
gr2 = FOREACH gr GENERATE FLATTEN(group), events.code;
events.code creates a bag out of all the code values. events is the name of the bag of grouped tuples (it's named after the original relation).
This should give you:
(a, b, {c, d})
(k, l, {m, n, o})
It's very important to note that the values in the list are in a bag not a tuple, like you asked for. Keeping it in a bag is the right idea because the bag is a variable list, while a tuple is not.
Additional advice: Understanding how GROUP BY outputs data is something I see a lot of people struggle with when first using Pig. If you think my answer doesn't make much sense, I'd recommend spending some time to really get to understand GROUP BY. Understanding versus thinking it is magic will pay off in the long run.

Resources