Pig Script: Null values after join and GENERATE - hadoop

I have a strange dilemma in my Pig Script. I am joining multiple tables and one of the last joins is as follows:
a = JOIN O_1 by ((long)OpropID, (long)OAID) LEFT, property by ((long)GPropID, (long)prop_AID);
If I filter my result by specific data points, I get the proper results for those fields from the property table (right table in the join). Even without the filter, the resultset is correct, I'm only filtering it to test the results.
b = filter a by OpropID==12 and OAID==10;
dump b;
However, if create a subsequent GENERATE statement immediately after the join, the same fields (last two in the example below) return NULL results:
c = FOREACH a GENERATE gID, p_AID, OpropID, OAID, GPropID, prop_AID;
I've tried using $16, $17 instead of the field names; I've also used property::GPropID or property::prop_AID to no avail.
Any help at this point would be appreciated.

Related

Propel, selecting just one column

I am trying to run a query in propel that runs an aggregate function (SUM).
My Code
$itemQuery = SomeEntity::Create();
$itemQuery->withColumn('SUM(SomeColumn)', someColumn)
->groupBy(SomeForeignKey);
Problem
It should theoretically return the sum of every group of items but the problem is propel tries to fetch all columns, and also appends a bunch of other columns to the group by clause. This results in an unexpected categorisation and therefore the sum is incorrect.
Is there anyway to make propel fetch just the column I am running the aggregation function on so that the group by statement works as well?
You need to add a select statement for the column and the foreign key:
$itemQuery = SomeEntity::Create();
$itemQuery->select(array(SomeColumn, SomeForeignKey));
$itemQuery->withColumn('SUM(SomeColumn)', someColumn);
$itemQuery->groupBy(SomeForeignKey);

Pig latin join by field

I have a Pig latin related problem:
I have this data below (in one row):
A = LOAD 'records' AS (f1:chararray, f2:chararray,f3:chararray, f4:chararray,f5:chararray, f6:chararray);
DUMP A;
(FITKA,FINVA,FINVU,FEEVA,FETKA,FINVA)
Now I have another dataset:
B = LOAD 'values' AS (f1:chararray, f2:chararray);
Dump B;
(FINVA,0.454535)
(FITKA,0.124411)
(FEEVA,0.123133)
And I would like to get those two dataset joined. I would get corresponding value from dataset B and place that value beside the value from dataset A. So expected output is below:
FITKA 0.123133, FINVA 0.454535 and so on ..
(They can also be like: FITKA, 0.123133, FINVA, 0.454535 and so on .. )
And then I would be able to multiply values (0.123133 x 0.454535 .. and so on) because they are on the same row now and this is what I want.
Of course I can join column by column but then values appear "end of row" and then I can clean it by using another foreach generate. But, I want some simpler solution without too many joins which may cause performance issues.
Dataset A is text (Sentence in one way..).
So what are my options to achieve this?
Any help would be nice.
A sentence can be represented as a tuple and contains a bag of tuples (word, count).
Therefore, I suggest you change the way you store your data to the following format:
sentence:tuple(words:bag{wordcount:tuple(word, count)})

Pig Latin using two data sources in one FILTER statement

In my pig script, am reading data from more than 5 data sources (Hive tables), where one is the main source data and rest were kind of dimension data tables. I am trying to filter the main data source relation (or alias) w.r.t some value in one of the dimension relation.
E.g.
-- main_data is main data source and dept_data is department data
filtered_data1 = FILTER main_data BY deptID == dept_data.departmentID;
filtered_data2 = FOREACH filtered_data1 GENERATE $0, $1, $3, $7;
In my pig script there are minimum 20 instances where I need to match for some value between multiple data sources and produce a new relation. But am getting some error as
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias filtered_data1.
Backend error : Scalar has more than one row in the output. 1st : ( ..... ) 2nd : ( .... )
Details at logfile: /root/pig_1403263965493.log
I tried to use "relation::field" approach also, no use. Alternatively, am joining these two relations (data sources) to get filtered data, but I feel, this will slow down the execution process and unnecessirity huge data will be dumped.
Please guide me how two use two or more data sources in one FILTER statement, something like in SQL, so that I can avoid using JOIN statements and get it done from FILTER statement itself.
Where A.deptID = B.departmentID And A.sectionID = C.sectionID And A.cityID = D.cityID
If you want to match records from different tables by a single ID, you would pretty much have to use a join, as such:
Where A::deptID = B::departmentID And A::sectionID = C::sectionID And A::cityID = D::cityID
If you just want to keep the records that occur in all other tables, you could probably go for an INTERSECT and then a
FILTER BY someID IN someIDList

Hadoop Buffering vs Streaming

Could someone please explain to me what is the difference between Hadoop Streaming vs Buffering?
Here is the context I have read in Hive :
In every map/reduce stage of the join, the last table in the sequence is streamed through the reducers whereas the others are buffered. Therefore, it helps to reduce the memory needed in the reducer for buffering the rows for a particular value of the join key by organizing the tables such that the largest tables appear last in the sequence. e.g. in:
SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)
In a reduce side join, the values from multiple tables are often tagged to identify them on reducer stage, for the table they are coming from.
Consider a case of two tables:
On reduce call, the mixed values associated with both tables are iterated.
During iteration, the value for one of the tag/table are locally stored into an arraylist. (This is buffering).
While the rest of the values are being streamed through and values for another tag/table are detected, the values of first tag are fetched from the saved arraylist. The two tag values are joined and written to output collector.
Contrast this with the case what if the larger table values are kept in arraylist then it could result into OOM if the arraylist outgrows to overwhelm the memory of the container's JVM.
void reduce(TextPair key , Iterator <TextPair> values ,OutputCollector <Text,Text> output ,Reporter reporter ) throws IOException {
//buffer for table1
ArrayList <Text> table1Values = new ArrayList <Text>() ;
//table1 tag
Text table1Tag = key . getSecond();
TextPair value = null;
while( values . hasNext() ){
value = values . next() ;
if(value.getSecond().equals(table1Tag)){
table1Values.add (value.getFirst() );
}
else{
for( Text val : table1Values ){
output.collect ( key.getFirst() ,new Text(val.toString() + "\t"+ value.getFirst().toString () ));
}
}
}
}
You can use the below hint to specify which of the joined tables would be streamed on reduce side:
SELECT /*+ STREAMTABLE(a) */ a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)
Hadoop Streaming in general refers to using custom made python or shell scripts to perform your map-reduce logic. ( For example, using the Hive TRANSFORM keyword.)
Hadoop buffering, in this context, refers to the phase in a map-reduce job of a Hive query with a join, when records are read into the reducers, after having been sorted and grouped coming out of the mappers. The author is explaining why you should order the join clauses i n a Hive query, so that the largest tables are last; because it helps optimize the implementation of joins in Hive.
They are completely different concepts.
In response to your comments:
In Hive's join implementation, it must take records from multiple tables, sort them by the join key, and then collate them together in the proper order. It has to read them grouped by the different tables, so they have to see groups from different tables, and once all tables have been seen, start processing them. The first groups from the first tables need to be buffered (kept in memory) because they can not be processed until the last table is seen. The last table can be streamed, (each row processed as they are read) since the other tables group are in memory, and the join can start.

Max/Min for whole sets of records in PIG

I have a set set of records that I am loading from a file and the first thing I need to do is get the max and min of a column.
In SQL I would do this with a subquery like this:
select c.state, c.population,
(select max(c.population) from state_info c) as max_pop,
(select min(c.population) from state_info c) as min_pop
from state_info c
I assume there must be an easy way to do this in PIG as well but I'm having trouble finding it. It has a MAX and MIN function but when I tried doing the following it didn't work:
records=LOAD '/Users/Winter/School/st_incm.txt' AS (state:chararray, population:int);
with_max = FOREACH records GENERATE state, population, MAX(population);
This didn't work. I had better luck adding an extra column with the same value to each row and then grouping them on that column. Then getting the max on that new group. This seems like a convoluted way of getting what I want so I thought I'd ask if anyone knows a simpler way.
Thanks in advance for the help.
As you said you need to group all the data together but no extra column is required if you use GROUP ALL.
Pig
records = LOAD 'states.txt' AS (state:chararray, population:int);
records_group = GROUP records ALL;
with_max = FOREACH records_group
GENERATE
FLATTEN(records.(state, population)), MAX(records.population);
Input
CA 10
VA 5
WI 2
Output
(CA,10,10)
(VA,5,10)
(WI,2,10)

Resources