I am running through the example given in Programming Pig. Take a look at the --analyze_stock.pig example.
I am basically confused about how relational operators are working on bags, I have read that relational operators can work only on relations.
daily = load 'NYSE_daily' as (exchange:chararray, symbol:chararray,
date:chararray, open:float, high:float, low:float,
close:float, volume:int, adj_close:float);
grpd = group daily by symbol;
After running these two statements if i run
describe grpd
The output i get is
{group: chararray,daily: {(exchange: chararray,symbol: chararray,date: chararray,open: float,high: float,low: float,close: float,volume: int,adj_close: float)}}
This clearly shows that daily is a bag
The next statement in the script is
analyzed = foreach grpd {
sorted = order daily by date;
generate group, analyze(sorted);
Here the order (relational operator) is being applied on daily (bag) based on the describe statement above.
I realize that in all probability that my concepts are a little weak here, would appreciate if someone could help me out.
Remember that a bag is an unordered collection of tuples. Also remember that the records in a Pig relation are tuples. That means that a relation is actually just a big bag. That means that conceptually, anything you can do to a whole relation you can also do to a smaller bag in a single record. This is done using a nested foreach.
In practice, they are not identical -- when dealing with a bag, there are not map-reduce cycles involved; it's more like using a UDF. Consequently, not every operator can be used this way. Note that the source I linked to is out-of-date on this point: you can now use, e.g., GROUP BY as well.
Related
In the following code:
dsBodyStartStopSort =
order dsBodyStartStop
by terminal_id, point_date_stamp, point_time_stamp, event_code
;
dsBodyStartStopRank =
rank dsBodyStartStopSort
;
store dsBodyStartStopSort
into 'xStartStopSort.csv'
using PigStorage(';')
;
I know that if I don't do that RANK in the middle, the sort order will make it to the STORE command. And that is guaranteed by Pig.
And it appears from the testing I've done that doing RANK does not mess up the sort order--but is that guaranteed? I don't want to just be running on luck.
More generally, what is Pig's rule for preserving sort once it's done? Is it until some reduce event occurs? Will it work across FILTER? Certainly not GROUP? Just wondering if there is a well defined set of rules on when and how Pig guarantees or does not guarantee order.
To summarize: 1) Is order preserved across RANK? 2) How is order preserved generally?
The best piece of documentation I found on the topic:
Bags are disordered unless you explicitly apply a nested ORDER BY
operation as demonstrated below. A nested FOREACH will preserve
ordering, letting you order by one combination of fields then project
out just the values you'd like to concatenate.
From looking at unofficial examples and comments, I conclude the following:
If you do an order right before a rank, it should preserve the order. Personally I prefer to just use RANK xxx BY a,b,c; and only use ORDER afterwards if it is really needed.
If you do an order right before a LIMIT, it should feed LIMIT with the top lines. However the output would be sorted rather than in the original order.
I obtained data with tuple in pig
0,(0),(zero)
1,(1,2),(first,second)
Can I receive this?
0,0,zero
1,1,first
1,2,second
To start off, I'm going to correct your terminology and you should be treating (0) and (1,2) as bags, not tuples. Tuples are intended to be fixed-length data structures that represent some sort of entity. Say (name, address, year of birth), for example. If you have a list of similar objects, like {(apple), (orange), (banana)}, you want a bag.
There doesn't exist behavior that allows you to "zip" up multiple bags/lists. The reason for this is from a design perspective, Pig treats bags as unordered lists, hence the term "bag" not "list". This assumption really helps out with parallelism since you don't have to be considered with order. Therefore, it's really hard to match up 1 with first.
What you could try to do is write an eval function UDF that takes in two bags as parameters, and then zips up the two lists, and then returns back one back with the bags zipped.
How to take a join of two record sets using Map Reduce ? Most of the solutions including those posted on SO suggest that I emit the records based on common key and in the reducer add them to say a HashMap and then take a cross product. (eg. Join of two datasets in Mapreduce/Hadoop)
This solution is very good and works for majority of the cases but in my case my issue is rather different. I am dealing with a data which has got billions of records and taking a cross product of two sets is impossible because in many cases the hashmap will end up having few million objects. So I encounter a Heap Space Error.
I need a much more efficient solution. The whole point of MR is to deal with very high amount of data I want to know if there is any solution that can help me avoid this issue.
Don't know if this is still relevant for anyone, but I facing a similar issue these days. My intention is to use a key-value store, most likely Cassandra, and use it for the cross product. This means:
When running on a line of type A, look for the key in Cassandra. If exists - merge A records into the existing value (B elements). If not - create a key, and add A elements as value.
When running on a line of type B, look for the key in Cassandra. If exists - merge B records into the existing value (A elements). If not - create a key, and add B elements as value.
This would require additional server for Cassandra, and probably some disk space, but since I'm running in the cloud (Google's bdutil Hadoop framework), don't think it should be much of a problem.
You should look into how Pig does skew joins. The idea is that if your data contains too many values with the same key (even if there is no data skew) , you can create artificial keys and spread the key distribution. This would make sure that each reducer gets less number of records than otherwise. For e.g. if you were to suffix "1" to 50% of your key "K1" and "2" the other 50% you will end with half the records on the reducer one (1K1) and the other half goes to 2K2.
If the distribution of the keys values are not known before hand you could some kind of sampling algorithm.
It is common task to make some evaluation on pairs of items:
Examples: de-duplication, collaborative filtering, similar items etc
This is basically self-join or cross-product with the same source of data.
To do a self join, you can follow the "reduce-side join" pattern. The mapper emits the join/foreign key as key, and the record as the value.
So, let's say we wanted to do a self-join on "city" (the middle column) on the following data:
don,baltimore,12
jerry,boston,19
bob,baltimore,99
cameron,baltimore,13
james,seattle,1
peter,seattle,2
The mapper would emit the key->value pairs:
(baltimore -> don,12)
(boston -> jerry,19)
(baltimore -> bob,99)
(baltimore -> cameron,13)
(seattle -> james,1)
(seattle -> peter,2)
In the reducer, we'll get this:
(baltimore -> [(don,12), (bob,99), (cameron,13)])
(boston -> [(jerry,19)])
(seattle -> [(james,1), (peter,2)])
From here, you can do the inner join logic, if you so choose. To do this, you'd just match up every item for every other item. To do this, load it up the data into an array list, then do a N x N loop over the items to compare each to each other.
Realize that reduce-side joins are expensive. They send pretty much all of the data to the reducers if you don't filter anything out. Also, be careful of loading the data up into memory in the reducers-- you may blow your heap on a hot join key by loading all of the data in an array list.
The above is a bit different than the typical reduce-side join. The idea is the same when joining two data sets: the foreign key is the key, and the record is the value. The only difference is that the values could be coming from two or more data sets. You can use MultipleInputs to have different mappers parse different input sets, then have the reducer collect data from both.
Cross product in the case where you don't have any constraints is a nightmare. I.e.,
select * from tablea, tableb;
There are a number of ways to do this. None of them are particularly efficient. If you want this type of behavior, leave me a comment and I'll spend more time explaining a way to do this.
If you can figure out some sort of join key which is a fundamental key to similarity, you are much better off.
Plug for my book: MapReduce Design Patterns. It should be published in a few months, but if you are really interested I can email you the chapter on joins.
One typically uses the reducer to perform whatever logic is required on the join. The trick is to map the dataset twice, possibly adding some marker to the value indicating which run it is. Then a self join is no different from any other kind of join.
I have a question about configuring Map/Side inner join for multiple mappers in Hadoop.
Suppose I have two very large data sets A and B, I use the same partition and sort algorithm to split them into smaller parts. For A, assume I have a(1) to a(10), and for B I have b(1) to b(10). It is assured that a(1) and b(1) contain the same keys, a(2) and b(2) have the same keys, and so on. I would like to setup 10 mappers, specifically, mapper(1) to mapper(10). To my understanding, Map/Side join is a pre-processing task prior to the mapper, therefore, I would like to join a(1) and b(1) for mapper(1), to join a(2) and b(2) for mapper(2), and so on.
After reading some reference materials, it is still not clear to me how to configure these ten mappers. I understand that using CompositeInputFormat I would be able to join two files, but it seems only configuring one mapper and joining the 20 files pair after pair (in 10 sequential tasks). How to configure all these ten mappers and join ten pairs at the same time in a genuine Map/Reduce (10 tasks in parallel)? To my understanding, ten mappers would require ten CompositeInputFormat settings because the files to join are all different. I strongly believe this is practical and doable, but I couldn't figure out what exact commands I should use.
Any hint and suggestion is highly welcome and appreciated.
Shi
Thanks a lot for the replies, David and Thomas!
I appreciate your emphasis about the pre-requirements on Map-side Join. Yes, I am aware about the sort, API, etc. After reading your comments, I think my actual problem is what is the correct expression for joining multiple splits of two files in CompositeInputFormat. For example, I have dataA and dataB sorted and reduced in 2 files respectively:
/A/dataA-r-00000
/A/dataA-r-00001
/B/dataB-r-00000
/B/dataB-r-00001
The expression command I am using now is:
inner(tbl(org.apache.hadoop.mapred.KeyValueTextInputFormat,"/A/dataA-r-00000"),tbl(org.apache.hadoop.mapred.KeyValueTextInputFormat,"/B/dataB-r-00000"))
It works but as you mentioned, it only starts two mappers (because the inner join prevents from splitting) and could be very inefficient if the files are big. If I want to use more mappers (say another 2 mappers to join dataA-r-00001 and dataB-r-00001), how should I construct the expression, is it something like:
String joinexpression = "inner(tbl(org.apache.hadoop.mapred.KeyValueTextInputFormat,'/A/dataA-r-00000'),tbl(org.apache.hadoop.mapred.KeyValueTextInputFormat,'/B/dataB-r-00000'), tbl(org.apache.hadoop.mapred.KeyValueTextInputFormat,'/A/dataA-r-00001'),tbl(org.apache.hadoop.mapred.KeyValueTextInputFormat,'/B/dataB-r-00001'))" ;
But I think that could be mistaken, because the command above actually perform inner join of four files (which will result in nothing in my case because file *r-00000 and *r-00001 have non-overlapping keys).
Or I could just use the two dirs as inputs, like:
String joinexpression = "inner(tbl(org.apache.hadoop.mapred.KeyValueTextInputFormat,'/A/'),tbl(org.apache.hadoop.mapred.KeyValueTextInputFormat,'/B/'))" ;
The inner join will match the pairs automatically according to the file endings, say "00000" to "00000", "00001" to "00001"? I am stuck at this point because I need to construct the expression and pass it to
conf.set("mapred.join.expr", joinexpression);
So in one word, how should I build the proper expression if I want to use more mappers to join multiple pairs of files simultaneously?
There are map- and reduce side joins.
You proposed to use a map side join, which is executed inside a mapper and not before it.
Both sides must have the same key and value types. So you can't join a LongWritable and a Text, although they might have the same value.
There are subtle more things to note:
input files have to be sorted, so it has likely to be a reducer output
You can control the number of mappers in your join-map-phase by setting the number of reducers in the job that should've sorted the datasets
The whole procedure basically works like this: You have dataset A and dataset B, both share the same key, let's say LongWritable.
Run two jobs that sort the two datasetsby their keys, both jobs HAVE TO set the number of reducers to an equal number, say 2.
this will result in 2 sorted files for each dataset
now you setup your job that joins the datasets, this job will spawn with 2 mappers. It could be more if you're setting the reduce numbers higher in the previous job.
do whatever you like in the reduce step.
If the number of the files to be joined is not equal it will result in an exception during job setup.
Setting up a join is kind of painful, mainly because you have to use the old API for mapper and reducer if your version is less than 0.21.x.
This document describes very well how it works. Scroll all the way to the bottom, sadly this documentation is somehow missing in the latest Hadoop docs.
Another good reference is "Hadoop the Definitive Guide", which explains all of this in more detail and with examples.
I think you're missing the point. You don't control the number of mappers. It's the number of reducers that you have control over. Simply emit the correct keys from your mapper. Then run 10 reducers.