Join of two datasets in Mapreduce/Hadoop - hadoop

Does anyone know how to implement the Natural-Join operation between two datasets in Hadoop?
More specifically, here's what I exactly need to do:
I am having two sets of data:
point information which is stored as (tile_number, point_id:point_info) , this is a 1:n key-value pairs. This means for every tile_number, there might be several point_id:point_info
Line information which is stored as (tile_number, line_id:line_info) , this is again a 1:m key-value pairs and for every tile_number, there might be more than one line_id:line_info
As you can see the tile_numbers are the same between the two datasets. now what I really need is to join these two datasets based on each tile_number. In other words for every tile_number, we have n point_id:point_info and m line_id:line_info. What I want to do is to join all pairs of point_id:point_info with all pairs of line_id:line_info for every tile_number
In order to clarify, here's an example:
For point pairs:
(tile0, point0)
(tile0, point1)
(tile1, point1)
(tile1, point2)
for line pairs:
(tile0, line0)
(tile0, line1)
(tile1, line2)
(tile1, line3)
what I want is as following:
for tile 0:
(tile0, point0:line0)
(tile0, point0:line1)
(tile0, point1:line0)
(tile0, point1:line1)
for tile 1:
(tile1, point1:line2)
(tile1, point1:line3)
(tile1, point2:line2)
(tile1, point2:line3)

Use a mapper that outputs titles as keys and points/lines as values. You have to differentiate between the point output values and line output values. For instance you can use a special character (even though a binary approach would be much better).
So the map output will be something like:
tile0, _point0
tile1, _point0
tile2, _point1
...
tileX, *lineL
tileY, *lineK
...
Then, at the reducer, your input will have this structure:
tileX, [*lineK, ... , _pointP, ...., *lineM, ..., _pointR]
and you will have to take the values separate the points and the lines, do a cross product and output each pair of the cross-product , like this:
tileX (lineK, pointP)
tileX (lineK, pointR)
...
If you can already easily differentiate between the point values and the line values (depending on your application specifications) you don't need the special characters (*,_)
Regarding the cross-product which you have to do in the reducer:
You first iterate through the entire values List, separate them into 2 list:
List<String> points;
List<String> lines;
Then do the cross-product using 2 nested for loops.
Then iterate through the resulting list and for each element output:
tile(current key), element_of_the_resulting_cross_product_list

So basically you have two options here.Reduce side join or Map Side Join .
Here your group key is "tile". In a single reducer you are going to get all the output from point pair and line pair. But you you will have to either cache point pair or line pair in the array. If either of the pairs(point or line) are very large that neither can fit in your temporary array memory for single group key(each unique tile) then this method will not work for you. Remember you don't have to hold both of key pairs for single group key("tile") in memory, one will be sufficient.
If both key pairs for single group key are large , then you will have to try map-side join.But it has some peculiar requirements. However you can fulfill those requirement by doing some pre-processing your data through some map/reduce jobs running equal number of reducers for both data.

Related

Can Map count occurrences more than once?

I read in a tutorial that Map counts every word in a dictionary like this:
('house', 1).
Then in a huge text it may find the word 'house' many times. Hence, the Reduce function will take as many (house,1) exist from the Map function and it will iterate giving a ('house',100) if it found it 100 times in a document.
Is this how it works? Why the second time the Map function finds the word 'house' doesn't store it ('house',2)?
The Mapper is called on every item in your input and then it emits a series of intermediate key/value pairs.
Those key/value pairs look like this: (feature, partial aggregate value) or (house, 1) in your example. After, all the emitted values for a given key are grouped together likes this (feature, (value1, value2, etc.) or (house, (1, 1, 1, 1, 1)).
In the end, the Reducer computes the final aggregate result from all the intermediate values for that feature. So, (feature, (value1, value2, etc.) becomes (feature, totalValue). Or (house, (1, 1, 1, 1, 1)) becomes (house, 5).
The Mapper does not count how many times that feature (or word in your example) occurs, it simply splits the output as (feature, value). It is the job of the Reducer to compute the final aggregate for the feature. Otherwise, what would be the purpose of the Reducer?
I need to specify that I am currently learning about Hadoop and the MapReduce programming model. Thus, if I am wrong, correct me.

Hadoop Mapreduce count distinct vector elements for big data

I have data consisting of n-length vector of integer/real numbers. Data is typically in GB level and feature size of a vector is more than 100. I want to count distinct elements of every vector feature. For example if I have data like:
1.13211 22.33 1.00 ... 311.66
1.13211 44.44 4.52 ... 311.66
1.55555 22.33 5.11 ... 311.66
I want the result like (2,2,3,...,1) just one vector. Since there is 2 distinct value in first feature of a vector, 2 distinct value in second feature and etc.
The way I think to do it with mapreduce is , to send the values from mapper ("$val$+{feature_vector_num}",1). For example like (1.13211+1,1) or (2.33+2,1). And in reducer just sum them up and probably the second mapper and reducer to wrap up the all reducer results from previous step.
The problem is that, if I have data of size N, with my solution, its size sent to reducer will be
|V| * N in worst case,(|V| is the length of feature vector) and this is also the number of reducers and number of keys at the same time. Therefore for big data, this is quite a bad solution.
Do you have any suggessions?
Thanks
Without considering any implementation detail (MapReduce or not), I would do it in 2 steps with a hashtable per feature (probably in Redis).
The first step would list all values and corresponding counts.
The second would then run through each vector and see if the element is unique or not in the hastable. If you have some margin for error, and want a light memory footprint, I would even go with a bloom filter.
The two steps are trivially parallelized.
I would agree with lejlot is that 1GB would be much more optimally solvable using other means (e.g. memory algorithms such as hash map) and not with m/r.
However in case if your problem is 2-3+ orders of magnitude larger, or if you just want to practice with m/r, here is one of the possible solutions:
Phase 1
Mapper
Params:
Input key: irrelevant (for TextInputFormat I think it is LongWritable that
represents a position in a file but you can just use Writable)
Input value: a single line with vector components separated by space (1.13211 22.33 1.00 ... 311.66)
Output key: a pair <IntWritable, DoubleWritable>
where IntWritable holds an index of the component, and DoubleWritable holds a value of the component.
Google for hadoop examples, specifically, SecondarySort.java which demonstrates how to implement a pair of IntWritable. You just need to rewrite this using DoubleWritable as a second component.
Output value: irrelevant, you can use NullWritable
Mapper Function
Tokenize the value
For each token, emit <IntWritable, DoubleWritable> key (you can create a custom writable pair class for that) and NullWritable value
Reducer
The framework will call your reducer with <IntWritable, DoubleWritable> pair as keys, only one time for each key variation, effectively making dedupe. For example, <1, 1.13211> key will come only once.
Params
Input Key: Pair <IntWritable, DoubleWritable>
Input Value: Irrelevant (Writable or NullWritable)
Output Key: IntWritable (component index)
Output Value: IntWritable (count corresponding to the index)
Reducer Setup
initialize int[] counters array of size equal to your vector dimension.
Reducer Function
get an index from key.getFirst()
increment count for the index: counters[index]++
Reducer Cleanup
for each count in counters array emit, index of the array as a key, and value of the counter.
Phase 2
This one is trivial and only needed if you have multiple reducers in the first phase. In this case the counts calculated above are partial.
You need to combine the outputs of your multiple reducers into a single output.
You need to set up a single-reducer job, where your reducer will just accumulate counts for corresponding indices.
Mapper
NO-OP
Reducer
Params
Input key: IntWritable (position)
Input value: IntWritable (partial count)
Output key: IntWritable (position)
Output value: IntWritable (total count)
Reducer Function
for each input key
int counter = 0
iterate over the values
counter += value
emit input key (as a key) and counter (as a value)
The resulting output file "part-r-00000" should have N records, where each record is a pair of values (position and distinct count) sorted by position.

Reduce properties which I'm not sure about

I'm a beginner in writing map-reduces and I'm not sure about some reduce function properties.
So, reduce gets (key, list of values) as an input parameter...
is it guaranteed that the list of input values always contains at least 2 members? So, an unique key emitted by the mapper would never be passed to the reducer?
or, if there is just one item in the input list, is it guaranteed that the key is unique?
can reduce emit more values then the input values list size?
I have a large list of strings. I need to find all of them which are not unique. Can I make it with just one map/reduce? The only way I see is to count all the unique strings by one map/reduce and then select those which are not unique by the another map/reduce
Thanks
The list of input values to the reduce() method may have one or more, but not zero members.
All of the values mapped from/to a unique key value are passed as a list to the reduce along with the key value. If that list contains one member then you can assume that that key value was mapped to only one value (or once, if you're counting)
Your reducer can write any number, including zero, of key value pairs for a given input key and list of values. The types of the input key/values may be different from the types of the output key/value pairs.
You can solve your problem with one map/reduce step
So, the problem with the strings, pseudocode:
map(string s) {
emit(s, 0);
}
reduce(string key, list values) {
if (valies.size() > 1) { emit(key, 1); return; }
if (valuse.contains(1)) { emit(key, 1); return; }
}
right?

Hadoop : Number of input records for reducer

Is there anyway by which each reducer process could determine the number of elements or records it has to process ?
Short answer - ahead of time no, the reducer has no knowledge of how many values are backed by the iterable. The only way you can do this is to count as you iterate, but you can't then re-iterate over the iterable again.
Long answer - backing the iterable is actually a sorted byte array of the serialized key / value pairs. The reducer has two comparators - one to sort the key/value pairs in key order, then a second to determine the boundary between keys (known as the key grouper). Typically the key grouper is the same as the key ordering comparator.
When iterating over the values for a particular key, the underlying context examines the next key in the array, and compares to the previous key using the grouping comparator. If the comparator determines they are equal, then iteration continues. Otherwise iteration for this particular key ends. So you can see that you cannot ahead of time determine how may values you will be passed for any particular key.
You can actually see this in action if you create a composite key, say a Text/IntWritable pair. For the compareTo method sort by first the Text, then the IntWritable field. Next create a Comparator to be used as the group comparator, which only considers the Text part of the key. Now as you iterate over the values in the reducer, you should be able to observe IntWritable part of the key changing with each iteration.
Some code i've used before to demonstrates this scenario can be found on this pastebin
Your reducer class must extend the MapReducer Reduce class:
Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
and then must implement the reduce method using the KEYIN/VALUEIN arguments specified in the extended Reduce class
reduce(KEYIN key, Iterable<VALUEIN> values,
org.apache.hadoop.mapreduce.Reducer.Context context)
The values associated with a given key can be counted via
int count = 0;
Iterator<VALUEIN> it = values.iterator();
while(it.hasNext()){
it.Next();
count++;
}
Though I'd propose doing this counting along side your other processing as to not make two passes through your value set.
EDIT
Here's an example vector of vectors that will dynamically grow as you add to it (so you won't have to statically declare your arrays, and hence don't need the size of the values set). This will work best for non-regular data (IE the number of columns is not the same for every row in your input csv file), but will have the most overhead.
Vector table = new Vector();
Iterator<Text> it = values.iterator();
while(it.hasNext()){
Text t = it.Next();
String[] cols = t.toString().split(",");
int i = 0;
Vector row = new Vector(); //new vector will be our row
while(StringUtils.isNotEmpty(cols[i])){
row.addElement(cols[i++]); //here were adding a new column for every value in the csv row
}
table.addElement(row);
}
Then you can access the Mth column of the Nth row via
table.get(N).get(M);
Now, if you knew the # of columns would be set, you could modify this to use a Vector of arrays which would probably be a little faster/more space efficient.

How can I use the map datatype in Apache Pig?

I'd like to use Apache Pig to build a large key -> value mapping, look things up in the map, and iterate over the keys. However, there does not even seem to be syntax for doing these things; I've checked the manual, wiki, sample code, Elephant book, Google, and even tried parsing the parser source. Every single example loads map literals from a file... and then never uses them. How can you use Pig's maps?
First, there doesn't seem to be a way to load a 2-column CSV file into a map directly. If I have a simple map.csv:
1,2
3,4
5,6
And I try to load it as a map:
m = load 'map.csv' using PigStorage(',') as (M: []);
dump m;
I get three empty tuples:
()
()
()
So I try to load tuples and then generate the map:
m = load 'map.csv' using PigStorage(',') as (key:chararray, val:chararray);
b = foreach m generate [key#val];
ERROR 1000: Error during parsing. Encountered " "[" "[ "" at line 1, column 24.
...
Many variations on the syntax also fail (e.g., generate [$0#$1]).
OK, so I munge my map into Pig's map literal format as map.pig:
[1#2]
[3#4]
[5#6]
And load it up:
m = load 'map.pig' as (M: []);
Now let's load up some keys and try lookups:
k = load 'keys.csv' as (key);
dump k;
3
5
1
c = foreach k generate m#key; /* Or m[key], or... what? */
ERROR 1000: Error during parsing. Invalid alias: m in {M: map[ ]}
Hrm, OK, maybe since there are two relations involved, we need a join:
c = join k by key, m by /* ...um, what? */ $0;
dump c;
ERROR 1068: Using Map as key not supported.
c = join k by key, m by m#key;
dump c;
Error 1000: Error during parsing. Invalid alias: m in {M: map[ ]}
Fail. How do I refer to the key (or value) of a map? The map schema syntax doesn't seem to let you even name the key and value (the mailing list says there's no way to assign types).
Finally, I'd just like to be able to find all they keys in my map:
d = foreach m generate ...oh, forget it.
Is Pig's map type half-baked? What am I missing?
Currently pig maps need the key to a chararray (string) that you supply and not a variable which contains a string. so in map#key the key has to be constant string that you supply (eg: map#'keyvalue').
The typical use case of this is to load a complex data structure one of the element being a key value pair and later in a foreach statement you can refer to a particular value based on the key you are interested in.
http://pig.apache.org/docs/r0.9.1/basic.html#map-schema
In Pig version 0.10.0 there is a new function available called "TOMAP" (http://pig.apache.org/docs/r0.10.0/func.html#tomap) that converts its odd (chararray) parameters to keys and even parameters to values. Unfortunately I haven't found it to be that useful, though, since I typically deal with arbitrary dicts of varying lengths and keys.
I would find a TOMAP function that took a tuple as a single argument, instead of a variable number of parameters, to be much more useful.
This isn't a complete solution to your problem, but the availability of TOMAP gives you some more options for your constructing a real solution.
Great question!
I personally do not like Maps in Pig. They have a place in traditional programming languages like Java, C# etc, wherein its really handy and fast to lookup a key in the map. On the other hand, Maps in Pig have very limited features.
As you rightly pointed, one can not lookup variable key in the Map in Pig. The key needs to be Constant. e.g. myMap#'keyFoo' is allowed but myMap#$SOME_VARIABLE is not allowed.
If you think about it, you do not need Map in Pig. One usually loads the data from some source, transforms it, joins it with some other dataset, filter it, transform it and so on. JOIN actually does a good job of looking up the variable keys in the data.
e.g. data1 has 2 columns A and B and data2 has 3 columns X, Y, Z. If you join data1 BY A with data2 BY Z, JOIN does the work of a Map (from traditional language) which maps value of column Z to value of column B (via column A). So data1 essentially represents a Map A -> B.
So why do we need Map in Pig?
Usually Hadoop data are the dumps of different data sources from Traditional languages. If original data sources contain Maps, the HDFS data would contain a corresponding Map.
How can one handle the Map data?
There are really 2 use cases:
Map keys are constants.
e.g. HttpRequest Header data contains time, server, clientIp as the keys in Map. to access value of a particular key, one case access them with Constant key.
e.g. header#'clientIp'.
Map keys are variables.
In these cases, you would most probably would want to JOIN the Map keys with some other data set. I usually convert the Map to Bag using UDF MapToBag, which converts map data into Bag of 2 field tuples (key, value). Once map data is converted to Bag of tuples, its easy to join it with other data sets.
I hope this helps.
1)If you want to load map data it should be like "[programming#SQL,rdbms#Oracle]"
2)If you want to load tuple data it should be like "(first_name_1234,middle_initial_1234,last_name_1234)"
3)If you want to load bag data it should be like"{(project_4567_1),(project_4567_2),(project_4567_3)}"
my file pigtest.csv like this
1234|emp_1234#company.com|(first_name_1234,middle_initial_1234,last_name_1234)|{(project_1234_1),(project_1234_2),(project_1234_3)}|[programming#SQL,rdbms#Oracle]
4567|emp_4567#company.com|(first_name_4567,middle_initial_4567,last_name_4567)|{(project_4567_1),(project_4567_2),(project_4567_3)}|[programming#Java,OS#Linux]
my schema:
a = LOAD 'pigtest.csv' using PigStorage('|') AS (employee_id:int, email:chararray, name:tuple(first_name:chararray, middle_name:chararray, last_name:chararray), project_list:bag{project: tuple(project_name:chararray)}, skills:map[chararray]) ;
b = FOREACH a GENERATE employee_id, email, name.first_name, project_list, skills#'programming' ;
dump b;
I think you need to think in term of relations and the map is just one field of one record. Then you can apply some operations on the relations, like joining the two sets data and mapping:
Input
$ cat data.txt
1
2
3
4
5
$ cat mapping.txt
1 2
2 4
3 6
4 8
5 10
Pig
mapping = LOAD 'mapping.txt' AS (key:CHARARRAY, value:CHARARRAY);
data = LOAD 'data.txt' AS (value:CHARARRAY);
-- list keys
mapping_keys =
FOREACH mapping
GENERATE key;
DUMP mapping_keys;
-- join mapping to data
mapped_data =
JOIN mapping BY key, data BY value;
DUMP mapped_data;
Output
> # keys
(1)
(2)
(3)
(4)
(5)
> # mapped data
(1,2,1)
(2,4,2)
(3,6,3)
(4,8,4)
(5,10,5)
This answer could also help you if you just want to do a simple look up:
pass-a-relation-to-a-pig-udf-when-using-foreach-on-another-relation
You can load up any data and then convert and store in key value format to read for later use
data = load 'somedata.csv' using PigStorage(',')
STORE data into 'folder' using PigStorage('#')
and then read as a mapped data.

Resources