Can Map count occurrences more than once? - hadoop

I read in a tutorial that Map counts every word in a dictionary like this:
('house', 1).
Then in a huge text it may find the word 'house' many times. Hence, the Reduce function will take as many (house,1) exist from the Map function and it will iterate giving a ('house',100) if it found it 100 times in a document.
Is this how it works? Why the second time the Map function finds the word 'house' doesn't store it ('house',2)?

The Mapper is called on every item in your input and then it emits a series of intermediate key/value pairs.
Those key/value pairs look like this: (feature, partial aggregate value) or (house, 1) in your example. After, all the emitted values for a given key are grouped together likes this (feature, (value1, value2, etc.) or (house, (1, 1, 1, 1, 1)).
In the end, the Reducer computes the final aggregate result from all the intermediate values for that feature. So, (feature, (value1, value2, etc.) becomes (feature, totalValue). Or (house, (1, 1, 1, 1, 1)) becomes (house, 5).
The Mapper does not count how many times that feature (or word in your example) occurs, it simply splits the output as (feature, value). It is the job of the Reducer to compute the final aggregate for the feature. Otherwise, what would be the purpose of the Reducer?
I need to specify that I am currently learning about Hadoop and the MapReduce programming model. Thus, if I am wrong, correct me.

Related

Number of Reducers and output order

When I use the function job.setNumReduceTasks(1);, I get the output sorted by key. However, the output is not sorted by key when I remove this function.
So, should we expect to get sorted output from the reducer when we have more than one reducer task?
Thanks.
Output is sorted on the key within a single Reducer. However the default Partitioner is the result of a hash function, and so whilst each file will be sorted if using multiple Reducers, one file will not be a sorted continuation of the last. For example:
We have a word count job with three Reducers. The Mapper outputs:
(A,1)
(zebra,1)
(bat,1)
(zebra,1)
(frog,1)
(A,1)
The Partitioner looks like the following
public int getPartition(K key, V value, int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
and so it could allocate the keys in the following way:
REDUCER 1 REDUCER 2 REDUCER 3
(A,1) (frog,1) (bat,1)
(A,1)
(zebra,1)
Notice that Reducer 1 doesn't contain A-F, Reducer 2 doesn't contain G-M and Reducer 3 doesn't contain N-Z, i.e. it's not splitting alphabetically. And that's why the overall output won't be sorted, but data will be sorted within each Reducer's output.
This makes sense as otherwise we could end up with a big skew. Say for example you're running a MapReduce job on some customer services data where the ID always starts with C - you wouldn't want everything to go to the same Reducer.

Hadoop Mapreduce count distinct vector elements for big data

I have data consisting of n-length vector of integer/real numbers. Data is typically in GB level and feature size of a vector is more than 100. I want to count distinct elements of every vector feature. For example if I have data like:
1.13211 22.33 1.00 ... 311.66
1.13211 44.44 4.52 ... 311.66
1.55555 22.33 5.11 ... 311.66
I want the result like (2,2,3,...,1) just one vector. Since there is 2 distinct value in first feature of a vector, 2 distinct value in second feature and etc.
The way I think to do it with mapreduce is , to send the values from mapper ("$val$+{feature_vector_num}",1). For example like (1.13211+1,1) or (2.33+2,1). And in reducer just sum them up and probably the second mapper and reducer to wrap up the all reducer results from previous step.
The problem is that, if I have data of size N, with my solution, its size sent to reducer will be
|V| * N in worst case,(|V| is the length of feature vector) and this is also the number of reducers and number of keys at the same time. Therefore for big data, this is quite a bad solution.
Do you have any suggessions?
Thanks
Without considering any implementation detail (MapReduce or not), I would do it in 2 steps with a hashtable per feature (probably in Redis).
The first step would list all values and corresponding counts.
The second would then run through each vector and see if the element is unique or not in the hastable. If you have some margin for error, and want a light memory footprint, I would even go with a bloom filter.
The two steps are trivially parallelized.
I would agree with lejlot is that 1GB would be much more optimally solvable using other means (e.g. memory algorithms such as hash map) and not with m/r.
However in case if your problem is 2-3+ orders of magnitude larger, or if you just want to practice with m/r, here is one of the possible solutions:
Phase 1
Mapper
Params:
Input key: irrelevant (for TextInputFormat I think it is LongWritable that
represents a position in a file but you can just use Writable)
Input value: a single line with vector components separated by space (1.13211 22.33 1.00 ... 311.66)
Output key: a pair <IntWritable, DoubleWritable>
where IntWritable holds an index of the component, and DoubleWritable holds a value of the component.
Google for hadoop examples, specifically, SecondarySort.java which demonstrates how to implement a pair of IntWritable. You just need to rewrite this using DoubleWritable as a second component.
Output value: irrelevant, you can use NullWritable
Mapper Function
Tokenize the value
For each token, emit <IntWritable, DoubleWritable> key (you can create a custom writable pair class for that) and NullWritable value
Reducer
The framework will call your reducer with <IntWritable, DoubleWritable> pair as keys, only one time for each key variation, effectively making dedupe. For example, <1, 1.13211> key will come only once.
Params
Input Key: Pair <IntWritable, DoubleWritable>
Input Value: Irrelevant (Writable or NullWritable)
Output Key: IntWritable (component index)
Output Value: IntWritable (count corresponding to the index)
Reducer Setup
initialize int[] counters array of size equal to your vector dimension.
Reducer Function
get an index from key.getFirst()
increment count for the index: counters[index]++
Reducer Cleanup
for each count in counters array emit, index of the array as a key, and value of the counter.
Phase 2
This one is trivial and only needed if you have multiple reducers in the first phase. In this case the counts calculated above are partial.
You need to combine the outputs of your multiple reducers into a single output.
You need to set up a single-reducer job, where your reducer will just accumulate counts for corresponding indices.
Mapper
NO-OP
Reducer
Params
Input key: IntWritable (position)
Input value: IntWritable (partial count)
Output key: IntWritable (position)
Output value: IntWritable (total count)
Reducer Function
for each input key
int counter = 0
iterate over the values
counter += value
emit input key (as a key) and counter (as a value)
The resulting output file "part-r-00000" should have N records, where each record is a pair of values (position and distinct count) sorted by position.

Reduce properties which I'm not sure about

I'm a beginner in writing map-reduces and I'm not sure about some reduce function properties.
So, reduce gets (key, list of values) as an input parameter...
is it guaranteed that the list of input values always contains at least 2 members? So, an unique key emitted by the mapper would never be passed to the reducer?
or, if there is just one item in the input list, is it guaranteed that the key is unique?
can reduce emit more values then the input values list size?
I have a large list of strings. I need to find all of them which are not unique. Can I make it with just one map/reduce? The only way I see is to count all the unique strings by one map/reduce and then select those which are not unique by the another map/reduce
Thanks
The list of input values to the reduce() method may have one or more, but not zero members.
All of the values mapped from/to a unique key value are passed as a list to the reduce along with the key value. If that list contains one member then you can assume that that key value was mapped to only one value (or once, if you're counting)
Your reducer can write any number, including zero, of key value pairs for a given input key and list of values. The types of the input key/values may be different from the types of the output key/value pairs.
You can solve your problem with one map/reduce step
So, the problem with the strings, pseudocode:
map(string s) {
emit(s, 0);
}
reduce(string key, list values) {
if (valies.size() > 1) { emit(key, 1); return; }
if (valuse.contains(1)) { emit(key, 1); return; }
}
right?

Join of two datasets in Mapreduce/Hadoop

Does anyone know how to implement the Natural-Join operation between two datasets in Hadoop?
More specifically, here's what I exactly need to do:
I am having two sets of data:
point information which is stored as (tile_number, point_id:point_info) , this is a 1:n key-value pairs. This means for every tile_number, there might be several point_id:point_info
Line information which is stored as (tile_number, line_id:line_info) , this is again a 1:m key-value pairs and for every tile_number, there might be more than one line_id:line_info
As you can see the tile_numbers are the same between the two datasets. now what I really need is to join these two datasets based on each tile_number. In other words for every tile_number, we have n point_id:point_info and m line_id:line_info. What I want to do is to join all pairs of point_id:point_info with all pairs of line_id:line_info for every tile_number
In order to clarify, here's an example:
For point pairs:
(tile0, point0)
(tile0, point1)
(tile1, point1)
(tile1, point2)
for line pairs:
(tile0, line0)
(tile0, line1)
(tile1, line2)
(tile1, line3)
what I want is as following:
for tile 0:
(tile0, point0:line0)
(tile0, point0:line1)
(tile0, point1:line0)
(tile0, point1:line1)
for tile 1:
(tile1, point1:line2)
(tile1, point1:line3)
(tile1, point2:line2)
(tile1, point2:line3)
Use a mapper that outputs titles as keys and points/lines as values. You have to differentiate between the point output values and line output values. For instance you can use a special character (even though a binary approach would be much better).
So the map output will be something like:
tile0, _point0
tile1, _point0
tile2, _point1
...
tileX, *lineL
tileY, *lineK
...
Then, at the reducer, your input will have this structure:
tileX, [*lineK, ... , _pointP, ...., *lineM, ..., _pointR]
and you will have to take the values separate the points and the lines, do a cross product and output each pair of the cross-product , like this:
tileX (lineK, pointP)
tileX (lineK, pointR)
...
If you can already easily differentiate between the point values and the line values (depending on your application specifications) you don't need the special characters (*,_)
Regarding the cross-product which you have to do in the reducer:
You first iterate through the entire values List, separate them into 2 list:
List<String> points;
List<String> lines;
Then do the cross-product using 2 nested for loops.
Then iterate through the resulting list and for each element output:
tile(current key), element_of_the_resulting_cross_product_list
So basically you have two options here.Reduce side join or Map Side Join .
Here your group key is "tile". In a single reducer you are going to get all the output from point pair and line pair. But you you will have to either cache point pair or line pair in the array. If either of the pairs(point or line) are very large that neither can fit in your temporary array memory for single group key(each unique tile) then this method will not work for you. Remember you don't have to hold both of key pairs for single group key("tile") in memory, one will be sufficient.
If both key pairs for single group key are large , then you will have to try map-side join.But it has some peculiar requirements. However you can fulfill those requirement by doing some pre-processing your data through some map/reduce jobs running equal number of reducers for both data.

How can I get an integer index for a key in hadoop?

Intuitively, hadoop is doing something like this to distribute keys to mappers, using python-esque pseudocode.
# data is a dict with many key-value pairs
keys = data.keys()
key_set_size = len(keys) / num_mappers
index = 0
mapper_keys = []
for i in range(num_mappers):
end_index = index + key_set_size
send_to_mapper(keys[int(index):int(end_index)], i)
index = end_index
# And something vaguely similar for the reducer (but not exactly).
It seems like somewhere hadoop knows the index of each key it is passing around, since it distributes them evenly among the mappers (or reducers). My question is: how can I access this index? I'm looking for a range of integers [0, n) mapping to all my n keys; this is what I mean by an "index".
I'm interested in the ability to get the index from within either the mapper or reducer.
After doing more research on this question, I don't believe it is possible to do exactly what I want. Hadoop does not seem to have such an index that is user-visible after all, although it does try to distribute work evenly among the mappers (so such an index is theoretically possible).
Actually, your reducer (each individual one) gets an array of items back that correspond to the reduce key. So do you want the offset of items within the reduce key in your reducer, or do you want the overall offset of the particular item in the global array of all lines being processed? To get an indeex in your mapper, you can simply prepend a line number to each line of the file before the file gets to the mapper. This will tell you the "global index". However keep in mind that with 1 000 000 items, item 662 345 could be processed before item 10 000.
If you are using the new MR API then the org.apache.hadoop.mapreduce.lib.partition.HashPartitioner is the default partitioner or else org.apache.hadoop.mapred.lib.HashPartitioner is the default partitioner. You can call the getPartition() on either of the HashPartitioner to get the partition number for the key (which you mentioned as index).
Note that the HashPartitioner class is only used to distribute the keys to the Reducer. When it comes to a mapper, each input split is processed by a map task and the keys are not distributed.
Here is the code from HashPartitioner for the getPartition(). You can write a simple Java program for the same.
public int getPartition(K key, V value, int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
Edit: Including another way to get the index.
The following code from should also work. To be included in the map or the reduce function.
public void configure(JobConf job) {
partition = job.getInt( "mapred.task.partition", 0);
}

Resources