Hadoop : Number of input records for reducer - hadoop

Is there anyway by which each reducer process could determine the number of elements or records it has to process ?

Short answer - ahead of time no, the reducer has no knowledge of how many values are backed by the iterable. The only way you can do this is to count as you iterate, but you can't then re-iterate over the iterable again.
Long answer - backing the iterable is actually a sorted byte array of the serialized key / value pairs. The reducer has two comparators - one to sort the key/value pairs in key order, then a second to determine the boundary between keys (known as the key grouper). Typically the key grouper is the same as the key ordering comparator.
When iterating over the values for a particular key, the underlying context examines the next key in the array, and compares to the previous key using the grouping comparator. If the comparator determines they are equal, then iteration continues. Otherwise iteration for this particular key ends. So you can see that you cannot ahead of time determine how may values you will be passed for any particular key.
You can actually see this in action if you create a composite key, say a Text/IntWritable pair. For the compareTo method sort by first the Text, then the IntWritable field. Next create a Comparator to be used as the group comparator, which only considers the Text part of the key. Now as you iterate over the values in the reducer, you should be able to observe IntWritable part of the key changing with each iteration.
Some code i've used before to demonstrates this scenario can be found on this pastebin

Your reducer class must extend the MapReducer Reduce class:
Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
and then must implement the reduce method using the KEYIN/VALUEIN arguments specified in the extended Reduce class
reduce(KEYIN key, Iterable<VALUEIN> values,
org.apache.hadoop.mapreduce.Reducer.Context context)
The values associated with a given key can be counted via
int count = 0;
Iterator<VALUEIN> it = values.iterator();
while(it.hasNext()){
it.Next();
count++;
}
Though I'd propose doing this counting along side your other processing as to not make two passes through your value set.
EDIT
Here's an example vector of vectors that will dynamically grow as you add to it (so you won't have to statically declare your arrays, and hence don't need the size of the values set). This will work best for non-regular data (IE the number of columns is not the same for every row in your input csv file), but will have the most overhead.
Vector table = new Vector();
Iterator<Text> it = values.iterator();
while(it.hasNext()){
Text t = it.Next();
String[] cols = t.toString().split(",");
int i = 0;
Vector row = new Vector(); //new vector will be our row
while(StringUtils.isNotEmpty(cols[i])){
row.addElement(cols[i++]); //here were adding a new column for every value in the csv row
}
table.addElement(row);
}
Then you can access the Mth column of the Nth row via
table.get(N).get(M);
Now, if you knew the # of columns would be set, you could modify this to use a Vector of arrays which would probably be a little faster/more space efficient.

Related

Which container to use for given situation?

I am doing a problem and i need to do this task.
I want to add pairs (p1,q1),(p2,q2)..(pn,qn) in such way that
(i) Duplicate pair added only once(like in set).
(ii) I store count how many time each pair are added to set.For ex : (7,2) pair
will present in set only once but if i add 3 times count will 3.
Which container is efficient for this problem in c++?
Little example will be great!
Please ask if you cant understand my problem and sorry for bad English.
How about a std::map<Key, Value> to map your pairs (Key) to their count and as you insert, increment a counter (Value).
using pairs_to_count = std::map<std::pair<T1, T2>, size_t>;
std::pair<T1, T2> p1 = // some value;
std::pair<T1, T2> p2 = // some other value;
pairs_to_count[p1]++;
pairs_to_count[p1]++;
pairs_to_count[p2]++;
pairs_to_count[p2]++;
pairs_to_count[p2]++;
In this code, the operator[] will automatically add a key in the map if it does not exist yet. At that moment, it will initialize the key's corresponding value to zero. But as you insert, even the first time, that value is incremented.
Already after the first insertion, the count of 1 correctly reflects the number of insertion. That value gets incremented as you insert more.
Later, retrieving the count is a matter of calling operator[] again to get value associated with a given key.
size_t const p2_count = pairs_to_count[p2]; // equals 3

Hadoop Mapreduce count distinct vector elements for big data

I have data consisting of n-length vector of integer/real numbers. Data is typically in GB level and feature size of a vector is more than 100. I want to count distinct elements of every vector feature. For example if I have data like:
1.13211 22.33 1.00 ... 311.66
1.13211 44.44 4.52 ... 311.66
1.55555 22.33 5.11 ... 311.66
I want the result like (2,2,3,...,1) just one vector. Since there is 2 distinct value in first feature of a vector, 2 distinct value in second feature and etc.
The way I think to do it with mapreduce is , to send the values from mapper ("$val$+{feature_vector_num}",1). For example like (1.13211+1,1) or (2.33+2,1). And in reducer just sum them up and probably the second mapper and reducer to wrap up the all reducer results from previous step.
The problem is that, if I have data of size N, with my solution, its size sent to reducer will be
|V| * N in worst case,(|V| is the length of feature vector) and this is also the number of reducers and number of keys at the same time. Therefore for big data, this is quite a bad solution.
Do you have any suggessions?
Thanks
Without considering any implementation detail (MapReduce or not), I would do it in 2 steps with a hashtable per feature (probably in Redis).
The first step would list all values and corresponding counts.
The second would then run through each vector and see if the element is unique or not in the hastable. If you have some margin for error, and want a light memory footprint, I would even go with a bloom filter.
The two steps are trivially parallelized.
I would agree with lejlot is that 1GB would be much more optimally solvable using other means (e.g. memory algorithms such as hash map) and not with m/r.
However in case if your problem is 2-3+ orders of magnitude larger, or if you just want to practice with m/r, here is one of the possible solutions:
Phase 1
Mapper
Params:
Input key: irrelevant (for TextInputFormat I think it is LongWritable that
represents a position in a file but you can just use Writable)
Input value: a single line with vector components separated by space (1.13211 22.33 1.00 ... 311.66)
Output key: a pair <IntWritable, DoubleWritable>
where IntWritable holds an index of the component, and DoubleWritable holds a value of the component.
Google for hadoop examples, specifically, SecondarySort.java which demonstrates how to implement a pair of IntWritable. You just need to rewrite this using DoubleWritable as a second component.
Output value: irrelevant, you can use NullWritable
Mapper Function
Tokenize the value
For each token, emit <IntWritable, DoubleWritable> key (you can create a custom writable pair class for that) and NullWritable value
Reducer
The framework will call your reducer with <IntWritable, DoubleWritable> pair as keys, only one time for each key variation, effectively making dedupe. For example, <1, 1.13211> key will come only once.
Params
Input Key: Pair <IntWritable, DoubleWritable>
Input Value: Irrelevant (Writable or NullWritable)
Output Key: IntWritable (component index)
Output Value: IntWritable (count corresponding to the index)
Reducer Setup
initialize int[] counters array of size equal to your vector dimension.
Reducer Function
get an index from key.getFirst()
increment count for the index: counters[index]++
Reducer Cleanup
for each count in counters array emit, index of the array as a key, and value of the counter.
Phase 2
This one is trivial and only needed if you have multiple reducers in the first phase. In this case the counts calculated above are partial.
You need to combine the outputs of your multiple reducers into a single output.
You need to set up a single-reducer job, where your reducer will just accumulate counts for corresponding indices.
Mapper
NO-OP
Reducer
Params
Input key: IntWritable (position)
Input value: IntWritable (partial count)
Output key: IntWritable (position)
Output value: IntWritable (total count)
Reducer Function
for each input key
int counter = 0
iterate over the values
counter += value
emit input key (as a key) and counter (as a value)
The resulting output file "part-r-00000" should have N records, where each record is a pair of values (position and distinct count) sorted by position.

Reduce properties which I'm not sure about

I'm a beginner in writing map-reduces and I'm not sure about some reduce function properties.
So, reduce gets (key, list of values) as an input parameter...
is it guaranteed that the list of input values always contains at least 2 members? So, an unique key emitted by the mapper would never be passed to the reducer?
or, if there is just one item in the input list, is it guaranteed that the key is unique?
can reduce emit more values then the input values list size?
I have a large list of strings. I need to find all of them which are not unique. Can I make it with just one map/reduce? The only way I see is to count all the unique strings by one map/reduce and then select those which are not unique by the another map/reduce
Thanks
The list of input values to the reduce() method may have one or more, but not zero members.
All of the values mapped from/to a unique key value are passed as a list to the reduce along with the key value. If that list contains one member then you can assume that that key value was mapped to only one value (or once, if you're counting)
Your reducer can write any number, including zero, of key value pairs for a given input key and list of values. The types of the input key/values may be different from the types of the output key/value pairs.
You can solve your problem with one map/reduce step
So, the problem with the strings, pseudocode:
map(string s) {
emit(s, 0);
}
reduce(string key, list values) {
if (valies.size() > 1) { emit(key, 1); return; }
if (valuse.contains(1)) { emit(key, 1); return; }
}
right?

Count frequency of items in array - without two for loops

Need to know is there a way to count the frequency of items in a array without using two loops. This is without knowing the size of the array. If I know the size of the array I can use switch without looping. But I need more versatile than that. I think modifying the quicksort may give better results.
Array[n];
TwoDArray[n][2];
First loop will go on Array[], while second loop is to find the element and increase it count in two-d array.
max = 0;
for(int i=0;i<Array.length;i++){
found= false;
for(int j=0;j<TwoDArray[max].length;j++){
if(TwoDArray[j][0]==Array[i]){
TwoDArray[j][1]+=;
found = true;
break;
}
}
if(found==false){
TwoDArray[max+1][0]=Array[i];
TwoDArray[max+1][1]=1;
max+=;
}
If you can comment or provide better solution would be very helpful.
Use map or hash table to implement this. Insert key as the array item and value as the frequency.
Alternatively you can use array too if the range of array elements are not too large. Increase the count of value at indexes corresponding to the array element.
I would build a map keyed by the item in the array and with a value that is the count of that item. One pass over the array to build the map that contains the counts. For each item, look it's count up in the map, increment the count, and put the new count back into the map.
The map put and get operations can be constant time (e.g., if you use a hash map implementation with a good hash function and properly sized backing store). This means you can compute the frequencies in time proportional to the number of elements in your array.
I'm not saying this is better than using a map or hash table (especially not when there are lots of duplicates, though in that case you can get close to O(n) sorting with certain techniques, so this is not too bad either), it's just an alternative.
Sort the array
Use a (single) for-loop to iterate through the sorted array
If you find the same element as the previous one, increment the current count
If you find a different element, store the previous element and its count and set the count to 1
At the end of the loop, store the previous element and its count

How can I get an integer index for a key in hadoop?

Intuitively, hadoop is doing something like this to distribute keys to mappers, using python-esque pseudocode.
# data is a dict with many key-value pairs
keys = data.keys()
key_set_size = len(keys) / num_mappers
index = 0
mapper_keys = []
for i in range(num_mappers):
end_index = index + key_set_size
send_to_mapper(keys[int(index):int(end_index)], i)
index = end_index
# And something vaguely similar for the reducer (but not exactly).
It seems like somewhere hadoop knows the index of each key it is passing around, since it distributes them evenly among the mappers (or reducers). My question is: how can I access this index? I'm looking for a range of integers [0, n) mapping to all my n keys; this is what I mean by an "index".
I'm interested in the ability to get the index from within either the mapper or reducer.
After doing more research on this question, I don't believe it is possible to do exactly what I want. Hadoop does not seem to have such an index that is user-visible after all, although it does try to distribute work evenly among the mappers (so such an index is theoretically possible).
Actually, your reducer (each individual one) gets an array of items back that correspond to the reduce key. So do you want the offset of items within the reduce key in your reducer, or do you want the overall offset of the particular item in the global array of all lines being processed? To get an indeex in your mapper, you can simply prepend a line number to each line of the file before the file gets to the mapper. This will tell you the "global index". However keep in mind that with 1 000 000 items, item 662 345 could be processed before item 10 000.
If you are using the new MR API then the org.apache.hadoop.mapreduce.lib.partition.HashPartitioner is the default partitioner or else org.apache.hadoop.mapred.lib.HashPartitioner is the default partitioner. You can call the getPartition() on either of the HashPartitioner to get the partition number for the key (which you mentioned as index).
Note that the HashPartitioner class is only used to distribute the keys to the Reducer. When it comes to a mapper, each input split is processed by a map task and the keys are not distributed.
Here is the code from HashPartitioner for the getPartition(). You can write a simple Java program for the same.
public int getPartition(K key, V value, int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
Edit: Including another way to get the index.
The following code from should also work. To be included in the map or the reduce function.
public void configure(JobConf job) {
partition = job.getInt( "mapred.task.partition", 0);
}

Resources