How to pass initial value to a reduce phase in riak? - ruby

I am trying to write a riak map reduce using riak-ruby-client. Javascript reduce function looks like this:
arr.reduce(callback,[initialValue]);
I am doing something like this:
map_reduce = Riak::MapReduce.new(Ripple.client)
map_reduce.add(bucket) // I have passed a valid bucket
var callback = "function(previous, current){return previous + current;}"
results = map_reduce.map(map_func).reduce(callback,1,:keep=>true).run //1 is the initial value as in javascript reduce func.
But riak does not treat 1 as the initial value here. Can someone tell how do I pass an initial value to reduce phase??

A reduce phase function takes two arguments, The first one is a list of inputs, containing the output from previous reduce phase iteration(s) as well as a batch of output from the preceding map/input phase. The second argument is a configuration parameter that is passed in every time the reduce phase function executes. This is described in greater detail in the Riak MapReduce documentation.
As reduce phase functions need to be commutative, associative, and idempotent, it is not possible to reliably identify which is the first iteration and it is therefore not possible to set an initial value.

Related

Spark RDD Lineage and Storage

inputRDD = sc.textFile("log.txt")
errorsRDD = inputRDD.filter(lambda x: "error" in x)
warningsRDD = inputRDD.filter(lambda x: "warning" in x)
badLinesRDD = errorsRDD.union(warningsRDD)
badLinesCount = badLinesRDD.count()
warningCount = warningsRDD.count()
In the code above none of the transformations are evaluated until the second to last line of code is executed where you count the number of objects in the badLinesRDD. So when this badLinesRDD.count() is run it will compute the first four RDD's up till the union and return you the result. But when warningsRDD.count() is run it will only compute the transformation RDD's until the top 3 lines and return you a result correct?
Also when these RDD transformations are computed when an action is called on them where are the objects from the last RDD transformation, which is union, stored? Does it get stored in memory on the each of the DataNodes where the filter transformation was run in parallel?
Unless task output is persisted explicitly (cache, persist for example) or implicitly (shuffle write) and there is enough free space every action will execute complete lineage.
So when you call warningsRDD.count() it will load the file (sc.textFile("log.txt")) and filter (inputRDD.filter(lambda x: "warning" in x)).
Also when these RDD transformations are computed when an action is called on them where are the objects from the last RDD transformation, which is union, stored?
Assuming data is not persisted, nowhere. All task outputs are discarded after data is passed to the next stage or output. If data is persisted it depends on the settings (disk, on-heap, off-heap, DFS).

Minimizing shuffle when mapper output is mostly sorted

I have a map-reduce process in which the mapper takes input from a file that is sorted by key. For example:
1 ...
2 ...
2 ...
3 ...
3 ...
3 ...
4 ...
Then it gets transformed and 99.9% of the keys stay in the same order in relation to one another and 99% of the remainder are close. So the following might be the output of running the map task on the above data:
a ...
c ...
c ...
d ...
e ...
d ...
e ...
Thus, if you could make sure that a reducer took in a range of inputs and put that reducer in the same node where most of the inputs were already located, the shuffle would require very little data transfer. For example, suppose that I partitioned the data so that a-d were taken care of by one reducer and e-g by the next. Then if a-d could be run on the same node that had handled the mapping of 1-4, only two records for e would need to be sent over the network.
How do I construct a system that takes advantage of this property of my data? I have both Hadoop and Spark available and do not mind writing custom partitioners and the like. However, the full workload is such a classic example of MapReduce that I'd like to stick with a framework which supports that paradigm.
Hadoop mail archives mention consideraton of such an optimization. Would one need to modify the framework itself to implement it?
From the SPARK perspective there is not direct support for this: the closest is mapPartitions with preservePartions=true. However that will not directly help in your case because the keys may not be changed.
/**
* Return a new RDD by applying a function to each partition of this RDD.
*
* `preservesPartitioning` indicates whether the input function preserves the partitioner, which
* should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
*/
def mapPartitions[U: ClassTag](
f: Iterator[T] => Iterator[U], preservesPartitioning: Boolean = false): RDD[U] = {
val func = (context: TaskContext, index: Int, iter: Iterator[T]) => f(iter)
new MapPartitionsRDD(this, sc.clean(func), preservesPartitioning)
}
If you were able to know definitively that none of the keys would move outside of their original partitions the above would work. But the values on the boundaries would likely not cooperate.
What is the scale of the data compared to the migrating keys? You may consider adding a postprocessing step. First construct a partition for all migrating keys. Your mapper would output a special key value for keys needing to migrate. Then postprocess the results to do some sort of append to the standard partitions. That is extra hassle so you would need to evaluate the tradeoff in an extra step and pipeline complexity.

Generate sequence number using Map Reduce

I have a problem where I need to generate sequence number starting from 1 in the whole file.
For ex lets say I have a BIG file as follows:-
abc,123abb,111ccc,122.....N number of such line
Now my output should be as follows:-
1,abc,1232,abb,1113,ccc,122....so on.
The problem of doing this using mapreduce is that every split of the file is processed in parallel by different map function due to which the sequence could not be maintained. Please don't tell me to use single reducer to do this. I don't want to use single reducer as I want to do this in parallel using typical mapreduce job. So is there any best way so that this could be done using map-reduce?
You can do this, but its slightly tricky. You need to work with the "mapred_job_id" environment variable which gives you the job ID of the reducer.
For example, when you read in the "mapred_job_id" variable, you may get something like this: "job_201302272236_0001". You can take the last part of that job ID, which is "0001".
Using this, you can construct a prefix for each of the lines output by the reducer. For example, if you know that each reducer outputs a maximum of 1000 lines, you can have the output of this reducer be 1000-1999. The second reducer would have a job ID "job_201302272236_0002", so it would take 2000-2999.
Sample code for the above algorithm using Python (Hadoop streaming):
import os, sys
jobID = os.environ['mapred_job_id']
reducerID = jobID.split("_")[-1]
count = 0
for line in sys.stdin:
print str((reducerID*NUM)+count) + "," + line
count += 1

On what basis mapreduce framework decides whether to launch a combiner or not

As per definition "The Combiner may be called 0, 1, or many times on each key between the mapper and reducer."
I want to know that on what basis mapreduce framework decides how many times cobiner will be launched.
Simply the number of spills to disk. Sorting happens after the MapOutputBuffer filled up, at the same time the combining will take place.
You can tune the number of spills to disk with the parameters io.sort.mb, io.sort.spill.percent, io.sort.record.percent - those are also explained in the documentation (books and online resources).
Example for specific numbers of combiner runs:
0 -> no combiner was defined
1 -> a combiner was defined and the MapOutputBuffer filled up once
>1 -> a combiner was defined and the MapOutputBuffer filled up more than once
Note that even if the MapOutputBuffer never fills up completely, this buffer must be flushed at the end of the map stage and thus triggers the combiner to run at least once (if defined).
First of all, Thomas Jungblut's answer is great and I gave me upvote. The only thing I want to add is that the Combiner will always be run at least once per Mapper if defined, unless the mapper output is empty or is a single pair. So having the combiner not being executed in the mapper is possible but highly unlikely.
Source code which has logic to invoke combiner based on condition.
Line 1950 - Line 1955 https://github.com/apache/hadoop/blob/0b8a7c18ddbe73b356b3c9baf4460659ccaee095/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/MapTask.java
if (combinerRunner == null || numSpills < minSpillsForCombine) {
Merger.writeFile(kvIter, writer, reporter, job);
} else {
combineCollector.setWriter(writer);
combinerRunner.combine(kvIter, combineCollector);
}
So Combiner runs if :
It is not defined , and
If the spills are greater than minSpillsForCombine. minSpillForCombine is driven by property "mapreduce.map.combine.minspills" whose default value is 3.

MRJob and mapreduce task partitioning over Hadoop

I am trying to perform a mapreduce job using the Python MRJob lib and am having some issues getting it to properly distribute across my Hadoop cluster. I believe I am simply missing a basic principle of mapreduce. My cluster is a small, one master one slave test cluster. The basic idea is that I'm just requesting a series of web pages with parameters, doing some analysis on them and returning back some properties on the web page.
The input to my map function is simply a list of URLs with parameters such as the following:
http://guelph.backpage.com/automotive/?layout=bla&keyword=towing
http://guelph.backpage.com/whatever/?p=blah
http://semanticreference.com/search.html?go=Search&q=red
http://copiahcounty.wlbt.com/h/events?ename=drupaleventsxmlapi&s=rrr
http://sweetrococo.livejournal.com/34076.html?mode=ffff
Such that the key-value pairs for the initial input are just key:None, val:URL.
The following is my map function:
def mapper(self, key, url):
'''Yield domain as the key, and (url, query parameter) tuple as the value'''
parsed_url = urlparse(url)
domain = parsed_url.scheme + "://" + parsed_url.netloc + "/"
if self.myclass.check_if_param(parsed_url):
parsed_url_query = parsed_url.query
url_q_dic = parse_qs(parsed_url_query)
for query_param, query_val in url_q_dic.iteritems():
#yielding a tuple in mrjob will yield a list
yield domain, (url, query_param)
Pretty simple, I'm just checking to make sure the URL has a parameter and yielding the URL's domain as key and a tuple giving me the URL and the query parameter as value which MRJob kindly transforms into a list to pass to the reducer, which is the following:
def reducer(self, domain, url_query_params):
final_list = []
for url_query_param in url_query_params:
url_to_list_props = url_query_param[0]
param_to_list_props = url_query_param[1]
#set our target that we will request and do some analysis on
self.myclass.set_target(url_to_list_props, param_to_list_props)
#perform a bunch of requests and do analysis on the URL requested
props_list = self.myclass.get_props()
for prop in props_list:
final_list.append(prop)
#index this stuff to a central db
MapReduceIndexer(domain, final_list).add_prop_info()
yield domain, final_list
My problem is that only one reducer task is run. I would expect the number of reducer tasks to be equal to the number of unique keys emitted by the mapper. The end result with the above code is that I have one reducer which runs on the master, but the slave sits idly and does nothing, which is obviously not ideal. I notice that in my output a few mapper tasks are started, but always only 1 reducer task. Other than that, the task runs smoothly and all works as expected.
My question is... what the heck am I doing wrong? Am I misunderstanding the reduce step or screwing up my key-value pairs somewhere? Why are there not multiple reducers running on this job?
Update: OK so from the answer given I increased mapred.reduce.tasks to higher (it was the default which I now realize is 1). This was indeed why I was getting 1 reducer. I now see 3 reduce tasks being performed simultaneously. I now have an import error on my slave that needs to be resolved but at least I am getting somewhere...
The number of reducers is totally unrelated to the form of your input data. For MRJob it looks like you need bootstrap options

Resources