I have a problem where I need to generate sequence number starting from 1 in the whole file.
For ex lets say I have a BIG file as follows:-
abc,123abb,111ccc,122.....N number of such line
Now my output should be as follows:-
1,abc,1232,abb,1113,ccc,122....so on.
The problem of doing this using mapreduce is that every split of the file is processed in parallel by different map function due to which the sequence could not be maintained. Please don't tell me to use single reducer to do this. I don't want to use single reducer as I want to do this in parallel using typical mapreduce job. So is there any best way so that this could be done using map-reduce?
You can do this, but its slightly tricky. You need to work with the "mapred_job_id" environment variable which gives you the job ID of the reducer.
For example, when you read in the "mapred_job_id" variable, you may get something like this: "job_201302272236_0001". You can take the last part of that job ID, which is "0001".
Using this, you can construct a prefix for each of the lines output by the reducer. For example, if you know that each reducer outputs a maximum of 1000 lines, you can have the output of this reducer be 1000-1999. The second reducer would have a job ID "job_201302272236_0002", so it would take 2000-2999.
Sample code for the above algorithm using Python (Hadoop streaming):
import os, sys
jobID = os.environ['mapred_job_id']
reducerID = jobID.split("_")[-1]
count = 0
for line in sys.stdin:
print str((reducerID*NUM)+count) + "," + line
count += 1
Related
I am using NiFi 1.9.2
I am reading a text file which happens to be a csv file. I have the Contents of the file in the Contents of a flowFile.
Contents are
a,b,c
d,e,f
g,h,i
I want to prepend a line number to all records in the flowfile and get
1,a,b,c
2,d,e,f
3,g,h,i
each time I feed a file through this processor
I can achieve something close by using the ReplaceText processor with Properties as follows:
Search Value : (?m)(^.*$)
Replacement Value : ${nextInt()},$1
But because nextInt() persists it's value over the lifetime of the running NiFi instance I get
0,a,b,c
1,d,e,f
2,g,h,i
for 1st execution
3,a,b,c
4,d,e,f
5,g,h,i
for the next execution etc
Additionally, from the NiFi expression-language-guide, the "counter is shared across all NiFi components, so calling this function multiple times from one Processor will not guarantee sequential values within the context of a particular Processor."
Is there a way to ensure the line numbers always start at 0 for each execution of this processor for the lifetime of the NiFi instance, and are always sequential?
What the range of the counter?
Can I get the counter to start at 1?
You can split the content to several lines then use fragment.index to prepent the counter to the lines. After that you can merge them again.
The Flow:
GenerateFlowFile:
SplitText:
ReplaceText:
MergeContent:
Don't forget to add a new line (Shift+Enter) to Demarcator attribute.
Result:
You can use ${Fragment.index:minus(1)} if you want to count from zero.
I have a map-reduce process in which the mapper takes input from a file that is sorted by key. For example:
1 ...
2 ...
2 ...
3 ...
3 ...
3 ...
4 ...
Then it gets transformed and 99.9% of the keys stay in the same order in relation to one another and 99% of the remainder are close. So the following might be the output of running the map task on the above data:
a ...
c ...
c ...
d ...
e ...
d ...
e ...
Thus, if you could make sure that a reducer took in a range of inputs and put that reducer in the same node where most of the inputs were already located, the shuffle would require very little data transfer. For example, suppose that I partitioned the data so that a-d were taken care of by one reducer and e-g by the next. Then if a-d could be run on the same node that had handled the mapping of 1-4, only two records for e would need to be sent over the network.
How do I construct a system that takes advantage of this property of my data? I have both Hadoop and Spark available and do not mind writing custom partitioners and the like. However, the full workload is such a classic example of MapReduce that I'd like to stick with a framework which supports that paradigm.
Hadoop mail archives mention consideraton of such an optimization. Would one need to modify the framework itself to implement it?
From the SPARK perspective there is not direct support for this: the closest is mapPartitions with preservePartions=true. However that will not directly help in your case because the keys may not be changed.
/**
* Return a new RDD by applying a function to each partition of this RDD.
*
* `preservesPartitioning` indicates whether the input function preserves the partitioner, which
* should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
*/
def mapPartitions[U: ClassTag](
f: Iterator[T] => Iterator[U], preservesPartitioning: Boolean = false): RDD[U] = {
val func = (context: TaskContext, index: Int, iter: Iterator[T]) => f(iter)
new MapPartitionsRDD(this, sc.clean(func), preservesPartitioning)
}
If you were able to know definitively that none of the keys would move outside of their original partitions the above would work. But the values on the boundaries would likely not cooperate.
What is the scale of the data compared to the migrating keys? You may consider adding a postprocessing step. First construct a partition for all migrating keys. Your mapper would output a special key value for keys needing to migrate. Then postprocess the results to do some sort of append to the standard partitions. That is extra hassle so you would need to evaluate the tradeoff in an extra step and pipeline complexity.
I am trying to write a riak map reduce using riak-ruby-client. Javascript reduce function looks like this:
arr.reduce(callback,[initialValue]);
I am doing something like this:
map_reduce = Riak::MapReduce.new(Ripple.client)
map_reduce.add(bucket) // I have passed a valid bucket
var callback = "function(previous, current){return previous + current;}"
results = map_reduce.map(map_func).reduce(callback,1,:keep=>true).run //1 is the initial value as in javascript reduce func.
But riak does not treat 1 as the initial value here. Can someone tell how do I pass an initial value to reduce phase??
A reduce phase function takes two arguments, The first one is a list of inputs, containing the output from previous reduce phase iteration(s) as well as a batch of output from the preceding map/input phase. The second argument is a configuration parameter that is passed in every time the reduce phase function executes. This is described in greater detail in the Riak MapReduce documentation.
As reduce phase functions need to be commutative, associative, and idempotent, it is not possible to reliably identify which is the first iteration and it is therefore not possible to set an initial value.
As per definition "The Combiner may be called 0, 1, or many times on each key between the mapper and reducer."
I want to know that on what basis mapreduce framework decides how many times cobiner will be launched.
Simply the number of spills to disk. Sorting happens after the MapOutputBuffer filled up, at the same time the combining will take place.
You can tune the number of spills to disk with the parameters io.sort.mb, io.sort.spill.percent, io.sort.record.percent - those are also explained in the documentation (books and online resources).
Example for specific numbers of combiner runs:
0 -> no combiner was defined
1 -> a combiner was defined and the MapOutputBuffer filled up once
>1 -> a combiner was defined and the MapOutputBuffer filled up more than once
Note that even if the MapOutputBuffer never fills up completely, this buffer must be flushed at the end of the map stage and thus triggers the combiner to run at least once (if defined).
First of all, Thomas Jungblut's answer is great and I gave me upvote. The only thing I want to add is that the Combiner will always be run at least once per Mapper if defined, unless the mapper output is empty or is a single pair. So having the combiner not being executed in the mapper is possible but highly unlikely.
Source code which has logic to invoke combiner based on condition.
Line 1950 - Line 1955 https://github.com/apache/hadoop/blob/0b8a7c18ddbe73b356b3c9baf4460659ccaee095/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/MapTask.java
if (combinerRunner == null || numSpills < minSpillsForCombine) {
Merger.writeFile(kvIter, writer, reporter, job);
} else {
combineCollector.setWriter(writer);
combinerRunner.combine(kvIter, combineCollector);
}
So Combiner runs if :
It is not defined , and
If the spills are greater than minSpillsForCombine. minSpillForCombine is driven by property "mapreduce.map.combine.minspills" whose default value is 3.
I am trying to perform a mapreduce job using the Python MRJob lib and am having some issues getting it to properly distribute across my Hadoop cluster. I believe I am simply missing a basic principle of mapreduce. My cluster is a small, one master one slave test cluster. The basic idea is that I'm just requesting a series of web pages with parameters, doing some analysis on them and returning back some properties on the web page.
The input to my map function is simply a list of URLs with parameters such as the following:
http://guelph.backpage.com/automotive/?layout=bla&keyword=towing
http://guelph.backpage.com/whatever/?p=blah
http://semanticreference.com/search.html?go=Search&q=red
http://copiahcounty.wlbt.com/h/events?ename=drupaleventsxmlapi&s=rrr
http://sweetrococo.livejournal.com/34076.html?mode=ffff
Such that the key-value pairs for the initial input are just key:None, val:URL.
The following is my map function:
def mapper(self, key, url):
'''Yield domain as the key, and (url, query parameter) tuple as the value'''
parsed_url = urlparse(url)
domain = parsed_url.scheme + "://" + parsed_url.netloc + "/"
if self.myclass.check_if_param(parsed_url):
parsed_url_query = parsed_url.query
url_q_dic = parse_qs(parsed_url_query)
for query_param, query_val in url_q_dic.iteritems():
#yielding a tuple in mrjob will yield a list
yield domain, (url, query_param)
Pretty simple, I'm just checking to make sure the URL has a parameter and yielding the URL's domain as key and a tuple giving me the URL and the query parameter as value which MRJob kindly transforms into a list to pass to the reducer, which is the following:
def reducer(self, domain, url_query_params):
final_list = []
for url_query_param in url_query_params:
url_to_list_props = url_query_param[0]
param_to_list_props = url_query_param[1]
#set our target that we will request and do some analysis on
self.myclass.set_target(url_to_list_props, param_to_list_props)
#perform a bunch of requests and do analysis on the URL requested
props_list = self.myclass.get_props()
for prop in props_list:
final_list.append(prop)
#index this stuff to a central db
MapReduceIndexer(domain, final_list).add_prop_info()
yield domain, final_list
My problem is that only one reducer task is run. I would expect the number of reducer tasks to be equal to the number of unique keys emitted by the mapper. The end result with the above code is that I have one reducer which runs on the master, but the slave sits idly and does nothing, which is obviously not ideal. I notice that in my output a few mapper tasks are started, but always only 1 reducer task. Other than that, the task runs smoothly and all works as expected.
My question is... what the heck am I doing wrong? Am I misunderstanding the reduce step or screwing up my key-value pairs somewhere? Why are there not multiple reducers running on this job?
Update: OK so from the answer given I increased mapred.reduce.tasks to higher (it was the default which I now realize is 1). This was indeed why I was getting 1 reducer. I now see 3 reduce tasks being performed simultaneously. I now have an import error on my slave that needs to be resolved but at least I am getting somewhere...
The number of reducers is totally unrelated to the form of your input data. For MRJob it looks like you need bootstrap options