Hadoop MapReduce global counter - hadoop

I need a global counter in my application. When all of the reduce tasks are finished, I must print the global counter.
I have found a solution in Here. However, I wonder whether I can use a global counter with Hadoop streaming or pipes as I write my application with C++.

You can use the stderr output of the stream process.
I found this Jira issue:
https://issues.apache.org/jira/browse/HADOOP-1328
It has few patches , I guess you can find in those about how to do the global counters.

Related

how to terminate mapreduce job after checking for a condition?

I already found this solution. But, As answer says it's unsafe to do so, is there any safer way to do this using new MapReduce library (org.apache.hadoop.mapreduce)
As I was willing to terminate the MapReduce job that runs in a loop; So, I solved this problem by using counters as follows,
public static enum SOLUTION_FLAG{
SOLUTION_FOUND;
}
I took help from this site,
How to use the counters in Hadoop?
From the value of a flag, I decided to if I can skip the task and when a job ends, at the end of each loop I checked for a value of this flag.
Let me know if I'm doing it correctly.

Java Vs Scripting for HDFS map/reduce

I am a DB person, so java is new to me. Looking for scripting language for working with HDFS, may be Python I am looking for. But I see in one of the previous question, you mentioned that "Heart Beat" between Name and Data node will not happen if we use scripting language. Why, I could not understand? As we are writing our application logic to process data in the scripts or java code, and how it matter for the "Heart Beat"?
Any idea, on this?
Python is good choice for hadoop if you know already how to code with it. I've used php and perl with success. This part of Hadoop framework is called Streaming.
For "Heart Beat" I believe you are thinking of Counters. They are user defined "variables" that can only be incremented. Hadoop will terminate task attempt if no counters are incremented for 10 minutes. However you shouldn't worry about this as there are system counters that are automatically incremented for you. If you do have a job that takes very long, you can still use counters with Python (Hadoop Streaming) by sending something like this to standard error output:
reporter:counter:MyGroup,MyCounter,1
For more info on counters with Hadoop Streaming see this

How to define a shared (global) variable in Hadoop?

I need a shared (global) variable which is accessible among all mappers and reducers. Mappers just read value from it, but reducers change some values to be used in the next iteration in it. I know DistributedCache is a technique to do that, however it only support reading a shared value.
This is exactly what ZooKeeper was built for. ZooKeeper can keep up with lots of reads from mappers/reducers, and still be able to write something now and then.
The other option would be to set values in the configuration object. However, this only persists globally for a single job. You'd have to manage the passing of this value across jobs yourself. Also, you can't end this while the job is running.

shared variable in map reduce

I need a variable that shared between reduce tasks and each of reduce tasks can read and write on it atomically.
The reason that I need such a variable is to give a unique identifier to each files that created by reduce task (number of files which created by reduce tasks is not deterministic).
Thanks
In my understanding ZooKeeper is specially built to maintain atomic access to the cluster wide variables.
I would recommend using FileSystem.createNewFile().
Have a look here:
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/fs/FileSystem.html#createNewFile%28org.apache.hadoop.fs.Path%29
All the outout files produced by the reducers already have unique names part-r-00001 and such.
There is a partition number you can read in case you need that number from your code.
Centralized counters that must be guaranteed unique break a lot of the scalability of Hadoop.
So if you need something different then I would use something like a Sha1 of the task id of the reducer to get something that is unique over multiple jobs.

ruby: How do i get the number of subprocess(fork) running

I want to limit the subprocesses count to 3. Once it hits 3 i wait until one of the processes stops and then execute a new one. I'm using Kernel.fork to start the process.
How do i get the number of running subprocesses? or is there a better way to do this?
A good question, but I don't think there's such a method in Ruby, at least not in the standard library. There's lots of gems out there....
This problem though sounds like a job for the Mutex class. Look up the section Condition Variables here on how to use Ruby's mutexes.
I usually have a Queue of tasks to be done, and then have a couple of threads consuming tasks until they receive an item indicating the end of work. There's an example in "Programming Ruby" under the Thread library. (I'm not sure if I should copy and paste the example to Stack Overflow - sorry)
My solution was to use trap("CLD"), to trap SIGCLD whenever a child process ended and decrease the counter (a global variable) of processes running.

Resources