Measure Total Runtime of Hadoop Mapreduce Job - hadoop

I want to measure the total execution time of my MapReduce program, from when I run my program from terminal using hadoop -jar etc until it complete. Can I do it like using Java System.currentTimeMillis in the start and the end of driver main class and then just do end-start?
I'm not sure it will output the right execution time in the fully distributed mode. Is there a better way? I don't want to use any tools, I want the return value of this execution time. Thanks a lot :)

For anybody that have the same question I already get the answer for this through my own experiment. So the answer is yes you can use currentTimeMillis at the start and end of the main class.

Related

How to implement time in Pari/GP

I have Pari/GP 32-bit and would like to implement any type of code which runs for a limited amount of time, then automatically stops if nothing is produced by that time. Here is a pseudocode example:
command
...
run command
if run time over 3 minutes
automatically interrupt the program and stop running
I know there is a basic way to do this, I just never found it in the PARI/GP guide. Any help? Thanks.
alarm(5);for(i=1,10^10,continue)

how to terminate mapreduce job after checking for a condition?

I already found this solution. But, As answer says it's unsafe to do so, is there any safer way to do this using new MapReduce library (org.apache.hadoop.mapreduce)
As I was willing to terminate the MapReduce job that runs in a loop; So, I solved this problem by using counters as follows,
public static enum SOLUTION_FLAG{
SOLUTION_FOUND;
}
I took help from this site,
How to use the counters in Hadoop?
From the value of a flag, I decided to if I can skip the task and when a job ends, at the end of each loop I checked for a value of this flag.
Let me know if I'm doing it correctly.

Java Vs Scripting for HDFS map/reduce

I am a DB person, so java is new to me. Looking for scripting language for working with HDFS, may be Python I am looking for. But I see in one of the previous question, you mentioned that "Heart Beat" between Name and Data node will not happen if we use scripting language. Why, I could not understand? As we are writing our application logic to process data in the scripts or java code, and how it matter for the "Heart Beat"?
Any idea, on this?
Python is good choice for hadoop if you know already how to code with it. I've used php and perl with success. This part of Hadoop framework is called Streaming.
For "Heart Beat" I believe you are thinking of Counters. They are user defined "variables" that can only be incremented. Hadoop will terminate task attempt if no counters are incremented for 10 minutes. However you shouldn't worry about this as there are system counters that are automatically incremented for you. If you do have a job that takes very long, you can still use counters with Python (Hadoop Streaming) by sending something like this to standard error output:
reporter:counter:MyGroup,MyCounter,1
For more info on counters with Hadoop Streaming see this

How to find the CPU time taken by a Map/Reduce task in Hadoop

I am writing a Hadoop scheduler. My scheduling requires finding the CPU time taken by each Map/Reduce task.
I know that:
The TaskInProgress class maintains the execStartTime and execFinishTime values which are wall-clock times when the process started and finished, but they do not accurately indicate the CPU time consumed by the task.
Each task is executed in a new JVM, and I could use the OperatingSystemMXBean.getProcessCpuTime () method, but again the description of the method tells me: "Returns the CPU time used by the process on which the Java virtual machine is running in nanoseconds". I am not entirely clear if this is what I want.
I am using a library that records resource metrics like CPU Usage/IDLE time, swap usage and memory usage.
http://code.google.com/p/hadoop-toolkit/
You have to extract a patch and apply it to a 20.2 tag version.
I am not entirely clear if this is what I want.
I am pretty sure that this method returns the wall clock time as well.
Just for posterity, I solved this problem by making a change in src/mapred/org/apache/hadoop/mapred/TaskLog.java (Hadoop 0.20.203) on line 572
mergedCmd.append("exec setsid 'time' "); // add 'time'
The CPU time will be written to: logs/userlogs/JOBID/TASKID/stderr. I also wrote a script to reap the cumulative CPU time: https://gist.github.com/1984365
Before running the job, you need to make sure you do:
rm -rf logs/userlogs/*
so that the script works.

pig hadoop needed for I want to do?

I have a question for you, well a clarification...
I developed a program that uses hadoop map reduce wich gets just a column from a dataset (csv file) and process this data with some functions, so this program is finished, but the real question is
Is a good idea to develop this program in Pig? note that in the process of the file I dont use FILTERS COUNTS or any built in function of Pig...
Am I right if I say that passing this hadoop map reduce program to Pig has no sense? because all my functions will need to be rewrited as a Pig User Defined Function UDF...
in my opinion there is actually not really a need to write your very small script in PigLatin. It'll be more interesting when its getting bigger, because PigLatin requires less lines of code than java, and its more readable, especially to non-java-developer :)

Resources