how to terminate mapreduce job after checking for a condition? - hadoop

I already found this solution. But, As answer says it's unsafe to do so, is there any safer way to do this using new MapReduce library (org.apache.hadoop.mapreduce)

As I was willing to terminate the MapReduce job that runs in a loop; So, I solved this problem by using counters as follows,
public static enum SOLUTION_FLAG{
SOLUTION_FOUND;
}
I took help from this site,
How to use the counters in Hadoop?
From the value of a flag, I decided to if I can skip the task and when a job ends, at the end of each loop I checked for a value of this flag.
Let me know if I'm doing it correctly.

Related

Measure Total Runtime of Hadoop Mapreduce Job

I want to measure the total execution time of my MapReduce program, from when I run my program from terminal using hadoop -jar etc until it complete. Can I do it like using Java System.currentTimeMillis in the start and the end of driver main class and then just do end-start?
I'm not sure it will output the right execution time in the fully distributed mode. Is there a better way? I don't want to use any tools, I want the return value of this execution time. Thanks a lot :)
For anybody that have the same question I already get the answer for this through my own experiment. So the answer is yes you can use currentTimeMillis at the start and end of the main class.

Adding counter to reducer in Hadoop Map-Reduce

I'm trying to write a map reduce job and want to add a counter to my reducer. However when I run the job the counter does not seem to appear in the output.
Currently I am using the line (Java):
context.getCounter(ReducerCounters.COUNTDISTINCT).increment(1);
To be honest I'm not 100% sure its possible to use a counter like this on a reducer. If anyone knows if this is possible or impossible please let me know. I can't seem to find any reliable examples of a counter on a reducer online.
Many Thanks.
Following are my imports :
import org.apache.hadoop.mapreduce.Counter;
import org.apache.hadoop.mapreduce.Counters;
This is present in my driver class as an enum.
public enum ReducerCounters {
COUNTDISTINCT
};
And this is present in my driver's run function.
job.waitForCompletion(true);
Counters cn=job.getCounters();
// Find the specific counters that you want to print
Counter c1=cn.findCounter(ReducerCounters.COUNTDISTINCT);
System.out.println("Displaying just the value " + c1.getValue());
Also make sure you are using the latest hadoop client version in your pom file. Don't use the last released hadoop-core version since it is not updated. I just tested it. Counters work both in my mapper and reducer.

Hive query: Is there a way to use UDTF with `cluster by`?

Solved:
It turns out to be a mistake in my UDTF. I find out a fix but I don't quite understand why it worked. At the beginning when I was implementing the UDTF, Eclipse suggested that initialize is deprecated. But I got error if I skip it, so I implemented it anyway. I put a variable initialization in that method, guessing init is only to be done once. The jar worked for some simpler scenarios, but if I were to use the UDTF output with a UDF, then use the UDF output to do something, like the cheating cluster by or insert, I got the previously mentioned error. The engineer friend of mine found out that the initialize actually got executed more than once. So I just put the initialization in process, with a if checking if the variable is null, and init it if is. Then everything works fine, my cheat also worked. Still, if someone can give me an explanation, I would be most grateful.
Following is my original question:
I know I'm not supposed to use cluster by after UDTF, so select myudtf("stringValue") cluster by rand() wouldn't work.
But since my udtf outputs 7000+ and growing rows every hour, so I really need to distribute the subsequent processing to all my hadoop cluster slave units.
And I imagine I don't get that without using cluster by rand(), so I tried the following cheat:
First I wrap the result up with an other table, select key from (select myudtf("stringValue") as key) t limit 1; and it gives correct result,
OK
some/key/value/string
Time taken: 0.035 seconds, Fetched: 1 row(s)
Then I add the cluster by part, select key from (select myudtf("stringValue") as key) t cluster by rand() limit 1, then I get error:
WARNING: Hive-on-MR is deprecated in Hive ...
....
Task with the most failures(4):
-----
Task ID:
task_....
URL:
http:....
....
-----
Diagnostic Messages for this Task:
Error: tried to access class sun.security.ssl.SSLSessionContextImpl from class sun.security.ssl.SSLSessionContextImplConstructorAccess
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
I did this trying to cheat hive to treat the temporary table t as a "normal" table which I can apply cluster by to, hoping that it will distribute the work load to all the hadoop slaves, but unfortunately hive is clever enough to see through my poorly attempted trick.
So, could some one please help me to clarify my mis-conceptions, or give me some hint of the correct way to do this?
FYI I asked help from a highly experienced engineering guy in my company, and he thinks it maybe a deeper system level bug, he tried to trace the problem for 20 something minutes before he left work, he did find some lib version issues but couldn't fix the problem after all. ...And I just guess it must be something I did wrongly.
It turns out to be a mistake in my UDTF. I find out a fix but I don't quite understand why it worked. At the beginning when I was implementing the UDTF, Eclipse suggested that initialize is deprecated. But I got error if I skip it, so I implemented it anyway. I put a variable initialization in that method, guessing init is only to be done once. The jar worked for some simpler scenarios, but if I were to use the UDTF output with a UDF, then use the UDF output to do something, like the cheating cluster by or insert, I got the previously mentioned error. The engineer friend of mine found out that the initialize actually got executed more than once. So I just put the initialization in process, with a if checking if the variable is null, and init it if is. Then everything works fine, my cheat also worked. Still, if someone can give me a more specific explanation, I would be most grateful.

Mapreduce - Right way to confirm whether the file is split or not

We had a lot of xml files and we wanted to process one xml using one mapper task because of obvious reasons to make the processing ( parsing ) simpler.
We wrote a mapreduce program to achieve that by overriding isSplitable method of input format class.It seems it is working fine.
However, we wanted to confirm if one mapper is used to process one xml file. IS there is a way to confirm by looking at the logs produced by driver program or any other way .
Thanks
To answer your question, Just check the number of mapper count.
It should be equal to your number of input files.
Example :
/ds/input
/file1.xml
/file2.xml
/file3.xml
Then the mapper count should be 3.
Here is the command.
mapred job -counter job_1449114544347_0001 org.apache.hadoop.mapreduce.JobCounter TOTAL_LAUNCHED_MAPS
You can get many details using mapred job -counter command. You can check video 54 and 55 from this playlist. It covers counters in detail.

Hadoop MapReduce global counter

I need a global counter in my application. When all of the reduce tasks are finished, I must print the global counter.
I have found a solution in Here. However, I wonder whether I can use a global counter with Hadoop streaming or pipes as I write my application with C++.
You can use the stderr output of the stream process.
I found this Jira issue:
https://issues.apache.org/jira/browse/HADOOP-1328
It has few patches , I guess you can find in those about how to do the global counters.

Resources