Has anyone use mapred.job.tracker=local in hadoop streaming job? - hadoop

in last weeks, we use hadoop streaming to calculate some reports everyday. Recently we made a change to our program, if the input size is smaller than 10MB, we will set mapred.job.tracker=local in the JobConf, then the job will run locally.
But last night, many jobs failed, with status 3 returned by runningJob.getJobState().
I don't know why, and there is nothing in the stderr.
I can google nothing related about this question. So I'm wondering if I should use mapred.job.tracker=local in production mode? Maybe it's just a debug solution in developing supplied by hadoop.
Has anyone know something about it? Anything, Any infomation, Thank you.

I believe setting mapred.job.tracker=local has nothing to do with your error as local is the default value.
This config parameter defines the host and port that the MapReduce job tracker runs at. If it is set to be "local", then jobs are run in-process as a single map and reduce task.
Refer here.

Related

How to get the scheduler of an already finished job in Yarn Hadoop?

So I'm in this situation, where I'm modifying the mapred-site.xml and specific configuration files of different schedulers for Hadoop, and I just want to make sure that the modifications I have made to the default scheduler(FIFO), has actually taken place.
How can I check the scheduler applied to a job or a queue of jobs already submitted to hadoop using job id ?
Sorry if this doesn't make that much sense, but I've looked around quite extensively to wrap my head around it, and read many documentations, and yet I still cannot seem to find this fundamental piece of information.
I'm simply trying the word count as a job, changing scheduler settings in mapped-site.xml and yarn-site.xml.
For instance I'm changing property "yarn.resourcemanager.scheduler.class" to "org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler" based on this link : see this
I'm also moving appropriate jar files specific to the schedulers to the correct directory.
For your reference, I'm using the "yarn" runtime mode, and Cloudera and Hadoop 2.
Thanks a ton for your help

IPython Notebook with Spark on EC2 : Initial job has not accepted any resources

I am trying to run the simple WordCount job in IPython notebook with Spark connected to an AWS EC2 cluster. The program works perfectly when I use Spark in the local standalone mode but throws the problem when I try to connect it to the EC2 cluster.
I have taken the following steps
I have followed instructions given in this Supergloo blogpost.
No errors are found until the last line where I try to write the output to a file. [The lazyloading feature of Spark means that this when the program really starts to execute]
This is where I get the error
[Stage 0:> (0 + 0) / 2]16/08/05 15:18:03 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Actually there is no error, we have this warning and the program goes into an indefinite wait state. Nothing happens until I kill the IPython notebook.
I have seen this Stackoverflow post and have reduced the number of cores to 1 and memory to 512 by using this options after the main command
--total-executor-cores 1 --executor-memory 512m
The screen capture from the SparkUI is as follows
sparkUI
This clearly shows that both core and UI is not being fully utilized.
Finally, I see from this StackOverflow post that
The spark-ec2 script configure the Spark Cluster in EC2 as standalone,
which mean it can not work with remote submits. I've been struggled
with this same error you described for days before figure out it's not
supported. The message error is unfortunately incorrect.
So you have to copy your stuff and log into the master to execute your
spark task.
If this is indeed the case, then there is nothing more to be done, but since this statement was made in 2014, I am hoping that in the last 2 years the script has been rectified or there is a workaround. If there is any workaround, I would be grateful if someone can point it out to me please.
Thank you for your reading till this point and for any suggestions offered.
You can not submit jobs except on the Master - as you see - unless you set up a REST based Spark job server.

Hadoop Mapper not running my class

Using Hadoop version 0.20.. I am creating a chain of jobs job1 and job2 (mappers of which are in x.jar, there is no reducer) , with dependency and submitting to hadoop cluster using JobControl. Note I have setJarByClass and getJar gives the correct jar file, when checked before submission.
Submission goes through and there seem to be no errors in user logs and jobtracker. But I dont see my Mapper getting executed (no sysouts or log output), but default output seems to be coming to the output folder (input file is read as is and output). I am able to run the job directly using x.jar, but I am really out of clues as to why it is not running with Jobcontrol.
Please help !
This issue bugged me for quite some days. Finally I found that it was the UsedGenericOptionsParser which created the issue. Set this to true and everything started working fine.

Hadoop Job Scheduling query

I am a beginner to Hadoop.
As per my understanding, Hadoop framework runs the Jobs in FIFO order (default scheduling).
Is there any way to tell the framework to run the job at a particular time?
i.e Is there any way to configure to run the job daily at 3PM like that?
Any inputs on this greatly appreciated.
Thanks, R
What about calling the job from external java schedule framework, like Quartz? Then you can run the job as you want.
you might consider using Oozie (http://yahoo.github.com/oozie/). It allows (beside other things):
Frequency execution: Oozie workflow specification supports both data
and time triggers. Users can specify execution frequency and can wait
for data arrival to trigger an action in the workflow.
It is independent of any other Hadoop schedulers and should work with any of them, so probably nothing in you Hadoop configuration will change.
How about having a script to execute your Hadoop job and then using at command to execute at some specified time.if you want the job to run regularly, you could setup a cron job to execute your script.
I'd use a commercial scheduling app if Cron does not cut it and/or a custom workflow solution. We use a solution called jams but keep in mind it's .net-oriented.

Hidden features of Hadoop MapReduce

What are the hidden features of Hadoop MapReduce that every developer should be aware of?
One hidden feature per answer, please.
Here are some tips and tricks http://allthingshadoop.com/2010/04/28/map-reduce-tips-tricks-your-first-real-cluster/
One item from there specifically that every developer should be aware of:
In your Java code there is a little trick to help the job be “aware” within the cluster of tasks that are not dead but just working hard. During execution of a task there is no built in reporting that the job is running as expected if it is not writing out. So this means that if your tasks are taking up a lot of time doing work it is possible the cluster will see that task as failed (based on the mapred.task.tracker.expiry.interval setting).
Have no fear there is a way to tell cluster that your task is doing just fine. You have 2 ways todo this you can either report the status or increment a counter. Both of these will cause the task tracker to properly know the task is ok and this will get seen by the jobtracker in turn. Both of these options are explained in the JavaDoc http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/Reporter.html

Resources