hadoop 2.2.0 job -list throws NPE - hadoop

I compiled hadoop 2.2.0 x64 and running it on a cluster. When I do hadoop job -list or hadoop job -list all, it throws a NPE like this:
14/01/28 17:18:39 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
14/01/28 17:18:39 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
Exception in thread "main" java.lang.NullPointerException
at org.apache.hadoop.mapreduce.tools.CLI.listJobs(CLI.java:504)
at org.apache.hadoop.mapreduce.tools.CLI.run(CLI.java:312)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.mapred.JobClient.main(JobClient.java:1237)
and on hadoop webapp like jobhistory ( I turn on the jobhistory server). it shows no job was running and no job finishing although I was running jobs.
Please help me to solve this problem.

I encountered this when trying to migrate to mapreduce over to YARN. Turns out I was missing the directive in mapred-site.xml instructing map reduce to use YARN:
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

Related

Running Spark on Yarn Client

I have recently setup an Multinode Hadoop HA (Namenode & ResourceManager) Cluster (3 node) , The installation is completed and all daemon's run as expected
Daemon in NN1 :
2945 JournalNode
3137 DFSZKFailoverController
6385 Jps
3338 NodeManager
22730 QuorumPeerMain
2747 DataNode
3228 ResourceManager
2636 NameNode
Daemon in NN2 :
19620 Jps
3894 QuorumPeerMain
16966 ResourceManager
16808 NodeManager
16475 DataNode
16572 JournalNode
17101 NameNode
16702 DFSZKFailoverController
Daemon in DN1 :
12228 QuorumPeerMain
29060 NodeManager
28858 DataNode
29644 Jps
28956 JournalNode
I am interested to run Spark Jobs on my Yarn setup.
I have installed Scala and Spark on my NN1 and i can successfully start my spark by issuing the following command
$ spark-shell
Now , i have no knowledge about SPARK , i would like to know how can i run Spark on Yarn. I have read that we can run it as either yarn-client or yarn-cluster.
Should i install the spark & scala on all nodes in the Cluster (NN2 & DN1) to run spark on Yarn client or cluster ? If No then how can i submit the Spark Jobs from NN1 (Primary namenode) host.
I have copied over the Spark assembly JAR to the HDFS as suggested in a blog i read ,
-rw-r--r-- 3 hduser supergroup 187548272 2016-04-04 15:56 /user/spark/share/lib/spark-assembly.jar
Also created SPARK_JAR variable in my bashrc file.I tried to submit the Spark Job as yarn-client but i end up with error as below , I have no idea on if i am doing it all correct or need other settings to be done first.
[hduser#ptfhadoop01v spark-1.6.0]$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode client --driver-memory 4g --executor-memory 2g --executor-cores 2 --queue thequeue lib/spark-examples*.jar 10
16/04/04 17:27:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/04/04 17:27:51 WARN SparkConf:
SPARK_WORKER_INSTANCES was detected (set to '2').
This is deprecated in Spark 1.0+.
Please instead use:
- ./spark-submit with --num-executors to specify the number of executors
- Or set SPARK_EXECUTOR_INSTANCES
- spark.executor.instances to configure the number of instances in the spark config.
16/04/04 17:27:54 WARN Client: SPARK_JAR detected in the system environment. This variable has been deprecated in favor of the spark.yarn.jar configuration variable.
16/04/04 17:27:54 WARN Client: SPARK_JAR detected in the system environment. This variable has been deprecated in favor of the spark.yarn.jar configuration variable.
16/04/04 17:27:57 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:124)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:64)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:530)
at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:29)
at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
16/04/04 17:27:58 WARN MetricsSystem: Stopping a MetricsSystem that is not running
Exception in thread "main" org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:124)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:64)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:530)
at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:29)
at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
[hduser#ptfhadoop01v spark-1.6.0]$
Please help me to resolve this and on how to run Spark on Yarn as client or as Cluster mode.
Now , i have no knowledge about SPARK , i would like to know how can i run Spark on Yarn. I have read that we can run it as either yarn-client or yarn-cluster.
It's highly recommended that you read the official documentation of Spark on YARN at http://spark.apache.org/docs/latest/running-on-yarn.html.
You can use spark-shell with --master yarn to connect to YARN. You need to have proper configuration files on the machine you do spark-shell from, e.g. yarn-site.xml.
Should i install the spark & scala on all nodes in the Cluster (NN2 & DN1) to run spark on Yarn client or cluster ?
No. You don't have to install anything on YARN since Spark will distribute necessary files for you.
If No then how can i submit the Spark Jobs from NN1 (Primary namenode) host.
Start with spark-shell --master yarn and see if you can execute the following code:
(0 to 5).toDF.show
If you see a table-like output, you're done. Else, provide the error(s).
Also created SPARK_JAR variable in my bashrc file.I tried to submit the Spark Job as yarn-client but i end up with error as below , I have no idea on if i am doing it all correct or need other settings to be done first.
Remove the SPARK_JAR variable. Don't use it as it's not needed and might cause troubles. Read the official documentation at http://spark.apache.org/docs/latest/running-on-yarn.html to understand the basics of Spark on YARN and beyond.
By adding this property into hdfs-site.xml , it solved the issue
<property>
<name>dfs.client.failover.proxy.provider.mycluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
In the client mode you'd run it something like below for simple word count example
spark-submit --class org.sparkexample.WordCount --master yarn-client wordcount-sample-plain-1.0-SNAPSHOT.jar input.txt output.txt
I think you got the spark-submit command wrong there. There is no --master yarn set up.
I would highly recommend using an automated provisioning tool to set up your cluster quickly instead of a manual approach.
Refer to Cloudera or Hortonworks tools. You can use it to get setup in no time and be able to submit jobs easily without doing all these configurations manually.
Reference: https://hortonworks.com/products/hdp/

INFO Configuration deprecation session id is deprecated Instead use dfs metrics session-id

I am trying to set up hadoop 2.6.2. Almost everything has been setup.
My Ubuntu version: 15.10
My hadoop path is /usr/local/hadoop/hadoop-2.6.2
Java path is /usr/local/java/jdk1.8.0_65
I have mentioned java and hadoop path in /etc/profile
I have edited 4 files inside hadoop-2.6.2/etc/hadoop: core-site.xml, hadoop-env.sh, hdfs-site.xml and mapred-site.xml
But when I try to execute following command from hadoop site
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.2.jar grep input output 'dfs[a-z.]+'
Then it gives me following error
INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
15/11/25 07:57:09 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
java.net.ConnectException: Call From jass-VirtualBox/127.0.1.1 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
What can be the reason?
I had the same problem but on ubuntu 14.04 LTS.
I have solved it with following commands:
sbin/stop-dfs.sh
bin/hdfs namenode -format
sbin/start-dfs.sh
The first command will stop all daemons.
The second will format file system.
The third will start all daemons again.

MapReduce job is stuck on a multi node Hadoop-2.7.1 cluster

I have successfully run Hadoop 2.7.1 on a multi node cluster (1 namenode and 4 datanodes). But, when I run MapReduce job (WordCount example from Hadoop website), it always stuck at this point.
[~#~ hadoop-2.7.1]$ bin/hadoop jar WordCount.jar WordCount /user/inputdata/ /user/outputdata
15/09/30 17:54:56 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/09/30 17:54:57 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/09/30 17:54:58 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
15/09/30 17:54:59 INFO input.FileInputFormat: Total input paths to process : 1
15/09/30 17:55:00 INFO mapreduce.JobSubmitter: number of splits:1
15/09/30 17:55:00 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1443606819488_0002
15/09/30 17:55:00 INFO impl.YarnClientImpl: Submitted application application_1443606819488_0002
15/09/30 17:55:00 INFO mapreduce.Job: The url to track the job: http://~~~~:8088/proxy/application_1443606819488_0002/
15/09/30 17:55:00 INFO mapreduce.Job: Running job: job_1443606819488_0002
Do I have to specify a memory for yarn?
NOTE: DataNode hardwares are really old (Each has 1GB RAM).
Appreciate your help.
Thank you.
The data nodes memory (1gb) is really very scarce to prepare atleast 1 container to run mapper/reducer/am in it.
You could try lowering the below container memory allocation values in yarn-site.xml with very lower values to get the container created on them.
yarn.scheduler.minimum-allocation-mb
yarn.scheduler.maximum-allocation-mb
Also try to reduce the below properties values in your job configration,
mapreduce.map.memory.mb
mapreduce.reduce.memory.mb
mapreduce.map.java.opts
mapreduce.reduce.java.opts

Application status in hadoop cluster manager is undefined

I am trying to run a HelloWorld programing in Pig or Hive on my single-cluster setup. All my applications hangs. That includes my word count example in pig. My Hive command below also hangs:
SELECT COUNT(*) FROM u_data;
My Hive logs stops as follows: set mapreduce.job.reduces=<number>
Starting Job = job_1433035285759_0006, Tracking URL = http://der-Inspiron-3521:8088/proxy/application_1433035285759_0006/
Kill Command = /home/der/utility/hadoop-2.7.0/bin/hadoop job -kill job_1433035285759_0006
I follow the link. The application shows the
status = undefined
Here's my map-reduce configuration file:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>localhost:10020</value>
<description>Host and port for Job History Server (default 0.0.0.0:10020)</description>
</property>
<name>mapreduce.jobtracker.http.address</name>
<value>localhost:50030</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
</configuration>
My hadoop server is version 2.7.0.
I tried hadoop server 2.6.0 but I had other issues.
My pig scripts hangs at the MapReduce Launching stage:
2015-05-30 19:34:59,641 [main] INFO org.apache.pig.backend.hadoop.executionengine.
mapReduceLayer.MapReduceLauncher -
detailed locations: M: lines[1,8],words[-1,-1],
wordcount[4,12],grouped[3,10] C: wordcount[4,12],grouped[3,10] R: wordcount[4,12]
2015-05-30 19:34:59,666 [main]
INFO org.apache.pig.backend.
hadoop.executionengine.mapReduceLayer
MapReduceLauncher - 0% complete
2015-05-30 19:34:59,666 [main] INFO org.apache.pig.backend.hadoop.executionengine.
mapReduceLayer.MapReduceLauncher -
Running jobs are [job_1433039621140_0001]

Nutch 1.7 with Hadoop 2.6.0 "Wrong FS" Error

We have been trying to use Nutch 1.7 with Hadoop 2.6.0.
After installation, we we try to submit a job to Nutch, we receive the following error:
INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: hdfs://master:9000/user/ubuntu/crawl/crawldb/436075385, expected: file:///
Job is submitted using the following command:
./crawl urls crawl_results 1
Also, we have checked fs.default.name setting in core-site.xml is having hdfs protocol:
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
</property>
It is happening when crawl command is sent to Nutch, after it reads the input URLs from file and attempts to insert the data into crawl db.
Any insights would be appreciated.
Thanks in advance.

Resources