Submit a Spark job on a Yarn cluster from a remote client - hadoop

I want to submit a Spark job on a remote YARN cluster using the spark-submit command. My client is a Windows machine and the cluster is composed of a master and 4 slaves. I copied the Hadoop config files from my cluster to the remote machine, namely core-site.xml and yarn-site.xml and set the HADOOP_CONF_DIR variable in spark-env.sh to point to them.
However, when I submit a job using the following command :
spark-submit --jars hdfs:///user/kmansour/elevation/geotrellis-1.2.1-assembly.jar \
--class tutorial.CalculateFlowDirection hdfs:///user/kmansour/elevation/demo_2.11-0.2.0.jar hdfs:///user/kmansour/elevation/TIF/DTM_1m_19_E_17_108_*.tif \
--deploy-mode cluster \
--master yarn
I get stuck with:
INFO yarn.Client: Application report for application_1519070657292_0088 (state: ACCEPTED)
Until I get this :
diagnostics: Application application_1519070657292_0088 failed 2 times due to AM Container for appattempt_1519070657292_0088_000002 exited with exitCode: 10
For more detailed output, check application tracking page:http://node1:8088/cluster/app/application_1519070657292_0088Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1519070657292_0088_02_000001
Exit code: 10
Stack trace: ExitCodeException exitCode=10:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:585)
at org.apache.hadoop.util.Shell.run(Shell.java:482)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
When I check out the application tracking page, I get this on stderr :
18/03/13 14:48:05 INFO util.SignalUtils: Registered signal handler for TERM
18/03/13 14:48:05 INFO util.SignalUtils: Registered signal handler for HUP
18/03/13 14:48:05 INFO util.SignalUtils: Registered signal handler for INT
18/03/13 14:48:06 INFO yarn.ApplicationMaster: Preparing Local resources
18/03/13 14:48:08 INFO yarn.ApplicationMaster: ApplicationAttemptId: appattempt_1519070657292_0088_000002
18/03/13 14:48:08 INFO spark.SecurityManager: Changing view acls to: kmansour
18/03/13 14:48:08 INFO spark.SecurityManager: Changing modify acls to: kmansour
18/03/13 14:48:08 INFO spark.SecurityManager: Changing view acls groups to:
18/03/13 14:48:08 INFO spark.SecurityManager: Changing modify acls groups to:
18/03/13 14:48:08 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(kmansour); groups with view permissions: Set(); users with modify permissions: Set(kmansour); groups with modify permissions: Set()
18/03/13 14:48:08 INFO yarn.ApplicationMaster: Waiting for Spark driver to be reachable.
18/03/13 14:50:15 ERROR yarn.ApplicationMaster: Failed to connect to driver at 132.156.9.98:50687, retrying ...
18/03/13 14:50:15 ERROR yarn.ApplicationMaster: Uncaught exception:
org.apache.spark.SparkException: Failed to connect to driver!
at org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkDriver(ApplicationMaster.scala:577)
at org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:433)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:256)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:764)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:67)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:66)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:762)
at org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:785)
at org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
18/03/13 14:50:15 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 10, (reason: Uncaught exception: org.apache.spark.SparkException: Failed to connect to driver!)
18/03/13 14:50:16 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: Uncaught exception: org.apache.spark.SparkException: Failed to connect to driver!)
18/03/13 14:50:16 INFO yarn.ApplicationMaster: Deleting staging directory hdfs://132.156.9.142:8020/user/kmansour/.sparkStaging/application_1519070657292_0088
18/03/13 14:50:16 INFO util.ShutdownHookManager: Shutdown hook called
The IP address of my master node is 132.156.9.142 and the IP address of my client is 132.156.9.98. The log shows me that the application master is attempting to connect to the driver on the client when I explicitly stated --deploy-mode cluster.
Shouldn't the driver driver be on a node in the cluster ?
This is the content of my config files :
spark-defaults.conf :
spark.eventLog.enabled true
spark.eventLog.dir hdfs://132.156.9.142:8020/events
spark.history.fs.logDirectory hdfs://132.156.9.142:8020/events
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.cores 2
spark.driver.memory 5g
spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
spark.executor.instances 4
spark.executor.cores 2
spark.executor.memory 6g
spark.yarn.am.memory 2g
spark.yarn.jars hdfs://node1:8020/jars/*.jar
yarn-site.xml:
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>node1</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>8192</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1024</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>7168</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>2</value>
</property>
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>5</value>
</property>
</configuration>
core-site.xml :
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://132.156.9.142:8020</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>C:\Users\kmansour\Documents\hadoop-2.7.4\tmp</value>
</property>
</configuration>
I am very new at all this and perhaps my reasoning is flawed, any input or suggestions would help.

You need to change order or parameters passed to spark-submit. In your configuration:
spark-submit --jars hdfs:///user/kmansour/elevation/geotrellis-1.2.1-assembly.jar \
--class tutorial.CalculateFlowDirection hdfs:///user/kmansour/elevation/demo_2.11-0.2.0.jar hdfs:///user/kmansour/elevation/TIF/DTM_1m_19_E_17_108_*.tif \
--deploy-mode cluster \
--master yarn
Spark is called in default mode (yarn-client probably) and then your --deploy-mode and --master as passed as app parameters, because there are entered after jar file location. Change it to:
spark-submit --jars hdfs:///user/kmansour/elevation/geotrellis-1.2.1-assembly.jar \
--deploy-mode cluster \
--master yarn \
--class tutorial.CalculateFlowDirection hdfs:///user/kmansour/elevation/demo_2.11-0.2.0.jar hdfs:///user/kmansour/elevation/TIF/DTM_1m_19_E_17_108_*.tif
and you will get true yarn-cluster mode.

Related

Hadoop 3.3.0 on WSL 2 - Failed to start name node, but data node works fine

I'm running Hadoop 3.3.0 on WSL 2 in Windows 11, I followed this guide to setup my configuration: Install Hadoop 3.3.0 on WSL 2
When I start namenode and datanode with:
start-dfs.sh
It shows no error, but in jps the namenode is not running:
8805 DataNode
9034 Seconday Namenode
9212 Jps
Looking in the namenode log file, it shows failed to start namenode:
2023-01-02 20:50:51,714 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping NameNode metrics system...
2023-01-02 20:50:51,715 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system stopped.
2023-01-02 20:50:51,715 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system shutdown complete.
2023-01-02 20:50:51,719 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
java.io.IOException: Could not parse line: Filesystem 1024-blocks Used Available Capacity Mounted on
at org.apache.hadoop.fs.DF.parseOutput(DF.java:195)
at org.apache.hadoop.fs.DF.getFilesystem(DF.java:76)
at org.apache.hadoop.hdfs.server.namenode.NameNodeResourceChecker$CheckedVolume.<init>(NameNodeResourceChecker.java:70)
at org.apache.hadoop.hdfs.server.namenode.NameNodeResourceChecker.addDirToCheck(NameNodeResourceChecker.java:166)
at org.apache.hadoop.hdfs.server.namenode.NameNodeResourceChecker.<init>(NameNodeResourceChecker.java:135)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startCommonServices(FSNamesystem.java:1266)
at org.apache.hadoop.hdfs.server.namenode.NameNode.startCommonServices(NameNode.java:862)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:783)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:1014)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:987)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1756)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1821)
2023-01-02 20:50:51,722 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1: java.io.IOException: Could not parse line: Filesystem 1024-blocks Used Available Capacity Mounted on
2023-01-02 20:50:51,725 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
Here is my configuration file core-site.xml:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
And for hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
I've been stuck with this problem for several days, thank you for all the help!

NoClassDefFoundError while running HBase, no error in zookeeper

I've created a standalone hadoop cluster using this tutorial. Then I installed HBase over hadoop by following this tutorial.
I ran Hadoop by
cd /usr/local/hadoop/sbin/
./start-all.sh
And HBase by
cd /usr/local/hbase/bin
./start-hbase.sh
Then when I do jps, I get:
3761 Jps
835 NameNode
966 DataNode
3480 HMaster
3608 HRegionServer
1465 ResourceManager
1610 NodeManager
3418 HQuorumPeer
1150 SecondaryNameNode
But after some time it shows:
1779 SecondaryNameNode
1557 DataNode
2870 HQuorumPeer
2200 NodeManager
2061 ResourceManager
3246 Jps
1423 NameNode
So that's a pretty large indicator that something is wrong. Now, I checked the zookeeper logs in /usr/local/hbase/logs/hbase-hduser-zookeeper-stal.log and it showed:
2019-04-29 07:54:45,677 INFO [main] server.ZooKeeperServer: Server environment:java.io.tmpdir=/tmp
2019-04-29 07:54:45,677 INFO [main] server.ZooKeeperServer: Server environment:java.compiler=<NA>
2019-04-29 07:54:45,677 INFO [main] server.ZooKeeperServer: Server environment:os.name=Linux
2019-04-29 07:54:45,678 INFO [main] server.ZooKeeperServer: Server environment:os.arch=amd64
2019-04-29 07:54:45,678 INFO [main] server.ZooKeeperServer: Server environment:os.version=4.15.0-47-generic
2019-04-29 07:54:45,678 INFO [main] server.ZooKeeperServer: Server environment:user.name=hduser
2019-04-29 07:54:45,678 INFO [main] server.ZooKeeperServer: Server environment:user.home=/home/hduser
2019-04-29 07:54:45,678 INFO [main] server.ZooKeeperServer: Server environment:user.dir=/home/hduser
2019-04-29 07:54:45,782 INFO [main] server.ZooKeeperServer: tickTime set to 3000
2019-04-29 07:54:45,782 INFO [main] server.ZooKeeperServer: minSessionTimeout set to -1
2019-04-29 07:54:45,782 INFO [main] server.ZooKeeperServer: maxSessionTimeout set to 90000
2019-04-29 07:54:46,780 INFO [main] server.NIOServerCnxnFactory: binding to port 0.0.0.0/0.0.0.0:2181
which doesn't seem like any error whatsoever.
So, I checked HBase's errors in /usr/local/hbase/logs/hbase-hduser-master-stal.log and I got:
2019-04-29 07:55:11,513 ERROR [main] master.HMasterCommandLine: Master exiting
java.lang.RuntimeException: Failed construction of Master: class org.apache.hadoop.hbase.master.HMaster.
at org.apache.hadoop.hbase.master.HMaster.constructMaster(HMaster.java:3100)
at org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:236)
at org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:140)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:149)
at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:3111)
Caused by: java.lang.NoClassDefFoundError: org/apache/htrace/SamplerBuilder
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:644)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:628)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:149)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2701)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2683)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:372)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.hbase.util.CommonFSUtils.getRootDir(CommonFSUtils.java:362)
at org.apache.hadoop.hbase.util.CommonFSUtils.isValidWALRootDir(CommonFSUtils.java:411)
at org.apache.hadoop.hbase.util.CommonFSUtils.getWALRootDir(CommonFSUtils.java:387)
at org.apache.hadoop.hbase.regionserver.HRegionServer.initializeFileSystem(HRegionServer.java:704)
at org.apache.hadoop.hbase.regionserver.HRegionServer.<init>(HRegionServer.java:613)
at org.apache.hadoop.hbase.master.HMaster.<init>(HMaster.java:489)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.hbase.master.HMaster.constructMaster(HMaster.java:3093)
... 5 more
Caused by: java.lang.ClassNotFoundException: org.apache.htrace.SamplerBuilder
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 25 more
There was a similar question, which was answered by:
HBase 2.1.0 release uses HTrace, that is an incubating Apache
Foundation project.
There is a folder for 3rd-party libraries in HBase lib folder,
client-facing-thirdparty. You need to copy
htrace-core-3.1.0-incubating.jar from there to the HBase lib
directory. (see reference)
There is also another solution at Cloudera Community that changes a
configuration instead of adding the library manually.
The first solution includes:
The HMaster refuse to start due to the error below:
Java.lang.RuntimeException: Failed construction of Master: class
org.apache.hadoop.hbase.master.HMaster Caused by:
java.lang.ClassNotFoundException: org.apache.htrace.SamplerBuilder
This is because in hbase 2.0, we have 2 different version of
htrace-core.x.x.x.incubating.jar
cd /usr/local/hbase/lib/client-facing-thirdparty/:
htrace-core-3.1.0-incubating.jar
htrace-core-4.2.0-incubating.jar
Currently, only version 3.1.0 has the required class SamplerBuilder.
We need to remove version 4.2.0:
mv htrace-core-4.2.0-incubating.jar htrace-core-4.2.0-incubating.jar.bak
But, when I did cd to the /usr/local/hbase/lib/client-facing-thirdparty and do ls -a I get:
. audience-annotations-0.5.0.jar findbugs-annotations-1.3.9-1.jar log4j-1.2.17.jar slf4j-log4j12-1.7.25.jar
.. commons-logging-1.2.jar htrace-core4-4.2.0-incubating.jar slf4j-api-1.7.25.jar
As one can see, there is only one htrace file, not two. So, I downloaded htrace-3.1.0, from here, and copied it into /usr/local/hbase/lib/client-facing-thirdparty, and renamed htrace-core4-4.2.0-incubating.jar to htrace-core4-4.2.0-incubating.jar.bak. Then I restarted hadoop and HBase. Still no change. jps didn't show HMaster and HRegionServer now.
HBase configuration files:
<configuration>
<property>
<name>hbase.unsafe.stream.capability.enforce</name>
<value>false</value>
</property>
<property>
<name>zookeeper.znode.parent</name>
<value>/hbase</value>
</property>
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:9000/user/hduser/hbase</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>localhost</value>
</property>
<property>
<name>hbase.master</name>
<value>localhost:60010</value>
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2181</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>hdfs://localhost:9000/user/hduser/zookeeper</value>
</property>
<property>
<name>hbase.tmp.dir</name>
<value>/hbase/tmp</value>
<description>Temporary directory on the local filesystem.</description>
</property>
</configuration>
And hbase-env.sh looks like:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HBASE_REGIONSERVERS=/usr/local/hbase/conf/regionservers
export HBASE_MANAGES_ZK=true
export HBASE_PID_DIR=/var/hbase/pids
export HBASE_OPTS="$HBASE_OPTS -XX:+UseConcMarkSweepGC"
So, what should I do now? Any help is appreciated.

ERROR in datanode execution while running Hadoop first time in Windows 10

I am trying to run Hadoop 3.1.1 in my Windows 10 machine. I modified all the files:
hdfs-site.xml
mapred-site.xml
core-site.xml
yarn-site.xml
Then, I executed the following command:
C:\hadoop-3.1.1\bin> hdfs namenode -format
The format ran correctly so I directed to C:\hadoop-3.1.1\sbin to execute the following command:
C:\hadoop-3.1.1\sbin> start-dfs.cmd
The command prompt opens 2 new windows: one for datanode and another for namenode.
The namenode window keeps running:
2018-09-02 21:37:06,232 INFO ipc.Server: IPC Server Responder: starting
2018-09-02 21:37:06,232 INFO ipc.Server: IPC Server listener on 9000: starting
2018-09-02 21:37:06,247 INFO namenode.NameNode: NameNode RPC up at: localhost/127.0.0.1:9000
2018-09-02 21:37:06,247 INFO namenode.FSNamesystem: Starting services required for active state
2018-09-02 21:37:06,247 INFO namenode.FSDirectory: Initializing quota with 4 thread(s)
2018-09-02 21:37:06,247 INFO namenode.FSDirectory: Quota initialization completed in 3 milliseconds
name space=1
storage space=0
storage types=RAM_DISK=0, SSD=0, DISK=0, ARCHIVE=0, PROVIDED=0
2018-09-02 21:37:06,279 INFO blockmanagement.CacheReplicationMonitor: Starting CacheReplicationMonitor with interval 30000 milliseconds
While the datanode gives following error:
ERROR: datanode.DataNode: Exception in secureMain
org.apache.hadoop.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 0, volumes configured: 1, volumes failed: 1, volume failures tolerated: 0
at org.apache.hadoop.hdfs.server.datanode.checker.StorageLocationChecker.check(StorageLocationChecker.java:220)
at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:2762)
at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:2677)
at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:2719)
at org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:2863)
at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:2887)
2018-09-02 21:37:04,250 INFO util.ExitUtil: Exiting with status 1: org.apache.hadoop.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 0, volumes configured: 1, volumes failed: 1, volume failures tolerated: 0
2018-09-02 21:37:04,250 INFO datanode.DataNode: SHUTDOWN_MSG:
And then, the datanode shuts down! I tried several ways to overcome this error, but this is first time I am installing Hadoop on windows and can't understand what to do next!
I got things working, after I removed the file system reference for the datanode in hdfs-site.xml. I found that enabled the software to create and initialise its own datanode, which then popped up in sbin. After that I could use hdfs without a hitch. Here is what worked for me for Hadoop 3.1.3 on windows:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///C:/Users/myusername/hadoop/hadoop-3.1.3/data/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>datanode</value>
</property>
</configuration>
Cheers,
MV
I had the same problem and what worked for me was editing hdfs-site.xml as follows:
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///C:/Hadoop/hadoop-3.1.2/data/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/C:/Hadoop/hadoop-3.1.2/data/datanode</value>
</property>

spark in running but yarn can't find it

I have a problem about yarn cluster
I run hdfs-namenode, hdfs-datanode, yarn at localhost and then run a spark-master and a spark-worker at localhost too, see like this:
$ jps
5809 Main
53730 ResourceManager
53540 SecondaryNameNode
53125 NameNode
56710 Master
54009 NodeManager
56809 Worker
53308 DataNode
56911 Jps
I can see spark-worker is link to spark-master throw http://127.0.0.1:8080
img : spark-web-ui
[![enter image description here][1]][1]
But in yarn web-ui http://127.0.0.1:8088, there is nothing in Nodes of the cluster page
img :
[![enter image description here][2]][2]
My conf/spark-env.sh is
export SCALA_HOME="/opt/scala-2.11.8/"
export JAVA_HOME="/opt/jdk1.8.0_101/"
export HADOOP_HOME="/opt/hadoop-2.7.3/"
export HADOOP_CONF_DIR="/opt/hadoop-2.7.3/etc/hadoop/"
export SPARK_MASTER_IP=127.0.0.1
export SPARK_LOCAL_DIRS="/opt/spark-2.0.0-bin-hadoop2.7/"
export SPARK_DRIVER_MEMORY=1G
And conf/spark-defaults.conf is
spark.master spark://127.0.0.1:7077
spark.yarn.submit.waitAppCompletion false
spark.yarn.access.namenodes hdfs://127.0.0.1:8032
And yarn-site.xml is
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>127.0.0.1</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>${yarn.resourcemanager.hostname}:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>${yarn.resourcemanager.hostname}:8035</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>${yarn.resourcemanager.hostname}:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>${yarn.resourcemanager.hostname}:8088</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>${yarn.resourcemanager.hostname}:8032</value>
</property>
</configuration>
When I submit an application use
spark-submit --master yarn --deploy-mode cluster test.py
I can get out put like this
16/10/12 16:19:30 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/10/12 16:19:30 INFO client.RMProxy: Connecting to ResourceManager at /127.0.0.1:8032
16/10/12 16:19:30 INFO yarn.Client: Requesting a new application from cluster with 1 NodeManagers
16/10/12 16:19:30 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)
16/10/12 16:19:30 INFO yarn.Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead
16/10/12 16:19:30 INFO yarn.Client: Setting up container launch context for our AM
16/10/12 16:19:30 INFO yarn.Client: Setting up the launch environment for our AM container
16/10/12 16:19:30 INFO yarn.Client: Preparing resources for our AM container
16/10/12 16:19:31 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
16/10/12 16:19:32 INFO yarn.Client: Uploading resource file:/opt/spark-2.0.0-bin-hadoop2.7/spark-3cdb2435-d6a0-4ce0-a54a-f2849d5f4909/__spark_libs__2140674596658903486.zip -> hdfs://127.0.0.1:9000/user/fuxiuyin/.sparkStaging/application_1476256306830_0002/__spark_libs__2140674596658903486.zip
16/10/12 16:19:33 INFO yarn.Client: Uploading resource file:/home/fuxiuyin/PycharmProjects/spark-test/test.py -> hdfs://127.0.0.1:9000/user/fuxiuyin/.sparkStaging/application_1476256306830_0002/test.py
16/10/12 16:19:33 INFO yarn.Client: Uploading resource file:/opt/spark-2.0.0-bin-hadoop2.7/python/lib/pyspark.zip -> hdfs://127.0.0.1:9000/user/fuxiuyin/.sparkStaging/application_1476256306830_0002/pyspark.zip
16/10/12 16:19:33 INFO yarn.Client: Uploading resource file:/opt/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip -> hdfs://127.0.0.1:9000/user/fuxiuyin/.sparkStaging/application_1476256306830_0002/py4j-0.10.1-src.zip
16/10/12 16:19:33 INFO yarn.Client: Uploading resource file:/opt/spark-2.0.0-bin-hadoop2.7/spark-3cdb2435-d6a0-4ce0-a54a-f2849d5f4909/__spark_conf__3570291475444079549.zip -> hdfs://127.0.0.1:9000/user/fuxiuyin/.sparkStaging/application_1476256306830_0002/__spark_conf__.zip
16/10/12 16:19:33 INFO spark.SecurityManager: Changing view acls to: fuxiuyin
16/10/12 16:19:33 INFO spark.SecurityManager: Changing modify acls to: fuxiuyin
16/10/12 16:19:33 INFO spark.SecurityManager: Changing view acls groups to:
16/10/12 16:19:33 INFO spark.SecurityManager: Changing modify acls groups to:
16/10/12 16:19:33 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(fuxiuyin); groups with view permissions: Set(); users with modify permissions: Set(fuxiuyin); groups with modify permissions: Set()
16/10/12 16:19:33 INFO yarn.Client: Submitting application application_1476256306830_0002 to ResourceManager
16/10/12 16:19:33 INFO impl.YarnClientImpl: Submitted application application_1476256306830_0002
16/10/12 16:19:33 INFO yarn.Client: Application report for application_1476256306830_0002 (state: ACCEPTED)
16/10/12 16:19:33 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1476260373944
final status: UNDEFINED
tracking URL: http://localhost:8088/proxy/application_1476256306830_0002/
user: fuxiuyin
16/10/12 16:19:33 INFO util.ShutdownHookManager: Shutdown hook called
16/10/12 16:19:33 INFO util.ShutdownHookManager: Deleting directory /opt/spark-2.0.0-bin-hadoop2.7/spark-3cdb2435-d6a0-4ce0-a54a-f2849d5f4909
It's success, but in yarn web-ui, this app isn't running always in ACCEPTED
Looks like no spark node run this app.
Can anyone tell me what's wrong?
Thanks~
You can specify one type of cluster:
YARN (cluster or client mode)
Spark standalone
Mesos
You have started Spark standalone server and you're connecting to this cluster manager. If you want to start Spark on YARN, you should specify yarn master - just --master yarn
Edit:
Please add logs and spark-submit command. Please also post how are you launching YARN. If first attempt was wrong, then it means you have configuration problem
Edit number 2: It seems that YARN doesn't have enough resources to process your application. Please check your config, i.e. check if maybe increasing yarn.nodemanager.resource.memory-mb will help. Also you can go to Spark Web UI - http://application-master-ip:4040 - and see information from Spark Context.
Also, you can check if you can deploy application to Spark Standalone (which you are also starting), just by setting --master spark://...: as in configuration. Then you will be sure if it is a problem with YARN or in Spark
BTW. You can omit running Spark Standalone if you're submitting to YARN :) And memory used by Stanalone Workers can be used by YARN
:). Thanks everyone, I'm so sorry to waste your time. When I check the resource in http://localhost:8088/ I noticed this:
I just stop the server and delete tmp directory and logs directory. Then it works.
Thank you again

Unable to import data using scoop: exitCode=255

I am a noob in hadoop spark. I have setup a hadoop/spark cluster (1 namenode, 2 datanode). Now I am trying to import data from DB (mysql) using scoop in HDFS, but its failing always
16/07/27 16:50:04 INFO mapreduce.Job: Running job: job_1469629483256_0004
16/07/27 16:50:11 INFO mapreduce.Job: Job job_1469629483256_0004 running in uber mode : false
16/07/27 16:50:11 INFO mapreduce.Job: map 0% reduce 0%
16/07/27 16:50:13 INFO ipc.Client: Retrying connect to server: datanode1_hostname/172.31.58.123:59676. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
16/07/27 16:50:14 INFO ipc.Client: Retrying connect to server: datanode1_hostname/172.31.58.123:59676. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
16/07/27 16:50:15 INFO ipc.Client: Retrying connect to server: datanode1_hostname/172.31.58.123:59676. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)
16/07/27 16:50:18 INFO mapreduce.Job: Job job_1469629483256_0004 failed with state FAILED due to: Application application_1469629483256_0004 failed 2 times due to AM Container for appattempt_1469629483256_0004_000002 exited with exitCode: 255
For more detailed output, check application tracking page:http://ip-172-31-55-182.ec2.internal:8088/cluster/app/application_1469629483256_0004Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1469629483256_0004_02_000001
Exit code: 255
Stack trace: ExitCodeException exitCode=255:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 255
Failing this attempt. Failing the application.
16/07/27 16:50:18 INFO mapreduce.Job: Counters: 0
16/07/27 16:50:18 WARN mapreduce.Counters: Group FileSystemCounters is deprecated. Use org.apache.hadoop.mapreduce.FileSystemCounter instead
16/07/27 16:50:18 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 16.2369 seconds (0 bytes/sec)
16/07/27 16:50:18 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
16/07/27 16:50:18 INFO mapreduce.ImportJobBase: Retrieved 0 records.
16/07/27 16:50:18 ERROR tool.ImportTool: Error during import: Import job failed!
I am able to manually write in HDFS:
hdfs dfs -put <local file path> <hdfs path>
But when i run scoop import command
sqoop import --connect jdbc:mysql://<host>/<db_name> --username <USERNAME> --password <PASSWORD> --table <TABLE_NAME> --enclosed-by '\"' --fields-terminated-by , --escaped-by \\ -m 1 --target-dir <hdfs location>
Can any one please tell me what I am doing wrong
Here is the list of things that I have already tried
Shutting down cluster, formatting HDFS, then restarting cluster (didn't help)
Made sure that HDFS is not in SAFE MODE
all the nodes have this in their /etc/hosts
127.0.0.1 localhost
172.31.55.182 namenode_hostname
172.31.58.123 datanode1_hostname
172.31.58.122 datanode2_hostname
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
Configuration Files:
All Nodes: $HADOOP_CONF_DIR/core-site.xml:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://ip-172-31-55-182.ec2.internal:9000</value>
</property>
</configuration>
All Nodes: $HADOOP_CONF_DIR/yarn-site.xml:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>ip-172-31-55-182.ec2.internal</value>
</property>
</configuration>
All Nodes: $HADOOP_CONF_DIR/mapred-site.xml:
<configuration>
<property>
<name>mapreduce.jobtracker.address</name>
<value>ip-172-31-55-182.ec2.internal:54311</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
NameNode Specific Configurations
$HADOOP_CONF_DIR/hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///mnt/hadoop_data/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.address</name>
<value>0.0.0.0:50010</value>
</property>
<property>
<name>dfs.datanode.http.address</name>
<value>0.0.0.0:50075</value>
</property>
<property>
<name>dfs.datanode.https.address</name>
<value>0.0.0.0:50475</value>
</property>
<property>
<name>dfs.datanode.ipc.address</name>
<value>0.0.0.0:50020</value>
</property>
</configuration>
$HADOOP_CONF_DIR/masters:
ip-172-31-55-182.ec2.internal
$HADOOP_CONF_DIR/slaves:
ip-172-31-58-123.ec2.internal
ip-172-31-58-122.ec2.internal
DataNode Specific Configurations
$HADOOP_CONF_DIR/hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///mnt/hadoop_data/hdfs/datanode</value>
</property>
<property>
<name>dfs.datanode.address</name>
<value>0.0.0.0:50010</value>
</property>
<property>
<name>dfs.datanode.http.address</name>
<value>0.0.0.0:50075</value>
</property>
<property>
<name>dfs.datanode.https.address</name>
<value>0.0.0.0:50475</value>
</property>
<property>
<name>dfs.datanode.ipc.address</name>
<value>0.0.0.0:50020</value>
</property>
</configuration>
From where u are trying to import the data. I mean from which machine you are trying to connect.check the master and slaves file in both namenode and datanode.
Try to ping the ip address from different server and check if it's showing as up.
Make these changes and restart your cluster, and try again:
Edit the part as mention in comment(#) below, and remove the comment
/etc/hosts file on client node:
127.0.0.1 localhost yourcomputername #get computername by "hostname -f" command and replace here
172.31.55.182 namenode_hostname ip-172-31-55-182.ec2.internal
172.31.58.123 datanode1_hostname ip-172-31-58-123.ec2.internal
172.31.58.122 datanode2_hostname ip-172-31-58-122.ec2.internal
/etc/hosts file on cluster nodes:
198.22.23.212 youcomputername #change to public ip of client node, change computername same as client node
172.31.55.182 namenode_hostname ip-172-31-55-182.ec2.internal
172.31.58.123 datanode1_hostname ip-172-31-58-123.ec2.internal
172.31.58.122 datanode2_hostname ip-172-31-58-122.ec2.internal
I am terminating this cluster and starting from scratch.

Resources