spark history not start on ambari cluster - hadoop

we start the spark history as the following
/usr/hdp/2.6.0.3-8/spark2/sbin/start-history-server.sh
from the log
spark-root-org.apache.spark.deploy.history.HistoryServer-1-master01
we get
WARN AbstractLifeCycle: FAILED ServerConnector#14a54ef6{HTTP/1.1}
{0.0.0.0:18081}: java.net.BindException: Address already in use
java.net.BindException: Address already in use
please advice what is solution in order to start the spark history

You need to kill the process (like a zombie History server) with the port already open, or change the port in Ambari to something else
A combination of netstat -an, ps -ef, and lsof will help you find what process holds the port

Related

Hadoop Data node IP isn't a real VM

I'm currently running a hadoop setup with a Namenode(master-node - 10.0.1.86) and a Datanode(node1 - 10.0.1.85) using two centOS VM's.
When I run a hive query that starts a mapReduce job, I get the following error:
"Application application_1515705541639_0001 failed 2 times due to
Error launching appattempt_1515705541639_0001_000002. Got exception:
java.net.NoRouteToHostException: No Route to Host from
localhost.localdomain/127.0.0.1 to 10.0.2.62:48955 failed on socket
timeout exception: java.net.NoRouteToHostException: No route to host;
For more details see: http://wiki.apache.org/hadoop/NoRouteToHost"
Where on earth is this IP of 10.0.2.62 coming from? Here is an example of what I am seeing.
This IP does not exist on my network. You can not reach it through ping of telnet.
I have gone through all my config files on both master-node and node1 and I cannot find where it is picking up this IP. I've stopped/started both hdfs and yarn and rebooted both the VM's. Both /etc/host files are how they should be. Any general direction on where to look next would be appreciated, I am stumped!
Didn't have any luck on discovering where this rogue IP was coming from. I ended up assigning the VM the IP address that the node-master was looking for. Sure enough all works fine.

Hadoop HA Namenode goes down with the Error: flush failed for required journal (JournalAndStream(mgr=QJM to [< ip >:8485, < ip >:8485, < ip >:8485]))

Hadoop Namenode goes down almost everyday once.
FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) -
**Error: flush failed for required journal** (JournalAndStream(mgr=QJM to [< ip >:8485, < ip >:8485, < ip >:8485], stream=QuorumOutputStream starting at txid <>))
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
at
Can someone suggest what are the things that I need to look into for resolving this issue?
I am using VMs for the journal nodes and master nodes. Does it cause any issue?
From the error you pasted. It appears your journal nodes could not talk to the NN in a timely manner. What was going on at the time of this event?
Since you mention that your nodes are vms I would guess you overloaded the hypervisor or it had troubling talking from the NN to the JN and zk quorum.
In my case, this issue was caused due to the difference in the system time between the nodes of the cluster.
To keep the system time in sync, we can execute the commands below in each node.
sudo service ntpd stop
sudo ntpdate pool.ntp.org # Run this command multiple times
sudo service ntpd start
If hue is down, run below command on the hue server machine
sudo service hue start
If namenode is down, start the namenode.
Recurring fix
Add a crontab for the root user on all the nodes of the environment.
or
Install VM tools, to keep the system time in sync.

Need to Install Mesos to get Mesos Slave?

I'm trying to get this question solve,
To get mesos slave, is it we have to install Mesos and start mesos slave set up or?
And also I have problem with mesos master which I run a command
./bin/mesos-master.sh --ip=*** --work_dir=/var/lib/mesos
end up it does not continue to run so i stop it running. End up I run the same above command and I get error shown
Failed to initialize, bind: Address already in use [98]
Which part did I do wrongly?
You have to run mesos-master first and then you can connect mesos slave running on a different node to the master. You can refer to getting started guide of mesos. only one slave can connect to the master on the same port. If you get bind address already in use, you can try running slave on another port by passing --port=value parameter. Replace value with port number.
to start mesos master on localhost:
./bin/mesos-master.sh --ip=127.0.0.1 --work_dir=/var/lib/mesos
to start and connect slave to master
./bin/mesos-slave.sh --master=127.0.0.1:5050
to start and connect another slave to the same master you have to use another port as default port 5051 is already used by the first connected slave. Use argument --port-value to start slave on another port
./bin/mesos-slave.sh --master=127.0.0.1:5050 --port=5053
You may get a permission denied error. If so use sudo to access the given port
sudo ./bin/mesos-slave.sh --master=127.0.0.1:5050 --port=5053
You can run one more slave but you have to specify ip and a different workdir using
./mesos-slave.sh --master=<ipaddr>:<port> --ip=<ip of slave> --work_dir=<work_dir other than that of a running slave> --port=<another_port>
edit your etc/hosts and add more local ips with the following entries
127.0.0.2 slave2
127.0.0.3 slave3
then you can replace --ip=<ip of slave> with --ip=slave1 or --ip=slave2
You may have to replace <another_port> with ports like 5052,5053 or any available port if you have a running slave. The slave will be using the default port.
To run only a mesos-slave on a host is simple by installing the mesos package and only running the mesos-slave process with the correct flags, it's not a problem if the master is also installed, but be careful only to run the masters correct to the quorum number.
Something already running on the port you are trying to start the mesos-master, which has a web interface.
Check what program runs on the mesos default port, or use another port, more info about the command line documentation available here: Mesos configuration
To see what's using port 5050 or 5051 use either one of these commands:
sudo fuser -v 5050/tcp
sudo lsof -i | grep 5050
Both command will give you the process pid which holds the port. Either kill them or specify a new port for mesos by starting it with the correct port option:
./bin/mesos-master.sh --ip=*** --work_dir=/var/lib/mesos --port=FREE_PORT
Where do you specify the zookeepers for the mesos master and slaves? The following flags are required to start mesos-master (see the link I gave you):
--advertise_ip, --advertise_port, --quorum, --work_dir, --zk
What are your current full configuration for mesos master? You can find the files under related at /etc/mesos/, /etc/mesos-master/, /etc/mesos-slave/, /etc/defaults/mesos, /etc/defaults/mesos-master, /etc/defaults/mesos-slave. If you copy paste the lines from them and the mesos log here, we might give you more help.
Also please explain the cluster you would like to set up (Number of hosts, masters, slaves) and we can also help there.
excecute below command :
sudo netstat -peanut
Then check which process is using the port 5050 and 5051.
Kill those process using the pid.
Start the mesos master and slave again.
This happens to me when I killed the mesos slave accidentally and then restarted it but failed with address-bind issue.

Unable to start a node manager on master

I am setting up a Hadoop YARN cluster and I am using a machine as both a master and a slave. When I start the YARN using the following command, it starts the nodemanager on slaves but not on the master node.
sbin/yarn-daemons.sh start nodemanager
I have a master which also is slave and then I have another two slaves within the cluster, the nodemanagers in the slaves are starting properly.
The error I get :
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.BindException: Problem binding to [0.0.0.0:8040] java.net.BindException: Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException
Output of some of the Commands .
cat /etc/services | grep 8040
ampify 8040/tcp # Ampify Messaging Protocol
ampify 8040/udp # Ampify Messaging Protocol
lsof -i tcp:8040
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
java 28021 df 195u IPv6 3580602 0t0 TCP server1.mydomain.com:ampify (LISTEN
Under the default configuration that Hadoop ships, port 8040 is the port that the NodeManager uses for the localizer. This is basically a server endpoint responsible for bringing the files required to run a container onto the local node. (For example, this can be a MapReduce job's jar file or distributed cache files.)
Assuming that there is another server on the machine (here shown as Ampify) legitimately bound to port 8040, and you don't want to stop that service, then it is possible to reconfigure the port used by the NodeManager for the localizer. Set property yarn.nodemanager.localizer.address in your yarn-site.xml file. This is documented here:
http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
Pulling that from the XML source in the Hadoop tree, here is the documentation for the property:
<property>
<description>Address where the localizer IPC is.</description>
<name>yarn.nodemanager.localizer.address</name>
<value>${yarn.nodemanager.hostname}:8040</value>
</property>
Above error means, you are trying to start a process on 8040, which is already occupied by another instance.
To get rid of this error, you need to kill the process which is currently listening to port 8040. Your lsof output says pid is 28021. kill the process using the following command and start again
kill -9 28021

Job Tracker web interface

I followed the tutorialshttp://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/SingleCluster.html and installed hadoop 2.4.1 as pseudo distributed cluster. I created a ubuntu VM using OracleVM and installed hadoop as mentioned in the link. It was setup fine and able to run the examples. However the job tracker URL is not working. :50030 gives page not found. I also tried netstat on the server and there is no process waiting on 50030 port . Do i need to start any other service ? What are the possible reasons ?
You need to execute this:
$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver
Or JobTracker won't start.
(In my case, $HADOOP_HOME is in /usr/local/hadoop)
Check the value of mapred.job.tracker.http.address in mapred-site.xml
If the port is different, use that.
Also check if jobtracker is running. Check the jobtracker logs.
You need to enter the following command
http://localhost:50030/
Job Tracker web UI.

Resources