Mesos slave node unable to restart - mesos

I've setup a Mesos cluster using the CloudFormation templates from Mesosphere. Things worked fine after cluster launch.
I recently noticed that none of the slave nodes are listed in the Mesos dashboard. EC2 console shows the slaves are running & pass health checks. I restarted nodes on cluster but that didn't help.
I ssh'ed into one of the slaves and noticed mesos-slave services are not running. Executed sudo systemctl status dcos-mesos-slave.service but that couldn't start the service.
Looked in /var/log/mesos/ and tail -f mesos-slave.xxx.invalid-user.log.ERROR.20151127-051324.31267 and saw the following...
F1127 05:13:24.242182 31270 slave.cpp:4079] CHECK_SOME(state::checkpoint(path, bootId.get())): Failed to create temporary file: No space left on device
But the output of df -h and free show there is plenty of disk space left.
Which leads me to wonder, why is it complaining about no disk space?

Ok I figured it out.
When running Mesos for a long time or under frequent load, the /tmp folder won't have any disk space left since Mesos uses the /tmp/mesos/ as the work_dir. You see, the filesystem can only hold a certain number of file references(inodes). In my case, slaves were collecting large number of file chuncks from image pulls in /var/lib/docker/tmp.
To resolve this issue:
1) Remove files under /tmp
2) Set a different work_dir location

It is good practice to run
docker rmi -f $(docker images | grep "<none>" | awk "{print \$3}")
this way you will free space by deleting unused docker images

Related

Find unreachable/ deactivated mesos slaves agents

I deployed dcos cluster on aws ec2 instances having a couple of mesos-slave agents. Few out of them were unexpectedly terminated. Mesos master marked them "unreachable". I would like to change their status from "Unreachable" to "Gone". To do that dcos provide following command:
dcos node decommission <mesos-id>
However, I am unable to find mesos-id of the unreachable mesos-agents. Neither mesos-master nor dc/os GUI/logs show any information for these nodes.
My question is how to get a list of all the unreachable (or deactivated) mesos-slave agents?
Thanks in anticipation.
To get an history of agents marked as unreachable use this command:
grep unreachable /var/log/mesos/*.INFO.*
or
gawk 'match($0, /.*Marking agent (.*) \(.*\) unreachable.*/, a) {print a[1]}' /var/log/mesos/*.INFO.*|sort|uniq
But if you only want to reset metrics reported in web ui you need to restart the mesos-master service (take a look at https://mesos.apache.org/documentation/latest/monitoring/)

Hadoop HA Namenode goes down with the Error: flush failed for required journal (JournalAndStream(mgr=QJM to [< ip >:8485, < ip >:8485, < ip >:8485]))

Hadoop Namenode goes down almost everyday once.
FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) -
**Error: flush failed for required journal** (JournalAndStream(mgr=QJM to [< ip >:8485, < ip >:8485, < ip >:8485], stream=QuorumOutputStream starting at txid <>))
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
at
Can someone suggest what are the things that I need to look into for resolving this issue?
I am using VMs for the journal nodes and master nodes. Does it cause any issue?
From the error you pasted. It appears your journal nodes could not talk to the NN in a timely manner. What was going on at the time of this event?
Since you mention that your nodes are vms I would guess you overloaded the hypervisor or it had troubling talking from the NN to the JN and zk quorum.
In my case, this issue was caused due to the difference in the system time between the nodes of the cluster.
To keep the system time in sync, we can execute the commands below in each node.
sudo service ntpd stop
sudo ntpdate pool.ntp.org # Run this command multiple times
sudo service ntpd start
If hue is down, run below command on the hue server machine
sudo service hue start
If namenode is down, start the namenode.
Recurring fix
Add a crontab for the root user on all the nodes of the environment.
or
Install VM tools, to keep the system time in sync.

redis on windows cluster setup

I have downloaded MSOpenTech Redis version 3.x which includes the long awaited clustering feature. My redis database is all working and I can start my cluster on the min 3 nodes required (in cluster mode). Does anyone know how to configure the cluster (it seems no one knows)?
Installing Linux and running the native Linux version is not an option for me sadly.
Any help would be greatly appreciated.
You can follow the Redis Cluster Tutorial and to create the cluster you can use the redis-trib.rb ruby script, for which you need to install Ruby for Windows.
For example:
> C:\Ruby22\Bin\ruby.exe redis-trib.rb create --replicas 1 192.168.1.1:7000 192.168.1.1:7001 192.168.1.1:7002 192.168.1.1:7003 192.168.1.1:7004 192.168.1.1:7005
Did not have the option to install Ruby on Windows but found the manual steps worked for me. The Ruby script seems to do a lot of checking stuff is setup correctly and is the preferred setup route. So Beware, here be dragons.
Set each node to run in Cluster mode. Edit the redis.windows-service.conf file and uncomment
cluster-enabled yes
cluster-config-file nodes-6379.conf
cluster-node-timeout 15000
restart the service.
Run a powershell window and change to the Redis installed folder and start the redis-cli. e.g.
cd "C:\Program Files\Redis"
.\redis-cli.exe
Now you can join other nodes. Run CLUSTER MEET IPADDRESS PORT for each of the other nodes, than the instance you happen to be on. e.g.
CLUSTER MEET 10.10.0.2 6379
After a few seconds running
CLUSTER NODES
Should list all the nodes connected, but all will be set as MASTER.
On each of the other nodes, run CLUSTER REPLICATE MASTERNODEID. Where MASTERNODEID is the hash-looking value next the node declared "myself" on your master when running CLUSTER NODES. e.g.
CLUSTER REPLICATE b7c767ab3ab7c4a926ac2fed937cf140b96764a7
Now allocate slots to each Master. My setup has three instances, only one master.
for ($slot=0;$slot -le 16383;$slot++) {
.\redis-cli.exe -h REDMST CLUSTER ADDSLOTS $slot
}
Reconnect with redis-cli and try and save data. e.g.
SET foo bar
OK
GET foo
"bar"
Phew! Got most this from reading https://www.javacodegeeks.com/2015/09/redis-clustering.html#InstallingRedis which is not Windows specific.
for windows version:
open the command window then type below command
C:\ProgramFiles\redis>FOR /L %i IN (0,1,16383) DO ( redis-cli.exe -p **6380** CLUSTER ADDSLOTS %i )
6380 is port of master node.

Cloudera installation dfs.datanode.max.locked.memory issue on LXC

I have created virtual box, ubuntu 14.04LTS environment on my mac machine.
In virtual box of ubuntu, I've created cluster of three lxc-containers. One for master and other two nodes for slaves.
On master, I have started installation of CDH5 using following link http://archive.cloudera.com/cm5/installer/latest/cloudera-manager-installer.bin
I have also made necessary changes in the /etc/hosts including FQDN and hostnames. Also created passwordless user named as "ubuntu".
While setting up the CDH5, during installation I'm constantly facing following error on datanodes. Max locked memory size: dfs.datanode.max.locked.memory of 922746880 bytes is more than the datanode's available RLIMIT_MEMLOCK ulimit of 65536 bytes.
Exception in secureMain: java.lang.RuntimeException: Cannot start datanode because the configured max locked memory size (dfs.datanode.max.locked.memory) of 922746880 bytes is more than the datanode's available RLIMIT_MEMLOCK ulimit of 65536 bytes.
at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:1050)
at org.apache.hadoop.hdfs.server.datanode.DataNode.<init>(DataNode.java:411)
at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:2297)
at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:2184)
at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:2231)
at org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:2407)
at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:2431)
Krunal,
This solution will be probably be late for you but maybe it can help somebody else so here it is. Make sure your ulimit is set correctly. But in case its a config issue.
goto:
/run/cloudera-scm-agent/process/
find latest config dir,
in this case:
1016-hdfs-DATANODE
search for parameter in this dir:
grep -rnw . -e "dfs.datanode.max.locked.memory"
./hdfs-site.xml:163: <name>dfs.datanode.max.locked.memory</name>
and edit the value to the one he is expecting in your case(65536)
I solved by opening a seperate tab in Cloudera and set the value from there

Installing Hadoop over 5 hard drives on a desktop

I have been working with installing Hadoop. I followed some instruction on a Udemy course, and I installed Hadoop on pseudo distributed mode, on my laptop. It was fairly straightforward.
After that, I started to wonder if I could set up Hadoop on a desktop computer. So went out and bought an empty case and put in a 64 bit, 8 core AMD processor, along with a 50GB SSD hard drive and 4 inexpensive 500GB hard drives. I installed Ubuntu 14.04 on the SSD drive, and put virtual machines on the other drives.
I'm envisioning using my SSD as the master and using my 4 hard drives as nodes. Again, everything is living in the same case.
Unfortunately, and I've been searching everywhere, and I can't find any tutorials, guides, books, etc, that describe setting up Hadoop in this manner. It seems like most everything I've found that details installation of Hadoop is either a simple pseudo distributed setup (which I've already done), or else the instructions jump straight to large scale commercial applications. I'm still learning the basics, clearly, but I'd like to play in this sort-of in between place.
Has anyone done this before, and/or come across any documentation / tutorials / etc that describe how to set Hadoop up in this way? Many thanks in advance for the help.
You can run hadoop in different VM's which are located in different drives in the same system.
But you need to allocate same configurations for all the master and slave nodes
Also ensure that all the VM's having different ip addresses.
You can get different IP addresses by connecting your master computer to the LAN or you need to disable some functionality in VM machines in order to get different IP addresses.
if you done the hadoop installation in pseduo mode means then follow the below steps this may help you.
MULTINODE :
Configure the hosts in the network using the following settings in the host file. This has to be done in all machine [in namenode too].
sudo vi /etc/hosts
add the following lines in the file:
yourip1 master
yourip2 slave01
yourip3 slave02
yourip4 slave03
yourip5 slave04
[Save and exit – type ESC then :wq ]
Change the hostname for the namenode and datanodes.
sudo vi /etc/hostname
For master machine [namenode ] – master
For other machines – slave01 and slave02 and slave03 and slave04 and slave 05
Restart the machines in order to get the settings related to the network applied.
sudo shutdown -r now
Copy the keys from the master node to all datanodes, so as this will help to access the machines without asking for permissions everytime.
#ssh-copy-id –i ~/.ssh/id_rsa.pub hduser#slave01
#ssh-copy-id –i ~/.ssh/id_rsa.pub hduser#slave02
#ssh-copy-id –i ~/.ssh/id_rsa.pub hduser#slave03
#ssh-copy-id –i ~/.ssh/id_rsa.pub hduser#slave04
Now we are about to configure the hadoop configuration settings, so navigate to the ‘conf’ folder.
cd ~/hadoop/etc
Edit the slaves file within the hadoop directory.
vi ~/hadoop/conf/slaves
And add the below :
master
slave01
slave02
slave03
slave04
Now update localhost to master in core-site.xml,hdfs-site.xml,mapred-site.xml and yarn-site.xml
Now copy the files from the hadoop/etc/hadoop folder from master to slave machines.
then format you namenode in all machines.
and start the hadoop services.
I given you the some clues for how to configure the hadoop multinode cluster.
Never tried, but if you type ifconfig then it gives you same ipaddress on all the vm machines in hard drives. So this may not be the better option to go..
You can try creating Hadoop Cluster on Amazon EC2 for free using this step by step guide HERE
Or Video guide HERE
Hope it helps!

Resources