logouts while running hadoop under ubuntu 16.04 - hadoop

I am having some trouble with running hadoop jobs in both pseudo cluster and in cluster mode under ubuntu 16.04.
While running a vanila hadoop/hdfs installation - my hadoop user gets
logged out and all of the processes that are run by this user are closed.
I don't see anything indicating in logs (/var/log/systemd, journalctl or
dmesg) that explains why the user gets logged out.
Seems like I am not the only who has problems with this or similar issue:
https://stackoverflow.com/questions/38288162/in-ubuntu-16-04-running-hadoop-jar-laptop-gets-rebooted
Note: creating special hadoop user hadn't actually solved the problem in my case - but limited the logouts to the dedicated user.
https://askubuntu.com/questions/784591/ubuntu-16-04-kills-session-when-resource-usage-is-extremely-high
Is it possible that some problem around the UserGroupInformation class
(that can under some circumstances cause a logout), with maybe some changes in systemd in ubuntu 16.04 can cause this behavior?
The last lines of hadoop log that I get before the logout:
...
16/07/13 16:45:37 DEBUG ipc.ProtobufRpcEngine: Call: getJobReport took 4ms
16/07/13 16:45:37 DEBUG security.UserGroupInformation: PrivilegedAction
as:hduser (auth:SIMPLE)
from:org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:320)
16/07/13 16:45:37 DEBUG ipc.Client: IPC Client (1360814716) connection to
laptop/127.0.1.1:37339 from hduser sending #375
16/07/13 16:45:37 DEBUG ipc.Client: IPC Client (1360814716) connection to
laptop/127.0.1.1:37339 from hduser got value #375
16/07/13 16:45:37 DEBUG ipc.ProtobufRpcEngine: Call: getJobReport took 2ms
Terminated
hduser#laptop:~$ 16/07/13 16:45:37 DEBUG ipc.Client: stopping client from
cache: org.apache.hadoop.ipc.Client#4e7ab839
exit
journalctl:
Jul 12 16:06:44 laptop systemd-logind[978]: Removed session 7.
Jul 12 16:06:44 laptop systemd-logind[978]: Removed session 6.
Jul 12 16:06:44 laptop systemd-logind[978]: Removed session 5.
Jul 12 16:06:44 laptop systemd-logind[978]: Removed session 8.
syslog:
Jul 12 16:06:43 laptop systemd[4172]: Stopped target Default.
Jul 12 16:06:43 laptop systemd[4172]: Reached target Shutdown.
Jul 12 16:06:44 laptop systemd[4172]: Starting Exit the Session...
Jul 12 16:06:44 laptop systemd[4172]: Stopped target Basic System.
Jul 12 16:06:44 laptop systemd[4172]: Stopped target Sockets.
Jul 12 16:06:44 laptop systemd[4172]: Stopped target Paths.
Jul 12 16:06:44 laptop systemd[4172]: Stopped target Timers.
Jul 12 16:06:44 laptop systemd[4172]: Received SIGRTMIN+24 from PID
10101 (kill).
Jul 12 16:06:44 laptop systemd[1]: Stopped User Manager for UID 1001.
Jul 12 16:06:44 laptop systemd[1]: Removed slice User Slice of hduser.

I also had the problem. It took me time, but I found the solution here: https://unix.stackexchange.com/questions/293069/all-services-of-a-user-are-killed-when-running-multiple-services-under-this-user
Basically, some hadoop processes just stop, because why not. But systemd seems to kill all user's process when he see a service's process dying.
The fix is to add
[login]
KillUserProcesses=no
to /etc/systemd/logind.confand reboot.
I had multiple ubuntu's version to debug the problem, and the fix seems to works only on ubuntu 16.04.

I had the same issue. I was using Apache APEX which is hadoop native. While killing any APEX application my system used to log me out.
Solution : Replace the kill file (present in /bin/kill) of Ubuntu 16 with kill file of Ubuntu 14.
Everything works smoothly like before upgrade for me.

I had the same problem too. Finally, I found /bin/kill in ubuntu16.04 has bug in killing process group can solve this problem.
If pid is less than -1, then sig is sent to every process in the process group whose ID is -pid
Because of the bug in procps-ng-3.3.10, kill the process group whose ID starts with 1, invoked by bin/yarn application -kill AppID, will cause the user logouts.
The problem is solved after replacing /bin/kill with the new kill compiled from procps-ng-3.3.12.
tar xJf procps-ng-3.3.12.tar.xz
cd procps-ng-3.3.12
./configure
sudo cp .lib/kill /bin/kill
sudo chown root:root /bin/kill
sudo cp proc/.libs/libprocps.so.6.0.0 /lib/x86_64-linux/gnu/
sudo chown root:root /lib/x86_64-linux-gnu/libprocps.so.6.0.0

Related

HDFS fails to start with Hadoop 3.2 : bash v3.2+ is required

I'm building a small Hadoop cluster composed of 2 nodes : 1 master + 1 worker. I'm using the latest version of Hadoop (3.2) and everything is executed by the root user. In the installation process, I've been able to hdfs namenode -format. Next step is to start the HDFS daemon with start-dfs.sh.
$ start-dfs.sh
Starting namenodes on [master]
bash v3.2+ is required. Sorry.
Starting datanodes
bash v3.2+ is required. Sorry.
Starting secondary namenodes [master]
bash v3.2+ is required. Sorry.
Here's the generated logs in the journal:
$ journalctl --since "1 min ago"
-- Logs begin at Thu 2019-08-29 11:12:27 CEST, end at Thu 2019-08-29 11:46:40 CEST. --
Aug 29 11:46:40 master su[3329]: (to root) root on pts/0
Aug 29 11:46:40 master su[3329]: pam_unix(su-l:session): session opened for user root by root(uid=0)
Aug 29 11:46:40 master su[3329]: pam_unix(su-l:session): session closed for user root
Aug 29 11:46:40 master su[3334]: (to root) root on pts/0
Aug 29 11:46:40 master su[3334]: pam_unix(su-l:session): session opened for user root by root(uid=0)
Aug 29 11:46:40 master su[3334]: pam_unix(su-l:session): session closed for user root
Aug 29 11:46:40 master su[3389]: (to root) root on pts/0
Aug 29 11:46:40 master su[3389]: pam_unix(su-l:session): session opened for user root by root(uid=0)
Aug 29 11:46:40 master su[3389]: pam_unix(su-l:session): session closed for user root
As I'm using Zsh (with Oh-my-Zsh), I logged into a bash console to give it a try. Sadly, I get the same result. In fact, this error happens for all sbin/start-*.sh scripts. However, the hadoop and yarn commands work like a charm.
Since I didn't find much information on this error on the Internet, here I am. Would be glad to have any advice!
Other technical details
Operating system info:
$ lsb_release -d
Description: Debian GNU/Linux 10 (buster)
$ uname -srm
Linux 4.19.0-5-amd64 x86_64
Available Java versions (tried with both):
$ update-alternatives --config java
There are 2 choices for the alternative java (providing /usr/bin/java).
Selection Path Priority Status
------------------------------------------------------------
0 /usr/lib/jvm/java-11-openjdk-amd64/bin/java 1111 auto mode
* 1 /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/bin/java 1081 manual mode
2 /usr/lib/jvm/java-11-openjdk-amd64/bin/java 1111 manual mode
Some ENV variables you might be interested in:
$ env
USER=root
LOGNAME=root
HOME=/root
PATH=/root/bin:/usr/local/bin:/usr/local/hadoop/bin:/usr/local/hadoop/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
SHELL=/usr/bin/zsh
TERM=rxvt-unicode
JAVA_HOME=/usr/lib/jvm/adoptopenjdk-8-hotspot-amd64
HADOOP_HOME=/usr/local/hadoop
HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
ZSH=/root/.oh-my-zsh
Output of the Hadoop executable:
$ hadoop version
Hadoop 3.2.0
Source code repository https://github.com/apache/hadoop.git -r e97acb3bd8f3befd27418996fa5d4b50bf2e17bf
Compiled by sunilg on 2019-01-08T06:08Z
Compiled with protoc 2.5.0
From source with checksum d3f0795ed0d9dc378e2c785d3668f39
This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-3.2.0.jar
My Zsh and Bash installation:
$ zsh --version
zsh 5.7.1 (x86_64-debian-linux-gnu)
$ bash --version
GNU bash, version 5.0.3(1)-release (x86_64-pc-linux-gnu)
# only available in a console using *bash*
$ echo ${BASH_VERSINFO[#]}
5 0 3 1 release x86_64-pc-linux-gnu
TL;DR: use a different user (e.g. hadoop) instead of root.
I found the solution but not the deep understanding on what is going on. Despite how sad I can be, here's the solution I found:
Running with root user:
$ start-dfs.sh
Starting namenodes on [master]
bash v3.2+ is required. Sorry.
Starting datanodes
bash v3.2+ is required. Sorry.
Starting secondary namenodes [master_bis]
bash v3.2+ is required. Sorry
Then I created a hadoop user and gave this user privileges on the Hadoop installation (R/W access). After logging in with this new user I have the following output for the command that caused me some troubles:
$ start-dfs.sh
Starting namenodes on [master]
Starting datanodes
Starting secondary namenodes [master_bis]
Moreover, I noticed that processes created by start-yarn.sh were not listed in the output of jps while using Java 11. Switching to Java 8 solved my problem (don't forget to update all $JAVA_HOME variables, both in /etc/environment and hadoop-env.sh).
Success \o/. However, I'd be glad to understand why the root user cannot do this. I know it's a bad habit to use root but in an experimental environment this is not of our interest to have a clean "close-to" production environment. Any information about this will be kindly appreciated :).
try
chsh -s /bin/bash
to change the default shell back to bash

Zookeeper startup issues/confusion

Apart from the issue I am already having, I installed Zookeeper BEFORE I installed HBase (it's still not installed), after I saw a video on it. While installing it, I faced numerous issue, which I've now overcome, but I am left with one challenging one; probably the only one I will have to. So, the installation part has gone through well. I start zookeeper with the following command: sudo /home/hduser/zookeeper/bin/zkServer.sh start and (I am ok with it because) this is the result:
ZooKeeper JMX enabled by default
Using config: /home/hduser/zookeeper/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
YES! IT'S STARTED (after almost 50 mintutes of digging on the internet). But nevertheless, when I jps, this is what I get:
8499 SecondaryNameNode
8162 NameNode
8983 NodeManager
9370 Jps
8313 DataNode
8672 ResourceManager
Exactly!! No QuorumPeerMain! BUT wait.. When I sudo jps, I get this:
8499 -- process information unavailable
9243 QuorumPeerMain
8162 -- process information unavailable
8983 -- process information unavailable
9429 Jps
8313 -- process information unavailable
8672 -- process information unavailable
You see there? There's the QuorumPeerMain (minus the fact that it say process information unavailable against the perfectly relatable processes), riding the process 9243.
Can you tell me why that's happeneing?
Also, because of this discrepancy (or inconvenience), do you think HBase installation will be an issue?
I don't think it should matter, but this is a Mint machine (Sarah).
Thanks in advance!
The QuorumPeerMain service is visible with sudo jps command because you are running the Zookeeper with sudo /home/hduser/zookeeper/bin/zkServer.sh. You should run the Zookeeper without sudo in command then it will be visible in jps command result.
As you have started the Zookeeper with sudo the Zookeeper directory is having the files with root permissions you have to update the owner of these directories to run it with normal command.
Once you make above changes the hbase installation will not create any problem.

cassandra operations queue is full

I'm running datastax enterprise 4.5.1, with opscenter 5.1.1. These were installed from the standalone linux installers on Ubuntu 14.04 LTS.
$ cqlsh
Connected to Test Cluster at localhost:9160.
[cqlsh 4.1.1 | Cassandra 2.0.8.39 | CQL spec 3.1.1 | Thrift protocol 19.39.0]
In the datastax-agent log, I have been seeing tons of these WARN messages:
WARN [Thread-11] 2015-04-23 13:13:49,005 7647864 operations dropped so far.
WARN [Thread-11] 2015-04-23 13:13:49,005 Cassandra operation queue is full, discarding cassandra operation
similarly, these errors:
WARN [rollup-snapshot] 2015-04-30 16:20:40,432 Cassandra operation queue is full, discarding cassandra operation
WARN [rollup-snapshot] 2015-04-30 16:20:40,432 9 operations dropped so far.
Can someone give me an idea of what causes these? The node seems to be operating ok, no obvious errors in system.log to correlate. In datastax-agent-env.sh file, I've set JVM_OPTS="$JVM_OPTS -Xmx256M" but that doesn't eliminate the problem.
Try these changes in the config:
The following settings were changed in the agent address.yaml file. The agent process will need to be restarted for these settings to take effect.
thrift_max_conns: 10
async_pool_size: 10
async_queue_size: 20000
https://support.datastax.com/hc/en-us/articles/204225789-Ops-Center-is-not-showing-any-metrics-in-the-UI-dashboard

Restarting Amazon EMR cluster

I'm using Amazon EMR (Hadoop2 / AMI version:3.3.1) and I would like to change the default configuration (for example replication factor). In order for the change to take effect I need to restart the cluster and that's where my problems start.
How to do it? The script I found at ./.versions/2.4.0/sbin/stop-dfs.sh doesn't work. The slaves file ./.versions/2.4.0/etc/hadoop/slaves is empty anyway. There are some scripts in init.d:
$ ls -l /etc/init.d/hadoop-*
-rwxr-xr-x 1 root root 477 Nov 8 02:19 /etc/init.d/hadoop-datanode
-rwxr-xr-x 1 root root 788 Nov 8 02:19 /etc/init.d/hadoop-httpfs
-rwxr-xr-x 1 root root 481 Nov 8 02:19 /etc/init.d/hadoop-jobtracker
-rwxr-xr-x 1 root root 477 Nov 8 02:19 /etc/init.d/hadoop-namenode
-rwxr-xr-x 1 root root 1632 Oct 27 21:12 /etc/init.d/hadoop-state-pusher-control
-rwxr-xr-x 1 root root 484 Nov 8 02:19 /etc/init.d/hadoop-tasktracker
but if I for example stop the namenode something will start it again immediately. I looked for documentation and Amazon provides a 600 pages user guide but it's more how to use the cluster and not that much about maintenance.
On EMR 3.x.x , it used traditional SysVInit scripts for managing services. ls /etc/init.d/ can tell you the list of such services. You can restart a service like so,
sudo service hadoop-namenode restart
But if I for example stop the namenode something will start it again
immediately.
However, EMR also has a process called service-nanny that monitors hadoop related services and ensure all of em' are always running. This is the mystery process that brings it back.
So, for truly restarting a service, you would need to stop the service-nanny for a while and then restart/stop the necessary processes. Once you bring back service nanny , it will again do its job. So, you might run commands like -
sudo service service-nanny stop
sudo service hadoop-namenode restart
sudo service service-nanny start
Note that this behavior is different in 4.x.x and 5.x.x AMI's where upstart is used to stop/start applications and service-nanny no longer brings back applications.

hbase 0.90.5 not work after replace hadoop*.jar in hbase/lib/

i have Debian 6.03 and problem with best friends hbase and hadoop
step by step, I want working configuration hbase (standalone for the first step) and hadoop :
wget http://www.sai.msu.su/apache//hbase/hbase-0.90.5/hbase-0.90.5.tar.gz
tar xzfv hbase-0.90.5.tar.gz
sudo mv hbase-0.90.5 /usr/local/
sudo ln -s hbase-0.90.5/ hbase
sudo chown -R hduser:hadoop hbase*
lrwxrwxrwx 1 hduser hadoop 13 Янв 21 10:11 hbase -> hbase-0.90.5/
drwxr-xr-x 8 hduser hadoop 4096 Янв 21 10:11 hbase-0.90.5
dan#master:/usr/local/hbase$ su hduser
hduser#master:/usr/local/hbase$ bin/start-hbase.sh
starting master, logging to /usr/local/hbase/bin/../logs/hbase-hduser-master-master.out
hduser#master:/usr/local/hbase$ bin/hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.90.5, r1212209, Fri Dec 9 05:40:36 UTC 2011
hbase(main):001:0> list
TABLE
0 row(s) in 0.8560 seconds
But, after unpack hadoop core v 1.0 in hbase lib/ folder - I got:
hduser#master:/usr/local/hbase$ bin/stop-hbase.sh
hduser#master:/usr/local/hbase$ cp ../hadoop/hadoop-core-1.0.0.jar lib/
hduser#master:/usr/local/hbase$ rm lib/hadoop-core-0.20-append-r1056497.jar
hduser#master:/usr/local/hbase$ bin/start-hbase.sh
starting master, logging to /usr/local/hbase/bin/../logs/hbase-hduser-master-master.out
hduser#master:/usr/local/hbase$ bin/hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.90.5, r1212209, Fri Dec 9 05:40:36 UTC 2011
hbase(main):001:0> list
TABLE
ERROR: org.apache.hadoop.hbase.ZooKeeperConnectionException: HBase is able to connect to ZooKeeper but the connection closes immediately. This could be a sign that the server has too many connections (30 is the default). Consider inspecting your ZK server logs for that error and then make sure you are reusing HBaseConfiguration as often as you can. See HTable's javadoc for more information.
Why I need zookeeper on standalone after replace hadoop-core*.jar?
how to fix it?
Have you configured hbase-env.sh to manage Zookeeper itself?
Have you configured zookeeper quorums in hbase-site.xml?
I have the same problem, and solved it by configuring yarn and map-reduce.
Try this post.

Resources