Starting Hadoop h2o IO error sending batch UDP bytes - h2o

When starting Hadoop h2o (YARN h2o) with the following command:
hadoop jar ./h2o-3.18.0.4-cdh5.13/h2odriver.jar -nodes 10 -mapperXmx 5g -output junk/tmp1
I seem to sometimes get an issue bringing up the h2o cluster. This is the error I see on console:
ERROR: Timed out waiting for H2O cluster to come up (120 seconds)
ERROR: (Try specifying the -timeout option to increase the waiting time limit)
Attempting to clean up hadoop job...
Killed.
This is the error I see in YARN logs:
03-09 14:50:35.118 x.x.x.56:54321 37628 #49:54321 ERRR: Got IO error when sending batch UDP bytes: java.net.ConnectException: Connection refused

Here are some things to try:
Increase the timeout with the -timeout option
After increasing the timeout, does it still work sometimes and not other times? If yes, then you may have a host-level networking issue within your hadoop cluster.
Get yarn logs -applicationId nnn and look at which of the hosts has the issue. Is there a pattern in the IP address you can spot?

Related

HDFS Clustering in swarm

In normal docker environment HDFS clustered images like hadoop-master and hadoop-slave works fine. But when I try to run these images in swarm mode, I am facing connectivity issues. Is clustered hdfs compatible with docker swarm?
The service that I deployed is restarting and exiting continously for every 2-3 seconds.
Can someone help me in detail to implement HDFS clustering in swarm mode.
When I do docker logs conatinerid, I get
start sshd...
/bin/sh: 0: Can't open /bin/which
/etc/init.d/ssh: 424: .: Can't open /lib/lsb/init-functions.d/20-left-info-blocks
start serf...
Error connecting to Serf agent: dial tcp 127.0.0.1:7373: connection refused
Obviously, you don't have /bin/which nor LSB support installed.
Install all prerequisites.

Hadoop HA Namenode goes down with the Error: flush failed for required journal (JournalAndStream(mgr=QJM to [< ip >:8485, < ip >:8485, < ip >:8485]))

Hadoop Namenode goes down almost everyday once.
FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) -
**Error: flush failed for required journal** (JournalAndStream(mgr=QJM to [< ip >:8485, < ip >:8485, < ip >:8485], stream=QuorumOutputStream starting at txid <>))
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
at
Can someone suggest what are the things that I need to look into for resolving this issue?
I am using VMs for the journal nodes and master nodes. Does it cause any issue?
From the error you pasted. It appears your journal nodes could not talk to the NN in a timely manner. What was going on at the time of this event?
Since you mention that your nodes are vms I would guess you overloaded the hypervisor or it had troubling talking from the NN to the JN and zk quorum.
In my case, this issue was caused due to the difference in the system time between the nodes of the cluster.
To keep the system time in sync, we can execute the commands below in each node.
sudo service ntpd stop
sudo ntpdate pool.ntp.org # Run this command multiple times
sudo service ntpd start
If hue is down, run below command on the hue server machine
sudo service hue start
If namenode is down, start the namenode.
Recurring fix
Add a crontab for the root user on all the nodes of the environment.
or
Install VM tools, to keep the system time in sync.

H2O: unable to connect to h2o cluster through python

I have a 5 node hadoop cluster running HDP 2.3.0. I setup a H2O cluster on Yarn as described here.
On running following command
hadoop jar h2odriver_hdp2.2.jar water.hadoop.h2odriver -libjars ../h2o.jar -mapperXmx 512m -nodes 3 -output /user/hdfs/H2OTestClusterOutput
I get the following ouput
H2O cluster (3 nodes) is up
(Note: Use the -disown option to exit the driver after cluster formation)
(Press Ctrl-C to kill the cluster)
Blocking until the H2O cluster shuts down...
When I try to execute the command
h2o.init(ip="10.113.57.98", port=54321)
The process remains stuck at this stage.On trying to connect to the web UI using the ip:54321, the browser tries to endlessly load the H2O admin page but nothing ever displays.
On forcefully terminating the init process I get the following error
No instance found at ip and port: 10.113.57.98:54321. Trying to start local jar...
However if I try and use H2O with python without setting up a H2O cluster, everything runs fine.
I executed all commands as the root user. Root user has permissions to read and write from the /user/hdfs hdfs directory.
I'm not sure if this is a permissions error or that the port is not accessible.
Any help would be greatly appreciated.
It looks like you are using H2O2 (H2O Classic). I recommend upgrading your H2O to the latest (H2O 3). There is a build specifically for HDP2.3 here: http://www.h2o.ai/download/h2o/hadoop
Running H2O3 is a little cleaner too:
hadoop jar h2odriver.jar -nodes 1 -mapperXmx 6g -output hdfsOutputDirName
Also, 512mb per node is tiny - what is your use case? I would give the nodes some more memory.

Error with flume and remote hdfs sink

I'm trying to run flume with an hdfs sink. The hdfs is running in a different machine properly and I can even interact with the hdfs from the flume machine, but when I run flume and send events to it I get the following error:
2013-05-26 14:22:11,399 (SinkRunner-PollingRunner-DefaultSinkProcessor) [WARN - org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:456)] HDFS IO error
java.io.IOException: Callable timed out after 25000 ms
at org.apache.flume.sink.hdfs.HDFSEventSink.callWithTimeout(HDFSEventSink.java:352)
at org.apache.flume.sink.hdfs.HDFSEventSink.append(HDFSEventSink.java:727)
at org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:430)
at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
at java.lang.Thread.run(Thread.java:679)
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:258)
at java.util.concurrent.FutureTask.get(FutureTask.java:119)
at org.apache.flume.sink.hdfs.HDFSEventSink.callWithTimeout(HDFSEventSink.java:345)
... 5 more
Again, conectivity is not an issue since I can interact with hdfs using the hadoop command line (the flume machine is NOT a datanode).
The weirdest part is that after killing flume I can see that the tmp file is created in hdfs but it's empty (and the .tmp extension remains).
Any ideas as to why could this be happening? Thanks a lot!
Check 3 things, if your firewall is off i.e. iptables should be stopped. Secondly, value of the property agent.sinks.hdfs-sink.hdfs.path = hdfs://PUBLIC_IP:8020/user/hdfs/flume and not Private IP.
And change
agent.sinks.hdfs-sink.hdfs.callTimeout = 180000 because the default is 10000 ms which is very less time for HDFS to react.
Thanks,
Shilpa

Hadoop reduce task stuck at 0%

I'm following some guide to set up the pseudo distributed mode, I ran start-all.sh and the daemons are all good (6 of them), then I launch my WordCount example which runs well in standalone mode, but stuck at map 100%, reduce 0%.
Looking at the jobtracker, the reduce task is at status reduce > copy.
The only error log locates in secondarynamenode.log:
2013-02-27 23:29:59,555 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:my_user_name cause:java.net.ConnectException: Connection refused
2013-02-27 23:29:59,555 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in doCheckpoint:
2013-02-27 23:29:59,555 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: java.net.ConnectException: Connection refused
I can ssh localhost without password. Hadoop version is 1.1.1. I launch a jar file from commande line.
Really no idea what's wrong, some help?
Thanks in advance.
How much data are you running the word count on? If you are running on a large data set in standalone mode without using a combiner, then it's going to cause some trouble. Try
job.setMapperClass(<Mapper_Class>);
job.setCombinerClass(<Reducer_Class>);
job.setReducerClass(<Reducer_Class>);
in the main method containing the driver in your program. This might help you out.

Resources