Hadoop example application hangs on Windows single node - hadoop

I have installed single node Hadoop on Windows and it is apprently working.
Unfortunately, I can't run test application on it.
When I do
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.0.jar grep input output 'dfs[a-z.]+'
as described on it's page, I get it not returning to command prompt. On referenced job page I see
YarnApplicationState: ACCEPTED: waiting for AM container to be
allocated, launched and register with RM.
Diagnostics: [Mon Apr 23 00:46:44 +0300 2018] Application is added to
the scheduler and is not yet activated. Skipping AM assignment as
cluster resource is empty. Details : AM Partition =
<DEFAULT_PARTITION>; AM Resource Request = <memory:2048, vCores:1>;
Queue Resource Limit for AM = <memory:0, vCores:0>; User AM Resource
Limit of the queue = <memory:0, vCores:0>; Queue AM Resource Usage =
<memory:0, vCores:0>;
What does it mean and how to push it?

Related

H2O cluster startup frequently timing out

Trying to start an h2o cluster on (MapR) hadoop via python
# startup hadoop h2o cluster
import os
import subprocess
import h2o
import shlex
import re
from Queue import Queue, Empty
from threading import Thread
def enqueue_output(out, queue):
"""
Function for communicating streaming text lines from seperate thread.
see https://stackoverflow.com/questions/375427/non-blocking-read-on-a-subprocess-pipe-in-python
"""
for line in iter(out.readline, b''):
queue.put(line)
out.close()
# clear legacy temp. dir.
hdfs_legacy_dir = '/mapr/clustername/user/mapr/hdfsOutputDir'
if os.path.isdir(hdfs_legacy_dir ):
print subprocess.check_output(shlex.split('rm -r %s'%hdfs_legacy_dir ))
# start h2o service in background thread
local_h2o_start_path = '/home/mapr/h2o-3.18.0.2-mapr5.2/'
startup_p = subprocess.Popen(shlex.split('/bin/hadoop jar {}h2odriver.jar -nodes 4 -mapperXmx 6g -timeout 300 -output hdfsOutputDir'.format(local_h2o_start_path)),
shell=False,
stdout=subprocess.PIPE, stderr=subprocess.PIPE)
# setup message passing queue
q = Queue()
t = Thread(target=enqueue_output, args=(startup_p.stdout, q))
t.daemon = True # thread dies with the program
t.start()
# read line without blocking
h2o_url_out = ''
while True:
try: line = q.get_nowait() # or q.get(timeout=.1)
except Empty:
continue
else: # got line
print line
# check for first instance connection url output
if re.search('Open H2O Flow in your web browser', line) is not None:
h2o_url_out = line
break
if re.search('Error', line) is not None:
print 'Error generated: %s' % line
sys.exit()
print 'Connection url output line: %s' % h2o_url_out
h2o_cnxn_ip = re.search('(?<=Open H2O Flow in your web browser: http:\/\/)(.*?)(?=:)', h2o_url_out).group(1)
print 'H2O connection ip: %s' % h2o_cnxn_ip
frequently throws a timeout error
Waiting for H2O cluster to come up...
H2O node 172.18.4.66:54321 requested flatfile
H2O node 172.18.4.65:54321 requested flatfile
H2O node 172.18.4.67:54321 requested flatfile
ERROR: Timed out waiting for H2O cluster to come up (300 seconds)
Error generated: ERROR: Timed out waiting for H2O cluster to come up (300 seconds)
Shutting down h2o cluster
Looking at the docs (http://docs.h2o.ai/h2o/latest-stable/h2o-docs/faq/general-troubleshooting.html) (and just doing a wordfind for the word "timeout"), was unable to find anything that helped the problem (eg. extending the timeout time via hadoop jar h2odriver.jar -timeout <some time> did nothing but extend the time until the timeout error popped up).
Have noticed that this happens often when there is another instance of an h2o cluster already up and running (which I don't understand since I would think that YARN could support multiple instances), yet also sometimes when there is no other cluster initialized.
Anyone know anything else that can be tried to solve this problem or get more debugging info beyond the error message being thrown by h2o?
UPDATE:
Trying to recreate the problem from the commandline, getting
[me#mnode01 project]$ /bin/hadoop jar /home/me/h2o-3.20.0.5-mapr5.2/h2odriver.jar -nodes 4 -mapperXmx 6g -timeout 300 -output hdfsOutputDir
Determining driver host interface for mapper->driver callback...
[Possible callback IP address: 172.18.4.62]
[Possible callback IP address: 127.0.0.1]
Using mapper->driver callback IP address and port: 172.18.4.62:29388
(You can override these with -driverif and -driverport/-driverportrange.)
Memory Settings:
mapreduce.map.java.opts: -Xms6g -Xmx6g -XX:PermSize=256m -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Dlog4j.defaultInitOverride=true
Extra memory percent: 10
mapreduce.map.memory.mb: 6758
18/08/15 09:18:46 INFO client.MapRZKBasedRMFailoverProxyProvider: Updated RM address to mnode03.cluster.local/172.18.4.64:8032
18/08/15 09:18:48 INFO mapreduce.JobSubmitter: number of splits:4
18/08/15 09:18:48 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1523404089784_7404
18/08/15 09:18:48 INFO security.ExternalTokenManagerFactory: Initialized external token manager class - com.mapr.hadoop.yarn.security.MapRTicketManager
18/08/15 09:18:48 INFO impl.YarnClientImpl: Submitted application application_1523404089784_7404
18/08/15 09:18:48 INFO mapreduce.Job: The url to track the job: https://mnode03.cluster.local:8090/proxy/application_1523404089784_7404/
Job name 'H2O_66888' submitted
JobTracker job ID is 'job_1523404089784_7404'
For YARN users, logs command is 'yarn logs -applicationId application_1523404089784_7404'
Waiting for H2O cluster to come up...
H2O node 172.18.4.65:54321 requested flatfile
H2O node 172.18.4.67:54321 requested flatfile
H2O node 172.18.4.66:54321 requested flatfile
ERROR: Timed out waiting for H2O cluster to come up (300 seconds)
ERROR: (Try specifying the -timeout option to increase the waiting time limit)
Attempting to clean up hadoop job...
Killed.
18/08/15 09:23:54 INFO client.MapRZKBasedRMFailoverProxyProvider: Updated RM address to mnode03.cluster.local/172.18.4.64:8032
----- YARN cluster metrics -----
Number of YARN worker nodes: 6
----- Nodes -----
Node: http://mnode03.cluster.local:8044 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 7.0 GB used, 0 / 2 vcores used
Node: http://mnode05.cluster.local:8044 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 10.4 GB used, 0 / 2 vcores used
Node: http://mnode06.cluster.local:8044 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 10.4 GB used, 0 / 2 vcores used
Node: http://mnode01.cluster.local:8044 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 5.0 GB used, 0 / 2 vcores used
Node: http://mnode04.cluster.local:8044 Rack: /default-rack, RUNNING, 1 containers used, 7.0 / 10.4 GB used, 1 / 2 vcores used
Node: http://mnode02.cluster.local:8044 Rack: /default-rack, RUNNING, 1 containers used, 2.0 / 8.7 GB used, 1 / 2 vcores used
----- Queues -----
Queue name: root.default
Queue state: RUNNING
Current capacity: 0.00
Capacity: 0.00
Maximum capacity: -1.00
Application count: 0
Queue 'root.default' approximate utilization: 0.0 / 0.0 GB used, 0 / 0 vcores used
----------------------------------------------------------------------
WARNING: Job memory request (26.4 GB) exceeds queue available memory capacity (0.0 GB)
WARNING: Job virtual cores request (4) exceeds queue available virtual cores capacity (0)
ERROR: Only 3 out of the requested 4 worker containers were started due to YARN cluster resource limitations
----------------------------------------------------------------------
For YARN users, logs command is 'yarn logs -applicationId application_1523404089784_7404'
and noticing the later outputs
WARNING: Job memory request (26.4 GB) exceeds queue available memory capacity (0.0 GB)
WARNING: Job virtual cores request (4) exceeds queue available virtual cores capacity (0)
ERROR: Only 3 out of the requested 4 worker containers were started due to YARN cluster
I am confused by the reported 0GB mem. and 0 vcores becuase there are no other applications running on the cluster and looking at the cluster details in the YARN RM web UI shows
(using image, since could not find unified place in log files for this info and why the mem. availability is so uneven despite having no other running applications, I do not know). At this point, should mention that don't have much experience tinkering with / examining YARN configs, so it's difficult for me to find relevant information at this point.
Could it be that I am starting h2o cluster with -mapperXmx=6g, but (as shown in the image) one of the nodes only has 5g mem. available, so if this node is randomly selected to contribute to the initialized h2o application, it does not have enough memory to support the requested mapper mem.? Changing the startup command to /bin/hadoop jar /home/me/h2o-3.20.0.5-mapr5.2/h2odriver.jar -nodes 4 -mapperXmx 5g -timeout 300 -output hdfsOutputDir and start/stopping multiple times without error seems to support this theory (though need to check further to determine if I'm interpreting things correctly).
This is most likely because your Hadoop cluster is busy, and there just isn't space to start new yarn containers.
If you ask for N nodes, then you either get all N nodes, or the launch process times out like you are seeing. You can optionally use the -timeout command line flag to increase the timeout.

Chronos insufficient resources warning

I'm trying to run Chronos on Mesos, but all my jobs are stuck in a queueing state.
systemctl status chronos -l shows:
Mar 20 20:21:08 core-mq3 chronos[17940]: [2017-03-20 20:21:08,985] WARN Insufficient resources remaining for task 'ct:1490040556081:0:JobName:', will append to queue. (Needed: [cpus: 0.5 mem: 256.0 disk: 256.0], Found: [cpus: 1.8 mem: 11034.0 disk: 60398.8,cpus: 2.0 mem: 6542.0 disk: 60399.0]) (org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework:155)
So, it is refusing the offers even though all the resources are more than required.
This was a red herring. There was a constraint that the agent did not fulfill, which is why it couldn't run the task.
Running curl GET <chronos>/scheduler/jobs/search?name=<job> gave me all the details of the job, which I used to verify that the constraint was not being fulfilled.

Error in Accumulo's tablet server when scanning for data

I have a bunch of tables in Accumulo with one master and 2 tablet servers containing a bunch of tables storing millions of records. The problem is that whenever I scan the tables to get a few records out, the tablet server logs keep throwing this error
2015-11-12 04:38:56,107 [hdfs.DFSClient] WARN : Failed to connect to /192.168.250.12:50010 for block, add to deadNodes and continue. java.io.IOException: Got error, status message opReadBlock BP-1881591466-192.168.1.111-1438767154643:blk_1073773956_33167 received exception java.io.IOException: Offset 16320 and length 20 don't match block BP-1881591466-192.168.1.111-1438767154643:blk_1073773956_33167 ( blockLen 0 ), for OP_READ_BLOCK, self=/192.168.250.202:55915, remote=/192.168.250.12:50010, for file /accumulo/tables/1/default_tablet/F0000gne.rf, for pool BP-1881591466-192.168.1.111-1438767154643 block 1073773956_33167
java.io.IOException: Got error, status message opReadBlock BP-1881591466-192.168.1.111-1438767154643:blk_1073773956_33167 received exception java.io.IOException: Offset 16320 and length 20 don't match block BP-1881591466-192.168.1.111-1438767154643:blk_1073773956_33167 ( blockLen 0 ), for OP_READ_BLOCK, self=/192.168.250.202:55915, remote=/192.168.250.12:50010, for file /accumulo/tables/1/default_tablet/F0000gne.rf, for pool BP-1881591466-192.168.1.111-1438767154643 block 1073773956_33167
at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:140)
at org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:456)
at org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:424)
at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:818)
at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:697)
at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:355)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:618)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:844)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:896)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:697)
at java.io.DataInputStream.readShort(DataInputStream.java:312)
at org.apache.accumulo.core.file.rfile.bcfile.Utils$Version.<init>(Utils.java:264)
at org.apache.accumulo.core.file.rfile.bcfile.BCFile$Reader.<init>(BCFile.java:823)
at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.init(CachableBlockFile.java:246)
at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getBCFile(CachableBlockFile.java:257)
at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.access$100(CachableBlockFile.java:137)
at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader$MetaBlockLoader.get(CachableBlockFile.java:209)
at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getBlock(CachableBlockFile.java:313)
at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getMetaBlock(CachableBlockFile.java:368)
at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getMetaBlock(CachableBlockFile.java:137)
at org.apache.accumulo.core.file.rfile.RFile$Reader.<init>(RFile.java:843)
at org.apache.accumulo.core.file.rfile.RFileOperations.openReader(RFileOperations.java:79)
at org.apache.accumulo.core.file.DispatchingFileFactory.openReader(DispatchingFileFactory.java:69)
at org.apache.accumulo.tserver.tablet.Compactor.openMapDataFiles(Compactor.java:279)
at org.apache.accumulo.tserver.tablet.Compactor.compactLocalityGroup(Compactor.java:322)
at org.apache.accumulo.tserver.tablet.Compactor.call(Compactor.java:214)
at org.apache.accumulo.tserver.tablet.Tablet._majorCompact(Tablet.java:1976)
at org.apache.accumulo.tserver.tablet.Tablet.majorCompact(Tablet.java:2093)
at org.apache.accumulo.tserver.tablet.CompactionRunner.run(CompactionRunner.java:44)
at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
at java.lang.Thread.run(Thread.java:745)
I think it is more of a HDFS related issue as opposed to an Accumulo one, so I checked the logs of the datanode and found the same message,
Offset 16320 and length 20 don't match block BP-1881591466-192.168.1.111-1438767154643:blk_1073773956_33167 ( blockLen 0 ), for OP_READ_BLOCK, self=/192.168.250.202:55915, remote=/192.168.250.12:50010, for file /accumulo/tables/1/default_tablet/F0000gne.rf, for pool BP-1881591466-192.168.1.111-1438767154643 block 1073773956_33167
But as INFO in the logs. What I don't understand is that why am I getting this error.
I can see that the pool name of the file (BP-1881591466-192.168.1.111-1438767154643) that I am trying to access contains a IP address (192.168.1.111) which does not match the IP address of any of the servers (self and remote). Actually, 192.168.1.111 was the old IP address of the Hadoop Master server, but I had changed it. I use domain names instead of IP addresses so the only place where I made the changes were in the host files of the machines in the cluster. None of the Hadoop/Accumulo configurations use IP addresses. Does anyone know what the issue is here? I have spent days on it and still am not able to figure it out.
The error you are receiving indicates that Accumulo cannot read part of one of its files from HDFS. The NameNode is reporting that a block is located on a particular DataNode (in your case, 192.168.250.12). However, when Accumulo attempts to read from that DataNode, it fails.
This likely indicates a corrupt block in HDFS, or a temporary network issue. You can try to run hadoop fsck / (the exact command may vary, depending on version) to perform a health check of HDFS.
Also, the IP address mismatch in the DataNode appears to indicate that the DataNode is confused about the HDFS pool it is a part of. You should restart that DataNode after double-checking its configuration, DNS, and /etc/hosts for any anomolies.

'yarn application -list' doesnt show any results

I have run some Spark applications on a YARN cluster. The application shows up in the "All applications" page in the YARN UI http://host:8088/cluster but the yarn application -list command doesnt give any results. What could be the cause of this ?
When you use "-list" option without "-appTypes" or "-appStates" options, it applies default filtering for "application-types" and "states" (check the highlighted section below). If none of your applications match the default filtering, then you will not get any result.
Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):0
If you see the help for "-list", it states the following:
"List applications. Supports optional use of -appTypes to filter applications based on application type, and -appStates to filter applications based on application state".
This seems to be bit misleading.
If you don't specify "-appStates", by default it takes states as "SUBMITTED", "ACCEPTED" and "RUNNING", for filtering. Please check the code below from "listApplications()" method of "org.apache.hadoop.yarn.client.cli.ApplicationCLI.java".
private void listApplications()
{
............
if (allAppStates) {
for (YarnApplicationState appState : YarnApplicationState.values()) {
appStates.add(appState);
}
} else {
if (appStates.isEmpty()) {
appStates.add(YarnApplicationState.RUNNING);
appStates.add(YarnApplicationState.ACCEPTED);
appStates.add(YarnApplicationState.SUBMITTED);
}
}
............
}
As per the code above, following logic is applied:
Call "yarn application -list": Shows all applications in states "SUBMITTED", "ACCEPTED" and "RUNNING"
For e.g. for me output is below (there are zero applications in default states)
CMD> yarn application -list
Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):0
Call "yarn application -list -appStates ALL": Shows all the applications (in any state)
For e.g. for me output is below (there are totally 268 applications, also check the filtering criteria applied to "states"):
CMD> yarn application -list -appStates ALL
ALL Total number of applications (application-types: [] and states: [NEW, NEW_SAVING
, SUBMITTED, ACCEPTED, RUNNING, FINISHED, FAILED, KILLED]):268
Call "yarn application -list -appStates FINISHED": Shows all the applications (in FINISHED state)
For e.g. for me output is below (there are 136 applications in FINISHED state):
CMD> yarn application -list -appStates FINISHED
Total number of applications (application-types: [] and states: [FINISHED]):136
It turns out that I had enabled Log aggregation in YARN but had set the yarn.nodemanager.remote-app-log-dir to a custom hdfs directory (/tmp/yarnlogs), So logs were actually getting aggregated at /tmp/yarnlogs in HDFS, but the yarn command was still searching for logs at the default location on HDFS (/tmp/logs). So changing the property to its default value fixed it for me.
NOTE:
If the log aggregation directory is misconfigured ,it also causes an a error while trying to access job history from the web UI, that looks like :
Log aggregation has not completed or is not enabled

How to fix "Task attempt_201104251139_0295_r_000006_0 failed to report status for 600 seconds."

I wrote a mapreduce job to extract some info from a dataset. The dataset is users' rating about movies. The number of users is about 250K and the number of movies is about 300k. The output of map is <user, <movie, rating>*> and <movie,<user,rating>*>. In the reducer, I will process these pairs.
But when I run the job, the mapper completes as expected, but reducer always complain that
Task attempt_* failed to report status for 600 seconds.
I know this is due to failed to update status, so I added a call to context.progress() in my code like this:
int count = 0;
while (values.hasNext()) {
if (count++ % 100 == 0) {
context.progress();
}
/*other code here*/
}
Unfortunately, this does not help. Still many reduce tasks failed.
Here is the log:
Task attempt_201104251139_0295_r_000014_1 failed to report status for 600 seconds. Killing!
11/05/03 10:09:09 INFO mapred.JobClient: Task Id : attempt_201104251139_0295_r_000012_1, Status : FAILED
Task attempt_201104251139_0295_r_000012_1 failed to report status for 600 seconds. Killing!
11/05/03 10:09:09 INFO mapred.JobClient: Task Id : attempt_201104251139_0295_r_000006_1, Status : FAILED
Task attempt_201104251139_0295_r_000006_1 failed to report status for 600 seconds. Killing!
BTW, the error happened in reduce to copy phase, the log says:
reduce > copy (28 of 31 at 26.69 MB/s) > :Lost task tracker: tracker_hadoop-56:localhost/127.0.0.1:34385
Thanks for the help.
The easiest way will be to set this configuration parameter:
<property>
<name>mapred.task.timeout</name>
<value>1800000</value> <!-- 30 minutes -->
</property>
in mapred-site.xml
The easiest another way is to set in your Job Configuration inside the program
Configuration conf=new Configuration();
long milliSeconds = 1000*60*60; <default is 600000, likewise can give any value)
conf.setLong("mapred.task.timeout", milliSeconds);
**before setting it please check inside the Job file(job.xml) file in jobtracker GUI about the correct property name whether its mapred.task.timeout or mapreduce.task.timeout
.
.
.
while running the job check in the Job file again whether that property is changed according to the setted value.
In newer versions, the name of the parameter has been changed to mapreduce.task.timeout as described in this link (search for task.timeout). In addition, you can also disable this timeout as described in the above link:
The number of milliseconds before a task will be terminated if it
neither reads an input, writes an output, nor updates its status
string. A value of 0 disables the timeout.
Below is an example setting in the mapred-site.xml:
<property>
<name>mapreduce.task.timeout</name>
<value>0</value> <!-- A value of 0 disables the timeout -->
</property>
If you have hive query and its timing out , you can set above configurations in following way:
set mapred.tasktracker.expiry.interval=1800000;
set mapred.task.timeout= 1800000;
From https://issues.apache.org/jira/browse/HADOOP-1763
causes might be :
1. Tasktrackers run the maps successfully
2. Map outputs are served by jetty servers on the TTs.
3. All the reduce tasks connects to all the TT where maps are run.
4. since there are lots of reduces wanting to connect the map output server, the jetty servers run out of threads (default 40)
5. tasktrackers continue to make periodic heartbeats to JT, so that they are not dead, but their jetty servers are (temporarily) down.

Resources