Running Hadoop examples halt in Pseudo-Distributed mode - hadoop

Every thing run well in Standalone mode and when going to the pseudo-distributed mode, the HDFS works well, I can put files to HDFS and browse it. And I also checked that there is one DataNode in the live nodes lists.
However, when I run bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+', the program just halt there without producing any error. And from http://ereg.adobe.com:50070/dfsnodelist.jsp?whatNodes=LIVE I can see that nothing has ever been run on that DataNode.
I followed the configuration in the tutorial for those xml conf files. So anyone have any idea about what other mistakes I might have made? B.T.W, I'm running the stuffs on Mac OS X.

By halt, do you mean it hangs, or that it just silently returns? For Mapreduce issues, you should check the JobTracker's webpage (at port 50030) to see the status of the submitted job.

Related

Hadoop standalone - hdfs commands are slow

I'm doing development/research in an Ubuntu 14.04 VM with Hadoop 2.6.2 and I'm getting constantly held back because any commands I issue to hdfs always take about 15 seconds to run. I've tried digging around, but I am unable to locate the source of the problem or even if this is expected behavior.
I followed the directions on Apache's website and successfully got it up and running just fine in /opt/hadoop-2.6.2/
The following is a simple test command that I'm using to evaluate if I have solved the problem.
/opt/hadoop-2.6.2/bin/hdfs dfs -ls /
I have inspected the logs and found no errors or strange warnings. A recommendation that I found online was to set the logger to output the console.
HADOOP_ROOT_LOGGER=DEBUG,console /opt/hadoop-2.6.2/bin/hdfs dfs -ls /
Doing this yields something of interest. You can watch it hang between the following.
16/01/15 11:59:02 DEBUG impl.MetricsSystemImpl: UgiMetrics, User and group related metrics
16/01/15 11:59:17 DEBUG util.KerberosName: Kerberos krb5 configuration not found, setting default realm to empty
Thoughts: When I first saw this I assumed that it was hanging on authentication, but not only do I not have Kerberos installed, the default configuration for core-site.xml shows the authentication mode set to "simple". This makes wonder why it would be looking for anything Kerberos related to begin with. I attempted to specifically disable it in the xml and the lag/slowness didn't go away. I kinda get the feeling that the delay is because its timing out waiting for something. Does anyone else have any ideas?
I just went ahead and install Kerberos anyways just to see if it would work. Large delays have disappeared now that /etc/krb5.conf is present. I wonder if I could have just created the file with nothing in it. Hrmmm...
sudo apt-get install krb5-kdc krb5-admin-server

Hadoop on Mesos fails with "Could not find or load main class org.apache.hadoop.mapred.MesosExecutor"

I have a Mesos cluster setup -- I have verified that the master can see the slaves -- but when I attempt to run a Hadoop job, all tasks wind up with a status of LOST. The same error is present in all the slave stderr logs:
Error: Could not find or load main class org.apache.hadoop.mapred.MesosExecutor
and that is the only line in the stderr logs.
Following the instructions on http://mesosphere.io/learn/run-hadoop-on-mesos/, I have put a modified Hadoop distribution on HDFS which each slave can access.
In the lib directory of the Hadoop distribution, I have added hadoop-mesos-0.0.4.jar and mesos-0.14.2.jar.
I have verified that each slave does in fact download this Hadoop distribution, and that hadoop-mesos-0.0.4.jar contains the class org.apache.hadoop.mapred.MesosExecutor, so I cannot figure out why the class cannot be found.
I am using Hadoop from CDH4.4.0 and mesos-0.15.0-rc4.
Does any one have any suggestions as to what might be the problem? I know I would always start with a CLASSPATH problem, but, in this case, the mesos-slave is downloading, unpacking, and attempting to run a Hadoop TaskTracker so I would imagine any CLASSPATH would be setup by the mesos-slave.
In the stdout of the slave logs, the environment is printed. There is a MESOS_HADOOP_HOME which is empty. Should this be set to something? If it is supposed to be set to the downloaded Hadoop distribution, I cannot set it in advance because the Hadoop distribution is downloaded to a new location every time.
In the event that is related (some permissions issue maybe), when attempting to browse slave logs via the master UI, I get the error Error browsing path: ....
The user running mesos-slave can browse to the correct directory when I do so manually.
I found the problem. bin/hadoop of the downloaded Hadoop distribution attempts to find its location by running which $0. However, that will find a current Hadoop installation if one exists (i.e. /usr/lib/hadoop), and will load the jars under that installation's lib directory instead of the downloaded one's lib directory.
I had to modify bin/hadoop of the downloaded distribution to find its own location with dirname $0 instead of which $0.

How to check if my hadoop is running in pseudo distributed mode?

I installed hadoop quite a while ago but I somehow have forgotten if I installed in pseudo distributed mode or not.How can I check it? while my hadoop is running
To know if you are running hadoop in Standalone or Pseudo distributed mode, verify your configuration files. Below information might help.
The tool jps lists out all running Java processes. From the console you can run
$ jps
and check whether JobTracker, TaskTracker and the HDFS daemons are running.
Check your configuration files:-
Go to directory where hadoop configurations are kept (/etc/hadoop in case of Ubuntu)
Look at slaves and masters files, if both have only localhost or (local IP) it is pseudo-distributed. In case slaves file is empty it is standalone.

Need help adding multiple DataNodes in pseudo-distributed mode (one machine), using Hadoop-0.18.0

I am a student, interested in Hadoop and started to explore it recently.
I tried adding an additional DataNode in the pseudo-distributed mode but failed.
I am following the Yahoo developer tutorial and so the version of Hadoop I am using is hadoop-0.18.0
I tried to start up using 2 methods I found online:
Method 1 (link)
I have a problem with this line
bin/hadoop-daemon.sh --script bin/hdfs $1 datanode $DN_CONF_OPTS
--script bin/hdfs doesn't seem to be valid in the version I am using. I changed it to --config $HADOOP_HOME/conf2 with all the configuration files in that directory, but when the script is ran it gave the error:
Usage: Java DataNode [-rollback]
Any idea what does the error mean? The log files are created but DataNode did not start.
Method 2 (link)
Basically I duplicated conf folder to conf2 folder, making necessary changes documented on the website to hadoop-site.xml and hadoop-env.sh. then I ran the command
./hadoop-daemon.sh --config ..../conf2 start datanode
it gives the error:
datanode running as process 4190. stop it first.
So I guess this is the 1st DataNode that was started, and the command failed to start another DataNode.
Is there anything I can do to start additional DataNode in the Yahoo VM Hadoop environment? Any help/advice would be greatly appreciated.
Hadoop start/stop scripts use /tmp as a default directory for storing PIDs of already started daemons. In your situation, when you start second datanode, startup script finds /tmp/hadoop-someuser-datanode.pid file from the first datanode and assumes that the datanode daemon is already started.
The plain solution is to set HADOOP_PID_DIR env variable to something else (but not /tmp). Also do not forget to update all network port numbers in conf2.
The smart solution is start a second VM with hadoop environment and join them in a single cluster. It's the way hadoop is intended to use.

How to tell if I am about to run Hadoop streaming job on a cluster or in "local" mode?

Hadoop streaming will run the process in "local" mode when there is no hadoop instance running on the box. I have a shell script that is controlling a set of hadoop streaming jobs in sequence and I need to condition copying files from HDFS to local depending on whether the jobs have been running locally or not. Is there a standard way to accomplish this test? I could do a "ps aux | grep something" but that seems ad-hoc.
Hadoop streaming will run the process in "local" mode when there is no hadoop instance running on the box.
Can you pl point to the reference for this?
A regular or a streaming job will run the way it is configured, so we know ahead of time in which mode a Job is run. Check the documentation for configuring Hadoop on a Single Node and Cluster in different modes.
Rather than trying to detect at run time which mode the process is operating, it is probably better to wrap the tool you are developing in a bash script that explicitly selects local vs cluster operatide. The O'Reilly Hadoop describes how to explicitly choose local using a configuration file override:
hadoop v2.MaxTemperatureDriver -conf conf/hadoop-local.xml input/ncdc/micro max-temp
where conf-local.xml is an XML file configured for local operation.
I haven't tried this yet, but I think you can just read out the mapred.job.tracker configuration setting.

Resources