Why doesnt Hadoop work with Windows 7 under cygwin? - hadoop

I am trying to install and start hadoop 1.1.2 in cygwin on windows 7. I am getting the following issue when attempting to run a simple job:
bin/hadoop jar hadoop-*-examples.jar pi 10 100
13/04/26 17:56:10 WARN hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/username/PiEstimator_TMP_3_141592654/in/part0 could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1639)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:736)
at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
My configuration is as follows:
mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>
hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

The answer is in the logs. There will be a specific exception detailed there, most likely a file access problem requiring you to chmod 755 -R some directory and its contents.

This error not because you are trying to Hadoop on Windows. It's because there is some problem with your DataNode. Along with the point which Chris Gerken has made, there could be some other reasons as well. I had answered a similar question recently. You should have a look at it. Upload data to HDFS running in Amazon EC2 from local non-Hadoop Machine

Related

Nutch and HBase configuration error

I am trying to get nutch and hbase working based on this docker image: https://hub.docker.com/r/cogfor/nutch/
I am getting an exception that i try to inject a URL file:
InjectorJob: starting at 2017-12-19 20:49:45
InjectorJob: Injecting urlDir: urls
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/hadoop/hbase/HBaseConfiguration
org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:114)
at g.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135)
at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:78)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:218)
at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:252)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:275)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:284)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.HBaseConfiguration
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
I know there is some misconfiguration between Nutch/HBase/Hadoop.
My gora.properties has:
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
My hbase-site.xml has:
<configuration>
<property>
<name>hbase.rootdir</name>
<value>file:///data</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>false</value>
</property>
</configuration>
And my nutch-site.xml has:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>My Spider</value>
</property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>Default class for storing data</description>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(text|tika|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
<property>
<name>db.ignore.external.links</name>
<value>true</value>
</property>
<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
</property>
<property>
<name>http.content.limit</name>
<value>6553600</value>
</property>
This same is error is reported multiple times on S.O. but none of the solutions worked for me. The $HBASE_HOME and $HADOOP_CLASSPATH env variables are set to:
root#a5fb7fefc53e:/nutch_source/runtime/local/bin# echo $HADOOP_CLASSPATH
/opt/hbase-0.98.21-hadoop2/lib/hbase-client-0.98.21-hadoop2.jar:
/opt/hbase-0.98.21-hadoop2/lib/hbase-common-0.98.12-hadoop2.jar:
/opt/hbase-0.98.21-hadoop2/lib/protobuf-java-2.5.0.jar: /opt/hbase-
0.98.21-hadoop2/lib/guava-12.0.1.jar: /opt/hbase-0.98.21-
hadoop2/lib/zookeeper-3.4.6.jar: /opt/hbase-0.98.21-hadoop2/lib/hbase-
protocol-0.98.12-hadoop2.jar
root#a5fb7fefc53e:/nutch_source/runtime/local/bin# echo $HBASE_HOME
/opt/hbase-0.98.21-hadoop2
I verified all those files exist.
Can someone please help me out what i am missing?
The issue is mentioned in the documentation (https://wiki.apache.org/nutch/Nutch2Tutorial)
"N.B. It's possible to encounter the following exception: java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration; this is caused by the fact that sometimes the hbase TEST jar is deployed in the lib dir. To resolve this just copy the lib over from your installed HBase dir into the build lib dir. (This issue is currently in progress)."
All that needs to be done is this:
cp -R /root/hbase/lib/* /root/nutch/lib/
and nutch will start working fine.

Hadoop Multi-Node Cluster Installation on Ubuntu Issue - Troubleshoot

I have three Ubuntu 12.04 LTS computers that I want to install Hadoop on in a Master/Slave configuration as described here. It says to first install Hadoop as a single node and then proceed to multi-node. The single node installation works perfectly fine. I made the required changes to the /etc/hosts file and configured everything just as the guide says, but when I start the Hadoop cluster on the master, I get an error.
My computers, aptly named ironman, superman and batman, with batman (who else?) being the master node. When I do sudo bin/start-dfs.sh, the following shows up.
When I enter the password, I get this:
When I try sudo bin/start-all.sh, I get this:
I can ssh into the different terminals, but there's something that's not quite right. I checked the logs on superman/slave terminal and it says that it can't connect to batman:54310 and some zzz message. I figured my /etc/hosts is wrong but in fact, it is:
I tried to open port 54310 by changing iptables, but the output screens shown here are after I made the changes. I'm at my wit's end. Please tell me where I'm going wrong. Please do let me know if you need any more information and I will update the question accordingly. Thanks!
UPDATE: Here are my conf files.
core-site.xml Please note that I had put batman:54310 instead of the IP address. I only changed it because I thought I'd make the binding more explicit.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://130.65.153.195:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>
mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>130.65.153.195:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
</configuration>
hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
</configuration>
My conf/masters file is simply batman and my conf/slaves file is just:
batman
superman
ironman
Hope this clarifies things.
First things first: Make sure you can ping the master from slave and slave from master. Login to each machine individually and ping the other 2 hosts. Make sure they are reachable via their hostnames. It is possible that you have not add /etc/hosts entries in the slaves.
Secondly, you need to setup passwordless SSH access. You can use ssh-keygen -t rsa and ssh-copy-id for this. This will help remove the password prompts. It is a good idea to create a separate user for this (and not use root).
If this doesn't help, please post your log output.

hadoop: having more than one reducers under pseudo distributed environment?

I am newbie to hadoop. I have successfully configured a hadoop setup in pseudo distributed mode. I want to have multiple reducers with the option -D mapred.reduce.tasks=2 (with hadoop-streaming). however there's still only one reducer.
according to Google I'm sure that mapred.LocalJobRunner limits number of reducers to 1. But I wonder is there any workaround to have more reducers?
my hadoop configuration files:
[admin#localhost string-count-hadoop]$ cat ~/hadoop-1.1.2/conf/core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/admin/hadoop-data/tmp</value>
</property>
</configuration>
[admin#localhost string-count-hadoop]$ cat ~/hadoop-1.1.2/conf/mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>
[admin#localhost string-count-hadoop]$ cat ~/hadoop-1.1.2/conf/hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.name.dir</name>
<value>/home/admin/hadoop-data/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/admin/hadoop-data/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
the way I start job:
[admin#localhost string-count-hadoop]$ cat hadoop-startjob.sh
#!/bin/sh
~/hadoop-1.1.2/bin/hadoop jar ~/hadoop-1.1.2/contrib/streaming/hadoop-streaming-1.1.2.jar \
-D mapred.job.name=string-count \
-D mapred.reduce.tasks=2 \
-mapper mapper \
-file mapper \
-reducer reducer \
-file reducer \
-input $1 \
-output $2
[admin#localhost string-count-hadoop]$ ./hadoop-startjob.sh /z/programming/testdata/items_sequence /z/output
packageJobJar: [mapper, reducer] [] /tmp/streamjob837249979139287589.jar tmpDir=null
13/07/17 20:21:10 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/07/17 20:21:10 WARN snappy.LoadSnappy: Snappy native library not loaded
13/07/17 20:21:10 INFO mapred.FileInputFormat: Total input paths to process : 1
13/07/17 20:21:11 WARN mapred.LocalJobRunner: LocalJobRunner does not support symlinking into current working dir.
...
...
Try modifying core-site.xml's property
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
to,
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000/</value>
</property>
Put an extra / after 9000 and restart all the daemons.

hbase 0.95.1 fails on hadoop-2.0.5 alpha

I installed hadoop-2.0.5-alpha, hbase-0.95.1-hadoop2, and zookeeper-3.4.5. Hadoop and zookeper are running fine. HDFS and MR2 work great. But HBase will not boot. Has anyone seen this error before? I'll post my config and logs below. Thanks in advance for your help.
hbase-site.xml :
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hbase.zookeeper.quorum</name>
<value>master</value>
<description>The directory shared by RegionServers.
</description>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>master</value>
<description>Property from ZooKeeper's config zoo.cfg.
The directory where the snapshot is stored.
</description>
</property>
<property>
<name>hbase.rootdir</name>
<value>hdfs://master:8020/hbase</value>
<description>The directory shared by RegionServers.
</description>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
<description>The mode the cluster will be in. Possible values are
false: standalone and pseudo-distributed setups with managed Zookeeper
true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh)
</description>
</property>
</configuration>
hbase-xxxx-master-master.log :
2013-07-02 14:33:14,791 FATAL [master:master:60000] master.HMaster: Unhandled
exception. Starting shutdown.
java.io.IOException: Failed on local exception:
com.google.protobuf.InvalidProtocolBufferException: Message missing required fields:
callId, status; Host Details : local host is: "master/192.168.255.130"; destination
host is: "master":8020;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:760)
at org.apache.hadoop.ipc.Client.call(Client.java:1168)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invok
(ProtobufRpcEngine.java:202)
at com.sun.proxy.$Proxy10.setSafeMode(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invok
(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod
(RetryInvocationHandler.java:164)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke
(RetryInvocationHandler.java:83)
at com.sun.proxy.$Proxy10.setSafeMode(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.setSafeMode
(ClientNamenodeProtocolTranslatorPB.java:514)
at org.apache.hadoop.hdfs.DFSClient.setSafeMode(DFSClient.java:1896)
at org.apache.hadoop.hdfs.DistributedFileSystem.setSafeMode
(DistributedFileSystem.java:660)
at org.apache.hadoop.hbase.util.FSUtils.isInSafeMode(FSUtils.java:421)
at org.apache.hadoop.hbase.util.FSUtils.waitOnSafeMode(FSUtils.java:828)
at org.apache.hadoop.hbase.master.MasterFileSystem.checkRootDir
(MasterFileSystem.java:464)
at org.apache.hadoop.hbase.master.MasterFileSystem.createInitialFileSystemLayout
(MasterFileSystem.java:153)
at org.apache.hadoop.hbase.master.MasterFileSystem.<init>(MasterFileSystem.java:137)
at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:728)
at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:546)
at java.lang.Thread.run(Thread.java:662)
Make sure you have built hbase properly(keeping all the hadoop-2.0.5 dependencies in mind). Verify that the hadoop-core jar in hbase/lib directory is same as hadoop jar inside your hadoop. Check the version of hadoop in your pom.xml once and build hbase accordingly.
If you still face any issue you can try the patch from HBASE-7904 and rebuild your HBase.
HTH
there may be compatibility issue while installing hbase with hadoop 2.x please
check

Localhost-only pseudo-distributed hadoop installation

I am trying to make a pseudo-distributed Hadoop installation on my Gentoo machine. I want nothing to be visible from the outside network - e.g. jobtracker and namenode web interfaces - localhost:50030 and localhost:50070. However, I noticed that I can access these from within my home network.
How do I restrict all daemons to listen to localhost only?
I've used the configuration suggested by Hadoop:
core-site.xml
1 <?xml version="1.0"?>
2 <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
3
4 <!-- Put site-specific property overrides in this file. -->
5
6 <configuration>
7 <property>
8 <name>fs.default.name</name>
9 <value>hdfs://127.0.0.1:9000</value>
10 </property>
11 </configuration>
mapred-site.xml
1 <?xml version="1.0"?>
2 <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
3
4 <!-- Put site-specific property overrides in this file. -->
5
6 <configuration>
7 <property>
8 <name>mapred.job.tracker</name>
9 <value>127.0.0.1:9001</value>
10 </property>
11 </configuration>
I also enforced IPv4 (taken from this quide):
hadoop-env.sh
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

Resources