Zeppelin won't start with SSL enabled - hadoop

I'm using Ambari 2.4.0.1 with HDP 2.5 and trying to configure Zeppelin to use SSL. When I set the zeppelin.ssl property to "true" I always get this error when starting the server:
ERROR [2017-01-24 02:13:43,456] ({main} ZeppelinServer.java[main]:118) - Error while running jettyServer
java.io.FileNotFoundException: /etc/zeppelin/2.5.3.0-37/0/null (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at org.eclipse.jetty.util.resource.FileResource.getInputStream(FileResource.java:290)
at org.eclipse.jetty.util.security.CertificateUtils.getKeyStore(CertificateUtils.java:43)
at org.eclipse.jetty.util.ssl.SslContextFactory.loadKeyStore(SslContextFactory.java:871)
at org.eclipse.jetty.util.ssl.SslContextFactory.doStart(SslContextFactory.java:273)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
at org.eclipse.jetty.util.component.ContainerLifeCycle.start(ContainerLifeCycle.java:132)
at org.eclipse.jetty.util.component.ContainerLifeCycle.doStart(ContainerLifeCycle.java:114)
at org.eclipse.jetty.server.SslConnectionFactory.doStart(SslConnectionFactory.java:64)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
at org.eclipse.jetty.util.component.ContainerLifeCycle.start(ContainerLifeCycle.java:132)
at org.eclipse.jetty.util.component.ContainerLifeCycle.doStart(ContainerLifeCycle.java:114)
at org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.java:256)
at org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:81)
at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:236)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
at org.eclipse.jetty.server.Server.doStart(Server.java:366)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
at org.apache.zeppelin.server.ZeppelinServer.main(ZeppelinServer.java:116)
I have no idea what file it's trying to look for in /etc/zeppelin/2.5.3.0-37/0/
The zeppelin.ssl.keystore.path is set to conf/keystore and the keystore file is at that location. It's a relative path under /usr/hdp/current/zeppelin-server, and the conf dir there is actually a symlink to /etc/zeppelin/2.5.3.0-37/0/
I have client auth set to false, but set the truststore path nonetheless, and that didn't seem to make any difference.
If I toggle the zeppelin.ssl setting to "false" the server starts normally.
Any thoughts on what could be going on?

OK, in Ambari the tooltip for the keystore path field says it should be a relative path, relative to zeppelin home. But just now on a whim I changed it to the absolute path, and now my server starts in SSL mode. I don't know if the docs are wrong, or if it's a code bug, but it works with an absolute path, so at least I have a path forward.

Related

Change server location for HDFS

I'm trying to follow the tutorial here: https://www.quickprogrammingtips.com/big-data/how-to-install-hadoop-on-mac-os-x-el-capitan.html, but getting a strange error when trying to run the line
sbin/start-dfs.sh
It doesn't raise any complaints when I run the script, but the namenode is not actually started. When I went to inspect the logs, I saw this error:
2020-01-30 13:30:52,700 INFO org.apache.hadoop.http.HttpServer2: HttpServer.start() threw a non Bind IOException
java.net.BindException: Port in use: censoredsite.com:0
at org.apache.hadoop.http.HttpServer2.openListeners(HttpServer2.java:995)
at org.apache.hadoop.http.HttpServer2.start(HttpServer2.java:932)
at org.apache.hadoop.hdfs.server.namenode.NameNodeHttpServer.start(NameNodeHttpServer.java:171)
at org.apache.hadoop.hdfs.server.namenode.NameNode.startHttpServer(NameNode.java:834)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:692)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:898)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:877)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1603)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1671)
Caused by: java.net.BindException: Can't assign requested address
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:433)
at sun.nio.ch.Net.bind(Net.java:425)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at org.mortbay.jetty.nio.SelectChannelConnector.open(SelectChannelConnector.java:216)
at org.apache.hadoop.http.HttpServer2.openListeners(HttpServer2.java:990)
Which was preceded by this line earlier:
2020-01-30 13:30:52,359 INFO org.apache.hadoop.hdfs.DFSUtil: Starting Web-server for hdfs at: http://censoredsite.com/archive:50070
It seems that somehow the web-server for HDFS has been set to something that it shouldn't be, I searched around online but I couldn't find what this value should properly be (I assume localhost?) OR how to actually change it in the config files.
The other interesting thing is that this "censoredsite" is actually a uh... lewd site I used to visit a few years ago. I have absolutely no idea how it managed to get into my HDFS configuration details, pretty worrying that it somehow worked its way into my computer. Does anyone now how to explicitly change the location of org.apache.hadoop.hdfs.DFSUtil? Thanks.
It sounds like it ended up in your /etc/hosts file as a site mapping...
The way to change the address, though, is in the hdfs-site.xml
dfs.namenode.http-address
https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
Alternatively, install Hadoop in a VM or download the Cloudera quickstart ones, where it's all pre-configured

Hadoop single-node starting issue

I'm trying to bring up the hadoop standalone server (in aws) by executing
start-dfs.sh file but got the below error
Starting namenodes on [ip-xxx-xx-xxx-xx]
ip-xxx-xx-xxx-xx: Permission denied (publickey).
Starting datanodes
localhost: Permission denied (publickey).
Exception in thread "main" java.lang.UnsupportedClassVersionError: org/apache/hadoop/hdfs/tools/GetConf : Unsupported major.minor version 52.0
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:808)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:442)
at java.net.URLClassLoader.access$100(URLClassLoader.java:64)
at java.net.URLClassLoader$1.run(URLClassLoader.java:354)
at java.net.URLClassLoader$1.run(URLClassLoader.java:348)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:347)
at java.lang.ClassLoader.loadClass(ClassLoader.java:430)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:323)
at java.lang.ClassLoader.loadClass(ClassLoader.java:363)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:482)
Installed Java version is javac 1.7.0_181
Hadoop is 3.0.3.
Below is the path contents in profile file
export JAVA_HOME=/usr
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
#export PATH=$PATH:$HADOOP_CONF_DIR
export SCALA_HOME=/usr/local/scala
export PATH=$PATH:$SCALA_HOME/bin
What is the issue ? is there anything i'm missing?
thanks
ssh-keygen
2.It will ask for folder location where it will copy the keys, I entered /home/hadoop/.ssh/id_rsa
3.it will ask for pass phrase, keep it empty for simplicity.
cat /home/hadoop/.ssh/id_rsa.pub .>> ssh/authorized_keys (To copy the newly generated public key to auth file in your users home/.ssh directory)
ssh localhost should not ask for a password
start-dfs.sh (Now it should work!)

Spark/Hadoop/Yarn cluster communication requires external ip?

I deployed Spark (1.3.1) with yarn-client on Hadoop (2.6) cluster using bdutil, by default, the instances are created with Ephemeral external ips, and so far spark works fine. With some security concerns, and assuming the cluster is internal accessed only, I removed the external ips from the instances; after that, the spark-shell will not even run, and seemed it cannot communicate with Yarn/Hadoop, and just stuck indefinitely. Only after I added the external ips back, the spark-shell starts working properly.
My question is, is external ips of the nodes required to run spark over yarn, and why? If yes, will there be any concerns regarding security, etc? Thanks!
Short Answer
You need external IP addresses to access GCS, and default bdutil settings set GCS as the default Hadoop filesystem, including for control files. Use ./bdutil -F hdfs ... deploy to use HDFS as the default instead.
Security shouldn't be a concern when using external IP addresses unless you've added too many permissive rules to your firewall rules in your GCE network config.
EDIT: At the moment there appears to be a bug where we set spark.eventLog.dir to a GCS path even if the default_fs is hdfs. I filed https://github.com/GoogleCloudPlatform/bdutil/issues/35 to track this. In the meantime just manually edit /home/hadoop/spark-install/conf/spark-defaults.conf on your master (you might need to sudo -u hadoop vim.tiny /home/hadoop/spark-install/conf/spark-defaults.conf to have edit permissions on it) to set spark.eventLog.dir to hdfs:///spark-eventlog-base or something else in HDFS, and run hadoop fs -mkdir -p hdfs:///spark-eventlog-base to get it working.
Long Answer
By default, bdutil also configures Google Cloud Storage as the "default Hadoop filesystem", which means that control files used by Spark and YARN require access to Google Cloud Storage. Additionally, external IPs are required in order to access Google Cloud Storage.
I did manage to partially repro your case after manually configuring intra-network SSH; during startup I actually see the following:
15/06/26 17:23:05 INFO yarn.Client: Preparing resources for our AM container
15/06/26 17:23:05 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.4.0-hadoop2
15/06/26 17:23:26 WARN http.HttpTransport: exception thrown while executing request
java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:625)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:275)
at sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:371)
at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:191)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:933)
at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:177)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:153)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:93)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:965)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:410)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:343)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:460)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getBucket(GoogleCloudStorageImpl.java:1557)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getItemInfo(GoogleCloudStorageImpl.java:1512)
at com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage.getItemInfo(CacheSupplementedGoogleCloudStorage.java:516)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.getFileInfo(GoogleCloudStorageFileSystem.java:1016)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.exists(GoogleCloudStorageFileSystem.java:382)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configureBuckets(GoogleHadoopFileSystemBase.java:1639)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem.configureBuckets(GoogleHadoopFileSystem.java:71)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configure(GoogleHadoopFileSystemBase.java:1587)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:776)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:739)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169)
at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:216)
at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:384)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:102)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:58)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:381)
at org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1016)
at $line3.$read$$iwC$$iwC.<init>(<console>:9)
at $line3.$read$$iwC.<init>(<console>:18)
at $line3.$read.<init>(<console>:20)
at $line3.$read$.<init>(<console>:24)
at $line3.$read$.<clinit>(<console>)
at $line3.$eval$.<init>(<console>:7)
at $line3.$eval$.<clinit>(<console>)
at $line3.$eval.$print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813)
at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:123)
at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:122)
at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324)
at org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:122)
at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:64)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:973)
at org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:157)
at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:64)
at org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:106)
at org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:64)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:990)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
As expected, simply by calling org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start it tries to contact Google Cloud Storage, and fails because there's not GCS access without external IPs.
To get around this, you can simply use -F hdfs when creating your cluster to use HDFS as your default filesystem; in that case everything should work intra-cluster even without external IP addresses. In that mode, you can still even continue to use GCS whenever you have external IP addresses assigned by specifying full gs://bucket/object paths as your Hadoop arguments. However, note that in that case, as long as you've removed the external IP addresses, you won't be able to use GCS unless you also configure a proxy server and funnal all data through your proxy; the GCS configs for that is fs.gs.proxy.address.
In general, there's no need to worry about security just because of having external IP addresses unless you've opened up new permissive rules in your "default" network firewall rules in Google Compute Engine.

Hadoop 2.2.0- RecoveryInProgressException while appending content to an existing file

I am not able to append content to an existing file in HDFS.Exception is thrown at the following line.
outputStream = hdfs.append(dirPath);
where dirPath is "hdfs://master:54310/test/Readme.txt".
Please note that,I am running Hadoop on single node for development.
The exception log is given below.
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.RecoveryInProgressException): Failed to close file /test/Readme.txt. Lease recovery is in progress. Try again later.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2310)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:2153)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2386)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2347)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:508)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:320)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:59572)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2048)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2042)
Also note that at the same line, I sometime get this exception
failed to create file /test/Readme.txt for DFSClient_NONMAPREDUCE_187487992_15 on client 127.0.0.1 because current leaseholder is trying to recreate file
Could anyone please elaborate as to why these exceptions are thrown?
Even though the dfs.replication is set to 1 in hdfs-site.xml,the replication value will still be set to 3(Please suggest as to why this happens,restarted couple of times).I had to set the value of replication to 1 to make it work ,from the code using,
hdfs.setReplication(dirPath, (short) 1);

nutch2.0 hadoop Input path does not exist

Exception in thread "main" org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://yuqing-namenode:9000/user/yuqing/2
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:235)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:962)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:979)
at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:500)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:530)
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:50)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:219)
at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
at org.apache.nutch.crawl.Crawler.run(Crawler.java:136)
at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)
When I remove the config file of hadoop from nutch conf, the first line of error become:
Exception in thread "main" org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/home/yuqing/workspace/nutch2.0/2
Once I run Nutch2.0 success with hbase, but now the full distribution is not work.
Hbase in full distribution runs normal, I can op it in shell.
next I create a folder in nutch2.0, then the crawler can running, but output of console seems unnormal.
Now I have to have a meal.
Looks like you don't have input path. Excactly as hadoop said.
Check, that hdfs dfs -ls /user/yuqing/2 return something (2 should be file or directory)
As for the second part, when you remove hadoop configs, hadoop library uses internal configs (you can find them in distribution with names *-default.xml, f.e. core-default.xml), and hadoop functions in 'local' mode. In 'local' mode all paths are local (in local filesystem).
So, when you reference file in 'hdfs' mode, f.e. hdfs dfs -ls /some/file, hadoop will lookup file in hdfs (hdfs://namenode.ip/some/file), but in local mode file will be searched in relative (typically file:/home/user/some/file).
You can see that in your output: file:/home/yuqing/workspace/nutch2.0/2

Resources