nutch2.0 hadoop Input path does not exist - hadoop

Exception in thread "main" org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://yuqing-namenode:9000/user/yuqing/2
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:235)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:962)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:979)
at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:500)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:530)
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:50)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:219)
at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
at org.apache.nutch.crawl.Crawler.run(Crawler.java:136)
at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)
When I remove the config file of hadoop from nutch conf, the first line of error become:
Exception in thread "main" org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/home/yuqing/workspace/nutch2.0/2
Once I run Nutch2.0 success with hbase, but now the full distribution is not work.
Hbase in full distribution runs normal, I can op it in shell.
next I create a folder in nutch2.0, then the crawler can running, but output of console seems unnormal.
Now I have to have a meal.

Looks like you don't have input path. Excactly as hadoop said.
Check, that hdfs dfs -ls /user/yuqing/2 return something (2 should be file or directory)
As for the second part, when you remove hadoop configs, hadoop library uses internal configs (you can find them in distribution with names *-default.xml, f.e. core-default.xml), and hadoop functions in 'local' mode. In 'local' mode all paths are local (in local filesystem).
So, when you reference file in 'hdfs' mode, f.e. hdfs dfs -ls /some/file, hadoop will lookup file in hdfs (hdfs://namenode.ip/some/file), but in local mode file will be searched in relative (typically file:/home/user/some/file).
You can see that in your output: file:/home/yuqing/workspace/nutch2.0/2

Related

SQOOP Import Fails, File Not Found Exception

I am new to hadoop architectures system and installed components using web search. For this i installed Hadoop, sqoop, hive. Here is directory structure for my installations (my local ubuntu machine instead and any vm, Each my installation is on separate directory):-
/usr/local/hadoop
/usr/local/sqoop
/usr/local/hive
By looking at error i tried to resolve it and so i copied sqoop (local machine /usr/local/sqoop) folder to hdfs directory (hdfs://localhost:54310/usr/local/sqoop). This resolved my problem. I want to know certain things from this:-
Before coping my sqoop to hdfs, is my installation right?
Is it necessary to copy sqoop directory from ext file system to hdfs file system.
16/07/02 13:22:15 ERROR tool.ImportTool: Encountered IOException running import job: java.io.FileNotFoundException: File does not exist: hdfs://localhost:54310/usr/local/sqoop/lib/avro-mapred-1.7.5-hadoop2.jar
at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1122)
at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114)
at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.getFileStatus(ClientDistributedCacheManager.java:288)
at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.getFileStatus(ClientDistributedCacheManager.java:224)
at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.determineTimestamps(ClientDistributedCacheManager.java:93)
at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.determineTimestampsAndCacheVisibilities(ClientDistributedCacheManager.java:57)
at org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:269)
at org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:390)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:483)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1296)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1293)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1293)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1314)
at org.apache.sqoop.mapreduce.ImportJobBase.doSubmitJob(ImportJobBase.java:196)
at org.apache.sqoop.mapreduce.ImportJobBase.runJob(ImportJobBase.java:169)
at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:266)
at org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:673)
at org.apache.sqoop.manager.MySQLManager.importTable(MySQLManager.java:118)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:497)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605)
at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
at org.apache.sqoop.Sqoop.main(Sqoop.java:236)
No problem with the installation, there is no need to copy all the files from sqoop directory just copy the sqoop library files into hdfs.
Create a directory structure in hdfs which is same as $SQOOP_HOME/lib.
Example:hdfs dfs -mkdir -p /usr/lib/sqoop
Copy all sqoop library files from $SQOOP_HOME/lib to hdfs lib
Example:hdfs dfs -put /usr/lib/sqoop/lib/* /usr/lib/sqoop/lib

Google Cloud connector for Hadoop doesn't work with Pig

I'm using Hadoop with HDFS 2.7.1.2.4 and Pig 0.15.0.2.4 (Hortonworks HDP 2.4) and trying to use Google Cloud Storage Connector for Spark and Hadoop (bigdata-interop on GitHub). It works correctly when I try, say,
hadoop fs -ls gs://bucket-name
But when I try the following in Pig (in mapreduce mode):
data = LOAD 'gs://softline/o365.avro' USING AvroStorage();
data = STORE data INTO 'gs://softline/o366.avro' USING AvroStorage();
Pig fails with the following errors:
org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Wrong FS scheme: hdfs, in path: hdfs://hdp.slweb.ru:8020/user/root, expected scheme: gs
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:279)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob.submit(ControlledJob.java:335)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.pig.backend.hadoop23.PigJobControl.submit(PigJobControl.java:128)
at org.apache.pig.backend.hadoop23.PigJobControl.run(PigJobControl.java:194)
at java.lang.Thread.run(Thread.java:745)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:276)
Caused by: java.lang.IllegalArgumentException: Wrong FS scheme: hdfs, in path: hdfs://hdp.slweb.ru:8020/user/root, expected scheme: gs
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.checkPath(GoogleHadoopFileSystemBase.java:741)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem.checkPath(GoogleHadoopFileSystem.java:90)
at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:466)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.makeQualified(GoogleHadoopFileSystemBase.java:701)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem.getGcsPath(GoogleHadoopFileSystem.java:163)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.setWorkingDirectory(GoogleHadoopFileSystemBase.java:1094)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:235)
... 18 more
If needed I could post the logs of GC connectors.
Hame somebody used Pig with this connectors? Any help would be appeciated.
TL;DR explicitly set workmapreduce.job.working.dir=/user/root/ when starting the pig job
If a working directory has not been explicitly set during job submission then Hadoop will set the working directory to be the working directory of the default filesystem. When using HDFS as your default FS the working directory will generally be something like 'hdfs://namenode:port/user/<your username>'.
When PigInputFormat#getSplits is called, it fetches the FileSystem associated with the path of the input that it is operating on. In this case the filesystem is an instance of GoogleHadoopFileSystem. Pig then inspects the path of its input and if the path is non-local calls FileSystem#setWorkingDirectory(job.getWorkingDirectory()). The problem here is that the job's working directory is 'hdfs://namenode:port/user/<your username>' which GoogleHadoopFileSystem will reject as a path to set as its own working directory (as it only supports 'gs://' paths).

Hadoop Mapreduce Error Input path does not exist: hdfs://localhost:54310/user/hduser/input"

I have installed hadoop 2.6 in Ubuntu Linux 15.04 and its running fine. But, when I am running a sample test mapreduce program, its giving the following error:
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://localhost:54310/user/hduser/input.
Kindly help me. Below is the complete details of the error.
hduser#krishadoop:/usr/local/hadoop/sbin$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount input output
Picked up JAVA_TOOL_OPTIONS: -javaagent:/usr/share/java/jayatanaag.jar
15/08/24 15:22:37 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/08/24 15:22:38 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
15/08/24 15:22:38 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
15/08/24 15:22:39 INFO mapreduce.JobSubmitter: Cleaning up the staging area file:/app/hadoop/tmp/mapred/staging/hduser1122930879/.staging/job_local1122930879_0001
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://localhost:54310/user/hduser/input
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:321)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:264)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:385)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:597)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:614)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:492)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1296)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1293)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1293)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1314)
at org.apache.hadoop.examples.WordCount.main(WordCount.java:87)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Seems like you mentioned a wrong input path. Hadoop is searching for an input path at /user/hduser/input. Hadoop also follows unix like tree structure. If you simply mention a directory input it will be taken as /user/{username}/input.
hadoop fs -mkdir -p /user/hduser/input
hadoop fs -put <datafile> /user/hduser/input
If you see this path (file) physically and still getting the error, you may have confused with local file system and Hadoop Distributed File System(HDFS). In order to run this map-reduce, this file should be located in HDFS (locating only inside local file system will not do it.).
You can import local file system files into HDFS by this command.
hadoop fs -put <local_file_path> <HDFS_diresctory>
You confirm that the file that you imported exists in HDFS by this command.
hadoop fs -ls <HDFS_path>
You must create and upload your input before executing your hadoop job. For example, if you need to upload input.txt file, you should do the following:
$HADOOP_HOME/bin/hdfs dfs -mkdir /user/hduser/input
$HADOOP_HOME/bin/hdfs dfs -copyFromLocal $HADOOP_HOME/input.txt /user/hduser/input/input.txt
The first line creates the directory, and the other upload your input file into hdfs (hadoop fylesystem).
When you compile any jar file using input and output file/directory, you should make sure that the input file is already created(in the specified path) and output file does not exist.
If you want to give a text file as input file, first copy a text file from local file system to hdfs and compiling it by using the following commands
hadoop fs -copyFromLocal /input.txt /user/hduser/input.txt
/usr/local/hadoop/sbin$ yarn jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount /user/hduser/input.txt /output
/input.txt may be replaced with address of any text file.
You need to start Pig in local mode and not cluster node:
pig -x local
Program is not able to find the Hadoop path for the inputs. It is searching in the local system files rather than Hadoop's DFS.
This problem will go away when your program is able to locate the HDFS location. We need to let the program understand the HDFS location given in the configuration file. To do that, add these lines in your program code.
Configuration conf = new Configuration();
conf.addResource(new Path("/usr/local/hadoop/hadoop-2.7.3/etc/hadoop/core-site.xml"));
conf.addResource(new Path("/usr/local/hadoop/hadoop-2.7.3/etc/hadoop/hdfs-site.xml"));
you should make the directory in HDFS:
for instance, "hadoop fs -mkdir /input_dir"
Then when you run your MapReduce program. You should mention the absolute path of input directory, so the format should be:
hadoop jar jarFileName.jar className /input_dir /outputdir right
The following is wrong because it is relative path
hadoop jar jarFileName.jar className input_dir outputdir wrong
If you find /bin/bash: /bin/java: No such file or directory in log, try setting JAVA_HOME in /etc/hadoop/hadoop-env.sh

error while running nutch on hadoop multi cluster environment

I am running nutch on hadoop multi cluster environment.
Hadoop is throwing an error when nutch is being executed using the following command
$ bin/hadoop jar /home/nutch/nutch/runtime/deploy/nutch-1.5.1.job org.apache.nutch.crawl.Crawl urls -dir urls -depth 1 -topN 5
Error:
Exception in thread "main" java.io.IOException: Not a file:
hdfs://master:54310/user/nutch/urls/crawldb
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:170)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:515)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:753)
at com.bdc.dod.dashboard.BDCQueryStatsViewer.run(BDCQueryStatsViewer.java:829)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at com.bdc.dod.dashboard.BDCQueryStatsViewer.main(BDCQueryStatsViewer.java:796)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
I tried with possible ways of solving this and fixed all the issues like setting http.agent.name in /local/conf path etc. And I installed earlier and it was smooth.
Can anybody suggest a solution?
By the way, I followed link for installing and running.
I could solve this issue. when copying files from local file system to HDFS destination filesystem, it used to be like this: bin/hadoop dfs -put ~/nutch/urls urls.
However it should be "bin/hadoop dfs -put ~/nutch/urls/* urls", here urls/* will allow sub directories.

Hadoop EOF exception after Map step

I'm runnig Hadoop in pseudo-distributed mode. My new MR jobs throws EOF exception after Map step is complete in DataInputStream.readFully() method. The input/output formats are the same - SequenceFile. The similiar job runs fine, but the new one fails.
Stacktrace:
java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:180)
at java.io.DataInputStream.readFully(DataInputStream.java:152)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1520)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1492)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1479)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1474)
at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.initialize(SequenceFileRecordReader.java:50)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:451)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:646)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInf
I've fixed this issue, it was called by few corrupted Sequence files in HDFS input folder

Resources