org.apache.hadoop.mapred.InvalidInputException: Input path does not exist - hadoop

I have set up Apache Nutch with single node of Hadoop. When I execute the crawl command it starts the crawling. However there is an exception throwing after few minutes.
cause:org.apache.hadoop.mapred.InvalidInputException: Input path does
not exist: (please refer to the image 1)
This is the invalid path according to the exception
hdfs://localhost:54310/user/duleendra/TestCrawl/segments/drwxrwxrwx/crawl_generate
Actually there is no such path in hdfs.
How does this drwxrwxrwx come ?
In hdfs I can see the following path
hdfs://localhost:54310/user/duleendra/TestCrawl/segments/20150506222506/crawl_generate
(please refer to the image 2 as well).
Have I missed anything?
Thanks
Duleendra

I believe this is a bug in Unix based systems like OSX and FreeBsd . Nutch's crawl will not work in them. Try ubuntu.

Related

NiFi ListHDFS cannot find directory, FileNotFoundException

Have pipeline in NiFi of the form listHDFS->moveHDFS, attempting to run the pipeline we see the error log
13:29:21 HSTDEBUG01631000-d439-1c41-9715-e0601d3b971c
ListHDFS[id=01631000-d439-1c41-9715-e0601d3b971c] Returning CLUSTER State: StandardStateMap[version=43, values={emitted.timestamp=1525468790000, listing.timestamp=1525468790000}]
13:29:21 HSTDEBUG01631000-d439-1c41-9715-e0601d3b971c
ListHDFS[id=01631000-d439-1c41-9715-e0601d3b971c] Found new-style state stored, latesting timestamp emitted = 1525468790000, latest listed = 1525468790000
13:29:21 HSTDEBUG01631000-d439-1c41-9715-e0601d3b971c
ListHDFS[id=01631000-d439-1c41-9715-e0601d3b971c] Fetching listing for /hdfs/path/to/dir
13:29:21 HSTERROR01631000-d439-1c41-9715-e0601d3b971c
ListHDFS[id=01631000-d439-1c41-9715-e0601d3b971c] Failed to perform listing of HDFS due to File /hdfs/path/to/dir does not exist: java.io.FileNotFoundException: File /hdfs/path/to/dir does not exist
Changing the listHDFS path to /tmp seems to run ok, thus making me think that the problem is with my permissions on the directory I'm trying to list. However, changing the NiFi user to a user that can access that directory (eg. hadoop fs -ls /hdfs/path/to/dir) by setting the bootstrap.properties value run.as=myuser and restarting (see https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#bootstrap_properties) still seems to produce the same problem for the directory. The literal dir. string being used that is not working is:
"/etl/ucera_internal/datagov_example/raw-ingest-tracking/version-1/ingest"
Does anyone know what is happening here? Thanks.
** Note: The hadoop cluster I am accessing does not have kerberos enabled (it is a secured MapR hadoop cluster).
Update: It appears that the mapr hadoop implementation is different enough that it requires special steps in order for NiFi to properly work on it (see https://community.mapr.com/thread/10484 and http://hariology.com/integrating-mapr-fs-and-apache-nifi/). May not get a chance to work on this problem for some time to see if still works (as certain requirements have changed), so am dumping the link here for others who may have this problem in the meantime.
Could you once make sure you have entered correct path and directory needs to be exists in HDFS.
It seems to be list hdfs processor not able to find the directory that you have configured in directory property and logs are not showing any permission denied issues.
If logs shows permission denied then you can change the nifi running user in bootstrap.conf and
Once you make change in nifi properties then NiFi needs to restart to apply the changes (or) change the permissions on the directory that NiFi can have access.

"No such file or directory" in hadoop while executing WordCount program using jar command

I am new to Hadoop and am trying to execute the WordCount Problem.
Things I did so far -
Setting up the Hadoop Single Node cluster referring the below link.
http://www.bogotobogo.com/Hadoop/BigData_hadoop_Install_on_ubuntu_single_node_cluster.php
Write the word count problem referring the below link
https://kishorer.in/2014/10/22/running-a-wordcount-mapreduce-example-in-hadoop-2-4-1-single-node-cluster-in-ubuntu-14-04-64-bit/
Problem is when I execute the last line to run the program -
hadoop jar wordcount.jar /usr/local/hadoop/input /usr/local/hadoop/output
Following is the error I get -
The directory seems to be present
The file is also present in the directory with contents
Finally, on a side note I also tried the following directory sturcture in the jar command.
No avail! :/
I would really appreciate if someone could guide me here!
Regards,
Paul Alwin
Your first image is using input from the local Hadoop installation directory, /usr
If you want to use that data on your local filesystem, you can specify file:///usr/...
Otherwise, if you're running pseudo distributed mode, HDFS has been setup, and /usr does not exist in HDFS unless you explicitly created it there.
Based on the stacktrace, I believe the error comes from the /app/hadoop/ staging directory path not existing, or the permissions for it are not allowing your current user to run commands against that path
Suggestion: Hortonworks and Cloudera offer pre-built VirtualBox images and lots of tutorial resources. Most companies will have Hadoop from one of those vendors, so it's better to get familiar with that rather than mess around with having to install Hadoop yourself from scratch, in my opinion

xmx1000m is not recognized as an internal or external command: pig on windows

I am trying to setup pig on windows 7. I already have hadoop 2.7 single node cluster running on windows 7.
To setup pig, I have taken following steps as of now.
Downloaded the tar: http://mirror.metrocast.net/apache/pig/
Extracted tar to: C:\Users\zeba\Desktop\pig
Have set the Environment (User) Variable to:
PIG_HOME = C:\Users\zeba\Desktop\pig
PATH = C:\Users\zeba\Desktop\pig\bin
PIG_CLASSPATH = C:\Users\zeba\Desktop\hadoop\conf
Also changed HADOOP_BIN_PATH in pig.cmd to %HADOOP_HOME%\libexec as suggested by (Apache pig on windows gives "hadoop-config.cmd' is not recognized as an internal or external command" error when running "pig -x local") as was getting the same error
When I enter pig, I encounter the following error:
xmx1000m is not recognized as an internal or external command
Please help!
The error went away by installing pig-0.17.0. I was working with pig-0.16.0 previously.
Finally i got it. I changed HADOOP BIN PATH in pig.cmd to "HADOOP_HOME%\hadoop-2.9.2\libexec", as you can see "hadoop-2.9.2" is a subfile where "libexec" from my hadoop version is located..
Fix your "HADOOP_HOME" according to given image don't provide bin path only provide hadoop path.

Error: -copyFromLocal: java.net.UnknownHostException

I am new at Java, Hadoop etc.
I am having a problem when trying to copy a file to HDFS.
It says: "-copyFromLocal: java.net.UnknownHostException: quickstart.cloudera (...)"
How can I solve this? It is a exercise. You can see the problem in the imagem below.
Image with the problem
Image 2 with the error
Thank you very much.
As error says you need to supply the HDFS folder path as destination. So the code should be like:
hadoop fs -copyFromLocal words.txt /HDFS/Folder/Path
Almost all errors that you get while working in Hadoop are Java errors as MapReduce was mostly written in Java. But that doesnt mean there is some Java error in it.

unable to setup psuedo distributed hadoop cluster

I am using centos 7. Downloaded and untarred hadoop 2.4.0 and followed the instruction as per the link Hadoop 2.4.0 setup
Ran the following command.
./hdfs namenode -format
Got this error :
Error: Could not find or load main class org.apache.hadoop.hdfs.server.namenode.NameNode
I see a number of posts with the same error with no accepted answers and I have tried them all without any luck.
This error can occur if the necessary jarfiles are not readable by the user running the "./hdfs" command or are misplaced so that they can't be found by hadoop/libexec/hadoop-config.sh.
Check the permissions on the jarfiles under: hadoop-install/share/hadoop/*:
ls -l share/hadoop/*/*.jar
and if necessary, chmod them as the owner of the respective files to ensure they're readable. Something like chmod 644 should be sufficient to at least check if that fixes the initial problem. For the more permanent fix, you'll likely want to run the hadoop commands as the same user that owns all the files.
I followed the link Setup hadoop 2.4.0
and I was able to get over the error message.
Seems like the documentation on hadoop site is not complete.

Resources