Why would Hadoop hftp serve directories but not files? - hadoop

I'm trying to move files from one cluster to another using distcp, using the hftp protocol as specified in their instructions.
I can read directories over hftp, but when I attempt to get a file I get a 500 (internal server error). To eliminate the possibility of network and firewall issues, I'm using hadoop fs -ls and hadoop fs -cat commands on the source server in order to attempt to figure out this issue.
This provides a directory of the files:
hadoop fs -ls logfiles/day_id=19991231/hour_id=1999123123
-rw-r--r-- 3 username supergroup 812 2012-12-16 17:21 logfiles/day_id=19991231/hour_id=1999123123/000008_0
This gives me a "file not found" error, which it should because the file isn't there:
hadoop fs -cat hftp://hserver.domain.com:50070/user/username/logfiles/day_id=19991231/hour_id=1999123123/000008_0x
cat: `hftp://hserver.domain.com:50070/user/username/logfiles/day_id=19991231/hour_id=1999123123/000008_0x': No such file or directory
This line gives me a 500 internal server error. The file is confirmed on the server.
hadoop fs -cat hftp://hserver.domain.com:50070/user/username/logfiles/day_id=19991231/hour_id=1999123123/000008_0
cat: HTTP_OK expected, received 500
Here is a stack trace of what distcp logs when I attempt this:
java.io.IOException: HTTP_OK expected, received 500
at org.apache.hadoop.hdfs.HftpFileSystem$RangeHeaderUrlOpener.connect(HftpFileSystem.java:365)
at org.apache.hadoop.hdfs.ByteRangeInputStream.openInputStream(ByteRangeInputStream.java:119)
at org.apache.hadoop.hdfs.ByteRangeInputStream.getInputStream(ByteRangeInputStream.java:103)
at org.apache.hadoop.hdfs.ByteRangeInputStream.read(ByteRangeInputStream.java:187)
at java.io.DataInputStream.read(DataInputStream.java:83)
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:424)
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:547)
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:314)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:393)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:327)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
Can someone tell me why hftp is failing to serve files?

I ran into the same issue and eventually found a solution.
Everythning is explained here in details: http://www.swiss-scalability.com/2015/01/hadoop-hftp-returns-error-httpok.html
But in a nutshell, we're probably binding the NameNode RPC on a wildcard address (i.e. dfs.namenode.rpc-address point to the IP of the interface, and not 0.0.0.0).
Does not work with HFTP:
<property>
<name>dfs.namenode.rpc-address</name>
<value>0.0.0.0:8020</value>
</property>
Works with HFTP:
<property>
<name>dfs.namenode.rpc-address</name>
<value>10.0.1.2:8020</value>
</property>

Related

How to figure out hdfs URI - java.io.IOException: Incomplete HDFS URI, no host

How can I figure out the URI my hdfs dfs commands are connecting to?
Is there any configuration file that stores the URI or any command that can be used to display it?
I looked into the documention of FileSystemShell and the dfsadmin documentation without success. (Also, I do not have access to most of dfsadmin commands.)
When I call a command with hdfs:///user/myUserName/... it throws the exception:
Exception in thread "main" java.io.IOException: Incomplete HDFS URI, no host: hdfs:///user/myUserName/test.avro
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:136)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2625)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2607)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.avro.mapred.FsInput.<init>(FsInput.java:38)
at org.apache.avro.tool.Util.openSeekableFromFS(Util.java:110)
at org.apache.avro.tool.DataFileGetSchemaTool.run(DataFileGetSchemaTool.java:47)
at org.apache.avro.tool.Main.run(Main.java:87)
at org.apache.avro.tool.Main.main(Main.java:76)
Simple commands like hdfs dfs -ls are working fine.
Using Hadoop 3.1.0.
If you're able to access the file core-site.xml, then you can look for the value assigned to property fs.defaultFS
$ grep -A 2 defaultFS /etc/hadoop/conf/core-site.xml
<name>fs.defaultFS</name>
<value>hdfs://bigdataserver-2.internal.cloudapp.net:8020</value>
</property>
Note: I use Cloudera and core-site.xml is where i get the detail. For Hadoop, you might be having core-default.xml
Check this out: https://hadoop.apache.org/docs/r2.8.5/hadoop-project-dist/hadoop-common/core-default.xml

How can I deal with the java.net.ConnectException in Hadoop?

I was trying to copy some local file to the HDFS, with this script:
bin/hadoop fs -copyFromLocal '/home/czy/IdeaProjects/HadoopInAction/FirstHadoop/src/main/resources/crossView.txt' /user/czy
It came like this:
copyFromLocal: Call From ubuntu/127.0.1.1 to localhost:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
But when I used a script like this(without hdfs://localhost/), it worked well:
bin/hadoop fs -copyFromLocal '/home/czy/IdeaProjects/HadoopInAction/FirstHadoop/src/main/resources/crossView.txt' /user/czy
Why did this happen? Why do I received to localhost:8020 failed when I configured namenode port to 9000?
Check your core-site.xml file for valid parameters, server and port names
Check below link..it might be useful...
https://datashine.wordpress.com/2014/09/06/java-net-connectexception-connection-refused-for-more-details-see-httpwiki-apache-orghadoopconnectionrefused/

Hadoop Mapreduce Error Input path does not exist: hdfs://localhost:54310/user/hduser/input"

I have installed hadoop 2.6 in Ubuntu Linux 15.04 and its running fine. But, when I am running a sample test mapreduce program, its giving the following error:
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://localhost:54310/user/hduser/input.
Kindly help me. Below is the complete details of the error.
hduser#krishadoop:/usr/local/hadoop/sbin$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount input output
Picked up JAVA_TOOL_OPTIONS: -javaagent:/usr/share/java/jayatanaag.jar
15/08/24 15:22:37 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/08/24 15:22:38 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
15/08/24 15:22:38 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
15/08/24 15:22:39 INFO mapreduce.JobSubmitter: Cleaning up the staging area file:/app/hadoop/tmp/mapred/staging/hduser1122930879/.staging/job_local1122930879_0001
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://localhost:54310/user/hduser/input
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:321)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:264)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:385)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:597)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:614)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:492)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1296)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1293)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1293)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1314)
at org.apache.hadoop.examples.WordCount.main(WordCount.java:87)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Seems like you mentioned a wrong input path. Hadoop is searching for an input path at /user/hduser/input. Hadoop also follows unix like tree structure. If you simply mention a directory input it will be taken as /user/{username}/input.
hadoop fs -mkdir -p /user/hduser/input
hadoop fs -put <datafile> /user/hduser/input
If you see this path (file) physically and still getting the error, you may have confused with local file system and Hadoop Distributed File System(HDFS). In order to run this map-reduce, this file should be located in HDFS (locating only inside local file system will not do it.).
You can import local file system files into HDFS by this command.
hadoop fs -put <local_file_path> <HDFS_diresctory>
You confirm that the file that you imported exists in HDFS by this command.
hadoop fs -ls <HDFS_path>
You must create and upload your input before executing your hadoop job. For example, if you need to upload input.txt file, you should do the following:
$HADOOP_HOME/bin/hdfs dfs -mkdir /user/hduser/input
$HADOOP_HOME/bin/hdfs dfs -copyFromLocal $HADOOP_HOME/input.txt /user/hduser/input/input.txt
The first line creates the directory, and the other upload your input file into hdfs (hadoop fylesystem).
When you compile any jar file using input and output file/directory, you should make sure that the input file is already created(in the specified path) and output file does not exist.
If you want to give a text file as input file, first copy a text file from local file system to hdfs and compiling it by using the following commands
hadoop fs -copyFromLocal /input.txt /user/hduser/input.txt
/usr/local/hadoop/sbin$ yarn jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount /user/hduser/input.txt /output
/input.txt may be replaced with address of any text file.
You need to start Pig in local mode and not cluster node:
pig -x local
Program is not able to find the Hadoop path for the inputs. It is searching in the local system files rather than Hadoop's DFS.
This problem will go away when your program is able to locate the HDFS location. We need to let the program understand the HDFS location given in the configuration file. To do that, add these lines in your program code.
Configuration conf = new Configuration();
conf.addResource(new Path("/usr/local/hadoop/hadoop-2.7.3/etc/hadoop/core-site.xml"));
conf.addResource(new Path("/usr/local/hadoop/hadoop-2.7.3/etc/hadoop/hdfs-site.xml"));
you should make the directory in HDFS:
for instance, "hadoop fs -mkdir /input_dir"
Then when you run your MapReduce program. You should mention the absolute path of input directory, so the format should be:
hadoop jar jarFileName.jar className /input_dir /outputdir right
The following is wrong because it is relative path
hadoop jar jarFileName.jar className input_dir outputdir wrong
If you find /bin/bash: /bin/java: No such file or directory in log, try setting JAVA_HOME in /etc/hadoop/hadoop-env.sh

Cannot load a file from Hadoop HDFS from Pig Latin

I am having trouble trying to load a csv from file. I keep on getting the following error:
Input(s):
Failed to read data from "hdfs://localhost:9000/user/der/1987.csv"
Output(s):
Failed to produce result in "hdfs://localhost:9000/user/der/totalmiles3"
Looking at my Hadoop hdfs installed in my local machine I see the file. In fact the file is located at multiple locations such as /, /user/ , etc.
hdfs dfs -ls /user/der
Found 1 items
-rw-r--r-- 1 der supergroup 127162942 2015-05-28 12:42
/user/der/1987.csv
My pig scripts is as follows:
records = LOAD '1987.csv' USING PigStorage(',') AS
(Year, Month, DayofMonth, DayOfWeek, DepTime, CRSDepTime, ArrTime,
CRSArrTime, UniqueCarrier, FlightNum, TailNum,ActualElapsedTime,
CRSElapsedTime,AirTime,ArrDelay, DepDelay, Origin, Dest,
Distance:int, TaxIn, TaxiOut, Cancelled,CancellationCode,
Diverted, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay,
lateAircraftDelay);
milage_recs= GROUP records ALL;
tot_miles = FOREACH milage_recs GENERATE SUM(records.Distance);
STORE tot_miles INTO 'totalmiles3';
I ran pig with the -x local option. I was able to read the files from my local hard disk with the -x local option. Got the right answer and the tail -f on Hadoop namenode did not scroll which proves I ran the files all locally on hard disk:
pig -x local totalmiles.pig
Now I am getting errors. It seems the hadoop name server is getting request because I used tail -f and see the logs scroll.
pig totalmiles.pig
records = LOAD '/user/der/1987.csv' USING PigStorage(',') AS
I get the following error:
Failed Jobs:
JobId Alias Feature Message Outputs
job_local602774674_0001 milage_recs,records,tot_miles
GROUP_BY,COMBINER Message: ENOENT: No such file or directory
at
org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmodImpl(Native Method)
at
org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmod(NativeIO.java:230)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.j
ava:724)
at
org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java: 502)
at org.apache.hadoop.fs.FileSystem.mkdirs(FileSys tem.java:600)
at
org.apache.hadoop.mapreduce.JobResourceUploader.uploadFiles(JobResourceUpl
oader.java:94)
at
org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitte
r.java:98)
at org .apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:193)
...blah...
Input(s):
Failed to read data from "/user/der/1987.csv"
Output(s):
Failed to produce result in "hdfs://localhost:9000/user/der/totalmiles3"
I used the hdfs to check for permissions by mkdir and that seems ok:
hdfs dfs -mkdir /user/der/temp2
hdfs dfs -ls /user/der
Found 3 items
-rw-r--r-- 1 der supergroup 127162942 2015-05-28 12:42
/user/der/1987.csv
drwxr-xr-x - der supergroup 0 2015-05-28 16:21
/user/der/temp2
drwxr-xr-x - der supergroup 0 2015-05-28 15:57
/user/der/test
I tried the pig with mapreduce option and still get the same type of error:
pig -x mapreduce totalmiles.pig
5-05-28 20:58:44,608 [JobControl] INFO
org.apache.hadoop.mapreduce.lib.jobc
ontrol.ControlledJob - PigLatin:totalmiles.pig while
submitting
ENOENT: No such file or directory
at
org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmodImpl(Na at
org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmod(NativeIO.java:230)
at
org.apache.hadoop.fs.RawLocalFileSystem.setPermissi at
org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSy
at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:600)
at
org.apache.hadoop.mapreduce.JobResourceUploader.uploadFiles(Job
at org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(Jo
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobS
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
My core-site.xml has the temp dir as follows:
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop</value>
<description>A base for other temporary directories.
</description>
</property>
and my hdfs-site.xml as the namenode and datanode as follows:
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop/dfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop/dfs/datanode</value>
</property>
I've gotten a bit further in debugging the issue. It seems my namenode is misconfigured as I cannot reformat it:
[hadoop hdfs formatting gets error failed for Block pool ]
We have to give the hadoop file path as : /user/der/1987.csv
records = LOAD '/user/der/1987.csv' USING PigStorage(',') AS
(Year, Month, DayofMonth, DayOfWeek, DepTime, CRSDepTime, ArrTime,
CRSArrTime, UniqueCarrier, FlightNum, TailNum,ActualElapsedTime,
CRSElapsedTime,AirTime,ArrDelay, DepDelay, Origin, Dest,
Distance:int, TaxIn, TaxiOut, Cancelled,CancellationCode,
Diverted, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay,
lateAircraftDelay);
If its for testing, you can have the file : 1987.csv in the path from where you are executing the pig script, i.e. have 1987.csv and the .pig file in the same location.

Issue in Connecting to HDFS Namenode

After a new hadoop single node installation , I got following error in hadoop-root-datanode-localhost.localdomain.log
2014-06-18 23:43:23,594 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:root cause:java.net.ConnectException: Call to localhost/127.0.0.1:54310 failed on connection exception: java.net.ConnectException: Connection refused
2014-06-18 23:43:23,595 INFO org.apache.hadoop.mapred.JobTracker: Problem connecting to HDFS Namenode... re-trying java.net.ConnectException: Call to localhost/127.0.0.1:54310 failed on connection exception: java.net.ConnectException: Connection refusedat org.apache.hadoop.ipc.Client.wrapException(Client.java:1142)
Any idea.?
JPS is not giving any ouput
Core site.xml is updated
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/surya/hadoop-1.2.1/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>
Also , on format using hadoop namenode -format
got below aborted error
Re-format filesystem in /tmp/hadoop-root/dfs/name ? (Y or N) y
Format aborted in /tmp/hadoop-root/dfs/name
You need to run hadoop namenode -format as the hdfs-superuser. Probably the "hdfs" user itself.
The hint can be seen here:
UserGroupInformation: PriviledgedActionException as:root cause:java
Another thing to consider: You really want to move your hdfs root to something other than /tmp. You will risk losing your hdfs contents when /tmp is cleaned (which could happen any time)
UPDATE based on OP comments.
RE: JobTracker unable to contact NameNode: Please do not skip steps.
First make sure you format the NameNode
Then start the NameNode and DataNodes
Run some basic HDFS commands such as
hdfs dfs -put
and
hdfs dfs -get
Then you can start the JobTracker and TaskTracker
Then (and not earlier) you can try to run some MapReduce job (which uses hdfs)
1) Please run "jps" in console and show what it outputs
2) Please provide core-site.xml (I think you might have wrong fs.default.name)
Concerning this error:
Re-format filesystem in /tmp/hadoop-root/dfs/name ? (Y or N) y
Format aborted in /tmp/hadoop-root/dfs/name
You need to use a capital Y, not a lowercase y in order for it to accept the input and actually do the formatting.

Resources