WebHDFS OPEN command returns empty results - hadoop

I created a simple file in HDFS at the path /user/admin/foo.txt
I can see the contents of this file in Hue.
How I issue the command
curl -i http://namenode:50070/webhdfs/v1/user/admin/foo.txt?op=OPEN
I get the response
HTTP/1.1 307 TEMPORARY_REDIRECT
Cache-Control: no-cache
Expires: Tue, 24 Nov 2015 16:20:15 GMT
Date: Tue, 24 Nov 2015 16:20:15 GMT
Pragma: no-cache
Expires: Tue, 24 Nov 2015 16:20:15 GMT
Date: Tue, 24 Nov 2015 16:20:15 GMT
Pragma: no-cache
Location: http://datanode:50075/webhdfs/v1/user/admin/foo.txt?op=OPEN&namenoderpcaddress=nameservice1&offset=0
Content-Type: application/octet-stream
Content-Length: 0
Server: Jetty(6.1.26.cloudera.4)
why is the content-length: 0?? I was hoping that this would list the contents of the file.

Execute:
curl -i http://datanode:50075/webhdfs/v1/user/admin/foo.txt?op=OPEN&namenoderpcaddress=nameservice1&offset=0
As for the explanation - when using WebHDFS to open a file you have to do the following:
You don't know which node the file resides on, so you ask the namenode.
The namenode returns you a datanode containing the file.
You can then open the file itself by talking directly to the datanode.
So this activity is expected. See https://hadoop.apache.org/docs/r1.0.4/webhdfs.html for more information.

Related

WebHDFS FileNotFoundException rest api

I am posting this question as a continuation of post webhdfs rest api throwing file not found exception
I have an image file I would like to OPEN through the WebHDFS rest api.
the file exists in hdfs and has appropriate permissions
I can
LISTSTATUS that file and get an answer:
curl -i "http://namenode:50070/webhdfs/v1/tmp/file.png?op=LISTSTATUS"
HTTP/1.1 200 OK
Date: Fri, 17 Jul 2020 22:47:29 GMT
Cache-Control: no-cache
Expires: Fri, 17 Jul 2020 22:47:29 GMT
Date: Fri, 17 Jul 2020 22:47:29 GMT
Pragma: no-cache
X-FRAME-OPTIONS: SAMEORIGIN
Content-Type: application/json
Transfer-Encoding: chunked
{"FileStatuses":{"FileStatus":[
{"accessTime":1594828591740,"blockSize":134217728,"childrenNum":0,"fileId":11393739,"group":"hdfs","length":104811,"modificationTime":1594828592000,"owner":"XXXX","pathSuffix":"XXXX","permission":"644","replication":3,"storagePolicy":0,"type":"FILE"}
]}}
Content-Type: application/octet-stream
Content-Length: 0
So the api can properly read the metadata, but I cannot get that file to OPEN:
curl -i "http://namenode:50070/webhdfs/v1/tmp/file.png?op=OPEN"
HTTP/1.1 307 Temporary Redirect
Date: Fri, 17 Jul 2020 22:23:17 GMT
Cache-Control: no-cache
Expires: Fri, 17 Jul 2020 22:23:17 GMT
Date: Fri, 17 Jul 2020 22:23:17 GMT
Pragma: no-cache
X-FRAME-OPTIONS: SAMEORIGIN
Location: http://datanode1:50075/webhdfs/v1/tmp/file.png?op=OPEN&namenoderpcaddress=namenode:8020&offset=0
Content-Type: application/octet-stream
Content-Length: 0
{"RemoteException":{"exception":"FileNotFoundException","javaClassName":"java.io.FileNotFoundException","message":"Path is not a file: /tmp/file.png......
So, according to webhdfs rest api throwing file not found exception, I can see that the request is passed off from the namenode to the datanode1.
Datanode1 is in my hosts file, I can connect to it an check the status of webhdfs from there:
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
<final>true</final>
</property>
It is allowed, same on the namenode.
I also went to look at the hdfs logs on /var/log/hadoop/hdfs/*.{log,out} to see if I could find errors triggered when I curled, but nothing seems to happen. I see no entry pertaining to my file or webhdfs query. I tried that on the namenode and datanode1.
as a last ditch effort I tried to increase permissions (not ideal) from 644 (seen in point 2/) to 666
hdfs dfs -chmod 666 /tmp/file.png
curl -i "http://namenode:50070/webhdfs/v1/tmp/file.png?op=LISTSTATUS"
HTTP/1.1 403 Forbidden
Date: Fri, 17 Jul 2020 23:06:18 GMT
Cache-Control: no-cache
Expires: Fri, 17 Jul 2020 23:06:18 GMT
Date: Fri, 17 Jul 2020 23:06:18 GMT
Pragma: no-cache
X-FRAME-OPTIONS: SAMEORIGIN
Content-Type: application/json
Transfer-Encoding: chunked
{"RemoteException":{"exception":"AccessControlException","javaClassName":"org.apache.hadoop.security.AccessControlException","message":"Permission denied: user=XXXX, access=READ_EXECUTE, inode=\"/tmp/file.png\":XXXX:hdfs:drw-rw-rw-"}}
So it seems it did the switch but somehow I got a permission issue when relaxing the current permissions that I didnt get before? It is not like I removed the X flag, it wasn't there to begin with. Does access=READ_EXECUTE require both R and X?
Now I am at a loss as to why I can see but not read this file with HDFS. Can someone please help me troubleshoot this?
Looking closer at your last error,
... inode=\"/tmp/file.png\":XXXX:hdfs:drw-rw-rw-"}
, it seems to indicate that file.png is actually a directory (leading d symbol) and not a file. This is consistent with the error you're getting in step #3 *..."message":"Path is not a file: /tmp/file.png....
You can double check that simply by doing $ hdfs dfs -ls /tmp/file.png/.
Getting back to your access error, you do need an "execute" (x) permission to list the files in a directory.

webhdfs is redirecting to localhost:50075

I am trying to create a file from a non-hadoop environment to a remote hdfs.
For this purpose, I am using pywebhdfs api and I'm running command using curl.
https://pythonhosted.org/pywebhdfs/
I used this documentation as a reference, I am able to execute all other methods except of create_file().
While using create_file(), I am getting error like: 'Couldn't connect to host'
Command: curl -i -X PUT -L "http://xxx.xxx.xxx.xxx:50070/webhdfs/v1/test1/?op=CREATE" -T sample.txt
Response:HTTP/1.1 100 Continue
HTTP/1.1 307 TEMPORARY_REDIRECT
Cache-Control: no-cache
Expires: Tue, 30 Oct 2018 12:04:04 GMT
Date: Tue, 30 Oct 2018 12:04:04 GMT
Pragma: no-cache
Expires: Tue, 30 Oct 2018 12:04:04 GMT
Date: Tue, 30 Oct 2018 12:04:04 GMT
Pragma: no-cache
Content-Type: application/octet-stream
Location: http://localhost:50075/webhdfs/v1/test1/?op=CREATE&namenoderpcaddress=xxx.xxx.xxx.xxx:9000&overwrite=false
Content-Length: 0
Server: Jetty(6.1.26)
curl: (7) couldn't connect to host
This Location is displaying as localhost here. I took reference from the past post.
webhdfs always redirect to localhost:50075
but I didn't get success.
I tried changing IP in hdfs-site.xml and /etc/hosts file but no success at all.
Can anyone tell me, how to fix this?
Thanks in advance..

webhdfs rest api throwing file not found exception

I am trying to open a hdfs file that is present on cdh4 cluster from cdh5 machine using webhdfs from the command line as below:
curl -i -L "http://namenodeIpofCDH4:50070/webhdfs/v1/user/quad/source/JSONML.java?user.name=quad&op=OPEN"
I am getting "File Not Found Exception" even if the file JSONML.java is present in the mentioned path in namenode as well as datanode and its trace is as follows:
HTTP/1.1 307 TEMPORARY_REDIRECT
Cache-Control: no-cache
Expires: Thu, 01-Jan-1970 00:00:00 GMT
Date: Mon, 22 Feb 2016 13:25:35 GMT
Pragma: no-cache
Date: Mon, 22 Feb 2016 13:25:35 GMT
Pragma: no-cache
Set-Cookie: hadoop.auth="u=quad&p=quad&t=simple&e=1456183535737&s=KdZYcA5iwJeIU2F9ZJfLSaT4qMY=";Path=/
Location: http://n3.quadratics.com:50075/webhdfs/v1/user/quad/source/JSONML.java?op=OPEN&user.name=quad&namenoderpcaddress=n1.quadratics.com:8020&offset=0
Content-Type: application/octet-stream
Content-Length: 0
Server: Jetty(6.1.26.cloudera.4)
HTTP/1.1 404 Not Found
Cache-Control: no-cache
Expires: Mon, 22 Feb 2016 13:26:28 GMT
Date: Mon, 22 Feb 2016 13:26:28 GMT
Pragma: no-cache
Expires: Mon, 22 Feb 2016 13:26:28 GMT
Date: Mon, 22 Feb 2016 13:26:28 GMT
Pragma: no-cache
Content-Type: application/json
Transfer-Encoding: chunked
Server: Jetty(6.1.26.cloudera.4)
{"RemoteException":{"exception":"FileNotFoundException","javaClassName":"java.io.FileNotFoundException","message":"File does not exist: /user/quad/source/JSONML.java\n\tat org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)\n\tat org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)\n\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1932)\n\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1873)\n\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1853)\n\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1825)\n\tat org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:559)\n\tat org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getBlockLocations(AuthorizationProviderProxyClientProtocol.java:87)\n\tat org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:363)\n\tat org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)\n\tat org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)\n\tat org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)\n\tat org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044)\n\tat org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2040)\n\tat java.security.AccessController.doPrivileged(Native Method)\n\tat javax.security.auth.Subject.doAs(Subject.java:415)\n\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)\n\tat org.apache.hadoop.ipc.Server$Handler.run(Server.java:2038)\n"}}
But I don't get any error and get the status of the above file when I use the below command:
curl -i -L http://namenodeIpofCDH4:50070/webhdfs/v1/user/quad/source/JSONML.java?user.name=quad&op=GETFILESTATUS"
I get the output response as below:
HTTP/1.1 200 OK
Cache-Control: no-cache
Expires: Thu, 01-Jan-1970 00:00:00 GMT
Date: Mon, 22 Feb 2016 13:38:48 GMT
Pragma: no-cache
Date: Mon, 22 Feb 2016 13:38:48 GMT
Pragma: no-cache
Set-Cookie: hadoop.auth="u=quad&p=quad&t=simple&e=1456184328134&s=sE6esO8J39O+itl+ggNzX4/WzjQ=";Path=/
Content-Type: application/json
Transfer-Encoding: chunked
Server: Jetty(6.1.26.cloudera.4)
{"FileStatus":{"accessTime":1456147448567,"blockSize":134217728,"group":"quad","length":14849,"modificationTime":1456143798039,"owner":"quad","pathSuffix":"","permission":"644","replication":3,"type":"FILE"}}
Any ideas of the reason of why opening a file is failing and fixing that would be greatly appreciated.
I saw a similar error when I had misconfigured my /etc/hosts
The OPEN command above returns a redirect which provides a hostname. The localhost will try and resolve this hostname based on the local DNS setup.
It will then look for the file at the IP address that the hostname resolves to. Not necessarily the one you issued the command to.

Azure storage - Images url become binary?

I use the Azure storage to upload my image,
but some images will display by binary,
like : http://fungogo.blob.core.windows.net/asdf0/18263359_e0d9199e-b2d3-11e5-b71b-46c19c40c550.jpg
and some images will display by images on the browser
like : https://fungogo.blob.core.windows.net/images/14600328358_a00eaa35c5_o.jpg
I want to display the image on the browser instead of download, how can I fixed it?
I believe the issue is with the Content-Type. In the first block below contains the headers for the failing link. You can see it is listed as image/jpg/jpeg.
HTTP/1.1 200 OK
Content-Length: 137496
Content-Type: image/jpg/jpeg
Content-MD5: zPyz4CSRnPhQtW7PT1w9LQ==
Last-Modified: Wed, 03 Feb 2016 11:06:14 GMT
ETag: 0x8D32C8A086B77F5
Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0
x-ms-request-id: e25e0d4c-0001-0049-5ff0-5ea774000000
x-ms-version: 2009-09-19
x-ms-lease-status: unlocked
x-ms-blob-type: BlockBlob
Date: Thu, 04 Feb 2016 02:08:52 GMT
The response headers for the working link have a Content-Type as image/jpeg.
HTTP/1.1 200 OK
Content-Length: 1689160
Content-Type: image/jpeg
Content-MD5: iAhgwODEpi7EaTAyUCMY1Q==
Last-Modified: Mon, 01 Feb 2016 08:18:24 GMT
ETag: 0x8D32AE0413D02B6
Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0
x-ms-request-id: 9792c6d9-0001-0045-2cf0-5e4985000000
x-ms-version: 2009-09-19
x-ms-lease-status: unlocked
x-ms-blob-type: BlockBlob
Date: Thu, 04 Feb 2016 02:06:50 GMT
If you want to update the Content-Type of a lot of files at once, you can look at the example in the answer for this SO link at Set Content-type of media files stored on Blob
If you're wondering about the difference between jpg and jpeg, you can look at this SO link JPG vs. JPEG image formats

webhdfs always redirect to localhost:50075

I have a hdfs cluster (hadoop 2.7.1), with one namenode, one secondary namenode, 3 datanodes.
When I enable webhdfs and test, I found it always redirect to "localhost:50075" which is not configured as datanodes.
csrd#secondarynamenode:~/lybica-hdfs-viewer$ curl -i -L "http://10.56.219.30:50070/webhdfs/v1/demo.zip?op=OPEN"
HTTP/1.1 307 TEMPORARY_REDIRECT
Cache-Control: no-cache
Expires: Tue, 01 Dec 2015 03:29:21 GMT
Date: Tue, 01 Dec 2015 03:29:21 GMT
Pragma: no-cache
Expires: Tue, 01 Dec 2015 03:29:21 GMT
Date: Tue, 01 Dec 2015 03:29:21 GMT
Pragma: no-cache
Location: http://localhost:50075/webhdfs/v1/demo.zip?op=OPEN&namenoderpcaddress=10.56.219.30:9000&offset=0
Content-Type: application/octet-stream
Content-Length: 0
Server: Jetty(6.1.26)
curl: (7) Failed to connect to localhost port 50075: Connection refused
The etc/hadoop/slaves is configured as:
10.56.219.32
10.56.219.33
10.56.219.34
Is there any configurations on this?
Thanks!
Well, it's a /etc/hosts mistake.
The /etc/hosts on datanodes was:
127.0.0.1 localhost datanode-1
change it to:
127.0.0.1 datanode-1 localhost
fix this problem.
You need to have this entry in hdfs-site.xml
<property>
<name>dfs.datanode.http.address</name>
<value>0.0.0.0:50075</value>
</property>
Value should be 0.0.0.0 on a cluster. You need to restart the cluster after updating hdfs-site.xml file and deploying it on all nodes in the cluster.

Resources