webhdfs rest api throwing file not found exception - hadoop

I am trying to open a hdfs file that is present on cdh4 cluster from cdh5 machine using webhdfs from the command line as below:
curl -i -L "http://namenodeIpofCDH4:50070/webhdfs/v1/user/quad/source/JSONML.java?user.name=quad&op=OPEN"
I am getting "File Not Found Exception" even if the file JSONML.java is present in the mentioned path in namenode as well as datanode and its trace is as follows:
HTTP/1.1 307 TEMPORARY_REDIRECT
Cache-Control: no-cache
Expires: Thu, 01-Jan-1970 00:00:00 GMT
Date: Mon, 22 Feb 2016 13:25:35 GMT
Pragma: no-cache
Date: Mon, 22 Feb 2016 13:25:35 GMT
Pragma: no-cache
Set-Cookie: hadoop.auth="u=quad&p=quad&t=simple&e=1456183535737&s=KdZYcA5iwJeIU2F9ZJfLSaT4qMY=";Path=/
Location: http://n3.quadratics.com:50075/webhdfs/v1/user/quad/source/JSONML.java?op=OPEN&user.name=quad&namenoderpcaddress=n1.quadratics.com:8020&offset=0
Content-Type: application/octet-stream
Content-Length: 0
Server: Jetty(6.1.26.cloudera.4)
HTTP/1.1 404 Not Found
Cache-Control: no-cache
Expires: Mon, 22 Feb 2016 13:26:28 GMT
Date: Mon, 22 Feb 2016 13:26:28 GMT
Pragma: no-cache
Expires: Mon, 22 Feb 2016 13:26:28 GMT
Date: Mon, 22 Feb 2016 13:26:28 GMT
Pragma: no-cache
Content-Type: application/json
Transfer-Encoding: chunked
Server: Jetty(6.1.26.cloudera.4)
{"RemoteException":{"exception":"FileNotFoundException","javaClassName":"java.io.FileNotFoundException","message":"File does not exist: /user/quad/source/JSONML.java\n\tat org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)\n\tat org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)\n\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1932)\n\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1873)\n\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1853)\n\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1825)\n\tat org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:559)\n\tat org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getBlockLocations(AuthorizationProviderProxyClientProtocol.java:87)\n\tat org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:363)\n\tat org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)\n\tat org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)\n\tat org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)\n\tat org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044)\n\tat org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2040)\n\tat java.security.AccessController.doPrivileged(Native Method)\n\tat javax.security.auth.Subject.doAs(Subject.java:415)\n\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)\n\tat org.apache.hadoop.ipc.Server$Handler.run(Server.java:2038)\n"}}
But I don't get any error and get the status of the above file when I use the below command:
curl -i -L http://namenodeIpofCDH4:50070/webhdfs/v1/user/quad/source/JSONML.java?user.name=quad&op=GETFILESTATUS"
I get the output response as below:
HTTP/1.1 200 OK
Cache-Control: no-cache
Expires: Thu, 01-Jan-1970 00:00:00 GMT
Date: Mon, 22 Feb 2016 13:38:48 GMT
Pragma: no-cache
Date: Mon, 22 Feb 2016 13:38:48 GMT
Pragma: no-cache
Set-Cookie: hadoop.auth="u=quad&p=quad&t=simple&e=1456184328134&s=sE6esO8J39O+itl+ggNzX4/WzjQ=";Path=/
Content-Type: application/json
Transfer-Encoding: chunked
Server: Jetty(6.1.26.cloudera.4)
{"FileStatus":{"accessTime":1456147448567,"blockSize":134217728,"group":"quad","length":14849,"modificationTime":1456143798039,"owner":"quad","pathSuffix":"","permission":"644","replication":3,"type":"FILE"}}
Any ideas of the reason of why opening a file is failing and fixing that would be greatly appreciated.

I saw a similar error when I had misconfigured my /etc/hosts
The OPEN command above returns a redirect which provides a hostname. The localhost will try and resolve this hostname based on the local DNS setup.
It will then look for the file at the IP address that the hostname resolves to. Not necessarily the one you issued the command to.

Related

WebHDFS FileNotFoundException rest api

I am posting this question as a continuation of post webhdfs rest api throwing file not found exception
I have an image file I would like to OPEN through the WebHDFS rest api.
the file exists in hdfs and has appropriate permissions
I can
LISTSTATUS that file and get an answer:
curl -i "http://namenode:50070/webhdfs/v1/tmp/file.png?op=LISTSTATUS"
HTTP/1.1 200 OK
Date: Fri, 17 Jul 2020 22:47:29 GMT
Cache-Control: no-cache
Expires: Fri, 17 Jul 2020 22:47:29 GMT
Date: Fri, 17 Jul 2020 22:47:29 GMT
Pragma: no-cache
X-FRAME-OPTIONS: SAMEORIGIN
Content-Type: application/json
Transfer-Encoding: chunked
{"FileStatuses":{"FileStatus":[
{"accessTime":1594828591740,"blockSize":134217728,"childrenNum":0,"fileId":11393739,"group":"hdfs","length":104811,"modificationTime":1594828592000,"owner":"XXXX","pathSuffix":"XXXX","permission":"644","replication":3,"storagePolicy":0,"type":"FILE"}
]}}
Content-Type: application/octet-stream
Content-Length: 0
So the api can properly read the metadata, but I cannot get that file to OPEN:
curl -i "http://namenode:50070/webhdfs/v1/tmp/file.png?op=OPEN"
HTTP/1.1 307 Temporary Redirect
Date: Fri, 17 Jul 2020 22:23:17 GMT
Cache-Control: no-cache
Expires: Fri, 17 Jul 2020 22:23:17 GMT
Date: Fri, 17 Jul 2020 22:23:17 GMT
Pragma: no-cache
X-FRAME-OPTIONS: SAMEORIGIN
Location: http://datanode1:50075/webhdfs/v1/tmp/file.png?op=OPEN&namenoderpcaddress=namenode:8020&offset=0
Content-Type: application/octet-stream
Content-Length: 0
{"RemoteException":{"exception":"FileNotFoundException","javaClassName":"java.io.FileNotFoundException","message":"Path is not a file: /tmp/file.png......
So, according to webhdfs rest api throwing file not found exception, I can see that the request is passed off from the namenode to the datanode1.
Datanode1 is in my hosts file, I can connect to it an check the status of webhdfs from there:
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
<final>true</final>
</property>
It is allowed, same on the namenode.
I also went to look at the hdfs logs on /var/log/hadoop/hdfs/*.{log,out} to see if I could find errors triggered when I curled, but nothing seems to happen. I see no entry pertaining to my file or webhdfs query. I tried that on the namenode and datanode1.
as a last ditch effort I tried to increase permissions (not ideal) from 644 (seen in point 2/) to 666
hdfs dfs -chmod 666 /tmp/file.png
curl -i "http://namenode:50070/webhdfs/v1/tmp/file.png?op=LISTSTATUS"
HTTP/1.1 403 Forbidden
Date: Fri, 17 Jul 2020 23:06:18 GMT
Cache-Control: no-cache
Expires: Fri, 17 Jul 2020 23:06:18 GMT
Date: Fri, 17 Jul 2020 23:06:18 GMT
Pragma: no-cache
X-FRAME-OPTIONS: SAMEORIGIN
Content-Type: application/json
Transfer-Encoding: chunked
{"RemoteException":{"exception":"AccessControlException","javaClassName":"org.apache.hadoop.security.AccessControlException","message":"Permission denied: user=XXXX, access=READ_EXECUTE, inode=\"/tmp/file.png\":XXXX:hdfs:drw-rw-rw-"}}
So it seems it did the switch but somehow I got a permission issue when relaxing the current permissions that I didnt get before? It is not like I removed the X flag, it wasn't there to begin with. Does access=READ_EXECUTE require both R and X?
Now I am at a loss as to why I can see but not read this file with HDFS. Can someone please help me troubleshoot this?
Looking closer at your last error,
... inode=\"/tmp/file.png\":XXXX:hdfs:drw-rw-rw-"}
, it seems to indicate that file.png is actually a directory (leading d symbol) and not a file. This is consistent with the error you're getting in step #3 *..."message":"Path is not a file: /tmp/file.png....
You can double check that simply by doing $ hdfs dfs -ls /tmp/file.png/.
Getting back to your access error, you do need an "execute" (x) permission to list the files in a directory.

webhdfs is redirecting to localhost:50075

I am trying to create a file from a non-hadoop environment to a remote hdfs.
For this purpose, I am using pywebhdfs api and I'm running command using curl.
https://pythonhosted.org/pywebhdfs/
I used this documentation as a reference, I am able to execute all other methods except of create_file().
While using create_file(), I am getting error like: 'Couldn't connect to host'
Command: curl -i -X PUT -L "http://xxx.xxx.xxx.xxx:50070/webhdfs/v1/test1/?op=CREATE" -T sample.txt
Response:HTTP/1.1 100 Continue
HTTP/1.1 307 TEMPORARY_REDIRECT
Cache-Control: no-cache
Expires: Tue, 30 Oct 2018 12:04:04 GMT
Date: Tue, 30 Oct 2018 12:04:04 GMT
Pragma: no-cache
Expires: Tue, 30 Oct 2018 12:04:04 GMT
Date: Tue, 30 Oct 2018 12:04:04 GMT
Pragma: no-cache
Content-Type: application/octet-stream
Location: http://localhost:50075/webhdfs/v1/test1/?op=CREATE&namenoderpcaddress=xxx.xxx.xxx.xxx:9000&overwrite=false
Content-Length: 0
Server: Jetty(6.1.26)
curl: (7) couldn't connect to host
This Location is displaying as localhost here. I took reference from the past post.
webhdfs always redirect to localhost:50075
but I didn't get success.
I tried changing IP in hdfs-site.xml and /etc/hosts file but no success at all.
Can anyone tell me, how to fix this?
Thanks in advance..

WebHDFS OPEN command returns empty results

I created a simple file in HDFS at the path /user/admin/foo.txt
I can see the contents of this file in Hue.
How I issue the command
curl -i http://namenode:50070/webhdfs/v1/user/admin/foo.txt?op=OPEN
I get the response
HTTP/1.1 307 TEMPORARY_REDIRECT
Cache-Control: no-cache
Expires: Tue, 24 Nov 2015 16:20:15 GMT
Date: Tue, 24 Nov 2015 16:20:15 GMT
Pragma: no-cache
Expires: Tue, 24 Nov 2015 16:20:15 GMT
Date: Tue, 24 Nov 2015 16:20:15 GMT
Pragma: no-cache
Location: http://datanode:50075/webhdfs/v1/user/admin/foo.txt?op=OPEN&namenoderpcaddress=nameservice1&offset=0
Content-Type: application/octet-stream
Content-Length: 0
Server: Jetty(6.1.26.cloudera.4)
why is the content-length: 0?? I was hoping that this would list the contents of the file.
Execute:
curl -i http://datanode:50075/webhdfs/v1/user/admin/foo.txt?op=OPEN&namenoderpcaddress=nameservice1&offset=0
As for the explanation - when using WebHDFS to open a file you have to do the following:
You don't know which node the file resides on, so you ask the namenode.
The namenode returns you a datanode containing the file.
You can then open the file itself by talking directly to the datanode.
So this activity is expected. See https://hadoop.apache.org/docs/r1.0.4/webhdfs.html for more information.

Cached content expiring too frequently on fastly CDN

I'm using Fastly as a CDN in front of my Heroku application, and am seeing many requests that I expect to be cached make it through.
An example of this behavior is two requests to the URL:
https://nuu-acceptance-herokuapp-com.global.ssl.fastly.net/attachments/f092ff0398b3bace19fae21b17a22320c3da5428/store/fit/240/160/28515a2fa2e47b59f13b2044ea5b9a7c8c9587ceca7d7dfadb28f08730f7/file.jpg. Here are two responses from the requests, which occurred fifteen minutes apart:
RESPONSE 1:
----------
Strict-Transport-Security: max-age=31536000
Content-Encoding: gzip
X-Content-Type-Options: nosniff
Age: 0
Transfer-Encoding: chunked
X-Cache: MISS
X-Cache-Hits: 0
Content-Disposition: inline; filename="file.jpg"
Connection: keep-alive
Via: 1.1 vegur
Via: 1.1 varnish
X-Request-Id: bc766069-c2ca-4a66-ba88-a8d76da72e2d
X-Served-By: cache-sjc3124-SJC
X-Runtime: 3.711698
Last-Modified: Tue, 23 Jun 2015 18:44:27 GMT
Server: Cowboy
X-Timer: S1435085062.909546,VS0,VE4437
Date: Tue, 23 Jun 2015 18:44:27 GMT
Vary: Accept-Encoding
Content-Type: image/jpeg
Access-Control-Allow-Origin: *
Cache-Control: public, must-revalidate, max-age=31536000
Set-Cookie: __profilin=; path=/; max-age=0; expires=Thu, 01 Jan 1970 00:00:00 -0000; secure
Accept-Ranges: bytes
Expires: Wed, 22 Jun 2016 18:44:27 GMT
----------
RESPONSE 2:
----------
Strict-Transport-Security: max-age=31536000
Content-Encoding: gzip
X-Content-Type-Options: nosniff
Age: 0
Transfer-Encoding: chunked
X-Cache: MISS
X-Cache-Hits: 0
Content-Disposition: inline; filename="file.jpg"
Connection: keep-alive
Via: 1.1 vegur
Via: 1.1 varnish
X-Request-Id: 60ee54b0-9509-42c5-9b03-c0f5854c5524
X-Served-By: cache-sjc3135-SJC
X-Runtime: 0.251021
Last-Modified: Tue, 23 Jun 2015 18:57:44 GMT
Server: Cowboy
X-Timer: S1435085863.749442,VS0,VE560
Date: Tue, 23 Jun 2015 18:57:44 GMT
Vary: Accept-Encoding
Content-Type: image/jpeg
Access-Control-Allow-Origin: *
Cache-Control: public, must-revalidate, max-age=31536000
Accept-Ranges: bytes
Expires: Wed, 22 Jun 2016 18:57:44 GMT
Both are cache misses, even though I expect this content to be cached for a year. It also appears that the same Fastly cluster handled the request. Can anyone point me to what I might be doing wrong? I'm seeing this behavior across many files served by Fastly - fastly seems to serve the files intermittently, but there are cache misses much more often than I expect.
I'd appreciate any help that anyone could give me with this - thanks!
If you look at the HTTP headers from your responses, you will see that they both contain a Set-Cookie header. Responses with cookies will not be cached by Fastly. You can remove them, however, in your app or within your Fastly configuration.

Setting Access-Control-Allow-Origin on Cloudfront

I am having problems serving static assets to Firefox using AWS Cloudfront.
Chrome works perfect, but Firefox is returning a CORS error.
If I execute curl , I get:
HTTP/1.1 200 OK
Content-Type: application/x-font-opentype
Content-Length: 39420
Connection: keep-alive
Date: Mon, 11 Aug 2014 21:53:50 GMT
Cache-Control: public, max-age=31557600
Expires: Sun, 09 Aug 2015 01:28:02 GMT
Last-Modified: Fri, 08 Aug 2014 19:28:05 GMT
ETag: "9df744bdf9372cf4cff87bb3e2d68fc8"
Accept-Ranges: bytes
Server: AmazonS3
Age: 2743
X-Cache: Hit from cloudfront
Via: 1.1 c445b20dfbf3128d810e975e5d84e2cd.cloudfront.net (CloudFront)
X-Amz-Cf-Id: ...
Which I think needs the header:
Access-Control-Allow-Origin: *
Can anyone help me? Why is it a problem on Firefox and not Chrome? How can I solve it?
Have you configured your distribution to support CORS by setting Origin header to be forwarded?
Reference:
http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/header-caching.html#header-caching-web-cors

Resources