Why does scrapy gives 404 for images that are available? - image

This is an example for an image that I add to the image_urls field.
http://static.zara.net/photos//2014/I/0/2/p/5875/309/800/2/w/1920/5875309800_1_1_1.jpg
Yet I get this warning and the image is not uploaded.
[zara_com] WARNING: File (code: 404): Error downloading image from http://static.zara.net/photos//2014/I/0/2/p/5875/309/800/2/w/1920/5875309800_1_1_1.jpg> referred in
Though an image like this one:
http://static.zara.net/photos//2014/V/1/3/p/1280/303/105/2/w/1920/1280303105_2_1_1.jpg
is uploaded normally.
What might be the problem? what should I check?

As far as I can see, they seem to be filtering requests made with the default scrapy user agent:
'User-Agent': 'Scrapy/0.24.2 (+http://scrapy.org)'
When I changed the USER_AGENT setting in settings.py of my project, it started returning 200 on all requests. The strange thing is that before that it returned 404 even on the image, which you said is returned normally.
P.S. It's not very good to scrape content from a site, if they are not allowing it, but well it's not like they are disallowing it in their robots.txt. Still you should probably enable the RobotsTxtMiddleware and the AutoThrottle extension to ensure you are playing fairly.

Related

Google Drive API Console: Error saving Drive UI integration page

I have a webapp in production that interacts with Google Drive through Google Drive API.
I need to change some settings in Drive interaction but I can't save.
When I save the Drive UI integration page, I receive this error:
There's a problem at our end.
Please try again. If the problem persists, please let us know using
the "Send feedback" link below. Thanks!
(spying Network console: there is an Internal Server Error in a POST call)
I tried to send feedback for months: nobody answers and the bug is still there.
I tried also to create another project: I can save the first time but then the bug returns.
How can I do? Has someone the same problem?
Is there a way to receive a reply from Google? Is there some workaround?
Thank you.
i think that problem must be Client ID
before adding Client ID, go to the Credentials -> OAuth 2.0 Client IDs
then select edit your Client ID. after that your production site url add to Authorized JavaScript origins and Authorized redirect URIs.
then enter your Client ID in Drive UI integration page
For myself trying to get the Drive UI configured I noticed a couple of errors (that don't have any specific error messages)
When adding in an Open URL it has to be a valid domain, so for instance I tried to test it out with local host, to no avail. However something like https://devbox.app.com worked, but something like https://localhost:8888 does not. Even though https://localhost is a valid javascript origin in the client_id configuration (at least for the app I am working on, not sure about other apps), localhost doesn't work as an open URL.
When adding in the mimeTypes it needs to be in the format */* and can include custom mimeTypes like application/custom+xml and application/custom-name+json not sure for other custom types that are not in a particular format like xml or json. Also not sure about wildcards.
When adding in file extensions do not add in the '.' just the name of the file extension.
The app icon I found only failed to upload the image when the image wasn't the exact dimensions, I actually ended up editing some icons in photoshop to change the pixel x pixel values as a quick work around during dev.
That worked for me to get it to save and I tested it with a file that had a custom mimeType (application/custom-name+xml specifically) and custom file extension!

How do we understand if there is a image not found error?

I have a job to check some 300 urls if they are loaded correctly. They are the production urls of a company. One of the checks that we have to do is to understand if any of the image or the text not loaded properly.
For the images not loaded usually we get a cross mark.. But since we want to automate this task, what is the piece of code or the debugging tool which can tell us that there is 'image not found' error in a particular url?
We are also checking for error like '404' and '5xx' HTTP errors. But hoping to get the error if we check the debugging window.
Let me know in case more information is required.
Thanks
Dhanya

MediaWiki InstantCommons file download error

My goal: I'd like to use an image from commons.mediawiki.org within a MediaWiki installation.
First I was trying to debug my InstantCommons configuration: Referring to files on commons.mediawiki.org failed for some reason. After activating various debugging options I learned that though general image download succeeded some kind of thumbnail followup request issued by the MediaWiki installation failed, which resulted into an overall error from the ForeignAPIRepo-Module.
As I can not deal with this error right now I thought I'd try something else as some kind of fallback: Download the MediaWiki image by specifiying the image URL in the upload image web page. The idea is to let MediaWiki download the image and include this image as regular wiki content. This way I would require to add license details manually and add a few comments, but this would be better than having no image.
But trying this I strangely get an error: It says "Fehler beim Senden der Anfrage" which means "Error while sending the request". But the internal request seems succeed in the logs. Here is what MediaWiki was logging:
[fileupload] Temporary file created "/tmp/URLdafce5345aa3-1"
[fileupload] Starting download from "https://upload.wikimedia.org/wikipedia/commons/c/c7/Broccoli%2C_Champignons%2C_Karotten_%2810581663524%29.jpg" <followRedirects>
[fileupload] <Error, collected 1 error(s) on the way, integer value set>
+------+---------------------------+------------------------------------------+
| 1 | http-request-error | |
+------+---------------------------+------------------------------------------+
[fileupload] Download by URL completed with HTTP status 200
Comment: All other log messages do not indicate anything that looks like an error or is related to the task of downloading the image, so I skipped them here.
The URL is correct, the image can be downloaded from the URL, MediaWiki receives a response code of 200, but instead of processing the response it indicates an error. Why? For http and https URLs I get the same result in the log.
Has anybody encountered this problem before in MediaWiki installations? Does anyone have any idea what the reason for this behaviour could be?
Comment: The wiki is of version 1.25.2 and a standard installation including SWM on an up to date standard Ubuntu Linux OS. Nothing exotic, nothing modified in any way.
Comment: Yes, I could upgrade to the latest version but, I'm not sure if this really solves the problem: I know that this featured did work in some other MediaWiki installations I have set up some time ago. Does anyone have a clue why this feature could fail here? Has anyone encountered something like this before?
Edit: I experimented with downloading from another MediaWiki instance of exactly the same version - 1.25.2 - in my local network. This did not succeed as well. But I get a different error message (translated): "The file .... could not be stored at ...". The "funny part": Though the error message indicated otherwise the file has been downloaded successfully and stored as expected. It has the correct user rights as one would expect, but log messages indicate that there are bugs in MediaWiki regarding this part: ("PHP Notice: Undefined property: UploadFromUrl::$nbytes") Maybe the uploading implementation is buggy somehow and the problems I am running into are typical?
There are multiple bugs with HTTPS support in MediaWiki, php-curl etc. See https://www.mediawiki.org/wiki/InstantCommons#HTTPS for debugging information, there is no magic bullet.

How can I scrape an image that doesn't have an extension?

Sometimes I come across an image that I can't scrape so that it can be saved. An example of this is:
https://s3.amazonaws.com/plumdistrict.com-production/perks/12321/image/original.?1325898487
When I hit the url from Internet Explorer I see the image but when I try to get it from the code below I get the following error message "System.Net.WebException The remote server returned an error: (403) Forbidden" error with GetResponse:
string url = "https://s3.amazonaws.com/plumdistrict.com-production/perks/12321/image/original.?1325898487";
WebRequest request = WebRequest.Create(url);
WebResponse response = request.GetResponse();
Any ideas on how to get this image?
Edit:
I am able to get to save images that do have extensions. For example I can scrape the following image just fine:
https://s3.amazonaws.com/plumdistrict.com-production/perks/12659/image/original.jpg?1326828951
Although HTTP is originally supposed to be stateless, there are a lot of implementations that rely on it being stateless. I could configure my webserver to only accept requests for "http://mydomain.com/sexy_avatar.jpg" if you provide a cookie proving you were logged in. If not, I send you a redirect 303 to "http://mydomain.com/avatar_for_public_use.jpg".
Amazon could be doing the same. Try to load the web page using Chrome, and look at the Network view in developer mode (CTRL+SHIFT+J) to see all headers supplied to the website. Maybe you even need to do a full navigation in the same session before you are allowed to see the image. This is certainly the case in many web applications I have developed :-)
Well, it looks like it's being generated from a script (possibly being retrieved from a database). The server should be sending a file/content type to go along with that... but it doesn't seem to be, which I believe is a violation of standards.
My Linux box knows full well that that's a JPEG image once it's on my hard drive, because it examines file headers rather than relying on extensions. Perhaps there is a tool to do the same in Windows?
Edit: Actually, on further contemplation, it seems odd that you'd get a 403 for that. Perhaps the server is actually blocking you from retrieving the file in that manner.

GET request to mp3 in S3 bucket failing to download file with 206 partial content?

I have an mp3 file in an S3 bucket. I am fetching this file via ajax GET request for html5 audio playback. Intermittently, the get request will fail to download the file and thus the track will not play. The request returns "206 partial content." Oddly, it will work several times before failing and then continuing to fail.
If I disable caching in my browser (chrome), the file will download and play appropriately.
Have I configured s3 incorrectly? How can I get this mp3 file to download and play consistently?
specific file is located here: https://s3.amazonaws.com/1m40s_dev/assets/music/walden.mp3
thanks!
I've found this often relates to the MIME type set on the S3 hosted file.
Setting the correct MIME type seems to fix things.
On a side note, I struggled with a single binary file always breaking in IE. Its MIME type was application/octet-stream. I changed the MIME to binary/octet-stream and that seemed to fix downloads from IE. Not sure why.
use amazon cloudfront solve the problem
I solved this by appending a timestamp to the end of the mp3 url on page load. This forced a new download of the content each time and eliminated the caching error.
This feels more like a work around than a fix. I still don't know the root cause of the issue but if you find yourself having a similar problem and just need to move on, add a timestamp or random number as a param at the end of the url
.../assets/music/walden.mp3?[timestamp]
One other workaround I've found is, if you're using rails, turning off turbolinks makes this go away on chrome. I'll add more to my answer as I discover more.

Resources