Scrapy:Sitemap spider and gzipped files - sitemap

I tried running the sitemap spider but it refused to crawl gzipped
sitemaps.It gave the following error
[scrapy] WARNING: Ignoring non-XML sitemap
is there a setting that needs to be enabled to allow parsing of
gzipped sitemaps?
I use scrapy version 0.15

Scrapy should automatically unzip the gzipped content.
See the responsible code in contrib/spiders/sitemap.py
if isinstance(response, XmlResponse):
body = response.body
elif is_gzipped(response):
body = gunzip(response.body)
else:
log.msg("Ignoring non-XML sitemap: %s" % response, log.WARNING)
return
I think either the XML is malformed, or the file isn't gzipped with the proper headers. I suggest trying the same spider on a sitemap of which you are sure of it's formatting.
If you want I can run test of my own, if you can provide me with your current code -- it'll allow me to give you a better answer :-).

You might want to take notice of this commit Scrapy's author did yesterday:
SitemapSpider: added support for sitemap urls ending in .xml and .xml.gz, even if they have a wrong content type
You could try cloning the latest version and re-try your spider.

I solved the problem adding this to the "settings.py" file:
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware': None
}
Apparently this is a Scrapy bug: https://github.com/scrapy/scrapy/issues/951

Related

XML Sitemap not working with Google Search Console..?

I have an XML sitemap located at https://store.usbswiper.com/sitemap_index.xml, which as you can see loads just fine.
However, Google Search Console is telling me it can't fetch the sitemap.
When I use this validator it's giving me a successful validation.
I have checked the robots.txt, and it's not blocking anything. It specifices the sitemap URL correctly as well.
Any info on why Google Search Console is giving me this "couldn't fetch" message would be greatly appreciated.
EDIT: When I first ran that validator it gave me this error:
Incorrect http header content-type: "" (expected: "application/xml")
I added a robots.txt and then when I ran it and posted this thread it was validating successfully. I just now tried again and it's failing again with the same message. I don't understand why it's working sometimes and sometimes not. The Search Console hasn't successfully loaded at all no matter what the validator is doing.
Add the full link in console;
for example, add https://example.net/sitemap.xml
rather than just sitemap.xml.

MPDF in Laravel can't output (inline) pdf

I am doing below code in Laravel 5.5 with mpdf 8.0
$mpdf = new \Mpdf\Mpdf();
$mpdf->WriteHTML('Hello World');
$mpdf->Output("test","I");
It outputs gibberish/garbage values, seemingly showing pdf file in raw form.
Some findings
If I use $mpdf->Output($reportPath, 'F'); (saving it to file) and the opening that. It opens the file as expected.
If I place die(); after $mpdf->Output("test","I"); it shows the document.
My suspicion is, it has something something to do with Content-type:application/pdf not being set by default but I have also tried using header("Content-type:application/pdf"); before Output but of no use. it is still showing Content-Type: text/html; charset=UTF-8 in response header in Network tab of chrome (also tried Firefox).
Some back-story
It used to work on php7.3 just fine, but I have to update it to php7.4 due to some library and multiple application on a single server scenario.
Also start using a sub-domain for my application instead of placing the directories after the domain.
I'm looking for
A solution that doesn't require me to place die; at the end of output.
Or some clue in on why this has started happening or/and perhaps why I need to place die; after Output.
Any other solution.
The goal is to provide some ref. for people encountering same issues in future, since I have spent hours and haven't anything that specifically address such issue.
Ok, so I found out that I can't just rely on $this->mpdf->Output('test.pdf',"I") to output my result (though it was working previously with the same line) to the browser.
Because for some reason it has started to send Content-Type:text/html value in Content-Type header so I had to change that.
Solution
I did it as below:
return response($this->mpdf->Output('test.pdf',"I"),200)->header('Content-Type','application/pdf');

MediaWiki InstantCommons file download error

My goal: I'd like to use an image from commons.mediawiki.org within a MediaWiki installation.
First I was trying to debug my InstantCommons configuration: Referring to files on commons.mediawiki.org failed for some reason. After activating various debugging options I learned that though general image download succeeded some kind of thumbnail followup request issued by the MediaWiki installation failed, which resulted into an overall error from the ForeignAPIRepo-Module.
As I can not deal with this error right now I thought I'd try something else as some kind of fallback: Download the MediaWiki image by specifiying the image URL in the upload image web page. The idea is to let MediaWiki download the image and include this image as regular wiki content. This way I would require to add license details manually and add a few comments, but this would be better than having no image.
But trying this I strangely get an error: It says "Fehler beim Senden der Anfrage" which means "Error while sending the request". But the internal request seems succeed in the logs. Here is what MediaWiki was logging:
[fileupload] Temporary file created "/tmp/URLdafce5345aa3-1"
[fileupload] Starting download from "https://upload.wikimedia.org/wikipedia/commons/c/c7/Broccoli%2C_Champignons%2C_Karotten_%2810581663524%29.jpg" <followRedirects>
[fileupload] <Error, collected 1 error(s) on the way, integer value set>
+------+---------------------------+------------------------------------------+
| 1 | http-request-error | |
+------+---------------------------+------------------------------------------+
[fileupload] Download by URL completed with HTTP status 200
Comment: All other log messages do not indicate anything that looks like an error or is related to the task of downloading the image, so I skipped them here.
The URL is correct, the image can be downloaded from the URL, MediaWiki receives a response code of 200, but instead of processing the response it indicates an error. Why? For http and https URLs I get the same result in the log.
Has anybody encountered this problem before in MediaWiki installations? Does anyone have any idea what the reason for this behaviour could be?
Comment: The wiki is of version 1.25.2 and a standard installation including SWM on an up to date standard Ubuntu Linux OS. Nothing exotic, nothing modified in any way.
Comment: Yes, I could upgrade to the latest version but, I'm not sure if this really solves the problem: I know that this featured did work in some other MediaWiki installations I have set up some time ago. Does anyone have a clue why this feature could fail here? Has anyone encountered something like this before?
Edit: I experimented with downloading from another MediaWiki instance of exactly the same version - 1.25.2 - in my local network. This did not succeed as well. But I get a different error message (translated): "The file .... could not be stored at ...". The "funny part": Though the error message indicated otherwise the file has been downloaded successfully and stored as expected. It has the correct user rights as one would expect, but log messages indicate that there are bugs in MediaWiki regarding this part: ("PHP Notice: Undefined property: UploadFromUrl::$nbytes") Maybe the uploading implementation is buggy somehow and the problems I am running into are typical?
There are multiple bugs with HTTPS support in MediaWiki, php-curl etc. See https://www.mediawiki.org/wiki/InstantCommons#HTTPS for debugging information, there is no magic bullet.

ruby/bash: How do I download a large file with using the "If-Range" and "Range" headers?

I've been trying to use mechanize to download mp3 files, but the server always returns a 404.
Looking at the headers my browser sends (checked on Chrome and FF), I noticed that the If-Range and Range headers are used to initiate a successful download, so I'm guessing the server is rejecting any request that doesn't specify them.
What is the right way to download files in this way, using ruby (Net::HTTP) or bash (curl or wget)?
404 is file not found. Are you sure your URL is correct? If it is correct then you should be able to use wget <full url and file name> to test it.

GET request to mp3 in S3 bucket failing to download file with 206 partial content?

I have an mp3 file in an S3 bucket. I am fetching this file via ajax GET request for html5 audio playback. Intermittently, the get request will fail to download the file and thus the track will not play. The request returns "206 partial content." Oddly, it will work several times before failing and then continuing to fail.
If I disable caching in my browser (chrome), the file will download and play appropriately.
Have I configured s3 incorrectly? How can I get this mp3 file to download and play consistently?
specific file is located here: https://s3.amazonaws.com/1m40s_dev/assets/music/walden.mp3
thanks!
I've found this often relates to the MIME type set on the S3 hosted file.
Setting the correct MIME type seems to fix things.
On a side note, I struggled with a single binary file always breaking in IE. Its MIME type was application/octet-stream. I changed the MIME to binary/octet-stream and that seemed to fix downloads from IE. Not sure why.
use amazon cloudfront solve the problem
I solved this by appending a timestamp to the end of the mp3 url on page load. This forced a new download of the content each time and eliminated the caching error.
This feels more like a work around than a fix. I still don't know the root cause of the issue but if you find yourself having a similar problem and just need to move on, add a timestamp or random number as a param at the end of the url
.../assets/music/walden.mp3?[timestamp]
One other workaround I've found is, if you're using rails, turning off turbolinks makes this go away on chrome. I'll add more to my answer as I discover more.

Resources