Scrapy persistent cache - caching

We need to be able to re-crawl historical data. Imagine today is 23rd of June. We crawl a website today but after a few days we realize we have to re-crawl it, "seeing" it exactly as it was on 23rd. That means, including all possible redirects, GET and POST requests etc. ALL the pages the spider sees, should be exactly as they were on 23rd, no matter what.
Use-case: if there is a change in the website, and our spider is unable to crawl something, we want to be able to get back "in the past" and re-run the spider after we fix it.
Generally, this should be quite easy - subclass the standard Scrapy's cache, force it to use dates for subfolders and have something like that:
cache/spider_name/2015-06-23/HERE ARE THE CACHED DIRS
but when I was experimenting with this, I realized sometimes the spider crawls the live website. That means, it doesn't take some pages from the cache (though the appropriate files exist on the disk) but instead it takes them from the live website. It happened with pages with captchas, in particular, but maybe with some other ones.
How can we force Scrapy to always take the page from the cache, not hitting the live website at all? Ideally, it should even work with no internet connection.
Update: we've used the Dummy policy and HTTPCACHE_EXPIRATION_SECS = 0
Thank you!

To do exactly what you want you should had this in your settings:
HTTPCACHE_IGNORE_MISSING = True
Then if enabled, requests not found in the cache will be ignored instead of downloaded.
When you are setting :
HTTPCACHE_EXPIRATION_SECS = 0
It only assure you that "cached requests will never expire" , but if a page isn't in your cache, then it will be download.
You can check the documentation.

Related

Send an entire web app as 1 HTTP response (html, js, css, images, ...)

Traditionally a browser will parse HTML and then send further requests to the server for all related data. This seems like inefficient to me, since it might require a large number of requests, even though my server already knows that a browser that wants to use this web application will need all of it's resources.
I know that js and css could be inlined, but that complicates server side code and img data as base64 bloats the size of the data... I'm aware as well that rendering can start before all assets are downloaded, which would potentially no longer work (depending on the implementation). I still feel that streaming an entire application in one go should be faster on slow connections than making tens of requests separately.
Ideally I would like the server to stream an entire directory into one HTTP response.
Does any model for this exist?
Does the reasoning make sense?
ps: If browser support for this is completely lacking, I'm wondering about a 2 step approach. Download a small JavaScript which downloads a compressed web app file, extracts it and plugs the resources into the page. Is anyone already doing something like this?
Update
I found one: http://blog.another-d-mention.ro/programming/read-load-files-from-zip-in-javascript/
I started to research related issues in order to find the way to get best results with what seems possible without changing web standards, and I wondered about caching. If I could send the last modified date of every subresource of a page along with the initial HTML page, a browser could avoid asking if modified headers once it has loaded every resource at least once. This would in effect be better than to send all resources with the initial request, since that would be beneficial only on the first load, and detrimental on subsequent loads, since it would be better for browsers to use their cache (as Barmar pointed out).
Now it turns out that even with a web extension you can not get hold of the if-modified-since header and so you surely can't tell the browser to use the cached version instead of contacting the server.
I then found this post from Facebook on how they tried to reduce traffic by hashing their static files and giving them a 1 year expiry date. This would mean that the url garantuees the content of the file. They still saw plenty of unnecessary if-modified-since requests and they managed to convince Firefox and Chrome to change the behaviour of their reload buttons to no longer reload static resources. For Firefox this requires a new cache-control: immutable header, for Chrome it doesn't.
I then remembered that I had seen something like that before and it turns out there is a solution for this problem which is more convenient than hashing the contents of resources and serving them from a database for at least ten years. It is to just a new version number in the filename. The even more convenient solution would be to just add a version query string, but it turns out that that doesn't always work.
Admittedly, changing your filenames all the time is a nuisance, because files referencing these files also need to change. However the files don't actually need to change. If you control the server it might be as simple as writing a redirect rule to make sure that logo.vXXXX.png will be redirected to logo.png (where XXXX is the last modified timestamp in seconds since epoch)[1]. Now let your template system automatically generate the timestamp, like in wordpress' wp_enqueue_script. WordPress actually satisfies itself with the query string technique. Now you can set the expiration date to a far future and use the immutable cache header. If browsers respect the cache control, you can now safely ignore etags and if-modified-since headers, since they are now completely redundant.
This solution guarantees the browser shall never ask for cache validation and yet you shall never see a stale resource, without having to decide on the expiry date in advance.
It doesn't answer the original question here about how to avoid having to do multiple requests to fetch the resources on the same page on a clean cache, but ever after (as long as the browser cache doesn't get cleared), you're good! I suppose that's good enough for me.
[1] You can even avoid the server overhead of checking the timestamp on every resource every time a page references it by using the version number of your application. In debug mode, for development, one can use the timestamp to avoid having to bump the version on every modification of the file.

Website is different when I upload on FTP

I have just started developing for a few weeks now and I bought a domain, but when I upload the files on live, the website looks different than what I have uploaded. Now, this gets fixed when I clear my cache. The problem is that my visitors enter, they see the page in a way, and after I update it they see it as the previous version!
Is there any possible solution for this? I don't want my visitors to clear cache every time I make a change on my website!
This is quite probable to be due to css cache. Your server is loading a cached version. You can specify the cached time in a few ways. Etags and htaccess (on apache) are the most common.
A very simple trick is just to add at the end of your style link url (where you load your main style in the head of the document) a get-like parameter: just like this:
main.css?v=2

Ajax Website, problems with History and Seo

I have a few problems i could use some input on.
I have a website, where all the content is loaded with ajax, it works quite well. There are a few issues with that approach though, or some UX issues.
User cannot copy URL from loaded content, since it will allways show the default URL only.
SEO will take a hit, since it cannot be crawled, the sitemap is like 2 pages only, even though when a normal user browses, they will see alot more.
Browser history, back and forward, does not work. Hitting the back button goes to the main page.
Now, i have searched and read alot.
Google has a hack, that seems to allow the site to be crawled, IF you use # in your url, does not work with empty url, which leads me to...
Manipulating the browser history with pushState/popState.
Now, i have tried getting it to work, but i just cant get my head around which process is the best way to take. Should i redo all my ajax?
Right now i have 2 div boxes, and i switch between them with loaded content, to get that nice sweet transition between pages. My frontpage is basically just 2 empty divs, nothing else. It works, but i get the feeling it is a pretty bad way to do it, thoughts?
If anyone know some good guides, feel free to give me, i have as i said read alot, but i might have missed some golden ones out there.
Google does execute some Javascript when indexing and ranking pages. However, text which is not immediately visible to users is demoted when establishing content relevancy.
Manipulating the browser history with pushState/popState.
It is very unlikely Google will trust your content if you need to use those tricks. And content which is not trusted is not ranked.
UPDATE: Manipulating browser history with pushState is ok.
Moreover, if your URLs change all the time, Google won't appreciate it, unless you manage to set canonical links.

Check on which pages an image is used?

Is there a certain way to check which pages on a website use a specific image?
Say I have some image which I don't use on a page anymore, so I'd like to delete it from my server. But I'm not entirely sure if it's being used on other pages, is there a way to check if it's still being shown on other pages?
You can hook your website to google webmaster tools and wait a little bit after a while 404 errors will appear there. This way you can track unused resources and dead ends.
This includes images.
There is a better way if you have direct access to the web server.
Visit every page in your website or let google crawl it.
You can later sort the files by date modified and ones which are not modified lately are not used.
You have to make sure you get the images from the pages so I would use a historyless cahceless session.
How to sort the files according to the time stamp in unix?

How to force a page to be removed from the search engine index?

Situation: Google has indexed a page in a forum. The thread is now deleted. How/whether can I make Google and other search engines to delete the cached copy? I doubt they would have anything against that since the linked page does not exist anymore and keeping the index updated and valid should be in their best interests.
Is this possible or do I have to wait months for an index update? Or will the page now stay there forever?
I am not the owner of the respective site so I can't change robots.txt for example. I would like to force the update as the "third party".
I also noticed that a new page on that resource I created two days ago is already in the cache. Given that can I make an estimate how long will it take for a non valid page on this domain to be dropped?
EDIT: So I did the test. It took google shortly under 2 months to drop the page. Quite a long time...
It's damn near impossible to get it removed - however replacing the page with entirely blank content will ensure that you nuke the ranking of the page when it is respidered.
You can't really make Google delete anything, except perhaps in extreme circumstances. You can adjust your robots.txt file to promote a revisit interval that might update things sooner, but if it is a low traffic site, you might not get a revisit very soon.
EDIT:
Since you are not the site-owner, you can modify the meta tags on the page with "revisit-after" tags as discussed here.
You cant make search engines to remove the link but don't worry soon the link will be removed as the link will not longer be active. You need not wait for months for this to happen.
If your site is registered with Google Webmaster, you can request to remove pages from the index. It works, I tried and used it in the past.
EDIT: Since you are not the owner, I am afraid that this solution would not work.

Resources