Using cached web data from Internet (Google Cache, Wayback Machine etc.)

Using cached web data from Internet (Google Cache, Wayback Machine etc.) - caching

I want to use Google Cache for visiting the webpages of other websites even without going at them.
If I fire a query like this http://webcache.googleusercontent.com/search?q=cache:<URL without SCHEME>, we can get the data.
I found/assume following things (Ques 0. please correct if any of them are wrong):
Google may or may not have cached information depending on the site's policy.
Google will anyways go to the website if any javascript has to be run.
Google just stores first 101 KB of the text.
Ques 1. I know Google cache only shows the recently crawled page but any idea of how old this data could be?
Ques 2. Is there any issue if I plan to go to Google cache for all the hits I make to that website (assuming that the website is cached and I am fine with little old page)?
Ques 3. Wayback Machine provides the data but it has huge delay between crawling and showing that data. Is there any directory where we can get recently archived data (like Wayback machine and Google cache)?

I know Google cache only shows the recently crawled page but any idea of how old this data could be?
Use the cache: operator in the URL
Is there any issue if I plan to go to Google cache for all the hits I make to that website (assuming that the website is cached and I am fine with little old page)?
Owners may request removal of content from the cache
Is there any directory where we can get recently archived data?
Use the tbs=qdr: query parameter in the URL

For Question 3, while it used to be the case that all Wayback Machine web captures were 6 months old, that was already becoming untrue in 2012, and is very untrue now in 2016. We have a ton of fresh content.

Related

How often does Google refresh its cached websites?

If you type in cache:www.92spoons.com, for example, into the Google search engine, it shows you a snapshot of the page from a time when Google snapshotted the site. I was just wondering, how often does Google refresh its cached data? It looks like, as of now, it was refreshed about 3 days ago. Also, do all sites' cached data update at the same time?

This is based on how often the website is changed. For example, Wikipedia may be updated several times a day, but 92spoons.com may be updated every few days. (source)
This also can be changed by popularity. You can visit this website which should allow you to refresh the cache. (source)

How to modify an old joomla website to remove a dangerous link flagged by google

A client told me his old website running on Joomla was flagged by google for having links to a malicious website. The website was blocked with the typical red security warning in google Chrome. I redirected the website to a temp page, but my client wants to bring back the old website while we work on something new.
However, my local machine and server are running Windows Server. I have the original files of the website and database. Is there a quick way I could remove the links (the google tool only mentions the website "mosaictriad.com") from the Joomla page from my machine? I've tried doing a crtl+f for mosaictriad.com in the sql file but didn't find anything.
Thanks for your opinion on what I should do next, the objective is simply to quickly clear the website from the security warning and send it back to the people managing his old server.
PS i don't have direct access to his server, only the files associated with his joomla website.
Additional details given my google:
Some pages on this website redirect visitors to dangerous websites that install malware on visitors' computers, including: mosaictriad.com.
Dangerous websites have been sending visitors to this website, including: navis.be and umblr.com.

Yes there is a way. You need to register in google webmaster tools. Register your site. Add the sitelinks. Ask google to rescan your website. They will remove it within 24 hours if scan result is negative for malwares.

Running the virus scanner on your local machine over the files may be able to detect some malicious files.
Alternatively, restore the website to a temporary folder on the web and use a commercial scanning service to help identify and clean the website. I use and recommend myjoomla.com but there are other services such as sucuri.net.

I think your strategy is wrong - you should quickly cleanup the website (try overwriting the core files with files from a fresh Joomla install) and you should then secure the website. Once you do that, you should contact Google through the Webmaster tools for a reconsideration request (this typically takes a few days to process if it's the first offense). Once Google approves your reconsideration request, then the red flag should be removed and the website should be accessible by everyone.

Scrapy persistent cache

We need to be able to re-crawl historical data. Imagine today is 23rd of June. We crawl a website today but after a few days we realize we have to re-crawl it, "seeing" it exactly as it was on 23rd. That means, including all possible redirects, GET and POST requests etc. ALL the pages the spider sees, should be exactly as they were on 23rd, no matter what.
Use-case: if there is a change in the website, and our spider is unable to crawl something, we want to be able to get back "in the past" and re-run the spider after we fix it.
Generally, this should be quite easy - subclass the standard Scrapy's cache, force it to use dates for subfolders and have something like that:
cache/spider_name/2015-06-23/HERE ARE THE CACHED DIRS
but when I was experimenting with this, I realized sometimes the spider crawls the live website. That means, it doesn't take some pages from the cache (though the appropriate files exist on the disk) but instead it takes them from the live website. It happened with pages with captchas, in particular, but maybe with some other ones.
How can we force Scrapy to always take the page from the cache, not hitting the live website at all? Ideally, it should even work with no internet connection.
Update: we've used the Dummy policy and HTTPCACHE_EXPIRATION_SECS = 0
Thank you!

To do exactly what you want you should had this in your settings:
HTTPCACHE_IGNORE_MISSING = True
Then if enabled, requests not found in the cache will be ignored instead of downloaded.
When you are setting :
HTTPCACHE_EXPIRATION_SECS = 0
It only assure you that "cached requests will never expire" , but if a page isn't in your cache, then it will be download.
You can check the documentation.

How can I get the Google cache age of any URL or web page, Part II

I'm trying to get a Google cache of a LinkedIn page.
I've seen several threads (e.g.: How can I get the Google cache age of any URL or web page?) saying you can just append "http://webcache.googleusercontent.com/search?q=cache:" to the URL, and that seems to work for pages where Google is already displaying links to the cached version.
But the drop-down link has been deactivated for several pages I'm trying to access. And in those cases, the above solution just gets me 404'd.
Any ideas how to get around this?

Bulk import + export url rerwrites for Magento

I found a "bulk import and export url rewrites extension" for Magento when looking on the internet on how to bulk redirect urls from my current live site to the new urls based on the new site which is on a development server still.
I’ve asked my programmer to help me out and they’ve sent me 2 CSV files, one with all request and target urls from the current live site (these are often different as well, probably due to earlier redirects), and one similar to that for the new site. The current live site comes with 2500 urls, the future site with 3500 (probably because some old, inactive and unnecessary categories are still in the new site as well).
I was thinking to paste the current site’s urls into an Excel sheet and to then insert the future urls per url. A lot of work… Then I thought: can’t I limit my work to the approx. 300 urls that Google has indexed (which can be found through Google Webmaster Tools as you probably know)?
What would you recommend? Would there be advantages to using such an extension? Which advantages would that be? (if you keep in mind that my programmer would upload all of my redirects into a .htaccess file for me?)
Thanks in advance.
Kind regards, Bob.

Axel is giving HORRIBLE advice.. 301 redirects tell Google to transfer old PR and authority of that page to the new page in the redirect.. More over if other websites linked to those pages you don't want a bunch of dead links, then someone MIGHT just remove them.
Even worse if you don't handle the 404's correctly Google can and will penalize you.
ALWAYS setup 301 redirects when changing platforms, only someone that either doesn't understand or perhaps care about SEO would suggest otherwise

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Using cached web data from Internet (Google Cache, Wayback Machine etc.) - caching

For Question 3, while it used to be the case that all Wayback Machine web captures were 6 months old, that was already becoming untrue in 2012, and is very untrue now in 2016. We have a ton of fresh content.

Related

How often does Google refresh its cached websites?

How to modify an old joomla website to remove a dangerous link flagged by google

Scrapy persistent cache

How can I get the Google cache age of any URL or web page, Part II

Bulk import + export url rerwrites for Magento

Categories

Resources