How to prevent Googlebot from overwhelming site? - performance

I'm running a site with a lot of content, but little traffic, on a middle-of-the-road dedicated server.
Occasionally, Googlebot will stampede us, resulting in Apache maxing out its memory, and causing the server to crash.
How can I avoid this?

register at google webmaster tools, verify your site and throttle google bot down
submit a sitemap
read the google guildelines: (if-Modified-Since HTTP header)
use robot.txt to restrict access from to bot to some parts of the website
make a script that changes the robot.txt each $[period of time] to make sure the bot is never able to crawl too many pages at the same time while making sure it can crawl all the content overall

You can set how your site is crawled using google's webmaster tools. Specifically take a look at this page: Changing Google's crawl rate
You can also restrict the pages that the google bot searches using a robots.txt file. There is a setting available for crawl-delay, but it appears that it is not honored by google.

Register your site using the Google Webmaster Tools, which lets you set how often and how many requests per second googlebot should try to index your site. Google Webmaster Tools can also help you create a robots.txt file to reduce the load on your site

Note that you can set the crawl speed via Google Webmaster Tools (under Site Settings), but they only honour the setting for six months! So you have to log in every six months to set it again.
This setting was changed in Google. The setting is only saved for 90 days now (3 months, not 6).

You can configure the crawling speed in google's webmaster tools.

To limit the crawl rate:
On the Search Console Home page, click the site that you want.
Click the gear icon Settings, then click Site Settings.
In the Crawl rate section, select the option you want and then limit the crawl rate as desired.
The new crawl rate will be valid for 90 days.

Related

Prevent Google Web Preview bot

I noticed today in the webserver logs that we sometimes get bursts (450 requests in 2 seconds) of requests from a useragent with Google Web Preview. Looking at other stackoverflow it seems this is probably related to the preview functionality on the search page or maybe to the saved/most used links at the bottom of a users chrome tabs.
I've already blocked these particular URLs in the robots.txt, so, it's obviously ignoring that. It seems from this 2010 instant previews page that you can add a nosnippet tag and Google will then not try to fetch the preview. However, it seems that adding nosnippet wouldn't actually stop the request (as they'd still have to fetch the page to parse out the tag).
Short of blocking Google's ip address which I don't want to do, is there a decent way to stop Google hammering the server periodically.
I think you probably did it, but when I get such issue I make a buffer page, and provide link on that page e.g link for admin panel that I don't want to be rendered and use NO Index on that page

How to get Google's Last Crawl Date for a given URL of mine via the API

Given any URL of a site I "own", in Google Search Console I can see this information:
I am particularly interested in the "last crawl date".
How do I get the same information with the API (Search Console API or Webmaster Tools API)?
You cannot. Not via the Google Search Console API, the Webmaster Tools API, or any other Google API for that matter. How Google can design their APIs so poorly is beyond me. Providing access to 100% of the features that you can access through the UI of the same service, is the #1 most basic requirement of an API, and they fail even at that.
There's this workaround (requesting https://webcache.googleusercontent.com/search?q=cache:<YOUR_URL>... and scraping the response contents), but you'll start getting "429 too many requests" pretty soon, so it's basically useless unless you only need to make, I don't know, maybe a request every few days.
In practice, there doesn't seem to be any other way than logging the crawler's visits yourself (recognizing it from the user-agent string, validating the IP maybe with a reverse lookup or just against a list).

How do browsers(Firefox more specifically) know which cookies are tracking cookies

I came accross to a situation where Firefox in incognito mode blocks some of the cookies on my site. More specifically google analytics cookies like _ga, _gid, ..etc. Searching in the internet I came across to this article. So browsers like Firefox somehow identify these cookies as tracking. But how? How does it know which cookies are tracking and which not? I need to know this because next time I set cookies on my server I dont want them to be blocked by browsers.
In context of the article it just means blocking reference links. For instance it blocks sending the referral information from, for instance Facebook, to other sites.
Other sites use the referral information to decide who to pay to get more traffic and stuff like that.
There's like 100 different versions of the idea of "tracking" though.
Like the article points out, your ISP always know every DNS search you do and every call to an IP so they always know ALLLL your traffic and are "tracking" it.
There's also "ad tracking" where all those google calls send out what the crawler says is on the page in order to create targeted ads and all that.
I think, based on what you wrote, you're just talking about tracking links which is just scrubbing the referral link part though.
You'd have to be more specific if that's not what you're looking at.

Add blocked (via robots.txt) URLs in Sitemap?

In my sitemap there are some links which I don't want Google to index, so I blocked them using robots.txt.
Now in Google Webmaster Tool, it is showing warnings. Will it adversely impact my website in Google?
It's better to remove these URLs from XML sitemap.

Google Not Indexing AJAX URLs

I have submitted a sitemap for my AJAX web application to Google via their Webmaster Tools. The submitted URLs are of the form:
http://www.mysite.com/#!myscreen;id=object-id
http://www.mysite.com/#!myotherscreen;id=another-id
However, even though more than a week has passed since sitemap submission, Google has not indexed the URLs. Google states that the sitemap has been processed, states that 60 URLs have been detected, states that no errors occurred, but does not index any of the URLs.
I have already implemented the AJAX crawlability contract on the server side, where requests containing an _escaped_fragment_ are responded to with a snapshot.
Any help/info regarding why Google is not indexing the URLs would be greatly appreciated.
See GWT SE friendly application
Suggestions include following the guide at http://code.google.com/web/ajaxcrawling/.
Nowadays, you don't need to do something specific for Google anymore, and AJAX crawling scheme has been deprecated been Google.
Just make sure that your website is easy to use for your users, and Google will be able to properly crawl it.
If you want to go the extra mile, however, you can check that article:
* https://moz.com/blog/optimizing-angularjs-single-page-applications-googlebot-crawlers

Resources