How to make Robots.txt and Sitemap.xml only accessable through search engine bots - stack-overflow

Several website like Quora, Stackechange, and including Stackoverflow (https://stackoverflow.com/sitemap.xml) only access through the search engine crawlers (Google, Yahoo, Bing, etc).
How can i do same for my website robots.txt and sitemap.xml
What are the user-agents these crawlers use and where i can find a list
Google and Bing crawlers do not use any static IP's, they are dynamic and lot of IP's. How this big site like Stackoverflow manage whitelisting IP's of crawlers.
How big site content indexed instantly on Google. like my this question will get indexed instantly after publishing it. where my website usually take 2-7 days for indexing.

Related

How to find what site is calling images from my sever

I host a CDN for static resources. While monitoring some 404 errors came up on old images.
I suspect some partner of ours to still call the images on their site and would like to find out which to contact the appropriate one.
There is no Referer header in the requests, the ip are from residential ISPs, the User-Agent are from mostly up to date browsers so I think the users are legitimate.
I tried to Google the urls or part of them but no luck so far.
I can't crawl partners websites as I don't have a list of the domains they use.
How can I find out what site is still calling theses old images ?

How to find the customer's visit is from the Google Results Page

As we are moving from the classic google analytics to the Universal google analytics for the marketing requirement, i need to find out from where the customer is coming from. If he is coming from the marketing campaigns then we have the param utm_source from url. So with this I can find out the customer visit. But if the customer is from the google results, then there will be no extra parameters added to the URL.
Because of this, I am unable to differentiate whether the customer is from the Google Results or from the direct URL visit. My idea is to use, HTTP_REFERRER. But this will result in lot of requests to server for each page load which results in unnecessary load on server.
Universal google analytics does support _utmz cookies. It will only supported in classic google analytics. So is there any better way to differentiate the customer visit from the google results and the direct URL visit.
I think your idea to use the referrer is as solid as it gets. You do not need any server roundtrips, since you can access the referrer via Javascript using document.referrer - if that is empty you have a direct type-in/bookmark, else you can check against a list of hostnames of search engines. This might not match to 100% with Google Analytics attribution, but should give you a usable approximation (it will obviously only work on the landing page, after that the referrer is your own site).

Add blocked (via robots.txt) URLs in Sitemap?

In my sitemap there are some links which I don't want Google to index, so I blocked them using robots.txt.
Now in Google Webmaster Tool, it is showing warnings. Will it adversely impact my website in Google?
It's better to remove these URLs from XML sitemap.

Google Not Indexing AJAX URLs

I have submitted a sitemap for my AJAX web application to Google via their Webmaster Tools. The submitted URLs are of the form:
http://www.mysite.com/#!myscreen;id=object-id
http://www.mysite.com/#!myotherscreen;id=another-id
However, even though more than a week has passed since sitemap submission, Google has not indexed the URLs. Google states that the sitemap has been processed, states that 60 URLs have been detected, states that no errors occurred, but does not index any of the URLs.
I have already implemented the AJAX crawlability contract on the server side, where requests containing an _escaped_fragment_ are responded to with a snapshot.
Any help/info regarding why Google is not indexing the URLs would be greatly appreciated.
See GWT SE friendly application
Suggestions include following the guide at http://code.google.com/web/ajaxcrawling/.
Nowadays, you don't need to do something specific for Google anymore, and AJAX crawling scheme has been deprecated been Google.
Just make sure that your website is easy to use for your users, and Google will be able to properly crawl it.
If you want to go the extra mile, however, you can check that article:
* https://moz.com/blog/optimizing-angularjs-single-page-applications-googlebot-crawlers

How to prevent Googlebot from overwhelming site?

I'm running a site with a lot of content, but little traffic, on a middle-of-the-road dedicated server.
Occasionally, Googlebot will stampede us, resulting in Apache maxing out its memory, and causing the server to crash.
How can I avoid this?
register at google webmaster tools, verify your site and throttle google bot down
submit a sitemap
read the google guildelines: (if-Modified-Since HTTP header)
use robot.txt to restrict access from to bot to some parts of the website
make a script that changes the robot.txt each $[period of time] to make sure the bot is never able to crawl too many pages at the same time while making sure it can crawl all the content overall
You can set how your site is crawled using google's webmaster tools. Specifically take a look at this page: Changing Google's crawl rate
You can also restrict the pages that the google bot searches using a robots.txt file. There is a setting available for crawl-delay, but it appears that it is not honored by google.
Register your site using the Google Webmaster Tools, which lets you set how often and how many requests per second googlebot should try to index your site. Google Webmaster Tools can also help you create a robots.txt file to reduce the load on your site
Note that you can set the crawl speed via Google Webmaster Tools (under Site Settings), but they only honour the setting for six months! So you have to log in every six months to set it again.
This setting was changed in Google. The setting is only saved for 90 days now (3 months, not 6).
You can configure the crawling speed in google's webmaster tools.
To limit the crawl rate:
On the Search Console Home page, click the site that you want.
Click the gear icon Settings, then click Site Settings.
In the Crawl rate section, select the option you want and then limit the crawl rate as desired.
The new crawl rate will be valid for 90 days.

Resources