Understanding Googlebots AJAX crawling - ajax

I've been through Googles documentation and countless blog posts about this subject, and depending on date and source, there seems to some conflicting information. Please shine your wisdom upon this lowly peasant, and all will be good.
I'm building a website pro-bono where a large part of the audience is from African countries with poor internet connectivity, and the client can't afford any decent infrastructure. So I've decided to serve everything as static html files, and if javascript is available, I load the page content directly into the DOM, if a user clicks on a nav link, to prevent overhead from loading a whole page.
My client-side routes looks like this:
//domain.tld/#!/page
My first question is; Does googlebot translate that into:
//domain.tld/_escaped_fragment_/page or //domain.tld/?_escaped_fragment_=/page?
I've made a simple server-side router in php, that builds the requested pages for googlebot, and my plan was to redirect //d.tld/_escaped_fragment_/page to //d.tld/router/page.
But when using Google's "Fetch as Googlebot" (for the first time I might add), it doesn't seem to recognize any links on the page. It just returns "Success" and shows me the html of the main page (Update: When pointing Fetch as Googlebot to //d.tld/#!/page it just returns the content of the main page, without doing any _escaped_fragment_ magic). Which leads me to my second question:
Do I need to follow a particular syntax when using hashbang links, for googlebot to crawl them?
My links looks like this in the HTML:
Page Headline
Update1: So, when I ask Fetch as Googlebot to get //d.tld/#!/page this shows up in the access log: "GET /_escaped_fragment_/page HTTP/1.1" 301 502 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" But it doesn't seem to follow the 301 I set up, and shows the main page instead. Should I use a 302 instead? This is the rule I'm using: RedirectMatch 301 /_escaped_fragment_/(.*) /router/$1
Update2: I've changed my plans, and will account for googlebot as part of my non-javascript fallback tactic. So now all the links are pointing to the router /router/page and are then changed to /#!/page/ onLoad with javascript. I'm keeping the question open for a bit, in case someone has a brilliant solution that might help others.

Related

SEO 301 redirect limits

I'm thinking of doing such structure for accessing some hypothetical page:
/foo/ID/some-friendly-string
The key part here is "ID" that identifies the page so everything that's not ID is only relevant to SEO. I also want everything else that isn't "/foo/ID/some-friendly-string" to redirect to the original link. Eg.:
/foo/ID ---> /foo/ID/some-friendly-string
/foo/ID/some-friendly-string-blah ---> /foo/ID/some-friendly-string
But what if somehow these links get "polluted" somewhere on the internet and spiders start accessing them with "/foo/ID/some-friendly-string-blah-blah-pollution" URLs? I don't even know if this can happen, but if, say, some bad person decided to post thousands of such "different" links on some well known forums or some such - then google would find thousands of "different" URLs 301-redirecting to the same page.
In such case - would there be some sort of a penalty or is it all the same to google as long as the endpoint is unique and no content duplicates?
I might be a little paranoid with this, but that's just my nature to investigate explaitable situations :)
Thanks for your thoughts
Your approach of using 301 redirect is correct.
301 redirects are very useful if people access your site through several different URLs.
For instance, your page for a given ID can be accessed in multiple ways, say:
http://yoursite.com/foo/ID
http://yoursite.com/foo/ID/some-friendly-string (preferred)
http://yoursite.com/foo/ID/some-friendly-string-blah
http://yoursite.com/some-friendly-string-blah-blah-pollution
It is a good idea to pick one of those URLs (you have decided to be http://yoursite.com/foo/ID/some-friendly-string) as your preferred URL and use 301 redirects to send traffic from the other URLs to your preferred one.
I would also recommend adding canonical link to the HEAD section on the page e.g.
<link rel="canonical" href="http://yourwebsite.com/foo/ID/some-friendly-string"/>
You can get more details on 301 redirects in:
Google Webmaster Tools - Configuration > Change of Address
Google Webmaster Tools Documentation - 301 redirects
I hope that will help you out with your decisions on redirects.
EDIT
I forgot to mention very good example, namely, Stack Overflow. The URL of this question is
http://stackoverflow.com/questions/14318239/seo-301-redirect-limits but you can access it with http://stackoverflow.com/questions/14318239/blahblah and will get redirect to the original URL.

Can we identify googlebot like search engines hit on particular URL

My Problem:
My client site which displays more products and it adds more page load/weight. So i decided to use ajax more products loading and it works well. But here it affects the seo - and no products or deals has been indexed(Even i suggest the client to submit product via googlebase but client doesnot like that idea and he wants direct google crawling into site also he wants less time page load).
Question:
Can we identify the googlebot crawling request to the server or mozila like browser user agent request to the site(server).
Suggestion I have
I tried to identify user agent from requests but that doesnot working(or i might missing something?) Please anyone have correct solution for this problem to reduce the page load time using ajax and get googlebot also to crawl the website.
You should just search stackoverflow for "Google AJAX SEO". There are a number of questions around this.
In short, Google has a specification to make AJAX sites crawlable: https://developers.google.com/webmasters/ajax-crawling/docs/getting-started?hl=sv-SE
You can also look into PushState as an SEO option as well.
One tactic that is used to solve this is to harness the pagination function of whatever framework or CMS you are using. You load one page of content and display pagination links in your view then use JavaScript to hide the pagination links and fetch the content of the linked pagination page via Ajax and append it to the current page. Take a look at how infinite-scroll works for inspiration:
http://www.infinite-scroll.com/
Basically you need to be at least loading links to pages that contain the other content so that search engines can crawl the content, but you can hide these links for the users who have JavaScript Enabled.
But to better answer your question, it is possible to redirect robots using htaccess:
redirect all bots using htaccess apache
But it is better SEO, as far as I understand it, to have the content or links to it, actually available on the page.

How does Googlebot know a webserver is not cloaking when it requests a `?_escaped_fragment_=` URL?

With regard to Google's AJAX crawling spec, if the server returns one thing (namely, a JavaScript-heavy file) for a #! URL and something else (namely, a "html snapshot" of the page) to Googlebot when the #! is replaced with ?_escaped_fragment_=, that feels like cloaking to me. After all, how is Googlebot sure that the server is returning good faith equivalents for both the #! and ?_escaped_fragment_= URLs. Yet this is what the AJAX crawling spec actually tells webmasters to do. Am I missing something? How is Googlebot sure that the server is returning the same content in both cases?
The crawler does not know. But it never knows even for sites that return plain ol' html either - it is extremely easy to write code that cloaks the site based on http headers used by crawlers or known IP headers.
See this related question: How does Google Know you are Cloaking?
Most of it seems like conjecture, but it seems likely there are various checks in-place, varying between spoofing normal browser headers and actual real-person looking at the page.
Continuing the conjecture, it certainly wouldn't be beyond the capabilities of programmers at Google to write a form of crawler that actually retrieved what the user sees - after all, they have their own browser that does just that. It would be prohibitively CPU-expensive to do that all the time, but probably makes sense for the occasional spot-check.

prevent crawler from following POST form action

I have simple form on my site:
<form method="POST" action="Home/Import"> ... </form>
I get tons of error reports because of crawlers sending HEAD request to Home/Import
Notice form is POST.
Questions
Why crawlers try to crawl those actions?
Anything I can do to prevent it? (I already have Home in robots.txt)
What is a good way to deal with those invalid (but correct) HEAD requests?
Details:
I use Post-Redirect-Get pattern, if that matters.
Platform: ASP.NET MVC 3.0 (C#) on IIS 7.5
1) A crawler typically makes HEAD requests to get the mime-type of the response.
2) The HEAD request shouldn't invoke the action handler for a POST. If I saw that I was getting alot of HEAD requests to a resource I don't want the crawler to crawl I would give it a link I do want it to crawl. Most crawlers read a Robots.txt
you can disable head requests at webserver level... for apache:
<LimitExcept GET POST>
deny from all
</LimitExcept>
you can work this at robots.txt level by adding:
Disallow: /Home/Import
Head requests are used to get information about the page, without getting the whole page, like last-modified-time, size etc. it is an efficiency thing. your script should not be giving errors because of head requests, and those errors are probably because of lack of validations in your code. your code could check if the request http method is 'head' and do something different.
4 years ago but still answering question #1: Google does indeed try to crawl POST forms, both by just sending a "GET" to the URL and actual "POST" requests. See their blog on this. The why is in the nature of the web: bad web developers hide their content links behind POST search forms. To reach that content, search engines have to improvise.
About #2: The reliability of robots.txt varies.
And about #3: The ultra clean way would probably be: HTTP Status 405 Method not allowed if HEAD requests in particular are your problem. Not sure browsers will like this, though.

With Google's #! mess, what effect would a redirect on the converted URL have?

So Google takes:
http://www.mysite.com/mypage/#!pageState
and converts it to:
http://www.mysite.com/mypage/?_escaped_fragment_=pageState
...So... Would be it fair game to redirect that with a 301 status to something like:
http://www.mysite.com/mypage/pagestate/
and then return an HTML snapshot?
My thought is if you have an existing html structure, and you just want to add ajax as a progressive enhancement, this would be a fair way to do it, if Google just skipped over _escaped_fragment_ and indexed the redirected URL. Then your ajax links are configured by javascript, and underneath them are the regular links that go to your regular site structure.
So then when a user comes in on a static url (ie http://www.mysite.com/mypage/pagestate/ ), the first link he clicks takes him to the ajax interface if he has javascript, then it's all ajax.
On a side note does anyone know if Yahoo/MSN onboard with this 'spec' (loosely used)? I can't seem to find anything that says for sure.
If you redirect the "?_escaped_fragment_" URL it will likely result in the final URL being indexed (which might result in a suboptimal user experience, depending on how you have your site setup). There might be a reason to do it like that, but it's hard to say in general.
As far as I know, other search engines are not yet following the AJAX-crawling proposal.
You've pretty much got it. I recently did some tests and experimented with sites like Twitter (which uses #!) to see how they handle this. From what I can tell they handle it like you're describing.
If this is your primary URL
http://www.mysite.com/mypage/#!pageState
Google/Facebook will go to
http://www.mysite.com/mypage/?_escaped_fragment_=pageState
You can setup a server-side 301 redirect to a prettier URL, perhaps something like
http://www.mysite.com/mypage/pagestate/
On these HTML snapshot pages you can add a client-side redirect to send most people back to the dynamic version of the page. This ensures most people share the dynamic URL. For example, if you try to go to http://twitter.com/brettdewoody it'll redirect you to the dynamic (https://twitter.com/#!/brettdewoody) version of the page.
To answer your last question, both Google and Facebook use the _escaped_fragment_ method right now.

Resources