Hashbang URLs make the website difficult to crawl by Google?

Hashbang URLs make the website difficult to crawl by Google? - ajax

Our agency built a dynamic website that uses a lot of AJAX interactions and #! (hashbang) URLs: http://www.gunlawsbystate.com/
It's a long book which you can scroll through and the URL in the address bar changes dynamically. We have to support IE so please don't advise using pushState — hansbang is the only option for us for now.
There's a navigation in the left sidebar which contains links to all chapters in the book.
An example of a link:
http://www.gunlawsbystate.com/#!/federal-properety/national-parks-and-wildlife-refuges/
We are expecting google to crawl this:
http:// www.gunlawsbystate.com/?_escaped_fragment_=/federal-properety/national-parks-and-wildlife-refuges/
which is complete html snapshot of the section. (+ there are links to the subsections like www.gunlawsbystate.com/#!/federal-properety/national-parks-and-wildlife-refuges/ii-change-in-the-law/ => www.gunlawsbystate.com/?_escaped_fragment_=/federal-properety/national-parks-and-wildlife-refuges/ii-change-in-the-law/ ).
It all looks to be complete according to the Google's specifications ( developers.google.com/webmasters/ajax-crawling/docs/specification ).
The site is run for about 3 months for now. The homepage is getting re-indexed every 10-15 days.
The problem is that for some reason Google doesn't crawl hashbang URLs properly. It seems like Google just "doesn't like" those URLs.
www.google.ru/search?&q=site%3Agunlawsbystate.com :
Just 67 pages are indexed. Notice that most of the pages Google indexed have "normal" URLs (mostly wordpress blog posts, categories and tags) and just 5-10% of result pages are hashbang URLs, although there are more than 400 book sections with unique content which Google should really like if it crawles it properly.
Could someone give me an advise on this, why Google does not crawl our book pages properly? Any help will be appreciated.
P.S. I'm sorry for not-clickable links — stackoverflow doesn't let me post more than 2.
UPD. The sitemap has been submitted to google a while ago. Google Webmaster Tools says that 518 URLs submitted and just 62 URLs indexed. Also, on the 'Index Status' page of the Webmaster Tools I see that there are 1196 pages Ever crawled; 1071 pages are Not selected. It clearly points to the fact that for some reason google doesn't index the #! pages that it visits frequently.

You are missing a few things.
First you need a meta tag to tell google that the Hash URLS can be accessed via a different url.
<meta name="fragment" content="!">
Next you need to serve a mapped version of each of the urls to googlebot.
When google visits:
http://www.gunlawsbystate.com/#!/federal-regulation/airports-and-aircraft/ii-boarding-aircraft/
It will instead crawl:
http://www.gunlawsbystate.com/?_escaped_fragment_=federal-regulation/airports-and-aircraft/i-introduction/
For that to work you either need to use something like PHP or ASP to serve up the correct page. Asp.net routing would also work if you can get the piping correct. There are services which will actually create these "snapshot" versions for you and then your meta tag will point to their servers.

Since it is deprecated by Google and now Google is not able to access the content under hashbang URLs.
Based on research Google avoids Escaped fragment URLs now and suggesting to create separate pages rather than using HashBang.
So I think PushState is the other option which can be used in this case.

Related

MEAN-SEO not working as expected

I have a project in meanjs.
It has html5mode disabled so my URLS are like that:
http://localhost:3000/#!/products
I am trying to implement AJAX snapshoots in order to allow Google Crawlers to see content generated by javascript on client side.
I installed a module called MEAN-SEO:
http://blog.meanjs.org/post/78474995741/mean-seo
Now when I access the following URL:
http://localhost:3000/?_escaped_fragment_=
I am redirected to:
http://localhost:3000/?_escaped_fragment_=/#!/
And when I click on "products" or when I access directly, I am redirected to:
http://localhost:3000/?_escaped_fragment_=/#!/products
After reading the Google specification detailed here https://developers.google.com/webmasters/ajax-crawling/docs/getting-started , what I need is to get is something without hashbangs, like the following:
http://localhost:3000/?_escaped_fragment_=/products
What I am doing wrong?
Kind Regards.

Any specific reasons why you want html5mode off?
Here is something a lot of people have missed: Search engines (both Google and Bing) can now handle AJAX based content.
Their crawlers now understands pushstates, so if you just turn html5mode on you don't need any special handling to get your SEO working. You can load your content via AJAX, you can set title tags and meta tags with javascript and so on and so forth, and the crawlers will understand your content the same as if you had rendered things server-side. There is no need to do html-snapshotting or escaped_fragment handling for SEO anymore.
This has been announced on their developer blogs but unfortunately most of the documentation hasn't been updated with this information, so it's gone under the radar for a lot of people.
One word of warning though, Facebook does not handle pushstates, so if you want to support the Facebook crawler you still need to handle that separately.

AJAX Crawling with question mark instead of hashbang

Where I'm at: I've read Google's documentation regarding it's AJAX crawling, and I've searched around a bit in this website and others, but I'm quite confused, as it seems that all problems address the same issue: AJAX crawing with hashbangs?
I've developed an app which, among other purposes, let's the user search for locations worldwide, using an AJAX searcher quite similar to Google's, but my app uses exclusively the question mark in AJAX, instead of hashbang. Due to compatibility issues, changing it to the hashbang is not an option.
Not only am I largely confused by the fact that I could not find anyone else using the question mark instead of the hashbang, I'm also wondering if there is any documentation regarding my issue: how to let google bot crawl all my AJAX content when I'm using the question mark instead of a hashbang in my AJAX app.

The AJAX crawling schema was created explicitly for applications and websites using hashbang (#!) in the URL structure, because the fragment part of the URLs only exist on the client side; the URL rewriting in the specs, i.e. from #! to ?_escaped_fragment_= is meant to solve that.
Since most of the web is already making use of Javascript in a way or other, we (Google) needed a better solution, so we started executing Javascript in the pages we crawled and effectively render every page, just like a normal browser would. To quote our blogpost, Understanding web pages better:
In order to solve this problem, we decided to try to understand pages by executing JavaScript. It’s hard to do that at the scale of the current web, but we decided that it’s worth it. We have been gradually improving how we do this for some time. In the past few months, our indexing system has been rendering a substantial number of web pages more like an average user’s browser with JavaScript turned on.
You can also see what we "see" using Fetch as Google in Search Console (former Webmaster Tools); read more about the feature in our post titled Rendering pages with Fetch as Google
Before you do anything else, please try to fetch a few pages from your site with Fetch as Google. You might not have to do anything at all, it might actually work out of the box. And the good news is that it's not only Google that's rendering pages!

SEO with angularjs and asp.net restfull service

I have developed a website using angularjs and web api.
The problem is that the ajax rendered content is not crawable by google. And no one can find the website using google search.
After reading many articles regarding this issue, including:
This one with all links of explanation going out,
Google ajax crawling protocol, and also stack over flow question, I couldn't find the proper solution. Those that mention asp.net solutions, are talking about mvc, and I need only the simple REST by web api, other articles are not talking about asp.net.
Is there any simple explanation?

I'm the one who asked this same question long ago, so I will answer from my experience:
Firstly, if all your content are accessible via unique URIs (including the hashbang if you use it), modern search engines should index it just fine. In fact Google can index javascript generated content now. You can try that via the Google Webmaster tool and see how your site is indexed.
Secondly, there are libraries that help you to serve parsed content to search engines if you need to, but in my case I didn't bother much with it since Google is indexing js nicely.

I've seen others ask this question, and maybe I'm missing something or this is outdated, but I don't see why AngularJS needs to be an issue with SEO.
Say you have a landing page and it has a bunch of links. Assuming you're using html5 mode in AngularJS (and I'm not sure that's 100% necessary) and something like ng-route then the links on the landing page can work both as "angular" (JavaScript) links and "old school" (full page load) links.
If you're a human user you can click a link and it will do angular magic and adjust the content without loading the full page. Ok, all fine.
But if you instead copy the link and paste it in a new tab or new browser, it will still work - assuming you've set up routes correctly.
I'm not an SEO expert by any stretch of the imagination, but as I understand it, having links that load pages and having those pages have real and useful content is the core of SEO, and done this way, AngularJS should work fine. The key thing to check is if you copy and paste the link (not just click it) that it works.

URL Re-writes and Google Indexing

I was asked to perform some URL re-writes for a new site with numerous dynamic pages and this has all worked fine.
However when I look at the URLs that Google has indexed, it has indexed the 'non-rewrite' url, so all the '?', '&' etc are being used.
What do you have to do to force Google to index your re-written URLs?
I just assumed it would do this automatically and never expected it to be an issue.
All help is gratefully appreciated.
Thanks.

Steps
1) Make sure that expired pages are no longer publicly accessible
2) Anything you do not wish Bots to crawl should be flagged with appropriate "nofollow" meta tags
3) Submit a new sitemap to your Google Web developer account
4) Make sure your Website throws a 404 error when a page isn't found. It is always a good idea to make a splash page for a 404 error which links back to your home page. (this is accomplished different ways across different server-side languages)
Google will automatically remove indexed pages if they no longer exist.. So be patient.

Google AdSense bot's algorithm and behavior

I am interesting in Google AdSense bot's algorithm and behavior with web site. I did not work with AdSense and i do not have account. So i need your help to understand:
1) Gbot from time to time downloads all pages from web site. Am i right?
2) Gbot do not understand dynamic content (loaded by ajax). So i must generate static content and return it within html page and this pages must show identical content for all users and for Gbot?
3) Because of (1) and (2) i cannot use only root path http://example.com with some "main" widget. I must generate unique pages for example http://example.com/thread?id=101 ?
4) Gbot downloads pages (1) for grabbing (indexing) keywords from them and then store (on it's servers) these information for example by key/value (where key is page path, value is tag cloud). Am i right?
5) When web site has been opened in browser by user. Integrated html AdSense's code loads some JavaScript. As i understand by "googling" this JavaScript do not index page, but makes call (with some parameter key==page_path) to Google's server and gets appropriate ad links. Then shows this ad links in it's frame. Is it right behavior? Maybe JavaScript makes some local indexing of page's content?
6) How Gbot and AdSense's JavaScript work with cookies? As i understand AdSense can use cookies for show appropriate ad links. If it is right, please give me some use cases;)
I know that "true" algorithm is known only by engineers from Google. But some of you had experience with AdSense and AdSense html/javascript. Please correct my vision of it;)
Thank you very much for any advice!!!
P.S. This question is very important for me. It is not some question for fun! So Please do not close it;)

1) Yes if Googlebot can access the pages and if it knows about the pages through a link, XMLSitemaps, Google +1, etc.
2) Googlebot will now make AJAX / XHR requests to understand AJAX content (http://googlewebmastercentral.blogspot.com/2011/11/get-post-and-safely-surfacing-more-of.html).
Yes, you should show the same content to Googlebot as you would Users, otherwise this would be consider cloaking, which is against their guidelines.
3) This question isn't clear. But basically it's preferable to have the URL change because Google will then know how to index the content separately. If you're using AJAX then you might want to consider permalinks like you suggested, or you can use HTML5 popstate.
4) Yes Google will index the words on the page. I'm not certain they store it as a key/value pair. I'm not even sure if they're still using Big Table (http://labs.google.com/papers/bigtable.html) ... but it's likely they use Big Table or a similar system to store the inverted index.
5) The Adsense code is embedded Javascript ... for new webpages that Google hasn't seen before, it tries to deliver the most relevant ads based on the information it's found on the web about the site or possibly through anchor text of links pointing to that page. However, to get a more accurate understanding of the content of the page, Google will send an adsense specific bot to crawl your page ... sometimes you'll see it come very fast, even as soon as you load the page for the first time. It uses a different user agent than the traditional Googlebot ... you can find all the User Agents from Google here (http://www.google.com/support/webmasters/bin/answer.py?answer=1061943)
6) Google's crawlers don't accept cookies and won't pass back cookies to your server. It has to do with the massively distributed nature of Google crawlers that makes maintain cookies or sessions extremely difficult.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio