Indexing Hash Bang #! Content Using Google Search Appliance (GSA) - google-search-appliance

Has anyone had success indexing content that contains #! (Hashbang) in the URL? If so, how did you do it?
We have a third party help center of ours that we are hosting that requires the use of #! in the URL, however, we need the ability to index this content within our GSA. We are using version 7.0.14.G.238 of our GSA
Here's an example of one of our help articles with a hashbang in the URL:
/templates/selfservice/example/#!portal/201500000001006/article/201500000006039/Resume-and-Cover-Letter-Reviews
I understand #! requires JavaScript, not the most friendly SEO in the world and many popular sites (Facebook, Twitter, etc.) deprecated the use of it.

While some Javascript content is indexed, if you want to make sure there is absolutely content in the index for this site you have two options. Either make sure the site is non-JS friendly which is supported in a lot of JS frontend sites, or alternatively use a content feed to push the data into the GSA instead. Turn off JS in your browser and access the site and see if content links are created.
If you have access to the database, you could just send the content straight in, however read up here: http://www.google.com/support/enterprise/static/gsa/docs/admin/72/gsa_doc_set/feedsguide/ on feeds which can send data straight in, or possibly read up on connectors in general https://support.google.com/gsa/topic/2721859?hl=en&ref_topic=2707841

Related

MEAN-SEO not working as expected

I have a project in meanjs.
It has html5mode disabled so my URLS are like that:
http://localhost:3000/#!/products
I am trying to implement AJAX snapshoots in order to allow Google Crawlers to see content generated by javascript on client side.
I installed a module called MEAN-SEO:
http://blog.meanjs.org/post/78474995741/mean-seo
Now when I access the following URL:
http://localhost:3000/?_escaped_fragment_=
I am redirected to:
http://localhost:3000/?_escaped_fragment_=/#!/
And when I click on "products" or when I access directly, I am redirected to:
http://localhost:3000/?_escaped_fragment_=/#!/products
After reading the Google specification detailed here https://developers.google.com/webmasters/ajax-crawling/docs/getting-started , what I need is to get is something without hashbangs, like the following:
http://localhost:3000/?_escaped_fragment_=/products
What I am doing wrong?
Kind Regards.
Any specific reasons why you want html5mode off?
Here is something a lot of people have missed: Search engines (both Google and Bing) can now handle AJAX based content.
Their crawlers now understands pushstates, so if you just turn html5mode on you don't need any special handling to get your SEO working. You can load your content via AJAX, you can set title tags and meta tags with javascript and so on and so forth, and the crawlers will understand your content the same as if you had rendered things server-side. There is no need to do html-snapshotting or escaped_fragment handling for SEO anymore.
This has been announced on their developer blogs but unfortunately most of the documentation hasn't been updated with this information, so it's gone under the radar for a lot of people.
One word of warning though, Facebook does not handle pushstates, so if you want to support the Facebook crawler you still need to handle that separately.

Using a connector to crawl content using a sitemap.xml

We have a dspace repository of research publications that the gsa is indexing via a web crawl, ie start at the homepage and follow all the links.
I'm thinking that using a connector to submit urls for indexing from sitemap.xml file, might be more efficient. The gsa would then only need to index and recrawl the urls on the sitemap and could ignore the result of the site.
The suggestion from the gsa documentation is that this is not really a target for a connector, as the content can all be discovered by a web crawl.
What do you think?
Thanks,
Georgina.
This might be outdated (so I'm not sure if it still work), but there's an example of a python connector that will parse a sitemap.xml and send it as Content Feed or Metadata feed.
Here are 2 links to help you
https://github.com/google/gsa-admin-toolkit/blob/master/connectormanager/sitemap_connector.py
https://github.com/google/gsa-admin-toolkit/wiki/ConnectorManagerDocumentation
If anything, this will give you an idea of the logic to implement if you write your own Connector 3.x or Adaptor 4.x
You can generate sitemaps making from /bin directory "dspace generate-sitemaps". It will generate a sitemaps directory with link to all items from dspace.
An output example:
<html><head><title>URL List</title></head><body><ul><li>http://localhost:8080//handle/123456789/1</li>
<li>http://localhost:8080//handle/123456789/2</li>
<li>http://localhost:8080//handle/123456789/3</li>
<li>http://localhost:8080//handle/123456789/5</li>
</ul></body></html>
You could easily create a GSA "Feed" that lists the URLs that you want to crawl. However, since your "Follow" patterns must include the host name of your web site, the crawler is going to follow all the pages that are linked from the pages in your feed.
If you truly only want to index the items in your "Site Map" then you should probably look at writing an Adaptor (4.x). You would then be responsible for writing the logic to parse your sitemap.xml file to extract the URLs you want crawled.

Google crawl ajax / dynamically generated content - SEO

I've got a very unique situation that I don't believe any of the other topics here can relate.
I have a ecommerce module that is dynamically loaded / embedded into third party sites, no iframe straight JSON to web client into content. I have no access to these third part sites at all, other then my javascript file being loaded from their page and dynamically generating the content.
I'm aware of the #! method, but that's no good here, my JS does generate "urls" within the embedded platform, but they're fake and for the address bar only, and I don't believe google crawlers can reach this far.
So my question is, is there a meta that we can set to point outside the url to i.e. back to my server with static crawlable content. I.e. pointing the canonical to my server... but again I don't think that would work.
If you implement #! then you have to make sure the url your embedded in supports the fragment parameter versions, which you probably can't. It's server side stuff.
You probably can't influence the canonical tag of the page either. It again has to be done server side. Any meta tag you set via JavaScript will not be seen by a bot.
Disqus solved the problem by providing an API so the embedding websites could get there comments server side and render then in plain html. WordPress has a plugin to do this. Disqus are also one of the few systems that Google has worked out how to crawl their AJAX pages.
Some plugins request people to also include a plain link with the JavaScript. Be careful with this as you may break Google Guidelines if you do it wrong. But you may be able to integrate the plain link with your plugin so that it directs bots and users to a crawlable version of the content.
Look into Google's crawlable ajax standard (and why it's a bad idea) and canonical URLs.
Now you can actually do this. A complete guide and examples can be found here: https://github.com/kubrickology/Logical-escaped_fragment

Google Search optimisation for ajax calls

I have a page on my site which has a list of things which gets updated frequently. This list is created by calling the server via jsonp, getting json back and transforming it into html. Fast and slick.
Unfortunately, Google isn't able to index it. After reading up on how to get this done according to Google's AJAX crawling guide, I am bit confused and need some clarification and confirmation:
The ajax pages need to be implement the rules only, right?
I currently have a rest url like
[site]/base/junkets/browse.aspx?page=1&rows=18&sidx=ScoreAll&sord=desc&callback=jsonp1295964163067
this would need to become something like:
[site]/base/junkets/browse.aspx#page=1&rows=18&sidx=ScoreAll&sord=desc&callback=jsonp1295964163067
And when google calls it like this
[site]/base/junkets/browse.aspx#!page=1&rows=18&sidx=ScoreAll&sord=desc&callback=jsonp1295964163067
I would have to deliver the html snapshot.
Why replace the ? with # ?
Creating html snapshots seems very cumbersome. Would it suffice to just serve simple links? In my case I would be happy if google would only index the things pages.
It looks like you've misunderstood the AJAX crawling guide. The #! notation is to be used on links to the page your AJAX application lives within, not on the URL of the service your appliction makes calls to. For example, if I access your app by going to example.com/app/, then you'd make page crawlable by instead linking to example.com/app/#!page=1.
Now when Googlebot sees that URL in a link, instead of going to example.com/app/#!page=1 – which means issuing a request for example.com/app/ (recall that the hash is never sent to the server) – it will request example.com/app/?_escaped_fragment_=page=1. If _escaped_fragment_ is present in a request, you know to return the static HTML version of your content.
Why is all of this necessary? Googlebot does not execute script (nor does it know how to index your JSON objects), so it has no way of knowing what ends up in front of your users after your scripts run and content is loaded. So, your server has to do the heavy lifting of producing a HTML version of what your users ultimately see in the AJAXy version.
So what are your next steps?
First, either change the links pointing to your application to include #!page=1 (or whatever), or add <meta name="fragment" content="!"> to your app's HTML. (See item 3 of the AJAX crawling guide.)
When the user changes pages (if this is applicable), you should also update the hash to reflect the current page. You could simply set location.hash='#!page=n';, but I'd recommend using the excellent jQuery BBQ plugin to help you manage the page's hash. (This way, you can listen to changes to the hash if the user manually changes it in the address bar.) Caveat: the currently released version of BBQ (1.2.1) does not support AJAX crawlable URLs, but the most recent version in the Git master (1.3pre) does, so you'll need to grab it here. Then, just set the AJAX crawlable option:
$.param.fragment.ajaxCrawlable(true);
Second, you'll have to add some server-side logic to example.com/app/ to detect the presence of _escaped_fragment_ in the query string, and return a static HTML version of the page if it's there. This is where Google's guidance on creating HTML snapshots might be helpful. It sounds like you might want to pursue option 3. You could also modify your service to output HTML in addition to JSON.
I've more or less given up on this. There really seems no alternative to generating the html on the server and delivering it in the html bdoy if you want goolge to index your directory.
I even tried adding a section wraped a .net user control which implemented a simple html version of the directory. But google also managed to ignore ..
So in the end my directory has been de-ajaxified. :(

Google AdSense bot's algorithm and behavior

I am interesting in Google AdSense bot's algorithm and behavior with web site. I did not work with AdSense and i do not have account. So i need your help to understand:
1) Gbot from time to time downloads all pages from web site. Am i right?
2) Gbot do not understand dynamic content (loaded by ajax). So i must generate static content and return it within html page and this pages must show identical content for all users and for Gbot?
3) Because of (1) and (2) i cannot use only root path http://example.com with some "main" widget. I must generate unique pages for example http://example.com/thread?id=101 ?
4) Gbot downloads pages (1) for grabbing (indexing) keywords from them and then store (on it's servers) these information for example by key/value (where key is page path, value is tag cloud). Am i right?
5) When web site has been opened in browser by user. Integrated html AdSense's code loads some JavaScript. As i understand by "googling" this JavaScript do not index page, but makes call (with some parameter key==page_path) to Google's server and gets appropriate ad links. Then shows this ad links in it's frame. Is it right behavior? Maybe JavaScript makes some local indexing of page's content?
6) How Gbot and AdSense's JavaScript work with cookies? As i understand AdSense can use cookies for show appropriate ad links. If it is right, please give me some use cases;)
I know that "true" algorithm is known only by engineers from Google. But some of you had experience with AdSense and AdSense html/javascript. Please correct my vision of it;)
Thank you very much for any advice!!!
P.S. This question is very important for me. It is not some question for fun! So Please do not close it;)
1) Yes if Googlebot can access the pages and if it knows about the pages through a link, XMLSitemaps, Google +1, etc.
2) Googlebot will now make AJAX / XHR requests to understand AJAX content (http://googlewebmastercentral.blogspot.com/2011/11/get-post-and-safely-surfacing-more-of.html).
Yes, you should show the same content to Googlebot as you would Users, otherwise this would be consider cloaking, which is against their guidelines.
3) This question isn't clear. But basically it's preferable to have the URL change because Google will then know how to index the content separately. If you're using AJAX then you might want to consider permalinks like you suggested, or you can use HTML5 popstate.
4) Yes Google will index the words on the page. I'm not certain they store it as a key/value pair. I'm not even sure if they're still using Big Table (http://labs.google.com/papers/bigtable.html) ... but it's likely they use Big Table or a similar system to store the inverted index.
5) The Adsense code is embedded Javascript ... for new webpages that Google hasn't seen before, it tries to deliver the most relevant ads based on the information it's found on the web about the site or possibly through anchor text of links pointing to that page. However, to get a more accurate understanding of the content of the page, Google will send an adsense specific bot to crawl your page ... sometimes you'll see it come very fast, even as soon as you load the page for the first time. It uses a different user agent than the traditional Googlebot ... you can find all the User Agents from Google here (http://www.google.com/support/webmasters/bin/answer.py?answer=1061943)
6) Google's crawlers don't accept cookies and won't pass back cookies to your server. It has to do with the massively distributed nature of Google crawlers that makes maintain cookies or sessions extremely difficult.

Resources