URL Re-writes and Google Indexing - url-rewriting

I was asked to perform some URL re-writes for a new site with numerous dynamic pages and this has all worked fine.
However when I look at the URLs that Google has indexed, it has indexed the 'non-rewrite' url, so all the '?', '&' etc are being used.
What do you have to do to force Google to index your re-written URLs?
I just assumed it would do this automatically and never expected it to be an issue.
All help is gratefully appreciated.
Thanks.

Steps
1) Make sure that expired pages are no longer publicly accessible
2) Anything you do not wish Bots to crawl should be flagged with appropriate "nofollow" meta tags
3) Submit a new sitemap to your Google Web developer account
4) Make sure your Website throws a 404 error when a page isn't found. It is always a good idea to make a splash page for a 404 error which links back to your home page. (this is accomplished different ways across different server-side languages)
Google will automatically remove indexed pages if they no longer exist.. So be patient.

Related

How can I get the Google cache age of any URL or web page, Part II

I'm trying to get a Google cache of a LinkedIn page.
I've seen several threads (e.g.: How can I get the Google cache age of any URL or web page?) saying you can just append "http://webcache.googleusercontent.com/search?q=cache:" to the URL, and that seems to work for pages where Google is already displaying links to the cached version.
But the drop-down link has been deactivated for several pages I'm trying to access. And in those cases, the above solution just gets me 404'd.
Any ideas how to get around this?

Hashbang URLs make the website difficult to crawl by Google?

Our agency built a dynamic website that uses a lot of AJAX interactions and #! (hashbang) URLs: http://www.gunlawsbystate.com/
It's a long book which you can scroll through and the URL in the address bar changes dynamically. We have to support IE so please don't advise using pushState — hansbang is the only option for us for now.
There's a navigation in the left sidebar which contains links to all chapters in the book.
An example of a link:
http://www.gunlawsbystate.com/#!/federal-properety/national-parks-and-wildlife-refuges/
We are expecting google to crawl this:
http:// www.gunlawsbystate.com/?_escaped_fragment_=/federal-properety/national-parks-and-wildlife-refuges/
which is complete html snapshot of the section. (+ there are links to the subsections like www.gunlawsbystate.com/#!/federal-properety/national-parks-and-wildlife-refuges/ii-change-in-the-law/ => www.gunlawsbystate.com/?_escaped_fragment_=/federal-properety/national-parks-and-wildlife-refuges/ii-change-in-the-law/ ).
It all looks to be complete according to the Google's specifications ( developers.google.com/webmasters/ajax-crawling/docs/specification ).
The site is run for about 3 months for now. The homepage is getting re-indexed every 10-15 days.
The problem is that for some reason Google doesn't crawl hashbang URLs properly. It seems like Google just "doesn't like" those URLs.
www.google.ru/search?&q=site%3Agunlawsbystate.com :
Just 67 pages are indexed. Notice that most of the pages Google indexed have "normal" URLs (mostly wordpress blog posts, categories and tags) and just 5-10% of result pages are hashbang URLs, although there are more than 400 book sections with unique content which Google should really like if it crawles it properly.
Could someone give me an advise on this, why Google does not crawl our book pages properly? Any help will be appreciated.
P.S. I'm sorry for not-clickable links — stackoverflow doesn't let me post more than 2.
UPD. The sitemap has been submitted to google a while ago. Google Webmaster Tools says that 518 URLs submitted and just 62 URLs indexed. Also, on the 'Index Status' page of the Webmaster Tools I see that there are 1196 pages Ever crawled; 1071 pages are Not selected. It clearly points to the fact that for some reason google doesn't index the #! pages that it visits frequently.
You are missing a few things.
First you need a meta tag to tell google that the Hash URLS can be accessed via a different url.
<meta name="fragment" content="!">
Next you need to serve a mapped version of each of the urls to googlebot.
When google visits:
http://www.gunlawsbystate.com/#!/federal-regulation/airports-and-aircraft/ii-boarding-aircraft/
It will instead crawl:
http://www.gunlawsbystate.com/?_escaped_fragment_=federal-regulation/airports-and-aircraft/i-introduction/
For that to work you either need to use something like PHP or ASP to serve up the correct page. Asp.net routing would also work if you can get the piping correct. There are services which will actually create these "snapshot" versions for you and then your meta tag will point to their servers.
Since it is deprecated by Google and now Google is not able to access the content under hashbang URLs.
Based on research Google avoids Escaped fragment URLs now and suggesting to create separate pages rather than using HashBang.
So I think PushState is the other option which can be used in this case.

Is my AJAX content already crawlable?

I have build a site based on Ajax navigation.
I have build it that way, that whenever someone without javascript visits my site, the nav links, which usually load content via Ajax, are acting like normal links and the user can browse through the pages as usual.
Since, Google bot doesn't run javascript, it should theoretically be able to go through all links and corresponding sites as usual, right? Since they are valid links with the href tag pointed to the corresponding site.
Now I was wondering if thats sufficient or if I need to implant this method from Google too to make sure Google sees all my content?
Thanks for your insights and excuse my poor English!
If you can navigate your site by showing source (ctrl-u in chrome), google can also crawl your site. Yes, its that simple

How to clear Linkedin Share cache?

I have a new description in the page but when I share the page it is still using the old description that no longer exists, I am after something similar like the Facebook Lint.
Any ideas?
You can append dummy query string value to your url and make it look like a new url and LinkedIn fetches it again. I've tried it and it works.
For example:
https://www.codeproof.com/?refid=LinkedIn
where refid=LinkedIn is just a dummy value.
If your url already contains query string and then just append "&refid=LinkedIn" at the end of the url.
Unfortunately, appending a query string to the URL no longer works.
From the following StackOverflow post:
LinkedIn's content cache presently stores website information for approximately 7 days before the crawler will revisit the site.
There looks like there is no instant way to clear the cache, but to wait seven days, remove the media and re-add it.
Appending a query string to the URL no longer works, so you'll have to wait 7 days.
But if you really need to share your URL with the medias you want, you'll have to go with a custom API call.
From the LinkedIn developer docs :
The first time that LinkedIn's crawlers visit a webpage when asked to
share content via a URL, the data it finds (Open Graph values or our
own analysis) will be cached for a period of approximately 7 days.
This means that if you subsequently change the article's description,
upload a new image, fix a typo in the title, etc., you will not see
the change represented during any subsequent attempts to share the
page until the cache has expired and the crawler is forced to revisit
the page to retrieve fresh content.
If you make API calls that directly provide the content to be shared
rather than by a URL that requires analysis, LinkedIn will always use
the values you provide.
Step 1: Visit https://www.linkedin.com/post-inspector/
Step 2: Enter your URL and click on Inspect, You will see the updated preview image
Step 3: Now try sharing your URL on LinkedIn
I've just found a way to force linkedin to fetch a fresh version of the page. Just create a redirect to your destination page and share the redirect page.
For example:
If your page that you want to share is: http://stackoverflow.com
Create a redirect for a page: https://stackoverflow.com/share-li to go to http://stackoverflow.com
And then share the https://stackoverflow.com/share-li on linked in. This way linkedin will think it's a new page and it'll get a fresh page version.
It's easy to do if you're using wordpress, just install a redirection plugin like this one for example: https://wordpress.org/plugins/redirection/
For wordpress these steps work for me:
In the home page I've removed the featured image and add it as a simple image on the header of the page
I've created a redirect page in my blog like (mydomain.com/social) that redirects all requests to my blog (mydomain.com)
Share the blog again in the social networks and everything will be ok
It's done =D
Unfortunately, there is none as of now. We are investigating what it would take to expose a similar feature. Please stay tuned. We'll announce it on the developer site at http://developer.linkedin.com.

Google AdSense bot's algorithm and behavior

I am interesting in Google AdSense bot's algorithm and behavior with web site. I did not work with AdSense and i do not have account. So i need your help to understand:
1) Gbot from time to time downloads all pages from web site. Am i right?
2) Gbot do not understand dynamic content (loaded by ajax). So i must generate static content and return it within html page and this pages must show identical content for all users and for Gbot?
3) Because of (1) and (2) i cannot use only root path http://example.com with some "main" widget. I must generate unique pages for example http://example.com/thread?id=101 ?
4) Gbot downloads pages (1) for grabbing (indexing) keywords from them and then store (on it's servers) these information for example by key/value (where key is page path, value is tag cloud). Am i right?
5) When web site has been opened in browser by user. Integrated html AdSense's code loads some JavaScript. As i understand by "googling" this JavaScript do not index page, but makes call (with some parameter key==page_path) to Google's server and gets appropriate ad links. Then shows this ad links in it's frame. Is it right behavior? Maybe JavaScript makes some local indexing of page's content?
6) How Gbot and AdSense's JavaScript work with cookies? As i understand AdSense can use cookies for show appropriate ad links. If it is right, please give me some use cases;)
I know that "true" algorithm is known only by engineers from Google. But some of you had experience with AdSense and AdSense html/javascript. Please correct my vision of it;)
Thank you very much for any advice!!!
P.S. This question is very important for me. It is not some question for fun! So Please do not close it;)
1) Yes if Googlebot can access the pages and if it knows about the pages through a link, XMLSitemaps, Google +1, etc.
2) Googlebot will now make AJAX / XHR requests to understand AJAX content (http://googlewebmastercentral.blogspot.com/2011/11/get-post-and-safely-surfacing-more-of.html).
Yes, you should show the same content to Googlebot as you would Users, otherwise this would be consider cloaking, which is against their guidelines.
3) This question isn't clear. But basically it's preferable to have the URL change because Google will then know how to index the content separately. If you're using AJAX then you might want to consider permalinks like you suggested, or you can use HTML5 popstate.
4) Yes Google will index the words on the page. I'm not certain they store it as a key/value pair. I'm not even sure if they're still using Big Table (http://labs.google.com/papers/bigtable.html) ... but it's likely they use Big Table or a similar system to store the inverted index.
5) The Adsense code is embedded Javascript ... for new webpages that Google hasn't seen before, it tries to deliver the most relevant ads based on the information it's found on the web about the site or possibly through anchor text of links pointing to that page. However, to get a more accurate understanding of the content of the page, Google will send an adsense specific bot to crawl your page ... sometimes you'll see it come very fast, even as soon as you load the page for the first time. It uses a different user agent than the traditional Googlebot ... you can find all the User Agents from Google here (http://www.google.com/support/webmasters/bin/answer.py?answer=1061943)
6) Google's crawlers don't accept cookies and won't pass back cookies to your server. It has to do with the massively distributed nature of Google crawlers that makes maintain cookies or sessions extremely difficult.

Resources