I'm looking to implement the Google crawlable AJAX states as described here:
http://code.google.com/web/ajaxcrawling/docs/getting-started.html
Essentially this requires specifying your AJAX states with a #!state value at the end of the url.
This should then be passed to the application server (PHP in my case) as part of the query string eg.
http://www.example.com/#!open would become http://www.example.com/?_escaped_fragment_=open
Unfortunately i'm having trouble figuring out how to implement this via mod_rewrite on Apache 2. Can anyone offer some help?
Cheers
James
The RFC 2396 section 4 says:
When a URI reference is used to perform a retrieval action on the
identified resource, the optional fragment identifier, separated from
the URI by a crosshatch ("#") character, consists of additional
reference information to be interpreted by the user agent after the
retrieval action has been successfully completed. As such, it is not
part of a URI, but is often used in conjunction with a URI.
That is, the fragment won't be visible for the web server so you'll have to look for some other method as mod_rewrite is a no go.
Depending on what language you are familiar with, you can install HTMLUnit if you're a Java dev or you could try to write a proxy and use it to fetch the parsed content by, i.e. Jaxer or a Firefox instance. I used Jaxer and is pretty easy to implement the crawlable ajax pages after you get used with the Jaxer API (which isn't complicated at all)
Related
I've got a very unique situation that I don't believe any of the other topics here can relate.
I have a ecommerce module that is dynamically loaded / embedded into third party sites, no iframe straight JSON to web client into content. I have no access to these third part sites at all, other then my javascript file being loaded from their page and dynamically generating the content.
I'm aware of the #! method, but that's no good here, my JS does generate "urls" within the embedded platform, but they're fake and for the address bar only, and I don't believe google crawlers can reach this far.
So my question is, is there a meta that we can set to point outside the url to i.e. back to my server with static crawlable content. I.e. pointing the canonical to my server... but again I don't think that would work.
If you implement #! then you have to make sure the url your embedded in supports the fragment parameter versions, which you probably can't. It's server side stuff.
You probably can't influence the canonical tag of the page either. It again has to be done server side. Any meta tag you set via JavaScript will not be seen by a bot.
Disqus solved the problem by providing an API so the embedding websites could get there comments server side and render then in plain html. WordPress has a plugin to do this. Disqus are also one of the few systems that Google has worked out how to crawl their AJAX pages.
Some plugins request people to also include a plain link with the JavaScript. Be careful with this as you may break Google Guidelines if you do it wrong. But you may be able to integrate the plain link with your plugin so that it directs bots and users to a crawlable version of the content.
Look into Google's crawlable ajax standard (and why it's a bad idea) and canonical URLs.
Now you can actually do this. A complete guide and examples can be found here: https://github.com/kubrickology/Logical-escaped_fragment
I've got a web app which heavily uses AngularJS / AJAX and I'd like it to be crawlable by Google and other search engines. My understanding is that I need to do something special to make it work, as described here: https://developers.google.com/webmasters/ajax-crawling
Unfortunately, that looks quite nasty and I'd rather not introduce the hash tags. What I'd like to do is to serve a static page to Googlebot (based on the User-Agent), either directly or by sending it a 302 redirect. That way, the web app can be the same, and the whole Googlebot workaround is nicely isolated until it is no longer necessary.
My worry is that Google may mistakenly assume that I'm trying to trick Googlebot, while my goal is to help it. What do you guys think about this approach, and what would you recommend?
Recently I come upon this excellent post from yearofmoo, explaining in details how to make your Angular app SEO friendly. In essence, when bots see an uri with a hash tag they will know it's an ajaxed page and will try to reach the same uri by replacing '#!' in your uri with '?_escaped_fragment_='. This alternative uri instructs bots that they should expect to find a definitive static version of the page they were accessing.
Of course, to achieve this you'd have to introduce hash tags into your uris. I don't see why are you trying to avoid them. Isn't gmail using hash tags?
Yeah unfortunately, if you want to be indexed - you have to adhere to the scheme :( If your running a ruby app - there's a gem that implements the crawling scheme for any rack app....
gem install google_ajax_crawler
writeup of how to use it is at http://thecodeabode.blogspot.com.au/2013/03/backbonejs-and-seo-google-ajax-crawling.html, source code at https://github.com/benkitzelman/google-ajax-crawler
Have a look at these links and it will give you a good direction:
Set up your own Prerender service using Prerender.io open source code:
https://prerender.io/
Use a different existing service such as BromBone, Seo.js or SEO4AJAX:
http://www.brombone.com/
http://getseojs.com/
http://www.seo4ajax.com/
Create your own service for rendering and serving snapshots to search engines. Read this article. It will give you the big picture:
http://scotch.io/tutorials/javascript/angularjs-seo-with-prerender-io
As of May 2014 GoogleBot now executes JavaScript. Check WebmasterTools to see how Google sees your site.
http://googlewebmastercentral.blogspot.no/2014/05/understanding-web-pages-better.html
Edit: Note that this does not mean other crawlers (Bing, Facebook, etc.) will execute Javascript. You may still need to take additional steps to ensure that these crawlers can see your site.
As part of a product we are deploying, clients need to access a remote API on our servers to access content and data. Nonetheless, for some reasons and some clients, a solution where the entire page is on our servers is not desireable (reasons include: control over design, but mostly SEO, and them wanting this content to be available under "their domain")... A script that accesses the API server-side is not desirable due to other issues.
My idea follows (and I will point out its flaws so others can please suggest alternatives):
1) Make a simple script to be hosted on the clients server which will obtain all traffic from a certain URI path (catch-all script, similar to any framework router). so /MyApp/*. This script would always return a single code, a "loader javascript and styling"...
2) Through javascript returned from the script above, extract the URL, and process the URI after the desired path /MyApp/[*] and send it to an external call with JSONP or CORS regular ajax, the return is then styled appropriately and displayed.
With this, a url such as /MyApp/abc and /MyApp/def would have the same html/js in the browser source, but the JS would load different data from the ajax call, therefore showing different content...
This would seem like a good solution, the only drawback is that from my understanding, google and other searchengines wouldnt ever be able to access the content from abc and def, they would only access the "loader javascript and styling" (obvious enough, they arent going to be running the JS)...
So this is better than #! in that it wouldnt screw with URLs, but would still be depending on JS, so not search engine friendly...
Due to server restrictions, I'd much rather have a simple "catchall" page, and have the API called from the client-side than have to impose minimum requirements such as curl, etc... plus I'd have access to the end-user ip address more easily this way (although I could make a more elaborate proxy - which would make installing it much harder on clients' servers)...
Is there a way of achieving this without conneting to the api from the server-side?
The easiest method of doing this IMO is to have an AJAX controller (assuming MVC design) to handle all remote requests. Have each action in your controller return JSON, and then you have easy access to the data with a serverside call.
Otherwise you are using the #! solution (which you don't like, and rightly so..), or using JSONP (a hassle as well).
Seaside by default points example.com/myapp to whatever application is registered at myapp. I'd like to have a core application that can also handle these links, or some other way of handling these links.
So far, I have a home application that is also registered as the default application, so http://mydomain.com will resolve to it, but if I generate a link, like http://mydomain.com/more-info, Seaside tries to resolve an application registered at more-info. What if I want my home application to handle the link? Or handle it in some other way?
I'm hosting Seaside with Apache, so I could use Apache's URL rewriting engine to rewrite http://mydomain.com/more-info to http://mydomain.com/home/more-info, which would be handled by my home app.
Is there a better way to do this? Also, if a link exists to an explanation of the Seaside request/response lifecycle, that'd be sweet.
What you are trying to do is not common practice in Seaside applications. If you want to generate a link from one page to another page in your application, you generally use a callback attached to an anchor:
html anchor callback: [ self call: moreInfoComponent]
In such cases, you do not care about how the url looks like and Seaside generates the url for you. Such generated urls never have a nested structure but use parameters.
More information on the Seaside request/response cycle can be found in the online book (chapters "Fundamentals" and "Sequencing Components").
However, if you indeed want to have such a nested url (to make urls bookmarkable), there are different approaches, depending on what you actually want to achieve. You can either take a look at the approach for handling expired sessions (in the book) or at the Seaside-REST package.
Btw, the mapping of urls to applications happens through (instances of) WADispatcher. If you inspect the result of the following expression, you can see the dispatcher tree of Seaside. It's entirely customizable by adding new applications, dispatchers, etc...
WAAdmin defaultServerManager adaptors first requestHandler
Hope this helps you on your way...
When I look at Amazon.com and I see their URL for pages, it does not have .htm, .html or .php at the end of the URL.
It is like:
http://www.amazon.com/books-used-books-textbooks/b/ref=topnav_storetab_b?ie=UTF8&node=283155
Why and how? What kind of extension is that?
Your browser doesn't care about the extension of the file, only the content type that the server reports. (Well, unless you use IE because at Microsoft they think they know more about what you're serving up than you do). If your server reports that the content being served up is Content-Type: text/html, then your browser is supposed to treat it like it's HTML no matter what the file name is.
Typically, it's implemented using a URL rewriting scheme of some description. The basic notion is that the web should be moving to addressing resources with proper URIs, not classic old URLs which leak implementation detail, and which are vulnerable to future changes as a result.
A thorough discussion of the topic can be found in Tim Berners-Lee's article Cool URIs Don't Change, which argues in favour of reducing the irrelevant cruft in URIs as a means of helping to avoid the problems that occur when implementations do change, and when resources do move to a different URL. The article itself contains good general advice on planning out a URI scheme, and is well worth a read.
More specifically than most of these answers:
Web content doesn't use the file extension to determine what kind of file is being served (unless you're Internet Explorer). Instead, they use the Content-type HTTP header, which is sent down the wire before the content of the image, HTML page, download, or whatever. For example:
Content-type: text/html
denotes that the page you are viewing should be interpreted as HTML, and
Content-type: image/png
denotes that the page is a PNG image.
Web servers often use the file extension if the file is served directly from disk to determine what Content-type to assign, but web applications can also generate pages with any Content-type they like in response to a request. No matter the filename's structure or extension, so long as the actual content of the page matches with the declared Content-type, the data renders as intended.
For websites that use Apache, they are probably using mod_rewrite that enables them to rewrite URLS (and make them more user and SEO friendly)
You can read more here http://httpd.apache.org/docs/2.0/mod/mod_rewrite.html
and here http://www.sitepoint.com/article/apache-mod_rewrite-examples/
EDIT: There are rewriting modules for IIS as well.
Traditionally the file extension represents the file that is being served.
For example
http://someserver/somepath/image.jpg
Later that same approach was used to allow a script process the parameter
http://somerverser/somepath/script.php?param=1234&other=7890
In this case the file was a php script that process the "request" and presented a dinamically created file.
Nowadays, the applications are much more complex than that ( namely amazon that you metioned )
Then there is no a single script that handles the request ( but a much more complex app wit several files/methods/functions/object etc ) , and the url is more like the entry point for a web application ( it may have an script behind but that another thing ) so now web apps like amazon, and yes stackoverflow don't show an file in the URL but anything comming is processed by the app in the server side.
websites urls without file extension?
Here I questions represents the webapp and 322747 the parameter
I hope this little explanation helps you to understand better all the other answers.
Well how about a having an index.html file in the directory and then you type the path into the browser? I see that my Firefox and IE7 both put the trailing slash in automatically, I don't have to type it. This is more suited to people like me that do not think every single url on earth should invoke php, perl, cgi and 10,000 other applications just in order to sent a few kilobytes of data.
A lot of people are using an more "RESTful" type architecture... or at least, REST-looking URLs.
This site (StackOverflow) dosn't show a file extension... it's using ASP.NET MVC.
Depending on the settings of your server you can use (or not) any extension you want. You could even set extensions to be ".JamesRocks" but it won't be very helpful :)
Anyways just in case you're new to web programming all that gibberish on the end there are arguments to a GET operation, and not the page's extension.
A number of posts have mentioned this, and I'll weigh in. It absolutely is a URL rewriting system, and a number of platforms have ways to implement this.
I've worked for a few larger ecommerce sites, and it is now a very important part of the web presence, and offers a number of advantages.
I would recommend taking the technology you want to work with, and researching samples of the URL rewriting mechanism for that platform. For .NET, for example, there google 'asp.net url rewriting' or use an add-on framework like MVC, which does this functionality out of the box.
In Django (a web application framework for python), you design the URLs yourself, independent of any file name, or even any path on the server for that matter.
You just say something like "I want /news/<number>/ urls to be handled by this function"