I was reading this guide of best practices by Yahoo:
http://developer.yahoo.com/performance/rules.html#no404
One rule that caught my eye is this.
HTTP requests are expensive so making an HTTP request and getting a
useless response (i.e. 404 Not Found) is totally unnecessary and will
slow down the user experience without any benefit.
Some sites have helpful 404s "Did you mean X?", which is great for the
user experience but also wastes server resources (like database, etc).
Particularly bad is when the link to an external JavaScript is wrong
and the result is a 404. First, this download will block parallel
downloads. Next the browser may try to parse the 404 response body as
if it were JavaScript code, trying to find something usable in it.
Is this really best practice?? If a user enters the wrong url on my website, what is this guide recommending? That I just leave the user with the default server error page? If the user enters the wrong url, how can the server return anything other than a 404? I simply don't understand this advice and wonder if anyone can explain the reasoning, thank you.
The best practice is that you serve different 404 documents in different situations.
For resources, that are not intended to be directly requested by a user but rather embedded into other documents (i.e., images, stylesheets, script files, etc.), should not result in a detailed error document but rather simply be empty, i.e., no body. With no body present, the browser won’t try to parse it.
But for resources, that are intended to be directly requested, a detailed 404 document with helpful information and guidance is better.
You could also take the presence of a Referer header field instead of the author’s intention as indicator whether a 404 document is useful or not.
Related
With regard to Google's AJAX crawling spec, if the server returns one thing (namely, a JavaScript-heavy file) for a #! URL and something else (namely, a "html snapshot" of the page) to Googlebot when the #! is replaced with ?_escaped_fragment_=, that feels like cloaking to me. After all, how is Googlebot sure that the server is returning good faith equivalents for both the #! and ?_escaped_fragment_= URLs. Yet this is what the AJAX crawling spec actually tells webmasters to do. Am I missing something? How is Googlebot sure that the server is returning the same content in both cases?
The crawler does not know. But it never knows even for sites that return plain ol' html either - it is extremely easy to write code that cloaks the site based on http headers used by crawlers or known IP headers.
See this related question: How does Google Know you are Cloaking?
Most of it seems like conjecture, but it seems likely there are various checks in-place, varying between spoofing normal browser headers and actual real-person looking at the page.
Continuing the conjecture, it certainly wouldn't be beyond the capabilities of programmers at Google to write a form of crawler that actually retrieved what the user sees - after all, they have their own browser that does just that. It would be prohibitively CPU-expensive to do that all the time, but probably makes sense for the occasional spot-check.
We've been trying to implement a site with a http home page, but https everywhere else. In order to do this we hit the rather big snag that our login form, in a lightbox, would need to fetch a https form using ajax, embed it in a http page and then (possibly) handle the form errors, still within the lightbox.
In the end we gave up and just made the whole site https, but I'm sure I've seen a login-in-a-lightbox implementation on other sites, though can't find any examples now I want to.
Can anyone give any examples of sites that have achieved this functionality, or explain how/why this functionality can/can't be achieved.
The Same Origin Policy prevents this. The page is either 100% HTTPS or it's not. The Same Origin Policy sees this as a "different" site if the protocol is not the same.
A "lightbox" is not different than any other HTML - it's just laid out differently. The same rules apply.
One option would be to use an iFrame. It's messy, but if having the whole shebang in https isn't an option, it can get the job done.
you might be able to put the login form into an iframe so that users can login through https while it seems they are on a http page,
but im not sure why you would want to do this.
So Google takes:
http://www.mysite.com/mypage/#!pageState
and converts it to:
http://www.mysite.com/mypage/?_escaped_fragment_=pageState
...So... Would be it fair game to redirect that with a 301 status to something like:
http://www.mysite.com/mypage/pagestate/
and then return an HTML snapshot?
My thought is if you have an existing html structure, and you just want to add ajax as a progressive enhancement, this would be a fair way to do it, if Google just skipped over _escaped_fragment_ and indexed the redirected URL. Then your ajax links are configured by javascript, and underneath them are the regular links that go to your regular site structure.
So then when a user comes in on a static url (ie http://www.mysite.com/mypage/pagestate/ ), the first link he clicks takes him to the ajax interface if he has javascript, then it's all ajax.
On a side note does anyone know if Yahoo/MSN onboard with this 'spec' (loosely used)? I can't seem to find anything that says for sure.
If you redirect the "?_escaped_fragment_" URL it will likely result in the final URL being indexed (which might result in a suboptimal user experience, depending on how you have your site setup). There might be a reason to do it like that, but it's hard to say in general.
As far as I know, other search engines are not yet following the AJAX-crawling proposal.
You've pretty much got it. I recently did some tests and experimented with sites like Twitter (which uses #!) to see how they handle this. From what I can tell they handle it like you're describing.
If this is your primary URL
http://www.mysite.com/mypage/#!pageState
Google/Facebook will go to
http://www.mysite.com/mypage/?_escaped_fragment_=pageState
You can setup a server-side 301 redirect to a prettier URL, perhaps something like
http://www.mysite.com/mypage/pagestate/
On these HTML snapshot pages you can add a client-side redirect to send most people back to the dynamic version of the page. This ensures most people share the dynamic URL. For example, if you try to go to http://twitter.com/brettdewoody it'll redirect you to the dynamic (https://twitter.com/#!/brettdewoody) version of the page.
To answer your last question, both Google and Facebook use the _escaped_fragment_ method right now.
Here's the situation:
I have a web application which response to a request for a list of resources, lets say:
/items
This is initially requested directly by the web browser by navigating to that path. The browser uses it's standard "Accept" header which includes "text/html" and my application notices this and returns the HTML content for the item list.
Within the returned HTML is some JavaScript (jQuery), which then does an ajax request to retrieve the actual data:
/items
Only this time, the "Accept" header is explicitly set to "application/json". Again, my application notices this and JSON is correctly returned to the request, the data is inserted into the page, and everything is happy.
Here comes the problem: The user navigates to another page, and later presses the BACK button. They are then prompted to save a file. This turns out to be the JSON data of the item list.
So far I've confirmed this to happen in both Google Chrome and Firefox 3.5.
There's two possible types of answers here:
How can I fix the problem. Is
there some magic combination of
Cache-Control headers, or other
voodoo which cause the browser to do
the right thing here?
If you think I am doing something
horribly wrong here, how should I go
about this? I'm seeking correctness,
but also trying not to sacrifice
flexibility.
If it helps, the application is a JAX-RS web application, using Restlet 2.0m4. I can provide sample request/response headers if it's helpful but I believe the issue is completely reproducible.
Is there some magic combination of Cache-Control headers, or other voodoo which cause the browser to do the right thing here?
If you serve different responses to different Accept: headers, you must include the header:
Vary: Accept
in your response. The Vary header should also contain any other request headers that influence the response, so for example if you do gzip/deflate compression you'd have to include Accept-Encoding.
IE, unfortunately handles many values of Vary poorly, breaking cacheing completely, which might or might not matter to you.
If you think I am doing something horribly wrong here, how should I go about this?
I don't think the idea of serving different content for different types at the same URL is horribly wrong, but you are letting yourself in for more compatibility problems than you really need. Relying on headers working through JSON isn't really a great idea in practice; you'd be best off just having a different URL, such as /items/json or /items?format=json.
I know this question is old, but just in case anyone else runs into this:
I was having this same problem with a Rails application using jQuery, and I fixed it by telling the browser not to cache the JSON response with the solution given here to a different question:
jQuery $.getJSON works only once for each control. Doesn't reach the server again
The problem only seemed to occur with Chrome and Firefox. Safari was handling the back behavior okay without explicitly having to tell it to not cache.
Old question, but for anyone else seeing this, there is nothing wrong with the questioner's usage of the Accept header.
This is a confirmed bug in Chrome. (Previously also in Firefox but since fixed.)
http://code.google.com/p/chromium/issues/detail?id=94369
So I've been noticing some strange results in how google peruses our site. One issue is that a url such as this:
http://example.com/randomstring
is showing up on google with all of the data of
http://example.com/
So in my mind there are two solutions. One is to add a 301 redirect whenever someone visits a sub-url of the main one, and redirect them to the parent URL, or just give a 404, with a nice message saying, "Maybe you meant parent-url".
Thoughts? I'm pretty sure I know where I want to send them, but what is the proper web-etiquette? 404 or 301?
The correct http way would be a 404, as long as a request is made to something that doesn't exist.
301 is for something that is moved, which is not the case here.
However, 100% correct http convention is rarely followed today. Depending on the context it could be useful to redirect the user to the home page with a notification that the page wasn't found and that they were redirected. Though in this case you should use a 303 See Other code.
You should never redirect without letting the user know that a redirect happened, though. That confuses the user to think that maybe something is wrong.
The already posted answers cover your question nicely but I thought there may be some value in going to the source: rfc 2616
10.3.2 301 Moved Permanently
The requested resource has been assigned a
new permanent URI and any future
references to this resource SHOULD use
one of the returned URIs. Clients with
link editing capabilities ought to
automatically re-link references to
the Request-URI to one or more of the
new references returned by the server,
where possible. This response is
cacheable unless indicated otherwise.
The new permanent URI SHOULD be given
by the Location field in the response.
Unless the request method was HEAD,
the entity of the response SHOULD
contain a short hypertext note with a
hyperlink to the new URI(s).
If the 301 status code is received in
response to a request other than GET
or HEAD, the user agent MUST NOT
automatically redirect the request
unless it can be confirmed by the
user, since this might change the
conditions under which the request was
issued.
Note: When automatically redirecting a POST request after
receiving a 301 status code, some existing HTTP/1.0 user agents
will erroneously change it into a GET request.
10.4.5 404 Not Found
The server has not found anything matching the
Request-URI. No indication is given of
whether the condition is temporary or
permanent. The 410 (Gone) status code
SHOULD be used if the server knows,
through some internally configurable
mechanism, that an old resource is
permanently unavailable and has no
forwarding address. This status code
is commonly used when the server does
not wish to reveal exactly why the
request has been refused, or when no
other response is applicable.
Of course, with these things it tends to be that the common usage takes precedence over the actual text of the RFC. If the entire world is doing it one way, pointing at a document doesn't help much.
I'd say a 404 is the right thing to do, as there never was a meaningful resource at the location, so nothing has "moved permanently" (which is the meaning of 301) and the client needs to know their URL was faulty and has not just changed in the meantime.
But I don't quite understand yet what the issue is. Is Google hitting your site with random URL requests? That would be odd. Or is it that your site is showing the same results for domain.com/randomstring as for domain.com/index.html? That you should change, methinks with a 404.
If you know what URL they should go to, that's exactly what 301 is for.
So are you saying that your site is doing redirects without your control?
When you want to use a 301 (permanent redirect) is when that page originally existed but has moved somewhere else. It's a "Change of Address Card". Huge lifesaver when restructuring a site. If the page is just some wacky random URL, then passing a 404 tells spiders (and humans too but people do this less) that this page never existed so don't keep coming back and wasting my web-servers time. Some people disagree with this because they never want their users to see a 404 page. I think these codes were developed for good reason and are used pretty well by Search Engines.
Passing either of these status codes does not prevent you from serving "friendly pages" (although a 301 will typically just redirect you if the browser allows).
The thing to remember is that Google doesn't like duplicate content, so you want to make sure that your site does not appear to be serving the same content with different URL's.