So I've been noticing some strange results in how google peruses our site. One issue is that a url such as this:
http://example.com/randomstring
is showing up on google with all of the data of
http://example.com/
So in my mind there are two solutions. One is to add a 301 redirect whenever someone visits a sub-url of the main one, and redirect them to the parent URL, or just give a 404, with a nice message saying, "Maybe you meant parent-url".
Thoughts? I'm pretty sure I know where I want to send them, but what is the proper web-etiquette? 404 or 301?
The correct http way would be a 404, as long as a request is made to something that doesn't exist.
301 is for something that is moved, which is not the case here.
However, 100% correct http convention is rarely followed today. Depending on the context it could be useful to redirect the user to the home page with a notification that the page wasn't found and that they were redirected. Though in this case you should use a 303 See Other code.
You should never redirect without letting the user know that a redirect happened, though. That confuses the user to think that maybe something is wrong.
The already posted answers cover your question nicely but I thought there may be some value in going to the source: rfc 2616
10.3.2 301 Moved Permanently
The requested resource has been assigned a
new permanent URI and any future
references to this resource SHOULD use
one of the returned URIs. Clients with
link editing capabilities ought to
automatically re-link references to
the Request-URI to one or more of the
new references returned by the server,
where possible. This response is
cacheable unless indicated otherwise.
The new permanent URI SHOULD be given
by the Location field in the response.
Unless the request method was HEAD,
the entity of the response SHOULD
contain a short hypertext note with a
hyperlink to the new URI(s).
If the 301 status code is received in
response to a request other than GET
or HEAD, the user agent MUST NOT
automatically redirect the request
unless it can be confirmed by the
user, since this might change the
conditions under which the request was
issued.
Note: When automatically redirecting a POST request after
receiving a 301 status code, some existing HTTP/1.0 user agents
will erroneously change it into a GET request.
10.4.5 404 Not Found
The server has not found anything matching the
Request-URI. No indication is given of
whether the condition is temporary or
permanent. The 410 (Gone) status code
SHOULD be used if the server knows,
through some internally configurable
mechanism, that an old resource is
permanently unavailable and has no
forwarding address. This status code
is commonly used when the server does
not wish to reveal exactly why the
request has been refused, or when no
other response is applicable.
Of course, with these things it tends to be that the common usage takes precedence over the actual text of the RFC. If the entire world is doing it one way, pointing at a document doesn't help much.
I'd say a 404 is the right thing to do, as there never was a meaningful resource at the location, so nothing has "moved permanently" (which is the meaning of 301) and the client needs to know their URL was faulty and has not just changed in the meantime.
But I don't quite understand yet what the issue is. Is Google hitting your site with random URL requests? That would be odd. Or is it that your site is showing the same results for domain.com/randomstring as for domain.com/index.html? That you should change, methinks with a 404.
If you know what URL they should go to, that's exactly what 301 is for.
So are you saying that your site is doing redirects without your control?
When you want to use a 301 (permanent redirect) is when that page originally existed but has moved somewhere else. It's a "Change of Address Card". Huge lifesaver when restructuring a site. If the page is just some wacky random URL, then passing a 404 tells spiders (and humans too but people do this less) that this page never existed so don't keep coming back and wasting my web-servers time. Some people disagree with this because they never want their users to see a 404 page. I think these codes were developed for good reason and are used pretty well by Search Engines.
Passing either of these status codes does not prevent you from serving "friendly pages" (although a 301 will typically just redirect you if the browser allows).
The thing to remember is that Google doesn't like duplicate content, so you want to make sure that your site does not appear to be serving the same content with different URL's.
Related
I was reading this guide of best practices by Yahoo:
http://developer.yahoo.com/performance/rules.html#no404
One rule that caught my eye is this.
HTTP requests are expensive so making an HTTP request and getting a
useless response (i.e. 404 Not Found) is totally unnecessary and will
slow down the user experience without any benefit.
Some sites have helpful 404s "Did you mean X?", which is great for the
user experience but also wastes server resources (like database, etc).
Particularly bad is when the link to an external JavaScript is wrong
and the result is a 404. First, this download will block parallel
downloads. Next the browser may try to parse the 404 response body as
if it were JavaScript code, trying to find something usable in it.
Is this really best practice?? If a user enters the wrong url on my website, what is this guide recommending? That I just leave the user with the default server error page? If the user enters the wrong url, how can the server return anything other than a 404? I simply don't understand this advice and wonder if anyone can explain the reasoning, thank you.
The best practice is that you serve different 404 documents in different situations.
For resources, that are not intended to be directly requested by a user but rather embedded into other documents (i.e., images, stylesheets, script files, etc.), should not result in a detailed error document but rather simply be empty, i.e., no body. With no body present, the browser won’t try to parse it.
But for resources, that are intended to be directly requested, a detailed 404 document with helpful information and guidance is better.
You could also take the presence of a Referer header field instead of the author’s intention as indicator whether a 404 document is useful or not.
I'm thinking of doing such structure for accessing some hypothetical page:
/foo/ID/some-friendly-string
The key part here is "ID" that identifies the page so everything that's not ID is only relevant to SEO. I also want everything else that isn't "/foo/ID/some-friendly-string" to redirect to the original link. Eg.:
/foo/ID ---> /foo/ID/some-friendly-string
/foo/ID/some-friendly-string-blah ---> /foo/ID/some-friendly-string
But what if somehow these links get "polluted" somewhere on the internet and spiders start accessing them with "/foo/ID/some-friendly-string-blah-blah-pollution" URLs? I don't even know if this can happen, but if, say, some bad person decided to post thousands of such "different" links on some well known forums or some such - then google would find thousands of "different" URLs 301-redirecting to the same page.
In such case - would there be some sort of a penalty or is it all the same to google as long as the endpoint is unique and no content duplicates?
I might be a little paranoid with this, but that's just my nature to investigate explaitable situations :)
Thanks for your thoughts
Your approach of using 301 redirect is correct.
301 redirects are very useful if people access your site through several different URLs.
For instance, your page for a given ID can be accessed in multiple ways, say:
http://yoursite.com/foo/ID
http://yoursite.com/foo/ID/some-friendly-string (preferred)
http://yoursite.com/foo/ID/some-friendly-string-blah
http://yoursite.com/some-friendly-string-blah-blah-pollution
It is a good idea to pick one of those URLs (you have decided to be http://yoursite.com/foo/ID/some-friendly-string) as your preferred URL and use 301 redirects to send traffic from the other URLs to your preferred one.
I would also recommend adding canonical link to the HEAD section on the page e.g.
<link rel="canonical" href="http://yourwebsite.com/foo/ID/some-friendly-string"/>
You can get more details on 301 redirects in:
Google Webmaster Tools - Configuration > Change of Address
Google Webmaster Tools Documentation - 301 redirects
I hope that will help you out with your decisions on redirects.
EDIT
I forgot to mention very good example, namely, Stack Overflow. The URL of this question is
http://stackoverflow.com/questions/14318239/seo-301-redirect-limits but you can access it with http://stackoverflow.com/questions/14318239/blahblah and will get redirect to the original URL.
We've been trying to implement a site with a http home page, but https everywhere else. In order to do this we hit the rather big snag that our login form, in a lightbox, would need to fetch a https form using ajax, embed it in a http page and then (possibly) handle the form errors, still within the lightbox.
In the end we gave up and just made the whole site https, but I'm sure I've seen a login-in-a-lightbox implementation on other sites, though can't find any examples now I want to.
Can anyone give any examples of sites that have achieved this functionality, or explain how/why this functionality can/can't be achieved.
The Same Origin Policy prevents this. The page is either 100% HTTPS or it's not. The Same Origin Policy sees this as a "different" site if the protocol is not the same.
A "lightbox" is not different than any other HTML - it's just laid out differently. The same rules apply.
One option would be to use an iFrame. It's messy, but if having the whole shebang in https isn't an option, it can get the job done.
you might be able to put the login form into an iframe so that users can login through https while it seems they are on a http page,
but im not sure why you would want to do this.
Do I need to do anything extra to get Glimpse to cough up details on a request that return the default 404 server error page on Cassini? By default 404 page, I mean the "Server Error in '/' Application" with a message of "The resource cannot be found" (as well as the HTTP 404 description, requested URL, and version info).
This project has some fairly complex routing, so I don't doubt I have something conflicting with what I am trying to do. I just want Glimpse to provide whatever details it can to point me in the right direction for fixing the problem.
I loaded up Glimpse via NuGet on an MVC3 project I am running through Visual Studio 2010's built-in hosting system (Cassini), and all works fine on previously working action methods and their resulting views. Since then, I added another action method that is proving difficult to hit via the default URL structure (e.g., /controller/action?someparam=x). Since I thought the Glimpse route data would be quite handy for determining what is going wrong here, I went looking for the eyeball in the corder of the default 404 page. Glimpse doesn't appear to be "attached" to this result.
UPDATE: Also doesn't work with RouteDebugger. Whatever I have wrong, it is high enough in the pipeline that nothing seems to be able to pin itself into the response.
UPDATE: The request URL wasn't working because I forgot I had this action set with [HttpPost]. That completely explains the 404, but not how to get any route information from the various utilities on the response sent back.
As far as glimpse goes, one of the reason it wasn't showing in the first place is that we only enable Glimpse on 200 Success results. Hence why the eyeball wouldn't showup for a 404.
Why its not showing up now... have you gone to the /Glimpse/Config page and turned glimpse on? Glimpse isn't enabled by default, so you have to explicitly turn it on.
Let me know how it goes.
What is the requirement for the browser to show the ubiquitous "this page has expired" message when the user hits the back button?
What are some user-friendly ways to prevent the user from using the back button in a webapp?
Well, by default whenever you're dealing with a form POST, and then the user hits back and then refresh then they'll see the message indicating that the browser is resubmitting data. But if the page is set to expire immediately then they won't even have to hit refresh and they'll see the page has expired message when they hit back.
To avoid both messages there are a couple things to try:
1) Use a form GET instead. It depends on what you're doing but this isn't always a good solution as there are still size restrictions on a GET request. And the information is passed along in the querystring which isn't the most secure of options.
-- or --
2) Perform a server-side redirect to a different page after the form POST.
Looks like a similar question was answered here:
Redirect with a 303 after POST to avoid "Webpage has expired": Will it work if there are more bytes than a GET request can handle?
As a third option one could prevent a user from going back in their browser at all. The only time I've felt a need to do this was to prevent them from doing something stupid such as paying twice. Although there are better server-side methods to handle that. If your site uses sessions then you can prevent them from paying twice by first disabling cache on the checkout page and setting it expire immediately. And then you can utilize a flag of some sort stored in a session which will actually change the behavior of the page if you go back to it.
you need to set pragma-cache control option in HTTP headers:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.9
However, from the usability point of view, this is discouraged approach to the matter. I strongly encourage you to look for other options.
ps: as proposed by Steve, redirection via GET is the proper way (or check page movement with JS).
Try using the following code in the Page_Load
Response.Cache.SetCacheability(HttpCacheability.Private)
use one of the following before session_start:
session_cache_expire(60); // in minutes
ini_set('session.cache_limiter', 'private');
/Note:
Language is PHP
I'm not sure if this is standard practice, but I typically solve this issue by not sending a Vary header for IE only. In Apache, you can put the following in httpd.conf:
BrowserMatch MSIE force-no-vary
According to the RFC:
The Vary field value indicates the set
of request-header fields that fully
determines, while the response is
fresh, whether a cache is permitted to
use the response to reply to a
subsequent request without
revalidation.
The practical effect is that when you go "back" to a POST, IE simply gets the page from the history cache. No request at all goes to the server side. I can see this clearly in HTTPWatch.
I would be interested to hear potential bad side-effects of this approach.