prevent crawler from following POST form action - http-post

I have simple form on my site:
<form method="POST" action="Home/Import"> ... </form>
I get tons of error reports because of crawlers sending HEAD request to Home/Import
Notice form is POST.
Questions
Why crawlers try to crawl those actions?
Anything I can do to prevent it? (I already have Home in robots.txt)
What is a good way to deal with those invalid (but correct) HEAD requests?
Details:
I use Post-Redirect-Get pattern, if that matters.
Platform: ASP.NET MVC 3.0 (C#) on IIS 7.5

1) A crawler typically makes HEAD requests to get the mime-type of the response.
2) The HEAD request shouldn't invoke the action handler for a POST. If I saw that I was getting alot of HEAD requests to a resource I don't want the crawler to crawl I would give it a link I do want it to crawl. Most crawlers read a Robots.txt

you can disable head requests at webserver level... for apache:
<LimitExcept GET POST>
deny from all
</LimitExcept>
you can work this at robots.txt level by adding:
Disallow: /Home/Import
Head requests are used to get information about the page, without getting the whole page, like last-modified-time, size etc. it is an efficiency thing. your script should not be giving errors because of head requests, and those errors are probably because of lack of validations in your code. your code could check if the request http method is 'head' and do something different.

4 years ago but still answering question #1: Google does indeed try to crawl POST forms, both by just sending a "GET" to the URL and actual "POST" requests. See their blog on this. The why is in the nature of the web: bad web developers hide their content links behind POST search forms. To reach that content, search engines have to improvise.
About #2: The reliability of robots.txt varies.
And about #3: The ultra clean way would probably be: HTTP Status 405 Method not allowed if HEAD requests in particular are your problem. Not sure browsers will like this, though.

Related

Do not use 404 pages on your website?

I was reading this guide of best practices by Yahoo:
http://developer.yahoo.com/performance/rules.html#no404
One rule that caught my eye is this.
HTTP requests are expensive so making an HTTP request and getting a
useless response (i.e. 404 Not Found) is totally unnecessary and will
slow down the user experience without any benefit.
Some sites have helpful 404s "Did you mean X?", which is great for the
user experience but also wastes server resources (like database, etc).
Particularly bad is when the link to an external JavaScript is wrong
and the result is a 404. First, this download will block parallel
downloads. Next the browser may try to parse the 404 response body as
if it were JavaScript code, trying to find something usable in it.
Is this really best practice?? If a user enters the wrong url on my website, what is this guide recommending? That I just leave the user with the default server error page? If the user enters the wrong url, how can the server return anything other than a 404? I simply don't understand this advice and wonder if anyone can explain the reasoning, thank you.
The best practice is that you serve different 404 documents in different situations.
For resources, that are not intended to be directly requested by a user but rather embedded into other documents (i.e., images, stylesheets, script files, etc.), should not result in a detailed error document but rather simply be empty, i.e., no body. With no body present, the browser won’t try to parse it.
But for resources, that are intended to be directly requested, a detailed 404 document with helpful information and guidance is better.
You could also take the presence of a Referer header field instead of the author’s intention as indicator whether a 404 document is useful or not.

SEO 301 redirect limits

I'm thinking of doing such structure for accessing some hypothetical page:
/foo/ID/some-friendly-string
The key part here is "ID" that identifies the page so everything that's not ID is only relevant to SEO. I also want everything else that isn't "/foo/ID/some-friendly-string" to redirect to the original link. Eg.:
/foo/ID ---> /foo/ID/some-friendly-string
/foo/ID/some-friendly-string-blah ---> /foo/ID/some-friendly-string
But what if somehow these links get "polluted" somewhere on the internet and spiders start accessing them with "/foo/ID/some-friendly-string-blah-blah-pollution" URLs? I don't even know if this can happen, but if, say, some bad person decided to post thousands of such "different" links on some well known forums or some such - then google would find thousands of "different" URLs 301-redirecting to the same page.
In such case - would there be some sort of a penalty or is it all the same to google as long as the endpoint is unique and no content duplicates?
I might be a little paranoid with this, but that's just my nature to investigate explaitable situations :)
Thanks for your thoughts
Your approach of using 301 redirect is correct.
301 redirects are very useful if people access your site through several different URLs.
For instance, your page for a given ID can be accessed in multiple ways, say:
http://yoursite.com/foo/ID
http://yoursite.com/foo/ID/some-friendly-string (preferred)
http://yoursite.com/foo/ID/some-friendly-string-blah
http://yoursite.com/some-friendly-string-blah-blah-pollution
It is a good idea to pick one of those URLs (you have decided to be http://yoursite.com/foo/ID/some-friendly-string) as your preferred URL and use 301 redirects to send traffic from the other URLs to your preferred one.
I would also recommend adding canonical link to the HEAD section on the page e.g.
<link rel="canonical" href="http://yourwebsite.com/foo/ID/some-friendly-string"/>
You can get more details on 301 redirects in:
Google Webmaster Tools - Configuration > Change of Address
Google Webmaster Tools Documentation - 301 redirects
I hope that will help you out with your decisions on redirects.
EDIT
I forgot to mention very good example, namely, Stack Overflow. The URL of this question is
http://stackoverflow.com/questions/14318239/seo-301-redirect-limits but you can access it with http://stackoverflow.com/questions/14318239/blahblah and will get redirect to the original URL.

Understanding Googlebots AJAX crawling

I've been through Googles documentation and countless blog posts about this subject, and depending on date and source, there seems to some conflicting information. Please shine your wisdom upon this lowly peasant, and all will be good.
I'm building a website pro-bono where a large part of the audience is from African countries with poor internet connectivity, and the client can't afford any decent infrastructure. So I've decided to serve everything as static html files, and if javascript is available, I load the page content directly into the DOM, if a user clicks on a nav link, to prevent overhead from loading a whole page.
My client-side routes looks like this:
//domain.tld/#!/page
My first question is; Does googlebot translate that into:
//domain.tld/_escaped_fragment_/page or //domain.tld/?_escaped_fragment_=/page?
I've made a simple server-side router in php, that builds the requested pages for googlebot, and my plan was to redirect //d.tld/_escaped_fragment_/page to //d.tld/router/page.
But when using Google's "Fetch as Googlebot" (for the first time I might add), it doesn't seem to recognize any links on the page. It just returns "Success" and shows me the html of the main page (Update: When pointing Fetch as Googlebot to //d.tld/#!/page it just returns the content of the main page, without doing any _escaped_fragment_ magic). Which leads me to my second question:
Do I need to follow a particular syntax when using hashbang links, for googlebot to crawl them?
My links looks like this in the HTML:
Page Headline
Update1: So, when I ask Fetch as Googlebot to get //d.tld/#!/page this shows up in the access log: "GET /_escaped_fragment_/page HTTP/1.1" 301 502 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" But it doesn't seem to follow the 301 I set up, and shows the main page instead. Should I use a 302 instead? This is the rule I'm using: RedirectMatch 301 /_escaped_fragment_/(.*) /router/$1
Update2: I've changed my plans, and will account for googlebot as part of my non-javascript fallback tactic. So now all the links are pointing to the router /router/page and are then changed to /#!/page/ onLoad with javascript. I'm keeping the question open for a bit, in case someone has a brilliant solution that might help others.

Log in form in a lightbox

We've been trying to implement a site with a http home page, but https everywhere else. In order to do this we hit the rather big snag that our login form, in a lightbox, would need to fetch a https form using ajax, embed it in a http page and then (possibly) handle the form errors, still within the lightbox.
In the end we gave up and just made the whole site https, but I'm sure I've seen a login-in-a-lightbox implementation on other sites, though can't find any examples now I want to.
Can anyone give any examples of sites that have achieved this functionality, or explain how/why this functionality can/can't be achieved.
The Same Origin Policy prevents this. The page is either 100% HTTPS or it's not. The Same Origin Policy sees this as a "different" site if the protocol is not the same.
A "lightbox" is not different than any other HTML - it's just laid out differently. The same rules apply.
One option would be to use an iFrame. It's messy, but if having the whole shebang in https isn't an option, it can get the job done.
you might be able to put the login form into an iframe so that users can login through https while it seems they are on a http page,
but im not sure why you would want to do this.

Google Page Speed Recommendation for Leveraging Browser Caching

Well, I'm trying to optimize my application and currently using page speed for this. One of the strongest recommendations was that I needed to leverage browser caching. The report sent me to this page:
http://code.google.com/intl/pt-BR/speed/page-speed/docs/caching.html#LeverageBrowserCaching
In this page there is this quote:
If the Last-Modified date is
sufficiently far enough in the past,
chances are the browser won't refetch
it.
My point is: it doesn't matter the value I set to the Last-Modified header (I tried 10 years past), when I access and reload my application (always clearing the browser recent history) I get status 200 for the first access, and 304 for the reaming ones.
Is there any way I can get the behavior described in the google documentation? I mean, the browser don't try to fetch the static resources from my site?
You might have better success using the Expires header (also listed on that Google doc link).
Also keep in mind that all of these caching-related headers are hints or suggestions for browsers to follow. Different browsers can behave differently.
The method of testing is a good example. In you case you mentioned getting status 304 for remaining requests, but are you getting those by doing a manual browser refresh? Browsers will usually do a request in that case.

Resources