Is it possible to list multiple user-agents in one line? - user-agent

Is it possible in robots.txt to give one instruction to multiple bots without repeatedly having to mention it?
Example:
User-agent: googlebot yahoobot microsoftbot
Disallow: /boringstuff/

It's actually pretty hard to give a definitive answer to this, as there isn't a very well-defined standard for robots.txt, and a lot of the documentation out there is vague or contradictory.
The description of the format understood by Google's bots is quite comprehensive, and includes this slightly garbled sentence:
Muiltiple start-of-group lines directly after each other will follow the group-member records following the final start-of-group line.
Which seems to be groping at something shown in the following example:
user-agent: e
user-agent: f
disallow: /g
According to the explanation below it, this constitutes a single "group", disallowing the same URL for two different User Agents.
So the correct syntax for what you want (with regards to any bot working the same way as Google's) would then be:
User-agent: googlebot
User-agent: yahoobot
User-agent: microsoftbot
Disallow: /boringstuff/
However, as Jim Mischel points out, there is no point in a robots.txt file which some bots will interpret correctly, but others may choke on, so it may be best to go with the "lowest common denominator" of repeating the blocks, perhaps by dynamically generating the file with a simple "recipe" and update script.

I think the original robots.txt specification defines it unambiguously: one User-agent line can only have one value.
A record (aka. a block, a group) consists of lines. Each line has the form
<field>:<optionalspace><value><optionalspace>
User-agent is a field. It’s value:
The value of this field is the name of the robot the record is describing access policy for.
It’s singular ("name of the robot"), not plural ("the names of the robots").
The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended.
If several values would be allowed, how could parsers possibly be liberal? Whichever the delimiting character would be (,, , ;, …), it could be part of the robot name.
The record starts with one or more User-agent lines
Why should you use several User-agent lines if you could provide several values in one line?
In addition:
the specification doesn’t define a delimiting character to provide several values in one line
it doesn’t define/allow it for Disallow either
So instead of
User-agent: googlebot yahoobot microsoftbot
Disallow: /boringstuff/
you should use
User-agent: googlebot
User-agent: yahoobot
User-agent: microsoftbot
Disallow: /boringstuff/
or (probably safer, as you can’t be sure if all relevant parsers support the not so common way of having several User-agent lines for a record)
User-agent: googlebot
Disallow: /boringstuff/
User-agent: yahoobot
Disallow: /boringstuff/
User-agent: microsoftbot
Disallow: /boringstuff/
(resp. of course User-agent: *)

According to the original robots.txt exclusion protocol:
User-agent
The value of this field is the name of the robot the record is describing access policy for.
If more than one User-agent field is present the record describes an identical access policy for more than one robot. At least one field needs to be present per record.
The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended.
If the value is '*', the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the "/robots.txt" file.
I have never seen multiple bots listed in a single line. And it's likely that my web crawler would not have correctly handled such a thing. But according to the spec above, it should be legal.
Note also that even if Google were to support multiple user agents in a single directive, or the multiple user agents as described in IMSoP's answer (interesting find, by the way ... I didn't know that one), not all other crawlers will. You need to decide if you want to use the convenient syntax that very possibly only Google and Bing bots will support, or use the more cumbersome and simpler syntax that all polite bots support.

You have to put each bot on a different line.
http://en.wikipedia.org/wiki/Robots_exclusion_standard

As mentioned in the accepted answer, the safest approach is to add a new entry for each bot.
This repo has a good robots.txt file for blocking a lot of bad bots: https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker/blob/master/robots.txt/robots.txt

Related

robots.txt -- blank lines required between user-agent blocks, or optional?

Seemingly conflicting descriptions given in authoritative documentation sources.
A Standard for Robot Exclusion:
('record' refers to each user-agent block)
"The file consists of one or more records separated by one or more
blank lines (terminated by CR,CR/NL, or NL). Each record contains
lines of the form ...".
Google's Robot.txt Specifications:
"... Note the optional use of white-space and empty lines to improve
readability."
So -- based on documentation that we have available to us -- is this empty line here mandatory?
User-agent: *
Disallow: /this-directory/
User-agent: DotBot
Disallow: /this-directory/
Disallow: /and-this-directory/
Or, is this OK?
User-agent: *
Disallow: /this-directory/
User-agent: DotBot
Disallow: /this-directory/
Disallow: /and-this-directory/
Google Robots.txt Parser and Matcher Library does not have special handling for blank lines. Python urllib.robotparser always interprets blank lines as the start of a new record, although they are not strictly required and the parser also recognizes a User-Agent: as one. Therefore, both of your configurations would work fine with either parser.
This, however, is specific to the two prominent robots.txt parser; you should still write it in the most common and unambiguous way possible to deal with badly written custom parsers.
Best not to have any blank lines. I was having problems with my robots.txt file and everything worked fine AFTER I removed all blank lines.

Does Googlebot crawl url with get parameters?

I gave up rewriting url because I don't feel the necessity.
So url looks like "http://www.stackoverflow.com?question=35453."
But, I worry about one thing.
Does Googlebot crawl my pages?
Do I need to rewrite my url like "http://www.stackoverflow.com/question/35453"?
Does Googlebot crawl my pages?
Yes. But you can set in google webmaster tools to ignore it.
In a general way, if you want nice url more meaningful for search engines and users, you should avoid parameters and transform to human language words separated by dashes.
So IMHO, the best you could do is:
http://www.stackoverflow.com/question/does-googlebot-crawl-url-with-get-parameters.html.

If Content-disposition is not safe to use, what can we use instead?

I've read here that using content-disposition has security issues and is not part of the http standard. If content-disposition, what can we use instead?
I've also searched the list of all response fields categorized whether it is part of the standard or not and I've not seen a response field that can be used to replace content-disposition.
Well, the information about not being a standard is incorrect - see https://greenbytes.de/tech/webdav/rfc6266.html and http://www.iana.org/assignments/message-headers/message-headers.xhtml (note that Wikipedia is entirely irrelevant with respect to this).

Internationalizing Title/Meta Tags ok or bad practice?

Is there a problem if I have both English and Chinese versions of the same title/meta tags under the same exact url? I detect the language the user has set for the browser (through the http header "accept-language" field) and change the titles/meta tags based on the language set. I get a large percentage of my traffic from China and felt this was a better-localized user experience for those users BUT I have no idea how Google would view this. My gut feeling tells me that this is not good for SEO.
Baidu.com, a major Chinese search engine, does in fact pick up my translated tags however for other US based sites it does not translate their English title/meta tags into Chinese. I would think Chinese users are less likely to click on those.
Creating sub domains and or separate domains for other countries is not an option at this point. That being said should I only have one language (English) for my title/meta tags to avoid any search engine issues?
Thanks for any advice / wisdom you can offer. Really hoping to get clarity on best practices.
Thanks all!
Yes, it probably is a problem. Search engines see mixed language content. You are not describing how you “detect and change the titles/meta tags based on the users browser language”, but you are probably doing it client-side and using “browser language”, which is wrong whatever it means in detail (it does not specify the user’s preferred language).
To get a more targeted answer, ask a more real question, with a URL.
If you want to get search traffic from search engines in both English and Chinese, you should have two urls instead of one.
When googlebot crawls a page, it does not even send the "Accept-Language" header. You have to send it your default language. When there is one url, there is no way for you to have your second language indexed. You won't be ranked in search engines in multiple languages.
For best SEO, use separate top level domains, subdomains, or folders for different languages.
http://example.de/
http://example.es/
http://example.com/
http://de.example.com/
http://es.example.com/
http://www.example.com/
http://example.com/de/
http://example.com/es/
http://example.com/en/
I think there are no problem when you use English and Chinese in same meta tags.

Generic way to ask which browser they're using?

I would like a text input with the question "what browser are you using" above it. Then when a form is submitted, I'd like to compare their answer to their User-Agent HTTP header.
I am stumped on how to reliably make this work.
I could ask them to spell it out instead of using acronyms like IE or FF, but Internet Explorer uses "MSIE" as its' identifier doesn't it?
Another thought I had was to keep a pool of User-Agent strings, then present them with a select element that has theirs inserted randomly among 4 or so other random strings and asking them to select theirs. I fear non-tech-savy users would bungle this enough times for it to be a problem though. I suppose I could use some logic to make sure there's only one of each browser type among the options, but I'm leery about even that.
Why would You want to ask user about its User-Agent?
Pulling appropriate http header - as you've mentioned, should be enough.
But if you need that badly, I'd go for
regular expressions - for checking http user-agent header and cutting out unimportant info from that header
present possible match based on the previous step
ask user if the match is correct, if not, let him enter its own answer
then I'd try to match what he entered to some dictionary values, so as to be able to enter IE and MSIE and get the same result.
The above seems vague enough :) and abstract, maybe you could provide an explanation - why you want that? maybe there is some other way?
Remember: The client sends the HTTP header and potentially the user can put anything in User Agent. So if you want to catch people who "lie about" the browser they are using, you will only catch those who cannot modify the HTTP header before they send it.
You can neither 100% trust the user input nor the string that the browser sends in the HTTP headers...
The obvious question is why you want to ask the user what browser they are using?
But given that:
a) Normalise the user string: lower-case, remove spaces, remove numbers?
b) Build a map between the normalised strings, and user-agent strings.
When you do a lookup, if the normalised string, or the user-agent string is not in the map, pass it to a human to add to the map with appropriate mapping.
Possibly you'll want to normalise the user-agent in some way as well?

Resources