Can I use non latin characters in my robots.txt and sitemap.xml? - sitemap

Can I use non latin characters in my robots.txt file and sitemap.xml like this?
robots.txt
User-agent: *
Disallow: /somefolder/
Sitemap: http://www.domainwithåäö.com/sitemap.xml
sitemap.xml
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url><loc>http://www.domainwithåäö.com/</loc></url>
<url><loc>http://www.domainwithåäö.com/subpage1</loc></url>
<url><loc>http://www.domainwithåäö.com/subpage2</loc></url>
</urlset>
Or should I do like this?
robots.txt
User-agent: *
Disallow: /somefolder/
Sitemap: http://www.xn--domainwith-z5al6t.com/sitemap.xml
sitemap.xml
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url><loc>http://www.xn--domainwith-z5al6t.com/</loc></url>
<url><loc>http://www.xn--domainwith-z5al6t.com/subpage1</loc></url>
<url><loc>http://www.xn--domainwith-z5al6t.com/subpage2</loc></url>
</urlset>

On https://support.google.com/webmasters/answer/183668 Google writes: "Make sure that your URLs follow the RFC-3986 standard for URIs, the RFC-3987 standard for IRIs", so I guess the correct answer is that you have to follow these two standards.
My best guess is that it doesn't matter, because Google consider the two URLs identical. That might also be what's stated in the standards, but I'm not good at reading these, so I can't confirm nor deny that.
Using the the xn-- format works. I haven't tried using Unicode characters to see if that also works.

As your example contains a URI with characters NOT in the US-ASCII table, you will need to percent encode them.
Example from Bing:
Your URL:
http://www.domain.com/папка/
To Disallow: /папка/
Without Percent encoding (Not Compatible):
Disallow: /папка/
With Percent encoding (Compatile):
Disallow: /%D0%BF%D0%B0%D0%BF%D0%BA%D0%B0/
This Bing blog post may be of help.
For the XML sitemap, non-ASCII characters can be used but must be encoded to match the encoding readability of your server. See this guide by Google for a more detailed explanation with examples.

They must be ASCII encoded as follows:
Domain name portion must be Punycoded: https://www.punycoder.com/
Path portion must be URL encoded (percent encoded): https://www.urlencoder.io/

Related

robots.txt -- blank lines required between user-agent blocks, or optional?

Seemingly conflicting descriptions given in authoritative documentation sources.
A Standard for Robot Exclusion:
('record' refers to each user-agent block)
"The file consists of one or more records separated by one or more
blank lines (terminated by CR,CR/NL, or NL). Each record contains
lines of the form ...".
Google's Robot.txt Specifications:
"... Note the optional use of white-space and empty lines to improve
readability."
So -- based on documentation that we have available to us -- is this empty line here mandatory?
User-agent: *
Disallow: /this-directory/
User-agent: DotBot
Disallow: /this-directory/
Disallow: /and-this-directory/
Or, is this OK?
User-agent: *
Disallow: /this-directory/
User-agent: DotBot
Disallow: /this-directory/
Disallow: /and-this-directory/
Google Robots.txt Parser and Matcher Library does not have special handling for blank lines. Python urllib.robotparser always interprets blank lines as the start of a new record, although they are not strictly required and the parser also recognizes a User-Agent: as one. Therefore, both of your configurations would work fine with either parser.
This, however, is specific to the two prominent robots.txt parser; you should still write it in the most common and unambiguous way possible to deal with badly written custom parsers.
Best not to have any blank lines. I was having problems with my robots.txt file and everything worked fine AFTER I removed all blank lines.

Why do we use UTF-8 encoding in XHTML?

<? xml version="1.0" encoding="utf-8"?>
As per W3C standards we have to use utf-8 encoding, Why can not we use utf-16 or any other which are in the encoding format?
Whats the difference between utf-8 encoding and rest of the other encoding formats.
XHTML doesn't require UTF-8 encoding. As explained in this section of the specification, any character encoding can be given -- but the default is UTF-8 or UTF-16.
According to w3 school there are lots of character encodings to help browser to understand.
UTF-8 - Character encoding for Unicode
ISO-8859-1 - Character encoding for the Latin alphabet.
There are several ways to specify which character encoding is used in the document. First, the web server can include the character encoding or "charset" in the Hypertext Transfer Protocol (HTTP) Content-Type header, which would typically look like this:[1]
charset=ISO-8859-4
This method gives the HTTP server a convenient way to alter document's encoding according to content negotiation; certain HTTP server software can do it, for example Apache with the module mod_charset_lite.[2]

Is it possible to list multiple user-agents in one line?

Is it possible in robots.txt to give one instruction to multiple bots without repeatedly having to mention it?
Example:
User-agent: googlebot yahoobot microsoftbot
Disallow: /boringstuff/
It's actually pretty hard to give a definitive answer to this, as there isn't a very well-defined standard for robots.txt, and a lot of the documentation out there is vague or contradictory.
The description of the format understood by Google's bots is quite comprehensive, and includes this slightly garbled sentence:
Muiltiple start-of-group lines directly after each other will follow the group-member records following the final start-of-group line.
Which seems to be groping at something shown in the following example:
user-agent: e
user-agent: f
disallow: /g
According to the explanation below it, this constitutes a single "group", disallowing the same URL for two different User Agents.
So the correct syntax for what you want (with regards to any bot working the same way as Google's) would then be:
User-agent: googlebot
User-agent: yahoobot
User-agent: microsoftbot
Disallow: /boringstuff/
However, as Jim Mischel points out, there is no point in a robots.txt file which some bots will interpret correctly, but others may choke on, so it may be best to go with the "lowest common denominator" of repeating the blocks, perhaps by dynamically generating the file with a simple "recipe" and update script.
I think the original robots.txt specification defines it unambiguously: one User-agent line can only have one value.
A record (aka. a block, a group) consists of lines. Each line has the form
<field>:<optionalspace><value><optionalspace>
User-agent is a field. It’s value:
The value of this field is the name of the robot the record is describing access policy for.
It’s singular ("name of the robot"), not plural ("the names of the robots").
The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended.
If several values would be allowed, how could parsers possibly be liberal? Whichever the delimiting character would be (,, , ;, …), it could be part of the robot name.
The record starts with one or more User-agent lines
Why should you use several User-agent lines if you could provide several values in one line?
In addition:
the specification doesn’t define a delimiting character to provide several values in one line
it doesn’t define/allow it for Disallow either
So instead of
User-agent: googlebot yahoobot microsoftbot
Disallow: /boringstuff/
you should use
User-agent: googlebot
User-agent: yahoobot
User-agent: microsoftbot
Disallow: /boringstuff/
or (probably safer, as you can’t be sure if all relevant parsers support the not so common way of having several User-agent lines for a record)
User-agent: googlebot
Disallow: /boringstuff/
User-agent: yahoobot
Disallow: /boringstuff/
User-agent: microsoftbot
Disallow: /boringstuff/
(resp. of course User-agent: *)
According to the original robots.txt exclusion protocol:
User-agent
The value of this field is the name of the robot the record is describing access policy for.
If more than one User-agent field is present the record describes an identical access policy for more than one robot. At least one field needs to be present per record.
The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended.
If the value is '*', the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the "/robots.txt" file.
I have never seen multiple bots listed in a single line. And it's likely that my web crawler would not have correctly handled such a thing. But according to the spec above, it should be legal.
Note also that even if Google were to support multiple user agents in a single directive, or the multiple user agents as described in IMSoP's answer (interesting find, by the way ... I didn't know that one), not all other crawlers will. You need to decide if you want to use the convenient syntax that very possibly only Google and Bing bots will support, or use the more cumbersome and simpler syntax that all polite bots support.
You have to put each bot on a different line.
http://en.wikipedia.org/wiki/Robots_exclusion_standard
As mentioned in the accepted answer, the safest approach is to add a new entry for each bot.
This repo has a good robots.txt file for blocking a lot of bad bots: https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker/blob/master/robots.txt/robots.txt

unicode in firefox extension

Application works when I am using the following code :
xulschoolhello.greeting.label = Hello World?
But when I use Unicode, Application does not work :
xulschoolhello.greeting.label = سلام دنیا ?
Why does not work?
I don't have a problem loading that string in my extension in a xul file from chrome://. Make sure you are not overriding the encoding (UTF-8 by default). See this page for more information.
To make sure, change your XUL's first line to:
<?xml version="1.0" encoding='UTF-8' ?>
In case you are using this in a properties file, make sure you save the .properties file in utf-8 format. From Property Files - XUL | MDN:
Non-ASCII Characters, UTF-8 and escaping
Gecko 1.8.x (or later) supports property files encoded in UTF-8. You
can and should write non-ASCII characters directly without escape
sequences, and save the file as UTF-8 without BOM. Double-check the
save options of your text editor, because many don't do this by
default. See Localizing extension descriptions for more details.
In some cases, it may be useful or needed to use escape sequences to
express some characters. Property files support escape sequences of
the form: \uXXXX , where XXXX is a Unicode character code. For
example, to put a space at the beginning or end of a string (which
would normally be stripped by the properties file parser), use \u0020
.

Is it possible to create INTERNATIONAL permalinks?

i was wondering how you deal with permalinks on international sites. By permalink i mean some link which is unique and human readable.
E.g. for english phrases its no problem e.g. /product/some-title/
but what do you do if the product title is in e.g chinese language??
how do you deal with this problem?
i am implementing an international site and one requirement is to have human readable URLs.
Thanks for every comment
Characters outside the ISO Latin-1 set are not permitted in URLs according to this spec, so Chinese strings would be out immediately.
Where the product name can be localised, you can use urls like <DOMAIN>/<LANGUAGE>/DIR/<PRODUCT_TRANSLATED>, e.g.:
http://www.example.com/en/products/cat/
http://www.example.com/fr/products/chat/
accompanied by a mod_rewrite rule to the effect of:
RewriteRule ^([a-z]+)/product/([a-z]+)? product_lookup.php?lang=$1&product=$2
For the first example above, this rule will call product_lookup.php?lang=en&product=cat. Inside this script is where you would access the internal translation engine (from the lang parameter, en in this case) to do the same translation you do on the user-facing side to translate, say, "Chat" on the French page, "Cat" on the English, etc.
Using an external translation API would be a good idea, but tricky to get a reliable one which works correctly in your business domain. Google have opened up a translation API, but it currently only supports a limited number of languages.
English <=> Arabic
English <=> Chinese
English <=> Russian
Take a look at Wikipedia.
They use national characters in URLs.
For example, Russian home page URL is: http://ru.wikipedia.org/wiki/Заглавная_страница. The browser transparently encodes all non-ASCII characters and replaces them by their codes when sending URL to the server.
But on the web page all URLs are human-readable.
So you don't need to do anything special -- just put your product names into URLs as is.
The webserver should be able to decode them for your application automatically.
I usually transliterate the non-ascii characters. For example "täst" would become "taest". GNU iconv can do this for you (I'm sure there are other libraries):
$ echo täst | iconv -t 'ascii//translit'
taest
Alas, these transliterations are locale dependent: in languages other than german, 'ä' could be translitertated as simply 'a', for example. But on the other side, there should be a transliteration for every (commonly used) character set into ASCII.
How about some scheme like /productid/{product-id-number}/some-title/
where the site looks at the {number} and ignores the 'some-title' part entirely. You can put that into whatever language or encoding you like, because it's not being used.
If memory serves, you're only able to use English letters in URLs. There's a discussion to change that, but I'm fairly positive that it's not been implemented yet.
that said, you'd need to have a look up table where you assign translations of products/titles into whatever word that they'll be in the other language. For example:
foo.com/cat will need a translation look up for "cat" "gato" "neko" etc.
Then your HTTP module which is parsing those human reading objects into an exact url will know which page to serve based upon the translations.
Creating a look up for such thing seems an overflow to me. I cannot create a lookup for all the different words in all languages. Maybe accessing an translation API would be a good idea.
So as far as I can see its not possible to use foreign chars in the permalink as the sepecs of the URL does not allow it.
What do you think of encoding the specials chars? are those URLs recognized by Google then?

Resources