I recently came across an encoding issue specific to how Firefox encodes URLs directly entered into the address bar. It basically looks like the default Firefox character encoding for URLs is NOT UTF-8, which is the case with most browsers. Additionally, it looks like they are trying to make some intelligent decisions as to what character encoding to use, based on the content of the URL.
For example, if you enter a URL directly into the address bar (I'm using Firefox 3.5.5) with a 'q' parameter, you will get the following results:
For the given query string parameter, this is how it's actually encoded in the http request:
1) ...q=Književni --> q=Knji%9Eevni (This appears to be iso-8859-1 encoded)
2) ...q=漢字 --> q=%E6%BC%A2%E5%AD%97 (This appears to be UTF-8 encoded)
3) ...q=Književni漢字 --> Knji%C5%BEevni%E6%BC%A2%E5%AD%97 (This appears to be UTF-8 encoded ... which is odd, because notice that the first part of the value is the same as 1, which was iso-8859-1 encoded).
So, this really shouldn't be a big deal, right? Well, for me, not totally, but sort of. In the application I'm working on, we have a search box in our global navigation. When a user submits a search term in our search box, the 'q' parameter (like in our example, the parameter that holds the query string value) is submitted on the request and is UTF-8 encoded and all is well and good.
However, the URL that then appears in the address bar contains the decoded form of that URL, so the q parameter looks like "q=Književni". Now, as I mentioned before, if a user then presses the ENTER key to submit what is in the address bar, the "q=Književni" parameter is now encoded to iso-8859-1 and gets sent to our server as "q=Knji%9Eevni". The problem with this is that we are always expecting a UTF-8 encoded URL ... so when we recieve this parameter our application does not know how to interpret it and it can cause some strange results.
As I mentioned before, this appears to ONLY be a Firefox issue, and it would be rare that a user would actually run into this scenario, so it is not too concerning for us. However, I happened to notice that Google actually handles this quite nicely. Typing in the following URL using either of the differently encoded forms of the query string parameter will return nice results in Google:
http://www.google.com/search?q=Knji%C5%BEevni
http://www.google.com/search?q=Knji%9Eevni
So my question really is, how do you think they handle this scenario? Additionally, does anyone else see the same strange Firefox behavior?
Looks like it is using latin-1 unless any characters can't be represented in that encoding, otherwise it is using UTF-8.
If that is indeed the case, the way to get around this at the other end is to assume everything you receive is UTF-8, and validate it as UTF-8. If it fails validation as UTF-8 then assume it is latin-1 (iso-8859-1).
Due to the way UTF-8 is structured, it is highly unlikely that something that is not actually UTF-8 will pass when validated as UTF-8.
Still, the possibility exists and I don't think Firefox's behaviour is a good idea, though no doubt they have done it as a compromise - like for compatibility with servers that wouldn't know UTF-8 if they stepped in it.
There are several parts in a url. The domain name is encoded according to the IDN (International Domain Names) rules (http://en.wikipedia.org/wiki/Internationalized_domain_name).
The part that you care about comes (usually) from a form. And the encoding of the source page determines the encoding (before the % escaping). The form element in html can also take an encoding attribute which overrides the the page setting.
So it is not the fault of Firefox, the encoding of the referrer page/form is the determining factor. And that is the standard behavior.
Related
I'm using Apache JMeter 2.8 to carry out some performance testing on one web-based information system.
There are several accented letters used in different requests - like 'ä', 'ö', 'ü' or 'õ'.
When it comes to running test scripts and executing requests, for example 'ä' value in some parameter turns into 'ä'. ('ä' - This is the way jmeter saves such character into a *.jmx file) Content encoding for these http requests is set to UTF-8. When i look at the contents of the project, all characters are displayed correcly. When i run test scripts wrong values are used.
Added later:
I can successfully simulate GET requests with utf8 chars, but still accented characters in my POST requests look like 'ä'. What can be the reason, why jmeter's GET requests' data has proper utf8 encoding and POSTs Windows-1252/ISO-8859-1/cp1252/"ANSI" instead?
Any ideas why this happens? Thanks in advance!
The characters displayed don't only depend on the bytes of the input but what decoding the display is using to interpret them. For example, ä, when encoded as UTF-8, is the bytes 0xC3A4.
Now, what does 0xC3A4 look like when displayed? That depends what decoding is used. Here's some examples:
UTF-8: ä
Windows-1252/ISO-8859-1/cp1252/"ANSI": ä
UTF-16BE: 쎤
UTF-32: �
Mac Os Roman: ä
Windows-1251: Г¤
And so on.
JMeter saves character correctly in JMX, ensure that you opened them with the right encoding (UTF-8).
In JMeter there is this property:
sampleresult.default.encoding=ISO-8859-1
which you can change if this is not the default encoding. But I am not sure it's the issue you are facing.
Check the "Encode?"
Solution is:
set content encoding to UTF-8
Check Encode? in parameters table as your parameters are non ascii ones
if this does not work, it reveals an issue on tested application:
request.setCharacterEncoding("UTF-8") must be called before using parameters if it's a Java Application.
Same concepts exist for PHP and ASP.
The thing got fixed by switching HTTP request implementation field from HttpClient4 to HttpClient3.1 + leaving HTTP Request Content encoding value empty :)
There might be some JMeter bug regarding using HttpClient4.
Objective : To have multi language characters in the user id in Enovia v6
I am using utf-8 encoding in tcl script and it seems it saves multi language characters properly in the database (after some conversion). But, in ui i literally see the saved information from the database.
While doing the same excercise throuhg Power Web, saved data somehow gets converted back into proper multi language character and displays properly.
Am i missing something while taking tcl approach?
Pasting one example to help understand better.
Original Name: Kátai-Pál
Name saved in database as: Kátai-Pál
In UI I see name as: Kátai-Pál
In Tcl I use below syntax
set encoded [encoding convertto utf-8 Kátai-Pál];
Now user name becomes: Kátai-Pál
In UI I see name as “Kátai-Pál”
The trick is to think in terms of characters, not bytes. They're different things. Encodings are ways of representing characters as byte sequences (internally, Tcl's really quite complicated, but you shouldn't ever have to care about that if you're not developing Tcl's implementation itself; suffice to say it's Unicode). Thus, when you use:
encoding convertto utf-8 "Kátai-Pál"
You're taking a sequence of characters and asking for the sequence of bytes (one per result character) that is the encoding of those characters in the given encoding (UTF-8).
What you need to do is to get the database integration layer to understand what encoding the database is using so it can convert back into characters for you (you can only ever communicate using bytes; everything else is just a simplification). There are two ways that can happen: either the information is correctly shared (via metadata or defined convention), or both sides make assumptions which come unstuck occasionally. It sounds like the latter is what's happening, alas.
If you can't handle it any other way, you can take the bytes produced out of the database layer and convert into characters:
encoding convertfrom $theEncoding $theBytes
Working out what $theEncoding should be is in general very tricky, but it sounds like it's utf-8 for you. Once you've got characters, Tcl/Tk will be able to display them correctly; it knows how to transfer them correctly into the guts of the platform's GUI. (And in scripts that you actually write, you're best off replacing non-ASCII characters with their \uXXXX escapes, because platforms don't agree on what encoding is right to use for scripts. Alas.)
So, I'm trying to do some screen scraping off of a certain site using nokogiri, but the site owners failed to specify the proper encoding of the page in a <meta> tag. The upshot of this is that I'm trying to deal with strings that think they're utf-8, but really aren't.
(If you care, here are the files I was using to test this:
main file: http://dpaste.de/nif5/
ann.html: http://dpaste.de/YsLM/
ann2.html: http://dpaste.de/Lofi/
ann3.html: http://dpaste.de/R21j/
a-p.html: http://dpaste.de/O9dy/
output: http://dpaste.de/WdXc/
)
After doing a lot of searching around (this SO question was particularly useful), I found that calling encode('iso-8859-1', 'utf-8') on that test string "works", in that I get a proper © symbol. The issue now is that there are other characters in some other strings I want that really do not work at being converted to latin encoding (Shōta, for instance, turns into Sh�\x8Dta).
Now, I'm probably going to bother the appropriate webmasters and try and get them to fix their damn encodings, but in the meantime, I'd like to be able to use the bytes that I've got. I'm fairly certain that there is a way, but I just can't for the life of me figure out what it is.
Those pages appear to be correctly encoded as UTF-8. That's how my browser sees them, and when I viewsource them and tell the editor to decode them as UTF-8, they look fine. The only problem I see is that some copyright symbols seem to have been corrupted before (or as) they were added to the content. The o-macron and other non-ASCII letters come through just fine.
I don't know if you're aware of this, but the proper way to notify clients of a page's encoding is through a header. Pages may include that information in <meta> tags, but that's neither required nor expected; browsers typically ignore such tags if the header is present.
Since your pages are XHTML, they could also embed the encoding information in an XML processing instruction, but again, they're not required to. But it als means you could have Nokogiri treat them as XML instead of HTML, in which case I would expect it to use UTF-8 by default. But I'm not familiar with Nokogiri, so I can't be sure. And anyway, the header is still the final authority.
So, the issue is that ANN only specifies encoding via headers, and Nokogiri doesn't receive the headers from the open() function. So, Nokogiri guesses that the page is latin-encoded, and produces strings that we really can't reverse to get back the original characters from.
You can specify the encoding to Nokogiri as the 3rd parameter to Nokogiri::HTML(), which solves the issue I was initially trying to solve. So, I'll accept this answer, even though the more specific question I asked (how to get those non-latin characters out of a latin string) is unanswerable.
I am enhancing a piece of software that implements a simple ASCII based protocol.
The protocol is simple... here is an example of what the messages look a little bit like (not the same though, I can't show you the real protocol):
AUTH 1 1 200<CR><LF>
To which we get a response looking similar to
230 DEVICE 1 STATE AUTH 200 OUTPUT 1 NAME "Photo Black"<CR><LF>
The name "Photo Black" comes from a database sqlite database. I need to enhance it to support foreign languages. So I've been thinking that the field "Photo Black" needs to be "optionally" encoded as a UTF-8 string between the quotes. I'm wondering if there is a standard for this so that the client application can interpret the string in the quotes and straight away recognize it as either UTF-8 or plain ASCII. I'm not willing to rewrite the protocol, that would be too much work. Just slip in some kind of encoding for clients to recognize some Spanish or Swedish names.
I don't want the field to be always interpreted as UTF-8 either, long story there. You know how in C++ I can type 0xFF and the compiler knows that this is a hex string... is there an equivalent for UTF-8? Sorry I may be jumping the gun but I'm not that familiar with UTF-8 encoding and internationalization in general.
Do you have control over both the server and the client? If not, you can't change the protocol so you won't be able to do it. When you say you're "not wiling to rewrite the protocol" - you're going to have to do so at least to some extent. Whatever you do, you will be changing the protocol.
I'm not sure why you wouldn't want to always interpret the data as UTF-8 either - if it's currently only ASCII, then it would be completely backward compatible to always interpret it as UTF-8, as all ASCII is encoded the same way in UTF-8. Perhaps if you could give more information, we could provide more help.
You could introduce a prefix for UTF-8-encoded strings, e.g. U:
230 DEVICE 1 STATE AUTH 200 OUTPUT 1 NAME U"Photo UTF-8 stuff here Black"<CR><LF>
would that help?
Do you actually have an 8-bit data path? If something is going to mangle the top bit of every byte, then you'll need to consider options like Punycode instead of UTF-8.
Read up on the concept of Ascii Compatible Encoding, or ACE. iDNS is an example. So is/was UTF-7.
Here's the master speaking.
You really can't code-switch in and out of UTF-8. For a nightmare, look up ISO-2022, which attempted to support that sort of thing. Also keep in mind that UTF-8 includes ASCII, but not Latin-1.
Why don't you want the field to be "always interpreted as UTF-8"? You don't say.
If you do have the client interpret the protocol as UTF-8 encoded text, all of the existing output will still work correctly, since UTF-8 is a proper superset of ASCII.
Generally we are using UTF-8 encoding standard for sending the request for every language.
But in some language this encoding standard is not working properly,then in that case we are using
ISO-8859-1.
You can use any encoding you want. However from your question, it sounds like typically you're using UTF-8, but sometimes you're getting data from somewhere that's coming in with a different encoding (eg, Internet Explorer tends to like send data to the web server using ISO-8859-1).
If you're going to serve up UTF-8 encoded text, and you get non-UTF-8 encoded text from somewhere, you have to convert that to UTF-8 before you send it down the line. Probably a good practice is to automatically sanitize all data received from the web browser and re-encode it as UTF-8. Unfortunately the browser doesn't always tell you what encoding it's using; if it's not supplied you can probably assume it's UTF-8 or ISO-8859-1.
If you're using a server side language, you're going to want to look into how to convert encodings with that language. For example, PHP has iconv() function calls, and a very nice function mb_detect_encoding($text) which will do a pretty decent job of guessing what the encoding is for a given bit of data when you don't already know.
Something like this would be in order (presuming PHP serverside):
$text = iconv(mb_detect_encoding($text), 'UTF-8', $text);
Do this with all user input before you do anything else with it (eg, use array_map to automatically convert user inputs):
function convert_to_utf8($text) {
return iconv(mb_detect_encoding($text), 'UTF-8', $text);
}
$_GET = array_map('convert_to_utf8', $_GET);
$_POST = array_map('convert_to_utf8', $_POST);
Best yet would be to determine if the browser is supplying an encoding, and use that as the first argument to iconv() instead of mb_detect_encoding.
This is a rather vague question.
If you mean to ask, "what is encoding in AJAX?" then the answer is that AJAX is not an encoding, it is a method of client-server communication.
If you meant to ask, "what encoding does AJAX use?" then the answer is that AJAX responses can use whatever encoding you want, but it should typically match the encoding of the HTML page that made the request.