what is encoding in Ajax? - ajax

Generally we are using UTF-8 encoding standard for sending the request for every language.
But in some language this encoding standard is not working properly,then in that case we are using
ISO-8859-1.

You can use any encoding you want. However from your question, it sounds like typically you're using UTF-8, but sometimes you're getting data from somewhere that's coming in with a different encoding (eg, Internet Explorer tends to like send data to the web server using ISO-8859-1).
If you're going to serve up UTF-8 encoded text, and you get non-UTF-8 encoded text from somewhere, you have to convert that to UTF-8 before you send it down the line. Probably a good practice is to automatically sanitize all data received from the web browser and re-encode it as UTF-8. Unfortunately the browser doesn't always tell you what encoding it's using; if it's not supplied you can probably assume it's UTF-8 or ISO-8859-1.
If you're using a server side language, you're going to want to look into how to convert encodings with that language. For example, PHP has iconv() function calls, and a very nice function mb_detect_encoding($text) which will do a pretty decent job of guessing what the encoding is for a given bit of data when you don't already know.
Something like this would be in order (presuming PHP serverside):
$text = iconv(mb_detect_encoding($text), 'UTF-8', $text);
Do this with all user input before you do anything else with it (eg, use array_map to automatically convert user inputs):
function convert_to_utf8($text) {
return iconv(mb_detect_encoding($text), 'UTF-8', $text);
}
$_GET = array_map('convert_to_utf8', $_GET);
$_POST = array_map('convert_to_utf8', $_POST);
Best yet would be to determine if the browser is supplying an encoding, and use that as the first argument to iconv() instead of mb_detect_encoding.

This is a rather vague question.
If you mean to ask, "what is encoding in AJAX?" then the answer is that AJAX is not an encoding, it is a method of client-server communication.
If you meant to ask, "what encoding does AJAX use?" then the answer is that AJAX responses can use whatever encoding you want, but it should typically match the encoding of the HTML page that made the request.

Related

Ruby 1.9 iso-8859-8-i encoding

I'm trying to create a piece of code that will download a page from the internet and do some manipulation on it. The page is encoded in iso-8859-1.
I can't find a way to handle this file. I need to search through the file in Hebrew and return the changed file to the user.
I tried to use string.encode, but I still get the wrong encoding.
when printing the response encoding, I get: "encoding":{} like its undefined, and this is an example of what it returns:
\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd-\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd
It should be Hebrew letters.
When I try with final.body.encode('iso-8859-8-i'), I get the error code converter not found (ASCII-8BIT to iso-8859-8-i).
When you have input where Ruby or OS has incorrectly assign encoding, then conversions will not work. That's because Ruby will start with the wrong assumption and try to maintain the wrong characters when converting.
However, if you know from some other source what the correct encoding is, you can use force_encoding method to tell Ruby how to interpret the bytes it has loaded into a String. Note this alters the object in place.
E.g.
contents = final.body
contents.force_encoding( 'ISO-8859-8' )
puts contents
At this point (provided it works), you now can make conversions (to e.g. UTF-8), because Ruby has been correctly told what characters it is dealing with.
I could not find 'ISO-8859-8-I' on my version of Ruby. I am not sure yet how close 'ISO-8859-8' is to what you need (some Googling suggests that it may be OK for you, if the ...-I encoding is not available).

JSON encoding issue with Ruby 1.9 and HTTParty

I've created a WebAPI that returns JSON.
The initial data is as follow (UTF-8 encoded):
#text="Rosenborg har ikke h\xC3\xB8rt hva Steffen"
Then with a .to_json on my object, here is what is sent by the API (I think it is ISO-8859-1 encoding) :
"text":"Rosenborg har ikke h\ufffd\ufffdrt hva Steffen"
I'm using HTTParty on the client side, and that's what I finally get :
"text":"Rosenborg har ikke h��rt hva"
Both WebAPI and client app are using Ruby 1.9.2 and Rails 3.
I'm a bit lost with this encoding issue... I tried to add the utf8 encoding header to my ruby files but it didn't changed anything.
I guess that I'm missing an encoding / decoding part somewhere... anyone has an idea?
Thank you very much !!!
Vincent
In Ruby 1.9, encoding is explicit now. However, Rails may or may not be configured to send the responses in the encoding you expect. You'll have to set the global configuration setting:
Encoding.default_external = "utf-8".
I believe the encoding that Ruby specifies by default for serialization is the platform default. In America on Windows that would be CodePage-1251. Other countries would have an alternate encoding.
Edit: Also see this url if the json is executed against MySQL: https://rails.lighthouseapp.com/projects/8994/tickets/5210-encoding-problem-in-json-format-response
Edit 2: Rails core and its suite of libraries (ActiveRecord, et. al.) will respect the Encoding.default_external configuration setting which encodes all the values it sends. Unfortunately, because encoding is a relatively new concept to Ruby not every 3rd party library has been adjusted for proper encoding. The ones that have may require additional configuration settings for those libraries. This includes MySQL, and the RSolr library you were using.
In all versions of Ruby before the 1.9 series, a string was just an array of bytes. When you've been thinking like that for so long, it's hard to wrap your head around the concept of multiple string encodings. The thing that is even more confusing now is that unlike Java, C#, and other languages that use some form of UTF as the native string format, Ruby allows each string to be encoded differently. In retrospect, that might be a mistake, but at least now they are respecting encoding.
The Encoding.force_encoding method is designed to treat the byte sequence with that new encoding, but does not change any of the underlying data. So it is possible to have invalid byte sequences. There is another method called .encode() that will transform the bytes from one encoding to another and guarantees valid byte sequences. For more information read this:
http://blog.grayproductions.net/articles/ruby_19s_string
Ok, I finally found out what the problem is...
I'm using RSolr to get my data from Solr, and by default encoding for all results is unfortunately 'US-ASCII' as mentioned here (and checked by myself) :
http://groups.google.com/group/rsolr/browse_thread/thread/2d4890fa7737e7ef#
So you need to force encoding as follow :
my_string.force_encoding(Encoding::UTF_8)
There is maybe a nice encoding option to provide to RSolr!

ruby string encoding

So, I'm trying to do some screen scraping off of a certain site using nokogiri, but the site owners failed to specify the proper encoding of the page in a <meta> tag. The upshot of this is that I'm trying to deal with strings that think they're utf-8, but really aren't.
(If you care, here are the files I was using to test this:
main file: http://dpaste.de/nif5/
ann.html: http://dpaste.de/YsLM/
ann2.html: http://dpaste.de/Lofi/
ann3.html: http://dpaste.de/R21j/
a-p.html: http://dpaste.de/O9dy/
output: http://dpaste.de/WdXc/
)
After doing a lot of searching around (this SO question was particularly useful), I found that calling encode('iso-8859-1', 'utf-8') on that test string "works", in that I get a proper © symbol. The issue now is that there are other characters in some other strings I want that really do not work at being converted to latin encoding (Shōta, for instance, turns into Sh�\x8Dta).
Now, I'm probably going to bother the appropriate webmasters and try and get them to fix their damn encodings, but in the meantime, I'd like to be able to use the bytes that I've got. I'm fairly certain that there is a way, but I just can't for the life of me figure out what it is.
Those pages appear to be correctly encoded as UTF-8. That's how my browser sees them, and when I viewsource them and tell the editor to decode them as UTF-8, they look fine. The only problem I see is that some copyright symbols seem to have been corrupted before (or as) they were added to the content. The o-macron and other non-ASCII letters come through just fine.
I don't know if you're aware of this, but the proper way to notify clients of a page's encoding is through a header. Pages may include that information in <meta> tags, but that's neither required nor expected; browsers typically ignore such tags if the header is present.
Since your pages are XHTML, they could also embed the encoding information in an XML processing instruction, but again, they're not required to. But it als means you could have Nokogiri treat them as XML instead of HTML, in which case I would expect it to use UTF-8 by default. But I'm not familiar with Nokogiri, so I can't be sure. And anyway, the header is still the final authority.
So, the issue is that ANN only specifies encoding via headers, and Nokogiri doesn't receive the headers from the open() function. So, Nokogiri guesses that the page is latin-encoded, and produces strings that we really can't reverse to get back the original characters from.
You can specify the encoding to Nokogiri as the 3rd parameter to Nokogiri::HTML(), which solves the issue I was initially trying to solve. So, I'll accept this answer, even though the more specific question I asked (how to get those non-latin characters out of a latin string) is unanswerable.

How do you think Google is handling this encoding issue?

I recently came across an encoding issue specific to how Firefox encodes URLs directly entered into the address bar. It basically looks like the default Firefox character encoding for URLs is NOT UTF-8, which is the case with most browsers. Additionally, it looks like they are trying to make some intelligent decisions as to what character encoding to use, based on the content of the URL.
For example, if you enter a URL directly into the address bar (I'm using Firefox 3.5.5) with a 'q' parameter, you will get the following results:
For the given query string parameter, this is how it's actually encoded in the http request:
1) ...q=Književni --> q=Knji%9Eevni (This appears to be iso-8859-1 encoded)
2) ...q=漢字 --> q=%E6%BC%A2%E5%AD%97 (This appears to be UTF-8 encoded)
3) ...q=Književni漢字 --> Knji%C5%BEevni%E6%BC%A2%E5%AD%97 (This appears to be UTF-8 encoded ... which is odd, because notice that the first part of the value is the same as 1, which was iso-8859-1 encoded).
So, this really shouldn't be a big deal, right? Well, for me, not totally, but sort of. In the application I'm working on, we have a search box in our global navigation. When a user submits a search term in our search box, the 'q' parameter (like in our example, the parameter that holds the query string value) is submitted on the request and is UTF-8 encoded and all is well and good.
However, the URL that then appears in the address bar contains the decoded form of that URL, so the q parameter looks like "q=Književni". Now, as I mentioned before, if a user then presses the ENTER key to submit what is in the address bar, the "q=Književni" parameter is now encoded to iso-8859-1 and gets sent to our server as "q=Knji%9Eevni". The problem with this is that we are always expecting a UTF-8 encoded URL ... so when we recieve this parameter our application does not know how to interpret it and it can cause some strange results.
As I mentioned before, this appears to ONLY be a Firefox issue, and it would be rare that a user would actually run into this scenario, so it is not too concerning for us. However, I happened to notice that Google actually handles this quite nicely. Typing in the following URL using either of the differently encoded forms of the query string parameter will return nice results in Google:
http://www.google.com/search?q=Knji%C5%BEevni
http://www.google.com/search?q=Knji%9Eevni
So my question really is, how do you think they handle this scenario? Additionally, does anyone else see the same strange Firefox behavior?
Looks like it is using latin-1 unless any characters can't be represented in that encoding, otherwise it is using UTF-8.
If that is indeed the case, the way to get around this at the other end is to assume everything you receive is UTF-8, and validate it as UTF-8. If it fails validation as UTF-8 then assume it is latin-1 (iso-8859-1).
Due to the way UTF-8 is structured, it is highly unlikely that something that is not actually UTF-8 will pass when validated as UTF-8.
Still, the possibility exists and I don't think Firefox's behaviour is a good idea, though no doubt they have done it as a compromise - like for compatibility with servers that wouldn't know UTF-8 if they stepped in it.
There are several parts in a url. The domain name is encoded according to the IDN (International Domain Names) rules (http://en.wikipedia.org/wiki/Internationalized_domain_name).
The part that you care about comes (usually) from a form. And the encoding of the source page determines the encoding (before the % escaping). The form element in html can also take an encoding attribute which overrides the the page setting.
So it is not the fault of Firefox, the encoding of the referrer page/form is the determining factor. And that is the standard behavior.

Enhancing an ASCII protcol with multilingual fields

I am enhancing a piece of software that implements a simple ASCII based protocol.
The protocol is simple... here is an example of what the messages look a little bit like (not the same though, I can't show you the real protocol):
AUTH 1 1 200<CR><LF>
To which we get a response looking similar to
230 DEVICE 1 STATE AUTH 200 OUTPUT 1 NAME "Photo Black"<CR><LF>
The name "Photo Black" comes from a database sqlite database. I need to enhance it to support foreign languages. So I've been thinking that the field "Photo Black" needs to be "optionally" encoded as a UTF-8 string between the quotes. I'm wondering if there is a standard for this so that the client application can interpret the string in the quotes and straight away recognize it as either UTF-8 or plain ASCII. I'm not willing to rewrite the protocol, that would be too much work. Just slip in some kind of encoding for clients to recognize some Spanish or Swedish names.
I don't want the field to be always interpreted as UTF-8 either, long story there. You know how in C++ I can type 0xFF and the compiler knows that this is a hex string... is there an equivalent for UTF-8? Sorry I may be jumping the gun but I'm not that familiar with UTF-8 encoding and internationalization in general.
Do you have control over both the server and the client? If not, you can't change the protocol so you won't be able to do it. When you say you're "not wiling to rewrite the protocol" - you're going to have to do so at least to some extent. Whatever you do, you will be changing the protocol.
I'm not sure why you wouldn't want to always interpret the data as UTF-8 either - if it's currently only ASCII, then it would be completely backward compatible to always interpret it as UTF-8, as all ASCII is encoded the same way in UTF-8. Perhaps if you could give more information, we could provide more help.
You could introduce a prefix for UTF-8-encoded strings, e.g. U:
230 DEVICE 1 STATE AUTH 200 OUTPUT 1 NAME U"Photo UTF-8 stuff here Black"<CR><LF>
would that help?
Do you actually have an 8-bit data path? If something is going to mangle the top bit of every byte, then you'll need to consider options like Punycode instead of UTF-8.
Read up on the concept of Ascii Compatible Encoding, or ACE. iDNS is an example. So is/was UTF-7.
Here's the master speaking.
You really can't code-switch in and out of UTF-8. For a nightmare, look up ISO-2022, which attempted to support that sort of thing. Also keep in mind that UTF-8 includes ASCII, but not Latin-1.
Why don't you want the field to be "always interpreted as UTF-8"? You don't say.
If you do have the client interpret the protocol as UTF-8 encoded text, all of the existing output will still work correctly, since UTF-8 is a proper superset of ASCII.

Resources