Displaying partially Unicode encoded data via AJAX/innerHTML - ajax

I am trying to get some data from the server via an AJAX call and then displaying the result using responseDiv.innerHTML. The data from the server comes partially encoded with Unicode elements, like: za\u010Dat test. By setting the innerHTML of the response div, this just displayed as is. That is, the Unicode is not converted to an actual representation in the browser.
The charset of the containing page is set to UTF-8. I have tried most other things, like converting the unicode representation to HTML entities, but that doesn't seem to work either.
I should also mention that the text coming from the server has HTML tags intermixed as well. The HTML tags are honored as they should be. For example, if the text from the server comes as <b>Bold this!</b>, the text is bolded.
Any help appreciated.
Vikram

Can you replace '\u010D' with 'č'?
AFAIK the HTML tags coming from the server should work if you are setting the innerHTML.
This works for me:
document.getElementById('info').innerHTML = "č <b>Bold this</b>";
BTW - you can use something like Fiddler or Firebug to ensure you are getting what you expect from the server.
Update: use regular expressions to find and replace the unicode characters with HTML entities:
$.get('data.txt', function(data) {
data = data.replace(/\\u([0-9A-F])([0-9A-F])([0-9A-F])([0-9A-F])/g, '&#x$1$2$3$4;');
document.getElementById('info').innerHTML = data;
});

Just convert the unicode literals to characters directly:
'H\u0065\u006Clo, world!'.replace(/\u([0-9a-fA-F]{4})/, function() {
return String.fromCharCode(parseInt(arguments[1],16));
});

Related

How can I save German characters from Classic ASP to SQL Database [duplicate]

So I was having an issue with converting French characters correctly. Basically, I have a form which sends data to an SQL Database. Then, on another page, data from this DB is retrieved and displayed to the user. But the data (strings) were being displayed with wierd corrupt characters because the input in the form on the other page was in French. I overcame this problem by using the following function which converters a string to the correct charset. HOWEVER, obviously the better solution is to convert it FIRST and then send it to the database. Now here's the code to convert a string retrieved from a DB to the appropriate charset:
Function ConvertFromUTF8(sIn)
Dim oIn: Set oIn = CreateObject("ADODB.Stream")
oIn.Open
oIn.CharSet = "WIndows-1252"
oIn.WriteText sIn
oIn.Position = 0
oIn.CharSet = "UTF-8"
ConvertFromUTF8 = oIn.ReadText
oIn.Close
End Function
I got this function from here: Classic ASP - How to convert a UTF-8 string to UCS-2?
Now my question is, what function do I use to convert strings beforehand and then send them to the database, so that when I retrieve them they will be good-to-go?
Tried Paul's Method:
So there's page 1, and page 2. Page 1 contains a form which, when submitted, sends the string to the DB which is then retrieved in page 2. I tried Paul's solution by removing the function ConvertFromUTF8 and leaving it to as it was before (it returned wierd mangolian characters). After that, I added the following line on top of Page 1 as well as Page 2.
<%#LANGUAGE="VBSCRIPT" CODEPAGE="65001"%>
I also have the following on both of the pages:
Response.CodePage = 65001
Response.CharSet = "UTF-8"
But it didn't work :(
Edit: it works!, thank you so much everyone for your help!
All I needed to do was add "CodePage = 65001" on top of Page 3 (which I didn't even talk about), where the writing to the DB part was happening.
Paul's answer isn't wrong but it is not the only part to consider:
You will need to go through each of these steps to make sure that you are getting consistent results;
IMPORTANT: These steps have to be performed on each and every page in your web application or you will have problems (emphasized by Paul's comment).
Each page needs to be saved using UTF-8 encoding double check this as some IDEs will default to Windows-1252 (also often misnamed as "ANSI").
Each page will need the following line added as the very first line in the page, to make this easier I put this along with some other values in an include file so I can include them in each page as I go.
Include File - page_encoding.asp
<%#Language="VBScript" CodePage = 65001 %>
<%
Response.CharSet = "UTF-8"
Response.CodePage = 65001
%>
Usage in the top of an ASP page (prefer to put in a config folder at the root of the web)
<!-- #include virtual="/config/page_encoding.asp" -->
Response.Charset = "UTF-8" is the equivalent of setting the ;charset in the HTTP content-type header.
Response.CodePage = 65001 tell's ASP to process all dynamic strings as UTF-8.
Include files in the page will also have to be saved using UTF-8 encoding (double check these also).
Follow these steps and your page will work, your problem at the moment is some pages are being interpreted as Windows-1252 while others are being treated as UTF-8 and you're ending up with a mis-match in encoding.
Normally - and that word has a veryyyyy long stretch - you do not need to convert on hand, even more it's discouraged. At the top off your asp page you write:
<%#LANGUAGE="VBSCRIPT" CODEPAGE="65001"%>
that tell's ASP to send and to receive (from a server point of view) UTF-8. Furthermore it instructs the interpreter to use 2 byte strings. So when writing to a database or reading from a database everything goes auto-magically, so if your database uses 1 byte char or 2 byte nchar conversions are taken care of. And actually that's about it. You can test if all goes well by testing with this set:
áäÇçéčëíďńóöçÖöÚü
This set contains some 'European' but also some 'Unicode' chars... those Unicode will always fail if you use codepage 1252, so it's a nice test set.

How Do Validators Differentiate Between '&' and '&amp'?

Knowing that & is the html entity value of & - how do validators like w3c know this? Even when I look at my source code it's already been parsed into the correct value.
Your question is based on a false premise -- as Co_42 noted, & is not the "ASCII value" of '&'. It's a HTML character reference representing the character '&'. The ASCII value of '&' is 38 (or 0x26).
Your source code almost certainly consists of ASCII or Unicode text files. Those don't use HTML entities. If you have a string with an ampersand stored in the source code, it'll probably be stored with a bare "&". If there's a string literal somewhere containing actual HTML data, it may contain "&".
When you use some sort of tool or function to convert strings to text ready to put into for an HTML or XML document, any "&" will be (should be!) converted into "&".
When a program that reads HTML documents encounters an ASCII "&", it can assume that that's the beginning of a HTML character reference. This is okay because all ampersands in the actual text should have been converted into "&".
As a somewhat perverse example, if you open your source code in a word processor and save it as an HTML document, you'll find that in the actual file, "&" has been converted into "&" (and "&" has been converted into "&amp;"). If you then open that document in a browser, you'll find that the ampersands are displayed the same way they are when you view your source code in a text editor. The encoding step that happened when you saved the HTML document corresponds to the decoding step that happens when the browser displays it.
If you put something like "Fish & chips" directly into an actual HTML document, your HTML document will be invalid. Complicating the matter is the fact that programs such as browsers tend to try to recover from errors in document and display the documents anyway. As such, your browser may still display "Fish & chips" on the screen when you open your invalid document. However, a program such as the W3C validator, which is specifically meant to discover errors in HTML documents, will notify you that your document is invalid.

Convert to unicode from UTF-8 [duplicate]

I try to convert a UTF8 string to a Java Unicode string.
String question = request.getParameter("searchWord");
byte[] bytes = question.getBytes();
question = new String(bytes, "UTF-8");
The input are Chinese Characters and when I compare the hex code of each caracter it is the same Chinses character. So I'm pretty sure that the charset is UTF8.
Where do I go wrong?
There's no such thing as a "UTF-8 string" in Java. Everything is in Unicode.
When you call String.getBytes() without specifying an encoding, that uses the platform default encoding - that's almost always a bad idea.
You shouldn't have to do anything to get the right characters here - the request should be handling it all for you. If it's not doing so, then chances are it's lost data already.
Could you give an example of what's actually going wrong? Specify the Unicode values of the characters in the string you're receiving (e.g. by using toCharArray() and then converting each char to an int) and what you expected to receive.
EDIT: To diagnose this, use something like this:
public static void dumpString(String text) {
for (int i = 0; i < text.length(); i++) {
System.out.println(i + ": " + (int) text.charAt(i));
}
}
Note that that will give the decimal value of each Unicode character. If you have a handy hex library method around, you may want to use that to give you the hex value. The main point is that it will dump the Unicode characters in the string.
First make sure that the data is actually encoded as UTF-8.
There are some inconsistency between browsers regarding the encoding used when sending HTML form data. The safest way to send UTF-8 encoded data from a web form is to put that form on a page that is served with the Content-Type: text/html; charset=utf-8 header or contains a <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> meta tag.
Now to properly decode the data call request.setCharacterEncoding("UTF-8") in your servlet before the first call to request.getParameter().
The servlet container takes care of the encoding for you. If you use setCharacterEncoding() properly you can expect getParameter() to return normal Java strings.
Also you may need a special filter which will take care of encoding of your requests. For example such filter exists in spring framework org.springframework.web.filter.CharacterEncodingFilter
String question = request.getParameter("searchWord");
is all you have to do in your servlet code. At this point you have not to deal with encodings, charsets etc. This is all handled by the servlet-infrastucture. When you notice problems like displaying �, ?, ü somewhere, there is maybe something wrong with request the client sent. But without knowing something of the infrastructure or the logged HTTP-traffic, it is hard to tell what is wrong.
possibly.
question = new String(bytes, "UNICODE");

C# MVC3 and non-latin characters

I have my database results (áéíóúàâêô...) and when I display any of this characters I get codes like:
á
My controller is like this:
ViewBag.EstadosDeAlma = (from e in db.EstadosDeAlma select e.Title).ToList();
My cshtml page is like this:
var data = '#foreach (dynamic item in ViewBag.EstadosDeAlma){ #(item + " ") }';
In addition, if I use any rich text editor as Tiny MCE all non-latin characters are like this too.
What should I do to avoid this problem?
What output encoding are you using on your web pages? I would suggest using UTF-8 since you want a lot of non-ascii characters to work.
I think you should HTML encode/decode the values before comparing them.
Since you are using jQuery you can take advantage of the encoding functions built-in into it. For example:
$('<div/>').html('& #225;gil').html()
gives you "ágil" (notice that I added an extra space between the & and the # so that stackoverflow does not encode it, you won't need it)
This other question has more information about this.
HTML-encoding lost when attribute read from input field

What's the correct way to get HTML from an AJAX Response in JSON Format?

I have an AJAX request that creates a 'post', and upon successful post, I want to get HTML to inject back into the DOM. Right now I'm returning a JSON array that details success/error, and when I have a success I also include the HTML for the post in the response. So, I parse the response as JSON, and set a key in the JSON array to a bunch of HTML Code.
Naturally, the HTML code is making the JSON array break -- what should I do to escape it (or is there a better way to do this?). I get an AJAX response with a JSON array like so:
[{response:"success"},{html:'<div class="this is going to break...
Thanks!
Contrary to what you're probably used to in JavaScript, ' can't begin a string in JSON. It's strictly a ". Single quotes work when you're passing JSON to JavaScript.. much like <br> works when you want to put an XHTML line break.
So, use " to open the HTML string, and sanitize your quotes with \".
json.org has more info WRT what you should sanitize. Though the list of special characters isn't long, it's probably best to use a library like Anurag suggests in a comment.
Apart from escaping double quotes as mention by BranTheMan, newlines also break JSON strings. You need to replace newlines with \n.
Personally I've found this to be enough:
// Don't know what your serverside language is, example in javascript syntax:
print(encodeJSON({
response : "success",
html : htmlString.replace(/\n/g,'\\n').replace(/"/g,'\\"')
}));

Resources