How can I save German characters from Classic ASP to SQL Database [duplicate] - vbscript

So I was having an issue with converting French characters correctly. Basically, I have a form which sends data to an SQL Database. Then, on another page, data from this DB is retrieved and displayed to the user. But the data (strings) were being displayed with wierd corrupt characters because the input in the form on the other page was in French. I overcame this problem by using the following function which converters a string to the correct charset. HOWEVER, obviously the better solution is to convert it FIRST and then send it to the database. Now here's the code to convert a string retrieved from a DB to the appropriate charset:
Function ConvertFromUTF8(sIn)
Dim oIn: Set oIn = CreateObject("ADODB.Stream")
oIn.Open
oIn.CharSet = "WIndows-1252"
oIn.WriteText sIn
oIn.Position = 0
oIn.CharSet = "UTF-8"
ConvertFromUTF8 = oIn.ReadText
oIn.Close
End Function
I got this function from here: Classic ASP - How to convert a UTF-8 string to UCS-2?
Now my question is, what function do I use to convert strings beforehand and then send them to the database, so that when I retrieve them they will be good-to-go?
Tried Paul's Method:
So there's page 1, and page 2. Page 1 contains a form which, when submitted, sends the string to the DB which is then retrieved in page 2. I tried Paul's solution by removing the function ConvertFromUTF8 and leaving it to as it was before (it returned wierd mangolian characters). After that, I added the following line on top of Page 1 as well as Page 2.
<%#LANGUAGE="VBSCRIPT" CODEPAGE="65001"%>
I also have the following on both of the pages:
Response.CodePage = 65001
Response.CharSet = "UTF-8"
But it didn't work :(
Edit: it works!, thank you so much everyone for your help!
All I needed to do was add "CodePage = 65001" on top of Page 3 (which I didn't even talk about), where the writing to the DB part was happening.

Paul's answer isn't wrong but it is not the only part to consider:
You will need to go through each of these steps to make sure that you are getting consistent results;
IMPORTANT: These steps have to be performed on each and every page in your web application or you will have problems (emphasized by Paul's comment).
Each page needs to be saved using UTF-8 encoding double check this as some IDEs will default to Windows-1252 (also often misnamed as "ANSI").
Each page will need the following line added as the very first line in the page, to make this easier I put this along with some other values in an include file so I can include them in each page as I go.
Include File - page_encoding.asp
<%#Language="VBScript" CodePage = 65001 %>
<%
Response.CharSet = "UTF-8"
Response.CodePage = 65001
%>
Usage in the top of an ASP page (prefer to put in a config folder at the root of the web)
<!-- #include virtual="/config/page_encoding.asp" -->
Response.Charset = "UTF-8" is the equivalent of setting the ;charset in the HTTP content-type header.
Response.CodePage = 65001 tell's ASP to process all dynamic strings as UTF-8.
Include files in the page will also have to be saved using UTF-8 encoding (double check these also).
Follow these steps and your page will work, your problem at the moment is some pages are being interpreted as Windows-1252 while others are being treated as UTF-8 and you're ending up with a mis-match in encoding.

Normally - and that word has a veryyyyy long stretch - you do not need to convert on hand, even more it's discouraged. At the top off your asp page you write:
<%#LANGUAGE="VBSCRIPT" CODEPAGE="65001"%>
that tell's ASP to send and to receive (from a server point of view) UTF-8. Furthermore it instructs the interpreter to use 2 byte strings. So when writing to a database or reading from a database everything goes auto-magically, so if your database uses 1 byte char or 2 byte nchar conversions are taken care of. And actually that's about it. You can test if all goes well by testing with this set:
áäÇçéčëíďńóöçÖöÚü
This set contains some 'European' but also some 'Unicode' chars... those Unicode will always fail if you use codepage 1252, so it's a nice test set.

Related

Server.URLEncode started to replace blank with plus ("+") instead of percent-20 ("%20")

Given this piece of code:
<%
Response.Write Server.URLEncode("a doc file.asp")
%>
It output this for a while (like Javascript call encodeURI):
a%20doc%20file.asp
Now, for unknow reason, I get:
a+doc+file%2Easp
I'm not sure of what I touched to make this happen (maybe the file content encoding ANSI/UTF-8). Why did this happen and how can I get the first behavior of Server.URLEncode, ie using a percent encoding?
Classic ASP hasn't been updated in nearly 20 years, so Server.URLEncode still uses the RFC-1866 standard, which specifies spaces be encoded as + symbols (which is a hangover from an old application/x-www-form-urlencoded media type), you must be mistaken in thinking it was encoding spaces as %20 at some point, not unless there's an IIS setting you can change that I'm unaware of.
More modern languages use the RFC-3986 standard for encoding URLs, which is why Javascript's encodeURI function returns spaces encoded as %20.
Both + and %20 should be treated exactly the same when decoded by any browser thanks to RFC backwards compatibility, but it's generally considered best to use %20 when encoding spaces in a URL as it's the more modern standard now, and some decoding functions (such as Javascript's decodeURIComponent) won't recognise + symbols as spaces and will fail to properly decode URLs that use them over %20.
You can always use a custom function to encode spaces as %20:
function URL_encode(ByVal url)
url = Server.URLEncode(url)
url = replace(url,"+","%20")
URL_encode = url
end function

Not sure why the output of my PHP scripts contains random embedded spaces within character strings

I have written several PHP scripts to read the contents of a database and output those contents in an email message. Every once in a while, I will see a SPACE (0x20) character embedded in the output where there shouldn't be any. For example, in one script, I reference a PHP global variable containing exactly "n" non-space characters, and sometimes (not always), when that variable is dumped to an email message, the string will appear with an embedded blank (making the total length of the string "n+1"). Other times, an HTML tag (such as <BR>) will appear as < BR> (note the SPACE before the "B").
Because the behavior of the script is not consistent (some emails are affected, and others aren't), I can't seem to find the problem.
I am enclosing a link to the PHP script that is occasionally embedding a space into the BREAK tag. I have removed the lines that provide specific login information to the databases. Otherwise, everything else is intact. In the code file you can find at the link below, line 281 is the one that contained the BREAK command with the embedded SPACE (as described above). This has happened only once!
http://jem-software.com/temptest.txt
I guess the only other potentially relevant information is that this script file is taken from code entered into a JUMI code block contained within a Joomla! based website.
Edit 1:
Thank you, Riccardo, for your suggestions. Here is some more clarification:
I am not reading an email and parsing the results in order to insert into a database. Just the opposite, I am reading from a database and using the results to create an email. I will check the database to see what character set was used, and explicitly pass the character set to see if that makes a difference.
I don't use Joomla functions to access the database because the database I am referencing is external to the Joomla! environment. It is a pre-existing database created from PHP scripts written several years prior. When my old website was re-written using Joomla, I wanted to "port" the PHP database access code intact, so I installed the JUMI plugin to make this possible.
I will check out the character coding in the database and synchronize it to the character code of the email message.
I don't understand how an issue with character coding would result in the insertion of a SPACE into the hard-coded HTML tag - this tag did not come from any database, but was typed into the email as a literal string.
This is a strange issue, but here are my two cents:
The first is you're not using Joomla functions to access the db and the mail subsystem. While this could work, it's not really nice.
The second is, this smells like a character set / codepage issue.
Here are a few considerations on the character set issue:
I read your code quickly, and I didn't notice anything wrong. But Joomla uses UTF-8, and your queries don't specify it (mysql_set_charset() is missing!) which could be a first issue.
The second is that the emails you read will have different character sets, depending on the senders' settings. Make sure you handle the codepage issues properly: the following is a snippet of a function I use for parsing email:
$mime = imap_fetchmime($this->connection, $this->messageNumber, $partNumber);
return $this->decodeMailBody($data,$mime); // QUOTED_PRINTABLE
function decodeMailBody($string,$mime) {
$str = quoted_printable_decode($string);
echo "<h3>mime: $mime; charset $charset</h3>";
//mime: Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8
//mime: Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=windows-1252
$mimes = explode('charset=',$mime);
foreach($mimes as $mimepiece) {
$charset = $mimepiece;
}
$charset = strtolower(trim($charset));
if ($charset == 'utf-8') {
return $str;
} else {
return iconv($charset, 'UTF-8', $str);
}
}
Last, make sure you use utf-8 when you insert the mail into the db after parsing it.

C# MVC3 and non-latin characters

I have my database results (áéíóúàâêô...) and when I display any of this characters I get codes like:
á
My controller is like this:
ViewBag.EstadosDeAlma = (from e in db.EstadosDeAlma select e.Title).ToList();
My cshtml page is like this:
var data = '#foreach (dynamic item in ViewBag.EstadosDeAlma){ #(item + " ") }';
In addition, if I use any rich text editor as Tiny MCE all non-latin characters are like this too.
What should I do to avoid this problem?
What output encoding are you using on your web pages? I would suggest using UTF-8 since you want a lot of non-ascii characters to work.
I think you should HTML encode/decode the values before comparing them.
Since you are using jQuery you can take advantage of the encoding functions built-in into it. For example:
$('<div/>').html('& #225;gil').html()
gives you "ágil" (notice that I added an extra space between the & and the # so that stackoverflow does not encode it, you won't need it)
This other question has more information about this.
HTML-encoding lost when attribute read from input field

Displaying partially Unicode encoded data via AJAX/innerHTML

I am trying to get some data from the server via an AJAX call and then displaying the result using responseDiv.innerHTML. The data from the server comes partially encoded with Unicode elements, like: za\u010Dat test. By setting the innerHTML of the response div, this just displayed as is. That is, the Unicode is not converted to an actual representation in the browser.
The charset of the containing page is set to UTF-8. I have tried most other things, like converting the unicode representation to HTML entities, but that doesn't seem to work either.
I should also mention that the text coming from the server has HTML tags intermixed as well. The HTML tags are honored as they should be. For example, if the text from the server comes as <b>Bold this!</b>, the text is bolded.
Any help appreciated.
Vikram
Can you replace '\u010D' with 'č'?
AFAIK the HTML tags coming from the server should work if you are setting the innerHTML.
This works for me:
document.getElementById('info').innerHTML = "č <b>Bold this</b>";
BTW - you can use something like Fiddler or Firebug to ensure you are getting what you expect from the server.
Update: use regular expressions to find and replace the unicode characters with HTML entities:
$.get('data.txt', function(data) {
data = data.replace(/\\u([0-9A-F])([0-9A-F])([0-9A-F])([0-9A-F])/g, '&#x$1$2$3$4;');
document.getElementById('info').innerHTML = data;
});
Just convert the unicode literals to characters directly:
'H\u0065\u006Clo, world!'.replace(/\u([0-9a-fA-F]{4})/, function() {
return String.fromCharCode(parseInt(arguments[1],16));
});

Quotation marks turn to question marks

So I have a ruby script that parses HTML pages and saves the extracted string into a DB...
but i'm getting weired charcters (usually question marks) instead of plain text...
Eg : ‘SOME TEXT’ instead of 'Some Text'
I've tried HTML entities and CGI::unescape ... but to no avail...
did some googling n set $KCODE = 'u' & require 'jcode'
still not working...
any suggestions /pointers would be great
Thanks
PS : using mysql 5.1
Your script is storing the Unicode escape sequences for quotation marks (instead of ASCII quotation marks) in the database.
That's actually good - it shows that the DB itself is working fine, although for best results you should ensure that the table is set to use 'utf8_collation_ci' so that string sorting works properly.
The fact that the output is displayed as "‘" just means that your terminal (and/or web page) output encoding is incorrect.
If it's terminal output, make sure that $ENV{'LANG'} is set to the appropriate UTF8 encoding (e.g. en.UTF-8) and that the terminal emulator itself is set the same way.
If it's HTML output, make sure that the page encoding is set to UTF-8 as well, i.e.:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
Is the DB that you're storing data in capable of handling Unicode? These symptoms seem to imply that it's not. For Unicode support under MySQL, please see this link.
It seems likely that the quotation marks in question are not the standard ASCII quotation marks but the Unicode ones.
Ruby has an iconv implementation to convert between encoding types. See here for more information.

Resources