Convert UTF-16 to code-page and remove unicode text direction control characters? - winapi

Short version
Given: 1/16/2006 2∶30∶11 ᴘᴍ
How to get: 1/16/2006 2:30:11 PM
rather than: ?1/?16/?2006 ??2:30:11 ??
Background
I have an example Unicode (UTF-16) encoded string:
U+200e U+0031 U+002f U+200e U+0031 U+0036 U+002f U+200e U+0032 U+0030 U+0030 U+0036 U+0020 U+200f U+200e U+0032 U+2236 U+0033 U+0030 U+2236 U+0031 U+0031 U+0020 U+1d18 U+1d0d
[LTR] 1 / [LTR] 1 6 / [LTR] 2 0 0 6 [RTL] [LTR] 2 ∶ 3 0 ∶ 1 1 ᴘ ᴍ
In a slightly easier to read form is:
LTR1/LTR16/LTR2006 RTLLTR2∶30∶11 ᴘᴍ
The actual final text as you're supposed to see it is:
I currently use the Windows function WideCharToMultiByte to convert the UTF-16 to the local code-page:
WideCharToMultiByte(CP_ACP, 0, text, length, NULL, 0, NULL, NULL);
and when i do the text comes out as:
?1/?16/?2006 ??2:30:11 ??
I don't control the presence of the Unicode text direction markers; it's a security thing. But obviously when i'm converting the Unicode to (for example) ISO-8859-1, those characters are irrelevant, make no sense, and i would hope can be dropped.
Is there a Windows function (e.g. FoldString, WideCharToMultiByte) that can be instructed to drop these non-mappable non-printable character?
1/16/2006 2∶30∶11 ᴘᴍ
That gets us close
If a function did that, dropped the non-printing characters that don't have a representation in the target code-page, we would get:
1/16/2006 2∶30∶11 ᴘᴍ
When converted to ISO-8859-1, it becomes:
1/16/2006 2?30?11 ??
That's because some of those characters don't map exactly into ISO-8859-1:
1/16/2006 2U+223630U+223611 U+1d18U+1d0d
1/16/2006 2RATIO30RATIO11 Small Capital PSmall Capital M
But when you see them, it doesn't seem unreasonable that they could be best-fit mapped into:
Original: 1/16/2006 2∶30∶11 ᴘᴍ
Mapped: 1/16/2006 2:30:11 PM
Is there a function that can do that?
I'm happy to suffer with:
1/16/2006 2?30?11 ??
But i really need to fix:
?1/?16/?2006 ??2:30:11 ??
Unicode has the notion
Unicode already has the notion of what "fancy" character you can replace with what "normal" character.
U+00BA º → o (Masculine ordinal indicator) → (Small latin letter o, superscripted)
U+FF0F / → / (Fullwidth solidus) → (solidus, wide)
U+00BC ¼ → 1/4 (Vulgar fraction one quarter)
U+2033 ″ → ′′ (Double prime)
U+FE64: ﹤ → <
I know these are technically for a different purpose;. But there is also the general notion of a mapping list (which again is for a different purpose).
Microsoft SQL Server, when being asked to insert a Unicode string into a non-unicode varchar column does an even better job:
Is there a mapping list for the purpose of unicode best-fit?
Because the reality is that it just makes a mess for users:

Related

How can I convert ASCII code to characters in Verilog language

I've been looking into this but searching seems to lead to nothing.
It might be too simple to be described, but here I am, scratching my head...
Any help would be appreciated.
Verilog knows about "strings".
A single ASCII character requires 8 bits. Thus to store 8 characters you need 64 bits:
wire [63:0] string8;
assign string8 = "12345678";
There are some gotchas:
There is no End-Of-String character (like the C null-character)
The most RHS character is in bits 7:0.
Thus string8[7:0] will hold 8h'38. ("8").
To walk through a string you have to use e.g.: string[ index +: 8];
As with all Verilog vector assignments: unused bits are set to zero thus
assign string8 = "ABCD"; // MS bit63:32 are zero
You can not use two dimensional arrays:
wire [7:0] string5 [0:4]; assign string5 = "Wrong";
You are probably mislead by a misconception about characters. There are no such thing as a character in hardware. There are only sets of bits or codes. The only thing which converts binary codes to characters is your terminal. It interprets codes in a certain way and forming letters for you to se. So, all the printfs in 'c' and $display in verilog only send the codes to the terminal (or to a file).
The thing which converts characters to the codes is your keyboard, which you also use to type in the program. The compiler then interprets your program. Verilog (as well as the 'c') compiler represents strings in double quotes (which you typed in) as a set of bytes directly. Verilog, as well as 'c' use ascii-8 encoding for such character strings, meaning that the code for 'a' is decimal 97 and 'b' is 98, .... Every character is 8-bit wide and the quoted string forms a concatenation of bytes of ascii codes.
So, answering you question, you can convert an ascii codes to characters by sending them to the terminal via $display (or other) function, using the %s modifier.
So, an example:
module A;
reg[8*5-1:0] hello;
reg[8*3 - 1: 0] bye;
initial begin
hello = "hello"; // 5 bytes of characters
bye = {8'd98, 8'd121, 8'd101}; // 3 bytes 'b' 'y' 'e'
$display("hello=%s bye=%s", hello, bye);
end
endmodule

Working with character values beyond ChrW(65535) in VbScript

Does VBScript have any way to support converting hex to decimal to char beyond the range of ChrW(65535)?
For example, \u2122 is h&2122 (hex), and converts to decimal value 8482, which, using ChrW prints ™
Response.Write ChrW(cLng("&h2122"))
Results in:
™
... All good so far.
There are characters beyond the range of ChrW e.g. the "G clef" character U+1D11E (utf-16) which maps to decimal 119,070, which is beyond the range of ChrW()
Is there a way to work with those higher characters in VBScript?
You could try using the surrogate pair encoded values for the utf16-le (or utf16-be) encoding of that character and give'em to two separate chrW calls and concat the result.
Dim tmChar
tmChar = ChrW(&HD834) & ChrW(&HDD1E)
' You will probably see just a square or a question mark or some place holder here, depending on system and font support
MSGBox("Wow: " & tmChar)
'Have also a thinking face emoji as an additional example
tmChar = ChrW(&HD83E) & ChrW(&HDD14)
MSGBox("Hummm... " & tmChar)
I think vbs strings store utf16-le "units" , not characters nor codepoints.
Note that:
"G clef" is the unicode character corresponding to the code point number 119070 (1D11E in hexadecimal)
"G clef" is U+1D11E Unicode stop
"G clef" as surrogate pair units is 0xD834 0xDD1E
"G clef" encoded as UTF16-BE is 0xD8 0x34 0xDD 0x1E
"G clef" encoded as UTF16-LE is 0x34 0xD8 0x1E 0xDD

GNU Prolog: Strange characters

I am new to using GNU Prolog.
Given the following facts:
theme(cafe).
role(manager).
role(boss).
role(coworker).
numberOfCharacters(theme(cafe), 3).
charactersRole(numberCharacters(theme(cafe), 3), role('boss'), role('manager'), role('çoworker')).
When I query:
charactersRole(numberCharacters(theme('cafe'), 3), role(X), role(Y), role(Z)).
It returns some of the values correctly, while one value contains ç in place of normal character 'c'.
X = boss
Y = manager
Z = 'çoworker'
Thanks! :)
role('çoworker')
You have cedilla right here, which gets misrepresented by two characters, usually by not being unicode-aware. This is not a Prolog issue.
ç are U+00C3 U+00A7 in Unicode
And ç is
U+00E7 LATIN SMALL LETTER C WITH CEDILLA
UTF-8: 0xC3 0xA7
That's what you're getting by outputting an UTF-8 2-byte character into non-UTF8-aware LATIN1 terminal.

Convert escaped unicode (\u008E) to accented character (Ž) in Ruby?

I am having a very difficult time with this:
# contained within:
"MA\u008EEIKIAI"
# should be
"MAŽEIKIAI"
# nature of string
$ p string3
"MA\u008EEIKIAI"
$ puts string3
MAEIKIAI
$ string3.inspect
"\"MA\\u008EEIKIAI\""
$ string3.bytes
#<Enumerator: "MA\u008EEIKIAI":bytes>
Any ideas on where to start?
Note: this is not a duplicate of my previous question.
\u008E means that the unicode character with the codepoint 8e (in hex) appears at that point in the string. This character is the control character “SINGLE SHIFT TWO” (see the code chart (pdf)). The character Ž is at the codepoint u017d. However it is at position 8e in the Windows CP-1252 encoding. Somehow you’ve got your encodings mixed up.
The easiest way to “fix” this is probably just to open the file containing the string (or the database record or whatever) and edit it to be correct. The real solution will depend on where the string in question came from and how many bad strings you have.
Assuming the string is in UTF-8 encoding, \u008E will consist of the two bytes c2 and 8e. Note that the second byte, 8e, is the same as the encoding of Ž in CP-1252. On way to convert the string would be something like this:
string3.force_encoding('BINARY') # treat the string just as bytes for now
string3.gsub!(/\xC2/n, '') # remove the C2 byte
string3.force_encoding('CP1252') # give the string the correct encoding
string3.encode('UTF-8') # convert to the desired encoding
Note that this isn’t a general solution to fix all issues like this. Not all CP-1252 characters, when mangled and expressed in UTF-8 this way will amenable to conversion like this. Some will be two bytes c2 xx where xx the correct byte (like in this case), others will be c3 yy where yy is a different byte.
What about using Regexp & String#pack to convert the Unicode escape?
str = "MA\\u008EEIKIAI"
puts str #=> MA\u008EEIKIAI
str.gsub!(/\\u(.{4})/) do |match|
[$1.to_i(16)].pack('U')
end
puts str #=> MA EIKIAI

How to remove these kind of symbols (junk) from string?

Imagine I have String in C#: "I Don’t see ya.."
I want to remove (replace to nothing or etc.) these "’" symbols.
How do I do this?
That 'junk' looks a lot like someone interpreted UTF-8 data as ISO 8859-1 or Windows-1252, probably repeatedly.
’ is the sequence C3 A2, E2 82 AC, E2 84 A2.
UTF-8 C3 A2 = U+00E2 = â
UTF-8 E2 82 AC = U+20AC = €
UTF-8 E2 84 A2 = U+2122 = ™
We then do it again: in Windows 1252 this sequence is E2 80 99, so the character should have been U+2019, RIGHT SINGLE QUOTATION MARK (’)
You could make multiple passes with byte arrays, Encoding.UTF8 and Encoding.GetEncoding(1252) to correctly turn the junk back into what was originally entered. You will need to check your processing to find the two places that UTF-8 data was incorrectly interpreted as Windows-1252.
"I Don’t see ya..".Replace( "’", string.Empty);
How did that junk get in there the first place? That's the real question.
By removing any non-latin character you'll be intentionally breaking some internationalization support.
Don't forget the poor guy who's name has a "â" in it.
This looks disturbingly familiar to a character encoding issue dealing with the Windows character set being stored in a database using the standard character encoding. I see someone voted Will down, but he has a point. You may be solving the immediate issue, but the combinations of characters are limitless if this is the issue.
If you really have to do this, regular expressions are probably the best solution.
I would strongly recommend that you think about why you have to do this, though - at least some of the characters your listing as undesirable are perfectly valid and useful in other languages, and just filtering them out will most likely annoy at least some of your international users. As a swede, I can't emphasize enough how much I hate systems that can't handle our å, ä and ö characters correctly.
Consider Regex.Replace(your_string, regex, "") - that's what I use.
Test each character in turn to see if it is a valid alphabetic or numeric character and if not then remove it from the string. The character test is very simple, just use...
char.IsLetterOrDigit;
Please there are various others such as...
char.IsSymbol;
char.IsControl;
Regex.Replace("The string", "[^a-zA-Z ]","");
That's how you'd do it in C#, although that regular expression ([^a-zA-Z ]) should work in most languages.
[Edited: forgot the space in the regex]
The ASCII / Integer code for these characters would be out of the normal alphabetic Ranges. Seek and replace with empty characters. String has a Replace method I believe.
Either use a blacklist of stuff you do not want, or preferably a white list (set). With a white list you iterate over the string and only copy the letters that are in your white list to the result string. You said remove, and the way you do that is having two pointers one you read from (R) and one you write to (W):
I Donââ‚
W R
if comma is in your whitelist then you would in this case read the comma and write it where à is then advance both pointers. UTF-8 is a multi-byte encoding, so you advancing the pointer may not just be adding to the address.
With C an easy to way to get a white list by using one of the predefined functions (or macros): isalnum, isalpha, isascii, isblank, iscntrl, isdigit, isgraph, islower, isprint, ispunct, isspace, isupper, isxdigit. In this case you send up with a white list function instead of a set of course.
Usually when I see data like you have I look for memory corruption, or evidence to suggest that the encoding I expect is different than the one the data was entered with.
/Allan
I had the same problem with extraneous junk thrown in by adobe in an EXIF dump. I spent an hour looking for a straight answer and trying numerous half-baked suggestions which did not work here.
This thread more than most I have read was replete with deep, probing questions like 'how did it get there?', 'what if somebody has this character in their name?', 'are you sure you want to break internationalization?'.
There were some impressive displays of erudition positing how this junk could have gotten here and explaining the evolution of the various character encoding schemes. The person wanted to know how to remove it, not how it came to be or what the standards orgs are up to, interesting as this trivia may be.
I wrote a tiny program which gave me the right answer. Instead of paraphrasing the main concept, here is the entire, self-contained, working (at least on my system) program and the output I used to nuke the junk:
#!/usr/local/bin/perl -w
# This runs in a dos window and shows the char, integer and hex values
# for the weird chars. Install the HEX values in the REGEXP below until
# the final test line looks normal.
$str = 's: “Brian'; # Nuke the 3 werid chars in front of Brian.
#str = split(//, $str);
printf("len str '$str' = %d, scalar \#str = %d\n",
length $str, scalar #str);
$ii = -1;
foreach $c (#str) {
$ii++;
printf("$ii) char '$c', ord=%03d, hex='%s'\n",
ord($c), unpack("H*", $c));
}
# Take the hex characters shown above, plug them into the below regexp
# until the junk disappears!
($s2 = $str) =~ s/[\xE2\x80\x9C]//g; # << Insert HEX values HERE
print("S2=>$s2<\n"); # Final test
Result:
M:\new\6s-2014.1031-nef.halloween>nuke_junk.pl
len str 's: GÇ£Brian' = 11, scalar #str = 11
0) char 's', ord=115, hex='73'
1) char ':', ord=058, hex='3a'
2) char ' ', ord=032, hex='20'
3) char 'G', ord=226, hex='e2'
4) char 'Ç', ord=128, hex='80'
5) char '£', ord=156, hex='9c'
6) char 'B', ord=066, hex='42'
7) char 'r', ord=114, hex='72'
8) char 'i', ord=105, hex='69'
9) char 'a', ord=097, hex='61'
10) char 'n', ord=110, hex='6e'
S2=>s: Brian<
It's NORMAL!!!
One other actionable, working suggestion I ran across:
iconv -c -t ASCII < 6s-2014.1031-238246.halloween.exf.dif > exf.ascii.dif
If String having the any Junk date , This is good to way remove those junk date
string InputString = "This is grate kingdom¢Ã‚¬â";
string replace = "’";
string OutputString= Regex.Replace(InputString, replace, "");
//OutputString having the following result
It's working good to me , thanks for looking this review.

Resources