I had a question about string normalization and it was already answered, but the problem is, I cannot correctly normalize korean characters that require 3 keystrokes
With the input "ㅁㅜㄷ"(from keystrokes "ane"), it comes out "무ㄷ" instead of "묻".
With the input "ㅌㅐㅇ"(from keystrokes "xod"), it comes out "태ㅇ" instead of "탱".
This is Mr. Dean's answer and while it worked on the example I gave at first...it doesn't work with the one's I cited above.
If you are using .NET, the following will work:
var s = "ㅌㅐㅇ";
s = s.Normalize(NormalizationForm.FormKC);
In native Win32, the corresponding call is NormalizeString:
wchar_t *input = "ㅌㅐㅇ";
wchar_t output[100];
NormalizeString(NormalizationKC, input, -1, output, 100);
NormalizeString is only available in Windows Vista+. You need the "Microsoft Internationalized Domain Name (IDN) Mitigation APIs" installed if you want to use it on XP (why it's in the IDN download, I don't understand...)
Note that neither of these methods actually requires use of the IME - they work regardless of whether you've got the Korean IME installed or not.
This is the code I'm using in delphi (with XP):
var buf: array [0..20] of char;
temporary: PWideChar;
const NORMALIZATIONKC=5;
...
temporary:='ㅌㅐㅇ';
NormalizeString(NORMALIZATIONKC , temporary, -1, buf, 20);
showmessage(buf);
Is this a bug? Is there something incorrect in my code?
Does the code run correctly on your computer? In what language? What windows version are you using?
The jamo you're using (ㅌㅐㅇ)are in the block called Hangul Compatibility Jamo, which is present due to legacy code pages. If you were to take your target character and decompose it (using NFKD), you get jamo from the block Hangul Jamo (ᄐ ᅢ ᆼ, sans the spaces, which are just there to prevent the browser from normalizing it), and these can be re-composed just fine.
Unicode 5.2 states:
When Hangul compatibility jamo are
transformed with a compatibility
normalization form, NFKD or NFKC, the
characters are converted to the
corresponding conjoining jamo
characters.
(...)
Table 12-11
illustrates how two Hangul
compatibility jamo can be separated in
display, even after transforming them
with NFKD or NFKC.
This suggests that NFKC should combine them correctly by treating them as regular Jamo, but Windows doesn't appear to be doing that. However, using NFKD does appear to convert them to the normal Jamo, and you can then run NFKC on it to get the right character.
Since those characters appear to come from an external program (the IME), I would suggest you either do a manual pass to convert those compatibility Jamo, or start by doing NFKD, then NFKC. Alternatively, you may be able to reconfigure the IME to output "normal" Jamo instead of comaptibility Jamo.
Related
I'm trying to write a short program (short enough that it has a simple main function). First, I should list the dependency in the cargo.toml file:
[dependencies]
passwords = {version = "3.1.3", features = ["crypto"]}
Then when I use the crate in main.rs:
extern crate passwords;
use passwords::hasher;
fn main() {
let args: Vec<String> = std::env::args().collect();
if args.len() < 2
{
println!("Error! Needed second argument to demonstrate BCrypt Hash!");
return;
}
let password = args.get(1).expect("Expected second argument to exist!").trim();
let hash_res = hasher::bcrypt(10, "This_is_salt", password);
match hash_res
{
Err(_) => {println!("Failed to generate a hash!");},
Ok(hash) => {
let str_hash = String::from_utf8_lossy(&hash);
println!("Hash generated from password {} is {}", password, str_hash);
}
}
}
The issue arises when I run the following command:
$ target/debug/extern_crate.exe trooper1
And this becomes the output:
?sC�M����k��ed from password trooper1 is ���Ka .+:�
However, this input:
$ target/debug/extern_crate.exe trooper3
produces this:
Hash generated from password trooper3 is ��;��l�ʙ�Y1�>R��G�Ѡd
I'm pretty content with the second output, but is there something within UTF-8 that could cause the "Hash generat" portion of the output statement to be overwritten? And is there code I could use to prevent this?
Note: Code was developed in Visual Studio Code in Windows 10, and was compiled and run using an embedded Git Bash Terminal.
P.S.: I looked at similar questions such as Rust println! problem - weird behavior inside the println macro and Why does my string not match when reading user input from stdin? but those issues seem to be issues with new-line and I don't think that's the problem here.
To complement the previous, the answer to your question of "is there something within UTF-8 that could cause the "Hash generat" portion of the output statement to be overwritten?" is:
let str_hash = String::from_utf8_lossy(&hash);
The reason's in the name: from_utf8_lossy is lossy. UTF8 is a pretty prescriptive format. You can use this function to "decode" stuff which isn't actually UTF8 (for whatever reason), but the way it will do this decoding is:
replace any invalid UTF-8 sequences with U+FFFD REPLACEMENT CHARACTER, which looks like this: �
And so that is what the odd replacement you get is: byte sequences which can not be decoded as UTF8, and are replaced by the "replacement character".
And this is because hash functions generally return random-looking binary data, meaning bytes across the full range (0 to 255) and with no structure. UTF8 is structured and absolutely does not allow such arbitrary data so while it's possible that a hash will be valid UTF8 (though that's not very useful) the odds are very very low.
That's why hashes (and binary data in general) are usually displayed in alternative representations e.g. hex, base32 or base64.
You could convert the hash to hex before printing it to prevent this
Neither of the other answers so far have covered what caused the Hash generated part of the answer to get overwritten.
Presumably you were running your program in a terminal. Terminals support various "terminal control codes" that give the terminal information such as which formatting they should use to output the text they're showing, and where the text should be output on the screen. These codes are made out of characters, just like strings are, and Unicode and UTF-8 are capable of representing the characters in question – the only difference from "regular" text is that the codes start with a "control character" rather than a more normal sort of character, but control characters have UTF-8 encodings of their own. So if you try to print some randomly generated UTF-8, there's a chance that you'll print something that causes the terminal to do something weird.
There's more than one terminal control code that could produce this particular output, but the most likely possibility is that the hash contained the byte b'\x0D', which UTF-8 decodes as the Unicode character U+000D. This is the terminal control code "CR", which means "print subsequent output at the start of the current line, overwriting anything currently there". (I use this one fairly frequently for printing progress bars, getting the new version of the progress bar to overwrite the old version of the progress bar.) The output that you posted is consistent with accidentally outputting CR, because some random Unicode full of replacement characters ended up overwriting the start of the line you were outputting – and because the code in question is only one byte long (most terminal control codes are much longer), the odds that it might appear in randomly generated UTF-8 are fairly high.
The easiest way to prevent this sort of thing happening when outputting arbitrary UTF-8 in Rust is to use the Debug implementation for str/String rather than the Display implementation – it will output control codes in escaped form rather than outputting them literally. (As the other answers say, though, in the case of hashes, it's usual to print them as hex rather than trying to interpret them as UTF-8, as they're likely to contain many byte sequences that aren't valid UTF-8.)
I am writing a terminal program for the Raspberry Pi using ncurses. I want to add a shadow around a box. I want to use mvaddch() to print extended characters such as char 233 (upper half box character). What would be the syntax for the mvaddch() command? Or is there another way to accomplish this?
You're probably referring to something like code page 866. ncurses will assume your terminal shows characters consistent with the locale encoding, which probably is UTF-8. So (unless you want to convert the characters in your program) the way to go is using Unicode values.
The Unicode organization has tables which you can use to lookup a particular code, e.g., ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP866.TXT. For your example, the relevant row is
0xdf 0x2580 #UPPER HALF BLOCK
(because 0xdf is 223). You would use the Unicode 0x2580 in a call to the function mvaddwstr, e.g.
wchar_t mydata[] = { 0x2580, 0 };
mvaddwstr(0,0, mydata);
(the similarly-named wadd_wch uses a data structure which is more complicated).
You would have to link with the ncursesw library, and of course initialize your program's locale using setlocale as mentioned in the ncurses manual page.
Most of WinAPI calls have Unicode and ANSI function call
For examble:
function MessageBoxA(hWnd: HWND; lpText, lpCaption: LPCSTR; uType: UINT): Integer; stdcall;external user32;
function MessageBoxW(hWnd: HWND; lpText, lpCaption: LPCWSTR; uType: UINT): Integer; stdcall; external user32;
When should i use the ANSI function rather than calling the Unicode function ?
Just as (rare) exceptions to the posted comments/answers...
One may choose to use the ANSI calls in cases where UTF-8 is expected and supported. For an example, WriteConsoleA'ing UTF-8 strings in a console set to use a TT font and running under chcp 65001.
Another oddball exception is functions that are primarily implemented as ANSI, where the Unicode "W" variant simply converts to a narrow string in the active codepage and calls the "A" counterpart. For such a function, and when a narrow string is available, calling the "A" variant directly saves a redundant double conversion. Case in point is OutputDebugString, which fell into this category until Windows 10 (I just noticed https://msdn.microsoft.com/en-us/library/windows/desktop/aa363362.aspx which mentions that a call to WaitForDebugEventEx - only available since Windows 10 - enables true Unicode output for OutputDebugStringW).
Then there are APIs which, even though dealing with strings, are natively ANSI. For example GetProcAddress only exists in the ANSI variant which takes a LPCSTR argument, since names in the export tables are narrow strings.
That said, by an large most string-related APIs are natively Unicode and one is encouraged use the "W" variants. Not all the newer APIs even have an "A" variant any longer (e.g. CommandLineToArgvW). From the horses's mouth https://msdn.microsoft.com/en-us/library/windows/desktop/ff381407.aspx:
Windows natively supports Unicode strings for UI elements, file names, and so forth. Unicode is the preferred character encoding, because it supports all character sets and languages. Windows represents Unicode characters using UTF-16 encoding, in which each character is encoded as a 16-bit value. UTF-16 characters are called wide characters, to distinguish them from 8-bit ANSI characters.
[...]
When Microsoft introduced Unicode support to Windows, it eased the transition by providing two parallel sets of APIs, one for ANSI strings and the other for Unicode strings.
[...]
Internally, the ANSI version translates the string to Unicode. The Windows headers also define a macro that resolves to the Unicode version when the preprocessor symbol UNICODE is defined or the ANSI version otherwise.
[...]
Most newer APIs in Windows have just a Unicode version, with no corresponding ANSI version.
[ NOTE ] The post was edited to add the last two paragraphs.
The simplest rule to follow is this: Only use the ANSI variants on systems that do not have the Unicode variant. That is on Windows 95, 98 and ME, which are the versions of Windows that do not support Unicode.
These days, it is exceptionally unlikely that you will be targeting such versions, and so in all probability you should always just use the Unicode variants.
i was expecting this command
^FO15,240^BY3,2:1^BCN,100,Y,N,Y,^FD>:>842011118888^FS
to generate a
(420) 11118888
interpretation line, instead it generates
~n42011118888
anyone have idea how to generate the expected output?
TIA!
Joey
If the firmware is up to date, D mode can be used.
^BCo,h,f,g,e,m
^XA
^FO15,240
^BY3,2:1
^BCN,100,Y,N,Y,D
^FD(420)11118888^FS
^XZ
D = UCC/EAN Mode (x.11.x and newer firmware)
This allows dealing with UCC/EAN with and without chained
application identifiers. The code starts in the appropriate subset
followed by FNC1 to indicate a UCC/EAN 128 bar code. The printer
automatically strips out parentheses and spaces for encoding, but
prints them in the human-readable section. The printer automatically
determines if a check digit is required, calculate it, and print it.
Automatically sizes the human readable.
The ^BC command's "interpretation line" feature does not support auto-insertion of the parentheses. (I think it's safe to assume this is partly because it has no way of determining what your data identifier is by just looking at the data provided - it could be 420, could be 4, could be any other portion of the data starting from the first character.)
My recommendation is that you create a separate text field which handles the logic for the parentheses, and place it just above or below the barcode itself. This is the way I've always approached these in the past - I prefer this method because I have direct control over the font, font size, and formatting of the interpretation line.
I am trying to put Unicode characters (using a custom font) into a string which I then display using Quartz, but XCode doesn't like the escape codes for some reason, and I'm really stuck.
CGContextShowTextAtPoint (context, 15, 15, "\u0066", 1);
It doesn't like this (Latin lowercase f) and says it is an "invalid universal character".
CGContextShowTextAtPoint (context, 15, 15, "\ue118", 1);
It doesn't complain about this but displays nothing. When I open the font in FontForge, it shows the glyph as there and valid. Also Font Book validated the font just fine. If I use the font in TextEdit and put in the Unicode character with the character viewer Unicode table, it appears just fine. Just Quartz won't display it.
Any ideas why this isn't working?
The "invalid universal character" error is due to the definition in C99: Essentially \uNNNN escapes are supposed to allow one programmer to call a variable føø and another programmer (who might not be able to type ø) to refer to it as f\u00F8\u00F8. To make parsing easier for everyone, you can't use a \u escape for a control character or a character that is in the "basic character set" (perhaps a lesson learned from Java's unicode escapes which can do crazy things like ending comments).
The second error is probably because "\ue118" is getting compiled to the UTF-8 sequence "\xee\x8e\x98" — three chars. CGContextShowTextAtPoint() assumes that one char (byte) is one glyph, and CGContextSelectFont() only supports the encodings kCGEncodingMacRoman (which decodes the bytes to "Óéò") and kCGEncodingFontSpecific (what happens is anyone's guess. The docs say not to use CGContextSetFont() (which does not specify the char-to-glyph mapping) in conjunction with CGContextShowText() or CGContextShowTextAtPoint().
If you know the glyph number, you can use CGContextShowGlyphs(), CGContextShowGlyphsAtPoint(), or CGContextShowGlyphsAtPositions().
I just changed the font to use standard alphanumeric characters in the end. Much simpler.