I have a string that I have retrived from an html page using boost's regex_search(). Unfortunately, however, the japanese characters in the page are written as \u codes, and these are interpreted by regex_search as normal characters in a string.
So, my question is, how does one go about converting these codes to normal Unicode text? (UTF-8 obviously)
This is a fundamental issue with fstream having absolutely no regard for UTF-8. It looks like boost has its own implementation of fstream, but changing to it had no effect on my program, and I couldn't find any extra settings to configure boost's fstream to work with UTF-8 (although today is my first day ever working with boost, I could have missed it).
As a final note: I'm running this on linux, but I'd certainly appreciate a portable solution over a system-specific one.
Thanks all, I really appreciate the help :D
fstream is a narrow-character only stream (it's a typedef to basic_fstream<char>). std::wfstream would be the type you're looking for, although to be perfectly portable to, for example, Windows, you may have to introduce C++11 dependencies (Windows has no Unicode locales, but supports locale-independent Unicode conversions introduced by C++11. GCC on Linux doesn't support the new Unicode conversions, but has plenty of Unicode locales to choose from) or rely on boost.locale.
Your steps would be:
parse the string to obtain the hexadecimal values of the code points
store them as wide characters.
write them to a std::wofstream (or convert to UTF-8 first, and then write to std::ofstream)
To illustrate the last step:
#include <fstream>
#include <locale>
int main()
{
std::locale::global(std::locale("en_US.utf8")); // any utf8 works
std::wofstream f("test.txt");
f.imbue(std::locale());
f << wchar_t(0x65e5) << wchar_t(0x672c) << wchar_t(0x8a9e) << '\n';
}
produces a file (on Linux) that contains e6 97 a5 e6 9c ac e8 aa 9e 0a
Related
I'm trying to write a short program (short enough that it has a simple main function). First, I should list the dependency in the cargo.toml file:
[dependencies]
passwords = {version = "3.1.3", features = ["crypto"]}
Then when I use the crate in main.rs:
extern crate passwords;
use passwords::hasher;
fn main() {
let args: Vec<String> = std::env::args().collect();
if args.len() < 2
{
println!("Error! Needed second argument to demonstrate BCrypt Hash!");
return;
}
let password = args.get(1).expect("Expected second argument to exist!").trim();
let hash_res = hasher::bcrypt(10, "This_is_salt", password);
match hash_res
{
Err(_) => {println!("Failed to generate a hash!");},
Ok(hash) => {
let str_hash = String::from_utf8_lossy(&hash);
println!("Hash generated from password {} is {}", password, str_hash);
}
}
}
The issue arises when I run the following command:
$ target/debug/extern_crate.exe trooper1
And this becomes the output:
?sC�M����k��ed from password trooper1 is ���Ka .+:�
However, this input:
$ target/debug/extern_crate.exe trooper3
produces this:
Hash generated from password trooper3 is ��;��l�ʙ�Y1�>R��G�Ѡd
I'm pretty content with the second output, but is there something within UTF-8 that could cause the "Hash generat" portion of the output statement to be overwritten? And is there code I could use to prevent this?
Note: Code was developed in Visual Studio Code in Windows 10, and was compiled and run using an embedded Git Bash Terminal.
P.S.: I looked at similar questions such as Rust println! problem - weird behavior inside the println macro and Why does my string not match when reading user input from stdin? but those issues seem to be issues with new-line and I don't think that's the problem here.
To complement the previous, the answer to your question of "is there something within UTF-8 that could cause the "Hash generat" portion of the output statement to be overwritten?" is:
let str_hash = String::from_utf8_lossy(&hash);
The reason's in the name: from_utf8_lossy is lossy. UTF8 is a pretty prescriptive format. You can use this function to "decode" stuff which isn't actually UTF8 (for whatever reason), but the way it will do this decoding is:
replace any invalid UTF-8 sequences with U+FFFD REPLACEMENT CHARACTER, which looks like this: �
And so that is what the odd replacement you get is: byte sequences which can not be decoded as UTF8, and are replaced by the "replacement character".
And this is because hash functions generally return random-looking binary data, meaning bytes across the full range (0 to 255) and with no structure. UTF8 is structured and absolutely does not allow such arbitrary data so while it's possible that a hash will be valid UTF8 (though that's not very useful) the odds are very very low.
That's why hashes (and binary data in general) are usually displayed in alternative representations e.g. hex, base32 or base64.
You could convert the hash to hex before printing it to prevent this
Neither of the other answers so far have covered what caused the Hash generated part of the answer to get overwritten.
Presumably you were running your program in a terminal. Terminals support various "terminal control codes" that give the terminal information such as which formatting they should use to output the text they're showing, and where the text should be output on the screen. These codes are made out of characters, just like strings are, and Unicode and UTF-8 are capable of representing the characters in question – the only difference from "regular" text is that the codes start with a "control character" rather than a more normal sort of character, but control characters have UTF-8 encodings of their own. So if you try to print some randomly generated UTF-8, there's a chance that you'll print something that causes the terminal to do something weird.
There's more than one terminal control code that could produce this particular output, but the most likely possibility is that the hash contained the byte b'\x0D', which UTF-8 decodes as the Unicode character U+000D. This is the terminal control code "CR", which means "print subsequent output at the start of the current line, overwriting anything currently there". (I use this one fairly frequently for printing progress bars, getting the new version of the progress bar to overwrite the old version of the progress bar.) The output that you posted is consistent with accidentally outputting CR, because some random Unicode full of replacement characters ended up overwriting the start of the line you were outputting – and because the code in question is only one byte long (most terminal control codes are much longer), the odds that it might appear in randomly generated UTF-8 are fairly high.
The easiest way to prevent this sort of thing happening when outputting arbitrary UTF-8 in Rust is to use the Debug implementation for str/String rather than the Display implementation – it will output control codes in escaped form rather than outputting them literally. (As the other answers say, though, in the case of hashes, it's usual to print them as hex rather than trying to interpret them as UTF-8, as they're likely to contain many byte sequences that aren't valid UTF-8.)
I am writing a terminal program for the Raspberry Pi using ncurses. I want to add a shadow around a box. I want to use mvaddch() to print extended characters such as char 233 (upper half box character). What would be the syntax for the mvaddch() command? Or is there another way to accomplish this?
You're probably referring to something like code page 866. ncurses will assume your terminal shows characters consistent with the locale encoding, which probably is UTF-8. So (unless you want to convert the characters in your program) the way to go is using Unicode values.
The Unicode organization has tables which you can use to lookup a particular code, e.g., ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP866.TXT. For your example, the relevant row is
0xdf 0x2580 #UPPER HALF BLOCK
(because 0xdf is 223). You would use the Unicode 0x2580 in a call to the function mvaddwstr, e.g.
wchar_t mydata[] = { 0x2580, 0 };
mvaddwstr(0,0, mydata);
(the similarly-named wadd_wch uses a data structure which is more complicated).
You would have to link with the ncursesw library, and of course initialize your program's locale using setlocale as mentioned in the ncurses manual page.
I have an Excel file including latin character, which is shown as follows:
abcón
After saving it into a csv file, the latin character was lost
abc??n
What causes this problem and how to solve it? Thanks.
It's likely that the ó you're using in the excel file isn't supported in ascii text. There are a couple different symbols that look almost if not entirely identical. From the Insert->Symbol character map, 00F3 is supported and is from the latin extended alphabet. However, 1F79 from the greek extended alphabet is not supported and from my casual inspection is identical. Try replacing the char in question with the char from the char map.
Alternatively, you can use Alt-Codes and use 0243 for the char which should work.
I created some .txt files on my Mac (didn't think that would matter at first, but...) so that I could read them in the application I am making in (unfortunately) Visual Studio on a different computer. They are basically files filled with records, with the number of entries per row at the top, e.g.:
2
int int
age name
9 Bob
34 Mary
12 Jim
...
In the code, which I originally just made (and tested successfully) on the Mac, I attempt to read this file and similar ones:
Table TableFromFile(string _filename){ //For a database system
ifstream infile;
infile.open(_filename.c_str());
if(!infile){
cerr << "File " << _filename << " could not be opened.";
exit(1);
}
//Determine number attributes (columns) in table,
//which is number on first line of input file
std::string num;
getline(infile, num);
int numEntries = atoi(num.c_str());
...
...
In short, this causes a crash! As I looked into it, I found some interesting "Error reading characters of string" issues and found that numEntries is getting some crazy negative garbage value. This seems to be caused by the fact that "num", which should just be "2" as read from the first line, is actually coming out as "ÿþ2".
From a little research, it seems that these strange characters are formatting things...perhaps unicode/Mac specific? In any case, they are a problem, and I am wondering if there is a fast and easy way to make the text files I created on my Mac cooperate and behave in Windows just like they did in the Mac terminal. I tried connecting to a UNIX machine, putting a txt file there, running unix2dos on it, and put into back in VS, but to no avail...still those symbols at the start of the line! Should I just make my input files all over again in Windows? I am very surprised to learn that what you see is not always what you get when it comes to characters in a file across platforms...but a good lesson, I suppose.
As the commenter indicated, the bytes you're seeing are the byte order mark. See http://en.wikipedia.org/wiki/Byte_order_mark.
"ÿþ" is 0xFFFE, the UTF-16 "little endian" byte order mark. The "2" is your first actual character (for UTF-16, characters below 256 will be represented by bytes of the for 0xnn00;, where "nn" is the usual ASCII or UTF-8 code for that character, so something trying to read the bytes as ASCII or UTF-8 will do OK until it reaches the first null byte).
If you need to puzzle out the Unicode details of a text file the best tool I know of is the free SC Unipad editor (www.unipad.org). It is Windows-only but can read and write pretty much any encoding and will be able to tell you what there is to know about the file. It is very good at guessing the encoding.
Unipad will be able to open the file and let you save it in whatever encoding you want: ASCII, UTF-8, etc.
I am reading about the charater set and encodings on Windows. I noticed that there are two compiler flags in Visual Studio compiler (for C++) called MBCS and UNICODE. What is the difference between them ? What I am not getting is how UTF-8 is conceptually different from a MBCS encoding ? Also, I found the following quote in MSDN:
Unicode is a 16-bit character encoding
This negates whatever I read about the Unicode. I thought unicode can be encoded with different encodings such as UTF-8 and UTF-16. Can somebody shed some more light on this confusion?
I noticed that there are two compiler
flags in Visual Studio compiler (for
C++) called MBCS and UNICODE. What is
the difference between them ?
Many functions in the Windows API come in two versions: One that takes char parameters (in a locale-specific code page) and one that takes wchar_t parameters (in UTF-16).
int MessageBoxA(HWND hWnd, const char* lpText, const char* lpCaption, unsigned int uType);
int MessageBoxW(HWND hWnd, const wchar_t* lpText, const wchar_t* lpCaption, unsigned int uType);
Each of these function pairs also has a macro without the suffix, that depends on whether the UNICODE macro is defined.
#ifdef UNICODE
#define MessageBox MessageBoxW
#else
#define MessageBox MessageBoxA
#endif
In order to make this work, the TCHAR type is defined to abstract away the character type used by the API functions.
#ifdef UNICODE
typedef wchar_t TCHAR;
#else
typedef char TCHAR;
#endif
This, however, was a bad idea. You should always explicitly specify the character type.
What I am not getting is how UTF-8 is
conceptually different from a MBCS
encoding ?
MBCS stands for "multi-byte character set". For the literal minded, it seems that UTF-8 would qualify.
But in Windows, "MBCS" only refers to character encodings that can be used with the "A" versions of the Windows API functions. This includes code pages 932 (Shift_JIS), 936 (GBK), 949 (KS_C_5601-1987), and 950 (Big5), but NOT UTF-8.
To use UTF-8, you have to convert the string to UTF-16 using MultiByteToWideChar, call the "W" version of the function, and call WideCharToMultiByte on the output. This is essentially what the "A" functions actually do, which makes me wonder why Windows doesn't just support UTF-8.
This inability to support the most common character encoding makes the "A" version of the Windows API useless. Therefore, you should always use the "W" functions.
Update: As of Windows 10 build 1903 (May 2019 update), UTF-8 is now supported as an "ANSI" code page. Thus, my original (2010) recommendation to always use "W" functions no longer applies, unless you need to support old versions of Windows. See UTF-8 Everywhere for text-handling advice.
Unicode is a 16-bit character encoding
This negates whatever I read about the
Unicode.
MSDN is wrong. Unicode is a 21-bit coded character set that has several encodings, the most common being UTF-8, UTF-16, and UTF-32. (There are other Unicode encodings as well, such as GB18030, UTF-7, and UTF-EBCDIC.)
Whenever Microsoft refers to "Unicode", they really mean UTF-16 (or UCS-2). This is for historical reasons. Windows NT was an early adopter of Unicode, back when 16 bits was thought to be enough for everyone, and UTF-8 was only used on Plan 9. So UCS-2 was Unicode.
_MBCS and _UNICODE are macros to determine which version of TCHAR.H routines to call. For example, if you use _tcsclen to count the length of a string, the preprocessor would map _tcsclen to different version according to the two macros: _MBCS and _UNICODE.
_UNICODE & _MBCS Not Defined: strlen
_MBCS Defined: _mbslen
_UNICODE Defined: wcslen
To explain the difference of these string length counting functions, consider following example.
If you have a computer box that run Windows Simplified Chinese edition which use GBK(936 code page), you compile a gbk-file-encoded source file and run it.
printf("%d\n", _mbslen((const unsigned char*)"I爱你M"));
printf("%d\n", strlen("I爱你M"));
printf("%d\n", wcslen((const wchar_t*)"I爱你M"));
The result would be 4 6 3.
Here is the hexdecimal representation of I爱你M in GBK.
GBK: 49 B0 AE C4 E3 4D 00
_mbslen knows this string is encoded in GBK, so it could intepreter the string correctly and get the right result 4 words: 49 as I, B0 AE as 爱, C4 E3 as 你, 4D as M.
strlen only knows 0x00, so it get 6.
wcslen consider this hexdeciaml array is encoded in UTF16LE, and it count two bytes as one word, so it get 3 words: 49 B0, AE C4, E3 4D.
as #xiaokaoy pointed out, the only valid terminator for wcslen is 00 00. Thus the result is not guranteed to be 3 if the following byte is not 00.
MBCS means Multi-Byte Character Set and describes any character set where a character is encoded into (possibly) more than 1 byte.
The ANSI / ASCII character sets are not multi-byte.
UTF-8, however, is a multi-byte encoding. It encodes any Unicode character as a sequence of 1, 2, 3, or 4 octets (bytes).
However, UTF-8 is only one out of several possible concrete encodings of the Unicode character set. Notably, UTF-16 is another, and happens to be the encoding used by Windows / .NET (IIRC). Here's the difference between UTF-8 and UTF-16:
UTF-8 encodes any Unicode character as a sequence of 1, 2, 3, or 4 bytes.
UTF-16 encodes most Unicode characters as 2 bytes, and some as 4 bytes.
It is therefore not correct that Unicode is a 16-bit character encoding. It's rather something like a 21-bit encoding (or even more these days), as it encompasses a character set with code points U+000000 up to U+10FFFF.
As a footnote to the other answers, MSDN has a document Generic-Text Mappings in TCHAR.H with handy tables summarizing how the preprocessor directives _UNICODE and _MBCS change the definition of different C/C++ types.
As to the phrasing "Unicode" and "Multi-Byte Character Set", people have already described what the effects are. I just want to emphasize that both of those are Microsoft-speak for some very specific things. (That is, they mean something less general and more particular-to-Windows than one might expect if coming from a non-Microsoft-specific understanding of text internationalization.) Those exact phrases show up and tend to get their own separate sections/subsections of microsoft technical documents, e.g. in Text and Strings in Visual C++