When you are going from std::u16string to, lets say std::u32string, std::wstring_convert doesn't work as it expects chars. So how does one use std::wstring_convert to convert between UTF-16 and UTF-32 using std::u16string as input?
For example :
inline std::u32string utf16_to_utf32(const std::u16string& s) {
std::wstring_convert<std::codecvt_utf16<char32_t>, char32_t> conv;
return conv.from_bytes(s); // cannot do this, expects 'char'
}
Is it ok to reinterpret_cast to char, as I've seen in a few examples?
If you do need to reinterpret_cast, I've seen some examples using the string size as opposed to the total byte size for the pointers. Is that an error or a requirement?
I know codecvt is deprecated, but until the standard offers an alternative, it has to do.
If you do not want to reinterpret_cast, the only way I've found is to first convert to utf-8, then reconvert to utf-32.
For ex,
// Convert to utf-8.
std::u16string s;
std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> conv;
std::string utf8_str = conv.to_bytes(s);
// Convert to utf-32.
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv;
std::u32string utf32_str = conv.from_bytes(utf8_str);
Yes this is sad and likely contributes to codecvt deprecation.
Related
Please check my code and result:
void test_MultibyteToWideChar()
{
WCHAR wbuf[10] = {};
int ret = MultiByteToWideChar(1253, MB_ERR_INVALID_CHARS,
"\x98\x99\x9A\x9B\x9C\x9D\xFF", 8,
wbuf, 10);
}
The code aims to convert some codepage-1253(Greek) ANSI bytes to Unicode.
Some input ANSI codepoints(0x98, 0x9a etc) does not have character definition, and I thought MultiByteToWideChar should produce error ERROR_NO_UNICODE_TRANSLATION, or, at least, produce U+FFFD(REPLACEMENT CHARACTER) in output buffer.
However, MultiByteToWideChar silently succeeds, just copy the codepoints(0x98, 0x9a etc) to output WCHAR buffer. I think it is ill behavior.
Any idea on this?
Compile with VS2019 16.11 and run on Win10.1909.
==== Update ====
In my first image, MultibyteToWideChar reported fail by returning 0(=FALSE). I checked again, GetLastError() is 1113, ERROR_NO_UNICODE_TRANSLATION. But I have to tell you, that error is triggered by the final input byte \xFF, not by \x98. We can try to convert a single \x98 and see that MultibyteToWideChar returns success. Sure, in my opinion, it is a fake success.
I'm getting into Rust programming to realize a small program and I'm a little bit lost in string conversions.
In my program, I have a vector as follows:
let mut name: Vec<winnt::WCHAR> = Vec::new();
WCHAR is the same as a u16 on my Windows machine.
I hand over the Vec<u16> to a C function (as a pointer) which fills it with data. I then need to convert the string contained in the vector into a &str. However, no matter, what I try, I can not manage to get this conversion working.
The only thing I managed to get working is to convert it to a WideString:
widestr = unsafe { WideCString::from_ptr_str(name.as_ptr()) };
But this seems to be a step into the wrong direction.
What is the best way to convert the Vec<u16> to an &str under the assumption that the vector holds a valid and null-terminated string.
I then need to convert the string contained in the vector into a &str. However, no matter, what I try, I can not manage to get this conversion working.
There's no way of making this a "free" conversion.
A &str is a Unicode string encoded with UTF-8. This is a byte-oriented encoding. If you have UTF-16 (or the different but common UCS-2 encoding), there's no way to read one as the other. That's equivalent to trying to read a JPEG image as a PDF. Both chunks of data might be a string, but the encoding is important.
The first question is "do you really need to do that?". Many times, you can take data from one function and shovel it back into another function, never looking at it. If you can get away with that, that might be be best answer.
If you do need to transform it, then you have to deal with the errors that can occur. An arbitrary array of 16-bit integers may not be valid UTF-16 or UCS-2. These encodings have edge cases that can easily produce invalid strings. Null-termination is another aspect - Unicode actually allows for embedded NUL characters, so a null-terminated string can't hold all possible Unicode characters!
Once you've ensured that the encoding is valid 1 and figured out how many entries in the input vector comprise the string, then you have to decode the input format and re-encode to the output format. This is likely to require some kind of new allocation, so you are most likely to end up with a String, which can then be used most anywhere a &str can be used.
There is a built-in method to convert UTF-16 data to a String: String::from_utf16. Note that it returns a Result to allow for these error cases. There's also String::from_utf16_lossy, which replaces invalid encoded parts with the Unicode replacement character.
let name = [0x68, 0x65, 0x6c, 0x6c, 0x6f];
let a = String::from_utf16(&name);
let b = String::from_utf16_lossy(&name);
println!("{:?}", a);
println!("{:?}", b);
If you are starting from a pointer to a u16 or WCHAR, you will need to convert to a slice first by using slice::from_raw_parts. If you have a null-terminated string, you need to find the NUL yourself and slice the input appropriately.
1: This is actually a great way of using types; a &str is guaranteed to be UTF-8 encoded, so no further check needs to be made. Similarly, the WideCString is likely to perform a check once upon construction and then can skip the check on later uses.
This is my simple hack for this case. There must be a bug; fix for your own case:
let mut v = vec![0u16; MAX_PATH as usize];
// imaginary win32 function
win32_function(v.as_mut_ptr());
let mut path = String::new();
for val in v.iter() {
let c: u8 = (*val & 0xFF) as u8;
if c == 0 {
break;
} else {
path.push(c as char);
}
}
I would like to create a function in OCaml that returns the char lambda (UTF8 0x03bb) but I can't use Char.chr because it's not in the ASCII chart. Is there a way to do so? I am new to OCaml...
First note that you are mixing scalar values (an integer in the ranges 0..0xD7FF and 0xE000 .. 0x10FFFF) and their encoding (the byte serialization of such an integer). Don't say UTF-8 0x03bb, as it doesn't make any sense, what you are talking about is the scalar value U+03BB, the integer that represent small lambda in Unicode.
Now as you noticed the OCaml char type can't represent such integers as it is limited to 256 values. What you can do however is to represent their UTF-8 encoding in OCaml strings which are (or more precisely became) sequences of arbitrary bytes. For U+03BB its UTF-8 serialization is the byte sequence 0xCE 0xBB so you can write:
let lambda = "\xCE\xBB"
If you prefer to deal with scalar values directly you can use an UTF-8 encoder like Uutf (disclaimer I'm the author) and do for example:
let lambda = 0x03BB
let lambda_utf_8 =
let b = Buffer.create 5 in
Uutf.Buffer.add_utf_8 b lambda; Buffer.contents b
For a short refresher on Unicode and a few biased tips on how to deal with Unicode in OCaml you can consult this minimal Unicode introduction.
UPDATE
Since OCaml 4.06, unicode escapes are supported in literal strings. The following UTF-8 encodes the lambda character in the lambda string:
let lambda = "\u{03BB}"
Just tagging along after the excellent answer by Daniel,
If you just want that 0x03bb, you can do
#require "camomile"
open CamomileLibrary.UChar
let () =
let c = of_int 955 in
(* Just to do something with it *)
print_endline (CamomileLibrary.UPervasives.escaped_uchar c)
Here's a way to actually "see" it on the terminal.
#require "camomile, zed"
open CamomileLibrary.UChar
let () =
Zed_utf8.make 1 (of_int 955) |> print_endline
and I got the decimal of 955 from: http://www.fileformat.info/info/unicode/char/03bb/index.htm
I still have some trouble with understanding this with UNICODE and ANSI in win32 api..
For example, i have this code:
SYSTEMTIME LocalTime = { 0 };
GetSystemTime (&LocalTime);
SetDlgItemText(hWnd, 1003, LocalTime);'
That generates the error in the title.
Also, i should mention that it automatically adds a W after "setdlgitemtext" Some macro in VS probably.
Could someone clarify this for me?
In C or C++ you can't just take an arbitrary structure and pass it to a function that expects a string. You have to convert that structure to a string first.
The Win32 functions GetDateFormat() and GetTimeFormat() can be used to convert a SYSTEMTIME to a string (the first one does the "date" part and the second one does the "time" part) according to the current system locale rules.
For example,
SYSTEMTIME LocalTime = { 0 };
GetSystemTime (&LocalTime);
wchar_t wchBuf[80];
GetDateFormat(LOCALE_USER_DEFAULT, DATE_SHORTDATE, &LocalTime, NULL, wchBuf, sizeof(wchBuf) / sizeof(wchBuf[0]));
SetDlgItemText(hWnd, 1003, wchBuf);
I'm trying to write strictly binary data to files (no encoding). The problem is, when I hex dump the files, I'm noticing rather weird behavior. Using either one of the below methods to construct a file results in the same behavior. I even used the System::Text::Encoding::Default to test as well for the streams.
StreamWriter^ binWriter = gcnew StreamWriter(gcnew FileStream("test.bin",FileMode::Create));
(Also used this method)
FileStream^ tempBin = gcnew FileStream("test.bin",FileMode::Create);
BinaryWriter^ binWriter = gcnew BinaryWriter(tempBin);
binWriter->Write(0x80);
binWriter->Write(0x81);
.
.
binWriter->Write(0x8F);
binWriter->Write(0x90);
binWriter->Write(0x91);
.
.
binWriter->Write(0x9F);
Writing that sequence of bytes, I noticed the only bytes that weren't converted to 0x3F in the hex dump were 0x81,0x8D,0x90,0x9D, ... and I have no idea why.
I also tried making character arrays, and a similar situation happens. i.e.,
array<wchar_t,1>^ OT_Random_Delta_Limits = {0x00,0x00,0x03,0x79,0x00,0x00,0x04,0x88};
binWriter->Write(OT_Random_Delta_Limits);
0x88 would be written as 0x3F.
If you want to stick to binary files then don't use StreamWriter. Just use a FileStream and Write/WriteByte. StreamWriters (and TextWriters in generally) are expressly designed for text. Whether you want an encoding or not, one will be applied - because when you're calling StreamWriter.Write, that's writing a char, not a byte.
Don't create arrays of wchar_t values either - again, those are for characters, i.e. text.
BinaryWriter.Write should have worked for you unless it was promoting the values to char in which case you'd have exactly the same problem.
By the way, without specifying any encoding, I'd expect you to get non-0x3F values, but instead the bytes representing the UTF-8 encoded values for those characters.
When you specified Encoding.Default, you'd have seen 0x3F for any Unicode values not in that encoding.
Anyway, the basic lesson is to stick to Stream when you want to deal with binary data rather than text.
EDIT: Okay, it would be something like:
public static void ConvertHex(TextReader input, Stream output)
{
while (true)
{
int firstNybble = input.Read();
if (firstNybble == -1)
{
return;
}
int secondNybble = input.Read();
if (secondNybble == -1)
{
throw new IOException("Reader finished half way through a byte");
}
int value = (ParseNybble(firstNybble) << 4) + ParseNybble(secondNybble);
output.WriteByte((byte) value);
}
}
// value would actually be a char, but as we've got an int in the above code,
// it just makes things a bit easier
private static int ParseNybble(int value)
{
if (value >= '0' && value <= '9') return value - '0';
if (value >= 'A' && value <= 'F') return value - 'A' + 10;
if (value >= 'a' && value <= 'f') return value - 'a' + 10;
throw new ArgumentException("Invalid nybble: " + (char) value);
}
This is very inefficient in terms of buffering etc, but should get you started.
A BinaryWriter() class initialized with a stream will use a default encoding of UTF8 for any chars or strings that are written. I'm guessing that the
binWriter->Write(0x80);
binWriter->Write(0x81);
.
.
binWriter->Write(0x8F);
binWriter->Write(0x90);
binWriter->Write(0x91);
calls are binding to the Write( char) overload so they're going through the character encoder. I'm not very familiar with C++/CLI, but it seems to me that these calls should be binding to Write(Int32), which shouldn't have this problem (maybe your code is really calling Write() with a char variable that's set to the values in your example. That would account for this behavior).
0x3F is commonly known as the ASCII character '?'; the characters that are mapping to it are control characters with no printable representation. As Jon points out, use a binary stream rather than a text-oriented output mechanism for raw binary data.
EDIT -- actually your results look like the inverse of what I would expect. In the default code page 1252, the non-printable characters (i.e. ones likely to map to '?') in that range are 0x81, 0x8D, 0x8F, 0x90 and 0x9D