MultibyteToWideChar blindly copies invalid ANSI codepoints, WHY? - winapi

Please check my code and result:
void test_MultibyteToWideChar()
{
WCHAR wbuf[10] = {};
int ret = MultiByteToWideChar(1253, MB_ERR_INVALID_CHARS,
"\x98\x99\x9A\x9B\x9C\x9D\xFF", 8,
wbuf, 10);
}
The code aims to convert some codepage-1253(Greek) ANSI bytes to Unicode.
Some input ANSI codepoints(0x98, 0x9a etc) does not have character definition, and I thought MultiByteToWideChar should produce error ERROR_NO_UNICODE_TRANSLATION, or, at least, produce U+FFFD(REPLACEMENT CHARACTER) in output buffer.
However, MultiByteToWideChar silently succeeds, just copy the codepoints(0x98, 0x9a etc) to output WCHAR buffer. I think it is ill behavior.
Any idea on this?
Compile with VS2019 16.11 and run on Win10.1909.
==== Update ====
In my first image, MultibyteToWideChar reported fail by returning 0(=FALSE). I checked again, GetLastError() is 1113, ERROR_NO_UNICODE_TRANSLATION. But I have to tell you, that error is triggered by the final input byte \xFF, not by \x98. We can try to convert a single \x98 and see that MultibyteToWideChar returns success. Sure, in my opinion, it is a fake success.

Related

How do you use wstring_convert to convert between utf16 and utf32?

When you are going from std::u16string to, lets say std::u32string, std::wstring_convert doesn't work as it expects chars. So how does one use std::wstring_convert to convert between UTF-16 and UTF-32 using std::u16string as input?
For example :
inline std::u32string utf16_to_utf32(const std::u16string& s) {
std::wstring_convert<std::codecvt_utf16<char32_t>, char32_t> conv;
return conv.from_bytes(s); // cannot do this, expects 'char'
}
Is it ok to reinterpret_cast to char, as I've seen in a few examples?
If you do need to reinterpret_cast, I've seen some examples using the string size as opposed to the total byte size for the pointers. Is that an error or a requirement?
I know codecvt is deprecated, but until the standard offers an alternative, it has to do.
If you do not want to reinterpret_cast, the only way I've found is to first convert to utf-8, then reconvert to utf-32.
For ex,
// Convert to utf-8.
std::u16string s;
std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> conv;
std::string utf8_str = conv.to_bytes(s);
// Convert to utf-32.
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv;
std::u32string utf32_str = conv.from_bytes(utf8_str);
Yes this is sad and likely contributes to codecvt deprecation.

How can I convert from a C-style WCHAR* to a Rust string? [duplicate]

I'm getting into Rust programming to realize a small program and I'm a little bit lost in string conversions.
In my program, I have a vector as follows:
let mut name: Vec<winnt::WCHAR> = Vec::new();
WCHAR is the same as a u16 on my Windows machine.
I hand over the Vec<u16> to a C function (as a pointer) which fills it with data. I then need to convert the string contained in the vector into a &str. However, no matter, what I try, I can not manage to get this conversion working.
The only thing I managed to get working is to convert it to a WideString:
widestr = unsafe { WideCString::from_ptr_str(name.as_ptr()) };
But this seems to be a step into the wrong direction.
What is the best way to convert the Vec<u16> to an &str under the assumption that the vector holds a valid and null-terminated string.
I then need to convert the string contained in the vector into a &str. However, no matter, what I try, I can not manage to get this conversion working.
There's no way of making this a "free" conversion.
A &str is a Unicode string encoded with UTF-8. This is a byte-oriented encoding. If you have UTF-16 (or the different but common UCS-2 encoding), there's no way to read one as the other. That's equivalent to trying to read a JPEG image as a PDF. Both chunks of data might be a string, but the encoding is important.
The first question is "do you really need to do that?". Many times, you can take data from one function and shovel it back into another function, never looking at it. If you can get away with that, that might be be best answer.
If you do need to transform it, then you have to deal with the errors that can occur. An arbitrary array of 16-bit integers may not be valid UTF-16 or UCS-2. These encodings have edge cases that can easily produce invalid strings. Null-termination is another aspect - Unicode actually allows for embedded NUL characters, so a null-terminated string can't hold all possible Unicode characters!
Once you've ensured that the encoding is valid 1 and figured out how many entries in the input vector comprise the string, then you have to decode the input format and re-encode to the output format. This is likely to require some kind of new allocation, so you are most likely to end up with a String, which can then be used most anywhere a &str can be used.
There is a built-in method to convert UTF-16 data to a String: String::from_utf16. Note that it returns a Result to allow for these error cases. There's also String::from_utf16_lossy, which replaces invalid encoded parts with the Unicode replacement character.
let name = [0x68, 0x65, 0x6c, 0x6c, 0x6f];
let a = String::from_utf16(&name);
let b = String::from_utf16_lossy(&name);
println!("{:?}", a);
println!("{:?}", b);
If you are starting from a pointer to a u16 or WCHAR, you will need to convert to a slice first by using slice::from_raw_parts. If you have a null-terminated string, you need to find the NUL yourself and slice the input appropriately.
1: This is actually a great way of using types; a &str is guaranteed to be UTF-8 encoded, so no further check needs to be made. Similarly, the WideCString is likely to perform a check once upon construction and then can skip the check on later uses.
This is my simple hack for this case. There must be a bug; fix for your own case:
let mut v = vec![0u16; MAX_PATH as usize];
// imaginary win32 function
win32_function(v.as_mut_ptr());
let mut path = String::new();
for val in v.iter() {
let c: u8 = (*val & 0xFF) as u8;
if c == 0 {
break;
} else {
path.push(c as char);
}
}

Why should we use OutputStream.write(byte[] b, int off, int len) instead of OutputStream.write(byte[] b)?

Sorry, everybody. It's a Java beginner question, but I think it will be helpful for a lot of java learners.
FileInputStream fis = new FileInputStream(file);
OutputStream os = socket.getOutputStream();
byte[] buffer = new byte[1024];
int len;
while((len=fis.read(buffer)) != -1){
os.write(buffer, 0, len);
}
The code above is part of FileSenderClient class which is for sending files from client to a server using java.io and java.net.Socket.
My question is that: in the above code, why should we use
os.write(buffer, 0, len)
instead of
os.write(buffer)
In another way to ask this question: what is the point of having a "len" parameter for "OutputStream.write()" method?
It seems both codes are working fine.
while((len=fis.read(buffer)) != -1){
os.write(buffer, 0, len);
}
Because you only want to write data that you actually read. Consider the case where the input consists of N buffers plus one byte. Without the len parameter you would write (N+1)*1024 bytes instead of N*1024+1 bytes. Consider also the case of reading from a socket, or indeed the general case of reading: the actual contract of InputStream.read() is that it transfers at least one byte, not that it fills the buffer. Often it can't, for one reason or another.
It seems both codes are working fine.
No they're not.
It actually does not work in the same way.
It is very likely you used a very small text file to test. But if you look carefully, you will still find there is a lot of extra spaces in the end of you file you received, and the size of the file you received is larger than the file you send.
The reason is that you have created a byte array in a size of 1024 but you don't have so many data to put (or read()) into that byte array. Therefore, the byte array is full with NULL in the end part. When it comes to writing to file, these NULLs are still written into the file and show as spaces " " in Windows Notepad...
If you use advanced text editors like Notepad++ or Sublime Text to view the file you received, you will see these NULL characters.

Buffer overflow or false positive?

getmodulefilenamew function produces false positive (buffer overflow) as it accepts second argument as buffer - of fixed size in our case.
But looking through its documentation: http://msdn.microsoft.com/en-us/library/ms683197%28v=vs.85%29.aspx
Quote: If the buffer is too small to hold the module name, the string is truncated to nSize characters including the terminating null character, the function returns nSize, and the function sets the last error to ERROR_INSUFFICIENT_BUFFER.
Can somebody as trusted third party person confirm or reject this issue as false positive. Thanks for your help!
===
HMODULE applicationModule = GetModuleHandleW(NULL);
WCHAR processName[MAX_PATH];
memset(processName, 0, sizeof(processName));
GetModuleFileNameW(applicationModule, processName, sizeof(processName));
===
The problem is line with GetModuleFileNameW function
Scan was provided by Veracode static analyzer.
Your problem is that you are passing an incorrect value for nSize. You are passing the number of bytes but you should be passing the number of characters, MAX_PATH. These values differ because a wide character has a size of 2 bytes.
So, yes there is an error in your code. If the module name is sufficiently long, Windows will attempt to write up to 520 characters to a buffer that only has room for 260.

Visual Studio C++ 2008 Manipulating Bytes?

I'm trying to write strictly binary data to files (no encoding). The problem is, when I hex dump the files, I'm noticing rather weird behavior. Using either one of the below methods to construct a file results in the same behavior. I even used the System::Text::Encoding::Default to test as well for the streams.
StreamWriter^ binWriter = gcnew StreamWriter(gcnew FileStream("test.bin",FileMode::Create));
(Also used this method)
FileStream^ tempBin = gcnew FileStream("test.bin",FileMode::Create);
BinaryWriter^ binWriter = gcnew BinaryWriter(tempBin);
binWriter->Write(0x80);
binWriter->Write(0x81);
.
.
binWriter->Write(0x8F);
binWriter->Write(0x90);
binWriter->Write(0x91);
.
.
binWriter->Write(0x9F);
Writing that sequence of bytes, I noticed the only bytes that weren't converted to 0x3F in the hex dump were 0x81,0x8D,0x90,0x9D, ... and I have no idea why.
I also tried making character arrays, and a similar situation happens. i.e.,
array<wchar_t,1>^ OT_Random_Delta_Limits = {0x00,0x00,0x03,0x79,0x00,0x00,0x04,0x88};
binWriter->Write(OT_Random_Delta_Limits);
0x88 would be written as 0x3F.
If you want to stick to binary files then don't use StreamWriter. Just use a FileStream and Write/WriteByte. StreamWriters (and TextWriters in generally) are expressly designed for text. Whether you want an encoding or not, one will be applied - because when you're calling StreamWriter.Write, that's writing a char, not a byte.
Don't create arrays of wchar_t values either - again, those are for characters, i.e. text.
BinaryWriter.Write should have worked for you unless it was promoting the values to char in which case you'd have exactly the same problem.
By the way, without specifying any encoding, I'd expect you to get non-0x3F values, but instead the bytes representing the UTF-8 encoded values for those characters.
When you specified Encoding.Default, you'd have seen 0x3F for any Unicode values not in that encoding.
Anyway, the basic lesson is to stick to Stream when you want to deal with binary data rather than text.
EDIT: Okay, it would be something like:
public static void ConvertHex(TextReader input, Stream output)
{
while (true)
{
int firstNybble = input.Read();
if (firstNybble == -1)
{
return;
}
int secondNybble = input.Read();
if (secondNybble == -1)
{
throw new IOException("Reader finished half way through a byte");
}
int value = (ParseNybble(firstNybble) << 4) + ParseNybble(secondNybble);
output.WriteByte((byte) value);
}
}
// value would actually be a char, but as we've got an int in the above code,
// it just makes things a bit easier
private static int ParseNybble(int value)
{
if (value >= '0' && value <= '9') return value - '0';
if (value >= 'A' && value <= 'F') return value - 'A' + 10;
if (value >= 'a' && value <= 'f') return value - 'a' + 10;
throw new ArgumentException("Invalid nybble: " + (char) value);
}
This is very inefficient in terms of buffering etc, but should get you started.
A BinaryWriter() class initialized with a stream will use a default encoding of UTF8 for any chars or strings that are written. I'm guessing that the
binWriter->Write(0x80);
binWriter->Write(0x81);
.
.
binWriter->Write(0x8F);
binWriter->Write(0x90);
binWriter->Write(0x91);
calls are binding to the Write( char) overload so they're going through the character encoder. I'm not very familiar with C++/CLI, but it seems to me that these calls should be binding to Write(Int32), which shouldn't have this problem (maybe your code is really calling Write() with a char variable that's set to the values in your example. That would account for this behavior).
0x3F is commonly known as the ASCII character '?'; the characters that are mapping to it are control characters with no printable representation. As Jon points out, use a binary stream rather than a text-oriented output mechanism for raw binary data.
EDIT -- actually your results look like the inverse of what I would expect. In the default code page 1252, the non-printable characters (i.e. ones likely to map to '?') in that range are 0x81, 0x8D, 0x8F, 0x90 and 0x9D

Resources