C++ builder - convert UnicodeString to UTF-8 encoded string

C++ builder - convert UnicodeString to UTF-8 encoded string - utf-8

I try to convert UnicodeString to UTF-8 encoded string in C++ builder. I use UnicodeToUtf8() function to do that.
char * dest;
UnicodeSring src;
UnicodeToUtf8(dest,256,src.w_str(),src.Length());
but compiler shows me runtime access violation message. What I'm doing wrong?

Assuming you are using C++Builder 2009 or later (you did not say), and are using the RTL's System::UnicodeString class (and not some other third-party UnicodeString class), then there is a much simplier way to handle this situation. C++Builder also has a System::UTF8String class available (it has been available since C++Builder 6, but did not become a true RTL-implemented UTF-8 string type until C++Builder 2009). Simply assign your UnicodeString to a UTF8String and let the RTL handle the memory allocation and data conversion for you, eg:
UnicodeString src = ...;
UTF8String dest = src; // <-- automatic UTF16-to-UTF8 conversion
// use dest.c_str() and dest.Length() as needed...

This fixes the problem in the question, but the real way to do a UTF16 to UTF8 conversion is in Remy's answer below.
dest is a pointer to a random space in memory because you do not initialize it. In debug builds it probably points to 0 but in release builds it could be anywhere. You are telling UnicodeToUtf8 that dest is a buffer with room for 256 characters.
Try this
char dest[256]; // room for 256 characters
UnicodeString src = L"Test this";
UnicodeToUtf8( dest, 256, src, src.Length() );
But in reality you can use the easier:
char dest[256]; // room for 256 characters
UnicodeString src = L"Test this";
UnicodeToUtf8( dest, src, 256 );

Related

Converting raw pointer to 16-bit Unicode character to file path in Rust

I'm replacing a DLL written in C++ with one written in Rust.
Currently the function in the DLL is called as follows:
BOOL calledFunction(wchar_t* pFileName)
I believe that in this context wchar_t is a 16-bit Unicode character, so I chose to expose the following function in my Rust DLL:
pub fn calledFunction(pFileName: *const u16)
What would be the best way to convert that raw pointer to something I could actually use to open the file from the Rust DLL?

Here is some example code:
use std::ffi::OsString;
use std::os::windows::prelude::*;
unsafe fn u16_ptr_to_string(ptr: *const u16) -> OsString {
let len = (0..).take_while(|&i| *ptr.offset(i) != 0).count();
let slice = std::slice::from_raw_parts(ptr, len);
OsString::from_wide(slice)
}
// main example
fn main() {
let buf = vec![97_u16, 98, 99, 100, 101, 102, 0];
let ptr = buf.as_ptr(); // raw pointer
let string = unsafe { u16_ptr_to_string(ptr) };
println!("{:?}", string);
}
In u16_ptr_to_string, you do 3 things:
get the length of the string by counting the non-zero characters using offset (unsafe)
create a slice using from_raw_parts (unsafe)
transform this &[u16] into an OsString with from_wide
It is better to use wchar_t and wcslen from the libc crate and use another crate for conversion. This is maybe a bad idea to reimplement something that is already maintained in a crate.

You need to use OsString, which represents the native string format used by the operating system. In Windows these are specifically 16-bit character strings (usually UTF-16).
Quoting the doc:
OsString and OsStr are useful when you need to transfer strings to and from the operating system itself, or when capturing the output of external commands. Conversions between OsString, OsStr and Rust strings work similarly to those for CString and CStr.
You first need to convert the pointer into a slice, using unsafe code:
use std::slice;
// manifest a slice out of thin air!
let ptr = 0x1234 as const *u16;
let nb_elements = 10;
unsafe {
let slice = slice::from_raw_parts(ptr, nb_elements);
}
This assumes you know the size of your string, meaning your function should also take the number of characters as argument.
The from_wide method should be the one needed to convert from a native format:
use std::ffi::OsString;
use std::os::windows::prelude::*;
// UTF-16 encoding for "Unicode".
let arr = [0x0055, 0x006E, 0x0069, 0x0063, 0x006F, 0x0064, 0x0065];
let string = OsString::from_wide(&arr[..]);

WM_GETTEXT usage

I'm trying to get the status of a text field in my application. But I don't get it to work. I'm using "SendMessage" to get "WM_GETTEXT", I save the content to a char *.
I output the char * to a file, but I only get "D" back. This is what I have now:
LRESULT result;
char * output = (char*)malloc(1024);
result = SendMessage(hwnd,WM_GETTEXT,1024,(LPARAM)output);
ofstream file("test.txt");
file << *output;
file.close();
delete [] output;

Pointers concepts
file << *output; will print the first element of the string array
file << output; print the entire string

C# code:
public const uint WM_GETTEXT = 0xD;
const int bufferSize = 10000;
StringBuilder sb = new StringBuilder(bufferSize);
SendMessageGetText(handle, WM_GETTEXT, new UIntPtr(bufferSize), sb);
Console.WriteLine(sb.ToString());
Working properly to me!

Sophia's answer is correct. However, the default now for a Visual Studio project is to create a Unicode project. You will only get the first letter if your project is Unicode and not MBCS.
Have you examined the buffer returned from WM_GETTEXT to verify it has the entire string?
If not, try declaring your output variable as TCHAR* (to be generic) or as a wchar_t* and see what results you get in the buffer.
p.s. It is bad form to allocate memory with malloc and release it with delete. You should either use malloc/free pairs or new/delete pairs. Even safer way to allocate a char buffer is to use std::string or use std::wstring for a wide string.
p.p.s Try making sure your project settings are for a Multibyte project and not Unicode project. Then everything in Sophia's answer will work.
One more thing... Just use GetWindowText() API instead of the SendMessage stuff. That's why it is there so you don't have to go through the rigmarole of casting a pointer to a LPARAM or WPARAM. It's more typesafe and will give you a compile time error (better than runtime errors) if your types don't match up--especially with Unicode/MBCS and wchar_t/char.

How do you convert a 'System::String ^' to 'TCHAR'?

i asked a question here involving C++ and C# communicating. The problem got solved but led to a new problem.
this returns a String (C#)
return Marshal.PtrToStringAnsi(decryptsn(InpData));
this expects a TCHAR* (C++)
lpAlpha2[0] = Company::Pins::Bank::Decryption::Decrypt::Decryption("123456");
i've googled how to solve this problem, but i am not sure why the String has a carrot(^) on it. Would it be best to change the return from String to something else that C++ would accept? or would i need to do a convert before assigning the value?

String has a ^ because that's the marker for a managed reference. Basically, it's used the same way as * in unmanaged land, except it can only point to an object type, not to other pointer types, or to void.
TCHAR is #defined (or perhaps typedefed, I can't remember) to either char or wchar_t, based on the _UNICODE preprocessor definition. Therefore, I would use that and write the code twice.
Either inline:
TCHAR* str;
String^ managedString
#ifdef _UNICODE
str = (TCHAR*) Marshal::StringToHGlobalUni(managedString).ToPointer();
#else
str = (TCHAR*) Marshal::StringToHGlobalAnsi(managedString).ToPointer();
#endif
// use str.
Marshal::FreeHGlobal(IntPtr(str));
or as a pair of conversion methods, both of which assume that the output buffer has already been allocated and is large enough. Method overloading should make it pick the correct one, based on what TCHAR is defined as.
void ConvertManagedString(String^ managedString, char* outString)
{
char* str;
str = (char*) Marshal::StringToHGlobalAnsi(managedString).ToPointer();
strcpy(outString, str);
Marshal::FreeHGlobal(IntPtr(str));
}
void ConvertManagedString(String^ managedString, wchar_t* outString)
{
wchar_t* str;
str = (wchar_t*) Marshal::StringToHGlobalUni(managedString).ToPointer();
wcscpy(outString, str);
Marshal::FreeHGlobal(IntPtr(str));
}

The syntax String^ is C++/CLI talk for "(garbage collected) reference to a System.String".
You have a couple of options for the conversion of a String into a C string, which is another way to express the TCHAR*. My preferred way in C++ would be to store the converted string into a C++ string type, either std::wstring or std::string, depending on you building the project as a Unicode or MBCS project.
In either case you can use something like this:
std::wstring tmp = msclr::interop::marshal_as<std::wstring>( /* Your .NET String */ );
or
std::string tmp = msclr::interop::marshal_as<std::string>(...);
Once you've converted the string into the correct wide or narrow string format, you can then access its C string representation using the c_str() function, like so:
callCFunction(tmp.c_str());
Assuming that callCFunction expects you to pass it a C-style char* or wchar_t* (which TCHAR* will "degrade" to depending on your compilation settings.

That is a really rambling way to ask the question, but if you mean how to convert a String ^ to a char *, then you use the same marshaller you used before, only backwards:
char* unmanagedstring = (char *) Marshal::StringToHGlobalAnsi(managedstring).ToPointer();
Edit: don't forget to release the memory allocated when you're done using Marshal::FreeHGlobal.

convert BSTR to wstring

How to convert a char[256] to wstring?
update. here is my current code:
char testDest[256];
char *p= _com_util::ConvertBSTRToString(url->bstrVal);
for (int i = 0; i <= strlen(p); i++)
{
testDest[i] = p[i];
}
// need to convert testDest to wstring to I can pass it to this below function...
writeToFile(testDestwstring);

If your input is BSTR (as it seems to be) the data is already Unicode and you can just cast this directly to wstring as follows. _bstr_t has implicit conversions to both char* and wchar* which avoid the need for manual Win32 code conversion.
if (url->bstrVal)
{
// true => make a new copy - can avoid this if source
// no longer needed, by using false here and avoiding SysFreeString on source
const _bstr_t wrapper(url->bstrVal, true);
std::wstring wstrVal((const _wchar_t*)wrapper);
}
See here for more details on this area of Windows usage. It's easy to mess up the use of the Win32 API in this area - using the BSTR wrapper to do this avoids both data copy (if used judiciously) and code complexity.

MultiByteToWideChar will return a UTF-16 string. You need to specify the source codepage.

How do I read Unicode-16 strings from a file using POSIX methods in Linux?

I have a file containing UNICODE-16 strings that I would like to read into a Linux program. The strings were written raw from Windows' internal WCHAR format. (Does Windows always use UTF-16? e.g. in Japanese versions)
I believe that I can read them using raw reads and the converting with wcstombs_l. However, I cannot figure what locale to use. Runing "locale -a" on my up-to-date Ubuntu and Mac OS X machines yields zero locales with utf-16 in their names.
Is there a better way?
Update: the correct answer and others below helped point me to using libiconv. Here's a function I'm using to do the conversion. I currently have it inside a class that makes the conversions into a one-line piece of code.
// Function for converting wchar_t* to char*. (Really: UTF-16LE --> UTF-8)
// It will allocate the space needed for dest. The caller is
// responsible for freeing the memory.
static int iwcstombs_alloc(char **dest, const wchar_t *src)
{
iconv_t cd;
const char from[] = "UTF-16LE";
const char to[] = "UTF-8";
cd = iconv_open(to, from);
if (cd == (iconv_t)-1)
{
printf("iconv_open(\"%s\", \"%s\") failed: %s\n",
to, from, strerror(errno));
return(-1);
}
// How much space do we need?
// Guess that we need the same amount of space as used by src.
// TODO: There should be a while loop around this whole process
// that detects insufficient memory space and reallocates
// more space.
int len = sizeof(wchar_t) * (wcslen(src) + 1);
//printf("len = %d\n", len);
// Allocate space
int destLen = len * sizeof(char);
*dest = (char *)malloc(destLen);
if (*dest == NULL)
{
iconv_close(cd);
return -1;
}
// Convert
size_t inBufBytesLeft = len;
char *inBuf = (char *)src;
size_t outBufBytesLeft = destLen;
char *outBuf = (char *)*dest;
int rc = iconv(cd,
&inBuf,
&inBufBytesLeft,
&outBuf,
&outBufBytesLeft);
if (rc == -1)
{
printf("iconv() failed: %s\n", strerror(errno));
iconv_close(cd);
free(*dest);
*dest = NULL;
return -1;
}
iconv_close(cd);
return 0;
} // iwcstombs_alloc()

Simplest way is convert the file from utf16 to utf8 native UNIX encoding and then read it,
iconv -f utf16 -t utf8 file_in.txt -o file_out.txt
You can also use iconv(3) (see man 3 iconv) to convert string using C. Most of other languages has bindings to iconv as well.
Than you can use any UTF-8 locale like en_US.UTF-8 that are usualy the default one
on most linux distros.

(Does Windows always use UTF-16? e.g. in Japanese versions)
Yes, NT's WCHAR is always UTF-16LE.
(The ‘system codepage’, which for Japanese installs is indeed cp932/Shift-JIS, still exists in NT for the benefit of the many, many applications that aren't Unicode-native, FAT32 paths, and so on.)
However, wchar_t is not guaranteed to be 16 bits and on Linux it won't be, UTF-32 (UCS-4) is used. So wcstombs_l is unlikely to be happy.
The Right Thing would be to use a library like iconv to read it in to whichever format you are using internally - presumably wchar_t. You could try to hack it yourself by poking bytes in, but you'd probably get things like the Surrogates wrong.
Runing "locale -a" on my up-to-date Ubuntu and Mac OS X machines yields zero locales with utf-16 in their names.
Indeed, Linux can't use UTF-16 as a locale default encoding thanks to all the \0s.

You can read as binary, then do your own quick conversion:
http://unicode.org/faq/utf_bom.html#utf16-3
But it is probably safer to use a library (like libiconv) which handles invalid sequences properly.

I would strongly recommend using a Unicode encoding as your program's internal representation. Use either UTF-16 or UTF-8. If you use UTF-16 internally, then obviously no translation is required. If you use UTF-8, you can use a locale with .UTF-8 in it such as en_US.UTF-8.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

C++ builder - convert UnicodeString to UTF-8 encoded string - utf-8

I try to convert UnicodeString to UTF-8 encoded string in C++ builder. I use UnicodeToUtf8() function to do that. char * dest; UnicodeSring src; UnicodeToUtf8(dest,256,src.w_str(),src.Length()); but compiler shows me runtime access violation message. What I'm doing wrong?

Related

Converting raw pointer to 16-bit Unicode character to file path in Rust

WM_GETTEXT usage

How do you convert a 'System::String ^' to 'TCHAR'?

convert BSTR to wstring

How do I read Unicode-16 strings from a file using POSIX methods in Linux?

Categories

Resources