What would be most efficient way to compare QString and char* - performance

What would be most efficient way to compare QString and char*
if( mystring == mycharstar ) {} will perform malloc,
and
if(strcmp(mystring.toLocal8Bit().constData(),mycharstar ) == 0) {}
Will allocate a QByteArray
I would like to not have any allocation happening,
would would you guys recommend?
What about
if(mystring == QLatin1String(mycharstar))
Would it be any better?

There is no "efficient" way that only uses casts. This is because QtString internally uses 16 bits to encode a single character while C strings use only 8 bits. That means any comparison based on memory pointers will simply almost always return false.
That's why you have to encode the 16 bit wide characters of QtString to the same encoding as your C string and that always needs at least one call to malloc().
See also: How to convert QString to std::string?

It could be if( mystring == QLatin1String(mycharstar) ), as suggested here.

Related

Is ZeroMemory the windows equivalent of null terminating a buffer?

For example I by convention null terminate a buffer (set buffer equal to zero) the following way, example 1:
char buffer[1024] = {0};
And with the windows.h library we can call ZeroMemory, example 2:
char buffer[1024];
ZeroMemory(buffer, sizeof(buffer));
According to the documentation provided by microsoft: ZeroMemory Fills a block of memory with zeros. I want to be accurate in my windows application so I thought what better place to ask than stack overflow.
Are these two examples equivalent in logic?
Yes, the two codes are equivalent. The entire array is filled with zeros in both cases.
In the case of char buffer[1024] = {0};, you are explicitly setting only the first char element to 0, and then the compiler implicitly value-initializes the remaining 1023 char elements to 0 for you.
In C++11 and later, you can omit that first element value:
char buffer[1024] = {};
char buffer[1024]{};

Compatibility of printf with utf-8 encoded strings

I'm trying to format some utf-8 encoded strings in C code (char *) using the printf function. I need to specify a length in format. Everything goes well when there are no multi-bytes characters in parameter string, but the result seems to be incorrect when there are some multibyte chars in data.
my glibc is kind of old (2.17), so I tried with some online compilers and result is the same.
#include <stdlib.h>
#include <locale.h>
int main(void)
{
setlocale( LC_CTYPE, "en_US.UTF-8" );
setlocale( LC_COLLATE, "en_US.UTF-8" );
printf( "'%-4.4s'\n", "elephant" );
printf( "'%-4.4s'\n", "éléphant" );
printf( "'%-20.20s'\n", "éléphant" );
return 0;
}
Result of execution is :
'elep'
'él�'
'éléphant '
First line is correct (4 chars in output)
Second line is obviously wrong (at least from a human point of view)
Last line is also wrong : only 18 unicode chars are written instead of 20
It seems that the printf function count chars before UTF-8 decoding (counting bytes instead of unicode chars)
Is that a bug in glibc or a well documented limitation of printf ?
It's true that printf counts bytes, not multibyte characters. If it's a bug, the bug is in the C standard, not in glibc (the standard library implementation usually used in conjunction with gcc).
In fairness, counting characters wouldn't help you align unicode output either, because unicode characters are not all the same display width even with fixed-width fonts. (Many codepoints are width 0, for example.)
I'm not going to attempt to argue that this behaviour is "well-documented". Standard C's locale facilities have never been particularly adequate to the task, imho, and they have never been particularly well documented, in part because the underlying model attempts to encompass so many possible encodings without ever grounding itself in a concrete example that it is almost impossible to explain. (...Long rant deleted...)
You can use the wchar.h formatted output functions,
which count in wide characters. (Which still isn't going to give you correct output alignment but it will count precision the way you expect.)
Let me quote rici: It's true that printf counts bytes, not multibyte characters. If it's a bug, the bug is in the C standard, not in glibc (the standard library implementation usually used in conjunction with gcc).
However, don't conflate wchar_t and UTF-8. See wikipedia to grasp the sense of the former. UTF-8, instead, can be dealt with almost as if it were good old ASCII. Just avoid truncating in the middle of a character.
In order to get alignment, you want to count characters. Then, pass the bytes count to printf. That can be achieved by using the * precision and passing the count of bytes. For example, since accented e takes two bytes:
printf("'-4.*s'\n", 6, "éléphant");
A function to count bytes is easily coded based on the format of UTF-8 characters:
static int count_bytes(char const *utf8_string, int length)
{
char const *s = utf8_string;
for (;;)
{
int ch = *(unsigned char *)s++;
if ((ch & 0xc0) == 0xc0) // first byte of a multi-byte UTF-8
while (((ch = *(unsigned char*)s) & 0xc0) == 0x80)
++s;
if (ch == 0)
break;
if (--length <= 0)
break;
}
return s - utf8_string;
}
At this point however, one would end up with lines like so:
printf("'-4.*s'\n", count_bytes("éléphant", 4), "éléphant");
Having to repeat the string twice quickly becomes a maintenance nightmare. At a minimum, one can define a macro to make sure the string is the same. Assuming the above function is saved in some utf8-util.h file, your program could be rewritten as follows:
#include <stdio.h>
#include <stdlib.h>
#include <locale.h>
#include "utf8-util.h"
#define INT_STR_PAIR(i, s) count_bytes(s, i), s
int main(void)
{
setlocale( LC_CTYPE, "en_US.UTF-8" );
setlocale( LC_COLLATE, "en_US.UTF-8" );
printf( "'%-4.*s'\n", INT_STR_PAIR(4, "elephant"));
printf( "'%-4.*s'\n", INT_STR_PAIR(4, "éléphant"));
printf( "'%-4.*s'\n", INT_STR_PAIR(4, "é𐅫éphant"));
printf( "'%-20.*s'\n", INT_STR_PAIR(20, "éléphant"));
return 0;
}
The last but one test uses 𐅫, the Greek acrophonic thespian three hundred (U+1016B) character. Given how the counting works, testing with consecutive non-ASCII characters makes sense. The ancient Greek character looks "wide" enough to see how much space it takes using a fixed-width font. The output may look like:
'elep'
'élép'
'é𐅫ép'
'éléphant '
(On my terminal, those 4-char strings are of equal length.)

array declaration vs pointer + new

I'm not quite sure why is it happening. I have a program:
std::ifstream file(path, std::ios::binary);
file.seekg(0, file.end);
int length = file.tellg();
file.seekg(0, file.beg);
char* buffer = new char[length];
file.read(buffer, length);
file.close();
It is running well, and reads data correctly. However, if I replace buffer's declaration with:
char buffer[length];
then I get a segmentation fault. The size of data is around a few megabytes. What is the difference?
The difference is that "a few megabytes" is too large to put on your process's "stack", where you are now putting the data.
(Furthermore, you are relying on a GCC extension; length, as a variable whose value is not known until runtime, cannot legally be used as the size of a "normal" array like this)
Put your code back the way it was, and don't forget to delete[] the buffer when you are done using it.
Actually, this would be better:
std::vector<char> buffer(length);
file.read(&buffer[0], length);

How do you convert a 'System::String ^' to 'TCHAR'?

i asked a question here involving C++ and C# communicating. The problem got solved but led to a new problem.
this returns a String (C#)
return Marshal.PtrToStringAnsi(decryptsn(InpData));
this expects a TCHAR* (C++)
lpAlpha2[0] = Company::Pins::Bank::Decryption::Decrypt::Decryption("123456");
i've googled how to solve this problem, but i am not sure why the String has a carrot(^) on it. Would it be best to change the return from String to something else that C++ would accept? or would i need to do a convert before assigning the value?
String has a ^ because that's the marker for a managed reference. Basically, it's used the same way as * in unmanaged land, except it can only point to an object type, not to other pointer types, or to void.
TCHAR is #defined (or perhaps typedefed, I can't remember) to either char or wchar_t, based on the _UNICODE preprocessor definition. Therefore, I would use that and write the code twice.
Either inline:
TCHAR* str;
String^ managedString
#ifdef _UNICODE
str = (TCHAR*) Marshal::StringToHGlobalUni(managedString).ToPointer();
#else
str = (TCHAR*) Marshal::StringToHGlobalAnsi(managedString).ToPointer();
#endif
// use str.
Marshal::FreeHGlobal(IntPtr(str));
or as a pair of conversion methods, both of which assume that the output buffer has already been allocated and is large enough. Method overloading should make it pick the correct one, based on what TCHAR is defined as.
void ConvertManagedString(String^ managedString, char* outString)
{
char* str;
str = (char*) Marshal::StringToHGlobalAnsi(managedString).ToPointer();
strcpy(outString, str);
Marshal::FreeHGlobal(IntPtr(str));
}
void ConvertManagedString(String^ managedString, wchar_t* outString)
{
wchar_t* str;
str = (wchar_t*) Marshal::StringToHGlobalUni(managedString).ToPointer();
wcscpy(outString, str);
Marshal::FreeHGlobal(IntPtr(str));
}
The syntax String^ is C++/CLI talk for "(garbage collected) reference to a System.String".
You have a couple of options for the conversion of a String into a C string, which is another way to express the TCHAR*. My preferred way in C++ would be to store the converted string into a C++ string type, either std::wstring or std::string, depending on you building the project as a Unicode or MBCS project.
In either case you can use something like this:
std::wstring tmp = msclr::interop::marshal_as<std::wstring>( /* Your .NET String */ );
or
std::string tmp = msclr::interop::marshal_as<std::string>(...);
Once you've converted the string into the correct wide or narrow string format, you can then access its C string representation using the c_str() function, like so:
callCFunction(tmp.c_str());
Assuming that callCFunction expects you to pass it a C-style char* or wchar_t* (which TCHAR* will "degrade" to depending on your compilation settings.
That is a really rambling way to ask the question, but if you mean how to convert a String ^ to a char *, then you use the same marshaller you used before, only backwards:
char* unmanagedstring = (char *) Marshal::StringToHGlobalAnsi(managedstring).ToPointer();
Edit: don't forget to release the memory allocated when you're done using Marshal::FreeHGlobal.

How do I read Unicode-16 strings from a file using POSIX methods in Linux?

I have a file containing UNICODE-16 strings that I would like to read into a Linux program. The strings were written raw from Windows' internal WCHAR format. (Does Windows always use UTF-16? e.g. in Japanese versions)
I believe that I can read them using raw reads and the converting with wcstombs_l. However, I cannot figure what locale to use. Runing "locale -a" on my up-to-date Ubuntu and Mac OS X machines yields zero locales with utf-16 in their names.
Is there a better way?
Update: the correct answer and others below helped point me to using libiconv. Here's a function I'm using to do the conversion. I currently have it inside a class that makes the conversions into a one-line piece of code.
// Function for converting wchar_t* to char*. (Really: UTF-16LE --> UTF-8)
// It will allocate the space needed for dest. The caller is
// responsible for freeing the memory.
static int iwcstombs_alloc(char **dest, const wchar_t *src)
{
iconv_t cd;
const char from[] = "UTF-16LE";
const char to[] = "UTF-8";
cd = iconv_open(to, from);
if (cd == (iconv_t)-1)
{
printf("iconv_open(\"%s\", \"%s\") failed: %s\n",
to, from, strerror(errno));
return(-1);
}
// How much space do we need?
// Guess that we need the same amount of space as used by src.
// TODO: There should be a while loop around this whole process
// that detects insufficient memory space and reallocates
// more space.
int len = sizeof(wchar_t) * (wcslen(src) + 1);
//printf("len = %d\n", len);
// Allocate space
int destLen = len * sizeof(char);
*dest = (char *)malloc(destLen);
if (*dest == NULL)
{
iconv_close(cd);
return -1;
}
// Convert
size_t inBufBytesLeft = len;
char *inBuf = (char *)src;
size_t outBufBytesLeft = destLen;
char *outBuf = (char *)*dest;
int rc = iconv(cd,
&inBuf,
&inBufBytesLeft,
&outBuf,
&outBufBytesLeft);
if (rc == -1)
{
printf("iconv() failed: %s\n", strerror(errno));
iconv_close(cd);
free(*dest);
*dest = NULL;
return -1;
}
iconv_close(cd);
return 0;
} // iwcstombs_alloc()
Simplest way is convert the file from utf16 to utf8 native UNIX encoding and then read it,
iconv -f utf16 -t utf8 file_in.txt -o file_out.txt
You can also use iconv(3) (see man 3 iconv) to convert string using C. Most of other languages has bindings to iconv as well.
Than you can use any UTF-8 locale like en_US.UTF-8 that are usualy the default one
on most linux distros.
(Does Windows always use UTF-16? e.g. in Japanese versions)
Yes, NT's WCHAR is always UTF-16LE.
(The ‘system codepage’, which for Japanese installs is indeed cp932/Shift-JIS, still exists in NT for the benefit of the many, many applications that aren't Unicode-native, FAT32 paths, and so on.)
However, wchar_t is not guaranteed to be 16 bits and on Linux it won't be, UTF-32 (UCS-4) is used. So wcstombs_l is unlikely to be happy.
The Right Thing would be to use a library like iconv to read it in to whichever format you are using internally - presumably wchar_t. You could try to hack it yourself by poking bytes in, but you'd probably get things like the Surrogates wrong.
Runing "locale -a" on my up-to-date Ubuntu and Mac OS X machines yields zero locales with utf-16 in their names.
Indeed, Linux can't use UTF-16 as a locale default encoding thanks to all the \0s.
You can read as binary, then do your own quick conversion:
http://unicode.org/faq/utf_bom.html#utf16-3
But it is probably safer to use a library (like libiconv) which handles invalid sequences properly.
I would strongly recommend using a Unicode encoding as your program's internal representation. Use either UTF-16 or UTF-8. If you use UTF-16 internally, then obviously no translation is required. If you use UTF-8, you can use a locale with .UTF-8 in it such as en_US.UTF-8.

Resources