Compatibility of printf with utf-8 encoded strings - gcc

I'm trying to format some utf-8 encoded strings in C code (char *) using the printf function. I need to specify a length in format. Everything goes well when there are no multi-bytes characters in parameter string, but the result seems to be incorrect when there are some multibyte chars in data.
my glibc is kind of old (2.17), so I tried with some online compilers and result is the same.
#include <stdlib.h>
#include <locale.h>
int main(void)
{
setlocale( LC_CTYPE, "en_US.UTF-8" );
setlocale( LC_COLLATE, "en_US.UTF-8" );
printf( "'%-4.4s'\n", "elephant" );
printf( "'%-4.4s'\n", "éléphant" );
printf( "'%-20.20s'\n", "éléphant" );
return 0;
}
Result of execution is :
'elep'
'él�'
'éléphant '
First line is correct (4 chars in output)
Second line is obviously wrong (at least from a human point of view)
Last line is also wrong : only 18 unicode chars are written instead of 20
It seems that the printf function count chars before UTF-8 decoding (counting bytes instead of unicode chars)
Is that a bug in glibc or a well documented limitation of printf ?

It's true that printf counts bytes, not multibyte characters. If it's a bug, the bug is in the C standard, not in glibc (the standard library implementation usually used in conjunction with gcc).
In fairness, counting characters wouldn't help you align unicode output either, because unicode characters are not all the same display width even with fixed-width fonts. (Many codepoints are width 0, for example.)
I'm not going to attempt to argue that this behaviour is "well-documented". Standard C's locale facilities have never been particularly adequate to the task, imho, and they have never been particularly well documented, in part because the underlying model attempts to encompass so many possible encodings without ever grounding itself in a concrete example that it is almost impossible to explain. (...Long rant deleted...)
You can use the wchar.h formatted output functions,
which count in wide characters. (Which still isn't going to give you correct output alignment but it will count precision the way you expect.)

Let me quote rici: It's true that printf counts bytes, not multibyte characters. If it's a bug, the bug is in the C standard, not in glibc (the standard library implementation usually used in conjunction with gcc).
However, don't conflate wchar_t and UTF-8. See wikipedia to grasp the sense of the former. UTF-8, instead, can be dealt with almost as if it were good old ASCII. Just avoid truncating in the middle of a character.
In order to get alignment, you want to count characters. Then, pass the bytes count to printf. That can be achieved by using the * precision and passing the count of bytes. For example, since accented e takes two bytes:
printf("'-4.*s'\n", 6, "éléphant");
A function to count bytes is easily coded based on the format of UTF-8 characters:
static int count_bytes(char const *utf8_string, int length)
{
char const *s = utf8_string;
for (;;)
{
int ch = *(unsigned char *)s++;
if ((ch & 0xc0) == 0xc0) // first byte of a multi-byte UTF-8
while (((ch = *(unsigned char*)s) & 0xc0) == 0x80)
++s;
if (ch == 0)
break;
if (--length <= 0)
break;
}
return s - utf8_string;
}
At this point however, one would end up with lines like so:
printf("'-4.*s'\n", count_bytes("éléphant", 4), "éléphant");
Having to repeat the string twice quickly becomes a maintenance nightmare. At a minimum, one can define a macro to make sure the string is the same. Assuming the above function is saved in some utf8-util.h file, your program could be rewritten as follows:
#include <stdio.h>
#include <stdlib.h>
#include <locale.h>
#include "utf8-util.h"
#define INT_STR_PAIR(i, s) count_bytes(s, i), s
int main(void)
{
setlocale( LC_CTYPE, "en_US.UTF-8" );
setlocale( LC_COLLATE, "en_US.UTF-8" );
printf( "'%-4.*s'\n", INT_STR_PAIR(4, "elephant"));
printf( "'%-4.*s'\n", INT_STR_PAIR(4, "éléphant"));
printf( "'%-4.*s'\n", INT_STR_PAIR(4, "é𐅫éphant"));
printf( "'%-20.*s'\n", INT_STR_PAIR(20, "éléphant"));
return 0;
}
The last but one test uses 𐅫, the Greek acrophonic thespian three hundred (U+1016B) character. Given how the counting works, testing with consecutive non-ASCII characters makes sense. The ancient Greek character looks "wide" enough to see how much space it takes using a fixed-width font. The output may look like:
'elep'
'élép'
'é𐅫ép'
'éléphant '
(On my terminal, those 4-char strings are of equal length.)

Related

What is the correct behavior of std::get_time() for "short" input

I'm trying to understand what should be the correct behavior of C++11
std::get_time() when the input data is "shorter" than expected by the format
string. For example, what the following program should print:
#include <ctime>
#include <iomanip>
#include <sstream>
#include <iostream>
int main (int argc, char* argv[])
{
using namespace std;
tm t {};
istringstream is ("2016");
is >> get_time (&t, "%Y %d");
cout << "eof: " << is.eof () << endl
<< "fail: " << is.fail () << endl;
}
Note that get_time()
behavior is described in terms of
std::time_get<CharT,InputIt>::get().
Based on the latter (see 1c paragraph) I would expect both eofbit and failbit
to be set and so the program to print:
eof: 1
fail: 1
However, for all the major Standard C++ Library implementations (libstdc++
10.2.1, libc++ 11.0.0, and MSVC 16.8) it prints:
eof: 1
fail: 0
Interestingly, that for MSVC before 16.8 it prints:
eof: 1
fail: 1
But the commit "std::get_time should not fail when format is longer than the stream"
suggests that this was fixed deliberately.
Could someone clarify if (and why) the mentioned standard libraries behave correctly and, if that's the case, how it is supposed to detect that the format string was not fully used.
I cannot explain with 100% accuracy, but I can try to explain why the function behaves as observed.
I suspect that the eofbit is not set in your case, which is 1c, because case 1b takes precedence:
b) There was a parsing error (err != std::ios_base::goodbit)
According to The eofbit part of https://en.cppreference.com/w/cpp/io/ios_base/iostate , one of the situations when the eof bit is set is when
The std::get_time I/O manipulator and any of the std::time_get parsing functions: time_get::get, time_get::get_time, time_get::get_date etc., if the end of the stream is reached before the last character needed to parse the expected date/time value was processed.
The same source for the failbit says:
The time input manipulator std::get_time (technically, time_get::get it calls), if the input cannot be unambiguously parsed as a time value according to the given format string.
So my guess is that when the input is 2000, get tries to read it in using operator>>(std::string&), hits the eof condition and sets the eofbit. This satisfies condition 1b, so condition 1c cannot be applied.
If the function expects a year and the input is shorter than 4 digit, e.g. 200, or of it contains a space after the year, 2000 , or contains more than 4 digits, 20001, the function returns failbit. However, if the input is a 4-digit number starting with 0's, e.g. 0005, the function returns eofbit == 1, failbit == 0. This is in accordance with the specification of %Y format specifier:
parses full year as a 4 digit decimal number, leading zeroes permitted but not required
So I hope this explains why sometimes condition 1c is not taken into account. We can detect that the format string has not been fully used in a usual way, by testing the good() member function. I believe telling the difference between the function returning failbit == 1 or 0 is of very little practical importance. I also believe the standard is imprecise here, but if we assume that the user is interested in the value of good(), this lack of precision is of no practical relevance.
It is also possible that the value of failbit in the case you consider is implementation-defined: an implementation could try and read exactly 4 characters to satisfy the %Y format specifier, in which case the eofbit would not be set. But this is only my guess.
EDIT
Look at this modification of your program:
int main (int argc, char* argv[])
{
using namespace std;
tm t {};
istringstream is ("2016");
// is >> get_time (&t, "%Y %d");
std::string s;
is >> s;
cout << "eof: " << is.eof () << endl
<< "fail: " << is.fail () << endl;
}
I replaced get_time with std::string, but the behavior did not change! The string has been read in to its end, so the stream state cannot be set to fail; however, it hit the end-of-file, so the eofbit has been set!
eof: 1
fail: 0
What I'm saying is that a similar phenomenon can take place inside get_time and then the stream's state is propagated up to the result of get_time.
Ok, it seems that all the mentioned implementations behave according to the
C++11 standard.
Here is my understanding of what happens in the above program.
std::get_time() does all the preparations and calls std::time_get<CharT,InputIt>::get().
Since the first format string character is '%', the get() function calls
do_get() at the first iteration of the parsing loop.
do_get() reads "2016" while processing the %Y specifier and fills the
respective field in the time object. Besides that, it sets eofbit according to
the standard, since "the end of the input stream is reached after reading a
character". This makes get() function to bail out from the loop after the
do_get() call due to 1b condition (see get() for details), with only eofbit set for the stream. Note
that the format part that follows %Y is fully ignored.
But if we, for example, change the input stream from "2016" to "2016 " (append the space character), then do_get() doesn't set eofbit, get() reads/matches the spaces in the stream and format after the do_get() call, and then bails out due to 1c condition with both eofbit and failbit set.
Generally reading with std::get_time() seems to succeed (failbit is not set)
when either format string is fully matched against the stream (which may still
have some data in it) or if the end of the stream is reached after a
conversion specifier was successfully applied (with the rest of the format
string ignored).

Is ZeroMemory the windows equivalent of null terminating a buffer?

For example I by convention null terminate a buffer (set buffer equal to zero) the following way, example 1:
char buffer[1024] = {0};
And with the windows.h library we can call ZeroMemory, example 2:
char buffer[1024];
ZeroMemory(buffer, sizeof(buffer));
According to the documentation provided by microsoft: ZeroMemory Fills a block of memory with zeros. I want to be accurate in my windows application so I thought what better place to ask than stack overflow.
Are these two examples equivalent in logic?
Yes, the two codes are equivalent. The entire array is filled with zeros in both cases.
In the case of char buffer[1024] = {0};, you are explicitly setting only the first char element to 0, and then the compiler implicitly value-initializes the remaining 1023 char elements to 0 for you.
In C++11 and later, you can omit that first element value:
char buffer[1024] = {};
char buffer[1024]{};

Read a utf8 file to a std::string without BOM

I am trying to read a utf8 content to char*, my file does not have any DOM, so the code is straight, (the file is unicode punctuation)
char* fileData = "\u2010\u2020";
I cannot see how a single unsigned char 0 > 255 can contain a character of value 0 > 65535 so I must be missing something.
...
std::ifstream fs8("../test_utf8.txt");
if (fs8.is_open())
{
unsigned line_count = 1;
std::string line;
while ( getline(fs8, line))
{
std::cout << ++line_count << '\t' << line << L'\n';
}
}
...
So how can I read a utf8 file into a char*, (or even a std::string)
well, you ARE reading the file correctly into std::string and std::string do support UTF8, it's probably that your console * which cannot show non-ASCII character.
basically, when a character code page is bigger than CHAR_MAX/2, you simply represent this character with many character.
how and how many characters? this is what encoding is all about.
UTF32 for example, will show each character, ASCII and non ASCII as 4 characters. hence the "32" (each byte is 8 bit, 4*8 = 32).
without providing any auditional information on what OS you are using, we can't give a an advice on how your program can show the file's line.
*or more exactly, the standard output which will probably be implemented as console text.

Stack around the variable 'xyz' was corrupted

I'm trying to get some simple piece of code I found on a website to work in VC++ 2010 on windows vista 64:
#include "stdafx.h"
#include <windows.h>
int _tmain(int argc, _TCHAR* argv[])
{
DWORD dResult;
BOOL result;
char oldWallPaper[MAX_PATH];
result = SystemParametersInfo(SPI_GETDESKWALLPAPER, sizeof(oldWallPaper)-1, oldWallPaper, 0);
fprintf(stderr, "Current desktop background is %s\n", oldWallPaper);
return 0;
}
it does compile, but when I run it, I always get this error:
Run-Time Check Failure #2 - Stack around the variable 'oldWallPaper' was corrupted.
I'm not sure what is going wrong, but I noticed, that the value of oldWallPaper looks something like "C\0:\0\0U\0s\0e\0r\0s[...]" -- I'm wondering where all the \0s come from.
A friend of mine compiled it on windows xp 32 (also VC++ 2010) and is able to run it without problems
any clues/hints/opinions?
thanks
The doc isn't very clear. The returned string is a WCHAR, two bytes per character not one, so you need to allocate twice as much space otherwise you get a buffer overrun. Try:
BOOL result;
WCHAR oldWallPaper[(MAX_PATH + 1)];
result = SystemParametersInfo(SPI_GETDESKWALLPAPER,
_tcslen(oldWallPaper), oldWallPaper, 0);
See also:
http://msdn.microsoft.com/en-us/library/ms724947(VS.85).aspx
http://msdn.microsoft.com/en-us/library/ms235631(VS.80).aspx (string conversion)
Every Windows function has 2 versions:
SystemParametersInfoA() // Ascii
SystemParametersInfoW() // Unicode
The version ending in W is the wide character type (ie Unicode) version of the function. All the \0's you are seeing are because every character you're getting back is in Unicode - 16 bytes per character - the second byte happens to be 0. So you need to store the result in a wchar_t array, and use wprintf instead of printf
wchar_t oldWallPaper[MAX_PATH];
result = SystemParametersInfo(SPI_GETDESKWALLPAPER, MAX_PATH-1, oldWallPaper, 0);
wprintf( L"Current desktop background is %s\n", oldWallPaper );
So you can use the A version SystemParametersInfoA() if you are hell-bent on not using Unicode. For the record you should always try to use Unicode, however.
Usually SystemParametersInfo() is a macro that evaluates to the W version, if UNICODE is defined on your system.

How do I read Unicode-16 strings from a file using POSIX methods in Linux?

I have a file containing UNICODE-16 strings that I would like to read into a Linux program. The strings were written raw from Windows' internal WCHAR format. (Does Windows always use UTF-16? e.g. in Japanese versions)
I believe that I can read them using raw reads and the converting with wcstombs_l. However, I cannot figure what locale to use. Runing "locale -a" on my up-to-date Ubuntu and Mac OS X machines yields zero locales with utf-16 in their names.
Is there a better way?
Update: the correct answer and others below helped point me to using libiconv. Here's a function I'm using to do the conversion. I currently have it inside a class that makes the conversions into a one-line piece of code.
// Function for converting wchar_t* to char*. (Really: UTF-16LE --> UTF-8)
// It will allocate the space needed for dest. The caller is
// responsible for freeing the memory.
static int iwcstombs_alloc(char **dest, const wchar_t *src)
{
iconv_t cd;
const char from[] = "UTF-16LE";
const char to[] = "UTF-8";
cd = iconv_open(to, from);
if (cd == (iconv_t)-1)
{
printf("iconv_open(\"%s\", \"%s\") failed: %s\n",
to, from, strerror(errno));
return(-1);
}
// How much space do we need?
// Guess that we need the same amount of space as used by src.
// TODO: There should be a while loop around this whole process
// that detects insufficient memory space and reallocates
// more space.
int len = sizeof(wchar_t) * (wcslen(src) + 1);
//printf("len = %d\n", len);
// Allocate space
int destLen = len * sizeof(char);
*dest = (char *)malloc(destLen);
if (*dest == NULL)
{
iconv_close(cd);
return -1;
}
// Convert
size_t inBufBytesLeft = len;
char *inBuf = (char *)src;
size_t outBufBytesLeft = destLen;
char *outBuf = (char *)*dest;
int rc = iconv(cd,
&inBuf,
&inBufBytesLeft,
&outBuf,
&outBufBytesLeft);
if (rc == -1)
{
printf("iconv() failed: %s\n", strerror(errno));
iconv_close(cd);
free(*dest);
*dest = NULL;
return -1;
}
iconv_close(cd);
return 0;
} // iwcstombs_alloc()
Simplest way is convert the file from utf16 to utf8 native UNIX encoding and then read it,
iconv -f utf16 -t utf8 file_in.txt -o file_out.txt
You can also use iconv(3) (see man 3 iconv) to convert string using C. Most of other languages has bindings to iconv as well.
Than you can use any UTF-8 locale like en_US.UTF-8 that are usualy the default one
on most linux distros.
(Does Windows always use UTF-16? e.g. in Japanese versions)
Yes, NT's WCHAR is always UTF-16LE.
(The ‘system codepage’, which for Japanese installs is indeed cp932/Shift-JIS, still exists in NT for the benefit of the many, many applications that aren't Unicode-native, FAT32 paths, and so on.)
However, wchar_t is not guaranteed to be 16 bits and on Linux it won't be, UTF-32 (UCS-4) is used. So wcstombs_l is unlikely to be happy.
The Right Thing would be to use a library like iconv to read it in to whichever format you are using internally - presumably wchar_t. You could try to hack it yourself by poking bytes in, but you'd probably get things like the Surrogates wrong.
Runing "locale -a" on my up-to-date Ubuntu and Mac OS X machines yields zero locales with utf-16 in their names.
Indeed, Linux can't use UTF-16 as a locale default encoding thanks to all the \0s.
You can read as binary, then do your own quick conversion:
http://unicode.org/faq/utf_bom.html#utf16-3
But it is probably safer to use a library (like libiconv) which handles invalid sequences properly.
I would strongly recommend using a Unicode encoding as your program's internal representation. Use either UTF-16 or UTF-8. If you use UTF-16 internally, then obviously no translation is required. If you use UTF-8, you can use a locale with .UTF-8 in it such as en_US.UTF-8.

Resources