Read a utf8 file to a std::string without BOM - c++11

I am trying to read a utf8 content to char*, my file does not have any DOM, so the code is straight, (the file is unicode punctuation)
char* fileData = "\u2010\u2020";
I cannot see how a single unsigned char 0 > 255 can contain a character of value 0 > 65535 so I must be missing something.
...
std::ifstream fs8("../test_utf8.txt");
if (fs8.is_open())
{
unsigned line_count = 1;
std::string line;
while ( getline(fs8, line))
{
std::cout << ++line_count << '\t' << line << L'\n';
}
}
...
So how can I read a utf8 file into a char*, (or even a std::string)

well, you ARE reading the file correctly into std::string and std::string do support UTF8, it's probably that your console * which cannot show non-ASCII character.
basically, when a character code page is bigger than CHAR_MAX/2, you simply represent this character with many character.
how and how many characters? this is what encoding is all about.
UTF32 for example, will show each character, ASCII and non ASCII as 4 characters. hence the "32" (each byte is 8 bit, 4*8 = 32).
without providing any auditional information on what OS you are using, we can't give a an advice on how your program can show the file's line.
*or more exactly, the standard output which will probably be implemented as console text.

Related

CryptStringToBinary bug? Or do I not understand something?

It seems the value returned by CryptStringToBinary() in pdwSkip parameter is wrong.
The documentation says:
pdwSkip - a pointer to a DWORD value that receives the number of
characters skipped to reach the beginning of the actual base64 or
hexadecimal strings.
char buf[100]={0};
DWORD bufSize=sizeof(buf);
DWORD skip=0, flags=0;
BOOL rv=CryptStringToBinary("\r\n\t c3Nzc3Nzcw==\r\n\t ",0,CRYPT_STRING_BASE64,
buf,&bufSize,&skip,&flags);
if(rv) {
printf("skip=%u\n",skip);
}
The code prints:skip=0I expected it to be 4 because "the actual base64 string" is "c3Nzc3Nzcw==". And before it there are 4 characters.I tested it on Windows 8.1 with latest updates.
You are passing CRYPT_STRING_BASE64 in the third param - that means there are no headers.
If you pass CRYPT_STRING_BASE64HEADER instead, the function will interpret your string as PEM encoded data. PEM looks like this:
------ BEGIN STUFF --------------
AAAAAAAAABBBBBBBBBBBBBBBBBCCCCCCD
GGGGGGGGGGGaaaaaaaaaaaaasss666666
------ END STUFF---------
I'm not sure what exactly is the heuristic that the function uses to detect the header (probably a sequence of dashes, followed by some ASCII, followed by more dashes, then EOL), but "\r\n\t " is definitely not a reasonable header in a PEM encoded crypto object. Those are valid Base64 characters. The docs make a reference to "certificate beginning and ending headers" - that's a very specific thing, the PEM header/footer lines.
Not sure if the function is designed to quietly skip whitespace between Base64 characters, the docs are silent on that. That said, quietly skipping whitespace is pretty much a requirement for any PEM friendly Base64 decoder. PEM includes whitespace (the newlines) by design. But they definitely don't count that whitespace as a header. For one thing, whitespace in PEM occurs in the middle.
After you add the the beginning of the actual base64, you will receive the number of skipped characters.
Try this:
#include <Windows.h>
#include <wincrypt.h>
#include <stdio.h>
#include <vector>
#pragma comment(lib,"Crypt32.lib")
int main()
{
LPCSTR szPemPublicKey =
"\r\n\t "
"-----BEGIN CERTIFICATE -----"
"MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAqRLhZXK29Xo5YdSoMdAe"
"MHwDYAmThPSJzbQaBhVLCY1DTQr0JRkvd+0xfdwih97bWUXVpxuOgYH9hofIzZGP"
"-----END CERTIFICATE -----";
BYTE derPrivateKey[2048];
char buf[1024] = { 0 };
DWORD bufSize = sizeof(buf);
DWORD skip = 0, flags = 0;
BOOL rv = CryptStringToBinary(szPemPublicKey, 0, CRYPT_STRING_BASE64HEADER,
derPrivateKey, &bufSize, &skip, &flags);
if (rv) {
printf("skip=%u\n", skip);
}
}
Debug:

Compatibility of printf with utf-8 encoded strings

I'm trying to format some utf-8 encoded strings in C code (char *) using the printf function. I need to specify a length in format. Everything goes well when there are no multi-bytes characters in parameter string, but the result seems to be incorrect when there are some multibyte chars in data.
my glibc is kind of old (2.17), so I tried with some online compilers and result is the same.
#include <stdlib.h>
#include <locale.h>
int main(void)
{
setlocale( LC_CTYPE, "en_US.UTF-8" );
setlocale( LC_COLLATE, "en_US.UTF-8" );
printf( "'%-4.4s'\n", "elephant" );
printf( "'%-4.4s'\n", "éléphant" );
printf( "'%-20.20s'\n", "éléphant" );
return 0;
}
Result of execution is :
'elep'
'él�'
'éléphant '
First line is correct (4 chars in output)
Second line is obviously wrong (at least from a human point of view)
Last line is also wrong : only 18 unicode chars are written instead of 20
It seems that the printf function count chars before UTF-8 decoding (counting bytes instead of unicode chars)
Is that a bug in glibc or a well documented limitation of printf ?
It's true that printf counts bytes, not multibyte characters. If it's a bug, the bug is in the C standard, not in glibc (the standard library implementation usually used in conjunction with gcc).
In fairness, counting characters wouldn't help you align unicode output either, because unicode characters are not all the same display width even with fixed-width fonts. (Many codepoints are width 0, for example.)
I'm not going to attempt to argue that this behaviour is "well-documented". Standard C's locale facilities have never been particularly adequate to the task, imho, and they have never been particularly well documented, in part because the underlying model attempts to encompass so many possible encodings without ever grounding itself in a concrete example that it is almost impossible to explain. (...Long rant deleted...)
You can use the wchar.h formatted output functions,
which count in wide characters. (Which still isn't going to give you correct output alignment but it will count precision the way you expect.)
Let me quote rici: It's true that printf counts bytes, not multibyte characters. If it's a bug, the bug is in the C standard, not in glibc (the standard library implementation usually used in conjunction with gcc).
However, don't conflate wchar_t and UTF-8. See wikipedia to grasp the sense of the former. UTF-8, instead, can be dealt with almost as if it were good old ASCII. Just avoid truncating in the middle of a character.
In order to get alignment, you want to count characters. Then, pass the bytes count to printf. That can be achieved by using the * precision and passing the count of bytes. For example, since accented e takes two bytes:
printf("'-4.*s'\n", 6, "éléphant");
A function to count bytes is easily coded based on the format of UTF-8 characters:
static int count_bytes(char const *utf8_string, int length)
{
char const *s = utf8_string;
for (;;)
{
int ch = *(unsigned char *)s++;
if ((ch & 0xc0) == 0xc0) // first byte of a multi-byte UTF-8
while (((ch = *(unsigned char*)s) & 0xc0) == 0x80)
++s;
if (ch == 0)
break;
if (--length <= 0)
break;
}
return s - utf8_string;
}
At this point however, one would end up with lines like so:
printf("'-4.*s'\n", count_bytes("éléphant", 4), "éléphant");
Having to repeat the string twice quickly becomes a maintenance nightmare. At a minimum, one can define a macro to make sure the string is the same. Assuming the above function is saved in some utf8-util.h file, your program could be rewritten as follows:
#include <stdio.h>
#include <stdlib.h>
#include <locale.h>
#include "utf8-util.h"
#define INT_STR_PAIR(i, s) count_bytes(s, i), s
int main(void)
{
setlocale( LC_CTYPE, "en_US.UTF-8" );
setlocale( LC_COLLATE, "en_US.UTF-8" );
printf( "'%-4.*s'\n", INT_STR_PAIR(4, "elephant"));
printf( "'%-4.*s'\n", INT_STR_PAIR(4, "éléphant"));
printf( "'%-4.*s'\n", INT_STR_PAIR(4, "é𐅫éphant"));
printf( "'%-20.*s'\n", INT_STR_PAIR(20, "éléphant"));
return 0;
}
The last but one test uses 𐅫, the Greek acrophonic thespian three hundred (U+1016B) character. Given how the counting works, testing with consecutive non-ASCII characters makes sense. The ancient Greek character looks "wide" enough to see how much space it takes using a fixed-width font. The output may look like:
'elep'
'élép'
'é𐅫ép'
'éléphant '
(On my terminal, those 4-char strings are of equal length.)

C++ - Can't overwrite file with a single byte in binary mode after reading it (Windows 10)

EDIT:
I have a problem with this little piece of code. I'm new to C++ and I need to understand how to write a byte (or a multiple bytes) inside a file in binary mode.
With this code I'm reading the first 11 bytes of foo.txt. Inside foo.txt there is this string "Hello World!!!". On the console we see "Hello World". Until here it's all ok!
The problem is that I can't overwrite the entire file (or few bytes) after reading into, why? On Ubuntu it overwrites the byte in position 11, not on Windows. On Windows it doesn't do nothing!
The file is opened in read/write mode with parameters ios_base::in | ios_base::out. When I use file.write() it should overwrites the data with byte 3F (in ASCII the string "?"). It doesn't happen, why?
PLEASE TRY TO RUN THIS CODE! (change workingDirectory)
Thanks
#include <iostream>
#include <string>
#include <fstream>
using namespace std;
int main()
{
string workingDirectory = "C:\\Users\\francesco\\Documents\\Test\\";
string inputFile = workingDirectory + "foo.txt";
fstream file;
file.open(inputFile, ios_base::in | ios_base::out | ios_base::binary);
char *buffer = new char [11];
file.read(buffer, 11);
cout << "Track:" << endl;
for (int i = 0; i < 11; i++)
{
cout << buffer[i];
};
cout << endl;
delete[] buffer;
if (file.is_open())
{
cout << "The file is still open!" << endl;
int num = 63; // 0x3F ASCII --> ?
file.write((char*)&num, 1);
}
else
{
cout << "We shoudln't reach this part of code!" << endl;
}
file.close();
return 0;
}
At the moment I resolved the problem closing and reopening file (or seek again to byte 11). It is interesting to know why on Windows we have to do this after reading operation
Maybe you need to clear the error flag after your reading operations.
file.clear();
I don't know what is the input for your program, but my best guess is to add this:
if (file.rdstate() & std::ifstream::eofbit) {
file.seekg(-1, file.end);
}
before file.clear(), maybe you hit end of file. Or if not just check if all error flags are ok after reading a file:
if ( (is.rdstate() & std::ifstream::failbit ) != 0 )
std::cerr << "Error opening 'test.txt'\n";
I found the problem, the library has a different (incorrect) behaviour with different OSes.
EDIT:
From C11 standard, ISO/IEC 9899:2011:
...input shall not be directly followed by output without an intervening call to a file positioning function, unless the input operation encounters end-of-file.
Read here also for C++ library:
Once a file has been opened in read and write mode, the << operator can be used to insert information into the file, while the >> operator may be used to extract information from the file. These operations may be performed in any order, but a seekg or seekp operation is required when switching between insertions and extractions. The seek operation is used to activate the stream's data used for reading or those used for writing (and vice versa). The istream and ostream parts of fstream objects share the stream's data buffer and by performing the seek operation the stream either activates its istream or its ostream part. If the seek is omitted, reading after writing and writing after reading simply fails.
These are the cases:
On Windows it overwrites first byte only if we don't previously read
the first 11 bytes and opened the file in read/write mode
On Windows it CAN'T overwrites byte in position 11 (no errors) if we
previously read the first 11 bytes and opened the file in read/write
mode
On Windows after reading the first 11 bytes we have to seek to position 11 in order to overwrite byte in
position 11
On Ubuntu it overwrites byte in position 11 also if we previously
read the first 11 bytes and opened the file in read/write mode
On both OS if we open the file with ios_base::out only parameter it
overwrites the entire file, not if we use ios_base::in also
At the moment I resolved the problem closing and reopening file (and/or seek again to byte 11).
It is interesting to know why on Windows only we have to do this after reading operation.

using MultiByteToWideChar

The following code prints the desired output but it prints garbage at the end of the string. There is something wrong with the last call to MultiByteToWideChar but I can't figure out what. Please help??
#include "stdafx.h"
#include<Windows.h>
#include <iostream>
using namespace std;
#include<tchar.h>
int main( int, char *[] )
{
TCHAR szPath[MAX_PATH];
if(!GetModuleFileName(NULL,szPath,MAX_PATH))
{cout<<"Unable to get module path"; exit(0);}
char ansiStr[MAX_PATH];
if(!WideCharToMultiByte(CP_ACP,WC_COMPOSITECHECK,szPath,-1,
ansiStr,MAX_PATH,NULL,NULL))
{cout<<"Unicode to ANSI failed\n";
cout<<GetLastError();exit(1);}
string s(ansiStr);
size_t pos = 0;
while(1)
{
pos = s.find('\\',pos);
if(pos == string::npos)
break;
s.insert(pos,1,'\\');
pos+=2;
}
if(!MultiByteToWideChar(CP_ACP,MB_PRECOMPOSED,s.c_str(),s.size(),szPath,MAX_PATH))
{cout<<"ANSI to Unicode failed"; exit(2);}
wprintf(L"%s",szPath);
}
MSDN has this to say about the cbMultiByte parameter:
If this parameter is -1, the function processes the entire input
string, including the terminating null character. Therefore, the
resulting Unicode string has a terminating null character, and the
length returned by the function includes this character.
If this parameter is set to a positive integer, the function processes
exactly the specified number of bytes. If the provided size does not
include a terminating null character, the resulting Unicode string is
not null-terminated, and the returned length does not include this
character.
..so if you want the output string to be 0 terminated you should include the 0 terminator in the length you pass in OR 0 terminate yourself based on the return value...

How do I read Unicode-16 strings from a file using POSIX methods in Linux?

I have a file containing UNICODE-16 strings that I would like to read into a Linux program. The strings were written raw from Windows' internal WCHAR format. (Does Windows always use UTF-16? e.g. in Japanese versions)
I believe that I can read them using raw reads and the converting with wcstombs_l. However, I cannot figure what locale to use. Runing "locale -a" on my up-to-date Ubuntu and Mac OS X machines yields zero locales with utf-16 in their names.
Is there a better way?
Update: the correct answer and others below helped point me to using libiconv. Here's a function I'm using to do the conversion. I currently have it inside a class that makes the conversions into a one-line piece of code.
// Function for converting wchar_t* to char*. (Really: UTF-16LE --> UTF-8)
// It will allocate the space needed for dest. The caller is
// responsible for freeing the memory.
static int iwcstombs_alloc(char **dest, const wchar_t *src)
{
iconv_t cd;
const char from[] = "UTF-16LE";
const char to[] = "UTF-8";
cd = iconv_open(to, from);
if (cd == (iconv_t)-1)
{
printf("iconv_open(\"%s\", \"%s\") failed: %s\n",
to, from, strerror(errno));
return(-1);
}
// How much space do we need?
// Guess that we need the same amount of space as used by src.
// TODO: There should be a while loop around this whole process
// that detects insufficient memory space and reallocates
// more space.
int len = sizeof(wchar_t) * (wcslen(src) + 1);
//printf("len = %d\n", len);
// Allocate space
int destLen = len * sizeof(char);
*dest = (char *)malloc(destLen);
if (*dest == NULL)
{
iconv_close(cd);
return -1;
}
// Convert
size_t inBufBytesLeft = len;
char *inBuf = (char *)src;
size_t outBufBytesLeft = destLen;
char *outBuf = (char *)*dest;
int rc = iconv(cd,
&inBuf,
&inBufBytesLeft,
&outBuf,
&outBufBytesLeft);
if (rc == -1)
{
printf("iconv() failed: %s\n", strerror(errno));
iconv_close(cd);
free(*dest);
*dest = NULL;
return -1;
}
iconv_close(cd);
return 0;
} // iwcstombs_alloc()
Simplest way is convert the file from utf16 to utf8 native UNIX encoding and then read it,
iconv -f utf16 -t utf8 file_in.txt -o file_out.txt
You can also use iconv(3) (see man 3 iconv) to convert string using C. Most of other languages has bindings to iconv as well.
Than you can use any UTF-8 locale like en_US.UTF-8 that are usualy the default one
on most linux distros.
(Does Windows always use UTF-16? e.g. in Japanese versions)
Yes, NT's WCHAR is always UTF-16LE.
(The ‘system codepage’, which for Japanese installs is indeed cp932/Shift-JIS, still exists in NT for the benefit of the many, many applications that aren't Unicode-native, FAT32 paths, and so on.)
However, wchar_t is not guaranteed to be 16 bits and on Linux it won't be, UTF-32 (UCS-4) is used. So wcstombs_l is unlikely to be happy.
The Right Thing would be to use a library like iconv to read it in to whichever format you are using internally - presumably wchar_t. You could try to hack it yourself by poking bytes in, but you'd probably get things like the Surrogates wrong.
Runing "locale -a" on my up-to-date Ubuntu and Mac OS X machines yields zero locales with utf-16 in their names.
Indeed, Linux can't use UTF-16 as a locale default encoding thanks to all the \0s.
You can read as binary, then do your own quick conversion:
http://unicode.org/faq/utf_bom.html#utf16-3
But it is probably safer to use a library (like libiconv) which handles invalid sequences properly.
I would strongly recommend using a Unicode encoding as your program's internal representation. Use either UTF-16 or UTF-8. If you use UTF-16 internally, then obviously no translation is required. If you use UTF-8, you can use a locale with .UTF-8 in it such as en_US.UTF-8.

Resources