libXML2 cannot read properly his own XML UTF-8 format - utf-8

I would like to parse an UTF8 format XML with libXML2.
My code is write in C and I use v2.9.3 of libXML2.
My code follow:
xmlTextReaderPtr reader;
xmlTextWriterPtr writer;
writer = xmlNewTextWriterFilename("test.xml", 0);
xmlTextWriterStartDocument(writer, NULL, "UTF-8", NULL);
xmlTextWriterStartElement(writer, BAD_CAST "node_with_é_character");
xmlTextWriterEndElement(writer);
xmlTextWriterEndDocument(writer);
xmlFreeTextWriter(writer);
reader = xmlReaderForFile("test.xml", "UTF-8", XML_PARSE_RECOVER);
int ret = 1;
while (ret == 1) {
const xmlChar *nameT = xmlTextReaderConstName(reader);
printf("\n ---> %s\n",nameT);
ret = xmlTextReaderRead(reader);
}
Output is :
---> (null)
---> node_with_é_character
Problem is "node_with_é_character" trace and not "node_with_é_character"
My command prompt is "chcp 1252" set.
I don't understand why liXML2 cannot store/read the "é" character.

As noted in comment your under Windows, so I guess it's likely your source code is not UTF-8 encoded, so the C string "node_with_é_character" is not UTF-8 encoded in your executable.
I don't know libxml2 interfaces, but code example is quite clear it expects input parameters in UTF-8. See http://xmlsoft.org/examples/testWriter.c
/* Write a comment as child of EXAMPLE.
* Please observe, that the input to the xmlTextWriter functions
* HAS to be in UTF-8, even if the output XML is encoded
* in iso-8859-1 */
tmp = ConvertInput("This is a comment with special chars: <\xE4\xF6\xFC>",
MY_ENCODING);
Save your source file as UTF-8 will help you fix your issue.

Related

What does "Stream did not contain valid UTF-8" mean?

I'm creating a simple HTTP server. I need to read the requested image and send it to browser. I'm using this code:
fn read_file(mut file_name: String) -> String {
file_name = file_name.replace("/", "");
if file_name.is_empty() {
file_name = String::from("index.html");
}
let path = Path::new(&file_name);
if !path.exists() {
return String::from("Not Found!");
}
let mut file_content = String::new();
let mut file = File::open(&file_name).expect("Unable to open file");
let res = match file.read_to_string(&mut file_content) {
Ok(content) => content,
Err(why) => panic!("{}",why),
};
return file_content;
}
This works if the requested file is text based, but when I want to read an image I get the following message:
stream did not contain valid UTF-8
What does it mean and how to fix it?
The documentation for String describes it as:
A UTF-8 encoded, growable string.
The Wikipedia definition of UTF-8 will give you a great deal of background on what that is. The short version is that computers use a unit called a byte to represent data. Unfortunately, these blobs of data represented with bytes have no intrinsic meaning; that has to be provided from outside. UTF-8 is one way of interpreting a sequence of bytes, as are file formats like JPEG.
UTF-8, like most text encodings, has specific requirements and sequences of bytes that are valid and invalid. Whatever image you have tried to load contains a sequence of bytes that cannot be interpreted as a UTF-8 string; this is what the error message is telling you.
To fix it, you should not use a String to hold arbitrary collections of bytes. In Rust, that's better represented by a Vec:
fn read_file(mut file_name: String) -> Vec<u8> {
file_name = file_name.replace("/", "");
if file_name.is_empty() {
file_name = String::from("index.html");
}
let path = Path::new(&file_name);
if !path.exists() {
return String::from("Not Found!").into();
}
let mut file_content = Vec::new();
let mut file = File::open(&file_name).expect("Unable to open file");
file.read_to_end(&mut file_content).expect("Unable to read");
file_content
}
To evangelize a bit, this is a great aspect of why Rust is a nice language. Because there is a type that represents "a set of bytes that is guaranteed to be a valid UTF-8 string", we can write safer programs since we know that this invariant will always be true. We don't have to keep checking throughout our program to "make sure" it's still a string.

C Program Strange Characters retrieved due to language setting on Windows

If the below code is compiled with UNICODE as compiler option, the GetComputerNameEx API returns junk characters.
Whereas if compiled without UNICODE option, the API returns truncated value of the hostname.
This issue is mostly seen with Asia-Pacific languages like Chinese, Japanese, Korean to name a few (i.e., non-English).
Can anyone throw some light on how this issue can be resolved.
# define INFO_SIZE 30
int main()
{
int ret;
TCHAR infoBuf[INFO_SIZE+1];
DWORD bufSize = (INFO_SIZE+1);
char *buf;
buf = (char *) malloc(INFO_SIZE+1);
if (!GetComputerNameEx((COMPUTER_NAME_FORMAT)1,
(LPTSTR)infoBuf, &bufSize))
{
printf("GetComputerNameEx failed (%d)\n", GetLastError());
return -1;
}
ret = wcstombs(buf, infoBuf, (INFO_SIZE+1));
buf[INFO_SIZE] = '\0';
return 0;
}
In the languages you mentioned, most characters are represented by more than one byte. This is because these languages have alphabets of much more than 256 characters. So you may need more than 30 bytes to encode 30 characters.
The usual pattern for calling a function like wcstombs goes like this: first get the amount of bytes required, then allocate a buffer, then convert the string.
(edit: that actually relies on a POSIX extension, which also got implemented on Windows)
size_t size = wcstombs(NULL, infoBuf, 0);
if (size == (size_t) -1) {
// some character can't be converted
}
char *buf = new char[size + 1];
size = wcstombs(buf, infoBuf, size + 1);

Powerbuilder: ImportFile of UTF-8 (Converting UTF-8 to ANSI)

My Powerbuilder version is 6.5, cannot use a higher version as this is what I am supporting.
My problem is, when I am doing dw_1.ImportFile(file) the first row and first column has a funny string like this:

Which I dont understand until I tried opening the file and saving it to a new text file and trying to import that new file.which worked flawlessly without the funny string.
My conclusion is that this is happening because the file is UTF-8 (as shown in NOTEPAD++) and the new file is Ansi. The file I am trying to import is automatically given by a 3rd party and my users dont want the extra job of doing this.
How do I force convert this files to ANSI in powerbuilder. If there is none, I might have to do a command prompt conversion, any ideas?
The weird  characters are the (optional) utf-8 BOM that tells editors that the file is utf-8 encoded (as it can be difficult to know it unless we encounter an escaped character above code 127). You cannot just rid it off because if your file contains any character above 127 (accents or any special char), you will still have garbage in your displayed data (for example: é -> é, € -> €, ...) where special characters will become from 2 to 4 garbage chars.
I recently needed to convert some utf-8 encoded string to "ansi" windows 1252 encoding. With version of PB10+, a reencoding between utf-8 and ansi is as simple as
b = blob(s, encodingutf8!)
s2 = string(b, encodingansi!)
But string() and blob() do not support encoding specification before the release 10 of PB.
What you can do is to read the file yourself, skip the BOM, ask Windows to convert the string encoding via MultiByteToWideChar() + WideCharToMultiByte() and load the converted string in the DW with ImportString().
Proof of concept to get the file contents (with this reading method, the file cannot be bigger than 2GB):
string ls_path, ls_file, ls_chunk, ls_ansi
ls_path = sle_path.text
int li_file
if not fileexists(ls_path) then return
li_file = FileOpen(ls_path, streammode!)
if li_file > 0 then
FileSeek(li_file, 3, FromBeginning!) //skip the utf-8 BOM
//read the file by blocks, FileRead is limited to 32kB
do while FileRead(li_file, ls_chunk) > 0
ls_file += ls_chunk //concatenate in loop works but is not so performant
loop
FileClose(li_file)
ls_ansi = utf8_to_ansi(ls_file)
dw_tab.importstring( text!, ls_ansi)
end if
utf8_to_ansi() is a globlal function, it was written for PB9, but it should work the same with PB6.5:
global type utf8_to_ansi from function_object
end type
type prototypes
function ulong MultiByteToWideChar(ulong CodePage, ulong dwflags, ref string lpmultibytestr, ulong cchmultibyte, ref blob lpwidecharstr, ulong cchwidechar) library "kernel32.dll"
function ulong WideCharToMultiByte(ulong CodePage, ulong dwFlags, ref blob lpWideCharStr, ulong cchWideChar, ref string lpMultiByteStr, ulong cbMultiByte, ref string lpUsedDefaultChar, ref boolean lpUsedDefaultChar) library "kernel32.dll"
end prototypes
forward prototypes
global function string utf8_to_ansi (string as_utf8)
end prototypes
global function string utf8_to_ansi (string as_utf8);
//convert utf-8 -> ansi
//use a wide-char native string as pivot
constant ulong CP_ACP = 0
constant ulong CP_UTF8 = 65001
string ls_wide, ls_ansi, ls_null
blob lbl_wide
ulong ul_len
boolean lb_flag
setnull(ls_null)
lb_flag = false
//get utf-8 string length converted as wide-char
setnull(lbl_wide)
ul_len = multibytetowidechar(CP_UTF8, 0, as_utf8, -1, lbl_wide, 0)
//allocate buffer to let windows write into
ls_wide = space(ul_len * 2)
lbl_wide = blob(ls_wide)
//convert utf-8 -> wide char
ul_len = multibytetowidechar(CP_UTF8, 0, as_utf8, -1, lbl_wide, ul_len)
//get the final ansi string length
setnull(ls_ansi)
ul_len = widechartomultibyte(CP_ACP, 0, lbl_wide, -1, ls_ansi, 0, ls_null, lb_flag)
//allocate buffer to let windows write into
ls_ansi = space(ul_len)
//convert wide-char -> ansi
ul_len = widechartomultibyte(CP_ACP, 0, lbl_wide, -1, ls_ansi, ul_len, ls_null, lb_flag)
return ls_ansi
end function

How Should I read a File line-by-line in C?

I'd like to read a file line-by-line. I have fgets() working okay, but am not sure what to do if a line is longer than the buffer sizes I've passed to fgets()? And furthermore, since fgets() doesn't seem to be Unicode-aware, and I want to allow UTF-8 files, it might miss line endings and read the whole file, no?
Then I thought I'd use getline(). However, I'm on Mac OS X, and while getline() is specified in /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.8.sdk/usr/include/stdio.h, it's not in /usr/include/stdio, so gcc doesn't find it in the shell. And it's not particularly portable, obviously, and I'd like the library I'm developing to be generally useful.
So what's the best way to read a file line-by-line in C?
First of all, it's very unlikely that you need to worry about non-standard line terminators like U+2028. Normal text files are not expected to contain them, and the very overwhelming majority of all existing software that reads normal text files doesn't support them. You mention getline() which is available in glibc but not in MacOS's libc, and it would surprise me if getline() did support such fancy line terminators. It's almost a certainly that you can get away with just supporting LF (U+000A) and maybe also CR+LF (U+000D U+000A). To do that, you don't need to care about UTF-8. That's the beauty of UTF-8's ASCII compatibility and is by design.
As for supporting lines that are longer than the buffer you pass to fgets(), you can do this with a little extra logic around fgets. In pseudocode:
while true {
fgets(buffer, size, stream);
dynamically_allocated_string = strdup(buffer);
while the last char (before the terminating NUL) in the buffer is not '\n' {
concatenate the contents of buffer to the dynamically allocated string
/* the current line is not finished. read more of it */
fgets(buffer, size, stream);
}
process the whole line, as found in the dynamically allocated string
}
But again, I think you will find that there's really quite a lot of software out there that simply doesn't bother with that, from software that parses system config files like /etc/passwd to (some) scripting languages. Depending on your use case, it may very well be good enough to use a "big enough" buffer (e.g. 4096 bytes) and declare that you don't support lines longer than that. You can even call it a security feature (a line length limit is protection against resource exhaustion attacks from a crafted input file).
Based on this answer, here's what I've come up with:
#define LINE_BUF_SIZE 1024
char * getline_from(FILE *fp) {
char * line = malloc(LINE_BUF_SIZE), * linep = line;
size_t lenmax = LINE_BUF_SIZE, len = lenmax;
int c;
if(line == NULL)
return NULL;
for(;;) {
c = fgetc(fp);
if(c == EOF)
break;
if(--len == 0) {
len = lenmax;
char * linen = realloc(linep, lenmax *= 2);
if(linen == NULL) {
// Fail.
free(linep);
return NULL;
}
line = linen + (line - linep);
linep = linen;
}
if((*line++ = c) == '\n')
break;
}
*line = '\0';
return linep;
}
To read stdin:
char *line;
while ( line = getline_from(stdin) ) {
// do stuff
free(line);
}
To read some other file, I first open it with fopen():
FILE *fp;
fp = fopen ( filename, "rb" );
if (!fp) {
fprintf(stderr, "Cannot open %s: ", argv[1]);
perror(NULL);
exit(1);
}
char *line;
while ( line = getline_from(fp) ) {
// do stuff
free(line);
}
This works very nicely for me. I'd love to see an alternative that uses fgets() as suggested by #paul-tomblin, but I don't have the energy to figure it out tonight.

"Extra content at the end of the document" error using libxml2 to read from file handle created with shm_open

I'm trying to write a unit test that checks some xml parsing code. The unit test creates a file descriptor on an in-memory xml doc using shm_open and then passes that to xmlTextReaderForFd(). But I'm getting an "Extra content at the end of the document" error on the subsequent xmlTextReaderRead(). The parsing code works fine on a file descriptor created from an actual file (I've done a byte-for-byte comparison with the shm_open created one and it's the exact same set of bytes.) Why is libxml2 choking on a file descriptor created with shm_open?
Here's my code:
void unitTest() {
int fd = shm_open("/temporary", O_RDWR | O_CREAT, S_IRUSR | S_IWUSR);
char *pText = "<?xml version=\"1.0\"?><foo></foo>";
write(fd, pText, strlen(pText) + 1);
lseek(fd, 0, SEEK_SET);
xmlTextReaderPtr pReader = xmlReaderForFd(
fd, // file descriptor
"/temporary", // base uri
NULL, // encoding
0); // options
int result = xmlTextReaderRead(pReader);
// result is -1
// Get this error at console:
// /temporary:1: parser error : Extra content at the end of the document
// <?xml version="1.0"?><foo></foo>
// ^
}
I figured out the problem. I was writing out the NULL terminator and that's what was causing libxml2 to choke (although I could have sworn I already tried it without the NULL terminator, d'oh!) The fixed code should simply be:
write(fd, pText, strlen(pText));
Also, make sure you are reading the file as binary, not text. 'Text' strips out CR/LF, reduces the size of the file and leaves detritus at the end of the buffer.
Example (VS 2010):
struct _stat32 stat;
char *buf;
FILE *f = fopen("123.XML", "rb"); // right
//f = fopen("123.XML", "rt"); // WRONG!
_fstat(fileno(f), &stat);
buf = (char *)malloc(stat.st_size);
int ret = fread(buf, stat.st_size, 1, f);
assert(ret == 1);
// etc.

Resources