I'm using libxml2. All function are working with xmlChar*. I found that xmlChar is an unsigned char.
So I have some questions about how to work with it.
1) For example if I working with utf-16 or utf-32 file how libxml2 process it and returns xmlChar in function? Will I lose some characters then??
2) If I want to do something with this string, should I cast it to char* or wchar_t* and how??
Will I lose some characters?
xmlChar is for handling UTF-8 encoding only.
So, to answer your questions:
No, you won't loose any characters if using UTF-16 or UTF-32. Just use iconv or any other library to encode your UTF-16 or UTF-32 data before passing it to the API.
Do not just "cast" the string. Convert them if needed in some other encoding.
Related
I have an external system written in Ruby which sending a data over the wire encoded with ASCII_8BIT. How should I decode and encode them in the Scala?
I couldn't find a library for decoding and encoding ASCII_8BIT string in scala.
As I understand, correctly, the ASCII_8BIT is something similar to Base64. However, there is more than one Base64 encoding. Which type of encoding should I use to be sure that cover all corner cases?
What is ASCII-8BIT?
ASCII-8BIT is Ruby's binary encoding (the name "BINARY" is accepted as an alias for "ASCII-8BIT" when specifying the name of an encoding). It is used both for binary data and for text whose real encoding you don't know.
Any sequence of bytes is a valid string in the ASCII-8BIT encoding, but unlike other 8bit-encodings, only the bytes in the ASCII range are considered printable characters (and of course only those that are printable in ASCII). The bytes in the 128-255 range are considered special characters that don't have a representation in other encodings. So trying to convert an ASCII-8BIT string to any other encoding will fail (or replace the non-ASCII characters with question marks depending on the options you give to encode) unless it only contains ASCII characters.
What's its equivalent in the Scala/JVM world?
There is no strict equivalent. If you're dealing with binary data, you should be using binary streams that don't have an encoding and aren't treated as containing text.
If you're dealing with text, you'll either need to know (or somehow figure out) its encoding or just arbitrarily pick an 8-bit ASCII-superset encoding. That way non-ASCII characters may come out as the wrong character (if the text was actually encoded with a different encoding), but you won't get any errors because any byte is a valid character. You can then replace the non-ASCII characters with question marks if you want.
What does this have to do with Base64?
Nothing. Base64 is a way to represent binary data as ASCII text. It is not itself a character encoding. Knowing that a string has the character encoding ASCII or ASCII-8BIT or any other encoding, doesn't tell you whether it contains Base64 data or not.
But do note that a Base64 string will consist entirely of ASCII characters (and not just any ASCII characters, but only letters, numbers, +, / and =). So if your string contains any non-ASCII character or any character except the aforementioned, it's not Base64.
Therefore any Base64 string can be represented as ASCII. So if you have an ASCII-8BIT string containing Base64 data in Ruby, you should be able to convert it to ASCII without any problems. If you can't, it's not Base64.
I have a unicode character code point stored as a string.
std::string code = "0663";
I need to decode it into utf8 and get as a standard std::string using the ICU library.
I decided to use ICU to get a cross-platform bit-independent solution.
Untested:
Convert the string into a int32_t.
Treat the int32_t as a UChar32.
Create a UnicodeString with UnicodeString::setTo from the UChar32.
Create a string object with UnicodeString::toUTF8String from the UnicodeString.
I'm using v8 to use JavaScript in native(c++) code. To call a Javascript function I need to convert all the parameters to v8 data types.
For eg: Code to convert char* to v8 data type
char* value;
...
v8::String::New(value);
Now, I need to pass unicode chars(wchar_t) to JavaScript.
First of all does v8 supports Unicode chars? If yes, how to convert wchar_t/std::wstring to v8 data type?
I'm not sure if this was the case at the time this question was asked, but at the moment the V8 API has a number of functions which support UTF-8, UTF-16 and Latin-1 encoded text:
https://github.com/v8/v8/blob/master/include/v8.h
The relevant functions to create new string objects are:
String::NewFromUtf8 (UTF-8 encoded, obviously)
String::NewFromOneByte (Latin-1 encoded)
String::NewFromTwoByte (UTF-16 encoded)
Alternatively, you can avoid copying the string data and construct a V8 string object that refers to existing data (whose lifecycle you control):
String::NewExternalOneByte (Latin-1 encoded)
String::NewExternalTwoByte (UTF-16 encoded)
Unicode just maps Characters to Number. What you need is proper encoding, like UTF8 or UTF-16.
V8 seems to support UTF-8 (v8::String::WriteUtf8) and a not further described 16bit type (Write). I would give it a try and write some UTF-16 into it.
In unicode applications, windows stores UTF-16 in std::wstring. Maybe you try something like
std::wstring yourString;
v8::String::New (yourString.c_str());
No it doesn't have unicode support, the above solution is fine.
The following code did the trick
wchar_t path[1024] = L"gokulestás";
v8::String::New((uint16_t*)path, wcslen(path))
I want to convert a string from 1252 char code set to UTF-8. For this I used iconv library in my c++ application development which is based on linux platform.
I used the the API iconv() and converted my string.
there is a character è in my input. UTF-8 also does support to this character. So when my conversion is over, my output also should contain the same character è.
But When I see the output, Character è is converted to è which I don't want.
One more point is if the converter found any unknown character, that should be automatically replaced with the default REPLACEMENT CHARACTER of UTF-8 �(FFFD) which is not happening.
How can I achieve the above two points with the library iconv.
I used the below APIs to convert the string
1)iconv_open("UTF-8","CP1252")
2)iconv() - Pass the parameters required
3)iconv_close(cd)
Can any body help me to sort out this issue please......
Please use this to replace invalid utf-8 charaters.
iconv_open("UTF-8//IGNORE","CP1252")
I have a Ruby script that generates an ANSI file.
I want to convert the file to UTF8.
What's the easiest way to do it?
If your data is between ascii range 0 to 0x7F, its valid UTF8, so you don't need to do anything.
Or, if there is characters above 0x7F, you could use Iconv
text=Iconv.iconv('UTF-8', 'ascii',text)
The 8-bit Unicode Transformation Format (UTF-8) was designed to be backwards compatible with the American Standard Code for Information Interchange (ASCII). Therefore, by definition, any valid ASCII sequence is also a valid UTF-8 sequence. For more information, read the UTF FAQ and Unicode FAQ.
Any ASCII file is a valid UTF8 file, going by your Q's title, so no conversion is needed. I don't know what a UIF8 file is, going by your Q's text, so different from its title.