Given a string in form of a pointer to a array of bytes (chars), how can I detect the encoding of the string in C/C++ (I used visual studio 2008)?? I did a search but most of samples are done in C#.
Thanks
Assuming you know the length of the input array, you can make the following guesses:
First, check to see if the first few bytes match any well know byte order marks (BOM) for Unicode. If they do, you're done!
Next, search for '\0' before the last byte. If you find one, you might be dealing with UTF-16 or UTF-32. If you find multiple consecutive '\0's, it's probably UTF-32.
If any character is from 0x80 to 0xff, it's certainly not ASCII or UTF-7. If you are restricting your input to some variant of Unicode, you can assume it's UTF-8. Otherwise, you have to do some guessing to determine which multi-byte character set it is. That will not be fun.
At this point it is either: ASCII, UTF-7, Base64, or ranges of UTF-16 or UTF-32 that just happen to not use the top bit and do not have any null characters.
It's not an easy problem to solve, and generally relies on heuristics to take a best guess at what the input encoding is, which can be tripped up by relatively innocuous inputs - for example, take a look at this Wikipedia article and The Notepad file encoding Redux for more details.
If you're looking for a Windows-only solution with minimal dependencies, you can look at using a combination of IsTextUnicode and MLang's DetectInputCodePage to attempt character set detection.
If you are looking for portability, but don't mind taking on a fairly large dependency in the form of ICU then you can make use of it's character set detection routines to achieve the same thing in a portable manner.
I have written a small C++ library for detecting text file encoding. It uses Qt, but it can be just as easily implemented using just the standard library.
It operates by measuring symbol occurrence statistics and comparing it to pre-computed reference values in different encodings and languages. As a result, it not only detects encoding but also the language of the text. The downside is that pre-computed statistics must be provided for the target language to detect this language properly.
https://github.com/VioletGiraffe/text-encoding-detector
Related
I intend to create a library that treats with strings, but the first thing that came to my mind is supporting all languages, among these languages Asian languages like Chinese, Japanese and the languages that start from right-to-left like Arabic, Persian, and so on.
So, I want to know whether "UTF-8" which represented in data types char* & std::string is enough to support all languages for reading and writing or I should use "UTF-16" which represented in data types wchar_t* & std::wstring?
In short, which data type should be used and suitable for this mission whether these data types or other?
There are a few confusions in your question, so I'll start with the answer you're probably looking for, and move out from there:
You should encode in UTF-8 unless you have a very good reason not to encode in UTF-8. There are several good reasons, but none of them have to do with what languages are supported.
UTF-8 and UTF-16 are just different ways to encode Unicode. You can also encode Unicode in UTF-32. You can even encode Unicode in GB18030, or one of several other encodings. As long as the encoding can handle all Unicode code points, then it will cover the same number of languages, glyphs, scripts, characters, etc. (Nailing down precisely what is meant by a Unicode code point is itself a subtle topic that I don't want to get into here, but for these purposes, let's think of it a "character.")
You should generally use UTF-8 because it's extremely efficient if you're working in Latin-based scripts, and it's the most commonly supported encoding in that ecosystem. That said, for some problems, UTF-16 or UTF-32 can be more efficient. But without a specific reason, you should use UTF-8.
The data types char* and std::string do not represent UTF-8. They represent a sequence of char. That's all they represent. That sequence of char can be interpreted in many ways. It is fairly common to interpret it as UTF-8, but I wouldn't even say that's the most common interpretation (many systems treat it as extended ASCII, which is why non-English text often gets garbled as it moves between systems).
If you want to work in UTF-8, you often have to do more than use std:string. You need a UTF-8 handling library, most commonly std::locale for simple usage or ICU for more complex problems. UTF-8 characters can be between 1 and 4 char long, so you have to be very thoughtful when applying character processing. The most common mistake is that UTF-8 does not support random-access. You can't just jump to the 32nd letter in a string. You have to process it from the start to find all the character breaks. If you start processing a UTF-8 string at a random point, you may jump into the middle of a character.
Through combining characters, UTF-8 encodings can become (in many systems) arbitrarily long. The visually single "character" 👩👩👧👦 is encoded as a sequence of 25 char values in UTF-8. (Of course it's encoded as 12 wchar_t values in UTF-16. No Unicode encoding saves you from having to think about combining characters.)
On the other hand, UTF-8 is so powerful because you can often ignore it for certain problems. The character A encodes in UTF-8 exactly as it does in ASCII (65), and UTF-8 promises that there will be no bytes in a sequence that are 65 and aren't A. So searching for specific ASCII sequences requires no special processing (the way it does in UTF-16).
As NathanOliver points out, using any Unicode encoding will only support the languages, glyphs, scripts, characters, etc. that Unicode supports. As a practical matter, that is the vast majority of the commonly used languages in the world. It is not every language (and it has failings in how it handles some languages that it does support), but it's by far the most comprehensive system we have today.
No, UTF-8 is not enough to support all languages (yet). From As Yet Unsupported Scripts
Loma
Naxi Dongba (Moso)
are currently not supported.
I'm wondering why protobuf doesn't implement support of commonly used alphanumeric type?
This would allow to encode several characters in only byte (more if case insensitive) very effectively as no any sort of compression involved.
Is it something that Protobuf developers are planning to implement in future?
Thanks,
In today's global world, the number of times when "alphanumeric" means the 62 characters in the range 0-9, A-Z and a-z is fairly minimal. If we just consider the basic multilingual plane, there are about 48k code units (which is to say: over 70% of the available range) that are counted as "alphanumeric" - and a fairly standard (although even this may be suboptimal in some locales) way of encoding them is UTF-8, which protobuf already uses for the string type.
I cannot see much advantage in using a dedicated wire-type for this category of data, and any additional wire-type would have the issue that it would need support adding in multiple libraries, because an unknown wire-type renders the stream unreadable to down-level parsers: you cannot even skip over unwanted data if you don't know the wire-type (the wire-type defines the skip rules).
Of course, since you also have the bytes type available, you can feel free to do anything bespoke you want inside that.
Given a Unicode character, we want to find out what languages include this character, and more importantly, understand whether or not each language is Left-To-Right.
For example, the character A might be both English and Spanish which are both LTR languages.
I want this for my own text editor.
Can anyone help me in finding an API function or something that solves my problem?
Thanks in advance
Unicode-wise, LTR/RTL is a property of characters, not of the languages that use that character. This matters because embedded English in an Arabic text should be displayed left-to-right, even if for simplicity the document as a whole may be marked as Arabic. If you're using JCL, these properties can be obtained using the UnicodeIsLeftToRight and UnicodeIsRightToLeft functions. Note that characters may be neither left-to-right nor right-to-left, and also note that JCL uses a private copy of the Unicode character list that may be a subtly different version from what any specific version of Windows uses.
Regarding the question in the title, you would need to carry out an extensive study of the use of characters in the languages of the world. There are a few thousands of languages, though many of them have no regular writing system; on the other hand, some languages have several writing systems. Different variants of a language may have different repertoires of characters.
So it would be a major effort, though some data has been compiled e.g. in the CLDR repertoire – but the concept “characters used in a language” is far from clear. (Are the characters æ, è, and ö used in English? They sure appear in some forms of written English.)
So it would be unrealistic to expect to find a library routine for such purposes.
Apparently your real need was for deciding whether a character is a left-to-right character or a right-to-left character. But for completeness, I have provided an answer to what you actually asked and that might be relevant in some other contexts.
How to get in Ruby 1.8.7 unicode character that is alphabetically right after given character?
If you mean "next in the code page" then you can always hack around with the bytes and find out. You will probably end up falling into holes with no assigned characters if you go exploring the code page sequentially. This would mean "Unicode-abetically" if you can imagine such a term.
If you mean "alphabetically" then you're out of luck since that doesn't mean anything. The concept of alphabetic order varies considerably from one language to another and is sometimes even context-specific. Some languages don't even have a set order to their characters at all. This is the reason why some systems have a collation in addition to an encoding. The collation defines order, but often many letters are considered equivalent for the purposes of sorting, further complicating things.
Ruby 1.8.7 is also not aware about Unicode in general and pretends everything is an 8-bit ASCII string with one byte characters. Ruby 1.9 can parse multi-byte UTF-8 into separate characters and might make this exercise a lot easier.
I wanted to know what the difference is between binary format and ASCII format. The thing is I need to use PETSc to do some matrix manipulations and all my matrices are stored in text files.
PETSc has different set of rules for dealing with these formats. I don't know what these formats are, let alone what format my text file is.
Is there a way to convert one format to another?
This is an elementary question; a detailed answer will really help me in understanding this.
To answer your direct question, the difference between ASCII and binary is semantics.
ASCII is binary interpreted as text. Only a small subset of binary code can be interpreted as intelligible characters (decimal 32-126) everything else is either a special character (such as a line feed or a system bell or something else entirely.) Larger characters can be letters in other alphabets.
You can interpret general binary data as ASCII format, but if it's not ASCII text it may not mean anything to you.
As a general rule of thumb, if you open your file in a text editor (such as notepad, not such as microsoft word) and it seem to consist entirely of letters or primarily of letters, numbers, and spaces, then your file can probably be safely interpreted as ASCII. If you open your file in your text editor and it's noise it's probably needs to be interpreted as raw binary.
I am not very familiar with the program you're asking about, were I in your situation I would consult the documentation of the program to figure out what format the "binary" data stream is supposed to be in. There should be a detailed description or an included utility for generating your binary data. If you generated the data yourself it's probably in ASCII format.
If your matricies are in text files, and your program only reads from binary files, you are probably out of luck.
Binary formats are just the raw bytes of whatever data structure it uses internally (or a serialization format).
You have little hope of turning text to binary without the help of the program itself.
Look for a import format if the program has one.