How to get in Ruby 1.8.7 unicode character that is alphabetically right after given character?
If you mean "next in the code page" then you can always hack around with the bytes and find out. You will probably end up falling into holes with no assigned characters if you go exploring the code page sequentially. This would mean "Unicode-abetically" if you can imagine such a term.
If you mean "alphabetically" then you're out of luck since that doesn't mean anything. The concept of alphabetic order varies considerably from one language to another and is sometimes even context-specific. Some languages don't even have a set order to their characters at all. This is the reason why some systems have a collation in addition to an encoding. The collation defines order, but often many letters are considered equivalent for the purposes of sorting, further complicating things.
Ruby 1.8.7 is also not aware about Unicode in general and pretends everything is an 8-bit ASCII string with one byte characters. Ruby 1.9 can parse multi-byte UTF-8 into separate characters and might make this exercise a lot easier.
Related
I intend to create a library that treats with strings, but the first thing that came to my mind is supporting all languages, among these languages Asian languages like Chinese, Japanese and the languages that start from right-to-left like Arabic, Persian, and so on.
So, I want to know whether "UTF-8" which represented in data types char* & std::string is enough to support all languages for reading and writing or I should use "UTF-16" which represented in data types wchar_t* & std::wstring?
In short, which data type should be used and suitable for this mission whether these data types or other?
There are a few confusions in your question, so I'll start with the answer you're probably looking for, and move out from there:
You should encode in UTF-8 unless you have a very good reason not to encode in UTF-8. There are several good reasons, but none of them have to do with what languages are supported.
UTF-8 and UTF-16 are just different ways to encode Unicode. You can also encode Unicode in UTF-32. You can even encode Unicode in GB18030, or one of several other encodings. As long as the encoding can handle all Unicode code points, then it will cover the same number of languages, glyphs, scripts, characters, etc. (Nailing down precisely what is meant by a Unicode code point is itself a subtle topic that I don't want to get into here, but for these purposes, let's think of it a "character.")
You should generally use UTF-8 because it's extremely efficient if you're working in Latin-based scripts, and it's the most commonly supported encoding in that ecosystem. That said, for some problems, UTF-16 or UTF-32 can be more efficient. But without a specific reason, you should use UTF-8.
The data types char* and std::string do not represent UTF-8. They represent a sequence of char. That's all they represent. That sequence of char can be interpreted in many ways. It is fairly common to interpret it as UTF-8, but I wouldn't even say that's the most common interpretation (many systems treat it as extended ASCII, which is why non-English text often gets garbled as it moves between systems).
If you want to work in UTF-8, you often have to do more than use std:string. You need a UTF-8 handling library, most commonly std::locale for simple usage or ICU for more complex problems. UTF-8 characters can be between 1 and 4 char long, so you have to be very thoughtful when applying character processing. The most common mistake is that UTF-8 does not support random-access. You can't just jump to the 32nd letter in a string. You have to process it from the start to find all the character breaks. If you start processing a UTF-8 string at a random point, you may jump into the middle of a character.
Through combining characters, UTF-8 encodings can become (in many systems) arbitrarily long. The visually single "character" 👩👩👧👦 is encoded as a sequence of 25 char values in UTF-8. (Of course it's encoded as 12 wchar_t values in UTF-16. No Unicode encoding saves you from having to think about combining characters.)
On the other hand, UTF-8 is so powerful because you can often ignore it for certain problems. The character A encodes in UTF-8 exactly as it does in ASCII (65), and UTF-8 promises that there will be no bytes in a sequence that are 65 and aren't A. So searching for specific ASCII sequences requires no special processing (the way it does in UTF-16).
As NathanOliver points out, using any Unicode encoding will only support the languages, glyphs, scripts, characters, etc. that Unicode supports. As a practical matter, that is the vast majority of the commonly used languages in the world. It is not every language (and it has failings in how it handles some languages that it does support), but it's by far the most comprehensive system we have today.
No, UTF-8 is not enough to support all languages (yet). From As Yet Unsupported Scripts
Loma
Naxi Dongba (Moso)
are currently not supported.
Given a Unicode character, we want to find out what languages include this character, and more importantly, understand whether or not each language is Left-To-Right.
For example, the character A might be both English and Spanish which are both LTR languages.
I want this for my own text editor.
Can anyone help me in finding an API function or something that solves my problem?
Thanks in advance
Unicode-wise, LTR/RTL is a property of characters, not of the languages that use that character. This matters because embedded English in an Arabic text should be displayed left-to-right, even if for simplicity the document as a whole may be marked as Arabic. If you're using JCL, these properties can be obtained using the UnicodeIsLeftToRight and UnicodeIsRightToLeft functions. Note that characters may be neither left-to-right nor right-to-left, and also note that JCL uses a private copy of the Unicode character list that may be a subtly different version from what any specific version of Windows uses.
Regarding the question in the title, you would need to carry out an extensive study of the use of characters in the languages of the world. There are a few thousands of languages, though many of them have no regular writing system; on the other hand, some languages have several writing systems. Different variants of a language may have different repertoires of characters.
So it would be a major effort, though some data has been compiled e.g. in the CLDR repertoire – but the concept “characters used in a language” is far from clear. (Are the characters æ, è, and ö used in English? They sure appear in some forms of written English.)
So it would be unrealistic to expect to find a library routine for such purposes.
Apparently your real need was for deciding whether a character is a left-to-right character or a right-to-left character. But for completeness, I have provided an answer to what you actually asked and that might be relevant in some other contexts.
Given a string in form of a pointer to a array of bytes (chars), how can I detect the encoding of the string in C/C++ (I used visual studio 2008)?? I did a search but most of samples are done in C#.
Thanks
Assuming you know the length of the input array, you can make the following guesses:
First, check to see if the first few bytes match any well know byte order marks (BOM) for Unicode. If they do, you're done!
Next, search for '\0' before the last byte. If you find one, you might be dealing with UTF-16 or UTF-32. If you find multiple consecutive '\0's, it's probably UTF-32.
If any character is from 0x80 to 0xff, it's certainly not ASCII or UTF-7. If you are restricting your input to some variant of Unicode, you can assume it's UTF-8. Otherwise, you have to do some guessing to determine which multi-byte character set it is. That will not be fun.
At this point it is either: ASCII, UTF-7, Base64, or ranges of UTF-16 or UTF-32 that just happen to not use the top bit and do not have any null characters.
It's not an easy problem to solve, and generally relies on heuristics to take a best guess at what the input encoding is, which can be tripped up by relatively innocuous inputs - for example, take a look at this Wikipedia article and The Notepad file encoding Redux for more details.
If you're looking for a Windows-only solution with minimal dependencies, you can look at using a combination of IsTextUnicode and MLang's DetectInputCodePage to attempt character set detection.
If you are looking for portability, but don't mind taking on a fairly large dependency in the form of ICU then you can make use of it's character set detection routines to achieve the same thing in a portable manner.
I have written a small C++ library for detecting text file encoding. It uses Qt, but it can be just as easily implemented using just the standard library.
It operates by measuring symbol occurrence statistics and comparing it to pre-computed reference values in different encodings and languages. As a result, it not only detects encoding but also the language of the text. The downside is that pre-computed statistics must be provided for the target language to detect this language properly.
https://github.com/VioletGiraffe/text-encoding-detector
In a recent web application I built, I was pleasantly surprised when one of our users decided to use it to create something entirely in Japanese. However, the text was wrapped strangely and awkwardly. Apparently browsers don't cope with wrapping Japanese text very well, probably because it contains few spaces, as each character forms a whole word. However, that's not really a safe assumption to make as some words are constructed of several characters, and it is not safe to break some character groups into different lines.
Googling around hasn't really helped me understand the problem any better. It seems to me like one would need a dictionary of unbreakable patterns, and assume that everywhere else is safe to break. But I fear I don't know enough about Japanese to really know all the words, which I understand from some of my searching, are quite complicated.
How would you approach this problem? Are there any libraries or algorithms you are aware of that already exist that deal with this in a satisfactory way?
Japanese word wrap rules are called kinsoku shori and are surprisingly simple. They're actually mostly concerned with punctuation characters and do not try to keep words unbroken at all.
I just checked with a Japanese novel and indeed, both words in the syllabic kana script and those consisting of multiple Chinese ideograms are wrapped mid-word with impunity.
Below listed projects are useful to resolve Japanese wordwrap (or wordbreak from another point of view).
budou (Python): https://github.com/google/budou
mikan (JS): https://github.com/trkbt10/mikan.js
mikan.sharp (C#): https://github.com/YoungjaeKim/mikan.sharp
mikan has regex-based approach while budou uses natural language processing.
If a Ruby regular expression is matching against something that isn't a String, the to_str method is called on that object to get an actual String to match against. I want to avoid this behavior; I'd like to match regular expressions against objects that aren't Strings, but can be logically thought of as randomly accessible sequences of bytes, and all accesses to them are mediated through a byte_at() method (similar in spirit to Java's CharSequence.char_at() method).
For example, suppose I want to find the byte offset in an arbitrary file of an arbitrary regular expression; the expression might be multi-line, so I can't just read in a line at a time and look for a match in each line. If the file is very big, I can't fit it all in memory, so I can't just read it in as one big string. However, it would be simple enough to define a method that gets the nth byte of a file (with buffering and caching as needed for speed).
Eventually, I'd like to build a fully featured rope class, like in Ruby Quiz #137, and I'd like to be able to use regular expressions on them without the performance loss of converting them to strings.
I don't want to get up to my elbows in the innards of Ruby's regular expression implementation, so any insight would be appreciated.
You can't. This wasn't supported in Ruby 1.8.x, probably because it's such an edge case; and in 1.9 it wouldn't even make sense. Ruby 1.9 doesn't map its strings to bytes in any user-serviceable fashion; instead it uses character code points, so that it can support the multitude of encodings that it accepts. And 1.9's new optimized regex engine, Oniguruma, is also built around the same concept of encodings and code points. Bytes just don't enter into the picture at this level.
I have a suspicion that what you're asking for is a case of premature optimization. For any reasonable Ruby object, implementing to_str shouldn't be a huge performance hurdle. If it is, then Ruby's probably the wrong tool for you, as it abstracts and insulates you from your raw data in all sorts of ways.
Your example of looking for a byte sequence in a large binary file isn't an ideal use case for Ruby -- you'd be better off using grep or some other Unix tool. If you need the results in your Ruby program, run it as a system process using backticks and process the output.