Given a Unicode character, we want to find out what languages include this character, and more importantly, understand whether or not each language is Left-To-Right.
For example, the character A might be both English and Spanish which are both LTR languages.
I want this for my own text editor.
Can anyone help me in finding an API function or something that solves my problem?
Thanks in advance
Unicode-wise, LTR/RTL is a property of characters, not of the languages that use that character. This matters because embedded English in an Arabic text should be displayed left-to-right, even if for simplicity the document as a whole may be marked as Arabic. If you're using JCL, these properties can be obtained using the UnicodeIsLeftToRight and UnicodeIsRightToLeft functions. Note that characters may be neither left-to-right nor right-to-left, and also note that JCL uses a private copy of the Unicode character list that may be a subtly different version from what any specific version of Windows uses.
Regarding the question in the title, you would need to carry out an extensive study of the use of characters in the languages of the world. There are a few thousands of languages, though many of them have no regular writing system; on the other hand, some languages have several writing systems. Different variants of a language may have different repertoires of characters.
So it would be a major effort, though some data has been compiled e.g. in the CLDR repertoire – but the concept “characters used in a language” is far from clear. (Are the characters æ, è, and ö used in English? They sure appear in some forms of written English.)
So it would be unrealistic to expect to find a library routine for such purposes.
Apparently your real need was for deciding whether a character is a left-to-right character or a right-to-left character. But for completeness, I have provided an answer to what you actually asked and that might be relevant in some other contexts.
Related
I am working on on Stanford-openIE but I do not know whether it supports Chinese text or not. If it supports Chinese language, How can I use stanford-openIE for Chinese text?
Any guidance will be appreciated.
Stanford's OpenIE system was developed for English. It's based off of universal dependencies, meaning that in theory it shouldn't be too hard to adapt to other languages; but, nonetheless, it's highly unlikely that it would work out of the box.
At minimum, the relation triple segmenter would have to be adapted for Chinese. For some of the more subtle functionality, the code to mark natural logic polarity and the code to score prepositional phrase deletions would have to be rewritten.
How to get in Ruby 1.8.7 unicode character that is alphabetically right after given character?
If you mean "next in the code page" then you can always hack around with the bytes and find out. You will probably end up falling into holes with no assigned characters if you go exploring the code page sequentially. This would mean "Unicode-abetically" if you can imagine such a term.
If you mean "alphabetically" then you're out of luck since that doesn't mean anything. The concept of alphabetic order varies considerably from one language to another and is sometimes even context-specific. Some languages don't even have a set order to their characters at all. This is the reason why some systems have a collation in addition to an encoding. The collation defines order, but often many letters are considered equivalent for the purposes of sorting, further complicating things.
Ruby 1.8.7 is also not aware about Unicode in general and pretends everything is an 8-bit ASCII string with one byte characters. Ruby 1.9 can parse multi-byte UTF-8 into separate characters and might make this exercise a lot easier.
Given a string in form of a pointer to a array of bytes (chars), how can I detect the encoding of the string in C/C++ (I used visual studio 2008)?? I did a search but most of samples are done in C#.
Thanks
Assuming you know the length of the input array, you can make the following guesses:
First, check to see if the first few bytes match any well know byte order marks (BOM) for Unicode. If they do, you're done!
Next, search for '\0' before the last byte. If you find one, you might be dealing with UTF-16 or UTF-32. If you find multiple consecutive '\0's, it's probably UTF-32.
If any character is from 0x80 to 0xff, it's certainly not ASCII or UTF-7. If you are restricting your input to some variant of Unicode, you can assume it's UTF-8. Otherwise, you have to do some guessing to determine which multi-byte character set it is. That will not be fun.
At this point it is either: ASCII, UTF-7, Base64, or ranges of UTF-16 or UTF-32 that just happen to not use the top bit and do not have any null characters.
It's not an easy problem to solve, and generally relies on heuristics to take a best guess at what the input encoding is, which can be tripped up by relatively innocuous inputs - for example, take a look at this Wikipedia article and The Notepad file encoding Redux for more details.
If you're looking for a Windows-only solution with minimal dependencies, you can look at using a combination of IsTextUnicode and MLang's DetectInputCodePage to attempt character set detection.
If you are looking for portability, but don't mind taking on a fairly large dependency in the form of ICU then you can make use of it's character set detection routines to achieve the same thing in a portable manner.
I have written a small C++ library for detecting text file encoding. It uses Qt, but it can be just as easily implemented using just the standard library.
It operates by measuring symbol occurrence statistics and comparing it to pre-computed reference values in different encodings and languages. As a result, it not only detects encoding but also the language of the text. The downside is that pre-computed statistics must be provided for the target language to detect this language properly.
https://github.com/VioletGiraffe/text-encoding-detector
I'm writing some Extended Backus–Naur Form grammars for document parsing. There are lots of excellent guides for the syntax of these definitions, but very little online about how to design and structure them.
Can anyone suggest good articles (or general tips) about how you like to approach writing these as there does seem to be an element of style even if the final parse trees can be equivalent.
e.g. things like:
Deciding if you should explicitly tag newlines, or just treat it as whitespace?
Naming schemes for your nonterminals
Handing optional whitespace in long definitions
When to use bad syntax checks vs just letting those not match
Thanks,
You should work in the direction that you are most comfortable with - either bottom-up, top-down, or "sandwich" (do a little of both, meet somewhere in the middle).
Any "group" that can be derived and has a meaning of its own, should start from it's own non-terminal. So for example, I would use a non-terminal for all newline-related whitespaces, one for all the other whitespaces, and one for all whitespaces (which is basically the union of the former 2).
Naming conventions in grammars in general are that non-terminals are, or start with, a capital letter, and terminals start with non-capitals (but this of course depends on the language you're designing).
Regarding bad syntax checks, I'm not familiar with the concept. What I know of EBNFs are that you just write everything your language accepts, and only that.
Generally, just look around at some EBNFs of different languages from different websites, get a feeling of how they look, and then do what feels right to you.
In a recent web application I built, I was pleasantly surprised when one of our users decided to use it to create something entirely in Japanese. However, the text was wrapped strangely and awkwardly. Apparently browsers don't cope with wrapping Japanese text very well, probably because it contains few spaces, as each character forms a whole word. However, that's not really a safe assumption to make as some words are constructed of several characters, and it is not safe to break some character groups into different lines.
Googling around hasn't really helped me understand the problem any better. It seems to me like one would need a dictionary of unbreakable patterns, and assume that everywhere else is safe to break. But I fear I don't know enough about Japanese to really know all the words, which I understand from some of my searching, are quite complicated.
How would you approach this problem? Are there any libraries or algorithms you are aware of that already exist that deal with this in a satisfactory way?
Japanese word wrap rules are called kinsoku shori and are surprisingly simple. They're actually mostly concerned with punctuation characters and do not try to keep words unbroken at all.
I just checked with a Japanese novel and indeed, both words in the syllabic kana script and those consisting of multiple Chinese ideograms are wrapped mid-word with impunity.
Below listed projects are useful to resolve Japanese wordwrap (or wordbreak from another point of view).
budou (Python): https://github.com/google/budou
mikan (JS): https://github.com/trkbt10/mikan.js
mikan.sharp (C#): https://github.com/YoungjaeKim/mikan.sharp
mikan has regex-based approach while budou uses natural language processing.