Language detection for pinyin, translit etc? - internationalization

Real-world user-generated text in non-Latin alphabet languages is often not in canonical form but in translit, shlyokavitsa, arabizi, pinyin and so on. Language detection software is starting to handle it smartly, but usually it doesn't work, even though it's technically fairly trivial to incorporate it.
Is there a language detection system that is handling these informal Latinisations well? (Ideally a Python lib, but any language or service would be interesting.)
The Yandex, Microsoft and top Python lang id libs, like langid, have nothing on this front. Two that halfway work are known to me, both from Google:
- CLD, which is part of Chrome
- the Google Translate API
Besides only recognising translit for a few top languages, they are not ideal for a variety of reasons (accuracy, performance, price...)
This is a major issue for major languages like Hindi, Persian, Chinese, Arabic and Russian, and for all the other languages not written in the Latin alphabet but commonly Latinised (Romanised) online.

Related

How can I use stanford-openIE for Chinese text

I am working on on Stanford-openIE but I do not know whether it supports Chinese text or not. If it supports Chinese language, How can I use stanford-openIE for Chinese text?
Any guidance will be appreciated.
Stanford's OpenIE system was developed for English. It's based off of universal dependencies, meaning that in theory it shouldn't be too hard to adapt to other languages; but, nonetheless, it's highly unlikely that it would work out of the box.
At minimum, the relation triple segmenter would have to be adapted for Chinese. For some of the more subtle functionality, the code to mark natural logic polarity and the code to score prepositional phrase deletions would have to be rewritten.

Mutual intelligibility of programming languages

Borrowing the term from linguistics, what programming languages, if any, are mutually intelligible among them to some degree? To clarify, suppose we know programming language x, but we happen to need to read some code in language y. Is fluency or even basic knowledge of certain programming languages helpful in understanding the syntax of some other language we do not know?
As someone who knows around 20 different computer languages, I can say without any hesitation that it absolutely helps. And I would say it does not in any way restrict itself to a subset of languages, but it definitely varies in degrees between certain languages.
For example, knowing Java I picked up C# without barely trying. The concepts and feel were similar enough that it was a trivial jump. However, picking up LISP, a functional programming language, was a much different process, one that required me to think differently to really grasp it. I would equate that with the difference between learning to write Spanish after knowing English, and then learning to write Chinese. The concept of a phonetic alphabet makes a big difference in the ease that one might pick it up.
And, like how many languages evolved from Latin, many computer languages have evolved from common roots like C. So, like languages, you can see the common ancestry.
I use JavaScript and Ruby in my day to day life but I can also look at some objective-C and figure out what it's trying to do (even if I couldn't write it myself.) Generally the more Languages you know, the easier it is to learn another.
Computer languages are organized into various kinds. Much like actual languages. And if you've learned one kind, others of the same kind are easier. For example if you only speak Portuguese, you'll probably understand more Spanish than a Chinese speaker. And if you speak Chinese you'll be able to read some Japanese kanji since they originated from the same thing.
Specifically computer languages are divided into Procedural Languages (C, Fortran), Object-Oriented languages(C++, Ruby), and Functional Languages (Haskell, Closure). Of course, some languages borrow elements from several of these (JavaScript) so there are shades of grey.
tldr: Yes, knowing one language can help you understand another.

How should I go about coding an IME for Windows? (7 and below)

I would like to program an IME for Windows for East Asian languages which are not well-supported natively yet. This would require candidate lists.
For now, I would like to try programming a basic model which remaps keyboard characters to output phonetic characters.
Not too sure where to start though, haven't managed to find many resources on coding your own IME.

Is there a framework for writing phrase structure rules out there that is opensource?

I've worked with the Xerox toolchain so far, which is powerful, not opensource, and a bit overkill for my current problem. Are there libraries that allow my to implement a phrase structure grammar? Preferably in ruby or lisp.
AFAIK, there's no open-source Lisp phrase structure parser available.
But since a parser is actually a black box, it's not so hard to make your application work with a parser written in any language, especially as they produce S-expressions as output. For example, with something like pfp you can just pipe your sentences as strings to it, then read and process the resulting trees. Or you can wrap a socket server around it and you'll get a distributed system :)
There's also cl-langutils, that may be helpful in some basic NLP tasks, like tokenization and, maybe, POS tagging. But overall, it's much less mature and feature rich, than the commonly used packages, like Stanford's or OpenNLP.

Adding Accents to Speech Generation

The first part of this question is now its own, here: Analyzing Text for Accents
Question: How could accents be added to generated speech?
What I've come up with:
I do not mean just accent marks, or inflection, or anything singular like that. I mean something like a full British accent, or a Scottish accent, or Russian, etc.
I would think that this could be done outside of the language as well. Ex: something in Russian could be generated with a British accent, or something in Mandarin could have a Russian accent.
I think the basic process would be this:
Analyze the text
Compare with a database (or something like that) to determine what needs an accent, how strong it should be, etc.
Generate the speech in specified language
Easy with normal text-to-speech processors.
Determine the specified accent based on the analyzed text.
This is the part in question.
I think an array of amplitudes and filters would work best for the next step.
Mesh speech and accent.
This would be the easy part.
It could probably be done by multiplying the speech by the accent, like many other DSP methods do.
This is really more of a general DSP question, but I'd like to come up with a programatic algorithm to do this instead of a general idea.
This question isn't really "programming" per se: It's linguistics. The programming is comparatively easy. For the analysis, that's going to be really difficult, and in truth you're probably better off getting the user to specify the accent; Or are you going for an automated story reader?
However, a basic accent is doable with modern text-to speech. Are you aware of the international phonetic alphabet? http://en.wikipedia.org/wiki/International_Phonetic_Alphabet
It basically lists all the sounds a human voice can possibly make. An accent is then just a mapping (A function) from the alphabet to itself. For instance, to make an American accent sound British to an American person (Though not sufficient to make it sound British to a British person), you can de-rhotacise all the "r" sounds in the middle of a word. So for instance the alveolar trill would be replaced with the voiced uvular fricative. (Lots of corner cases to work out just for this).
Long and short: It's not easy, which is probably why no-one has done it. I'm sure a couple of linguistics professors out their would say its impossible. But that's what linguistics professors do. But you'll basically need to read several thick textbooks on accents and pronunciation to make any headway with this problem. Good luck!
What is an accent?
An accent is not a sound filter; it's a pattern of acoustic realization of text in a language. You can't take a recording of American English, run it through "array of amplitudes and filters", and have British English pop out. What DSP is useful for is in implementing prosody, not accent.
Basically (and simplest to model), an accent consists of rules for phonetic realization of a sequence of phonemes. Perception of accent is further influenced by prosody and by which phonemes a speaker chooses when reading text.
Speech generation
The process of speech generation has two basic steps:
Text-to-phonemes: Convert written text to a sequence of phonemes (plus suprasegmentals like stress, and prosodic information like utterance boundaries). This is somewhat accent-dependent (e.g. the output for "laboratory" differs between American and British speakers).
Phoneme-to-speech: given the sequence of phonemes, generate audio according to the dialect's rules for phonetic realizations of phonemes. (Typically you then combine diphones and then adjust acoustically the prosody). This is highly accent-dependent, and it is this step that imparts the main quality of the accent. A particular phoneme, even if shared between two accents, may have strikingly different acoustic realizations.
Normally these are paired. While you could have a British-accented speech generator that uses American pronunciations, that would sound odd.
Generating speech with a given accent
Writing a text-to-speech program is an enormous amount of work (in particular, to implement one common scheme, you have to record a native speaker speaking each possible diphone in the language), so you'd be better off using an existing one.
In short, if you want a British accent, use a British English text-to-phoneme engine together with a British English phoneme-to-speech engine.
For common accents like American and British English, Standard Mandarin, Metropolitan French, etc., there will be several choices, including open-source ones that you will be able to modify (as below). For example, look at FreeTTS and eSpeak. For less common accents, existing engines unfortunately may not exist.
Speaking text with a foreign accent
English-with-a-foreign-accent is socially not very prestigious, so complete systems probably don't exist.
One strategy would be to combine an off-the-shelf text-to-phoneme engine for a native accent with a phoneme-to-speech engine for the foreign language. For example, a native Russian speaker that learned English in the U.S. would plausibly use American pronunciations of words like laboratory, and map its phonemes onto his native Russian phonemes, pronouncing them as in Russian. (I believe there is a website that does this for English and Japanese, but I don't have the link.)
The problem is that the result is too extreme. A real English learner would attempt to recognize and generate phonemes that do not exist in his native language, and would also alter his realization of his native phonemes to approximate the native pronunciation. How closely the result matches a native speaker of course varies, but using the pure foreign extreme sounds ridiculous (and mostly incomprehensible).
So to generate plausible American-English-with-a-Russian-accent (for instance), you'd have to write a text-to-phoneme engine. You could use existing American English and Russian text-to-phoneme engines as a starting point. If you're not willing to find and record such a speaker, you could probably still get a decent approximation using DSP to combine the samples from those two engines. For eSpeak, it uses formant synthesis rather than recorded samples, so it might be easier to combine information from multiple languages.
Another thing to consider is that foreign speakers often modify the sequence of phonemes under influence by the phonotactics of their native language, typically by simplifying consonant clusters, inserting epenthetic vowels, or diphthongizing or breaking vowel sequences.
There is some literature on this topic.

Resources