I am working on on Stanford-openIE but I do not know whether it supports Chinese text or not. If it supports Chinese language, How can I use stanford-openIE for Chinese text?
Any guidance will be appreciated.
Stanford's OpenIE system was developed for English. It's based off of universal dependencies, meaning that in theory it shouldn't be too hard to adapt to other languages; but, nonetheless, it's highly unlikely that it would work out of the box.
At minimum, the relation triple segmenter would have to be adapted for Chinese. For some of the more subtle functionality, the code to mark natural logic polarity and the code to score prepositional phrase deletions would have to be rewritten.
Related
Given a Unicode character, we want to find out what languages include this character, and more importantly, understand whether or not each language is Left-To-Right.
For example, the character A might be both English and Spanish which are both LTR languages.
I want this for my own text editor.
Can anyone help me in finding an API function or something that solves my problem?
Thanks in advance
Unicode-wise, LTR/RTL is a property of characters, not of the languages that use that character. This matters because embedded English in an Arabic text should be displayed left-to-right, even if for simplicity the document as a whole may be marked as Arabic. If you're using JCL, these properties can be obtained using the UnicodeIsLeftToRight and UnicodeIsRightToLeft functions. Note that characters may be neither left-to-right nor right-to-left, and also note that JCL uses a private copy of the Unicode character list that may be a subtly different version from what any specific version of Windows uses.
Regarding the question in the title, you would need to carry out an extensive study of the use of characters in the languages of the world. There are a few thousands of languages, though many of them have no regular writing system; on the other hand, some languages have several writing systems. Different variants of a language may have different repertoires of characters.
So it would be a major effort, though some data has been compiled e.g. in the CLDR repertoire – but the concept “characters used in a language” is far from clear. (Are the characters æ, è, and ö used in English? They sure appear in some forms of written English.)
So it would be unrealistic to expect to find a library routine for such purposes.
Apparently your real need was for deciding whether a character is a left-to-right character or a right-to-left character. But for completeness, I have provided an answer to what you actually asked and that might be relevant in some other contexts.
I am trying to develop a Sinhala (My native language) to English translator.
Still I am thinking for an approach.
If I however parse a sentence of my language, then can use that for generating english sentence with the help of stanford parser. Or is there any other method you can recommend.
And I am thinking of a bottom up parser for my language, but still have no idea how to implement. Any suggestions for steps I can follow.
Thanks
Mathee
If you have enough amounts of bilingual corpora then the best thing you can do is to train a model using an existing statistical machine translation system like Moses. Modern phrase-based SMT systems resemble a reverse parsing mechanism in the sense that they find the most probable combination of target-language phrases for translating a specific source-language sentence. More information in this Wikipedia article.
I am trying to develop a Sinhala (My native language) to English translator. Still I am thinking for an approach.
If I however parse a sentence of my language, then can I use that for generating english sentence with the help of stanford parser or any other parser. Or is there any other method you can recommend.
And I am thinking of a bottom up parser for my language, but still have no idea how to implement. Any suggestions for steps I can follow.
Thanks Mathee
This course on Coursera may help you implement a translator. From what I know based on that course, you can use a training set tagged by parts of speech (i.e. noun, verb, etc.) and use that training test to parse other sentence. I suggest looking into hidden Markov models.
My Pyramids parser is an unconventional single-sentence parser for English. (It is capable of parsing other languages too, but a grammar must be specified.) The parser can not only parse English into parse trees, but can convert back and forth between parse trees and word-level semantic graphs, which are graphs that describe the semantic relationships between all the words in a sentence. The correct word order is reconstructed based on the contents of the graph; all that needs to be provided, other than the words and their relationships, is the type of sentence (statement, question, command) and the linguistic category of each word (noun, determiner, verb, etc.). From there it is straight-forward to join the tokens of the parse tree into a sentence.
The parser is a (very early) alpha pre-release, but it is functional and actively maintained. I am currently using it to translate back and forth between English and an internal semantic representation used by a conversational agent (a "chat bot", but capable of more in-depth language understanding). If you decide to use the parser, do let me know. I will be happy to provide any assistance you might need with installing, using, or improving it.
The first part of this question is now its own, here: Analyzing Text for Accents
Question: How could accents be added to generated speech?
What I've come up with:
I do not mean just accent marks, or inflection, or anything singular like that. I mean something like a full British accent, or a Scottish accent, or Russian, etc.
I would think that this could be done outside of the language as well. Ex: something in Russian could be generated with a British accent, or something in Mandarin could have a Russian accent.
I think the basic process would be this:
Analyze the text
Compare with a database (or something like that) to determine what needs an accent, how strong it should be, etc.
Generate the speech in specified language
Easy with normal text-to-speech processors.
Determine the specified accent based on the analyzed text.
This is the part in question.
I think an array of amplitudes and filters would work best for the next step.
Mesh speech and accent.
This would be the easy part.
It could probably be done by multiplying the speech by the accent, like many other DSP methods do.
This is really more of a general DSP question, but I'd like to come up with a programatic algorithm to do this instead of a general idea.
This question isn't really "programming" per se: It's linguistics. The programming is comparatively easy. For the analysis, that's going to be really difficult, and in truth you're probably better off getting the user to specify the accent; Or are you going for an automated story reader?
However, a basic accent is doable with modern text-to speech. Are you aware of the international phonetic alphabet? http://en.wikipedia.org/wiki/International_Phonetic_Alphabet
It basically lists all the sounds a human voice can possibly make. An accent is then just a mapping (A function) from the alphabet to itself. For instance, to make an American accent sound British to an American person (Though not sufficient to make it sound British to a British person), you can de-rhotacise all the "r" sounds in the middle of a word. So for instance the alveolar trill would be replaced with the voiced uvular fricative. (Lots of corner cases to work out just for this).
Long and short: It's not easy, which is probably why no-one has done it. I'm sure a couple of linguistics professors out their would say its impossible. But that's what linguistics professors do. But you'll basically need to read several thick textbooks on accents and pronunciation to make any headway with this problem. Good luck!
What is an accent?
An accent is not a sound filter; it's a pattern of acoustic realization of text in a language. You can't take a recording of American English, run it through "array of amplitudes and filters", and have British English pop out. What DSP is useful for is in implementing prosody, not accent.
Basically (and simplest to model), an accent consists of rules for phonetic realization of a sequence of phonemes. Perception of accent is further influenced by prosody and by which phonemes a speaker chooses when reading text.
Speech generation
The process of speech generation has two basic steps:
Text-to-phonemes: Convert written text to a sequence of phonemes (plus suprasegmentals like stress, and prosodic information like utterance boundaries). This is somewhat accent-dependent (e.g. the output for "laboratory" differs between American and British speakers).
Phoneme-to-speech: given the sequence of phonemes, generate audio according to the dialect's rules for phonetic realizations of phonemes. (Typically you then combine diphones and then adjust acoustically the prosody). This is highly accent-dependent, and it is this step that imparts the main quality of the accent. A particular phoneme, even if shared between two accents, may have strikingly different acoustic realizations.
Normally these are paired. While you could have a British-accented speech generator that uses American pronunciations, that would sound odd.
Generating speech with a given accent
Writing a text-to-speech program is an enormous amount of work (in particular, to implement one common scheme, you have to record a native speaker speaking each possible diphone in the language), so you'd be better off using an existing one.
In short, if you want a British accent, use a British English text-to-phoneme engine together with a British English phoneme-to-speech engine.
For common accents like American and British English, Standard Mandarin, Metropolitan French, etc., there will be several choices, including open-source ones that you will be able to modify (as below). For example, look at FreeTTS and eSpeak. For less common accents, existing engines unfortunately may not exist.
Speaking text with a foreign accent
English-with-a-foreign-accent is socially not very prestigious, so complete systems probably don't exist.
One strategy would be to combine an off-the-shelf text-to-phoneme engine for a native accent with a phoneme-to-speech engine for the foreign language. For example, a native Russian speaker that learned English in the U.S. would plausibly use American pronunciations of words like laboratory, and map its phonemes onto his native Russian phonemes, pronouncing them as in Russian. (I believe there is a website that does this for English and Japanese, but I don't have the link.)
The problem is that the result is too extreme. A real English learner would attempt to recognize and generate phonemes that do not exist in his native language, and would also alter his realization of his native phonemes to approximate the native pronunciation. How closely the result matches a native speaker of course varies, but using the pure foreign extreme sounds ridiculous (and mostly incomprehensible).
So to generate plausible American-English-with-a-Russian-accent (for instance), you'd have to write a text-to-phoneme engine. You could use existing American English and Russian text-to-phoneme engines as a starting point. If you're not willing to find and record such a speaker, you could probably still get a decent approximation using DSP to combine the samples from those two engines. For eSpeak, it uses formant synthesis rather than recorded samples, so it might be easier to combine information from multiple languages.
Another thing to consider is that foreign speakers often modify the sequence of phonemes under influence by the phonotactics of their native language, typically by simplifying consonant clusters, inserting epenthetic vowels, or diphthongizing or breaking vowel sequences.
There is some literature on this topic.
We're currently developing a full-text-search-enabled app and we Lucene.NET is our weapon of choice. What's expected is that an app will be used by people from different countries, so Lucene.NET has to be able to search across Russian, English and other texts equally well.
Are there any universal and culture-independent stemmers and analyzers to suit our needs? I understand that eventually we'd have to use culture-specific ones, but we want to get up and running with this potentially quick and dirty approach.
Given that the spelling, grammar and character sets of English and Russian are significantly different, any stemmer which tried to do both would either be massively large or poorly performant (most likely both).
It would probably be much better to use a stemmer for each language, and pick which one to use based on either UI clues (what language is being used to query) or by explicit selection.
Having said that, it's unlikely that any Russian text will match an English search term correctly or vice-versa.
This sounds like a case where a little more business analysis would help more than code.
There is no such a thing as a language-independent stemmer. In fact, whether stemming improves retrieval performance varies per language. The best you can do is language guessing on the documents and queries, then dispatch to the appropriate analyzer/stemmer.
Language guessing on short queries is hard, though (as in state-of-the-art, not quick 'n' dirty). If your queries are short, you might want use a simple whitespace analyzer on the queries and not stem anything.