I am trying to develop a Sinhala (My native language) to English translator. Still I am thinking for an approach.
If I however parse a sentence of my language, then can I use that for generating english sentence with the help of stanford parser or any other parser. Or is there any other method you can recommend.
And I am thinking of a bottom up parser for my language, but still have no idea how to implement. Any suggestions for steps I can follow.
Thanks Mathee
This course on Coursera may help you implement a translator. From what I know based on that course, you can use a training set tagged by parts of speech (i.e. noun, verb, etc.) and use that training test to parse other sentence. I suggest looking into hidden Markov models.
My Pyramids parser is an unconventional single-sentence parser for English. (It is capable of parsing other languages too, but a grammar must be specified.) The parser can not only parse English into parse trees, but can convert back and forth between parse trees and word-level semantic graphs, which are graphs that describe the semantic relationships between all the words in a sentence. The correct word order is reconstructed based on the contents of the graph; all that needs to be provided, other than the words and their relationships, is the type of sentence (statement, question, command) and the linguistic category of each word (noun, determiner, verb, etc.). From there it is straight-forward to join the tokens of the parse tree into a sentence.
The parser is a (very early) alpha pre-release, but it is functional and actively maintained. I am currently using it to translate back and forth between English and an internal semantic representation used by a conversational agent (a "chat bot", but capable of more in-depth language understanding). If you decide to use the parser, do let me know. I will be happy to provide any assistance you might need with installing, using, or improving it.
Related
I'm not sure this is the right fit for stackoverflow but maybe you guys would suggest where to put this question otherwise but here it is anyway. Suppose I have a few sentences of a text like this:
John reads newspapers everyday. Right now he has just finished reading
one. He will read another one and might even read a small book
tomorrow.
This small extract contains the following grammar units:
present simple (reads)
present perfect (has finished)
future simple (will read)
modal verb may
Do you know of any software, algorithm or study that defines rules for identifying these grammar patterns?
Read this also if you are going to use Ruby than you can use TreeTop or find the equivalent parser in other programming language.
NTLK is a natural language parser for python, it works by tagging words. You can look at some examples here. It creates a parse-tree, which are very useful for these types of problems.
I haven't seen it distinguish between simple and perfect, but it could be modified to do so.
I am trying to develop a Sinhala (My native language) to English translator.
Still I am thinking for an approach.
If I however parse a sentence of my language, then can use that for generating english sentence with the help of stanford parser. Or is there any other method you can recommend.
And I am thinking of a bottom up parser for my language, but still have no idea how to implement. Any suggestions for steps I can follow.
Thanks
Mathee
If you have enough amounts of bilingual corpora then the best thing you can do is to train a model using an existing statistical machine translation system like Moses. Modern phrase-based SMT systems resemble a reverse parsing mechanism in the sense that they find the most probable combination of target-language phrases for translating a specific source-language sentence. More information in this Wikipedia article.
I've worked with the Xerox toolchain so far, which is powerful, not opensource, and a bit overkill for my current problem. Are there libraries that allow my to implement a phrase structure grammar? Preferably in ruby or lisp.
AFAIK, there's no open-source Lisp phrase structure parser available.
But since a parser is actually a black box, it's not so hard to make your application work with a parser written in any language, especially as they produce S-expressions as output. For example, with something like pfp you can just pipe your sentences as strings to it, then read and process the resulting trees. Or you can wrap a socket server around it and you'll get a distributed system :)
There's also cl-langutils, that may be helpful in some basic NLP tasks, like tokenization and, maybe, POS tagging. But overall, it's much less mature and feature rich, than the commonly used packages, like Stanford's or OpenNLP.
The first part of this question is now its own, here: Analyzing Text for Accents
Question: How could accents be added to generated speech?
What I've come up with:
I do not mean just accent marks, or inflection, or anything singular like that. I mean something like a full British accent, or a Scottish accent, or Russian, etc.
I would think that this could be done outside of the language as well. Ex: something in Russian could be generated with a British accent, or something in Mandarin could have a Russian accent.
I think the basic process would be this:
Analyze the text
Compare with a database (or something like that) to determine what needs an accent, how strong it should be, etc.
Generate the speech in specified language
Easy with normal text-to-speech processors.
Determine the specified accent based on the analyzed text.
This is the part in question.
I think an array of amplitudes and filters would work best for the next step.
Mesh speech and accent.
This would be the easy part.
It could probably be done by multiplying the speech by the accent, like many other DSP methods do.
This is really more of a general DSP question, but I'd like to come up with a programatic algorithm to do this instead of a general idea.
This question isn't really "programming" per se: It's linguistics. The programming is comparatively easy. For the analysis, that's going to be really difficult, and in truth you're probably better off getting the user to specify the accent; Or are you going for an automated story reader?
However, a basic accent is doable with modern text-to speech. Are you aware of the international phonetic alphabet? http://en.wikipedia.org/wiki/International_Phonetic_Alphabet
It basically lists all the sounds a human voice can possibly make. An accent is then just a mapping (A function) from the alphabet to itself. For instance, to make an American accent sound British to an American person (Though not sufficient to make it sound British to a British person), you can de-rhotacise all the "r" sounds in the middle of a word. So for instance the alveolar trill would be replaced with the voiced uvular fricative. (Lots of corner cases to work out just for this).
Long and short: It's not easy, which is probably why no-one has done it. I'm sure a couple of linguistics professors out their would say its impossible. But that's what linguistics professors do. But you'll basically need to read several thick textbooks on accents and pronunciation to make any headway with this problem. Good luck!
What is an accent?
An accent is not a sound filter; it's a pattern of acoustic realization of text in a language. You can't take a recording of American English, run it through "array of amplitudes and filters", and have British English pop out. What DSP is useful for is in implementing prosody, not accent.
Basically (and simplest to model), an accent consists of rules for phonetic realization of a sequence of phonemes. Perception of accent is further influenced by prosody and by which phonemes a speaker chooses when reading text.
Speech generation
The process of speech generation has two basic steps:
Text-to-phonemes: Convert written text to a sequence of phonemes (plus suprasegmentals like stress, and prosodic information like utterance boundaries). This is somewhat accent-dependent (e.g. the output for "laboratory" differs between American and British speakers).
Phoneme-to-speech: given the sequence of phonemes, generate audio according to the dialect's rules for phonetic realizations of phonemes. (Typically you then combine diphones and then adjust acoustically the prosody). This is highly accent-dependent, and it is this step that imparts the main quality of the accent. A particular phoneme, even if shared between two accents, may have strikingly different acoustic realizations.
Normally these are paired. While you could have a British-accented speech generator that uses American pronunciations, that would sound odd.
Generating speech with a given accent
Writing a text-to-speech program is an enormous amount of work (in particular, to implement one common scheme, you have to record a native speaker speaking each possible diphone in the language), so you'd be better off using an existing one.
In short, if you want a British accent, use a British English text-to-phoneme engine together with a British English phoneme-to-speech engine.
For common accents like American and British English, Standard Mandarin, Metropolitan French, etc., there will be several choices, including open-source ones that you will be able to modify (as below). For example, look at FreeTTS and eSpeak. For less common accents, existing engines unfortunately may not exist.
Speaking text with a foreign accent
English-with-a-foreign-accent is socially not very prestigious, so complete systems probably don't exist.
One strategy would be to combine an off-the-shelf text-to-phoneme engine for a native accent with a phoneme-to-speech engine for the foreign language. For example, a native Russian speaker that learned English in the U.S. would plausibly use American pronunciations of words like laboratory, and map its phonemes onto his native Russian phonemes, pronouncing them as in Russian. (I believe there is a website that does this for English and Japanese, but I don't have the link.)
The problem is that the result is too extreme. A real English learner would attempt to recognize and generate phonemes that do not exist in his native language, and would also alter his realization of his native phonemes to approximate the native pronunciation. How closely the result matches a native speaker of course varies, but using the pure foreign extreme sounds ridiculous (and mostly incomprehensible).
So to generate plausible American-English-with-a-Russian-accent (for instance), you'd have to write a text-to-phoneme engine. You could use existing American English and Russian text-to-phoneme engines as a starting point. If you're not willing to find and record such a speaker, you could probably still get a decent approximation using DSP to combine the samples from those two engines. For eSpeak, it uses formant synthesis rather than recorded samples, so it might be easier to combine information from multiple languages.
Another thing to consider is that foreign speakers often modify the sequence of phonemes under influence by the phonotactics of their native language, typically by simplifying consonant clusters, inserting epenthetic vowels, or diphthongizing or breaking vowel sequences.
There is some literature on this topic.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I need to calculate word frequencies of a given set of adjectives in a large set of customer support reviews. However I don't want to include those that are negated.
For example suppose my list of adjectives was: [helpful, knowledgeable, friendly]. I want to make sure "friendly" isn't counted in a sentence such as "The representative was not very friendly."
Do I need to do a full NLP parse of the text or is there an easier approach? I don't need super high accuracy.
I'm not at all familiar with NLP. I'm hoping for something that doesn't have such a steep learning curve and isn't so processor intensive.
Thanks
If all you want is adjective frequencies, then the problem is relatively simple, as opposed to some brutal, not-so-good machine learning solution.
Wat do?
Do POS tagging on your text. This annotates your text with part of speech tags, so you'll have 95% accuracy or more on that. You can tag your text using the Stanford Parser online to get a feel for it. The parser actually also gives you the grammatical structure, but you only care about the tagging.
You also want to make sure the sentences are broken up properly. For this you need a sentence breaker. That's included with software like the Stanford parser.
Then just break up the sentences, tag them, and count all things with the tag ADJ or whatever tag they use. If the tags don't make sense, look up the Penn Treebank tagset (Treebanks are used to train NLP tools, and the Penn Treebank tags are the common ones).
How?
Java or Python is the language of NLP tools. Python, use NLTK. It's easy, well documented and well understood.
For Java, you have GATE, LingPipe and the Stanford Parser among others. It's a complete pain in the ass to use the Stanford Parser, fortunately I've suffered so you do not have to if you choose to go that route. See my google page for some code (at the bottom of the page) examples with the Stanford Parser.
Das all?
Nah, you might want to stem the adjectives too- that's where you get the root form of a word:
cars -> car
I can't actually think of a situation where this is necessary with adjectives, but it might happen. When you look at your output it'll be apparent if you need to do this. A POS tagger/parser/etc will get you your stemmed words (also called lemmas).
More NLP Explanations
See this question.
It depends on the source of your data. If the sentences come from some kind of generator, you can probably split them automatically. Otherwise you will need NLP, yes.
Properly parsing natural language pretty much is an open issue. It works "largely" for English, in particular since English sentences tend to stick to the SVO order. German for example is quite nasty here, as different word orders convey different emphasis (and thus can convey different meanings, in particular when irony is used). Additionally, German tends to use subordinate clauses much more.
NLP clearly is the way to go. At least some basic parser will be needed. It really depends on your task, too: do you need to make sure every one is correct, or is a probabilistic approach good enough? Can "difficult" cases be discarded or fed to a human for review? etc.