How to get a verb connecting two nouns in an Apache OpenNLP parse tree? - opennlp

What would be a good strategy to find out a verb connecting two nouns in a parse tree, assuming it exists? For example, in this sentence:
The man called his wife before coming back home.
Given the inputs "man" and "wife", I'd like to get the verb "called".
OpenNLP gives me a parse tree:
(TOP (S (NP (DT The) (NN man)) (VP (VBD called) (NP (PRP$ his) (NN wife)) (PP (IN before) (S (VP (VBG coming) (ADVP (RB back)))))) (. home.)))
So I guess this is at least partly a tree navigation question. Maybe
first isolate all verbs, then test by recursing down, until both nouns are eventually found?
Or try to find the shortest path from one noun to the other and save the verb on the way?
My problem is that I don't know enough about parse tree structure to devise a good strategy. Or should I perhaps use some other (Java) tool?
Thanks!

The task you want to accomplish is really complex. The biggest problem I can see here is about the need to get "a verb connecting two nouns". This is really generic, and as you maybe have already saw parse trees can assume very different (and even "wrong") structures.
If you want a more generic approach to the problem I would suggest you to look for relation extraction. This technique aims at extracting binary (or n-ary) relations from a sentence.
Example tools I would suggest are:
RelEx
MaltParser
ReVerb
With the latter one for example you can extract from sentences relations of the form subject-action-object. With regard to your question this will work if the two nouns are respectively the subject and the object of the phrase.
If instead you really want to get the verb connecting two name I think that the navigation of the tree is the most direct solution, but as I pointed out before, it is really difficult to implement given the "imperfection" and the non-standard structure of natural language phrases.

Related

Identify a list of items using Natural Language Processing

Is there a way for NLP parsers to identify a list?
For example, "a tiger, a lion and a gorilla" should be identified as a list
(I don't need it to be identified as a list of animals; just a list would be sufficient).
My ultimate aim is to link a common verb/word to all the items in the list.
For example, consider the sentence "He found a pen, a book and a flashlight". Here, "found" verb should be linked to all the 3 items.
Another example, "He was tested negative for cancer, anemia and diabetes". Here, the word "negative" should be linked to the three diseases.
Is this possible with any of the open-source NLP packages like OpenNLP or Stanford CoreNLP? Any other solution?
EDIT:
Like mentioned in one of the answers, my initial idea was to manually parse the list and find the items by looking at the placement of commas, etc.
But then I discovered Stanford NLP's OpenIE model. This seems to be doing a pretty good job. For example, "He has a pen and a book" gives the 2 relations (He;has;a pen) and (He;has;a book).
The problem with the model is that it doesn't work for incomplete sentences like, "has a pen and a book". (From what I understood, this is because OpenIE can only extract triples)
It also fails when negations are involved. Eg, "He has no pens".
Is there a solution to these problems? What are the best solutions available currently for information extraction?
I'm afraid the full answer could fill the better part of a PhD thesis :)
There are no generic tools to do what you need. You will need to write it yourself. If you look at this example, you can see that you can extract the list by starting from the token and or the comma and then traversing the graph around it to build the list. In this particular case you can look at the conj and appos relations that link smaller noun phrases.
You could also look at POS tag patterns like (N*, ,, N*, CC, N*) -- this is a hack but it's probably your best approach if you want fast results and you are willing to miss out on recall.
As for your requirement to include modifiers such as negation -- this is a separate task that should come after you've identified the list.
What you are trying to do is called Information Extraction.
In your case, the task is to extract basic propositions about a set of entities (given as an enumeration) instead of just one entity (which is the usual scenario). For example, you want to extract the following three propositions from the sentence He found a pen, a book and a flashlight.:
find(X, pen)
find(X, book)
find(X, flashlight)
X stands for the entity referred to as He. As Mr. Savkov already pointed out, information extraction is a quite hard problem whose solution lies beyond a Stack Overflow answer.
There are many approaches to information extraction. As suggested by Mr. Savkov, a solution based on POS tags might be a good starting point. I suggest taking a look at this nice tutorial based on NLTK (especially section 2.2. "Tag Patterns") and this paper.

How was the concept of tree (abstract data type) first proposed?

I'm learning the tree (abstract data type), and I don't just want to know about the definition and operations, which can be found without any difficulty.
I want more, especially who propose the tree, and in what situation did he work out the tree structure. Because besides using the tree, I'm interested in why some people can design such a usefull data structure.
By the way, when I learn Stack(another data structure), I find a short history that tell me how the stack were proposed in Wikipedia. But when it comes to Tree, there is no such a history.
So, could anybody tell me something about the history of Tree? Thanks!

Data Structure used for efficient text matching

I am a regular user of the Eclipse IDE. I found that it is really fast in finding out the occurences of a given variable name in a very section of code. I would be interested in how to go about building such mechanism or what are the fastest data structures used to do so or any algorithms. Of course eclipse is just and example. Thank you in advance.
As usual, the answer is, "it depends".
There's the case of efficiently searching for a chunk of matching text in a longer string.
The most common algorithm for this is the Boyer-More string search algorithm, and I believe that most implementations use a simple array of characters.
However, in the case of finding a variable name in the Eclipse editor, that is probably not what's happening. More likely, Eclipse is creating an Abstract Syntax Tree (AST) from your source code, and searching the tree. See for instance, http://www.eclipse.org/articles/Article-JavaCodeManipulation_AST/

Find basic words and estimate their difficulty

I'm looking for a possibly simple solution of the following problem:
Given input of a sentence like
"Absence makes the heart grow fonder."
Produce a list of basic words followed by their difficulty/complexity
[["absence", 0.5], ["make", 0.05], ["the", 0.01"], ["grow", 0.1"], ["fond", 0.5]]
Let's assume that:
all the words in the sentence are valid English words
popularity is an acceptable measure of difficulty/complexity
base word can be understood in any constructive way (see below)
difficulty/complexity is on scale from 0 - piece of cake to 1 - mind-boggling
difficulty bias is ok, better to be mistaken saying easy is though than the other way
working simple solution is preferred to flawless but complicated stuff
[edit] there is no interaction with user
[edit] we can handle any proper English input
[edit] a word is not more difficult than it's basic form (because as smart beings we can create unhappily if we know happy), unless it creates a new word (unlikely is not same difficulty as like)
General ideas:
I considered using Google searches or sites like Wordcount to estimate words popularity that could indicate its difficulty. However, both solutions give different results depending on the form of entered words. Google gives 316m results for fond but 11m for fonder, whereas Wordcount gives them ranks of 6k and 54k.
Transforming words to their basic forms is not a must but solves ambiguity problem (and makes it easy to create dictionary links), however it's not a simple task and its sense could me found arguable. Obviously fond should be taken instead of fonder, however investigating believe instead of unbelievable seems to be an overkill ([edit] it might be not the best example, but there is a moment when modifying basic word we create a new one like -> likely) and words like doorkeeper shouldn't be cut into two.
Some ideas of what should be consider basic word can be found here on Wikipedia but maybe a simpler way of determining it would be a use of a dictionary. For instance according to dictionary.reference.com unbelievable is a basic word whereas fonder comes from fond but then grow is not the same as growing
Idea of a solution:
It seems to me that the best way to handle the problem would be using a dictionary to find basic words, apply some of the Wikipedia rules and then use Wordcount (maybe combined with number of Google searches) to estimate difficulty.
Still, there might (probably is a simpler and better) way or ready to use algorithms. I would appreciate any solution that deals with this problem and is easy to put in practice. Maybe I'm just trying to reinvent the wheel (or maybe you know my approach would work just fine and I'm wasting my time deliberating instead of coding what I have). I would, however, prefer to avoid implementing frequency analysis algorithms or preparing a corpus of texts.
Some terminology:
The core part of the word is called a stem or a root. More on this distinction later. You can think of the root/stem as the part that carries the main meaning of the word and will appear in the dictionary.
(In English) most words are composed of one root (exception: compounds like "windshield") / one stem and zero or more affixes: the affixes that come after the root/stem are called suffixes, and the affixes that precede the root/stem are called prefixes. Examples: "driver" = "drive" (root/stem) + suffix "-er"; "unkind" = "kind" (root/stem) + "un-" (prefix).
Suffixes/prefixes (=affixes) can be inflectional or derivational. For example, in English, third-person singular verbs have an s on the end: "I drive" but "He drive-s". These kind of agreement suffixes don't change the category of the word: "drive" is a verb regardless of the inflectional "s". On the other hand, a suffix like "-er" is derivational: it takes a verb (e.g. "drive") and turns it into a noun (e.g. "driver")
The stem, is the piece of the word without any inflectional affixes, whereas the root is the piece of the word without any derivational affixes. For instance, the plural noun "drivers" is decomposable into "drive" (root) + "er" (derivational affix, makes a new stem "driver") + "s" (plural).
The process of deriving the "base" form of the word is called "stemming".
So, armed with this terminology it seems that for your task the most useful thing to do would be to stem each form you come across, i.e. remove all the inflectional affixes, and keep the derivational ones, since derivational affixes can change how common the word is considered to be. Think about it this way: if I tell you a new word in English, you will always know how to make it plural, 3rd-person singular, however, you may not know some of the other words you can derive from this). English being inflection-poor language, there aren't a lot of inflectional suffixes to worry about (and Google search is pretty good about stripping them off, so maybe you can use the Google's stemming engine just by running your word forms through google search and getting out the highlighted results):
Third singular verbal -s: "I drive"/"He drive-s"
Nominal plural `-s': "One wug"/"Two wug-s". Note that there are some irregular forms here such as "children", "oxen", "geese", etc. I think I wouldn't worry about these.
Verbal past tense forms and participial forms. The regular ones are easy: the past tense has -ed for past tense and past participle ("I walk"/"I walk-ed"/"I had walk-ed"), but there are quite a few of irregular ones (fall/fell/fallen, dive/dove/dived?, etc). Maybe make a list of these?
Verbal -ing forms: "walk"/"walk-ing"
Adjectival comparative -er and superlative -est. There are a few irregular/suppletive ones ("good"/"better"/"best"), but these should not present a huge problem.
These are the main inflectional affixes in English: I may be forgetting a few that you could discover by picking up an introductory Linguistics books. Also there are going to be borderline cases, such as "un-" which is so promiscuous that we might consider it inflectional. For more information on these types, see Level 1 vs. Level 2 affixation, but I would treat these cases as derivational for your purposes and not stem them.
As far as "grading" how common various stems are, besides google you could various freely-available text corpora. The wikipedia article linked to has a few links to free corpora, and you can find a bunch more by googling. From these corpora you can build a frequency count of each stem, and use that to judge how common the form is.
I'm afraid there is no simple solution to the task of finding "basic" forms. I'm basing that on my memory of my Machine Learning textbook, of which language analysis was part of. You need some database, from which you can get them.
At the same time, please take note that the amount of words people use in everyday language is not that big. You can always ask a user what is the base form of a world you have not seen before. (unless this is your homework, which will be automatically checked)
Eventually, if you don't care about covering all words, you can create simple database, which would contain different forms of the most common words, and then try to use grammatical rules for the less common ones (which would be a good approximation, as actually, the most common words in English are irregular, whereas the uncommon ones are regular, because their original forms have been forgotten).
Note however, i'm no specialist, i'm simply trying to help :-)

Does an algorithm exist to help detect the "primary topic" of an English sentence?

I'm trying to find out if there is a known algorithm that can detect the "key concept" of a sentence.
The use case is as follows:
User enters a sentence as a query (Does chicken taste like turkey?)
Our system identifies the concepts of the sentence (chicken, turkey)
And it runs a search of our corpus content
The area that we're lacking in is identifying what the core "topic" of the sentence is really about. The sentence "Does chicken taste like turkey" has a primary topic of "chicken", because the user is asking about the taste of chicken. While "turkey" is a helper topic of less importance.
So... I'm trying to find out if there is an algorithm that will help me identify the primary topic of a sentence... Let me know if you are aware of any!!!
I actually did a research project on this and won two competitions and am competing in nationals.
There are two steps to the method:
Parse the sentence with a Context-Free Grammar
In the resulting parse trees, find all nouns which are only subordinate to Noun-Phrase-like constituents
For example, "I ate pie" has 2 nouns: "I" and "pie". Looking at the parse tree, "pie" is inside of a Verb Phrase, so it cannot be a subject. "I", however, is only inside of NP-like constituents. being the only subject candidate, it is the subject. Find an early copy of this program on http://www.candlemind.com. Note that the vocabulary is limited to basic singular words, and there are no verb conjugations, so it has "man" but not "men", has "eat" but not "ate." Also, the CFG I used was hand-made an limited. I will be updating this program shortly.
Anyway, there are limitations to this program. My mentor pointed out in its currents state, it cannot recognize sentences with subjects that are "real" NPs (what grammar actually calls NPs). For example, "that the moon is flat is not a debate any longer." The subject is actually "that the moon is flat." However, the program would recognize "moon" as the subject. I will be fixing this shortly.
Anyway, this is good enough for most sentences...
My research paper can be found there too. Go to page 11 of it to read the methods.
Hope this helps.
Most of your basic NLP parsing techniques will be able to extract the basic aspects of the sentence - i.e., that chicken and turkey a NPs and they are linked by and adjective 'like', etc. Getting these to a 'topic' or 'concept' is more difficult
Technique such as Latent Semantic Analysis and its many derivatives transform this information into a vector (some have methods of retaining in some part the hierarchy/relations between parts of speech) and then compares them to existing, usually pre-classified by concept, vectors. See http://en.wikipedia.org/wiki/Latent_semantic_analysis to get started.
Edit Here's an example LSA app you can play around with to see if you might want to pursue it further . http://lsi.research.telcordia.com/lsi/demos.html
For many longer sentences its difficult to say what exactly is a topic and also there may be more than one.
One way to get approximate ans is
1.) First tag the sentence using openNLP, stanford Parser or any one.
2.) Then remove all the stop words from the sentence.
3.) Pick up Nouns( proper, singular and plural).
Other way is
1.) chuck the sentence into phrases by any parser.
2.) Pick up all the noun phrases.
3.) Remove the Noun phrases that doesn't have the Nouns as a child.
4.) Keep only adjectives and Nouns, remove all words from remaining Noun Phrases.
This might give approx. guessing.
"Key concept" is not a well-defined term in linguistics, but this may be a starting point: parse the sentence, find the subject in the parse tree or dependency structure that you get. (This doesn't always work; for example, the subject of "Is it raining?" is "it", while the key concept is likely "rain". Also, what's the key concept in "Are spaghetti and lasagna the same thing?")
This kind of problem (NLP + search) is more properly dealt with by methods such as LSA, but that's quite an advanced topic.
On the most basic level, a question in English is usually in the form of <verb> <subject> ... ? or <pronoun> <verb> <subject> ... ?. This is by no means a good algorithm, especially considering that the subject could span several words, but depending on how sophisticated a solution you need, it might be a useful starting point.
If you need precision, ignore this answer.
If you're willing to shell out money, http://www.connexor.com/ is supposed to be able to do this type of semantic analysis for a wide variety of languages, including English. I have never directly used their product, and so can't comment on how well it works.
There's an article about Parsing Noun Phrases in the MIT Computational Linguistics journal of this month: http://www.mitpressjournals.org/doi/pdf/10.1162/COLI_a_00076
Compound or complex sentences may have more than one key concept of a sentence.
You can use stanfordNLP or MaltParser which can give the dependency structure of a sentence. It also gives the parts of speech tagging including subject, verb , object etc.
I think most of the times the object will be the key concept of the sentence.
You should look at Google's Cloud Natural Language API. It's their NLP service.
https://cloud.google.com/natural-language/
Simple solution is to tag your sentence with part-of-speach tagger (e.g. from NLTK library for Python) then find matches with some predefined part-of-speach patterns in which it's clear where is main subject of the sentence
One option is to look into something like this as a first step:
http://www.abisource.com/projects/link-grammar/
But how you derive the topic from these links is another problem in itself. But as Abiword is trying to detect grammatical problems, you might be able to use it to determine the topic.
By "primary topic" you're referring to what is termed the subject of the sentence.
The subject can be identified by understanding a sentence through natural language processing.
The answer to this question is the same as that for How to determine subject, object and other words? - this is a currently unsolved problem.

Resources