Rewriting sentences while retaining semantic meaning - algorithm

Is it possible to use WordNet to rewrite a sentence so that the semantic meaning of the sentence still ways the same (or mostly the same)?
Let's say I have this sentence:
Obama met with Putin last week.
Is it possible to use WordNet to rephrase the sentence into alternatives like:
Obama and Putin met the previous week.
Obama and Putin met each other a week ago.
If changing the sentence structure is not possible, can WordNet be used to replace only the relevant synonyms?
For example:
Obama met Putin the previous week.

If the question is the possibility to use WordNet to do sentence paraphrases. It is possible with much grammatical/syntax components. You would need system that:
First get the individual semantics of the tokens and parse the sentence for its syntax.
Then understand the overall semantics of the composite sentence (especially if it's metaphorical)
Then rehash the sentence with some grammatical generator.
Up till now I only know of ACE parser/generator that can do something like that but it takes a LOT of hacking the system to make it work as a paraphrase generator. http://sweaglesw.org/linguistics/ace/
So to answer your questions,
Is it possible to use WordNet to rephrase the sentence into alternatives? Sadly, WordNet isn't a silverbullet. You will need more than semantics for a paraphrase task.
If changing the sentence structure is not possible, can WordNet be used to replace only the relevant synonyms? Yes this is possible. BUT to figure out which synonym is replace-able is hard... And you would also need some morphology/syntax component.
First you will run into a problem of multiple senses per word:
from nltk.corpus import wordnet as wn
sent = "Obama met Putin the previous week"
for i in sent.split():
possible_senses = wn.synsets(i)
print i, len(possible_senses), possible_senses
[out]:
Obama 0 []
met 13 [Synset('meet.v.01'), Synset('meet.v.02'), Synset('converge.v.01'), Synset('meet.v.04'), Synset('meet.v.05'), Synset('meet.v.06'), Synset('meet.v.07'), Synset('meet.v.08'), Synset('meet.v.09'), Synset('meet.v.10'), Synset('meet.v.11'), Synset('suffer.v.10'), Synset('touch.v.05')]
Putin 1 [Synset('putin.n.01')]
the 0 []
previous 3 [Synset('previous.s.01'), Synset('former.s.03'), Synset('previous.s.03')]
week 3 [Synset('week.n.01'), Synset('workweek.n.01'), Synset('week.n.03')]
Then even if you know the sense (let's say the first sense), you get multiple words per sense and not every word can be replaced in the sentence. Moreover, they are in the lemma form not a surface form (e.g. verbs are in their base form (simple present tense) and nouns are in singular):
from nltk.corpus import wordnet as wn
sent = "Obama met Putin the previous week"
for i in sent.split():
possible_senses = wn.synsets(i)
if possible_senses:
print i, possible_senses[0].lemma_names
else:
print i
[out]:
Obama
met ['meet', 'run_into', 'encounter', 'run_across', 'come_across', 'see']
Putin ['Putin', 'Vladimir_Putin', 'Vladimir_Vladimirovich_Putin']
the
previous ['previous', 'old']
week ['week', 'hebdomad']

One approach is grammatical analysis with nltk read more here and after analysis convert your sentence in to active voice or passive voice.

Related

Writing My Prolog Code

I am writing my first Prolog code, and I am have some difficulties with it I was wondering if anyone could help me out.
I am writing a program that needs to follow the following rules:
for Verb phrases., noun phrases come before transitive verbs.
subjects (nominative noun phrases) are followed by ga
Direct Objects (nominative noun phrases are followed by o.
it must be able to form these sentences with the given words in the code:
Adamu ga waraimasu (adam laughs)
Iive ga nakimasu (eve cries)
Adamu ga Iivu O mimasi (adam watches Eve)
Iivu ga Adamu O tetsudaimasu (eve helps adam)
here is my code. It it mostly complete except, I don't know if the rules are correct in the code:
Japanese([adamu ],[nounphrase],[adam],[entity]).
Japanese([iivu ],[nounphrase],[eve],[entity]).
Japanese([waraimasu ],[verb,intransitive],[laughs],[property]).
Japanese([nakimasu],[verb,intransitive],[cries],[property]).
Japanese([mimasu ],[verb,transitive],[watches],[relation]).
Japanese([tetsudaimasu ],[verb,transitive],[helps],[relation]).
Japanese(A,[verbphrase],B,[property]):-
Japanese(A,[verb,intransitive],B,[property]).
Japanese(A,[nounphrase,accusative],B,[entity]):-
Japanese(C,[nounphrase],B,[entity]),
append([ga],C,A).
Japanese(A,[verbphrase],B,[property]):-
Japanese(C,[verb,transitive],D,[relation]),
Japanese(E,[nounphrase,accusative],F,[entity]),
append(C,E,A),
append(D,F,B).
Japanese(A,[sentence],B,[proposition]):-
Japanese(C,[nounphrase],D,[entity]),
Japanese(E,[verbphrase],F,[property]),
append(E,C,A),
append(F,D,B).

Extracting food items from sentences

Given a sentence:
I had peanut butter and jelly sandwich and a cup of coffee for
breakfast
I want to be able to extract the following food items from it:
peanut butter and jelly sandwich
coffee
Till now, using POS tagging, I have been able to extract the individual food items, i.e.
peanut, butter, jelly, sandwich, coffee
But like I said, what I need is peanut butter and jelly sandwich instead of the individual items.
Is there some way of doing this without having a corpus or database of food items in the backend?
You can attempt it without using a trained set which contains a corpus of food items, but the approach shall work without it too.
Instead of doing simple POS tagging, do a dependency parsing combined with POS tagging.
That way would be able to find relations between multiple tokens of the phrase, and parsing the dependency tree with restricted conditions like noun-noun dependencies you shall be able to find relevant chunk.
You can use spacy for dep parsing. Here is output from displacy :
https://demos.explosion.ai/displacy/?text=peanut%20butter%20and%20jelly%20sandwich%20is%20delicious&model=en&cpu=1&cph=1
You can use freely available data here, or something better:
https://en.wikipedia.org/wiki/Lists_of_foods as a training set to
create a base set of food items (the hyperlinks in the crawled tree)
Based on the dependency parsing on your new data, you can keep
enriching the base data. For example: if 'butter' exists in your
corpus, and 'peanut butter' is a frequently encountered pair of
tokens, then 'peanut' and 'peanut butter' also get added to the
corpus.
The corpus can be maintained in a file which can be loaded in memory
while processing, or database like redis,aerospike etc.
Make sure you work with normalized i.e. small cased, special
characters cleaned, words lemmatized/stemmed, both in corpus and the
processing data. That would increase your coverage and accuracy.
First extract all Noun phrases using NLTK's Chunking (code copied from here):
import nltk
import re
import pprint
from nltk import Tree
import pdb
patterns="""
NP: {<JJ>*<NN*>+}
{<JJ>*<NN*><CC>*<NN*>+}
{<NP><CC><NP>}
{<RB><JJ>*<NN*>+}
"""
NPChunker = nltk.RegexpParser(patterns)
def prepare_text(input):
sentences = nltk.sent_tokenize(input)
sentences = [nltk.word_tokenize(sent) for sent in sentences]
sentences = [nltk.pos_tag(sent) for sent in sentences]
sentences = [NPChunker.parse(sent) for sent in sentences]
return sentences
def parsed_text_to_NP(sentences):
nps = []
for sent in sentences:
tree = NPChunker.parse(sent)
print(tree)
for subtree in tree.subtrees():
if subtree.label() == 'NP':
t = subtree
t = ' '.join(word for word, tag in t.leaves())
nps.append(t)
return nps
def sent_parse(input):
sentences = prepare_text(input)
nps = parsed_text_to_NP(sentences)
return nps
if __name__ == '__main__':
print(sent_parse('I ate peanut butter and beef burger and a cup of coffee for breakfast.'))
This will POS tag your sentences and uses a regex parser to extract Noun Phrases.
1.Define and Refine your noun phrase regex
You'll need to change the patterns regex to define and refine your Noun phrases.
For example is telling the parser than an NP followed by a coordinator (CC) like ''and'' and another NP is itself an NP.
2.Change from NLTK POS tagger to Stanford POS tagger
Also I noted that NLTK's POS tagger is not performing very well (e.g. It considers had peanut as a verb phrase. You can change the POS tagger to Stanford Parser if you want.
3.Remove smaller noun phrases:
After you have extracted all the Noun phrases for a sentence, you can remove the ones that are part of a bigger noun phrase. For example in the following example beef burger and peanut butter should be removed because
they're a part of a bigger noun phrase peanut butter and beef burger.
4.Remove noun phrases which none of the words are in a food lexicon
you will get noun phrases like school bus. if none of school and bus is in a food lexicon that you can compile from Wikipedia or WordNet then you remove the noun phrase. In this case remove cup and breakfast because they're not hopefully in your food lexicon.
The current code returns
['peanut butter and beef burger', 'peanut butter', 'beef burger', 'cup', 'coffee', 'breakfast']
for input
print(sent_parse('I ate peanut butter and beef burger and a cup of coffee for breakfast.'))
Too much for a comment, but not really an answer:
I think you would at least get closer if when you got two foods without a proper separator and combined them into one food. That would give peanut butter, jelly sandwich, coffee.
If you have correct English you could detect this case by count/non-count. Correcting the original to "I had a peanut butter and jelly sandwich and a cup of coffee for breakfast". Butter is non-count, you can't have "a butter", but you can have "a sandwich". Thus the a must apply to sandwich and despite the and "peanut butter" and "jelly sandwich" must be the same item--"peanut butter and jelly sandwich". Your mistaken sentence would parse the other way, though!
I would be very surprised if you could come up with general rules that cover every case, though. I would come at this sort of thing figuring that a few would leak and need a database to catch.
You could search for n-grams in your text where you vary the value of n. For example, if n=5 then you would extract "peanut butter and jelly sandwich" and "cup of coffee for breakfast", depending on where you start your search in the text for groups of five words. You won't need a corpus of text or a database to make the algorithm work.
A rule based approach with a lexicon of all food items would work here.
You can use GATE for the same and use JAPE rules with it.
In the above example your jape rule would have a condition to find all (np cc np) && np in "FOOD LEIXCON"
Can share a detailed jape code in an event you plan to go this route.

Nokogiri How can I extract text from HTML with correct spacing?

I'm trying to extract the text for a document to index it for search. The below mostly works except various words and punctuation run together. When it removes tags, I need to replace them with spaces so I do not get this issue. I have been trying to figure out the most efficient way to do this but I'm coming up empty so far.
doc = Nokogiri::HTML(html)
doc.xpath("//script").remove
doc.xpath("//style").remove
doc.xpath("//a").remove
text = doc.text.gsub(/\s+/,' ')
Here is some sample text I extracted from http://www.washingtontimes.com/blog/redskins-watch/2012/oct/18/redskins-linemen-respond-jason-pierre-paul-rg3-com/
Before the season it was New York Giants defensive end Osi Umenyiora
who made waves by saying he wouldn't call Robert Griffin III by “RG3”
until he did something. Until then, it was “Bob Griffin.”After
Griffin's 76-yard touchdown run in the Washington Redskins' victory
over the Minnesota Vikings, fellow Giants defensive end Jason
Pierre-Paul was the one who had some comments about Griffin.“Don’t
bring it to my side," Pierre-Paul told New York media. “Go the other
way. …“Yes, it'll be a very good matchup. Not on my side, though. Not
on my side. Or the other side.”Griffin, asked jokingly Wednesday about
running for office, said: “I’ve got a lot other guys to be running
away from right now, Pierre-Paul, Osi, all those guys.”But according
to a couple of Redskins linemen, Griffin shouldn't have much to worry
about Sunday if he gets into the open field.“If Robert gets into that
situation, I don't think there's many people that can run him down,”
right guard Chris Chester said. “I'm still going to go out there and
try to block and make sure no one touches Robert at all. But he's a
plenty good athlete to be able to outrun a lot of people in this
league.”Prompted with Pierre-Paul's comments, left tackle Trent
Williams responded: “What do you want me to say about that?”“Robert's
my guy. I don't know Pierre-Paul. I don't know why he would say
something like that,” he said. “Maybe he knows something I don't.”
You could try inserting a space before each p tag:
doc.search('p').each{|el| el.before ' '}
but a better approach probably is something like:
text = doc.search('div.story p').map{|p| p.text}.join(" ")
Other answers are discussing inserting whitespace into the document, but if (as the question asks) your requirement is to replace those nodes with whitespace, Nokogiri has a replace method. So to replace script tags with spaces do:
doc.xpath('//script').each do |node|
node.replace(' ')
end
The question also asks about 'correct' spacing. Most browsers will not insert a space when they render around a <script> tag, so while useful for text extraction, this is not necessarily the 'correct' thing to do.

How can I extract the meaning of a paragraph?

I need to develop method that extracts the meaning from a string for a record in a database. Here is an example of the a string:
MyString = "Purse $75,000. (up To $14,250 Nysbfoa) For Maidens, Fillies And Mares Three Years Old And Upward. Three Year Olds, 118 Lbs.; Older, 123 Lbs. One And One Eighth Miles. (Inner turf)"
Given the string, I need to process it in such a way that I can create a race_record:
race_record[:purse] = 75000
race_record[:race_type] = "Maidens"
race_record[:sex] = "Fillies And Mares"
race_record[:age] = "Three Year Old And Upward"
race_record[:distance] = "One And One Eighth Miles"
race_record[:surface] = "inner turf"
I was planning on to use ruby and a series of regular expressions to extract the data. For example:
race_record[:purse] = Mystring.scan(/(?<=\Purse\s[$])(.*?)(?=\.)/)
race_record[:race_type] = Mystring.sub(....)
etc.
My question isn't so much what the correct regular expressions are. Given the objective, is the approach I proposed the right way to go, or is there a better approach or even a gem that can do the heavy lifting?
You could use one regex to extract all the relevant parts into capturing groups at once;
regexp =
/Purse\s\$ # Leading text
([\d,]+) # Group 1
.*?For\s # Intervening text
(\w+) # Group 2
,\s # Intervening text
(\w+\sAnd\s\w+) # Group 3, etc. etc.
\s
([^.]*)
\.[^;]*;[^.]*\.\s
([^.]*)
\.\s\(
([^()]*)
\)/x
Then you can do
irb(main):025:0> match = regexp.match(mystring)
=> #<MatchData "Purse $75,000. (up To $14,250 Nysbfoa) For Maidens, Fillies And Mares Three Years Old And Upward. Three Year Olds, 118 Lbs.; Older, 123 Lbs. One And One Eighth Miles. (Inner turf)"
1:"75,000" 2:"Maidens" 3:"Fillies And Mares" 4:"Three Years Old And Upward"
5:"One And One Eighth Miles" 6:"Inner turf">
irb(main):026:0> match[1]
=> "75,000"
irb(main):027:0> match[2]
=> "Maidens"
...etc.
If your input is fairly structured, i.e. it has a specific and know grammar, you could build a 'parser' to parse the grammar.
In the old days, we'd do this with yacc and lex, two old unix tools used to build compilers. Yacc and Lex have Ruby implementations. While the original intent was to output lower level code (such as machine assembly codes when building a real compiler), there is nothing that prevents you from calling any ruby code when a specific grammatical construct has been recognized by your parser.
NOTE: even though there is a Yacc/lex Ruby gem out there, I wouldn't say it will 'DO THE HEAVY LIFTING', learning yacc and lex has a small learning curve. Using something like yacc/lex would make your life easier in the long run, especially if you have a large grammar and must constantly adjust it.

Where can I learn more about the Google search "did you mean" algorithm? [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
How do you implement a “Did you mean”?
I am writing an application where I require functionality similar to Google's "did you mean?" feature used by their search engine:
Is there source code available for such a thing or where can I find articles that would help me to build my own?
You should check out Peter Norvigs article about implementing the spell checker in a few lines of python:
How to Write a Spelling Corrector It also has links for implementations in other languages (i.e. C#)
I attended a seminar by a Google engineer a year and a half ago, where they talked about their approach to this. The presenter was saying that (at least part of) their algorithm has little intelligence at all; but rather, utilises the huge amounts of data they have access to. They determined that if someone searches for "Brittany Speares", clicks on nothing, and then does another search for "Britney Spears", and clicks on something, we can have a fair guess about what they were searching for, and can suggest that in future.
Disclaimer: This may have just been part of their algorithm
Python has a module called difflib. It provides a functionality called get_close_matches. From the Python Documentation:
get_close_matches(word, possibilities[, n][, cutoff])
Return a list of the best "good
enough" matches. word is a sequence
for which close matches are desired
(typically a string), and
possibilities is a list of sequences against which to match
word (typically a list of strings).
Optional argument n (default
3) is the maximum number of close
matches to return; n must be
greater than 0.
Optional argument cutoff (default
0.6) is a float in the range [0,
1]. Possibilities that don't score
at least that similar to word are
ignored.
The best (no more than n) matches
among the possibilities are returned
in a list, sorted by similarity
score, most similar first.
>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
>>> import keyword
>>> get_close_matches('wheel', keyword.kwlist)
['while']
>>> get_close_matches('apple', keyword.kwlist)
[]
>>> get_close_matches('accept', keyword.kwlist)
['except']
Could this library help you?
You can use http://developer.yahoo.com/search/web/V1/spellingSuggestion.html which would give a similar functionality.
You can check out the source code for Xapian which provides this functionality, as do a lot of other search libraries. http://xapian.org/
I am not sure if it serves your purpose but a String Edit distance Algorithm with a dictionary might suffice for a small Application.
I'd take a look at this article on google bombing. It shows that it just suggests answers based off previously entered results.
AFAIK the "did you mean ?" feature doesn't check the spelling. It only gives you another query based on the content parsed by google.
A great chapter to this topic can be found in the openly available Introduction to Information Retrieval.
U could use ngram for the comparisment: http://en.wikipedia.org/wiki/N-gram
Using python ngram module: http://packages.python.org/ngram/index.html
import ngram
G2 = ngram.NGram([ "iis7 configure ftp 7.5",
"ubunto configre 8.5",
"mac configure ftp"])
print "String", "\t", "Similarity"
for i in G2.search("iis7 configurftp 7.5", threshold=0.1):
print i[0], "\t", i[1]
U get:
>>>
String Similarity
"iis7 configure ftp 7.5" 0.76
"mac configure ftp 0.24"
"ubunto configre 8.5" 0.19
take a look at Levenshtein-Automata

Resources