I'm trying to extract relation triples from Stanford CoreNLP, and it's working very well for single relation triples in a sentence but doesn't seem to work for multiple ideas in the same sentence.
For example: I drink water, and he eats a cake.
I would expect there to be two triples. (I, drink, water), (he, eats, cake), but only one will show up.
Here's what I'm currently working with:
with corenlp.CoreNLPClient(annotators="tokenize ssplit lemma pos ner depparse natlog openie".split()) as client:
ann = client.annotate(text)
sentence = ann.sentence[0].openieTriple
for x in ann.sentence:
print(x.openieTriple)
I would assume I"m doing something wrong here. Changing max_entailments doesn't fix the problem.
You must do:
for x in ann.sentence:
for triple in x.openieTriple
print(triple)`
Discovered this today thanks to your question, so thanks!
Related
Kaggle Problem:https://www.kaggle.com/c/tweet-sentiment-extraction
We have to upload the output file with id and ""
<id>,"<word or phrase that supports the sentiment>"
The question is how the model will be able to choose the length of the phrase like from x word to y word there is strong sentiment.
Can anyone please help ?
The most common way this is done is by having your model predict a start index and an end index (of the sequence of tokens you want to extract).
Poking through the discussion threads, this was the architecture of the winning entry for that competition: https://www.kaggle.com/c/tweet-sentiment-extraction/discussion/159477
Notice in the first section "Heartkilla" they are predicting two things, y-start and y-end. Further down they mention they filter out predictions where y-start is greater than y-end.
I'm a very new student of doc2vec and have some questions about document vector.
What I'm trying to get is a vector of phrase like 'cat-like mammal'.
So, what I've tried so far is by using doc2vec pre-trained model, I tried the code below
import gensim.models as g
model = "path/pre-trained doc2vec model.bin"
m = g. Doc2vec.load(model)
oneword = 'cat'
phrase = 'cat like mammal'
oneword_vec = m[oneword]
phrase_vec = m[phrase_vec]
When I tried this code, I could get a vector for one word 'cat', but not 'cat-like mammal'.
Because word2vec only provide the vector for one word like 'cat' right? (If I'm wrong, plz correct me)
So I've searched and found infer_vector() and tried the code below
phrase = phrase.lower().split(' ')
phrase_vec = m.infer_vector(phrase)
When I tried this code, I could get a vector, but every time I get different value when I tried
phrase_vec = m.infer_vector(phrase)
Because infer_vector has 'steps'.
When I set steps=0, I get always the same vector.
phrase_vec = m.infer_vector(phrase, steps=0)
However, I also found that document vector is obtained from averaging words in document.
like if the document is composed of three words, 'cat-like mammal', add three vectors of 'cat', 'like', 'mammal', and then average it, that would be the document vector. (If I'm wrong, plz correct me)
So here are some questions.
Is it the right way to use infer_vector() with 0 steps to getting a vector of phrase?
If it is the right averaging vector of words to get document vector, is there no need to use infer_vector()?
What is a model.docvecs for?
Using 0 steps means no inference at all happens: the vector stays at its randomly-initialized position. So you definitely don't want that. That the vectors for the same text vary a little each time you run infer_vector() is normal: the algorithm is using randomness. The important thing is that they're similar-to-each-other, within a small tolerance. You are more likely to make them more similar (but still not identical) with a larger steps value.
You can see also an entry about this non-determinism in Doc2Vec training or inference in the gensim FAQ.
Averaging word-vectors together to get a doc-vector is one useful technique, that might be good as a simple baseline for many purposes. But it's not the same as what Doc2Vec.infer_vector() does - which involves iteratively adjusting a candidate vector to be better and better at predicting the text's words, just like Doc2Vec training. For your doc-vector to be comparable to other doc-vectors created during model training, you should use infer_vector().
The model.docvecs object holds all the doc-vectors that were learned during model training, for lookup (by the tags given as their names during training) or other operations, like finding the most_similar() N doc-vectors to a target tag/vector amongst those learned during training.
I have around 1000 pairs of sentences. Each pair consists of two sentences, one causing high CTR and one LOW. I want to create a mechanism to auto-produce sentences optimized for high CTR. When iterating the pairs, I can get a vector (using Spacy NLP) for each sentence. I take the vectors difference (Sent1.vector - Sent2.Vector) and then mean all the pairs using numpy mean. When I have the "difference vector" in hand, I want to add it to any given text and get a new sentence. any Ideas how to obtain this? Gensim most_similar only works on single words... Thanks
I have found successful weighting theme for adding word vectors which seems to work for sentence comparison in my case:
query1 = vectorize_query("human cat interaction")
query2 = vectorize_query("people and cats talk")
query3 = vectorize_query("monks predicted frost")
query4 = vectorize_query("man found his feline in the woods")
>>> print(1 - spatial.distance.cosine(query1, query2))
>>> 0.7154500319
>>> print(1 - spatial.distance.cosine(query1, query3))
>>> 0.415183904078
>>> print(1 - spatial.distance.cosine(query1, query4))
>>> 0.690741014142
When I add additional information to the sentence which acts as noise I get decrease:
>>> query4 = vectorize_query("man found his feline in the dark woods while picking white mushrooms and watching unicorns")
>>> print(1 - spatial.distance.cosine(query1, query4))
>>> 0.618269123349
Are there any ways to deal with additional information when comparing using word vectors? When I know that some subset of the text can provide better match.
UPD: edited the code above to make it more clear.
vectorize_query in my case does so called smooth inverse frequency weighting, when word vectors from GloVe model (that can be word2vec as well, etc.) are added with weights a/(a+w), where w should be the word frequency. I use there word's inverse tfidf score, i.e. w = 1/tfidf(word). Coefficient a is typically taken 1e-3 in this approach. Taking just tfidf score as weight instead of that fraction gives almost similar result, I also played with normalization, etc.
But I wanted to have just "vectorize sentence" in my example to not overload the question as I think it does not depend on how I add word vectors using weighting theme - the problem is only that comparison works best when sentences have approximately the same number of meaning words.
I am aware of another approach when distance between sentence and text is being computed using the sum or mean of minimal pairwise word distances, e.g.
"Obama speaks to the media in Illinois" <-> "The President greets the press in Chicago" where we have dist = d(Obama, president) + d(speaks, greets) + d(media, press) + d(Chicago, Illinois). But this approach does not take into account that adjective can change the meaning of noun significantly, etc - which is more or less incorporated in vector models. Words like adjectives 'good', 'bad', 'nice', etc. become noise there, as they match in two texts and contribute as zero or low distances, thus decreasing the distance between sentence and text.
I played a bit with doc2vec models, it seems it was gensim doc2vec implementation and skip-thoughts embedding, but in my case (matching short query with much bigger amount of text) I had unsatisfactory results.
If you are interested in part-of-speech to trigger similarity (e.g. only interested in nouns and noun phrases and ignore adjectives), you might want to look at sense2vec, which incorporates word classes into the model. https://explosion.ai/blog/sense2vec-with-spacy ...after which you can weight the word class while performing a dot product across all terms, effectively deboosting what you consider the 'noise'.
It's not clear your original result, the similarity decreasing when a bunch of words are added, is 'bad' in general. A sentence that says a lot more is a very different sentence!
If that result is specifically bad for your purposes – you need a model that captures whether a sentence says "the same and then more", you'll need to find/invent some other tricks. In particular, you might need a non-symmetric 'contains-similar' measure – so that the longer sentence is still a good match for the shorter one, but not vice-versa.
Any shallow, non-grammar-sensitive embedding that's fed by word-vectors will likely have a hard time with single-word reversals-of-meaning, as for example the difference between:
After all considerations, including the relevant measures of economic, cultural, and foreign-policy progress, historians should conclude that Nixon was one of the very *worst* Presidents
After all considerations, including the relevant measures of economic, cultural, and foreign-policy progress, historians should conclude that Nixon was one of the very *best* Presidents
The words 'worst' and 'best' will already be quite-similar, as they serve the same functional role and appear in the same sorts of contexts, and may only contrast with each other a little in the full-dimensional space. And then their influence may be swamped in the influence of all the other words. Only more sophisticated analyses may highlight their role as reversing the overall import of the sentence.
While it's not yet an option in gensim, there are alternative ways to calculation the "Word Mover's Distance" that report the unmatched 'remainder' after all the easy pairwise-meaning-measuring is finished. While I don't know any prior analysis or code that'd flesh out this idea for your needs, or prove its value, I have a hunch such an analysis might help better discover cases of "says the same and more", or "says mostly the same but with reversal in a few words/aspects".
Let´s suppose I want to build directed graphs with an algorithm that can read through a parragraph and build edges between nouns and their corresponding adjectives.
Example:
Input String
"Owls are solitary and nocturnal birds of prey."
Output should look something like this:
Owls = {adjectives:"solitary, nocturnal, birds"}
If the above is not possible, what would be the best way to get some adjectives that describe a noun?
A more general approach for what you're asking is to use a Dependency Parser which extracts various types of relationship between words in a sentence.
The input of the parser is a sentence, and its output is a dependency tree over the words, where each edge denotes a dependency relation between two words.
Consider the following example (taken from the wiki entry linked above). In the sentence, "syntactic" is an adjective describing "functions". The parse tree encodes this information by connecting the two words with an edge labeled ATTR (attribute).
You can find dependency parsers for many languages online.
A good starting point is python's NLTK package.
If you are looking for all adjectives that could describe a noun your best starting place might be the Google NGram dataset. You can try the viewer here which shows that 'horned', 'barn', 'screech' are all common adjectives for owls.
Alternatively, if you are trying to tag specific sentences to find adjectives related to a noun you should try one of the part of speech taggers.