Why am I getting String where should get a dict when using pycorenlp.StanfordCoreNLP.annotate? - stanford-nlp

I'm running this example using pycorenlp Stanford Core NLP python wrapper, but the annotate function returns a string instead of a dict, so, when I iterate over it to get each sentence sentiment value I get the following error: "string indices must be integers".
What could I do to get over it? Anyone could help me? Thanks in advance.
The code is below:
from pycorenlp import StanfordCoreNLP
nlp_wrapper = StanfordCoreNLP('http://localhost:9000')
doc = "I like this chocolate. This chocolate is not good. The chocolate is delicious. Its a very
tasty chocolate. This is so bad"
annot_doc = nlp_wrapper.annotate(doc,
properties={
'annotators': 'sentiment',
'outputFormat': 'json',
'timeout': 100000,
})
for sentence in annot_doc["sentences"]:
print(" ".join([word["word"] for word in sentence["tokens"]]) + " => "\
+ str(sentence["sentimentValue"]) + " = "+ sentence["sentiment"])

You should just use the official stanfordnlp package! (note: the name is going to be changed to stanza at some point)
Here are all the details, and you can get various output formats from the server including JSON.
https://stanfordnlp.github.io/stanfordnlp/corenlp_client.html
from stanfordnlp.server import CoreNLPClient
with CoreNLPClient(annotators=['tokenize','ssplit','pos','lemma','ner', 'parse', 'depparse','coref'], timeout=30000, memory='16G') as client:
# submit the request to the server
ann = client.annotate(text)

It would be great if you provide the error stack trace. The reason for this is that the annotator is meeting timeout sooner and returns a assertion message 'the text is too large..'. Its dtype is . Further, I would put more light on Petr Matuska comment. By looking at your example it is clear that your goal is to find sentiment for the sentence along with its sentiment score.
The sentiment score is not found with result in using CoreNLPCLient. I faced similar issue, but i did work around which fixed this issue. If the text is large you must set the timeout value to much higher (eg., timeout = 500000). Also the annotator results in a dictionary and therefore it consumes a lot of memory. For a larger text corpus, this will be a great problem!! So it is upto us how we can handle the data structure in the code. There are alternatives such using slot, tupple or named tupple for faster access.

Related

how to handle spelling mistake(typos) in entity extraction in Rasa NLU?

I have few intents in my training set(nlu_data.md file) with sufficient amount of training examples under each intent.
Following is an example,
##intent: SEARCH_HOTEL
- find good [hotel](place) for me in Mumbai
I have added multiple sentences like this.
At the time of testing, all sentences in training file are working fine. But if any input query is having spelling mistake e.g, hotol/hetel/hotele for hotel keyword then Rasa NLU is unable to extract it as an entity.
I want to resolve this issue.
I am allowed to change only training data, also restricted not to write any custom component for this.
To handle spelling mistakes like this in entities, you should add these examples to your training data. So something like this:
##intent: SEARCH_HOTEL
- find good [hotel](place) for me in Mumbai
- looking for a [hotol](place) in Chennai
- [hetel](place) in Berlin please
Once you've added enough examples, the model should be able to generalise from the sentence structure.
If you're not using it already, it also makes sense to use the character-level CountVectorFeaturizer. That should be in the default pipeline described on this page already
One thing I would highly suggest you to use is to use look-up tables with fuzzywuzzy matching. If you have limited number of entities (like country names) look-up tables are quite fast, and fuzzy matching catches typos when that entity exists in your look-up table (searching for typo variations of those entities). There's a whole blogpost about it here: on Rasa.
There's a working implementation of fuzzy wuzzy as a custom component:
class FuzzyExtractor(Component):
name = "FuzzyExtractor"
provides = ["entities"]
requires = ["tokens"]
defaults = {}
language_list ["en"]
threshold = 90
def __init__(self, component_config=None, *args):
super(FuzzyExtractor, self).__init__(component_config)
def train(self, training_data, cfg, **kwargs):
pass
def process(self, message, **kwargs):
entities = list(message.get('entities'))
# Get file path of lookup table in json format
cur_path = os.path.dirname(__file__)
if os.name == 'nt':
partial_lookup_file_path = '..\\data\\lookup_master.json'
else:
partial_lookup_file_path = '../data/lookup_master.json'
lookup_file_path = os.path.join(cur_path, partial_lookup_file_path)
with open(lookup_file_path, 'r') as file:
lookup_data = json.load(file)['data']
tokens = message.get('tokens')
for token in tokens:
# STOP_WORDS is just a dictionary of stop words from NLTK
if token.text not in STOP_WORDS:
fuzzy_results = process.extract(
token.text,
lookup_data,
processor=lambda a: a['value']
if isinstance(a, dict) else a,
limit=10)
for result, confidence in fuzzy_results:
if confidence >= self.threshold:
entities.append({
"start": token.offset,
"end": token.end,
"value": token.text,
"fuzzy_value": result["value"],
"confidence": confidence,
"entity": result["entity"]
})
file.close()
message.set("entities", entities, add_to_output=True)
But I didn't implement it, it was implemented and validated here: Rasa forum
Then you will just pass it to your NLU pipeline in config.yml file.
Its a strange request that they ask you not to change the code or do custom components.
The approach you would have to take would be to use entity synonyms. A slight edit on a previous answer:
##intent: SEARCH_HOTEL
- find good [hotel](place) for me in Mumbai
- looking for a [hotol](place:hotel) in Chennai
- [hetel](place:hotel) in Berlin please
This way even if the user enters a typo, the correct entity will be extracted. If you want this to be foolproof, I do not recommend hand-editing the intents. Use some kind of automated tool for generating the training data. E.g. Generate misspelled words (typos)
First of all, add samples for the most common typos for your entities as advised here
Beyond this, you need a spellchecker.
I am not sure whether there is a single library that can be used in the pipeline, but if not you need to create a custom component. Otherwise, dealing with only training data is not feasible. You can't create samples for each typo.
Using Fuzzywuzzy is one of the ways, generally, it is slow and it doesn't solve all the issues.
Universal Encoder is another solution.
There should be more options for spell correction, but you will need to write code in any way.

In gensim with pretrained model, wmdistance is working well, but n_similarity is not

I have calculated distances between two sentences using wmdistance() funtion of gensim with pre-trained model
Now, I want to similarity between them and tried with n_similarity() funnction, but keyerror occured
keyerror : word not in vacabulary
This shows screenshoot of error example
Anyone have got idea on this, please?
When you get an error that a word is not in the vocabulary, it means the word is not in that model.
Any attempt to look it up will generate a KeyError, to let you know you are trying to get a word-vector that isn't there.
You should filter your lists-of-tokens, before passing them to n_similarity(), to only include valid words.
Of course, that means you can't get a meaningful result about the word 'selfie'. It's unknown nonsense to the model, as if you asked for the word 'asruhfglaiwurfliuawiufsdfsdfs'.

gensim/models/ldaseqmodel.py:217: RuntimeWarning: divide by zero encountered in double_scalars

/Users/Barry/anaconda/lib/python2.7/site-packages/gensim/models/ldaseqmodel.py:217: RuntimeWarning: divide by zero encountered in double_scalars
convergence = np.fabs((bound - old_bound) / old_bound)
#dynamic topic model
def run_dtm(num_topics=18):
docs, years, titles = preprocessing(datasetType=2)
#resort document by years
Z = zip(years, docs)
Z = sorted(Z, reverse=False)
years_new, docs_new = zip(*Z)
#generate time slice
time_slice = Counter(years_new).values()
for year in Counter(years_new):
print year,' --- ',Counter(years_new)[year]
print '********* data set loaded ********'
dictionary = corpora.Dictionary(docs_new)
corpus = [dictionary.doc2bow(text) for text in docs_new]
print '********* train lda seq model ********'
ldaseq = ldaseqmodel.LdaSeqModel(corpus=corpus, id2word=dictionary, time_slice=time_slice, num_topics=num_topics)
print '********* lda seq model done ********'
ldaseq.print_topics(time=1)
Hey guys, I'm using the dynamic topic models in gensim package for topic analysis, following this tutorial, https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/ldaseqmodel.ipynb, however I always got the same unexpected error. Can anyone give me some guidance? I'm really puzzled even thought I have tried some different dataset for generating corpus and dictionary.
The error is like this:
/Users/Barry/anaconda/lib/python2.7/site-packages/gensim/models/ldaseqmodel.py:217: RuntimeWarning: divide by zero encountered in double_scalars
convergence = np.fabs((bound - old_bound) / old_bound)
The np.fabs error means it is encountering an error with NumPy. What NumPy and gensim versions are you using?
NumPy no longer supports Python 2.7, and Ldaseq was added to Gensim in 2016, so you might just not have a compatible version available. If you are recoding a Python 3+ tutorial to a 2.7 variant, you obviously understand a little bit about the version differences - try running it in a, say, 3.6.8 environment (you will have to upgrade sometime anyway, 2020 is the end of 2.7 support from Python itself). That might already help, I've gone through the tutorial and did not encounter this with my own data.
That being said, I have encountered the same error before when running LdaMulticore, and it was caused by an empty corpus.
Instead of running your code fully in a function, can you try to go through it line by line (or look at you DEBUG level log) and check whether your output has the expected properties: that, for example your corpus is not empty (or contains empty documents)?
If that happens, fix the preprocessing steps and try again - that at least helped me and helped with the same ldamodel error in the mailing list.
PS: not commenting because I lack the reputation, feel free to edit this.
This is the issue with the source code of ldaseqmodel.py itself.
For the latest gensim package(version 3.8.3) I am getting the same error at line 293:
ldaseqmodel.py:293: RuntimeWarning: divide by zero encountered in double_scalars
convergence = np.fabs((bound - old_bound) / old_bound)
Now, if you go through the code you will see this:
enter image description here
You can see that here they divide the difference between bound and old_bound by the old_bound(which is also visible from the warning)
Now if you analyze further you will see that at line 263, the old_bound is initialized with zero and this is the main reason that you are getting this warning of divide by zero encountered.
enter image description here
For further information, I put a print statement at line 294:
print('bound = {}, old_bound = {}'.format(bound, old_bound))
The output I received is: enter image description here
So, in a single line you are getting this warning because of the source code of the package ldaseqmodel.py not because of any empty document. Although if you do not remove the empty documents from your corpus you will receive another warning. So I suggest if there are any empty documents in your corpus remove them and just ignore the above warning of division by zero.

Gensim FastText - KeyError: "word not in vocabulary"

I was having trouble with the "most_similar" call in a FastText model, from my understanding, Fasttext should be able to obtain results for words that aren't in the vocabulary, but I'm getting a "Not in Vocabulary" error, even when prior to saving and loading, the call was perfectly fine.
Here's the code from juypter.
import gensim as gensim
model = gensim.models.FastText(my_sentences, size=100, window=5, min_count=3, workers=4, sg=1)
model.wv.most_similar(positive=['iPhone 6'])
Returns
[('iPhone7', 0.942690372467041),
('iPhone7.', 0.9395840764045715),
('iPhone5s', 0.9379133582115173),
('iPhone6s', 0.9338586330413818),
('iPhone5S', 0.9335439801216125),
('iPhone5.', 0.9318809509277344),
('iPhone®', 0.9314558506011963),
('iPhone6', 0.9268479347229004),
('iPhone4s', 0.9223971366882324),
('iPhone5', 0.9212019443511963)]
So far so good, now I save the model.
model.wv.save_word2vec_format("example_fasttext.txt", binary=False)
Then load it up again:
from gensim.models import KeyedVectors
new_model = KeyedVectors.load_word2vec_format('example_fasttext.txt', binary=False, limit=50000)
Then I do the exact most_similar call from the model I just loaded:
new_model.most_similar(positive=['iPhone 6'])
But results now are:
KeyError: "word 'iPhone 6' not in vocabulary"
Any idea what I did wrong?
Your problem is probaly in the limit parameter of the load_word2vec_format method. What you are doing here is loading the model only for the 50000 most frequent words. If iPhone 6 does not appear enough times, you are not loading it.
Try with
new_model = KeyedVectors.load_word2vec_format('example_fasttext.txt', binary=False)
I'm having the same problem as you, and I think I am starting to understand what's going on.
Basically, when you save your model as a .txt or as a .vec, you are only saving the word-vectors ; not the n-grams (saved in the binary version of your model), which allow you to generalize / approximate out-of-vocabulary words.
I suggest you save your model with:
your_fasttext_model.save(file_path)

Scraping all data from Reddit searches

I am using PRAW to scrape data off of reddit. I am using the .search method to search very specific people. I can easily print the title of the submission if the keyword is in the title, but if the keyword is in the text of the submission nothing pops up. Here is the code I have so far.
import praw
reddit = praw.Reddit(----------)
alls = reddit.subreddit("all")
for submission in alls.search("Yoa ming",sort = comment, limit = 5):
print(submission.title)
When I run this code i get
Yoa Ming next to Elephant!
Obama's Yoa Ming impression
i used to yoa ming... until i took an arrow to the knee
Could someone make a rage face out of our dearest Yoa Ming? I think it would compliment his first one so well!!!
If you search Yoa Ming on reddit, there are posts that dont contain "Yoa Ming" in the title but "Yoa Ming" in the text and those are the posts I want.
Thanks.
You might need to update the version of PRAW you are using. Using v6.3.1 yields the expected outcome and includes submissions that have the keyword in the body and not the title.
Also, the sort=comment parameter should be sort='comments'. Using an invalid value for sort will not throw an error but it will fall back to the default value, which may be why you are seeing different search results between your script and the website.

Resources