Sentiment Analysis Stanford NLP in Google Colab - stanford-nlp

Please, I'm trying to use sentiment analysis using Stanford NLP on Google Colab, but I dont know how to 'redirect' the server, as explained here:
How to Run standford corenlp server on Google Colab?
I don't know about servers in generally. I would like to do this in Colab:
To do this,this code worked in my computer:
1-I've downloaded the Stanford NLP library;
2-The files containing the text that I want to process with sentiment analysis are in a folder ( what I've call as "directory_path"(input) in the above code);
3- I want to generate(output) a xml file with the output/answer generated by the server
import os
from pycorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('http://localhost:9000')
#above is the directory with files that contains the content (text) I want to process with Stanford NLP:
directory_path = "C:\\Users\\User\\Desktop\\ic\\Reviews\\reviews HetRec\\Hetrec_reviews\\Separated Reviews\\Separated Reviews"
for file in os.listdir(directory_path):
current_file = os.path.join(directory_path, file)
review_text = ""
with open(current_file, encoding='utf8') as f:
res = f.readlines()
review_text += "".join(res)
#review_text is a string representing the content read in the file.
res = nlp.annotate(review_text,
properties={
'annotators': 'sentiment',
'outputFormat': 'xml',
'timeout': 50000,
})
#above, file_name is just a generic name for the xml files I want to generate:
file_name = file[0 : -3] + "xml"
#get the output generated by the server:
with open(file_name, 'w', encoding="utf-8") as f:
print(res, file=f)
Any advice can help me! So, my objective is just generate the output using this code on Google Colab - but this code doesn't work on Colab, and I know nothing about servers and don't have much experience .I didn't find working examples of code for sentiment analisys with Stanford NLP in Colab.
Note: as far as I know, Stanza library doesn't support sentiment analysis. I don't mind about the version of the library used - it can be older version of Stanford NLP too.

Related

Why am I getting String where should get a dict when using pycorenlp.StanfordCoreNLP.annotate?

I'm running this example using pycorenlp Stanford Core NLP python wrapper, but the annotate function returns a string instead of a dict, so, when I iterate over it to get each sentence sentiment value I get the following error: "string indices must be integers".
What could I do to get over it? Anyone could help me? Thanks in advance.
The code is below:
from pycorenlp import StanfordCoreNLP
nlp_wrapper = StanfordCoreNLP('http://localhost:9000')
doc = "I like this chocolate. This chocolate is not good. The chocolate is delicious. Its a very
tasty chocolate. This is so bad"
annot_doc = nlp_wrapper.annotate(doc,
properties={
'annotators': 'sentiment',
'outputFormat': 'json',
'timeout': 100000,
})
for sentence in annot_doc["sentences"]:
print(" ".join([word["word"] for word in sentence["tokens"]]) + " => "\
+ str(sentence["sentimentValue"]) + " = "+ sentence["sentiment"])
You should just use the official stanfordnlp package! (note: the name is going to be changed to stanza at some point)
Here are all the details, and you can get various output formats from the server including JSON.
https://stanfordnlp.github.io/stanfordnlp/corenlp_client.html
from stanfordnlp.server import CoreNLPClient
with CoreNLPClient(annotators=['tokenize','ssplit','pos','lemma','ner', 'parse', 'depparse','coref'], timeout=30000, memory='16G') as client:
# submit the request to the server
ann = client.annotate(text)
It would be great if you provide the error stack trace. The reason for this is that the annotator is meeting timeout sooner and returns a assertion message 'the text is too large..'. Its dtype is . Further, I would put more light on Petr Matuska comment. By looking at your example it is clear that your goal is to find sentiment for the sentence along with its sentiment score.
The sentiment score is not found with result in using CoreNLPCLient. I faced similar issue, but i did work around which fixed this issue. If the text is large you must set the timeout value to much higher (eg., timeout = 500000). Also the annotator results in a dictionary and therefore it consumes a lot of memory. For a larger text corpus, this will be a great problem!! So it is upto us how we can handle the data structure in the code. There are alternatives such using slot, tupple or named tupple for faster access.

How to speedup Stanford NLP in Python?

import numpy as np
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize
#english.all.3class.distsim.crf.ser.gz
st = StanfordNERTagger('/media/sf_codebase/modules/stanford-ner-2018-10-16/classifiers/english.all.3class.distsim.crf.ser.gz',
'/media/sf_codebase/modules/stanford-ner-2018-10-16/stanford-ner.jar',
encoding='utf-8')
After initializing above code Stanford NLP following code takes 10 second to tag the text as shown below. How to speed up?
%%time
text="My name is John Doe"
tokenized_text = word_tokenize(text)
classified_text = st.tag(tokenized_text)
print (classified_text)
Output
[('My', 'O'), ('name', 'O'), ('is', 'O'), ('John', 'PERSON'), ('Doe', 'PERSON')]
CPU times: user 4 ms, sys: 20 ms, total: 24 ms
Wall time: 10.9 s
Another solution within NLTK is to not use the old nltk.tag.StanfordNERTagger but instead to use the newer nltk.parse.CoreNLPParser . See, e.g., https://github.com/nltk/nltk/wiki/Stanford-CoreNLP-API-in-NLTK .
More generally the secret to good performance is indeed to use a server on the Java side, which you can repeatedly call without having to start new subprocesses for each sentence processed. You can either use the NERServer if you just need NER or the StanfordCoreNLPServer for all CoreNLP functionality. There are a number of Python interfaces to it, see: https://stanfordnlp.github.io/CoreNLP/other-languages.html#python
Found the answer.
Initiate the Stanford NLP Server in background in the folder where Stanford NLP is unzipped.
java -Djava.ext.dirs=./lib -cp stanford-ner.jar edu.stanford.nlp.ie.NERServer -port 9199 -loadClassifier ./classifiers/english.all.3class.distsim.crf.ser.gz
Then initiate Stanford NLP Server tagger in Python using sner library.
from sner import Ner
tagger = Ner(host='localhost',port=9199)
Then run the tagger.
%%time
classified_text=tagger.get_entities(text)
print (classified_text)
Output:
[('My', 'O'), ('name', 'O'), ('is', 'O'), ('John', 'PERSON'), ('Doe', 'PERSON')]
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 18.2 ms
Almost 300 times better performance in terms of timing! Wow!
After attempting several options, I like Stanza. It is developed by Stanford, is very simple to implement, I didn't have to figure out how to start the server properly on my own, and it dramatically improved the speed of my program. It implements the 18 different object classifications.
I found Stanza by following the link provided in Christopher Manning's answer.
To download:
pip install stanza
then in Python:
import stanza
stanza.download('en') # download English model
nlp = stanza.Pipeline('en') # initialize English neural pipeline
doc = nlp("My name is John Doe.") # run annotation over a sentence or multiple sentences
If you only want a specific tool (NER), you can specify with processors as:
nlp = stanza.Pipeline('en',processors='tokenize,ner')
For an output similar to that produced by the OP:
classified_text = [(token.text,token.ner) for i, sentence in enumerate(doc.sentences) for token in sentence.tokens]
print(classified_text)
[('My', 'O'), ('name', 'O'), ('is', 'O'), ('John', 'B-PERSON'), ('Doe', 'E-PERSON')]
But to produce a list of only those words that are recognizable entities:
classified_text = [(ent.text,ent.type) for ent in doc.ents]
[('John Doe', 'PERSON')]
It produces a couple of features that I really like:
instead of each word being classified as a separate person entity, it combines John Doe into one 'PERSON' object.
If you do want each separate word, you can extract those and it identifies which part of the object it is ('B' for the first word in the object, 'I' for the intermediate words, and 'E' for the last word in the object)

Can I get an entityMention from the result of a TokensRegex match in Stanford CoreNLP?

I want to add addresses (and possibly other rules based entities) to an NER pipeline and the Tokens Regex seems like a terribly useful DSL for doing so. Following https://stackoverflow.com/a/42604225, I'm created this rules file:
ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }
{ pattern: ([{ner:"NUMBER"}] [{pos:"NN"}|{pos:"NNP"}] /ave(nue)?|st(reet)?|boulevard|blvd|r(oa)?d/), action: Annotate($0, ner, "address") }
Here's a scala repl session showing how I'm trying to set up an annotation pipeline.
# import edu.stanford.nlp.pipeline.{StanfordCoreNLP, CoreDocument}
# import edu.stanford.nlp.util.PropertiesUtils.asProperties
# val pipe = new StanfordCoreNLP(asProperties(
"customAnnotatorClass.tokensregex", "edu.stanford.nlp.pipeline.TokensRegexAnnotator",
"annotators", "tokenize,ssplit,pos,lemma,ner,tokensregex",
"ner.combinationMode", "HIGH_RECALL",
"tokensregex.rules", "addresses.tregx"))
pipe: StanfordCoreNLP = edu.stanford.nlp.pipeline.StanfordCoreNLP#2ce6a051
# val doc = new CoreDocument("Adam Smith lived at 123 noun street in Glasgow, Scotland")
doc: CoreDocument = Adam Smith lived at 123 noun street in Glasgow, Scotland
# pipe.annotate(doc)
# doc.sentences.get(0).nerTags
res5: java.util.List[String] = [PERSON, PERSON, O, O, address, address, address, O, CITY, O, COUNTRY]
# doc.entityMentions
res6: java.util.List[edu.stanford.nlp.pipeline.CoreEntityMention] = [Adam Smith, 123, Glasgow, Scotland]
As you can see, the address gets correctly tagged in the nerTags for the sentence, but it doesn't show up in the documents entityMentions. Is there a way to do this?
Also, is there a way from the document to discern two adjacent matches of the tokenregex from a single match (assuming I have more complicated set of regexes; in the current example I only match exactly 3 tokens, so I could just count tokens)?
I tried approaching it using the regexner with a tokens regex described here https://stanfordnlp.github.io/CoreNLP/regexner.html, but I couldn't seem to get that working.
Since I'm working in scala I'll be happy to dive into the Java API to get this to work, rather than fiddle with properties and resource files, if that's necessary.
Yes, I've recently added some changes (in the GitHub version) to make this easier! Make sure to download the latest version from GitHub. Though we are aiming to release Stanford CoreNLP 3.9.2 fairly soon and it will have these changes.
If you read this page you can get an understanding of the full NER pipeline run by the NERCombinerAnnotator.
https://stanfordnlp.github.io/CoreNLP/ner.html
Furthermore there is a lot of write up on the TokensRegex here:
https://stanfordnlp.github.io/CoreNLP/tokensregex.html
Basically what you want to do is run the ner annotator, and use it's TokensRegex sub-annotator. Imagine you have some named entity rules in a file called my_ner.rules.
You could run a command like this:
java -Xmx5g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -ner.additional.tokensregex.rules my_ner.rules -outputFormat text -file example.txt
This will run a TokensRegex sub-annotator during the full named entity recognition process. Then when the final step of entity mentions are run, it will operate on the rules extracted named entities and create entity mentions from them.

Stanford CoreNLP: output in CONLL format from Java

I want to parse some German text with Stanford CoreNLP and obtain a CONLL output, so that I can pass the latter to CorZu for coreference resolution.
How can I do that programmatically?
Here is my code so far (which only outputs dependency trees):
Annotation germanAnnotation = new Annotation("Gestern habe ich eine blonde Frau getroffen");
Properties germanProperties = StringUtils.argsToProperties("-props", "StanfordCoreNLP-german.properties");
StanfordCoreNLP pipeline = new StanfordCoreNLP(germanProperties);
pipeline.annotate(germanAnnotation);
StringBuilder trees = new StringBuilder("");
for (CoreMap sentence : germanAnnotation.get(CoreAnnotations.SentencesAnnotation.class)) {
Tree sentenceTree = sentence.get(TreeCoreAnnotations.TreeAnnotation.class);
trees.append(sentenceTree).append("\n");
}
With the following code I managed to save the parsing output in CONLL format.
OutputStream outputStream = new FileOutputStream(new File("./target/", OUTPUT_FILE_NAME));
CoNLLOutputter.conllPrint(germanAnnotation, outputStream, pipeline);
However, the HEAD field was 0 for all words. I am not sure whether there is a problem in parsing or only in the CONLLOutputter. Honestly, I was too annoyed by CoreNLP to investigate further.
I decided and I suggest using ParZu instead. ParZu and CorZu are made to work together seamlessly - and they do.
In my case, I had an already tokenized and POS tagged text. This makes things easier, since you will not need:
A POS-Tagger using the STTS tagset
A tool for morphological analysis
Once you have ParZu and CorZu installed, you will only need to run corzu.sh (included in the CorZu download folder). If your text is tokenized and POS tagged, you can edit the script accordingly:
parzu_cmd="/YourPath/ParZu/parzu -i tagged"
Last note: make sure to convert your tagged text to the following format, with empty lines signifying sentence boundaries: word [tab] tag [newline]

Separate YAML and plain text on the same document

While building a blog using django I realized that it would be extremely practical to store the text of an article and all the related informations (title, author, etc...) together in a human-readable file format, and then charge those files on the database using a simple script.
Now that said, YAML caught my attention for his readability and ease of use, the only downside of the YAML syntax is the indentation:
---
title: Title of the article
author: Somebody
# Other stuffs here ...
text:|
This is the text of the article. I can write whatever I want
but I need to be careful with the indentation...and this is a
bit boring.
---
I believe that's not the best solution (especially if the files are going to be written by casual users). A format like this one could be much better
---
title: Title of the article
author: Somebody
# Other stuffs here ...
---
Here there is the text of the article, it is not valid YAML but
just plain text. Here I could put **Markdown** or <html>...or whatever
I want...
Is there any solution? Preferably using python.
Other file formats propositions are welcome as well!
Unfortunately this is not possible, what one would think could work is using | for a single scalar in the separate document:
import ruamel.yaml
yaml_str = """\
title: Title of the article
author: Somebody
---
|
Here there is the text of the article, it is not valid YAML but
just plain text. Here I could put **Markdown** or <html>...or whatever
I want...
"""
for d in ruamel.yaml.load_all(yaml_str):
print(d)
print('-----')
but it doesn't because | is the block indentation indicator. And although at the top level an indentation of 0 (zero) would easily work, ruamel.yaml (and PyYAML) don't allow this.
It is however easy to parse this yourself, which has the advantage over using the front matter package that you can use YAML 1.2 and are not restricted to using YAML 1.1 because of frontmaker using the PyYAML. Also note that I used the more appropriate end of document marker ... to separate YAML from the markdown:
import ruamel.yaml
combined_str = """\
title: Title of the article
author: Somebody
...
Here there is the text of the article, it is not valid YAML but
just plain text. Here I could put **Markdown** or <html>...or whatever
I want...
"""
with open('test.yaml', 'w') as fp:
fp.write(combined_str)
data = None
lines = []
yaml_str = ""
with open('test.yaml') as fp:
for line in fp:
if data is not None:
lines.append(line)
continue
if line == '...\n':
data = ruamel.yaml.round_trip_load(yaml_str)
continue
yaml_str += line
print(data['author'])
print(lines[2])
which gives:
Somebody
I want...
(the round_trip_load allows dumping with preservation of comments, anchor names etc).
I found Front Matter does exactly what I want to do.
There is also a python package.

Resources