StanfordCoreNLPClient don't work as expected on sentiment analysis - stanford-nlp

Stanford CoreNLP version 3.9.1
I have a problem getting StanfordCoreNLPClient work the same way as StanfordCoreNLP when doing sentiment analysis.
public class Test {
public static void main(String[] args) {
String text = "This server doesn't work!";
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, sentiment");
//If I uncomment this line, and comment out the next one, it works
//StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
StanfordCoreNLPClient pipeline = new StanfordCoreNLPClient(props, "http://localhost", 9000, 2);
Annotation annotation = new Annotation(text);
pipeline.annotate(annotation);
CoreDocument document = new CoreDocument(annotation);
CoreSentence sentence = document.sentences().get(0);
//outputs null when using StanfordCoreNLPClient
System.out.println(RNNCoreAnnotations.getPredictions(sentence.sentimentTree()));
//throws null pointer when using StanfordCoreNLPClien (reason of course is that it uses the same method I called above, I assume)
System.out.println(RNNCoreAnnotations.getPredictionsAsStringList(sentence.sentimentTree()));
}
}
Output using StanfordCoreNLPClient pipeline = new StanfordCoreNLPClient(props, "http://localhost", 9000, 2):
null
Exception in thread "main" java.lang.NullPointerException
at edu.stanford.nlp.neural.rnn.RNNCoreAnnotations.getPredictionsAsStringList(RNNCoreAnnotations.java:68)
at tomkri.mastersentimentanalysis.preprocessing.Test.main(Test.java:35)
Output using StanfordCoreNLP pipeline = new StanfordCoreNLP(props):
Type = dense , numRows = 5 , numCols = 1
0.127
0.599
0.221
0.038
0.015
[0.12680336652661395, 0.5988695516384742, 0.22125584263055106, 0.03843574738131668, 0.014635491823044227]
Other annotations than sentiment works in both cases (at least those I have tried).
The server starts fine, and I am able to use from my web browser. When using it there I also get output of sentiment scores (on each subtree in the parse) in json format.

My solution, in case anyone else need it.
I tried to get the required annotation by making http request to the server with JSON response:
HttpResponse<JsonNode> jsonResponse = Unirest.post("http://localhost:9000")
.queryString("properties", "{\"annotators\":\"tokenize, ssplit, pos, lemma, ner, parse, sentiment\",\"outputFormat\":\"json\"}")
.body(text)
.asJson();
String sentTreeStr = jsonResponse.getBody().getObject().
getJSONArray("sentences").getJSONObject(0).getString("sentimentTree");
System.out.println(sentTreeStr); //prints out sentiment values for tree and all sub trees.
But not all annotation data is available. For example, you don't get the probability distribution over all the possible
sentiment values, only the probability of the sentiment most likely (the sentiment with highest probability).
If you need that, this is a solution:
HttpResponse<InputStream> inStream = Unirest.post("http://localhost:9000")
.queryString(
"properties",
"{\"annotators\":\"tokenize, ssplit, pos, lemma, ner, parse, sentiment\","
+ "\"outputFormat\":\"serialized\","
+ "\"serializer\": \"edu.stanford.nlp.pipeline.GenericAnnotationSerializer\"}"
)
.body(text)
.asBinary();
GenericAnnotationSerializer serializer = new GenericAnnotationSerializer ();
try{
ObjectInputStream in = new ObjectInputStream(inStream.getBody());
Pair<Annotation, InputStream> deserialized = serializer.read(in);
Annotation annotation = deserialized.first();
//And now we are back to a state as if we were not running CoreNLP as server.
CoreDocument doc = new CoreDocument(annotation);
CoreSentence sentence = document.sentences().get(0);
//Prints out same output as shown in question
System.out.println(
RNNCoreAnnotations.getPredictions(sentence.sentimentTree()));
} catch (UnirestException ex) {
Logger.getLogger(SentimentTargetExtractor.class.getName()).log(Level.SEVERE, null, ex);
}

Related

Lemmatization not working on words starting with capitol letters

I am working on a project that uses StanfordNLP. One of the function in the project it to extract all nouns from a piece of text and lemmatize each noun. I am extracting the nouns using the below code
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, parse, natlog, openie");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation document = new Annotation(text);
pipeline.annotate(document);
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
for(CoreMap sentence: sentences) {
SemanticGraph dependencies = sentence.get(BasicDependenciesAnnotation.class);
List<String> Nouns = Extractnouns(dependencies.typedDependencies(), sentence);
}
private List<String> Extractnouns(Collection<TypedDependency> tdl, CoreMap sentence) {
List<String> concepts=new ArrayList<String>();
for (TypedDependency td : tdl)
{
String govlemma = td.gov().lemma();
String deplemma = td.dep().lemma();
String deptag=td.dep().tag();
String govtag=td.gov().tag();
if(deptag!=null && deptag.contains("NN") )
{
concepts.add(deplemma);
}
if(govtag!=null && govtag.contains("NN") )
{
concepts.add(govlemma);
}
}
return concepts;
}
It is working as expected but for some words the lemmatization is not working. I observed that some of the nouns that come as the first word in a sentence have this problem. Example: "Protons and electrons both carry an electrical charge." Here the word "Protons" is not getting converted to "proton" on applying lemma. Same with with some other nouns too.
Could you please tell me a solution for this problem?
Unfortunately this is a part of speech tagging error. "Protons" gets labelled with "NNP" not "NNS", so lemmatization isn't performed on it.
You could try running on lower-cased versions of the text, I note in that case it does the right thing.

Reload CRF NER model of StanfordCoreNLP pipeline

I am making a web application (GUI) for building a CRF NER model instead of manually anotating CSV files. When the user collects a number of training files, he should be able to generate a new model and try it.
The issue I have is with reloading of model. When I assign a new value to pipeline, like
pipeline = new StanfordCoreNLP(props)
the model stays the same. I tried to clear the annotation pool with
StanfordCoreNLP.clearAnnotatorPool()
but nothing changes. Is this possible at all or I have to restart my whole application every time to get this to work?
EDIT (Clarification):
I have 2 methods in same class: nerString() and train(). Something like this:
class NerService {
private var pipeline: StanfordCoreNLP = null
loadPipelines()
private def loadPipelines(): Unit = {
val props = new Properties()
props.setProperty("tokenize.class", "BosnianTokenizer")
props.setProperty("ner.model", "conf/NER/classifiers/ner-ba-model.ser.gz") // NER CRF model
props.setProperty("ner.useSUTime", "false")
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner")
pipeline = new StanfordCoreNLP(propsNER)
}
def nerString(tekst: String): List[TokenNER] = {
val document = new Annotation(tekst)
pipeline.annotate(document)
...
}
/////////////// train new NER model ///////////////////////
private val trainProps = StringUtils.propFileToProperties("conf/NER/classifiers/ner-ba-training.prop")
private val serializeTo = "conf/NER/classifiers/ner-ba-model.ser.gz" // save at location...
private val inputDir = new File("conf/NER/classifiers/input")
private val fileFilter = new WildcardFileFilter("*.tsv")
private val dirFilter = TrueFileFilter.TRUE
def train(): Unit = {
val allFiles = FileUtils.listFiles(inputDir, fileFilter, dirFilter).asScala
val trainFileList = allFiles.map(_.getPath).mkString(",")
trainProps.setProperty("trainFileList", trainFileList)
val flags = new SeqClassifierFlags(trainProps)
val crf = new CRFClassifier[CoreLabel](flags)
crf.train()
crf.serializeClassifier(serializeTo)
loadPipelines()
}
}
The loadPipelines() is used to re-assign the pipeline when the new NER model is created.
How do I know that the model isn't updated? I have a text that I include manually and see the difference with and without it..

TokensRegex: Tokens are null after retokenization

I'm experimenting with Stanford NLP's TokensRegex and try to find dimensions (e.g. 100x120) in a text. So my plan is to first retokenize the input to further split these tokens (using the example provided in retokenize.rules.txt) and then to search for the new pattern.
After doing the retokenization, however, only null-values are left that replace the original string:
The top level annotation
[Text=100x120 Tokens=[null-1, null-2, null-3] Sentences=[100x120]]
The retokenization seems to work fine (3 tokens in result), but the values are lost. What can I do to maintain the original values in the tokens list?
My retokenize.rules.txt file is (as in the demo):
tokens = { type: "CLASS", value:"edu.stanford.nlp.ling.CoreAnnotations$TokensAnnotation" }
options.matchedExpressionsAnnotationKey = tokens;
options.extractWithTokens = TRUE;
options.flatten = TRUE;
ENV.defaults["ruleType"] = "tokens"
ENV.defaultStringPatternFlags = 2
ENV.defaultResultAnnotationKey = tokens
{ pattern: ( /\d+(x|X)\d+/ ), result: Split($0[0], /x|X/, TRUE) }
The main method:
public static void main(String[] args) throws IOException {
//...
text = "100x120";
Properties properties = new Properties();
properties.setProperty("tokenize.language", "de");
properties.setProperty("annotators", tokenize,retokenize,ssplit,pos,lemma,ner");
properties.setProperty("customAnnotatorClass.retokenize", "edu.stanford.nlp.pipeline.TokensRegexAnnotator");
properties.setProperty("retokenize.rules", "retokenize.rules.txt");
StanfordCoreNLP stanfordPipeline = new StanfordCoreNLP(properties);
runPipeline(pipelineWithRetokenize, text);
}
And the pipeline:
public static void runPipeline(StanfordCoreNLP pipeline, String text) {
Annotation annotation = new Annotation(text);
pipeline.annotate(annotation);
out.println();
out.println("The top level annotation");
out.println(annotation.toShorterString());
//...
}
Thanks for letting us know. The CoreAnnotations.ValueAnnotation is not being populated and we'll update TokenRegex to populate the field.
Regardless, you should be able to use TokenRegex to retokenize as you have planned. Most of the pipeline does not depending on the ValueAnnotation and uses the CoreAnnotations.TextAnnotation instead. You can use the CoreAnnotations.TextAnnotation to get the text for the new tokens (each token is a CoreLabel so you can access it using token.word() as well).
See TokensRegexRetokenizeDemo for example code on how to get the different annotations out.

How to iterate through Elasticsearch source using Apache Spark?

I am trying to build a recommendation system by integrating Elasticsearch with Apache Spark. I am using Java. I am using movilens dataset as example data. I have indexed the data to Elasticsearch as well. So far, I have been able to read the input from Elasticsearch index as follows:
SparkConf conf = new SparkConf().setAppName("Example App").setMaster("local");
conf.set("spark.serializer", org.apache.spark.serializer.KryoSerializer.class.getName());
conf.set("es.nodes", "localhost");
conf.set("es.port", "9200");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaPairRDD<String, Map<String, Object>> esRDD = JavaEsSpark.esRDD(sc, "movielens/recommendation");
Using esRDD.collect() function, I can see that I am retrieving the data from elastic search correctly. Now I need to feed the user id, item id and preference from the Elasticsearch result to Spark's recommendation. If I am using a csv file, I would be able to do it as follows:
String path = "resources/user_data.data";
JavaRDD<String> data = sc.textFile(path);
JavaRDD<Rating> ratings = data.map(
new Function<String, Rating>() {
public Rating call(String s) {
String[] sarray = s.split(" ");
return new Rating(Integer.parseInt(sarray[0]), Integer.parseInt(sarray[1]),
Double.parseDouble(sarray[2]));
}
}
);
What could be an equivalent mapping if I need to iterate through the elastic search output stored in esRDD and create a similar map as above? If there is any example code that I could refer to, that would be of great help.
Apologies for not answering the Spark question directly, but in case you missed it, there is a description of doing recommendations on MovieLens data using elasticsearch here: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_significant_terms_demo.html
You have not specified the format of the data in ElasticSearch. But let's assume it has fields userId, movieId and rating so an example document looks something like {"userId":1,"movieId":1,"rating":4}.
Then you should be able to do (ignoring null checks etc):
JavaRDD<Rating> ratings = esRDD.map(
new Function<Map<String, Object>, Rating>() {
public Rating call(Map<String, Object> m) {
Int userId = Integer.parseInt(m.get("userId"));
Int movieId = Integer.parseInt(m.get("movieId"));
Double rating = Double.parseDouble(m.get("rating"));
return new Rating(userId, movieId, rating);
}
}
);

How to update multiple fields using java api elasticsearch script

I am trying to update multiple value in index using Java Api through Elastic Search Script. But not able to update fields.
Sample code :-
1:
UpdateResponse response = request.setScript("ctx._source").setScriptParams(scriptParams).execute().actionGet();
2:
UpdateResponse response = request.setScript("ctx._source.").setScriptParams(scriptParams).execute().actionGet();
if I mentioned .(dot) in ("ctx._source.") getting illegalArgument Exception and if i do not use dot, not getting any exception but values not getting updated in Index.
Can any one tell me the solutions to resolve this.
First of all, your script (ctx._source) doesn't do anything, as one of the commenters already pointed out. If you want to update, say, field "a", then you would need a script like:
ctx._source.a = "foobar"
This would assign the string "foobar" to field "a". You can do more than simple assignment, though. Check out the docs for more details and examples:
http://www.elasticsearch.org/guide/reference/api/update/
Updating multiple fields with one script is also possible. You can use semicolons to separate different MVEL instructions. E.g.:
ctx._source.a = "foo"; ctx._source.b = "bar"
In Elastic search have an Update Java API. Look at the following code
client.prepareUpdate("index","typw","1153")
.addScriptParam("assignee", assign)
.addScriptParam("newobject", responsearray)
.setScript("ctx._source.assignee=assignee;ctx._source.responsearray=newobject ").execute().actionGet();
Here, assign variable contains object value and response array variable contains list of data.
You can do the same using spring java client using the following code. I am also listing the dependencies used in the code.
import org.elasticsearch.action.update.UpdateRequest;
import org.elasticsearch.index.query.QueryBuilder;
import org.springframework.data.elasticsearch.core.query.UpdateQuery;
import org.springframework.data.elasticsearch.core.query.UpdateQueryBuilder;
private UpdateQuery updateExistingDocument(String Id) {
// Add updatedDateTime, CreatedDateTime, CreateBy, UpdatedBy field in existing documents in Elastic Search Engine
UpdateRequest updateRequest = new UpdateRequest().doc("UpdatedDateTime", new Date(), "CreatedDateTime", new Date(), "CreatedBy", "admin", "UpdatedBy", "admin");
// Create updateQuery
UpdateQuery updateQuery = new UpdateQueryBuilder().withId(Id).withClass(ElasticSearchDocument.class).build();
updateQuery.setUpdateRequest(updateRequest);
// Execute update
elasticsearchTemplate.update(updateQuery);
}
XContentType contentType =
org.elasticsearch.client.Requests.INDEX_CONTENT_TYPE;
public XContentBuilder getBuilder(User assign){
try {
XContentBuilder builder = XContentFactory.contentBuilder(contentType);
builder.startObject();
Map<String,?> assignMap=objectMap.convertValue(assign, Map.class);
builder.field("assignee",assignMap);
return builder;
} catch (IOException e) {
log.error("custom field index",e);
}
IndexRequest indexRequest = new IndexRequest();
indexRequest.source(getBuilder(assign));
UpdateQuery updateQuery = new UpdateQueryBuilder()
.withType(<IndexType>)
.withIndexName(<IndexName>)
.withId(String.valueOf(id))
.withClass(<IndexClass>)
.withIndexRequest(indexRequest)
.build();

Resources