Nouns with hyphens or slashes in coreNLP - stanford-nlp

The code below works well for full nouns but I have text where a single noun can have hyphens or slashes in them. What should I do to accommodate that?
c/h, for instance, is a valid abbreviation as in "c/h was born in Paris".
But I get
(c,NN)
(/,HYPH)
(h,NN)
When I would like (NNP) or (NP) at least.
Here's the code I've run:
import edu.stanford.nlp.ling._
import edu.stanford.nlp.pipeline._
import scala.collection.JavaConverters._
import java.util._
val text = "Marie was born in Paris";
val props = new Properties();
// set the list of annotators to run
props.setProperty("annotators", "tokenize,pos");
// build pipeline
val pipeline = new StanfordCoreNLP(props);
// create a document object
val document = pipeline.processToCoreDocument(text);
// display tokens
document.tokens().asScala.foreach(tok => {
println(tok.word(), tok.tag())
})

Related

natural language logic in stanford corenlp

How does one use the natural logic component of Stanford CoreNLP?
I am using CoreNLP 3.9.1 and I fed natlog as an annotator in command line, but I don't seem to see any natlog result in the output, i.e. OperatorAnnotation and PolarityAnnotation, according to this link. Does that have anything to do with the outputFormat? I've tried xml and json, but neither has any output on natural logic. The other stuff (tokenization, dep parse) is in there though.
Here is my command:
./corenlp.sh -annotators tokenize,ssplit,pos,lemma,depparse,natlog -file natlog.test -outputFormat xml
Thanks in advance.
I don't think any of the output options show the natlog stuff. This is more designed if you have a Java system and are working with the Annotations themselves in Java code. You should be able to see them by looking at the CoreLabel for each token.
This code snippet works for me:
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.util.CoreMap;
import edu.stanford.nlp.ling.CoreAnnotations.NamedEntityTagAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.PartOfSpeechAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.TextAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation;
import edu.stanford.nlp.pipeline.Annotation;
// this is the polarity annotation!
import edu.stanford.nlp.naturalli.NaturalLogicAnnotations.PolarityDirectionAnnotation;
// not the one below!
// import edu.stanford.nlp.ling.CoreAnnotations.PolarityAnnotation;
import edu.stanford.nlp.util.PropertiesUtils;
import java.io.*;
import java.util.*;
public class test {
public static void main(String[] args) throws FileNotFoundException, UnsupportedEncodingException {
// code from: https://stanfordnlp.github.io/CoreNLP/api.html#generating-annotations
StanfordCoreNLP pipeline = new StanfordCoreNLP(
PropertiesUtils.asProperties(
// **add natlog here**
"annotators", "tokenize,ssplit,pos,lemma,parse,depparse,natlog",
"ssplit.eolonly", "true",
"tokenize.language", "en"));
// read some text in the text variable
String text = "Every dog sees some cat";
Annotation document = new Annotation(text);
// run all Annotators on this text
pipeline.annotate(document);
// these are all the sentences in this document
// a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
for(CoreMap sentence: sentences) {
// traversing the words in the current sentence
// a CoreLabel is a CoreMap with additional token-specific methods
for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
// this is the text of the token
String word = token.get(TextAnnotation.class);
// this is the POS tag of the token
String pos = token.get(PartOfSpeechAnnotation.class);
// this is the NER label of the token
String ne = token.get(NamedEntityTagAnnotation.class);
// this is the polarity label of the token
String pol = token.get(PolarityDirectionAnnotation.class);
System.out.print(word + " [" + pol + "] ");
}
System.out.println();
}
}
}
The output will be:
Every [up] dog [down] sees [up] some [up] cat [up]

Lemmatization not working on words starting with capitol letters

I am working on a project that uses StanfordNLP. One of the function in the project it to extract all nouns from a piece of text and lemmatize each noun. I am extracting the nouns using the below code
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, parse, natlog, openie");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation document = new Annotation(text);
pipeline.annotate(document);
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
for(CoreMap sentence: sentences) {
SemanticGraph dependencies = sentence.get(BasicDependenciesAnnotation.class);
List<String> Nouns = Extractnouns(dependencies.typedDependencies(), sentence);
}
private List<String> Extractnouns(Collection<TypedDependency> tdl, CoreMap sentence) {
List<String> concepts=new ArrayList<String>();
for (TypedDependency td : tdl)
{
String govlemma = td.gov().lemma();
String deplemma = td.dep().lemma();
String deptag=td.dep().tag();
String govtag=td.gov().tag();
if(deptag!=null && deptag.contains("NN") )
{
concepts.add(deplemma);
}
if(govtag!=null && govtag.contains("NN") )
{
concepts.add(govlemma);
}
}
return concepts;
}
It is working as expected but for some words the lemmatization is not working. I observed that some of the nouns that come as the first word in a sentence have this problem. Example: "Protons and electrons both carry an electrical charge." Here the word "Protons" is not getting converted to "proton" on applying lemma. Same with with some other nouns too.
Could you please tell me a solution for this problem?
Unfortunately this is a part of speech tagging error. "Protons" gets labelled with "NNP" not "NNS", so lemmatization isn't performed on it.
You could try running on lower-cased versions of the text, I note in that case it does the right thing.

StanfordCoreNLPClient don't work as expected on sentiment analysis

Stanford CoreNLP version 3.9.1
I have a problem getting StanfordCoreNLPClient work the same way as StanfordCoreNLP when doing sentiment analysis.
public class Test {
public static void main(String[] args) {
String text = "This server doesn't work!";
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, sentiment");
//If I uncomment this line, and comment out the next one, it works
//StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
StanfordCoreNLPClient pipeline = new StanfordCoreNLPClient(props, "http://localhost", 9000, 2);
Annotation annotation = new Annotation(text);
pipeline.annotate(annotation);
CoreDocument document = new CoreDocument(annotation);
CoreSentence sentence = document.sentences().get(0);
//outputs null when using StanfordCoreNLPClient
System.out.println(RNNCoreAnnotations.getPredictions(sentence.sentimentTree()));
//throws null pointer when using StanfordCoreNLPClien (reason of course is that it uses the same method I called above, I assume)
System.out.println(RNNCoreAnnotations.getPredictionsAsStringList(sentence.sentimentTree()));
}
}
Output using StanfordCoreNLPClient pipeline = new StanfordCoreNLPClient(props, "http://localhost", 9000, 2):
null
Exception in thread "main" java.lang.NullPointerException
at edu.stanford.nlp.neural.rnn.RNNCoreAnnotations.getPredictionsAsStringList(RNNCoreAnnotations.java:68)
at tomkri.mastersentimentanalysis.preprocessing.Test.main(Test.java:35)
Output using StanfordCoreNLP pipeline = new StanfordCoreNLP(props):
Type = dense , numRows = 5 , numCols = 1
0.127
0.599
0.221
0.038
0.015
[0.12680336652661395, 0.5988695516384742, 0.22125584263055106, 0.03843574738131668, 0.014635491823044227]
Other annotations than sentiment works in both cases (at least those I have tried).
The server starts fine, and I am able to use from my web browser. When using it there I also get output of sentiment scores (on each subtree in the parse) in json format.
My solution, in case anyone else need it.
I tried to get the required annotation by making http request to the server with JSON response:
HttpResponse<JsonNode> jsonResponse = Unirest.post("http://localhost:9000")
.queryString("properties", "{\"annotators\":\"tokenize, ssplit, pos, lemma, ner, parse, sentiment\",\"outputFormat\":\"json\"}")
.body(text)
.asJson();
String sentTreeStr = jsonResponse.getBody().getObject().
getJSONArray("sentences").getJSONObject(0).getString("sentimentTree");
System.out.println(sentTreeStr); //prints out sentiment values for tree and all sub trees.
But not all annotation data is available. For example, you don't get the probability distribution over all the possible
sentiment values, only the probability of the sentiment most likely (the sentiment with highest probability).
If you need that, this is a solution:
HttpResponse<InputStream> inStream = Unirest.post("http://localhost:9000")
.queryString(
"properties",
"{\"annotators\":\"tokenize, ssplit, pos, lemma, ner, parse, sentiment\","
+ "\"outputFormat\":\"serialized\","
+ "\"serializer\": \"edu.stanford.nlp.pipeline.GenericAnnotationSerializer\"}"
)
.body(text)
.asBinary();
GenericAnnotationSerializer serializer = new GenericAnnotationSerializer ();
try{
ObjectInputStream in = new ObjectInputStream(inStream.getBody());
Pair<Annotation, InputStream> deserialized = serializer.read(in);
Annotation annotation = deserialized.first();
//And now we are back to a state as if we were not running CoreNLP as server.
CoreDocument doc = new CoreDocument(annotation);
CoreSentence sentence = document.sentences().get(0);
//Prints out same output as shown in question
System.out.println(
RNNCoreAnnotations.getPredictions(sentence.sentimentTree()));
} catch (UnirestException ex) {
Logger.getLogger(SentimentTargetExtractor.class.getName()).log(Level.SEVERE, null, ex);
}

Check for words in input text from words collection

I have collection of Noun Phrases say around 10,000 words. I want to check every new input text data for these NP collection and extract those sentences that contains any of these NP. I don't want to run loops for every word because it makes my code dead slow. I am using Java and Stanford CoreNLP.
A quick and easy way to do this is to use RegexNER to identify all examples of anything in your dictionary, and then check for non "O" NER tags in the sentence.
package edu.stanford.nlp.examples;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.util.*;
import java.util.*;
import java.util.stream.Collectors;
public class FindSentencesWithPhrase {
public static boolean checkForNamedEntity(CoreMap sentence) {
for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
if (token.ner() != null && !token.ner().equals("O")) {
return true;
}
}
return false;
}
public static void main(String[] args) {
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,regexner");
props.setProperty("regexner.mapping", "phrases.rules");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
String exampleText = "This sentence contains the phrase \"ice cream\"." +
"This sentence is not of interest. This sentences contains pizza.";
Annotation ann = new Annotation(exampleText);
pipeline.annotate(ann);
for (CoreMap sentence : ann.get(CoreAnnotations.SentencesAnnotation.class)) {
if (checkForNamedEntity(sentence)) {
System.out.println("---");
System.out.println(sentence.get(CoreAnnotations.TokensAnnotation.class).
stream().map(token -> token.word()).collect(Collectors.joining(" ")));
}
}
}
}
The file "phrases.rules" should look like this:
ice cream PHRASE_OF_INTEREST MISC 1
pizza PHRASE_OF_INTEREST MISC 1

Reload CRF NER model of StanfordCoreNLP pipeline

I am making a web application (GUI) for building a CRF NER model instead of manually anotating CSV files. When the user collects a number of training files, he should be able to generate a new model and try it.
The issue I have is with reloading of model. When I assign a new value to pipeline, like
pipeline = new StanfordCoreNLP(props)
the model stays the same. I tried to clear the annotation pool with
StanfordCoreNLP.clearAnnotatorPool()
but nothing changes. Is this possible at all or I have to restart my whole application every time to get this to work?
EDIT (Clarification):
I have 2 methods in same class: nerString() and train(). Something like this:
class NerService {
private var pipeline: StanfordCoreNLP = null
loadPipelines()
private def loadPipelines(): Unit = {
val props = new Properties()
props.setProperty("tokenize.class", "BosnianTokenizer")
props.setProperty("ner.model", "conf/NER/classifiers/ner-ba-model.ser.gz") // NER CRF model
props.setProperty("ner.useSUTime", "false")
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner")
pipeline = new StanfordCoreNLP(propsNER)
}
def nerString(tekst: String): List[TokenNER] = {
val document = new Annotation(tekst)
pipeline.annotate(document)
...
}
/////////////// train new NER model ///////////////////////
private val trainProps = StringUtils.propFileToProperties("conf/NER/classifiers/ner-ba-training.prop")
private val serializeTo = "conf/NER/classifiers/ner-ba-model.ser.gz" // save at location...
private val inputDir = new File("conf/NER/classifiers/input")
private val fileFilter = new WildcardFileFilter("*.tsv")
private val dirFilter = TrueFileFilter.TRUE
def train(): Unit = {
val allFiles = FileUtils.listFiles(inputDir, fileFilter, dirFilter).asScala
val trainFileList = allFiles.map(_.getPath).mkString(",")
trainProps.setProperty("trainFileList", trainFileList)
val flags = new SeqClassifierFlags(trainProps)
val crf = new CRFClassifier[CoreLabel](flags)
crf.train()
crf.serializeClassifier(serializeTo)
loadPipelines()
}
}
The loadPipelines() is used to re-assign the pipeline when the new NER model is created.
How do I know that the model isn't updated? I have a text that I include manually and see the difference with and without it..

Resources