Is there a way to return all the matched expressions?
Consider the following sentence
John Snow killed Ramsay Bolton
where
John-NNP, Snow-NNP, killed- VBD, Ramsay-NNP, Bolton-NNP
And I am using the following tag combination as rules
NNP-NNP
NNP-VBD
VBD-NNP
and expected matched words from the above rules are:
John Snow, Snow killed, killed Ramsay, Ramsay Bolton
But using the below code, I am getting only this as matched expression:
[John Snow, killed Ramsay]
Is there a way in stanford to get all the expected matching words from the sentence? This is the code and rule file I am using right now:
import com.factweavers.multiterm.SetNLPAnnotators;
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.tokensregex.CoreMapExpressionExtractor;
import edu.stanford.nlp.ling.tokensregex.Env;
import edu.stanford.nlp.ling.tokensregex.NodePattern;
import edu.stanford.nlp.ling.tokensregex.TokenSequencePattern;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.util.CoreMap;
import java.util.List;
import java.util.regex.Pattern;
public class StanfordTest {
public static void main(String[] args) {
String rulesFile="en.rules";
Env env = TokenSequencePattern.getNewEnv();
env.setDefaultStringMatchFlags(NodePattern.NORMALIZE);
env.setDefaultStringPatternFlags(Pattern.CASE_INSENSITIVE);
env.bind("collapseExtractionRules", false);
CoreMapExpressionExtractor extractor= CoreMapExpressionExtractor.createExtractorFromFiles(env, rulesFile);
String content="John Snow killed Ramsay Bolton";
Annotation document = new Annotation(content);
SetNLPAnnotators snlpa = new SetNLPAnnotators();
StanfordCoreNLP pipeline = snlpa.setAnnotators("tokenize, ssplit, pos, lemma, ner");
pipeline.annotate(document);
List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
sentences.parallelStream().forEach(sentence -> {
System.out.println(extractor.extractExpressions(sentence));
});
}
}
en.rules
{
ruleType:"tokens",
pattern:([{tag:/VBD/}][ {tag:/NNP/}]),
result:"result1"
}
{
ruleType:"tokens",
pattern:([{tag:/NNP/}][ {tag:/VBD/}]),
result:"result2"
}
{
ruleType:"tokens",
pattern:([{tag:/NNP/}][ {tag:/NNP/}]),
result:"result3"
}
I think you need to create different extractors for different things you want.
The issue here is that when you have two part-of-speech tag rule sequences that overlap like this, the first one that gets matched absorbs the tokens preventing the second pattern from matching.
So if (NNP, NNP) is the first rule, "John Snow" gets matched. But then the Snow is not available to be matched with "Snow killed".
If you have a set of patterns that overlap like this, you should disentangle them and put them in separate extractors.
So you can have a (noun, verb) extractor, and a separate (noun, noun) extractor for instance.
Related
I'm trying to run the code found here: https://stanfordnlp.github.io/CoreNLP/coref.html
public class CorefExample {
public static void main(String[] args) throws Exception {
Annotation document = new Annotation("Barack Obama was born in Hawaii. He is the president. Obama was elected in 2008.");
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,parse,mention,coref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
pipeline.annotate(document);
System.out.println("---");
System.out.println("coref chains");
for (CorefChain cc : document.get(CorefCoreAnnotations.CorefChainAnnotation.class).values()) {
System.out.println("\t" + cc);
}
for (CoreMap sentence : document.get(CoreAnnotations.SentencesAnnotation.class)) {
System.out.println("---");
System.out.println("mentions");
for (Mention m : sentence.get(CorefCoreAnnotations.CorefMentionsAnnotation.class)) {
System.out.println("\t" + m);
}
}
}
}
However, these three imports required aren't found for me:
import edu.stanford.nlp.coref.CorefCoreAnnotations;
import edu.stanford.nlp.coref.data.CorefChain;
import edu.stanford.nlp.coref.data.Mention;
I could use these imports instead:
import edu.stanford.nlp.dcoref.CorefCoreAnnotations;
import edu.stanford.nlp.dcoref.CorefChain;
import edu.stanford.nlp.dcoref.Mention;
But then an annotation is missing, specifically:
CorefCoreAnnotations.CorefMentionsAnnotation.class
Additionally, CorefCoreAnnotations.CorefChainAnnotation.class).values() returns null...
I think the problem is that I am using CoreNLP version 3.6.0. This tutorial is for 3.7.0 I believe. Is there a similar example that uses version 3.6.0? If not, what changes do I need to make? I have a large pipeline set up and I'm not sure how hard it would be to upgrade.
Thanks for any help!
Hi I would recommend just upgrading to Stanford CoreNLP 3.7.0, it should not cause too many things to break.
One of the main changes is that we created a new package named edu.stanford.nlp.coref and put the code from edu.stanford.nlp.hcoref into it.
For the most part things should be the same though if you upgrade.
I am using the RegexNER annotator in CoreNLP and some of my named entities consist of multiple words. Excerpt from my mapping file:
RAF inhibitor DRUG_CLASS
Gilbert's syndrome DISEASE
The first one gets detected but each word gets the annotation DRUG_CLASS and there seems to be no way to link the words, like an NER id which both words would have.
The second case does not get detected at all and that's probably because the tokenizer treats the apostrophe after Gilbert as a separate token. Since RegexNER has the tokenization as a dependency, I can't really get around it.
Any suggestions to resolve these cases?
If you use the entitymentions annotator that will create entity mentions out of consecutive tokens with the same ner tags. There is the downside that if two entities of the same type are side by side they will be joined together. We are working on improving the ner system so we may include a new model that finds the boundaries of distinct mentions in these cases, hopefully this will go into Stanford CoreNLP 3.8.0.
Here is some sample code for accessing the entity mentions:
package edu.stanford.nlp.examples;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.util.*;
import java.util.*;
public class EntityMentionsExample {
public static void main(String[] args) {
Annotation document =
new Annotation("John Smith visted Los Angeles on Tuesday.");
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,entitymentions");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
pipeline.annotate(document);
for (CoreMap entityMention : document.get(CoreAnnotations.MentionsAnnotation.class)) {
System.out.println(entityMention);
System.out.println(entityMention.get(CoreAnnotations.TextAnnotation.class));
}
}
}
If you just have your rules tokenized the same way as the tokenizer it will work fine, so for instance the rule should be Gilbert 's syndrome.
So you could just run the tokenizer on all your text patterns and this problem will go away.
<a id="ctl00_cphBody_gvMessageList_ctl02_hlnkMessageSubject" href="Message.aspx?id=3428&member=">DDM IT QUIZ 2017 – Bhubaneswar Edition</a>
<a id="ctl00_cphBody_gvMessageList_ctl03_hlnkMessageSubject" href="Message.aspx?id=3427&member=">[Paybooks] Tax/investment declaration proof FY 2016-17</a>
<a id="ctl00_cphBody_gvMessageList_ctl04_hlnkMessageSubject" href="Message.aspx?id=3426&member=">Reimbursement clarification</a>
out:
DDM IT QUIZ 2017 – Bhubaneswar Edition
[Paybooks] Tax/investment declaration proof FY 2016-17
Reimbursement clarification
How can i get the relative xpaths for these three elements, so that I can get the above mentioned texts.
xpath = '//a/text()'
This will return a list of text
A complete answer would be:
to get all a elements of the same type, in this case having an id with ctl00
//a[contains(#id, 'ctl00')]
you can add more restriction, for example add a href restriction, to contain a certain string in his value
//a[contains(#id, 'ctl00')][contains(#href, 'Message')]
to get all a element is enough to just use
//a
In order to get text, you can use a method to get the text from your framework or add /text() to your xpath expression.
You can use the span class [relative xpath] like the following example along with the mouse operations and keyboard operations.
Check this out and let me know!
import java.util.concurrent.TimeUnit;
import org.openqa.selenium.By;
import org.openqa.selenium.Keys;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.openqa.selenium.interactions.Actions;
public class SnapD {
public static void main(String args[]){
WebDriver d=new FirefoxDriver();
d.get("https://www.snapdeal.com/");
d.manage().window().maximize();
d.manage().timeouts().implicitlyWait(20, TimeUnit.SECONDS);
System.out.println("Hello Google...");
System.out.println("Hello Snapdeal...");
WebElement wb= d.findElement(By.xpath("//span[text()='Electronics']"));
Actions act=new Actions(d);
act.moveToElement(wb);
act.perform();
System.out.println("Mouse hovered");
WebElement wb1=d.findElement(By.xpath("//span[text()='DTH Services']"));
act.contextClick(wb1).perform();
act.sendKeys(Keys.ARROW_DOWN,Keys.ENTER).perform();
act.sendKeys(Keys.chord(Keys.CONTROL,Keys.TAB)).perform();
}
}
Is there a way to call stanford parser from command line so that it parses one sentence at a time, and in case of troubles at a specific sentence just goes over to the next sentence?
UPDATE:
I have been adapting the script posted StanfordNLP Help. However, I noticed that, with the last version of corenlp (2015-04-20) there are problems with the CCprocessed dependencies: collapsing just appears not to take place (if I grep prep_ on the output, I find nothing).
Collapsing works with the 2015-04-20 and PCFG, for example, so I assume the issue is model-specific.
If I use the very same java class in corenlp 2015-01-29 (with depparse.model changed to parse.model, and removing the original dependencies part), collapsing works just fine. Maybe I am just using the parser in the wrong way, that's why I am re-posting here and not starting a new post. Here is the updated code of the class:
import java.io.*;
import java.util.*;
import edu.stanford.nlp.io.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.semgraph.*;
import edu.stanford.nlp.trees.TreeCoreAnnotations.*;
import edu.stanford.nlp.util.*;
public class StanfordSafeLineExample {
public static void main (String[] args) throws IOException {
// build pipeline
Properties props = new Properties();
props.setProperty("annotators","tokenize, ssplit, pos, lemma, depparse");
props.setProperty("ssplit.eolonly","true");
props.setProperty("tokenize.whitespace","false");
props.setProperty("depparse.model", "edu/stanford/nlp/models/parser/nndep/english_SD.gz");
props.setProperty("parse.originalDependencies", "true");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// open file
BufferedReader br = new BufferedReader(new FileReader(args[0]));
// go through each sentence
for (String line = br.readLine() ; line != null ; line = br.readLine()) {
try {
Annotation annotation = new Annotation(line);
pipeline.annotate(annotation);
ArrayList<String> edges = new ArrayList<String>();
CoreMap sentence = annotation.get(CoreAnnotations.SentencesAnnotation.class).get(0);
System.out.println("sentence: "+line);
for (CoreLabel token: annotation.get(CoreAnnotations.TokensAnnotation.class)) {
Integer identifier = token.get(CoreAnnotations.IndexAnnotation.class);
String word = token.get(CoreAnnotations.TextAnnotation.class);
String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class);
String lemma = token.get(CoreAnnotations.LemmaAnnotation.class);
System.out.println(identifier+"\t"+word+"\t"+pos+"\t"+lemma);
}
SemanticGraph tree = sentence.get(SemanticGraphCoreAnnotations.BasicDependenciesAnnotation.class);
SemanticGraph tree2 = sentence.get(SemanticGraphCoreAnnotations.CollapsedCCProcessedDependenciesAnnotation.class);
System.out.println("---BASIC");
System.out.println(tree.toString(SemanticGraph.OutputFormat.READABLE));
System.out.println("---CCPROCESSED---");
System.out.println(tree2.toString(SemanticGraph.OutputFormat.READABLE)+"</s>");
} catch (Exception e) {
System.out.println("Error with this sentence: "+line);
System.out.println("");
}
}
}
}
There are many ways to handle this.
The way I'd do it is to run the Stanford CoreNLP pipeline.
Here is where you can get the appropriate jar:
http://nlp.stanford.edu/software/corenlp.shtml
After you cd into the directory stanford-corenlp-full-2015-04-20
you can issue this command:
java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,parse -ssplit.eolonly -outputFormat text -file sample_sentences.txt
sample_sentences.txt would have the sentences you want to parse, one sentence per line
This will put the results in sample_sentences.txt.out which you can extract with some light scripting.
If you change -outputFormat to json instead of text, you will get some json which you can easily load and get the parses from
If you have any issues with this approach let me know and I can modify the answer to further assist you/clarify!
UPDATE:
I am not sure the exact way you are running things, but these options could be helpful.
If you use -fileList to run the pipeline on a list of files rather than on a single file, and then use this flag: -continueOnAnnotateError it should just skip the bad file, which is progress, though admittedly not just skipping the bad sentence
I wrote some Java for doing exactly what you need, so I'll try to post that in the next 24 hours if you just want to use my whipped together Java code, I'm still looking it over...
Here is some sample code for your need:
import java.io.*;
import java.util.*;
import edu.stanford.nlp.io.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.semgraph.*;
import edu.stanford.nlp.trees.TreeCoreAnnotations.*;
import edu.stanford.nlp.util.*;
public class StanfordSafeLineExample {
public static void main (String[] args) throws IOException {
// build pipeline
Properties props = new Properties();
props.setProperty("annotators","tokenize, ssplit, pos, depparse");
props.setProperty("ssplit.eolonly","true");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// open file
BufferedReader br = new BufferedReader(new FileReader(args[0]));
// go through each sentence
for (String line = br.readLine() ; line != null ; line = br.readLine()) {
try {
Annotation annotation = new Annotation(line);
pipeline.annotate(annotation);
ArrayList<String> edges = new ArrayList<String>();
CoreMap sentence = annotation.get(CoreAnnotations.SentencesAnnotation.class).get(0);
SemanticGraph tree = sentence.get(SemanticGraphCoreAnnotations.BasicDependenciesAnnotation.class);
System.out.println("---");
System.out.println("sentence: "+line);
System.out.println(tree.toString(SemanticGraph.OutputFormat.READABLE));
} catch (Exception e) {
System.out.println("---");
System.out.println("Error with this sentence: "+line);
}
}
}
}
instructions:
Cut and paste this into StanfordSafeLineExample.java
put that file in the directory stanford-corenlp-full-2015-04-20
javac -cp "*:." StanfordSafeLineExample.java
add your sentences one sentence per line to a file called sample_sentences.txt
java -cp "*:." StanfordSafeLineExample sample_sentences.txt
I'm using Thymeleaf template engine with spring and I'd like to display text stored throught a multiline textarea.
In my database multiline string are store with "\n" like this : "Test1\nTest2\n...."
With th:text i've got : "Test1 Test2" with no line break.
How I can display line break using Thymeleaf and avoid manually "\n" replacing with < br/> and then avoid using th:utext (this open form to xss injection) ?
Thanks !
Two of your options:
Use th:utext - easy setup option, but harder to read and remember
Create a custom processor and dialect - more involved setup, but easier, more readable future use.
Option 1:
You can use th:utext if you escape the text using the expression utility method #strings.escapeXml( text ) to prevent XSS injection and unwanted formatting - http://www.thymeleaf.org/doc/tutorials/2.1/usingthymeleaf.html#strings
To make this platform independent, you can use T(java.lang.System).getProperty('line.separator') to grab the line separator.
Using the existing Thymeleaf expression utilities, this works:
<p th:utext="${#strings.replace( #strings.escapeXml( text ),T(java.lang.System).getProperty('line.separator'),'<br />')}" ></p>
Option 2:
The API for this is now different in 3 (I wrote this tutorial for 2.1)
Hopefully you can combine the below logic with their official tutorial. One day maybe I'll have a minute to update this completely. But for now:
Here's the official Thymeleaf tutorial for creating your own dialect.
Once setup is complete, all you will need to do to accomplish escaped textline output with preserved line breaks:
<p fd:lstext="${ text }"></p>
The main piece doing the work is the processor. The following code will do the trick:
package com.foo.bar.thymeleaf.processors
import java.util.Collections;
import java.util.List;
import org.thymeleaf.Arguments;
import org.thymeleaf.Configuration;
import org.thymeleaf.dom.Element;
import org.thymeleaf.dom.Node;
import org.thymeleaf.dom.Text;
import org.thymeleaf.processor.attr.AbstractChildrenModifierAttrProcessor;
import org.thymeleaf.standard.expression.IStandardExpression;
import org.thymeleaf.standard.expression.IStandardExpressionParser;
import org.thymeleaf.standard.expression.StandardExpressions;
import org.unbescape.html.HtmlEscape;
public class HtmlEscapedWithLineSeparatorsProcessor extends
AbstractChildrenModifierAttrProcessor{
public HtmlEscapedWithLineSeparatorsProcessor(){
//only executes this processor for the attribute 'lstext'
super("lstext");
}
protected String getText( final Arguments arguments, final Element element,
final String attributeName) {
final Configuration configuration = arguments.getConfiguration();
final IStandardExpressionParser parser =
StandardExpressions.getExpressionParser(configuration);
final String attributeValue = element.getAttributeValue(attributeName);
final IStandardExpression expression =
parser.parseExpression(configuration, arguments, attributeValue);
final String value = (String) expression.execute(configuration, arguments);
//return the escaped text with the line separator replaced with <br />
return HtmlEscape.escapeHtml4Xml( value ).replace( System.getProperty("line.separator"), "<br />" );
}
#Override
protected final List<Node> getModifiedChildren(
final Arguments arguments, final Element element, final String attributeName) {
final String text = getText(arguments, element, attributeName);
//Create new text node signifying that content is already escaped.
final Text newNode = new Text(text == null? "" : text, null, null, true);
// Setting this allows avoiding text inliners processing already generated text,
// which in turn avoids code injection.
newNode.setProcessable( false );
return Collections.singletonList((Node)newNode);
}
#Override
public int getPrecedence() {
// A value of 10000 is higher than any attribute in the SpringStandard dialect. So this attribute will execute after all other attributes from that dialect, if in the same tag.
return 11400;
}
}
Now that you have the processor, you need a custom dialect to add the processor to.
package com.foo.bar.thymeleaf.dialects;
import java.util.HashSet;
import java.util.Set;
import org.thymeleaf.dialect.AbstractDialect;
import org.thymeleaf.processor.IProcessor;
import com.foo.bar.thymeleaf.processors.HtmlEscapedWithLineSeparatorsProcessor;
public class FooDialect extends AbstractDialect{
public FooDialect(){
super();
}
//This is what all the dialect's attributes/tags will start with. So like.. fd:lstext="Hi David!<br />This is so much easier..."
public String getPrefix(){
return "fd";
}
//The processors.
#Override
public Set<IProcessor> getProcessors(){
final Set<IProcessor> processors = new HashSet<IProcessor>();
processors.add( new HtmlEscapedWithLineSeparatorsProcessor() );
return processors;
}
}
Now you need to add it to your xml or java configuration:
If you are writing a Spring MVC application, you just have to set it at the additionalDialects property of the Template Engine bean, so that it is added to the default SpringStandard dialect:
<bean id="templateEngine" class="org.thymeleaf.spring3.SpringTemplateEngine">
<property name="templateResolver" ref="templateResolver" />
<property name="additionalDialects">
<set>
<bean class="com.foo.bar.thymeleaf.dialects.FooDialect"/>
</set>
</property>
</bean>
Or if you are using Spring and would rather use JavaConfig you can create a class annotated with #Configuration in your base package that contains the dialect as a managed bean:
package com.foo.bar;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import com.foo.bar.thymeleaf.dialects.FooDialect;
#Configuration
public class TemplatingConfig {
#Bean
public FooDialect fooDialect(){
return new FooDialect();
}
}
Here are some further references on creating custom processors and dialects: http://www.thymeleaf.org/doc/articles/sayhelloextendingthymeleaf5minutes.html , http://www.thymeleaf.org/doc/articles/sayhelloagainextendingthymeleafevenmore5minutes.html and http://www.thymeleaf.org/doc/tutorials/2.1/extendingthymeleaf.html
Try putting style="white-space: pre-line" on the element.
For example:
<span style="white-space: pre-line" th:text="${text}"></span>
You might also be interested in white-space: pre-wrap which also maintains consecutive spaces in addition to line breaks.
Avoid using th:utext if possible as it has serious security implications. If care is not taken, th:utext can create the possibility of XSS attacks.
Maybe not what the OP had in mind, but this works and prevents code injection:
<p data-th-utext="${#strings.replace(#strings.escapeXml(text),'
','<br>')}"></p>
(Using HTML5-style Thymeleaf.)
In my case escapeJava() returns unicode values for cyrillic symbols, so I wrap all in unescapeJava() method help to solve my problem.
<div class="text" th:utext="${#strings.unescapeJava(#strings.replace(#strings.escapeJava(comment.text),'\n','<br />'))}"></div>
If you are using jQuery in thymleaf, you can format your code using this:
$('#idyourdiv').val().replace(/\n\r?/g, '<br />')
Hope that answer can help you
Try this
<p th:utext="${#strings.replace(#strings.escapeJava(description),'\n','<br />')}" ></p>
You need to use th:utext and append break line to string.
My code is:
StringBuilder message = new StringBuilder();
message.append("some text");
message.append("<br>");
message.append("some text");
<span th:utext="${message}"></span>
Inspired by #DavidRoberts answer, I made a Thymeleaf dialect that makes it easy to keep the line breaks, if the css white-space property isn't an option.
It also bring support for BBCode if you want it.
You can either import it as a dependency (it's very light, and very easy to set up thanks to the starter project) or just use it as inspiration to make your own.
Check it out here :
https://github.com/oxayotl/meikik-project