How to read rules from a file - stanford-nlp

I am trying to match the sentence against rules.
I am able to compile multiple rules and match it against CoreLabel using the following method :
TokenSequencePattern pattern1 = TokenSequencePattern.compile("([{tag:/NN.*//*}])");
TokenSequencePattern pattern2 = TokenSequencePattern.compile("([{tag:/NN.*//*}])");
List<TokenSequencePattern> tokenSequencePatterns = new ArrayList<>();
tokenSequencePatterns.add(pattern1);
tokenSequencePatterns.add(pattern2);
MultiPatternMatcher multiMatcher = TokenSequencePattern.getMultiPatternMatcher(tokenSequencePatterns);
List<SequenceMatchResult<CoreMap>> matched=multiMatcher.findNonOverlapping(tokens);
I have many rules inside a file. Is there any way to load the rule file?
I have seen a method to load the rules from file using the following method:
CoreMapExpressionExtractor extractor = CoreMapExpressionExtractor.createExtractorFromFiles(TokenSequencePattern.getNewEnv(), "en.rules");
List<MatchedExpression> matched = extractor.extractExpressions((CoreMap)sentence);
But it accepts CoreMap as its argument. But I need to match it against CoreLabel

Please see this comprehensive write up on TokensRegex:
https://stanfordnlp.github.io/CoreNLP/tokensregex.html

Related

Disallow DTD's while evaluating xpath

I have below code to evaluate xpath expression.
String inputXml = "<?xml version=\"1.0\"?><!DOCTYPE document SYSTEM \"test.dtd\"><Request><Header><Version>1.0</Version></Header></Request>";
String xpath="/Request/Header/Version";
XPathFactory xpf = new net.sf.saxon.xpath.XPathFactoryImpl();
final InputSource is = new InputSource(new StringReader(inputXml));
String version = xpf.newXPath().evaluate(xpath, is);
xpf.newXPath().evaluate throws error as test.dtd couldn't be found. I want to disallow DTD completely. I have been reading about setting SAXParser feature "http://apache.org/xml/features/disallow-doctype-decl" but not sure how to apply in this case or is there any other way to disallow/ignore DTD's.
I'm not quite sure what you want to achieve. If you want this to fail because there is a DTD referenced, then you already seem to be achieving that.
However, if you want to set a property on the XML parser, there are two ways you could achieve it:
(a) Supply a SAXSource rather than an InputSource; initialize the XMLReader in the SAXSource to the XML parser you want to use, and use the XMLReader's setFeature interface to configure it before you pass it to the XPath engine.
(b) Set the Saxon configuration feature http://saxon.sf.net/feature/parserFeature?uri=http://apache.org/xml/features/disallow-doctype-decl (that's a single string with no spaces or newlines) to the value true. You can do this using
xpf.getConfiguration().setConfigurationProperty(featureName, true);

How to add a namespace to existing xml file

I want to open this file and get all elements that start with us-gaap.
ftp://ftp.sec.gov/edgar/data/916789/0001558370-15-001143.txt
To get elements I tried like this:
str = '<html><body><us-gaap:foo>foo</us-gaap:foo></body></html>'
doc = Nokogiri::XML(File.read(str))
doc.xpath('//us-gaap:*')
Nokogiri::XML::XPath::SyntaxError: Undefined namespace prefix: //us-gaap:*
from /Users/ironsand/.rbenv/versions/2.2.2/lib/ruby/gems/2.2.0/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/searchable.rb:165:in `evaluate'
doc.namespaces returns {}, so I think I have to add namespace us-gaap.
There are some questions about "adding namespace with Nokogiri", but it looks like about how to create a new XML document, not how to add a namespace to existing documents.
How can I add a namespace to existing document?
I know I can remove the namespace by Nokogiri::XML::Document#remove_namespaces!, but I don't want to use it because it removes also necesarry information.
You have asked an XY Problem. You think that the problem is that you need to add a missing namespace; the real problem is that the file you're trying to parse is not valid XML.
require 'nokogiri'
doc = Nokogiri.XML( IO.read('0001558370-15-001143.txt') )
doc.errors.length
#=> 5716
For example, the <ACCEPTANCE-DATETIME> 'element' opened on line 3 is never closed, and on line 16 there is a raw ampersand in the text:
STANDARD INDUSTRIAL CLASSIFICATION: ELECTRIC HOUSEWARES & FANS [3634]
which ought to be escaped as an entity.
However, the document has valid XML fragments within it! In particular, there is one XML document that defines xmlns:us-gaap namespace, from lines 27243-49312. Let's extract just that, using only the knowledge that the root element defines the namespace we want, and the assumptions that no element with the same name is nested within the document, and that the root element does not have an unescaped > character in any attribute. (These assumptions are valid for this file, but may not be valid for every XML file.)
txt = IO.read('0001558370-15-001143.txt')
gaap_finder = %r{(<(\w+) [^>]+xmlns:us-gaap=.+?</\2>)}m
txt.scan(gaap_finder) do |xml,_|
doc = Nokogiri.XML( xml )
gaaps = doc.xpath('//us-gaap:*')
p gaaps.length
#=> 569
end
The code above handles the case where there may be more than one XML document in the txt file, though in this case there is only one.
Decoded, the gaap_finder regex says this:
%r{...}m — this is a regular expression (that allows slashes in it, unescaped) with "multiline mode", where a period will match newline characters
(...) — capture everything we find
< — start with a literal "less-than" symbol
(\w+) — find one or more word characters (the tag name), and save them
— the word characters must be followed by a space (important to avoid capturing the <xsd:xbrl ...> element in this file)
[^>]+ — followed by one or more characters that is NOT a "greater-than" symbol (to ensure that we stay in the same element that we started in)
xmlns:us-gaap\s*= — followed by this literal namespace declaration (which may have whitespace separating it from the equals sign)
.+? — followed by anything (as little as possible)...
</\2> — ...up until you see a closing tag with the same name as what we captured for the name of the starting tag
Because of the way scan works when the regex has capturing groups, each result is a two-element array, where the first element is the entire captured XML and the second element is the name of the tag that we captured (which we "discard" by assigning it to the _ variable).
If you want to be less magic about your capturing, the text file format appears to always wrap each XML document in <XBRL>...</XBRL>. So, you could do this to process every XML file (there are seven, five of which do not happen to have any us-gaap namespaces):
txt = IO.read('0001558370-15-001143.txt')
xbrls = %r{(?<=<XBRL>).+?(?=</XBRL>)}m # find text inside <XBRL>…</XBRL>
txt.scan(xbrls) do |xml|
doc = Nokogiri.XML( xml )
if doc.namespaces["xmlns:us-gaap"]
gaaps = doc.xpath('//us-gaap:*')
p gaaps.length
end
end
#=> 569
#=> 0 (for the XML Schema document that defines the namespace)
I couldn't figure out how to update an existing doc with a new namespace, but since Nokogiri will recognize namespaces on the root element, and those namespaces are, syntactically, just attributes, you can update the document with a new namespace declaration, serialize the doc to a string, and re-parse it:
str = '<html><body><us-gaap:foo>foo</us-gaap:foo></body></html>'
doc_without_ns = Nokogiri::XML(str)
doc_without_ns.root['xmlns:us-gaap'] = 'http://your/actual/ns/here'
doc = Nokogiri::XML(doc_without_ns.to_xml)
doc.xpath("//us-gaap:*")
# Returns [#<Nokogiri::XML::Element:0x3ff375583f9c name="foo" namespace=#<Nokogiri::XML::Namespace:0x3ff375583f24 prefix="us-gaap" href="http://your/actual/ns/here"> children=[#<Nokogiri::XML::Text:0x3ff375583768 "foo">]>]

spring mongo creating like query

I can match starting of string i.e clo with keywords and it gives me correct result db.post.find({"keywords":"/^clo/"}).pretty() When I tried to write same query using spring mongo.It not working properly. It gives result as % string %. i.e. matches anywhere in string. I am trying to match only at starting . my code is
String pattern = "/^" + keyword + "/";
Criteria criteria2 = Criteria.where("keywords").is(keyword).regex(pattern);
Where I am missing ?
You can do it like this:
Query.query(Criteria.where("keywords").regex("^clo"))
Or use it as native query:
new BasicQuery("{'keywords' : '/^clo/'}")
Method is() provides the full equals, regex() has to be without / wrappers.
That's is your issue.

urlrewriting tuckey using Tuckey

My project (we have Spring 3) needs to rewrite URLs from the form
localhost:8888/testing/test.htm?param1=val1&paramN=valN
to
localhost:8888/nottestinganymore/test.htm?param1=val1&paramN=valN
My current rule looks like:
<from>^/testing/(.*/)?([a-z0-9]*.htm.*)$</from>
<to type="passthrough">/nottestinganymore/$2</to>
But my query parameters are being doubled, so I am getting param1=val1,val1 and paramN=valN,valN...please help! This stuff is a huge pain.
To edit/add, we have use-query-string=true on the project and I doubt I can change that.
The regular expression needs some tweaking. Tuckey uses the java regular expression engine unless specified otherwise. Hence the best way to deal with this is to write a small test case that will confirm if your regular expression is correct. For e.g. a slightly tweaked example of your regular expression with a test case is below.
#Test public void testRegularExpression()
{
String regexp = "/testing/(.*)([a-z0-9]*.htm.*)$";
String url = "localhost:8888/testing/test.htm?param1=val1&paramN=valN";
Pattern pattern = Pattern.compile(regexp);
Matcher matcher = pattern.matcher(url);
if (matcher.find())
{
System.out.println("$1 : " + matcher.group(1) );
System.out.println("$2 : " + matcher.group(2) );
}
}
The above will print the output as follows :
$1 : test
$2 : .htm?param1=val1&paramN=valN
You can modify the expression now to see what "groups" you want to extract from URL and then form the target URL.

How to parse a list of sentences?

I want to parse a list of sentences with the Stanford NLP parser.
My list is an ArrayList, how can I parse all the list with LexicalizedParser?
I want to get from each sentence this form:
Tree parse = (Tree) lp1.apply(sentence);
Although one can dig into the documentation, I am going to provide code here on SO, especially since links move and/or die. This particular answer uses the whole pipeline. If not interested in the whole pipeline, I will provide an alternative answer in just a second.
The below example is the complete way of using the Stanford pipeline. If not interested in coreference resolution, remove dcoref from the 3rd line of code. So in the example below, the pipeline does the sentence splitting for you (the ssplit annotator) if you just feed it in a body of text (the text variable). Have just one sentence? Well, that is ok, you can feed that in as the text variable.
// creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// read some text in the text variable
String text = ... // Add your text here!
// create an empty Annotation just with the given text
Annotation document = new Annotation(text);
// run all Annotators on this text
pipeline.annotate(document);
// these are all the sentences in this document
// a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
for(CoreMap sentence: sentences) {
// traversing the words in the current sentence
// a CoreLabel is a CoreMap with additional token-specific methods
for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
// this is the text of the token
String word = token.get(TextAnnotation.class);
// this is the POS tag of the token
String pos = token.get(PartOfSpeechAnnotation.class);
// this is the NER label of the token
String ne = token.get(NamedEntityTagAnnotation.class);
}
// this is the parse tree of the current sentence
Tree tree = sentence.get(TreeAnnotation.class);
// this is the Stanford dependency graph of the current sentence
SemanticGraph dependencies = sentence.get(CollapsedCCProcessedDependenciesAnnotation.class);
}
// This is the coreference link graph
// Each chain stores a set of mentions that link to each other,
// along with a method for getting the most representative mention
// Both sentence and token offsets start at 1!
Map<Integer, CorefChain> graph =
document.get(CorefChainAnnotation.class);
Actually documentation from Stanford NLP provide sample of how to parse sentences.
You can find the documentation here
So as promised, if you don't want to access the full Stanford pipeline (although I believe that is the recommended approach), you can work with the LexicalizedParser class directly. In this case, you would download the latest version of Stanford Parser (whereas the other would use CoreNLP tools). Make sure that in addition to the parser jar, you have the model file for the appropriate parser you want to work with. Example code:
LexicalizedParser lp1 = new LexicalizedParser("englishPCFG.ser.gz", new Options());
String sentence = "It is a fine day today";
Tree parse = lp.parse(sentence);
Note this works for version 3.3.1 of the parser.

Resources