Stanford CoreNLP: output in CONLL format from Java - format

I want to parse some German text with Stanford CoreNLP and obtain a CONLL output, so that I can pass the latter to CorZu for coreference resolution.
How can I do that programmatically?
Here is my code so far (which only outputs dependency trees):
Annotation germanAnnotation = new Annotation("Gestern habe ich eine blonde Frau getroffen");
Properties germanProperties = StringUtils.argsToProperties("-props", "StanfordCoreNLP-german.properties");
StanfordCoreNLP pipeline = new StanfordCoreNLP(germanProperties);
pipeline.annotate(germanAnnotation);
StringBuilder trees = new StringBuilder("");
for (CoreMap sentence : germanAnnotation.get(CoreAnnotations.SentencesAnnotation.class)) {
Tree sentenceTree = sentence.get(TreeCoreAnnotations.TreeAnnotation.class);
trees.append(sentenceTree).append("\n");
}

With the following code I managed to save the parsing output in CONLL format.
OutputStream outputStream = new FileOutputStream(new File("./target/", OUTPUT_FILE_NAME));
CoNLLOutputter.conllPrint(germanAnnotation, outputStream, pipeline);
However, the HEAD field was 0 for all words. I am not sure whether there is a problem in parsing or only in the CONLLOutputter. Honestly, I was too annoyed by CoreNLP to investigate further.
I decided and I suggest using ParZu instead. ParZu and CorZu are made to work together seamlessly - and they do.
In my case, I had an already tokenized and POS tagged text. This makes things easier, since you will not need:
A POS-Tagger using the STTS tagset
A tool for morphological analysis
Once you have ParZu and CorZu installed, you will only need to run corzu.sh (included in the CorZu download folder). If your text is tokenized and POS tagged, you can edit the script accordingly:
parzu_cmd="/YourPath/ParZu/parzu -i tagged"
Last note: make sure to convert your tagged text to the following format, with empty lines signifying sentence boundaries: word [tab] tag [newline]

Related

Why am I getting String where should get a dict when using pycorenlp.StanfordCoreNLP.annotate?

I'm running this example using pycorenlp Stanford Core NLP python wrapper, but the annotate function returns a string instead of a dict, so, when I iterate over it to get each sentence sentiment value I get the following error: "string indices must be integers".
What could I do to get over it? Anyone could help me? Thanks in advance.
The code is below:
from pycorenlp import StanfordCoreNLP
nlp_wrapper = StanfordCoreNLP('http://localhost:9000')
doc = "I like this chocolate. This chocolate is not good. The chocolate is delicious. Its a very
tasty chocolate. This is so bad"
annot_doc = nlp_wrapper.annotate(doc,
properties={
'annotators': 'sentiment',
'outputFormat': 'json',
'timeout': 100000,
})
for sentence in annot_doc["sentences"]:
print(" ".join([word["word"] for word in sentence["tokens"]]) + " => "\
+ str(sentence["sentimentValue"]) + " = "+ sentence["sentiment"])
You should just use the official stanfordnlp package! (note: the name is going to be changed to stanza at some point)
Here are all the details, and you can get various output formats from the server including JSON.
https://stanfordnlp.github.io/stanfordnlp/corenlp_client.html
from stanfordnlp.server import CoreNLPClient
with CoreNLPClient(annotators=['tokenize','ssplit','pos','lemma','ner', 'parse', 'depparse','coref'], timeout=30000, memory='16G') as client:
# submit the request to the server
ann = client.annotate(text)
It would be great if you provide the error stack trace. The reason for this is that the annotator is meeting timeout sooner and returns a assertion message 'the text is too large..'. Its dtype is . Further, I would put more light on Petr Matuska comment. By looking at your example it is clear that your goal is to find sentiment for the sentence along with its sentiment score.
The sentiment score is not found with result in using CoreNLPCLient. I faced similar issue, but i did work around which fixed this issue. If the text is large you must set the timeout value to much higher (eg., timeout = 500000). Also the annotator results in a dictionary and therefore it consumes a lot of memory. For a larger text corpus, this will be a great problem!! So it is upto us how we can handle the data structure in the code. There are alternatives such using slot, tupple or named tupple for faster access.

Can I get an entityMention from the result of a TokensRegex match in Stanford CoreNLP?

I want to add addresses (and possibly other rules based entities) to an NER pipeline and the Tokens Regex seems like a terribly useful DSL for doing so. Following https://stackoverflow.com/a/42604225, I'm created this rules file:
ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }
{ pattern: ([{ner:"NUMBER"}] [{pos:"NN"}|{pos:"NNP"}] /ave(nue)?|st(reet)?|boulevard|blvd|r(oa)?d/), action: Annotate($0, ner, "address") }
Here's a scala repl session showing how I'm trying to set up an annotation pipeline.
# import edu.stanford.nlp.pipeline.{StanfordCoreNLP, CoreDocument}
# import edu.stanford.nlp.util.PropertiesUtils.asProperties
# val pipe = new StanfordCoreNLP(asProperties(
"customAnnotatorClass.tokensregex", "edu.stanford.nlp.pipeline.TokensRegexAnnotator",
"annotators", "tokenize,ssplit,pos,lemma,ner,tokensregex",
"ner.combinationMode", "HIGH_RECALL",
"tokensregex.rules", "addresses.tregx"))
pipe: StanfordCoreNLP = edu.stanford.nlp.pipeline.StanfordCoreNLP#2ce6a051
# val doc = new CoreDocument("Adam Smith lived at 123 noun street in Glasgow, Scotland")
doc: CoreDocument = Adam Smith lived at 123 noun street in Glasgow, Scotland
# pipe.annotate(doc)
# doc.sentences.get(0).nerTags
res5: java.util.List[String] = [PERSON, PERSON, O, O, address, address, address, O, CITY, O, COUNTRY]
# doc.entityMentions
res6: java.util.List[edu.stanford.nlp.pipeline.CoreEntityMention] = [Adam Smith, 123, Glasgow, Scotland]
As you can see, the address gets correctly tagged in the nerTags for the sentence, but it doesn't show up in the documents entityMentions. Is there a way to do this?
Also, is there a way from the document to discern two adjacent matches of the tokenregex from a single match (assuming I have more complicated set of regexes; in the current example I only match exactly 3 tokens, so I could just count tokens)?
I tried approaching it using the regexner with a tokens regex described here https://stanfordnlp.github.io/CoreNLP/regexner.html, but I couldn't seem to get that working.
Since I'm working in scala I'll be happy to dive into the Java API to get this to work, rather than fiddle with properties and resource files, if that's necessary.
Yes, I've recently added some changes (in the GitHub version) to make this easier! Make sure to download the latest version from GitHub. Though we are aiming to release Stanford CoreNLP 3.9.2 fairly soon and it will have these changes.
If you read this page you can get an understanding of the full NER pipeline run by the NERCombinerAnnotator.
https://stanfordnlp.github.io/CoreNLP/ner.html
Furthermore there is a lot of write up on the TokensRegex here:
https://stanfordnlp.github.io/CoreNLP/tokensregex.html
Basically what you want to do is run the ner annotator, and use it's TokensRegex sub-annotator. Imagine you have some named entity rules in a file called my_ner.rules.
You could run a command like this:
java -Xmx5g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -ner.additional.tokensregex.rules my_ner.rules -outputFormat text -file example.txt
This will run a TokensRegex sub-annotator during the full named entity recognition process. Then when the final step of entity mentions are run, it will operate on the rules extracted named entities and create entity mentions from them.

Separate YAML and plain text on the same document

While building a blog using django I realized that it would be extremely practical to store the text of an article and all the related informations (title, author, etc...) together in a human-readable file format, and then charge those files on the database using a simple script.
Now that said, YAML caught my attention for his readability and ease of use, the only downside of the YAML syntax is the indentation:
---
title: Title of the article
author: Somebody
# Other stuffs here ...
text:|
This is the text of the article. I can write whatever I want
but I need to be careful with the indentation...and this is a
bit boring.
---
I believe that's not the best solution (especially if the files are going to be written by casual users). A format like this one could be much better
---
title: Title of the article
author: Somebody
# Other stuffs here ...
---
Here there is the text of the article, it is not valid YAML but
just plain text. Here I could put **Markdown** or <html>...or whatever
I want...
Is there any solution? Preferably using python.
Other file formats propositions are welcome as well!
Unfortunately this is not possible, what one would think could work is using | for a single scalar in the separate document:
import ruamel.yaml
yaml_str = """\
title: Title of the article
author: Somebody
---
|
Here there is the text of the article, it is not valid YAML but
just plain text. Here I could put **Markdown** or <html>...or whatever
I want...
"""
for d in ruamel.yaml.load_all(yaml_str):
print(d)
print('-----')
but it doesn't because | is the block indentation indicator. And although at the top level an indentation of 0 (zero) would easily work, ruamel.yaml (and PyYAML) don't allow this.
It is however easy to parse this yourself, which has the advantage over using the front matter package that you can use YAML 1.2 and are not restricted to using YAML 1.1 because of frontmaker using the PyYAML. Also note that I used the more appropriate end of document marker ... to separate YAML from the markdown:
import ruamel.yaml
combined_str = """\
title: Title of the article
author: Somebody
...
Here there is the text of the article, it is not valid YAML but
just plain text. Here I could put **Markdown** or <html>...or whatever
I want...
"""
with open('test.yaml', 'w') as fp:
fp.write(combined_str)
data = None
lines = []
yaml_str = ""
with open('test.yaml') as fp:
for line in fp:
if data is not None:
lines.append(line)
continue
if line == '...\n':
data = ruamel.yaml.round_trip_load(yaml_str)
continue
yaml_str += line
print(data['author'])
print(lines[2])
which gives:
Somebody
I want...
(the round_trip_load allows dumping with preservation of comments, anchor names etc).
I found Front Matter does exactly what I want to do.
There is also a python package.

Resolving how square brackets vs parenthesis are conceptually handled in pipeline

When I parse a sentence that contains left and right square brackets, the parse is much, much slower and different than if the sentence contained left and right parentheses and the default normalizeOtherBrackets value of true is kept (I would say 20 seconds vs 3 seconds for the ParserAnnotator to be run). If this is property is instead set to false, than the parse times of the brackets vs the parentheses are pretty comparable however the parse trees are still very different. With a true value for text with brackets, the POS is -LRB- whereas the POS is CD for false but in each case the general substructure of the tree is the same.
In the case of my corpus, the brackets are overwhelmingly meant to "clarify the antecedent" as described in this site. However, the PRN phrase-level label exists for parentheses and not for square brackets and so the formation of the tree is inherently different even if they have close to the same function in the sentence.
So, please explain how the parse times are so different and what can be done to get a proper parse? Obviously, a simplistic approach would be to replace brackets with parens, but that does not seem like a satisfying solution. Are there any settings that can provide me with some relief? Here is my code:
private void execute() {
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner");
props.setProperty("tokenize.options", "normalizeOtherBrackets=false");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// create an empty Annotation just with the given text
Annotation document = new Annotation(text);
// run all Annotators on this text
pipeline.annotate(document);
// these are all the sentences in this document
// a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
long start = System.nanoTime();
ParserAnnotator pa = new ParserAnnotator(DefaultPaths.DEFAULT_PARSER_MODEL, false, 60, new String[0]);
pa.annotate(document);
long end = System.nanoTime();
System.out.println((end - start) / 1000000 + " ms");
for(CoreMap sentence: sentences) {
System.out.println(sentence);
Tree cm = sentence.get(TreeAnnotation.class);
cm.pennPrint();
List<CoreLabel> cm2 = sentence.get(TokensAnnotation.class);
for (CoreLabel c : cm2) {
System.out.println(c);
}
}
}
Yes, this a known (and very tricky) issue, and unfortunately there is no perfect solution for this.
The problem with such sentences is mainly caused by the fact that these parenthetical constructions don't appear in the training data of the parser and by the way the part-of-speech tagger and the parser interact.
If you run the tokenizer with the option normalizeOtherBrackets=true, then the square brackets will be replaced with -LSB- and -RSB-. Now in most (or maybe even in all) of the training examples for the part-of-speech tagger -LSB- has the POS tag -LRB-, and -RSB- has the tag -RRB-. Therefore the POS-tagger, despite taking context into account, basically always assigns these tags to squared brackets.
Now if you run the POS-tagger before the parser, then the parser operates in delexicalized mode which means it tries to parse the sentence based on the POS tags of the individual words instead of based on the words themselves. Assuming that your sentence will have a POS sequence similar to -LRB- NN -RRB- the parser tries to find production rules so that it can parse this sequence. However, the parser's training data doesn't contain such sequences and therefore there aren't any such production rules, and consequently the parser is unable to parse a sentence with this POS sequence.
When this happens, the parser goes into fallback mode which relaxes the constraints on the part-of-speech tags by assigning a very low probability to all the other possible POS tags of every word so that it can eventually parse the sentence according to the production rules of the parser. This step, however, requires a lot of computation and therefore it takes so much longer to parse such sentences.
On the other hand, if you set normalizeOtherBrackets to false, then it won't translate squared brackets to -LSB- and -RSB- and the POS tagger will assign tags to the original [ and ]. However, our training data was preprocessed such that all squared brackets have been replaced and therefore [ and ] are treated as unknown words and the POS tagger assigns tags purely based on the context which most likely results in a POS sequence that is part of the training data of our parser and tagger. This in return means that the parser is able to parse the sentence without entering into recovery mode which is a lot faster.
If your data contains a lot of sentences with such constructions, I would recommend that you don't run the POS tagger before the parser (by removing pos from the list of annotators). This should still give you faster but still reliable parses. Otherwise, if you have access to a treebank, you could also manually add a few trees with these constructions and train a new parsing model. I would not set normalizeOtherBrackets to false because then your tokenized input will differ from the training data of the parser and the tagger which will potentially give you worse results.

Stanford NER: How do I create a new training set that I can use and test out?

From my understanding, to create a training file, you put your words in a text file. Then after each word, add a space or tab along with the tag (such as PERS, LOC, etc...)
I also copied text from a sample properties file into a word pad. How do I get these into a gz file that I can input into the classifier and use?
Please guide me though. I'm a newbie and am fairly inept with technology.
Your training file (say training-data.tsv) should look like this:
I O
drove O
to O
Vancouver LOCATION
BC LOCATION
yesterday O
where O means "Outside", as in not a named entity.
where the space between the columns is a tab.
You don't put them in a ser.gz file. The ser.gz file is the classifier model that is created by the training process.
To train the classifier run:
java -cp ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop my-classifier.properties
where my-classifier.properties would look like this:
trainFile = training-data.tsv
serializeTo = my-classification-model.ser.gz
map = word=0,answer=1
...
I'd advise you take a look at the NLTK documentation to learn more about training a parser http://nltk.googlecode.com/svn/trunk/doc/howto/tag.html
. Now, it seems that you want to train the CRFClassifier (not the parser!); for that you may want to check this FAQ http://nlp.stanford.edu/software/crf-faq.shtml#a

Resources