Model evaluation in Stanford NER - stanford-nlp

I'm doing a project with the NER module from Stanford CoreNLP and I'm currently having some issues with the evaluation of the model.
I'm using the API to call the functionality from inside a java program instead of using the command line arguments and so far I've managed to train the model from several training files (in a tab-separated format; 2 columns with token and annotation/answer) and to serialize it to a file which was pretty easy.
Now I'm trying to evaluate the model I've trained on some test files (precision, recall, f1) and I'm kinda stuck there. First of all, what format should the test files be in? I'm assuming they should be the same as the training files (tab-separated) which would be the logical thing. I've looked through the JavaDoc documentation for information on how to use the classify method and also had a look at the NERDemo.java. I've managed to get the classifyToString method to work but that doesn't really help me with the evaluation. I've found the classifyAndWriteAnswers(String testFile, DocumentReaderAndWriter<IN> readerWriter, boolean outputScores) method that I assume would give me the precision and recall scores if I set outputScores to true.
However, I can't manage to get this to work. Which DocumentReaderAndWriter should I use as the second argument?
This is what I've got right now:
public static void evaluate(CRFClassifier classifier, File testFile) {
try {
classifier.classifyAndWriteAnswers(testFile.getPath(), new PlainTextDocumentReaderAndWriter(), true);
} catch (IOException e) {
e.printStackTrace();
}
}
This is what I get:
Unchecked call to 'classifyAndWriteAnswers(String, DocumentReaderAndWriter<IN>, boolean)' as a member of raw type 'edu.stanford.nlp.ie.AbstractSequenceClassifier'
Also, do I pass the path to the test file as the first argument or rather the file itself loaded into a String? Some help would be greatly appreciated.

Related

Why does Sonarqube mark try as a critical issue?

I'm currently facing an issue with some SonarQube's analysis being performed over some Kotlin code I wrote.
I'm trying to implement a method that connects to the database and returns accordingly to the query's result. I'm not sure how related this can be, but I added the following maven dependencies to my project:
Quarkus
Arrow
Ktorm
The code is the following:
#ApplicationScoped
class Repository(private val database: Database) {
override fun get(name: String): Either<Error, Brand> =
try {
database.brands.find { it.name eq name }.rightIfNotNull {
MissingBrandError("Missing brand")
}
} catch (e: Exception) {
Either.Left(DatabaseError(e.message))
}
}
class Error(val message: String)
class MissingUserError(val message: String) : Error(message)
class DatabaseError(val message: String? = null) : Error(message ?: "Some database error")
NOTE: Database object is of type org.ktorm.database.Database and brands is of type org.ktorm.entity.EntitySequence
The code is working and I also wrote unit tests for it that pass and give enough coverage (accordingly to the code coverage analysis tool), but at some point in my pipeline SonarQube marks the try as a critical issue with the following message:
Possible null pointer dereference in (...)Repository(String) due to return value of called method
I checked it online and I could find some related questions, but none of the provided answers worked for me. Amongst the many attempts these are the ones I can remember I tried without any success:
Not inlining any code (pretty much using Java style code)
Extracting the query result to a variable
Check with if/else statements for nullability instead (both with inlined try and without)
I'd also like to highlight that all I can see on Sonar is the generated report and CLI for the running build. I don't have access to any of its configuration or intended to change them (unless of course it comes down to that). The line I mentioned seems to be the only one affected by this problem according to Sonar's report, that's why this is the solo class I provided.
I hope I provided enough info and that any of you can help me with this. Thanks in advance.

Empty output when reproducing Chinese coreference results on Conll-2012 using CoreNLP Neural System

Following the instructions on this page https://stanfordnlp.github.io/CoreNLP/coref.html#running-on-conll-2012, Here's my code when I tried to reproduce Chinese coreference results on Conll-2012:
public class TestCoref {
public static void main(String[] args) throws Exception {
Properties props = StringUtils.argsToProperties(args);
props.setProperty("props", "edu/stanford/nlp/coref/properties/neural-chinese-conll.properties");
props.setProperty("coref.data", "path-to/data/conll-2012");
props.setProperty("coref.conllOutputPath", "path-to-output/conll-results");
props.setProperty("coref.scorer", "path-to/reference-coreference-scorers/v8.01/scorer.pl");
CorefSystem coref = new CorefSystem(props);
coref.runOnConll(props);
}
}
As output, I got 3 files like these:
"date-time.coref.predicted.txt
date-time.coref.gold.txt
date-time.predicted.txt"
but all of them are EMPTY!
I got my "conll-2012" data as follows:
First I downloaded train/dev/test-key data from this page http://conll.cemantix.org/2012/data.html, as well as the ontonote-release-5.0 from LDC. Then I ran the script skeleton2conll.sh provided with the official conll 2012 data which produced _conll files.
the model I used is downloaded here http://nlp.stanford.edu/software/stanford-chinese-corenlp-models-current.jar
When I tried to find the problem, I noticed that there exists a function "annotate" in the class CorefSystem which seems to do the real job, but it is not used at all. https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/coref/CorefSystem.java
I wonder if there is a bug in runOnConll function which doesn't read an annotate anything, or how could I reproduce the coreference results?
PS:
I especially want to produce some results on conversational data like "tc" and "bc" in conll-2012. I find that using the coreference API, I can only parse textual data. Is there any other way to use Neural Coref System on conversational data (where different speakers should be indicated) apart from running on Conll-2012?
thanks in advance for help!
As a start, why don't you run this command from the command line:
java -Xmx10g -cp stanford-corenlp-3.9.1.jar:stanford-chine-corenlp-models-3.9.1.jar:* edu.stanford.nlp.coref.CorefSystem -props edu/stanford/nlp/coref/properties/neural-chinese-conll.properties -coref.data <path-to-conll-data> -coref.conllOutputPath <where-to-save-system-output> -coref.scorer <path-to-scoring-script>

how to get Coreference Resolution annotation in stanford core nlp toolkit?

I'm trying to use Stanford Corenlp toolkit to annotate a text. I tried to use the code provided here : http://stanfordnlp.github.io/CoreNLP/
and it works well. The problem is when i want to use Co-reference Resolution tool embedded in coreNLP toolkit. It does not work. i used the code that were published by stanford nlp group. code is here below:
public class CorefExample {
public static void main(String[] args) throws Exception {
Annotation document = new Annotation("Barack Obama was born in Hawaii. He is the president. Obama was elected in 2008.");
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,parse,mention,coref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
pipeline.annotate(document);
System.out.println("---");
System.out.println("coref chains");
for (CorefChain cc : document.get(CorefCoreAnnotations.CorefChainAnnotation.class).values()) {
System.out.println("\t"+cc);
}
for (CoreMap sentence : document.get(CoreAnnotations.SentencesAnnotation.class))
{
System.out.println("---");
System.out.println("mentions");
for (Mention m : sentence.get(CorefCoreAnnotations.CorefMentionsAnnotation.class)) {
System.out.println("\t"+m);
}
}
}
}
but when i want to run these codes, i got null,this line : "sentence.get(CorefCoreAnnotations.CorefMentionsAnnotation.class)"
always return null, while i am sure that toolkit has annotaed corefrence mentions.
I really mixed up. what is the solution? how can i receive the coref annottaion in java code?
If I run the sample code on the coref page with the latest stanford-corenlp-3.6.0.jar it runs to completion, so I am not seeing the null issue you are talking about.
Make sure to use the latest jar available on the website, version 3.6.0
Update:
If you cut and paste the code on this page:
http://stanfordnlp.github.io/CoreNLP/coref.html
and put into a file called CorefExample.java then do:
javac -cp "stanford-corenlp-full-2015-12-09/*" CorefExample.java
java -cp "stanford-corenlp-full-2015-12-09/*:." CorefExample
You should see the mentions printed out.
We've updated the distribution, so also make sure you've downloaded it recently.
If you're still having problems we will have to figure out what is different from what I just described and your set up. I just cut and paste the code and ran it as described above and I see the mentions printed out (I even added a sentence with no mentions to the sample text) and I get a list with the mentions (or empty list). So you shouldn't be getting null if you're using that exact code with the latest jar.
It would be helpful to know how you're running the code so we can see what the difference is.
I had the same problem with version 3.7.0, while using the coreference resolution with "dcoref" (deterministic approach) parameter instead "coref". I am not sure which packages you were importing while performing the coreference resolution and experiencing the error (not mentioned in the above code snippet), but I suppose the ones mentioned in this example code.
In my case, I included the packages manually and didn't copy them from the example. Therefore, Intellij proposed that I use either
import edu.stanford.nlp.coref.CorefCoreAnnotations;
import edu.stanford.nlp.coref.data.CorefChain;
as in the example,
or:
import edu.stanford.nlp.dcoref.CorefChain;
import edu.stanford.nlp.dcoref.CorefCoreAnnotations;
I chose the second option and this created the null error for me. Actually, not only was sentence.get(CorefCoreAnnotations.CorefMentionsAnnotation.class) null, but document.get(CorefCoreAnnotations.CorefChainAnnotation.class) as well.

Parquet-MR AvroParquetWriter - how to convert data to Parquet (with Specific Mapping)

I'm working on a tool for converting data from a homegrown format to Parquet and JSON (for use in different settings with Spark, Drill and MongoDB), using Avro with Specific Mapping as the stepping stone. I have to support conversion of new data on a regular basis and on client machines which is why I try to write my own standalone conversion tool with a (Avro|Parquet|JSON) switch instead of using Drill or Spark or other tools as converters as I probably would if this was a one time job. I'm basing the whole thing on Avro because this seems like the easiest way to get conversion to Parquet and JSON under one hood.
I used Specific Mapping to profit from static type checking, wrote an IDL, converted that to a schema.avsc, generated classes and set up a sample conversion with specific constructor, but now I'm stuck configuring the writers. All Avro-Parquet conversion examples I could find [0] use AvroParquetWriter with deprecated signatures (mostly: Path file, Schema schema) and Generic Mapping.
AvroParquetWriter has only one none-deprecated Constructor, with this signature:
AvroParquetWriter(
Path file,
WriteSupport<T> writeSupport,
CompressionCodecName compressionCodecName,
int blockSize,
int pageSize,
boolean enableDictionary,
boolean enableValidation,
WriterVersion writerVersion,
Configuration conf
)
Most of the parameters are not hard to figure out but WriteSupport<T> writeSupport throws me off. I can't find any further documentation or an example.
Staring at the source of AvroParquetWriter I see GenericData model pop up a few times but only one line mentioning SpecificData: GenericData model = SpecificData.get();.
So I have a few questions:
1) Does AvroParquetWriter not support Avro Specific Mapping? Or does it by means of that SpecificData.get() method? The comment "Utilities for generated Java classes and interfaces." over 'SpecificData.class` seems to suggest that but how exactly should I proceed?
2) What's going on in the AvroParquetWriter constructor, is there an example or some documentation to be found somewhere?
3) More specifically: the signature of the WriteSupport method asks for 'Schema avroSchema' and 'GenericData model'. What does GenericData model refer to? Maybe I'm not seeing the forest because of all the trees here...
To give an example of what I'm aiming for, my central piece of Avro conversion code currently looks like this:
DatumWriter<MyData> avroDatumWriter = new SpecificDatumWriter<>(MyData.class);
DataFileWriter<MyData> dataFileWriter = new DataFileWriter<>(avroDatumWriter);
dataFileWriter.create(schema, avroOutput);
The Parquet equivalent currently looks like this:
AvroParquetWriter<SpecificRecord> parquetWriter = new AvroParquetWriter<>(parquetOutput, schema);
but this is not more than a beginning and is modeled after the examples I found, using the deprecated constructor, so will have to change anyway.
Thanks,
Thomas
[0] Hadoop - The definitive Guide, O'Reilly, https://gist.github.com/hammer/76996fb8426a0ada233e, http://www.programcreek.com/java-api-example/index.php?api=parquet.avro.AvroParquetWriter
Try AvroParquetWriter.builder :
MyData obj = ... // should be avro Object
ParquetWriter<Object> pw = AvroParquetWriter.builder(file)
.withSchema(obj.getSchema())
.build();
pw.write(obj);
pw.close();
Thanks.

how to generate MSTEST results in Static folders

Is there a way to control the name of the MSTEST video recoding file names or the folder names with the test name. It seems to generate different guid everytime and thus very difficult to map the test with its corresponding video recording files.
The only solution I can see is to read the TRX file and map the guid to Test Name.
Any suggestions ??
If you're not opposed to doing it by hand, it's pretty easy. I encountered the same problem, and needed them to be somewhere predictable so I could email links to the videos. In the end my solution just ended up being to code in the functionality by hand. It's a bit involved, but not too difficult.
First, you'll need to have Expression Encoder 4 installed.
Then you'll need to add these references to your project:
Microsoft.Expression.Encoder
Microsoft.Expression.Encoder.Api2
Microsoft.Expression.Encoder.Types
Microsoft.Expression.Encoder.Utilities
Next, you need to add the following inclusion statements:
using Microsoft.Expression.Encoder.Profiles;
using Microsoft.Expression.Encoder.ScreenCapture;
Then you can use [TestInitialize] and [TestCleanup] to define the correct behavior. These methods will run at the beginning and end of each test respectively. This can be done something like this:
[TestInitialize]
public void startVideoCapture()
{
screenCapJob.CaptureRectangle = RectangleSelectionUtilities.GetScreenRect(0);
screenCapJob.CaptureMouseCursor = true;
screenCapJob.ShowFlashingBoundary = false;
screenCapJob.OutputScreenCaptureFileName = "path you want to save to";
screenCapJob.Start();
}
[TestCleanup]
public void stopVideoCapture()
{
screenCapJob.Stop();
}
Obviously this code needs some error and edge case handling, but it should get you started.
You should also know that the free version of Expression Encoder 4 limits you to 10 minutes per video file, so you may want to make a timer that will start a new video for you when it hits 10 minutes.

Resources