Reload CRF NER model of StanfordCoreNLP pipeline - stanford-nlp

I am making a web application (GUI) for building a CRF NER model instead of manually anotating CSV files. When the user collects a number of training files, he should be able to generate a new model and try it.
The issue I have is with reloading of model. When I assign a new value to pipeline, like
pipeline = new StanfordCoreNLP(props)
the model stays the same. I tried to clear the annotation pool with
StanfordCoreNLP.clearAnnotatorPool()
but nothing changes. Is this possible at all or I have to restart my whole application every time to get this to work?
EDIT (Clarification):
I have 2 methods in same class: nerString() and train(). Something like this:
class NerService {
private var pipeline: StanfordCoreNLP = null
loadPipelines()
private def loadPipelines(): Unit = {
val props = new Properties()
props.setProperty("tokenize.class", "BosnianTokenizer")
props.setProperty("ner.model", "conf/NER/classifiers/ner-ba-model.ser.gz") // NER CRF model
props.setProperty("ner.useSUTime", "false")
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner")
pipeline = new StanfordCoreNLP(propsNER)
}
def nerString(tekst: String): List[TokenNER] = {
val document = new Annotation(tekst)
pipeline.annotate(document)
...
}
/////////////// train new NER model ///////////////////////
private val trainProps = StringUtils.propFileToProperties("conf/NER/classifiers/ner-ba-training.prop")
private val serializeTo = "conf/NER/classifiers/ner-ba-model.ser.gz" // save at location...
private val inputDir = new File("conf/NER/classifiers/input")
private val fileFilter = new WildcardFileFilter("*.tsv")
private val dirFilter = TrueFileFilter.TRUE
def train(): Unit = {
val allFiles = FileUtils.listFiles(inputDir, fileFilter, dirFilter).asScala
val trainFileList = allFiles.map(_.getPath).mkString(",")
trainProps.setProperty("trainFileList", trainFileList)
val flags = new SeqClassifierFlags(trainProps)
val crf = new CRFClassifier[CoreLabel](flags)
crf.train()
crf.serializeClassifier(serializeTo)
loadPipelines()
}
}
The loadPipelines() is used to re-assign the pipeline when the new NER model is created.
How do I know that the model isn't updated? I have a text that I include manually and see the difference with and without it..

Related

Springboot Mongo reactive repository unable to update nested list

I wanted to update a nested list but I experience a strange behavior where I have to call method twice to get it done...
Here is my POJO:
#Document(collection = "company")
data class Company (
val id: ObjectId,
#Indexed(unique=true)
val name: String,
val customers: MutableList<Customer> = mutableListOf()
//other fields
)
Below is my function from custom repository to do the job which I based on this tutorial
override fun addCustomer(customer: Customer): Mono<Company> {
val query = Query(Criteria.where("employees.keycloakId").`is`(customer.createdBy))
val update = Update().addToSet("customers", customer)
val upsertOption = FindAndModifyOptions.options().upsert(true)
//if I uncomment below this will work...
//mongoTemplate.findAndModify(query, update, upsertOption, Company::class.java).block()
return mongoTemplate.findAndModify(query, update, upsertOption, Company::class.java)
}
In order to actually add this customer I have to either uncomment the block call above or call the method two times in the debugger while running integration tests which is quite confusing to me
Here is the failing test
#Test
fun addCustomer() {
//given
val company = fixture.company
val initialCustomerSize = company.customers.size
companyRepository.save(company).block()
val customerToAdd = CustomerReference(id = ObjectId.get(),
keycloakId = "dummy",
username = "customerName",
email = "email",
createdBy = company.employees[0].keycloakId)
//when, then
StepVerifier.create(companyCustomRepositoryImpl.addCustomer(customerToAdd))
.assertNext { updatedCompany -> assertThat(updatedCompany.customers).hasSize(initialCustomerSize + 1) }
.verifyComplete()
}
java.lang.AssertionError:
Expected size:<3> but was:<2> in:
I found out the issue.
By default mongo returns entity with state of before update. To override it I had to add:
val upsertOption = FindAndModifyOptions.options()
.returnNew(true)
.upsert(true)

StanfordCoreNLPClient don't work as expected on sentiment analysis

Stanford CoreNLP version 3.9.1
I have a problem getting StanfordCoreNLPClient work the same way as StanfordCoreNLP when doing sentiment analysis.
public class Test {
public static void main(String[] args) {
String text = "This server doesn't work!";
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, sentiment");
//If I uncomment this line, and comment out the next one, it works
//StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
StanfordCoreNLPClient pipeline = new StanfordCoreNLPClient(props, "http://localhost", 9000, 2);
Annotation annotation = new Annotation(text);
pipeline.annotate(annotation);
CoreDocument document = new CoreDocument(annotation);
CoreSentence sentence = document.sentences().get(0);
//outputs null when using StanfordCoreNLPClient
System.out.println(RNNCoreAnnotations.getPredictions(sentence.sentimentTree()));
//throws null pointer when using StanfordCoreNLPClien (reason of course is that it uses the same method I called above, I assume)
System.out.println(RNNCoreAnnotations.getPredictionsAsStringList(sentence.sentimentTree()));
}
}
Output using StanfordCoreNLPClient pipeline = new StanfordCoreNLPClient(props, "http://localhost", 9000, 2):
null
Exception in thread "main" java.lang.NullPointerException
at edu.stanford.nlp.neural.rnn.RNNCoreAnnotations.getPredictionsAsStringList(RNNCoreAnnotations.java:68)
at tomkri.mastersentimentanalysis.preprocessing.Test.main(Test.java:35)
Output using StanfordCoreNLP pipeline = new StanfordCoreNLP(props):
Type = dense , numRows = 5 , numCols = 1
0.127
0.599
0.221
0.038
0.015
[0.12680336652661395, 0.5988695516384742, 0.22125584263055106, 0.03843574738131668, 0.014635491823044227]
Other annotations than sentiment works in both cases (at least those I have tried).
The server starts fine, and I am able to use from my web browser. When using it there I also get output of sentiment scores (on each subtree in the parse) in json format.
My solution, in case anyone else need it.
I tried to get the required annotation by making http request to the server with JSON response:
HttpResponse<JsonNode> jsonResponse = Unirest.post("http://localhost:9000")
.queryString("properties", "{\"annotators\":\"tokenize, ssplit, pos, lemma, ner, parse, sentiment\",\"outputFormat\":\"json\"}")
.body(text)
.asJson();
String sentTreeStr = jsonResponse.getBody().getObject().
getJSONArray("sentences").getJSONObject(0).getString("sentimentTree");
System.out.println(sentTreeStr); //prints out sentiment values for tree and all sub trees.
But not all annotation data is available. For example, you don't get the probability distribution over all the possible
sentiment values, only the probability of the sentiment most likely (the sentiment with highest probability).
If you need that, this is a solution:
HttpResponse<InputStream> inStream = Unirest.post("http://localhost:9000")
.queryString(
"properties",
"{\"annotators\":\"tokenize, ssplit, pos, lemma, ner, parse, sentiment\","
+ "\"outputFormat\":\"serialized\","
+ "\"serializer\": \"edu.stanford.nlp.pipeline.GenericAnnotationSerializer\"}"
)
.body(text)
.asBinary();
GenericAnnotationSerializer serializer = new GenericAnnotationSerializer ();
try{
ObjectInputStream in = new ObjectInputStream(inStream.getBody());
Pair<Annotation, InputStream> deserialized = serializer.read(in);
Annotation annotation = deserialized.first();
//And now we are back to a state as if we were not running CoreNLP as server.
CoreDocument doc = new CoreDocument(annotation);
CoreSentence sentence = document.sentences().get(0);
//Prints out same output as shown in question
System.out.println(
RNNCoreAnnotations.getPredictions(sentence.sentimentTree()));
} catch (UnirestException ex) {
Logger.getLogger(SentimentTargetExtractor.class.getName()).log(Level.SEVERE, null, ex);
}

Creating dynamic POST /users calls with Gatling in Scala

I am using Gatling to generate a large number of users to test performance issues on my product. I need to be able to dynamically create users with unique fields (like 'email'). So, I'm generating a random number and using it, but it isn't being re-instantiated each time, so the email is only unique on the first pass.
object Users {
def r = new scala.util.Random;
def randNumb() = r.nextInt(Integer.MAX_VALUE)
val numb = randNumb()
val createUser = {
exec(http("Create User")
.post("/users")
.body(StringBody(raw"""{"email": "qa_user_$numb#company.com" }""")))
}
}
val runCreateUsers = scenario("Create Users").exec(Users.createUser)
setUp(
runCreateUsers.inject(constantUsersPerSec(10) during(1 seconds))
).protocols(httpConf)
Where should I be defining my random numbers? How can I pass it into createUser?
Use a feeder:
object Users {
val createUser = exec(http("Create User")
.post("/users")
.body(StringBody("""{"email": "qa_user_${numb}#Marqeta.com" }""")))
}
val numbers = Iterator.continually(Map("numb" -> scala.util.Random.nextInt(Int.MaxValue)))
val runCreateUsers = scenario("Create Users")
.feed(numbers)
.exec(Users.createUser)
...

SpEL not able to extract attribute value from Scala object

I have a simple Scala class called Case
case class Case(
#(Id#field) var id: String,
var state: CaseState = new OpenCaseState,
var notes: List[CaseNote] = new ArrayList(),
var assignedGroups:Set[String] = new HashSet(),
var aclTemplateIds: Set[String] = new HashSet()
) extends Serializable { }
I created an instance of this class called a_case, setting id as 123. I am trying to get the value of the id attribute. I tried this
var parser: ExpressionParser = new SpelExpressionParser
var context: EvaluationContext = new StandardEvaluationContext(a_case)
var extractedId = parser.parseExpression("'id'").getValue(context).asInstanceOf[String]
All I get is "id" in my extractedId variable. When I try to parse "id" without the single quotes, I get an exception saying the property id is not found in Case. Am I missing something here or is this a Scala issue?
SpEL can do that for you if your id has getter.
I'm not well with Scala, but:
BeanProperty
You can annotate vals and vars with the #BeanProperty annotation. This generates getters/setters that look like POJO getter/setter definitions. If you want the isFoo variant, use the BooleanBeanProperty annotation. The ugly foo$_eq becomes
setFoo("newfoo");
getFoo();
https://twitter.github.io/scala_school/java.html

TokensRegex: Tokens are null after retokenization

I'm experimenting with Stanford NLP's TokensRegex and try to find dimensions (e.g. 100x120) in a text. So my plan is to first retokenize the input to further split these tokens (using the example provided in retokenize.rules.txt) and then to search for the new pattern.
After doing the retokenization, however, only null-values are left that replace the original string:
The top level annotation
[Text=100x120 Tokens=[null-1, null-2, null-3] Sentences=[100x120]]
The retokenization seems to work fine (3 tokens in result), but the values are lost. What can I do to maintain the original values in the tokens list?
My retokenize.rules.txt file is (as in the demo):
tokens = { type: "CLASS", value:"edu.stanford.nlp.ling.CoreAnnotations$TokensAnnotation" }
options.matchedExpressionsAnnotationKey = tokens;
options.extractWithTokens = TRUE;
options.flatten = TRUE;
ENV.defaults["ruleType"] = "tokens"
ENV.defaultStringPatternFlags = 2
ENV.defaultResultAnnotationKey = tokens
{ pattern: ( /\d+(x|X)\d+/ ), result: Split($0[0], /x|X/, TRUE) }
The main method:
public static void main(String[] args) throws IOException {
//...
text = "100x120";
Properties properties = new Properties();
properties.setProperty("tokenize.language", "de");
properties.setProperty("annotators", tokenize,retokenize,ssplit,pos,lemma,ner");
properties.setProperty("customAnnotatorClass.retokenize", "edu.stanford.nlp.pipeline.TokensRegexAnnotator");
properties.setProperty("retokenize.rules", "retokenize.rules.txt");
StanfordCoreNLP stanfordPipeline = new StanfordCoreNLP(properties);
runPipeline(pipelineWithRetokenize, text);
}
And the pipeline:
public static void runPipeline(StanfordCoreNLP pipeline, String text) {
Annotation annotation = new Annotation(text);
pipeline.annotate(annotation);
out.println();
out.println("The top level annotation");
out.println(annotation.toShorterString());
//...
}
Thanks for letting us know. The CoreAnnotations.ValueAnnotation is not being populated and we'll update TokenRegex to populate the field.
Regardless, you should be able to use TokenRegex to retokenize as you have planned. Most of the pipeline does not depending on the ValueAnnotation and uses the CoreAnnotations.TextAnnotation instead. You can use the CoreAnnotations.TextAnnotation to get the text for the new tokens (each token is a CoreLabel so you can access it using token.word() as well).
See TokensRegexRetokenizeDemo for example code on how to get the different annotations out.

Resources