How to make fast stanford Core NLP API? - performance

I use Stanford Core NLP API; My codes are Below:
public class StanfordCoreNLPTool {
public static StanfordCoreNLPTool instance;
private Annotation annotation = new Annotation();
private Properties props = new Properties();
private PrintWriter out = new PrintWriter(System.out);;
private StanfordCoreNLPTool(){}
public void startPipeLine(String question){
props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse,mention,
dcoref, sentiment");
annotation = new Annotation(question);
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// run all the selected Annotators on this text
pipeline.annotate(annotation);
pipeline.prettyPrint(annotation, out);
}
public static StanfordCoreNLPTool getInstance() {
if(instance == null){
synchronized (StanfordCoreNLPTool.class){
if(instance == null){
instance = new StanfordCoreNLPTool();
}
}
}
return instance;
}
}
It works fine, but it takes much time; Consider we are using it in a question answering system, so for every new input, pipeAnnotation must be run.
As you know, each time some rules should be fetched, some data trained and etc to yield to an sentence with NLP tags such as POS, NER and ... .
first of all, i wanted to solve the problem with RMI and EJB, but it failed, because, Regardless Of any JAVA Architecture, for every new sentence, pipeAnnotation should be learnt from the scratch.
look at this log that are printed in my intellij console:
Reading POS tagger model from
edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger
... done [6.1 sec].
Loading classifier from
edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ...
done [8.0 sec].
Loading classifier from
edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ...
done [8.7 sec].
Loading classifier from
edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz
... done [5.0 sec].
INFO: Read 25 rules [main] INFO
edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator parse
[main] INFO edu.stanford.nlp.parser.common.ParserGrammar - Loading
parser from serialized file
edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [4.1
sec].
please help me to find a solution to make the program fast.

The big problem here is that your startAnnotation() method combines two things that should be separated:
Constructing an annotation pipeline, including loading large annotator models
Annotating a particular piece of text (a question) and printing the result
For making things fast, these two things must be separated, as loading an annotation pipeline is slow but should be done only once, while annotating each piece of text is reasonably fast.
The rest is all minor details but FWIW:
All the fanciness of the double-checked locking constructing a StanfordCoreNLPTool singleton is doing nothing if you create a new one for every piece of text you annotate. You should construct an annotation pipeline only once, and it might be reasonable to do that as a singleton, but it's probably sufficient to do the initialization in the constructor once you distinguish pipeline construction from text annotation.
The annotation variable can and should be private to the method that annotates one piece of text.
If after these changes you still want model loading to be faster, you could buy an SSD! (My 2012 MBP with an SSD loads models more than 5 times faster than you are reporting.)
If you want to further speed annotation, the main tool is to cut out annotators you don't need or to choose faster versions. E.g., if you're not using coreference, you could delete mention and dcoref.

Related

Spring Batch multiple readers for different DB's

I have an existing spring batch project which reads data from MySQL or ArangoDB(NoSql database) based on feature toggle decision during startup and does some process and again writes back to MySQL/ArangoDB.
Now the reader configuration for MySQL is something like below,
#Bean
#Primary
#StepScope
public HibernatePagingItemReader reader(
#Value("#{jobParameters[oldMetadataDefinitionId]}") Long oldMetadataDefinitionId) {
Map<String, Object> queryParameters = new HashMap<>();
queryParameters.put(Constants.OLD_METADATA_DEFINITION_ID, oldMetadataDefinitionId);
HibernatePagingItemReader<Long> reader = new HibernatePagingItemReader<>();
reader.setUseStatelessSession(false);
reader.setPageSize(250);
reader.setParameterValues(queryParameters);
reader.setSessionFactory(((HibernateEntityManagerFactory) entityManagerFactory.getObject()).getSessionFactory());
return reader;
}
and i have another arango reader like below,
#Bean
#StepScope
public ListItemReader arangoReader(
#Value("#{jobParameters[oldMetadataDefinitionId]}") Long oldMetadataDefinitionId) {
List<InstanceDTO> instanceList = new ArrayList<InstanceDTO>();
PersistenceService arangoPersistence = arangoConfiguration
.getPersistenceService());
List<Long> instanceIds = arangoPersistence.getDefinitionInstanceIds(oldMetadataDefinitionId);
instanceIds.forEach((instanceId) ->
{
InstanceDTO instanceDto = new InstanceDTO();
instanceDto.setDefinitionID(oldMetadataDefinitionId);
instanceDto.setInstanceID(instanceId);
instanceList.add(instanceDto);
});
return new ListItemReader(instanceList);
}
and my step configuration is below,
#Bean
#SuppressWarnings("unchecked")
public Step InstanceMergeStep(ListItemReader arangoReader, ItemWriter<MetadataInstanceDTO> arangoWriter,
ItemReader<Long> mysqlReader, ItemWriter<Long> mysqlWriter) {
Step step = null;
if (arangoUsage) {
step = steps.get("arangoInstanceMergeStep")
.<Long, Long>chunk(1)
.reader(arangoReader)
.writer(arangoWriter)
.faultTolerant()
.skip(Exception.class)
.skipLimit(10)
.taskExecutor(stepTaskExecutor())
.build();
((TaskletStep) step).registerChunkListener(chunkListener);
}
else {
step = steps.get("mysqlInstanceMergeStep")
.<Long, Long>chunk(1)
.reader(mysqlReader)
.writer(mysqlWriter)
.faultTolerant()
.skip(Exception.class)
.skipLimit(failedSkipLimit)
.taskExecutor(stepTaskExecutor())
.build();
((TaskletStep) step).registerChunkListener(chunkListener);
}
return step;
}
The MySQL reader has pagination support through HibernatePagingItemReader so that it will handle millions of items without any memory issue.
I want to implement the same pagination support for arango reader to fetch only 250 documents per iteration how can modify the arango reader code to acheive this?
First of all documentation of ListItemReader says that - Useful for testing so don't use it for production. Return an ItemReader instead from all your reader beans instead of actual concrete types.
Having said that, Spring Batch API or Spring Data doesn't seem to supporting Arango DB . Closest that I could find is this
( I have not worked with Arango DB before ) .
So in my opinion, you have to write your own custom arango reader that implements paging by possibly implementing abstract class - org.springframework.batch.item.database.AbstractPagingItemReader
If its not doable by extending above class, you might have to implement everything from scratch. All of pagination readers in Spring Batch API extend this abstract class including HibernatePagingItemReader.
Also, remember that arango record set should have some kind of ordering to implement pagination so we can distinguish between page - 0 & page -1 etc ( similar to ORDER BY clause , BETWEEN Operator & less than , greater than operators etc in SQL. Also FETCH FIRST XXX ROWS OR LIMIT clause kind of thing would be needed too ) .
Implementing by your own is not a very tough task as you have to calculate total possible items , order them and then divide into pages and fetch only one page at a time.
Look at API for implementations like - HibernatePagingItemReader etc to get ideas.
Hope it helps !!

How can I detect named entities that have more than 1 word using CoreNLP's RegexNER?

I am using the RegexNER annotator in CoreNLP and some of my named entities consist of multiple words. Excerpt from my mapping file:
RAF inhibitor DRUG_CLASS
Gilbert's syndrome DISEASE
The first one gets detected but each word gets the annotation DRUG_CLASS and there seems to be no way to link the words, like an NER id which both words would have.
The second case does not get detected at all and that's probably because the tokenizer treats the apostrophe after Gilbert as a separate token. Since RegexNER has the tokenization as a dependency, I can't really get around it.
Any suggestions to resolve these cases?
If you use the entitymentions annotator that will create entity mentions out of consecutive tokens with the same ner tags. There is the downside that if two entities of the same type are side by side they will be joined together. We are working on improving the ner system so we may include a new model that finds the boundaries of distinct mentions in these cases, hopefully this will go into Stanford CoreNLP 3.8.0.
Here is some sample code for accessing the entity mentions:
package edu.stanford.nlp.examples;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.util.*;
import java.util.*;
public class EntityMentionsExample {
public static void main(String[] args) {
Annotation document =
new Annotation("John Smith visted Los Angeles on Tuesday.");
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,entitymentions");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
pipeline.annotate(document);
for (CoreMap entityMention : document.get(CoreAnnotations.MentionsAnnotation.class)) {
System.out.println(entityMention);
System.out.println(entityMention.get(CoreAnnotations.TextAnnotation.class));
}
}
}
If you just have your rules tokenized the same way as the tokenizer it will work fine, so for instance the rule should be Gilbert 's syndrome.
So you could just run the tokenizer on all your text patterns and this problem will go away.

Stanford ParserAnnotator doesn't generate annotations

I'm beginning to learn the Stanford CoreNLP Java API, and am trying to print the syntax tree of a sentence. The syntax tree is supposed to be generated by the ParserAnnotator. In my code (posted below), the ParserAnnotator runs without errors but doesn't generate anything. The error only shows up when the code tries to get the label of the tree's root node, and the tree is revealed to be null. The components that run before it generate their annotations without any problems.
There was one other person on SO who had a problem with the ParserAnnotator, but the issue was with memory. I've increased the memory that I allow Eclipse to use, but the behavior is the same. Running the code in the debugger also did not yield any errors.
Some background information: The sentence I used was "This is a random sentence." I recently upgraded from Windows 8.1 to Windows 10.
public static void main(String[] args){
String sentence = "This is a random sentence.";
Annotation doc = initStanford(sentence);
Tree syntaxTree = doc.get(TreeAnnotation.class);
printTreePreorder(syntaxTree);
}
private static Annotation initStanford(String sentence){
StanfordCoreNLP pipeline = pipeline("tokenize, ssplit, parse");
Annotation document = new Annotation(sentence);
pipeline.annotate(document);
return document;
}
private static StanfordCoreNLP pipeline(String components){
Properties props = new Properties();
props.put("annotators", components);
return new StanfordCoreNLP(props);
}
public static void printTreePreorder(Tree tree){
System.out.println(tree.label());
for(int i = 0;i < tree.numChildren();i++){
printTreePreorder(tree.getChild(i));
}
}
You're trying to get the tree off of the document (Annotation), rather than the sentences (CoreMap). You can get the sentences with:
Tree tree = doc.get(SentencesAnnotation.class).get(0).get(TreeAnnotation.class)
I can also shamelessly plug the Simple CoreNLP API:
Tree tree = new Sentence("this is a sentence").parse()

How to register a Renderer with CRaSH

After reading about the remote shell in the Spring Boot documentation I started playing around with it. I implemented a new Command that produces a Stream of one of my database entities called company.
This works fine. So I want to output my stream of companies in the console. This is done by calling toString() by default. While this seams reasonable there is also a way to get nicer results by using a Renderer.
Implementing one should be straight forward as I can delegate most of the work to one of the already existing ones. I use MapRenderer.
class CompanyRenderer extends Renderer<Company> {
private final mapRenderer = new MapRenderer()
#Override Class<Company> getType() { Company }
#Override LineRenderer renderer(Iterator<Company> stream) {
def list = []
stream.forEachRemaining({
list.add([id: it.id, name: it.name])
})
return mapRenderer.renderer(list.iterator())
}
}
As you can see I just take some fields from my entity put them into a Mapand then delegate to a instance of MapRenderer to do the real work.
TL;DR
Only problem is: How do I register my Renderer with CRaSH?
Links
Spring Boot documentation http://docs.spring.io/spring-boot/docs/current/reference/html/production-ready-remote-shell.html
CRaSH documentation (not helping) http://www.crashub.org/1.3/reference.html#_renderers

Save Spark Dataframe into Elasticsearch - Can’t handle type exception

I have designed a simple job to read data from MySQL and save it in Elasticsearch with Spark.
Here is the code:
JavaSparkContext sc = new JavaSparkContext(
new SparkConf().setAppName("MySQLtoEs")
.set("es.index.auto.create", "true")
.set("es.nodes", "127.0.0.1:9200")
.set("es.mapping.id", "id")
.set("spark.serializer", KryoSerializer.class.getName()));
SQLContext sqlContext = new SQLContext(sc);
// Data source options
Map<String, String> options = new HashMap<>();
options.put("driver", MYSQL_DRIVER);
options.put("url", MYSQL_CONNECTION_URL);
options.put("dbtable", "OFFERS");
options.put("partitionColumn", "id");
options.put("lowerBound", "10001");
options.put("upperBound", "499999");
options.put("numPartitions", "10");
// Load MySQL query result as DataFrame
LOGGER.info("Loading DataFrame");
DataFrame jdbcDF = sqlContext.load("jdbc", options);
DataFrame df = jdbcDF.select("id", "title", "description",
"merchantId", "price", "keywords", "brandId", "categoryId");
df.show();
LOGGER.info("df.count : " + df.count());
EsSparkSQL.saveToEs(df, "offers/product");
You can see the code is very straightforward. It reads the data into a DataFrame, selects some columns and then performs a count as a basic action on the Dataframe. Everything works fine up to this point.
Then it tries to save the data into Elasticsearch, but it fails because it cannot handle some type. You can see the error log here.
I'm not sure about why it can't handle that type. Does anyone know why this is occurring?
I'm using Apache Spark 1.5.0, Elasticsearch 1.4.4 and elaticsearch-hadoop 2.1.1
EDIT:
I have updated the gist link with a sample dataset along with the source code.
I have also tried to use the elasticsearch-hadoop dev builds as mentionned by #costin on the mailing list.
The answer for this one was tricky, but thanks to samklr, I have managed to figure about what the problem was.
The solution isn't straightforward nevertheless and might consider some “unnecessary” transformations.
First let's talk about Serialization.
There are two aspects of serialization to consider in Spark serialization of data and serialization of functions. In this case, it's about data serialization and thus de-serialization.
From Spark’s perspective, the only thing required is setting up serialization - Spark relies by default on Java serialization which is convenient but fairly inefficient. This is the reason why Hadoop itself introduced its own serialization mechanism and its own types - namely Writables. As such, InputFormat and OutputFormats are required to return Writables which, out of the box, Spark does not understand.
With the elasticsearch-spark connector one must enable a different serialization (Kryo) which handles the conversion automatically and also does this quite efficiently.
conf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer")
Even since Kryo does not require that a class implement a particular interface to be serialized, which means POJOs can be used in RDDs without any further work beyond enabling Kryo serialization.
That said, #samklr pointed out to me that Kryo needs to register classes before using them.
This is because Kryo writes a reference to the class of the object being serialized (one reference is written for every object written), which is just an integer identifier if the class has been registered but is the full classname otherwise. Spark registers Scala classes and many other framework classes (like Avro Generic or Thrift classes) on your behalf.
Registering classes with Kryo is straightforward. Create a subclass of KryoRegistrator,and override the registerClasses() method:
public class MyKryoRegistrator implements KryoRegistrator, Serializable {
#Override
public void registerClasses(Kryo kryo) {
// Product POJO associated to a product Row from the DataFrame
kryo.register(Product.class);
}
}
Finally, in your driver program, set the spark.kryo.registrator property to the fully qualified classname of your KryoRegistrator implementation:
conf.set("spark.kryo.registrator", "MyKryoRegistrator")
Secondly, even thought the Kryo serializer is set and the class registered, with changes made to Spark 1.5, and for some reason Elasticsearch couldn't de-serialize the Dataframe because it can't infer the SchemaType of the Dataframe into the connector.
So I had to convert the Dataframe to an JavaRDD
JavaRDD<Product> products = df.javaRDD().map(new Function<Row, Product>() {
public Product call(Row row) throws Exception {
long id = row.getLong(0);
String title = row.getString(1);
String description = row.getString(2);
int merchantId = row.getInt(3);
double price = row.getDecimal(4).doubleValue();
String keywords = row.getString(5);
long brandId = row.getLong(6);
int categoryId = row.getInt(7);
return new Product(id, title, description, merchantId, price, keywords, brandId, categoryId);
}
});
Now the data is ready to be written into elasticsearch :
JavaEsSpark.saveToEs(products, "test/test");
References:
Elasticsearch's Apache Spark support documentation.
Hadoop Definitive Guide, Chapter 19. Spark, ed. 4 – Tom White.
User samklr.

Resources