Multi-language coreNLP - stanford-nlp

Is it possible to have more than one Stanford CoreNLP instance, each of them using a different language, in the same Java project?
In the CoreNLP documentation, it seems that the only way to change language is to add a different Maven dependency: what if I want to use all of them together?

If you include a dependency for each language, you will get all of the model files for Chinese, German, and Spanish. You will now have all the resources to run on Chinese, German, and Spanish.
Within your code, you determine the language by the .properties file you use to build the StanfordCoreNLP pipeline object. So you are free to build different pipelines with different .properties files.
The appropriate .properties files for the various languages can be found in the corresponding model jars.

Related

How to get training data and models of Stanford CoreNLP?

I downloaded Stanford CoreNLP from the official website and GitHub.
In the guides, it is stated
On the Stanford NLP machines, training data is available in
/u/nlp/data/depparser/nn/data
or
HERE
The list of models currently distributed is:
edu/stanford/nlp/models/parser/nndep/english_UD.gz (default, English,
Universal Dependencies)
It may sound a silly question, but I cannot find such files and folders in any distribution.
Where can I find the source data and models officially distributed with Stanford CoreNLP?
We don't distribute most of the CoreNLP training data. Quite a lot of it is non-free, licensed data produced by other people (such as LDC https://www.ldc.upenn.edu/).
However, a huge number of free dependency treebanks are available through the Universal Dependencies project: https://universaldependencies.org/.
All the Stanford CoreNLP models are available in the "models" jar files. edu/stanford/nlp/models/parser/nndep/english_UD.gz is in this one: stanford-corenlp-3.9.2-models.jar, which is both in the zip file download http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip or can be found on Maven here: http://central.maven.org/maven2/edu/stanford/nlp/stanford-parser/3.9.2/.

Maven plugin to test i18n properties

does anybody know a maven plugin that tests all my language properties files? I want to test that every language in my project contains all keys.
Use cases:
Figure out if so. added a key to the default file and forgot to add to any of the other language files.
Figure out if so. dropped a key in one of the files and forgot to drop it in all the other files.
It is not that difficult to write my own small maven plugin, but I would prefer an already existing one. Haven't found one so far.
Or: How do you test your files? Manually / automated / not at all?
Eric
You should give a try to the i18n-maven-plugin. In the build (process-resources phase), all your Java classes, JSP will be parsed to find all the i18n keys in your project (according to your pom).
The plugin will add all the i18n keys that are missing in all you bundles. There is also a strict mode that remove all the i18n keys that are no longer found in your application from your bundles so you can be sure that 100% of the keys are both used in your app and translated in every language.
For a working, real-life example, feel free to check out this application:
svn checkout https://svn.codelutin.com/wao/tags/wao-4.0.4/
mvn clean process-resources -Di18n.verbose
Funny - I gave my project the same name a couple of years ago. https://github.com/hoereth/i18n-maven-plugin
This plugin serves me well on numerous projects. It turns around the concept of properties files 180 degrees. You maintain a well structured XMl file with your translations and the plugin will create all properties files for you during build time. No need for validation at his point. It can also create a Java class which holds all translation keys - thus enabling you to compile-check your translation calls.
Believe me - this takes away the pain of translating from a technical point of view. :)

Conflict between Stanford Parser & Stanford POS tagger

I am working on a project which requires me to add POS tags to an input string. I am also going to use grammatical dependency structure generated by the Stanford parser for later processing.
Something to point out before I jump to my problem.
For POS tagging I am using http://nlp.stanford.edu/software/tagger.shtml (Version 3.3.1)
For grammetical dependency generation I am using http://nlp.stanford.edu/software/lex-parser.shtml#Download (version 3.3.1)
I included both these jars in my class path.(By include I am using maven to pull stanford parser jar from maven repository and including POStagger jar using steps mentioned later)
Now the problem is whenever I try to get the POS tags for an input string I get the following error.
Exception in thread "main" java.lang.NoSuchMethodError: edu.stanford.nlp.tagger.maxent.TaggerConfig.getTaggerDataInputStream(Ljava/lang/String;)Ljava/io/DataInputStream;
My intuition says that this is because Stanford parser jar also has Maxent package that contains TaggerConfig class. Every time I ask for POS tags for a string the program looks into the Stanford parser jar instead of Stanford POStagger jar hence the error.
I am using maven and couldn't find the POStagger jar on Maven central so I included it into my local maven repository using instructions on http://charlie.cu.cc/2012/06/how-add-external-libraries-maven/ link.
I would really appreciate if anyone can point out any solution to this problem
You are using two jar files. Go to the BuildPath and reverse the order of your imported jars. That should fix it.
The method Java is complaining about was in releases of the Stanford POS Tagger in the 2009-2011 period, but is not in any recent (or ancient) release.
So what this means is that you have another jar on your class path which contains an old version of the Stanford POS tagger hidden inside it, and its MaxentTagger has been invoked, not the one from the v3.3.1 jars (due to class path search order). You should find it and complain.
The most common case recently has been the CMU ark-tweet-nlp.jar. See: http://nlp.stanford.edu/software/corenlp-faq.shtml#nosuchmethoderror.
The overlapping classes of the Stanford releases are not a problem: Providing you use the same version of the tagger and parser, they are identical.

Is there a independent maven artifact used as a docbook generate tools

Actually there is a maven plugin docbkx-maven-plugin which can generate pdf, html or other format in the maven project, but the problem is you need to configure more about xsl template, css, or if you want to highlight the source code, support other language like Chinese, you also need to do more configures, so Is there a independent maven artifact used as a docbook generate tools, includes following features:
include the dtd files
code highlight support
non-english support, means can generate pdf like in Chinese content
have a default xsl template for generate html, pdf

Antlr4 maven plugin cannot find grammar files in different directories

I'm using the antlr4 maven plug-in to build my maven project which uses antlr4:
<groupId>org.antlr</groupId>
<artifactId>antlr4-maven-plugin</artifactId>
<version>4.0</version>
I started with one grammar file and got my pom.xml set-up and everything was building nicely.
Then I decided to split my grammar into logical parts and therefore used several grammar files but in different directories (so the generated code would be put into separate packages) but still all under the same root src/main/antlr4 directory.
I use the import statement in my "top" level grammar file to import the other required files.
But now maven gives me the following error when trying to build:
[ERROR] Message{errorType=CANNOT_FIND_IMPORTED_GRAMMAR
Why can't antlr find the other files that I am importing?
thanks,
Ryan.
I had the same problem and solved with the following configuration:
<libDirectory>${basedir}/src/main/antlr4/yourGrammarDirectory</libDirectory>
Look here: How to import grammar in Antlr4 to build with maven
For the ANTLR 4.0 release, no testing was performed on imports across multiple directories.
Due to the limited benefits (IMO) provided by the current grammar import mechanism, this is currently a very low priority feature. Currently, sharing grammar files by using imports will not allow you to share code for the generated parsers or the parse trees they produce. I've been using ANTLR for years on dozens of products (including commercial releases), and not once have I found composite grammars to provide more benefits than trouble. (Note that I'm talking about the import statement here. Separating lexer and parser grammars into separate files in the same directory is frequently beneficial and my preferred way to work.)

Resources