OpenNLP: Unable to locate the model file for Lemmatizer - opennlp

Summary: Unable to find the model file used for Lemmatizer (english-lemmatizer.bin)
Details: OpenNLP Tools Models appears to be a comprehensive repository for the various models used by the different components of the Apache OpenNLP library. However, I am unable to find the model file en-lemmatizer.bin, which is used with the lemmatizer. The Apache OpenNLP Developer Manual provides the following code snippet for the Lemmatization step:
InputStream dictLemmatizer = null;
try (dictLemmatizer = new FileInputStream("english-lemmatizer.bin")) {
}
However, unlike other model files, I am just not able to find the location of this model file. Any pointers would be appreciated.

The book "Natural Language Processing with Java Cookbook' by Richard M. Reese provides a good answer. For some reason en-lemmatizer.bin is not available for direct download from the web, but it can be created using the following steps:
Download and untar apache-opennlp-1.9.0-bin.tar (https://opennlp.apache.org/download.html)
Go to the URL for the Lemmatizer Training File and save the text content as en-lemmatizer.dict
Go to the bin directory (from step 1, after untarring) and
execute the following command:
opennlp LemmatizerTrainerME -model en-lemmatizer.bin -lang en -data /path/to/en-lemmatizer.dict -encoding UTF-8
Note: Be prepared to handle the following error:
Computing event counts... Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

You want en-lemmatizer.bin and not english-lemmatizer.txt

Related

How to add custom information model XML.file to server and run It?

I'm currently working on open62541. I created one object XML file. now I want to add the file to server and try to run server. when server runs the XML file contain objects should be show on opcua-client application.
In 3 steps:
you need a nodeset.xml file
use cmake command to generate source code from it
call a function in your executable
1
I do not know what kind of "XML" file you have.
I would assume you have a valid nodeset.xml file.
if you do not know how to do it,you can try read this: https://opcua.rocks/custom-information-models/
personally i suggest to use a GUI tool for that (e.g. free opc ua modeler)
2
Then you should use following custom CMake commands provides by open62541
# Generate types and namespace for DI
ua_generate_nodeset_and_datatypes(
NAME "di" # the name you want
FILE_CSV "${UA_NODESET_DIR}/DI/OpcUaDiModel.csv"
FILE_BSD "${UA_NODESET_DIR}/DI/Opc.Ua.Di.Types.bsd"
NAMESPACE_MAP "2:http://opcfoundation.org/UA/DI/"
FILE_NS "${UA_NODESET_DIR}/DI/Opc.Ua.Di.NodeSet2.xml"
)
after build, you would find bunches of ua_xxxx_generated.c and ua_xxxx——generated.h file under build/src_generated folder.
Then in your programm code just include these headers and call
3
namespace_xxx_nodeset_generated(server)
please refer to https://github.com/open62541/open62541/tree/master/examples/nodeset
and
http://www.open62541.org/doc/master/nodeset_compiler.html
There are rich example and codes for that

Stanford Core NLP train tagger using Java API

Does anyone know if it's possible to train a Stanford tagger using the Java API? I'm only finding examples of people doing it through the command line. That should imply that there exists an API method somewhere, but I can't find it.
You can put all of your training properties in a .properties file and then call MaxentTagger.main("-props", "/path/to/training.properties"). I don't see any easier way to do this in the Java API.
The only solution I came up with is to use MaxentTagger.main(...) and pass it a bunch of arguments that have been formatted command line style.

CoreNLP: Load "out-of-box" and custom NER model on Windows OS

I am looking to load a custom build NER model as well as one of the "out-of-box" Stanford CoreNLP NER models on a Windows 10 computer. I would like to apply both models to my text.
I have accomplished this for a CentOS system and authored this question "Load Custom NER Model Stanford CoreNLP".
I understand that I can use -serverproperties with a properties file to load a custom NER model. When you do this that is the only model to load and you would have to specify which "out-of-box" NER models you would like to load in addition to your custom model. I have done this on my CentOS system but cannot accomplish it on my Windows computer.
The difficulty comes in specifying the filepath to the "out-of-box" NER models. I use this type of path for my custom model C:\path\to\custom_model.ser.gz but I do not have a file path to the "out-of-box" NER models as their paths are for a Linux OS.
How do I properly direct CoreNLP to the "out-of-box" NER models in my server.prop file?
The ner.model file path can take in a comma separated list with multiple model paths. I honestly am not familiar with Windows, so I am not really sure what would happen if you supplied a DOS style path in your list for ner.model .
But assuming that doesn't work, you could always make a jar and place your custom model in that jar with a Unix path, then place that jar into your CLASSPATH when running your application.
I was able to solve my own problem. This is what I used in the server.prop file:
ner.model = C:\\path\\to\\custom_model.ser.gz,edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz
The issue I was having was that I was putting a space after the comma separating the models. I would get the "Unable to load as url, path, or file" error because it was adding the space to the file path. ~face to palm~

Weka UI language configuration error while reading file

in attempts to implement Machine learning into my project, i used WEKA. And to train and test it, weka process collection of data which is in Russian Language. But in process of reading it shows unreadible ('ЧÑ, о Ñ') characters. I understand that this is due to language configuration error, but i cant find a solution. Any help is apperciated
WEKA UI screenshot
i gave java 1.8, weka 3.8.
my dataset is like: "Российский ситком (ситуационная комедия) «Интерны», совмещенная адаптация «Клиники» и «Доктора Хауса»"
my folder is like:
-kino1tr:
-good
-bad
-neutral
i did stupid mistake. While loading data, there charSet field to specify language configuration. Thus, stating UTF-8 in charset resolves the issue

How to read excel file tibco activities?

I have a requirement to read excel file using tibco palettes.Can any body please throw some lights regarding this. I am basically new to this tibco BW. Please tell me what steps should I follow?
I am assuming you are not referring to CSV files, for which you could use the File Read and Parse activities of BW.
If you want to parse or render a multi-worksheet workbook, you can try publicly available API's such as Apache's POI or commercial API's such as from Aspose to cut your own Java based solution. Then you can use the Java Code or general Java activities to embed and use that code.
And then there's another ready-to-use option available from us: an Excel Plugin for TIBCO BusinessWorks, if you wish to leverage all built-in features of BW (XPath mapping, etc) when parsing or rendering your Excel.
Edit 1:
As per your comment, you can also try the following steps, if you are looking for a more homegrown solution.
Based on one of the (public/commercial) libraries above you can write generic Java Code to parse each cell of each row of each sheet of the workbook. Output should be an XML string. Then create an XSD to match your output. It is at your discretion, which information of the cell you want to read from the workbook - you already are aware of the complexity of the API, I am sure.
Create a BW (sub)process that calls your code from a Java activity, use Parse XML to parse your XML string result into you XSD structure. Configure the End activity to use your XSD and map (copy) your Parse XML result into the End activity.
Then wrap this subprocess into a Custom Activity (General Activities Palette). Create a Custom Palette and now you can re-use what you did in many other BW projects. The path to the custom palettes can be found in TIBCO Designer - Edit- Preferences - General - User Directories
If you add Error Output schemas, you will also get typed error outputs from that custom activity.
HTH,
Hendrik

Resources