Is there a difference between command line tool and models that are trained using API (Programmatically)? - opennlp

In opennlp, I am training a named entity model. If I provide the ".train" file and train using the command line tool, it works perfect. But when I use the API and pass through sentence detector and tokenize it and send it to namefind, the find does not detect the types.

Did you try to get it using the getType method from the Span?
http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.namefind.recognition.api
Also, you can refer to the command line tool source code to check if you are using it right:
http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/namefind/TokenNameFinderTool.java?view=markup

Related

Difference between Tensorfloat and ImageFeatureValue

When using the Windows-Machine-Learning library, the input and output to the onnx models is often either TensorFloat or ImageFeatureValue format.
My question: What is the difference between these? It seems like I am able to change the form of the input in the automatically created model.cs file after onnx import (for body pose detection) from TensorFloat to ImageFeatureValue and the code still runs. This makes it e.g. easier to work with videoframes, since I can then create my input via ImageFeatureValue.CreateFromVideoFrame(frame).
Is there a reason why this might lead to problems and what are the differences between these when using videoframes as input, I don't see it from the documentation? Or why does the model.cs script create a TensorFloat instead of an ImageFeatureValue in the first place anyway if the input is a videoframe?
Found the answer here.
If Windows ML does not support your model's color format or pixel range, then you can implement conversions and tensorization. You'll create an NCHW four-dimensional tensor for 32-bit floats for your input value. See the Custom Tensorization Sample for an example of how to do this.

CoreNLP API equivalent to command line?

For one of our project, we are currently using the syntax analysis component with the command line. We want to move from this approach to now use the corenlp server (for better performances).
Our command line options are as follow:
java -mx4g -cp "$scriptdir/*:" edu.stanford.nlp.parser.lexparser.LexicalizedParser -tokenized -escaper edu.stanford.nlp.process.PTBEscapingProcessor -sentences newline -tokenized -tagSeparator / -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerMethod newCoreLabelTokenizerFactory -outputFormat "wordsAndTags,typedDependenciesCollapsed"
I've tried a few things but I didn't manage to find the proper options when using the corenlp API (with Python).
For instance, how to specify that the text is already tokenised?
I would really appreciate any help.
In general, the server calls into CoreNLP rather than the individual NLP components, so the documentation on CoreNLP may be useful. The body of the text being annotated is sent to the server as the POST body; the properties are passed in as URL params. For example, for your case, I believe the following curl command should do the trick (and should be easy to adapt to the language of your choice):
curl -X POST -d "it's split on whitespace" \
'http://localhost:9000/?annotators=tokenize,ssplit,pos,parse&tokenize.whitespace=true&ssplit.eolonly=true'
Note that we're just passing the following properties into the server:
annotators = tokenize,ssplit,pos,parse (specifies that we want the parser, and all its prerequisites).
tokenize.whitespace = true will call the withespace tokenizer.
ssplit.eolonly = true will split sentences on and only on newlines.
Other potentially useful options are documented on the parser annotator page.

Can I choose a pos.model in Stanford parser?

I want to use gate-EN-twitter.model for pos tagging when in the process of parsing by Stanford parser. Is there an option on command line that does that? like -pos.model gate-EN-twitter.model? Or do I have to use Stanford pos tagger with gate model for tagging first then use its output as input for the parser?
Thanks!
If I understand you correctly, you want to force the Stanford Parser to use the tags generated by this Twitter-specific POS tagger. That's definitely possible, though this tweet from Stanford NLP about this exact model should serve as a warning:
Tweet from Stanford NLP, 13 Apr 2014:
Using CoreNLP on social media? Try GATE Twitter model (iff not parsing…) -pos.model gate-EN-twitter.model https://gate.ac.uk/wiki/twitter-postagger.html #nlproc
(https://twitter.com/stanfordnlp/status/455409761492549632)
That being said, if you really want to try, we can't stop you :)
There is a parser FAQ entry on forcing in your own tags. See http://nlp.stanford.edu/software/parser-faq.shtml#f
Basically, you have two options (see the FAQ for full details):
If calling the parser from the command line, you can pre-tag your text file and then alert the parser to the fact that the text is pre-tagged using some command-line options.
If parsing programmatically, the LexicalizedParser#parse method will accept any List<? extends HasTag> and treat the tags in that list as golden. Just pre-tag your list (using the CoreNLP pipeline or MaxentTagger) and pass on that token list to the parser.

Can I tell SPSS to run certain syntax lines using a syntax command?

So I was wondering if it was possible to write something up in the syntax which tells the program to run certain command lines. I'm not very good at explaining, so here's an example:
*Total sample frequency.
FREQUENCIES VARIABLES=Age Gender CigDay CO Min_last Day_abs Cigs_Monthly
/ORDER=ANALYSIS.
*6. Next, using the split-file function, perform the frequency analysis for each gender.
* Split file.
SORT CASES BY Gender.
SPLIT FILE LAYERED BY Gender.
*7 Run frequency again.
FREQUENCIES VARIABLES=Age Gender CigDay CO Min_last Day_abs Cigs_Monthly
/ORDER=ANALYSIS.
So, I was wondering whether it was possible to not have to copy/paste the Frequency command and simply include a line of command that told SPSS to re-run the syntax rows 37 to 38 (Which is where the first frequency command written).
A short answer is - no. There is not a command available that would allow to run a specific line of syntax. Certainly you can do it manually by selecting and running the lines you need.
But there are other options available for such tasks when you need to re-run a part of the code several time:
Insert command. Save the code you need to run several time in an external syntax file and insert it when needed in your main syntax file.
Define and End Define commands. Define the code you need to run several time as a macro command and call it when needed in your main syntax file.
I suggest not using INCLUDE as it is obsolete, although it is still supported. INSERT provides better functionality.
If you set out to build a macro library for your frequently used commands, think about parameterizing them so that, for example, you can pass in the specific variables to use as arguments. See the Command Syntax Reference entry for DEFINE via the Help menu for full details, but be prepared to spend some time studying it.

(Unit)test pdf generation

We implemented a magento module https://github.com/firegento/firegento-pdf/ and I plan to write tests for the module.
The problem is: The extension generates pdfs.
Is there any framework, or whatever to test pdfs? It would be totally fine if I can check for text in the pdf. I don't want to check the correct placement.
Andy ideas?
This looks promising but feels oversized. http://webcheatsheet.com/php/reading_clean_text_from_pdf.php
I use PdfBox for a similar module, a Java based command line utility that extracts text from a PDF and optionally converts it to HTML: http://pdfbox.apache.org/commandline/#extractText
To use it within PHPUnit tests, I wrote a PHP interface for the relevant PdfBox methods: https://github.com/schmengler/PdfBox
Example
use SGH\PdfBox;
//$pdf = GENERATED_PDF;
$converter = new PdfBox;
$converter->setPathToPdfBox('/usr/bin/pdfbox-app-1.7.0.jar');
$text = $converter->textFromPdfStream($pdf);
Further reading: Unit Test Generated PDFs with PHPUnit and PDFBox
Maybe you could use Inkscape to convert it into SVG and make asserts on some SVG Nodes.
That would do if you only want to check the text or some simple formatting.
$ inkscape -f invoice.pdf --export-plain-svg=thepdf.svg
For the correct position you need to be a bit fuzzy, though.
Nevertheless the PDF source can be plain enough to be checked with simple strpos().
You have to resign yourself to testing "we sent the right commands to Magento". Testing the output PDF would cause fragility.
Mock the PDF-writing library, and test that your code called the library correctly. This has the added benefit of speed, but requires a little more discipline. If a PDF output fails a manual inspection, you MUST fix that test-first, to keep your mocks honest.

Resources