For one of our project, we are currently using the syntax analysis component with the command line. We want to move from this approach to now use the corenlp server (for better performances).
Our command line options are as follow:
java -mx4g -cp "$scriptdir/*:" edu.stanford.nlp.parser.lexparser.LexicalizedParser -tokenized -escaper edu.stanford.nlp.process.PTBEscapingProcessor -sentences newline -tokenized -tagSeparator / -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerMethod newCoreLabelTokenizerFactory -outputFormat "wordsAndTags,typedDependenciesCollapsed"
I've tried a few things but I didn't manage to find the proper options when using the corenlp API (with Python).
For instance, how to specify that the text is already tokenised?
I would really appreciate any help.
In general, the server calls into CoreNLP rather than the individual NLP components, so the documentation on CoreNLP may be useful. The body of the text being annotated is sent to the server as the POST body; the properties are passed in as URL params. For example, for your case, I believe the following curl command should do the trick (and should be easy to adapt to the language of your choice):
curl -X POST -d "it's split on whitespace" \
'http://localhost:9000/?annotators=tokenize,ssplit,pos,parse&tokenize.whitespace=true&ssplit.eolonly=true'
Note that we're just passing the following properties into the server:
annotators = tokenize,ssplit,pos,parse (specifies that we want the parser, and all its prerequisites).
tokenize.whitespace = true will call the withespace tokenizer.
ssplit.eolonly = true will split sentences on and only on newlines.
Other potentially useful options are documented on the parser annotator page.
Related
I am experiencing a problem with the use of the Stanford pipeline (last version of CoreNLP) to parse the BNC.
The problematic sentence excerpt is the following, and the problem are the dashes (if I remove them, it goes through).
"... they did it again and again — on and off for years."
The parser just gets stuck in this sentence, and it does not even throw an error.The sentence gets parsed correctly in the web interface.
I tried with the options of the tokenizer, with no result.
I add the command line I am using:
java [...] edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,depparse -tokenize.whitespace false -ssplit.eolonly true -parse.model edu/stanford/nlp/models/parser/nndep/english_SD.gz -file $inputfile
Does anybody have a suggestion on how to cope with this issue?
Thanks a lot in advance!
Gabriella
Running with Stanford CoreNLP v.3.5.2 on OS X 10.10.4, I couldn't reproduce this problem. The example string given was parsed just fine.
There could be a problem, but if so it is subtle and you'd want to similarly give more information on Stanford NLP version, OS and version, and to stick a textfile that doesn't work somewhere to be downloaded, to make sure the problem isn't something like line endings that gets lost when pasting text on a web page.
We need to add terms to named entity extraction tables/model in Stanford and can't figure out how. Use case - we need to build up a set of IED terms over time and want to the Stanford pipeline to extract the terms when found in text files.
Looking to see if this is something someone has done before
Please take a look at http://nlp.stanford.edu/software/regexner/ to see how to use it. It allows you to specify a file of mappings of phrases to entity types. When you want to update the mappings, you update the file and rerun the Stanford pipeline.
If you are interested in how to actually learn patterns for the terms over time, you can take a look at our pattern learning system: http://nlp.stanford.edu/software/patternslearning.shtml
Could you specify the tags you want to apply?
To use the RegexNER all you have to do is build a file with 1 entry per line of the form:
TEXT_PATTERN\tTAG
You would put all of the things you want in your custom dictionary into a file, say custom_dictionary.txt
I am assuming by IED you mean
https://en.wikipedia.org/wiki/Improvised_explosive_device ??
So your file might look like:
VBIED\tIED_TERM
sticky bombs\tIED_TERM
RCIED\tIED_TERM
New Country\tLOCATION
New Person\tPERSON
(Note Stack Overflow has some strange formatting, there should not be blank lines between each entry, it should be 1 entry per line!!)
If you then run this command:
java -mx1g -cp '*' edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators 'tokenize,ssplit,pos,lemma,regexner,ner' -file sample_input.txt -regexner.mapping custom_dictionary.txt
you will tag sample_input.txt
Updating is merely a matter of updating custom_dictionary.txt
One thing to be on the lookout for, it matters if you put "ner" first or "regexner" first in your list of annotators.
If your highest priority is tagging with your specialized terms (for instance IED_TERM), I would run the regexner first in the pipeline, since there are some tricky issues with how the taggers overwrite each other.
I am using Sphinx/reStructuredText to create HTML and PDF docs for a project and am including the output of the argparse help for the command line tools. At the moment I am pasting the output in manually but plan at some point to switch to autogenerated output.
The problem is that while the formatting is fine for named parameters (like -x or --xray) it doesn't work well on positional parameters. It looks like the absence of a leading '-' on the parameter name is confusing it. The output looks like normal text without the neat indentation etc.
So my question is, does there exist markup that would force formatting the positional parameters as if they had leading '-' characters? If not, could someone suggest where in the docs or code I should start looking to put something together myself?
You may use sphinx-argparse extension.
It does not use output of argparse help, but it instead introspects argparse parser itself and provides proper sphinx markup for both positional arguments and options, formatting them as definition list with proper labels.
Sub-commands are also easily handled.
http://sphinx-argparse.readthedocs.org/en/latest/
It will also cover you needs in:
plan at some point to switch to autogenerated output
In opennlp, I am training a named entity model. If I provide the ".train" file and train using the command line tool, it works perfect. But when I use the API and pass through sentence detector and tokenize it and send it to namefind, the find does not detect the types.
Did you try to get it using the getType method from the Span?
http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.namefind.recognition.api
Also, you can refer to the command line tool source code to check if you are using it right:
http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/cmdline/namefind/TokenNameFinderTool.java?view=markup
We need to do some queries of a Mongo DB from BASH shell scripts. Using eval and Mongo's printjson() gives me text output, but it needs to be parsed. Using other scripting languages (Python, Ruby, Erlang, etc) is not an option.
I looked at JSON.sh ( a BASH script lib JSON parser: https://github.com/rcrowley/json.sh ) and it appears to be close to a solution other than the issue that it does not recognize BSON-but-not-JSON data types. Before I try to mod it to recognize BSON data types, is anyone aware of an existing solution?
Thanks.
10/11 Below Stennie notes that I have received an answer in the MongoDB User group, and provides a URL. The answer is very nice and complete, and begins, "MongoDB actually uses what we call Mongo Extended JSON which differs a bit from the vanilla JSON standard..." so I will have to modify the parser. Thanks to all.
Do you perhaps want to use tojson() rather than printjson() and loop through the result of tojson() to parse the fields?