I am experiencing a problem with the use of the Stanford pipeline (last version of CoreNLP) to parse the BNC.
The problematic sentence excerpt is the following, and the problem are the dashes (if I remove them, it goes through).
"... they did it again and again — on and off for years."
The parser just gets stuck in this sentence, and it does not even throw an error.The sentence gets parsed correctly in the web interface.
I tried with the options of the tokenizer, with no result.
I add the command line I am using:
java [...] edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,depparse -tokenize.whitespace false -ssplit.eolonly true -parse.model edu/stanford/nlp/models/parser/nndep/english_SD.gz -file $inputfile
Does anybody have a suggestion on how to cope with this issue?
Thanks a lot in advance!
Gabriella
Running with Stanford CoreNLP v.3.5.2 on OS X 10.10.4, I couldn't reproduce this problem. The example string given was parsed just fine.
There could be a problem, but if so it is subtle and you'd want to similarly give more information on Stanford NLP version, OS and version, and to stick a textfile that doesn't work somewhere to be downloaded, to make sure the problem isn't something like line endings that gets lost when pasting text on a web page.
Related
For a few month I am working with Libreoffice 7.4. I am trying to get a PDF with PDF-version 1.5 (because LaTeX still does not accept 1.6 for input).
I am using a command like this as described in this issue by Mike Kaganski:
soffice --convert-to pdf:writer_pdf_Export:{"SelectPdfVersion":"type":"long","value":"15"}} %1
where %1 is a fully qualified writer document (odt). This produces the correct pdf, but always with PDF-spec 1.6. Changing the value in the command to anything has no effect.
By the way, using the exact syntac as in the link, i.e. 'pdf:writer_pdf ...}}' produces not a *.pdf but a *.'pdf.
What am I doing wrongly? Thanks for any help.
Thanks, K J. Though this is not an answer to my question, it led into the right direction (and thus is even better than an answer). It's certainly true that downgrading any document to a lower version is or might be dangerous.
The problem is, however, that \pdfminorversion only works for PdfTeX and not LuaTeX.
According to the answers found here for LuaTeX (and likely XeTeX, I didn't check), you have to set in the preamble of a project either
\pdfvariable minorversion=6
or
\directlua{pdf.setminorversion(6)}
(the latter is not likely to work with XeTeX).
So I still don't know how to effectively (versus theoretically – if someone knows an answer to this he is still wellcome) downgrade a PDF, but nontheless my actual problem is solved.
For one of our project, we are currently using the syntax analysis component with the command line. We want to move from this approach to now use the corenlp server (for better performances).
Our command line options are as follow:
java -mx4g -cp "$scriptdir/*:" edu.stanford.nlp.parser.lexparser.LexicalizedParser -tokenized -escaper edu.stanford.nlp.process.PTBEscapingProcessor -sentences newline -tokenized -tagSeparator / -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerMethod newCoreLabelTokenizerFactory -outputFormat "wordsAndTags,typedDependenciesCollapsed"
I've tried a few things but I didn't manage to find the proper options when using the corenlp API (with Python).
For instance, how to specify that the text is already tokenised?
I would really appreciate any help.
In general, the server calls into CoreNLP rather than the individual NLP components, so the documentation on CoreNLP may be useful. The body of the text being annotated is sent to the server as the POST body; the properties are passed in as URL params. For example, for your case, I believe the following curl command should do the trick (and should be easy to adapt to the language of your choice):
curl -X POST -d "it's split on whitespace" \
'http://localhost:9000/?annotators=tokenize,ssplit,pos,parse&tokenize.whitespace=true&ssplit.eolonly=true'
Note that we're just passing the following properties into the server:
annotators = tokenize,ssplit,pos,parse (specifies that we want the parser, and all its prerequisites).
tokenize.whitespace = true will call the withespace tokenizer.
ssplit.eolonly = true will split sentences on and only on newlines.
Other potentially useful options are documented on the parser annotator page.
I uses the sentense
He died in the day before yesterday.
to process corenlp NER.
On the server, I got the result like this.
And in local, I uses the same sentence, got the result of
He(O) died(O) in(O) the(O) day(TIME) before(O) yesterday(O) .(O)
So, how can I get the same result like the server?
In order to increase the likelihood of getting a relevant answer, you may want to rephrase your question and provide a bit more information. And as a bonus, in the process of doing so, you may even find out the answer yourself ;)
For example, what url are you using to get your server result? When I check here: http://nlp.stanford.edu:8080/ner/process , I can select multiple models for English. Not sure which version their API is based on (would say the most recent stable version, but I don't know). Then the title of your post suggests you are using 3.8 locally, but it wouldn't hurt to specify the relevant piece in your pom.xml file, or the models you downloaded yourself.
What model are you using in your code? How are you calling it? (i.e. any other annotators in your pipeline that could be relevant for NER output)
Are you even calling it from code (if so, Java? Python?), or using it from the command line?
A lot of this is summarised in https://stackoverflow.com/help/how-to-ask and it's not that long to read through ;)
We need to add terms to named entity extraction tables/model in Stanford and can't figure out how. Use case - we need to build up a set of IED terms over time and want to the Stanford pipeline to extract the terms when found in text files.
Looking to see if this is something someone has done before
Please take a look at http://nlp.stanford.edu/software/regexner/ to see how to use it. It allows you to specify a file of mappings of phrases to entity types. When you want to update the mappings, you update the file and rerun the Stanford pipeline.
If you are interested in how to actually learn patterns for the terms over time, you can take a look at our pattern learning system: http://nlp.stanford.edu/software/patternslearning.shtml
Could you specify the tags you want to apply?
To use the RegexNER all you have to do is build a file with 1 entry per line of the form:
TEXT_PATTERN\tTAG
You would put all of the things you want in your custom dictionary into a file, say custom_dictionary.txt
I am assuming by IED you mean
https://en.wikipedia.org/wiki/Improvised_explosive_device ??
So your file might look like:
VBIED\tIED_TERM
sticky bombs\tIED_TERM
RCIED\tIED_TERM
New Country\tLOCATION
New Person\tPERSON
(Note Stack Overflow has some strange formatting, there should not be blank lines between each entry, it should be 1 entry per line!!)
If you then run this command:
java -mx1g -cp '*' edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators 'tokenize,ssplit,pos,lemma,regexner,ner' -file sample_input.txt -regexner.mapping custom_dictionary.txt
you will tag sample_input.txt
Updating is merely a matter of updating custom_dictionary.txt
One thing to be on the lookout for, it matters if you put "ner" first or "regexner" first in your list of annotators.
If your highest priority is tagging with your specialized terms (for instance IED_TERM), I would run the regexner first in the pipeline, since there are some tricky issues with how the taggers overwrite each other.
I am going to use Stanford POS tagger to tag the sentences. I want to split the documents into sentences and then sentences to tokens. As I am using java for the first time I just want to run the tagger from the command line.
As I'm running the tagger, it gives the output but it gives a warning "untokenizable".
What does this warning mean? Is the tokenization not implicitly done by the tagger?
I have tried to run the command for splitting the text into sentences that you have specified but it doesn't work. The tagger gives the error that it could not open the path.
I also want to know that how can I input number of text files and get their output in corresponding files so that all of the output is not jumbled.
Yes, the Stanford POS tagger includes a high-quality, deterministic tokenizer, which is used unless you say the text is already tokenized. For formal English text, it is superior to most other tokenizers out there, though it isn't fully suitable for sms, tweets, etc.
An untokenizable warning means that there are byte/character sequences in the input that it can't process.
Normally what this actually means is this: The default character encoding of the tagger is utf-8 (Unicode), but your document is in some other encoding such as an 8 bit encoding like iso-8859-1 or Windows cp1252. You can convert the document or specify an input document encoding with the -encoding flag.
But it could also mean that there is a rare character in the input that it doesn't know about. Normally in those cases, if it's just an occasional character, you can just ignore the messages. You can choose whether the characters are deleted or turned into 1 character tokens.
There isn't at present a facility for running it on a bunch of files with one command. You'll either need to run it separately on each file, or to write your own code for that.