Adding metadata into Stanford coreNLP input - stanford-nlp

I have a large corpus of sentences (~ 1.1M) to parse through Stanford Core NLP but in the output I get more sentences than in the input, probably the system segments some sentences beyond the given segmentation into lines.
To control what happens I would like to include "tags" into the input. These tags should be recognizable in the output and should not affect parsing.
Something like
<0001>
I saw a man with a telescope .
</0001>
or
#0001#
I saw a man with a telescope .
#/0001#
I have tried many formats, in all cases the "tag" has been parsed as if it were part of the text.
Is there some way to tell the parser "do not parse this, just keep it as is in the output"?
===A few hours later===
As I'm getting no answer, here is an example: I would like to process the sentence “Manon espérait secrètement y revoir un garçon qui l'avait marquée autrefois.” that carries tag 151_0_4. I imagined to write the tag between two rows of equal signs on a separate line, followed by a period, to be sure that the tag will, at worst, be processed as a separate sentence:
=====151_0_4======.
Manon espérait secrètement y revoir un garçon qui l'avait marquée autrefois.
=====151_0_4======.
And here is what this produced:
(ROOT (SENT (NP (SYM =)) (NP (SYM =) (PP (SYM =) (NP (SYM =) (PP (SYM =) (NP (NUM 151_0_4) (SYM =) (SYM =) (NP (SYM =) (PP (SYM =) (NP (SYM =) (SYM =))))))))) (PUNCT .)))
As you see the tags are definitely considered as being part of the sentence, no way to separate them from it.
Same thing happened with XML-like tags <x151_0_4> or tags using the hash character...

If your current data is strictly one sentence per line, then by far the easiest thing to do is to just leave it like that and to give the option -ssplit.eolonly=true.
There unfortunately isn't an option to pass through certain kinds of meta-data or delimiters without attempting to parse or process them. However, you can indicate that they should not be made part of other sentences by means of the ssplit.boundaryTokenRegex or ssplit.boundaryMultiTokenRegex properties. However, your choices are then either to just delete them (see ssplit.tokenPatternsToDiscard) or else to process them as weird sentences, which you'd then need to clean up.

Related

Extracting information from a sentence using NLP

I want to extract information from sentences. I am a newbie in this field. I have sentences as :
 
"Andrew query pizza king what is today's deal"
 "Andrew order flower shop to send my wife roses"
Format : <Name> <command> <company name> <connecting word> <action>
With the help of standford NLP parser how to extract the sentences as the format above? Like After extracting If i want to print action of the sentence it should give {is today's deal, me send my wife roses}
That's a hard task. If you have a very, very restricted set of sentences you can try to use the parser dependencies and model your problem with rules. However, I ran your sentence through the Stanford parser and got obviously wrong result:
(ROOT
(FRAG
(NP
(NP (NNP Andrew) (NN query) (NN pizza) (NN king))
(SBAR
(WHNP (WP what))
(S
(VP (VBZ is)
(NP
(NP (NN today) (POS 's))
(NN deal))))))))
As you can see, it sees Andrew query pizza king as a noun phrase, it would do the same with "Andrew dog carrot soup what is today's deal". Obviously it misses the verb "query", the target "pizza king", etc.
Even if that worked, the syntax parser models only syntax, ignoring semantics. You should check Semantic Role Labeling, Named Entity Recognition, Relation Extraction, etc. For your specific task most probably you would have to define your own semantics, then use a statistical algorithm for analyzing the text and extracting the needed information.
Here is a nice article about approaches to building chatbots: https://techinsight.com.vn/language/en/three-basic-nlp-problems-one-develops-chatbot-system-typical-approaches/

Name Entity Recognition for Arabic documents

I need your help please, I am doing NER project using NetBeans v.8.0.2.
I need to get the Person Names and Places out of any Arabic document-file and categorize them as person name, Place. I saw all Stanford files, POS tagger, parser and also Stanford NER. And I tried them all, the tagger works fine with me.
But i had problems with Parser especially in this line of code
LexicalizedParser lp = LexicalizedParser.loadModel(grammar, options);
from ParserDemo and no output comes up. Do i need the parser first to tokenize the document then to use POS tagger, or i can just use the POS tagger with some editing (like using if statement to combine all NNP together and the same for places).
So first of all as of the moment we do not have any Arabic NER models.
Secondly, I'll post some steps for running the Stanford parser on Arabic text.
Get the Stanford parser: http://nlp.stanford.edu/software/lex-parser.shtml
Compile ParserDemo.java ; you need the jars present in the directory stanford-parser-full-2015-04-20 to compile
I ran this command at the command line while in the stanford-parser-full-2015-04-20 directory, (do the analogous thing in NetBeans):
java -cp ".:*" ParserDemo edu/stanford/nlp/models/lexparser/arabicFactored.ser.gz data/arabic-onesent-utf8.txt
You should get a proper parse of the Arabic example sentence.
So when you run ParserDemo in NetBeans, make sure you provide "edu/stanford/nlp/models/lexparser/arabicFactored.ser.gz" as the first argument to ParserDemo , so it knows to load the Arabic model.
For this input:
و نشر العدل من خلال قضاء مستقل
I get this output:
(ROOT
(S (CC و)
(VP (VBD نشر)
(NP (DTNN العدل))
(PP (IN من)
(NP (NN خلال)
(NP (NN قضاء) (JJ مستقل)))))
(PUNC .)))
I am happy to help you further, please let me know if you need any more info.
FYI here is some more info on the Arabic parser:
http://nlp.stanford.edu/software/parser-arabic-faq.shtml

How to get the text for Tree in reverse way

Please, Could anybody help me in order to extract the text from a Tree??
e.g. : (NP (NP (DT the) (JJ main) (NN road)) (PP (IN of) (NP (NNP Rontau))))
The text : "the main road of Rontau"
I'm using stanford trees package.
You want the "yield" of a parse tree. You can access this as a list of Label instances using the method Tree#yield.

emacs lisp (elisp): modifying indent-for-tab-command in actionscript-mode

I'm using actionscript-mode in emacs, found here http://www.emacswiki.org/emacs/ActionScriptMode. I would like to modify it. The problem is in the behavior of the TAB command. I'm concerned with the case that the cursor is on the first column of a line that contains code already. Most code-editing modes in emacs handle this case by (1) performing an indentation calculation and adjustment, and (2) advancing the cursor to the first non-whitespace character. For some reason, the actionscript-mode does (1) but not (2). How can I modify it? I know a little emacs-lisp but not enough to follow the code. I'm not going to post the entire actionscript-mode.el, but there may be a clue in this section:
(defun actionscript-indent-line ()
"Indent current line of As3 code. Delete any trailing
whitespace. Keep point at same relative point in the line."
(interactive)
(save-excursion
(end-of-line)
(delete-horizontal-space))
(let ((old-pos (point)))
(back-to-indentation)
(let ((delta (- old-pos (point)))
(col (max 0 (as3-calculate-indentation))))
(indent-line-to col)
(forward-char delta))))
The comment here says "keep point at same relative point in the line." Maybe I can just turn off that part.
I figured it out. The command back-to-indentation puts the cursor at the beginning of the line. I need to modify a few things because I want to preserve the relative position in the case that the cursor is in the middle of the line, past the first non-whitespace character, but if the cursor is positioned before the first non-whitespace character then I want to run back-to-indentation.

Difference between Print and Display

PLT Scheme's documentation says:
The rationale for providing print
is that display and write both
have relatively standard output
conventions, and this standardization
restricts the ways that an environment
can change the behavior of these
procedures. No output conventions
should be assumed for print, so that
environments are free to modify the
actual output generated by print in
any way.
Could somebody please explain what that means for a noob and how is print and display different?
The thing is that programs can expect certain output formats from write and display. In PLT, it is possible to change how they behave, but a little involved to do so. This is intentional, since doing such a change can have dramatic and unexpected result.
OTOH, changing how print behaves is deliberately easy -- just see the current-print documentation. The idea is that print is used for debugging, for presenting code to you in an interactive REPL -- not as a tool that you will rely on for output that needs to be formatted in a specific way. (BTW, see also the "~v" directive for format, printf, etc.)
You are free to override print function. If you want to override standardized functions, for example the write, you must obey to the output standard, otherwise code that use it will possibly break.
About display and write:
The Scheme Programming Language, 3rd edition, pg. 178
(display obj)
(display obj output-port)
returns unspecified
display is similar to write
but prints strings and characters
found within obj directly. Strings
are printed without quotation marks or
slashes, and characters are printed
without #\ notation. For example,
both (display "(a b c)") and
(display '("a b" c)) would print (a b c). Because of this, display should not be used to print objects
that are indended to be read with
read. display is useful primarily for printing messages, with
obj most often being a string.
If you don't want to replace print you might try out SRFI-28 instead:
http://srfi.schemers.org/srfi-28/srfi-28.html
Basically print is meant for debugging ie, showing the data/value without any beautification or formatting, and display is meant for showing the data in your desired format or more beautified format
ex:
print(dataframe)
enter image description here
display(dataframe)
enter image description here

Resources