CoreNLP tokenizer/sentence splitter misbehaves on HTML input - stanford-nlp

I'm using the Stanford CoreNLP pipeline from the command line to dependency parse a large document, and it's very important that each line in the document receive its own dependency tree (otherwise other things get unaligned). This pair of lines is currently causing me grief:
<!-- copy from here --> <a href="http://strategis.gc.ca/epic/internet/inabc-eac.nsf/en/home"><img src="id-images/ad-220x80_01e.jpg" alt="Aboriginal Business Canada:
Opening New Doors for Your Business" width="220" height="80" border="0"></a> <!-- copy to here --> Small ABC Graphic Instructions 1.
This is the command I'm using:
java -cp "*" -Xmx1g -Xss515m edu.stanford.nlp.pipeline.StanfordCoreNLP \
-annotators tokenize,ssplit,pos,depparse \
-ssplit.eolonly true \
-ssplit.newlineIsSentenceBreak always \
-outputFormat conllu \
-file "input.txt"
And this is the resulting output:
1 <!-- copy from here --> _ _ JJ _ 9 amod _ _
2 <a href="http://strategis.gc.ca/epic/internet/inabc-eac.nsf/en/home"> _ _ JJ _ 9 amod _ _
3 <img src="id-images/ad-220x80_01e.jpg" alt="Aboriginal Business Canada:
Opening New Doors for Your Business" width="220" height="80" border="0"> _ _ NN _ 9 compound _ _
4 </a> _ _ NN _ 9 compound _ _
5 <!-- copy to here --> _ _ NN _ 9 compound _ _
6 Small _ _ JJ _ 9 amod _ _
7 ABC _ _ NNP _ 9 compound _ _
8 Graphic _ _ NNP _ 9 compound _ _
9 Instructions _ _ NNS _ 0 root _ _
10 1 _ _ CD _ 9 nummod _ _
11 . _ _ . _ 9 punct _ _
It looks like the newline character inside the quotation marks in the HTML tag is being interpreted as part of the token, rather than as a sentence break. This is peculiar since I'm using the -ssplit.newlineIsSentenceBreak always flag, which I would expect would force the parser to split up the HTML code. However, even if I didn't need each line to get its own parse, this behavior is troubling because the resulting file is no longer in valid conllu format, since line 3 has only two tab-separated columns (instead of the required 10) and line 4 has only 9.
One workaround I played with was turning each line in the original file into its own file and then feeding them in with the -filelist parameter, but that created too much stdout output that slowed things down and clogged the terminal. My attempts to redirect the output to /dev/null or turn on a "quiet mode" were met with failure, but that's probably a question for another post.
I tried double-spacing the file, but that didn't help. Preprocessing the text with sed 's/"/\\"/g' does fix this problem by destroying the pipeline's ability to recognize this as HTML code, but introduces new ones since the parser presumably wasn't trained on escaped quotation marks.
Obviously this is a weird sentence and I don't expect the output to be parsed sensibly, but I do need it be formatted sensibly. Any tips?
Update
It was suggested to me that I try using the cleanxml annotator to get rid of the HTML tag altogether. This reduces the number of lines in the file, which may result in misalignment later, but since the HTML tags aren't getting parsed sensibly anyway it seems independently advantageous to get rid of them. I'll update again later with whether or not this works for my purposes, but I'm open to other suggestions in the meantime.

There were two problems here:
The tokenizer will parse as a single token HTML/XML/SGML tags where the quoted value of an attribute is split over lines. Usually this is a good thing - if this were regular text, what it is doing for this example is actually sensible, keeping the whole img tag together - but it is disastrous if you are wanting to process text as strictly one sentence per line, as in most machine translation corpora. In this case you want to treat each line as a sentence, even if the original sentence breaking was done wrongly, as here.
If a newline was captured in the value of an attribute, it was left as such, and then on output, this destroyed the integrity of at least the line-oriented CoNLL and CoNLL-U output formats.
I've added/changed code to address these problems:
There is a new tokenizer option: -tokenize.options tokenizePerLine, which will disallow the tokenizer from looking across line boundaries when forming tokens or making tokenization decisions. (This option can be combined in a comma-separated list with all the other tokenize.options options.)
If a newline is captured in the value of an attribute, then it is mapped to U+00A0 (non-breaking space). This was already what happened to U+0020 (space) and is now also done for newlines. This fixes the CoNLL/CoNLL-U output and maintains the correct invariant for Stanford CoreNLP: Tokens may occasionally contain non-breaking spaces, but never space or newline.
This problem is fixed in commit 0a17fe4c0fc4ccfb095f474bf113d1df0c6d17cb on the CoreNLP github. If you grab that version -- or minimally update the PTBLexer.class and PTBTokenizer.class, then you will have this new option and should be good. The following command should give you what you want:
java -cp "*" -Xmx1g edu.stanford.nlp.pipeline.StanfordCoreNLP \
-annotators tokenize,ssplit,pos,depparse -ssplit.eolonly true \
-tokenize.options tokenizePerLine -outputFormat conllu -file "input.txt"
p.s. I guess you were trying to fix things, but if you're using -ssplit.eolonly true then you shouldn't need or see a difference from -ssplit.newlineIsSentenceBreak always. Also, we maybe should make it so that turning on -ssplit.eolonly true automatically turns on -tokenize.options tokenizePerLine, but that is not presently the case....

Related

How does PowerShell display the output string resulting from invoking an external terminal application (nssm.exe)

In PowerShell (5.1):
Calling an external command (in this case nssm.exe get logstash-service Application) the output is displayed in PowerShell as I would have expected (ASCII-string "M:\logstash-7.1.1\bin\logstash.bat"):
PS C:\> M:\nssm-2.24\win64\nssm.exe get logstash-service Application
M:\logstash-7.1.1\bin\logstash.bat
But the following command (which pipes the output into Out-Default) results in:
PS C:\> M:\nssm-2.24\win64\nssm.exe get logstash-service Application | Out-Default
M : \ l o g s t a s h - 7 . 1 . 1 \ b i n \ l o g s t a s h . b a t
(Please note all that "whitespace" separating all characters of the resulting output string)
Also the following attempt to capture the output (as an ASCII string) into variable $outCmd results in :
PS C:\> $outCmd = M:\nssm-2.24\win64\nssm.exe get logstash-service Application
PS C:\> $outCmd
M : \ l o g s t a s h - 7 . 1 . 1 \ b i n \ l o g s t a s h . b a t
PS C:\>
Again, please note the separating whitespace between the characters.
Why is there a difference in the output between the first and the latter 2 commands?
Where are the "spaces" (or other kinds of whitespace chars) coming from in the output of the latter 2 commands?
What exactly needs to be done in order to capture the output of that external command as ASCII string "M:\logstash-7.1.1\bin\logstash.bat" (i.e. without the strange spaces in between)?
If the issue is related to ENCODING, please specify what exactly needs to be done/changed.
Yes, the problem is one of character encoding, and the problem often only surfaces when an external program's output is either captured in a variable, sent through the pipeline, or redirected to a file.
Only in these cases does PowerShell get involved and decodes the output into .NET strings before any further processing.
This decoding happens based on the encoding stored in the [Console]::OutputEncoding property, so for programs that do not themselves respect this encoding for their output you'll have to set this property to match the actual character encoding used.
Your symptom implies that nssm.exe outputs UTF-16LE-encoded ("Unicode") strings[1], so to capture them properly you'll have to do something like the following:
$orig = [Console]::OutputEncoding
[Console]::OutputEncoding = [System.Text.Encoding]::Unicode
# Store the output lines from nssm.exe in an array of strings.
$output = M:\nssm-2.24\win64\nssm.exe get logstash-service Application
[Console]::OutputEncoding = $orig
The underlying problem is that external programs are expected to use the current console's output code page for their output encoding, which defaults to the system's active legacy OEM code page, as reflected in [Console]::OutputEncoding (and reported by chcp), but some do not, in an attempt to:
either: overcome the limitations of the legacy, single-byte OEM encodings in order to provide full Unicode support (as is the case here, although it is more common to do that with UTF-8 encoding, as the Node.js CLI, node.exe does, for instance)
or: use the more widely used active ANSI legacy code page instead (as python does by default).
See this answer for additional information, which also links to two helper functions:
Invoke-WithEncoding, which wraps capturing output from an external program with a given encoding (see example below), and Debug-NativeInOutput, for diagnosing what encoding a given external program uses.
With function Invoke-WithEncoding from the linked answer defined, you could then call:
$output = Invoke-WithEncoding -Encoding Unicode {
M:\nssm-2.24\win64\nssm.exe get logstash-service Application
}
[1] The apparent spaces in the output are actually NUL characters (code point 0x0) that stem from the 0x0 high bytes of 8-bit-range UTF-16LE code units, which includes all ASCII characters and most of Windows-1252): Because PowerShell, based on the single-byte OEM code page stored in [Console]::Encoding (e.g., 437 on US-English systems), interprets each byte as a whole character, the 0x0 bytes of 2-byte (16-bit) Unicode code units (in the 8-bit range) are mistakenly retained as NUL characters, and in the console these characters present like spaces.

Removing diacritical marks from a Greek text in an automatic way

I have a decompiled stardict dictionary in the form of a tab file
κακός <tab> bad
where <tab> signifies a tabulation.
Unfortunately, the way the words are defined requires the query to include all diacritical marks. So if I want to search for ζῷον, I need to have all the iotas and circumflexes correct.
Thus I'd like to convert the whole file so that the keyword has the diacritic removed. So the line would become
κακος <tab> <h3>κακός</h3> <br/> bad
I know I could read the file line by line in bash, as described here [1]
while read line
do
command
done <file
But what is there any way to automatize the operation of converting the line? I heard about iconv [2] but didn't manage to achieve the desired conversion using it. I'd best like to use a bash script.
Besides, is there an automatic way of transliterating Greek, e.g. using the method Perseus has?
/edit: Maybe we could use the Unicode codes? We can notice that U+1F0x, U+1F8x for x < 8, etc. are all variants of the letter α. This would reduce the amount of manual work. I'd accept a C++ solution as well.
[1] http://en.kioskea.net/faq/1757-how-to-read-a-file-line-by-line
[2] How to remove all of the diacritics from a file?
You can remove diacritics from a string relatively easily using Perl:
$_=NFKD($_);s/\p{InDiacriticals}//g;
for example:
$ echo 'ὦὢῶὼώὠὤ ᾪ' | perl -CS -MUnicode::Normalize -pne '$_=NFKD($_);s/\p{InDiacriticals}//g'
ωωωωωωω Ω
This works as follows:
The -CS enables UTF8 for Perl's stdin/stdout
The -MUnicode::Normalize loads a library for Unicode normalisation
-e executes the script from the command line; -n automatically loops over lines in the input; -p prints the output automatically
NFKD() translates the line into one of the Unicode normalisation forms; this means that accents and diacritics are decomposed into separate characters, which makes it easier to remove them in the next step
s/\p{InDiacriticals}//g removes all characters that Unicoded denotes as diacritical marks
This should in fact work for removing diacritics etc for all scripts/languages that have good Unicode support, not just Greek.
I'm not so familiar with Ancient Greek as I am with Modern Greek (which only really uses two diacritics)
However I went through the vowels and found out which combined with diacritics. This gave me the following list:
ἆἂᾶὰάἀἄ
ἒὲέἐἔ
ἦἢῆὴήἠἤ
ἶἲῖὶίἰἴ
ὂὸόὀὄ
ὖὒῦὺύὐὔ
ὦὢῶὼώὠὤ
I saved this list as a file and passed it to this sed
cat test.txt | sed -e 's/[ἆἂᾶὰάἀἄ]/α/g;s/[ἒὲέἐἔ]/ε/g;s/[ἦἢῆὴήἠἤ]/η/g;s/[ἶἲῖὶίἰἴ]/ι/g;s/[ὂὸόὀὄ]/ο/g;s/[ὖὒῦὺύὐὔ]/υ/g;s/[ὦὢῶὼώὠὤ]/ω/g'
Credit to hungnv
It's a simple sed. It takes each of the options and replaces it with the unmarked character. The result of the above command is:
ααααααα
εεεεε
ηηηηηηη
ιιιιιιι
οοοοο
υυυυυυυ
ωωωωωωω
Regarding transliterating the Greek: the image from your post is intended to help the user type in Greek on the site you took it from using similar glyphs, not always similar sounds. Those are poor transliterations. e.g. β is most often transliterated as v. ψ is ps. φ is ph, etc.

How to decrement (subtract) number in file with sed

I've got some source code like the following where I call a function in C:
void myFunction (
&((int) table[1, 0]),
&((int) table[2, 0]),
&((int) table[3, 0])
);
...the only problem is that the function has >300 parameters (it's an auto-generated wrapper for initialising and calling a whole module; it was given to me and I cannot change it). And as you can see: I began accessing the array with a 1 instead of a 0... Great times, modifying all the 300 parameters, i.e. decrasing 300 x the x-coordinate of the array, by hand.
The solution I am looking for is how I could force sed to to do the work for me ;)
EDIT: Please note that the syntax above for accessing a two-dimensional array in C is wrong anyway! Of course it should be [1][0]... (so don't just copy-and-paste ;))
Basically, the command I came up with, was the following:
sed -r 's/(.*)(table\[)([0-9]+)(,)(.*)/echo "\1\2$((\3-1))\4\5"/ge' inputfile.c > outputfile.c
Well, this does not look very intuitive on the first sight - and I was missing good explanations for nearly every example I found.
So I will try to give a detailed explanation on this:
sed
--> basic command
-r
--> most examples you find are using -e; however, the -r parameter (only works with GNU sed) enables extended regular expressions and brings support for the + in a regex. It basically means "one or more matches".
's/input/output/ge'
--> this is the basic replacement syntax. It basically means "replace 'input' by 'output'". The /g is a "global" flag, i.e. sed will replace all occurences and not only the first one. You can add an additional e to execute the result in the bash. This is what we want to do here to handle the calculation.
(.*)
--> this matches "everthing" from the last match to the next match
(table\[)
--> the \ is to escape the bracket. This part of the expression will match Strings like table[
([0-9]+)
--> this one matches numbers with at least one digit, however, it can also match higher numbers with more than only one digit.
(,)
--> this simply matches the comma ,
(.*)
--> and again: the rest of the line
And now the interesting part:
echo "\1\2$((\3-1))\4\5"
the echo is a bash command
the \n (you can use every value from \1 up to \9) is some kind of "variable" for the inputs: \1 will contain the first match, \2 the seconds match, ... --> this helps you to preserve parts of the input string
the $((1+1)) is a simple bash syntax to calculate the value of the term inside the double brackets (in the complete sed command above, the \3 will of course be automatically replaced by the 3rd match, i.e. the 1st part inside the brackets to access the table's cells)
please note that we use quotation marks around the echo content to also be able to process lines with characters like & which would otherwise not work
The already mentioned e of \ge at the end will trigger the execution of the result in the bash. E.g. the first two lines of the example source code in the question would produce the following bash statements:
echo "void myFunction ("
echo " &((int) table[$((1-1)), 0]),"
which is being executed and results in the following output:
void myFunction (
&((int) table[0, 0]),
...which is exatcly what I wanted :)
BTW:
text > output.c
is simple bash syntax to output text (or in this case the sed-processed source code) to a file called output.c.
Good links about this topic are:
sed basics
regular expressions basics
Ahh and one more thing: You can also use sed in the git-Bash on Windows - if you are "forced" to use Windows at work like me ;)
PS: In the meantime I could have easily done this by hand but using sed was a lot more fun ;)
Here's another way you could do it, using Perl:
perl -pe 's/(table\[)(\d+)(,)/$1.($2-1).$3/e' file.c
This uses the e modifier to execute an expression in the replacement. The capture groups are concatenated together but the middle group has 1 subtracted from its value.
This will output to standard output so you can check that it does what you want. When you're happy, you can add the -i switch to overwrite the original file.

An error ['\+' is an unrecognized escape in character string starting "\+" while creating a R package

I tried to create a package using some functions and scripts I created (using X11 on a Mac). While R CMD check was doing its work, it encountered a problem as follows:
temp = trim(unlist(strsplit(lp.add(ranefterms[[i]]),
+ "\+")))
Error: '\+' is an unrecognized escape in character string starting "\+"
The oddest thing, however, is that my function actually does NOT have "\ +". Instead, it has "\ \ +" (see below). So I don't know why "\ \ +" is recognized as "\ +".
for(i in 1:n)
temp = trim(unlist(strsplit(lp.add(ranefterms[[i]]), '\\+')))
To dig a little further, I looked at the packageName-Ex.R file in the Rcheck folder. As it turned out, all the "\ \"s have been changed to "\" in the checking process (e.g., the double slashes I need for functions such as strsplit() and grepl())
I wonder what may have been the cause of it. Sorry that I can't come up with a reproducible example...
The offending code comes from the Examples section of one of your help files (which is why it ends up in packageName-Ex.R). To fix it, just escape each of the backslashes in the Examples sections of your *.Rd documentation files with a second backslash. (So, type \\to get \ in the processed help file, and type \\\\ to get \\.)
Unless it's escaped, \ is interpreted as a special character that identifies sectioning and mark-up macros (i.e. commands like \author, \description, \bold, and \ldots). Quoting from Duncan Murdoch's Parsing Rd files (the official manual for this topic):
The backslash \ is used as an escape character: \, \%, { and }
remove the special meaning of the second character.
As an example of what this looks like in practice, here is part of $R_SOURCE_HOME/src/library/base/man/grep.Rd, which is processed to create the help file you see when you type ?grep or ?gsub.
## capitalizing
txt <- "a test of capitalizing"
gsub("(\\\\w)(\\\\w*)", "\\\\U\\\\1\\\\L\\\\2", txt, perl=TRUE)
gsub("\\\\b(\\\\w)", "\\\\U\\\\1", txt, perl=TRUE)
In the processed help file, it looks like this:
## capitalizing
txt <- "a test of capitalizing"
gsub("(\\w)(\\w*)", "\\U\\1\\L\\2", txt, perl=TRUE)
gsub("\\b(\\w)", "\\U\\1", txt, perl=TRUE)

C Shell Script Print Path name with Colon

I am trying to print the paths that are located in a file .chsrc which is easy to print with just echo-ing $path, but I need to add something to the end of each path that is listed. Ie:
/opt/local/bin: /usr/ucb: /usr/bin:
I can not edit or change the .chsrc file. I also tried to find something on concatenating in C Shell, but that seems to not really "exist" in C Shell from what I read. I am sorry if I sound arrogant in anyway, I am new to C Shell. If anyone has any pointers, advice is always great! Thank you!
echo "$PATH" | sed 's/:/: /g;s/ *$//'
's/'=substitute, '/:/: /'=targetPattern/replacmentPattern, 'g'=do replacment globally (on the current line), ';'= command separator, 's/ *$//'= substitute any trailing spaces at the end-of-the line '$', replacementPattern='//'=(nothing, empty, nada, zilch;-)
Best to always echo any env var surrounded by dbl-quotes unless you want any spaces in the variable to cause word splitting in the commandline evaluatio. Especially with PATH, as space char is legal in a path.
In general, concatenation works like this in CSH
set var1 = text1
set var2 = myText
echo "someText "$var1 " more stuff"$var2
# -----------^^^^^^^^^^^^^^^^--- deliberate, copy paste as is
I don't have a csh to print the output with, but cut and paste these lines and you'll see that spaces outside of " .... ", get reduced to 1 space, spaces inside of "...." stay in place, as many as you want, AND variables, bumped up next to "text Strings" do NOT insert a space char automatically, you have to put them in.
I don't see any arrogance in this question ;-)
But... before you spend 8+ years of your life using a 2nd rate shell ( ;-( ), read everything at A great Unix Primer , especially Why csh is less than perfect scripting language ;-)
P.S. Welcome to StackOverflow (S.O.) Please remeber to read the FAQs, http://tinyurl.com/2vycnvr , vote for good Q/A by using the gray triangles, http://i.imgur.com/kygEP.png , and to accept the one answer that best solves your problem, if any, by pressing the checkmark sign , http://i.imgur.com/uqJeW.png

Resources