How to keep punctuation in typedDependencies? - stanford-nlp

ChineseGrammaticalStructure gs = new ChineseGrammaticalStructure(t);
Collection<TypedDependency> tdl = gs.typedDependenciesCollapsed();
I have tried to print gs and tdl, gs keeps the punctuation while tdl lost it. How to keep punctuation while convert GrammaticalStructure to typedDependencies when using stanford parser 3.9.1?

It seems stanford parser is hard to deal with this question, people who have the same question may try corenlp.

Related

Specific Part of Speech labels for Java Stanford NLP

What are the set of PoS labels produced by Standford NLP (including PoS for punctuation tokens), and its description?
I know this question has been asked several times, such as in:
Java Stanford NLP: Part of Speech labels?
http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
http://www.mathcs.emory.edu/~choi/doc/clear-dependency-2012.pdf
but those answers list some typical PoS labels which are not specific to Standfor NLP. For instance, none of those answers list the -LRB- PoS label used by Stanford NKLP for the ( punctuation.
Where can I find this list of PoS labels in the source code of the Stanford NLP?
Also, what are some token examples annotated with the SYM PoS label?
Also, how to know if a token is a punctuation?
Here they define isPunctation == true if its PoS is :|,|.|“|”|-LRB-|-RRB-|HYPH|NFP|SYM|PUNC. However Stanford NLP does not have all these PoS.
It is the Penn Treebank POS set, but many descriptions of this tag set seem to omit punctuation marks. Here is a complete list of tags:
https://www.eecis.udel.edu/~vijay/cis889/ie/pos-set.pdf
(But parentheses are tagged as -LRB- and -RRB-, not sure why they don't mention this in the documentation.)

How do I write regexes for German character classes like letters, vowels, and consonants?

For example, I set up these:
L = /[a-z,A-Z,ßäüöÄÖÜ]/
V = /[äöüÄÖÜaeiouAEIOU]/
K = /[ßb-zBZ&&[^#{V}]]/
So that /(#{K}#{V}{2})/ matches "ᄚ" in "azAZᄚ".
Are there any better ways of dealing with them?
Could I put those constants in a module in a file somewhere in my Ruby installation folder, so I can include/require them inside any new script I write on my computer? (I'm a newbie and I know I'm muddling this terminology; Please correct me.)
Furthermore, could I get just the meta-characters \L, \V, and \K (or whatever isn't already set in Ruby) to stand for them in regexes, so I don't have to do that string interpolation thing all the time?
You're starting pretty well, but you need to look through the Regexp class code that is installed by Ruby. There are tricks for writing patterns that build themselves using String interpolation. You write the bricks and let Ruby build the walls and house with normal String tricks, then turn the resulting strings into true Regexp instances for use in your code.
For instance:
LOWER_CASE_CHARS = 'a-z'
UPPER_CASE_CHARS = 'A-Z'
CHARS = LOWER_CASE_CHARS + UPPER_CASE_CHARS
DIGITS = '0-9'
CHARS_REGEX = /[#{ CHARS }]/
DIGITS_REGEX = /[#{ DIGITS }]/
WORDS = "#{ CHARS }#{ DIGITS }_"
WORDS_REGEX = /[#{ WORDS }]/
You keep building from small atomic characters and character classes and soon you'll have big regular expressions. Try pasting those one by one into IRB and you'll quickly get the hang of it.
A small improvement on what you do now would be to use regex unicode support for categories or scripts.
If you mean L to be any letter, use \p{L}. Or use \p{Latin} if you want it to mean any letter in a Latin script (all German letters are).
I don't think there are built-ins for vowels and consonants.
See \p{L} match your example.

How do I correctly deal with non-breaking spaces using Nokogiri?

I am using Nokogiri to parse an HTML page, but I am having odd problems with non-breaking spaces. I tried different encodings, replacing the whitespace, and a few other headache inducing attempts.
Here is the HTML snippet in question:
<td>Amount 15,300 at dollars</td>
Note the change for the representation after I use Nokogiri:
<td>Amount 15,300 at dollars</td>
And outputting the inner_text:
Amount 15,300 at dollars
This is my base Nokogiri grab, I did try a few alternatives to solve but failed miserably:
doc = Nokogiri::HTML(open(url))
And then I do a doc.search for the item in question.
Note that if I look at the doc, the line shows up with the   on that line.
Clarification: I do not think I clearly stated the difficulty I am having. I can't get the inner_text to show up without the strange  symbol.
Unless you really, really want to keep the notation, there shouldn't be a problem here.
A0 is the hex character code for a non-breaking space. As such,   prints a non-breaking space, and is exactly equivalent to .   does the same thing, too.
What Nokogiri is doing here is reading the text node, recognizing the entities, and converting them to their actual string representation internally. Then, when converting it back to an HTML-friendly version of the text node, it represents the non-breaking space by its hex code, rather than taking the performance overhead of looking it up in an entity table, since it's equivalent, anyway.
Assuming that  was what you were seeing and wasn't just an issue pasting into StackOverflow, this is a text encoding issue: the output software (browser?) isn't in UTF-8 mode, so doesn't know how to handle character code A0, so does the best it can. If this is a browser, adding <meta charset="utf-8"> to the head will solve this issue, and will make the rest of the output more Unicode-friendly.
If you really, really want , use gsub to replace them in your final output. Otherwise, don't worry about it.
I know this is old, but it took me an hour to find out how to solve this problem, and it is really easy once you know. Just pass your string to this function and it will be "de-nbsp-fied".
def strip_html(str)
nbsp = Nokogiri::HTML(" ").text
str.gsub(nbsp,'')
end
You could also replace it whith a space if you wished. May many of you find this answer!
As #sawa says, the main problem is what you see when writing to the console. It's not correctly displaying the non-breaking space after Nokogiri converts it to the appropriate binary value.
The usual way to fix the problem is to preprocess the content:
require 'nokogiri'
html = '<td>Amount 15,300 at dollars</td>'
doc = Nokogiri::HTML::DocumentFragment.parse(html.gsub(/&(?:#xa0|#160|nbsp);/i, ' '))
puts doc.to_html
Which outputs:
<td>Amount 15,300 at dollars</td>

ruby from any encoding to ascii

I have to deal with mainly English alphabets and all the punctuation marks, I don't have to worry about European accents. So the only concern I have is when a user paste something he copies from the web that includes, for instance, an apostrophe that when I do a puts in the console (on Win7), it outputs
"ItΓÇÖs" # where as it actually is " It's "
So my main question is, is there a end-it-all conversion method I can use in Ruby that just properly replaces all the ,.;?!"'~` _- with ASCII counter parts?
I really understand very little about encodings, if you think this is wrong question to ask, which can very likely be the case, please do advice as to what I should look for instead.
Thank you
I work in publishing where we deal with this a lot. We have had success with stringex https://github.com/rsl/stringex. They have a to_ascii method that normalizes unicode dashes etc.
And in ruby 2.0:
"ItΓÇÖs".encode("ASCII", invalid: :replace, undef: :replace, replace: '')
=> "Its"
For programmatically handling multibyte encodings iconv is your friend. And, James Grey wrote a series of blog articles talking about how to take apart the problem and convert encodings.
The problem gets more complicated when dealing with text that has been pasted in, because some characters could be in one multibyte-encoding, and other characters could be in another. You might have to walk the string checking for multibyte characters, then asking Ruby what the encoding is, and, if it's not what you expect, convert it to the expected or desired encoding, then move to the next character. Grey's articles cover it all nicely and are good reading.

Syntax hightlighting of template-toolkit within Ultraedit

Have anyone succesfully created a "wordfile" that works? I've tried but i can't get it to highlight [% and %]
I am much too late to answer this, just was looking for some help with my wordfile problem and came across your unanswered question. After fighting several hours with my wordfile I thought this should be fairly easy to complete now, so here is a wordfile I was able to get working to highlight your special "words":
/L19"Test" nocase Line Comment = ! String Chars = '" File Extensions = test
/Colors = 0,8421376,8421376,8421504,255,
/Colors Back = 16777215,16777215,16777215,16777215,16777215,
/Colors Auto Back = 1,1,1,1,1,
/Font Style = 0,0,0,0,0,
/Delimiters = ~!#^&*()_-+=|\/{}:;"'<> , .?
/C1 Colors = 8421376 Colors Back = 16777215 Colors Auto Back = 1 Font Style = 0
%]
[%
Probably lots in there you don't care about, special things I came across:
It appears the lines for the different colors have to be sorted by the first character and each new character should be on a new line.
Since you are trying to put special characters as words, I removed them from the "Delimiters" line - maybe this is a problem.
When saving the uew file, make sure the newline character is correct for the OS you are working on, the file I started with was a unix file and I am working on a windows box. Once I saved my file converting the line endings to windows, it worked perfectly.
Well, sorry for being so late to answer your question. I don't normally watch the UltraEdit tag, but do like the tool.

Resources