Specific Part of Speech labels for Java Stanford NLP - stanford-nlp

What are the set of PoS labels produced by Standford NLP (including PoS for punctuation tokens), and its description?
I know this question has been asked several times, such as in:
Java Stanford NLP: Part of Speech labels?
http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
http://www.mathcs.emory.edu/~choi/doc/clear-dependency-2012.pdf
but those answers list some typical PoS labels which are not specific to Standfor NLP. For instance, none of those answers list the -LRB- PoS label used by Stanford NKLP for the ( punctuation.
Where can I find this list of PoS labels in the source code of the Stanford NLP?
Also, what are some token examples annotated with the SYM PoS label?
Also, how to know if a token is a punctuation?
Here they define isPunctation == true if its PoS is :|,|.|“|”|-LRB-|-RRB-|HYPH|NFP|SYM|PUNC. However Stanford NLP does not have all these PoS.

It is the Penn Treebank POS set, but many descriptions of this tag set seem to omit punctuation marks. Here is a complete list of tags:
https://www.eecis.udel.edu/~vijay/cis889/ie/pos-set.pdf
(But parentheses are tagged as -LRB- and -RRB-, not sure why they don't mention this in the documentation.)

Related

Is there an algorithm to filter out sentence fragments?

I have in a database thousands of sentences (highlights from kindle books) and some of them are sentence fragments (e.g. "You can have the nicest, most") which I am trying to filter out.
As per some definition I found, a sentence fragment is missing either its subject or its main verb.
I tried to find some kind of sentence fragment algorithm but without success.
But anyway in the above example, I can see the subject (You) and the verb (have) but it still doesn't look like a full sentence to me.
I thought about restricting on the length (like excluding string whose length is < than 30) but I don't think it's a good idea.
Any suggestion on how you would do it?

Struggling to understand user dictionary format in Elasticsearch Kuromoji tokenizer

I wanted to use Elasticsearch Kuromoji plugin for Japanese language. However, I'm struggling to understand the user_dictionary format of the file in the tokenizer. It's explained in elastic doc https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-kuromoji-tokenizer.html as the CSV of the following form:
The Kuromoji tokenizer uses the MeCab-IPADIC dictionary by default. A user_dictionary may be appended to the default dictionary. The dictionary should have the following CSV format:
<text>,<token 1> ... <token n>,<reading 1> ... <reading n>,<part-of-speech tag>
So there is not much in the documentation about that.
When looking at the sample entry the doc shows, it can looks like below:
東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞
So, breaking it down, the first element is the dictionary text:
東京スカイツリー - Tokyo Sky Tree
東京 スカイツリー - is Tokyo Sky tree - I assuming the space here is to denote token, but wondering why only "Tokyo" is a separate token, but sky tree is not split into "sky" "tree" ?
トウキョウ スカイツリー - Then we have a reading forms. And again, "Tokyo" "sky tree" - again, why it's splited such way. Can I specify more than one reading form of the text in this column (of course if there are any)
And the last is the part of speech, which is the bit I don't understand. カスタム名詞 means "Custom noun". I assuming I can define the part of speech such as verb, noun etc. But what are the rules, should it follow some format of part of speech name. I saw examples where it's specified as "noun" - 名詞. But in this example is custom noun.
Anyone have some ideas, materials especially around Part of speech field - such as what are the available values. Additionally, what impact has this field to the overall tokenizer capabilities ?
Thanks
Do you try to define "tokyo sky tree" like this
"東京スカイツリー,東京スカイツリー,トウキョウスカイツリー,カスタム名詞"
"東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞"
I encounter another problem Found duplicate term [東京スカイツリー] in user dictionary at line [1]

How to replace the matched text with a word specifying by rules files?

I am currently working on a Stanford CoreNLP program that replaces a matched text with a specified word using a list of given rules. I checked TokensRegex Expression, I know there is a regex function can be used in Action field:
Replace(List<CoreMap>, tokensregex, replacement)<br>Match(String,regex,replacement)
to do that. However, it is not clear to me how to implement this function in my rules files. And I couldn't find any example on GitHub or other web pages.
Here is an example of a replacement:
Input text: John Smith is a member of the NLP lab.
Matched pattern: "John Smith" is replaced with "Student A" in the text.
Resulting text: Student A is a member of the NLP lab.
Anyone could help me? I am new to Stanford CoreNLP and have a lot of things to learn.

stanfordnlp - Training space separated words as a single token to Stanford NER model generation

I have read the detailed description given here- http://nlp.stanford.edu/software/crf-faq.shtml#a on training the model based on the labelled input file according to the .prop file. But the article says-
You should make sure each line consists of solely content fields and tab characters. Spaces don't work. Extra tabs will cause problems.
My text corpus has some space separated words which are all combinedly form a token instead of single word. For instance, "Wright State University" is a single token though Wright, State and University are entities individually. I would like to generate the model with the above token as a single one. The article says that the input file to generate the model should be given as a tab separated words with first column being the token and the second column the label. How can I achieve this?
Typically NER training data is in the form of natural language sentences where each token has an NER tag. You might have 10,000 sentences or more.
For instance: "He attended Wright State University."
should be represented as:
He O
attended O
Wright SCHOOL
State SCHOOL
University SCHOOL
. O
If don't have sentences, and you simply have a list of strings that should be tagged a certain way, it makes more sense to use RegexNER.
You can find a thorough description of how to use RegexNER here:
http://nlp.stanford.edu/software/regexner.html

How to improve detection of sentences in Sphinx?

It is possible to search words in one sentence with Sphinx. For example, we have next text:
Вася молодец, съел огурец, т.к. проголодался. Такие дела.
If I search
молодец SENTENCE огурец
i find this text. If I search
молодец SENTENCE проголодался
I cant find this text, because dot from phrase т.к. regarded as end of sentence.
And how I see, set of delimiters is hardcoded in Sphinx's sources.
My question is how to improve detection of sentence? Better way for me is to use Yandex's Tomita parser or another nlp library with smart detection of sentences.
Split text into sentences with Yandex's Tomita parser. We get the text, which splited by "\n".
Delete all ".", "!", "?" leaving last from each sentence.
Build the Sphinx index with this preprocessed data.

Resources