Resolving how square brackets vs parenthesis are conceptually handled in pipeline - stanford-nlp

When I parse a sentence that contains left and right square brackets, the parse is much, much slower and different than if the sentence contained left and right parentheses and the default normalizeOtherBrackets value of true is kept (I would say 20 seconds vs 3 seconds for the ParserAnnotator to be run). If this is property is instead set to false, than the parse times of the brackets vs the parentheses are pretty comparable however the parse trees are still very different. With a true value for text with brackets, the POS is -LRB- whereas the POS is CD for false but in each case the general substructure of the tree is the same.
In the case of my corpus, the brackets are overwhelmingly meant to "clarify the antecedent" as described in this site. However, the PRN phrase-level label exists for parentheses and not for square brackets and so the formation of the tree is inherently different even if they have close to the same function in the sentence.
So, please explain how the parse times are so different and what can be done to get a proper parse? Obviously, a simplistic approach would be to replace brackets with parens, but that does not seem like a satisfying solution. Are there any settings that can provide me with some relief? Here is my code:
private void execute() {
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner");
props.setProperty("tokenize.options", "normalizeOtherBrackets=false");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// create an empty Annotation just with the given text
Annotation document = new Annotation(text);
// run all Annotators on this text
pipeline.annotate(document);
// these are all the sentences in this document
// a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
long start = System.nanoTime();
ParserAnnotator pa = new ParserAnnotator(DefaultPaths.DEFAULT_PARSER_MODEL, false, 60, new String[0]);
pa.annotate(document);
long end = System.nanoTime();
System.out.println((end - start) / 1000000 + " ms");
for(CoreMap sentence: sentences) {
System.out.println(sentence);
Tree cm = sentence.get(TreeAnnotation.class);
cm.pennPrint();
List<CoreLabel> cm2 = sentence.get(TokensAnnotation.class);
for (CoreLabel c : cm2) {
System.out.println(c);
}
}
}

Yes, this a known (and very tricky) issue, and unfortunately there is no perfect solution for this.
The problem with such sentences is mainly caused by the fact that these parenthetical constructions don't appear in the training data of the parser and by the way the part-of-speech tagger and the parser interact.
If you run the tokenizer with the option normalizeOtherBrackets=true, then the square brackets will be replaced with -LSB- and -RSB-. Now in most (or maybe even in all) of the training examples for the part-of-speech tagger -LSB- has the POS tag -LRB-, and -RSB- has the tag -RRB-. Therefore the POS-tagger, despite taking context into account, basically always assigns these tags to squared brackets.
Now if you run the POS-tagger before the parser, then the parser operates in delexicalized mode which means it tries to parse the sentence based on the POS tags of the individual words instead of based on the words themselves. Assuming that your sentence will have a POS sequence similar to -LRB- NN -RRB- the parser tries to find production rules so that it can parse this sequence. However, the parser's training data doesn't contain such sequences and therefore there aren't any such production rules, and consequently the parser is unable to parse a sentence with this POS sequence.
When this happens, the parser goes into fallback mode which relaxes the constraints on the part-of-speech tags by assigning a very low probability to all the other possible POS tags of every word so that it can eventually parse the sentence according to the production rules of the parser. This step, however, requires a lot of computation and therefore it takes so much longer to parse such sentences.
On the other hand, if you set normalizeOtherBrackets to false, then it won't translate squared brackets to -LSB- and -RSB- and the POS tagger will assign tags to the original [ and ]. However, our training data was preprocessed such that all squared brackets have been replaced and therefore [ and ] are treated as unknown words and the POS tagger assigns tags purely based on the context which most likely results in a POS sequence that is part of the training data of our parser and tagger. This in return means that the parser is able to parse the sentence without entering into recovery mode which is a lot faster.
If your data contains a lot of sentences with such constructions, I would recommend that you don't run the POS tagger before the parser (by removing pos from the list of annotators). This should still give you faster but still reliable parses. Otherwise, if you have access to a treebank, you could also manually add a few trees with these constructions and train a new parsing model. I would not set normalizeOtherBrackets to false because then your tokenized input will differ from the training data of the parser and the tagger which will potentially give you worse results.

Related

Best Data Structure For Text Processing

Given a sentence like this, and i have the data structure (dictionary of lists, UNIQUE keys in dictionary):
{'cat': ['feline', 'kitten'], 'brave': ['courageous', 'fearless'], 'little': ['mini']}
A courageous feline was recently spotted in the neighborhood protecting her mini kitten
How would I efficiently process these set of text to convert the word synonyms of the word cat to the word CAT such that the output is like this:
A fearless cat was recently spotted in the neighborhood protecting her little cat
The algorithm I want is something that can process the initial text to convert the synonyms into its ROOT word (key inside dictionary), the keywords and synonyms would get longer as well.
Hence, first, I want to inquire if the data structure I am using is able to perform efficiently and whether there are more efficient structure.
For now, I am only able to think of looping through each list inside the dictionary, searching for the synonym's then mapping it back to its keyword
edit: Refined the question
Your dictionary is organised in the wrong way. It will allow you to quickly find a target word, but that is not helpful when you have an input that does not have the target word, but some synonyms of it.
So organise your dictionary in the opposite sense:
d = {
'feline': 'cat',
'kitten': 'cat'
}
To make the replacements, you could create a regular expression and call re.sub with a callback function that will look up the translation of the found word:
import re
regex = re.compile(rf"\b(?:{ '|'.join(map(re.escape, d)) })\b")
s = "A feline was recently spotted in the neighborhood protecting her little kitten"
print(regex.sub(lambda match: d[match[0]], s))
The regular expression makes sure that the match is with a complete word, and not with a substring -- "cafeline" as input will not give a match for "feline".

Limiting BART HuggingFace Model to complete sentences of maximum length

I'm implementing BART on HuggingFace, see reference: https://huggingface.co/transformers/model_doc/bart.html
Here is the code from their documentation that works in creating a generated summary:
from transformers import BartModel, BartTokenizer, BartForConditionalGeneration
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large')
def baseBart(ARTICLE_TO_SUMMARIZE):
inputs = tokenizer([ARTICLE_TO_SUMMARIZE], max_length=1024, return_tensors='pt')
# Generate Summary
summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=25, early_stopping=True)
return [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids][0]
I need to impose conciseness with my summaries so I am setting max_length=25. In doing so though, I'm getting incomplete sentences such as these two examples:
EX1: The opacity at the left lung base appears stable from prior exam.
There is elevation of the left hemidi
EX 2: There is normal mineralization and alignment. No fracture or
osseous lesion is identified. The ankle mort
How do I make sure that the predicted summary is only coherent sentences with complete thoughts and remains concise. If possible, I'd prefer to not perform a regex on the summarized output and cut off any text after the last period, but actually have the BART model produce sentences within the the maximum length.
I tried setting truncation=True in the model but that didn't work.

Gensim most_similar() with Fasttext word vectors return useless/meaningless words

I'm using Gensim with Fasttext Word vectors for return similar words.
This is my code:
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('cc.it.300.vec')
words = model.most_similar(positive=['sole'],topn=10)
print(words)
This will return:
[('sole.', 0.6860659122467041), ('sole.Ma', 0.6750558614730835), ('sole.Il', 0.6727924942970276), ('sole.E', 0.6680260896682739), ('sole.A', 0.6419174075126648), ('sole.È', 0.6401025652885437), ('splende', 0.6336565613746643), ('sole.La', 0.6049465537071228), ('sole.I', 0.5922051668167114), ('sole.Un', 0.5904430150985718)]
The problem is that "sole" ("sun", in english) return a series of words with a dot in it (like sole., sole.Ma, ecc...). Where is the problem? Why most_similar return this meaningless word?
EDIT
I tried with english word vector and the word "sun" return this:
[('sunlight', 0.6970556974411011), ('sunshine', 0.6911839246749878), ('sun.', 0.6835992336273193), ('sun-', 0.6780728101730347), ('suns', 0.6730450391769409), ('moon', 0.6499731540679932), ('solar', 0.6437565088272095), ('rays', 0.6423950791358948), ('shade', 0.6366724371910095), ('sunrays', 0.6306195259094238)] 
Is it impossible to reproduce results like relatedwords.org?
Perhaps the bigger question is: why does the Facebook FastText cc.it.300.vec model include so many meaningless words? (I haven't noticed that before – is there any chance you've downloaded a peculiar model that has decorated words with extra analytical markup?)
To gain the unique benefits of FastText – including the ability to synthesize plausible (better-than-nothing) vectors for out-of-vocabulary words – you may not want to use the general load_word2vec_format() on the plain-text .vec file, but rather a Facebook-FastText specific load method on the .bin file. See:
https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.load_facebook_vectors
(I'm not sure that will help with these results, but if choosing to use FastText, you may be interesting it using it "fully".)
Finally, given the source of this training – common-crawl text from the open web, which may contain lots of typos/junk – these might be legimate word-like tokens, essentially typos of sole, that appear often enough in the training data to get word-vectors. (And because they really are typo-synonyms for 'sole', they're not necessarily bad results for all purposes, just for your desired purpose of only seeing "real-ish" words.)
You might find it helpful to try using the restrict_vocab argument of most_similar(), to only receive results from the leading (most-frequent) part of all known word-vectors. For example, to only get results from among the top 50000 words:
words = model.most_similar(positive=['sole'], topn=10, restrict_vocab=50000)
Picking the right value for restrict_vocab might help in practice to leave out long-tail 'junk' words, while still providing the real/common similar words you seek.

How to set up training and feature template files for NER? - CRF++

For the problem of named entity recognition,
After tokenizing the sentences, how do you set up the columns? it looks like one column in the documentation is POS tag, but where do these come from? Am I supposed to tag the POS myself or is there a tool to generate these?
What is the next column represent? A class like PERSON, LOCATION, etc? and does it have to be in any particular format?
Is there any example of a completed training file and template for NER?
You can find example training and test data in the crf++ repo here. The training data for noun phrase chunking looks like this:
Confidence NN B
in IN O
the DT B
pound NN I
is VBZ O
widely RB O
expected VBN O
... etc ...
The columns are arbitrary in that they can be anything. CRF++ requires that every line have the same number of columns (or be blank, to separate sentences), not all CRF packages require that. You will have to provide the data values yourself; they are the data the classifier learns from.
While anything can go in the various columns, one convention you should know is IOB Format. To deal with potentially multi-token entities, you mark them as Inside/Outside/Beginning. It may be useful to give an example. Pretend we are training a classifier to detect names - for compactness I'll write this on one line:
John/B Smith/I ate/O an/O apple/O ./O
In columnar format it would look like this:
John B
Smith I
ate O
an O
apple O
. O
With these tags, B (beginning) means the word is the first in an entity, I means a word is inside an entity (it comes after a B tag), and O means the word is not an entity. If you have more than one type of entity it's typical to use labels like B-PERSON or I-PLACE.
The reason for using IOB tags is so that the classifier can learn different transition probabilities for starting, continuing, and ending entities. So if you're learning company names It'll learn that Inc./I-COMPANY usually transitions to an O label because Inc. is usually the last part of a company name.
Templates are another problem and CRF++ uses its own special format, but again, there are examples in the source distribution you can look at. Also see this question.
To answer the comment on my answer, you can generate POS tags using any POS tagger. You don't even have to provide POS tags at all, though they're usually helpful. The other labels can be added by hand or automatically; for example, you can use a list of known nouns as a starting point. Here's an example using spaCy for a simple name detector:
import spacy
nlp = spacy.load('en')
names = ['John', 'Jane', etc...]
text = nlp("John ate an apple.")
for word in text:
person = 'O' # default not a person
if str(word) in names:
person = 'B-PERSON'
print(str(word), word.pos_, person)

Stanford CoreNLP: output in CONLL format from Java

I want to parse some German text with Stanford CoreNLP and obtain a CONLL output, so that I can pass the latter to CorZu for coreference resolution.
How can I do that programmatically?
Here is my code so far (which only outputs dependency trees):
Annotation germanAnnotation = new Annotation("Gestern habe ich eine blonde Frau getroffen");
Properties germanProperties = StringUtils.argsToProperties("-props", "StanfordCoreNLP-german.properties");
StanfordCoreNLP pipeline = new StanfordCoreNLP(germanProperties);
pipeline.annotate(germanAnnotation);
StringBuilder trees = new StringBuilder("");
for (CoreMap sentence : germanAnnotation.get(CoreAnnotations.SentencesAnnotation.class)) {
Tree sentenceTree = sentence.get(TreeCoreAnnotations.TreeAnnotation.class);
trees.append(sentenceTree).append("\n");
}
With the following code I managed to save the parsing output in CONLL format.
OutputStream outputStream = new FileOutputStream(new File("./target/", OUTPUT_FILE_NAME));
CoNLLOutputter.conllPrint(germanAnnotation, outputStream, pipeline);
However, the HEAD field was 0 for all words. I am not sure whether there is a problem in parsing or only in the CONLLOutputter. Honestly, I was too annoyed by CoreNLP to investigate further.
I decided and I suggest using ParZu instead. ParZu and CorZu are made to work together seamlessly - and they do.
In my case, I had an already tokenized and POS tagged text. This makes things easier, since you will not need:
A POS-Tagger using the STTS tagset
A tool for morphological analysis
Once you have ParZu and CorZu installed, you will only need to run corzu.sh (included in the CorZu download folder). If your text is tokenized and POS tagged, you can edit the script accordingly:
parzu_cmd="/YourPath/ParZu/parzu -i tagged"
Last note: make sure to convert your tagged text to the following format, with empty lines signifying sentence boundaries: word [tab] tag [newline]

Resources