Print Probabilities from CoreNLP Pipeline - stanford-nlp

I'm aware of the functionality of using printProbs from a classifier to print the probabilities that a particular token is a particular ner type. However, how can I access the CRFClassifier used by the CoreNLP pipeline in the bottom code to actually call the printProb method?
// create an empty Annotation just with the given text
Annotation document = new Annotation(text);
// run all Annotators on this text
pipeline.annotate(document);

One possible approach I would say is to have your own custom ner annotator which resembles very closely the CRFClassifier and add that to your pipeline instead of ner. So basically start with a copy of that code and then from within there you can access the CRFClassifier methods which contain the cliques and the probabilities

One could also use reflection (hahaha)
NERCombinerAnnotator nerAnnotator = (NERCombinerAnnotator) StanfordCoreNLP.getExistingAnnotator(Annotator.STANFORD_NER);
Field field = nerAnnotator.getClass().getDeclaredField("ner");
field.setAccessible(true);
NERClassifierCombiner classifier = (NERClassifierCombiner) field.get(nerAnnotator);
Field field2 = ner.getClass().getSuperclass().getDeclaredField("baseClassifiers");
field2.setAccessible(true);
// one of these will be the CRFClassifier used
List baseClassifiers = (List) field2.get(ner);
You should also realize that a number of the usual reflection exceptions could be raised using this code.

Related

How to get the output of the last but one layer of the Vision transformer using the hugging face implementation?

I am trying to use the huggingface implementation of the vision transformer to get the feature vector of the last but one dense layer
In order to get information from the second last layer, you need to output_hidden_states=True. Here is an example in my context:
configBert = BertConfig.from_pretrained('bert-base-uncased', output_hidden_states=True, num_labels=NUM_LABELS)
modelBert = TFBertModel.from_pretrained('bert-base-uncased', config=configBert)

How to provide parameter input for interaction variable in H2OGradientBoostingEstimator?

I need to use the interaction variable feature of multiclass classification in H2OGradientBoostingEstimator in H2O in Python. I am not sure which parameter to use & how to use that. Can anyone please help me out with this?
Currently, I am using the below code -
pros_gbm = H2OGradientBoostingEstimator(nfolds=0,seed=1234, keep_cross_validation_predictions = False, ntrees=10, max_depth=3, learn_rate=0.01, distribution='multinomial')
hist_gbm = pros_gbm.train(x=predictors, y=target, training_frame=hf_train, validation_frame = hf_test,verbose=True)
GBM inherently creates interactions. You can extract information about feature interactions using the .feature_interaction() extractor method (for an H2O Model). More information is provided in the user guide and the Python docs.
If you want to explicitly add a new column that is the interaction between two numerics, you could create that manually by multiplying the two (or more) columns together to get a new interaction column.
For categorical interactions, there's also the the h2o.interaction() method in Python here to create interaction columns in the data (prior to sending it to the GBM or any algorithm).

How do I train a encoder-decoder model for a translation task using hugging face transformers?

I would like to train a encoder decoder model as configured below for a translation task. Could someone guide me as to how I can set-up a training pipeline for such a model? Any links or code snippets would be appreciated to understand.
from transformers import BertConfig, EncoderDecoderConfig, EncoderDecoderModel
# Initializing a BERT bert-base-uncased style configuration
config_encoder = BertConfig()
config_decoder = BertConfig()
config = EncoderDecoderConfig.from_encoder_decoder_configs(config_encoder, config_decoder)
# Initializing a Bert2Bert model from the bert-base-uncased style configurations
model = EncoderDecoderModel(config=config)
The encoder-decoder models are used in the same as any other models in Transformers. It accepts batches of tokenized text as vocabulary indices (i.e., you need a tokenizer that is suitable for your sequence-to-sequence task). When you feed the model with the input (input_ids) and the desired output (decoder_input_ids and labels), you will get the loss value that you can optimize during training. Note that if the sentences in the batch have different lengths, you need to do masking too. This is a minimum example for the EncoderDecoderModel documentation:
from transformers import EncoderDecoderModel, BertTokenizer
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = EncoderDecoderModel.from_encoder_decoder_pretrained(
'bert-base-uncased', 'bert-base-uncased')
input_ids = torch.tensor(
tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0)
outputs = model(
input_ids=input_ids, decoder_input_ids=input_ids, labels=input_ids,
return_dict=True)
loss = outputs.loss
If you do not want to write the training loop yourself, you can use dataset processing (DataCollatorForSeq2Seq) and training (Seq2SeqTrainer) utilities from Transformers. You can follow the Seq2Seq example on GitHub.

Iterate through tokens and find the entity for a token

Problem
After running CoreNLP over some text, I want to reconstruct a sentence adding the POS-tag for each Token and grouping the tokens that form an entity.
This could be easily done if there was a way to see which entity a Token belongs to.
Aproach
One option I was considering now was going through sentence.tokens() and finding the index in a list containing only the Tokens from all the CoreEntityMentions for that sentence. Then I could see which CoreEntityMention that Token belongs to, so I can group them.
Another option could be to look the offsets of each Token in the sentence and compare it to the offset of each CoreEntityMention.
I think the question is similar to what was asked here, but since it was a while ago, maybe the API has changed since.
This is the setup:
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner");
pipeline = new StanfordCoreNLP(props);
String text = "Some text with entities goes here";
CoreDocument coreDoc = new CoreDocument(text);
// annotate the document
pipeline.annotate(coreDoc);
for (CoreSentence sentence : coreDoc.sentences()) {
// Code goes here
List<CoreEntityMention> em : sentence.entityMentions();
}
Each token in an entity mention contains an index to which entity mention in the document it corresponds to.
cl.get(CoreAnnotations.EntityMentionIndexAnnotation.class);
I'll make a note to add a convenience method for this future versions.

Apache Storm Trident .each() function explanation

I want to use Apache Storm's TridentTopology in a project. I am finding it difficult to understand the .each() function from the storm.trident.Stream class. Below is the example code given in their tutorial for reference:
TridentTopology topology = new TridentTopology();
TridentState wordCounts =
topology.newStream("spout1", spout)
.each(new Fields("sentence"), new Split(), new Fields("word"))
.groupBy(new Fields("word"))
.persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count"))
.parallelismHint(6);
I didn't understand the signature of the method .each(). Below is what I understood. Please correct me if I am wrong and also give some more information for my knowledge.
.each()
The first parameter takes the fields which are correlated keys to the
emitted values from spout and returned from the getOutputFields()
method in the spout. I still don't know why is that parameter used
for.
The second parameter is the class extending the BaseFunction. It
processes the tuple.
The third parameter understanding is similar to the first parameter.
The first parameter is a projection on the input tuples. In your example, only the field with name "sentence" in provided to Split. If your source emits tuple with schema Fields("first", "sentence", "third") you can only access "sentence" in Split. Furthermore, "sentence" will have index zero (and not one) in Split. Pay attention that it is not a projection on the output -- all field will remain in the output tuples! It's just a limited view on the whole tuple within Split.
The last parameter is the schema of the Values given to emit() within Split. This field names are appended as new attribute to the output tuples. Thus, the output tuple's schema is the input tuple's schema (original, not projected by the first parameter) plus the fields of this last parameter.
See section "Function" in the documentation: https://storm.apache.org/releases/0.10.0/Trident-API-Overview.html

Resources