How to use generative transformer model to extract words from input sentence? - huggingface-transformers

I have a generative transformer model (T5). I am training it to perform an extractive task. But I find that it ends up generating words which are not present in the input.
My transformer input is 'answer for [INPUT_SENTENCE]: '. So the input sentence is passed as input to the transformer model.
Is there a way to force the decoder to only generate words which are present in the input sentence?

Related

How to initialize BertForSequenceClassification for different input rather than [CLS] token?

BertForSequenceClassification uses [CLS] token's representation to feed a linear classifier. I want to leverage another token (say [X] in the input sequence) rather than [CLS]. What's the most straightforward way to implement that in Transformers?
You can define the special tokens when creating the tokenizer.
This is an example of how to modify an special token of a pretrained tokenizer:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", cls_token="[X]")
Please check the BertTokenizer documentation to see which other special tokens you can modify (unk_token, sep_token, pad_token, cls_token, mask_token, etc.).

entity recognize,just a word,not a whole sentence to reognize,what should use Algorithm about NLP

when i input a parameter "marry" into function,then,return "this is name",when i input a parameter "MIT",then,return "this is institution name",just a word,not a whole sentence to reognize,what should use Algorithm about NLP? thanks.
This looks like part-of-speech tagging and/or named entity recognition BUT if you are processing English, single words without context are potentially ambiguous. Also, single words may not be informative. "new" on it's own can be an adjective (POS) but "New York" is most likely a location (NER). Check some literature on both tasks and consider processing at least sentence-level features.

In elasticsearc How can I Tokenize words separeted by space and be able to match by typing without space

Here is what I want to achieve :
My field value : "one two three"
I want to be able to match this field by typing: one or onetwo or onetwothree or onethree or twothree or two or three
For that, the tokenizer need to produce those tokens:
one
onetwo
onetwothree
onethree
two
twothree
three
Do you know how can I implement this analyzer ?
there is the same problem in German language when we connect different words into one. For this purpose Elasticsearch uses technique called "coumpound words". There is also a specific token filter called "compound word token filter". It is trying to find sub-words from given dictionary in string. You only have to define dictionary for your language. There is whole specification at link bellow.
https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-compound-word-tokenfilter.html

stanfordnlp - Training space separated words as a single token to Stanford NER model generation

I have read the detailed description given here- http://nlp.stanford.edu/software/crf-faq.shtml#a on training the model based on the labelled input file according to the .prop file. But the article says-
You should make sure each line consists of solely content fields and tab characters. Spaces don't work. Extra tabs will cause problems.
My text corpus has some space separated words which are all combinedly form a token instead of single word. For instance, "Wright State University" is a single token though Wright, State and University are entities individually. I would like to generate the model with the above token as a single one. The article says that the input file to generate the model should be given as a tab separated words with first column being the token and the second column the label. How can I achieve this?
Typically NER training data is in the form of natural language sentences where each token has an NER tag. You might have 10,000 sentences or more.
For instance: "He attended Wright State University."
should be represented as:
He O
attended O
Wright SCHOOL
State SCHOOL
University SCHOOL
. O
If don't have sentences, and you simply have a list of strings that should be tagged a certain way, it makes more sense to use RegexNER.
You can find a thorough description of how to use RegexNER here:
http://nlp.stanford.edu/software/regexner.html

Programming idiom to parse a string in multiple-passes

I'm working on a Braille translation library, and I need to translate a string of text into braille. I plan to do this in multiple passes, but I need a way to keep track of which parts of the string have been translated and which have not, so I don't retranslate them.
I could always create a class which would track the ranges of positions in the string which had been processed, and then design my search/replace algorithm to ignore them on subsequent passes, but I'm wondering if there isn't a more elegant way to accomplish the same thing.
I would imagine that multi-pass string translation isn't all that uncommon, I'm just not sure what the options are for doing it.
A more usual approach would be to tokenize your input, then work on the tokens. For example, start by tokenizing the string into a token for each character. Then, in a first pass generate a straightforward braille mapping, token by token. In subsequent passes, you can replace more of the tokens - for example, by replacing sequences of input tokens with a single output token.
Because your tokens are objects or structs, rather than simple characters, you can attach additional information to each - such as the source token(s) you translated (or rather, transliterated) the current token from.
Check out some basic compiler theory..
Lexical Analysis
Parsing/Syntax Analysis

Resources