How do you get single embedding vector for each word (token) from RoBERTa? - word-embedding

As you may know, RoBERTa (BERT, etc.) has its own tokenizer and sometimes you get pieces of given word as tokens, e.g. embeddings » embed, #dings
Since the nature of the task I am working on, I need a single representation for each word. How do I get it?
CLEARANCE:
sentence: "embeddings are good" --> 3 word tokens given
output: [embed,#dings,are,good] --> 4 tokens are out
When I give sentence to pre-trained RoBERTa, I get encoded tokens. At the end I need representation for each token. Whats the solution? Summing embed + #dings tokens point-wise?

I'm not sure if there is standard practice, but what I saw the others have done is to simply take the average of the sub-tokens embeddings. example: https://arxiv.org/abs/2006.01346, Section 2.3 line 4

Related

Is there an algorithm to filter out sentence fragments?

I have in a database thousands of sentences (highlights from kindle books) and some of them are sentence fragments (e.g. "You can have the nicest, most") which I am trying to filter out.
As per some definition I found, a sentence fragment is missing either its subject or its main verb.
I tried to find some kind of sentence fragment algorithm but without success.
But anyway in the above example, I can see the subject (You) and the verb (have) but it still doesn't look like a full sentence to me.
I thought about restricting on the length (like excluding string whose length is < than 30) but I don't think it's a good idea.
Any suggestion on how you would do it?

XLM-RoBERTa token - id relationship

I used the XLM-RoBERTa tokenizer in order to get the IDs for a bunch of sentences such as:
["loving is great", "This is another example"]
I see that the IDs returned are not always as many as the number of whitespace-separated tokens in my sentences: for example, the first sentence corresponds to [[0, 459, 6496, 83, 6782, 2]], with loving being 456 and 6496. After getting the matrix for the word embeddings from the IDs, I was trying to identify only those word embeddings/vectors corresponding to some specific tokens: is there a way to do that? If the original tokens are sometimes assigned more than one ID and this cannot be predicted, I do not see how this is possible.
More in general, my task is to get word embeddings for some specific tokens within a sentence: my goal is therefore to use first the sentence so that word embeddings of single tokens can be calculated within the syntactic context, but then I would like to identify/keep the vectors of only some specific tokens and not those of all tokens in the sentence.
The mapping between tokens and IDs is unique, however, the text is segmented into subwords before you get the token (in this case subword) IDs.
You can find out what string the IDs belong to:
import transformers
tok = transformers.AutoTokenizer.from_pretrained("xlm-roberta-base")
tok.convert_ids_to_tokens([459, 6496])
You will get: ['▁lo', 'ving'] which shows how the first word was actually pre-processed.
The preprocessing split everything on spaces and prepend the first token all tokens preceded with a space with the ▁ sign. In the second step, it splits out-out-vocabulary tokens into subwords for which there are IDs in the vocabulary.

custom token filter for elasticsearch

I want to implement a custom token filter like this:
single words are accepted if they match a specific (regex) pattern - adjacent words are concatenated if one ends in a letter and the other one begins with a digit (or vice versa)
This seems to map to:
step 1 - shingle - adjacent words joined together with a space
step 2 - if token matches pattern /pat1/, keep ... if token matches /pata patb/, replace the whitespace
step 3 - remove everything else.
Is there a way to achieve that?. I have seen https://stackoverflow.com/questions/35742426/how-to-filter-tokens-based-on-a-regex-in-elasticsearch but dont feel like converting a complex pattern into one with lookahead.
the idea is to factor out potential order numbers from user input.
The data is assumed to be normalised, so an order number could be a regular isbn 978<10_more_digits> or something like "ME4713P". Users might input "ME 4713P" or 978-<10_digits_and_some_dashes> instead
Order numbers can be described as "contain both letters and digits, optional dashes" or "contain letters, a dash, more letters" or "contain digits, a dash, more digits"
BTW: sorry to use different email this time...

Elasticsearch standard tokenizer behaviour and word boundaries

I am not sure why the standard tokenizer (used by the default standard analyzer) behaves like this in this scenario:
- If I use the word system.exe it generates the token system.exe. I understand . is not a word breaker.
- If I use the word system32.exe it generates the tokens system and exe. I don´t understand this, why it breaks the word when it finds a number + a . ?
- If I use the word system32tm.exe it generates the token system32tm.exe. As in the first example, it works as expected, not breaking the word into different tokens.
I have read http://unicode.org/reports/tr29/#Word_Boundaries but I still don´t understand why a number + dot (.) is a word boundary
As mentioned in the question, the standard tokenizer provides grammar based tokenization based on the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29
The rule http://unicode.org/reports/tr29/#Word_Boundaries is to not break if you have letter + dot + letter, see WB6 in the above spec. So tm.exe is preserved and system32.exe is split.
The spec says that it always splits, except for the listed exceptions. Exceptions WB6 and WB7 say that it never splits on letter, then punctuation, then letter. Rules WB11 and WB12 say that it never splits on number, then punctuation, then number. However there is no such rule for number then punctuation then letter, so the default rule applies and system32.exe gets splitted.

Programming idiom to parse a string in multiple-passes

I'm working on a Braille translation library, and I need to translate a string of text into braille. I plan to do this in multiple passes, but I need a way to keep track of which parts of the string have been translated and which have not, so I don't retranslate them.
I could always create a class which would track the ranges of positions in the string which had been processed, and then design my search/replace algorithm to ignore them on subsequent passes, but I'm wondering if there isn't a more elegant way to accomplish the same thing.
I would imagine that multi-pass string translation isn't all that uncommon, I'm just not sure what the options are for doing it.
A more usual approach would be to tokenize your input, then work on the tokens. For example, start by tokenizing the string into a token for each character. Then, in a first pass generate a straightforward braille mapping, token by token. In subsequent passes, you can replace more of the tokens - for example, by replacing sequences of input tokens with a single output token.
Because your tokens are objects or structs, rather than simple characters, you can attach additional information to each - such as the source token(s) you translated (or rather, transliterated) the current token from.
Check out some basic compiler theory..
Lexical Analysis
Parsing/Syntax Analysis

Resources