XLM-RoBERTa token - id relationship - transformer-model

I used the XLM-RoBERTa tokenizer in order to get the IDs for a bunch of sentences such as:
["loving is great", "This is another example"]
I see that the IDs returned are not always as many as the number of whitespace-separated tokens in my sentences: for example, the first sentence corresponds to [[0, 459, 6496, 83, 6782, 2]], with loving being 456 and 6496. After getting the matrix for the word embeddings from the IDs, I was trying to identify only those word embeddings/vectors corresponding to some specific tokens: is there a way to do that? If the original tokens are sometimes assigned more than one ID and this cannot be predicted, I do not see how this is possible.
More in general, my task is to get word embeddings for some specific tokens within a sentence: my goal is therefore to use first the sentence so that word embeddings of single tokens can be calculated within the syntactic context, but then I would like to identify/keep the vectors of only some specific tokens and not those of all tokens in the sentence.

The mapping between tokens and IDs is unique, however, the text is segmented into subwords before you get the token (in this case subword) IDs.
You can find out what string the IDs belong to:
import transformers
tok = transformers.AutoTokenizer.from_pretrained("xlm-roberta-base")
tok.convert_ids_to_tokens([459, 6496])
You will get: ['▁lo', 'ving'] which shows how the first word was actually pre-processed.
The preprocessing split everything on spaces and prepend the first token all tokens preceded with a space with the ▁ sign. In the second step, it splits out-out-vocabulary tokens into subwords for which there are IDs in the vocabulary.

Related

How do you get single embedding vector for each word (token) from RoBERTa?

As you may know, RoBERTa (BERT, etc.) has its own tokenizer and sometimes you get pieces of given word as tokens, e.g. embeddings » embed, #dings
Since the nature of the task I am working on, I need a single representation for each word. How do I get it?
CLEARANCE:
sentence: "embeddings are good" --> 3 word tokens given
output: [embed,#dings,are,good] --> 4 tokens are out
When I give sentence to pre-trained RoBERTa, I get encoded tokens. At the end I need representation for each token. Whats the solution? Summing embed + #dings tokens point-wise?
I'm not sure if there is standard practice, but what I saw the others have done is to simply take the average of the sub-tokens embeddings. example: https://arxiv.org/abs/2006.01346, Section 2.3 line 4

custom token filter for elasticsearch

I want to implement a custom token filter like this:
single words are accepted if they match a specific (regex) pattern - adjacent words are concatenated if one ends in a letter and the other one begins with a digit (or vice versa)
This seems to map to:
step 1 - shingle - adjacent words joined together with a space
step 2 - if token matches pattern /pat1/, keep ... if token matches /pata patb/, replace the whitespace
step 3 - remove everything else.
Is there a way to achieve that?. I have seen https://stackoverflow.com/questions/35742426/how-to-filter-tokens-based-on-a-regex-in-elasticsearch but dont feel like converting a complex pattern into one with lookahead.
the idea is to factor out potential order numbers from user input.
The data is assumed to be normalised, so an order number could be a regular isbn 978<10_more_digits> or something like "ME4713P". Users might input "ME 4713P" or 978-<10_digits_and_some_dashes> instead
Order numbers can be described as "contain both letters and digits, optional dashes" or "contain letters, a dash, more letters" or "contain digits, a dash, more digits"
BTW: sorry to use different email this time...

In elasticsearc How can I Tokenize words separeted by space and be able to match by typing without space

Here is what I want to achieve :
My field value : "one two three"
I want to be able to match this field by typing: one or onetwo or onetwothree or onethree or twothree or two or three
For that, the tokenizer need to produce those tokens:
one
onetwo
onetwothree
onethree
two
twothree
three
Do you know how can I implement this analyzer ?
there is the same problem in German language when we connect different words into one. For this purpose Elasticsearch uses technique called "coumpound words". There is also a specific token filter called "compound word token filter". It is trying to find sub-words from given dictionary in string. You only have to define dictionary for your language. There is whole specification at link bellow.
https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-compound-word-tokenfilter.html

Ignore elements in cts:search

I am having some xml documents which have a structure like this:
<root>
<intro>...</intro>
...
<body>
<p>..................
some text CO<sub>2</sub>
.................. </p>
</body>
</root>
Now I want to search all the results with phrase CO2 and also want to get results of above type in search results.
For this purpose, I am using this query -
cts:search
(fn:collection ("urn:iddn:collections:searchable"),
cts:element-query
(
fn:QName("http://iddn.icis.com/ns/fields","body"),
cts:word-query
(
"CO2",
("case-insensitive","diacritic-sensitive","punctuation-insensitive",
"whitespace-sensitive","unstemmed","unwildcarded","lang=en"),
1
)
)
,
("unfiltered", "score-logtfidf"),
0.0)
But using this I am not able to get document with CO<sub>2</sub>. I am only getting data with simple phrase CO2.
If I replace the search phrase to CO 2 then I am able to get documents only with CO<sub>2</sub> and not with CO2
I want to get combined data for both CO<sub>2</sub> and CO2 as search results.
So can I ignore <sub> by any means, or is there any other way to cater this problem?
The issue here is tokenization. "CO2" is a single word token. CO<sub>2</sub>, even with phrase-through, is a phrase of two word tokens: "CO" and "2". Just as "blackbird" does not match "black bird", so too does "CO2" not match "CO 2". The phrase-through setting just means that we're willing to look for a phrase that crosses the <sub> element boundary.
You can't splice together CO<sub>2</sub> into one token, but you might be able to use customized tokenization overrides to break "CO2" into two tokens. Define a field and define overrides for the digits as 'symbol'. This will make each digit its own token and will break "CO2" into two tokens in the context of that field. You'd then need to replace the word-query with a field-word-query.
You probably don't want this to apply anywhere in a document, so you'd be best of adding markup around these kinds of chemical phrases in your documents. Fields in general and tokenization overrides in particular will come at a performance cost. The contents of a field are indexed completely separately so the index is bigger, and the tokenization overrides mean that we have to retokenize as well, both on ingest and at query time. This will slow things down a little (not a lot).
It appears that you want to add a phrase-through configuration.
Example:
<p>to <b>be</b> or not to be</p>
A phrase-through on <b> would then be indexed as "to be or not to be"

How to use regular expression in fetching data from graphite?

I want to fetch data from different counters from graphite in one single request like:-
summarize(site.testing_server_2.triggers_unknown.count,'1hour','sum')&format=json
summarize(site.testing_server_2.requests_failed.count,'1hour','sum')&format=json
summarize(site.testing_server_2.core_network_bad_soap.count,'1hour','sum')&format=json
and so on.. 20 more.
But I don't want to fetch
summarize(site.testing_server_2.module_xyz_abc.count,'1hour','sum')&format=json
in that request how can i do that?
This is what I tried:
summarize(site.testing_server_2.*.count,'1hour','sum')&format=json&from=-24hour
It gets json data for 'module_xyz_abc' too, but that i don't want.
You can't use regular expressions per se, but you can use some similar (in concept and somewhat in format) matching techniques available within the Graphite Render URL API. There are a few ways you can "match" within a target's "bucket" (i.e. between the dots).
Target Matching
Asterisk * match
The asterisk can be used to match ANY -zero or more- character(s). It can be used to replace the entire bucket (site.*.test) or within the bucket (site.w*t.test). Here is an example:
site.testing_server_2.requests_*.count
This would match site.testing_server_2.requests_failed.count, site.testing_server_2.requests_success.count, site.testing_server_2.requests_blah123.count, and so forth.
Character range [a-z0-9] match
The character range match is used to match on a single character (site.w[0-9]t.test) in the target's bucket and is specified as a range or list. For example:
site.testing_server_[0-4].requests_failed.count
This would match on site.testing_server_0.requests_failed.count, site.testing_server_1.requests_failed.count, site.testing_server_2.requests_failed.count, and so forth.
Value list (group capture) {blah, test, ...} match
The value list match can be used to match anything in the list of values, in the specified portion of the target's bucket.
site.testing_server_2.{triggers_unknown,requests_failed,core_network_bad_soap}.count
This would match site.testing_server_2.triggers_unknown.count, site.testing_server_2.requests_failed.count, and site.testing_server_2.core_network_bad_soap.count. But nothing else, so site.testing_server_2.module_xyz_abc.count would not match.
Answer
Without knowing all of your bucket values it is difficult to be surgical with the approach (perhaps with a combination of the matching options), so I'll recommend just going with a value list match. This should allow you to get all of the values in one -somewhat long- request. For example (and keep in mind you'd need to include all of your values):
summarize(site.testing_server_2.{triggers_unknown,requests_failed,core_network_bad_soap}.count,'1hour','sum')&format=json&from=-24hour
For more, see Graphite Paths and Wildcards

Resources