I have created word embeddings (Word2vec) using gensim. Now i want to evaluate my word embeddings. For that I used the "evaluate_word_pairs" method from gensim using WordSim353. I got the following result:
Can someone explain me how to interpret the results?
Related
I have in a database thousands of sentences (highlights from kindle books) and some of them are sentence fragments (e.g. "You can have the nicest, most") which I am trying to filter out.
As per some definition I found, a sentence fragment is missing either its subject or its main verb.
I tried to find some kind of sentence fragment algorithm but without success.
But anyway in the above example, I can see the subject (You) and the verb (have) but it still doesn't look like a full sentence to me.
I thought about restricting on the length (like excluding string whose length is < than 30) but I don't think it's a good idea.
Any suggestion on how you would do it?
- regex: regex features for intent classification
examples: |
- \bon road pric/i
- \bonroad pric/i
I have tested above regex and they are working fine. Hence I am sure there is no issue with regex expression
Example:
training-row-1] Please tell me on road price now.
training-row-2] Please tell me price now.
Based on above regex pattern, regex features which should get added are:
training-row-1] Please tell me on road price now. ==> TRUE (because regex match)
training-row-2] Please tell me price now. ==> FALSE (regex don't match)
My question is, In RegexFeaturizer, does regex match happens on whole sentence or on each token?
It make sense to have it on whole sentence.
Is above featurization which I have assumed is correct or no?
I've found the following docstring in the code for the RegexFeaturizer.
"""
Given a sentence, returns a vector of {1,0} values indicating which
regexes did match. Furthermore, if the message is tokenized, the
function will mark all tokens with a dict relating the name of the
regex to whether it was matched.
"""
So I think it's taking the entire sentence as input. It's hard to see inside of the feature space in Rasa but I've confirmed that the correct entity is picked up across tokens when using the RegexEntityExtractor. This is easily verified by temporarily adding entity examples in your NLU data (make sure it appears at least twice in intents) and running rasa interactive.
As you may know, RoBERTa (BERT, etc.) has its own tokenizer and sometimes you get pieces of given word as tokens, e.g. embeddings ยป embed, #dings
Since the nature of the task I am working on, I need a single representation for each word. How do I get it?
CLEARANCE:
sentence: "embeddings are good" --> 3 word tokens given
output: [embed,#dings,are,good] --> 4 tokens are out
When I give sentence to pre-trained RoBERTa, I get encoded tokens. At the end I need representation for each token. Whats the solution? Summing embed + #dings tokens point-wise?
I'm not sure if there is standard practice, but what I saw the others have done is to simply take the average of the sub-tokens embeddings. example: https://arxiv.org/abs/2006.01346, Section 2.3 line 4
I'm using Kibana to find all logs containing an exact match of the string #deprecated.
For a reason I don't understand, it matches string with the word "deprecated" without the # sign.
I tried to use escaping for # according to the Lucene Documentation. i.e. message:"\\#deprecated" - without change in results.
How can I query to exact match the #deprecated text exact match only
Why is this happening?
You problem isn't an issue with query syntax, which is what escaping is for, it's with analysis. You analyzer removes punctuation, because it's parsing it as full text. It removes #, in much the same way that it will remove periods and commas.
So, after analysis (assuming standard analysis) of something like: "Class is #deprecated" the token stream generated will have the following tokens: "class", "deprecated" ("is" is a stop word). The indexed form of "#deprecated" and "deprecated" are identical, so it is impossible to have a query that can differentiate between them as it is currently indexed.
To fix this you would have to change your analyzer. WhitespaceAnalyzer may be a good choice, and should fix this issue. However, be careful you aren't doing more harm than good. If you use WhitespaceAnalyzer, you are going to have to contend with other punctuation as well, and a search for "sentence"
would not find "match at the end of this sentence.", because of the period. So, if you are searching full text, this will certainly cause far more problems than it solves.
If you want to know the full rules of standard analysis, by the way, it's an implementation of UAX #29 word boundaries
Here is what I want to achieve :
My field value : "one two three"
I want to be able to match this field by typing: one or onetwo or onetwothree or onethree or twothree or two or three
For that, the tokenizer need to produce those tokens:
one
onetwo
onetwothree
onethree
two
twothree
three
Do you know how can I implement this analyzer ?
there is the same problem in German language when we connect different words into one. For this purpose Elasticsearch uses technique called "coumpound words". There is also a specific token filter called "compound word token filter". It is trying to find sub-words from given dictionary in string. You only have to define dictionary for your language. There is whole specification at link bellow.
https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-compound-word-tokenfilter.html