how to handle spelling mistake(typos) in entity extraction in Rasa NLU? - rasa-nlu

I have few intents in my training set(nlu_data.md file) with sufficient amount of training examples under each intent.
Following is an example,
##intent: SEARCH_HOTEL
- find good [hotel](place) for me in Mumbai
I have added multiple sentences like this.
At the time of testing, all sentences in training file are working fine. But if any input query is having spelling mistake e.g, hotol/hetel/hotele for hotel keyword then Rasa NLU is unable to extract it as an entity.
I want to resolve this issue.
I am allowed to change only training data, also restricted not to write any custom component for this.

To handle spelling mistakes like this in entities, you should add these examples to your training data. So something like this:
##intent: SEARCH_HOTEL
- find good [hotel](place) for me in Mumbai
- looking for a [hotol](place) in Chennai
- [hetel](place) in Berlin please
Once you've added enough examples, the model should be able to generalise from the sentence structure.
If you're not using it already, it also makes sense to use the character-level CountVectorFeaturizer. That should be in the default pipeline described on this page already

One thing I would highly suggest you to use is to use look-up tables with fuzzywuzzy matching. If you have limited number of entities (like country names) look-up tables are quite fast, and fuzzy matching catches typos when that entity exists in your look-up table (searching for typo variations of those entities). There's a whole blogpost about it here: on Rasa.
There's a working implementation of fuzzy wuzzy as a custom component:
class FuzzyExtractor(Component):
name = "FuzzyExtractor"
provides = ["entities"]
requires = ["tokens"]
defaults = {}
language_list ["en"]
threshold = 90
def __init__(self, component_config=None, *args):
super(FuzzyExtractor, self).__init__(component_config)
def train(self, training_data, cfg, **kwargs):
pass
def process(self, message, **kwargs):
entities = list(message.get('entities'))
# Get file path of lookup table in json format
cur_path = os.path.dirname(__file__)
if os.name == 'nt':
partial_lookup_file_path = '..\\data\\lookup_master.json'
else:
partial_lookup_file_path = '../data/lookup_master.json'
lookup_file_path = os.path.join(cur_path, partial_lookup_file_path)
with open(lookup_file_path, 'r') as file:
lookup_data = json.load(file)['data']
tokens = message.get('tokens')
for token in tokens:
# STOP_WORDS is just a dictionary of stop words from NLTK
if token.text not in STOP_WORDS:
fuzzy_results = process.extract(
token.text,
lookup_data,
processor=lambda a: a['value']
if isinstance(a, dict) else a,
limit=10)
for result, confidence in fuzzy_results:
if confidence >= self.threshold:
entities.append({
"start": token.offset,
"end": token.end,
"value": token.text,
"fuzzy_value": result["value"],
"confidence": confidence,
"entity": result["entity"]
})
file.close()
message.set("entities", entities, add_to_output=True)
But I didn't implement it, it was implemented and validated here: Rasa forum
Then you will just pass it to your NLU pipeline in config.yml file.

Its a strange request that they ask you not to change the code or do custom components.
The approach you would have to take would be to use entity synonyms. A slight edit on a previous answer:
##intent: SEARCH_HOTEL
- find good [hotel](place) for me in Mumbai
- looking for a [hotol](place:hotel) in Chennai
- [hetel](place:hotel) in Berlin please
This way even if the user enters a typo, the correct entity will be extracted. If you want this to be foolproof, I do not recommend hand-editing the intents. Use some kind of automated tool for generating the training data. E.g. Generate misspelled words (typos)

First of all, add samples for the most common typos for your entities as advised here
Beyond this, you need a spellchecker.
I am not sure whether there is a single library that can be used in the pipeline, but if not you need to create a custom component. Otherwise, dealing with only training data is not feasible. You can't create samples for each typo.
Using Fuzzywuzzy is one of the ways, generally, it is slow and it doesn't solve all the issues.
Universal Encoder is another solution.
There should be more options for spell correction, but you will need to write code in any way.

Related

What's difference RobertaModel, RobertaSequenceClassification (hugging face)

I try to use hugging face transformers api.
As I import library , I have some questions. If anyone who know the answer, please tell me your knowledge.
transformers library have several models that are trained. transformers provide not only bare model like 'BertModel, RobertaModel, ... but also convenient heads like 'ModelForMultipleChoice' , 'ModelForSequenceClassification', 'ModelForTokenClassification' , ModelForQuestionAnswering.
I wonder what's difference between bare model adding new linear transformation myself and modelforsequenceclassification.
what's different custom model (pretrained model with random intialized linear) and transformers modelforsequenceclassification.
is ModelforSequenceClassification trained from glue data?
I look forward to someone's reply Thanks.
I think it's easiest to understand if we have a look at the actual implementation, where I randomly chose RobertaModel and RobertaForSequenceClassification as an example. However, the conclusion is valid for all other models, too.
You can find the implementation for RobertaForSequenceClassification here, which looks roughly like this:
class RobertaForSequenceClassification(RobertaPreTrainedModel):
authorized_missing_keys = [r"position_ids"]
def __init__(self, config):
super().__init__(config)
self.num_labels = config.num_labels
self.roberta = RobertaModel(config, add_pooling_layer=False)
self.classifier = RobertaClassificationHead(config)
self.init_weights()
[...]
def forward([...]):
[...]
As we can see, there is no indication about the pretraining here, and it simply adds another linear layer on top (the implementation of the RobertaClassificationHead can be found a bit further down, namely here):
class RobertaClassificationHead(nn.Module):
"""Head for sentence-level classification tasks."""
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.out_proj = nn.Linear(config.hidden_size, config.num_labels)
def forward(self, features, **kwargs):
x = features[:, 0, :] # take <s> token (equiv. to [CLS])
x = self.dropout(x)
x = self.dense(x)
x = torch.tanh(x)
x = self.dropout(x)
x = self.out_proj(x)
return x
So, to answer your question: These models come without any pretrained additional layers on top, and you could easily implement them yourself*.
Now for the asterisk: While it could be easy to wrap this yourself, also note that it is an inherited class RobertaPreTrainedModel. This has several advantages, the most important one being a consistent design between different implementations (sequence classification model, sequence tagging model, etc.). Further, there are some neat functionalities that they are providing, like the forward call including extensive parameters (padding, masking, attention output, ...), which would cost quite some time to implement.
Last but not least, there are existing trained models based on these specific implementations, which you can search for on the Huggingface Model Hub. There, you might find models that are fine-tuned on a sequence classification task (e.g., this one), and then directly load its weights in a RobertaForSequenceClassification model. If you had your own implementation of a sequence classification model, loading and aligning these pre-trained weights would be incredibly more complicated.
I hope this answers your main concern, but feel free to elaborate (either as comment or new question) on any points that have not been addressed!

Gensim most_similar() with Fasttext word vectors return useless/meaningless words

I'm using Gensim with Fasttext Word vectors for return similar words.
This is my code:
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('cc.it.300.vec')
words = model.most_similar(positive=['sole'],topn=10)
print(words)
This will return:
[('sole.', 0.6860659122467041), ('sole.Ma', 0.6750558614730835), ('sole.Il', 0.6727924942970276), ('sole.E', 0.6680260896682739), ('sole.A', 0.6419174075126648), ('sole.È', 0.6401025652885437), ('splende', 0.6336565613746643), ('sole.La', 0.6049465537071228), ('sole.I', 0.5922051668167114), ('sole.Un', 0.5904430150985718)]
The problem is that "sole" ("sun", in english) return a series of words with a dot in it (like sole., sole.Ma, ecc...). Where is the problem? Why most_similar return this meaningless word?
EDIT
I tried with english word vector and the word "sun" return this:
[('sunlight', 0.6970556974411011), ('sunshine', 0.6911839246749878), ('sun.', 0.6835992336273193), ('sun-', 0.6780728101730347), ('suns', 0.6730450391769409), ('moon', 0.6499731540679932), ('solar', 0.6437565088272095), ('rays', 0.6423950791358948), ('shade', 0.6366724371910095), ('sunrays', 0.6306195259094238)] 
Is it impossible to reproduce results like relatedwords.org?
Perhaps the bigger question is: why does the Facebook FastText cc.it.300.vec model include so many meaningless words? (I haven't noticed that before – is there any chance you've downloaded a peculiar model that has decorated words with extra analytical markup?)
To gain the unique benefits of FastText – including the ability to synthesize plausible (better-than-nothing) vectors for out-of-vocabulary words – you may not want to use the general load_word2vec_format() on the plain-text .vec file, but rather a Facebook-FastText specific load method on the .bin file. See:
https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.load_facebook_vectors
(I'm not sure that will help with these results, but if choosing to use FastText, you may be interesting it using it "fully".)
Finally, given the source of this training – common-crawl text from the open web, which may contain lots of typos/junk – these might be legimate word-like tokens, essentially typos of sole, that appear often enough in the training data to get word-vectors. (And because they really are typo-synonyms for 'sole', they're not necessarily bad results for all purposes, just for your desired purpose of only seeing "real-ish" words.)
You might find it helpful to try using the restrict_vocab argument of most_similar(), to only receive results from the leading (most-frequent) part of all known word-vectors. For example, to only get results from among the top 50000 words:
words = model.most_similar(positive=['sole'], topn=10, restrict_vocab=50000)
Picking the right value for restrict_vocab might help in practice to leave out long-tail 'junk' words, while still providing the real/common similar words you seek.

rasa__nlu how to train entity to capture dynamic values

am new to rasa_nlu am bulting a bot where it takes values that are dynamic i want the dynamic value to be captured by an entity
examples:
with notes please { bring your college marksheet }
with notes please { come prepared
where the string inside braces are dynamic values i need to capture them and make use of them. suggest me a way.
Thank you in advance
The Rasa component CRFEntityExtractor is able to generalize to different entity values. Just make sure you add enough training data, so that the additional random field can pick up the pattern.
The following example learns up to pick person names (you have to add more than these three examples, but this should give an idea).
## intent: greet_with_name
- Hi, I'm [Donald](name)
- Hello, it's [Steve](name)
- Hi, Im [Sam](name)
- ...

Algorithm in Ruby to trigger a method based on presence of certain texts

I don't know if it can be called an algorithm but i think its close.
I will be pulling data from an API that will have certain words in the title, eg:
Great Software 2.0 Download Now
Buy Great Software for just $10
Great Software Torrent Download
So, i want to do different things based on the presence of certain words such as Download, Buy etc. For eg, if it has the word 'buy' in it, i would like to extract the word buy and the amount value that is present in the title and show it in another div, so in this case it would be "Buy for $10" or "Buy $10" etc. I can do if/else as well but I don't want to use if else because there could be more such conditions in the future. So what i am thinking about is using the send method. eg:
def buy(string)
'Buy for just' + string.scan(/\$\d+/).first
end
def whichkeyword(title)
send (title.scan(/(download|buy)/i)[0][0]).downcase.to_sym, title
end
whichkeyword('Buy this software for $10 now')
is there a better way to do this? Or is this even a good way to do it? Any help would be appreciated
First of all, use send if and only you are to call private method, use public_send otherwise.
In this particular case metaprogramming is an overkill. It requires too much redundant code, plus it requires the code to be changed for new items. I would go with building a hash like:
#hash = { 'buy' => { text: 'Buy for just %{placeholder}', re: /\$\d+/ } }
This hash might be places somewhere outside of the code, e. g. it might be stored in yml file near the code and loaded in advance. That way you might be able to change a behaviour without modifying the code, that is handy for instance in gem.
As we have a hash defined/loaded, I would call the method:
def format string
key = string[/#{Regexp.union(#hash.keys).source}/i].downcase
puts #hash[key][:text] % { placeholder: string[#hash[key][:re]] }
end
Yielding:
▶ format("Buy this software for $10 now")
#⇒ Buy for just $10
There are many advantages over declaring methods, e. g. now matches might contain spaces, you might easily add/remove matchers etc.
First of all, your algorithm can work, but has some troubles in it, like what if no keyword is applied.
I have two solutions for you:
NLP
If you want to do it much more dynamic, you can use NLP - Natural language Processing. NLP will find main words in you sentence and then you can find the good solution for each.
A good gem for that is Treat that you can use with stanford-core-nlp. After processing the data you can find the verbs and even synonyms in the sentence and figure out what to do.
sentence('Buy this software for $10 now').verbs # ['buy']
Simple Hash
This solution is less dynamic, but much more simple. Like you did with the scan, just use Constant to manage your keywords, and the output from them(I would do it with lambdas). you can also add default to the hash
KEYWORDS = Hash.new('Default Title').merge(
buy: -> { },
download: -> { }
)
KEYWORDS[sentence[/(#{KEYWORDS.keys.join('|')})/i].downcase]
I think this solution is good enough.
The only thing that looks strange is scan(/(download|buy)/i)[0][0].
As for me I don't very much like using [] syntax in Ruby.
I think using scan here is not necessary.
What about
def whichkeyword(title)
title =~ /(download|buy)/i
send $1.downcase.to_sym, title unless $1.nil?
end
UPDATE
def whichkeyword(title)
action = title[/(download|buy)/i]
public_send action.downcase.to_sym, title if action
end

Sorting by counting the intersection of two lists in MongoDB

We have a posting analyzing requirement, that is, for a specific post, we need to return a list of posts which are mostly related to it, the logic is comparing the count of common tags in the posts. For example:
postA = {"author":"abc",
"title":"blah blah",
"tags":["japan","japanese style","england"],
}
there are may be other posts with tags like:
postB:["japan", "england"]
postC:["japan"]
postD:["joke"]
so basically, postB gets 2 counts, postC gets 1 counts when comparing to the tags in the postA. postD gets 0 and will not be included in the result.
My understanding for now is to use map/reduce to produce the result, I understand the basic usage of map/reduce, but I can't figure out a solution for this specific purpose.
Any help? Or is there a better way like custom sorting function to work it out? I'm currently using the pymongodb as I'm python developer.
You should create an index on tags:
db.posts.ensure_index([('tags', 1)])
and search for posts that share at least one tag with postA:
posts = list(db.posts.find({_id: {$ne: postA['_id']}, 'tags': {'$in': postA['tags']}}))
and finally, sort by intersection in Python:
key = lambda post: len(tag for tag in post['tags'] if tag in postA['tags'])
posts.sort(key=key, reverse=True)
Note that if postA shares at least one tag with a large number of other posts this won't perform well, because you'll send so much data from Mongo to your application; unfortunately there's no way to sort and limit by the size of the intersection using Mongo itself.

Resources