How to make pocketsphinx ignore words that are not present in the dictionary - pocketsphinx

I have a list of keywords that need to be spotted but some words are not "real words" (abracadabra, for example) and obviously they aren't in the dictionary.
My question is how do I ignore them ?
(pocketsphinx returns an ERROR and stops). I read a manual for pocketsphinx_continuous but didn't find a suitable parameter.

Before using a word in a keyphrase check if it is in the dictionary with ps_lookup_word.

Related

Verify if a string has determined word according an array of set of words and then change this target word found

Here some contextualization. It's not what I'm trying to do, but the mechanics is similar. I think I will grasp what I want with this problem.
I have an array with a set of bad words lets say. I want to build a method which receive an string (someone messege) that will filter the bad words according to the set of words previouslsy setted. And then replace the bad words found with "[removed]".

In elasticsearc How can I Tokenize words separeted by space and be able to match by typing without space

Here is what I want to achieve :
My field value : "one two three"
I want to be able to match this field by typing: one or onetwo or onetwothree or onethree or twothree or two or three
For that, the tokenizer need to produce those tokens:
one
onetwo
onetwothree
onethree
two
twothree
three
Do you know how can I implement this analyzer ?
there is the same problem in German language when we connect different words into one. For this purpose Elasticsearch uses technique called "coumpound words". There is also a specific token filter called "compound word token filter". It is trying to find sub-words from given dictionary in string. You only have to define dictionary for your language. There is whole specification at link bellow.
https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-compound-word-tokenfilter.html

How google recognises 2 words without spaces?

I want to understand how google handles no space between 2 words. For example there are 2 words - word1 and word2. I write in search box 'word1word2', it says do you mean 'word1 word2' or just understands to look for 'word1 word2'. Any information what data structure and algorithm they use? I see in this answer How to split text without spaces into list of words?, it is suggested to use trie data structure.
In the candidate generation of the spell corrector, you allow as a possibility omission of a space, just as you allow omission of other letters.... Perhaps look at the spelling correction lecture here: http://nlp-class.org/ [sorry, self-promotion] or Peter Norvig's intro: http://norvig.com/spell-correct.html
I assume you must have a script (using ajax for exemple http://net.tutsplus.com/tutorials/javascript-ajax/adding-a-jquery-auto-complete-to-your-google-custom-search-engine/)
Basically you check the words in a dictionary. The space must not be a condition to check the word but just a possibility. For exemple a simple algo(really simple) would be : "severalwords" you check the 3 firsts letter, nothing ? Then you check the 4 firsts...
Here is some explanations about google search engine :
https://developers.google.com/search-appliance/documentation/60/admin_searchexp/ce_improving_search
Maybe here can help too :
http://tm.durusau.net/?cat=1106

What Ruby Regex code can I use for obtaining "out of sight" from the input "outofsight"?

I'm building an application that returns results based on a movie input from a user. If the user messes up and forgets to space out the title of the movie is there a way I can still take the input and return the correct data? For example "outofsight" will still be interpreted as "out of sight".
There is no regex that can do this in a good and reliable way. You could try a search server like Solr.
Alternatively, you could do auto-complete in the GUI (if you have one) on the input of the user, and this way mitigate some of the common errors users can end up doing.
Example:
User wants to search for "outofsight"
Starts typing "out"
Sees "out of sight" as suggestion
Selects "out of sight" from suggestions
????
PROFIT!!!
There's no regex that can tell you where the word breaks were supposed to be. For example, if the input is "offlight", is it supposed to return "Off Light" or "Of Flight"?
This is impossible without a dictionary and some kind of fuzzy-search algorithm. For the latter see How can I do fuzzy substring matching in Ruby?.
You could take a string and put \s* in between each character.
So outofsight would be converted to:
o\s*u\s*t\s*o\s*f\s*s\s*i\s*g\s*h\s*t
... and match out of sight.
You can't do this with regular expressions, unless you want to store one or more patterns to match for each movie record. That would be silly.
A better approach for catching minor misspellings would be to calculate Levenshtein distances between what the user is typing and your movie titles. However, when your list of movies is large, this will become a rather slow operation, so you're better off using a dedicated search engine like Lucene/Solr that excels at this sort of thing.

Matching keywords with sentence database, how to avoid duplicated keywords in results?

I'm very new to programming and am a beginner in Ruby. I've done a lot of searching to try to find the answers I need, but nothing seems to match what I'm looking for.
I need to make a program for work that will:
Get keywords from the user
Match those keywords with the same keywords in a database of sentences, and then
Spit out randomized sentences that:
contain all the keywords 1 time
do NOT contain keywords not listed
do NOT duplicate keywords
Important to know: Sentences all have a mix of several keywords, NOT one per sentence
1 & 2 are OK, I've been able to do those. My problem is with part 3. I've tried long lists of "if include?" parameters, but it never ends up working and I know there must be a better way to do this.
My grasp of Ruby (and programming generally) is basic and I don't really know what it can and can't do, so any tips or hints in what functions would be useful would be very very much appreciated.
If the match is found, why don't you consecutively pop it out of your array/db? It will ensure no duplication, since that record would not be present to be matched later. No?
Consider this snippet:
db=%q(It is hot today), %q(It is going to rain), %q(Where are you, sonny?), %q(sentence contains is and are)
keyw=%w(is am are)
de=[]
keyw.each do |word|
for index in 0...db.length
if db[index].include?(word)
puts "Matched #{word} with #{db[index]}"
de<<index
end
end
until de.empty?
db.delete_at(de.pop)
end
end
db is database example and keyw contains keywords.
Corresponding output:
Matched is with It is hot today
Matched is with It is going to rain
Matched is with sentence contains is and are
Matched are with Where are you, sonny?
No duplication. :)

Resources