Verify if a string has determined word according an array of set of words and then change this target word found - ruby

Here some contextualization. It's not what I'm trying to do, but the mechanics is similar. I think I will grasp what I want with this problem.
I have an array with a set of bad words lets say. I want to build a method which receive an string (someone messege) that will filter the bad words according to the set of words previouslsy setted. And then replace the bad words found with "[removed]".

Related

Is there a way to remove a word from a KeyedVectors vocab?

I need to remove an invalid word from the vocab of a "gensim.models.keyedvectors.Word2VecKeyedVectors".
I tried to remove it using del model.vocab[word], if I print the model.vocab the word disappeared, but when I run model.most_similar using other words the word that I deleted is still appearing as similar.
So how can I delete a word from model.vocab in a way that affect the model.most_similar to not bring it?
There's no existing method supporting the removal of individual words.
A quick-and-dirty workaround might be to, at the same time as removing the vocab entry, noting the index of the existing vector (in the underlying large vector array), and also changing the string in the kv_model.index2entity list at that index to some plug value (like say, '***DELETED***').
Then, after performing any most_similar(), discard any entries matching '***DELETED***'.
Refer to:
How to remove a word completely from a Word2Vec model in gensim?
Possible method 1: I solve it by editing the text model file itself.
Possible method 2: Refer to #zsozso's answer. (Though I didn't get
it to
work).

How to make pocketsphinx ignore words that are not present in the dictionary

I have a list of keywords that need to be spotted but some words are not "real words" (abracadabra, for example) and obviously they aren't in the dictionary.
My question is how do I ignore them ?
(pocketsphinx returns an ERROR and stops). I read a manual for pocketsphinx_continuous but didn't find a suitable parameter.
Before using a word in a keyphrase check if it is in the dictionary with ps_lookup_word.

How do you check for a changing value within a string

I am doing some localization testing and I have to test for strings in both English and Japaneses. The English string might be 'Waiting time is {0} minutes.' while the Japanese string might be '待ち時間は{0}分です。' where {0} is a number that can change over the course of a test. Both of these strings are coming from there respective property files. How would I be able to check for the presence of the string as well as the number that can change depending on the test that's running.
I should have added the fact that I'm checking these strings on a web page which will display in the relevant language depending on the location of where they are been viewed. And I'm using watir to verify the text.
You can read elsewhere about various theories of the best way to do testing for proper language conversion.
One typical approach is to replace all hard-coded text matches in your code with constants, and then have a file that sets the constants which can be updated based on the language in use. (I've seen that done by wrapping the require of that file in a case statement based on the language being tested. Another approach is an array or hash for each value, enumerated by a variable with a name like 'language', which lets the tests change the language on the fly. So validations would look something like this
b.div(:id => "wait-time-message).text.should == WAIT_TIME_MESSAGE[language]
To match text where part is expected to change but fall within a predictable pattern, use a regular expression. I'd recommend a little reading about regular expressions in ruby, especially using unicode regular expressions in ruby, as well as some experimenting with a tool like Rubular to test regexes
In the case above a regex such as:
/Waiting time is \d+ minutes./ or /待ち時間は\d+分です。/
would match the messages above and expect one or more digits in the middle (note that it would fail if no digits appear, if you want zero or more digits, then you would need a * in place of the +
Don't check for the literal string. Check for some kind of intermediate form that can be used to render the final string.
Sometimes this is done by specifying a message and any placeholder data, like:
[ :waiting_time_in_minutes, 10 ]
Where that would render out as the appropriate localized text.
An alternative is to treat one of the languages as a template, something that's more limited in flexibility but works most of the time. In that case you could use the English version as the string that's returned and use a helper to render it to the final page.

Parsing free format text in Cocoa

My Cocoa app needs to parse free format text entered via NSTextView. The result of the process should be a collection of keyword strings which can then be displayed for review to the user and optionally persisted using Core Data.
I looked at NSScanner but from the samples in Apple's documentation it looks like it's not capable of presenting a list of keyword strings from a given string. Its focus seems to be more on finding a particular occurrence of a given string within another string.
Are there alternatives?
EDIT: To make this clearer: all words in the entered text are potential keywords, so basically all words delimited by spaces should be considered. Lets assume that the user can specify a minimum required length for a string to be considered a keyword to eliminate irrelevant words like "to", "of", "in" etc. Once the parsing is done, a list of parsed keywords should be presented (possibly using a table view). The user can then select or reject each keyword. Rejected keywords will be stored so the parsing can be made smarter as more texts are scanned.
You can absolutely use NSScanner to do this. All NSScanner does is go through a string character by character. It is up to you to decide what the keyword boundaries are and to interpret them using the scanner.
I suggest reading more about NSScanner in Apple's String Programming Guide.

How do you autocomplete names containing spaces?

I am working on implementing an autocompletion script in javascript. However, some of the names are two word names with a space in the middle. What kind of algorithm can you use to deal with it. I am using a trie to store the names.
The only solutions I could come up with were just saying that two word names cannot be used (either run them together or put a dash in the middle). The other idea was to create a list of these kind of names and have a separate loop to check the input. The other and possibly best idea I have is to redesign it slightly and have categories for first and last names and then an extra name category. I was wondering if there was a better solution out there?
Edit: I realized I wasn't very clear on what I was asking. My problem isn't adding two word phrases to the trie, but returning them when someone is typing in a name. In the trie I split the first and last names so you can search by either. So if someone types in the first name and then a space, how would I tell if they are typing in the rest of the first name or if they are now typing in the last name.
Why not have the trie also include the names with spaces?
Once you have a list of candidates, split each of them on the space and show the first token...
Is there a reason you are rolling your own autocomplete script, instead of using a currently existing one, such as YUI autocomplete? (i.e. are you doing it just for fun?, etc.)
If you have a way to parse the two-word names, then just include spaces in your trie. But if you cannot determine what is a two-word name and what is two separate words, and your trie cannot be large enough to hold all two-word sequences, then you have a problem.
One simple way to solve this is to default to allowing two-word pairs, but if you have too much branching after the space, throw away that entire branch. This way, when the first word is predictive for the second, you'll get autocompletion, but when it could be any of a huge number of things, your trie will end at the end of a single word.
If you using multiline editor, i guess the best choice autocomplete items will be a word. So firstname, middlename and lastname must be parsed and add a lookup item.
For (one line) textbox use you can add whitespaces (and firstname + space + middlename + space + lastname pattern) in search criteria.

Resources