I am currently parsing a bunch of mails and want to get words and other interesting tokens out of mails (even with spelling errors or combination of characters and letters, like "zebra21" or "customer242"). But how can I know that "0013lCnUieIquYjSuIA" and "anr5Brru2lLngOiEAVk1BTjN" are not words and not relevant? How to extract words and discard tokens that are encoding errors or parts of pgp signature or whatever else we get in mails and know that we will never be interested in those?
You need to decide on a good enough criteria for a word and write a regular expression or a manual to enforce it.
A few rules that can be extrapolated from your examples:
words can start with a captial letter or be all capital letters but if you have more than say, 2 uppercase letters and more than 2 lowercase letters inside a word, it's not a word
If you have numbers inside the word, it's not a word
if it's longer than say, 20 characters
There's no magic trick. you need to decide what you want the rules to be and make them happen.
Al alternative way is to train some kind of Hidden Markov-Models system to recognize things that sound like words but I think this is an overkill for what you want to do.
http://en.wikipedia.org/wiki/English_words_with_uncommon_properties
you can make rules that reject anything with these 'uncommon properties' to build a system that accepts most actual words
Although I generally agree with shoosh's answer, his approach makes it easy to achieve high recall but also low precision, i.e. you would get almost all real words but also a lot non-words. If your definition of word is too restrictive, it's the other way around but that's also not what you want since then you would miss cases like 'zebra123'. So here are a few ideas about how to improve precision:
It may be worthwile thinking about if you could determine what parts of an email belong to the main text and which are footers like pgp signatures. I'm sure it's possible to find some simple heuristics that match most cases, e.g. cut of everything below a line which consists only of '-'-characters.
Depending on your performance criteria you may want to check if a word is a real word or contains a real word by matching against a simple word list. It's easy to find quite exhaustive lists of Englisch words on the web, and you could also compile one yourself by extracting words from a large and clean text corpus.
Using a lexical analyser, you could filter every token which is marked as unknown.
Some simple statistics may tell you how likely it is that something is a word. Tokens which occur with high frequency most probably are words. Tokens which appear only once or whose number is below a certain threshold very probably are not words. Common spelling errors should appear more than once and uncommon ones may be ignored.
Some if these suggestions clearly don't work for cases like 'zebra123'. Again, simply cutting off, or splitting on, in-word numbers may do the trick.
My general approach would be to first identify tokens which certainly are words (using the suggestions above), then identify tokens which certainly are not words (using a regular expression), and then look (with your eyes) at the few hundred or thousand remaining tokens to find common characteristics to handle these separately.
Related
I met this problem in an interview. It is easy to implement a basic autocomplete system(https://www.futurice.com/blog/data-structures-for-fast-autocomplete/) to get a list of string from the prefix string. Now we want to add some new features.
ex,
User input: lun pla Output: lunch plan (mutiple words autocomplete)
User input: pla Output: lunch plan
User input: unc Output: lunch (autocomplete form part of the word)
How to implement the features?
You can try the following (basic) approach, and I will later give suggestions for extensions:
load a dictionary of accepted words
build a BK-Tree out of these words using Levenshtein-Damerau distance as the underlying metric
split an input sequence on the whitespace character to get words
for each word, check whether it is an accepted word. If it isn't find the nearest (within acceptable distance) word in the BK-tree
Now for the improvements:
As you indicated, sometimes a match makes more sense when two words are grouped together
Use the Google word2Phrase algorithm for this. You can find a C++ version here.
Use a more clever approach to finding word-boundaries. A stochastic method like HMM (Hidden Markov Model) might be useful (to avoid dates, times, abbreviations, etc being split)
Use a more intelligent error-metric. You could take into account common misspellings, keyboard layout errors (there are very specific errors for people that are used to typing qwerty that are suddenly faced with azerty), etc
Try to determine word part-of-speech type. (Adjective, noun, verb, etc). By doing this you can make much better completions.
I'm writing an algorithm to generate a random 6-character string (e.g. customer code XDEJQW). I want to ensure no and or offensive words or strings within. I guess I have no choice but to have a database table of those bad words, right? Just seems icky that I'll have to have an add/edit page for someone to go to that has some pretty awful words in it.
Thanks.
No need for a table, you can either use a string array or an enum for this purpose. The advantage is that you do not have to send a request to get the records of the bad word table. It is better for performance. Basically you can randomize the 6-character value until the result does not contain bad words.
depending on the purpose of the value, you can change the random process to make sure that no valid words are generate.. so if no valid words are generated.. offensive strings wont ether.. for example..
use only consonants
use only vowels
use 3 consecutive consonants and 3 consecutive vowels
etc..
the point is, normally, words of languages are made of syllables, a syllables to be pronounceable need to have a vowel.. usually paired with one or two (maybe more) consonants, before, after or around, that act as a "modifier" of the sound bi,ca,do,et,if,or or get,for etc.. if you can avoid these "patterns"
the probability of generating a word is low..
on the other and if you want to generate pronounceable passwords you do exactly the opposite alternating between consonants and vowels to produce syllables, ex: cidofe, but in that case you do have to validate against a list of "bad words"
but in ether case remember if you are going to validate.. don't just validate against a full word also try to filter out partial words, misspells or abbreviation to avoid things like SUKMYDIK
I'm looking for a filter in elasticsearch that will let me break english compound words into their constituent parts, so for example for a term like eyewitness, eye witness and eyewitness as queries would both match eyewitness. I noticed the compound word filter, but this requires explicity defining a word list, which I couldn't possibly come up with on my own.
First, you need to ask yourself if you really need to break the compound words. Consider a simpler approach like using "edge n-grams" to hit in the leading or trailing edges. It would have the side effect of loosely hitting on fragments like "ey", but maybe that would be acceptable for your situation.
If you do need to break the compounds, and want to explicitly index the word fragments, the you'll need to get a word list. You can download a list English words, one example is here. The dictionary word list is used to know which fragments of the compound words are actually words themselves. This will add overhead to your indexing, so be sure to test it. An example showing the usage is here.
If your text is German, consider https://github.com/jprante/elasticsearch-analysis-decompound
I was looking around some puzzles online to improve my knowledge on algorithms...
I came upon below question:
"You have a sentence with several words with spaces remove and words having their character order shuffled. You have a dictionary. Write an algorithm to produce the sentence back with spaces and words with normal character order."
I do not know what is good way to solve this.
I am new to algorithms but just looking at problem I think I would make program do what an intellectual mind would do.
Here is something I can think of:
-First find out manually common short english words from dictionary like "is" "the" "if" etc and put in dataset-1.
-Then find out permutation of words in dataset1 (eg "si", "eht" or "eth" or "fi") and put in dataset-2
-then find out from input sentense what character sequence matches the words of dataset2 and put them in dataset-3 and insert space in input sentence instead of those found.
-for rest of the words i would perform permutations to find out word from dictionary.
I am newbie to algorithms...is it a bad solution?
this seems like a perfectly fine solution,
In general there are 2 parameters for judging an algorithm.
correctness - does the algorithm provide the correct answer.
resources - the time or storage size needed to provide an answer.
usually there is a tradeoff between these two parameters.
so for example the size of your dictionary dictates what scrambled sentences you may
reconstruct, giving you a correct answer for more inputs,
however the whole searching process would take longer and would require more storage.
The hard part of the problem you presented is the fact that you need to compute permutations, and there are a LOT of them.
so checking them all is expensive, a good approach would be to do what you suggested, create a small subset of commonly used words and check them first, that way the average case is better.
note: just saying that you check the permutation/search is ok, but in the end you would need to specify the exact way of doing that.
currently what you wrote is an idea for an algorithm but it would not allow you to take a given input and mechanically work out the output.
Actually, might be wise to start by partitioning the dictionary by word length.
Then try to find the largest words that can be made using the letters avaliable, instead of finding the smallest ones. Short words are more common and thus will be harder to narrow down. IE: is it really "If" or "fig".
Then for each word length w, you can proceed w characters at a time.
There are still a lot of possible combinations though, simply because you found a valid word, doesn't mean it's the right word. Once you've gone through all the substrings, of which there should be something like O(c^4*d) where d is the number of words in the dictionary and c is the number of characters in the sentence. Practically speaking if the dictionary is sorted by word length, it'll be a fair bit less than that. Then you have to take the valid words, and figure out an ordering that works, so that all characters are used. There might be multiple solutions.
I am a rails newbie.
I am using profanity_filter ruby gem to filter the foul words in my content application..
profanity_filter, if at all there is a foul word, lets say "foulword" it returns "f******d"
If any user plays smart and types "foulwoord" or "foulwordd" or "foulllword" etc it does not detect as a foul word.
Is there a way to make sure it detects these user-smart-foul-words?
Looking forward for help!
Thank you!
How many foul words do you need to filter?
One approach would be to use something like Diff::LCS (from the diff-lcs gem) to check how many letters are different between the word being checked and each foul word. If you have a large number of foul words to check, this could be very slow. One thing you could do to make it much faster would be to include a dictionary of "good" words. Keep the "good" dictionary in a Set, and before checking each content word, first test whether it is in the dictionary. If so, you can move on. (If you want to make checking the dictionary very fast, keep it in a search trie.)
Further, if you check a word and find that it is OK, you could add it to the dictionary so you don't need to check the same word again. The danger here is that the dictionary may grow too large. If this is a problem, you could use something similar to a "least recently used" cache which, when the dictionary becomes too big, would discard "good" words which have not been seen recently.
Another approach would be to generate variants on each foul word, and store them in a "bad" dictionary. If you generate each word which differs by 1 letter from a foul word, there would be about 200-500 for each foul word. You could also generate words which differ from a foul word only by changing the letter "o" to a zero, etc.
No matter what you do, you are never going to catch 100% of "bad" words without ever mistakenly flagging a "good" word. If you can get a filter which catches an acceptably high percentage of "bad" words, with an acceptably low rate of false positives, that will be "success".
If you are doing this for a web site, I suggest that rather than blocking content with "bad" words, you automatically flag it for moderator attention. If allowing obscene content to go up on the site even briefly is unacceptable, you could delay displaying flagged content until after a moderator has looked at it. This will avoid the Scunthorpe problem with #Blorgbeard mentioned in his comment.