Algorithm to estimate word's complexity - algorithm

I need to estimate a complexity of words for typists.
For example "suffer" is easy than "people" becouse "o" and "p" is harder than "e" and "r".
Any key pressed by little finger is more hard to hit than by index finger.
And move finger from basic position is more harder than do not move.
And use shift key also add hardness.
What approach can be imlemented in this case?

I would check out the Carpalx website. The site details how they rate different keyboard layouts for typists, and already has some open-source software that implements their algorithms for any given keyboard layout. (Make sure to check out the typing effort, model parameters and keyboard evaluation sections.)

A simple approach:
Score each letter based off the finger positions. Add a modifier or multiplier for shift. Maybe add a reduction for repeated letters?
Take the word, add up the scores, and you should have one approach.
Test, modify scores as needed, repeat until you have a meaningful distribution.

You could hold a 2-D representation of your keyboard in an array as a sort of a connected graph, with the key name and coordonates as nodes, and 2 coordonates to where your hands are hovering over (around F and J?), then for each input key you calculate the distances from the key in the graph to your 2 "hover keys", take the minimum, add in a shift (caps) penalty and output a (possibly weighted) score.

You may want to try a very basic approach by giving a value to each key that may be typed, and simply add the value of all keys of the word you want to evaluate...

Purely based on distance between keys, calculate the distance using a table of keys, and add up distances.

As a base, you can give each key a difficulty rating and then just add the ratings.
Then you probably want to spot difficult finger patterns. For example the combination "sle" is more difficult than "sfe", because the former is a left-right-left combination. It's more difficult for the brain to coordinate between left and right hand as they are connected to each half of the brain, than it is to coordinate fingers on the same hand. It's common to press the keys in the wrong order in such combinations.
How common a word is also has an effect on the difficulty. More common words are typed more often so the brain learns the patterns. Also when a word contains a common word it's easier to type, like "hand" as it contains "and". On the other hand, words that contain only a part of a more common word becomes harder, as the brain wants to follow the more common pattern.

Typing difficulty is very subjective - it's akin to learning to play a musical instrument, so a word that is very difficult for one person to type is a piece of cake for another. Take for instance someone who's never sat at a keyboard before and ask them to type the word "Microsoft"... they'll hunt and peck and it'll probably take them a few seconds to type it. Take your average programmer that types this word a few dozen times a day and they'll rattle it off in less than a second.
On the other hand, take the first person and have them type the word "Microscope" and they'll likely take a very similar amount of time as they did to type "Microsoft", but the programmer will get to the "s" and either finish the word as "Microsoft" before deleting and replacing the characters with the correct ones, or they'll noticeably slow as they hit the "s" when their fingers don't immediately know the pattern for "Microscope" - in fact, I just had to type it 3 times because my fingers automatically finish the pattern for me without me thinking about it.
As far as word complexity goes therefore, it's not quite as straightforward as calculating the distance from the home keys, it comes down to the background environment of the typist, their usual typing speed, their literacy with the keyboard and a host of other things.

Instead of guessing, measure it.
Come up with a list of 100 words and ask a handful of people to type them in. Measure the amount of time between each keystroke. For every pair of letters, accumulate the total time taken by the user to move from the first to the second and divide by the number of times that letter pair appears to get an average, which is an actual direct estimate of the difficulty of moving between those two keys.
Of course, there will be some pairs of letters that don't appear anywhere in your words (e.g. ZQ). But those letter pairs will probably be irrelevant to your work anyway, unless you need to score random sequences of letters.
You'll also need to somehow account for mistyped letters. You could either discard these outright, or use mistyped letters to add some sort of penalty to that letter pair (reflecting the fact that mistyping one of the letters indicates that this letter pair may be difficult to type).

Related

word2vec window size at sentence boundaries

I am using word2vec (and doc2vec) to get embeddings for sentences, but i want to completely ignore word order.
I am currently using gensim, but can use other packages if necessary.
As an example, my text looks like this:
[
['apple', 'banana','carrot','dates', 'elderberry', ..., 'zucchini'],
['aluminium', 'brass','copper', ..., 'zinc'],
...
]
I intentionally want 'apple' to be considered as close to 'zucchini' as it is to 'banana' so I have set the window size to a very large number, say 1000.
I am aware of 2 problems that may arise with this.
Problem 1:
The window might roll in at the start of a sentence creating the following training pairs:
('apple', ('banana')), ('apple', ('banana', 'carrot')), ('apple', ('banana', 'carrot', 'date')) before it eventually gets to the correct ('apple', ('banana','carrot', ..., 'zucchini')).
This would seem to have the effect of making 'apple' closer to 'banana' than 'zucchini',
since their are so many more pairs containing 'apple' and 'banana' than there are pairs containing 'apple' and 'zucchini'.
Problem 2:
I heard that pairs are sampled with inverse proportion to the distance from the target word to the context word- This also causes an issue making nearby words more seem more connected than I want them to be.
Is there a way around problems 1 and 2?
Should I be using cbow as opposed to sgns? Are there any other hyperparameters that I should be aware of?
What is the best way to go about removing/ignoring the order in this case?
Thank you
I'm not sure what you mean by "Problem 1" - there's no "roll" or "wraparound" in the usual interpretation of a word2vec-style algorithm's window parameter. So I wouldn't worry about this.
Regarding "Problem 2", this factor can be essentially made negligible by the choice of a giant window value – say for example, a value one million times larger than your largest sentence. Then, any difference in how the algorithm treats the nearest-word and the 2nd-nearest-word is vanishingly tiny.
(More specifically, the way the gensim implementation – which copies the original Google word2vec.c in this respect – achieves a sort of distance-based weighting is actually via random dynamic shrinking of the actual window used. That is, for each visit during training to each target word, the effective window truly used is some random number from 1 to the user-specified window. By effectively using smaller windows much of the time, the nearer words have more influence – just without the cost of performing other scaling on the whole window's words every time. But in your case, with a giant window value, it will be incredibly rare for the effective-window to ever be smaller than your actual sentences. Thus every word will be included, equally, almost every time.)
All these considerations would be the same using SG or CBOW mode.
I believe a million-times-larger window will be adequate for your needs, for if for some reason it wasn't, another way to essentially cancel-out any nearness effects could be to ensure your corpus's items individual word-orders are re-shuffled between each time they're accessed as training data. That ensures any nearness advantages will be mixed evenly across all words – especially if each sentence is trained on many times. (In a large-enough corpus, perhaps even just a 1-time shuffle of each sentence would be enough. Then, over all examples of co-occurring words, the word co-occurrences would be sampled in the right proportions even with small windows.)
Other tips:
If your training data starts in some arranged order that clumps words/topics together, it can be beneficial to shuffle them into a random order instead. (It's better if the full variety of the data is interleaved, rather than presented in runs of many similar examples.)
When your data isn't true natural-language data (with its usual distributions & ordering significance), it may be worth it to search further from the usual defaults to find optimal metaparameters. This goes for negative, sample, & especially ns_exponent. (One paper has suggested the optimal ns_exponent for training vectors for recommendation-systems is far different from the usual 0.75 default for natural-language modeling.)

Constant-time Spelling Correction on Ten Million Entities

I have a list of ~10M entities. I need to match an entity that a user types out with an entity from the list. Users often misspell the entities (ie. orang instead of orange). I need to correct 1-2 instances of letter replacement (aca instead of aba), letter insertion (aca instead of ac), and letter deletion (aca instead of acca). I want to do this in constant time with respect to the size of the entity list.
Making a dictionary of all possible spellings that are off by 1-2 letters would be constant time but requires an intractably large amount of memory. Edit distance is linear in time with respect to the size of the entity list. I'm thinking there is probably a clever algorithm to prune down the candidate matches to <100 (maybe via a clever hash of the letters in the entity). Then I could run edit distance on the small set of candidates.
Does anyone know of a technique that will work here?
In addition to the linked document in Matt's comment (suggesting to only generate/compare/search via deletions), you can try using a DAWG aka MADFA aka DAFSA to store all possible distance=2 words. For example, for Python there's pyDAWG. Not sure if the space savings will be sufficient to meet your needs, as that depends on language, but if you affixes are similar, it could be quite significant: Each substitution/deletion is just an extra arc, and each insertion is only one more node.

Algorithm to reform a sentence from sentence whose spaces are removed and alphabets of words are reordered?

I was looking around some puzzles online to improve my knowledge on algorithms...
I came upon below question:
"You have a sentence with several words with spaces remove and words having their character order shuffled. You have a dictionary. Write an algorithm to produce the sentence back with spaces and words with normal character order."
I do not know what is good way to solve this.
I am new to algorithms but just looking at problem I think I would make program do what an intellectual mind would do.
Here is something I can think of:
-First find out manually common short english words from dictionary like "is" "the" "if" etc and put in dataset-1.
-Then find out permutation of words in dataset1 (eg "si", "eht" or "eth" or "fi") and put in dataset-2
-then find out from input sentense what character sequence matches the words of dataset2 and put them in dataset-3 and insert space in input sentence instead of those found.
-for rest of the words i would perform permutations to find out word from dictionary.
I am newbie to algorithms...is it a bad solution?
this seems like a perfectly fine solution,
In general there are 2 parameters for judging an algorithm.
correctness - does the algorithm provide the correct answer.
resources - the time or storage size needed to provide an answer.
usually there is a tradeoff between these two parameters.
so for example the size of your dictionary dictates what scrambled sentences you may
reconstruct, giving you a correct answer for more inputs,
however the whole searching process would take longer and would require more storage.
The hard part of the problem you presented is the fact that you need to compute permutations, and there are a LOT of them.
so checking them all is expensive, a good approach would be to do what you suggested, create a small subset of commonly used words and check them first, that way the average case is better.
note: just saying that you check the permutation/search is ok, but in the end you would need to specify the exact way of doing that.
currently what you wrote is an idea for an algorithm but it would not allow you to take a given input and mechanically work out the output.
Actually, might be wise to start by partitioning the dictionary by word length.
Then try to find the largest words that can be made using the letters avaliable, instead of finding the smallest ones. Short words are more common and thus will be harder to narrow down. IE: is it really "If" or "fig".
Then for each word length w, you can proceed w characters at a time.
There are still a lot of possible combinations though, simply because you found a valid word, doesn't mean it's the right word. Once you've gone through all the substrings, of which there should be something like O(c^4*d) where d is the number of words in the dictionary and c is the number of characters in the sentence. Practically speaking if the dictionary is sorted by word length, it'll be a fair bit less than that. Then you have to take the valid words, and figure out an ordering that works, so that all characters are used. There might be multiple solutions.

Creating a suggested words algorithm

I'm designing a cool spell checker (I know I know, modern browsers already have this), anyway, I am wondering what kind of effort would it take to develop a fairly simple but decent suggest-word algorithm.
My idea is that I would first look through the misspelled word's characters and count the amount of characters it matches in each word in the dictionary (sounds resources intensive), and then pick the top 5 matches (so if the misspelled word matches the most characters with 7 words from the dictionary, it will randomly display 5 of those words as suggested spelling).
Obviously to get more advanced, we would look at "common words" and have a dictionary file that is numbered with 'frequency of that word used in English language' ranking. I think that's taking it a bit overboard maybe.
What do you think? Anyone have ideas for this?
First of all you will have to consider the complexity in finding the "nearer" words to the misspelled word. I see that you are using a dictionary, a hash table perhaps. But this may not be enough. The best and cooler solution here is to go for a TRIE datastructure. The complexity of finding these so called nearer words will take linear order timing and it is very easy to exhaust the tree.
A small example
Take the word "njce". This is a level 1 example where one word is clearly misspelled. The obvious suggestion expected would be nice. The first step is very obvious to see whether this word is present in the dictionary. Using the search function of a TRIE, this could be done O(1) time, similar to a dictionary. The cooler part is finding the suggestions. You would obviously have to exhaust all the words that start with 'a' to 'z' that has words like ajce bjce cjce upto zjce. Now to find the occurences of this type is again linear depending on the character count. You should not carried away by multiplying this number with 26 the length of words. Since TRIE immediately diminishes as the length grows. Coming back to the problem. Once that search is done for which no result was found, you go the next character. Now you would be searching for nace nbce ncce upto nzce. In fact you wont have explore all the combinations as the TRIE data structure by itself will not be having the intermediate characters. Perhaps it will have na ni ne no nu characters and the search space becomes insanely simple. So are the further occurrences. You could develop on this concept further based on second and third order matches. Hope this helped.
I'm not sure how much of the wheel you're trying to reinvent, so you may want to check out Lucene.
Apache Lucene Core™ (formerly named Lucene Java), our flagship sub-project, provides a Java-based indexing and search implementation, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities.

Algorithm wanted: Find all words of a dictionary that are similar to words in a free text

We have a list of about 150,000 words, and when the user enters a free text, the system should present a list of words from the dictionary, that are very close to words in the free text.
For instance, the user enters: "I would like to buy legoe toys in Walmart". If the dictionary contains "Lego", "Car" and "Walmart", the system should present "Lego" and "Walmart" in the list. "Walmart" is obvious because it is identical to a word in the sentence, but "Lego" is similar enough to "Legoe" to be mentioned, too. However, nothing is similar to "Car", so that word is not shown.
Showing the list should be realtime, meaning that when the user has entered the sentence, the list of words must be present on the screen. Does anybody know a good algorithm for this?
The dictionary actually contains concepts which may include a space. For instance, "Lego spaceship". The perfect solution recognizes these multi-word concepts, too.
Any suggestions are appreciated.
Take a look at http://norvig.com/spell-correct.html for a simple algorithm. The article uses Python, but there are links to implementations in other languages at the end.
You will be doing quite a few lookups of words against a fixed dictionary. Therefore you need to prepare your dictionary. Logically, you can quickly eliminate candidates that are "just too different".
For instance, the words car and dissimilar may share a suffix, but they're obviously not misspellings of each other. Now why is that so obvious to us humans? For starters, the length is entirely different. That's an immediate disqualification (but with one exception - below). So, your dictionary should be sorted by word length. Match your input word with words of similar length. For short words that means +/- 1 character; longer words should have a higher margin (exactly how well can your demographic spell?)
Once you've restricted yourself to candidate words of similar length, you'd want to strip out words that are entirely dissimilar. With this I mean that they use entirely different letters. This is easiest to compare if you sort the letters in a word alphabetically. E.g. car becomes "acr"; rack becomes "ackr". You'll do this in preprocessing for your dictionary, and for each input word. The reason is that it's cheap to determine the (size of an) difference of two sorted sets. (Add a comment if you need explanation). car and rack have an difference of size 1, car and hat have a difference of size 2. This narrows down your set of candidates even further. Note that for longer words, you can bail out early when you've found too many differences. E.g. dissimilar and biography have a total difference of 13, but considering the length (8/9) you can probably bail out once you've found 5 differences.
This leaves you with a set of candidate words that use almost the same letters, and also are almost the same length. At this point you can start using more refined algorithms; you don't need to run 150.000 comparisons per input word anymore.
Now, for the length exception mentioned before: The problem is in "words" like greencar. It doesn't really match a word of length 8, and yet for humans it's quite obvious what was meant. In this case, you can't really break the input word at any random boundary and run an additional N-1 inexact matches against both halves. However, it is feasible to check for just a missing space. Just do a lookup for all possible prefixes. This is efficient because you'll be using the same part of the dictionary over and over, e.g. g gr, gre, gree, etc. For every prefix that you've found, check if the remaining suffixis also in the dictionery, e.g. reencar, eencar. If both halves of the input word are in the dictionary, but the word itself isn't, you can assume a missing space.
You would likely want to use an algorithm which calculates the Levenshtein distance.
However, since your data set is quite large, and you'll be comparing lots of words against it, a direct implementation of typical algorithms that do this won't be practical.
In order to find words in a reasonable amount of time, you will have to index your set of words in some way that facilitates fuzzy string matching.
One of these indexing methods would be to use a suffix tree. Another approach would be to use n-grams.
I would lean towards using a suffix tree since I find it easier to wrap my head around it and I find it more suited to the problem.
It might be of interest to look at a some algorithms such as the Levenshtein distance, which can calculate the amount of difference between 2 strings.
I'm not sure what language you are thinking of using but PHP has a function called levenshtein that performs this calculation and returns the distance. There's also a function called similar_text that does a similar thing. There's a code example here for the levenshtein function that checks a word against a dictionary of possible words and returns the closest words.
I hope this gives you a bit of insight into how a solution could work!

Resources