`FastText.wv.save_word2vec_format()` creates some entries with two words on one line - gensim

FastText.wv.save_word2vec_format() creates some entries with two words on one line. This is a problem because it breaks the KeyedVectors.load_word2vec_format() function, which expects one word followed by x floats, where x is the number of dimensions in the vector. I haven't been able to prove that this is a bug, so has anyone had this problem?
My solution has been to prune the resulting file by removing lines with two space-separated words. In the large data sets that I used, there were between three and ten occurrences per data set. I also double checked that no words in the vocabulary and no words in the data set contained a space.
In every occurrence, the two component words had their own entries as a single word. Is this intended for, maybe, particularly high co-occurring pairs?
Is this expected behavior? If so, why? And why is there no accounting for this in the loading function?

In general, if save_word2vec_format() seems to have prefixed a particular line of floats with a string that includes more whitespace than just an ending space, then it is almost certain that the matching key in the model includes such whitespace.
In particular, consider a written file with the usual special header-line (containing the counts of further vectors & dimensionality), which on its 10th line (counting from 1 as if using cat -n FILENAME) shows this symptom. In such a case, I'd expect ft_model.wv.index_to_key[8] to reveal a key with a matching internal whitespace.
Are you sure this isn't the case for your occurrences? (How had you checked "that no words in the vocabulary and no words in the data set contained a space"? Might you have missed some more exotic whitespace character that somehow became a plain space while saving?)
If the actual key contains whitespace, the issue is that something in the prior preprocessing/training left such internal-whitespace tokens in the model, and solving that may be the best way to resolve the issue.
If not, there's a deeper problem & it'd be helpful to figure a minimal way to trigger the problem. Examining the ft_model.wv.index_to_key values for the keys all around the problem key might help narrow things down.

Related

Algorithm to reform a sentence from sentence whose spaces are removed and alphabets of words are reordered?

I was looking around some puzzles online to improve my knowledge on algorithms...
I came upon below question:
"You have a sentence with several words with spaces remove and words having their character order shuffled. You have a dictionary. Write an algorithm to produce the sentence back with spaces and words with normal character order."
I do not know what is good way to solve this.
I am new to algorithms but just looking at problem I think I would make program do what an intellectual mind would do.
Here is something I can think of:
-First find out manually common short english words from dictionary like "is" "the" "if" etc and put in dataset-1.
-Then find out permutation of words in dataset1 (eg "si", "eht" or "eth" or "fi") and put in dataset-2
-then find out from input sentense what character sequence matches the words of dataset2 and put them in dataset-3 and insert space in input sentence instead of those found.
-for rest of the words i would perform permutations to find out word from dictionary.
I am newbie to algorithms...is it a bad solution?
this seems like a perfectly fine solution,
In general there are 2 parameters for judging an algorithm.
correctness - does the algorithm provide the correct answer.
resources - the time or storage size needed to provide an answer.
usually there is a tradeoff between these two parameters.
so for example the size of your dictionary dictates what scrambled sentences you may
reconstruct, giving you a correct answer for more inputs,
however the whole searching process would take longer and would require more storage.
The hard part of the problem you presented is the fact that you need to compute permutations, and there are a LOT of them.
so checking them all is expensive, a good approach would be to do what you suggested, create a small subset of commonly used words and check them first, that way the average case is better.
note: just saying that you check the permutation/search is ok, but in the end you would need to specify the exact way of doing that.
currently what you wrote is an idea for an algorithm but it would not allow you to take a given input and mechanically work out the output.
Actually, might be wise to start by partitioning the dictionary by word length.
Then try to find the largest words that can be made using the letters avaliable, instead of finding the smallest ones. Short words are more common and thus will be harder to narrow down. IE: is it really "If" or "fig".
Then for each word length w, you can proceed w characters at a time.
There are still a lot of possible combinations though, simply because you found a valid word, doesn't mean it's the right word. Once you've gone through all the substrings, of which there should be something like O(c^4*d) where d is the number of words in the dictionary and c is the number of characters in the sentence. Practically speaking if the dictionary is sorted by word length, it'll be a fair bit less than that. Then you have to take the valid words, and figure out an ordering that works, so that all characters are used. There might be multiple solutions.

String comparison algorithm, relevancy, how much "alike" 2 strings are

I have 2 sources of information for the same data (companies), which I can join together via a unique ID (contract number). The presence of the second, different source, is due to the fact that the 2 sources are updated manually, independently. So what I have is an ID and a company Name in 2 tables.
I need to come up with an algorithm that would compare the Name in the 2 tables for the same ID, and order all the companies by a variable which indicates how different the strings are (to highlight the most different ones, to be placed at the top of the list).
I looked at the simple Levenshtein distance calculation algorithm, but it's at the letter level, so I am still looking for something better.
The reason why Levenshtein doesn't really do the job is this: companies have a name, prefixed or postfixed by the organizational form (LTD, JSC, co. etc). So we may have a lot of JSC "Foo" which will differ a lot from Foo JSC., but what I am really looking for in the database is pairs of different strings like SomeLongCompanyName JSC and JSC OtherName.
Are there any Good ways to do this? (I don't really like the idea of using regex to separate words in each string, then find matches for every word in the other string by using the Levenshtein distance, so I am searching for other ideas)
How about:
1. Replace all punctuation by whitespace.
2. Break the string up into whitespace-delimited words.
3. Move all words of <= 4 characters to the end, sorted alphabetically.
4. Levenshtein.
Could you filter out (remove) those "common words" (similar to removing stop words for fulltext indexing) and then search on that? If not, could you sort the words alphabetically before comparing?
As an alternative or in addition to the Levenshtein distance, you could use Soundex. It's not terribly good, but it can be used to index the data (which is not possible when using Levenshtein).
Thank you both for ideas.
I used 4 indices which are levenshtein distances divided by the sum of the length of both words (relative distances) of the following:
Just the 2 strings
The string composed of the result after separating the word sequences, eliminating the non-word chars, ordering ascending and joining with space as separator.
The string which is contained between quotes (if no such string is present, the original string is taken)
The string composed of alphabetically ordered first characters of each word.
each of these in return is an integer value between 1 and 1000. The resulting value is the product of:
X1^E1 * X2^E2 * X3^E3 * X4^E4
Where X1..X4 are the indices, and E1..E4 are user-provided preferences of valuable (significant) is each index. To keep the result inside reasonable values of 1..1000, the vector (E1..E4) is normalized.
The results are impressive. The whole thing works much faster than I've expected (built it as a CLR assembly in C# for Microsoft SQL Server 2008). After picking E1..E4 correctly, the largest index (biggest difference) on non-null values in the whole database is 765. Right untill about 300 there is virtually no matching company name. Around 200 there are companies that have kind of similar names, and some are the same names but written in very different ways, with abbreviations, additional words, etc. When it comes down to 100 and less - practically all the records contain names that are the same but written with slight differences, and by 30, only the order or the punctuation may differ.
Totally works, result is better than I've expected.
I wrote a post on my blog, to share this library in case someone else needs it.

Seeking algo for text diff that detects and can group similar lines

I am in the process of writing a diff text tool to compare two similar source code files.
There are many such "diff" tools around, but mine shall be a little improved:
If it finds a set of lines are mismatched on both sides (ie. in both files), it shall not only highlight those lines but also highlight the individual changes in these lines (I call this inter-line comparison here).
An example of my somewhat working solution:
alt text http://files.tempel.org/tmp/diff_example.png
What it currently does is to take a set of mismatched lines and running their single chars thru the diff algo once more, producing the pink highlighting.
However, the second set of mismatches, containing "original 2", requires more work: Here, the first two right lines ("added line a/b") were added, while the third line is an altered version of the left side. I wish my software to detect this difference between a likely alteration and a probable new line.
When looking at this simple example, I can rather easily detect this case:
With an algo such as Levenshtein, I could find that of all right lines in the set of 3 to 5, line 5 matches left line 3 best, thus I could deduct that lines 3 and 4 on the right were added, and perform the inter-line comparison on left line 3 and right line 5.
So far, so good. But I am still stuck with how to turn this into a more general algorithm for this purpose.
In a more complex situation, a set of different lines could have added lines on both sides, with a few closely matching lines in between. This gets quite complicated:
I'd have to match not only the first line on the left to the best on the right, but vice versa as well, and so on with all other lines. Basically, I have to match every line on the left against every one on the right. At worst, this might create even crossings, so that it's not easily clear any more which lines were newly inserted and which were just altered (Note: I do not want to deal with possible moved lines in such a block, unless that would actually simplify the algorithm).
Sure, this is never going to be perfect, but I'm trying to get it better than it's now. Any suggestions that aren't too theoerical but rather practical (I'm not good understanding abstract algos) are appreciated.
Update
I must admit that I do not even understand how the LCS algo works. I simply feed it two arrays of strings and out comes a list of which sequences do not match. I am basically using the code from here: http://www.incava.org/projects/java/java-diff
Looking at the code I find one function equal() that is responsible for telling the algorithm whether two lines match or not. Based on what Pavel suggested, I wonder if that's the place where I'd make the changes. But how? This function only returns a boolean - not a relative value that could identify the quality of the match. And I can not simply used a fixed Levenshtein ration that would decide whether a similar line is still considered equal or not - I'll need something that's self-adopting to the entire set of lines in question.
So, what I'm basically saying is that I still do not understand where I'd apply the fuzzy value that relates to the relative similarity of lines that do not (exactly) match.
Levenshtein distance is based on the notion of an "edit script" that transforms one string into another. It's very closely related to the Needleman-Wunsch algorithm used for aligning DNA sequences by inserting gap characters, in which we search for the alignment that maximises a score in O(nm) time using dynamic programming. Exact matches between characters increase the score, while mismatches or inserted gap characters reduce the score. An example alignment of AACTTGCCA and AATGCGAT:
AACTTGCCA-
AA-T-GCGAT
(6 matches, 1 mismatch, 3 gap characters, 3 gap regions)
We can think of the top string being the "starting" sequence that we are transforming into the "final" sequence on the bottom. Each column containing a - gap character on the bottom is a deletion, each column with a - on the top is an insertion, and each column with different (non-gap) characters is a substitution. There are 2 deletions, 1 insertion and 1 substitution in the above alignment, so the Levenshtein distance is 4.
Here is another alignment of the same strings, with the same Levenshtein distance:
AACTTGCCA-
AA--TGCGAT
(6 matches, 1 mismatch, 3 gap characters, 2 gap regions)
But notice that although there are the same number of gaps, there is one less gap region. Because biological processes are more likely to create wide gaps than multiple separate gaps, biologists prefer this alignment -- and so will the users of your program. This is accomplished by also penalising the number of gap regions in the scores that we compute. An O(nm) algorithm to accomplish this for strings of lengths n and m was given by Gotoh in 1982 in a paper called "An improved algorithm for matching biological sequences". Unfortunately, I can't find any links to free full text of the paper -- but there are many useful tutorials that you can find by googling "sequence alignment" and "affine gap penalty".
In general, different choices of match, mismatch, gap and gap region weights will give different alignments, but any negative score for gap regions will prefer the bottom alignment above to the top one.
What does all this have to do with your problem? If you use Gotoh's algorithm on individual characters with a suitable gap penalty (arrived at with a few empirical tests), you should find a significant decrease in the the number of terrible-looking alignments like the example you gave.
Efficiency Considerations
Ideally, you could just do this on characters and ignore lines altogether, since the affine penalty will work to cluster changes into blocks spanning many lines wherever it can. But because of the higher running time, it may be more realistic to do a first pass on lines and then rerun the algorithm on characters, using as input all lines that are not identical. Under this scheme, any shared block of identical lines can be handled by compressing it into a single "character" with inflated matching weight, which helps to ensure no "crossings" appear.
With an algo such as Levenshtein, I could find that of all right lines in the set of 3 to 5, line 5 matches left line 3 best, thus I could deduct that lines 3 and 4 on the right were added, and perform the inter-line comparison on left line 3 and right line 5.
After you have determined it, use the same algorithm to determine what lines in these two chinks match each other. But you need to make slight modificaiton. When you used the algorithm to match equal lines, the lines could either match or not match, so that added either 0 or 1 to the cell of the table you used.
When comparing strings in one chunk some of them are "more equal" than others (ack. to Orwell). So they can add a real number from 0 to 1 to the cell when considering what sequence matches best so far.
To compute this metrics (from 0 to 1), you can apply to each pair of strings you encounter... right, the same algorithm again (actually, you already did this when you were doing the first pass of Levenstein algorithm). This will compute the length of LCS, whose ratio to the average length of two strings would be the the metric value.
Or, you can borrow the algorithm from one of diff tools. For instance, vimdiff can highlight the matches you require.
Here's one possible solution someone else just made me realize:
My original approach was like this:
Split the text up into separate lines and use LCS algo to determine where there are blocks of nonmatching lines.
Use some smart algo (which this question is about) to figure out which of these lines closely match, i.e. to tell that these lines were modified between revisions.
Compare those closely matching lines line-by-line using LCS again, while marking the non-matching lines as entirely new.
While this would allow for a better visual display of changes when comparing source code revisions, I now found that a much simpler approach is usually sufficient. It works like this:
Same as above.
Take the right and left block of nonmatching lines, concatenate those lines, and tokenize them (either into language-specific tokens/words, or just into single characters)
Apply the LCS algo on the two arrays of tokens.
Maybe those who replied to my original question assumed that I knew to do this all the time, but I had my focus so strongly on a per-line comparison that it did not occur to me to apply LCS on the set of lines by concatenating them, instead of processing them line-by-line.
So, while this approach will not provide as detailed change information as my original intent was, it still does improve the results over what I started yesterday with when I wrote this question.
I'll leave this question open for a while longer - maybe someone else, reading all this, can still provide a complete answer (Pavel and random_hacker offered some suggestions, but it's not a complete solution yet - anyway, thank you for the helpful comments).

How to recognize words in text with non-word tokens?

I am currently parsing a bunch of mails and want to get words and other interesting tokens out of mails (even with spelling errors or combination of characters and letters, like "zebra21" or "customer242"). But how can I know that "0013lCnUieIquYjSuIA" and "anr5Brru2lLngOiEAVk1BTjN" are not words and not relevant? How to extract words and discard tokens that are encoding errors or parts of pgp signature or whatever else we get in mails and know that we will never be interested in those?
You need to decide on a good enough criteria for a word and write a regular expression or a manual to enforce it.
A few rules that can be extrapolated from your examples:
words can start with a captial letter or be all capital letters but if you have more than say, 2 uppercase letters and more than 2 lowercase letters inside a word, it's not a word
If you have numbers inside the word, it's not a word
if it's longer than say, 20 characters
There's no magic trick. you need to decide what you want the rules to be and make them happen.
Al alternative way is to train some kind of Hidden Markov-Models system to recognize things that sound like words but I think this is an overkill for what you want to do.
http://en.wikipedia.org/wiki/English_words_with_uncommon_properties
you can make rules that reject anything with these 'uncommon properties' to build a system that accepts most actual words
Although I generally agree with shoosh's answer, his approach makes it easy to achieve high recall but also low precision, i.e. you would get almost all real words but also a lot non-words. If your definition of word is too restrictive, it's the other way around but that's also not what you want since then you would miss cases like 'zebra123'. So here are a few ideas about how to improve precision:
It may be worthwile thinking about if you could determine what parts of an email belong to the main text and which are footers like pgp signatures. I'm sure it's possible to find some simple heuristics that match most cases, e.g. cut of everything below a line which consists only of '-'-characters.
Depending on your performance criteria you may want to check if a word is a real word or contains a real word by matching against a simple word list. It's easy to find quite exhaustive lists of Englisch words on the web, and you could also compile one yourself by extracting words from a large and clean text corpus.
Using a lexical analyser, you could filter every token which is marked as unknown.
Some simple statistics may tell you how likely it is that something is a word. Tokens which occur with high frequency most probably are words. Tokens which appear only once or whose number is below a certain threshold very probably are not words. Common spelling errors should appear more than once and uncommon ones may be ignored.
Some if these suggestions clearly don't work for cases like 'zebra123'. Again, simply cutting off, or splitting on, in-word numbers may do the trick.
My general approach would be to first identify tokens which certainly are words (using the suggestions above), then identify tokens which certainly are not words (using a regular expression), and then look (with your eyes) at the few hundred or thousand remaining tokens to find common characteristics to handle these separately.

Algorithm wanted: Find all words of a dictionary that are similar to words in a free text

We have a list of about 150,000 words, and when the user enters a free text, the system should present a list of words from the dictionary, that are very close to words in the free text.
For instance, the user enters: "I would like to buy legoe toys in Walmart". If the dictionary contains "Lego", "Car" and "Walmart", the system should present "Lego" and "Walmart" in the list. "Walmart" is obvious because it is identical to a word in the sentence, but "Lego" is similar enough to "Legoe" to be mentioned, too. However, nothing is similar to "Car", so that word is not shown.
Showing the list should be realtime, meaning that when the user has entered the sentence, the list of words must be present on the screen. Does anybody know a good algorithm for this?
The dictionary actually contains concepts which may include a space. For instance, "Lego spaceship". The perfect solution recognizes these multi-word concepts, too.
Any suggestions are appreciated.
Take a look at http://norvig.com/spell-correct.html for a simple algorithm. The article uses Python, but there are links to implementations in other languages at the end.
You will be doing quite a few lookups of words against a fixed dictionary. Therefore you need to prepare your dictionary. Logically, you can quickly eliminate candidates that are "just too different".
For instance, the words car and dissimilar may share a suffix, but they're obviously not misspellings of each other. Now why is that so obvious to us humans? For starters, the length is entirely different. That's an immediate disqualification (but with one exception - below). So, your dictionary should be sorted by word length. Match your input word with words of similar length. For short words that means +/- 1 character; longer words should have a higher margin (exactly how well can your demographic spell?)
Once you've restricted yourself to candidate words of similar length, you'd want to strip out words that are entirely dissimilar. With this I mean that they use entirely different letters. This is easiest to compare if you sort the letters in a word alphabetically. E.g. car becomes "acr"; rack becomes "ackr". You'll do this in preprocessing for your dictionary, and for each input word. The reason is that it's cheap to determine the (size of an) difference of two sorted sets. (Add a comment if you need explanation). car and rack have an difference of size 1, car and hat have a difference of size 2. This narrows down your set of candidates even further. Note that for longer words, you can bail out early when you've found too many differences. E.g. dissimilar and biography have a total difference of 13, but considering the length (8/9) you can probably bail out once you've found 5 differences.
This leaves you with a set of candidate words that use almost the same letters, and also are almost the same length. At this point you can start using more refined algorithms; you don't need to run 150.000 comparisons per input word anymore.
Now, for the length exception mentioned before: The problem is in "words" like greencar. It doesn't really match a word of length 8, and yet for humans it's quite obvious what was meant. In this case, you can't really break the input word at any random boundary and run an additional N-1 inexact matches against both halves. However, it is feasible to check for just a missing space. Just do a lookup for all possible prefixes. This is efficient because you'll be using the same part of the dictionary over and over, e.g. g gr, gre, gree, etc. For every prefix that you've found, check if the remaining suffixis also in the dictionery, e.g. reencar, eencar. If both halves of the input word are in the dictionary, but the word itself isn't, you can assume a missing space.
You would likely want to use an algorithm which calculates the Levenshtein distance.
However, since your data set is quite large, and you'll be comparing lots of words against it, a direct implementation of typical algorithms that do this won't be practical.
In order to find words in a reasonable amount of time, you will have to index your set of words in some way that facilitates fuzzy string matching.
One of these indexing methods would be to use a suffix tree. Another approach would be to use n-grams.
I would lean towards using a suffix tree since I find it easier to wrap my head around it and I find it more suited to the problem.
It might be of interest to look at a some algorithms such as the Levenshtein distance, which can calculate the amount of difference between 2 strings.
I'm not sure what language you are thinking of using but PHP has a function called levenshtein that performs this calculation and returns the distance. There's also a function called similar_text that does a similar thing. There's a code example here for the levenshtein function that checks a word against a dictionary of possible words and returns the closest words.
I hope this gives you a bit of insight into how a solution could work!

Resources