information retrieval compare wordds - algorithm

I have about 3 million words coming from many paper researches.
I want to filter that researches according to meta data.
the research is about cars, books, foods.
for example, I have a document with meta data Toyota
I have another document with meta data Toiota
notice that Toiota is the same as Toyota
what are the available approaches to solve that problem please?
What I have tried
I used a stem to take the root of the word.
I stem the first word to take the root
I stem the second word to take the root
compare the two roots.
My problem
The stem works just on words that have meaning. for example, eating, eat, ate. but when the word doesn't have meaning like Toyota, the root of it is the exact same word.
Another problem
The stem also doesn't work in this case:
united state doesn't equal US but logically they are the same.
anyone has a better approach too?
I don't know what are the available tags in StackOverFlow that works with me problem so you are welcome to add tags.
Update 1
I want to search this problem in gooogle but I don't know the correct words to use when searching, could you help me pelase?

If you want Toiota to mean the same as Toyota, there are a few options:
Hard code the translation
Auto "spell check" the query/document. If Toiota does not exist in your dictionary, then return the closest word, if it's close. See Norvig's spelling corrector.
Compare documents on character similarity and not exact word matches {t,o,y,o,t,a} has 83% overlap with {t,o,i,o,t,a} . Check out Jaro-Winkler distance too.
For US/United States you probably want a synonym file (of countries and their abbreviations), and add synonyms for each document. Another approach would be to take words and auto-abbreviate them and add that in your index. Example
abbrev('United States') = {'united,'states','us'} --take first letter of each word in multi-part words
abbrev('Canada') = {'canada', 'can'} -- take first three letters of single letter words

Related

Matching words and valid sub-words in elasticseach

I've been working with ElasticSearch within an existing code base for a few days, so I expect that the answer is easy once I know what I'm doing. I want to extend a search to yield the same results when I search with a compound word, like "eyewitness", or its component words separated by a whitespace, like "eye witness".
For example, I have a catalog of toy cars that includes both "firetruck" toys and "fire truck" toys. I would like to ensure that if someone searched on either of these terms, the results would include both the "firetruck" and the "fire truck" entries.
I attempted to do this at first with the "fuzziness" of a match, hoping that "fire truck" would be considered one transform away from "firetruck", but that does not work: ES fuzziness is per-word and will not add or remove whitespace characters as a valid transformation.
I know that I could do some brute-forcing before generating the query by trying to come up with additional search terms by breaking big words into smaller words and also joining smaller words into bigger words and checking all of them against a dictionary, but that falls apart pretty quickly when "fuzziness" and proper names are part of the task.
It seems like this is exactly the kind of thing that ES should do well, and that I simply don't have the right vocabulary yet for searching for the solution.
Thanks, everyone.
there are two things you could could do:
you could split words into their compounds, i.e. firetruck would be split into two tokens fire and truck, see here
you could use n-grams, i.e. for 4 grams the original firetruck get split into the tokens fire, iret, retr, etru, truc, ruck. In queries, the scoring function helps you ending up with pretty decent results. Check out this.
Always remember to do the same tokenization on both the analysis and the query side.
I would start with the ngrams and if that is not good enough you should go with the compounds and split them yourself - but that's a lot of work depending on the vocabulary you have under consideration.
hope the concepts and the links help, fricke

Predicting Missing Word in Sentence

How can I predict a word that's missing from a sentence?
I've seen many papers on predicting the next word in a sentence using an n-grams language model with frequency distributions from a set of training data. But instead I want to predict a missing word that's not necessarily at the end of the sentence. For example:
I took my ___ for a walk.
I can't seem to find any algorithms that take advantage of the words after the blank; I guess I could ignore them, but they must add some value. And of course, a bi/trigram model doesn't work for predicting the first two words.
What algorithm/pattern should I use? Or is there no advantage to using the words after the blank?
Tensorflow has a tutorial to do this: https://www.tensorflow.org/versions/r0.9/tutorials/word2vec/index.html
Incidentally it does a bit more and generates word embeddings, but to get there they train a model to predict the (next/missing) word. They also show using only the previous words, but you can apply the same ideas and add the words that follow.
They also have a bunch of suggestions on how to improve the precision (skip ngrams).
Somewhere at the bottom of the tutorial you have links to working source-code.
The only thing to be worried about is to have sufficient training data.
So, when I've worked with bigrams/trigrams, an example query generally looked something like "Predict the missing word in 'Would you ____'". I'd then go through my training data and gather all the sets of three words matching that pattern, and count the things in the blanks. So, if my training data looked like:
would you not do that
would you kindly pull that lever
would you kindly push that button
could you kindly pull that lever
I would get two counts for "kindly" and one for "not", and I'd predict "kindly". All you have to do for your problem is consider the blank in a different place: "____ you kindly" would get two counts for "would" and one for "could", so you'd predict "would". As far as the computer is concerned, there's nothing special about the word order - you can describe whatever pattern you want, from your training data. Does that make sense?

Given a huge set of street names, what is the most efficient way to test whether a text contains one of the street names from the set?

I have an interesting problem that I need help with. I am currently working on a feature of my program and stumbled into this issues
I have a huge list of street names in Indonesia ( > 100k rows ) stored in database,
Each street name may have more than 1 word. For example : "Sudirman", "Gatot Subroto", or "Jalan Asia Afrika" are all legit street names
have a bunch of texts ( > 1 Million rows ) in databases, that I split into sentences. Now, the features ( function to be exact ) that I need to do , is to test whether there are street names inside the sentences or no, so just a true / false test
I have tried to solve it by doing these steps:
a. Putting the street names into a Key,Value Hash
b. Split each sentences into words
c. Test whether words are in the hash
This is fast, but will not work with multiple words
Another alternatives that I thought of is to do these steps:
a. Split each sentences into words
b. Query the database with LIKE statement ( i,e. SELECT #### FROM street_table WHERE name like '%word%' )
c. If query returned a row, it means that the sentence contains street names
Now, this solution is going to be a very IO intensive.
So my question is "What is the most efficient way to do this test" ? regardless of the programming language. I do this in python mainly, but any language will do as long as I can grasp the concepts
============EDIT 1 =================
Will this be periodical ?
Yes, I will call this feature / function with an interval of 1 minute. Each call will take 100 row of texts at least and test them against the street name database
A simple solution would be to create a dictionary/multimap with first-word-of-street-name=>full-street-name(s). When you iterate each word in your sentence you'll look up potential street names, and check if you have a match (by looking at the next words).
This algorithm should be fairly easy to implement and should perform pretty good too.
Using nlp, you can determine the proper noun in a sentence. Please refer to the link below.
http://nlp.stanford.edu/software/lex-parser.shtml
The standford parser is accurate in its calculation. Once you have the proper noun, you can decide the approach to follow.
So you have a document and want to seach if it contains any of your list of streetnames?
Turbo Boyer-Moore is a good starting point for doing that.
Here is more information on turbo boyer moore
But, i strongly believe, you will have to do something about the organisation of your list of street names. there should be some bucket access to it, i.e. you can easily filter for street names:
Here an example:
Street name: Asia-Pacific-street
You can access your list by:
A (getting a starting point for all that start with an A)
AS (getting a starting point for all that start with an AS)
and so on...
I believe you should have lots of buckets for that, at least 26 (first letter) * 26 (second letter)
more information about bucketing
The Aho-Corasick algorithm could be pretty useful. One of it's advantages is that it's run time is independent of how many words you are searching for (only how long the text is you are searching through). It will be especially useful if your list of street names is not changing frequently.
http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm

String comparison algorithm, relevancy, how much "alike" 2 strings are

I have 2 sources of information for the same data (companies), which I can join together via a unique ID (contract number). The presence of the second, different source, is due to the fact that the 2 sources are updated manually, independently. So what I have is an ID and a company Name in 2 tables.
I need to come up with an algorithm that would compare the Name in the 2 tables for the same ID, and order all the companies by a variable which indicates how different the strings are (to highlight the most different ones, to be placed at the top of the list).
I looked at the simple Levenshtein distance calculation algorithm, but it's at the letter level, so I am still looking for something better.
The reason why Levenshtein doesn't really do the job is this: companies have a name, prefixed or postfixed by the organizational form (LTD, JSC, co. etc). So we may have a lot of JSC "Foo" which will differ a lot from Foo JSC., but what I am really looking for in the database is pairs of different strings like SomeLongCompanyName JSC and JSC OtherName.
Are there any Good ways to do this? (I don't really like the idea of using regex to separate words in each string, then find matches for every word in the other string by using the Levenshtein distance, so I am searching for other ideas)
How about:
1. Replace all punctuation by whitespace.
2. Break the string up into whitespace-delimited words.
3. Move all words of <= 4 characters to the end, sorted alphabetically.
4. Levenshtein.
Could you filter out (remove) those "common words" (similar to removing stop words for fulltext indexing) and then search on that? If not, could you sort the words alphabetically before comparing?
As an alternative or in addition to the Levenshtein distance, you could use Soundex. It's not terribly good, but it can be used to index the data (which is not possible when using Levenshtein).
Thank you both for ideas.
I used 4 indices which are levenshtein distances divided by the sum of the length of both words (relative distances) of the following:
Just the 2 strings
The string composed of the result after separating the word sequences, eliminating the non-word chars, ordering ascending and joining with space as separator.
The string which is contained between quotes (if no such string is present, the original string is taken)
The string composed of alphabetically ordered first characters of each word.
each of these in return is an integer value between 1 and 1000. The resulting value is the product of:
X1^E1 * X2^E2 * X3^E3 * X4^E4
Where X1..X4 are the indices, and E1..E4 are user-provided preferences of valuable (significant) is each index. To keep the result inside reasonable values of 1..1000, the vector (E1..E4) is normalized.
The results are impressive. The whole thing works much faster than I've expected (built it as a CLR assembly in C# for Microsoft SQL Server 2008). After picking E1..E4 correctly, the largest index (biggest difference) on non-null values in the whole database is 765. Right untill about 300 there is virtually no matching company name. Around 200 there are companies that have kind of similar names, and some are the same names but written in very different ways, with abbreviations, additional words, etc. When it comes down to 100 and less - practically all the records contain names that are the same but written with slight differences, and by 30, only the order or the punctuation may differ.
Totally works, result is better than I've expected.
I wrote a post on my blog, to share this library in case someone else needs it.

How to find similarity in texts

I have a database where users upload articles.
I would like to make an algorithm where my web app will suggest similar texts according to the one the user reads.
I saw some examples like Levenshtein distance. But those algorithms measures distance for strings and not for whole articles. Is there a way to extract most significant keywords from text? Surely, I understand that "most significant" is an ambiguous term.
How do other sites manage this?
thanks a lot
Is there a way to extract most significant keywords from text?
Yes. Basically, you extract all the words from the text, sort the words by frequency, eliminate the common words (a, an, the, etc.) by matching them against a common word dictionary, and save the top 20 or more words, along with their frequency, from each article.
The number of top words you save is related to both the length of the article and the subject matter of all the articles. Less words work for general interest articles, while more words are necessary for special interest articles, like answers to programming questions.
Articles that match more than half of the top words could be considered related. The degree of relatedness would depend on the number of matching top words and the frequencies of the matching words.
You could calculate a relatedness score by multiplying the frequencies of each matched word from the two articles and summing all the products. The higher the score, the more the articles are related.
You might try to correct the 'weight' of each word by the frequency it appears in all the articles. So the best indicators of similarity would be the words that appear only in the two compared ones and nowhere else. This would automatically disregard the common words (a, an, the, etc.) mentioned by #Gilbert Le Blanc.

Resources