I have a map() function like this in beer design document:
function (doc, meta) {
if(doc.brewery_id)
emit([ doc.brewery_id, doc.abv], [doc.name, doc.abv, doc.type, doc.brewery_id, doc.style, doc.category]);
}
I need to get all doc with 2 rules:
1.[brewery_id] start with "21st"
2.[abv] between 3-4
My filter is:
startkey=["21st", 3]
endkey=["21st\uefff", 4]
But the result is not correct, the rule 1 is work as expected but the rule 2 is ignored.
Please help me find out what's wrong.
Thanks!!!
Hear's the result
{"total_rows":5891,"rows":[
{"id":"21st_amendment_brewery_cafe-bitter_american","key":["21st_amendment_brewery_cafe",3.6],"value":["Bitter American",3.6,"beer","21st_amendment_brewery_cafe","Special Bitter or Best Bitter","British Ale"]},
{"id":"21st_amendment_brewery_cafe-563_stout","key":["21st_amendment_brewery_cafe",5],"value":["563 Stout",5,"beer","21st_amendment_brewery_cafe","American-Style Stout","North American Ale"]},
{"id":"21st_amendment_brewery_cafe-south_park_blonde","key":["21st_amendment_brewery_cafe",5],"value":["South Park Blonde",5,"beer","21st_amendment_brewery_cafe","Golden or Blonde Ale","North American Ale"]},
{"id":"21st_amendment_brewery_cafe-amendment_pale_ale","key":["21st_amendment_brewery_cafe",5.2],"value":["Amendment Pale Ale",5.2,"beer","21st_amendment_brewery_cafe","American-Style Pale Ale","North American Ale"]},
{"id":"21st_amendment_brewery_cafe-potrero_esb","key":["21st_amendment_brewery_cafe",5.2],"value":["Potrero ESB",5.2,"beer","21st_amendment_brewery_cafe","Special Bitter or Best Bitter","British Ale"]},
{"id":"21st_amendment_brewery_cafe-general_pippo_s_porter","key":["21st_amendment_brewery_cafe",5.5],"value":["General Pippo's Porter",5.5,"beer","21st_amendment_brewery_cafe","Porter","Irish Ale"]},
{"id":"21st_amendment_brewery_cafe-watermelon_wheat","key":["21st_amendment_brewery_cafe",5.5],"value":["Watermelon Wheat",5.5,"beer","21st_amendment_brewery_cafe","Belgian-Style Fruit Lambic","Belgian and French Ale"]},
{"id":"21st_amendment_brewery_cafe-north_star_red","key":["21st_amendment_brewery_cafe",5.8],"value":["North Star Red",5.8,"beer","21st_amendment_brewery_cafe","American-Style Amber/Red Ale","North American Ale"]},
{"id":"21st_amendment_brewery_cafe-oyster_point_oyster_stout","key":["21st_amendment_brewery_cafe",5.9],"value":["Oyster Point Oyster Stout",5.9,"beer","21st_amendment_brewery_cafe","American-Style Stout","North American Ale"]},
{"id":"21st_amendment_brewery_cafe-21a_ipa","key":["21st_amendment_brewery_cafe",7.2],"value":["21A IPA",7.2,"beer","21st_amendment_brewery_cafe","American-Style India Pale Ale","North American Ale"]}
]
}
If you need to filter your results by 2 varying ranges you can use LinQ, but if you have large number of documents it can be slow. So to make it faster you can do two things:
After applying LinQ "filter" cache results in memcached or couchbase.
If your datamodel allows you to create separate view for one of the ranges, i.e. if you can move one of your ranges from key to map function if like:
View for 21sts:
map: function() { if (doc.subtype === "21sts") emit (doc.abv,null) }
where docs that have subtype == "21sts" are docs that you can get from view with:
map: function() { emit(doc.brewery_id, null) }
and startkey="21st", endkey="21st\uefff".
Related
I am implementing an internal search that looks at various normalized fields to determine relevance for a user's search terms. The best_fields strategy seems to yield strange results sometimes because a "less important" field will generate the highest score and beat out other more important fields with weaker matches. I've included a boost, but cranking that value up seems like it will also skew results; as does moving to a most_fields strategy since not all pages will have all the fields.
What is the right way to go about tuning the below query & incorporating scores from each field?
Below is an example where the content field ends up winning the "max" evaluation for best_field (because the search term is present more times) and scores higher than the second page which I want to come first because the search term is a literal match for the keywords field. What's more, since more keywords are added to important pages, their match seems to get further devalued since the field length is much longer than average.
Query Example
{
"query": {
'multi_match' : {
"query": "Hello World",
"fields": ["keywords^3", "name^2", "content^1"]
}
}
};
Document/Results Example:
[{
"name": "Howdy!",
"keywords: "",
"content": "Hello everybody, I'm in the world. hello there, i like saying hello"
},{
"name": "Hey",
"keywords: "Hello World, Hello, World",
"content": "Lot's of text, Lot's of text, Lot's of text, Lot's of text, Lot's of text, Hello"
}]
What you have right now is static boost which is also a good way to tune the search relevance but for advance use cases like yours I would advise looking at function score way to fine tuning the score and relevance.
Please go through at the function score documentation, its quite exhaustive and can easily serve your use-case.
I'm trying to retrieve all the tasks documents that have the string first in their name.
I currently have the following code, but it only works if I pass the exact name:
res, err := db.client.Query(
f.Map(
f.Paginate(f.MatchTerm(f.Index("tasks_by_name"), "My first task")),
f.Lambda("ref", f.Get(f.Var("ref"))),
),
)
I think I can use ContainsStr() somewhere, but I don't know how to use it in my query.
Also, is there a way to do it without using Filter()? I ask because it seems like it filters after the pagination, and it messes up with the pages
FaunaDB provides a lot of constructs, this makes it powerful but you have a lot to choose from. With great power comes a small learning curve :).
How to read the code samples
To be clear, I use the JavaScript flavor of FQL here and typically expose the FQL functions from the JavaScript driver as follows:
const faunadb = require('faunadb')
const q = faunadb.query
const {
Not,
Abort,
...
} = q
You do have to be careful to export Map like that since it will conflict with JavaScripts map. In that case, you could just use q.Map.
Option 1: using ContainsStr() & Filter
Basic usage according to the docs
ContainsStr('Fauna', 'a')
Of course, this works on a specific value so in order to make it work you need Filter and Filter only works on paginated sets. That means that we first need to get a paginated set. One way to get a paginated set of documents is:
q.Map(
Paginate(Documents(Collection('tasks'))),
Lambda(['ref'], Get(Var('ref')))
)
But we can do that more efficiently since one get === one read and we don't need the docs, we'll be filtering out a lot of them. It's interesting to know that one index page is also one read so we can define an index as follows:
{
name: "tasks_name_and_ref",
unique: false,
serialized: true,
source: "tasks",
terms: [],
values: [
{
field: ["data", "name"]
},
{
field: ["ref"]
}
]
}
And since we added name and ref to the values, the index will return pages of name and ref which we can then use to filter. We can, for example, do something similar with indexes, map over them and this will return us an array of booleans.
Map(
Paginate(Match(Index('tasks_name_and_ref'))),
Lambda(['name', 'ref'], ContainsStr(Var('name'), 'first'))
)
Since Filter also works on arrays, we can actually simple replace Map with filter. We'll also add a to lowercase to ignore casing and we have what we need:
Filter(
Paginate(Match(Index('tasks_name_and_ref'))),
Lambda(['name', 'ref'], ContainsStr(LowerCase(Var('name')), 'first'))
)
In my case, the result is:
{
"data": [
[
"Firstly, we'll have to go and refactor this!",
Ref(Collection("tasks"), "267120709035098631")
],
[
"go to a big rock-concert abroad, but let's not dive in headfirst",
Ref(Collection("tasks"), "267120846106001926")
],
[
"The first thing to do is dance!",
Ref(Collection("tasks"), "267120677201379847")
]
]
}
Filter and reduced page sizes
As you mentioned, this is not exactly what you want since it also means that if you request pages of 500 in size, they might be filtered out and you might end up with a page of size 3, then one of 7. You might think, why can't I just get my filtered elements in pages? Well, it's a good idea for performance reasons since it basically checks each value. Imagine you have a massive collection and filter out 99.99 percent. You might have to loop over many elements to get to 500 which all cost reads. We want pricing to be predictable :).
Option 2: indexes!
Each time you want to do something more efficient, the answer lies in indexes. FaunaDB provides you with the raw power to implement different search strategies but you'll have to be a bit creative and I'm here to help you with that :).
Bindings
In Index bindings, you can transform the attributes of your document and in our first attempt we will split the string into words (I'll implement multiple since I'm not entirely sure which kind of matching you want)
We do not have a string split function but since FQL is easily extended, we can write it ourselves bind to a variable in our host language (in this case javascript), or use one from this community-driven library: https://github.com/shiftx/faunadb-fql-lib
function StringSplit(string: ExprArg, delimiter = " "){
return If(
Not(IsString(string)),
Abort("SplitString only accept strings"),
q.Map(
FindStrRegex(string, Concat(["[^\\", delimiter, "]+"])),
Lambda("res", LowerCase(Select(["data"], Var("res"))))
)
)
)
And use it in our binding.
CreateIndex({
name: 'tasks_by_words',
source: [
{
collection: Collection('tasks'),
fields: {
words: Query(Lambda('task', StringSplit(Select(['data', 'name']))))
}
}
],
terms: [
{
binding: 'words'
}
]
})
Hint, if you are not sure whether you have got it right, you can always throw the binding in values instead of terms and then you'll see in the fauna dashboard whether your index actually contains values:
What did we do? We just wrote a binding that will transform the value into an array of values at the time a document is written. When you index the array of a document in FaunaDB, these values are indexes separately yet point all to the same document which will be very useful for our search implementation.
We can now find tasks that contain the string 'first' as one of their words by using the following query:
q.Map(
Paginate(Match(Index('tasks_by_words'), 'first')),
Lambda('ref', Get(Var('ref')))
)
Which will give me the document with name:
"The first thing to do is dance!"
The other two documents didn't contain the exact words, so how do we do that?
Option 3: indexes and Ngram (exact contains matching)
To get exact contains matching efficient, you need to use a (still undocumented function since we'll make it easier in the future) function called 'NGram'. Dividing a string in ngrams is a search technique that is often used underneath the hood in other search engines. In FaunaDB we can easily apply it as due to the power of the indexes and bindings. The Fwitter example has an example in it's source code that does autocompletion. This example won't work for your use-case but I do reference it for other users since it's meant for autocompleting short strings, not to search a short string in a longer string like a task.
We'll adapt it though for your use-case. When it comes to searching it's all a tradeoff of performance and storage and in FaunaDB users can choose their tradeoff. Note that in the previous approach, we stored each word separately, with Ngrams we'll split words even further to provide some form of fuzzy matching. The downside is that the index size might become very big if you make the wrong choice (this is equally true for search engines, hence why they let you define different algorithms).
What NGram essentially does is get substrings of a string of a certain length.
For example:
NGram('lalala', 3, 3)
Will return:
If we know that we won't be searching for strings longer than a certain length, let's say length 10 (it's a tradeoff, increasing the size will increase the storage requirements but allow you to do query for longer strings), you can write the following Ngram generator.
function GenerateNgrams(Phrase) {
return Distinct(
Union(
Let(
{
// Reduce this array if you want less ngrams per word.
indexes: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
indexesFiltered: Filter(
Var('indexes'),
// filter out the ones below 0
Lambda('l', GT(Var('l'), 0))
),
ngramsArray: q.Map(Var('indexesFiltered'), Lambda('l', NGram(LowerCase(Var('Phrase')), Var('l'), Var('l'))))
},
Var('ngramsArray')
)
)
)
}
You can then write your index as followed:
CreateIndex({
name: 'tasks_by_ngrams_exact',
// we actually want to sort to get the shortest word that matches first
source: [
{
// If your collections have the same property tht you want to access you can pass a list to the collection
collection: [Collection('tasks')],
fields: {
wordparts: Query(Lambda('task', GenerateNgrams(Select(['data', 'name'], Var('task')))))
}
}
],
terms: [
{
binding: 'wordparts'
}
]
})
And you have an index backed search where your pages are the size you requested.
q.Map(
Paginate(Match(Index('tasks_by_ngrams_exact'), 'first')),
Lambda('ref', Get(Var('ref')))
)
Option 4: indexes and Ngrams of size 3 or trigrams (Fuzzy matching)
If you want fuzzy searching, often trigrams are used, in this case our index will be easy so we're not going to use an external function.
CreateIndex({
name: 'tasks_by_ngrams',
source: {
collection: Collection('tasks'),
fields: {
ngrams: Query(Lambda('task', Distinct(NGram(LowerCase(Select(['data', 'name'], Var('task'))), 3, 3))))
}
},
terms: [
{
binding: 'ngrams'
}
]
})
If we would place the binding in values again to see what comes out we'll see something like this:
In this approach, we use both trigrams on the indexing side as on the querying side. On the querying side, that means that the 'first' word which we search for will also be divided in Trigrams as follows:
For example, we can now do a fuzzy search as follows:
q.Map(
Paginate(Union(q.Map(NGram('first', 3, 3), Lambda('ngram', Match(Index('tasks_by_ngrams'), Var('ngram')))))),
Lambda('ref', Get(Var('ref')))
)
In this case, we do actually 3 searches, we are searching for all of the trigrams and union the results. Which will return us all sentences that contain first.
But if we would have miss-spelled it and would have written frst we would still match all three since there is a trigram (rst) that matches.
This is literally about comparing cakes. My friend is having a cupcake party with the goal of determining the best cupcakery in Manhattan. Actually, it's much more ambitious than that. Read on.
There are 27 bakeries, and 19 people attending (with maybe one or two no-shows). There will be 4 cupcakes from each bakery, if possible including the staples -- vanilla, chocolate, and red velvet -- and rounding out the 4 with wildcard flavors. There are 4 attributes on which to rate the cupcakes: flavor, moistness, presentation (prettiness), and general goodness. People will provide ratings on a 5-point scale for each attribute for each cupcake they sample. Finally, each cupcake can be cut into 4 or 5 pieces.
The question is: what is a procedure for coming up with a statistically meaningful ranking of the bakeries for each attribute, and for each flavor (treating "wildcard" as a flavor)? Specifically, we want to rank the bakeries 8 times: for each flavor we want to rank the bakeries by goodness (goodness being one of the attributes), and for each attribute we want to rank the bakeries across all flavors (ie, independent of flavor, ie, aggregating over all flavors). The grand prize goes to the top-ranked bakery for the goodness attribute.
Bonus points for generalizing this, of course.
This is happening in about 12 hours so I'll post as an answer what we ended up doing if no one answers in the meantime.
PS: Here's the post-party blog post about it: http://gracenotesnyc.com/2009/08/05/gracenotes-nycs-cupcake-cagematch-the-sweetest-battle-ever/
Here's what we ended up doing. I made a huge table to collect everyone's ratings at http://etherpad.com/sugarorgy (Revision 25, just in case it gets vandalized with me adding this public link to it) and then used the following Perl script to parse the data into a CSV file:
#!/usr/bin/env perl
# Grabs the cupcake data from etherpad and parses it into a CSV file.
use LWP::Simple qw(get);
$content = get("http://etherpad.com/ep/pad/export/sugarorgy/latest?format=txt");
$content =~ s/^.*BEGIN_MAGIC\s*//s;
$content =~ s/END_MAGIC.*$//s;
$bakery = "none";
for $line (split('\n', $content)) {
next if $line =~ /sar kri and deb/;
if ($line =~ s/bakery\s+(\w+)//) { $bakery = $1; }
$line =~ s/\([^\)]*\)//g; # strip out stuff in parens.
$line =~ s/^\s+(\w)(\w)/$1 $2/;
$line =~ s/\-/\-1/g;
$line =~ s/^\s+//;
$line =~ s/\s+$//;
$line =~ s/\s+/\,/g;
print "$bakery,$line\n";
}
Then I did the averaging and whatnot in Mathematica:
data = Import["!~/svn/sugar.pl", "CSV"];
(* return a bakery's list of ratings for the given type of cupcake *)
tratings[bak_, t_] := Select[Drop[First#Select[data,
#[[1]]==bak && #[[2]]==t && #[[3]]=="g" &], 3], #!=-1&]
(* return a bakery's list of ratings for the given cupcake attribute *)
aratings[bak_, a_] := Select[Flatten[Drop[#,3]& /#
Select[data, #[[1]]==bak && #[[3]]==a&]], #!=-1&]
(* overall rating for a bakery *)
oratings[bak_] := Join ## (tratings[bak, #] & /# {"V", "C", "R", "W"})
bakeries = Union#data[[All, 1]]
SortBy[{#, oratings##, Round[Mean#oratings[#], .01]}& /# bakeries, -#[[3]]&]
The results are at the bottom of http://etherpad.com/sugarorgy.
Perhaps reading about voting systems will be helpful. PS: don't take whatever is written on Wikipedia as "good fish". I have found factual errors in advanced topics there.
Break the problem up into sub-problems.
What's the value of a cupcake? A basic approach is "the average of the scores." A slightly more robust approach may be "the weighted average of the scores." But there may be complications beyond that... a cupcake with 3 goodness and 3 flavor may be 'better' than one with 5 flavor and 1 goodness, even if flavor and goodness have equal weight (IOW, a low score may have a disproportionate effect).
Make up some sample cupcake scores (specifics! Cover the normal scenarios and a couple weird ones), and estimate what you think a reasonable "overall" score would be if you had an ideal algorithm. Then, use that data to reverse engineer the algorithm.
For example, a cupcake with goodness 4, flavor 3, presentation 1 and moistness 4 might deserve a 4 overall, while one with goodness 4, flavor 2, presentation 5, and moistness 4 might only rate a 3.
Next, do the same thing for the bakery. Given a set of cupcakes with a range of scores, what would an appropriate rating be? Then, figure out the function that will give you that data.
The "goodness" ranking seems a bit odd, as it seems like it's a general rating, and so having it in there is already the overall score, so why calculate an overall score?
If you had time to work with this, I'd always suggest capturing the raw data, and using that as a basis to do more detailed analysis, but I don't think that's really relevant here.
Perhaps this is too general for you, but this type of problem can be approached using Conjoint Analysis (link text). A R package for implementing this is bayesm(link text).
If you can write SQL, you could make a little database and write some queries. It should not be that difficult.
e.g. select sum(score) / count(score) as finalscore, bakery, flavour from tables where group by bakery, flavour
Imagine I have a situation where I need to index sentences. Let me explain it a little bit deeper.
For example I have these sentences:
The beautiful sky.
Beautiful sky dream.
Beautiful dream.
As far as I can imagine the index should look something like this:
alt text http://img7.imageshack.us/img7/4029/indexarb.png
But also I would like to do search by any of these words.
For example, if I do search by "the" It should show give me connection to "beautiful".
if I do search by "beautiful" it should give me connections to (previous)"The", (next)"sky" and "dream". If I search by "sky" it should give (previous) connection to "beautiful" and etc...
Any Ideas ? Maybe you know already existing algorithm for this kind of problem ?
Short Answer
Create a struct with two vectors of previous/forward links.
Then store the word structs in a hash table with the key as the word itself.
Long Answer
This is a linguistic parsing problem that is not easily solved unless you don't mind gibberish.
I went to the park basketball court.
Would you park the car.
Your linking algorithm will create sentences like:
I went to the park the car.
Would you park basketball court.
I'm not quite sure of the SEO applications of this, but I would not welcome another gibberish spam site taking up a search result.
I imagine you would want some sort of Inverted index structure. You would have a Hashmap with the words as keys pointing to lists of pairs of the form (sentence_id, position). You would then store your sentences as arrays or linked lists. Your example would look like this:
sentence[0] = ['the','beautiful', 'sky'];
sentence[1] = ['beautiful','sky', 'dream'];
sentence[2] = ['beautiful', 'dream'];
inverted_index =
{
'the': {(0,0)},
'beautiful': {(0,1), (1,0), (2,0)},
'sky' : {(0,2),(1,1)},
'dream':{(1,2), (2,1)}
};
Using this structure lookups on words can be done in constant time. Having identified the word you want, finding the previous and subsequent word in a given sentence can also be done in constant time.
Hope this helps.
You can try and dig into Markov chains, formed from words of sentences. Also you'll require both-way chain (i.e. to find next and previous words), i.e. store probable words that appear just after the given or just before it.
Of course, Markov chain is a stochastic process to generate content, however similar approach may be used to store information you need.
That looks like it could be stored in a very simple database with the following tables:
Words:
Id integer primary-key
Word varchar(20)
Following:
WordId1 integer foreign-key Words(Id) indexed
WordId2 integer foreign-key Words(Id) indexed
Then, whenever you parse a sentence, just insert the ones that aren't already there, as follows:
The beautiful sky.
Words (1,'the')
Words (2, 'beautiful')
Words (3,, 'sky')
Following (1, 2)
Following (2, 3)
Beautiful sky dream.
Words (4, 'dream')
Following (3, 4)
Beautiful dream.
Following (2, 4)
Then you can query to your hearts content on what words follow or precede other words.
This oughta get you close, in C#:
class Program
{
public class Node
{
private string _term;
private Dictionary<string, KeyValuePair<Node, Node>> _related = new Dictionary<string, KeyValuePair<Node, Node>>();
public Node(string term)
{
_term = term;
}
public void Add(string phrase, Node previous, string [] phraseRemainder, Dictionary<string,Node> existing)
{
Node next= null;
if (phraseRemainder.Length > 0)
{
if (!existing.TryGetValue(phraseRemainder[0], out next))
{
existing[phraseRemainder[0]] = next = new Node(phraseRemainder[0]);
}
next.Add(phrase, this, phraseRemainder.Skip(1).ToArray(), existing);
}
_related.Add(phrase, new KeyValuePair<Node, Node>(previous, next));
}
}
static void Main(string[] args)
{
string [] sentences =
new string [] {
"The beautiful sky",
"Beautiful sky dream",
"beautiful dream"
};
Dictionary<string, Node> parsedSentences = new Dictionary<string,Node>();
foreach(string sentence in sentences)
{
string [] words = sentence.ToLowerInvariant().Split(' ');
Node startNode;
if (!parsedSentences.TryGetValue(words[0],out startNode))
{
parsedSentences[words[0]] = startNode = new Node(words[0]);
}
if (words.Length > 1)
startNode.Add(sentence,null,words.Skip(1).ToArray(),parsedSentences);
}
}
}
I took the liberty of assuming you wanted to preserve the actual initial phrase. At the end of this, you'll have a list of words in the phrases, and in each one, a list of phrases that use that word, with references to the next and previous words in each phrase.
Using an associative array will allow you to quickly parse sentences in Perl. It is much faster than you would anticipate and it can be effectively dumped out in a tree like structure for subsequent usage by a higher level language.
Tree Search Algorithms (like BST, ect)
I am doing a CSV Import tool for the project I'm working on.
The client needs to be able to enter the data in excel, export them as CSV and upload them to the database.
For example I have this CSV record:
1, John Doe, ACME Comapny (the typo is on purpose)
Of course, the companies are kept in a separate table and linked with a foreign key, so I need to discover the correct company ID before inserting.
I plan to do this by comparing the company names in the database with the company names in the CSV.
the comparison should return 0 if the strings are exactly the same, and return some value that gets bigger as the strings get more different, but strcmp doesn't cut it here because:
"Acme Company" and "Acme Comapny" should have a very small difference index, but
"Acme Company" and "Cmea Mpnyaco" should have a very big difference index
Or "Acme Company" and "Acme Comp." should also have a small difference index, even though the character count is different.
Also, "Acme Company" and "Company Acme" should return 0.
So if the client makes a type while entering data, i could prompt him to choose the name he most probably wanted to insert.
Is there a known algorithm to do this, or maybe we can invent one :)
?
You might want to check out the Levenshtein Distance algorithm as a starting point. It will rate the "distance" between two words.
This SO thread on implementing a Google-style "Do you mean...?" system may provide some ideas as well.
I don't know what language you're coding in, but if it's PHP, you should consider the following algorithms:
levenshtein(): Returns the minimal number of characters you have to replace, insert or delete to transform one string into another.
soundex(): Returns the four-character soundex key of a word, which should be the same as the key for any similar-sounding word.
metaphone(): Similar to soundex, and possibly more effective for you. It's more accurate than soundex() as it knows the basic rules of English pronunciation. The metaphone generated keys are of variable length.
similar_text(): Similar to levenshtein(), but it can return a percent value instead.
I've had some success with the Levenshtein Distance algorithm, there is also Soundex.
What language are you implementing this in? we may be able to point to specific examples
I have actually implemented a similar system. I used the Levenshtein distance (as other posters already suggested), with some modifications. The problem with unmodified edit distance (applied to whole strings) is that it is sensitive to word reordering, so "Acme Digital Incorporated World Company" will match poorly against "Digital Incorporated World Company Acme" and such reorderings were quite common in my data.
I modified it so that if the edit distance of whole strings was too big, the algorithm fell back to matching words against each other to find a good word-to-word match (quadratic cost, but there was a cutoff if there were too many words, so it worked OK).
I've taken SoundEx, Levenshtein, PHP similarity, and double metaphone and packaged them up in C# in one set of extension methods on String.
Entire blog post here.
There's multiple algorithms to do just that, and most databases even include one by default. It is actually a quite common concern.
If its just about English words, SQL Server for example includes SOUNDEX which can be used to compare on the resulting sound of the word.
http://msdn.microsoft.com/en-us/library/aa259235%28SQL.80%29.aspx
I'm implementing it in PHP, and I am now writing a piece of code that will break up 2 strings in words and compare each of the words from the first string with the words of the second string using levenshtein and accept the lowes possible values. Ill post it when I'm done.
Thanks a lot.
Update: Here's what I've come up with:
function myLevenshtein( $str1, $str2 )
{
// prepare the words
$words1 = explode( " ", preg_replace( "/\s+/", " ", trim($str1) ) );
$words2 = explode( " ", preg_replace( "/\s+/", " ", trim($str2) ) );
$found = array(); // array that keeps the best matched words so we don't check them again
$score = 0; // total score
// In my case, strings that have different amount of words can be good matches too
// For example, Acme Company and International Acme Company Ltd. are the same thing
// I will just add the wordcount differencre to the total score, and weigh it more later if needed
$wordDiff = count( $words1 ) - count( $words2 );
foreach( $words1 as $word1 )
{
$minlevWord = "";
$minlev = 1000;
$return = 0;
foreach( $words2 as $word2 )
{
$return = 1;
if( in_array( $word2, $found ) )
continue;
$lev = levenshtein( $word1, $word2 );
if( $lev < $minlev )
{
$minlev = $lev;
$minlevWord = $word2;
}
}
if( !$return )
break;
$score += $minlev;
array_push( $found, $minlevWord );
}
return $score + $wordDiff;
}