Word comparison algorithm - algorithm

I am doing a CSV Import tool for the project I'm working on.
The client needs to be able to enter the data in excel, export them as CSV and upload them to the database.
For example I have this CSV record:
1, John Doe, ACME Comapny (the typo is on purpose)
Of course, the companies are kept in a separate table and linked with a foreign key, so I need to discover the correct company ID before inserting.
I plan to do this by comparing the company names in the database with the company names in the CSV.
the comparison should return 0 if the strings are exactly the same, and return some value that gets bigger as the strings get more different, but strcmp doesn't cut it here because:
"Acme Company" and "Acme Comapny" should have a very small difference index, but
"Acme Company" and "Cmea Mpnyaco" should have a very big difference index
Or "Acme Company" and "Acme Comp." should also have a small difference index, even though the character count is different.
Also, "Acme Company" and "Company Acme" should return 0.
So if the client makes a type while entering data, i could prompt him to choose the name he most probably wanted to insert.
Is there a known algorithm to do this, or maybe we can invent one :)
?

You might want to check out the Levenshtein Distance algorithm as a starting point. It will rate the "distance" between two words.
This SO thread on implementing a Google-style "Do you mean...?" system may provide some ideas as well.

I don't know what language you're coding in, but if it's PHP, you should consider the following algorithms:
levenshtein(): Returns the minimal number of characters you have to replace, insert or delete to transform one string into another.
soundex(): Returns the four-character soundex key of a word, which should be the same as the key for any similar-sounding word.
metaphone(): Similar to soundex, and possibly more effective for you. It's more accurate than soundex() as it knows the basic rules of English pronunciation. The metaphone generated keys are of variable length.
similar_text(): Similar to levenshtein(), but it can return a percent value instead.

I've had some success with the Levenshtein Distance algorithm, there is also Soundex.
What language are you implementing this in? we may be able to point to specific examples

I have actually implemented a similar system. I used the Levenshtein distance (as other posters already suggested), with some modifications. The problem with unmodified edit distance (applied to whole strings) is that it is sensitive to word reordering, so "Acme Digital Incorporated World Company" will match poorly against "Digital Incorporated World Company Acme" and such reorderings were quite common in my data.
I modified it so that if the edit distance of whole strings was too big, the algorithm fell back to matching words against each other to find a good word-to-word match (quadratic cost, but there was a cutoff if there were too many words, so it worked OK).

I've taken SoundEx, Levenshtein, PHP similarity, and double metaphone and packaged them up in C# in one set of extension methods on String.
Entire blog post here.

There's multiple algorithms to do just that, and most databases even include one by default. It is actually a quite common concern.
If its just about English words, SQL Server for example includes SOUNDEX which can be used to compare on the resulting sound of the word.
http://msdn.microsoft.com/en-us/library/aa259235%28SQL.80%29.aspx

I'm implementing it in PHP, and I am now writing a piece of code that will break up 2 strings in words and compare each of the words from the first string with the words of the second string using levenshtein and accept the lowes possible values. Ill post it when I'm done.
Thanks a lot.
Update: Here's what I've come up with:
function myLevenshtein( $str1, $str2 )
{
// prepare the words
$words1 = explode( " ", preg_replace( "/\s+/", " ", trim($str1) ) );
$words2 = explode( " ", preg_replace( "/\s+/", " ", trim($str2) ) );
$found = array(); // array that keeps the best matched words so we don't check them again
$score = 0; // total score
// In my case, strings that have different amount of words can be good matches too
// For example, Acme Company and International Acme Company Ltd. are the same thing
// I will just add the wordcount differencre to the total score, and weigh it more later if needed
$wordDiff = count( $words1 ) - count( $words2 );
foreach( $words1 as $word1 )
{
$minlevWord = "";
$minlev = 1000;
$return = 0;
foreach( $words2 as $word2 )
{
$return = 1;
if( in_array( $word2, $found ) )
continue;
$lev = levenshtein( $word1, $word2 );
if( $lev < $minlev )
{
$minlev = $lev;
$minlevWord = $word2;
}
}
if( !$return )
break;
$score += $minlev;
array_push( $found, $minlevWord );
}
return $score + $wordDiff;
}

Related

Separate characters and numbers following specific rules

I am trying to distinguish flight numbers.
Example:
flightno = "FR556"
split_data = flightno.upcase.match(/([A-Za-z]+)(\d+)/)
first = split_data[1] # FR
second = split_data[1] # 556
I then go on to query the database to find an airline based on the FR in this example and apply some logic with the result which is Ryanair.
My problem is when the flight number might be:
flightno = "U21920"
split_data = flightno.upcase.match(/([A-Za-z]+)(\d+)/)
first = split_data[1] # U
second = split_data[1] # 21920
i basically want first to be U2 not just U. This is used to search the database of airlines by their IATA code in this case is U2
****EDIT**
In the interest of clarity i made some mistakes in terminology when asking my question. Due to the complexities of booking reference numbers, the input is taken from whatever the passenger provides. For an easyJet flight for example, the passenger may input EZY1920 or U21920 only the airline provides either so the passenger is ignorant really.
"EZY" = ICAO
"U2" = IATA
I take the input from the user and try to separate the ICAO or IATA from the flight number "1920" but there is no way of determining that without searching the database or separating the input which i feel is cumbersome from a user experience point of view.
Using a regex to separate characters from numbers works until the user inputs an IATA as part of their flight number (the passenger won't know the difference) and as you can see in the example above this confuses the regex.**
The trouble is i cant think of any other pattern with flight numbers. They always have at least two characters made up of just letters or a mixture of a letter and a number and can be 3 characters in length. The numbers part can be as short as 1 but can also be as long as 4 - always numbers.
****edit**
As has been mentioned in the comments, there is no fixed size however one thing that is always true (at least so far) is the first character will always be a letter regardless if it is ICAO or IATA.
After considering every bodies input so far i'm wondering if searching the database and returning airlines with an IATA or ICAO that matches the first two letters provided by the user (U2), (FR), (EZ) might be one way to go, however this is subject to obvious problems should an ICAO or IATA be released that matches another airline, for example "EZY" & "EZT". This is not future proof and i'm looking for better ruby or regex solutions.**
Appreciate your input.
EDIT
I have answered my own question below. While other answers provide a solution for handling some conditions they would fall down if the flight number began with a number so i worked out a crass but to date stable way to analyse the string for digits and then work out if it is an ICAO or IATA from that.
A solution I think of is that you match your given flight number against a complete list of ICAO/IATA codes: https://raw.githubusercontent.com/datasets/airport-codes/master/data/airport-codes.csv
Spending some time with google might give you a more appropriate list.
Then use the first three characters (if that is the maximum) of your flight number to find a match within the icao codes. If you find one, you will know where to seperate your string.
Here a minimal ugly example that should set you on a track. Feel free to update!
ICAOCODES = %w(FR DEU U21) # grab your data here
def retrieve_flight_information(flightnumber)
ICAOCODES.each do |icao|
co = flightnumber.match(icao).to_s
if co.length > 0
# airline
puts co
# flight number
puts flightnumber.gsub(co,'')
end
end
end
retrieve_flight_information("FR556")
#=> FR
#=> 556
retrieve_flight_information("U21214123")
#=> U21
#=> 214123
The biggest flaw lies in using .gsub() as it might mess up your flightnumber in case it looks like this: "FR21413FR2"
However you will find plenty of solutions to this problem on so.
As mentioned in the comments, a list of icao codes is not what you are looking for. But what is relevant here, is that you somehow need a list of strings that you can securely compare against.
I have a fairly crass solution that seems to be working in all scenarios i can throw at it to date. I wanted to make this available to anybody else that might find it useful?
The general rule of thumb for flight codes/numbers seems to be:
IATA: two characters made up of any combination letters and digits
ICAO: three characters made up of letters only (to date)
With that in mind we should be able to work out if we need to search the database by IATA or ICAO depending on the condition of the first three characters.
First we take the flight number and convert to uppercase
string = "U21920".upcase
Next we analyse the first three characters to check for any numbers.
first_three = string[0,3] # => U21
Is there a digit in first_three?
if first_three =~ /\d/ # => true
iata = first_three[0,2] # => If true lets get rid of the last character
# Now we go to the database searching IATA (U2)
search = Airline.where('iata LIKE ?', "#{iata}%") # => Starts with search, just in case
Otherwise if there isnt a digit found in the string
else
icao = string.match(/([A-Za-z]+)(\d+)/)
search = Airline.where('icao LIKE ?', "#{icao[1]}%")
This seems to work for the random flight numbers ive tested it with today from a few of the major airport live departure/arrival boards. Its an interesting problem because some airlines issue tickets with either an ICAO or IATA code as part of the flight number which means passengers won't know any different, not to mention, some airports provide flight information in their own format so assumign there isnt a change to the ICAO and IATA build then the above should work.
Here is an example script you can run
test.rb
puts "What is your flight number?"
string = gets.upcase
first_three = string[0,3]
puts "Taking first three from #{string} is #{first_three}"
if first_three =~ /\d/ # Calling String's =~ method.
puts "The String #{first_three} DOES have a number in it."
iata = first_three[0,2]
search = Airline.where('iata LIKE ?', "#{iata}%")
puts "Searching Airlines starting with IATA #{iata} = #{search.count}"
puts "Found #{search.first.name} from IATA #{iata}"
else
puts "The String #{first_three} does not have a number in it."
icao = string.match(/([A-Za-z]+)(\d+)/)
search = Airline.where('icao LIKE ?', "#{icao[1]}%")
puts "Searching Airlines starting with ICAO #{icao[1]} = #{search.count}"
puts "Found #{search.first.name} from IATA #{icao[1]}"
end
Airline
Airline(id: integer, name: string, iata: string, icao: string, created_at: datetime, updated_at: datetime )
stick this in your lib folder and run
rails runner lib/test.rb
Obviously you can remove all of the puts statements to get straight to the result. I'm using rails runner to include access to my Airline model when running the script.

How to Connect Logic with Objects

I have a system that contains x number of strings. These string are shown in a UI based on some logic. For example string number 1 should only show if the current time is past midday and string 3 only shows if a randomly generated number between 0-1 is less than 0.5.
How would be the best way to model this?
Should the logic just be in code and be linked to a string by some sort or ID?
Should the logic be some how stored with the strings?
NOTE The above is a theoretical example before people start questioning my logic.
It's usually better to keep resources (such as strings) separate from logic. So referring strings by IDs is a good idea.
It seems that you have a bunch of rules which you have to link to the display of strings. I'd keep all three as separate entities: rules, strings, and the linking between them.
An illustration in Python, necessarily simplified:
STRINGS = {
'morning': 'Good morning',
'afternoon': 'Good afternoon',
'luck': 'you must be lucky today',
}
# predicates
import datetime, random
def showMorning():
return datetime.datetime.now().hour < 12
def showAfternoon():
return datetime.datetime.now().hour >= 12
def showLuck():
return random.random() > 0.5
# interconnection
RULES = {
'morning': showMorning,
'afternoon': showAfternoon,
'luck': showLuck,
}
# usage
for string_id, predicate in RULES.items():
if predicate():
print STRINGS[string_id]

How to split a string into words. Ex: "stringintowords" -> "String Into Words"?

What is the right way to split a string into words ?
(string doesn't contain any spaces or punctuation marks)
For example: "stringintowords" -> "String Into Words"
Could you please advise what algorithm should be used here ?
! Update: For those who think this question is just for curiosity. This algorithm could be used to camеlcase domain names ("sportandfishing .com" -> "SportAndFishing .com") and this algo is currently used by aboutus dot org to do this conversion dynamically.
Let's assume that you have a function isWord(w), which checks if w is a word using a dictionary. Let's for simplicity also assume for now that you only want to know whether for some word w such a splitting is possible. This can be easily done with dynamic programming.
Let S[1..length(w)] be a table with Boolean entries. S[i] is true if the word w[1..i] can be split. Then set S[1] = isWord(w[1]) and for i=2 to length(w) calculate
S[i] = (isWord[w[1..i] or for any j in {2..i}: S[j-1] and isWord[j..i]).
This takes O(length(w)^2) time, if dictionary queries are constant time. To actually find the splitting, just store the winning split in each S[i] that is set to true. This can also be adapted to enumerate all solution by storing all such splits.
As mentioned by many people here, this is a standard, easy dynamic programming problem: the best solution is given by Falk Hüffner. Additional info though:
(a) you should consider implementing isWord with a trie, which will save you a lot of time if you use properly (that is by incrementally testing for words).
(b) typing "segmentation dynamic programming" yields a score of more detail answers, from university level lectures with pseudo-code algorithm, such as this lecture at Duke's (which even goes so far as to provide a simple probabilistic approach to deal with what to do when you have words that won't be contained in any dictionary).
There should be a fair bit in the academic literature on this. The key words you want to search for are word segmentation. This paper looks promising, for example.
In general, you'll probably want to learn about markov models and the viterbi algorithm. The latter is a dynamic programming algorithm that may allow you to find plausible segmentations for a string without exhaustively testing every possible segmentation. The essential insight here is that if you have n possible segmentations for the first m characters, and you only want to find the most likely segmentation, you don't need to evaluate every one of these against subsequent characters - you only need to continue evaluating the most likely one.
If you want to ensure that you get this right, you'll have to use a dictionary based approach and it'll be horrendously inefficient. You'll also have to expect to receive multiple results from your algorithm.
For example: windowsteamblog (of http://windowsteamblog.com/ fame)
windows team blog
window steam blog
Consider the sheer number of possible splittings for a given string. If you have n characters in the string, there are n-1 possible places to split. For example, for the string cat, you can split before the a and you can split before the t. This results in 4 possible splittings.
You could look at this problem as choosing where you need to split the string. You also need to choose how many splits there will be. So there are Sum(i = 0 to n - 1, n - 1 choose i) possible splittings. By the Binomial Coefficient Theorem, with x and y both being 1, this is equal to pow(2, n-1).
Granted, a lot of this computation rests on common subproblems, so Dynamic Programming might speed up your algorithm. Off the top of my head, computing a boolean matrix M such M[i,j] is true if and only if the substring of your given string from i to j is a word would help out quite a bit. You still have an exponential number of possible segmentations, but you would quickly be able to eliminate a segmentation if an early split did not form a word. A solution would then be a sequence of integers (i0, j0, i1, j1, ...) with the condition that j sub k = i sub (k + 1).
If your goal is correctly camel case URL's, I would sidestep the problem and go for something a little more direct: Get the homepage for the URL, remove any spaces and capitalization from the source HTML, and search for your string. If there is a match, find that section in the original HTML and return it. You'd need an array of NumSpaces that declares how much whitespace occurs in the original string like so:
Needle: isashort
Haystack: This is a short phrase
Preprocessed: thisisashortphrase
NumSpaces : 000011233333444444
And your answer would come from:
location = prepocessed.Search(Needle)
locationInOriginal = location + NumSpaces[location]
originalLength = Needle.length() + NumSpaces[location + needle.length()] - NumSpaces[location]
Haystack.substring(locationInOriginal, originalLength)
Of course, this would break if madduckets.com did not have "Mad Duckets" somewhere on the home page. Alas, that is the price you pay for avoiding an exponential problem.
This can be actually done (to a certain degree) without dictionary. Essentially, this is an unsupervised word segmentation problem. You need to collect a large list of domain names, apply an unsupervised segmentation learning algorithm (e.g. Morfessor) and apply the learned model for new domain names. I'm not sure how well it would work, though (but it would be interesting).
This is basically a variation of a knapsack problem, so what you need is a comprehensive list of words and any of the solutions covered in Wiki.
With fairly-sized dictionary this is going to be insanely resource-intensive and lengthy operation, and you cannot even be sure that this problem will be solved.
Create a list of possible words, sort it from long words to short words.
Check if each entry in the list against the first part of the string. If it equals, remove this and append it at your sentence with a space. Repeat this.
A simple Java solution which has O(n^2) running time.
public class Solution {
// should contain the list of all words, or you can use any other data structure (e.g. a Trie)
private HashSet<String> dictionary;
public String parse(String s) {
return parse(s, new HashMap<String, String>());
}
public String parse(String s, HashMap<String, String> map) {
if (map.containsKey(s)) {
return map.get(s);
}
if (dictionary.contains(s)) {
return s;
}
for (int left = 1; left < s.length(); left++) {
String leftSub = s.substring(0, left);
if (!dictionary.contains(leftSub)) {
continue;
}
String rightSub = s.substring(left);
String rightParsed = parse(rightSub, map);
if (rightParsed != null) {
String parsed = leftSub + " " + rightParsed;
map.put(s, parsed);
return parsed;
}
}
map.put(s, null);
return null;
}
}
I was looking at the problem and thought maybe I could share how I did it.
It's a little too hard to explain my algorithm in words so maybe I could share my optimized solution in pseudocode:
string mainword = "stringintowords";
array substrings = get_all_substrings(mainword);
/** this way, one does not check the dictionary to check for word validity
* on every substring; It would only be queried once and for all,
* eliminating multiple travels to the data storage
*/
string query = "select word from dictionary where word in " + substrings;
array validwords = execute(query).getArray();
validwords = validwords.sort(length, desc);
array segments = [];
while(mainword != ""){
for(x = 0; x < validwords.length; x++){
if(mainword.startswith(validwords[x])) {
segments.push(validwords[x]);
mainword = mainword.remove(v);
x = 0;
}
}
/**
* remove the first character if any of valid words do not match, then start again
* you may need to add the first character to the result if you want to
*/
mainword = mainword.substring(1);
}
string result = segments.join(" ");

codeigniter form validation with phone numbers

What's a good way to validate phone numbers being input in codeigniter?
It's my first time writing an app, and I don't really understand regex at all.
Is it easier to have three input fields for the phone number?
Here's a cool regex I found out on the web. It validates a number in almost any US format and converts it to (xxx) xxx-xxxx. I think it's great because then people can enter any 10 digit US phone number using whatever format they are used to using and you get a correctly formatted number out of it.
Here's the whole function you can drop into your MY_form_validation class. I wanted my form to allow empty fields so you'll have to alter it if you want to force a value.
function valid_phone_number_or_empty($value)
{
$value = trim($value);
if ($value == '') {
return TRUE;
}
else
{
if (preg_match('/^\(?[0-9]{3}\)?[-. ]?[0-9]{3}[-. ]?[0-9]{4}$/', $value))
{
return preg_replace('/^\(?([0-9]{3})\)?[-. ]?([0-9]{3})[-. ]?([0-9]{4})$/', '($1) $2-$3', $value);
}
else
{
return FALSE;
}
}
}
The best way is not to validate phone numbers at all, unless you're absolutley 100% positive that you're only dealing with phone numbers in the US or at least North America. As soon as you allow phone numbers from Europe I don't think there's a regex which covers all possibilities.
strip out the non-digits with this:
$justNumbers = preg_replace( '/\D/', $_POST[ 'phone_num' ] );
and see if there are enough characters!
$numRequired = 7; // make your constraints here
if( strlen( $justNumbers ) < $numRequired ){ /* ... naughty naughty ... */ }
this is loose, of course, but will work for international numbers as well (as all it's really doing is seeing if there are over 7 numbers in the input).
just change the number of required digits according to your specifications.
The regex solution as explained by Dan would be the way to do it - but you might want to re-think validating phone numbers at all. What if the user wants to add a note like he/she would do on paper? - I see many values entered in the phone number fields to be stuff like:
307-555-2323 (home)
or
307-555-3232 after 6pm
I think today we can assume the users knows what to enter.

Best algorithm to index sentences

Imagine I have a situation where I need to index sentences. Let me explain it a little bit deeper.
For example I have these sentences:
The beautiful sky.
Beautiful sky dream.
Beautiful dream.
As far as I can imagine the index should look something like this:
alt text http://img7.imageshack.us/img7/4029/indexarb.png
But also I would like to do search by any of these words.
For example, if I do search by "the" It should show give me connection to "beautiful".
if I do search by "beautiful" it should give me connections to (previous)"The", (next)"sky" and "dream". If I search by "sky" it should give (previous) connection to "beautiful" and etc...
Any Ideas ? Maybe you know already existing algorithm for this kind of problem ?
Short Answer
Create a struct with two vectors of previous/forward links.
Then store the word structs in a hash table with the key as the word itself.
Long Answer
This is a linguistic parsing problem that is not easily solved unless you don't mind gibberish.
I went to the park basketball court.
Would you park the car.
Your linking algorithm will create sentences like:
I went to the park the car.
Would you park basketball court.
I'm not quite sure of the SEO applications of this, but I would not welcome another gibberish spam site taking up a search result.
I imagine you would want some sort of Inverted index structure. You would have a Hashmap with the words as keys pointing to lists of pairs of the form (sentence_id, position). You would then store your sentences as arrays or linked lists. Your example would look like this:
sentence[0] = ['the','beautiful', 'sky'];
sentence[1] = ['beautiful','sky', 'dream'];
sentence[2] = ['beautiful', 'dream'];
inverted_index =
{
'the': {(0,0)},
'beautiful': {(0,1), (1,0), (2,0)},
'sky' : {(0,2),(1,1)},
'dream':{(1,2), (2,1)}
};
Using this structure lookups on words can be done in constant time. Having identified the word you want, finding the previous and subsequent word in a given sentence can also be done in constant time.
Hope this helps.
You can try and dig into Markov chains, formed from words of sentences. Also you'll require both-way chain (i.e. to find next and previous words), i.e. store probable words that appear just after the given or just before it.
Of course, Markov chain is a stochastic process to generate content, however similar approach may be used to store information you need.
That looks like it could be stored in a very simple database with the following tables:
Words:
Id integer primary-key
Word varchar(20)
Following:
WordId1 integer foreign-key Words(Id) indexed
WordId2 integer foreign-key Words(Id) indexed
Then, whenever you parse a sentence, just insert the ones that aren't already there, as follows:
The beautiful sky.
Words (1,'the')
Words (2, 'beautiful')
Words (3,, 'sky')
Following (1, 2)
Following (2, 3)
Beautiful sky dream.
Words (4, 'dream')
Following (3, 4)
Beautiful dream.
Following (2, 4)
Then you can query to your hearts content on what words follow or precede other words.
This oughta get you close, in C#:
class Program
{
public class Node
{
private string _term;
private Dictionary<string, KeyValuePair<Node, Node>> _related = new Dictionary<string, KeyValuePair<Node, Node>>();
public Node(string term)
{
_term = term;
}
public void Add(string phrase, Node previous, string [] phraseRemainder, Dictionary<string,Node> existing)
{
Node next= null;
if (phraseRemainder.Length > 0)
{
if (!existing.TryGetValue(phraseRemainder[0], out next))
{
existing[phraseRemainder[0]] = next = new Node(phraseRemainder[0]);
}
next.Add(phrase, this, phraseRemainder.Skip(1).ToArray(), existing);
}
_related.Add(phrase, new KeyValuePair<Node, Node>(previous, next));
}
}
static void Main(string[] args)
{
string [] sentences =
new string [] {
"The beautiful sky",
"Beautiful sky dream",
"beautiful dream"
};
Dictionary<string, Node> parsedSentences = new Dictionary<string,Node>();
foreach(string sentence in sentences)
{
string [] words = sentence.ToLowerInvariant().Split(' ');
Node startNode;
if (!parsedSentences.TryGetValue(words[0],out startNode))
{
parsedSentences[words[0]] = startNode = new Node(words[0]);
}
if (words.Length > 1)
startNode.Add(sentence,null,words.Skip(1).ToArray(),parsedSentences);
}
}
}
I took the liberty of assuming you wanted to preserve the actual initial phrase. At the end of this, you'll have a list of words in the phrases, and in each one, a list of phrases that use that word, with references to the next and previous words in each phrase.
Using an associative array will allow you to quickly parse sentences in Perl. It is much faster than you would anticipate and it can be effectively dumped out in a tree like structure for subsequent usage by a higher level language.
Tree Search Algorithms (like BST, ect)

Resources