Cake Comparison Algorithm - algorithm

This is literally about comparing cakes. My friend is having a cupcake party with the goal of determining the best cupcakery in Manhattan. Actually, it's much more ambitious than that. Read on.
There are 27 bakeries, and 19 people attending (with maybe one or two no-shows). There will be 4 cupcakes from each bakery, if possible including the staples -- vanilla, chocolate, and red velvet -- and rounding out the 4 with wildcard flavors. There are 4 attributes on which to rate the cupcakes: flavor, moistness, presentation (prettiness), and general goodness. People will provide ratings on a 5-point scale for each attribute for each cupcake they sample. Finally, each cupcake can be cut into 4 or 5 pieces.
The question is: what is a procedure for coming up with a statistically meaningful ranking of the bakeries for each attribute, and for each flavor (treating "wildcard" as a flavor)? Specifically, we want to rank the bakeries 8 times: for each flavor we want to rank the bakeries by goodness (goodness being one of the attributes), and for each attribute we want to rank the bakeries across all flavors (ie, independent of flavor, ie, aggregating over all flavors). The grand prize goes to the top-ranked bakery for the goodness attribute.
Bonus points for generalizing this, of course.
This is happening in about 12 hours so I'll post as an answer what we ended up doing if no one answers in the meantime.
PS: Here's the post-party blog post about it: http://gracenotesnyc.com/2009/08/05/gracenotes-nycs-cupcake-cagematch-the-sweetest-battle-ever/

Here's what we ended up doing. I made a huge table to collect everyone's ratings at http://etherpad.com/sugarorgy (Revision 25, just in case it gets vandalized with me adding this public link to it) and then used the following Perl script to parse the data into a CSV file:
#!/usr/bin/env perl
# Grabs the cupcake data from etherpad and parses it into a CSV file.
use LWP::Simple qw(get);
$content = get("http://etherpad.com/ep/pad/export/sugarorgy/latest?format=txt");
$content =~ s/^.*BEGIN_MAGIC\s*//s;
$content =~ s/END_MAGIC.*$//s;
$bakery = "none";
for $line (split('\n', $content)) {
next if $line =~ /sar kri and deb/;
if ($line =~ s/bakery\s+(\w+)//) { $bakery = $1; }
$line =~ s/\([^\)]*\)//g; # strip out stuff in parens.
$line =~ s/^\s+(\w)(\w)/$1 $2/;
$line =~ s/\-/\-1/g;
$line =~ s/^\s+//;
$line =~ s/\s+$//;
$line =~ s/\s+/\,/g;
print "$bakery,$line\n";
}
Then I did the averaging and whatnot in Mathematica:
data = Import["!~/svn/sugar.pl", "CSV"];
(* return a bakery's list of ratings for the given type of cupcake *)
tratings[bak_, t_] := Select[Drop[First#Select[data,
#[[1]]==bak && #[[2]]==t && #[[3]]=="g" &], 3], #!=-1&]
(* return a bakery's list of ratings for the given cupcake attribute *)
aratings[bak_, a_] := Select[Flatten[Drop[#,3]& /#
Select[data, #[[1]]==bak && #[[3]]==a&]], #!=-1&]
(* overall rating for a bakery *)
oratings[bak_] := Join ## (tratings[bak, #] & /# {"V", "C", "R", "W"})
bakeries = Union#data[[All, 1]]
SortBy[{#, oratings##, Round[Mean#oratings[#], .01]}& /# bakeries, -#[[3]]&]
The results are at the bottom of http://etherpad.com/sugarorgy.

Perhaps reading about voting systems will be helpful. PS: don't take whatever is written on Wikipedia as "good fish". I have found factual errors in advanced topics there.

Break the problem up into sub-problems.
What's the value of a cupcake? A basic approach is "the average of the scores." A slightly more robust approach may be "the weighted average of the scores." But there may be complications beyond that... a cupcake with 3 goodness and 3 flavor may be 'better' than one with 5 flavor and 1 goodness, even if flavor and goodness have equal weight (IOW, a low score may have a disproportionate effect).
Make up some sample cupcake scores (specifics! Cover the normal scenarios and a couple weird ones), and estimate what you think a reasonable "overall" score would be if you had an ideal algorithm. Then, use that data to reverse engineer the algorithm.
For example, a cupcake with goodness 4, flavor 3, presentation 1 and moistness 4 might deserve a 4 overall, while one with goodness 4, flavor 2, presentation 5, and moistness 4 might only rate a 3.
Next, do the same thing for the bakery. Given a set of cupcakes with a range of scores, what would an appropriate rating be? Then, figure out the function that will give you that data.
The "goodness" ranking seems a bit odd, as it seems like it's a general rating, and so having it in there is already the overall score, so why calculate an overall score?
If you had time to work with this, I'd always suggest capturing the raw data, and using that as a basis to do more detailed analysis, but I don't think that's really relevant here.

Perhaps this is too general for you, but this type of problem can be approached using Conjoint Analysis (link text). A R package for implementing this is bayesm(link text).

If you can write SQL, you could make a little database and write some queries. It should not be that difficult.
e.g. select sum(score) / count(score) as finalscore, bakery, flavour from tables where group by bakery, flavour

Related

bowed string (e.g. violin) synthesis algorithm

Is there a well known algorithm for synthesising bowed string instruments (e.g. violins)?
I know for plucked strings (e.g. guitars) there's the karplus-strong algorithm, which I have succesfully implemented in the past.
Ideally I would like an algorithm describing a computer program for generating/synthesizing the digital signal.
For example, the karplus-strong algorithm can be summerized as follows:
Determine the period length of the frequency you want to synthesize and create a buffer of exactly that size
Fill the buffer with random numbers (white noise)
Iterate over the buffer, each time average each poitn with the next point then outputting it to the output stream.
Repeat for the desired amount of time while applying some damping
I wonder if something similar exists for bowed strings.
Footnote:
Now, I know nothing about the physics of how strings produce the sound, so I have no idea how one would derive such an algorithm. For the karplus-strong algorithm, I simply read it in the original paper and applied it "blindly". I would have never guessed that starting with a while noise and continuously damping it would produce a sound so similar to a plucked string.
EDIT:
As usual, the close parade has started.
Before voting to close this question, please consider the following:
This question is not about physics. It's not about the mechanics of the string vibration or interaction with the bow and air to produce the sound.
This question is about the existence of a specific well known algorithm to synthesize the sound. It's strictly a question about programming.
Weirdly i was able to find some stuff on this on the Stanford chuck website.
The code is written in a language called ChucK which is apparently specific for audio programming. You will have to run to use this code snippet. But here is its implementation in chuck:
// patch
Bowed bow => dac;
// scale
[0, 2, 4, 7, 8, 11] #=> int scale[];
// infinite time loop
while( true )
{
// set
Math.random2f( 0, 1 ) => bow.bowPressure;
Math.random2f( 0, 1 ) => bow.bowPosition;
Math.random2f( 0, 12 ) => bow.vibratoFreq;
Math.random2f( 0, 1 ) => bow.vibratoGain;
Math.random2f( 0, 1 ) => bow.volume;
// print
<<< "---", "" >>>;
<<< "bow pressure:", bow.bowPressure() >>>;
<<< "bow position:", bow.bowPosition() >>>;
<<< "vibrato freq:", bow.vibratoFreq() >>>;
<<< "vibrato gain:", bow.vibratoGain() >>>;
<<< "volume:", bow.volume() >>>;
// set freq
scale[Math.random2(0,scale.size()-1)] + 57 => Std.mtof => bow.freq;
// go
.8 => bow.noteOn;
// advance time
Math.random2f(.8, 2)::second => now;
}
Edit: The above is just the implementation, the source file for it is here.
Not an algorithm, but there's an open source library (under a very liberal license) that implements synthesis algorithms in C++ for several instruments, including bowed strings.
The Synthesis ToolKit (STK)
Official homepage: https://ccrma.stanford.edu/software/stk/
Github link: https://github.com/thestk/stk
Files with code relevant to synthesis of bowed string instruments:
include/Bowed.h
src/Bowed.cpp
include/BowTable.h
The comments in the code make references to two papers:
Efficient Simulation of the Reed-Bore and Bow-String Mechanisms
by Julius Smith (1986) (PDF)
On the Fundamentals of Bowed String Dynamics by Michael McIntyre & Jim Woodhouse (1979) (PDF)
Julius Smith also has information about bowed string synthesis available on his (standford) website:
Bowed Strings section of the "Physical Audio Signal Processing" book
MUS420 Lecture
Digital Waveguide Modeling of Bowed Strings

How to Connect Logic with Objects

I have a system that contains x number of strings. These string are shown in a UI based on some logic. For example string number 1 should only show if the current time is past midday and string 3 only shows if a randomly generated number between 0-1 is less than 0.5.
How would be the best way to model this?
Should the logic just be in code and be linked to a string by some sort or ID?
Should the logic be some how stored with the strings?
NOTE The above is a theoretical example before people start questioning my logic.
It's usually better to keep resources (such as strings) separate from logic. So referring strings by IDs is a good idea.
It seems that you have a bunch of rules which you have to link to the display of strings. I'd keep all three as separate entities: rules, strings, and the linking between them.
An illustration in Python, necessarily simplified:
STRINGS = {
'morning': 'Good morning',
'afternoon': 'Good afternoon',
'luck': 'you must be lucky today',
}
# predicates
import datetime, random
def showMorning():
return datetime.datetime.now().hour < 12
def showAfternoon():
return datetime.datetime.now().hour >= 12
def showLuck():
return random.random() > 0.5
# interconnection
RULES = {
'morning': showMorning,
'afternoon': showAfternoon,
'luck': showLuck,
}
# usage
for string_id, predicate in RULES.items():
if predicate():
print STRINGS[string_id]

Algorithm to create unique random concatenation of items

I'm thinking about an algorithm that will create X most unique concatenations of Y parts, where each part can be one of several items. For example 3 parts:
part #1: 0,1,2
part #2: a,b,c
part #3: x,y,z
And the (random, one case of some possibilities) result of 5 concatenations:
0ax
1by
2cz
0bz (note that '0by' would be "less unique " than '0bz' because 'by' already was)
2ay (note that 'a' didn't after '2' jet, and 'y' didn't after 'a' jet)
Simple BAD results for next concatenation:
1cy ('c' wasn't after 1, 'y' wasn't after 'c', BUT '1'-'y' already was as first-last
Simple GOOD next result would be:
0cy ('c' wasn't after '0', 'y' wasn't after 'c', and '0'-'y' wasn't as first-last part)
1az
1cx
I know that this solution limit possible results, but when all full unique possibilities will gone, algorithm should continue and try to keep most avaible uniqueness (repeating as few as possible).
Consider real example:
Boy/Girl/Martin
bought/stole/get
bottle/milk/water
And I want results like:
Boy get milk
Martin stole bottle
Girl bought water
Boy bought bottle (not water, because of 'bought+water' and not milk, because of 'Boy+milk')
Maybe start with a tree of all combinations, but how to select most unique trees first?
Edit: According to this sample data, we can see, that creation of fully unique results for 4 words * 3 possibilities, provide us only 3 results:
Martin stole a bootle
Boy bought an milk
He get hard water
But, there can be more results requested. So, 4. result should be most-available-uniqueness like Martin bought hard milk, not Martin stole a water
Edit: Some start for a solution ?
Imagine each part as a barrel, wich can be rotated, and last item goes as first when rotates down, first goes as last when rotating up. Now, set barells like this:
Martin|stole |a |bootle
Boy |bought|an |milk
He |get |hard|water
Now, write sentences as We see, and rotate first barell UP once, second twice, third three and so on. We get sentences (note that third barell did one full rotation):
Boy |get |a |milk
He |stole |an |water
Martin|bought|hard|bootle
And we get next solutions. We can do process one more time to get more solutions:
He |bought|a |water
Martin|get |an |bootle
Boy |stole |hard|milk
The problem is that first barrel will be connected with last, because rotating parallel.
I'm wondering if that will be more uniqe if i rotate last barrel one more time in last solution (but the i provide other connections like an-water - but this will be repeated only 2 times, not 3 times like now). Don't know that "barrels" are good way ofthinking here.
I think that we should first found a definition for uniqueness
For example, what is changing uniqueness to drop ? If we use word that was already used ? Do repeating 2 words close to each other is less uniqe that repeating a word in some gap of other words ? So, this problem can be subjective.
But I think that in lot of sequences, each word should be used similar times (like selecting word randomly and removing from a set, and after getting all words refresh all options that they can be obtained next time) - this is easy to do.
But, even if we get each words similar number od times, we should do something to do-not-repeat-connections between words. I think, that more uniqe is repeating words far from each other, not next to each other.
Anytime you need a new concatenation, just generate a completely random one, calculate it's fitness, and then either accept that concatenation or reject it (probabilistically, that is).
const C = 1.0
function CreateGoodConcatenation()
{
for (rejectionCount = 0; ; rejectionCount++)
{
candidate = CreateRandomConcatination()
fitness = CalculateFitness(candidate) // returns 0 < fitness <= 1
r = GetRand(zero to one)
adjusted_r = Math.pow(r, C * rejectionCount + 1) // bias toward acceptability as rejectionCount increases
if (adjusted_r < fitness)
{
return candidate
}
}
}
CalculateFitness should never return zero. If it does, you might find yourself in an infinite loop.
As you increase C, less ideal concatenations are accepted more readily.
As you decrease C, you face increased iterations for each call to CreateGoodConcatenation (plus less entropy in the result)

Ruby, Count syllables

I am using ruby to calculate the Gunning Fog Index of some content that I have, I can successfully implement the algorithm described here:
Gunning Fog Index
I am using the below method to count the number of syllables in each word:
Tokenizer = /([aeiouy]{1,3})/
def count_syllables(word)
len = 0
if word[-3..-1] == 'ing' then
len += 1
word = word[0...-3]
end
got = word.scan(Tokenizer)
len += got.size()
if got.size() > 1 and got[-1] == ['e'] and
word[-1].chr() == 'e' and
word[-2].chr() != 'l' then
len -= 1
end
return len
end
It sometimes picks up words with only 2 syllables as having 3 syllables. Can anyone give any advice or is aware of a better method?
text = "The word logorrhoea is often used pejoratively to describe prose that is highly abstract and contains little concrete language. Since abstract writing is hard to visualize, it often seems as though it makes no sense and all the words are excessive. Writers in academic fields that concern themselves mostly with the abstract, such as philosophy and especially postmodernism, often fail to include extensive concrete examples of their ideas, and so a superficial examination of their work might lead one to believe that it is all nonsense."
# used to get rid of any puncuation
text = text.gsub!(/\W+/, ' ')
word_array = text.split(' ')
word_array.each do |word|
puts word if count_syllables(word) > 2
end
"themselves" is being counted as 3 but it's only 2
The function I give you before is based upon these simple rules outlined here:
Each vowel (a, e, i, o, u, y) in a
word counts as one syllable subject to
the following sub-rules:
Ignore final -ES, -ED, -E (except
for -LE)
Words of three letters or
less count as one syllable
Consecutive vowels count as one
syllable.
Here's the code:
def new_count(word)
word.downcase!
return 1 if word.length <= 3
word.sub!(/(?:[^laeiouy]es|ed|[^laeiouy]e)$/, '')
word.sub!(/^y/, '')
word.scan(/[aeiouy]{1,2}/).size
end
Obviously, this isn't perfect either, but all you'll ever get with something like this is a heuristic.
EDIT:
I changed the code slightly to handle a leading 'y' and fixed the regex to handle 'les' endings better (such as in "candles").
Here's a comparison using the text in the question:
# used to get rid of any puncuation
text = text.gsub!(/\W+/, ' ')
words = text.split(' ')
words.each do |word|
old = count_syllables(word.dup)
new = new_count(word.dup)
puts "#{word}: \t#{old}\t#{new}" if old != new
end
The output is:
logorrhoea: 3 4
used: 2 1
makes: 2 1
themselves: 3 2
So it appears to be an improvement.
One thing you ought to do is teach your algorithm about diphthongs. If I'm reading your code correctly, it would incorrectly flag "aid" as having two syllables.
You can also add "es" and the like to your special-case endings (you already have "ing") and just not count it as a syllable, but that might still result in some miscounts.
Finally, for best accuracy, you should convert your input to a spelling scheme or alphabet that has a definite relationship to the word's pronunciation. With your "themselves" example, the algorithm has no reliable way to know that the "e" "ves" is dropped. However, if you respelled it as "themselvz", or taught the algorithm the IPA and fed it [ðəmsɛlvz], it becomes very clear that the word is only pronounced with two syllables. That, of course, assumes you have control over the input, and is probably more work than just counting the syllables yourself.
To begin with it seems you should decrement len for the suffixes that should be excluded.
len-=1 if /.*[ing,es,ed]$/.match(word)
You could also check out Lingua::EN::Readability.
It can also calculate several readability measures, such as a Fog Index and a Flesch-Kincaid level.
PS. I think I know where you got the function from. DS.
There is also a rubygem called Odyssey that calculates Gunning Fog, along with some of the other popular ones (Flesch-Kincaid, SMOG, etc.)

Word comparison algorithm

I am doing a CSV Import tool for the project I'm working on.
The client needs to be able to enter the data in excel, export them as CSV and upload them to the database.
For example I have this CSV record:
1, John Doe, ACME Comapny (the typo is on purpose)
Of course, the companies are kept in a separate table and linked with a foreign key, so I need to discover the correct company ID before inserting.
I plan to do this by comparing the company names in the database with the company names in the CSV.
the comparison should return 0 if the strings are exactly the same, and return some value that gets bigger as the strings get more different, but strcmp doesn't cut it here because:
"Acme Company" and "Acme Comapny" should have a very small difference index, but
"Acme Company" and "Cmea Mpnyaco" should have a very big difference index
Or "Acme Company" and "Acme Comp." should also have a small difference index, even though the character count is different.
Also, "Acme Company" and "Company Acme" should return 0.
So if the client makes a type while entering data, i could prompt him to choose the name he most probably wanted to insert.
Is there a known algorithm to do this, or maybe we can invent one :)
?
You might want to check out the Levenshtein Distance algorithm as a starting point. It will rate the "distance" between two words.
This SO thread on implementing a Google-style "Do you mean...?" system may provide some ideas as well.
I don't know what language you're coding in, but if it's PHP, you should consider the following algorithms:
levenshtein(): Returns the minimal number of characters you have to replace, insert or delete to transform one string into another.
soundex(): Returns the four-character soundex key of a word, which should be the same as the key for any similar-sounding word.
metaphone(): Similar to soundex, and possibly more effective for you. It's more accurate than soundex() as it knows the basic rules of English pronunciation. The metaphone generated keys are of variable length.
similar_text(): Similar to levenshtein(), but it can return a percent value instead.
I've had some success with the Levenshtein Distance algorithm, there is also Soundex.
What language are you implementing this in? we may be able to point to specific examples
I have actually implemented a similar system. I used the Levenshtein distance (as other posters already suggested), with some modifications. The problem with unmodified edit distance (applied to whole strings) is that it is sensitive to word reordering, so "Acme Digital Incorporated World Company" will match poorly against "Digital Incorporated World Company Acme" and such reorderings were quite common in my data.
I modified it so that if the edit distance of whole strings was too big, the algorithm fell back to matching words against each other to find a good word-to-word match (quadratic cost, but there was a cutoff if there were too many words, so it worked OK).
I've taken SoundEx, Levenshtein, PHP similarity, and double metaphone and packaged them up in C# in one set of extension methods on String.
Entire blog post here.
There's multiple algorithms to do just that, and most databases even include one by default. It is actually a quite common concern.
If its just about English words, SQL Server for example includes SOUNDEX which can be used to compare on the resulting sound of the word.
http://msdn.microsoft.com/en-us/library/aa259235%28SQL.80%29.aspx
I'm implementing it in PHP, and I am now writing a piece of code that will break up 2 strings in words and compare each of the words from the first string with the words of the second string using levenshtein and accept the lowes possible values. Ill post it when I'm done.
Thanks a lot.
Update: Here's what I've come up with:
function myLevenshtein( $str1, $str2 )
{
// prepare the words
$words1 = explode( " ", preg_replace( "/\s+/", " ", trim($str1) ) );
$words2 = explode( " ", preg_replace( "/\s+/", " ", trim($str2) ) );
$found = array(); // array that keeps the best matched words so we don't check them again
$score = 0; // total score
// In my case, strings that have different amount of words can be good matches too
// For example, Acme Company and International Acme Company Ltd. are the same thing
// I will just add the wordcount differencre to the total score, and weigh it more later if needed
$wordDiff = count( $words1 ) - count( $words2 );
foreach( $words1 as $word1 )
{
$minlevWord = "";
$minlev = 1000;
$return = 0;
foreach( $words2 as $word2 )
{
$return = 1;
if( in_array( $word2, $found ) )
continue;
$lev = levenshtein( $word1, $word2 );
if( $lev < $minlev )
{
$minlev = $lev;
$minlevWord = $word2;
}
}
if( !$return )
break;
$score += $minlev;
array_push( $found, $minlevWord );
}
return $score + $wordDiff;
}

Resources