Competitive Programming : Generating test cases and validating the program correctness

Competitive Programming : Generating test cases and validating the program correctness - algorithm

I have been doing sport programming for a while and still improving day by day. But one thing I have always wondered is that it would be really nice if I could automate the test-case generation process and cross-validation of my program. Definitely it would be a brute force approach as some test cases would be algorithm specific.
Doing a google search gives me a nice link on Quora : How do programming contest problem setters make test cases ? and the popular testlib used by problem setters.
But isn't this a chicken-egg problem?
Assume I generated 1 million input test cases, but what would I check them against? How will I generate the outputs? Because I am still in the process of validating the program... If my script generates the correct outputs as well, then whats the point of writing the program in the first place. I can submit the script itself. Also, its not possible to write 1 million outputs for generated test cases manually. Can anyone please clarify this confusion.
I hope i have clarified the problem correctly.

It's common to generate the answer by a slow but obviously correct solution (like an exhaustive search). It can't be used as a main solution as it's too slow for large test cases, but you can check the output of your fast (but possibly incorrect) program using it.

Well the thing is it is not as broad as you think it to be. Test generation in competitive programming is guided by the algorithm of the problem and it's correctness proof.
So when you are thinking that there are million of test cases if you analyze the different situations the program can be then you will likely to get all the test cases. Maybe in certain algorithm you are some times processing the even index elements or the odd index elements of an array. Now what you will do? Divide it in 2 cases even or odd. Consider the smallest case for even ones . Same for odd ones. This way you are basically visiting all the control flow path of the program.
In competitive programming as we first determine the algorithm then we decide on a proper input sizes and then all this test cases and validations, it is often easy to think the corner points. Test case for 1000000 elements or when input is 0 or 1 ...test cases like this.
Another is most of the time we write a brute force solution much more slower than the original one. Now what we do? we just generate random medium size test cases and then run it again the slow program and we can check with our checker solution etc.
correctness is guided by some mathematical proof also.(Heuristics, Induction, Box principle, Number Theory etc ) That way we are sure about the correctness of the solution.

I faced the same issue earlier this year and saw some of my colleagues also figuring out a way to deal with this. Because there are sometimes when I just couldn’t think any more test cases and that’s when I decided to make a test case generator tool of my own, It’s free and open-source so anyone can use it.
You can easily generate a lot of test cases using this tool and validate the result using the output given from the correct but slower approach(in terms of time complexity and space complexity). You can either run them parallelly and check for outputs or write a simple script to compare the output of both programs (the slower but correct one and the better but unsure one) to validate.
I believe good coders won’t be needing it anyway, but for the middle level (div2, div3) coders and newbies, it can prove to be a lot helpful.
You can access it from GitHub : Test Case generator.
Both python source code and .exe files are present there with instructions,
If you want to make some changes of your own you can work with the python file.
If you just directly want it to generate some test cases you should prefer the .exe file(inside the zip).
It’ll help a lot if you’re beginning your Competitive programming journey.
Also, any suggestions or improvements are always welcome. Additionally you can also contribute to this project by adding some feature that you think the project is lacking or by adding some new test case formats by yourselves or making by request for the same.

Related

How do programmers test their algorithm in TopCoder or other competitions?

Good programmers who write programs of moderate to higher difficulty in competitions of TopCoder or ACM ICPC, have to ensure the correctness of their algorithm before submission.
Although they are provided with some sample test cases to ensure the correct output, but how does it guarantees that program will behave correctly? They can write some test cases of their own but it won't be possible in all cases to know the correct answer through manual calculation. How do they do it?
Update: As it seems, it is not quite possible to analyze and guarantee the outcome of an algorithm given tight constraints of a competitive environment. However, if there are any manual, more common traits which are adopted while solving such problems - should be enough to answer the question. Something like best practices..

In competitions, the top programmers have enough experience to read the question, and think of some test cases that should catch most of the possibilities for input.
It catches most of the bugs usually - but it is NOT 100% safe.
However, in real life critical applications (critical systems on air planes or nuclear reactors for example) there are methods to PROVE some piece of code does what it is supposed to do.
This is the field of formal verification - which is way too complex and time consuming to be done during a contest, but for some systems it is used because mistakes could not be tolerated.
Some additional information:
Formal verification basically consists of 2 parts:
Manual verification - in here we use proving systems such as Hoare logic and manually prove the program does what we wants it to do.
Automatic model checking - modeling the problem as state machine, and use Model Checking tools to verify that the module does what it is supposed to do (or not doing something "bad").
Specifying "what it should do" is usually done with temporal logic.
This is often used to verify correctness of hardware models as well. For example Intel uses it to ensure they won't get the floating point bug again.

Picture this, imagine you are a top programmer.Meaning you know a bunch of algorithms and wouldn't think think twice while implementing them.You know how to modify an already known algorithm to suit the problem's needs.You are strong with estimating time and complexity and you expect that in the worst case your tailored algorithm would run within time and memory constraints.
At this level you simply think and use a scratchpad for about five to ten minutes and have a super clear algorithm before you start to code.Once you finish coding, you hit compile and there is usually no compilation error.Because the code is so intuitive to you.
Then based on the algorithm used and data structures used, you expect that there might be
one of the following issues.
a corner case
an overflow problem
A corner case is basically like you have coded for the general case, however when say N=1, the answer is different from others.So you generally write it as a special case.
An overflow is when intermediate values or results overflow a data type's limits.
You make note of any problems which arise at this point, and use this data during Challenge phase(as in TopCoder).
Once you have checked against these two, you hit Submit.

There's a time element to Top Coder, so it's not possible to test every combination within that constraint. They probably do the best they can and rely on experience for the rest, just as one does in real life. I don't know that it's ever possible to guarantee that a significant piece of code is error free forever.

Code generation by genetic algorithms

Evolutionary programming seems to be a great way to solve many optimization problems. The idea is very easy and the implementation does not make problems.
I was wondering if there is any way to evolutionarily create a program in ruby/python script (or any other language)?
The idea is simple:
Create a population of programs
Perform genetic operations (roulette-wheel selection or any other selection), create new programs with inheritance from best programs, etc.
Loop point 2 until program that will satisfy our condition is found
But there are still few problems:
How will chromosomes be represented? For example, should one cell of chromosome be one line of code?
How will chromosomes be generated? If they will be lines of code, how do we generate them to ensure that they are syntactically correct, etc.?
Example of a program that could be generated:
Create script that takes N numbers as input and returns their mean as output.
If there were any attempts to create such algorithms I'll be glad to see any links/sources.

If you are sure you want to do this, you want genetic programming, rather than a genetic algorithm. GP allows you to evolve tree-structured programs. What you would do would be to give it a bunch of primitive operations (while($register), read($register), increment($register), decrement($register), divide($result $numerator $denominator), print, progn2 (this is GP speak for "execute two commands sequentially")).
You could produce something like this:
progn2(
progn2(
read($1)
while($1
progn2(
while($1
progn2( #add the input to the total
increment($2)
decrement($1)
)
)
progn2( #increment number of values entered, read again
increment($3)
read($1)
)
)
)
)
progn2( #calculate result
divide($1 $2 $3)
print($1)
)
)
You would use, as your fitness function, how close it is to the real solution. And therein lies the catch, that you have to calculate that traditionally anyway*. And then have something that translates that into code in (your language of choice). Note that, as you've got a potential infinite loop in there, you'll have to cut off execution after a while (there's no way around the halting problem), and it probably won't work. Shucks. Note also, that my provided code will attempt to divide by zero.
*There are ways around this, but generally not terribly far around it.

It can be done, but works very badly for most kinds of applications.
Genetic algorithms only work when the fitness function is continuous, i.e. you can determine which candidates in your current population are closer to the solution than others, because only then you'll get improvements from one generation to the next. I learned this the hard way when I had a genetic algorithm with one strongly-weighted non-continuous component in my fitness function. It dominated all others and because it was non-continuous, there was no gradual advancement towards greater fitness because candidates that were almost correct in that aspect were not considered more fit than ones that were completely incorrect.
Unfortunately, program correctness is utterly non-continuous. Is a program that stops with error X on line A better than one that stops with error Y on line B? Your program could be one character away from being correct, and still abort with an error, while one that returns a constant hardcoded result can at least pass one test.
And that's not even touching on the matter of the code itself being non-continuous under modifications...

Well this is very possible and #Jivlain correctly points out in his (nice) answer that genetic Programming is what you are looking for (and not simple Genetic Algorithms).
Genetic Programming is a field that has not reached a broad audience yet, partially because of some of the complications #MichaelBorgwardt indicates in his answer. But those are mere complications, it is far from true that this is impossible to do. Research on the topic has been going on for more than 20 years.
Andre Koza is one of the leading researchers on this (have a look at his 1992 work) and he demonstrated as early as 1996 how genetic programming can in some cases outperform naive GAs on some classic computational problems (such as evolving programs for Cellular Automata synchronization).
Here's a good Genetic Programming tutorial from Koza and Poli dated 2003.
For a recent reference you might wanna have a look at A field guide to genetic programming (2008).

Since this question was asked, the field of genetic programming has advanced a bit, and there have been some additional attempts to evolve code in configurations other than the tree structures of traditional genetic programming. Here are just a few of them:
PushGP - designed with the goal of evolving modular functions like human coders use, programs in this system store all variables and code on different stacks (one for each variable type). Programs are written by pushing and popping commands and data off of the stacks.
FINCH - a system that evolves Java byte-code. This has been used to great effect to evolve game-playing agents.
Various algorithms have started evolving C++ code, often with a step in which compiler errors are corrected. This has had mixed, but not altogether unpromising results. Here's an example.
Avida - a system in which agents evolve programs (mostly boolean logic tasks) using a very simple assembly code. Based off of the older (and less versatile) Tierra.

The language isn't an issue. Regardless of the language, you have to define some higher-level of mutation, otherwise it will take forever to learn.
For example, since any Ruby language can be defined in terms of a text string, you could just randomly generate text strings and optimize that. Better would be to generate only legal Ruby programs. However, it would also take forever.
If you were trying to build a sorting program and you had high level operations like "swap", "move", etc. then you would have a much higher chance of success.
In theory, a bunch of monkeys banging on a typewriter for an infinite amount of time will output all the works of Shakespeare. In practice, it isn't a practical way to write literature. Just because genetic algorithms can solve optimization problems doesn't mean that it's easy or even necessarily a good way to do it.

The biggest selling point of genetic algorithms, as you say, is that they are dirt simple. They don't have the best performance or mathematical background, but even if you have no idea how to solve your problem, as long as you can define it as an optimization problem you will be able to turn it into a GA.
Programs aren't really suited for GA's precisely because code isn't good chromossome material. I have seen someone who did something similar with (simpler) machine code instead of Python (although it was more of an ecossystem simulation then a GA per se) and you might have better luck if you codify your programs using automata / LISP or something like that.
On the other hand, given how alluring GA's are and how basically everyone who looks at them asks this same question, I'm pretty sure there are already people who tried this somewhere - I just have no idea if any of them succeeded.

Good luck with that.
Sure, you could write a "mutation" program that reads a program and randomly adds, deletes, or changes some number of characters. Then you could compile the result and see if the output is better than the original program. (However we define and measure "better".) Of course 99.9% of the time the result would be compile errors: syntax errors, undefined variables, etc. And surely most of the rest would be wildly incorrect.
Try some very simple problem. Say, start with a program that reads in two numbers, adds them together, and outputs the sum. Let's say that the goal is a program that reads in three numbers and calculates the sum. Just how long and complex such a program would be of course depends on the language. Let's say we have some very high level language that lets us read or write a number with just one line of code. Then the starting program is just 4 lines:
read x
read y
total=x+y
write total
The simplest program to meet the desired goal would be something like
read x
read y
read z
total=x+y+z
write total
So through a random mutation, we have to add "read z" and "+z", a total of 9 characters including the space and the new-line. Let's make it easy on our mutation program and say it always inserts exactly 9 random characters, that they're guaranteed to be in the right places, and that it chooses from a character set of just 26 letters plus 10 digits plus 14 special characters = 50 characters. What are the odds that it will pick the correct 9 characters? 1 in 50^9 = 1 in 2.0e15. (Okay, the program would work if instead of "read z" and "+z" it inserted "read w" and "+w", but then I'm making it easy by assuming it magically inserts exactly the right number of characters and always inserts them in the right places. So I think this estimate is still generous.)
1 in 2.0e15 is a pretty small probability. Even if the program runs a thousand times a second, and you can test the output that quickly, the chance is still just 1 in 2.0e12 per second, or 1 in 5.4e8 per hour, 1 in 2.3e7 per day. Keep it running for a year and the chance of success is still only 1 in 62,000.
Even a moderately competent programmer should be able to make such a change in, what, ten minutes?
Note that changes must come in at least "packets" that are correct. That is, if a mutation generates "reax z", that's only one character away from "read z", but it would still produce compile errors, and so would fail.
Likewise adding "read z" but changing the calculation to "total=x+y+w" is not going to work. Depending on the language, you'll either get errors for the undefined variable or at best it will have some default value, like zero, and give incorrect results.
You could, I suppose, theorize incremental solutions. Maybe one mutation adds the new read statement, then a future mutation updates the calculation. But without the calculation, the additional read is worthless. How will the program be evaluated to determine that the additional read is "a step in the right direction"? The only way I see to do that is to have an intelligent being read the code after each mutation and see if the change is making progress toward the desired goal. And if you have an intelligent designer who can do that, that must mean that he knows what the desired goal is and how to achieve it. At which point, it would be far more efficient to just make the desired change rather than waiting for it to happen randomly.
And this is an exceedingly trivial program in a very easy language. Most programs are, what, hundreds or thousands of lines, all of which must work together. The odds against any random process writing a working program are astronomical.
There might be ways to do something that resembles this in some very specialized application, where you are not really making random mutations, but rather making incremental modifications to the parameters of a solution. Like, we have a formula with some constants whose values we don't know. We know what the correct results are for some small set of inputs. So we make random changes to the constants, and if the result is closer to the right answer, change from there, if not, go back to the previous value. But even at that, I think it would rarely be productive to make random changes. It would likely be more helpful to try changing the constants according to a strict formula, like start with changing by 1000's, then 100's then 10's, etc.

I want to just give you a suggestion. I don't know how successful you'd be, but perhaps you could try to evolve a core war bot with genetic programming. Your fitness function is easy: just let the bots compete in a game. You could start with well known bots and perhaps a few random ones then wait and see what happens.

How to develop complex methods with TDD

A few weeks ago I started my first project with TDD. Up to now, I have only read one book about it.
My main concern: How to write tests for complex methods/classes. I wrote a class that calculates a binomial distribution. Thus, a method of this class takes n, k, and p as input, and calculates the resp. probability. (In fact it does a bit more, that's why I had to write it myself, but let's stick to this description of the class, for ease of the argument.)
What I did to test this method is: copying some tables with different n I found in the web into my code, picking randomly an entry in this table, feeded the resp. values for n, k, and p into my function, and looked whether the result was near the value in the table. I repeat this a number of times for every table.
This all works well now, but after writing the test, I had to tank for a few hours to really code the functionality. From reading the book, I had the impression that I should not code longer than a few minutes, until the test shows green again. What did I do wrong here? Of course I have broken this task down in a lot of methods, but they are all private.
A related question: Was it a bad idea to pick randomly numbers from the table? In case of an error, I will display the random-seed used by this run, so that I can reproduce the bug.

I don't agree with people saying that it's ok to test private code, even if you make them into separate classes. You should test entry points to your application (or your library, if it's a library you're coding). When you test private code, you limit your re-factoring possibilities for later (because refactoring your privates classes mean refactoring your test code, which you should refrain doing). If you end up re-using this private code elsewhere, then sure, create separate classes and test them, but until you do, assume that You Ain't Gonna Need It.
To answer your question, I think that yes, in some cases, it's not a "2 minutes until you go green" situation. In those cases, I think it's ok for the tests to take a long time to go green. But most situations are "2 minutes until you go green" situations. In your case (I don't know squat about binomial distribution), you wrote you have 3 arguments, n, k and p. If you keep k and p constant, is your function any simpler to implement? If yes, you should start by creating tests that always have constant k and p. When your tests pass, introduce a new value for k, and then for p.

"I had the impression that I should not code longer than a few minutes, until the test shows green again. What did I do wrong here?"
Westphal is correct up to a point.
Some functionality starts simple and can be tested simply and coded simply.
Some functionality does not start out simple. Simple is hard to achieve. EWD says that simplicity is not valued because it is so difficult to achieve.
If your function body is hard to write, it isn't simple. This means you have to work much harder to reduce it to something simple.
After you eventually achieve simplicity, you, too, can write a book showing how simple it is.
Until you achieve simplicity, it will take a long time to write things.
"Was it a bad idea to pick randomly numbers from the table?"
Yes. If you have sample data, run your test against all the sample data. Use a loop or something, and test everything you can possibly test.
Don't select one row -- randomly or otherwise, select all rows.

You should TDD using baby steps. Try thinking of tests that will require less code to be written. Then write the code. Then write another test, and so on.
Try to break your problem into smaller problems (you probably used some other methods to have your code completed). You could TDD these smaller methods.
--EDIT - based on the comments
Testing private methods is not necessarily a bad stuff. They sometimes really contain implementation details, but sometimes they might also act like an interface (in this case, you could follow my suggestion next paragraph).
One other option is to create other classes (implemented with interfaces that are injected) to take some of the responsibilities (maybe some of those smaller methods), and test them separately, and mock them when testing your main class.
Finally, I don't see spending more time coding as a really big problem. Some problems are really more complex to implement than to test, and require much thinking time.

You are correct about short quick refactors, I rarely go more than a few minutes between rebuild/test no matter how complicated the change. It takes a little practice.
The test you described is more of a system test than a unit test though. A unit test tries never to test more than a single method--in order to reduce complexity you should probably break your problem down into quite a few methods.
The system test should probably be done after you have built up your functionality with small unit tests on small straight-forward methods.
Even if the methods are just taking a part of the formula out of a longer method, you get the advantage of readability (the method name should be more readable than the formula part it replaces) and if the methods are final the JIT should inline them so you don't lose any speed.
On the other hand, if your formula isn't that big, maybe you just write it all in one method and test it like you did and take the downtime--rules are made to be broken.

It's difficult to answer your question without knowing a little bit more about the things you wanted to implement. It sounds like they were not easily partinioable in testable parts. Either the functionality works as a whole or it doesn't. If this is the case, it's no wonder you tool hours to implement it.
As to your second question: Yes, I think it's a bad idea to make the test fixture random. Why did you do this in the first place? Changing the fixture changes the test.

Avoid developing complex methods with TDD until you have developed simple methods as building blocks for the more complex methods. TDD would typically be used to create a quantity of simple functionality which could be combined to produce more complex behaviour. Complex methods/classes should always be able to be broken down into simpler parts, but it is not always obvious how and is often problem specific. The test you have written sounds like it might be more of an integration test to make sure all the components work together correctly, although the complexity of the problem you describe only borders on the edge of requiring a set of components to solve it. The situation you describe sounds like this:
class A {
public doLotsOfStuff() // Call doTask1..n
private doTask1()
private doTask2()
private doTask3()
}
You will find it quite hard to develop with TDD if you start by writing a test for the greatest unit of functionality (i.e. doLotsOfStuff()). By breaking the problem down into more mangeable chunks and approaching it from the end of simplest functionality you will also be able to create more discrete tests (much more useful than tests that check for everything!). Perhaps your potential solution could be reformulated like this:
class A{
public doLotsOfStuff() // Call doTask1..n
public doTask1()
public doTask2()
public doTask3()
}
Whilst your private methods may be implementation detail that is not a reason to avoid testing them in isolation. Just like many problems a divide-and-conquer approach would prove affective here. The real question is what size is a suitably testable and maintainable chunk of functionality? Only you can answer that based on your knowledge of the problem and your own judgement of applying your abilities to the task.

I think the style of testing you have is totally appropriate for code thats primarily a computation. Rather than pick a random row from your known results table, it'd be better to just hardcode the significant edge cases. This way your tests are consistently verifying the same thing, and when one breaks you know what it was.
Yes TDD prescribes short spans from test to implementation, but what you've down is still well beyond standards you'll find in the industry. You can now rely on the code to calculate what how it should, and can refactor / extend the code with a degree of certainty that you aren't breaking it.
As you learn more testing techniques you may find different approach that shortens the red/green cycle. In the meantime, don't feel bad about it. Its a means to an end, not an end in itself.

Practical tips debugging deep recursion?

I'm working on a board game algorithm where a large tree is traversed using recursion, however, it's not behaving as expected. How do I handle this and what are you experiences with these situations?
To make things worse, it's using alpha-beta pruning which means entire parts of the tree are never visited, as well that it simply stops recursion when certain conditions are met. I can't change the search-depth to a lower number either, because while it's deterministic, the outcome does vary by how deep is searched and it may behave as expected at a lower search-depth (and it does).
Now, I'm not gonna ask you "where is the problem in my code?" but I am looking for general tips, tools, visualizations, anything to debug code like this. Personally, I'm developing in C#, but any and all tools are welcome. Although I think that this may be most applicable to imperative languages.

Logging. Log in your code extensively. In my experience, logging is THE solution for these types of problems. when it's hard to figure out what your code is doing, logging it extensively is a very good solution, as it lets you output from within your code what the internal state is; it's really not a perfect solution, but as far as I've seen, it works better than using any other method.

One thing I have done in the past is to format your logs to reflect the recursion depth. So you may do a new indention for every recurse, or another of some other delimiter. Then make a debug dll that logs everything you need to know about a each iteration. Between the two, you should be able to read the execution path and hopefully tell whats wrong.

I would normally unit-test such algorithms with one or more predefined datasets that have well-defined outcomes. I would typically make several such tests in increasing order of complexity.
If you insist on debugging, it is sometimes useful to doctor the code with statements that check for a given value, so you can attach a breakpoint at that time and place in the code:
if ( depth = X && item.id = 32) {
// Breakpoint here
}

Maybe you could convert the recursion into an iteration with an explicit stack for the parameters. Testing is easier in this way because you can directly log values, access the stack and don't have to pass data/variables in each self-evaluation or prevent them from falling out of scope.

I once had a similar problem when I was developing an AI algorithm to play a Tetris game. After trying many things a loosing a LOT of hours in reading my own logs and debugging and stepping in and out of functions what worked out for me was to code a fast visualizer and test my code with FIXED input.
So, if time is not a problem and you really want to understand what is going on, get a fixed board state and SEE what your program is doing with the data using a mix of debug logs/output and some sort of your own tools that shows information on each step.
Once you find a board state that gives you this problem, try to pin-point the function(s) where it starts and then you will be in a position to fix it.

I know what a pain this can be. At my job, we are currently working with a 3rd party application that basically behaves as a black box, so we have to devise some interesting debugging techniques to help us work around issues.
When I was taking a compiler theory course in college, we used a software library to visualize our trees; this might help you as well, as it could help you see what the tree looks like. In fact, you could build yourself a WinForms/WPF application to dump the contents of your tree into a TreeView control--it's messy, but it'll get the job done.
You might want to consider some kind of debug output, too. I know you mentioned that your tree is large, but perhaps debug statements or breaks at key point during execution that you're having trouble visualizing would lend you a hand.
Bear in mind, too, that intelligent debugging using Visual Studio can work wonders. It's tough to see how state is changing across multiple breaks, but Visual Studio 2010 should actually help with this.
Unfortunately, it's not particularly easy to help you debug without further information. Have you identified the first depth at which it starts to break? Does it continue to break with higher search depths? You might want to evaluate your working cases and try to determine how it's different.

Since you say that the traversal is not working as expected, I assume you have some idea of where things may go wrong. Then inspect the code to verify that you have not overlooked something basic.
After that I suggest you set up some simple unit tests. If they pass, then keep adding tests until they fail. If they fail, then reduce the tests until they either pass or are as simple as they can be. That should help you pinpoint the problems.
If you want to debug as well, I suggest you employ conditional breakpoints. Visual Studio lets you modify breakpoints, so you can set conditions on when the breakpoint should be triggered. That can reduce the number of iterations you need to look at.

I would start by instrumenting the function(s). At each recursive call log the data structures and any other info that will be useful in helping you identify the problem.
Print out the dump along with the source code then get away from the computer and have a nice paper-based debugging session over a cup of coffee.

Start from the base case where you've mentioned if else statements and then try to channelize your thinking by writing it down on pen and paper + printing the values on console when the first few instances of recursive functions are generated with values.
The motto is to find the correct trend between the values you print and match them with those values you wrote on paper in the initial few steps of your recursive algorithm.

TDD and the Bayesian Spam Filter problem

It's well known that Bayesian classifiers are an effective way to filter spam. These can be fairly concise (our one is only a few hundred LoC) but all core code needs to be written up-front before you get any results at all.
However, the TDD approach mandates that only the minimum amount of code to pass a test can be written, so given the following method signature:
bool IsSpam(string text)
And the following string of text, which is clearly spam:
"Cheap generic viagra"
The minimum amount of code I could write is:
bool IsSpam(string text)
{
return text == "Cheap generic viagra"
}
Now maybe I add another test message, e.g.
"Online viagra pharmacy"
I could change the code to:
bool IsSpam(string text)
{
return text.Contains("viagra");
}
...and so on, and so on. Until at some point the code becomes a mess of string checks, regular expressions, etc. because we've evolved it instead of thinking about it and writing it in a different way from the start.
So how is TDD supposed to work with this type of situation where evolving the code from the simplest possible code to pass the test is not the right approach? (Particularly if it is known in advance that the best implementations cannot be trivially evolved).

Begin by writing tests for lower level parts of the spam filter algorithm.
First you need to have in your mind a rough design of how the algorithm should be. Then you isolate a core part of the algorithm and write tests for it. In the case of a spam filter that would maybe be calculating some simple probability using Bayes' theorem (I don't know about Bayesian classifiers, so I could be wrong). You build it bottom-up, step by step, until finally you have all the parts of the algorithm implemented and putting them together is simple.
It requires lots of practice to know which tests to write in which order, so that you can do TDD in small enough steps. If you need to write much more than 10 lines of code to pass one new test, you probably are doing something wrong. Start from something smaller or mock some of the dependencies. It's safer err on the smaller side, so that the steps are too small and your progress is slow, than trying to make too big steps and failing badly.
The "Cheap generic viagra" example that you have might be better suited for an acceptance test. It will probably even run very slowly, because you first need to initialize the spam filter with example data, so it won't be useful as a TDD test. TDD tests need to be FIRST (F = Fast, as in many hundreds or thousands tests per second).

Here's my take: Test Driven Development means writing tests before coding. This does not mean that each unit of code for which you write a test needs to be trivial.
Furthermore you still need to plan your software to do its tasks in a sensible and effective way. Simply adding more and more strings doesn't seem to be the best design for this problem.
So in short, you write the code from the smallest piece of functionality possible (and test it) but you don't design your algorithm (in pseudo code or however you like to do it) that way.
Would be interesting to see if you and others agree.

For me, what you call minimum amount of code to pass a test is the whole IsSpam() function. This is consistent with its size (you say only a few hundred LoC).
Alternatively, incremental approach does not claim to code first and think afterwards. You can design a solution, code it and then refine the design with special cases or better algorithm.
Anyway, refactoring does not consist simply in adding new stuff over old one. For me this is a more destructive approach, where you throw away old code for a simple feature and replace it with new code for a refined and more elaborate feature.

You have your unit tests, right?
That means that you can now refactor the code or even rewrite it and use the unit tests to see if you broke something.
First make it work, then make it clean -- It's time for the second step :)

(1) You cannot say that a string "is spam" or "is not spam" in the same way as if you were saying whether a number is prime. This is not black or white.
(2) It is incorrect, and certainly not the aim of TDD, to write string processing functions using just the very examples used for the tests. Examples should represent a kind of values. TDD does not protect against stupid implementations, so you shouldn't pretend that you have no clue at all, so you shouldn't write return text == "Cheap generic viagra".

It seems to me, that with a Bayesian Spam Filter, you should be using existing methods. In particular you would be using Bayes' Theorem, and probably some other probability theory.
In that case, it seems the best approach is to decide on your algorithm, based on these methods, which should either be tried and tested, or possibly experimental. Then, your unit tests should be designed to test whether ispam correctly implements the algorithm you decide on, as well as a basic test that the result is between 0 and 1.
The point is, that your unit tests aren't designed to test whether your algorithm is sensible. You should either know that already, or possibly your program is designed as an experiment, to see if it is sensible.
That's not to say performance of the isspam function isn't important. But it doesn't have to be part of the unit testing. The data could be from feedback from alpha testing, new theoretical results, or your own experiments. In that case, a new algorithm may be needed, and new unit tests are needed.
See also this question about testing random number generators.

The problem here is not with test driven development but with your tests. If you start out developing code against a single test then all your test is doing is specifying a string checking function.
The main idea of TDD is to think about your tests before writing code. You can't exhaustively test a spam filter, but you could come up with a reasonable approximation by tens or hundreds of thousands of test documents. In the presence of that many tests the naive Bayes algorithm is a simpler solution than a hundred thousand line switch statement.
In reality, you may not be able to pass 100% of your unit tests so you just have to try to pass as many as possible. You also have to make sure your tests are sufficiently realistic. If you think about it in this way, test driven development and machine learning have a lot in common.

The problem you are describing is theoretical, that by adding cruft in response to tests you will make a big, messy ball of mud. The thing you are missing is very important.
The cycle is: Red --> Green --> Refactor
You don't just bounce between red and green. As soon as you have tests passing (green) you refactor the production code and the tests. Then you write the next failing test (red).
If you are refactoring, then you are eliminating duplication and messiness and slop as it grows. You will quickly get to the point of extracting methods, building scoring and rating, and probably bringing in external tools. You will do it as soon as it is the simplest thing that will work.
Don't just bounce between red and green, or all your code will be muck. That refactoring step is not optional or discretionary. It is essential.

I don't think checking if a particular string is spam is really a unit test, its more of a customer test. There's an important difference as its not really a red/greed type of thing. In actuality you should probably have a couple hundred test documents. Initially some will be classified as spam, and as you improve the product the classifications will more directly match what you want. So you should make a custom app to load a bunch of test documents, classify them, and then evaluate the scoring overall. When you're done with that customer test, the score will be very bad since you haven't implemented an algorithm. But you now have a means to measure progress going forward, and this is pretty valuable given the amount of learning/changes/experimentation you can expect going forward.
As you implement your algorithm, (and even the customer test in the firsthand) you can still do TDD with real unit tests. The first test for the bayesian filter compononent won't measure if a particular string evaluates as spam, but whether the string is passed through the bayesian filter component appropriately. Your next tests will then focus on how a bayesian filter is implemented (structuring nodes correctly, applying training data, etc).
You do need a vision as to where the product is going, and your tests and implemention should be directed towards that vision. You can not also just add customer tests in a blind manner, you need to add tests with the overall product vision in mind. Any software development goal will have good tests and bad tests you can write.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio