How to test short period PRNGs?

How to test short period PRNGs? - random

So in short my question would be this:
How one should test a short period PRNG?
Emphasis being on short period, like from 2^8 up to at most 2^32. These small generators may have more compact and faster code, small state, might making them suitable for certain procedural generation tasks. The importance is that their generated sequence should be "reasonably random". In this case it may even be viable to test every seed one plans to use, so technically just the output stream itself may be sufficient.
I am aware of other similar questions like this, but they rather target long period, probably even cryptographically secure PRNGs.
So far I see that for example the Overlapping permutations test from the Diehard harness may work, but others seem to rather target long periods.
I tried to do X:Y plots to see correlations such as here, this seems to work and visually provide a good feedback, but I would like to see more if possible (if needed?).

Related

How do programmers test their algorithm in TopCoder or other competitions?

Good programmers who write programs of moderate to higher difficulty in competitions of TopCoder or ACM ICPC, have to ensure the correctness of their algorithm before submission.
Although they are provided with some sample test cases to ensure the correct output, but how does it guarantees that program will behave correctly? They can write some test cases of their own but it won't be possible in all cases to know the correct answer through manual calculation. How do they do it?
Update: As it seems, it is not quite possible to analyze and guarantee the outcome of an algorithm given tight constraints of a competitive environment. However, if there are any manual, more common traits which are adopted while solving such problems - should be enough to answer the question. Something like best practices..

In competitions, the top programmers have enough experience to read the question, and think of some test cases that should catch most of the possibilities for input.
It catches most of the bugs usually - but it is NOT 100% safe.
However, in real life critical applications (critical systems on air planes or nuclear reactors for example) there are methods to PROVE some piece of code does what it is supposed to do.
This is the field of formal verification - which is way too complex and time consuming to be done during a contest, but for some systems it is used because mistakes could not be tolerated.
Some additional information:
Formal verification basically consists of 2 parts:
Manual verification - in here we use proving systems such as Hoare logic and manually prove the program does what we wants it to do.
Automatic model checking - modeling the problem as state machine, and use Model Checking tools to verify that the module does what it is supposed to do (or not doing something "bad").
Specifying "what it should do" is usually done with temporal logic.
This is often used to verify correctness of hardware models as well. For example Intel uses it to ensure they won't get the floating point bug again.

Picture this, imagine you are a top programmer.Meaning you know a bunch of algorithms and wouldn't think think twice while implementing them.You know how to modify an already known algorithm to suit the problem's needs.You are strong with estimating time and complexity and you expect that in the worst case your tailored algorithm would run within time and memory constraints.
At this level you simply think and use a scratchpad for about five to ten minutes and have a super clear algorithm before you start to code.Once you finish coding, you hit compile and there is usually no compilation error.Because the code is so intuitive to you.
Then based on the algorithm used and data structures used, you expect that there might be
one of the following issues.
a corner case
an overflow problem
A corner case is basically like you have coded for the general case, however when say N=1, the answer is different from others.So you generally write it as a special case.
An overflow is when intermediate values or results overflow a data type's limits.
You make note of any problems which arise at this point, and use this data during Challenge phase(as in TopCoder).
Once you have checked against these two, you hit Submit.

There's a time element to Top Coder, so it's not possible to test every combination within that constraint. They probably do the best they can and rely on experience for the rest, just as one does in real life. I don't know that it's ever possible to guarantee that a significant piece of code is error free forever.

Code generation by genetic algorithms

Evolutionary programming seems to be a great way to solve many optimization problems. The idea is very easy and the implementation does not make problems.
I was wondering if there is any way to evolutionarily create a program in ruby/python script (or any other language)?
The idea is simple:
Create a population of programs
Perform genetic operations (roulette-wheel selection or any other selection), create new programs with inheritance from best programs, etc.
Loop point 2 until program that will satisfy our condition is found
But there are still few problems:
How will chromosomes be represented? For example, should one cell of chromosome be one line of code?
How will chromosomes be generated? If they will be lines of code, how do we generate them to ensure that they are syntactically correct, etc.?
Example of a program that could be generated:
Create script that takes N numbers as input and returns their mean as output.
If there were any attempts to create such algorithms I'll be glad to see any links/sources.

If you are sure you want to do this, you want genetic programming, rather than a genetic algorithm. GP allows you to evolve tree-structured programs. What you would do would be to give it a bunch of primitive operations (while($register), read($register), increment($register), decrement($register), divide($result $numerator $denominator), print, progn2 (this is GP speak for "execute two commands sequentially")).
You could produce something like this:
progn2(
progn2(
read($1)
while($1
progn2(
while($1
progn2( #add the input to the total
increment($2)
decrement($1)
)
)
progn2( #increment number of values entered, read again
increment($3)
read($1)
)
)
)
)
progn2( #calculate result
divide($1 $2 $3)
print($1)
)
)
You would use, as your fitness function, how close it is to the real solution. And therein lies the catch, that you have to calculate that traditionally anyway*. And then have something that translates that into code in (your language of choice). Note that, as you've got a potential infinite loop in there, you'll have to cut off execution after a while (there's no way around the halting problem), and it probably won't work. Shucks. Note also, that my provided code will attempt to divide by zero.
*There are ways around this, but generally not terribly far around it.

It can be done, but works very badly for most kinds of applications.
Genetic algorithms only work when the fitness function is continuous, i.e. you can determine which candidates in your current population are closer to the solution than others, because only then you'll get improvements from one generation to the next. I learned this the hard way when I had a genetic algorithm with one strongly-weighted non-continuous component in my fitness function. It dominated all others and because it was non-continuous, there was no gradual advancement towards greater fitness because candidates that were almost correct in that aspect were not considered more fit than ones that were completely incorrect.
Unfortunately, program correctness is utterly non-continuous. Is a program that stops with error X on line A better than one that stops with error Y on line B? Your program could be one character away from being correct, and still abort with an error, while one that returns a constant hardcoded result can at least pass one test.
And that's not even touching on the matter of the code itself being non-continuous under modifications...

Well this is very possible and #Jivlain correctly points out in his (nice) answer that genetic Programming is what you are looking for (and not simple Genetic Algorithms).
Genetic Programming is a field that has not reached a broad audience yet, partially because of some of the complications #MichaelBorgwardt indicates in his answer. But those are mere complications, it is far from true that this is impossible to do. Research on the topic has been going on for more than 20 years.
Andre Koza is one of the leading researchers on this (have a look at his 1992 work) and he demonstrated as early as 1996 how genetic programming can in some cases outperform naive GAs on some classic computational problems (such as evolving programs for Cellular Automata synchronization).
Here's a good Genetic Programming tutorial from Koza and Poli dated 2003.
For a recent reference you might wanna have a look at A field guide to genetic programming (2008).

Since this question was asked, the field of genetic programming has advanced a bit, and there have been some additional attempts to evolve code in configurations other than the tree structures of traditional genetic programming. Here are just a few of them:
PushGP - designed with the goal of evolving modular functions like human coders use, programs in this system store all variables and code on different stacks (one for each variable type). Programs are written by pushing and popping commands and data off of the stacks.
FINCH - a system that evolves Java byte-code. This has been used to great effect to evolve game-playing agents.
Various algorithms have started evolving C++ code, often with a step in which compiler errors are corrected. This has had mixed, but not altogether unpromising results. Here's an example.
Avida - a system in which agents evolve programs (mostly boolean logic tasks) using a very simple assembly code. Based off of the older (and less versatile) Tierra.

The language isn't an issue. Regardless of the language, you have to define some higher-level of mutation, otherwise it will take forever to learn.
For example, since any Ruby language can be defined in terms of a text string, you could just randomly generate text strings and optimize that. Better would be to generate only legal Ruby programs. However, it would also take forever.
If you were trying to build a sorting program and you had high level operations like "swap", "move", etc. then you would have a much higher chance of success.
In theory, a bunch of monkeys banging on a typewriter for an infinite amount of time will output all the works of Shakespeare. In practice, it isn't a practical way to write literature. Just because genetic algorithms can solve optimization problems doesn't mean that it's easy or even necessarily a good way to do it.

The biggest selling point of genetic algorithms, as you say, is that they are dirt simple. They don't have the best performance or mathematical background, but even if you have no idea how to solve your problem, as long as you can define it as an optimization problem you will be able to turn it into a GA.
Programs aren't really suited for GA's precisely because code isn't good chromossome material. I have seen someone who did something similar with (simpler) machine code instead of Python (although it was more of an ecossystem simulation then a GA per se) and you might have better luck if you codify your programs using automata / LISP or something like that.
On the other hand, given how alluring GA's are and how basically everyone who looks at them asks this same question, I'm pretty sure there are already people who tried this somewhere - I just have no idea if any of them succeeded.

Good luck with that.
Sure, you could write a "mutation" program that reads a program and randomly adds, deletes, or changes some number of characters. Then you could compile the result and see if the output is better than the original program. (However we define and measure "better".) Of course 99.9% of the time the result would be compile errors: syntax errors, undefined variables, etc. And surely most of the rest would be wildly incorrect.
Try some very simple problem. Say, start with a program that reads in two numbers, adds them together, and outputs the sum. Let's say that the goal is a program that reads in three numbers and calculates the sum. Just how long and complex such a program would be of course depends on the language. Let's say we have some very high level language that lets us read or write a number with just one line of code. Then the starting program is just 4 lines:
read x
read y
total=x+y
write total
The simplest program to meet the desired goal would be something like
read x
read y
read z
total=x+y+z
write total
So through a random mutation, we have to add "read z" and "+z", a total of 9 characters including the space and the new-line. Let's make it easy on our mutation program and say it always inserts exactly 9 random characters, that they're guaranteed to be in the right places, and that it chooses from a character set of just 26 letters plus 10 digits plus 14 special characters = 50 characters. What are the odds that it will pick the correct 9 characters? 1 in 50^9 = 1 in 2.0e15. (Okay, the program would work if instead of "read z" and "+z" it inserted "read w" and "+w", but then I'm making it easy by assuming it magically inserts exactly the right number of characters and always inserts them in the right places. So I think this estimate is still generous.)
1 in 2.0e15 is a pretty small probability. Even if the program runs a thousand times a second, and you can test the output that quickly, the chance is still just 1 in 2.0e12 per second, or 1 in 5.4e8 per hour, 1 in 2.3e7 per day. Keep it running for a year and the chance of success is still only 1 in 62,000.
Even a moderately competent programmer should be able to make such a change in, what, ten minutes?
Note that changes must come in at least "packets" that are correct. That is, if a mutation generates "reax z", that's only one character away from "read z", but it would still produce compile errors, and so would fail.
Likewise adding "read z" but changing the calculation to "total=x+y+w" is not going to work. Depending on the language, you'll either get errors for the undefined variable or at best it will have some default value, like zero, and give incorrect results.
You could, I suppose, theorize incremental solutions. Maybe one mutation adds the new read statement, then a future mutation updates the calculation. But without the calculation, the additional read is worthless. How will the program be evaluated to determine that the additional read is "a step in the right direction"? The only way I see to do that is to have an intelligent being read the code after each mutation and see if the change is making progress toward the desired goal. And if you have an intelligent designer who can do that, that must mean that he knows what the desired goal is and how to achieve it. At which point, it would be far more efficient to just make the desired change rather than waiting for it to happen randomly.
And this is an exceedingly trivial program in a very easy language. Most programs are, what, hundreds or thousands of lines, all of which must work together. The odds against any random process writing a working program are astronomical.
There might be ways to do something that resembles this in some very specialized application, where you are not really making random mutations, but rather making incremental modifications to the parameters of a solution. Like, we have a formula with some constants whose values we don't know. We know what the correct results are for some small set of inputs. So we make random changes to the constants, and if the result is closer to the right answer, change from there, if not, go back to the previous value. But even at that, I think it would rarely be productive to make random changes. It would likely be more helpful to try changing the constants according to a strict formula, like start with changing by 1000's, then 100's then 10's, etc.

I want to just give you a suggestion. I don't know how successful you'd be, but perhaps you could try to evolve a core war bot with genetic programming. Your fitness function is easy: just let the bots compete in a game. You could start with well known bots and perhaps a few random ones then wait and see what happens.

Safe mixing of entropy sources

Let us assume we're generating very large (e.g. 128 or 256bit) numbers to serve as keys for a block cipher.
Let us further assume that we wear tinfoil hats (at least when outside).
Being so paranoid, we want to be sure of our available entropy, but we don't entirely trust any particular source. Maybe the government is rigging our coins. Maybe these dice are ever so subtly weighted. What if the hardware interrupts feeding into /dev/random are just a little too consistent? (Besides being paranoid, we're lazy enough that we don't want to generate it all by hand...)
So, let's mix them all up.
What are the secure method(s) for doing this? Presumably just concatenating a few bytes from each source isn't entirely secure -- if one of the sources is biased, it might, in theory, lend itself to such things as a related-key attack, for example.
Is running SHA-256 over the concatenated bytes sufficient?
(And yes, at some point soon I am going to pick up a copy of Cryptography Engineering. :))

Since you mention /dev/random -- on Linux at least, /dev/random is fed by an algorithm that does very much what you're describing. It takes several variously-trusted entropy sources and mixes them into an "entropy pool" using a polynomial function -- for each new byte of entropy that comes in, it's xor'd into the pool, and then the entire pool is stirred with the mixing function. When it's desired to get some randomness out of the pool, the entire pool is hashed with SHA-1 to get the output, then the pool is mixed again (and actually there's some more hashing, folding, and mutilating going on to make sure that reversing the process is about as hard as reversing SHA-1). At the same time, there's a bunch of accounting going on -- each time some entropy is added to the pool, an estimate of the number of bits of entropy it's worth is added to the account, and each time some bytes are extracted from the pool, that number is subtracted, and the random device will block (waiting on more external entropy) if the account would go below zero. Of course, if you use the "urandom" device, the blocking doesn't happen and the pool simply keeps getting hashed and mixed to produce more bytes, which turns it into a PRNG instead of an RNG.
Anyway... it's actually pretty interesting and pretty well commented -- you might want to study it. drivers/char/random.c in the linux-2.6 tree.

Using a hash function is a good approach - just make sure you underestimate the amount of entropy each source contributes, so that if you are right about one or more of them being less than totally random, you haven't weakened your key unduly.
This isn't dissimilar to the approach used in key stretching (though you have no need for multiple iterations here).

I've done this before, and my approach was just to XOR them, byte-by-byte, against each other.
Running them through some other algorithm, like SHA-256, is terribly inefficient, so it's not practical, and I think it would be not really useful and possibly harmful.
If you do happen to be incredibly paranoid, and have a tiny bit of money, it might be fun to buy a "true" (depending on how convinced you are by Quantum Mechanics) a Quantum Random Number Generator.
-- Edit:
FWIW, I think the method I describe above (or something similar) is effectively a One-Time Pad from the point of view of either sources, assuming one of them is random, and therefore unattackable assuming they are independant and out to get you. I'm happy to be corrected on this if someone takes issue with it, and I encourage anyone not taking issue with it to question it anyway, and find out for yourself.

If you have a source of randomness but you're not sure whether it is biased or not, then there are a lot of different algorithms. Depending on how much work you want to do, the entropy you waste from the original source differes.
The easiest algorithm is the (improved) van Neumann algorithm. You can find the details in this pdf:
http://security1.win.tue.nl/~bskoric/physsec/files/PhysSec_LectureNotes.pdf
at page 27.
I also recommend you to read this document if you're interested in how to produce uniformly randomness from a given souce, how true random number generators work, etc!

Writing shorter code/algorithms, is more efficient (performance)?

After coming across the code golf trivia around the site it is obvious people try to find ways to write code and algorithms as short as the possibly can in terms of characters, lines and total size, even if that means writing something like:
//Code by: job
//Topic: Code Golf - Collatz Conjecture
n=input()
while n>1:n=(n/2,n*3+1)[n%2];print n
So as a beginner I start to wonder whether size actually matters :D
It is obviously a very subjective question highly dependent on the actual code being used, but what is the rule of thumb in the real world.
In the case that size wont matter, how come then we don't focus more on performance rather than size?

I hope this does not become a flame war. Good code has many attributes, including:
Solving the use-case properly.
Readability.
Maintainability.
Performance.
Testability.
Low memory signature.
Good user interface.
Reusability.
The brevity of code is not that important in 21st century programming. It used to be more important when memory was really scarce. Please see this question, including my answer, for books referencing the attributes above.

A lot of good answers already about what's important versus what's not. In real life, (almost) nobody writes code like code golf, wtih shortened identifiers, minimal whitespace, and the fewest possible statements.
That said, "more code" does correlate with more bugs and complexity, and "less code" tends to correlate with better readability and performance. So all other things being equal, it's useful to strive for shorter code, but only in the sense of "these simple 30 lines of code do the same as that 100 complex lines of code".

Writing "code golf" solutions are often to do with showing how "clever" you are in getting the job done in the most succinct way even at the expense of readability. Quite often, however, more verbose code including, for example, memoization of function results, can be faster. Code size can matter for performance, smaller blocks of code can fit in the L1 CPU cache but this is an extreme case of optimization and a faster algorithm will most always be better. "Code Golf" code is not like production code - always write for clarity & readability of the solution rather than terseness if anyone, including yourself, ever intend to read that code again.

Whitespace has no effect on performance. So code like that is just silly (or perhaps the golf score was based on the character count?). The number of lines also has no effect, although the number of statements can have an effect. (exception: python, where whitespace is significant)
The effect is complex, however. It's not at all uncommon to discover that you have to add statements to a function in order to improve it's performance.
Still, without knowing anything else, bet that more statements is a larger object file, and a slower program. But more lines doesn't do anything other than make code more readable up to a point (after which adding more lines makes it less readable ;)

I don't believe that Code Golf has any practical significance. In practice, readable code is what counts. Which in itself is a conflicting requirement: readable code should be concise, but still easy to understood.
However, I would like to answer your question yet differently. Usually, there are fast and simple algorithms. However, if the speed is top priority, things can get complex real fast (and the resulting code will be longer). I don't believe that simplicity equals speed.

There are many aspects to performance. Performance can for example be measured by memory footprint, speed of execution, bandwith consumption, framerate, maintainability, supportability and so on. Performance usually means spending as little as possible of the most scarce resource.
When applied to networking, brevity IS performance. If your webserver serves a little javascript snippet on every page, it doesn't exactly hurt to keep the variable names short. Pull up www.google.com and view source!
Some times DRY is not helping performance. An example is that Microsoft has found that they don't want a to loop through an array unless it is bigger than 3 elements.
String.Format has signatures for one, two and three arguments, and then for array.
There are many ways of trading one aspect for another. This is usually called caching.
You can for example trade memory footprint for speed of execution. For example by doing lookup instead of execution. It is just a matter of replacing () with [] in most popular languages. If you plan it so that the spaceship in your game can only go in a fixed number of directions, you can save on trigonometric function calls.
Or you can use a proxy server with a cache for looking up things over a network. DNS servers do this all the time.
Finally, if development team availability is the most scarce resource, clarity of code is the best bet for maintainability performance, even if it doesn't run quite as fast or is quite as interesting or "elegant" in code.

Absolutely not. Code size and performance (however you measure it) are only very loosly connected. To make matter worse whats a neat trick on one chip/compiler/OS may very well be the worse thing you can do in another archictecture.
Its counter-intuitive but a clear well written simple as possible implmentation is often far more efficient than than a devious bag of tricks. Today's optimizing compilers like clear uncomplicated code just as much as humans and complex trickery can cause them to abandon thier best optimizing strategies.

Writing fewer lines of code tends to be better for a bunch of reasons. For example, the less code you have, the less chance for bugs. See for example Paul Graham's essay, "Succinctness is Power"
Notwithstanding that, the level reached by Code Golf is usually far beyond what makes sense. In Code Golf, people are trying to write code that is as short as possible, even if they know that it's less readable.
Efficiency is a much harder thing to decide. I'm guessing that less code is usually more efficient, but there are many cases where this isn't true.
So to answer the real question, why do we even have Code Golf competitions which aim at a low character count, if that's not a very important thing?
Two reasons:
Making code as short as possible means you have to be both clever, and know a language pretty well to find all kinds of tricks. This makes it a fun riddle.
Also, it's the easiest measure to use for a code competition. Efficiency, for example, is very hard to measure, especially using many different languages, especially since some solutions are more efficient in some cases, but less in others (big input vs small). Readability: that's a very personal thing, which often leads to heated debates.
In short, I don't think there is any way of doing Code Golf style competitions without using "shortness of code" as the criterion.

This is from "10 Commandments for Java Developers"
Keep in Mind - "Less is more" is not always better. - Code efficiency is a great thing, but > in many situations writing less lines of code does not improve the efficiency of that code.
This is (probably) true for all programming languages (though in assembly it could be different).

It makes a difference if you're talking about little academic-style algorithms or real software, which can be thousands of lines of code. I'm talking about the latter.
Here's an example where a reasonably well-written program was speeded up by a factor of 43x, and it's code size was reduced by 4x.
"Code golf" is just squeezing code, like cramming undergraduates into a phone booth. I'm talking about reducing code by rewriting it in a form that is declarative, like a domain-specific-language (DSL). Since it is declarative, it maps more directly onto its requirements, so it is not puffed up with code that exists only for implementation's sake. That link shows an example of doing that.
This link shows a way of reducing size of UI code in a similar way.
Good performance is achieved by avoiding doing things that don't really have to be done. Of course, when you write code, you're not intentionally making it do unnecessary work, but if you do aggressive performance tuning as in that example, you'd be amazed at what you can remove.

The point of code golf is to optimise for one thing (source length), at the potential expense of everything else (performance, comprehensibility, robustness). If you accidentally improve performance that's a fluke - if you could shave a character off by doubling the runtime, then you would.
You ask "how come then we don't focus more on performance rather than size", but the question is based on a false premise that programmers focus more on code size than on performance. They don't, "code golf" is a minority interest. It's challenging and fun, but it's not important. Look at the number of questions tagged "code-golf" against the number tagged "performance".
As other people point out, making code shorter often means making it simpler to understand, by removing duplication and opportunities for obscure errors. That's usually more important than running speed. But code golf is a completely different thing, where you remove whitespace, comments, descriptive names, etc. The purpose isn't to make the code more comprehensible.

TDD and the Bayesian Spam Filter problem

It's well known that Bayesian classifiers are an effective way to filter spam. These can be fairly concise (our one is only a few hundred LoC) but all core code needs to be written up-front before you get any results at all.
However, the TDD approach mandates that only the minimum amount of code to pass a test can be written, so given the following method signature:
bool IsSpam(string text)
And the following string of text, which is clearly spam:
"Cheap generic viagra"
The minimum amount of code I could write is:
bool IsSpam(string text)
{
return text == "Cheap generic viagra"
}
Now maybe I add another test message, e.g.
"Online viagra pharmacy"
I could change the code to:
bool IsSpam(string text)
{
return text.Contains("viagra");
}
...and so on, and so on. Until at some point the code becomes a mess of string checks, regular expressions, etc. because we've evolved it instead of thinking about it and writing it in a different way from the start.
So how is TDD supposed to work with this type of situation where evolving the code from the simplest possible code to pass the test is not the right approach? (Particularly if it is known in advance that the best implementations cannot be trivially evolved).

Begin by writing tests for lower level parts of the spam filter algorithm.
First you need to have in your mind a rough design of how the algorithm should be. Then you isolate a core part of the algorithm and write tests for it. In the case of a spam filter that would maybe be calculating some simple probability using Bayes' theorem (I don't know about Bayesian classifiers, so I could be wrong). You build it bottom-up, step by step, until finally you have all the parts of the algorithm implemented and putting them together is simple.
It requires lots of practice to know which tests to write in which order, so that you can do TDD in small enough steps. If you need to write much more than 10 lines of code to pass one new test, you probably are doing something wrong. Start from something smaller or mock some of the dependencies. It's safer err on the smaller side, so that the steps are too small and your progress is slow, than trying to make too big steps and failing badly.
The "Cheap generic viagra" example that you have might be better suited for an acceptance test. It will probably even run very slowly, because you first need to initialize the spam filter with example data, so it won't be useful as a TDD test. TDD tests need to be FIRST (F = Fast, as in many hundreds or thousands tests per second).

Here's my take: Test Driven Development means writing tests before coding. This does not mean that each unit of code for which you write a test needs to be trivial.
Furthermore you still need to plan your software to do its tasks in a sensible and effective way. Simply adding more and more strings doesn't seem to be the best design for this problem.
So in short, you write the code from the smallest piece of functionality possible (and test it) but you don't design your algorithm (in pseudo code or however you like to do it) that way.
Would be interesting to see if you and others agree.

For me, what you call minimum amount of code to pass a test is the whole IsSpam() function. This is consistent with its size (you say only a few hundred LoC).
Alternatively, incremental approach does not claim to code first and think afterwards. You can design a solution, code it and then refine the design with special cases or better algorithm.
Anyway, refactoring does not consist simply in adding new stuff over old one. For me this is a more destructive approach, where you throw away old code for a simple feature and replace it with new code for a refined and more elaborate feature.

You have your unit tests, right?
That means that you can now refactor the code or even rewrite it and use the unit tests to see if you broke something.
First make it work, then make it clean -- It's time for the second step :)

(1) You cannot say that a string "is spam" or "is not spam" in the same way as if you were saying whether a number is prime. This is not black or white.
(2) It is incorrect, and certainly not the aim of TDD, to write string processing functions using just the very examples used for the tests. Examples should represent a kind of values. TDD does not protect against stupid implementations, so you shouldn't pretend that you have no clue at all, so you shouldn't write return text == "Cheap generic viagra".

It seems to me, that with a Bayesian Spam Filter, you should be using existing methods. In particular you would be using Bayes' Theorem, and probably some other probability theory.
In that case, it seems the best approach is to decide on your algorithm, based on these methods, which should either be tried and tested, or possibly experimental. Then, your unit tests should be designed to test whether ispam correctly implements the algorithm you decide on, as well as a basic test that the result is between 0 and 1.
The point is, that your unit tests aren't designed to test whether your algorithm is sensible. You should either know that already, or possibly your program is designed as an experiment, to see if it is sensible.
That's not to say performance of the isspam function isn't important. But it doesn't have to be part of the unit testing. The data could be from feedback from alpha testing, new theoretical results, or your own experiments. In that case, a new algorithm may be needed, and new unit tests are needed.
See also this question about testing random number generators.

The problem here is not with test driven development but with your tests. If you start out developing code against a single test then all your test is doing is specifying a string checking function.
The main idea of TDD is to think about your tests before writing code. You can't exhaustively test a spam filter, but you could come up with a reasonable approximation by tens or hundreds of thousands of test documents. In the presence of that many tests the naive Bayes algorithm is a simpler solution than a hundred thousand line switch statement.
In reality, you may not be able to pass 100% of your unit tests so you just have to try to pass as many as possible. You also have to make sure your tests are sufficiently realistic. If you think about it in this way, test driven development and machine learning have a lot in common.

The problem you are describing is theoretical, that by adding cruft in response to tests you will make a big, messy ball of mud. The thing you are missing is very important.
The cycle is: Red --> Green --> Refactor
You don't just bounce between red and green. As soon as you have tests passing (green) you refactor the production code and the tests. Then you write the next failing test (red).
If you are refactoring, then you are eliminating duplication and messiness and slop as it grows. You will quickly get to the point of extracting methods, building scoring and rating, and probably bringing in external tools. You will do it as soon as it is the simplest thing that will work.
Don't just bounce between red and green, or all your code will be muck. That refactoring step is not optional or discretionary. It is essential.

I don't think checking if a particular string is spam is really a unit test, its more of a customer test. There's an important difference as its not really a red/greed type of thing. In actuality you should probably have a couple hundred test documents. Initially some will be classified as spam, and as you improve the product the classifications will more directly match what you want. So you should make a custom app to load a bunch of test documents, classify them, and then evaluate the scoring overall. When you're done with that customer test, the score will be very bad since you haven't implemented an algorithm. But you now have a means to measure progress going forward, and this is pretty valuable given the amount of learning/changes/experimentation you can expect going forward.
As you implement your algorithm, (and even the customer test in the firsthand) you can still do TDD with real unit tests. The first test for the bayesian filter compononent won't measure if a particular string evaluates as spam, but whether the string is passed through the bayesian filter component appropriately. Your next tests will then focus on how a bayesian filter is implemented (structuring nodes correctly, applying training data, etc).
You do need a vision as to where the product is going, and your tests and implemention should be directed towards that vision. You can not also just add customer tests in a blind manner, you need to add tests with the overall product vision in mind. Any software development goal will have good tests and bad tests you can write.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio