How to develop complex methods with TDD

How to develop complex methods with TDD - tdd

A few weeks ago I started my first project with TDD. Up to now, I have only read one book about it.
My main concern: How to write tests for complex methods/classes. I wrote a class that calculates a binomial distribution. Thus, a method of this class takes n, k, and p as input, and calculates the resp. probability. (In fact it does a bit more, that's why I had to write it myself, but let's stick to this description of the class, for ease of the argument.)
What I did to test this method is: copying some tables with different n I found in the web into my code, picking randomly an entry in this table, feeded the resp. values for n, k, and p into my function, and looked whether the result was near the value in the table. I repeat this a number of times for every table.
This all works well now, but after writing the test, I had to tank for a few hours to really code the functionality. From reading the book, I had the impression that I should not code longer than a few minutes, until the test shows green again. What did I do wrong here? Of course I have broken this task down in a lot of methods, but they are all private.
A related question: Was it a bad idea to pick randomly numbers from the table? In case of an error, I will display the random-seed used by this run, so that I can reproduce the bug.

I don't agree with people saying that it's ok to test private code, even if you make them into separate classes. You should test entry points to your application (or your library, if it's a library you're coding). When you test private code, you limit your re-factoring possibilities for later (because refactoring your privates classes mean refactoring your test code, which you should refrain doing). If you end up re-using this private code elsewhere, then sure, create separate classes and test them, but until you do, assume that You Ain't Gonna Need It.
To answer your question, I think that yes, in some cases, it's not a "2 minutes until you go green" situation. In those cases, I think it's ok for the tests to take a long time to go green. But most situations are "2 minutes until you go green" situations. In your case (I don't know squat about binomial distribution), you wrote you have 3 arguments, n, k and p. If you keep k and p constant, is your function any simpler to implement? If yes, you should start by creating tests that always have constant k and p. When your tests pass, introduce a new value for k, and then for p.

"I had the impression that I should not code longer than a few minutes, until the test shows green again. What did I do wrong here?"
Westphal is correct up to a point.
Some functionality starts simple and can be tested simply and coded simply.
Some functionality does not start out simple. Simple is hard to achieve. EWD says that simplicity is not valued because it is so difficult to achieve.
If your function body is hard to write, it isn't simple. This means you have to work much harder to reduce it to something simple.
After you eventually achieve simplicity, you, too, can write a book showing how simple it is.
Until you achieve simplicity, it will take a long time to write things.
"Was it a bad idea to pick randomly numbers from the table?"
Yes. If you have sample data, run your test against all the sample data. Use a loop or something, and test everything you can possibly test.
Don't select one row -- randomly or otherwise, select all rows.

You should TDD using baby steps. Try thinking of tests that will require less code to be written. Then write the code. Then write another test, and so on.
Try to break your problem into smaller problems (you probably used some other methods to have your code completed). You could TDD these smaller methods.
--EDIT - based on the comments
Testing private methods is not necessarily a bad stuff. They sometimes really contain implementation details, but sometimes they might also act like an interface (in this case, you could follow my suggestion next paragraph).
One other option is to create other classes (implemented with interfaces that are injected) to take some of the responsibilities (maybe some of those smaller methods), and test them separately, and mock them when testing your main class.
Finally, I don't see spending more time coding as a really big problem. Some problems are really more complex to implement than to test, and require much thinking time.

You are correct about short quick refactors, I rarely go more than a few minutes between rebuild/test no matter how complicated the change. It takes a little practice.
The test you described is more of a system test than a unit test though. A unit test tries never to test more than a single method--in order to reduce complexity you should probably break your problem down into quite a few methods.
The system test should probably be done after you have built up your functionality with small unit tests on small straight-forward methods.
Even if the methods are just taking a part of the formula out of a longer method, you get the advantage of readability (the method name should be more readable than the formula part it replaces) and if the methods are final the JIT should inline them so you don't lose any speed.
On the other hand, if your formula isn't that big, maybe you just write it all in one method and test it like you did and take the downtime--rules are made to be broken.

It's difficult to answer your question without knowing a little bit more about the things you wanted to implement. It sounds like they were not easily partinioable in testable parts. Either the functionality works as a whole or it doesn't. If this is the case, it's no wonder you tool hours to implement it.
As to your second question: Yes, I think it's a bad idea to make the test fixture random. Why did you do this in the first place? Changing the fixture changes the test.

Avoid developing complex methods with TDD until you have developed simple methods as building blocks for the more complex methods. TDD would typically be used to create a quantity of simple functionality which could be combined to produce more complex behaviour. Complex methods/classes should always be able to be broken down into simpler parts, but it is not always obvious how and is often problem specific. The test you have written sounds like it might be more of an integration test to make sure all the components work together correctly, although the complexity of the problem you describe only borders on the edge of requiring a set of components to solve it. The situation you describe sounds like this:
class A {
public doLotsOfStuff() // Call doTask1..n
private doTask1()
private doTask2()
private doTask3()
}
You will find it quite hard to develop with TDD if you start by writing a test for the greatest unit of functionality (i.e. doLotsOfStuff()). By breaking the problem down into more mangeable chunks and approaching it from the end of simplest functionality you will also be able to create more discrete tests (much more useful than tests that check for everything!). Perhaps your potential solution could be reformulated like this:
class A{
public doLotsOfStuff() // Call doTask1..n
public doTask1()
public doTask2()
public doTask3()
}
Whilst your private methods may be implementation detail that is not a reason to avoid testing them in isolation. Just like many problems a divide-and-conquer approach would prove affective here. The real question is what size is a suitably testable and maintainable chunk of functionality? Only you can answer that based on your knowledge of the problem and your own judgement of applying your abilities to the task.

I think the style of testing you have is totally appropriate for code thats primarily a computation. Rather than pick a random row from your known results table, it'd be better to just hardcode the significant edge cases. This way your tests are consistently verifying the same thing, and when one breaks you know what it was.
Yes TDD prescribes short spans from test to implementation, but what you've down is still well beyond standards you'll find in the industry. You can now rely on the code to calculate what how it should, and can refactor / extend the code with a degree of certainty that you aren't breaking it.
As you learn more testing techniques you may find different approach that shortens the red/green cycle. In the meantime, don't feel bad about it. Its a means to an end, not an end in itself.

Related

Questions on SOLID/TDD

For a number of years now, I have been interested in TDD, but one or two things just didn't click. I am pretty sure it is the usual thoughts most people have when trying. "The examples in the book are wonderful, but my code is a lot more complicated than that. I never have a a procedure that does one thing, it will call three others, and they will call three others, and that will get data from the DB... bla bla bla".
A little while ago, I found some videos on SOLID (Anyone who is stuck, thinking TDD would be awesome, but... then find a few videos on SOLID, trust me). Each point became slightly more confusing, until the end, everything just went into place, including how I thought about testing code, and TDD.
I, of course, have a lot of old code, that isn't written like this, but I am okay with that, because I do see a better idea of how it should be. And whenever I work on anything, I can take it out, and do it properly (even when that means cutting out the small part of a method that needs updating, giving it it's own class, and calling that.
It has a few more questions. I would like to know where I might be able to find answers for that, or is there a standard.
How much should be tested?
My assumption is all of it. A lot of my functions will be take input parameters, and run a Stored Procedure. My guess on how to test that would be, with a given set of input parameters, is the stored procedure being called the correct one, are the parameters getting put in correct. Often this will be obvious (sometimes there will be array of numbers input that will be transformed to a comma separated string). If nothing else, this example, while the test might not be as valuable, will be documentation.
How do I name things?
This is the old problem with development. Should the class be named like the method would be, UpdateEmployee, or should there be a whole lot of er classes (EmployeeUpdater, EmplyeeGetter, etc.)
How is IOC generally handled?
This is still fine for now, I am creating interfaces, implementations, setting up IOC, etc.
I can see though, that pretty soon I am going to have pages and pages and pages of Interface/Class mappings in my IOC initialization method, or I would imagine it splitting into section, with one method that calls a few other methods, each registering classes (by namespace, or something). Is this how it generally works, or are there smarter ways of managing this?

I recommend reading Clean Code by Robert C Martin
In my view...
How much should be tested?
There is a big difference between how much and how well.
Ultimately its a judgment call and or a simple cost/benefit analysis.
Critical apps/code should be tested more thoroughly.
Working pure TDD means your code will be highly tested - easily > 90% coverage, but remember there is a difference between test quality and coverage. You may decide to test more edge cases.
You can get 100% coverage with one test case, but its pragmatic to test a range of values e.g. 0, 1, many & boundaries.
How do I name things?
For Java as an example, Look at the standard Java API documentation and see how they do it.
Referring to Clean Code, naming is and should be difficult, and maybe refactor if the name no longer fits.
Example Classes from Java's API's
FileFilter
DesktopManager
Names should make it obvious what the class/method/variable does.
Refer to Kent Beck's Four Rules of Simple Design (Express intent)
How is IOC generally handled?
Maybe someone else can expand on this point more, but referring to Extreme Programming, don't use interfaces for the sake of it, but when you need them. If you only have one concrete instance, you probably don't need an interface. Refactor to add interfaces to follow known design patterns when you have a real need for them.
https://www.martinfowler.com/articles/designDead.html

What failure modes can TDD leave behind?

Please note I have not yet 'seen the light' on TDD nor truly got why it has all of the benefits evangelised by its main proponents. I'm not dismissing it - I just have my reservations which are probably born of ignorance. So by all means laugh at the questions below, so long as you can correct me :-)
Can using TDD leave yourself open to unintended side-effects of your implementation? The concept of "the least amount of code to satisfy a test" suggests thinking in the narrowest terms about a particular problem without necessarily contemplating the bigger picture.
I'm thinking of objects that hold or depend upon state (e.g. internal field values). If you have tests which instantiate an object in isolation, initialise that object and then call the method under test, how would you spot that a different method has left behind an invalid state that would adversely affect the behaviour of the first method? If I have understood matters correctly, then you shouldn't rely on order of test execution.
Other failures I can imagine cover the non-closure of streams, non-disposal of GDI+ objects and the like.
Is this even TDD's problem domain, or should integration and system testing catch such issues?
Thanks in anticipation....

Some of this is in the domain of TDD.
Dan North says there is no such thing as test-driven development; that what we're really doing is example-driven development, and the examples become regression tests only once the system under test has been implemented.
This means that as you are designing a piece of code, you consider example scenarios and set up tests for each of those cases. Those cases should include the possibility that data is not valid, without considering why the data might be invalid.
Something like closing a stream can and should absolutely be covered when practicing TDD.
We use constructs like functions not only to reduce duplication but to encapsulate functionality. We reduce side effects by maintaining that encapsulation. I'd argue that we consider the bigger picture from a design perspective, but when it comes to implementing a method, we should be able to narrow our focus to that scope -- that unit of functionality. When we start juggling externalities is when we are likely to introduce defects.
That's my take, anyway; others may see it differently.

TDD is not a replacement for being smart. The best programmers become even better with TDD. The worst programmers are still terrible.
The fact that you are asking these questions is a good sign: it means you're serious about doing programming well.
The concept of "the least amount of
code to satisfy a test" suggests
thinking in the narrowest terms about
a particular problem without
necessarily contemplating the bigger
picture.
It's easy to take that attitude, just like "I don't need to test this; I'm sure it just works." Both are naive.
This is really about taking small steps, not about calling it quits early. You're still going after a great final result, but along the way you are careful to justify and verify each bit of code you write, with a test.
The immediate goal of TDD is pretty narrow: "how can I be sure that the code I'm writing does what I intend it to do?" If you have other questions you want to answer (like, "will this go over well in Ghana?" and "is my program fast enough?") then you'll need different approaches to answer them.
I'm thinking of objects that hold or
depend upon state.
how would you spot that a different
method has left behind an invalid
state?
Dependencies and state are troublesome. They make for subtle bugs that appear at the worst times. They make refactoring and future enhancement harder. And they make unit testing infeasible.
Luckily, TDD is great at helping you produce code that isolates your logic from dependencies and state. That's the second "D" in "TDD".

The concept of "the least amount of
code to satisfy a test" suggests
thinking in the narrowest terms about
a particular problem without
necessarily contemplating the bigger
picture.
It suggests that, but that isn't what it means. What it means is powerful blinders for the moment. The bigger picture is there, but interferes with the immediate task at hand - so focus entirely on that immediate task, and then worry about what comes next. The big picture is present, is accounted for in TDD, but we suspend attention to it during the Red phase. So long as there is a failing test, our job is to get that test to pass. Once it, and all the other tests, are passing, then it's time to think about the big picture, to look at shortcomings, to anticipate new failure modes, new inputs - and write a test to express them. That puts us back into Red, and re-narrows our focus. Get the new test to pass, then set aside the blinders for the next step forward.
Yes, TDD gives us blinders. But it doesn't blind us.

Good questions.
Here's my two cents, based on my personal experience:
Can using TDD leave yourself open to
unintended side-effects of your
implementation?
Yes, it does. TDD is not a "fully-fledged" option. It should be used along with other techniques, and you should definitely bear in mind the big picture (whether you are responsible of it or not).
I'm thinking of objects that hold or
depend upon state (e.g. internal field
values). If you have tests which
instantiate an object in isolation,
initialise that object and then call
the method under test, how would you
spot that a different method has left
behind an invalid state that would
adversely affect the behaviour of the
first method? If I have understood
matters correctly, then you shouldn't
rely on order of test execution.
Every test method should execute with no regard of what was executed before, or will be executed after. If that's not the case then something's wrong (from a TDD perspective on things).
Talking about your example, when you write a test you should know with a reasonable detail what your inputs will be and what are the expected outputs. You start from a defined input, in a defined state, and you check for a desired output. You're not 100% guaranteed that the same method in another state will do it's job without errors. However the "unexpected" should be reduced to a minimum.
If you design the class you should definitely know if two methods can change some shared internal state and how; and more important, if this should really happen at all, or if there is a problem about low cohesion.
Anyway a good design at the "tdd" level doesn't necessarily means that your software is well Built, you need more as Uncle Bob explains well here:
http://blog.objectmentor.com/articles/2007/10/17/tdd-with-acceptance-tests-and-unit-tests
Martin Fowler wrote an interesting article about Mocks vs Stubs test which covers some of the topics you are talking about:
http://martinfowler.com/articles/mocksArentStubs.html#ClassicalAndMockistTesting

How do you decide which parts of the code shall be consolidated/refactored next?

Do you use any metrics to make a decision which parts of the code (classes, modules, libraries) shall be consolidated or refactored next?

I don't use any metrics which can be calculated automatically.
I use code smells and similar heuristics to detect bad code, and then I'll fix it as soon as I have noticed it. I don't have any checklist for looking problems - mostly it's a gut feeling that "this code looks messy" and then reasoning that why it is messy and figuring out a solution. Simple refactorings like giving a more descriptive name to a variable or extracting a method take only a few seconds. More intensive refactorings, such as extracting a class, might take up to a an hour or two (in which case I might leave a TODO comment and refactor it later).
One important heuristic that I use is Single Responsibility Principle. It makes the classes nicely cohesive. In some cases I use the size of the class in lines of code as a heuristic for looking more carefully, whether a class has multiple responsibilities. In my current project I've noticed that when writing Java, most of the classes will be less than 100 lines long, and often when the size approaches 200 lines, the class does many unrelated things and it is possible to split it up, so as to get more focused cohesive classes.

Each time I need to add new functionality I search for already existing code that does something similar. Once I find such code I think of refactoring it to solve both the original task and the new one. Surely I don't decide to refactor each time - most often I reuse the code as it is.

I generally only refactor "on-demand", i.e. if I see a concrete, immediate problem with the code.
Often when I need to implement a new feature or fix a bug, I find that the current structure of the code makes this difficult, such as:
too many places to change because of copy&paste
unsuitable data structures
things hardcoded that need to change
methods/classes too big to understand
Then I will refactor.
I sometimes see code that seems problematic and which I'd like to change, but I resist the urge if the area is not currently being worked on.
I see refactoring as a balance between future-proofing the code, and doing things which do not really generate any immediate value. Therefore I would not normally refactor unless I see a concrete need.
I'd like to hear about experiences from people who refactor as a matter of routine. How do you stop yourself from polishing so much you lose time for important features?

We use Cyclomatic_complexity to identify the code that needs to be refactored next.

I use Source Monitor and routinely refactor methods when the complexity metric goes aboove around 8.0.

Adversarial/Naive Pairing with TDD: How effective is it?

A friend of mine was explaining how they do ping-pong pairing with TDD at his workplace and he said that they take an "adversarial" approach. That is, when the test writing person hands the keyboard over to the implementer, the implementer tries to do the bare simplest (and sometimes wrong thing) to make the test pass.
For example, if they're testing a GetName() method and the test checks for "Sally", the implementation of the GetName method would simply be:
public string GetName(){
return "Sally";
}
Which would, of course, pass the test (naively).
He explains that this helps eliminate naive tests that check for specific canned values rather than testing the actual behavior or expected state of components. It also helps drive the creation of more tests and ultimately better design and fewer bugs.
It sounded good, but in a short session with him, it seemed like it took a lot longer to get through a single round of tests than otherwise and I didn't feel that a lot of extra value was gained.
Do you use this approach, and if so, have you seen it pay off?

It can be very effective.
It forces you to think more about what test you have to write to get the other programmer to write the correct functionality you require.
You build up the code piece by piece passing the keyboard frequently
It can be quite tiring and time consuming but I have found that its rare I have had to come back and fix a bug in any code that has been written like this

I've used this approach. It doesn't work with all pairs; some people are just naturally resistant and won't give it an honest chance. However, it helps you do TDD and XP properly. You want to try and add features to your codebase slowly. You don't want to write a huge monolithic test that will take lots of code to satisfy. You want a bunch of simple tests. You also want to make sure you're passing the keyboard back and forth between your pairs regularly so that both pairs are engaged. With adversarial pairing, you're doing both. Simple tests lead to simple implementations, the code is built slowly, and both people are involved throughout the whole process.

I like it some of the time - but don't use that style the entire time. Acts as a nice change of pace at times. I don't think I'd like to use the style all of the time.
I've found it a useful tool with beginners to introduce how the tests can drive the implementation though.

(First, off, Adversarial TDD should be fun. It should be an opportunity for teaching. It shouldn't be an opportunity for human dominance rituals. If there isn't the space for a bit of humor then leave the team. Sorry. Life is to short to waste in a negative environment.)
The problem here is badly named tests. If the test looked like this:
foo = new Thing("Sally")
assertEquals("Sally", foo.getName())
Then I bet it was named "testGetNameReturnsNameField". This is a bad name, but not immediately obviously so. The proper name for this test is "testGetNameReturnsSally". That is what it does. Any other name is lulling you into a false sense of security. So the test is badly named. The problem is not the code. The problem is not even the test. The problem is the name of the test.
If, instead, the tester had named the test "testGetNameReturnsSally", then it would have been immediately obvious that this is probably not testing what we want.
It is therefore the duty of the implementor to demonstrate the poor choice of the tester. It is also the duty of the implementor to write only what the tests demand of them.
So many bugs in production occur not because the code did less than expected, but because it did more. Yes, there were unit tests for all the expected cases, but there were not tests for all the special edge cases that the code did because the programmer thought "I better just do this too, we'll probably need that" and then forgot about it. That is why TDD works better than test-after. That is why we throw code away after a spike. The code might do all the things you want, but it probably does somethings you thought you needed, and then forgot about.
Force the test writer to test what they really want. Only write code to make tests pass and no more.
RandomStringUtils is your friend.

It is based on the team's personality. Every team has a personality that is the sum of its members. You have to be careful not to practice passive-aggressive implementations done with an air of superiority. Some developers are frustrated by implementations like
return "Sally";
This frustration will lead to an unsuccessful team. I was among the frustrated and did not see it pay off. I think a better approach is more oral communication making suggestions about how a test might be better implemented.

TDD and the Bayesian Spam Filter problem

It's well known that Bayesian classifiers are an effective way to filter spam. These can be fairly concise (our one is only a few hundred LoC) but all core code needs to be written up-front before you get any results at all.
However, the TDD approach mandates that only the minimum amount of code to pass a test can be written, so given the following method signature:
bool IsSpam(string text)
And the following string of text, which is clearly spam:
"Cheap generic viagra"
The minimum amount of code I could write is:
bool IsSpam(string text)
{
return text == "Cheap generic viagra"
}
Now maybe I add another test message, e.g.
"Online viagra pharmacy"
I could change the code to:
bool IsSpam(string text)
{
return text.Contains("viagra");
}
...and so on, and so on. Until at some point the code becomes a mess of string checks, regular expressions, etc. because we've evolved it instead of thinking about it and writing it in a different way from the start.
So how is TDD supposed to work with this type of situation where evolving the code from the simplest possible code to pass the test is not the right approach? (Particularly if it is known in advance that the best implementations cannot be trivially evolved).

Begin by writing tests for lower level parts of the spam filter algorithm.
First you need to have in your mind a rough design of how the algorithm should be. Then you isolate a core part of the algorithm and write tests for it. In the case of a spam filter that would maybe be calculating some simple probability using Bayes' theorem (I don't know about Bayesian classifiers, so I could be wrong). You build it bottom-up, step by step, until finally you have all the parts of the algorithm implemented and putting them together is simple.
It requires lots of practice to know which tests to write in which order, so that you can do TDD in small enough steps. If you need to write much more than 10 lines of code to pass one new test, you probably are doing something wrong. Start from something smaller or mock some of the dependencies. It's safer err on the smaller side, so that the steps are too small and your progress is slow, than trying to make too big steps and failing badly.
The "Cheap generic viagra" example that you have might be better suited for an acceptance test. It will probably even run very slowly, because you first need to initialize the spam filter with example data, so it won't be useful as a TDD test. TDD tests need to be FIRST (F = Fast, as in many hundreds or thousands tests per second).

Here's my take: Test Driven Development means writing tests before coding. This does not mean that each unit of code for which you write a test needs to be trivial.
Furthermore you still need to plan your software to do its tasks in a sensible and effective way. Simply adding more and more strings doesn't seem to be the best design for this problem.
So in short, you write the code from the smallest piece of functionality possible (and test it) but you don't design your algorithm (in pseudo code or however you like to do it) that way.
Would be interesting to see if you and others agree.

For me, what you call minimum amount of code to pass a test is the whole IsSpam() function. This is consistent with its size (you say only a few hundred LoC).
Alternatively, incremental approach does not claim to code first and think afterwards. You can design a solution, code it and then refine the design with special cases or better algorithm.
Anyway, refactoring does not consist simply in adding new stuff over old one. For me this is a more destructive approach, where you throw away old code for a simple feature and replace it with new code for a refined and more elaborate feature.

You have your unit tests, right?
That means that you can now refactor the code or even rewrite it and use the unit tests to see if you broke something.
First make it work, then make it clean -- It's time for the second step :)

(1) You cannot say that a string "is spam" or "is not spam" in the same way as if you were saying whether a number is prime. This is not black or white.
(2) It is incorrect, and certainly not the aim of TDD, to write string processing functions using just the very examples used for the tests. Examples should represent a kind of values. TDD does not protect against stupid implementations, so you shouldn't pretend that you have no clue at all, so you shouldn't write return text == "Cheap generic viagra".

It seems to me, that with a Bayesian Spam Filter, you should be using existing methods. In particular you would be using Bayes' Theorem, and probably some other probability theory.
In that case, it seems the best approach is to decide on your algorithm, based on these methods, which should either be tried and tested, or possibly experimental. Then, your unit tests should be designed to test whether ispam correctly implements the algorithm you decide on, as well as a basic test that the result is between 0 and 1.
The point is, that your unit tests aren't designed to test whether your algorithm is sensible. You should either know that already, or possibly your program is designed as an experiment, to see if it is sensible.
That's not to say performance of the isspam function isn't important. But it doesn't have to be part of the unit testing. The data could be from feedback from alpha testing, new theoretical results, or your own experiments. In that case, a new algorithm may be needed, and new unit tests are needed.
See also this question about testing random number generators.

The problem here is not with test driven development but with your tests. If you start out developing code against a single test then all your test is doing is specifying a string checking function.
The main idea of TDD is to think about your tests before writing code. You can't exhaustively test a spam filter, but you could come up with a reasonable approximation by tens or hundreds of thousands of test documents. In the presence of that many tests the naive Bayes algorithm is a simpler solution than a hundred thousand line switch statement.
In reality, you may not be able to pass 100% of your unit tests so you just have to try to pass as many as possible. You also have to make sure your tests are sufficiently realistic. If you think about it in this way, test driven development and machine learning have a lot in common.

The problem you are describing is theoretical, that by adding cruft in response to tests you will make a big, messy ball of mud. The thing you are missing is very important.
The cycle is: Red --> Green --> Refactor
You don't just bounce between red and green. As soon as you have tests passing (green) you refactor the production code and the tests. Then you write the next failing test (red).
If you are refactoring, then you are eliminating duplication and messiness and slop as it grows. You will quickly get to the point of extracting methods, building scoring and rating, and probably bringing in external tools. You will do it as soon as it is the simplest thing that will work.
Don't just bounce between red and green, or all your code will be muck. That refactoring step is not optional or discretionary. It is essential.

I don't think checking if a particular string is spam is really a unit test, its more of a customer test. There's an important difference as its not really a red/greed type of thing. In actuality you should probably have a couple hundred test documents. Initially some will be classified as spam, and as you improve the product the classifications will more directly match what you want. So you should make a custom app to load a bunch of test documents, classify them, and then evaluate the scoring overall. When you're done with that customer test, the score will be very bad since you haven't implemented an algorithm. But you now have a means to measure progress going forward, and this is pretty valuable given the amount of learning/changes/experimentation you can expect going forward.
As you implement your algorithm, (and even the customer test in the firsthand) you can still do TDD with real unit tests. The first test for the bayesian filter compononent won't measure if a particular string evaluates as spam, but whether the string is passed through the bayesian filter component appropriately. Your next tests will then focus on how a bayesian filter is implemented (structuring nodes correctly, applying training data, etc).
You do need a vision as to where the product is going, and your tests and implemention should be directed towards that vision. You can not also just add customer tests in a blind manner, you need to add tests with the overall product vision in mind. Any software development goal will have good tests and bad tests you can write.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio