naive bayesian spam filter question - algorithm

I am planning to implement spam filter using Naive Bayesian classification model.
Online I see a lot of info on Naive Bayesian classification, but the problem is its a lot of mathematical stuff, than clearly stating how its done. And the problem is I am more of a programmer than a mathematician (yes I had learnt Probability and Bayesian theorem back in school, but out of touch for a long long time, and I don't have luxury of learning it now (Have nearly 3 weeks to come-up with a working prototype)).
So if someone can explain or point me to location where its explained for programmers than a mathematician, it would be a great help.
PS: By the way I have to implement it in C, if you want to know. :(
Regards,
Microkernel

The book Programming Collective Intelligence has chapter that covers this and other methods. The chapter (#6) can be understood without reference to previous chapters, is written clearly, and discusses only the minimal mathematics necessary to get the job done.

You could try this website. It's got some source code.

I would highly recommend Andrew Moore's tutorials and I think you should start with this one.

You could also take a look at POPFile, an open source spam filter engine.

Have you looked at dspam?
http://dspam.irontec.com/faq.shtml#1.0
http://www.nuclearelephant.com/

Related

Material and Information to improve algorithmic knowledge

Lately I have been stuck on improving my algorithmic skills. And at this point I am finding myself out of good material for solving grid problems based on dfs and bsf. I somehow managed to do http://www.spoj.pl/problems/POUR1/ with a brute force logic but i recently go-ogled to find out that the problem can be done by bfs. But I can't figure out exactly how to go about it. Can someone please provide some text to read or some kind of explanation to the above mentioned problem so I can add this to my skill set. It would be extremely kind if you could even help me out for these techniques in problems like these http://www.codechef.com/problems/MMANT/ .please help as soon as possible I am really stuck in these kind of problems ant can't move on. It would also be really kind if u could provide a list of good questions about Binary Indexed Trees and segment trees and some more examples of their usage.
Thanks for the help!! :)
One resource I've found useful is The Algorithmist:
The Algorithmist is a resource dedicated to anything algorithms - from
the practical realm, to the theoretical realm. There are also links
and explanation to problemsets.
Also The Algorithm Design Manual by Steve Skiena is extremely useful, especially the second part.

Learning AI by practice ( Perceptrons, Neural networks and Bayesian AI)

I'm about to take a course in AI and I want to practice before. I'm using a book to learn the theory, but resources and concrete examples in any language to help with the practice would be amazing. Can anyone recommend me good sites or books with plenty of examples and tutorials ?
Thanks !
Edit: My course will deal with Perceptrons, Neural networks and Bayesian AI.
Really depends on what area you want to specialize on. There is the startup - resource : is
here. I learned about neural nets from their example. Can you elaborate what kind of AI it should be?
Ah and i forgot: this link is a very nice forum where you can look at problems other people have and learn from that.
Cheers.
My advice would be to learn it by trying to implement the various types of learners yourself. See if you can find yourself a dataset related to some interest you have (sports, games, health, etc.) and then try and create a learner to do some kind of classification (predicting a winner in a sports game, learning how to classify backgammon positions, detecting cancer based on patient data, etc.) using different methods. Start with Decision Trees if that's part of your future course work since they're relatively simple, then move on to neural networks.
Here is a set of sources, each one of which i recommend highly--for the quality of the explanation, for the quality of the code, and for the 'completeness' of the algorithm demo.
Least-Squares Regression
(Python)
k-means clustering (Python)
Multi-Layer Perceptron (Python)
Hopfield Network (Python)
Decision Tree (ID3 & C4.5)
In addition, the excellent textbook Elements of Statistical Learning by Hastie, et al. is actually free to download. The authors have an R package that accompanies this textbook which contains all of the code. This book includes detailed discussion of most (if not all) of the major classes of ML algorithms, with specific examples that refer to working code and 'real-world' data.
Personally I would recommend this M.Tim.Jones book on AI.
Has many many topics on AI and almost every type of AI discussion is followed by C example code. Very pragmatic book on AI indeed !!
Russel & Norvig have a good survey of the broad field.

Improve algorithmic thinking [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I was thinking about ways to improve my ability to find algorithmic solutions to a problem.I have thought of solving math problems from various math sectors such as discrete mathematics or linear algebra.After "googling" a bit I have read an article that claimed the need of learning game programming in order to achieve this and it seems logical to me.
Do you have/had the same concerns as me or do you have any ideas on this?I am looking forward to hear them.
Thank you all, in advance.
P.S.1:I want to say that I already know about programming and how to program(although I am at amateur level:-) ) and I just want to improve at the specific issue, NOT to start learning it
P.S.2:I think that its a useful topic for future reference so I checked the community wiki box.
Solve problems on a daily basis. Watch traffic lights and ask yourself, "How can these be synced to optimize the flow of traffic? Or to optimize the flow of pedestrians? What is the best solution for both?". Look at elevators and ask yourself "Why should these elevators use different rules than the elevators in that other building I visited yesterday? How is it actually implemented? How can it be improved?".
Try to see a problem everywhere, even if it is solved already. Reflect on the solution. Ask yourself why your own superior solution probably isn't as good as the one you can see - what are you missing?
And so on. Every day. All of the time.
The idea is that almost everything can be viewed as an algorithm (a goal that has some kind of meaning to somebody, and a method with which to achieve it). Try to have that in mind next time you watch a gameshow on TV, or when you read the news coverage of the latest bank robbery. Ask yourself "What is the goal?", "Whose goal is it?" and "What is the method?".
It can easily be mistaken for critical thinking but is more about questioning your own solutions, rather than the solutions you try to understand and improve.
First of all, and most important: practice. Think of solutions to everything everytime. It doesn't have to be on your computer, programming. All algorithms will do great. Like this: when you used to trade cards, how did you compare your deck and your friend's to determine the best way for both of you to trade? How can you define how many trades can you do to do the maximum and yet don't get any repeated card?
Use problem databases and online judges like this site, http://uva.onlinejudge.org/index.php, that has hundreds of problems concerning general algorithms. And you don't need to be an expert programmer at all to solve any of them. What you need is a good ability with logic and math. There, you can find problems from the simplest ones to the most challenging. Most of them come from Programming Marathons.
You can, then, implement them in C, C++, Java or Pascal and submit them to the online judge. If you have a good algorithm, it will be accepted. Else, the judge will say your algorithm gave the wrong answer to the problem, or it took too long to solve.
Reading about algorithms helps, but don't waste too much time on it... Reading won't help as much as trying to solve the problems by yourself. Maybe you can read the problem, try to figure out a solution for yourself, compare with the solution proposed by the source and see what you missed. Don't try to memorize them. If you have the concept well learned, you can implement it anywhere. Understanding is the hardest part for most of them.
Polya's "How To Solve It" is a great book for thinking about how to solve mathematical problems and do proofs, and I'd recommend it for anyone who does problem solving.
But! It doesn't really address the excitement that happens when the real world provides input to your system, via channel noise, user wackiness, other programs grabbing resources, etc. For that it is worth looking at algorithms that get applied to real-world input (obligatory and deserved nod to Knuth's collection), and systems which are fairly robust in the face of same (TCP, kernel internals). Part of coming up with good algorithmic solutions is to know what already exists.
And alongside reading all that, of course practice practice practice.
You should check out Mathematics and Plausible Reasoning by G. Polya. It is a rare math book, which actually deals with the thought process involved in making mathematical discoveries. I think it is the same thought process that is involved in coming up with algorithms.
The saying "practice makes perfect" definitely applies. I'm tutoring a friend of mine in programming, and I remind him that "if you don't know how to ride a bike, you could read every book about it but it doesn't mean you'll be better than Lance Armstrong tomorrow - you have to practice".
In your case, how about trying the problems in Project Euler? http://projecteuler.net
There are a ton of problems there, and for each one you could practice at developing an algorithm. Once you get a good-enough implementation, you can access other people's solutions (for a particular problem) and see how others have done it. Don't think of it as math problems, but rather as problems in creating algorithms for solving math problems.
In university, I actually took a class in algorithm design and analysis, and there is definitely a lot of theory behind it. You may hear people talking about "big-O" complexity and stuff like that - there are quite a lot of different properties about algorithms themselves which can lead to greater understanding of what constitutes a "good" algorithm. You can study quite a bit in this regard as well for the long-term.
Check some online judges, TopCoder (algorithm tutorials). Take some algorithms book (CLRS, Skiena) and do harder exercises. Practice much.
I would suggest this path for you :
1.First learn elementary parts of a language.
2.Then learn about some basic maths.
3.Move to topcoder div2 easy problems.Usually if you cannot score 250 pts. in any given day,then it means you need a lot of practise,keep practising.
4.Now's the time to learn some tools of a programmer,take a good book like Algorithm Design Manual by Steven Skienna and learn about dynamic programming and greedy approach.
5.Now move to marathons,don't be discouraged if you cannot solve it quickly.Improvement will not happen overnight,you will have to patiently keep on working hard.
6.Continue step 5 from now on and you will be a better programmer.
Learning about game programming will probably lead you to good algorithms for game programming, but not necessarily to better algorithms in general.
It's a good start, but I think that the best way to learn and apply algorithmic knowledge is
Learn about good algorithms that currently exist for your area of interest
Expand your knowledge by viewing other areas; for example, what kinds of algorithms are
required when working on genetic analysis? What's the best approach for determining
run-off potential as it relates to flooding?
Read about problems in other domains and attempt to use the algorithms that you're
familiar with to see if they fit. If they don't try to break the problem down and see if
you can come up with your own algorithm.
A few more books worth reading (in no particular order):
Aha! Insight (Martin Gardner)
Any of the Programming Pearls books (Jon Bentley)
Concrete Mathematics (Graham, Knuth, and Patashnik)
A Mathematical Theory of Communication (Claude Shannon)
Of course, most of those are just samples -- other books and papers by the same authors are usually quite good as well (e.g. Shannon wrote a lot that's well worth reading, and far too few people give it the attention it deserves).
Read SICP / Structure and Interpretation of Computer Programs and work all the problems; then read The Art of Computer Programming (all volumes), working all the exercises as you go; then work through all the problems at Project Euler.
If you aren't damned good at algorithms after that, there is probably no hope for you. LOL!
P.S. SICP is available freely online, but you have to buy AoCP (get the international, not-for-release-in-north-america edition used for 30 USD). And I haven't done this yet myself (I'm trying when I have free time).
I can recommend the book "Introductory Logic and Sets for Computer Scientists" by Nimal Nissanke (Addison Wesley). The focus is on set theory, predicate logic etc. Basically the maths of solving problems in code if you will. Good stuff and not too difficult to work through.
Good luck...Kevin
Great
how about trying the problems in Project Euler? http://projecteuler.net
There are a ton of problems there, and for each one you could practice at developing an algorithm. Once you get a good-enough implementation, you can access other people's solutions (for a particular problem) and see how others have done it. Don't think of it as math problems, but rather as problems in creating algorithms for solving math problems
Ok, so to sum up the suggestions:
The most effective way to improve this ability is to solve problem as frequently as possible.Either real world problems(such as the elevators "algorithm" which is already suggested) or exercises from books like CLRS(great, I already own it :-)).But I didn't see comments about maths and I don't know what to suppose(if you agree or not).:-s
The links were great.I will definitely use them.I also think that it will be a good exercise to solve problems from national/international informatics contests or to read the way a mathematician proves a theorem.
Thank you all again.Feel free to suggest more, although I am already satisfied with the solutions mentioned.

Having a bit of trouble with self-learning from Cormen et al's algo book

I started reading Intro to Algorithms by Cormen et al like 3 weeks ago on my free time. I finished the second chapter and have been trying out the exercises for quite a while. I find them a bit difficult.
Is this normal? Should I finish all the exercises before moving on? Or is it alright if I solve the ones I can and move on to the next chapters, possibly coming back to the exercises I can't figure out right now?
If anyone out there has had experience with this book, can you tell me how it was for you? I'm a bit discouraged on not being able to solve quite a few of the exercises here.
That book was hard for me too. We used it at the university I attended and I often had to refer to other sources to get simpler explanations when I found CLRS a little over my head. Once I got the Wikipedia explanation straight in my head, and a code sample working (which CLRS often lacks), I found that I was able to go back to the text and make sense of it.
Don't worry about doing all of the exercises. Even the super-elite MIT students don't have to do them all. Do what you can do and move on. If you need a concept in the next chapter that you had glossed over, it will still be there for you to backtrack to.
MIT OpenCourseWare has also made available the old lectures for Introduction to Algorithms (SMA 5503).
Good for you for diving into CLRS by yourself. You're a braver man than me. I used the book for a grad algorithms course I took last semester, and I had a hard time just finishing the problem sets assigned for the course. Completing all of the exercises would be a truly Herculean effort.
I'd recommend tackling the chapters that interest you most and those that you don't find to difficult. The beginning of the book, if I remember correctly, is one of the harder parts, diving into the mathematical background of a lot of different areas of algorithms. Chapter 5 is especially difficult unless you know a fair bit of probability theory. Also, starred sections and problems are significantly more challenging than the surrounding material (like 21.4, which contains material our professor confessed to being unable to prove in class). Finally at the end of the book, there is just a survey of miscellaneous topics; you can just look at those that interest you, since there are entire books written about each of those topics if you want to learn more about them.
I hope this helps, and most importantly, don't get too discouraged! This is the seminal book on algorithms for a reason.
It's a difficult book, used by one of the pre-eminent technical universities in the world. It's no surprise that it's challenging. There are a LOT of exercises of varying difficulty. It's a noble goal to attempt all of them.
Aren't the course materials on-line? It'd be interesting to see if students taking the course for credit do all the exercises.
I wouldn't be discouraged. Keep plugging, even if you have to pass on some of the exercises. There's nothing saying that you have to master it in one pass, either. Go through, take in what you can, and re-do if necessary. You might find that the extra context helps.
The lectures are available on iTunes if you find that helps.
The important thing is to set a deadline and make steady progress. Good luck.
The problem with not doing all the problems is that when you are self-studying, you really don't have a good gage for how much you should be able to answer.
You can look at the course assignments online, I would recommend that for figuring out problem sets to get done.
I am learning Algorithms on my own from the CLRS book in 2020. Regardless of what people tell you about solutions manuals in general, it is advisable to get "good" solutions manuals if you are self studying with the book.
The two sets of solutions I recommend are (1) The official instructor manual and (2) solutions by Rutger's university students Michelle Bodnar and Andrew Lohr. When one of those solutions is unclear, I simply refer to the other one. If you get stuck at a problem, then give yourself a few minutes to solve it. If you don't get the answer, then use the solutions manuals. You can always test yourself on the problems from other text books or leetcode to see how much you can do on your own vs just following a solutions manual.
I won't post the solutions manuals here. I suggest that you search for them online. The Rutgers one is easily available and is legal. The official one is restricted to instructors only and is hard to get. You might be able to pay obscure online sellers/hackers to get the official one for you. Use a preloaded visa or master card gift card to make that purchase. Make sure that the card is accepted in the sellers country.
Chapter 2 was doable because I used Youtube to understand algorithms and time complexity when it was not clearly explained in the CLRS book, which is quite often. The solutions manuals also helped a little bit.
Chapter 3 is hard and I don't know if I will be able to get past this one. I might have to switch to another book, perhaps the one by Tamassia. I had studied elementary algebra, set theory, functions, probability, mathematical series and calculus a few years ago. But, I remember only a few of those things. So, it is difficult to understand Chapter 3 and move ahead.
In general, it is a comprehensive and rigorous book. However, it is bad because of these:
One-based array indexing (instead of the usual 0-based) - so everytime you translate the algorithm present in the book into code you have to either +1 or -1 and / or use < instead of <= or the other way around and so on.
Bad variable naming in pseudocode - instead of lo, hi or left, right you get p and q.
The fact that is is very rigorous may get you confused in the little details and usually you can miss the overall idea of an algorithm.
It is a famous book because many scientific papers refer to this book in their references.
Otherwise, it is ok.

Where can I learn more about "ant colony" optimizations?

I've been reading things here and there for a while now about using an "ant colony" model as a heuristic approach to optimizing various types of algorithms. However, I have yet to find an article or book that discusses ant colony optimizations in an introductory manner, or even in a lot of detail. Can anyone point me at some resources where I can learn more about this idea?
On the off chance that you know German (yes, sorry …), a friend and I have written an introduction with code about this subject which I myself find quite passable. The text and code uses the example of TSP to introduce the concept.
Even if you don't know German, take a look at the code and the formulas in the text, this might still serve.
link Wikipedia actually got me started. I read the article and got to coding. I was solving a wicked variation of the traveling salesman problem. It's an amazing meta-heuristic. Basically, any type of search problem that can be put into a graph (nodes & edges, symmetric or not) can be solved with an ACO.
Look out for the difference between global and local pheromone trails. Local pheromones discourage one generation of ants from traversing the same path. They keep the model from converging. Global pheromones are attractors and should snag at least one ant per generation. They encourage optimum paths over several generations.
The best suggestion I have, is simply to play with the algorithm. Setup a basic TSP solver and some basic colony visualization. Then have some fun. Working with ants, conceptually, is way cool. You program their basic behaviors and then set them loose. I even grow fond of them. :)
ACOs are a greedier form of genetic algorithms. Play with them. Alter their communicative behaviors and pack behavior. You'll rapidly begin to see network / graph programming in an entirely different way. That's their biggest benefit, not the recipe that most people see it as.
You just gotta play with it to really understand it. Books & research papers only give a general sky-high understanding. Like a bike, you just gotta start riding. :)
ACOs, by far, are my favorite abstraction for graph problems.
National Geographic wrote an interesting article awhile back talking about some of the theories.
The best resource for these topics is Google scholar. Ive been working on Ant Colony Optimization algorithms for a while, here are some good papers:
Ant Colony Optimization - A New Metaheuristic
Ant Colony Optimization - Artificial Ants as a Computational Intelligence Technique
Just search for "Ant Colony" on google scholar.
Also, search for papers published by Marco Dorigo.
I am surprised nobody has mentioned the bible of ACO:
Marco Dorigo & Thomas Stützle: Ant Colony Optimization
This book is written by the author of ACO and it is highly readable. You can take it to the beach and have fun reading it. But it is also the most complete resource of all, great as a reference when implementing the thing.
You can read some excerpts on Google Books
Another great source of wisdom is the ACO Homepage
See for example this article on scholarpedia.
There is also discussion here in the What is the most efficient way of finding a path through a small world graph? question.
At first glance this seems to be closely related to (or prehaps a special case of) the Metropolis algorithm. So that's another possible direction for searching.
Addition: This PDF file includes a reference to the original Metropolis paper from 1953.
Well, i found the Homepage of Eric Rollins and his different Implementations (Haskell, Scala, Erlang,...) of a ACO Algorithm helpfull.
And also the Book from Enrique Alba, titled "Parallel Metaheuristics: A New Class of Algorithms" where you can find a whole chapter of explanation about ACO Algorithms and their different usages.
Hth

Resources