"Learning" filter engines

"Learning" filter engines - filter

Are there any "intelligent" or "learning" engines out there, that are able to identify "evil" phrases in texts ( maybe something like a learning Spamfilter... e.g. used in Thunderbird? )
For example if i want to filter texts with mailadresses:
asdasd asd as d dgfdgfdgfdg sadasd(at)asfsdf.com
At first the tool wouldn't recognize this as an emailadress... but if the user "teached" ( clicked a "text contains an mailadress"-button for example ) the tool several times, that text which contains phrases like "xxxxx(at)xxxxx.xx" is suspicious, it "learns" that it should mark these text automatically in the future...
Question: Is there anything like it on the market? I foudn some libs ( like SpamAssasin, etc. ) but these are "specialized" on emails...

The general idea you are talking about is a Bayesian filter. Maybe that will help you in your searches.
Edit: A few other examples:
Python
Java
.NET
Ruby

Yeah, this seems to be good start: http://nbayes.codeplex.com/ ( C# implementation of the bayesian algorithm )

Related

text mining/analyse user commands/questions algorithm or library

I got a financial application and I wish to add to it the ability to get user command or input in textbox and then take the right action. for example, wish the user to write "show the revenue in the last 10 days" and it'll show the revenue to him/her - the point is that I wish it to really understand the meaning of the question, so the previus statement will bring the same results as "do I got any revenue in the last 10 days" or something like that - BI (something like the Wolfram|Alpha engine).
I wonder if there's any opensource library or algorithm books or whatever that I can use to learn the subject. Regards to opensource libraries - I don't mind which language it'll be written in.
I've read about this subject and saw many engines and services (OpenNLP, Apache UIMA, CoreNLP etc.) but did not figure out if they're right for my needs.
Any answer or suggestion is welcome.
Many thanks!

The field you're talking about is usually called "natural language processing". It's hard, and an active field of research. There are various libraries which you could consider based on your preferred programming language and use case:
http://en.wikipedia.org/wiki/List_of_natural_language_processing_toolkits
I've used NLTK a little bit. This field is seriously difficult to get right, so you might want to try to restrict your application to some small set of verbs and nouns such that people are using a controlled vocabulary in the first instance, and then try to extend it beyond that.

Matching users with objects based on keywords and activity in Ruby

I have users that have authenticated with a social media site. Now based on their last X (let's say 200) posts, I want to map how much that content matches up with a finite list of keywords.
What would be the best way to do this to capture associated words/concepts (maybe that's too difficult) or just get a score of how much, say, my tweet history maps to 'Walrus' or 'banana'?
Would a naive Bayes work here to separate into 'matches' and 'no match'?

In Python I would say NLTK can easily do it. In Ruby maybe gem called lda-ruby will help you. Whole LDA concept is well explained here - look at Sarah Palin's email for example. There's even the example of an app (not entirely in Ruby, but still) which did that -> github.com/echen/sarah-palin-lda
Or maybe I just say stupid things and that can't help you at all. I'm not an expert ;)

A simple bayes would work in this case, it is highly used to detect if emails are spam or not so for a simple keyword matching it should work pretty well.
For this problem you could also apply a recommendation system where you look for the top recommended keyword for a user (or for a post).
There are a ton of ways for doing this. I would recommend you to read Programming Collective Intelligence. It is explained using python but since you know ruby there should be not problem to understand the code.

How can I do "related tags"?

I have tags on my website, and I input them one by one when I create a blog post. I love gmail's new feature, that ask you if you want to include X in a mail, if you type Y's name and that you often include both of them in the same messages.
I'd like to do something similar on my website, but I don't know how to represent the tags "related-ness" in an object or database ... thoughts ?

It all boils down to create associations between certain characteristics of your posts and certain tags, and then - when you press the "publish" button - to analyse the new post and propose all tags matched with your post characteristics.
This can be done in several ways from a "totally hard-coded" association to some sort of "learning AI"... and everything in-between.
Hard-coded solutions
This are the simplest algorithms to implement. You should first decide what characteristics of your post are relevant for tagging (e.g.: it's length if you tag them "short" or "long", the presence of photos or videos if you tag them "multimedia-content", etc...). The most obvious is however to focus on which words are used in posts. For example you could build a mapping like this:
tag_hint_words = {'code-development' : ['programming',
'language', 'python', 'function',
'object', 'method'],
'family' : ['Theresa', 'kids',
'uncle Ben', 'holidays']}
Then you would check your post for the presence of the words in the list (the code between [ and ] ) and propose the tag (the word before :) as a possible candidate.
A common approach is to give "scores", or in other word to put a number that indicates the probability a given tag is the right one. For example: if your post would contain the sentence...
After months of programming, we finally left for the summer holidays at uncle Ben's cottage. Theresa and the kids were ecstatic!
...despite the presence of the word "programming" the program should indicate family as the most likely tag to use, as there are many more words hinting.
Learning AI's
One of the obvious limitations of the above method is that - say one day you pick up java beside python - you would probably need to change your code and include words like "java" or "oracle" too. The same applies if you create new tags.
To circumvent this limitation (and have some fun!!) you could try to implement a learning algorithm. Learning algorithms are those who refine their outcome the more you use them (so they indeed... learn!). Some algorithm requires initial training (many spam filters and voice recognition programs need this initial "primer"). Some don't.
I am absolutely no expert on the subject, but two common AI's are: the Naive Bayes Classifier and some flavour of Neural network.
Although the WP pages might look scary, they are surprisingly easy to implement (at least in Python). Here's the recording of a lecture at PyCon 2009 on the subject "Easy AI with Python". I found it very informative and even somehow inspiring! :)
HTH!

You should have a look at this post :
Any suggestions for a db schema for storing related keywords?
If you're looking for a schema for storing related tags it will help.

Relevancy searches where multiple agents play a part are usually done using Collaborative filtering. You might want to give that a look see.

Look up Clustering (Machine Learning algorithm). Don't be intimidated by math, it's a pretty straightforward algorithm. Check out Machine Learning for Hackers for simpler explanations of many Machine Learning algorithms and methods.

Searching algorithmics: Parsing and processing a request OOP style

Say you were to create a search engine that can accept a query statement under the form of a String. The statement can be used to retrieve different types of objects with a given set of characteristics and possibly linked to other objects. In plain english or pseudo-code using an OOP approach, how would you go about parsing and processing statements as follows to get the series of desired objects ?
get fruit with colour green
get variety of apples, pears from Andy
get strawberry with colour "deep red" and origin not Spain
get total of sales of melons between 2010-10-10 and 2010-12-30
get last deliverydate of bananas from "Pete" and state not sold
Hope the question is clear. If not I'll be more than happy to reformulate.
P.S: This isn't homework ;)

Your problem is well suited to a document-oriented store such as Lucene. For example you can design a schema such as
Type
Variety
Color
Origin
DateSold
etc
:
Then you can write a Lucene query such as Type:Fruit AND Color:Green. You can also build nested queries such as (Fruit:Straberry AND Color:Deep Red) AND NOT Origin:Spain.
Apache Lucene is a Java library with portts available for most major languages. Apache Solr is a full-fledged search server built using Lucene lib and easily integrable into your platform-of-choice because it has a RESTful API.
BTW Solr has something called faceting which lets the user filter results using each of the criteria above. So user types fruit into search box and then gets results back.
Type:
- Fruit (109)
- Nut (99)
Origin:
- Spain(32)
- France(39)
Color:
- Red (22)
- Deep Red(45)
Clicking on each of the facets filters the results with the intersection. So if you want a more user-friendly interaction model, faceting/filtering is much easier, than getting users to type extensive Lucene queries.
Update: You might still need to do some lexical parsing if you wish to let users type natural language queries and break it down, but given the tremendously difficult challenge, my suggestion would be to use the simple & powerful faceting approach.
Hope that helps.

It sounds like you're developing a mini language, since you're concerned with syntax and parsing. So, check out the many tools used to generate lexers and parsers. You can start here: http://en.wikipedia.org/wiki/Lexical_analysis

I agree with John.
a) Start with lexical analysis
b) Take statistics of searches and use them to index
c) Find relationships by analysing possibly related searches
This is just a wild guess though, never tried it before.

Lightweight fuzzy search library

Can you suggest some light weight fuzzy text search library?
What I want to do is to allow users to find correct data for search terms with typos.
I could use full-text search engines like Lucene, but I think it's an overkill.
Edit:
To make question more clear here is a main scenario for that library:
I have a large list of strings. I want to be able to search in this list (something like MSVS' intellisense) but it should be possible to filter this list by string which is not present in it but close enough to some string which is in the list.
Example:
Red
Green
Blue
When I type 'Gren' or 'Geen' in a text box, I want to see 'Green' in the result set.
Main language for indexed data will be English.
I think that Lucene is to heavy for that task.
Update:
I found one product matching my requirements. It's ShuffleText.
Do you know any alternatives?

Lucene is very scalable—which means its good for little applications too. You can create an index in memory very quickly if that's all you need.
For fuzzy searching, you really need to decide what algorithm you'd like to use. With information retrieval, I use an n-gram technique with Lucene successfully. But that's a special indexing technique, not a "library" in itself.
Without knowing more about your application, it won't be easy to recommend a suitable library. How much data are you searching? What format is the data? How often is the data updated?

I'm not sure how well Lucene is suited for fuzzy searching, the custom library would be better choice. For example, this search is done in Java and works pretty fast, but it is custom made for such task:
http://www.softcorporation.com/products/people/

Soundex is very 'English' in it's encoding - Daitch-Mokotoff works better for many names, especially European (Germanic) and Jewish names. In my UK-centric world, it's what I use.
Wiki here.

You didn't specify your development platform, but if its PHP then suggest you look at the ZEND Lucene lubrary :
http://ifacethoughts.net/2008/02/07/zend-brings-lucene-to-php/
http://framework.zend.com/manual/en/zend.search.lucene.html
As it LAMP its far lighter than Lucene on Java, and can easily be extended for other filetypes, provided you can find a conversion library or cmd line converter - there are lots of OSS solutions around to do this.

Try Walnutil - based on Lucene API - integrated to SQL Server and Oracle DBs . You can create any type of index and then use it. For simple search you can use some methods from walnutilsoft, for more complicated search cases you can use Lucene API. See web based example where was used indexes created from Walnutil Tools. Also you can see some code example written on Java and C# which you can use it for creating different type of search.
This tools is free.
http://www.walnutilsoft.com/

If you can choose to use a database, I recommend using PostgreSQL and its fuzzy string matching functions.
If you can use Ruby, I suggest looking into the amatch library.

#aku - links to working soundex libraries are right there at the bottom of the page.
As for Levenshtein distance, the Wikipedia article on that also has implementations listed at the bottom.

A powerful, lightweight solution is sphinx.
It's smaller then Lucene and it supports disambiguation.
It's written in c++, it's fast, battle-tested, has libraries for every env and it's used by large companies, like craigslists.org

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio