Stemming - code examples or open source projects? - algorithm

Stemming is something that's needed in tagging systems. I use delicious, and I don't have time to manage and prune my tags. I'm a bit more careful with my blog, but it isn't perfect. I write software for embedded systems that would be much more functional (helpful to the user) if they included stemming.
For instance:
Parse
Parser
Parsing
Should all mean the same thing to whatever system I'm putting them into.
Ideally there's a BSD licensed stemmer somewhere, but if not, where do I look to learn the common algorithms and techniques for this?
Aside from BSD stemmers, what other open source licensed stemmers are out there?
-Adam

Snowball stemmer (C & Java)
I've used it's Python binding, PyStemmer

Check out the nltk toolkit written in python. It has a very functional stemmer.

Another option for stemming would be WordNet, along with one of its APIs. Some basic information on stemming and lemmatization, including a description of the Porter stemming algorithm, can be found online in Introduction to Information Retrieval.

Lucene has a stemmer in, I believe (and IIRC it lets you use your own one if you want).
EDIT: Just checked, and Lucence refers to the Snowball site which is an open source stemming library as far as I can tell.

Related

Works LibShortText with other languages too?

LibShortText is an open source tool for short-text classification and analysis.
http://www.csie.ntu.edu.tw/~cjlin/libshorttext/
I have tried to figure out if it also works with other languages than english (e.g. german)? But I didn't find a hint.
Who knows the answer? Thank you in advance.
I think so (but may need some extra preprocessing). Libsvm and Liblinear are both language-agnostic. Since LibShortText is built on top of LibLinear, it should work for all languages too.
According to this paper, it has internal pre-processing methods to extract features.
libshorttext.converter: For given short texts, LibShortText follows
the bag-of-word model to generate features. Users apply procedures in
this library to pre-process short texts by tokenization, stemming
(optional), and stop-word removal (optional). The library also allows
users to choose between unigram and bigram features.
However, it looks like its stemming and stop-word removal only supports English. So if you want to have better features extracted for non-English text, you might want to use your own pre-processing methods, for example, using nltk.

Artificial Intelligence and Chat Filters

Are there any chat filters that works depending on the context? I'm talking about the use of new technologies like Artificial Intelligence and Natural Language Processing to determine for example if a word was rude or not, depending on the context.
The simplest way is to just use a regex for the handful of most offensive words.
There are services offered online that will allow you to query with a word to see if it's profane. Those are good options if you want to be really sure.
Unfortunately, there's no sure bet for any of these. One man's profanity is another's common talk (see santorum. very very vulgar, but most aren't offended.) And each group has their own fowl language. ballox isn't so bad in America, but it's fairly bad in Britain.
You could make a clbuttic mistake. Even if you eliminate everything, I can still write a horribly offensive story using the kindest language.
A short black list or one of the services is the way to go depending on the level of filtering you want.
This looks interesting http://pottymouthfilter.com, It is unfortunately commercial product but at least someone is working on something along those lines.

How can I detect a user's input language using Ruby without using an online service?

I'm looking for a library or technique to detect the input language of blocks of text provided by users. Online lookups (like Google translate) won't work for this task as I'm writing an app which must run offline.
Thanks.
Here are two more n-gram-based gems you might want to try. They work offline.
https://github.com/echen/unsupervised-language-identification, optimized for separating english and other languages (has a live demo)
https://github.com/feedbackmine/language_detector, less specialized, will detect more languages. Some languages may need some extra training — I found it to be not precise enough for German text.
For anyone interested, I've found http://rubygems.org/gems/kenwaln-whatlanguage, which is performing excellently.
I'm using CLD which I really like, succinct and easy to use. Give it a try.
A quick demo of WhatLanguage in Ruby:
http://www.youtube.com/watch?v=lNqZ2cqOReo&list=UUJ_3fstMOH-g4yBxtvgAWkw&index=0&feature=plcp

What simple syntax can be used for rich text?

I want in an application with a simple text input, enriched with some marks to include formatting or semantic labeling. I want the syntax as easy as possible and I want to include self-defined labels.
Example:
[bold]Stackoverflow[/bold] is a [tag]good[/tag] resource for programmers.
Tables would be needed too.
HTML/XML and LaTeX are mighty enough to allow this, but too complicated. Wiki-Syntax seems simple, but uses another symbol for each markup, has unclear quoting and every Wiki seems to have another syntax. For tables and similar stuff Wiki becomes very complicated.
Exists a language/syntax, that matches my needs or can be slightly changed to do so? Or do I have to invent something myself? In that case, do you have suggestions?
Definitely do NOT invent your own. There are plenty of simple markup languages already, and users HATE learning new ones. Trust me on this!
I would suggest using one of the following:
Textile
Markdown
BBCode
Make your decision based on your userbase, as well as what tools and parsers are available in your chosen language. For my site, we went with Textile, but I've found that BBCode tends to be the language that most people already know. However, this will vary with different user demographics.
StackOverflow, along with several other sites, uses Markdown. I think it will give you the best balance between features and simplicity.
Let me add ReStructuredText to the list.
An additional benefit of using it is given by the availability of ReStructuredText to Anything service that makes extremely easy to create HTML or PDF versions of the document.
As already pointed out there are a lot of lightweight markup languages (many are listed here: wikipedia article), there should be no need of creating your own.

Lightweight fuzzy search library

Can you suggest some light weight fuzzy text search library?
What I want to do is to allow users to find correct data for search terms with typos.
I could use full-text search engines like Lucene, but I think it's an overkill.
Edit:
To make question more clear here is a main scenario for that library:
I have a large list of strings. I want to be able to search in this list (something like MSVS' intellisense) but it should be possible to filter this list by string which is not present in it but close enough to some string which is in the list.
Example:
Red
Green
Blue
When I type 'Gren' or 'Geen' in a text box, I want to see 'Green' in the result set.
Main language for indexed data will be English.
I think that Lucene is to heavy for that task.
Update:
I found one product matching my requirements. It's ShuffleText.
Do you know any alternatives?
Lucene is very scalable—which means its good for little applications too. You can create an index in memory very quickly if that's all you need.
For fuzzy searching, you really need to decide what algorithm you'd like to use. With information retrieval, I use an n-gram technique with Lucene successfully. But that's a special indexing technique, not a "library" in itself.
Without knowing more about your application, it won't be easy to recommend a suitable library. How much data are you searching? What format is the data? How often is the data updated?
I'm not sure how well Lucene is suited for fuzzy searching, the custom library would be better choice. For example, this search is done in Java and works pretty fast, but it is custom made for such task:
http://www.softcorporation.com/products/people/
Soundex is very 'English' in it's encoding - Daitch-Mokotoff works better for many names, especially European (Germanic) and Jewish names. In my UK-centric world, it's what I use.
Wiki here.
You didn't specify your development platform, but if its PHP then suggest you look at the ZEND Lucene lubrary :
http://ifacethoughts.net/2008/02/07/zend-brings-lucene-to-php/
http://framework.zend.com/manual/en/zend.search.lucene.html
As it LAMP its far lighter than Lucene on Java, and can easily be extended for other filetypes, provided you can find a conversion library or cmd line converter - there are lots of OSS solutions around to do this.
Try Walnutil - based on Lucene API - integrated to SQL Server and Oracle DBs . You can create any type of index and then use it. For simple search you can use some methods from walnutilsoft, for more complicated search cases you can use Lucene API. See web based example where was used indexes created from Walnutil Tools. Also you can see some code example written on Java and C# which you can use it for creating different type of search.
This tools is free.
http://www.walnutilsoft.com/
If you can choose to use a database, I recommend using PostgreSQL and its fuzzy string matching functions.
If you can use Ruby, I suggest looking into the amatch library.
#aku - links to working soundex libraries are right there at the bottom of the page.
As for Levenshtein distance, the Wikipedia article on that also has implementations listed at the bottom.
A powerful, lightweight solution is sphinx.
It's smaller then Lucene and it supports disambiguation.
It's written in c++, it's fast, battle-tested, has libraries for every env and it's used by large companies, like craigslists.org

Resources