Stemming Algorithm - stemming

i have a question about Porter Stemmer Algorithm, I researched on the internet,
but i couldn't find what the difference between understemming and overstemming.
and is the Porter Algorithm understemming or overstamming?
do you have an idea?
Thanks in advance

Overstemming happens when the cut-off suffix is too long, this leads to spurious matching of unrelated words.
Understemming is the opposite -- e.g. a stemmer that doesn't cut off anything inherently understems.
Porter Stemmer, I suspect, will do both types of errors from time to time, for English. Note that implementations for other languages might behave very differently (speaking about Snowball which has user-supplied algorithms for a bunch of languages). They may even differ in the linguistic definition of stem.

Related

How can I efficiently find all people mentioned in some text, while tolerating spelling mistakes?

I have a list of names of millions of famous people (from Wikidata), and I need to create a system that efficiently finds all people mentioned in a fairly short text: it can be just one word (eg. "Einstein") to a few pages of text (eg. a Wikipedia page).
I need the system to be fairly tolerant to spelling mistakes (eg. Mikael Jackson instead of Michael Jackson), and short forms (eg. M. Jackson). In case of ambiguity, it should return all possible people (eg. "George Bush" should return both father and son, and possibly other homonyms).
This related question has a few interesting answers, including using the Aho-Corasick algorithm. There are libraries in many languages, including in Python. However, it does not seem to support fuzzy search (ie. tolerate misspellings).
I guess I could extend the vocabulary to include all the possible spellings of each name, but that would make the vocabulary too large, so I would rather avoid that if possible (moreover, I may want to extend this solution to more than just people at one point).
I took a quick look at Lucene/ElasticSearch but it does not seem to support this use case (unless I missed it).
Any ideas?
Elasticsearch has support for fuzzy matching: See documentation here.

What is a collaborative algorithm?

What is a collaborative algorithm? Is there a scientifically citable reference?
Details:
I found many articles about collaborative algorithms, but none (or other websites) with a definition.
I am actually looking for a term to describe distributed algorithms where each instance has all information at the beginning and can complete the whole task on its own, but the instances help each other whenever they have solved a sub-problem, so the other instances do not have to redo the work (hence "collaboration"). I picked up this terminology in A Collaborative Approach for Multi-Threaded SAT Solving. Do you think the term "collaborative algorithm" is suitable for this? If not, do you know of a better term?
No, there's no scientifically citable references.
All parallel/distributed programming is "collaborative" in a sense that several threads/nodes are collaborating on the same big task.
distributed algorithms where.. instances help each other whenever they have solved a sub-problem - even some web application clusters fit your description: individual cluster nodes "solve subproblems" and store the "solutions" in a distributed in-RAM storage (such as memcached or cassandra or many others) thus helping each other.
I think the term "collaborative algorithm" is not formal.
Actually the term "algorithm" itself is not really formal,
as far as I remember. I guess algorithm can be formalized as
"a program which runs on a Turing machine". I think I've seen
this definition somewhere.
So yes, I guess all in all the term you coined makes sense, but you
need to define it somehow yourself (either formally or informally).
Not sure what your background is but ... OK, in scientific papers
different authors sometimes use same terms/concepts for denoting
different things and sometimes they use different terms for
denoting the same thing.
Also, even though computer science papers are scientific not
all terms in them are formally defined. So I wouldn't draw
too many conclusions based on these papers unless I am familiar
to a decent extent with all of them or unless some of them
are considered really remarkable and widely accepted as
a de-facto standard in a particular sub-field or field.

How can I learn to read formulas with greek symbols?

I suppose maybe it's because I don't know the keywords to google for, but I can't find any sources on how to read those formulas you see on wikipedia, like this for instance:
Erlang Distribution
I've searched in the math world and computer science world. It feels like it is assumed that we're supposed to understand it out of thin air. Beginner lessons seem scarce.
So far I know how sigma works. And that upside-down shape that is used as the half-life logo is called lambda. But what the heck is it trying to say?? Why is there a semi-colon in the function, etc..
If there is a book on this stuff I'd buy it in an instant. It is probably very basic stuff but I never had experience in theoretical math or even know where to look.
Does anyone know what this subject is called, and what to google for?
Formulas with this symbols usually are statistics or probability notations.
Greek letters (e.g. θ, β) are commonly used to denote unknown parameters (population parameters).
Greek letters used in mathematics, science, and engineering
you can find info here
Notation in probability and statistics
here
I think the colon in alt.: \scriptstyle \theta \;=\; \frac{1}{\lambda} > 0\, scale (real) in the box in Wikipedia is just saying that there is an alternative definition, in which you specify theta rather than lambda, and in that definition what is called theta is the reciprocal of lambda in the other definition.
I once complained to a much better mathematician than I was that I came unstuck with formulas with some of the weirder greek letters in because I couldn't write them recognisably in my handwriting (which is bad enough for the latin alphabet). He said a lot of the people he knew simply said "let x be funny-squiggle-thing" and rewrote with sensible letters. I really wish I'd thought of that.
In general, letters in weird alphabets behave pretty much like sensible letters, at least in the sort of thing you are pointing at. It's done as a sort of type-checking - usually all of the letters pinched from some particular foreign language are related in some way - e.g. all parameters. Unfortunately that doesn't hold exactly in the Wikipedia example you quote, where two of the greek letters stand for functions - one is definitely the Gamma function. I suspect the other is http://en.wikipedia.org/wiki/Digamma_function, but I'm not really sure.
Check out the resources list here: http://en.wikipedia.org/wiki/Greek_alphabet
I would say your best bet is still searching in Google (or other search engine, whatever float your boat) about the specific formula you are trying to learn. Sometimes a symbol may be used in different meaning in different formula.
Anyway, there is a good resource in here that explained a lot of math symbols, not just the Greek symbol.
Some link that may interest you here and here.
First, find a Greek alphabet (upper and lower case) to refer to, so that you can at least call lambda by it's name. No one starts out knowing automatically what the various Greek characters are, not even Greeks.
Second, Read the actual article, usually either the character is defined (as lambda happens to be in the Wikipedia page you references) or it's standard nomenclature (in which case you've done the right thing by looking for a basic article on the function in question-- I do this all the time so don't feel bad.) Or, as a third option, it's a crappy paper. Happens sometimes. It's kind of a pain, though, since you can't just do a text search on the lambda character in a PDF.
(Someone educate me on that if I'm wrong....)
Third, try to pick out which unfamiliar symbols are variables (like lambda) and which are operators (like sigma, and it's helpers.) It's the operators that can sometimes cause real trouble. A variable is just a name for something, but operators come freighted with more meaning, more rules, and more syntax. It's not always obvious which symbols are operators, either.
Finally, and specifically for computer science, a good introductory book (college freshman/sophomore level) on discrete math will hopefully treat most of the basic notations and operators to at least get your feet on the ground. Nowadays, you kids and your newfangled internet might be able to get something similar from Udacity, Edx, Course RA, or the Khan Academy.
Basically, it's a lot of hard work, especially on your own, but you're already doing most of the right things.

General approach to constraint solving w/optimization over large finite domains?

I have a constraint problem I've been working on, which has a couple "fun" properties:
The domain is massive; basic constraints bring it down to around 2^40 to 2^30, but it's hard to bring it down further without...
Optimization for the solution. There is no single constrained solution; I'm looking for the best fit in the domain based on some complex predicates.
In searching for a way to handle this problem, I've brushed up on my Erlang, Haskell, and Prolog, but these languages don't already have the advanced predicates I'm looking for. I know that some of my optimizations could bring down the search space, and humans can peruse the domain fairly quickly and make really good guesses about optimal answers. (The domain is parameterized on a dozen variables; it's really easy to pick outliers as probable candidates for being close to the best in the domain.)
What I'm looking for in this question isn't a magical algorithm to handle this search, but an answer to the question: Since Prolog and Haskell aren't the right tools for this, which language or library might be a better answer? I have written this up in Haskell, but on a trivial restricted search of 6 million items, it couldn't even reach ten thousand comparisons per second, and perhaps that is because Haskell is not a good fit for expressing these kinds of problems.
If I remember correctly, Coq has a nice support for computations wit constraints. At least, if your domain may be described as formal system, Coq will help to write it down as a code and perform basic computations.

Culture-independent stemmer/analyzer for Lucene.NET

We're currently developing a full-text-search-enabled app and we Lucene.NET is our weapon of choice. What's expected is that an app will be used by people from different countries, so Lucene.NET has to be able to search across Russian, English and other texts equally well.
Are there any universal and culture-independent stemmers and analyzers to suit our needs? I understand that eventually we'd have to use culture-specific ones, but we want to get up and running with this potentially quick and dirty approach.
Given that the spelling, grammar and character sets of English and Russian are significantly different, any stemmer which tried to do both would either be massively large or poorly performant (most likely both).
It would probably be much better to use a stemmer for each language, and pick which one to use based on either UI clues (what language is being used to query) or by explicit selection.
Having said that, it's unlikely that any Russian text will match an English search term correctly or vice-versa.
This sounds like a case where a little more business analysis would help more than code.
There is no such a thing as a language-independent stemmer. In fact, whether stemming improves retrieval performance varies per language. The best you can do is language guessing on the documents and queries, then dispatch to the appropriate analyzer/stemmer.
Language guessing on short queries is hard, though (as in state-of-the-art, not quick 'n' dirty). If your queries are short, you might want use a simple whitespace analyzer on the queries and not stem anything.

Resources