Text search question about implementation - full-text-search

Can someone explain me how the text searching algorithm works? I understand its a huge field but am trying to understand it from high level so that I can look up academic papers on it.
For example, Spelling mistakes is one problem that is tough to solve and of course Google solves it. When I search for a term and misspell it on Google, it automatically suggests the correct spelling. How is indexing done for it? Using MapReduce I can see they index various entities. What do they or some one else index and store? May be I am looking for a practical implementation of MapReduce if I am thinking in the right direction at all.
Pav

I'm afraid this question really is too big, which probably explains why it has not seen an answer yet. As far as Google's spell-checker is concerned, Peter Norvig explains how it is done: How to Write a Spelling Corrector
The exact implementation in productive use at Google surely looks quite a bit different and way more complicated, but this might get you started.

Related

Recommendations for Fast Multipole Method implementation?

I'm interested in implementing the Fast Multipole Method to efficiently simulate a system of repulsive particles.
I've found a large collection of references discussing FMM, but none seem very approachable for non-mathematicians who want to fully understand the algorithm.
Can you recommend a ground-up reference that clearly explains the mathematics behind the process, and includes pseudocode exemplifying a proper implementation?
I am by no means an expert in FMM, but this java implementation and introduction is the best source I've found so far for explaining it carefully and slowly. The paper is good at defining terms before using them, and the code at least is useful as a reference point. The math still gets hairy very quickly, but it is what it is :)
A pedestrian introduction to fast multipole methods is a close second. It doesn't explain the actual details of a working FMM implementation, but it's a good introduction to the basic ideas.
I like the short course on FMM. In begins with FMM in 1D, than it uses theory of complex variable to do FMM in 2D. And than there is the crazy 3D version which uses theory of spherical harmonics functions, which I guess can be very difficult for non-mathematician. But If you need FMM only in 2D you should be fine.
Unfortunately no pseudo codes are given there.
But do you really need the accuracy of FMM?. You might be fine with Barnes-Hut's algorithm
After running into a similar issue to you, I ended up writing a fully-documented Python fast multipole method implementation, pybbfmm. I've also written a short, mathematics-free tutorial on how the method works. Together, I think they're substantially more accessible than any of the other presentations I could find.
(meta: Although this is effectively a linkpost, the OP is explicitly asking for a link. I've added what I think was missing from the last one - the name fo the library - but I'm not sure how else to offer this answer except as a name and a link. Certainly it doesn't feel any more linkpost-y than the accepted answer. If this one gets deleted as well, I'll give up)

QA Algorithm for Q Processing

What Algorithm/method do I use for a Question Answering System's Question Processing?
I have been searching possible algorithms for my Question Answering System, the only thing that I think that would be possible to use is Parsing but I have asked about parsing in my last question and with the answers there i think its not possible to be used?(I'm not sure).
My idea of using Parsing is by Cutting the question into pieces word per word and then it will go through a Storage of Words that would determine what Kind of Word(noun,adjective,verb,etc) is being said. My purpose of using Parsing is to remove or rather to determine the Topic of the question.
The other idea of mine is the ChatterBot. A Chatterbot uses a query of words? Correct me if I'm not mistaken and those words are assigned to another Word. It would randomly choose a word from its Query.
Example: User's Statement: Hello > ChatterBot's Possible Replies: Hi,Hello,Hey
I'm not quite sure what is the possible method/algorithm to use in a Question Answering, I have read the Wikipedia post : http://en.wikipedia.org/wiki/Question_answering but I do not quite understand what algorithm to use in Question Processing.
Thank you.
PS: I'm developing in Javascript. Q = Question
You could use a naive bayes classifier in order to look at the questions and determine their subject. You'd need a lot of training data and a fairly narrow domain.
The sophisticated responses to this problem involve a lot of machine inference techniques which are a bit out of my skill level to explain extremely well. My idea is to use a markov network in which each word has an edge to one or two words next to it. A series of tests are applied to each word which indicate likely memberhood of that word to one of its possible meanings (For example, Mark is more likely a name if it's capitalized, but if the next word is 'a' it probably is used in the sense of a verb.) From there the machine can attempt to determine the actual meaning of the sentence, which will rely on the use of, again, unimaginably large amounts of training data.
Coursera's Probabilistic Graphical Models class (Probably their NLP class too) would probably be the best resource if you're interested in becoming skilled in this area. (PGM is the only reason I know anything about this!)
here's a great book, you may need to read to get a lot of stuff related to NLP, and Question answering systems http://www.amazon.com/Speech-Language-Processing-2nd-Edition/dp/0131873210
the book has a full section (V.Applications) that will help you a lot to develop a good system.
but note that the book is discussing theories and algorithms only (no code)
it's not about parsing text only, you'll need to understand the context to provide better answer. actually you need to extract some keywords and ignore everything else.
also you may read in topics Keywords (Bag of words), algorithms like (TF/IDF).

Material and Information to improve algorithmic knowledge

Lately I have been stuck on improving my algorithmic skills. And at this point I am finding myself out of good material for solving grid problems based on dfs and bsf. I somehow managed to do http://www.spoj.pl/problems/POUR1/ with a brute force logic but i recently go-ogled to find out that the problem can be done by bfs. But I can't figure out exactly how to go about it. Can someone please provide some text to read or some kind of explanation to the above mentioned problem so I can add this to my skill set. It would be extremely kind if you could even help me out for these techniques in problems like these http://www.codechef.com/problems/MMANT/ .please help as soon as possible I am really stuck in these kind of problems ant can't move on. It would also be really kind if u could provide a list of good questions about Binary Indexed Trees and segment trees and some more examples of their usage.
Thanks for the help!! :)
One resource I've found useful is The Algorithmist:
The Algorithmist is a resource dedicated to anything algorithms - from
the practical realm, to the theoretical realm. There are also links
and explanation to problemsets.
Also The Algorithm Design Manual by Steve Skiena is extremely useful, especially the second part.

Making a spell check utility

This idea just popped into my head, so I don't have any code to show for it, but I was curious to know the answer. How is spell check implemented on most major word processors? I'm most curious to know what kind of data structures would be used in the creation of such a utility. Also, references to algorithms would be nice answers as well.
For a basic guide in python, have a look here.
Also you might want to look at this past question

How would I figure out if there are other algorithms similar to mine?

In another question, I asked something similar but I ended up just posting my algorithm there and invalidating several answers. I re-ask it here:
If I "invented" an algorithm, what's the best way for me to figure out if it's already been published about/patented?
You would need to do some searching. Starting with Google search, generically, will often be sufficient to reassure you that your algorithm is not novel. If that is not conclusive, then you need to search harder, perhaps looking at searching various patent sites (Google, USPTO, other places too). If you still don't find anything, then maybe your algorithm is novel.
Next questions: is it worth it to you to try and patent it, or get someone else to patent it for you (a company, for example)? Indeed, can you patent it or does your employer already own it? This will depend in part on how likely it is that everyone else will want to use the same algorithm. The chances are, they won't. If you patent it, they will ignore it until the patent expires.
If you do find a way to afford getting the patent filed - and issued (which is not automatic just because you filed) - then you face enforcing your patent. Will you be able to identify and prosecute those who abuse your patent? If not, was it worth chasing it? Maybe, maybe not; but probably not.
Finally, note that you cannot actually patent a pure algorithm. You would have to reduce it to practice. That isn't as hard as it seems, but just be aware that pure mathematical algorithms are inherently non-patentable.
In summary:
You will probably find someone else already thought of it.
If you decide to patent it because it is novel, you need money.
You need money to file for the patent.
You need money to pursue those who abuse your patent.
You would probably be better off just publishing it.
Most often you basically just have to do back ground research in the given area. This is why when academics do research projects they start of by learning about the history (back ground) of the area all the way up to the current methods or theories being used. It also helps to ask someone who knows the area and has worked in it for many years.
Well, if it's in a textbook like your algorithm seems to have been (Dijkstra), then it definitely already exists in the public domain and cannot be patented. How you use the algorithm in your application as a whole might be, but most abstract ideas or implementations thereof (such as "finding the shortest path between two nodes") cannot be patented.
Or, you could waste a bunch of money and submit a patent and see what happens :)
In all seriousness though, you might start by searching for existing patents, or read up on some articles like this one to get a better feel for the patent process.
Tracking down every algorithm for a particular problem would be quite daunting. A better process might be to track down the best solutions known for the problem and compare them with yours.
I would start with Wikipedia. I know people say "don't use wiki for research", but it's pretty good at computer science (all those geeks contributing), and it will tell you pretty quickly what the best widely-known algorithms are. If you've got something strictly better than the algorithms you can find in Wikipedia, then it might be worth looking further. If Wikipedia's got something strictly better than your algorithm, then you've invented a curiosity at best and probably won't get rich or famous from it.
Next, check the references at the bottom; they may lead you to papers (which will have more references that you can follow), or to academics' websites (which might have links). Also go to Citeseer and search for key words.
Unfortunately, there's no real replacement for having some basic knowledge. If you've invented (for instance) a graph-theoretic algorithm, but you don't know the language of graph-theory, then you'll struggle to find it because you won't know where to start looking. You might profitably spend your time reading an algorithms textbook -- that will give you an overview of good algorithms and how to speak about them.
If an algorithm or method can not be found right away (wikipedia/google), i find it rewarding to scan academic/engineering websites (web of science, ieee explore, acm etc.) for 'review' papers. If recent, they can give a solid overview over the field (e.g. graph search) mentioning books, papers and conferences. After that one can focus the search on particular methods.

Resources