Ad Hoc Reports Hadoop - hadoop

I want to allow people to put in simple text search terms, run a pig job (if that's best? it's what I know best) and output the results (the tsv file results?) so I can show them in a web interface.
Is there anything that approaches this problem?
Anything known to link a few disjointed pieces of the flow I am going for, together?
Thanks

Why don't you index the docs into Lucene or Solr? Then you can do text search in real-time. Hadoop is designed for batch oriented processes, which doesn't seem like what you want in this case.

Well, it depends on your project's requirements. Does it need low-latency, and how complex is the ad hoc search. Well I think hbase+pig might be a comprised solution. hbase can be used for search real-time search purpose (although its search function is not so powerful than RDBMS) and pig for batch_processing of large amount for data.

Related

How search engine, say Google's page ranking algorithm work across distributed/multiple machines?

I am new to distributed computing but was wondering how page ranking algorithm works across multiple machines. Like
When do they decide data should be replicated (if needed at all),
If data is not copied, do they ask serves at other places to give them the result?
Or do they send "modules" to different serves (say part of a HUGE-HUGE - linked-graph) to one server, another module to another server and the combine the results they received?
I search something -- how does it fetches pages from my country (you know, search pages from <insert country> only)
This is not homework. Just a question I had. I welcome all ideas, even if they are very general or very detailed or do not answer all of my questions.
Right now, I know next to nothing, my hope is to know something after going through the answers.
There're three whales: MapReduce, Google File System, BigTable
Here are some whitepapers of the architecture
GoogleCluster
MapReduce, GFS, BigTable
Note: some of these are quite outdated, nowadays they are doing live updates, which wouldn't work with mapreduce.

Problem: Need to look up a sentence in a database of millions of sentences?

So, I'll be storing millions of sentences in a database each with an author. I need to be able to efficiently search for a sentence and return the author. Now, I'd like to be able to mispell a word or forget a word or two in this sentence, and have the application still be able to match (fuzzy-esque). Can anyone point me in the right direction? How does google do this? Because I can search for lyrics on google for instance and it will return the song with the lyrics? I'm looking to do the same thing?
Thanks all.
If fuzzy makes things too complicated, then I can deal with just an efficient sentence search.
If you're writing in Java, you can try Lucene.
Shouldn't it really be "document" and author instead of individual sentences?
For full text search check inverted index data structure.
This is how search engines do it
samples of code
UPDATE:
also if you're working on a distributed system check Hadoop - open source alternative for Goolge's MapReduce
Full Text Indexing on SQL Server or Oracle will most likey be what you're after right out of the box. They can go fuzzy, use word roots and other clever stuff.
I can't comment on other DB engines though a quick google shows most will have something similar. For some reason I expect them to be more limited in the fuzziness.
Indeed fuzzy matching is not a simple thing to do, although some databases implement some kind of fuzzy search, depending on the method used and your data, your results may vary. Here's a link that explains fuzzy searches in SQL sever
http://msdn.microsoft.com/en-us/magazine/cc163731.aspx
As for the sentence search, most db engines implement full text search/indexing that you may want to look at... It comes with trade offs in terms of performance and storage, but you may want to look at it
How does google do this?
Using inverted indexes. The details are proprietary, but you can bet your last dollars that there is a lot of replication and storing of the indexes, etc in memory so that they can handle the vast number of search requests they get per second.

Can anyone point me toward a content relevance algorithm?

A new project with some interesting requirements has arrived on my desk. I need to develop a searchable directory of businesses, with a focus on delivering relevant results based on arbitrary search queries. The businesses can be of any niche; there's no one area that is more represented than another.
When googling for things like "search algorithm" or "content relevance algorithm," all I get are references to Google's "Mystical Algorithm of the Old Gods" and SEO firms.
Does the relevance value of MySQL's full text Match() function have what it takes for the task? I've never used it, but I'm definitely going to do some testing. Also, since this will largely be a human edited directory, I can assume that we can add weighted factors like tagging and categories. What would be a good way to combine these factors with MySQL's Match() relevancy?
I'm also open to ideas that I've not discussed here.
For an example of information retrieval based techniques lookup TF-IDF or BM25.
For machine learning based techniques, lookup RankNet and its variants from MSR.
If you have hand edited data, have a look at Oracle text search. In one of my previous projects we had some good results.
I was not directly involved in the database setups, but I know that the results were very welcome. (Before this they had just keyword based search).
Use a search engine like Solr to index the data. You can still use MySql to hold the data, but for searches use a search engine.

how to cluster evolving data streams

I want to incrementally cluster text documents reading them as data streams but there seems to be a problem. Most of the term weighting options are based on vector space model using TF-IDF as the weight of a feature. However, in our case IDF of an existing attribute changes with every new data point and hence previous clustering does not remain valid anymore and hence any popular algorithms like CluStream, CURE, BIRCH cannot be applied which assumes fixed dimensional static data.
Can anyone redirect me to any existing research related to this or give suggestions? Thanks !
Have you looked at
TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams
Here's an idea off the top of my head:
What's your input data like? I'm guessing it's at least similarly themed, so you could start with a base phrase dictionary and use that for idf. Apache Lucene is a great indexing engine. Since you have a base dictionary, you can run kmeans or whatever you'd like. As documents come in, you'll have to rebuild the dictionary at some frequency (which can be off-loaded to another thread/machine/etc) and then re-cluster.
With the data indexed in a high-performance, flexible engine like Lucene, you could run queries even as new documents are being indexed. I bet if you do some research on different clustering algorithms you'll find some good ideas.
Some interesting paper/links:
http://en.wikipedia.org/wiki/Document_classification
http://www.scholarpedia.org/article/Text_categorization
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
Without more information, I can't see why you couldn't re-cluster every once in a while. You might wanna take a look at some of the recommender systems already out there.

Indexing algorithms to develop an app like google desktop search?

I want to develop google desktop search like application, I want to know that which Indexing Techniques/ Algorithms I should use so I can get very fast data retrival.
In general, what you want is an Inverted Index. You can do the indexing yourself, but it's a lot of work to get right - you need to handle stemming, stop words, extending the posting list to include positions in the document so you can handle multi-word queries, and so forth. Then, you need to store the index, probably in a B-Tree on disk - or you can make life easier for yourself by using an existing database for the disk storage, such as BDB. You also need to write a query planner that interprets user queries, performs query expansion and converts them to a series of index scans. Wikipedia's article on Search Engine Indexing provides a good overview of all the challenges, too.
Or, you can leverage existing work and use ready-made full text indexing solutions like Apache Lucene and Compass (which is built on Lucene). These tools handle practically everything detailed above (and more), which just leaves you writing the tool to build and update the index by feeding all your documents into Lucene, and the UI to allow users to search it.
The Burrows-Wheeler transform, used to compress data in bzip2, can be used to make substring searching of text a constant time function.
http://en.wikipedia.org/wiki/Burrows-Wheeler_transform
I haven't seen a simple introduction online, but here is a lot of detail:
http://www.ddj.com/architect/184405504

Resources