how to cluster evolving data streams - algorithm

I want to incrementally cluster text documents reading them as data streams but there seems to be a problem. Most of the term weighting options are based on vector space model using TF-IDF as the weight of a feature. However, in our case IDF of an existing attribute changes with every new data point and hence previous clustering does not remain valid anymore and hence any popular algorithms like CluStream, CURE, BIRCH cannot be applied which assumes fixed dimensional static data.
Can anyone redirect me to any existing research related to this or give suggestions? Thanks !

Have you looked at
TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams

Here's an idea off the top of my head:
What's your input data like? I'm guessing it's at least similarly themed, so you could start with a base phrase dictionary and use that for idf. Apache Lucene is a great indexing engine. Since you have a base dictionary, you can run kmeans or whatever you'd like. As documents come in, you'll have to rebuild the dictionary at some frequency (which can be off-loaded to another thread/machine/etc) and then re-cluster.
With the data indexed in a high-performance, flexible engine like Lucene, you could run queries even as new documents are being indexed. I bet if you do some research on different clustering algorithms you'll find some good ideas.
Some interesting paper/links:
http://en.wikipedia.org/wiki/Document_classification
http://www.scholarpedia.org/article/Text_categorization
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
Without more information, I can't see why you couldn't re-cluster every once in a while. You might wanna take a look at some of the recommender systems already out there.

Related

How couchdb 1.6 inherently take advantage of Map reduce when it is Single Server Database

I am new to couch db, while going through documentation of Couch DB1.6, i came to know that it is single server DB, so I was wondering how map reduce inherently take advantage of it.
If i need to scale this DB then do I need to put more RAID hardware, of will it work on commodity hardware like HDFS?
I came to know that couch db 2.0 planning to bring clustering feature, but could not get proper documentation on this.
Can you please help me understanding how exactly internally file get stored and accessed.
Really appreciate your help.
I think your question is something like this:
"MapReduce is … a parallel, distributed algorithm on a cluster." [shortened from MapReduce article on Wikipedia]
But CouchDB 1.x is not a clustered database.
So what does CouchDB mean by using the term "map reduce"?
This is a reasonable question.
The historical use of "MapReduce" as described by Google in this paper using that stylized term, and implemented in Hadoop also using that same styling implies parallel processing over a dataset that may be too large for a single machine to handle.
But that's not how CouchDB 1.x works. View index "map" and "reduce" processing happens not just on single machine, but even on a single thread! As dch (a longtime contributor to the core CouchDB project) explains in his answer to https://stackoverflow.com/a/12725497/179583:
The issue is that eventually, something has to operate in serial to build the B~tree in such a way that range queries across the indexed view are efficient. … It does seem totally wacko the first time you realise that the highly parallelisable map-reduce algorithm is being operated sequentially, wat!
So: what benefit does map/reduce bring to single-server CouchDB? Why were CouchDB 1.x view indexes built around it?
The benefit is that the two functions that a developer can provide for each index "map", and optionally "reduce", form very simple building blocks that are easy to reason about, at least after your indexes are designed.
What I mean is this:
With e.g. the SQL query language, you focus on what data you need — not on how much work it takes to find it. So you might have unexpected performance problems, that may or may not be solved by figuring out the right columns to add indexes on, etc.
With CouchDB, the so-called NoSQL approach is taken to an extreme. You have to think explicitly about how you each document or set of documents "should be" found. You say, I want to be able to find all the "employee" documents whose "supervisor" field matches a certain identifier. So now you have to write a map function:
function (doc) {
if (doc.isEmployeeRecord) emit(doc.supervisor.identifier);
}
And then you have to query it like:
GET http://couchdb.local:5984/personnel/_design/my_indexes/_view/by_supervisor?key=SOME_UUID
In SQL you might simply say something like:
SELECT * FROM personnel WHERE supervisor == ?
So what's the advantage to the CouchDB way? Well, in the SQL case this query could be slow if you don't have an index on the supervisor column. In the CouchDB case, you can't really make an unoptimized query by accident — you always have to figure out a custom view first!
(The "reduce" function that you provide to a CouchDB view is usually used for aggregate functions purposes, like counting or averaging across multiple documents.)
If you think this is a dubious advantage, you are not alone. Personally I found designing my own indexes via a custom "map function" and sometimes a "reduce function" to be an interesting challenge, and it did pay off in knowing the scaling costs at least of queries (not so much for replications…).
So don't think of CouchDB view so much as being "MapReduce" (in the stylized sense) but just as providing efficiently-accessible storage for the results of running [].map(…).reduce(…) across a set of data. Because the "map" function is applied to only a document at once, the total set of data can be bigger than fits in memory at once. Because the "reduce" function is limited in its size, it further encourages efficient processing of a large set of data into an efficiently-accessed index.
If you want to learn a bit more about how the indexes generated in CouchDB are stored, you might find these articles interesting:
The Power of B-trees
CouchDB's File Format is brilliantly simple and speed-efficient (at the cost of disk space).
Technical Details, View Indexes
You may have noticed, and I am sorry, that I do not actually have a clear/solid answer of what the actual advantage and reasons were! I did not design or implement CouchDB, was only an avid user for many years.
Maybe the bigger advantage is that, in systems like Couchbase and CouchDB 2.x, the "parallel friendliness" of the map/reduce idea may come into play more. So then if you have designed an app to work in CouchDB 1.x it may then scale in the newer version without further intervention on your part.

What is a convenient way to do document clustering with elasticsearch?

I have stored a lot of news articles from RSS feeds from different sources in an elasticsearch index. At the moment when I do a search query, it will return me a lot of similar news articles for one query, because the same news topics gets covered by many RSS sources.
Instead what I would like to do is return only one news article out of a group of articles to the same topic. So I somehow need to recognize, which articles are about the same topic, cluster these documents and return only the "best" article out of such a cluster.
What would be the most convenient way to approach that problem?
Can I somehow make use of the elasticsearch more-like-this API? Or is the https://github.com/carrot2/elasticsearch-carrot2 plugin the way to go? Or is there simply no convenient way and I have to implement somehow my own version of http://en.wikipedia.org/wiki/K-means_clustering or http://en.wikipedia.org/wiki/Non-negative_matrix_factorization to cluster my documents?
I don't think you'll be able to do the clustering adequately from within Elasticsearch. But you can definitely use the clustering results in your ES query.
If I were going to do it, I would use the data you have as input for a clustering algorithm, probably implemented in Apache Spark. I've written a few blog posts about using ES and Spark together (here's one: http://blog.qbox.io/deploy-elasticsearch-and-apache-spark-to-the-cloud). Exactly how to do that is probably outside the scope of a StackOverflow answer, but there are lots of ways to go about it. You certainly don't have to use Spark, of course (I just like it). Pick your favorite programming paradigm to implement clustering, or even use a third-party library. There are plenty out there.
Once I was happy with my clustering results, I would save the cluster meta-data back to ES as a "parent" dataset. So every article would have a parent document representing the cluster to which the article belonged. This relationship could then be used (maybe with a top child query, or has parent or something) to return the results you are wanting.
ES is not particularly useful for clustering. Most clustering algorithms require pairwise distance computations, which is easiest if you can fit all your data into a huge matrix (and then factor it)
So it may well be easier (and faster) to work outside ES!
None of the approaches work half as good as advertised. See e.g. “reading tea leaves”. Everybody who constructs such an algorithm is happy to get anything out, and will tune and fiddle parameters and rerun until the result looks nice. The technical term is cherry picking. Evaluation is incredibly sloppy, and if you look at the results closely, they aren't any better than choosing a random key word (say, car) and doing a text search on that. Much more meaningful than those “topics” discovered by topic models that nobody can decipher in practise. So good luck...
Chang, J., Gerrish, S., Wang, C., Boyd-graber, J. L., & Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems (pp. 288-296)
Carrot (as mentioned in the question) is very good for clustering the results of a query - it only scales up to 100's or 1000's of documents but that may be enough. If you need larger scales, then methods like locality sensitive hashing avoids the need to calculate all the pairwise distances. Using ES's "more-like-this" could work as a quick-and-dirty alternative to hashing, but would probably need some post-processing.

Develop a distributed Full-Text search Index (AKA Inverted index)

I know how to develop a simple inverted index on a single machine. In short it is a standard hash table kept in-memory where:
- key - a word
- value - a List of word locations
As an example, the code is here: http://rosettacode.org/wiki/Inverted_Index#Java
Question:
Now I'm trying to make it distributed among n nodes and in turn:
Make this index horizontally scalable
Apply automatic sharding to this index.
I'm interested especially in automatic sharding. Any ideas or links are welcome!
Thanks.
Sharding by it self is quite a complex task which is not completely solved in the modern DBs. Typical problems in distributed DBs are a CAP theorem, and some other low-level and quite challenging tasks like rebalancing your cluster data after adding a new blank node or after naturally-occured imbalance in the data.
The best data distribution implemented in a DB I've seen was in Cassandra. However full text search is not yet implemented in Cassandra, so you might consider building your distributed index upon it.
Some other already implemented options are Elasticsearch and SolrCloud. In the example given one important detail is missing which is a word-stemming. With word stemming you basically search for any form of a word like "sing", "sings", "singer". Lucene and two previous solutions have it implemented for the majority of the languages.

Algorithm to recognize keywords' categories in a One-search-box-for-all model query

I'm aiming at providing one-search-box-for-everything model in search engine project, like LinkedIn.
I've tried to express my problem using an analogy.
Let's assume that each result is an article and has multiple dimensions like author, topic, conference (if that's a publication), hosted website, etc.
Some sample queries:
"information retrieval papers at IEEE by authorXYZ": three dimensions {topic, conf-name, authorname}
"ACM paper by authoABC on design patterns" : three dimensions {conf-name, author, topic}
"Multi-threaded programming at javaranch" : two dimensions {topic, website}
I've to identify those dimensions and corresponding keywords in a big query before I can retrieve the final result from the database.
Points
I've access to all the possible values to all the dimensions. For example, I've all the conference names, author names, etc.
There's very little overlap of terms across dimensions.
My approach (naive)
Using Lucene, index all the keywords in each dimension with a dedicated field called "dimension" and another field with actual value.
Ex:
1) {name:IEEE, dimension:conference}, etc.
2) {name:ooad, dimension:topic}, etc.
3) {name:xyz, dimension:author}, etc.
Search the index with the query as-it-is.
Iterate through results up to some extent and recognize first document with a new dimension.
Problems
Not sure when to stop recognizing the dimensions from the result set. For example, the query may contain only two dimensions but the results may match 3 dimensions.
If I want to include spell-checking as well, it becomes more complex and the results tend to be less accurate.
References to papers, articles, or pointing-out the right terminology that describes my problem domain, etc. would certainly help.
Any guidance is highly appreciated.
Solution 1: Well how about solving your problem using Natural Language Processing Named Entity Recognition (NER). Now NER can be done using simple Regular Expressions (in case where the data is too static) or else you can use some Machine Learning Technique like Hidden Markov Models to actually figure out the named entities in your sequence data set. Why I stress on HMM as compared to other Machine Learning Supervised algorithms is because you have sequential data with each state dependent on the previous or next state. NER would output for you the dimensions along with the corresponding name. After that your search becomes a vertical search problem and you can just search for the identified words in different Solr/Lucene fields and set your boosts accordingly.
Now coming to the implementation part, I assume you know Java as you are working with Lucene, so Mahout is a good choice. Mahout has an HMM built in and you can train+test the model on your data set. I am also assuming you have large data set.
Solution 2: Try to model this problem as a property graph problem. Check out something like Neo4j. I suggest this as your problem falls under schema less domain. Your schema is not fixed and problem very well can be modelled as a graph where each node would be a set of key value pairs.
Solution 3: As you said that you have all possible values of dimensions than before anything else why not simply convert all your unstructured data from your text to structured data by using Regular Expressions and again as you do not have fixed schema so store the data in any NoSQL key value database. Most of them provided Lucene Integrations for full text search, then simply search on those database.
what you need to do is to calculate the similarity between the query and the document set you are looking in. Measures like cosine similarity should serve your need. However a hack that you can use is calculate the Tf/idf for the document and create an index using that score from there you can choose the appropriate one. I would recommend you to look into Vector Space Model to find a method that serves your need!!
give this algorithm a look aswell
http://en.wikipedia.org/wiki/Okapi_BM25

Can anyone point me toward a content relevance algorithm?

A new project with some interesting requirements has arrived on my desk. I need to develop a searchable directory of businesses, with a focus on delivering relevant results based on arbitrary search queries. The businesses can be of any niche; there's no one area that is more represented than another.
When googling for things like "search algorithm" or "content relevance algorithm," all I get are references to Google's "Mystical Algorithm of the Old Gods" and SEO firms.
Does the relevance value of MySQL's full text Match() function have what it takes for the task? I've never used it, but I'm definitely going to do some testing. Also, since this will largely be a human edited directory, I can assume that we can add weighted factors like tagging and categories. What would be a good way to combine these factors with MySQL's Match() relevancy?
I'm also open to ideas that I've not discussed here.
For an example of information retrieval based techniques lookup TF-IDF or BM25.
For machine learning based techniques, lookup RankNet and its variants from MSR.
If you have hand edited data, have a look at Oracle text search. In one of my previous projects we had some good results.
I was not directly involved in the database setups, but I know that the results were very welcome. (Before this they had just keyword based search).
Use a search engine like Solr to index the data. You can still use MySql to hold the data, but for searches use a search engine.

Resources