Develop a distributed Full-Text search Index (AKA Inverted index) - full-text-search

I know how to develop a simple inverted index on a single machine. In short it is a standard hash table kept in-memory where:
- key - a word
- value - a List of word locations
As an example, the code is here: http://rosettacode.org/wiki/Inverted_Index#Java
Question:
Now I'm trying to make it distributed among n nodes and in turn:
Make this index horizontally scalable
Apply automatic sharding to this index.
I'm interested especially in automatic sharding. Any ideas or links are welcome!
Thanks.

Sharding by it self is quite a complex task which is not completely solved in the modern DBs. Typical problems in distributed DBs are a CAP theorem, and some other low-level and quite challenging tasks like rebalancing your cluster data after adding a new blank node or after naturally-occured imbalance in the data.
The best data distribution implemented in a DB I've seen was in Cassandra. However full text search is not yet implemented in Cassandra, so you might consider building your distributed index upon it.
Some other already implemented options are Elasticsearch and SolrCloud. In the example given one important detail is missing which is a word-stemming. With word stemming you basically search for any form of a word like "sing", "sings", "singer". Lucene and two previous solutions have it implemented for the majority of the languages.

Related

Elastic search or Trie for search/autocomplete?

My understanding how autocomplete/search for text/item works at high level in any scalable product like Amazon eCommerce/Google at high level was :-
Elastic Search(ES) based approach
Documents are stored in DB . Once persisted given to Elastic search, It creates the index and store the index/document(based on tokenizer) in memory or disk based
configuration.
Once user types say 3 characters, it search all index under ES(Can be configured to index even ngram) , Rank them based on weightage and return to user
But after reading couple of resources on google like Trie based search
Looks some of the scalable product also uses Trie data stucture to do the prefix based search.
My question Is Can trie based approach be good alternative to ES or ES internally uses Trie or am i missing completely here ?
ES autocompletion can be achieved in two ways:
using prefix queries
either using (edge-)ngrams
or using the completion suggester
The first option is the poor man's completion feature. I'm mentioning it because it can be useful in certain situation but you should avoid it if you have a substantial amount of documents.
The second option uses the conventional ES indexing features, i.e. it will tokenize the text, all (edge-)ngrams will be indexed and then you can search for any prefix/infix/suffix that have been indexed.
The third option uses a different approach and is optimized for speed. Basically, when indexing a field of type completion, ES will create a "finite state transducer" and store it in memory for ultra fast access.
A finite state transducer is close to a trie in terms of implementation. You can check this excellent article which shows how trie compares to finite state transducer
UPDATE (June 25th, 2019):
ES 7.2 introduced a new data type called search_as_you_type that allows this kind of behavior natively. Read more at: https://www.elastic.co/guide/en/elasticsearch/reference/7.2/search-as-you-type.html

How couchdb 1.6 inherently take advantage of Map reduce when it is Single Server Database

I am new to couch db, while going through documentation of Couch DB1.6, i came to know that it is single server DB, so I was wondering how map reduce inherently take advantage of it.
If i need to scale this DB then do I need to put more RAID hardware, of will it work on commodity hardware like HDFS?
I came to know that couch db 2.0 planning to bring clustering feature, but could not get proper documentation on this.
Can you please help me understanding how exactly internally file get stored and accessed.
Really appreciate your help.
I think your question is something like this:
"MapReduce is … a parallel, distributed algorithm on a cluster." [shortened from MapReduce article on Wikipedia]
But CouchDB 1.x is not a clustered database.
So what does CouchDB mean by using the term "map reduce"?
This is a reasonable question.
The historical use of "MapReduce" as described by Google in this paper using that stylized term, and implemented in Hadoop also using that same styling implies parallel processing over a dataset that may be too large for a single machine to handle.
But that's not how CouchDB 1.x works. View index "map" and "reduce" processing happens not just on single machine, but even on a single thread! As dch (a longtime contributor to the core CouchDB project) explains in his answer to https://stackoverflow.com/a/12725497/179583:
The issue is that eventually, something has to operate in serial to build the B~tree in such a way that range queries across the indexed view are efficient. … It does seem totally wacko the first time you realise that the highly parallelisable map-reduce algorithm is being operated sequentially, wat!
So: what benefit does map/reduce bring to single-server CouchDB? Why were CouchDB 1.x view indexes built around it?
The benefit is that the two functions that a developer can provide for each index "map", and optionally "reduce", form very simple building blocks that are easy to reason about, at least after your indexes are designed.
What I mean is this:
With e.g. the SQL query language, you focus on what data you need — not on how much work it takes to find it. So you might have unexpected performance problems, that may or may not be solved by figuring out the right columns to add indexes on, etc.
With CouchDB, the so-called NoSQL approach is taken to an extreme. You have to think explicitly about how you each document or set of documents "should be" found. You say, I want to be able to find all the "employee" documents whose "supervisor" field matches a certain identifier. So now you have to write a map function:
function (doc) {
if (doc.isEmployeeRecord) emit(doc.supervisor.identifier);
}
And then you have to query it like:
GET http://couchdb.local:5984/personnel/_design/my_indexes/_view/by_supervisor?key=SOME_UUID
In SQL you might simply say something like:
SELECT * FROM personnel WHERE supervisor == ?
So what's the advantage to the CouchDB way? Well, in the SQL case this query could be slow if you don't have an index on the supervisor column. In the CouchDB case, you can't really make an unoptimized query by accident — you always have to figure out a custom view first!
(The "reduce" function that you provide to a CouchDB view is usually used for aggregate functions purposes, like counting or averaging across multiple documents.)
If you think this is a dubious advantage, you are not alone. Personally I found designing my own indexes via a custom "map function" and sometimes a "reduce function" to be an interesting challenge, and it did pay off in knowing the scaling costs at least of queries (not so much for replications…).
So don't think of CouchDB view so much as being "MapReduce" (in the stylized sense) but just as providing efficiently-accessible storage for the results of running [].map(…).reduce(…) across a set of data. Because the "map" function is applied to only a document at once, the total set of data can be bigger than fits in memory at once. Because the "reduce" function is limited in its size, it further encourages efficient processing of a large set of data into an efficiently-accessed index.
If you want to learn a bit more about how the indexes generated in CouchDB are stored, you might find these articles interesting:
The Power of B-trees
CouchDB's File Format is brilliantly simple and speed-efficient (at the cost of disk space).
Technical Details, View Indexes
You may have noticed, and I am sorry, that I do not actually have a clear/solid answer of what the actual advantage and reasons were! I did not design or implement CouchDB, was only an avid user for many years.
Maybe the bigger advantage is that, in systems like Couchbase and CouchDB 2.x, the "parallel friendliness" of the map/reduce idea may come into play more. So then if you have designed an app to work in CouchDB 1.x it may then scale in the newer version without further intervention on your part.

"Fan-out" indexing strategy

I'm planning to use Elasticsearch for a social network kind of platform where users can post "updates", be friends with other users and follow their friends' feed. The basic and probably most frequent query will be "get posts shared with me by friends I follow". This query could be augmented by additional constraints (like tags or geosearch).
I've learned that social networks usually take a fan-out-on-write approach to disseminate "updates" to followers so queries are more localized. So I can see 2 potential indexing strategies:
Store all posts in a single index and search for posts (1) shared with the requester and (2) whose author is among the list of users followed by the requester (the "naive" approach).
Create one index per user, inject posts that are created by followed users and directly search among this index (the "fan-out" approach).
The second option is obviously much more efficient from a search perspective, although it presents sync challenges (like the need to delete posts when I stop following a friend, for example). But the thing I would be most concerned with is the multiplication of indices; in a (successful) social network, we can expect at least tens of thousands of users...
So my questions here are:
how does ES cope with a very high number of indices? can it incur performance issues?
any thoughts about a better indexing strategy for my particular use-case?
Thanks
Each elasticsearch index shard is a separate Lucene index, which means several open file descriptors and memory overhead. Generally, even after reducing number of shards per index from default 5, the resource consumption in index-per-user scenario may be too large.
It is hard to give any concrete numbers, but my guess is that if you stick to two shards per index, you would be able to handle no more than 3000 users per m3.medium machine, which is prohibitive in my opinion.
However, you don't necessarily need to have dedicated index for every user. You can use filtered aliases to use one index for multiple users. From application point of view, it would look like a per-user scenario, without incurring overhead mentioned above. See this video for details.
With that being said, I don't think elasticsearch is particularly good fit for fan-out-on-write strategy. It is, however, very good solution to employ in fan-out-on-read scenario (something similar to what you've outlined as (1)):
The biggest advantage of using elasticsearch is that you are able to perform relevance scoring, typically based on some temporal features, like browsing context. Using elasticsearch to just retrieve documents sorted by timestamp means that you don't utilize its potential. Meanwhile, solutions like Redis will give you far superior read performance for such task.
Fan-out-on-write scenario means a lot of writes on each update (especially, if you have users with many followers). Elasticsearch is not a database and is not optimized for such usage-pattern. It is, however, prepared for frequent reads.
Fan-out-on-write also means that you are producing a lot of 'extra' data by duplicating info about posts. To keep this data in RAM, you need to store only metadata, like id of document in separate document storage and tags. Again, there are other formats than JSON to store and search this kind of structured data effectively.
Choosing between the two scenarios is a question about your requirements, like average number of followers, number of 'hubs' that nearly everybody follows, whether the feed is naturally ordered (e.g. by time) etc. I think that deciding whether to use elasticsearch needs to be a consequence of this analysis.

elasticsearch - tips on how to organize my data

I'm trying elasticsearch by getting some data from facebook and twitter to.
The question is: how can I organize this data in index?
/objects/posts
/objects/twits
or
/posts/post
/twits/twit
I'm trying queries such as, get posts by author_id = X
You need to think about the long term when deciding how to structure your data in Elasticsearch. How much data are you planning on capturing? Are search requests going to look into both Facebook and Twitter data? Amount of requests, types of queries and so on.
Personally I would start of with the first approach, localhost:9200/social/twitter,facebook/ as this will reduce the need for another index when it isn't necessarily required. You can search across both of the types easily which has less overhead than searching across two indexes. There is quite an interesting article here about how to grow with intelligence.
Elasticsearch has many configurations, essentially its finding a balance which fits your data.
First one is the good approach. Because creating two indices will create two lucence instances which will effect the response time.

how to cluster evolving data streams

I want to incrementally cluster text documents reading them as data streams but there seems to be a problem. Most of the term weighting options are based on vector space model using TF-IDF as the weight of a feature. However, in our case IDF of an existing attribute changes with every new data point and hence previous clustering does not remain valid anymore and hence any popular algorithms like CluStream, CURE, BIRCH cannot be applied which assumes fixed dimensional static data.
Can anyone redirect me to any existing research related to this or give suggestions? Thanks !
Have you looked at
TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams
Here's an idea off the top of my head:
What's your input data like? I'm guessing it's at least similarly themed, so you could start with a base phrase dictionary and use that for idf. Apache Lucene is a great indexing engine. Since you have a base dictionary, you can run kmeans or whatever you'd like. As documents come in, you'll have to rebuild the dictionary at some frequency (which can be off-loaded to another thread/machine/etc) and then re-cluster.
With the data indexed in a high-performance, flexible engine like Lucene, you could run queries even as new documents are being indexed. I bet if you do some research on different clustering algorithms you'll find some good ideas.
Some interesting paper/links:
http://en.wikipedia.org/wiki/Document_classification
http://www.scholarpedia.org/article/Text_categorization
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
Without more information, I can't see why you couldn't re-cluster every once in a while. You might wanna take a look at some of the recommender systems already out there.

Resources