Can UUID v5 be Lucene Friendly - performance

Based on the results of some tests here it is obvious that the worst case for Lucene UUID is UUID version 4 because of it's random nature.
I wanted to know if any one has any experience using UUID v5 for Lucene ?
As I studied UUID version 5 is not being generated randomly and it is made from two pieces of input information, Can those input pieces be manipulated in a way that UUID v5 becomes suitable as means of performance for being used in Lucene?

Related

Generate Document id in Elasticsearch based on document

Is there any support provided by Elasticsearch to generate hash id based on document content?
If not, which hashing algorithm should I use for better distribution across shards and efficiency with rare collisions, if I can expect around a couple of million events per day being indexed to my Elasticsearch cluster?
With my research, I have found 3 candidates for my problem: murmur3, md5, sha256 (128 would suffice?).

Elastic search or Trie for search/autocomplete?

My understanding how autocomplete/search for text/item works at high level in any scalable product like Amazon eCommerce/Google at high level was :-
Elastic Search(ES) based approach
Documents are stored in DB . Once persisted given to Elastic search, It creates the index and store the index/document(based on tokenizer) in memory or disk based
configuration.
Once user types say 3 characters, it search all index under ES(Can be configured to index even ngram) , Rank them based on weightage and return to user
But after reading couple of resources on google like Trie based search
Looks some of the scalable product also uses Trie data stucture to do the prefix based search.
My question Is Can trie based approach be good alternative to ES or ES internally uses Trie or am i missing completely here ?
ES autocompletion can be achieved in two ways:
using prefix queries
either using (edge-)ngrams
or using the completion suggester
The first option is the poor man's completion feature. I'm mentioning it because it can be useful in certain situation but you should avoid it if you have a substantial amount of documents.
The second option uses the conventional ES indexing features, i.e. it will tokenize the text, all (edge-)ngrams will be indexed and then you can search for any prefix/infix/suffix that have been indexed.
The third option uses a different approach and is optimized for speed. Basically, when indexing a field of type completion, ES will create a "finite state transducer" and store it in memory for ultra fast access.
A finite state transducer is close to a trie in terms of implementation. You can check this excellent article which shows how trie compares to finite state transducer
UPDATE (June 25th, 2019):
ES 7.2 introduced a new data type called search_as_you_type that allows this kind of behavior natively. Read more at: https://www.elastic.co/guide/en/elasticsearch/reference/7.2/search-as-you-type.html

what is the difference between postings Format and Points Format in Lucene?

In Lucene 7 and 6 there is a format called PointsFormat and it uses BKD tree data Structure .and posting Format uses FST data Structure in lucene 4 for indexing.
1)i would like to know difference between this two?
2) what is the advantage of moving from one version into another one in lucene?
3)In Lucene which data structure is a efficient in indexing ?
These format are technically and logically different. PointsFormat is using for encoding/decoding PointValues (which are basically numeric values, introduced in recent versions of Lucene), while PostingsFormat is the format for encoding/decoding terms, postings, and proximity data.
Also, these format classes are abstract and there a lot of implemented format in both Lucene 6 and 7.
Answering to your following questions - moving to a newer version of Lucene is usually a plus for performance, as well as new functionality.
Both of these structures are efficient, but they are different just for storing different types of data

Develop a distributed Full-Text search Index (AKA Inverted index)

I know how to develop a simple inverted index on a single machine. In short it is a standard hash table kept in-memory where:
- key - a word
- value - a List of word locations
As an example, the code is here: http://rosettacode.org/wiki/Inverted_Index#Java
Question:
Now I'm trying to make it distributed among n nodes and in turn:
Make this index horizontally scalable
Apply automatic sharding to this index.
I'm interested especially in automatic sharding. Any ideas or links are welcome!
Thanks.
Sharding by it self is quite a complex task which is not completely solved in the modern DBs. Typical problems in distributed DBs are a CAP theorem, and some other low-level and quite challenging tasks like rebalancing your cluster data after adding a new blank node or after naturally-occured imbalance in the data.
The best data distribution implemented in a DB I've seen was in Cassandra. However full text search is not yet implemented in Cassandra, so you might consider building your distributed index upon it.
Some other already implemented options are Elasticsearch and SolrCloud. In the example given one important detail is missing which is a word-stemming. With word stemming you basically search for any form of a word like "sing", "sings", "singer". Lucene and two previous solutions have it implemented for the majority of the languages.

how to cluster evolving data streams

I want to incrementally cluster text documents reading them as data streams but there seems to be a problem. Most of the term weighting options are based on vector space model using TF-IDF as the weight of a feature. However, in our case IDF of an existing attribute changes with every new data point and hence previous clustering does not remain valid anymore and hence any popular algorithms like CluStream, CURE, BIRCH cannot be applied which assumes fixed dimensional static data.
Can anyone redirect me to any existing research related to this or give suggestions? Thanks !
Have you looked at
TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams
Here's an idea off the top of my head:
What's your input data like? I'm guessing it's at least similarly themed, so you could start with a base phrase dictionary and use that for idf. Apache Lucene is a great indexing engine. Since you have a base dictionary, you can run kmeans or whatever you'd like. As documents come in, you'll have to rebuild the dictionary at some frequency (which can be off-loaded to another thread/machine/etc) and then re-cluster.
With the data indexed in a high-performance, flexible engine like Lucene, you could run queries even as new documents are being indexed. I bet if you do some research on different clustering algorithms you'll find some good ideas.
Some interesting paper/links:
http://en.wikipedia.org/wiki/Document_classification
http://www.scholarpedia.org/article/Text_categorization
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
Without more information, I can't see why you couldn't re-cluster every once in a while. You might wanna take a look at some of the recommender systems already out there.

Resources