Is there any support provided by Elasticsearch to generate hash id based on document content?
If not, which hashing algorithm should I use for better distribution across shards and efficiency with rare collisions, if I can expect around a couple of million events per day being indexed to my Elasticsearch cluster?
With my research, I have found 3 candidates for my problem: murmur3, md5, sha256 (128 would suffice?).
Related
Based on the results of some tests here it is obvious that the worst case for Lucene UUID is UUID version 4 because of it's random nature.
I wanted to know if any one has any experience using UUID v5 for Lucene ?
As I studied UUID version 5 is not being generated randomly and it is made from two pieces of input information, Can those input pieces be manipulated in a way that UUID v5 becomes suitable as means of performance for being used in Lucene?
My understanding how autocomplete/search for text/item works at high level in any scalable product like Amazon eCommerce/Google at high level was :-
Elastic Search(ES) based approach
Documents are stored in DB . Once persisted given to Elastic search, It creates the index and store the index/document(based on tokenizer) in memory or disk based
configuration.
Once user types say 3 characters, it search all index under ES(Can be configured to index even ngram) , Rank them based on weightage and return to user
But after reading couple of resources on google like Trie based search
Looks some of the scalable product also uses Trie data stucture to do the prefix based search.
My question Is Can trie based approach be good alternative to ES or ES internally uses Trie or am i missing completely here ?
ES autocompletion can be achieved in two ways:
using prefix queries
either using (edge-)ngrams
or using the completion suggester
The first option is the poor man's completion feature. I'm mentioning it because it can be useful in certain situation but you should avoid it if you have a substantial amount of documents.
The second option uses the conventional ES indexing features, i.e. it will tokenize the text, all (edge-)ngrams will be indexed and then you can search for any prefix/infix/suffix that have been indexed.
The third option uses a different approach and is optimized for speed. Basically, when indexing a field of type completion, ES will create a "finite state transducer" and store it in memory for ultra fast access.
A finite state transducer is close to a trie in terms of implementation. You can check this excellent article which shows how trie compares to finite state transducer
UPDATE (June 25th, 2019):
ES 7.2 introduced a new data type called search_as_you_type that allows this kind of behavior natively. Read more at: https://www.elastic.co/guide/en/elasticsearch/reference/7.2/search-as-you-type.html
I'm creating a search by image system(similar to Google's reverse image search) for a cataloging system used internally at my co.. We've already been using Elasticsearch with success for our regular search functionality, so I'm planning on hashing all our images, creating a separate index for them, and using it for searching. There are many items in the system and each item may have multiple images associated with it, and the item should be able to be find-able by reverse image searching any of its related images.
There are 2 possible schema we've thought of:
Making a document for each image, containing only the hash of the image and the item id it is related to. This would result in about ~7m documents, but they would be small since they only contain a single hash and an ID.
Making a document for each item, and storing the hashes of all the images associated with it in an array on the document. This would result in around ~100k documents, but each document would be fairly large, some items have hundreds of images associated with them.
Which of these schema would be more performant?
Having attended a recent Under the Hood talk by Alexander Reelsen, he would probably say "it depends" and "benchmark it".
As #Science_Fiction already hinted:
are the images frequently updated? That could come at a negative cost factor
OTOH, the overhead for 7m documents maybe shouldn't be neglected whereas in your second scenario they would just be not_analyzed terms in a field.
If 1. is a low factor I would probably start with your second approach first.
I have many documents, over ten thousands (maybe more). I'd like to extract some keywords from each document, let's say 5 keywords from each document, using hadoop. Each document may talk about a unique topic. My current approach is to use Latent Dirichlet Allocation (LDA) implemented in Mahout. However as each document talks about a different topic, the number of extracted topics should be equal to the number of documents, which is very large. As the LDA become very inefficient when the number of topics become large, my approach is to randomly group documents into small groups each having only 100 documents and then use Mahout LDA to extract 100 topics from each group. This approach works, but may not be very efficient because each time I run hadoop on small set of documents. Does anyone has a better (more efficient) idea for this?
I know how to develop a simple inverted index on a single machine. In short it is a standard hash table kept in-memory where:
- key - a word
- value - a List of word locations
As an example, the code is here: http://rosettacode.org/wiki/Inverted_Index#Java
Question:
Now I'm trying to make it distributed among n nodes and in turn:
Make this index horizontally scalable
Apply automatic sharding to this index.
I'm interested especially in automatic sharding. Any ideas or links are welcome!
Thanks.
Sharding by it self is quite a complex task which is not completely solved in the modern DBs. Typical problems in distributed DBs are a CAP theorem, and some other low-level and quite challenging tasks like rebalancing your cluster data after adding a new blank node or after naturally-occured imbalance in the data.
The best data distribution implemented in a DB I've seen was in Cassandra. However full text search is not yet implemented in Cassandra, so you might consider building your distributed index upon it.
Some other already implemented options are Elasticsearch and SolrCloud. In the example given one important detail is missing which is a word-stemming. With word stemming you basically search for any form of a word like "sing", "sings", "singer". Lucene and two previous solutions have it implemented for the majority of the languages.