Elasticsearch - many small documents vs fewer large documents? - performance

I'm creating a search by image system(similar to Google's reverse image search) for a cataloging system used internally at my co.. We've already been using Elasticsearch with success for our regular search functionality, so I'm planning on hashing all our images, creating a separate index for them, and using it for searching. There are many items in the system and each item may have multiple images associated with it, and the item should be able to be find-able by reverse image searching any of its related images.
There are 2 possible schema we've thought of:
Making a document for each image, containing only the hash of the image and the item id it is related to. This would result in about ~7m documents, but they would be small since they only contain a single hash and an ID.
Making a document for each item, and storing the hashes of all the images associated with it in an array on the document. This would result in around ~100k documents, but each document would be fairly large, some items have hundreds of images associated with them.
Which of these schema would be more performant?

Having attended a recent Under the Hood talk by Alexander Reelsen, he would probably say "it depends" and "benchmark it".
As #Science_Fiction already hinted:
are the images frequently updated? That could come at a negative cost factor
OTOH, the overhead for 7m documents maybe shouldn't be neglected whereas in your second scenario they would just be not_analyzed terms in a field.
If 1. is a low factor I would probably start with your second approach first.

Related

Elastic Search: One index with custom type to differentiate document schemas VS multiple index, one per document type?

I am not experienced in ES (my background is more of relational databases) and I am trying to achieve the goal of having a search bar in my web application to search the entire content of it (or the content I will be willing to index in ES).
The architecture implemented is Jamstack with a gatsby application fetching content (sometimes at build time, sometimes at runtime) from a strapi application (headless cms). In the middle, I developed a microservice to write the documents created in the strapi application to the ES database. At this moment, there is only one index for all the documents, regardless the type.
My problem is, as the application grows and different types of documents are created (sometimes very different from one another, as example I can have an article (news) and a hospital) I am having hard time to correctly query the database as I have to define a lot of specific conditions when making the query (to cover all types of documents).
My solution to this is to keep only one index and break down the query in several ones and when the user hits the search button those queries are run and the results will be joined together before being presented OR break down the only index into several ones, one per document which leads me to another doubt, is it possible to query multiple indexes at once and define specific index fields in the query?
Which is the best approach? I hope I could make my self clear in this.
Thanks in advance.
According to the example you provided, where one type of document can be of type news and another type is hospital, it makes sense to create multiple indices(but you also need to tell, how many such different types you have). there are pros and cons with both the approach and once you know them, you can choose one based on your use-case.
Before I start listing out the pros/cons, the answer to your other question is that you can query multiple indices in a single search query using multi-search API.
Pros of having a single index
less management overhead of multiple indices(this is why I asked how many such indices you may have in your application).
More performant search queries as data are present in a single place.
Cons
You are indexing different types of documents, so you will have to include a complex filter to get the data that you need.
Relevance will not be good, as you have a mix of documents which impacts the IDF of similarity algo(BM25), and impacts the relevance.
Pros of having a different index
It's better to separate the data based on their properties, for better relevant results.
Your search queries will not be complex.
If you have really huge data, it makes sense to break the data, to have the optimal shard size and better performance.
cons
More management overhead.
if you need to search in all indices, you have to implement multi-search and wait for all indices search result, which might be costly.

Is ElasticSearch suited for retrieving a very large number of search records?

So, our production environment has an ES cluster that contains all our products inventory (ID and attributes) where each product is mapped to one document. Internally, one of our use cases is to create a logical grouping of these products based on text matching on a bunch of these product attributes.
Often times, it's possible that a product set could contain a very large number of products, say, 5 million. That is, the query to create a product set could match about 5 million documents.
Now, my question is, is ES capable of handling such large retrievals of documents, or is it recommended to use a backing store like Cassandra or HBase to fetch a huge number of documents? Note that I'm not concerned about realtime use cases - I'm okay with having an asynchronous execution of the product set creation, so latency isn't a major concern for me. From what I understand, ES provides the Scroll API to retrieve a large number of documents, but, I'm approaching the problem more from a school of thought perspective.
Is it fine to use ES to fetch very large documents, in the range of 5-10 million? Or should we use a parallel DB with big data capabilities to fetch the data and use ES only as the search store?
TL;DR no, it is not meant to retrieve large sets of documents, although you could work your way around with different approaches
notice that Scroll API might not be suitable for purposes other than re-indexing:
Scrolling is not intended for real time user requests, but rather for processing large amounts of data, e.g. in order to reindex the contents of one data stream or index into a new data stream or index with a different configuration.
Another way to achieve it would be the Search after parameter
search_after is not a solution to jump freely to a random page but rather to scroll many queries in parallel. It is very similar to the scroll API but unlike it, the search_after parameter is stateless, it is always resolved against the latest version of the searcher. For this reason the sort order may change during a walk depending on the updates and deletes of your index.
Rethink whether your use case really needs to exhaustively paginate over large sets of documents. Since ElasticSearch's strength doesn't lie on large result sets.
consult documentation:
Paginate search results
index.max_result_window
Track total hits
Scroll API
Search after parameter

Best way to store votes in elasticsearch for a reddit like system

I am building a site similar to reddit using elasticsearch and trying to decide where is the best place to store the up/down votes. I can think of couple options.
Store as part of the document.
In this case, any vote will trigger an update on the document. According to elasticsearch document, this is essentially a replace of the whole document. That seems to be a very expensive operation.
Store in another database.
Store votes in other database like SQL/MongoDB and update elasticsearch periodically. In this case, we have to tolerate some delay for the new votes to affect search result which is not so ideal and will also increase complexity and maintenance cost.
Store in another index in elasticsearch
This can separate the concern by index - one mostly RO, one RW. Is there an efficient way to merge the two indices so that I can order by votes at query time?
Any suggestions on those options or other better way to handle this?
There is a forth option - store votes in a separate document with a different type but in the same index as the original document. The votes type can be made a child of the article type. This setup will enable you to perform queries against articles and votes at the same time using has_child filters and queries. It will also require reindexing of only a small votes document every time a vote occurs instead of the large article document. On the negative side, the has_child and has_parent queries require loading of the parent/child map into memory, so this approach has a non-trivial memory footprint comparing to all other options that you have described.

MongoDB text index search slow for common words in large table

I am hosting a mongodb database for a service that supports full text searching on a collection with 6.8 million records.
Its text index includes ten fields with varying weights.
Most searches take less than a second. Some searches take two to three seconds. However, some searches take 15 - 60 seconds! The 15-60 second search cases are unacceptable for my application. I need to find a way to speed those up.
Searching takes 15-60 seconds when words that are very common in the index are used in the search query.
I seems that the text search feature does not support lazy parameters. My first thought was to cache a list of the 50 most common words in my text index and then ask mongodb to evaluate those last (lazy) and on top of the filtered results returned by the less common parameters. Hopefully people are still with me. For example, say I have a query "products chocolate", where products is common and chocolate is uncommon. I would like to be able to ask mongodb to evaluate "chocolate" first, and then filter those results with the "products" term. Does anyone know of a way to achieve this?
I can achieve the above scenario by omitting the most common words (i.e. "products") from the db query and then reapplying the common term filter on the application side after it has received records found by db. It is preferable for all query logic to happen on the database, but am open to application side processing for a speed payout.
There are still some holes in this design. If a user only searches common terms, I have no choice but to hit the database with all the terms. From preliminary reading, I gather that it is not recommended (or not supported) to have multiple text indexes (with different names) on the same collection. My plan is to create two identical tables, each with my 6.8M records, with different indexes - one for common words and one for uncommon words. This feels kludgy and clunky, but am willing to do this for a speed increase.
Does anyone have any insight and/or advice on how to speed up this system. I'd like as much processing to happen on the database as possible to keep it fast. I'm sure my little 6.8M record table is not the largest that mongodb has seen. Thanks!
Well I worked around these performance issues by allowing MongoDB full text search to search in OR based format. I'm prioritizing my results by fine tuning the weights on my indexed fields and just ordering by rank. I do get more results than desired, but that's not a huge problem because my weighted results that appear at the top will most likely be consumed before my user gets to less relevant results at the bottom.
If anyone is struggling with MongoDB text search performance using AND searching only, just switch back to OR and control your results using weights. It performs leaps better.
hth
This is the exact same issue as $all versus $in. $all only uses the index for the first keyword in the array. I believe your seeing the same issue here, reason why the OR a.k.a. IN works for you.

Best approaches to reduce the number of searches between the filenet object stores to find a document based on the time of the document creation?

For example, there are 5 object stores. I am thinking of inserting documents into them, but not in sequential order. Initially it might be sequential, but if i could insert by using some ranking method it would be easier to know which object store to search to find the document. The goal is to reduce the number of object store searches. This can only be achieved if the insertion uses some intelligent algorithm.
One method i found useful is using the current year MOD N (number of object stores) to determine where a document goes. Could we have some better approaches to this?
If you want fast access there are a couple of criteria:
The hash function has to be reproducible based on the data which is queried. This means, a lot depends on the queries you expect.
You usually want to distribute your object as much evenly accross stores as possible. If you want to go parallel, you want to access each document for a given query from different stores, so they will not block each other. Hence your hashing function should spread as much as possible to different stores for similar documents. If you expect documents related to the same query to be from the same year, do not use the year directly.
This assumes, you want to be able to have fast queries which can be paralised. If you instead have a system in which you first have to open a potentially expensive connection to the store, then most documents related to the same query should go in the same store and you should not take my advice above.
Your criteria for "what goes in a FileNet object store?" is basically "what documents logically belong together?".

Resources