ElasticSearch Search Queries Count - elasticsearch

We have a use case for aggregating count of elastic-search search queries/operations. Initially we've decided to make use of the /_stats endpoint for aggregating results on a per index basis. However, we would also like to explore the option of filtering search operations so we can distinguish operations by origin/source. I was wondering how we can do this efficiently. Any references to documentation or implementations would be highly appreciated,

Related

Elastic Search: One index with custom type to differentiate document schemas VS multiple index, one per document type?

I am not experienced in ES (my background is more of relational databases) and I am trying to achieve the goal of having a search bar in my web application to search the entire content of it (or the content I will be willing to index in ES).
The architecture implemented is Jamstack with a gatsby application fetching content (sometimes at build time, sometimes at runtime) from a strapi application (headless cms). In the middle, I developed a microservice to write the documents created in the strapi application to the ES database. At this moment, there is only one index for all the documents, regardless the type.
My problem is, as the application grows and different types of documents are created (sometimes very different from one another, as example I can have an article (news) and a hospital) I am having hard time to correctly query the database as I have to define a lot of specific conditions when making the query (to cover all types of documents).
My solution to this is to keep only one index and break down the query in several ones and when the user hits the search button those queries are run and the results will be joined together before being presented OR break down the only index into several ones, one per document which leads me to another doubt, is it possible to query multiple indexes at once and define specific index fields in the query?
Which is the best approach? I hope I could make my self clear in this.
Thanks in advance.
According to the example you provided, where one type of document can be of type news and another type is hospital, it makes sense to create multiple indices(but you also need to tell, how many such different types you have). there are pros and cons with both the approach and once you know them, you can choose one based on your use-case.
Before I start listing out the pros/cons, the answer to your other question is that you can query multiple indices in a single search query using multi-search API.
Pros of having a single index
less management overhead of multiple indices(this is why I asked how many such indices you may have in your application).
More performant search queries as data are present in a single place.
Cons
You are indexing different types of documents, so you will have to include a complex filter to get the data that you need.
Relevance will not be good, as you have a mix of documents which impacts the IDF of similarity algo(BM25), and impacts the relevance.
Pros of having a different index
It's better to separate the data based on their properties, for better relevant results.
Your search queries will not be complex.
If you have really huge data, it makes sense to break the data, to have the optimal shard size and better performance.
cons
More management overhead.
if you need to search in all indices, you have to implement multi-search and wait for all indices search result, which might be costly.

Does elastic search use previous search frequencies?

Does elastic search utilize the frequency of a previously searched document. For example there are document A and document B. Both have similar score in terms of edit distances and other metrics however document A is very frequently searched and B is not. Will elastic search score A better than B. If not, how to acheive this?
Elasticsearch does not change score based on previous searches in its default scoring algorithm. In fact, this is really a question about Lucene scoring, since Elasticsearch uses it for all of the actual Search logic.
I think you may be looking at this from the wrong viewpoint. Users search with a query, and Elasticsearch recommends documents. You have no way of knowing if the document it recommended was valid or not just based on the search. I think your question should really be, "How can I tune Search relevance in an intelligent way based on user data?".
Now, there are a number of ways you can achieve this, but they require you to gather user data and build the model yourself. So unfortunately, there is no easy way.
However, I would recommend taking a look at https://www.elastic.co/app-search/, which offers a managed solution with lots of custom relevant tuning which may save you lots of time depending on your use case.

How important is it to use separate indices for percolator queries and their documents?

The ElasticSearch documentation on the Percolate query recommends using separate indices for the query and the document being percolated:
Given the design of percolation, it often makes sense to use separate indices for the percolate queries and documents being percolated, as opposed to a single index as we do in examples. There are a few benefits to this approach:
Because percolate queries contain a different set of fields from the percolated documents, using two separate indices allows for fields to be stored in a denser, more efficient way.
Percolate queries do not scale in the same way as other queries, so percolation performance may benefit from using a different index configuration, like the number of primary shards.
At the bottom of the page here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-percolate-query.html
I understand this in theory, but I'd like to know more about how necessary this is for a large index (say, 1 million registered queries).
The tradeoff in my case is that creating a separate index for the document is quite a bit of extra work to maintain, mainly because both indices need to stay "in sync". This is difficult to guarantee without transactions, so I'm wondering if the effort is worth it for the scale I need.
In general I'm interested in any advice regarding the design of the index/mapping so that it can be queried efficiently. Thanks!

Searching for data in DynamoDB or using a search service

I would like to know the pros and cons of trying to search for data (basically full text search on a limited set of fields).
My data is currently in DynamoDB, and I realize that is not well suited to full-text search. Are there ways of doing a full-text search in DynamoDB? What are the pros and cons of doing that?
I can also use a Search cluster (like ElasticSearch). Any reasons that you would not go with a search cluster?
Are there other ways to do a full-text search? Other solutions?
Dynamodb is best suited for key value Insert and Retrieval.
It does not support search functionality, if you are trying to do a scan with some condition that will be O(n) and it will be very costly since you are consuming lots of read capacity.
Now coming to options
If use case is not full text search and only key value match, you can try to come up with composites key, but it will have drawbacks like
a. Can not change the schema afterwards and may require huge effort if you need to search on a new field.
b. Designing these kind of key is tricky considering that few keys will always be hot, and may result into hot partition.
Ideal solution is to use elastic-search or solr indexing. You can have a lambda function listening to dynamodb stream, doing transformation and putting data in elasticsearch. But it will have limitations like
a. Elasticsearch cluster is costly.

Filtering the results of a sorted query in Lucene.NET

I'm using Lucene.NET, which is currently up to date with Lucene 2.9. I'm trying to implement a kind of select distinct, but without the need to drill down into any groups. I know that Lucene 3.2 has a faceted search that may solve this, but I don't have the time to port it to 2.9 yet.
I figure in any event, when you perform a paged query with a sort operator, Lucene has to find all the documents that match the query, sort them, then take the top N results, where N is the page size. I'd like to build something that is also applied after the sorted query has completed, but takes the top N unique results and returns them. I'm thinking of using a HashSet and one of the indexed fields to determine uniqueness. I'd rather find a way to extend something in Lucene than try and do this once the results are already returned for performance reasons.
Custom filters seem to run before the main query is even applied and custom collectors run before sorting is applied, unless you are sorting by Lucene's document id. So what is the best approach to this problem? A point in the direction of the right component to extend will get you the answer on this one, an example implementation will most definitely get you the answer. Thanks in advance
I'd make the search without sorting, and in a custom collector, would collect the results in a sorted list of size N based on "uniqueness"

Resources