Find doc ids before query phase? - elasticsearch

As we do any search in elastic, Elastic performs it in two phases i.e Query and fetch phase as explained under section "Default search type: Query Then Fetch" at this resource
Here are the points
Send the query to each shard
Find all matching documents and calculate scores using local Term/Document Frequencies
Build a priority queue of results (sort, pagination with from/to, etc)
..
I have a question on point 1 of query phase. Per my understanding before query phase itself, elastic will find the relevant documents ids from inverted index based on the word in search query.
Then query will go specific shards only instead of going to each shard. Is that correct ?
So in query phase will elastic fetch those documents from shard based on document_id got grom inverted index, then calculate the scores for fetched document and return id's along with scrores to requesting node.
In fetch phase requesting node get all scores and decide what needs to sent to client then it actually fetches the document.

I have a question on point 1 of query phase. Per my understanding
before query phase itself, elastic will find the relevant documents
ids from inverted index based on the word in search query. Then query
will go specific shards only instead of going to each shard. Is that
correct ?
Here elastic will identify the shard based on document id before query phase. Inverted index does not come into the picture here. Once query goes to shard, the elastic refers the inverted index to find which term exists in which file/index.
Rest of the stuff is same as you pointed in mentioned resource

Related

Populate Elastic index by query and data from another index (reindex with script?) Is this possible?

Say I have 3 indexes:
requests index, each document has requestID
candidates index, each document has candidateID
matches index should have a single document that contains requestID and the top scoring candidateID
Looking to use the data in each 'requests' index to query 'candidates' and output the top hit from the query into matches into a combined document joins both.
Would like to do this all in elastic if possible, via ingest pipeline and scripts. I started looking and reindex api but not sure how would i parameterize it with data from another index.

How to get all the index patterns which never had any documents?

For Kibana server decommissioning purposes, I want to get a list of index patterns which never had any single document and had documents.
How to achieve this using Kibana only?
I tried this but it doesn't give the list based on the document count.
GET /_cat/indices
Also in individual level getting the count to check the documents are there is time consuming .
GET index-pattern*/_count
You can try this. V is for verbose and s stands for sort.
GET /_cat/indices?v&s=store.size:desc
From the docs :
These metrics are retrieved directly from Lucene, which {es} uses internally to power indexing and search. As a result, all document counts include hidden nested documents.

Elasticsearch data comparison

I have two different Elasticsearch clusters,
One cluster is Elastcisearch 6.x with the data, Second new Elasticsearch cluster 7.7.1 with pre-created indexes.
I reindexed data from Elastcisearch 6.x to Elastcisearch 7.7.1
Is there any way to get the doc from source and compare it with the target doc, in order to check that data is there and it is not affected somehow.
When you perform a reindex the data will be indexed based on destination index mapping, so if your mapping is same you should get the same result in search, the _source value will be unique on both indices but it doesn't mean your search result will be the same. If you really want to be sure everything is OK you should check the inverted index generated by both indices and compare them for fulltext search, this data can be really big and there is not an easy way to retrieve it, you can check this for getting term-document matrix .

ElasticSearch: given a document and a query, what is the relevance score?

Once a query is executed on ElasticSearch, a relevance _score is calculated for each retrieved document.
Given a specific document (e.g. by doc ID) and a specific query, I would like to see what is its _score?
One way is perhaps to query ES, retrieve all the hit documents, and look up the desired document out of all the retrieved documents to see its score.
I assume there should be a more efficient way to do this. Given a query and a document ID, what is its _score?
I'm using ElasticSearch 7.x
PS: I need this for a learning-to-rank scenario (to create my judgment list). I have in fact a complex query that was created from various should and must over different fields. My major requirement was to get the score value for each individual sub-query, which seems there is no solution for it. I want to understand which part of this complex query is more useful and which one is less. The only way I've come up with is to execute each sub-query separately to get the score but I do not want to actually execute that query just asking for what is the score of a specific document for that sub-query.
Scoring of the document is not only related to just the document and all other documents in the index, but it also depends on various factor like:
_score is calculated per shard basis not on an index basis by default, although you can change this behavior by using DFS Query Then Fetch param in your query. More info on this official blog.
Is there is any boost applied at index or query time(index time is deprecated from 5.X).
Any custom scoring function is used in addition to the default ES scoring algorithm(tf/idf in old versions) and BM25 in the latest versions.
Edit: Based on the comments from the other respected community members, rephrasing the below statement:
To answer your question, Using the _explain API, you can understand how Elasticsearch computes a score explanation for a query and a specific document. This can give useful feedback on whether a document matches or didn’t match a specific query.

Elastic search/lucene index on multiple words?

When I search say car engine(this is first time any user has searched for this keyword) in Elastic search/lucene , does search engine search the index for individual words in index table first and then find intersection. For example :- Say engine found the 10
documents for car and then it will search for engine say it got 5 documents. Now in 5 documents(minimal no of documents), it will search for car. It has found 2 documents.
Now search engine will rank it based on above results . Is this how multiple words are searched in index table at high level ?
For future searches against same keyword, does search engine make new entry for key car engine in index table ?
Yes, it does search for individual terms and takes the intersection or union of the results, according to your query. It uses something called an "inverted index" which it generates, as and when the documents to be searched are "indexed" into elasticsearch.
Indexing operations are different from searching. So, No, it wouldn't index user searches unless you tell it to (in your application).
The basic functioning of elasticsearch can be split into two parts:
Indexing. You create an index of documents by indexing all the documents that you want to search in. These documents could be anything from your MySQL store, or from Logstash etc, or could be made up of users' search queries that your application indexes into a relevant elastic index.
Searching. You search for the indexed documents using some keywords that could be user generated or application generated or a mixture, using ElasticSearch queries (DSL). If a result is found (according to your query) then elasticsearch returns the relevant records.
I'd encourage you to read this doc for a better understanding of how elastic searches docs:
https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up

Resources