ES 1.7.3
We have around 20M documents. Each document has a unique ID. When we do a count-request (/index/type/_count) we get around 30K less documents than we indexed.
I checked the existence of each document by making requests on the ID field. Result: there is none missing.
Is there any reasons why _count returns not the exact count?
PS: I read about estimates when doing aggregations. Is this perhaps related?
Coutn API may result in inaccurate results. You can use search_type=count instead. It works in the same way as searching works but returns only count.
Use it like
GET /index/type/_search?search_type=count
Study more about search_type here.
You can also refer to this question
Related
In elasticsearch 7.9, I have an Index with 1 shard and 1 replica. I use simple datetime filter to get docs between start time and end time, but I often get same result set in different order. I do not want to use Sort clause and compute scores. I just want to get results in same order.
So there is anyway to do this without using Sort?
It may be happening due to the fact, that you have 1 replica for your index, which might have some difference or different values for your timestamp field, you can use the preference param and make sure, your search results are always returned from the same shard.
Refer bouncy result issue blog in ES for more info.
Once a query is executed on ElasticSearch, a relevance _score is calculated for each retrieved document.
Given a specific document (e.g. by doc ID) and a specific query, I would like to see what is its _score?
One way is perhaps to query ES, retrieve all the hit documents, and look up the desired document out of all the retrieved documents to see its score.
I assume there should be a more efficient way to do this. Given a query and a document ID, what is its _score?
I'm using ElasticSearch 7.x
PS: I need this for a learning-to-rank scenario (to create my judgment list). I have in fact a complex query that was created from various should and must over different fields. My major requirement was to get the score value for each individual sub-query, which seems there is no solution for it. I want to understand which part of this complex query is more useful and which one is less. The only way I've come up with is to execute each sub-query separately to get the score but I do not want to actually execute that query just asking for what is the score of a specific document for that sub-query.
Scoring of the document is not only related to just the document and all other documents in the index, but it also depends on various factor like:
_score is calculated per shard basis not on an index basis by default, although you can change this behavior by using DFS Query Then Fetch param in your query. More info on this official blog.
Is there is any boost applied at index or query time(index time is deprecated from 5.X).
Any custom scoring function is used in addition to the default ES scoring algorithm(tf/idf in old versions) and BM25 in the latest versions.
Edit: Based on the comments from the other respected community members, rephrasing the below statement:
To answer your question, Using the _explain API, you can understand how Elasticsearch computes a score explanation for a query and a specific document. This can give useful feedback on whether a document matches or didn’t match a specific query.
I am pretty new to elasticsearch and already love it.
Right know I am interested in understanding on how I can let elasticsearch make suggestions for similar keywords.
I have already read this article: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html.
The More Like This Query (MLT Query) finds documents that are "like" a given set of documents.
This is already more than I am looking for. I dont need similar documents but only related / similar keywords.
So lets say I have an index of documents about movies and I start a query about "godfather". Then elasticsearch should suggest related keywords - e.g. "al pacino" or "Marlon Brando" because they are likely to occur in the same documents.
any ideas how this can be done?
Unfortunately, there is no built-in way to do that in Elastic. What you could possibly do, is to write a program, that will query Elastic, return matched documents, then you will get the _source data, or just retrieve it from your original datasource (like DB or file), later you will need to calculate TF-IDF for each term in the retrieved ones and somehow combine everything all together to get top K terms out of all returned terms.
I am adding 11378 documents to an index in ElasticSearch. But the number of documents it shows in the stats is only only 225. However the index_total under indexing has the correct number, that is 11378. When I search for a particular word, it returns 13 hits (docs). When I use a like query in SQL server, I am getting 178 records(docs). I am not understanding what I am doing wrong. I added the documents to the index, first with PUT command, and then later with POST. Bot the HTTP methods led to the same stats.
Can someone please explain what is happening? Any links or points are appreciated.
Thanks.
When you have less docs than what you think, the most common issue is that the ID is not generated correctly and you have collisions.
So you have two solutions:
make sure that the IDs you're generating are unique
let ES generate its own IDs
How to get all the rows returned from the solr instead of getting only 10 rows?
You can define how many rows you want (see Pagination in SolrNet), but you can't get all documents. Solr is not a database. It doesn't make much sense to get all documents in Solr, if you feel you need it you might be using the wrong tool for the job.
This is also explained in detail in the Solr FAQ.
As per Solr Wiki,
About the row that query returns,
The default value is "10", which is used if the parameter is not specified. If you want to tell Solr to return all possible results from the query without an upper bound, specify rows to be 10000000 or some other ridiculously large value that is higher than the possible number of rows that are expected.
refer this https://wiki.apache.org/solr/CommonQueryParameters
You can setup rows=x, where x is the desired number of doc in the query url.
You can also get groups of 10 doc, by looping over the founds docs by changing start value and leaving row=10
Technically it is possible to get all results from a SOLR search. All you need to do is to specify the limit as -1.