ElasticSearch has far less docs than what I added - elasticsearch

I am adding 11378 documents to an index in ElasticSearch. But the number of documents it shows in the stats is only only 225. However the index_total under indexing has the correct number, that is 11378. When I search for a particular word, it returns 13 hits (docs). When I use a like query in SQL server, I am getting 178 records(docs). I am not understanding what I am doing wrong. I added the documents to the index, first with PUT command, and then later with POST. Bot the HTTP methods led to the same stats.
Can someone please explain what is happening? Any links or points are appreciated.
Thanks.

When you have less docs than what you think, the most common issue is that the ID is not generated correctly and you have collisions.
So you have two solutions:
make sure that the IDs you're generating are unique
let ES generate its own IDs

Related

Check if document is part of Elasticsearch query?

Curious if there is some way to check if document ID is part of a large (million+ results) Elasticsearch query/filter.
Essentially I’ll have a group of related document ID’s and only want to return them if they are part of a larger query. Hoping to do database side. Theoretically seemed possible since ES has to cache stuff related to large scrolls.
It's a interesting use-case but you need to understand that Elasticsearch(ES) doesn't return all the matching documents ids in the search result and return by default only the 10 documents in the response, which can be changed by the size parameter.
And if you increase the size param and have millions of matching docs in your query then ES query performance would be very bad and it might bring even entire cluster down if you frequently fire such queries(in absence of circuit breaker) so be cautious about it.
You are right that, ES cache the stuff, but again that if you try to cache huge amount of data and that is getting invalidate very frequent then you will not get the required performance benefits, so better do the benchmark against it.
You are already on the correct path to use, scroll API to iterate on millions on search result, just see below points to improve further.
First get the count of search result, this is included in default search response with eq or greater value which will give you idea that how many search results you have based on which you can give size param for subsequent calls to see if your id is present or not.
See if you effectively utilize the filters context in your query, which is by default cached at ES.
Benchmark your some heavy scroll API calls with your data.
Refer this thread to fine tune your cluster and index configuration to optimize ES response further.

Does updating a doc increase the "delete" count of the index?

I am facing a strange issue in the number of docs getting deleted in an elasticsearch index. The data is never deleted, only inserted and/or updated. While I can see that the total number of docs are increasing, I have also been seeing some non-zero values in the docs deleted column. I am unable to understand from where did this number come from.
I tried reading whether the update doc first deletes the doc and then re-indexes it so in this way the delete count gets increased. However, I could not get any information on this.
The command I type to check the index is:
curl -XGET localhost:9200/_cat/indices
The output I get is:
yellow open e0399e012222b9fe70ec7949d1cc354f17369f20 zcq1wToKRpOICKE9-cDnvg 5 1 21219975 4302430 64.3gb 64.3gb
Note: It is a single node elasticsearch.
I expect to know the reason behind deletion of docs.
You are correct that updates are the cause that you see a count for documents delete.
If we talk about lucene then there is nothing like update there. It can also be said that documents in lucene are immutable.
So how does elastic provides the feature of update?
It does so by making use of _source field. Therefore it is said that _source should be enabled to make use of elastic update feature. When using update api, elastic refers to the _source to get all the fields and their existing values and replace the value for only the fields sent in update request. It marks the existing document as deleted and index a new document with the updated _source.
What is the advantage of this if its not an actual update?
It removes the overhead from application to always compile the complete document even when a small subset of fields need to update. Rather than sending the full document, only the fields that need an update can be sent using update api. Rest is taken care by elastic.
It reduces some extra network round-trips, reduce payload size and also reduces the chances of version conflict.
You can read more how update works here.

Can multiple add/delete of document to an index make it inconsistent?

For a use-case, I'll need to add and remove multiple documents to an elastic search index. My understanding is that the tf-idf or BM25 scores are affected by the frequencies that are calculated using the postings list (?)... But, if I add and remove many documents in a day, will that affect the document/word statistics?
I've already went though a lot of API's but my untrained eyes could not locate if this is the case, or if there's a way for me to force ElasticSearch to update/recompute the index every day or so...
Any help would be appreciated
Thanks
"The IDF portion of the score can be affected by deletions and modifications" the rest should be fine... (Igor Motov)
Link to discussion:
https://discuss.elastic.co/t/can-multiple-add-delete-of-document-to-an-index-make-it-inconsistent/137030

Elasticsearch: count returning wrong value

ES 1.7.3
We have around 20M documents. Each document has a unique ID. When we do a count-request (/index/type/_count) we get around 30K less documents than we indexed.
I checked the existence of each document by making requests on the ID field. Result: there is none missing.
Is there any reasons why _count returns not the exact count?
PS: I read about estimates when doing aggregations. Is this perhaps related?
Coutn API may result in inaccurate results. You can use search_type=count instead. It works in the same way as searching works but returns only count.
Use it like
GET /index/type/_search?search_type=count
Study more about search_type here.
You can also refer to this question

Get all the results from solr without 10 as limit

How to get all the rows returned from the solr instead of getting only 10 rows?
You can define how many rows you want (see Pagination in SolrNet), but you can't get all documents. Solr is not a database. It doesn't make much sense to get all documents in Solr, if you feel you need it you might be using the wrong tool for the job.
This is also explained in detail in the Solr FAQ.
As per Solr Wiki,
About the row that query returns,
The default value is "10", which is used if the parameter is not specified. If you want to tell Solr to return all possible results from the query without an upper bound, specify rows to be 10000000 or some other ridiculously large value that is higher than the possible number of rows that are expected.
refer this https://wiki.apache.org/solr/CommonQueryParameters
You can setup rows=x, where x is the desired number of doc in the query url.
You can also get groups of 10 doc, by looping over the founds docs by changing start value and leaving row=10
Technically it is possible to get all results from a SOLR search. All you need to do is to specify the limit as -1.

Resources