Elasticsearch - Inconsistent results while using Native Script - elasticsearch

ES Version 2.1.2
I am using NativeScript to filter documents by term value. The script returns true/false based on the lookup of the term value from the document. The number of results returned by Elasticsearch grows steadily for the same query until it reaches the actual count. I am not sure if it is because the fielddata is cached progressively. This always happens when ES is restarted. If the script query is replaced with terms query, the results are accurate in the first search. I also notice my custom script gets initialized multiple times and the number varies for every search request. What is wrong here?
I am running ES in single node with 3 shards and 0 replicas. The Native Script gets initialized 6 times for the first query. The number of initialization increases to somewhere 18 for second time for the same query and reaches to 22 for the third time. It stays on to 22 times for subsequent searches. As the number of initialization of Native Script increases I get more results for the same query and when it reaches the final count, actual number of search results is returned. Cant understand this inconsistency in total count of search results for the same query.

Found the issue. Search timeout was set to 500ms and during the first query (after bouncing ES), ES tries to load fielddata in memory from all active segments.Since initial population of fielddata took more time than the timeout threshold, fielddata from some segments were not loaded into memory. During subsequent searches, fielddata gets loaded completely and hence see actual count after few searches.
Increasing the timeout to 5s gave enough time to load fielddata from all segments and thereby getting actual total count in the first search itself.

Related

Elastic app search Bulk Indexing Maximum is 100, how to index 1 million documents?

I am in a real pain with Elastic App Search, the indexing limit for bulk index is 100 according to the docs:
https://www.elastic.co/guide/en/app-search/current/limits.html
I was trying to create all the promises and then do promise.all(allPromises), but it's failing to index everything, and the response of when this fails still return 200, and you have to loop over:
res.data (all the 100 documents array), and look if they have error field.
Is there any solution to index lot of document fast? Because indexing 1 million with loop to await between every 100 batch size query is extremely slow.
Unfortunately the limit of 100 makes it a slow operation. We are indexing 1.1 million documents and our solution was to slice that our into ten segments and run ten processes in parallel to decrease the time that it takes.
Since reading from the index is very quick we have a separate job that validates the information in App Search matches our source data and flags anything out of order. So I don't check for errors on a full import or update so that part goes as fast as possible. We only see around a 1% failure rate on bulk imports.
I should note that part of the reason for the second piece is we have on occasion found the search index can get out of alignment with the documents and fields we have, so validation seemed like a good idea.

Could removing Elasticsearch results limit cause performance problems?

If I were to bypass the limit of 10 results in ElasticSearch by adding a size parameter to my query as described here, could that cause performance problems to my ES cluster?
It will depend on various parameters
Number of request ES is getting every second/milli-second.
Size of individual document.
Out of total number of request, how many are unique. If we are hitting same
query multiple time, then results are returned from cache.
Size of query.
With the increase in number of documents, response size and time will also increase.
This will hamper the performance of application where these results are getting are displayed / delivered. So e.g. UI will go slow to parse all the result and display.
Going for pagination will be future safe as well.

Bigdesk charts explanation

I don't understand what Search time per second (Δ) means. Is it the delta of number of milliseconds that the search requests took in previous and current refresh interval? Also there is a Query and Fetch time below the chart, not sure what that represents.
Attached is a screenshot:
A query in Elasticsearch actually a 2 phased process:
Query Phase :
During the initial query phase, the query is broadcast to a shard copy (a primary or replica shard) of every shard in the index. Each shard executes the search locally and builds a priority queue of matching documents.
And
Fetch Phase :
The query phase identifies which documents satisfy the search request, but we still need to retrieve the documents themselves. This is the job of the fetch phase.
And that mail explains the Search time per second (Δ) part in detail:
Here is an example for "Search requests per second (Δ)":
- You do some "_search" request
- It hits 15 shards of some indices on that node, so the value of indices -> search -> "query_total" in nodes stats API 2 response
increases by 15
- Bigdesk refresh value is 5000 (5 sec)
As a result the chart should display peak of 3 (15/5) in the Query
line. So if the value is ~1500 in your case then it means in average
an X number of shards is hit by search requests per second where
X=1500*refresh (does it make sense)?
You can see the chart is really only informative (it depends on
refresh interval and number of shards). But there is the cumulative
"query_total" value displayed as well in the web UI.
Similarly, the second chart "Search time per second (Δ)" displays the
average time (in mills) spent in query or fetch phase on the node.
Again this value includes all involved shards on that node.
Search time per second (Δ) based on 2 series seies1 and serie2
they are explained here
looks like chart shows these metrics per time unit

ElasticSearch Query time - how to decrease the response time

I am executing some queries on elastic search.
Some of the queries are taking long time to execute first time and on rerun response time reduces.
However, first time execution is nearing 16 secs for some of the queries.
I have increased the vCPU from 1vCPU to 2vCPU (ElasticSearch server is running as a VM) and I can see certain decrease in the response time ("took" in elastic search).
Can someone please help and summarize, what all factors (both hardware and software e.g. query construct) affect the response time in ElasticSearch.
I am using Java to query ES.
First query will make a full search, next one can use some cache, that's why they are quicker.
You can check in elasticsearch for indexes based on your search fields. Your data may not be indexed correctly dependending on your kind of search, this will speed up the process.
You can also limit the number of matches, if you don't care to get all results at the same time (managing yourself pagination).

Elastic Search inconsistent index count

I am indexing a large dataset 30 million rows and following each re-index (using a JDBC river) I am seeing inconsistencies in the total size of the index.
I am using:
curl -XGET 'http://localhost:9200/index_name/_count'
and the results vary by as much as 100,000 results after each re-index.
I can't see any index errors in the log.
One possibility is that your refresh_interval setting is set to a high number. This option is used to reduce disk IO. Indexed results may only be available after this interval has expired.
You can also use the refresh API to force a refresh to take place. Like this:
curl -XPOST 'http://localhost:9200/index_name/_refresh'
See the elastic documentation for more details.
The algorithm used by ElasticSearch to count the number of entries in the index performs the operation in bounded memory usage. This leads to approximate results. To increase the precision you can set the
precision_threshold : AMOUNT_OF_ERROR
Still there is always a scope of 5% error in elastic search

Resources