Elastic Search inconsistent index count - elasticsearch

I am indexing a large dataset 30 million rows and following each re-index (using a JDBC river) I am seeing inconsistencies in the total size of the index.
I am using:
curl -XGET 'http://localhost:9200/index_name/_count'
and the results vary by as much as 100,000 results after each re-index.
I can't see any index errors in the log.

One possibility is that your refresh_interval setting is set to a high number. This option is used to reduce disk IO. Indexed results may only be available after this interval has expired.
You can also use the refresh API to force a refresh to take place. Like this:
curl -XPOST 'http://localhost:9200/index_name/_refresh'
See the elastic documentation for more details.

The algorithm used by ElasticSearch to count the number of entries in the index performs the operation in bounded memory usage. This leads to approximate results. To increase the precision you can set the
precision_threshold : AMOUNT_OF_ERROR
Still there is always a scope of 5% error in elastic search

Related

Could removing Elasticsearch results limit cause performance problems?

If I were to bypass the limit of 10 results in ElasticSearch by adding a size parameter to my query as described here, could that cause performance problems to my ES cluster?
It will depend on various parameters
Number of request ES is getting every second/milli-second.
Size of individual document.
Out of total number of request, how many are unique. If we are hitting same
query multiple time, then results are returned from cache.
Size of query.
With the increase in number of documents, response size and time will also increase.
This will hamper the performance of application where these results are getting are displayed / delivered. So e.g. UI will go slow to parse all the result and display.
Going for pagination will be future safe as well.

Reducing the size of elasticsearch index

I've just taken over a project that maintains an elasticsearch indexes and I'm new to the area. The indexes have grown massive in size. When I run:
GET /_cat/indices?v
I can see multiple multi-terabyte indexes. This is costing a lot in AWS fees. The last engineer who worked on this (no longer with the company) left some notes saying the n-gram configuration was causing the index to grow to a massive size. How do I find out information about how n-grams are set up on the index? When I run:
GET /my_index/_mapping
And get the mapping information I don't see any mention of n-grams for any of the fields. How can I see this information? I can see in the indexer code that it's set up to to have min ngram of 2 and max ngram of 12. Shoudln't this be in the mapping data returned from the call above?
Also, what other analysis can I perform on the indexes to get a better insight into their size, and what can be done to help reduce their footprint.
Thanks.

Elasticsearch - Inconsistent results while using Native Script

ES Version 2.1.2
I am using NativeScript to filter documents by term value. The script returns true/false based on the lookup of the term value from the document. The number of results returned by Elasticsearch grows steadily for the same query until it reaches the actual count. I am not sure if it is because the fielddata is cached progressively. This always happens when ES is restarted. If the script query is replaced with terms query, the results are accurate in the first search. I also notice my custom script gets initialized multiple times and the number varies for every search request. What is wrong here?
I am running ES in single node with 3 shards and 0 replicas. The Native Script gets initialized 6 times for the first query. The number of initialization increases to somewhere 18 for second time for the same query and reaches to 22 for the third time. It stays on to 22 times for subsequent searches. As the number of initialization of Native Script increases I get more results for the same query and when it reaches the final count, actual number of search results is returned. Cant understand this inconsistency in total count of search results for the same query.
Found the issue. Search timeout was set to 500ms and during the first query (after bouncing ES), ES tries to load fielddata in memory from all active segments.Since initial population of fielddata took more time than the timeout threshold, fielddata from some segments were not loaded into memory. During subsequent searches, fielddata gets loaded completely and hence see actual count after few searches.
Increasing the timeout to 5s gave enough time to load fielddata from all segments and thereby getting actual total count in the first search itself.

ElasticSearch Query time - how to decrease the response time

I am executing some queries on elastic search.
Some of the queries are taking long time to execute first time and on rerun response time reduces.
However, first time execution is nearing 16 secs for some of the queries.
I have increased the vCPU from 1vCPU to 2vCPU (ElasticSearch server is running as a VM) and I can see certain decrease in the response time ("took" in elastic search).
Can someone please help and summarize, what all factors (both hardware and software e.g. query construct) affect the response time in ElasticSearch.
I am using Java to query ES.
First query will make a full search, next one can use some cache, that's why they are quicker.
You can check in elasticsearch for indexes based on your search fields. Your data may not be indexed correctly dependending on your kind of search, this will speed up the process.
You can also limit the number of matches, if you don't care to get all results at the same time (managing yourself pagination).

Performance issues using Elasticsearch as a time window storage

We are using elastic search almost as a cache, storing documents found in a time window. We continuously insert a lot of documents of different sizes and then we search in the ES using text queries combined with a date filter so the current thread does not get documents it has already seen. Something like this:
"((word1 AND word 2) OR (word3 AND word4)) AND insertedDate > 1389000"
We maintain the data in the elastic search for 30 minutes, using the TTL feature. Today we have at least 3 machines inserting new documents in bulk requests every minute for each machine and searching using queries like the one above pratically continuously.
We are having a lot of trouble indexing and retrieving these documents, we are not getting a good throughput volume of documents being indexed and returned by ES. We can't get even 200 documents indexed per second.
We believe the problem lies in the simultaneous queries, inserts and TTL deletes. We don't need to keep old data in elastic, we just need a small time window of documents indexed in elastic at a given time.
What should we do to improve our performance?
Thanks in advance
Machine type:
An Amazon EC2 medium instance (3.7 GB of RAM)
Additional information:
The code used to build the index is something like this:
https://gist.github.com/dggc/6523411
Our elasticsearch.json configuration file:
https://gist.github.com/dggc/6523421
EDIT
Sorry about the long delay to give you guys some feedback. Things were kind of hectic here at our company, and I chose to wait for calmer times to give a more detailed account of how we solved our issue. We still have to do some benchmarks to measure the actual improvements, but the point is that we solved the issue :)
First of all, I believe the indexing performance issues were caused by a usage error on out part. As I told before, we used Elasticsearch as a sort of a cache, to look for documents inside a 30 minutes time window. We looked for documents in elasticsearch whose content matched some query, and whose insert date was within some range. Elastic would then return us the full document json (which had a whole lot of data, besides the indexed content). Our configuration had elastic indexing the document json field by mistake (besides the content and insertDate fields), which we believe was the main cause of the indexing performance issues.
However, we also did a number of modifications, as suggested by the answers here, which we believe also improved the performance:
We now do not use the TTL feature, and instead use two "rolling indexes" under a common alias. When an index gets old, we create a new one, assign the alias to it, and delete the old one.
Our application does a huge number of queries per second. We believe this hits elastic hard, and degrades the indexing performance (since we only use one node for elastic search). We were using 10 shards for the node, which caused each query we fired to elastic to be translated into 10 queries, one for each shard. Since we can discard the data in elastic at any moment (thus making changes in the number of shards not a problem to us), we just changed the number of shards to 1, greatly reducing the number of queries in our elastic node.
We had 9 mappings in our index, and each query would be fired to a specific mapping. Of those 9 mappings, about 90% of the documents inserted went to two of those mappings. We created a separate rolling index for each of those mappings, and left the other 7 in the same index.
Not really a modification, but we installed SPM (Scalable Performance Monitoring) from Sematext, which allowed us to closely monitor elastic search and learn important metrics, such as the number of queries fired -> sematext.com/spm/index.html
Our usage numbers are relatively small. We have about 100 documents/second arriving which have to be indexed, with peaks of 400 documents/second. As for searches, we have about 1500 searches per minute (15000 before changing the number of shards). Before those modifications, we were hitting those performance issues, but not anymore.
TTL to time-series based indexes
You should consider using time-series-based indexes rather than the TTL feature. Given that you only care about the most recent 30 minute window of documents, create a new index for every 30 minutes using a date/time based naming convention: ie. docs-201309120000, docs-201309120030, docs-201309120100, docs-201309120130, etc. (Note the 30 minute increments in the naming convention.)
Using Elasticsearch's index aliasing feature (http://www.elasticsearch.org/guide/reference/api/admin-indices-aliases/), you can alias docs to the most recently created index so that when you are bulk indexing, you always use the alias docs, but they'll get written to docs-201309120130, for example.
When querying, you would filter on a datetime field to ensure only the most recent 30 mins of documents are returned, and you'd need to query against the 2 most recently created indexes to ensure you get your full 30 minutes of documents - you could create another alias here to point to the two indexes, or just query against the two index names directly.
With this model, you don't have the overhead of TTL usage, and you can just delete the old, unused indexes from over an hour in the past.
There are other ways to improve bulk indexing and querying speed as well, but I think removal of TTL is going to be the biggest win - plus, your indexes only have a limited amount of data to filter/query against, which should provide a nice speed boost.
Elasticsearch settings (eg. memory, etc.)
Here are some setting that I commonly adjust for servers running ES - http://pastebin.com/mNUGQCLY, note that it's only for a 1GB VPS, so you'll need to adjust.
Node roles
Looking into master vs data vs 'client' ES node types might help you as well - http://www.elasticsearch.org/guide/reference/modules/node/
Indexing settings
When doing bulk inserts, consider modifying the values of both index.refresh_interval index.merge.policy.merge_factor - I see that you've modified refresh_interval to 5s, but consider setting it to -1 before the bulk indexing operation, and then back to your desired interval. Or, consider just doing a manual _refresh API hit after your bulk operation is done, particularly if you're only doing bulk inserts every minute - it's a controlled environment in that case.
With index.merge.policy.merge_factor, setting it to a higher value reduces the amount of segment merging ES does in the background, then back to its default after the bulk operation restores normal behaviour. A setting of 30 is commonly recommended for bulk inserts and the default value is 10.
Some other ways to improve Elasticsearch performance:
increase index refresh interval. Going from 1 second to 10 or 30 seconds can make a big difference in performance.
throttle merging if it's being overly aggressive. You can also reduce the number of concurrent merges by lowering index.merge.policy.max_merge_at_once and index.merge.policy.max_merge_at_once_explicit. Lowering the index.merge.scheduler.max_thread_count can help as well
It's good to see you are using SPM. Its URL in your EDIT was not hyperlink - it's at http://sematext.com/spm . "Indexing" graphs will show how changing of the merge-related settings affects performance.
I would fire up an additional ES instance and have it form a cluster with your current node. Then I would split the work between the two machines, use one for indexing and the other for querying. See how that works out for you. You might need to scale out even more for your specific usage patterns.

Resources