Elasticsearch cache clear doesn't seems to do what I expected - caching

I ran a search and the first time and it took 3-4 seconds.
I ran the same search second time and it took less than 100 ms (as expected as it used the cache)
Then I cleared the cache by calling "http://host:port/index/_cache/clear"
Next I ran the same search and was expecting it to take 3-4 seconds but it took less than 100 ms
So the clearing of the cache didn't work?
What exactly got cleared by that url?
How I do make ES do the raw search (i.e. no caching) every time?
I am doing as a part of some load testing.

Clearing the cache will empty:
Field data (used by facets, sorting, geo, etc)
Filter cache
parent/child cache
Bloom filters for posting lists
The effect you are seeing is probably due to the OS file system cache. Elasticsearch and Lucene leverage the OS file system cache heavily due to the immutable nature of lucene segments. This means that small indices tend to be cached entirely in memory by your OS and become diskless.
As an aside, it doesn't really make sense to benchmark Elasticsearch in a "cacheless" state. It is designed and built to operate in a cached environment - much of the performance that Elasticsearch is known for is due to it's excellent use of caching.
To be completely accurate, your benchmark should really be looking at a system that has fully warmed the JVM (to properly size the new-eden space, optimize JIT output, etc) and using real, production-like data to simulate "real world" cache filling and eviction on both the ES and OS levels.
Synthetic tests such as "no-cache environment" make little sense.

I don't know if this is what you're experiencing, but the cache isn't cleared immediately when you call clear cache. It is scheduled to be deleted in the next 60 seconds.
source: https://www.elastic.co/guide/en/elasticsearch/reference/1.5/indices-clearcache.html

Related

Elasticsearch: When serving a read request, why not try to find the document in the memtable first to achieve real-time query?

In Elasticsearch official document Near real-time search, it says that
In Elasticsearch, this process of writing and opening a new segment is called a refresh. A refresh makes all operations performed on an index since the last refresh available for search.
By default, Elasticsearch periodically refreshes indices every second, ... This is why we say that Elasticsearch has near real-time search: document changes are not visible to search immediately, but will become visible within this timeframe.
I feel a little confused: when serving a read request, why not try to find the document in memtable first, then in the on-disk segment, if so, we do not need to wait the refresh, which makes the real time query possible.
Really good question, but to understand it why Elasticsearch doesn't serve a search request from in-memory documents, we will have to little deep and understand why segments are created in first place and why they are immutable.
As you might be aware that segments are the actual physical files that stores the data of search index, and segments are immutable and this immutability provides a lot of benefits such as
Segments can be cached.
Segments can be used in multi-threaded Environments without worrying about the state being change.
Now as segments are cached and can be used in multi-threaded Environment, it's much easier to use the file system cache to provide the faster search, of-course that means sometime, you will not have a newer copy of data but thats a trade-off than iterating through the memtable which is still being modified and still can show the old version of the document(so still you have a near real time data), and can't be cached as its not immutable so every search thread will end up searching on a dataset which is always in motion and if you apply the locking on memtable while searching, it would reduce the indexing speed.
Btw, this is design from Lucene and Elasticsearch uses that as a library so it's not really Elasticsearch which controls that.
Bottomline, even if you search on memtable without locking and blocking updates while searching, you can't show the real time data and this would considerably slow both indexing and search speed.
Hope this helps.

MongoDB small collection Query very slow

I have a 33MB collection with around 33k items in it. This has been working perfectly for the past month and the queries were responsive and no slow queries. The collection have all the required indexes, and normally the response is almost instant(1-2ms)
Today I spotted that there was a major query queue and the requests were just not getting processed. The Oplog was filling up and just not clearing. After some searching I found the post below which suggest compacting and databaseRepair. I ran the repair and it fixed the problem. Ridiculously slow mongoDB query on small collection in simple but big database
My question is what could have gone wrong with the collection and how did databaseRepair fix the problem? Is there a way for me to ensure this does not happen again?
There are many things that could be an issue here, but ultimately if a repair/compact solved things for you it suggests storage related issues. Here are a few suggestions to follow up on:
Disk performance: Ensure that your disks are performing properly and that you do not have bad sectors. If part of your disk is damaged it could have spiked access times and you may run into this again. You may want to test your RAM modules as well.
Fragmentation: Without knowing your write profile it's hard to say, but your collections and indexes could have fragmented all over your storage system. Running repair will have rebuilt them and brought them back into a more contiguous form, allowing your disk access times to be much faster, especially if you're using mechanical disks and are going to disk for a lot of data.
If this was the issue then you may want to adjust your paddingFactor to reduce the frequency of this in the future, especially if your updates are growing the size of your documents over time. (Assuming you're using MMAPv1 storage).
Page faults: I'm assuming you may have brought the system down to do the repair, which may have reset your memory/working set. You might want to monitor for hard page faults that indicate that your queries are being bottlenecked by IO rather than being served by your in-memory working set. If this is consistently the case, your application behavior may change unexpectedly as data gets pushed in and out of memory, and you may need to add more RAM.

How to compare performance on neo4j queries without cache?

I've been trying to compare queries performance in neo4j.
In order to make the queries more efficient, I added index, analysed the result using profile, and tried doing the same while using the USING INDEX.
On most queries, DB Hits were much better using the second option (with the USING INDEX), rows were the same or less, but the time performance seems not to be reliable: on several queries adding the USING INDEX was slower though the better performance parameters (db hits & rows)and times got much better by re-executing a query.
In order to stop the cache's interfering, went to the the properties file, changed the cache_type in the neo4j.properties to none and restarted neo, but it still seems like the results of the same query comes faster each time (until a certain point).
What will be the best way to test it?
Neo4j has (up to 2.2.x) a two layered cache architecture. With cache_type=node you switch of just the object cache. To disable page cache, you can use dbms.pagecache.memory=0. However if all caches are disabled you basically measure the speed of your IO subsystem since every query goes down to the bare metal and reads from disc.
I recommend a different approach: enable both caches and run the queries you want to compare multiple times to warm up caches. Take measurement on warmed cache since this is much closer to a real production scenario.
On a side note: in Neo4j 2.3 the object cache will go away and we just have the page cache.

Elasticsearch - queries throttle cpu

Recently our cluster has seen extreme performance degradation. We had 3 nodes, 64 GB, 4CPU (2 core) each for an index that is 250M records, 60GB large. Performance was acceptable for months.
Since then we've:
1. Added a fourth server, same configuration.
2. Split the index into two indexes, query them with an alias
3. Disable paging (windows server 2012)
4. Added synonym analysis on one field
Our cluster can now survive for a few hours before it's basically useless. I have to restart elastic on each node to rectify the problem. We tried bumping each node to 8 cpus (2 cores) with little to no gain.
One issue is that EVERY QUERY uses up 100% of the cpu of whatever node it hits. Every query is facetted on 3+ fields, which hasn't changed since our cluster was healthy. Unfortunately I'm not sure if this was an happening before, but certainly it seems like an issue. We need to be able to respond to more than one request every few seconds obviously. When multiple requests come in at the same time the performance doesn't seem to get worse for those particular responses. Again, over time, the performance slows to a crawl; the CPU (all cores) stays maxed out indefinitely.
I'm using elasticsearch 1.3.4 and the plugin elasticsearch-analysis-phonetic 2.3.0 on every box and have been even when our performance wasn't so terrible.
Any ideas?
UPDATE:
it seems like the performance issue is due to index aliasing. When I pointed the site to a single index that ultimately stores about 80% of the data, the CPU wasn't being throttled. There were still a few 100% spikes, but they were much shorter. When I pointed it back to the alias (which points to two indexes total), I could literally bring the cluster down by refreshing the page a dozen times quickly: CPU usage goes to 100% every query and gets stuck there with many in a row.
Is there a known issue with elastic search aliases? Am I using the alias incorrectly?
UPDATE 2:
Found the cause in the logs. Paging queries are TERRIBLE. Is this a known bug in elastic? If I run an empty query then try and view the last page (from 100,000,000 e.g.) it brings the whole cluster down. That SINGLE QUERY. It gets through the first 1.5M results then quits, all the while taking up 100% of the CPU for over a minute.
UPDATE 3:
So here's somethings else strange. Pointing to an old index on dev (same size, no aliases) and trying to reproduce the paging issue; the cluster doesn't get hit immediately. It has 1% cpu usage for the first 20 seconds after the query. The query returns with an error before the CPU usage every goes up. About 2 minutes later, CPU usage spikes to 100% and server basically crashes (can't do anything else because CPU is so over taxed). On the production index this CPU load is instantaneous (it happens immediately after a query is made)
Without checking certain metrics it is very difficult to identify the cause of slow response or any other issue. But from the data you have mentioned it looks like there are to many cache evictions happening thereby increasing the number of Garbage Collection on your nodes. A frequent Garbage Collection (mainly the old GC) will consume lot of CPU. This in turn will start to affect all cluster.
As you have mentioned it started giving issues only after you added another node. This surprises me. Is there any increase in the traffic?.
Can you include the output of _stats API taken at the time when your cluster slows down. It will have lot of information from which I can make a better diagnosis. Also include a sample of the query.
I suggest you to install bigdesk so that you can have a graphical view of your cluster health more easily.

How to index records in Solr faster (and not impact ColdFusion web server)? Two JVM?

I have a 64 bit server, 8 GB RAM, dual quad CPU. No resources are ever hitting 100% (except, I guess, the JVM -- right?).
I need to index several million records for Solr, but the machine is in production. I recognize having a second machine for indexing would be helpful.
Should I dedicate a second instance of the JVM, dedicated to Solr?
Right now, when I run an index, pages which are normally served in 200 milliseconds will serve up in about 1.5 seconds, sometimes more... hitting, even, the dreaded "Service is Unavailable" error.
I adjusted my JVM Heap as follows:
-Xmx1024m
-XX:MaxPermSize256m
In case I'm chasing the wrong solution, allow me to broaden the landscape a bit. It seems that I can't affect the indexing speed of Solr. I had previously been indexing about 150,000 records per hour on a dev server virtualized on a workstation. In a production environment with much more hardware available, I'm indexing at the exact same speed.
Without data to prove it, I think that my JVM adjustments did not speed up the indexing, although it may have allowed the CF server to continue serving pages. I must say, the indexing speed is terribly slow, but I do know that it's not a function of the data access layer. I rewrote it from pure ORM to objects backed by SQL Stored Procedures thinking that was the slowdown (no effect).
use a separate instance for indexing the index, the only trick is getting the running searching instance to re-read the updated index, in which case, you set up a master (the indexer) and slave(the searcher) and do replication. this will both make the searcher not get interrupted, and the indexer will utilize its own JVM including its own share of the resources.
Have you tried these optimization tips?
http://bloggeraroundthecorner.blogspot.com/2009/08/tuning-coldfusion-solr-part-1.html
http://bloggeraroundthecorner.blogspot.com/2009/08/tuning-coldfusion-solr-part-2.html
http://bytestopshere.com/post.cfm/lessons-learned-moving-from-verity-to-solr-part-1

Resources