Clearing cache for performance testing in MongoDB? - performance

I know when testing performance of queries in relational databases you typically clear the cache before you run the query to get a real idea of what performance will be like the first time the data is queried. Is there something similar in MongoDB?
I found some query plan cache clearing stuff, but nothing that would clear cached data. Is there such a thing in MongoDB? If so, can someone point me toward some documentation?

This solution assumes you are on a Unix-like system.
I haven't done a lot of DBA of MongoDB, but I believe that this video contains the answer for clearing cached data.
The relevant portion of the video coming at about 4:57 in the video and the command is:
sudo echo 1 > /proc/sys/vm/drop_caches
As the video explains, using 1 should be sufficient for most cases, but if you want to make absolutely sure, you can use 3. And if you (like the instructor from the MongoDB class) find that command to be "inelegant", you can use:
sudo sysctl -w vm.drop_caches=1
There are other videos in that lecture series which introduce and demonstrate "pre-heating data" to get the working set into memory before having your application access the machine so that users don't experience the poor performance of reading from disk.

Related

How do I setup esrally to use with elassandra and my own tests?

I'm wondering whether others have attempt benchmarking Elassandra (more specifically, I'm using express-cassandra) using esrally. I'm hoping to not spend to much more time on esrally if that's not a good solution to test Elassandra.
Reading the documentation, it looks like Rally is capable of starting from scratch: Download Elasticsearch, install the source, build it, run it, connect, create a full schema, then start testing with data filling the schema (possibly done with some random data), do queries, ...
I already have everything in place and the only thing I really want to see a few things such as:
Which of 10 different memory setup is faster.
Which type of searches work, whether option 1, 2 and 3 from my existing software create drastic slow downs or not...
Whether insertion while doing searches have a effects on the speed of my searches.
I'm not going to change many parameters other than the memory (-Xmx, -Xms, maybe some others... like cached row in a separate heap.) For sure, I want to run all the tests with the latest Elassandra and not consider rebuilding or anything of the sort.
From reading the documentation, there is no mention of Elassandra. I found total of TWO PAGES in Google about testing with esrally and Elassandra and that did not boost my confidence that it's doable...
I would imagine that I have to use the benchmark-only pipeline. That at least removes all the gathering of the source, building, etc. I guess it also reduces the number of parameters I get in the resulting benchmark, but I don't need all the details...
Have you had any experience with such a setup? (Elassandra + esrally)
Yes, esrally works with Elassandra by using the --benchmark-only option.
To automate the creation of elassandra clusters to benchmark, you could either use ecm or k8s helm chart.
For instance, using ccm :
ecm create bench_cluster -v 6.2.3.10 -n 3 -s -e
esrally --pipeline=benchmark-only --target hosts=127.0.0.1:9200,127.0.0.2:9200,127.0.0.3:9200
ecm remove bench_cluster
For testing specific scenarios, you can write custom tracks.

How do I get Sphinx stats?

I am new to Sphinx and want to make it index a 2 million row table (~1.5GB in size). I will use plain indexes.
At the moment, I don't know how much memory should I put in the mem_limit config. My idea is that I could simply keep the default, and then I could see how many results are being swapped (stay in disk) or expired (how frequently used results in memory go to disk).
I'm not sure yet exactly how Sphinx works, anyway, but this is my understanding for now. However, how can I see stats like these, just like we can see the STATS for Memcached?
Having some kind of stats would definitely help me know how to better tune Sphinx for my application.
In case it's relevant, I use MariaDB and PHP on CentOS.
In case its not clear mem_limit is ONLY used by the indexer program. For use during creating the index.
... frankly the setting isnt all that critical. Just set it as high as you have available memory.
It's not applicable to searchd which actually answers queries.
There is 'SHOW STATUS' command, but it doesnt really have anything about menory
http://sphinxsearch.com/docs/current.html#sphinxql-show-status
... memory usage (and there are no variables to control it!) can be got from general OS commands. On linux for example, possibly something like memstat

Tabular Model Not Caching Results

Can anyone, someone point me in the direction of how to troubleshoot why a Tabular model that I have built does not seem to want to cache query results?
It is my understanding that MDX queries to Tabular model will be cached, however with our model they never seem to be! And I can't figure out why..
My best guess is that it's memory pressure, and the system is clearing down the RAM, but even that is a guess..
Are there any counters, DMVs, or other perfmon stats etc that i can use to actually see what is going on and check?
Thanks.
Plenty of places to look, but I'd recommend starting with a Profiler/xEvent trace. Below is an example of 2 runs of the same MDX query.
The first run is on a cold-cache...
The second run is on a warm-cache and you can see that it is resolving the query from cache...
This is much easier to see if you can isolate the query on non-production server (e.g. test/dev environment). There are quite a few reasons why a particular query may not be taking advantage of the cache...but you need to first confirm that it is not using the cache.

How to deactivate safe mode in the mongo shell?

Short question is on the title: I work with my mongo Shell wich is in safe mode by default, and I want to gain better performance by deactivating this behaviour.
Long Question for those willing to know the context:
I am working on a huge set of data like
{
_id:ObjectId("azertyuiopqsdfghjkl"),
stringdate:"2008-03-08 06:36:00"
}
and some other fields and there are about 250M documents like that (whole database with the indexes weights 36Go). I want to convert the date in a real ISODATE field. I searched a bit how I could make an update query like
db.data.update({},{$set:{date:new Date("$stringdate")}},{multi:true})
but did not find how to make this work and resolved myself to make a script that take the documents one after the other and make an update to set a new field which takes the new Date(stringdate) as its value. The query use the _id so the default index is used.
Problem is that it takes a very long time. I already figured out that if only I had inserted empty dates object when I created the database I would now get better performances since there is the problem of data relocation when a new field is added. I also set an index on a relevant field to process the database chunk by chunk. Finally I ran several concurrent mongo clients on both the server and my workstation to ensure that the limitant factor is the database lock availability and not any other factor like cpu or network costs.
I monitored the whole thing with mongotop, mongostats and the web monitoring interfaces which confirmed that write lock is taken 70% of the time. I am a bit disappointed mongodb does not have a more precise granularity on its write lock, why not allowing concurrent write operations on the same collection as long as there is no risk of interference? Now that I think about it I should have sharded the collection on a dozen shards even while staying on the same server, because there would have been individual locks on each shard.
But since I can't do a thing right now to the current database structure, I searched how to improve performance to at least spend 90% of my time writing in mongo (from 70% currently), and I figured out that since I ran my script in the default mongo shell, every time I make an update, there is also a getLastError() which is called afterwards and I don't want it because there is a 99.99% chance of success and even in case of failure I can still make an aggregation request after the end of the big process to retrieve the single exceptions.
I don't think I would gain so much performance by deactivating the getLastError calls, but I think itis worth trying.
I took a look at the documentation and found confirmation of the default behavior, but not the procedure for changing it. Any suggestion?
I work with my mongo Shell wich is in safe mode by default, and I want to gain better performance by deactivating this behaviour.
You can use db.getLastError({w:0}) ( http://docs.mongodb.org/manual/reference/method/db.getLastError/ ) to do what you want but it won't help.
This is because for one:
make a script that take the documents one after the other and make an update to set a new field which takes the new Date(stringdate) as its value.
When using the shell in a non-interactive mode like within a loop it doesn't actually call getLastError(). As such downing your write concern to 0 will do nothing.
I already figured out that if only I had inserted empty dates object when I created the database I would now get better performances since there is the problem of data relocation when a new field is added.
I did tell people when they asked about this stuff to add those fields incase of movement but instead they listened to the guy who said "leave them out! They use space!".
I shouldn't feel smug but I do. That's an unfortunately side effect of being right when you were told you were wrong.
mongostats and the web monitoring interfaces which confirmed that write lock is taken 70% of the time
That's because of all the movement in your documents, kinda hard to fix that.
I am a bit disappointed mongodb does not have a more precise granularity on its write lock
The write lock doesn't actually denote the concurrency of MongoDB, this is another common misconception that stems from the transactional SQL technologies.
Write locks in MongoDB are mutexs for one.
Not only that but there are numerous rules which dictate that operations will subside to queued operations under certain circumstances, one being how many operations waiting, another being whether the data is in RAM or not, and more.
Unfortunately I believe you have got yourself stuck in between a rock and hard place and there is no easy way out. This does happen.

Cache like mechanism (which Data Structure)?

I am fetching some questions from the server (database) and showing it to client (user) in the browser. The client will answer the question and based on his/her answer the next set of questions will be fetched from the database. Now, I want to pre-fetch the next set of questions while the user read the present question so that the waiting time for user to see the next question will be shorter.
My questions is, how to store the pre-fetched questions i.e. which data structure should I use to store the pre-fetched questions in the memory so that I can get better performance? I want a "cache" type of thing. Also once the user hit any question from the cache the question won't be there any more.
PS: Each question has unique Id.
Thanks
Naveen
There are multiple options to go about it. One that makes a big difference, one that makes little.
Little difference would be to fetch questions and store it in user's session. It's basically depends on where your session is stored, could also be database, or a file. This only makes sense if your db tables are very denormalized and it requires lots of joins to get the answer. I doubt that's the case so this won't make much difference for the user no matter which data structure used.
Big difference would make prefetching them with AJAX using javascript straight into the browser. In this case a simple array would suffice. JS gives you flexibility to build any objects with any properties, anything would be good enough. So write a poller in JS which fetches the questions from server while user is looking at the question, return them using JSON for example. JSON will become a simple object. Since each user stores only a couple of questions prefetched in their browser particular data structure choice won't make a difference here either.
Try using LinkedHashMap as You will have LRU algorithm implemented quickly with good performance.
Read this link as well :
LinkedHashMap as cache
First a few questions to adapt to your context :
assuming you use Java ?
using Hibernate also ?
If you want to prefetch in the server, many caching solutions exists.
Taking into account your unique id (see PS), if this ID is database related and you are using Hibernate, the easiest solution would be to configure the Hibernate second-level cache for that entity. Then, your only code would be to run the query in advance....
If theses requisites do not fit, I used EhCache as the caching solutions.
Somehow easy to start using, and it has plenty of features available when you later need them.

Resources