I am new to Sphinx and want to make it index a 2 million row table (~1.5GB in size). I will use plain indexes.
At the moment, I don't know how much memory should I put in the mem_limit config. My idea is that I could simply keep the default, and then I could see how many results are being swapped (stay in disk) or expired (how frequently used results in memory go to disk).
I'm not sure yet exactly how Sphinx works, anyway, but this is my understanding for now. However, how can I see stats like these, just like we can see the STATS for Memcached?
Having some kind of stats would definitely help me know how to better tune Sphinx for my application.
In case it's relevant, I use MariaDB and PHP on CentOS.
In case its not clear mem_limit is ONLY used by the indexer program. For use during creating the index.
... frankly the setting isnt all that critical. Just set it as high as you have available memory.
It's not applicable to searchd which actually answers queries.
There is 'SHOW STATUS' command, but it doesnt really have anything about menory
http://sphinxsearch.com/docs/current.html#sphinxql-show-status
... memory usage (and there are no variables to control it!) can be got from general OS commands. On linux for example, possibly something like memstat
Related
Can anyone, someone point me in the direction of how to troubleshoot why a Tabular model that I have built does not seem to want to cache query results?
It is my understanding that MDX queries to Tabular model will be cached, however with our model they never seem to be! And I can't figure out why..
My best guess is that it's memory pressure, and the system is clearing down the RAM, but even that is a guess..
Are there any counters, DMVs, or other perfmon stats etc that i can use to actually see what is going on and check?
Thanks.
Plenty of places to look, but I'd recommend starting with a Profiler/xEvent trace. Below is an example of 2 runs of the same MDX query.
The first run is on a cold-cache...
The second run is on a warm-cache and you can see that it is resolving the query from cache...
This is much easier to see if you can isolate the query on non-production server (e.g. test/dev environment). There are quite a few reasons why a particular query may not be taking advantage of the cache...but you need to first confirm that it is not using the cache.
I know when testing performance of queries in relational databases you typically clear the cache before you run the query to get a real idea of what performance will be like the first time the data is queried. Is there something similar in MongoDB?
I found some query plan cache clearing stuff, but nothing that would clear cached data. Is there such a thing in MongoDB? If so, can someone point me toward some documentation?
This solution assumes you are on a Unix-like system.
I haven't done a lot of DBA of MongoDB, but I believe that this video contains the answer for clearing cached data.
The relevant portion of the video coming at about 4:57 in the video and the command is:
sudo echo 1 > /proc/sys/vm/drop_caches
As the video explains, using 1 should be sufficient for most cases, but if you want to make absolutely sure, you can use 3. And if you (like the instructor from the MongoDB class) find that command to be "inelegant", you can use:
sudo sysctl -w vm.drop_caches=1
There are other videos in that lecture series which introduce and demonstrate "pre-heating data" to get the working set into memory before having your application access the machine so that users don't experience the poor performance of reading from disk.
In a Rails 3.2.x app, using (Re)tire to access an ES cluster a rake task is going through approx 1M rows to create a new index. (Ruby 1.9.3).
The task is using .to_json with specific attributes and methods listed to limit the resulting hash for each element.
Yet as the task run the memory is eaten away, ending with the process being killed usually by the system.
The task is already using find_by_batch. Smaller batches sizes (using find_each) don't help.
checking without index
Removing the index.import call does improve things (obviously). The task goes through the whole collection very fast without a problem. Pointing to either ES, tire or the JSON conversion (and the relations it might call upon).
reducing the scope of the task
Adding back index.import and passing a very limited hash (with string keys) for each item does make things slower but not too much and does not eat memory away. So json might no be the culprit here.
adding attributes and methods back
The culprit seems to be one of the method used to grab one of the additional attributes. It's based on a relation of the model and another ... Ending up with a lot of models being involved and sifted through.
As pointed out by Index the results of a method in ElasticSearch (Tire + ActiveRecord) adding includes does help a bit but the task does end up heavy too.
going around
I also tried to go around part of the problem and replace the calls to Tire with the use of ES bulk API.
Generating json files and sending them with a Ruby http lib can work. Yet, the same problem arise : memory since the same requests to the DB are made.
What's left ?
What I don't get is why even with the find_by_batch Ruby keeps eating away memory. I would expect that after each batch of data, memory related that batch would be freed.
Next to try : GC.start calls, Active Record caching de activation around the tasks.
Yet, except if a solution limiting the memory use drastically (300 or 500Mo instead of 800+) the background issue is : indexing a lot of instances of a Model including data related to some other models.
am I missing something for the import and includes that would solve the issue ?
would splitting that task into smaller background jobs (resque, sidekiq) help ? I would suppose so as each batch would be isolated from the others and once treated, really free up the memory (?) (orchestrating those tasks would be another trouble)
is there good practices related to indexing big quantities of data into ES ?
I've been using Rails + Elasticsearch for a while and did this kind of dance a few times.
A few things comes to mind, in no particular order.
Did you try to use the recent elasticsearch gem (instead of tire) ? I've updated my apps to use and like having more control on what is done.
I would also try to force a GC sweep after each ActiveRecord loop. You could also be extra careful with memory allocation by explicitly resetting all local variables each time.
You could use the fork & exec trick to fork a brand new process at each loop, it would be the most effective GC you can get. It's a little overhead when you write it the first time, but the pay-off is great. Take good care of limiting the amount of memory used in the outer part of the task. Using a process-based background task would partly achieve the same goal, but you might still get memory bloat.
Can you limit the use of ActiveRecord? If you need some basic associations you could use a lower-level/simpler tool like Sequel (or else) to use Ruby hashes/arrays instead of full fledged AR models.
Short question is on the title: I work with my mongo Shell wich is in safe mode by default, and I want to gain better performance by deactivating this behaviour.
Long Question for those willing to know the context:
I am working on a huge set of data like
{
_id:ObjectId("azertyuiopqsdfghjkl"),
stringdate:"2008-03-08 06:36:00"
}
and some other fields and there are about 250M documents like that (whole database with the indexes weights 36Go). I want to convert the date in a real ISODATE field. I searched a bit how I could make an update query like
db.data.update({},{$set:{date:new Date("$stringdate")}},{multi:true})
but did not find how to make this work and resolved myself to make a script that take the documents one after the other and make an update to set a new field which takes the new Date(stringdate) as its value. The query use the _id so the default index is used.
Problem is that it takes a very long time. I already figured out that if only I had inserted empty dates object when I created the database I would now get better performances since there is the problem of data relocation when a new field is added. I also set an index on a relevant field to process the database chunk by chunk. Finally I ran several concurrent mongo clients on both the server and my workstation to ensure that the limitant factor is the database lock availability and not any other factor like cpu or network costs.
I monitored the whole thing with mongotop, mongostats and the web monitoring interfaces which confirmed that write lock is taken 70% of the time. I am a bit disappointed mongodb does not have a more precise granularity on its write lock, why not allowing concurrent write operations on the same collection as long as there is no risk of interference? Now that I think about it I should have sharded the collection on a dozen shards even while staying on the same server, because there would have been individual locks on each shard.
But since I can't do a thing right now to the current database structure, I searched how to improve performance to at least spend 90% of my time writing in mongo (from 70% currently), and I figured out that since I ran my script in the default mongo shell, every time I make an update, there is also a getLastError() which is called afterwards and I don't want it because there is a 99.99% chance of success and even in case of failure I can still make an aggregation request after the end of the big process to retrieve the single exceptions.
I don't think I would gain so much performance by deactivating the getLastError calls, but I think itis worth trying.
I took a look at the documentation and found confirmation of the default behavior, but not the procedure for changing it. Any suggestion?
I work with my mongo Shell wich is in safe mode by default, and I want to gain better performance by deactivating this behaviour.
You can use db.getLastError({w:0}) ( http://docs.mongodb.org/manual/reference/method/db.getLastError/ ) to do what you want but it won't help.
This is because for one:
make a script that take the documents one after the other and make an update to set a new field which takes the new Date(stringdate) as its value.
When using the shell in a non-interactive mode like within a loop it doesn't actually call getLastError(). As such downing your write concern to 0 will do nothing.
I already figured out that if only I had inserted empty dates object when I created the database I would now get better performances since there is the problem of data relocation when a new field is added.
I did tell people when they asked about this stuff to add those fields incase of movement but instead they listened to the guy who said "leave them out! They use space!".
I shouldn't feel smug but I do. That's an unfortunately side effect of being right when you were told you were wrong.
mongostats and the web monitoring interfaces which confirmed that write lock is taken 70% of the time
That's because of all the movement in your documents, kinda hard to fix that.
I am a bit disappointed mongodb does not have a more precise granularity on its write lock
The write lock doesn't actually denote the concurrency of MongoDB, this is another common misconception that stems from the transactional SQL technologies.
Write locks in MongoDB are mutexs for one.
Not only that but there are numerous rules which dictate that operations will subside to queued operations under certain circumstances, one being how many operations waiting, another being whether the data is in RAM or not, and more.
Unfortunately I believe you have got yourself stuck in between a rock and hard place and there is no easy way out. This does happen.
I'm running MongoDB (2.2) on Linux, and I have a few questions.
I have schema with many fields + sub-fields and one index for this fields.
How fast are updates/delete done on the index -- I have about 3 Updates/Deletes etc. a second.
Is there a rule, like after 10,000 updates you have to compact or rebuild the index?
Are changes in the fields immediately visible in the index? If not is there a delay or a temporary table for this updates/deletes?
Thanks in advance - Brandon
Indexes are updated at the time of insert/update/remove. About performance the best answer would be to just test it.
Not that I would know of. If you need to do regular compaction or repair you should have replication too (but you can have it on the same host if resources permit)
Yes (well, on the same DB connection - on other it might take a bit more time. But if you're having that problem I'm not the right person to answer you anyway ;)
Having said that, I strongly suggest you take a look at some of the presentations at http://www.10gen.com/presentations - I'm sorry i can't point out the ones that were particularly interesting and usable, I suggest you browse and pick the ones that seem interesting to you.
Note that MongoDB does things VERY differently and has quite a few gotchas for the unprepared. It is however a great DB once you know how to use it.