Solr Caching Update on Writes - caching

I've been looking at potential ways to speed up solr queries for an application I'm working on. I've read about solr caching (https://wiki.apache.org/solr/SolrCaching), and I think the filter and query caches may be of some help. The application's config does setup these caches, but it looks like with some default settings that weren't experimented with, and our cache hit rate is relatively low.
One detail I've not been able to determine is how the caches deal with updates. If I update records that would result in removing or adding that record from the query or filter cache, do the caches update in a performant way? The application is fairly write-heavy, so whether the caches update in a conducive manner or not will probably determine whether trying to tune the caches will help much.

The short answer is that an update (add, edit, or delete) on your index followed by a commit operation rebuilds the index and replaces the current index. Since caches are associated with a specific index version, they are discarded when the index is replaced. If autowarming is enabled, then the caches in the new index will be primed with recent queries or queries that you specify.
However, this is Solr that we're talking about and there are usually multiple ways to handle any situation. That is definitely the case here. The commit operation mentioned above is known as a hard commit and may or may not be happening depending on your Solr configuration and how your applications interact with it. There's another option known as a soft commit that I believe would be a good choice for your index. Here's the difference...
A hard commit means that the index is rebuilt and then persisted to disk. This ensures that changes are not lost, but is an expensive operation.
A soft commit means that the index is updated in memory and not persisted to disk. This is a far less expensive operation, but data could conceivably be lost if Solr is halted unexpectedly.
Going a step further, Solr has two nifty settings known as autoCommit and autoSoftCommit which I highly recommend. You should disable all hard commit operations in your application code if you enable auto commit. The autoCommit setting can specify a period of time to queue up document changes (maxTime) and/or the number of changes to allow in the queue (maxDocs). When either of these limits is reached, a hard commit is performed. The autoSoftCommit setting works the same way, but results in (you guessed it) a soft commit. Solr's documentation on UpdateHandlers is a good starting point to learn about this.
These settings effectively make it possible to do batch updates instead of one at a time. In a write-heavy application such as yours, this is definitely a good idea. The optimal settings will depend upon the frequency of reads vs writes and, of course, the business requirements of the application. If near-real-time (NRT) search is a requirement, you may want autoSoftCommit set to a few seconds. If it's acceptable for search results to be a bit stale, then you should consider setting autoSoftCommit to a minute or even a few minutes. The autoCommit setting is usually set much higher as its primary function is data integrity and persistence.
I recommend a lot of testing in a non-production environment to decide upon reasonable caching and commit settings for your application. Given that your application is write-heavy, I would lean toward conservative cache settings and you may want to disable autowarming completely. You should also monitor cache statistics in production and reduce the size of caches with low hit rates. And, of course, keep in mind that your optimal settings will be a moving target, so you should review them periodically and make adjustments when needed.
On a related note, the Seven Deadly Sins of Solr is a great read and relevant to the topic at hand. Best of luck and have fun with Solr!

Related

How to deactivate safe mode in the mongo shell?

Short question is on the title: I work with my mongo Shell wich is in safe mode by default, and I want to gain better performance by deactivating this behaviour.
Long Question for those willing to know the context:
I am working on a huge set of data like
{
_id:ObjectId("azertyuiopqsdfghjkl"),
stringdate:"2008-03-08 06:36:00"
}
and some other fields and there are about 250M documents like that (whole database with the indexes weights 36Go). I want to convert the date in a real ISODATE field. I searched a bit how I could make an update query like
db.data.update({},{$set:{date:new Date("$stringdate")}},{multi:true})
but did not find how to make this work and resolved myself to make a script that take the documents one after the other and make an update to set a new field which takes the new Date(stringdate) as its value. The query use the _id so the default index is used.
Problem is that it takes a very long time. I already figured out that if only I had inserted empty dates object when I created the database I would now get better performances since there is the problem of data relocation when a new field is added. I also set an index on a relevant field to process the database chunk by chunk. Finally I ran several concurrent mongo clients on both the server and my workstation to ensure that the limitant factor is the database lock availability and not any other factor like cpu or network costs.
I monitored the whole thing with mongotop, mongostats and the web monitoring interfaces which confirmed that write lock is taken 70% of the time. I am a bit disappointed mongodb does not have a more precise granularity on its write lock, why not allowing concurrent write operations on the same collection as long as there is no risk of interference? Now that I think about it I should have sharded the collection on a dozen shards even while staying on the same server, because there would have been individual locks on each shard.
But since I can't do a thing right now to the current database structure, I searched how to improve performance to at least spend 90% of my time writing in mongo (from 70% currently), and I figured out that since I ran my script in the default mongo shell, every time I make an update, there is also a getLastError() which is called afterwards and I don't want it because there is a 99.99% chance of success and even in case of failure I can still make an aggregation request after the end of the big process to retrieve the single exceptions.
I don't think I would gain so much performance by deactivating the getLastError calls, but I think itis worth trying.
I took a look at the documentation and found confirmation of the default behavior, but not the procedure for changing it. Any suggestion?
I work with my mongo Shell wich is in safe mode by default, and I want to gain better performance by deactivating this behaviour.
You can use db.getLastError({w:0}) ( http://docs.mongodb.org/manual/reference/method/db.getLastError/ ) to do what you want but it won't help.
This is because for one:
make a script that take the documents one after the other and make an update to set a new field which takes the new Date(stringdate) as its value.
When using the shell in a non-interactive mode like within a loop it doesn't actually call getLastError(). As such downing your write concern to 0 will do nothing.
I already figured out that if only I had inserted empty dates object when I created the database I would now get better performances since there is the problem of data relocation when a new field is added.
I did tell people when they asked about this stuff to add those fields incase of movement but instead they listened to the guy who said "leave them out! They use space!".
I shouldn't feel smug but I do. That's an unfortunately side effect of being right when you were told you were wrong.
mongostats and the web monitoring interfaces which confirmed that write lock is taken 70% of the time
That's because of all the movement in your documents, kinda hard to fix that.
I am a bit disappointed mongodb does not have a more precise granularity on its write lock
The write lock doesn't actually denote the concurrency of MongoDB, this is another common misconception that stems from the transactional SQL technologies.
Write locks in MongoDB are mutexs for one.
Not only that but there are numerous rules which dictate that operations will subside to queued operations under certain circumstances, one being how many operations waiting, another being whether the data is in RAM or not, and more.
Unfortunately I believe you have got yourself stuck in between a rock and hard place and there is no easy way out. This does happen.

OLAP Saiku Cache expires

I'm using Saiku and PHPAnalytics to run MDX queries on my cube.
it seems if i run queries it's all good, caching is fine. But if I go for 2 hours and run those queries again - it does not using cache! Why? I need the cache to be saved for a long time! What to do? I tried to add this ti mondrian.properties mondrian.rolap.CachePool.costLimit = 2147483647
But no help. What do to?
The default in-memory cache of Mondrian stores things in a WeakHashMap. This means that it could be cleared at the discretion of the JVM's garbage collector. Most application servers are setup to do a periodical sweep of garbage collection (usually each hour or so). You have to either tweak your JVM's configuration to not do this.
-Dsun.rmi.dgc.client.gcInterval=3600000 -Dsun.rmi.dgc.server.gcInterval=3600000
You can also implement your own cache implementation of the SegmentCache SPI. If your implementation uses hard references, they will never be collected. This is trickier to do and will require you to do quite a bit of studying to get it right. You can start by taking a look at the default implementation and start from there.
The mondrian cache should cache up until the cache is deliberately flushed. That said it uses an aging system to determine what should be cached should it run out of memory to store the data, the oldest query gets pushed out of the cache and replaced.
I've not tried the PHPAnalytics stuff, but maybe they've put some call into the Saiku server to flush the cache on a regular basis, otherwise this shouldn't happen.

How to keep your distributed cache clean?

In a N-Tier architecture, what would be the best patterns to use so that you can keep your cache clean?
I know it's easy to just set an absolute/sliding timeout, but is there a better mechanism available to allow you to mark your cache as dirty after you update the underlying persistence.
The difficulty I"m trying to wrap my head around is that Cache are usually stored as KVP. But a query is usually a fair bit more complex than that. So how can the gateway service tell the cache store that for such and such query, it needs to refetch from persistence.
I also can't afford to hand-code the cache update per query. I'm looking for a more systematic approach.
Is this just a pipe dream, or is there some way to do this elegantly?
Link/Guide/Post appreciated.
I have worked with AppFabric and I think tried to do what you are asking about. I was working on an auction site and I wanted to pro-actively invalidate items in the cache.
For example, we had listings (things for sale) and they would be present all over the cache (AppFabric). The data that represented a listing was in 10 different places. What I initially wanted was a way to say, "Ok, my listing has changed. Let me go find everywhere it exists in cache, and then update." (I think you say "mark as dirty" in your question)
I found doing this was incredibly difficult. There are tags in AppFabric that I tried to use, so I would mark a given object (or collection of objects) with a tag and that would let me query the cache and remove items. In other words, if an object had a LISTING tag, I would find it and invalidate it.
Eventually I settled on a two-pronged attack.
For 95% of the data I let it expire. It was a happy day when I decided this because everything got much easier to develop. I had to make some concessions in the UI etc., but it was well worth it.
For the last 5% of the data I resolved to only ever store it once. For example, a bid on a listing. Whenever a new bid came in, we'd pro-actively invalidate that object, and then everything that needed that information would be updated as well.

What should be stored in cache for web app?

I realize that this might be a vague question the bequests a vague answer, but I'm in need of some real world examples, thoughts, &/or best practices for caching data for a web app. All of the examples I've read are more technical in nature (how to add or remove cache data from the respective cache store), but I've not been able to find a higher level strategy for caching.
For example, my web app has an inbox/mail feature for each user. What I've been doing to date is storing typical session data in the cache. In this example, when the user logs in I go to the database and retrieve the user's mail messages and store them in cache. I'm beginning to wonder if I should just maintain a copy of all users' messages in the cache, all the time, and just retrieve them from cache when needed, instead of loading from the database upon login. I have a bunch of other data that's loaded on login (product catalogs and related entities) and login is starting to slow down.
So I guess my question to the community, is what would you do/recommend as an approach in this scenario?
Thanks.
This might be better suited to https://softwareengineering.stackexchange.com/, but generally you want to cache:
Metadata/configuration data that does not change frequently. E.g. country/state lists, external resource addresses, logic/branching settings, product/price/tax definitions, etc.
Data that is costly to retrieve or generate and that does not need to frequently change. E.g. historical data sets for reports.
Data that is unique to the current user's session.
The last item above is where you need to be careful as you can drastically increase your app's memory usage, by adding a few megabytes to the data for every active session. It also implies different levels of caching -- application wide, user session, etc.
Generally you should NOT cache data that is under active change.
In larger systems you also need to think about where the cache(s) will sit. Is it possible to have one central cache server, or is it good enough for each server/process to handle its own caching?
Also: you should have some method to quickly reset/invalidate the cached data. For a smaller or less mission-critical app, this could be as simple as restarting the web server. For the large system that I work on, we use a 12 hour absolute expiration window for most cached data, but we have a way of forcing immediate expiration if we need it.
This is a really broad question, and the answer depends heavily on the specific application/system you are building. I don't know enough about your specific scenario to say if you should cache all the users' messages, but instinctively it seems like a bad idea since you would seem to be effectively caching your entire data set. This could lead to problems if new messages come in or get deleted. Would you then update them in the cache? Would that not simply duplicate the backing store?
Caching is only a performance optimization technique, and as with any optimization, measure first before making substantial changes, to avoid wasting time optimizing the wrong thing. Maybe you don't need much caching, and it would only complicate your app. Maybe the data you are thinking of caching can be retrieved in a faster way, or less of it can be retrieved at once.
Cache anything that causes duplicate database queries.
Client side file caching is important as well. Assuming files are marked with an id in your database, cache them on every network request to avoid many network requests for the same file. A resource to do this can be found here (https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_API). If you don't need to cache files, web storage, local storage and cookies are good for smaller pieces of data.
//if file is in cache
//refer to cache
//else
//make network request and push file to cache

How safe is it to store sessions with Redis?

I'm currently using MySql to store my sessions. It works great, but it is a bit slow.
I've been asked to use Redis, but I'm wondering if it is a good idea because I've heard that Redis delays write operations. I'm a bit afraid because sessions need to be real-time.
Has anyone experienced such problems?
Redis is perfect for storing sessions. All operations are performed in memory, and so reads and writes will be fast.
The second aspect is persistence of session state. Redis gives you a lot of flexibility in how you want to persist session state to your hard-disk. You can go through http://redis.io/topics/persistence to learn more, but at a high level, here are your options -
If you cannot afford losing any sessions, set appendfsync always in your configuration file. With this, Redis guarantees that any write operations are saved to the disk. The disadvantage is that write operations will be slower.
If you are okay with losing about 1s worth of data, use appendfsync everysec. This will give great performance with reasonable data guarantees
This question is really about real-time sessions, and seems to have arisen partly due to a misunderstanding of the phrase 'delayed write operations' While the details were eventually teased out in the comments, I just wanted to make it super-duper clear...
You will have no problems implementing real-time sessions.
Redis is an in-memory key-value store with optional persistence to disk. 'Delayed write operations' refers to writes to disk, not the database in general, which exists in memory. If you SET a key/value pair, you can GET it immediately (i.e in real-time). The policy you select with regards to persistence (how much you delay the writes) will determine the upper-bound for how much data could be lost in a crash.
Basically there are two main types available: async snapsnots and fsync(). They're called RDB and AOF respectively. More on persistence modes on the official page.
The signal handling of the daemonized process syncs to disk when it receives a SIGTERM for instance, so the data will still be there after a reboot. I think the daemon or the OS has to crash before you'll see an integrity corruption, even with the default settings (RDB snapshots).
The AOF setting uses an Append Only File that logs the commands the server receives, and recreates the DB from scratch on cold start, from the saved file. The default disk-sync policy is to flush once every second (IIRC) but can be set to lock and write on every command.
Using both the snapshots and the incremental log seems to offer both a long term don't-mind-if-I-miss-a-few-seconds-of-data approach with a more secure, but costly incremental log. Redis supports clustering out of the box, so replication can be done too it seems.
I'm using the default RDB setting myself and saving the snapshots to remote FTP. I haven't seen a failure that's caused a data loss yet. Acute hardware failure or power outages would most likely, but I'm hosted on a VPS. Slim chance of that happening :)

Resources