Cache greenplum query plan globally? - greenplum

I'd like to save planner cost using plan cache, since OCRA/Legacy optimizer will take dozens of millionseconds.
I think greenplum cache query plan in session level, when session end or other session could not share the analyzed plan. Even more, we can't keep session always on, since gp system will not release resource until TCP connection disconnected.
most major database cache plans after first running, and use that corss connections.
So, is there any switch that turn on query plan cache cross connectors? I can see in a session, client timing statistics not match the "Total time" planner gives?

Postgres can cache the plans as well, which is on a per session basis and once the session is ended, the cached plan is thrown away. This can be tricky to optimize/analyze, but generally of less importance unless the query you are executing is really complex and/or there are a lot of repeated queries.
The documentation explains those in detail pretty well. We can query pg_prepared_statements to see what is cached. Note that it is not available across sessions and visible only to the current session.
When a user starts a session with Greenplum Database and issues a query, the system creates groups or 'gangs' of worker processes on each segment to do the work. After the work is done, the segment worker processes are destroyed except for a cached number which is set by the gp_cached_segworkers_threshold parameter.
A lower setting conserves system resources on the segment hosts, but a higher setting may improve performance for power-users that want to issue many complex queries in a row.
Also see gp_max_local_distributed_cache.
Obviously, the more you cache, the less memory there will be available for other connections and queries. Perhaps not a big deal if you are only hosting a few power users running concurrent queries... but you may need to adjust your gp_vmem_protect_limit accordingly.
For clarification:
Segment resources are released after the gp_vmem_idle_resource_timeout.
Only the master session will remain until the TCP connection is dropped.

Related

Will Caching be useful when we need multiple items in one go

We are working on a ecom site, where admin can store some configuration on the combination of Product-Category-manufacturer or on Product-Category.
We have some reports, which can return 10000 Product's transactions (with 100-1000 unique combination of product-category-manufacturer ).
In this report, we also need to use configuration as well.
One option could be to fetch configurations from the same stored procedure for all unique Product-Category-manufacturer.
Another option could be to cache all these combination in some outproc cache (like redis). And once transaction data is fetched from stored procedure, system will pull the data from cache for all 1000 Product-Category-Feature combinations. But in this case, we will have to request cache 1000 times and if some of keys are not found in cache, we will have to hit database.
In fact there can be some combination where data does not exist in database. If we request for these combination, system will not find it in cache, and it will have to hit database every-time. To resolve this, we will have to form a set of all the Product-Category-Feature combination where there is data available in cache.
Could anybody suggest that if cache will be useful in this case?
We use caching mainly in 2 occasions,
To Reduce latency: Cache is closer to the client it takes less time for the resource to reach the client.
To Reduce network traffic: Most of the time we see that some resources are reusable but always fetch from original source which
is costly and make more unnecessary traffic. Adding a cache layer
solves this.
So to answer your question, "Will Caching be useful when we need multiple items in one go?" You have to think on the above 2 points. How much you are reusing (cache hit percentage). And cost difference between cache call and call to original source.
If your issue is getting 1000 items at once, Redis don't have issue providing that. It will be so much faster than the transnational DB. And you can have set of all the Product-Category-Feature combinations, its better as we will no have cache misses. However think about the size of the Redis DB, before you proceed.

RethinkDB changefeeds performance: architectural advice?

I am building an application with RethinkDB and I'm about to switch to using changefeeds. But I'm facing an architectural choice and I'd like to get some advice.
My application currently loads all user data from several tables on user login (sending all of it to the frontend), and then processes requests from the frontend, altering the database, and preparing and sending changed items to users. I'd like to switch that over to changefeeds. The way I see it, I have two choices:
Set up a single changefeed for each table. Filter by users logged in to a particular server, and distribute the changes to users manually. These changefeeds are never closed, e.g. they have the lifetime of my servers.
When a user logs in, set up an individual changefeed for that user, for that user's data only (using a getAll with a secondary index). Maintain as many changefeeds as there are currently logged in users. Close them when users log out.
Solution #1 has a big disadvantage: RethinkDB changefeeds do not have a concept of time (or version number), like for example Kafka does. This means that there is no way to a) load initial data, and b) get changes that happened since the initial load. There is a time window where changes can be lost: between initial data load (a) and the moment the changefeed is set up (b). I find this worrying.
Solution #2 seems better, because includeInitial can be used to get initial data, and then get subsequent changes without interruption. I'd have to deal with initial load performance (it's faster to load a single dump of all data than process thousands of updates), but it seems more "correct". But what about scaling? I'm planning to handle up to 1k users per server — is RethinkDB prepared to handle thousands of changefeeds, each being essentially a getAll query? The actual activity in these changefeeds will be very low, it's just the number that I'm worried about.
The RethinkDB manual is a bit terse about changefeed scaling, saying that:
Changefeeds perform well as they scale, although they create extra intracluster messages in proportion to the number of servers with open feed connections on each write.
Solution #2 creates many more feeds, but the number of servers with open feed connections is actually the same for both solutions. And "changefeeds perform well as they scale" isn't quite enough to go on :-)
I'd also be interested to know what are recommended practices for handling server restarts/upgrades and disconnections. The way I see it, if anything happens to RethinkDB, clients have to perform a full data load (using includeInitial) after reconnecting, because there is no way to know what changes have been lost during downtime. Is that what people do?
RethinkDB should be able to handle thousands of changefeeds just fine if it's on reasonable hardware. One thing some people to do lower network load in that case is they put a proxy node on the same machine as their app server, and connect to that, since the proxy node knows enough to deduplicate the changefeed messages coming in over the network, and because it takes a lot of CPU/memory load off of their main cluster.
Currently the only way to recover from a crash is to restart the changefeed using includeInitial. There are plans to add write timestamps in the future, but handling deletes is complicated in that case.

Inactive sessions in Oracle

I would like to ask a question :
This is my environment :
Solaris Version 10; Sun OS Version 5.10
Oracle Version: 11g Enterprise x64 Edition.
When I am running this query :
select c.owner, c.object_name, c.object_type,b.sid, b.serial#, b.status, b.osuser, b.machine
from v$locked_object a , v$session b, dba_objects c
where b.sid = a.session_id
and a.object_id = c.object_id;
Sometimes I get many status to be 'INACTIVE'.
What does this inactive mean?
Does this will make my db and application slow?
What are the affects of active and inactive status?
What does this inactive mean?
Just before the oracle executable executes a read to get the next "command" that it should execute for its session, it will set its session's state to INACTIVE. After the read completes, it will set it to ACTIVE. It will remain in that state until it is done executing the requested work. Then it will do the whole thing all over again.
Does this will make my db and application slow?
Not necessarily. See the answer to your final question.
What are the affects of active and inactive status?
The consequences of a large number of sessions (ACTIVE or INACTIVE) are significant in two ways.
The first is if the number is monotonically increasing, which would lead one to investigate the possibility that the application is leaking connections. I'm confident that such a catastrophe is not the case otherwise you would have mentioned it specifically.
The second, where the number fluctuates within the declared upper bound, is more likely. According to Andrew Holdsworth and other prominent members of the RWP, some architects allow too many connections in the application's connection pool and they demonstrate what happens (response time and availability consequences) when it is too high. They also have a prescription for how to better define the connection pool's attributes and behavior.
The essence of their argument is that by allowing a large number of connections in the pool, you allow them to all be busy at the same time. Rather than having the application tier queue transactions, the database server may have to play a primary role in queuing for low level resources like disk, CPU, network, and even other things like enqueues.
Even if all the sessions are busy for only a short time and they're contending for various resources, the contention is wasteful and can repeat over and over and over again. It makes more sense to spend extra time devising a good user experience queueing model so that you don't waste resources on what is undoubtedly the most expensive (hardware and software licenses) tier in your architecture.
ACTIVE means the session is currently executing some SQL operations whereas INACTIVE means the opposite. Check out the ORACLE v$session documentation
By nature, a high number of ACTIVE sessions will slow down the whole DBMS including your application. To what extent is hard to say - here you have to look at IO, CPU, etc. loads.
Inactive sessions will have a low impact unless you exceed the maximum session number.
In simple words, an INACTIVE status in v$session means no SQL statement is being executed at the time you check in v$session.
On other hand, if you see a lot of inactive sessions, then first check the last activity time for each session.
What I suspect is that, they might be part of a connection pool, hence they might be getting used frequently. You shouldn't worry about it, since from an application connection perspective, the connection pool will take care of it. You might see such INACTIVE sessions for quite a long time.
It does not at all suggest D/B is slow.
Status INACTIVE means session is not executing any query now.
ACTIVE means it's executing query.

MongoDB preload documents into RAM for better performance

I want MongoDB to hold query results in RAM for longer period of time (say 30 minutes if memory is available). Is it possible? OR is there any way i can make sure that the data is pre-loaded into RAM before subsequent queries on it.
In fact i am wondering about simple query results performance by MongoDB. I have a dedicated server with 10GB RAM and my db.stats() are as follows;
db.stats();
{
"db": "test",
"collections":16,
"objects":625690,
"avgObjSize":68.90,
"dataSize":43061996,
"storageSize":1121402888,
"numExtents":74,
"indexes":25,
"indexSize":28207200,
"fileSize":469762048,
"nsSizeMB":16,
"ok":1
}
Now when i query single document (as mentioned here) from a web service it loads in 1.3 seconds. Subsequent calls of same queries gives response in 400ms and then after few seconds, it again starts taking 1.3 seconds. Looks like MongoDB has lost the previous queried document from Memory, where as there is no other queries asking for data mapped to RAM.
Please explain this and let me know any way to make subsequent queries faster responding.
Your observed performance problem on an initial query is likely one of the following issues (in rough order of likelihood):
1) Your application / web service has some overhead to initialize on first request (i.e. allocating memory, setting up connection pools, resolving DNS, ...).
2) Indexes or data you have requested are not yet in memory, so need to be loaded.
3) The Query Optimizer may take a bit longer to run on the first request, as it is comparing the plan execution for your query pattern.
It would be very helpful to test the query via the mongo shell, and isolate whether the overhead is related to MongoDB or your web service (rather than timing both, as you have done).
Following are some notes related to MongoDB.
Caching
MongoDB doesn't have a "caching" time for documents in memory. It uses memory-mapped files for disk I/O and the documents in memory are based on your active queries (documents/indexes you've recently loaded) as well as the available memory. The operating system's virtual memory manager is in charge of caching, and typically will follow a Least-Recently Used (LRU) algorithm to decide which pages to swap out of memory.
Memory Usage
The expected behaviour is that over time MongoDB will grow to use all free memory to store your active working data set.
Looking at your provided db.stats() numbers (and assuming that is your only database), it looks like your database size is current about 1Gb so you should be able to keep everything within your 10Gb total RAM unless:
there are other processes competing for memory
you have restarted your mongod server and those documents/indexes haven't been requested yet
In MongoDB 2.2, there is a new touch command you can use to load indexes or documents into memory after a server restart. This should only be used on initial startup to "warm up" the server, as otherwise you could be unhelpfully forcing actual "active" data out of memory.
On a linux system, for example, you can use the top command and should see that:
virtual bytes/VSIZE will tend to be the size of the entire database
if the server doesn't have other processes running, resident bytes/RSIZE will be the total memory of the machine (this includes file system cache contents)
mongod should not use swap (since the files are memory-mapped)
You can use the mongostat tool to get a quick view of your mongod activity .. or more usefully, use a service like MMS to monitor metrics over time.
Query Optimizer
The MongoDB Query Optimizer compares plan execution for a query pattern every ~1,000 write operations, and then caches the "winning" query plan until the next time the optimizer runs .. or you explicitly call an explain() on that query.
This should be a straightforward one to test: run your query in the mongo shell with .explain() and look at the ms timings, and also the number of index entries and documents scanned. The timing for an explain() isn't the actual time the queries will take to run, as it includes the cost of comparing the plans. The typical execution will be much faster .. and you can look for slow queries in your mongod log.
By default MongoDB will log all queries slower than 100ms, so this provides a good starting point to look for queries to optimize. You can adjust the slow ms value with the --slowms config option, or using the Database Profiler commands.
Further reading in the MongoDB documentation:
Caching
Checking Server Memory Usage
Database Profiler
Explain
Monitoring & Diagnostics

How to avoid database query storms using cache-aside pattern

We are using a PostgreSQL database and AppFabric Server, running a moderately busy ASP.NET MVC e-commerce site.
Following the cache-aside pattern we request data from our cache, and if it is not available, we query the database.
This approach results in 'query storms' where the database recieves multiple queries for the same data in a short space of time, while a given object in the cache is being refreshed. This issue is exacerbated by longer running queries, and obviously multiple requests for the same data can cause the query to run longer, forming an unpleasant feedback loop.
One solution to this problem is to use read-locking on the cache. However this can itself cause performance issues in a web farm situation (or even on a single busy web server) as web servers are blocked on reads for no reason, in case there is a database query taking place.
Another solution is to drop the cache-aside pattern and seed the cache independently. This is the approach we have taken to mitigate the immediate issues we are seeing with this problem, however it is not possible with all data.
Am I missing something here? And what other approaches have people taken to avoid this behaviour?
Depending on the number of servers you have and your current cache architecture it may be worthwhile to evaluate adding a server-level (or in-process) cache as well. In effect you use this as a fallback cache, and it's especially helpful where hitting the primary storage (database) is either very resource intensive or slow.
When I've used this I've used the cache-aside pattern for the primary cache and a read-through design for the secondary--in which the secondary is locking and ensures the database isn't over-saturated by the same request. With this architecture a primary cache-miss results in at most one query per entity per server (or process) to the database.
So the basic workflow is:
1) Try to retrieve from primary / shared cache pool
* If successful, return
* If unsuccessul, continue
2) Check in-process cache for value
* If successful, return (optionally seeding primary cache)
* If unsuccessul, continue
3) Get lock by cache key (and double-check in-process cache, in case it's been added by another thread)
4) Retrieve object from primary persistence (db)
5) Seed in-process cache and return
I've done this using injectable wrappers, my cache layers all implement the relevant IRepository interface, and StructureMap injects the correct stack of caches. This keeps the actual cache behaviors flexible, focused, and easy to maintain despite being fairly complex.
We've used AppFabric successfully with the seeding strategy you mention above. We actually do use both solutions:
Seed known data where possible (we have a limited set, so this is actually easy for us to figure out)
Within each cache access method, make sure to do look-aside as necessary, and populate cache on retrieval from data store.
The look-aside is necessary, as items may be evicted due to memory pressure, or simply because they were missed in the seeding operation. We have a "warming" service that pulses on an interval (an hour) and keeps the cache populated with the necessary data. We keep analysis on cache misses, and use that to tweak our warming strategy if we see frequent misses during the warming interval.

Resources