After using a myisam for years now with 3 indexes + around 500 columns for Mio of rows, I wonder how to "force" mongodb to store indexes in memory for fast-read performance.
In general, it is a simply structured table and all queries are WHERE index1=.. or index2=... or index3=.. (myisam) and pretty simple in mongodb as well.
It's nice if mongodb is managing the index and ram on its own.
However, I am not sure if it does and about the way mongodb can speed up these queries on indexs-only best.
Thanks
It's nice if mongodb is managing the index and ram on its own.
MongoDB does not manage the RAM at all. It uses Memory-Mapped files and basically "pretends" that everything is RAM all of the time.
Instead, the operating system is responsible for managing which objects are kept in RAM. Typically on a LRU basis.
You may want to check the sizes of your indexes. If you cannot keep all of those indexes in RAM, then MongoDB will likely perform poorly.
However, I am not sure if it does and about the way mongodb can speed up these queries on indexs-only best.
MongoDB can use Covered Indexes to retrieve directly from the DB. However, you have to be very specific about the fields returned. If you include fields that are not part of the index, then it will not return "index-only" queries.
The default behavior is to include all fields, so you will need to look at the specific queries and make the appropriate changes to allow "index-only". Note that these queries do not include the _id, which may cause issues down the line.
You don't need to "force" mongo to store indices in memory. An index is brought in memory when you use it and then stays in memory until the OS kicks it out.
MongoDB will will automatically use covered index when it can.
Related
I have a table with millions of rows (with 98% reads, maybe 1 - 2% writes) which has references to couple of other config tables (with maybe 20 entries each). What are the best practices for caching the tables in this case? I cannot cache the table with millions of rows. But at the same time, I also don't want to hit the DB for the config tables. Is there a work around for this? I'm using Spring boot, and the data is in postgres.
Thanks.
First of all, let me refer to this:
What are the best practices for caching the tables in this case
I don't think you should "cache tables" as you say. In the Application, you work with the data, and this is what should be cached. This means the object that you cache should be already in a structure that includes these relations. Of course, in order to fetch the whole object from the database, you can use JOINs, but when the object gets cached, it doesn't matter already, the translation from Relational model to the object model was done.
Now the question is too broad because the actual answer can vary on the technologies you use, nature of data, and so forth.
You should answer the following questions before you design the cache (the list is out my head, but hopefully you'll get the idea):
What is the cache invalidation strategy? You say, there are 2% writes, what happens if the data gets updated, the data in the cache may become stale. Is it ok?
A kind of generalization of the previous question: If you have multiple instances (JVMs) of the same application, and one of them triggered the update to the DB data, what should happen to other apps' caches?
How long the stale/invalid data can reside in the cache?
Do the use cases of your application access all the data from the tables with the same frequencies or some data is more "interesting" (for example, the oldest data is not read, but the latest data is always "hot")? Probably if its millions of data for configuration, the JVM doesn't have all these objects in the heap at the same time, so there should be some "slice" of this data...
What are the performance implications of having the cache? How does it affect the GC behavior?
What technologies can be used in your case (maybe due to some regulations/licensing, some technologies are just not available, this is more a case in large organizations)
Based on these observations you can go with:
In-memory cache:
Spring integrates with various in-memory cache technologies, you can also use them without spring at all, to name a few:
Google Guava cache (for older spring cache implementations)
Coffeine (for newer spring cache implementations)
In memory map of key / value
In memory but in another process:
Redis
Infinispan
Now, these caches are slower than those listed in the previous category but still can
be significantly faster than the DB.
Data Grids:
Hazelcast
Off heap memory-based caches (this means that you store the data off-heap, so its not eligible for garbage collection)
Postgres related solutions. For example, you can still go to db, but since you can opt for keeping the index in-memory the queries will be significantly faster.
Some ORM mapping specific caches (like hibernate has its cache as well).
Some kind of mix of all above.
Implement your own solution - well, this is something that probably you shouldn't do as the first attempt to address the issue, because caching can be tricky.
In the end, let me provide a link to some very interesting session given by Michael Plod about caching. I believe it will help you to find the solution that works for you best.
I have been using cache for a long time. We store data against some key and fetch it from cache whenever required. I know that StackOverflow and many other sites heavily rely on cache. My question is do they always use key-value mechanism for caching or do they form some sql like query within a cache? For instance, I want to view last week report. This report's content will vary each day. Do i need to store different reports against each day (where day as a key) or can I get this result from forming some query that aggregate result across different key? Does any caching product (like redis) provide this functionality?
Thanks In Advance
Cache is always done as a key-value hash table. This is how it stays so fast. If you're doing querying then you're not doing cache.
What you may be trying to ask is... you could have in your database a table that contains agregated report data. And you could query against that pre-calculated table.
One of the reasons for cache (e.g. memcached ) being fast is its simplicity of data access and querying protocol.
The more functionality you add, more tradeoff you will have to do on the efficiency part. A full fledged SQL engine in a "caching" database is not a good design. Though you can utilize a data structures oriented database like Redis to design your cache data to suit your querying needs. For example: one set or one hash for each date.
A step further, you can use databases like MongoDb , or memsql which are pretty fast and have rich querying support.So an aggregation report once a while won't be an issue.
However, as a design decision, you will have to accept that their caching throughput will not be as much as memcached or redis.
I'm thinking about using Couchbase as a cache layer. I'm aware of the many advantages provided by Couchbase, like the easy scalability. But what interests me more is the rich document model of couchbase, compared to the simple key-value one of memcached.
My RDBMS is SQL Server, and we use NHibernate. The queries and the database are already quite optimized and I think that caching is the best option for further scaling.
My project is to implement a simple relationnel model between entities (much simpler than the one in the RDBMS), to handle invalidation. When an entity is invalidated (removed from cache) by the application, all dependent entities could also be removed. The logic of defining the dependencies between entities would be handled at the application level by a dedicated component. There would be 10 or 12 different entities (I don't want to cache all my application domain).
My document model in Couchbase would look like this:
Key (the one generated by the application), keys' format depends on entity type
Hashed key (to have a uniform unique key accross all entities)
Entity
Dependencies - list of hashed keys of the entities that must be removed when main entity is removed
So my questions are:
On invalidation, we would need to resolve a graph of dependencies (asynchronously). Is it fast to look for specific keys with around 500k entities?
Any feedback on the general idea?
Maintaining the dependencies between entities can be quite simplified, and might not be such a big issue.
Pierre
I use Couchbase 2.2 in production as a persistent cache layer and really happy with it (running about 2M documents). My app getting really fast gets (1 millisecond). Your idea is valid and I don't see anything wrong with using Couchbase as a entity storage for invalidation. Its a mature and very stable product.
You are correct in your entity design. You can have a main json doc that has list of references to other child documents. So that before deleting main document you will delete all children first.
Also, not sure if its applicable in your case, you can take advantage of Couchbase ability to expire documents. When you insert key/value(json doc) you can specify TTL(time to live) if you know it upfront. This way you don't need to explicitly delete entities from Couchbase.
Delete operation itself is fast (you can run it as asynchronous operation) and having 500K documents in the Couchbase cluster it really small size. You should see under 1 millisecond get operations.
But consider having minimum 3 Couchbase nodes in one cluster, so that you can take one node down at any given point of time without compromising data stored in the cluster. See Sizing a Couchbase Server 2.0 cluster
Some additional resources:
10 things developers should know about Couchbase
Top 10 things an Ops / Sys admin must know about Couchbase
App Development with Documents, their Schemas and Relationships
Couchbase Models
Here are my thoughts:
On invalidation, we would need to resolve a graph of dependencies
(asynchronously). Is it fast to look for specific keys with around
500k entities?
Are you looking for keys in your RDBMS or in CB? If in CB, you will need to use a view/index; now, views are disk-based, but stored in sorted order so they are no slower than SQL indices. Accessing them in parallel will be faster than in series. It will be the slow point in your operation though if you use CB.
Continuing along with this thought, I have used CB successfully to store and navigate a hierarchical data structure with 500k+ nodes in it. CB performs well, but does take a few seconds to spit out the whole index if I need it (which I do if I need to do a mass-update operation).
Any feedback on the general idea?
The idea is sound. In fact, I'm seeing 10x the performance of SQL with hierarchical queries when I run them on my Couchbase cluster. I also found that a single couchbase instance outperforms multiple instances when doing an index lookup - I do not know why that is (the 2-instance cb index is 5x faster than my SQL setup). To speed things up further, you can parellelize the queries to the cb index.
Anyone an idea?
The issue is: I am writing a high performance application. It has a SQL database which I use for persistence. In memory objects get updated, then the changes queued for a disc write (which is pretty much always an insert in a versioned table). The small time risk is given as accepted - in case of a crash, program code will resynclocal state with external systems.
Now, quite often I need to run lookups on certain values, and it would be nice to have standard interface. Basically a bag of objects, but with the ability to run queries efficiently against an in memory index. For example I have a table of "instruments" which all have a unique code, and I need to look up this code.... about 30.000 times per second as I get updates for every instrument.
Anyone an idea for a decent high performance library for this?
You should be able to use an in-memory SQLite database (:memory) with System.Data.SQLite.
How do I put my whole PostgreSql database into the RAM for a faster access?? I have 8GB memory and I want to dedicate 2 GB for the DB. I have read about the shared buffers settings but it just caches the most accessed fragment of the database. I needed a solution where the whole DB is put into the RAM and any read would happen from the RAM DB and any write operation would first write into the RAM DB and then the DB on the hard drive.(some thing like the default fsync = on with shared buffers in postgresql configuration settings).
I have asked myself the same question for a while. One of the disadvantages of PostgreSQL is that it does not seem to support an IN MEMORY storage engines as MySQL does...
Anyway I ran in to an article couple of weeks ago describing how this could be done; although it only seems to work on Linux. I really can't vouch for it for I have not tried it myself, but it does seem to make sense since a PostgreSQL tablespace is indeed assigned a mounted repository.
However, even with this approach, I am not sure you could put your index(s) into RAM as well; I do not think MySQL forces HASH index use with its IN MEMORY table for nothing...
I also wanted to do a similar thing to improve performance for I am also working with huge data sets. I am using python; they have dictionary data types which are basically hash tables in the form of {key: value} pairs. Using these is very efficient and effective. Basically, to get my PostgreSQL table into RAM, I load it into such a python dictionary, work with it, and persist it into db once in a while; its worth it if it is used well.
If you are not using python, I am pretty sure their is a similar dictionary-mapping data structure in your language.
Hope this helps!
if you are pulling data by id, use memcached - http://www.danga.com/memcached/ + postgresql.
Set up an old-fashioned RAMdisk and tell pg to store its data there.
Be sure you back it up well though.
Perhaps something like a Tangosol Coherence cache if you're using Java.
With only an 8GB database, if you've already optimized all the SQL activity and you're ready solve query problems with hardware, I suggest you're in trouble. This is just not a scalable solution in the long term. Are you sure there is nothing you can do to make substantial differences on the software and database design side?
I haven't tried this myself (yet) but:
There is a standard docker image available for postgres - https://hub.docker.com/_/postgres/
docker supports tmpfs mounts that are entirely in-memory https://docs.docker.com/storage/tmpfs/
Theoretically, it should be possible to combine the two.
If you do this, you might also want to tweak seq_page_cost and random_page_cost to reflect the relative storage costs. See https://www.postgresql.org/docs/current/runtime-config-query.html
The pre-existing advice for query optimization and increasing shared_buffers still stands though. The chances are that if you're having these problems on a database this small simply putting it into RAM probably isn't the right fix.
One solution is to use Fujistu version of PostGreSQL that supports in memory columnstore indexes...
https://www.postgresql.fastware.com/in-memory-columnar-index-brochure
But it cost a lot....
Or run MS SQL Server with the In Memory tables features.... Even the free version express has it !