Is the cache part of the business or data layer in a simple LAMP stack?
It's cross cutting concern that may be applied to every piece of data in Business, Data or any other Layer that contains and works with data.
memcached is not part of a simple LAMP stack. The basic LAMP app takes its data directly from the database and templates it into the view. The simple application (and even many complicated ones) don't need any more than that.
You add memcached to an application because you've got data that is too slow to compute to do it all live on-the-fly. Whilst certainly memcache counts as being in the data layer, when you are relying on memcache you lose the consistency of a database server, which means you are usually going to need to include some application-specific rules for how long data is cached based on the business logic of your app. So sure, it impinges on the business layer. And if the stuff you're caching is pre-populated views (eg HTML), then it's touching the presentation layer too.
This wide-ranging and not-easily-encapsulated nature is why you shouldn't introduce memcache to an application until you really need to. Don't assume that it's a necessary foundation for performance; remember your database also has table and query caches you may be able to leverage without having to give up consistency and add cache expiry complexity.
Memcached sits between a database and webserver. Its a cache, but more importantly its a explicit cache. So things dont get on it on its own. You have to "put" and "get" from it. The biggest advantage is, that it is close to 10 times faster than a database. And if you fetch data from memcached, you wont need to make a sql call, thus saving your database some cycles to do something more important.
So a book catalog website is ideal candidate 80% reads and 20% writes. For more information <here>.
Related
I have a table with millions of rows (with 98% reads, maybe 1 - 2% writes) which has references to couple of other config tables (with maybe 20 entries each). What are the best practices for caching the tables in this case? I cannot cache the table with millions of rows. But at the same time, I also don't want to hit the DB for the config tables. Is there a work around for this? I'm using Spring boot, and the data is in postgres.
Thanks.
First of all, let me refer to this:
What are the best practices for caching the tables in this case
I don't think you should "cache tables" as you say. In the Application, you work with the data, and this is what should be cached. This means the object that you cache should be already in a structure that includes these relations. Of course, in order to fetch the whole object from the database, you can use JOINs, but when the object gets cached, it doesn't matter already, the translation from Relational model to the object model was done.
Now the question is too broad because the actual answer can vary on the technologies you use, nature of data, and so forth.
You should answer the following questions before you design the cache (the list is out my head, but hopefully you'll get the idea):
What is the cache invalidation strategy? You say, there are 2% writes, what happens if the data gets updated, the data in the cache may become stale. Is it ok?
A kind of generalization of the previous question: If you have multiple instances (JVMs) of the same application, and one of them triggered the update to the DB data, what should happen to other apps' caches?
How long the stale/invalid data can reside in the cache?
Do the use cases of your application access all the data from the tables with the same frequencies or some data is more "interesting" (for example, the oldest data is not read, but the latest data is always "hot")? Probably if its millions of data for configuration, the JVM doesn't have all these objects in the heap at the same time, so there should be some "slice" of this data...
What are the performance implications of having the cache? How does it affect the GC behavior?
What technologies can be used in your case (maybe due to some regulations/licensing, some technologies are just not available, this is more a case in large organizations)
Based on these observations you can go with:
In-memory cache:
Spring integrates with various in-memory cache technologies, you can also use them without spring at all, to name a few:
Google Guava cache (for older spring cache implementations)
Coffeine (for newer spring cache implementations)
In memory map of key / value
In memory but in another process:
Redis
Infinispan
Now, these caches are slower than those listed in the previous category but still can
be significantly faster than the DB.
Data Grids:
Hazelcast
Off heap memory-based caches (this means that you store the data off-heap, so its not eligible for garbage collection)
Postgres related solutions. For example, you can still go to db, but since you can opt for keeping the index in-memory the queries will be significantly faster.
Some ORM mapping specific caches (like hibernate has its cache as well).
Some kind of mix of all above.
Implement your own solution - well, this is something that probably you shouldn't do as the first attempt to address the issue, because caching can be tricky.
In the end, let me provide a link to some very interesting session given by Michael Plod about caching. I believe it will help you to find the solution that works for you best.
I was wondering if I could get an explanation between the differences between In-Memory cache(redis, memcached), In-Memory data grids (gemfire) and In-Memory database (VoltDB). I'm having a hard time distinguishing the key characteristics between the 3.
Cache - By definition means it is stored in memory. Any data stored in memory (RAM) for faster access is called cache. Examples: Ehcache, Memcache Typically you put an object in cache with String as Key and access the cache using the Key. It is very straight forward. It depends on the application when to access the cahce vs database and no complex processing happens in the Cache. If the cache spans multiple machines, then it is called distributed cache. For example, Netflix uses EVCAche which is built on top of Memcache to store the users movie recommendations that you see on the home screen.
In Memory Database - It has all the features of a Cache plus come processing/querying capabilities. Redis falls under this category. Redis supports multiple data structures and you can query the data in the Redis ( examples like get last 10 accessed items, get the most used item etc). It can span multiple machine and is usually very high performant and also support persistence to disk if needed. For example, Twitter uses Redis database to store the timeline information.
I don't know about gemfire and VoltDB, but even memcached and redis are very different. Memcached is really simple caching, a place to store variables in a very uncomplex fashion, and then retrieve them so you don't have to go to a file or database lookup every time you need that data. The types of variable are very simple. Redis on the other hand is actually an in memory database, with a very interesting selection of data types. It has a wonderful data type for doing sorted lists, which works great for applications such as leader boards. You add your new record to the data, and it gets sorted automagically.
So I wouldn't get too hung up on the categories. You really need to examine each tool differently to see what it can do for you, and the application you're building. It's kind of like trying to draw comparisons on nosql databases - they are all very different, and do different things well.
I would add that things in the "database" category tend to have more features to protect and replicate your data than a simple "cache". Cache is temporary (usually) where as database data should be persistent. Many cache solutions I've seen do not persist to disk, so if you lost power to your whole cluster, you'd lose everything in cache.
But there are some cache solutions that have persistence and replication features too, so the line is blurry.
An in-memory Cache is a common query store therefore relieves DB of read Workloads. Common examples of in-memory cache are Redis cache. An example could be Web site storing popular searches made by clients thereby relieving the DB of some load.
In-memory Cache provides query functionality on top of caching (storing session data in RAM (temporary storage)).
Memcache falls in the temp store caching category.
Our application (java,spring, hibernate) uses postgress to store data.
We are looking to add an analysis engine to the application. I want to explore using a nosql db to run the analysis on. This is an attempt at learning the nosql a bit also to free the main application activity from performance penalty (as much as possible).
So, I want the data changes to also synch to the nosql db (in addition to postgres). Any synch mechanism will affect the performance of the main data/transaction activity.
Is it a good idea to push the data changes to a message bus and free the main transaction as early as possible ? Can anyone point me to frameworks/technologies/ideas that address this issue of same data going to two different data stores.
The simplest solution would be sending data to a Postgres read replica and running your analytics queries on that. The performance impact is minimal and this would save a lot of time compared to alternative approaches.
Unless you really know what you are doing, I would avoid NoSQL for this kind of application. If your dataset is too big for a Postgres read replica, you might want to use Redshift, which is a columnar datastore that is optimized for types of analytics queries typically performed.
I am developing a web app in Meteor, with Mongo, that will be running on cloud. Each user must belong to a Company.
Each Company can only access it's own data.
Each user can access it's own data and some data shared with other users of the same company.
Imagine 1.000 companies and 100 users per company, it could get very bad in performance and secutiry, if I use 1 Mongodb database for whole app.
So, because Mongo is "Schema-less and Database-less" I think I can define 1.000 dbs, lets say db_0001, db_0002, ... with same name collections, lets say tasks, messages, ..., so the app can be efficient and more secure (same code for every Company and isolation of data).
Also, on hosting side (let's say for example with Digital Ocean), I think its easier to distribute the dbs if the are already atomized.
Is this a good approach? Or should I not worry about it and let the hosting do this job?
Any thoughts are wellcome.
You are currently only looking at one side of the coin. That's fine to start with.
Think about how you are going to be displaying that data and what query does it translate to. Do a thorough due diligence on all the potential query. For example, how often would user/getbyid be called and how often would you have to show a user their info and their relationship with other users. What other meta data would be required beside user info, would you have to perform a join to get that data? or is it stored as an embedded document? What fields are you going to be searching and sorting by most? Which types of data are write heavy and what are read heavy?
Now lets get back to your database shading approach. It's great that you are thinking ahead of time on this front rather than having to rewrite your component later. Data volume/storage does not worry me here. How many concurrent users would be using at application and what are primary use cases should be the first place to look at to think about scale.
Additionally, you need to understand the nature of the business and project growth. Is it like Instragram type of hyper growth? or is it more predictable. A big Mongo cluster can handle thousands of concurrent read/write requests (assuming your design and query are optimized) so that does not bother me. If you want to keep it flexible MongoDB has a sharding mechanism and you can shard on a key and it takes care all the fancy stuff for ya.
MongoDB has eventual consistency (look up MongoDB CAP theorem) if you enable read from secondaries and you have a high volume business critical app you need to be careful because you can be reading out of date result.
As far as hosting is concerned, DO is fine but always have a backup in another region to maintain geographic redundancy so in case if a region goes down (Hello AWS!) you have something to fall back on.
Good luck on your project!
I have an instance of Laravel up and running with a load balancer in place. We've setup memcached (two server nodes) to handle session management. So far the site is running fine in our test environment. The site largely ties into a web based API, so we only store a few values (other than user authentication data) in a user's session to work with the site.
After a short amount of usage by one or two users, there are about 3000 items in the cache. I don't have full access to the nodes, so I don't know exactly what the items are. However we don't appear to be maxing out the nodes with memory and the application functionality is good.
Is this to be expected? I understand that the cache management will clear out old records over time as they expire, so these could just be "remnant" data records, but this is my first time working with memcached so I want to verify that this is normal behavior.
It's quite normal for any caching solution to rack up a number of items. Especially for lots of small objects it's often more efficient for a cache to keep them beyond their expiry (but no longer serve them) and then clear them out in a big sweep periodically.
"Remnant records" pretty much describes it.
As long as your application performs as expected, I wouldn't worry. You should worry when you get a lot of cache misses for objects that were supposed to be in cache but kicked out before expiry due to lack of memory to store them all.
Yes
It is normal to have lots of records in Memcache. But you need to have proper session management.
Store small amount of values per session. (Data which is required most of the API's, Like user access token)
Cache expiration
The biggest challenge when using Memcache is avoiding cache staleness while still writing clean code. Most developers store data to Memcache and delete or update data when it changes. This strategy can get messy very quickly – Memcache code becomes riddled throughout an application. Rails’ Sweepers can help with this problem, but other languages and frameworks don’t have similar alternatives.
One simple strategy to avoid code complexity is to write data to Memcache with an expiration. Data with an expiration will automatically expire when the expiration is reached. Most applications can benefit from time-based cache expiration with infrequently changing content such as static assets, headers, footers, blog posts, etc.
List management
A simple list stored in Memcache can be useful for maintaining denormalized relationships.
For example An e-commerce website may want to store a small table of recent purchases. Rather than keeping a serialized list in Memcache and recalculating it when a new purchase is made, append and prepend can be used to store denormalized data, avoiding a database query.
Note - Memcache only supports a max value size of 1 MB. Be careful creating lists that may grow larger in size than the maximum allowed value size
Also Check these links-
https://cloud.google.com/appengine/docs/adminconsole/memcache
http://docs.oracle.com/cd/E17952_01/refman-5.6-en/ha-memcached-faq.html
http://symas.com/mdb/memcache/