I am developing a web app in Meteor, with Mongo, that will be running on cloud. Each user must belong to a Company.
Each Company can only access it's own data.
Each user can access it's own data and some data shared with other users of the same company.
Imagine 1.000 companies and 100 users per company, it could get very bad in performance and secutiry, if I use 1 Mongodb database for whole app.
So, because Mongo is "Schema-less and Database-less" I think I can define 1.000 dbs, lets say db_0001, db_0002, ... with same name collections, lets say tasks, messages, ..., so the app can be efficient and more secure (same code for every Company and isolation of data).
Also, on hosting side (let's say for example with Digital Ocean), I think its easier to distribute the dbs if the are already atomized.
Is this a good approach? Or should I not worry about it and let the hosting do this job?
Any thoughts are wellcome.
You are currently only looking at one side of the coin. That's fine to start with.
Think about how you are going to be displaying that data and what query does it translate to. Do a thorough due diligence on all the potential query. For example, how often would user/getbyid be called and how often would you have to show a user their info and their relationship with other users. What other meta data would be required beside user info, would you have to perform a join to get that data? or is it stored as an embedded document? What fields are you going to be searching and sorting by most? Which types of data are write heavy and what are read heavy?
Now lets get back to your database shading approach. It's great that you are thinking ahead of time on this front rather than having to rewrite your component later. Data volume/storage does not worry me here. How many concurrent users would be using at application and what are primary use cases should be the first place to look at to think about scale.
Additionally, you need to understand the nature of the business and project growth. Is it like Instragram type of hyper growth? or is it more predictable. A big Mongo cluster can handle thousands of concurrent read/write requests (assuming your design and query are optimized) so that does not bother me. If you want to keep it flexible MongoDB has a sharding mechanism and you can shard on a key and it takes care all the fancy stuff for ya.
MongoDB has eventual consistency (look up MongoDB CAP theorem) if you enable read from secondaries and you have a high volume business critical app you need to be careful because you can be reading out of date result.
As far as hosting is concerned, DO is fine but always have a backup in another region to maintain geographic redundancy so in case if a region goes down (Hello AWS!) you have something to fall back on.
Good luck on your project!
I have an instance of Laravel up and running with a load balancer in place. We've setup memcached (two server nodes) to handle session management. So far the site is running fine in our test environment. The site largely ties into a web based API, so we only store a few values (other than user authentication data) in a user's session to work with the site.
After a short amount of usage by one or two users, there are about 3000 items in the cache. I don't have full access to the nodes, so I don't know exactly what the items are. However we don't appear to be maxing out the nodes with memory and the application functionality is good.
Is this to be expected? I understand that the cache management will clear out old records over time as they expire, so these could just be "remnant" data records, but this is my first time working with memcached so I want to verify that this is normal behavior.
It's quite normal for any caching solution to rack up a number of items. Especially for lots of small objects it's often more efficient for a cache to keep them beyond their expiry (but no longer serve them) and then clear them out in a big sweep periodically.
"Remnant records" pretty much describes it.
As long as your application performs as expected, I wouldn't worry. You should worry when you get a lot of cache misses for objects that were supposed to be in cache but kicked out before expiry due to lack of memory to store them all.
Yes
It is normal to have lots of records in Memcache. But you need to have proper session management.
Store small amount of values per session. (Data which is required most of the API's, Like user access token)
Cache expiration
The biggest challenge when using Memcache is avoiding cache staleness while still writing clean code. Most developers store data to Memcache and delete or update data when it changes. This strategy can get messy very quickly – Memcache code becomes riddled throughout an application. Rails’ Sweepers can help with this problem, but other languages and frameworks don’t have similar alternatives.
One simple strategy to avoid code complexity is to write data to Memcache with an expiration. Data with an expiration will automatically expire when the expiration is reached. Most applications can benefit from time-based cache expiration with infrequently changing content such as static assets, headers, footers, blog posts, etc.
List management
A simple list stored in Memcache can be useful for maintaining denormalized relationships.
For example An e-commerce website may want to store a small table of recent purchases. Rather than keeping a serialized list in Memcache and recalculating it when a new purchase is made, append and prepend can be used to store denormalized data, avoiding a database query.
Note - Memcache only supports a max value size of 1 MB. Be careful creating lists that may grow larger in size than the maximum allowed value size
Also Check these links-
https://cloud.google.com/appengine/docs/adminconsole/memcache
http://docs.oracle.com/cd/E17952_01/refman-5.6-en/ha-memcached-faq.html
http://symas.com/mdb/memcache/
We have a fantasy football application that uses memcached and the classic memcached-object-read-with-sql-server-fallback. This works fairly well, but recently I've been contemplating the overhead involved and whether or not this is the best approach.
Case in point - we need to generate a drop down list of the users teams, so we follow this pattern:
Get a list of the users teams from memcached
If not available get the list from SQL server and store in memcached.
Do a multiget to get the team objects.
Fallback to loading objects from sql store these.
This is all very well - each cached piece of data is relatively easily cached and invalidated, but there are two major downsides to this:
1) Because we are operating on objects we are incurring a rather large overhead - a single team occupies some hundred bytes in memcached and what we really just need for this case is a list of team names and ids - not all the other stuff in the team objects.
2) Due to the fallback to loading individual objects, the number of SQL queries generated on an empty cache or when the items expire can be massive:
1 x Memcached multiget (which misses, which and causes)
1 x SELECT ... FROM Team WHERE Id IN (...)
20 x Store in memcached
So that's 21 network request just for this one query, and also the IN query is slower than a specific join.
Obviously we could just do a simple
SELECT Id, Name FROM Teams WHERE UserId = XYZ
And cache that result, but this this would mean that this data would need to be specifically invalidated whenever the user creates a new team. In this case it might seem relatively simple , but we have many of these type of queries, and many of them operate on axes that are not easily invalidated (like a list of id and names of the teams that your friends have created in a specific game).
Sooo.. My question is - do any of you have ideas for resolving the mentioned drawbacks, or should I just accept that there is an overhead and that cache misses are bad, live with it?
First, cache what you need, maybe that two fields, not a complete record.
Second, cache what you need again, break the result set into records and cache them seperately
about caching:
You generally use caching to offload the slower disc-based storage, in this case mysql. The memory cache scales up rather easily, mysql scales less easy.
Given that, even if you double the cpu/netowork/memory usage of the cache and putting it all together again, it will still offload the db. Adding another nodejs instance or another memcached server is easy.
back to your question
You say its a user's team, you could go and fetch it when the user logs-in, and keep it updated in cache while the user changes it throughout his session.
I presume the team member's names do not change, if so you can load all team members by id,name and store those in cache or even local on nodejs, use the same fallback strategy as you do now. Only step 1 and 2 and 4 will be left then.
personally i usually try to split the sql results into smaller ready-made pieces and cache those, and keep the cache updated as long as possible, untimately trying to use mysql only as storage and never read from it
usually you will run some logic on the returned rows form mysql anyways, theres no need to keep repeating that.
I have been reading some Redis docs and trying the tutorial at http://try.redis-db.com/. So far, I can't see any difference between Redis and caching technologies like Velocity or the Enterprise Library Caching Framework
You're effectively just adding objects to an in-memory data store using a unique key. There do not seem to be any relational semantics...
What am I missing?
No, Redis is much more than a cache.
Like a Cache, Redis stores key=value pairs. But unlike a cache, Redis lets you operate on the values. There are 5 data types in Redis - Strings, Sets, Hash, Lists and Sorted Sets. Each data type exposes various operations.
The best way to understand Redis is to model an application without thinking about how you are going to store it in a database.
Lets say we want to build StackOverflow.com. To keep it simple, we need Questions, Answers, Tags and Users.
Modeling Questions, Users and Answers
Each object can be modeled as a Map. For example, a Question is a map with fields {id, title, date_asked, votes, asked_by, status}. Similarly, an Answer is a map with fields {id, question_id, answer_text, answered_by, votes, status}. Similarly, we can model a user object.
Each of these objects can be directly stored in Redis as a Hash. To generate unique ids, you can use the atomic increment command. Something like this -
$ HINCRBY unique_ids question 1
(integer) 1
$ HMSET question:1 title "Is Redis just a cache?" asked_by 12 votes 0
OK
$ HINCRBY unique_ids answer 1
(integer) 1
$ HMSET answer:1 question_id 1 answer_text "No, its a lot more" answered_by 15 votes 1
OK
Handling Up Votes
Now, everytime someone upvotes a question or an answer, you just need to do this
$ HINCRBY question:1 votes 1
(integer) 1
$ HINCRBY question:1 votes 1
(integer) 2
List of Questions for Homepage
Next, we want to store the most recent questions to display on the home page. If you were writing a .NET or Java program, you would store the questions in a List. Turns out, that is the best way to store this in Redis as well.
Every time someone asks a question, we add its id to the list.
$ lpush questions question:1
(integer) 1
$ lpush questions question:2
(integer) 1
Now, when you want to render your homepage, you ask Redis for the most recent 25 questions.
$ lrange questions 0 24
1) "question:100"
2) "question:99"
3) "question:98"
4) "question:97"
5) "question:96"
...
25) "question:76"
Now that you have the ids, retrieve items from Redis using pipelining and show them to the user.
Questions by Tags, Sorted by Votes
Next, we want to retrieve questions for each tag. But SO allows you to see top voted questions, new questions or unanswered questions under each tag.
To model this, we use Redis' Sorted Set feature. A Sorted Set allows you to associate a score with each element. You can then retrieve elements based on their scores.
Lets go ahead and do this for the Redis tag
$ zadd questions_by_votes_tagged:redis 2 question:1
(integer) 1
$ zadd questions_by_votes_tagged:redis 10 question:2
(integer) 1
$ zadd questions_by_votes_tagged:redis 5 question:613
(integer) 1
$ zrange questions_by_votes_tagged:redis 0 5
1) "question:1"
2) "question:613"
3) "question:2"
$ zrevrange questions_by_votes_tagged:redis 0 5
1) "question:2"
2) "question:613"
3) "question:1"
What did we do over here? We added questions to a sorted set, and associated a score (number of votes) to each question. Each time a question gets upvoted, we will increment its score. And when a user clicks "Questions tagged Redis, sorted by votes", we just do a zrevrange and get back the top questions.
Realtime Questions without refreshing page
And finally, a bonus feature. If you keep the questions page opened, SO will notify you when a new question is added. How can Redis help over here?
Redis has a pub-sub model. You can create channels, for example "channel_questions_tagged_redis". You then subscribe users to a particular channel. When a new question is added, you would publish a message to that channel. All users would then get the message. You will have to use a web technology like web sockets or comet to actually deliver the message to the browser, but Redis helps you with all the plumbing on the server side.
Persistence, Reliability etc.
Unlike a Cache, Redis persists data on the hard disk. You can have a master-slave setup to provide better reliability. To learn more, go through Persistence and Replication topics over here - http://redis.io/documentation
Not just a cache.
In memory key-value storage
Support multiple datatypes (strings, hashes, lists, sets, sorted sets, bitmaps, and hyperloglogs)
It provides an ability to store cache data into physical storage (if needed).
Support pub-sub model
Redis cache provides replication for high availability (master/slave)
Redis has unique abilities like ultra-fast lua-scripts. Its execution time equals to C commands execution. This also brings atomicity for sophisticated Redis data manipulation required for work many advanced objects like Locks and Semaphores.
There is a Redis based in memory data grid called Redisson which allows to easily build distributed application on Java. Thanks to distributed Lock, Semaphore, ReadWriteLock, CountDownLatch, ConcurrentMap objects and many others.
Perfectly works in cloud and supports AWS Elasticache, AWS Elasticache Cluster and Azure Redis Cache support
Actually there is no dependency between relative data representation (or any type of data representation) and database role (cache, permanent persistence etc).
Redis is good for cache it's true, but it's much more then just a cache. It's high speed fully in-memory database. It does persist data on disk. It's not relational, it's key-value storage.
We use it in production. Redis helps us to build software that handles thousands of requests per second and keep customer business data during whole natural lifecycle.
Redis is a cache which best suited for distributed environment/Microservice architecture.
It is fast, reliable, provides atomicity and consistency and has range of datatypes such as sets, hashes, list etc.
I am using it from last one year and it really comes as a saviour when you to need provide a production ready solution very fast and for any performance related issues as you can always use it to cache data.
Redis supports data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs, geospatial indexes with radius queries and streams. Redis has built-in replication, Lua scripting, LRU eviction, transactions and different levels of on-disk persistence, and provides high availability via Redis Sentinel and automatic partitioning with Redis Cluster.
implementaion with python
https://beyondexperiment.com/vijayravichandran06/redis-data-structure-with-python/
Usages of Redis:
Cache with multiple data structures, like: string, set, zset, list, hash and bitmap (which could be used in many aggregation use cases)
KV DB. Data in Reids memory can be stored on disk: RDB and AOF can get the snapshots and edit logs.
Message Queue. But one message can only be consumed by one consumer
Pubsub
Distributed lock. Rely on the setnx command, and only the first thread executing it successfully will hold the lock. https://redis.io/commands/setnx
it is not just key-value cache, it is key-dataStructure cache.
Redis is not only cache, but also a data store. whatever is written to the cache is also written to the disk. that allows us to take backups. this allows us to restart our cache nodes. If we restart them, our cache nodes will be prepopulated with the backup. we can restart the entire cluster. But in Memcached, when a Memcached node fails or restarts, all keys stored on that node are lost
redis is also used as a message-queue
As an addition, Redis has capabilities beside caching purpose. Based on latest Redis Documentation (https://redis.io/docs/modules/), Redis has some external modules that support different kind of tasks such as:
Redis Search, full-text search capability
Redis Graph, graph database on top of Redis
Redis Time Series, module that adds a time series data structure to Redis.
Redis AI,
Neural Network for Redis, neural networks module for Redis
etc.
Personally, I used Redis for message queue by utilize Celery for Django REST Framework application beside caching at production.
Its key value datastore ,mainly deployed in private subnet main in conjunction with cloud databases to provide micro second latency. Its able to provide that with either lazy loading or write through strategy ,based on specific use-case.
It way more complex than memcached & operates in cluster -enabled/disabled mode.
It supports shards, which makes data highly avialable ,multi- az deployment.
It supports encryption of data # rest & in transit
& is extremely useful for use-cases such as streaming application ,messaging ,real time analytics ..& applications where data's value depreciates at a very fast pace w.r.t time...
Hence its not just cache ,it brings allot many more features with it ,which makes it all the more useful
Besides being a cache server, Redis is specifically a data structure server.
Being a cache in the form of a data structure server means a lot, because data structures are fundamentals of programs, or applications. Consider you are using SQL databases as storage technology and need to construct a list, a hash map, a ranking set or things like that, it's kind of pain in the neck. Redis can provide you these functionalities directly in a very simple way, thus highly simplify the development.
On the other hand, a data structure server does not have to be in the form of a cache. There are projects compatible with Redis but have persistent storage engines.
In addition to so far made answer's and then to summarize
Redis is a very fast non-relational database that stores a mapping of keys to five different types of values (strings, hashes, lists, sets, sorted sets, bitmaps, and hyperloglogs). This is explained by details #Sripathi Krishnan answers.
Redis supports in-memory persistent storage on disk
Replication to scale read performance
Client-side sharding to scale write performance
If you want to get more detail and depth information about Redis, you can look at Redis In Action and Redis Essentials's books.
Have you ever noticed how facebook says “3 friends and 33 others liked this”? I was wondering what the best approach to do this is. I don’t think going through the friends list, and the list of users who “liked this” and comparing them is efficient at all! Do they keep a track of this in the database? That will make the database size very huge.
What do you guys think?
Thanks!
I would guess they outer join their friends table with their likes table to count both regular likes and friend likes at the same time.
With the proper indexes, it wouldn't be a slow query at all. Huge databases aren't necessarily slow, so there's really no reason to not store all of this information in a database. The trick is to make sure the indexes and partitions (if any) are set up well.
Facebook uses Cassandra, a NoSQL database for at least some things. Here's a more detailed discussion of what some of the bigger social media sites do to solve these problems:
http://www.25hoursaday.com/weblog/2009/09/10/BuildingScalableDatabasesDenormalizationTheNoSQLMovementAndDigg.aspx
Lots of interesting reading in there if you follow the links from it to the Digg blog post, etc.
Yes they definitely keep it in their database as they definitely have more than 1 server that needs to access the data.
As for scalability, I'm sure they use a lot of caching.
Here is an example:
If you have 1 million rows to go through, an index can perform O(logn) = 20 operations (in the worst case) only to find what you need.
For 2 million, you only need 21 operations (in the worst case) to find what you need.
Every time you double the amount of users to go through you simply need only 1 more operation (in the worst case) with a O(logn) index.
They also have a distributed architecture or a clustered database.
Facebook must be using a trigger(which automatically gets executed as soon as an event occurs).
For example, suppose a trigger is created to store the count and names of people who liked the status, then it will get executed every time when someone likes your status and that too implicitly (automatically).
This makes the operation way too easy and Facebook doesn't have to manually update the database or store a huge database for this. Also,this approach is a faster one.
In designing social networking software (mothsorchid.com) I found the only way to address this is to pre-cache streams of notifications. One doesn't query the database at the time of page load to count how many friends and others 'liked this', when someone 'likes' something that is recorded on the object, and when retrieving the object one can compare with the current user's friend list. If someone updates their profile/makes a comment/etc it sends notification objects to friends which are pre-cached in their feeds. Cuts down tremendously on database work at expense of disk space, but disk space is cheap.
As to how Facebook does this, they use Cassandra DBMS, which is probably a little different to what you have in mind.
Keep in mind that Facebook strongly utilizes memcached, so they're retaining a lot of data in memory and only refreshing it when absolutely necessary. See this blog post for some scalability discussion around this:
http://www.facebook.com/note.php?note_id=39391378919
Each entry that somebody can like probably contains a list of everybody who does like it (all of this is of course in a database). When you view that entry, they match it against your friends list to see which of them is your friend. Voila.
A lot of this are explained by the Director of Engineering of Facebook in this QCon presentation :
http://www.infoq.com/presentations/Facebook-Software-Stack
A great presentation to watch.....