I have a newsfeed with comments. I'm storing comments in MongoDB. Newsfeed possibly could grow very large in future so I need high speed.
comments: [
{user_id: 34, user_name: "John", text: "..."}
]
As you can see, I'm storing info about user as well because Mongo's docs say "when you need speed, use embeds".
But user can change his name anytime.
In that case user's name under each of his comments in newsfeed would be wrong.
Should I use references (DBref) to "User" collection by _id instead of embeds? And how much slower would it be? Is that slowdown big enough to be worried about it?
I'm just wondering how all big social networks are doing that. When I change my user's name it instantly updates in all my posts in newsfeed.
Storing DBRefs won't gain you any benefit vs. storing simple user ids. It's basically the same id, only with a collection name.
If you want quick efficient reads - embed.
When user changes his name, you can write this fact down and then run a nightly job that'll update his cached name in all comments.
If you want instantaneous name updates - you should reference. But in this case you're paying with more complex code and more queries to the database.
Related
I just ran into an interesting situation about relationships and databases. I am writing a ruby app and for my database I am using postgresql. I have a parent object "user" and a related object "thingies" where a user can have one or more thingies. What would be the advantage of using a separate table vs just embedding data within a field in the parent table?
Example from ActiveRecord:
using a related table:
def change
create_table :users do |i|
i.text :name
end
create_table :thingies do |i|
i.integer :thingie
i.text :discription
end
end
class User < ActiveRecord::Base
has_many :thingies
end
class Thingie < ActiveRecord::Base
belongs_to :user
end
using an embedded data structure (multidimensional array) method:
def change
create_table :users do |i|
i.text :name
i.text :thingies, array: true # example contents: [[thingie,discription],[thingie,discription]]
end
end
class User < ActiveRecord::Base
end
Relevant Information
I am using heroku and heroku-posgres as my database. I am using their free option, which limits me to 10,000 rows. This seems to make me want to use the multidimensional array way, but I don't really know.
Embedding a data structure in a field can work for simple cases but it prevents you from taking advantage of relational databases. Relational databases are designed to find, update, delete and protect your data. With an embedded field containing its own wad-o-data (array, JSON, xml etc), you wind up writing all the code to do this yourself.
There are cases where the embedded field might be more suitable, but for this question as an example I will use a case that highlights the advantages of a related table approch.
Imagine a User and Post example for a blog.
For an embedded post solution, you would have a table something like this (psuedocode - these are probably not valid ddl):
create table Users {
id int auto_increment,
name varchar(200)
post text[][],
}
With related tables, you would do something like
create table Users {
id int auto_increment,
name varchar(200)
}
create table Posts {
id auto_increment,
user_id int,
content text
}
Object Relational Mapping (ORM) tools: With the embedded post, you will be writing the code manually to add posts to a user, navigate through existing posts, validate them, delete them etc. With the separate table design, you can leverage the ActiveRecord (or whatever object relational system you are using) tools for this which should keep your code much simpler.
Flexibility: Imagine you want to add a date field to the post. You can do it with an embedded field, but you will have to write code to parse your array, validate the fields, update the existing embedded posts etc. With the separate table, this is much simpler. In addition, lets say you want to add an Editor to your system who approves all the posts. With the relational example this is easy. As an example to find all posts edited by 'Bob' with ActiveRecord, you would just need:
Editor.where(name: 'Bob').posts
For the embedded side, you would have to write code to walk through every user in the database, parse every one of their posts and look for 'Bob' in the editor field.
Performance: Imagine that you have 10,000 users with an average of 100 posts each. Now you want to find all posts done on a certain date. With the embedded field, you must loop through every record, parse the entire array of all posts, extract the dates and check agains the one you want. This will chew up both cpu and disk i/0. For the database, you can easily index the date field and pull out the exact records you need without parsing every post from every user.
Standards: Using a vendor specific data structure means that moving your application to another database could be a pain. Postgres appears to have a rich set of data types, but they are not the same as MySQL, Oracle, SQL Server etc. If you stick with standard data types, you will have a much easier time swapping backends.
These are the main issues I see off the top. I have made this mistake and paid the price for it, so unless there is a super-compelling reason do do otherwise, I would use the separate table.
what if users John and Ann have the same thingies? the records will be duplicated and if you decide to change the name of thingie you will have to change two or more records. If thingie is stored in the separate table you have to change only one record. FYI https://en.wikipedia.org/wiki/Database_normalization
Benefits of one to many:
Easier ORM (Object Relational Mapping) integration. You can use it either way, but you have to define your tables with native sql. Having distinct tables is easier and you can make use of auto-generated mappings.
Your space limitation of 10,000 rows will go further with the one to many relationship in the case that 2 or more people can have the same "thingies."
Handle users and thingies separately. In some cases, you might only care about people or thingies, not their relationship with each other. Some examples, updating a username or thingy description, getting a list of all thingies (or all users). Selecting from the single table can make it harding to work with.
Maintenance and manipulation is easier. In the case that a user or a thingy is updated (name change, email address update, etc), you only need to update 1 record in their table instead of writing update statements "where user_id=?".
Enforceable database constraints. What if a thingy is not owned by anyone? Is the user column now nillable? It would have to be in the single table case, so you could not enforce a simple "not nillable" username, for example.
There are a lot of reasons of course. If you are using a relational database, you should make use of the one to many by separating your objects (users and thingies) as separate tables. Considering your limitation on number of records and that the size of your dataset is small (under 10,000), you shouldn't feel the down side of normalized data.
The short truth is that there are benefits of both. You could, for example, get faster read times from the single table approach because you don't need complicated joins.
Here is a good reference with the pros/cons of both (normalized is the multiple table approach and denormalized is the single table approach).
http://www.ovaistariq.net/199/databases-normalization-or-denormalization-which-is-the-better-technique/
Besides the benefits other mentioned, there is also one thing about standards. If you are working on this app alone, then that's not a problem, but if someone else would want to change something, then the nightmare starts.
It may take this guy a lot of time to understand how it works alone. And modifing something like this will take even more time. This way, some simple improvement may be really time consuming. And at some point, you will be working with other people. So always code like the guy who works with your code at the end is the brutal psychopath who knows where you live.
I am developing a social networking site like Facebook. I am confused how to create structure for notification table. Should it be separate for each user or a huge one for all-where records added and deleted frequently ?
I have the same problem as you and found this (found this) upon researching where the table structure given is :
id
user_id (int)
activity_type (tinyint)
source_id (int)
parent_id (int)
parent_type (tinyint)
time (datetime but a smaller type like int would be better)
where:
activity_type tells me the type of activity, source_id tells me the record that the activity is related to. So if the activity type means "added favorite" then I know that the source_id refers to the ID of a favorite record.
The parent_id/parent_type are useful for my app - they tell me what the activity is related to. If a book was favorited, then parent_id/parent_type would tell me that the activity relates to a book (type) with a given primary key (id)
I index on (user_id, time) and query for activities that are user_id IN (...friends...) AND time > some-cutoff-point. Ditching the id and choosing a different clustered index might be a good idea - I haven't experimented with that.
Pretty basic stuff, but it works, it's simple, and it is easy to work with as your needs change. Also, if you aren't using MySQL you might be able to do better index-wise.
It also suggested there to use Redis for faster access to the most recent activities.
With Redis in the mix, it might work like this:
Create your MySQL activity record
For each friend of the user who created the activity, push the ID onto their activity list in Redis.
Trim each list to the last X items
Redis is fast and offers a way to pipeline commands across one connection - so pushing an activity out to 1000 friends takes milliseconds.
For a more detailed explanation of what I am talking about, see Redis' Twitter example: http://code.google.com/p/redis/wiki/TwitterAlikeExample
I hope this might help you also
I am in the midst of designing an application following the mvc paradigm. I'm using the sqlalchemy expression language (not the orm), and pyramid if anyone was curious.
So, for a user class, that represents a user on the system, I have several accessor methods for various pieces of data like the avatar_url, name, about, etc. I have a method called getuser which looks up a user in the db(by name or id), retrieves the users row, and encapsulates it with the user class.
However, should I have to make this look-up every-time I create a user class? What if a user is viewing her control panel and wants to change avatars, and sends an xhr; isn't it a waste to have to create a user object, and look up the users row when they wont even be using the data retrieved; but simply want to make a change to subset of the columns? I doubt this lookup is negligible despite indexing because of waiting for i/o correct?
More generally, isn't it inefficient to have to query a database and load all a model class's data to make any change (even small ones)?
I'm thinking I should just create a seperate form class (since every change made is via some form), and have specific form classes inherit them, where these setter methods will be implemented. What do you think?
EX: Class: Form <- Class: Change_password_form <- function: change_usr_pass
I'd really appreciate some advice on creating a proper design;thanks.
SQLAlchemy ORM has some facilities which would simplify your task. It looks like you're having to re-invent quite some wheels already present in the ORM layer: "I have a method called getuser which looks up a user in the db(by name or id), retrieves the users row, and encapsulates it with the user class" - this is what ORM does.
With ORM, you have a Session, which, apart from other things, serves as a cache for ORM objects, so you can avoid loading the same model more than once per transaction. You'll find that you need to load User object to authenticate the request anyway, so not querying the table at all is probably not an option.
You can also configure some attributes to be lazily loaded, so some rarely-needed or bulky properties are only loaded when you access them
You can also configure relationships to be eagerly loaded in a single query, which may save you from doing hundreds of small separate queries. I mean, in your current design, how many queries would the below code initiate:
for user in get_all_users():
print user.get_avatar_uri()
print user.get_name()
print user.get_about()
from your description it sounds like it may require 1 + (num_users*3) queries. With SQLAlchemy ORM you could load everything in a single query.
The conclusion is: fetching a single object from a database by its primary key is a reasonably cheap operation, you should not worry about that unless you're building something the size of facebook. What you should worry about is making hundreds of small separate queries where one larger query would suffice. This is the area where SQLAlchemy ORM is very-very good.
Now, regarding "isn't it a waste to have to create a user object, and look up the users row when they wont even be using the data retrieved; but simply want to make a change to subset of the columns" - I understand you're thinking about something like
class ChangePasswordForm(...):
def _change_password(self, user_id, new_password):
session.execute("UPDATE users ...", user_id, new_password)
def save(self, request):
self._change_password(request['user_id'], request['password'])
versus
class ChangePasswordForm(...):
def save(self, request):
user = getuser(request['user_id'])
user.change_password(request['password'])
The former example will issue just one query, the latter will have to issue a SELECT and build User object, and then to issue an UPDATE. The latter may seem to be "twice more efficient", but in a real application the difference may be negligible. Moreover, often you will need to fetch the object from the database anyway, either to do validation (new password can not be the same as old password), permissions checks (is user Molly allowed to edit the description of Photo #12343?) or logging.
If you think that the difference of doing the extra query is going to be important (millions of users constantly editing their profile pictures) then you probably need to do some profiling and see where the bottlenecks are.
Read up on the SOLID principle, paying particular attention to the S as it answers your question.
Create a single class to perform user existence check, and inject it into any class that requires that functionality.
Also, you need to create a data persistence class to store the user's data, so that the database doesn't have to be queried every time.
I've been doing a fair amount of work with Couch DB in my spare time recently and really enjoy using it. I find it to be much more flexible than using a relational database, but it's not without it's disadvantages.
One big disadvantage is the lack of dynamic queries / view generation... So you have to do a fair amount of work in planning and justifying your views, as you can't put that logic into your application code as you might do with SQL.
For example, I wrote a login scheme based on a JSON document template that looked a little bit like this:
{
"_id": "blah",
"type": "user",
"name": "Bob",
"email": "bob#theaquarium.com",
"password": "blah",
}
To prevent the creation of duplicate accounts, I wrote a very basic view to generate a list of user names to lookup as keys:
emit(doc.name, null)
This seemed reasonably efficient to me. I think it's way better than dragging out an entire list of documents (or even just a reduced number of fields for each document). So I did exactly the same thing to generate a list of email addresses:
emit(doc.email, null)
Can you see where I'm going with this question?
In a relational database (with SQL) one would simply make two queries against the same table. Would this technique (of equating a view to the product of an SQL query) be in some way analogous?
Then there's the performance / efficiency issue... Should those two views really be just one? Or is the use of a Couch DB view with keys and no associated value an effective practice? Considering the example above, both of those views would have uses outside of a login scheme... If I ever need to generate a list of user names, I can retrieve them without an additional overhead.
What do you think?
First, you certainly can put the view logic into your application code - all you need is an appropriate build or deploy system that extracts the views from the application and adds them to a design document. What is missing is the ability to generate new queries on the fly.
Your emit(doc.field,null) approach certainly isn't surprising or unusual. In fact, it is the usual pattern for "find document by field" queries, where the document is extracted using include_docs=true. There is also no need to mix the two views into one, the only performance-related decision is whether the two views should be placed in the same design document: all views in a design document are updated when any of them is accessed.
Of course, your approach does not actually guarantee that the e-mails are unique, even if your application tries really hard. Imagine the following circumstances with two client applications A and B:
A: queries view, determines that `test#email.com` does not exist.
B: queries view, determines that `test#email.com` does not exist.
A: creates account with `test#email.com`
B: creates account with `test#email.com`
This is a rare occurrence, but nonetheless possible. A better approach is to keep documents that use the email address as the key, because access to single documents is transactional (it's impossible to create two documents with the same key). Typical example:
{
_id: "test#email.com",
type: "email"
user: "000000001"
}
{
_id: "000000001",
type: "user",
email: "test#email.com",
firstname: "Test",
...
}
EDIT: a reservation pattern only works if two clients attempting to create an account for a given e-mail will reliably try to access the same document. If you randomly generate a new identifier, then client A will create and reserve document XXXX while client B will create and reserve document YYYY, and you will end up with two different documents that have the same e-mail.
Again, the only way to perform a transactional "check if it exists, create if it does not" operation is to have all clients alter a single document.
I'm about to embark on a project where a user will be able to create their own custom fields. MY QUESTION - what's the best approach for something like this?
Use case: we have medical records with attributes like first_name, last_name etc... However we also want a user to be able to log into their account and create custom fields. For instance they may want to create a field called 'second_phone' etc... They will then map their CRM to their fields within this app so they can import their data.
I'm thinking on creating tables like 'field_sets (has_many fields)', 'fields', 'field_values' etc...
This seems like it would be somewhat common hence why I thought I would first ask for opinions and/or existing examples.
This is where some modern schemaless databases can help you. My favourite is MongoDB. In short: you take whatever data you have and stuff a document with it. No hard thinking required.
If, however, you are in relational land, EAV is one of classic approaches.
I have also seen people do these things:
predefine some "optional" fields in the schema and use them if necessary.
serialize this optional data to string (using JSON, for example) and write it to text blob.