How are application like twitter implemented? - algorithm

Suppose A follows 100 person,
then will need 100 join statement,
which is horrible for database I think.
Or there are other ways ?

Why would you need 100 Joins?
You would have a simple table "Follows" with your ID and the other persons ID in it...
Then you retrieve the "Tweets" by joining something like this:
Select top 100
tweet.*
from
tweet
inner join
followers on follower.id = tweet.AuthorID
where
followers.masterID = yourID
Now you just need a decent caching and make sure you use a non locking query and you have all information... (Well maybe add some userdata into the mix)
Edit:
tweet
ID - tweetid
AuthorID - ID of the poster
Followers
MasterID - (Basically your ID)
FollowerID - (ID of the person following you)
The Followers table has a composite ID based on master and followerID
It should have 2 indexes - one on "masterID - followerID" and one on "FollowerID and MasterID"

The real trick is to minimize your database usage (e.g., cache, cache, cache) and to understand usage patterns. In the specific case of Twitter, they use a bunch of different techniques from queuing, an insane amount of in-memory caching, and some really clever data flow optimizations. Give Scaling Twitter: Making Twitter 10000 percent faster and the other associated articles a read. Your question about how you implement "following" is to denormalize the data (precalculate and maintain join tables instead of performing joins on the fly) or don't use a database at all. <-- Make sure to read this!

Related

best practice for very simple relation on a nosql table

I am using a dynamoDB table with GraphQL api to store posts. I want a user to be able to mark certain posts as favorites.
I have thought to create a relation table of user to post, but I also thought to just add an array of userId's to the post object with all the userIds of users who have made that post a favorite.
My understanding is a UUID is 16 bytes so even if say 10,000 users favorite the object then that array will be 160kb. Not insignificant but manageable to pass that much data with the object each time it is loaded.
Just wondering what is the best practice for this scenario. I'm pretty new to nosql.
With dynamoDB you have to think about access patterns first:
To get the favorite posts of a user, store a postsIds array in the user table
To get the users who like a post, store a likerIds array in the post table
To get a bidirectional link, do both of the above
Please also keep in mind that:
You can select fields when getting a document (only select the fields you are interested in)
I don't see a scenario where you would load 10k usernames and display them
The above solution looks pretty good for common scenarios.
More advanced solution:
There could be a more powerful way to do that using range keys. For instance:
Hash Key: postID
range key: likerID
title
...
post1
MyFancyPost
post1
user1
post1
user2
This structures is more powerful, and could store a lot of connections without having any "big" field in the post model.
you could easily paginate, and count the list of likers
can handle many more likers for only one post

Are Doctrine relations affecting application performance?

I am working on a Symfony project with a new team, and they decide to stop using Doctrine relations the most they can because of performances issues.
For instance I have to stock the id of my "relation" instead of using a ManyToOne relation.
But I am wondering if it is a real problem?
The thing is, it changes the way of coding to retrieve information and so on.
The performance issue most likely comes from the fact that queries are not optimised.
If you let Doctrine (Symfony component that handle the queries) do the queries itself (by using findBy(), findAll(), findOneBy(), etc), it will first fetch what you asked, then do more query as it will require data from other tables.
Lets take the most common example, a library.
Entities
Book
Author
Shelf
Relations
One Book have one Author, but one Author can have many Books (Book <= ManyToOne => Author)
One Book is stored in one Shelf (Book <= OneToOne => Sheilf)
Now if you query a Book, Doctrine will also fetch Shelf as it's a OneToOne relation.
But it won't fetch Author. In you object, you will only have access to book.author.id as this information is in the Book itself.
Thus, if in your Twig view, you do something like {{ book.author.name }}, as the information wasn't fetched in the initial query, Doctrine will add an extra query to fetch data about the author of the book.
Thus, to prevent this, you have to customize your query so it get the required data in one go, like this:
public function getBookFullData(Book $book) {
$qb=$this->createQueryBuilder('book');
$qb->addSelect('shelf')
->addSelect('author')
->join('book.shelf', 'shelf')
->join('book.author', 'author');
return $qb->getQuery()->getResult();
}
With this custom query, you can get all the data of one book in one go, thus, Doctrine won't have to do an extra query.
So, while the example is rather simple, I'm sure you can understand that in big projects, letting free rein to Doctrine will just increase the number of extra query.
One of my project, before optimisation, reached 1500 queries per page loading...
On the other hand, it's not good to ignore relations in a database.
In fact, a database is faster with foreign keys and indexes than without.
If you want your app to be as fast as possible, you have to use relations to optimise your database query speed, and optimise Doctrine queries to avoid a foul number of extra queries.
Last, I will say that order matter.
Using ORDER BY to fetch parent before child will also greatly reduce the number of query Doctrine might do on it's own.
[SIDE NOTE]
You can also change the fetch method on your entity annotation to "optimise" Doctrine pre-made queries.
fetch="EXTRA_LAZY
fetch="LAZY
fetch="EAGER
But it's not smart, and often don't really provide what we really need.
Thus, custom queries is the best choice.

Search/retrieve by a large OR query clause with Solr or Elasticsearch

I have a search database of car models: "Nissan Gtr", "Huynday Elantra", "Honda Accord", etc...
Now I also have a user list and the types of cars they like
user1 likes: carId:1234, carId:5678 etc...
Given user 1 I would like to return all the cars he likes, it can be 0 to even hundreads.
What the best way to model this in Solr or potentially another "nosql" system that can help with this problem.
I'm using Solr but I have the opportunity to use another system if I can and if it makes sense.
EDIT:
Solr solution is to slow for Join (Maybe we can try nested). And the current MySQL solution which uses join tables has over 2 billion rows.
so, you just want to store a mapping between User->Cars, and retrieve the cars based on the user...sounds very simple:
Your docs are Users: contain id (indexed), etc fields
one of the field is 'carsliked', multivalued, which contains the set of car ids he likes
you have details about each care in a different collection for example.
given a user id, you retrieve the 'carsliked' field, and get the car details with a cross collection join
You could also use nested object to store each liked car (with all the info about it) inside each user, but is a bit more complex. As a plus, you don't need the join on the query.
Solr would allow you many more things, for example, given a car, which users do like it? Elasticsearch will work exactly the same way (and probably many other tools, given how simple your use case seems).

typed data set; parent/child select and update with ONE trip to the database (for each op)?

Is it possible, using an ADO.NET typed DataSet containing two tables in a parent/child relationship, to populate the DataSet with ONE trip to the d/b (query could return one or two tables; if one, then result set has columns from both tables, right?), and to update the d/b with ONE trip to the d/b (call to generated stored proc, I guess).
By "is it possible", I mean is it possible to have Visual Studio (2012) automagically generate the classes and SQL code to make this happen?
Or am I kind of on my own? It's looking an awful lot like VS really wants to generate one d/b server round trip for each table involved.
*I guess the update stored proc would have to take table-typed parameters from both parent and child, and perform inserts/updates/deletes appropriately.
Yes, one round trip per table is the way to go.
(- It's certainly possible to use a join query to populate a datatable but VS will then be reluctant to generate update etc SQL. This may or may not be a problem, depending on what you intend to do with the dataset.)
But if you have two tables in a dataset, lets say customers - orders, then you would typically use two queries, and two trips to the db:
SELECT * FROM customers WHERE customers.customerid=#customerid
and
SELECT * FROM orders WHERE orders.customerid=#customerid
Somewhat more counter-intuitive is the situation where you want all customers and orders for one country:
SELECT * FROM customers WHERE customers.countryid=#countryid
and
SELECT orders.* FROM orders INNER JOIN customers ON customers.customerid=orders.customerid WHERE customers.countryid=#countryid
Note how the join query returns data from only one table, but uses the join to identify which rows to return.
Then, once you have the data in your dataset, you can navigate it using the getparentrow and getchildrows methods. This is how ADO.Net manages hierarchical data.
You do need this one-table-at-a-time approach, because, assuming you have foreign key constraints in your db, you need to insert and update in reverse order from delete.
EDIT Yes, this does mean that in some circumstances, depending on the data you want and the structure of your primary keys, you could end up with a humungous set of JOINS that still only pull the data from the table at the end of the hierarchy. This might seem wrong in terms of traditional SQL, but actually it's fine. The time you have lost in the multiple, more complex queries is saved by the reduced amount of data you have to pull back across the wire, compared with one big join query that would be returning multiple copies of the parent data.

Database table structure for notifications like table for a social networking site

I am developing a social networking site like Facebook. I am confused how to create structure for notification table. Should it be separate for each user or a huge one for all-where records added and deleted frequently ?
I have the same problem as you and found this (found this) upon researching where the table structure given is :
id
user_id (int)
activity_type (tinyint)
source_id (int)
parent_id (int)
parent_type (tinyint)
time (datetime but a smaller type like int would be better)
where:
activity_type tells me the type of activity, source_id tells me the record that the activity is related to. So if the activity type means "added favorite" then I know that the source_id refers to the ID of a favorite record.
The parent_id/parent_type are useful for my app - they tell me what the activity is related to. If a book was favorited, then parent_id/parent_type would tell me that the activity relates to a book (type) with a given primary key (id)
I index on (user_id, time) and query for activities that are user_id IN (...friends...) AND time > some-cutoff-point. Ditching the id and choosing a different clustered index might be a good idea - I haven't experimented with that.
Pretty basic stuff, but it works, it's simple, and it is easy to work with as your needs change. Also, if you aren't using MySQL you might be able to do better index-wise.
It also suggested there to use Redis for faster access to the most recent activities.
With Redis in the mix, it might work like this:
Create your MySQL activity record
For each friend of the user who created the activity, push the ID onto their activity list in Redis.
Trim each list to the last X items
Redis is fast and offers a way to pipeline commands across one connection - so pushing an activity out to 1000 friends takes milliseconds.
For a more detailed explanation of what I am talking about, see Redis' Twitter example: http://code.google.com/p/redis/wiki/TwitterAlikeExample
I hope this might help you also

Resources