What is the design & architecture behind facebook's status update mechanism? - algorithm

I'm planning on creating a social network and I don't think I quite understand how the status update module of facebook is designed. Hoping I can find some help here. At algorithmic and datastructure level, what is the most efficient way to create a status update mechanism in a social network?
A full table scan for all friends and then sorting their updates is very naive and costly. Do we use some sort of mechanism based on hashing or something else? Please let me know.
P.S: I'm not talking about their EdgeRank algorithm but the basic status update. How do they find and fetch them from the database?
Thanks in advance for the help!

Here is a great presentation that answers your question. The specific answer comes up at around minute 55:40, but I suggest that you watch the entire presentation to understand how the solution fits into the entire architecture.
In short:
A particular server ("leaf") stores all feed items for a particular user. So data for each of your friends is stored entirely at a specific destination.
When you want to view your news feed, one of the aggregator servers sends request to all the leaf servers for your friends and ranks the results. The aggregator knows which servers to send requests to based on the userid of each friend.
This is terribly simplified, of course. This only works because all of it is memcached, the system is designed to minimize latency, some ranking is done at the leaf server that contains the friend's feed items, etc.
You really don't want to be hitting the database for any of this to work at a reasonable speed. FB use MySql mostly as a key-value store; JOINing tables is just impossible at their scale. Then they put memcache servers in front of the databases and application servers.
Having said that, don't worry about scaling problems until you have them (unless, of course, you are worrying about them for the fun of it.) On day one, scaling is the least of your problems.

Related

MongoDB / Mongoose Schema for Threaded Messages (Efficiently)

I'm somewhat new to noSQL databases (I'm fairly good with relational databases though), and I'm wondering what the most efficient way to handle an inbox system with threaded messages would be.
Each 'message' will have a single sender and recipient. The number of received / sent messages will vary widely between users. This system should scale well to over 1k+ users.
I've read up on fan out on write / read but I'm not sure how well this would work for threaded messages.
Since I'm new to MongoDB / NoSQL in general, I'm not really used to structuring data efficiently this way.
I'm guessing there's going to be nested objects in any sort of efficient way of handling this...but I can't settle on a design that seems both efficient and convenient for threaded conversations between 2 users.
I thought of storing data with an array of the 2 users, combined with an array of 'message' objects. But then there's the issue of the order of the 2 user's usernames. (ex. [UserA, UserB] and [UserB, UserA] are both possible and would be problematic, so that seemed like a bad idea).
I thought of doing the whole fan out on read / write thing, but that doesn't seem efficient for threaded messages (since if grabbing messages by recipient is convenient, grabbing messages by sender won't be and vice versa).
I'm leaning towards favoring grabbing messages by recipient (since the inbox loads multiple messages, and sending only involves one [albeit with a longer look-up time]). But I'd really like to grab a threaded conversation in one go, as well as the list of users that a user has threaded conversations with (for the list of threads).
If someone could give me an efficient schema for threaded conversations I'd be very grateful. I've been researching this and trying to settle on a design for hours, and I'm exhausted. I keep finding flaws in my designs and scrapping them and I'd really just like some input from someone more experienced with NoSQL databases / MongoDB so I can avoid making a huge design flaw and/or writing logic that could've been handled with a better database design.
Thanks in advance for any and all help.
On this particular topic you are in luck, there is a great post discussing the various approaches to the schema here (it's a slight twist on what you are looking at, but not much different):
http://blog.mongodb.org/post/65612078649/schema-design-for-social-inboxes-in-mongodb
Then, this topic was also covered in detail at MongoDB World 2014 in three parts by Darren Wood and Asya Kamsky:
Part 1 Outline and Video
Part 2 Outline and Video
Part 3 Outline and Video
Also at MongoDB World the guys at Dropbox talked about the lessons they learned when building their Mailbox:
http://www.mongodb.com/presentations/mongodb-mailbox
And then, to round it off, there is a full reference architecture with code called Socialite on Github written by the aforementioned Darren Wood:
https://github.com/10gen-labs/socialite

Neo4j and Cluster Analysys

I'm developing a web application that will heavily depend on its ability to make suggestions on items basing on users with similar preferences. A friend of mine told me that what I'm looking for - mathematically - is some Cluster Analysis algorithm. On the other hand, here on SO, I was told that Neo4j (or some other Graph DB) was the kind DB that I should have approached for this task (the preferences one).
I started studying both this tools, and I'm having some doubts.
For Cluster Analysis purposes it looks to me that a standard SQL DB would still be the perfect choice, while Neo4j would be better suited for a Neural Network kind of approach (although still perfectly fit for the task).
Am I missing something? Am I trying to use the wrong tools combination?
I would love to hear some ideas on the subject.
Thanks for sharing
this depends on your data. neo4j is capable to provide even complex recommendations in real-time for one particular node - let's say you want to recommend to a user some product and this can be handle within a graph db in real-time
whereas using some clustering system is the best way to do recommendations for all users at once (and than maybe save it somewhere so you wouldn't need to calculate it again).
the computational difference:
neo4j has has no initialization cost and can give you one recommendations in an acceptable time
clustering needs more time for initialization (e.g. not in seconds but most likely in minutes/hours) and is better to calculate the recommendations for the whole dataset. in fact, taking strictly the time for one calculations for a specific user this clustering can do it faster than neo4j but the big restriction is the initial initialization - thus not good for real-time application
the practical difference:
if you have mostly static data and is ok for you to do recommendations once in a time than do clustering with SQL
if you got dynamical data where the data are being updated with each interaction and is necessary for you to always provide the newest recommendation, than use neo4j
I am currently working on various topics related to recommendation and clustering with neo4j.
I'm not exactly sure what you're looking for, but depending on how you implement you data on the graph, you can easily work out clustering algorithms based on counting links to various type of nodes.
If you plan correctly you nodes and relationships, you can then identify group of nodes that share most common links to a set of category.
let me introduce Reco4J (http://www.reco4j.org), is is an open source framework that provide recommendation based on graph database source. It uses neo4j as graph database management system.
Have a look at it and contact us if you are interested in support.
It is in a really early release but we are working hard to provide extended documentation and new interesting features.
Cheers,
Alessandro

Synchronize contact lists with a central server

We have some data that we are trying to synchronize between N machines and a centralized server, and I'm looking for a way to do this that is relatively efficient and robust.
Looking around, it appears that this is called a "set reconciliation problem". It's good to have a label for it, but searching on that turns up a lot of fairly academic work, which is at times a bit difficult to gauge in terms of its usefulness for our data, which is best described as contact lists in terms of its properties: objects (people) with multiple fields that do get updated, but not that often.
Our system involves a central server and machines connected to it. The central server, ideally, is the 'good' copy. A feature that's nice to have also, is the ability to force the machines to resend by tweaking something on the server.
So far, my thinking is along the lines of a UUID for each object and something like a version or timestamp (per object and or per collection of objects?) to use to tell which data to attempt to synchronize... but my thinking is still a bit fuzzy, and I thought asking would probably lead to a better solution than trying to invent this on my own.
It is not easy, and the perfect solution is academical. So you are on the good track.
You can craft a sync algorithm for your own problem, relaxing some of the requirements of the general solution.
I delivered a presentation on these topics at the last JsDay in Italy.
Here are my slides: http://www.slideshare.net/matteocollina/operational-transformation-12962149
Let me know if they help you, or if you need some assistance.

How search engine, say Google's page ranking algorithm work across distributed/multiple machines?

I am new to distributed computing but was wondering how page ranking algorithm works across multiple machines. Like
When do they decide data should be replicated (if needed at all),
If data is not copied, do they ask serves at other places to give them the result?
Or do they send "modules" to different serves (say part of a HUGE-HUGE - linked-graph) to one server, another module to another server and the combine the results they received?
I search something -- how does it fetches pages from my country (you know, search pages from <insert country> only)
This is not homework. Just a question I had. I welcome all ideas, even if they are very general or very detailed or do not answer all of my questions.
Right now, I know next to nothing, my hope is to know something after going through the answers.
There're three whales: MapReduce, Google File System, BigTable
Here are some whitepapers of the architecture
GoogleCluster
MapReduce, GFS, BigTable
Note: some of these are quite outdated, nowadays they are doing live updates, which wouldn't work with mapreduce.

How to ensure correctness of data gathered via crowdsourcing?

I have a site where users are entering data of some products they buy.
How do I ensure correctness of data entered via crowdsourcing (enabling users to vote/edit products) minimizing amount of work that needs to be done by administrator? I'm looking for some how-tos, best practices, etc.
What sort of data are you collecting ?
You're talking about crowd-sourcing, and thus (I assume) aggregating of data across this crowd. As they're talking about products they buy, I suspect you're going to be athering product attributes and prices.
Some possible approaches. If you users are entering non-numerical data (e.g. colours), just record the most common entries, or the mode (the most commonly entered).
If they're entering numeric data, discard outliers. i.e. bin the lowest and highest results, and average the rest (you could do this for prices, say. This is the approach that electronic exchanges use for resolving closing prices out of many trades).
Depending on your application, you may want to have a historical bias towards the most recent entries.
But this all depends on your application, and how much storage and crunching of data you're prepared to do.
Make sure you keep a log of IP addresses with every action made, malicious users or bots would trample on session data or cookies. Doing this ensures that a single entity cannot skew any results or do anything drastic by appearing to be multiple users.
As a high level data can be gathered from the 'crowd' with an associated correctness value. Looking at SO, an answer or response from someone with 1000+ rep, has more wieght that a casual user. Look for validations and triangulation, if it's a single voice in the crowd that you're listening too, then it's probably not worth that much. If other voices join then you know you're onto something, again in SO terms we all get a chance to upvote questions.
I've recently seen some really good iPhone apps which rely in crowd sourcing for their data, and then validate it by asking other users if it's correct.

Resources