badoo.com user search - how can this be done? - performance

Badoo.com has 56.000.000 user profiles. Profiles can be searched by sex, age, hair color, zodiac, education and so on, plus distance from my hometown, online status and date of registration. So far, this seems doable even if it's quite some query on huge tables (56m members...), it can be cached in a general way.
The interesting part is that they also have an individual "exclude list" (with every profile you look at, you can say that you don't want to meet this person). Plus, you friends don't show up either.
The second interesting part are the OR parts of the query. You can search for someone who's a woman, 25-35, blonde OR brunette, non-smoker, hetero OR bisexual, virgo OR twins OR cancer, living in a 50KM radius of Paris and who is not your friend and not on your exclude list and who's online now. Many ORs, heavy query, sort options, no way of caching or pre-calculating all this, but the search returns 11.298 results in milliseconds.
How do they do such a thing with 56 million datasets and 250K people using it at the same time? Fulltext search indexes? Relational Databases? Key Value Stores?
Does anyone have an idea abou the concept or architecture?

They are most likely built using an inverted indexing technology like Lucene or Sphinx. If you are looking to build a solution, my recommendation would be Apache Solr (a search server built using Lucene). It is very popular, has an active OSS community, and is used by sites such as Netflix, Cnet etc.

I'd recommend to take a look at Badoo Dev Blog. It's in Russian but google translate helps a lot.
In short they are using sharded MySQL and memcached. Here is some badoo evolution list.

Related

BigData solution for LIKE queries

I am looking for suggestions to use a distributed system to process this data. I have a data from organization wide computers (laptops, desktops, tablets etc.). The sample table contains data for all the files present on each computer in this organization. The idea is to find files with certain keywords (3000+) within FileName or FilePath i.e wild card pattern matching.
+-------------+----------+----------+----------+----------+
| MachineName | FileName | FilePath | FileType | FileSize |
+-------------+----------+----------+----------+----------+
The current solution is running on a beefy SQL Server but still takes hours to run through 80 million records due to wild card SQL queries i.e FILENAME LIKE '%abc%' or FILEPATH LIKE '%abc%' and the list goes on.
We have thought about FullText indexes on SQL but this activity is performed once a month and then the data is discarded. So, investing resources in getting full text index populated does not seem worth it in terms of time & resources.
The requirement is to get this activity completed in much shorter time and hence we are exploring for options.
Should it be ElasticSearch Or Solr or some other cloud-based solution? Please provide advise on some high-level solution.
For this use case, ElasticSearch is a good choice. It provides all you would need - because every field is indexed, it is commonly being used as a real-time full-text search engine.
On the other hand, Solr is a good choice too. From your requirements, I think that ElasticSearch offers much more than you would need. Solr is a bit older, which results in the excellent documentation. It specializes only in the text which is not a problem in your case. It is scalable and optimized for high traffic, so it should be suitable for your problem.
I think both ElasticSearch and Solr will fulfill what you need; choice is up to you - what is more sympathetic to you :) In my opinion, if you can, best is to try both of them and choose then.

Text Search Engine for Mailing List Archive Cataloging and Search

I'm working with a mailing list archive and am tasked with setting up basic search, boolean search, and ultimately some sort of more intelligent tag-based searching.
I see both commercial products and some open-source projects (like Lucene.NET)
Has anyone else done any similar kind of work?
I'm working in Win2k3 server now, so the immediate thought was to use ASP Classic or ASP.NET. However, if there were another platform that was orders of magnitude better for the purpose, then I'd consider that as well. I'm not going to throw out something becuse of that ;)
Since you are setting up mail search you will need two things : a search engine and a database.
There are many search engines that offer what you need.
Sphinx
Solr(Lucene and Solr are merged now)
PostgreSQL(inbuilt search)
They provide advanced search tools like keywords, field-restricted search, boolean queries, phrase search and more. Here is another SO post looking into various text search engines: Comparison of full text search engine - Lucene, Sphinx, Postgresql, MySQL?
Sphinx and Solr are pretty fast in search. Sphinx does full database search and also does partial indexing. Solr uses indexed based search, and is scalable with almost linear performance.
Second most important choice is the database where you store your mails. The mails will be in some format (schema), like fields in a table. It would be plain crazy not to use any format. It is not file search, right? Some search engines require particular DB's to work. Sphinx uses SQL databases only, Solr can be integrated with noSQL databases.
If you are not worried about scaling issues (you have thousands of users, having GB's of data, needing real time performance) then, you are fine with SQL databases. Otherwise you will have to use noSQL database with Solr.
SQL databases(like PostgreSQL) are simplest to work with, do what you need and require minimal setup/effort. Connectors will allow you to send query(mail search) from browser to your database.
Also you said you use Win2k3, you will have to switch to linux distribution to take advantage of these search engines. Win2k3 is slow, does not offer performance comparable to linux distros.
First, you should think about what you need.
What do you want to search in your e-mail archive? Just full text search in the the e-mail’s plein data? You will not get matches in mails that are base64 encoded then, for example. Do you need ‘fielded’ search? E.g.: search only in ‘subject’, ‘from’, ‘to’, ‘body’, ‘attachments’?
How do you want to provide access to search in the mails? Via a web page? On a command line? In some windows program?
If you didn’t yet, you should examine what your data looks like. Maybe ‘mbox’ format (one file with mail plain text concatenated) ‘maildir’ (a directory with many files, each contain one mail), or something else?
Setting up a search engine means to think about how data needs to be prepared:
E-Mails can contain different data inside. You will have to deal with base64 encoded data, character encodings as UTF-8 and attachments.
Usegroup mails may even be split across multiple e-mail messages.
If you want to search different ‘fields’ (‘Subject’, ‘date’, ‘body’) they need to be extracted.
Data needs to be prepared by linguistic means. You will need to find out which language the mails are in (if there are several) and process the data, eg. to make a search on mouse match on notions of mice and, perhaps, rats; or cursor and pointing device, depending on the topic of your mailing list.
Also think about:
Will there be updates to the data in future?
Are there deletes (including messages being relabeled later)?
Then compare the products (commercial or open source) that you favour how much of this they do provide already and what you will have to write yourself. Be aware that providing a search experience is more than downloading a search engine and dropping in a ton of data.

How does Facebook do it?

Have you ever noticed how facebook says “3 friends and 33 others liked this”? I was wondering what the best approach to do this is. I don’t think going through the friends list, and the list of users who “liked this” and comparing them is efficient at all! Do they keep a track of this in the database? That will make the database size very huge.
What do you guys think?
Thanks!
I would guess they outer join their friends table with their likes table to count both regular likes and friend likes at the same time.
With the proper indexes, it wouldn't be a slow query at all. Huge databases aren't necessarily slow, so there's really no reason to not store all of this information in a database. The trick is to make sure the indexes and partitions (if any) are set up well.
Facebook uses Cassandra, a NoSQL database for at least some things. Here's a more detailed discussion of what some of the bigger social media sites do to solve these problems:
http://www.25hoursaday.com/weblog/2009/09/10/BuildingScalableDatabasesDenormalizationTheNoSQLMovementAndDigg.aspx
Lots of interesting reading in there if you follow the links from it to the Digg blog post, etc.
Yes they definitely keep it in their database as they definitely have more than 1 server that needs to access the data.
As for scalability, I'm sure they use a lot of caching.
Here is an example:
If you have 1 million rows to go through, an index can perform O(logn) = 20 operations (in the worst case) only to find what you need.
For 2 million, you only need 21 operations (in the worst case) to find what you need.
Every time you double the amount of users to go through you simply need only 1 more operation (in the worst case) with a O(logn) index.
They also have a distributed architecture or a clustered database.
Facebook must be using a trigger(which automatically gets executed as soon as an event occurs).
For example, suppose a trigger is created to store the count and names of people who liked the status, then it will get executed every time when someone likes your status and that too implicitly (automatically).
This makes the operation way too easy and Facebook doesn't have to manually update the database or store a huge database for this. Also,this approach is a faster one.
In designing social networking software (mothsorchid.com) I found the only way to address this is to pre-cache streams of notifications. One doesn't query the database at the time of page load to count how many friends and others 'liked this', when someone 'likes' something that is recorded on the object, and when retrieving the object one can compare with the current user's friend list. If someone updates their profile/makes a comment/etc it sends notification objects to friends which are pre-cached in their feeds. Cuts down tremendously on database work at expense of disk space, but disk space is cheap.
As to how Facebook does this, they use Cassandra DBMS, which is probably a little different to what you have in mind.
Keep in mind that Facebook strongly utilizes memcached, so they're retaining a lot of data in memory and only refreshing it when absolutely necessary. See this blog post for some scalability discussion around this:
http://www.facebook.com/note.php?note_id=39391378919
Each entry that somebody can like probably contains a list of everybody who does like it (all of this is of course in a database). When you view that entry, they match it against your friends list to see which of them is your friend. Voila.
A lot of this are explained by the Director of Engineering of Facebook in this QCon presentation :
http://www.infoq.com/presentations/Facebook-Software-Stack
A great presentation to watch.....

What's your solution for free-text search and sort?

AFAIK,MySQL performs really bad at this,
what's your solution?
BTW,what's the solution of SO?
EDIT
Please pay attention that free-text search itself is pretty fast in MySQL,
but not the case when the result also needs to be sorted on an attribute!
Apache SOLR (Lucene) is pretty capable.
I think stack overflow uses SQL Server in the background with the built in fulltext search capabilities offered by the database. Oracle offers Oracle intermedia (Oracle 9i), later called Oracle Text, which is very well integrated and efficient. Postgresql offers a standard built-in module called tsearch2. I'm not sure about MySql, but looking at the other 3 databases I've mentioned, fulltext is something that is certainly complex and takes time to mature as a feature.
I recommend Sphinx Search : needs to be configured and some modifications to your code, but really worth it.
On a forum with 1+ million messages, a full-text search takes just a few milliseconds.
SO uses the full-text search capabilities of Microsoft SQL Server, it's been mentioned several times in the podcast and on the blog (ex: https://blog.stackoverflow.com/2008/11/sql-2008-full-text-search-problems/) In this blog entry, Jeff mentions possibly moving to Lucene.net in the future.
I'm currently evaluating Haystack and Solr for searching. in a couple of projects.

Can OLAP be done in BigTable?

In the past I used to build WebAnalytics using OLAP cubes running on MySQL.
Now an OLAP cube the way I used it is simply a large table (ok, it was stored a bit smarter than that) where each row is basically a measurement or and aggregated set of measurements. Each measurement has a bunch of dimensions (i.e. which pagename, useragent, ip, etc.) and a bunch of values (i.e. how many pageviews, how many visitors, etc.).
The queries that you run on a table like this are usually of the form (meta-SQL):
SELECT SUM(hits), SUM(bytes),
FROM MyCube
WHERE date='20090914' and pagename='Homepage' and browser!='googlebot'
GROUP BY hour
So you get the totals for each hour of the selected day with the mentioned filters.
One snag was that these cubes usually meant a full table scan (various reasons) and this meant a practical limitation on the size (in MiB) you could make these things.
I'm currently learning the ins and outs of Hadoop and the likes.
Running the above query as a mapreduce on a BigTable looks easy enough:
Simply make 'hour' the key, filter in the map and reduce by summing the values.
Can you run a query like I showed above (or at least with the same output) on a BigTable kind of system in 'real time' (i.e. via a user interface and the user get's their answer ASAP) instead of batch mode?
If not; what is the appropriate technology to do something like this in the realm of BigTable/Hadoop/HBase/Hive and the likes?
It's even kind of been done (kind of).
LastFm's aggregation/summary engine: http://github.com/zohmg/zohmg
A google search turned up a google code project "mroll" but it doesn't have anything except contact info (no code, nothing). Still, might want to reach out to that guy and see what's up. http://code.google.com/p/mroll/
We managed to create low latency OLAP in HBase by preagragating a SQL query and mapping it into appropriate Hbase qualifiers. For more detail visit below site.
http://soumyajitswain.blogspot.in/2012/10/hbase-low-latency-olap.html
My answer relates to HBase, but applies equally to BigTable.
Urban Airship open-sourced datacube, which I think is close to what you want. See their presentation here.
Adobe also has a couple of presentations (here and here) on how they do "low-latency OLAP" with HBase.
Andrei Dragomir made an interesting talk about how Adobe performs OLAP functionality with M/R and HBase.
Video: http://www.youtube.com/watch?v=5U3EnfiKs44
Slides: http://hstack.org/hbasecon-low-latency-olap-with-hbase/
If you are looking for a table-scan approach, have you considered Google BigQuery? BigQuery does automatic scale-out on the back-side that gives interactive response. There is a good session by Jordan Tigani from the 2012 Google I/O event that explains some of the internals.
http://www.youtube.com/watch?v=QI8623HlYd4
It's not MapReduce but it is geared towards high-speed table scan like what you described.

Resources