Efficient algorithm that takes a Twitter user and finds top users by order of how many of his followers they follow - algorithm

The title is very wordy. So I'll explain with an example.
We have a database of 10,000 twitter users with each following up to 2000 users. The algorithm takes as input one random never before seen user (including the people that follow him), and returns the twitter users from the database by order of how many of his followers they follow.
We have:
User A follows 1,2,3,4
User B follows 3,4,5,6
User C follows 4,8,9
We enter user X who has users 3,4,5 following him.
The algorithm should return:
B: 3 matches (3,4,5)
A: 2 matches (3,4)
C: 1 match (4)

Store the data as a sparse integer matrix A of size 10^5x10^5 with ones at the appropriate places. Then, given a user i, compute A[i,] * A (matrix multiplication). Then sort.

Assuming you have a table structure similar to this:
Table Users
Id (PK, uniqueidentifier, not null)
Username (nvarchar(50), not null)
Table UserFollowers
UserId (FK, uniqueidentifier, not null)
FollowerId (uniqueidentifier, not null)
You can use the following query to get the common parents of followers of the followers of the user in query
SELECT Users_Inner.Username, COUNT(Users_Inner.Id) AS [Total Common Parents]
UserFollowers ON Users.Id = UserFollowers.FollowerId INNER JOIN
UserFollowers AS UserFollowers_Inner ON UserFollowers.FollowerId = UserFollowers_Inner.UserId INNER JOIN
Users AS Users_Inner ON UserFollowers_Inner.FollowerId = Users_Computed.Id
WHERE (UserFollowers.UserId = 'BD34A1FF-FCF5-4D35-B8A3-EFFB1587A874')
GROUP BY Users_Inner.Username

would something like this work?
for f in followers(x)
for ff in followers(f)
count[ff]++ // assume it is initially 0
sort the ff-s by their counts
Unlike the matrix solution, the complexity of this is proportional to the number of people involved rather than the number of users on twitter.


Find Maximum Columns in a grouped row. [using PIG]

I have to find maximum number of posts created by person with some given set of data, where I am provided with user id, display name, age, comments count, view count, date, score and title of each post.
To get the number of maximum post, I think, we can group by user id.Now, after grouping, I need to check the id which has the most no. of columns. I don't understand how would I solve the latter part. Please help.
As What, I understand from your question. I am giving you answer Accordingly.
Let be try this code :
a = load '<path>' using PigStorage(',') as(userId,displayName,age,commentsCount,viewCount,date,score,title)
b = group a by userId;
c = foreach b generate group,COUNT(a.title);
dump c;

Slow Neo4j query despite indices

Here I'm trying to find all Twitter users who are followed by and who follow any members of some group G:
MATCH (x:User)-[:FOLLOWS]->(t:User)-[:FOLLOWS]->(y:User)
WHERE (x.screen_name IN {{G_SCREEN_NAMES}} OR x.id IN {{G_IDS}})
AND (y.screen_name IN {{G_SCREEN_NAMES}} OR y.id IN {{G_IDS}})
But for the group G I sometime have their screen names and sometimes have their ids, thus the OR clause above. Unfortunately this query is long running and doesn't appear to ever return.
I have indices and constraints on both on both id and screen_name:
ON :User(screen_name) ONLINE (for uniqueness constraint)
ON :User(id) ONLINE (for uniqueness constraint)
ON (user:User) ASSERT user.screen_name IS UNIQUE
ON (user:User) ASSERT user.id IS UNIQUE
If I get rid of the OR clause (for instance if I happen to have all screen_names or all ids for group G) then the query runs quite fast.
I'm using neo4j-community-2.1.3 on a Mac. My graph has 286039 nodes, all of which have the User label.
And ideas to improve this? Otherwise I'll have to chop this up into 4 queries to get all possible combinations of members. This is really even more problematic because I really want to keep track of how commonly a user appears in a G-->user-->G relationship, and I'll need to do a lot of extra bookkeeping if the counts are spread among 4 different queries.
I created an issue related to this: https://github.com/neo4j/neo4j/issues/2834
I ended up using
MATCH (x:User) WHERE x.screen_name IN ["apple","banana","coconut"]
WITH collect(id(x)) as x_ids
MATCH (x:User) WHERE x.id in [12345,98765]
WITH x_ids+collect(id(x)) as x_ids
MATCH (y:User) WHERE y.screen_name IN ["apple","banana","coconut"]
WITH x_ids,collect(id(y)) as y_ids
MATCH (y:User) WHERE y.id in [12345,98765]
WITH x_ids,y_ids+collect(id(y)) as y_ids
MATCH (x:User)-[:FOLLOWS]->(t:User)-[:FOLLOWS]->(y:User)
WHERE id(x) in x_ids AND id(y) in y_ids
RETURN count(*) as c, t.screen_name,t.id
LIMIT 1000
But this basically represents a hack to get around a place where neo4j isn't using the indices that it could be.
I guess the query does not make use of indexes due to the OR condition, you can verify by prefixing the query with PROFILE and run it in neo4j-shell.
If there's no notion of index usage, you might split the query up into two parts. The first one fetches the combined list of user ids, instead of the OR we do a UNION on two queries (each using a index lookup):
MATCH (x:User) WHERE x.screen_name in {G_SCREEN_NAMES} RETURN id(x) as ids UNION
MATCH (x:User) WHERE x.id in {G_IDS} RETURN id(x) as ids
On the client side, use the list of node ids as parameter for the next query:
MATCH (x:User)-[:FOLLOWS]->(t)-[:FOLLOWS]->(y)
WHERE id(x) in {ids} AND id(y) in {ids}
I've intentionally removed the labels for t and y with the assumption that you can only follow User and no other kind of nodes. This removes a unnecessary label check.
How about this query?
MATCH (x:User)
WHERE x.screen_name IN {{G_SCREEN_NAMES}} OR x.id IN {{G_IDS}}
MATCH (x)-[:FOLLOWS]->(t:User)
MATCH (t)-[:FOLLOWS]->(y:User)
WHERE y.screen_name IN {{G_SCREEN_NAMES}} OR y.id IN {{G_IDS}}
Grace and peace,

Relational algebra for one-to-many relations

Suppose I have the following relations:
Academic(academicID (PK), forename, surname, room)
Contact (contactID (PK), forename, surname, phone, academicNO (FK))
I am using Java & I want to understand the use of the notation.
Π( relation, attr1, ... attrn ) means project the n attributes out of the relation.
σ( relation, condition) means select the rows which match the condition.
⊗(relation1,attr1,relation2,attr2) means join the two relations on the named attributes.
relation1 – relation2 is the difference between two relations.
relation1 ÷ relation2 divides one relation by another.
Examples I have seen use three tables. I want to know the logic when only two tables are involved (academic and contact) as opposed to three (academic, contact, owns).
I am using this structure:
LessNumVac = Π( σ( job, vacancies < 2 ), type )
AllTypes = Π( job, type )
AllTypes – LessNumVac
How do I construct the algebra for:
List the names of all contacts owned by academic "John"
List the names of all contacts who is owned by academic "John".
For that, you would join the Academic and Conctact relations, filter for John, and project the name attributes. For efficiency, select John before joining:
πforename, surename (Contact ⋈academicNO = academicID (πacademicID (σforename = "John" Academic))))
You have to extend your operations set with natural join ⋈, Left outer join ⟕ and/or Right outer join ⟖ to show joins.
There is a great Wikipedia article about Relational Algebra. You should definitely read that one!

Pig: Pulling individual fields out after a GROUP

In PigLatin, I want to pull the other fields out of a record I want to select because of an aggregate, such as MAX.
I'm having trouble explaining the problem, so here is an example. Let's say I want to grab the name of the oldest person at a household:
Relation A is four columns, (name, address, zipcode, age)
B = GROUP A BY (address, zipcode); # group by the address
# generate the address, the person's age, but how do I grab that person's name?
C = FOREACH B GENERATE FLATTEN(group), MAX(age), ??? Name ???;
How do I generate the name of the person with the MAX age?
The problem with your logic is there can be more then 1 people with the MAX(age). Then you have to GROUP BY (name, address, age). But to give you a quick answer I will write that gets only one of the max ages. (I am not sure its the optimum way though)
Be careful with frail's answer which is accepted, as it would have undesirable behavior if the number in the LIMIT command is higher than 1. In particular, in that case the output would be a cross-product between all ages and names due to the last two FLATTEN calls. Then, if the value in the LIMIT is N, there would be N^2 output rows instead of intended N.
Much safer is to do the following in the GENERATE line, which would give exactly the same result as the accepted answer when 'LIMIT 1' is used:
GENERATE FLATTEN(group) AS (address, zipcode), FLATTEN(DB.(age, name)) AS (age, name);

SQL to Relational Algebra

How do I go about writing the relational algebra for this SQL query?
Select patient.name,
From patient, medicine, prescription
Where prescription.frequency = "3perday"
AND prescription.end-date="08-06-2010"
AND canceled = "Y"
cancelled (Y/N))
I will just point you out the operators you should use
Projection (π)
π(a1,...,an): The result is defined as the set that is obtained when all tuples in R are restricted to the set {a1,...,an}.
For example π(name) on your patient table would be the same as SELECT name FROM patient
Selection (σ)
σ(condition): Selects all those tuples in R for which condition holds.
For example σ(frequency = "1perweek") on your prescription table would be the same as SELECT * FROM prescription WHERE frequency = "1perweek"
Cross product(X)
R X S: The result is the cross product between R and S.
For example patient X prescription would be SELECT * FROM patient,prescription
You can combine these operands to solve your exercise. Try posting your attempt if you have any issues.
Note: I did not include the natural join as there are no joins. The cross product should be enough for this exercise.
An example would be something like the following. This is only if you accidentally left out the joins between patient, medicine, and prescription. If not, you will be looking for cross product (which seems like a bad idea in this case...) as mentioned by Lombo. I gave example joins that may fit your tables marked as "???". If you could include the layout of your tables that would be helpful.
I also assume that canceled comes from prescription since it is not prefixed.
Edit: If you need it in standard RA form, it's pretty easy to get from a diagram.
alt text http://img532.imageshack.us/img532/8589/diagram1b.jpg
