Cypher recommendation query performance - performance

I am working with rNeo4j for a recommendation application and I am having some issues writing an efficient query. The goal of the query is to recommend an item to a user, with the stipulation that they have not used the item before.
I want to return the item's name, the nodes on the path (for a visualization of the recommendation), and some additional measures to be able to make the recommendation as relevant as possible. Currently I'm returning the number of users that have used the item before, the length of the path to the recommendation, and a sum of the qCount relationship property.
Current query:
MATCH (subject:User {id: {idQ}), (rec:Item),
p = shortestPath((subject)-[*]-(rec))
WHERE NOT (subject)-[:ACCESSED]->(rec)
MATCH (users:User)-[:ACCESSED]->(rec)
RETURN rec.Name as Item,
count(users) as popularity,
length(p) as pathLength,
reduce(weight = 0, q IN relationships(p)| weight + toInt(q.qCount)) as Strength,
nodes(p) as path
ORDER BY pathLength, Strength DESCENDING, popularity DESCENDING
LIMIT {resultLimit}
The query appears to be working correctly, but it takes too long for the desired application (around 8 seconds). Does anyone have some suggestions for how to improve my query's performance?
I am new to cypher so I apologize if it is something obvious to a more advanced user.

One thing to consider is specifying an upper bound on the variable length path pattern like this: p = shortestPath((subject)-[*2..5]->(rec)) This limits the number of relationships in the pattern to a maximum of 5. Without setting a maximum performance can be poor, as paths of all lengths are considered.
Another thing to consider: by summing the relationship property qCount across all nodes in the path and then sorting by this sum you are looking for the shortest weighted path. Neo4j includes some graph algorithms (such as Dijkstra) for finding these paths efficiently, however they are not exposed via Cypher. See this page for more info.

Related

How do I find the largest cluster in this simple dataset?

I have data on users and their interests. Some users have more interests than others. Data looks like below.
How do I find the largest cluster of users with the most interests in common? Formally, I am trying to maximize (number of users in cluster * number of shared interests in cluster)
In the data below, the largest cluster is:
CORRECT ANSWER
Users: [1,2,3]
Interests: [2,3]
Cluster-value: 3 users x 2 shared interests = 6
DATA
User 1: {3,2}
User 2: {3,2,4}
User 3: {2,3,8}
User 4: {7}
User 5: {7}
User 6: {9}
How do I find the largest cluster of users with the most interests in common?
Here would be a hypothetical data generation process:
import random
# Generate 300 random (user, interest) tupples
def generate_data():
data = []
while len(data) < 300:
data_pt = {"user": random.randint(1,100), "interest":random.randint(50)}
if data_pt not in data:
data.append(data_pt)
return data
def largest_cluster(data):
return None
UPDATE: As somebody pointed out, the data is too parse. In the real case, there would be more users than interests. So I have updated the data generating process.
This looks to me like the kind of combinatorial optimization problem which would fall into the NP-Hard complexity class, which would of course mean that it's intractable to find an exact solution for instances with more than ~30 users.
Dynamic Programming would be the tool you'd want to employ if you were to find a usable algorithm for a problem with an exponential search space like this (here the solution space is all 2^n subsets of users), but I don't see DP helping us here because of the lack of overlapping sub-problems. That is, for DP to help, we have to be able to use and combine solutions to smaller sub-problems into an overall solution in polynomial time, and I don't see how we can do that for this problem.
Imagine you have a solution for a size=k problem, using a limited subset of the users {u1, u2,...uk} and you want to use that solution to find the new solution when you add another user u(k+1). The issue is the solution set in the incrementally larger instance might not overlap at all with the previous solution (it may be an entirely different group of users/interests), so we can't effectively combine solutions to subproblems to get the overall solution. And if instead of trying to just use the single optimal solution for the size k problem to reason about the size k+1 problem you instead stored all possible user combinations from the smaller instance along with their scores, you could of course quite easily do set intersections across these groups' interests with the new user's interests to find the new optimal solution. However, the problem with this approach is of course that the information you have to store would double with iteration, yielding an exponential time algorithm not better than the brute force solution. You run into similar problems if you try to base your DP off incrementally adding interests rather than users.
So if you know you only have a few users, you can use the brute force approach: generating all user combinations, taking a set intersection of each combination's interests, scoring and saving the max score. The best way to approach larger instances would probably be with approximate solutions through search algorithms (unless there is a DP solution I don't see). You could iteratively add/subtracts/swap users to improve the score and climb towards towards an optimum, or use a branch-and-bound algorithm which systematically explores all user combinations but stops exploring any user-subset branches with null interest intersection (as adding additional users to that subset will still produce a null intersection). You might have a lot of user groups with null interest intersections, so this latter approach could be quite quick practically speaking by its pruning off large parts of the search space, and if you ran it without a depth limit it would find the exact solution eventually.
Branch-and-bound would work something like this:
def getLargestCluster((user, interest)[]):
userInterestDict := { user -> {set of user's interests} } # build a dict
# generate and score user clusters
users := userInterestDict.keys() # save list of users to iterate over
bestCluster, bestInterests, bestClusterScore := {}, {}, 0
generateClusterScores()
return [bestCluster, bestInterests bestClusterScore]
# (define locally in getLargestCluster or pass needed values
def generateClusterScores(i = 0, userCluster = {}, clusterInterests = {}):
curScore := userCluster.size * clusterInterests.size
if curScore > bestScore:
bestScore, bestCluster, bestInterests := curScore, curCluster, clusterInterests
if i = users.length: return
curUser := users[i]
curInterests := userInterestDict[curUser]
newClusterInterests := userCluster.size = 0 ? curInterests : setIntersection(clusterInterests, curInterests)
# generate rest subsets with and without curUser (copy userCluster if pass by reference)
generateClusterScores(i+1, userCluster, clusterInterests)
if !newClusterInterests.isEmpty(): # bound the search here
generateClusterScores(i+1, userCluster.add(curUser), newClusterInterests)
You might be able to do a more sophisticated bounding (like if you can calculate that the current cluster score couldn't eclipse your current best score, even if all the remaining users were added to the cluster and the interest intersection stayed the same), but checking for an empty interest intersection is simple enough. This works fine for 100 users, 50 interests though, up to around 800 data points. You could also make it more efficient by iterating over the minimum of |interests| and |users| (to generate fewer recursive calls/combinations) and just mirror the logic for the case where interests is lower. Also, you get more interesting clusters with fewer users/interests

OrientDB Suggestion Algorithm

My database is currently storing information related to user searches. For example we have a User and Term vertex. User has a property called employeeId while Term has a property called "q" which will contain the search query. We draw an edge called 'searched' between them. What I would like to be able to do is suggest other searches to the user which might be related. I have a query which I feel sort of does the trick by doing a traverse from a term and ordering by depth and count. I'm sure there might be better way out there to do the query, or find better results.
User --searched--> Term
In this snippet below I would send in a term that got searched for, and use that in a like search to find terms which also contain that word.
#orient_query = "SELECT $depth, q, in().size() AS count FROM (TRAVERSE * FROM (select from Term where q.toLowerCase() LIKE 'aluminum') STRATEGY BREADTH_FIRST) WHERE #class = 'Term' AND $depth <> 0 ORDER BY $depth ASC, count DESC"
I guess my main concern is that there is a much better way out there to do this and find better results. My original thought process was that we needed a combination of depth in the traversal vs counts of how many times that search came up.
Edit: What I'm also seeing that other searches which contain the word "aluminum" like "aluminum alloy" are showing up really far down the results since their depth is higher.

System Design of Google Trends?

I am trying to figure out system design behind Google Trends (or any other such large scale trend feature like Twitter).
Challenges:
Need to process large amount of data to calculate trend.
Filtering support - by time, region, category etc.
Need a way to store for archiving/offline processing. Filtering support might require multi dimension storage.
This is what my assumption is (I have zero practial experience of MapReduce/NoSQL technologies)
Each search item from user will maintain set of attributes that will be stored and eventually processed.
As well as maintaining list of searches by time stamp, region of search, category etc.
Example:
Searching for Kurt Cobain term:
Kurt-> (Time stamp, Region of search origin, category ,etc.)
Cobain-> (Time stamp, Region of search origin, category ,etc.)
Question:
How do they efficiently calculate frequency of search term ?
In other words, given a large data set, how do they find top 10 frequent items in distributed scale-able manner ?
Well... finding out the top K terms is not really a big problem. One of the key ideas in this fields have been the idea of "stream processing", i.e., to perform the operation in a single pass of the data and sacrificing some accuracy to get a probabilistic answer. Thus, assume you get a stream of data like the following:
A B K A C A B B C D F G A B F H I B A C F I U X A C
What you want is the top K items. Naively, one would maintain a counter for each item, and at the end sort by the count of each item. This takes O(U) space and O(max(U*log(U), N)) time, where U is the number of unique items and N is the number of items in the list.
In case U is small, this is not really a big problem. But once you are in the domain of search logs with billions or trillions of unique searches, the space consumption starts to become a problem.
So, people came up with the idea of "count-sketches" (you can read up more here: count min sketch page on wikipedia). Here you maintain a hash table A of length n and create two hashes for each item:
h1(x) = 0 ... n-1 with uniform probability
h2(x) = 0/1 each with probability 0.5
You then do A[h1[x]] += h2[x]. The key observation is that since each value randomly hashes to +/-1, E[ A[h1[x]] * h2[x] ] = count(x), where E is the expected value of the expression, and count is the number of times x appeared in the stream.
Of course, the problem with this approach is that each estimate still has a large variance, but that can be dealt with by maintaining a large set of hash counters and taking the average or the minimum count from each set.
With this sketch data structure, you are able to get an approximate frequency of each item. Now, you simply maintain a list of 10 items with the largest frequency estimates till now, and at the end you will have your list.
How exactly a particular private company does it is likely not publicly available, and how to evaluate the effectiveness of such a system is at the discretion of the designer (be it you or Google or whoever)
But many of the tools and research is out there to get you started. Check out some of the Big Data tools, including many of the top-level Apache projects, like Storm, which allows for the processing of streaming data in real-time
Also check out some of the Big Data and Web Science conferences like KDD or WSDM, as well as papers put out by Google Research
How to design such a system is challenging with no correct answer, but the tools and research are available to get you started

Use of indexes for multi-word queries in full-text search (e.g. web search)

I understand that a fundamental aspect of full-text search is the use of inverted indexes. So, with an inverted index a one-word query becomes trivial to answer. Assuming the index is structured like this:
some-word -> [doc385, doc211, doc39977, ...] (sorted by rank, descending)
To answer the query for that word the solution is just to find the correct entry in the index (which takes O(log n) time) and present some given number of documents (e.g. the first 10) from the list specified in the index.
But what about queries which return documents that match, say, two words? The most straightforward implementation would be the following:
set A to be the set of documents which have word 1 (by searching the index).
set B to be the set of documents which have word 2 (ditto).
compute the intersection of A and B.
Now, step three probably takes O(n log n) time to perform. For very large A and Bs that could make the query slow to answer. But search engines like Google always return their answer in a few milliseconds. So that can't be the full answer.
One obvious optimization is that since a search engine like Google doesn't return all the matching documents anyway, we don't have to compute the whole intersection. We can start with the smallest set (e.g. B) and find enough entries which also belong to the other set (e.g. A).
But can't we still have the following worst case? If we have set A be the set of documents matching a common word, and set B be the set of documents matching another common word, there might still be cases where A ∩ B is very small (i.e. the combination is rare). That means that the search engine has to linearly go through a all elements x member of B, checking if they are also elements of A, to find the few that match both conditions.
Linear isn't fast. And you can have way more than two words to search for, so just employing parallelism surely isn't the whole solution. So, how are these cases optimized? Do large-scale full-text search engines use some kind of compound indexes? Bloom filters? Any ideas?
As you said some-word -> [doc385, doc211, doc39977, ...] (sorted by rank, descending), I think the search engine may not do this, the doc list should be sorted by doc ID, each doc has a rank according to the word.
When a query comes, it contains several keywords. For each word, you can find a doc list. For all keywords, you can do merge operations, and compute the relevance of doc to query. Finally return the top ranked relevance doc to user.
And the query process can be distributed to gain better performance.
Even without ranking, I wonder how the intersection of two sets is computed so fast by google.
Obviously the worst-case scenario for computing the intersection for some words A, B, C is when their indexes are very big and the intersection very small. A typical case would be a search for some very common ("popular" in DB terms) words in different languages.
Let's try "concrete" and 位置 ("site", "location") in chinese and 極端な ("extreme") in japanese.
Google search for 位置 returns "About 1,500,000,000 results (0.28 seconds) "
Google search for "concrete" returns "About 2,020,000,000 results (0.46 seconds) "
Google search for "極端な" About 7,590,000 results (0.25 seconds)
It is extremly improbable that all three terms would ever appear in the same document, but let's google them:
Google search for "concrete 位置 極端な" returns "About 174,000 results (0.13 seconds)"
Adding a russian word "игра" (game)
Search игра: About 212,000,000 results (0.37 seconds)
Search for all of them: " игра concrete 位置 極端な " returns About 12,600 results (0.33 seconds)
Of course the returned search results are nonsense and they do not contain all the search terms.
But looking at the query time for the composed ones, I wonder if there is some intersection computed on the word indexes at all. Even if everything is in RAM and heavily sharded, computing the intersection of two sets with 1,500,000,000 and 2,020,000,000 entries is O(n) and can hardly be done in <0.5 sec, since the data is on different machines and they have to communicate.
There must be some join computation, but at least for popular words, this is surely not done on the whole word index. Adding the fact that the results are fuzzy, it seems evident that Google uses some optimization of kind "give back some high-ranked results, and stop after 0,5 sec".
How this is implemented, I don't know. Any ideas?
Most systems somehow implement TF-IDF in one way or another. TF-IDF is a product of functions term frequency and inverse document frequency.
The IDF function relates the document frequency to the total number of documents in a collection. The common intuition for this function says that it should give a higher value for terms that appear in few documents and lower value for terms that appear in all documents making them irrelevant.
You mention Google, but Google optimises search with PageRank (links in/out) as well as term frequency and proximity. Google distributes the data and uses Map/Reduce to parallelise operations - to compute PageRank+TF-IDF.
There's a great explanation of the theory behind this in Information Retrieval: Implementing Search Engines chapter 2. Another idea to investigate further is also to look how Solr implements this.
Google does not need to actually find all results, only the top ones.
The index can be sorted by grade first and only then by id. Since the same ID always has the same grade this does not hurt sets intersection time.
So google starts intersection until it finds 10 results , and then does a statistical estimation to tell you how many more results it found.
A worst case is almost impossible.
If all words are "common" then intersection will give the first 10 results very fast. If there is a rare word, then intersection is fast because complexity is O(N long M) where N is the smallest group.
You need to remember that google keeps it's indexes in memory and uses parallel computing.For example U can split the problem into two searches each searching only half of the web, and then marge result and take the best. Google has millions of computes

Algorithm to calculate a page importance based on its views / comments

I need an algorithm that allows me to determine an appropriate <priority> field for my website's sitemap based on the page's views and comments count.
For those of you unfamiliar with sitemaps, the priority field is used to signal the importance of a page relative to the others on the same website. It must be a decimal number between 0 and 1.
The algorithm will accept two parameters, viewCount and commentCount, and will return the priority value. For example:
GetPriority(100000, 100000); // Damn, a lot of views/comments! The returned value will be very close to 1, for example 0.995
GetPriority(3, 2); // Ok not many users are interested in this page, so for example it will return 0.082
You mentioned doing this in an SQL query, so I'll give samples in that.
If you have a table/view Pages, something like this
Pages
-----
page_id:int
views:int - indexed
comments:int - indexed
Then you can order them by writing
SELECT * FROM Pages
ORDER BY
(0.3+LOG10(10+views)/LOG10(10+(SELECT MAX(views) FROM Pages))) +
(0.7+LOG10(10+comments)/LOG10(10+(SELECT MAX(comments) FROM Pages)))
I've deliberately chosen unequal weighting between views and comments. A problem that can arise with keeping an equal weighting with views/comments is that the ranking becomes a self-fulfilling prophecy - a page is returned at the top of the list, so it's visited more often, and thus gets more points, so it's shown at the stop of the list, and it's visited more often, and it gets more points.... Putting more weight on on the comments reflects that these take real effort and show real interest.
The above formula will give you ranking based on all-time statistics. So an article that amassed the same number of views/comments in the last week as another article amassed in the last year will be given the same priority. It may make sense to repeat the formula, each time specifying a range of dates, and favoring pages with higher activity, e.g.
0.3*(score for views/comments today) - live data
0.3*(score for views/comments in the last week)
0.25*(score for views/comments in the last month)
0.15*(score for all views/comments, all time)
This will ensure that "hot" pages are given higher priority than similarly scored pages that haven't seen much action lately. All values apart from today's scores can be persisted in tables by scheduled stored procedures so that the database isn't having to aggregate many many comments/view stats. Only today's stats are computed "live". Taking it one step further, the ranking formula itself can be computed and stored for historical data by a stored procedure run daily.
EDIT: To get a strict range from 0.1 to 1.0, you would motify the formula like this. But I stress - this will only add overhead and is unecessary - the absolute values of priority are not important - only their relative values to other urls. The search engine uses these to answer the question, is URL A more important/relevant than URL B? It does this by comparing their priorities - which one is greatest - not their absolute values.
// unnormalized - x is some page id
un(x) = 0.3*log(views(x)+10)/log(10+maxViews()) +
0.7*log(comments(x)+10)/log(10+maxComments())
// the original formula (now in pseudo code)
The maximum will be 1.0, the minimum will start at 1.0 and move downwards as more views/comments are made.
we define un(0) as the minimum value, i.e. (where views(x) and comments(x) are both 0 in the above formula)
To get a normalized formula from 0.1 to 1.0, you then compute n(x), the normalized priority for page x
(1.0-un(x)) * (un(0)-0.1)
n(x) = un(x) - ------------------------- when un(0) != 1.0
1.0-un(0)
= 0.1 otherwise.
Priority = W1 * views / maxViewsOfAllArticles + W2 * comments / maxCommentsOfAllArticles
with W1+W2=1
Although IMHO, just use 0.5*log_10(10+views)/log_10(10+maxViews) + 0.5*log_10(10+comments)/log_10(10+maxComments)
What you're looking for here is not an algorithm, but a formula.
Unfortunately, you haven't really specified the details of what you want, so there's no way we can provide the formula to you.
Instead, let's try to walk through the problem together.
You've got two incoming parameters, the viewCount and the commentCount. You want to return a single number, Priority. So far, so good.
You say that Priority should range between 0 and 1, but this isn't really important. If we were to come up with a formula we liked, but resulted in values between 0 and N, we could just divide the results by N-- so this constraint isn't really relevant.
Now, the first thing we need to decide is the relative weight of Comments vs Views.
If page A has 100 comments and 10 views, and page B has 10 comments and 100 views, which should have a higher priority? Or, should it be the same priority? You need to decide what's right for your definition of Priority.
If you decide, for example, that comments are 5 times more valuable than views, then we can begin with a formula like
Priority = 5 * Comments + Views
Obviously, this can be generalized to
Priority = A * Comments + B * Views
Where A and B are relative weights.
But, sometimes we want our weights to be exponential instead of linear, like
Priority = Comment ^ A + Views ^ B
which will give a very different curve than the earlier formula.
Similarly,
Priority = Comment ^ A * Views ^ B
will give higher value to a page with 20 comments and 20 views than one with 1 comment and 40 views, if the weights are equal.
So, to summarize:
You really ought to make a spreadsheet with sample values for Views and Comments, and then play around with various formulas until you get one that has the distribution that you are hoping for.
We can't do it for you, because we don't know how you want to value things.
I know it has been a while since this was asked, but I encountered a similar problem and had a different solution.
When you want to have a way to rank something, and there are multiple factors that you're using to perform that ranking, you're doing something called multi-criteria decision analysis. (MCDA). See: http://en.wikipedia.org/wiki/Multi-criteria_decision_analysis
There are several ways to handle this. In your case, your criteria have different "units". One is in units of comments, the other is in units of views. Futhermore, you may want to give different weight to these criteria based on whatever business rules you come up with.
In that case, the best solution is something called a weighted product model. See: http://en.wikipedia.org/wiki/Weighted_product_model
The gist is that you take each of your criteria and turn it into a percentage (as was previously suggested), then you take that percentage and raise it to the power of X, where X is a number between 0 and 1. This number represents your weight. Your total weights should add up to one.
Lastly, you multiple each of the results together to come up with a rank. If the rank is greater than 1, than the numerator page has a higher rank than the denominator page.
Each page would be compared against every other page by doing something like:
p1C = page 1 comments
p1V = page 1 view
p2C = page 2 comments
p2V = page 2 views
wC = comment weight
wV = view weight
rank = (p1C/p2C)^(wC) * (p1V/p2V)^(wV)
The end result is a sorted list of pages according to their rank.
I've implemented this in C# by performing a sort on a collection of objects implementing IComparable.
What several posters have essentially advocated without conceptual clarification is that you use linear regression to determine a weighting function of webpage view and comment counts to establish priority.
This technique is pretty easy to implement for your problem, and the basic concept is described well in this Wikipedia article on linear regression models.
A quick summary of how to apply it to your problem is:
Determine the parameters of the line which best fits the view and comment count data for all your site's webpages, i.e., use linear regression.
Use the line parameters to derive your priority function for the view/count parameters.
Code examples for basic linear regression should not be hard to track down if you don't want to implement it from scratch from basic math formulas (use the web, Numerical Recipes, etc.). Also, any general math software package like Matlab, R, etc., comes with linear regression functions.
The most naive approach would be the following:
Let v[i] the views of page i, c[i] the number of comments for page i, then define the relative view weight for page i to be
r_v(i) = v[i]/(sum_j v[j])
where sum_j v[j] is the total of the v[.] over all pages. Similarly define the relative comment weight for page i to be
r_c(i) = c[i]/(sum_j c[j]).
Now you want some constant parameter p: 0 < p < 1 which indicates the importance of views over comments: p = 0 means only comments are significant, p = 1 means only views are significant, and p = 0.5 gives equal weight.
Then set the priority to be
p*r_v(i) + (1-p)*r_c(i)
This might be over-simplistic but its probably the best starting point.

Resources