I am trying to understand the performance of neo4j in real-time recommendation systems.
The following is a cypher query (taken from their sandbox) which computes top 100 most similar users (in cosine-distance) to the query user "Cynthia Freeman":
MATCH
(p1:User {name: "Cynthia Freeman"})-[x:RATED]->(m:Movie)<-[y:RATED]-(p2:User)
WITH
COUNT(m) AS numberMovies,
SUM(x.rating * y.rating) AS xyDotProduct,
SQRT(REDUCE(xDot = 0.0, a IN COLLECT(x.rating) | xDot + a^2)) AS xLength,
SQRT(REDUCE(yDot = 0.0, b IN COLLECT(y.rating) | yDot + b^2)) AS yLength,
p1, p2
WHERE
numberMovies > 10
RETURN
p1.name, p2.name, xyDotProduct / (xLength * yLength) AS sim
ORDER BY sim DESC
LIMIT 100;
If my understanding is correct, there's no magic behind the LIMIT clause, as the distance computation still needs to be done vs. all other users, so resolving this query in real-time seems a bit of a stretch, unless neo4j is doing something behind the scenes.
In another example, they pre-compute this [:SIMILARITY] relationship between user nodes and store it in the graph, thus querying the top-N most similar users becomes an ordering of nodes. This will intuitively make the graph dense, so there are no storage advantages over simply using a similarity matrix.
Am I missing something fundamental about the way graph databases (and neo4j in particular) work? How can this scale to real-time applications, where there can be tens of thousands of users, and even more products that they interact with?
If you want to do real-time recommendations using some sort of cosine distance metric on tens of thousands of nodes or more, it is probably best to store the precomputed values as relationships.
As for making the graph dense, you can limit the SIMILAR relationship to top K similar nodes and also define the similarity cutoff threshold, which can make your graph as sparse as you would like to. You can only store relevant results. So, for example, in a graph of 10 thousand nodes, if every item has a connection to the top 10 other nodes, this is not a really dense graph. If you also remove duplicate relationships that point from one node to another and back, you could remove them even more. So if there are 10k*10k (divided by two if you are treating the relationships as undirected) relationships possible, you won't have a billion possible relationships, but only 100k at most.
The Graph Data Science library supports two algorithms for calculating cosine distance:
The first naive version calculates the distance between all pairs and can be tuned with topK and similarityCutoff parameters.
Just recently, the optimized implementation of the kNN algorithm was added in the GDS 1.4 pre-release. It uses the implementation described in this article: https://dl.acm.org/doi/abs/10.1145/1963405.1963487
However, for real-time calculation of similarity between 10k+ nodes, it might still take more than 100ms you would max the real-time response, so going with the pre-computed similarity relationships makes sense.
Aside from #TomažBratanič's great suggestions, your existing query can be made more efficient. It is performing mathematical calculations for every p1/p2 pair, even for pairs that are later filtered out because the number of shared movies does not exceed 10. Instead, you should try filtering out unwanted p1/p2 pairs before you do the calculations.
For example:
MATCH
(p1:User {name: "Cynthia Freeman"})-[x:RATED]->(m:Movie)<-[y:RATED]-(p2:User)
WITH
COLLECT({xr: x.rating, yr: y.rating}) AS data
p1, p2
WHERE
SIZE(data) > 10
WITH
REDUCE(s = 0, d IN data | s + d.xr * d.yr) AS xyDotProduct,
SQRT(REDUCE(xDot = 0.0, a IN data | xDot + a.xr^2)) AS xLength,
SQRT(REDUCE(yDot = 0.0, b IN data | yDot + b.yr^2)) AS yLength,
p1, p2
RETURN
p1.name, p2.name, xyDotProduct / (xLength * yLength) AS sim
ORDER BY sim DESC
LIMIT 100;
Related
Consider a group of events A,B,C and D. These events are related to each other and this can be defined through a set of rules like -
A is followed by B or D and never C.
If B occurrs twice in 5 minutes, C is triggered.
.. and so on.
The dataset I have has more than a 10000 rows where each record consists of the geographic coordinates, timestamp and an event that occurred at that time. I want to cluster these data points based on the rules as mentioned above but I'm not sure how to do it. The threshold factors to prevent all the events from being grouped together could be decided based on the time intervals or spatial difference.
How can these rules be represented and be used as a deciding factor during clustering?
So far I've tried clustering based on the spatial and temporal factors using algorithms like ST-DBSCAN and Clustream but I'd really like to find a way to group the data points based on the event sequences according to the rules.
I am trying to figure out system design behind Google Trends (or any other such large scale trend feature like Twitter).
Challenges:
Need to process large amount of data to calculate trend.
Filtering support - by time, region, category etc.
Need a way to store for archiving/offline processing. Filtering support might require multi dimension storage.
This is what my assumption is (I have zero practial experience of MapReduce/NoSQL technologies)
Each search item from user will maintain set of attributes that will be stored and eventually processed.
As well as maintaining list of searches by time stamp, region of search, category etc.
Example:
Searching for Kurt Cobain term:
Kurt-> (Time stamp, Region of search origin, category ,etc.)
Cobain-> (Time stamp, Region of search origin, category ,etc.)
Question:
How do they efficiently calculate frequency of search term ?
In other words, given a large data set, how do they find top 10 frequent items in distributed scale-able manner ?
Well... finding out the top K terms is not really a big problem. One of the key ideas in this fields have been the idea of "stream processing", i.e., to perform the operation in a single pass of the data and sacrificing some accuracy to get a probabilistic answer. Thus, assume you get a stream of data like the following:
A B K A C A B B C D F G A B F H I B A C F I U X A C
What you want is the top K items. Naively, one would maintain a counter for each item, and at the end sort by the count of each item. This takes O(U) space and O(max(U*log(U), N)) time, where U is the number of unique items and N is the number of items in the list.
In case U is small, this is not really a big problem. But once you are in the domain of search logs with billions or trillions of unique searches, the space consumption starts to become a problem.
So, people came up with the idea of "count-sketches" (you can read up more here: count min sketch page on wikipedia). Here you maintain a hash table A of length n and create two hashes for each item:
h1(x) = 0 ... n-1 with uniform probability
h2(x) = 0/1 each with probability 0.5
You then do A[h1[x]] += h2[x]. The key observation is that since each value randomly hashes to +/-1, E[ A[h1[x]] * h2[x] ] = count(x), where E is the expected value of the expression, and count is the number of times x appeared in the stream.
Of course, the problem with this approach is that each estimate still has a large variance, but that can be dealt with by maintaining a large set of hash counters and taking the average or the minimum count from each set.
With this sketch data structure, you are able to get an approximate frequency of each item. Now, you simply maintain a list of 10 items with the largest frequency estimates till now, and at the end you will have your list.
How exactly a particular private company does it is likely not publicly available, and how to evaluate the effectiveness of such a system is at the discretion of the designer (be it you or Google or whoever)
But many of the tools and research is out there to get you started. Check out some of the Big Data tools, including many of the top-level Apache projects, like Storm, which allows for the processing of streaming data in real-time
Also check out some of the Big Data and Web Science conferences like KDD or WSDM, as well as papers put out by Google Research
How to design such a system is challenging with no correct answer, but the tools and research are available to get you started
my topic is similarity and clustering of (a bunch of) text(s). In a nutshell: I want to cluster collected texts together and they should appear in meaningful clusters at the end. To do this, my approach up to now is as follows, my problem is in the clustering. The current software is written in php.
1) Similarity:
I treat every document as a "bag-of-words" and convert words into vectors. I use
filtering (only "real" words)
tokenization (split sentences into words)
stemming (reduce words to their base form; Porter's stemmer)
pruning (cut of words with too high & low frequency)
as methods for dimensionality reduction. After that, I'm using cosine similarity (as suggested / described on various sites on the web and here.
The result then is a similarity matrix like this:
A B C D E
A 0 30 51 75 80
B X 0 21 55 70
C X X 0 25 10
D X X X 0 15
E X X X X 0
A…E are my texts and the number is the similarity in percent; the higher, the more similar the texts are. Because sim(A,B) == sim(B,A) only half of the matrix is filled in. So the similarity of Text A to Text D is 71%.
I want to generate a a priori unknown(!) number of clusters out of this matrix now. The clusters should represent the similar items (up to a certain stopp criterion) together.
I tried a basic implementation myself, which was basically like this (60% as a fixed similarity threshold)
foreach article
get similar entries where sim > 60
foreach similar entry
check if one of the entries already has a cluster number
if no: assign new cluster number to all similar entries
if yes: use that number
It worked (somehow), but wasn't good at all and the results were often monster-clusters.
So, I want to redo this and already had a look into all kinds of clustering algorithms, but I'm still not sure which one will work best. I think it should be an agglomerative algoritm, because every pair of texts can be seen as a cluster in the beginning. But still the questions are what the stopp criterion is and if the algorithm should divide and / or merge existing clusters together.
Sorry if some of the stuff seems basic, but I am relatively new in this field. Thanks for the help.
Since you're both new to the field, have an unknown number of clusters and are already using cosine distance I would recommend the FLAME clustering algorithm.
It's intuitive, easy to implement, and has implementations in a large number of languages (not PHP though, largely because very few people use PHP for data science).
Not to mention, it's actually good enough to be used in research by a large number of people. If nothing else you can get an idea of what exactly the shortcomings are in this clustering algorithm that you want to address in moving onto another one.
Just try some. There are so many clustering algorithms out there, nobody will know all of them. Plus, it also depends a lot on your data set and the clustering structure that is there.
In the end, there also may be just this one monster cluster with respect to cosine distance and BofW features.
Maybe you can transform your similarity matrix to a dissimilarity matrix such as transforming x to 1/x, then your problem is to cluster a dissimilarity matrix. I think the hierarchical cluster may work. These may help you:hierarchical clustering and Clustering a dissimilarity matrix
I'm searching for a usable metric for SURF. Like how good one image matches another on a scale let's say 0 to 1, where 0 means no similarities and 1 means the same image.
SURF provides the following data:
interest points (and their descriptors) in query image (set Q)
interest points (and their descriptors) in target image (set T)
using nearest neighbor algorithm pairs can be created from the two sets from above
I was trying something so far but nothing seemed to work too well:
metric using the size of the different sets: d = N / min(size(Q), size(T)) where N is the number of matched interest points. This gives for pretty similar images pretty low rating, e.g. 0.32 even when 70 interest points were matched from about 600 in Q and 200 in T. I think 70 is a really good result. I was thinking about using some logarithmic scaling so only really low numbers would get low results, but can't seem to find the right equation. With d = log(9*d0+1) I get a result of 0.59 which is pretty good but still, it kind of destroys the power of SURF.
metric using the distances within pairs: I did something like find the K best match and add their distances. The smallest the distance the similar the two images are. The problem with this is that I don't know what are the maximum and minimum values for an interest point descriptor element, from which the distant is calculated, thus I can only relatively find the result (from many inputs which is the best). As I said I would like to put the metric to exactly between 0 and 1. I need this to compare SURF to other image metrics.
The biggest problem with these two are that exclude the other. One does not take in account the number of matches the other the distance between matches. I'm lost.
EDIT: For the first one, an equation of log(x*10^k)/k where k is 3 or 4 gives a nice result most of the time, the min is not good, it can make the d bigger then 1 in some rare cases, without it small result are back.
You can easily create a metric that is the weighted sum of both metrics. Use machine learning techniques to learn the appropriate weights.
What you're describing is related closely to the field of Content-Based Image Retrieval which is a very rich and diverse field. Googling that will get you lots of hits. While SURF is an excellent general purpose low-mid level feature detector, it is far from sufficient. SURF and SIFT (what SURF was derived from), is great at duplicate or near-duplicate detection but is not that great at capturing perceptual similarity.
The best performing CBIR systems usually utilize an ensemble of features optimally combined via some training set. Some interesting detectors to try include GIST (fast and cheap detector best used for detecting man-made vs. natural environments) and Object Bank (a histogram-based detector itself made of 100's of object detector outputs).
What is an algorithm to compare multiple sets of numbers against a target set to determine which ones are the most "similar"?
One use of this algorithm would be to compare today's hourly weather forecast against historical weather recordings to find a day that had similar weather.
The similarity of two sets is a bit subjective, so the algorithm really just needs to diferentiate between good matches and bad matches. We have a lot of historical data, so I would like to try to narrow down the amount of days the users need to look through by automatically throwing out sets that aren't close and trying to put the "best" matches at the top of the list.
Edit:
Ideally the result of the algorithm would be comparable to results using different data sets. For example using the mean square error as suggested by Niles produces pretty good results, but the numbers generated when comparing the temperature can not be compared to numbers generated with other data such as Wind Speed or Precipitation because the scale of the data is different. Some of the non-weather data being is very large, so the mean square error algorithm generates numbers in the hundreds of thousands compared to the tens or hundreds that is generated by using temperature.
I think the mean square error metric might work for applications such as weather compares. It's easy to calculate and gives numbers that do make sense.
Since your want to compare measurements over time you can just leave out missing values from the calculation.
For values that are not time-bound or even unsorted, multi-dimensional scatter data it's a bit more difficult. Choosing a good distance metric becomes part of the art of analysing such data.
Use the pearson correlation coefficient. I figured out how to calculate it in an SQL query which can be found here: http://vanheusden.com/misc/pearson.php
In finance they use Beta to measure the correlation of 2 series of numbers. EG, Beta could answer the question "Over the last year, how much would the price of IBM go up on a day that the price of the S&P 500 index went up 5%?" It deals with the percentage of the move, so the 2 series can have different scales.
In my example, the Beta is Covariance(IBM, S&P 500) / Variance(S&P 500).
Wikipedia has pages explaining Covariance, Variance, and Beta: http://en.wikipedia.org/wiki/Beta_(finance)
Look at statistical sites. I think you are looking for correlation.
As an example, I'll assume you're measuring temp, wind, and precip. We'll call these items "features". So valid values might be:
Temp: -50 to 100F (I'm in Minnesota, USA)
Wind: 0 to 120 Miles/hr (not sure if this is realistic but bear with me)
Precip: 0 to 100
Start by normalizing your data. Temp has a range of 150 units, Wind 120 units, and Precip 100 units. Multiply your wind units by 1.25 and Precip by 1.5 to make them roughly the same "scale" as your temp. You can get fancy here and make rules that weigh one feature as more valuable than others. In this example, wind might have a huge range but usually stays in a smaller range so you want to weigh it less to prevent it from skewing your results.
Now, imagine each measurement as a point in multi-dimensional space. This example measures 3d space (temp, wind, precip). The nice thing is, if we add more features, we simply increase the dimensionality of our space but the math stays the same. Anyway, we want to find the historical points that are closest to our current point. The easiest way to do that is Euclidean distance. So measure the distance from our current point to each historical point and keep the closest matches:
for each historicalpoint
distance = sqrt(
pow(currentpoint.temp - historicalpoint.temp, 2) +
pow(currentpoint.wind - historicalpoint.wind, 2) +
pow(currentpoint.precip - historicalpoint.precip, 2))
if distance is smaller than the largest distance in our match collection
add historicalpoint to our match collection
remove the match with the largest distance from our match collection
next
This is a brute-force approach. If you have the time, you could get a lot fancier. Multi-dimensional data can be represented as trees like kd-trees or r-trees. If you have a lot of data, comparing your current observation with every historical observation would be too slow. Trees speed up your search. You might want to take a look at Data Clustering and Nearest Neighbor Search.
Cheers.
Talk to a statistician.
Seriously.
They do this type of thing for a living.
You write that the "similarity of two sets is a bit subjective", but it's not subjective at all-- it's a matter of determining the appropriate criteria for similarity for your problem domain.
This is one of those situation where you are much better off speaking to a professional than asking a bunch of programmers.
First of all, ask yourself if these are sets, or ordered collections.
I assume that these are ordered collections with duplicates. The most obvious algorithm is to select a tolerance within which numbers are considered the same, and count the number of slots where the numbers are the same under that measure.
I do have a solution implemented for this in my application, but I'm looking to see if there is something that is better or more "correct". For each historical day I do the following:
function calculate_score(historical_set, forecast_set)
{
double c = correlation(historical_set, forecast_set);
double avg_history = average(historical_set);
double avg_forecast = average(forecast_set);
double penalty = abs(avg_history - avg_forecast) / avg_forecast
return c - penalty;
}
I then sort all the results from high to low.
Since the correlation is a value from -1 to 1 that says whether the numbers fall or rise together, I then "penalize" that with the percentage difference the averages of the two sets of numbers.
A couple of times, you've mentioned that you don't know the distribution of the data, which is of course true. I mean, tomorrow there could be a day that is 150 degree F, with 2000km/hr winds, but it seems pretty unlikely.
I would argue that you have a very good idea of the distribution, since you have a long historical record. Given that, you can put everything in terms of quantiles of the historical distribution, and do something with absolute or squared difference of the quantiles on all measures. This is another normalization method, but one that accounts for the non-linearities in the data.
Normalization in any style should make all variables comparable.
As example, let's say that a day it's a windy, hot day: that might have a temp quantile of .75, and a wind quantile of .75. The .76 quantile for heat might be 1 degree away, and the one for wind might be 3kmh away.
This focus on the empirical distribution is easy to understand as well, and could be more robust than normal estimation (like Mean-square-error).
Are the two data sets ordered, or not?
If ordered, are the indices the same? equally spaced?
If the indices are common (temperatures measured on the same days (but different locations), for example, you can regress the first data set against the second,
and then test that the slope is equal to 1, and that the intercept is 0.
http://stattrek.com/AP-Statistics-4/Test-Slope.aspx?Tutorial=AP
Otherwise, you can do two regressions, of the y=values against their indices. http://en.wikipedia.org/wiki/Correlation. You'd still want to compare slopes and intercepts.
====
If unordered, I think you want to look at the cumulative distribution functions
http://en.wikipedia.org/wiki/Cumulative_distribution_function
One relevant test is Kolmogorov-Smirnov:
http://en.wikipedia.org/wiki/Kolmogorov-Smirnov_test
You could also look at
Student's t-test,
http://en.wikipedia.org/wiki/Student%27s_t-test
or a Wilcoxon signed-rank test http://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test
to test equality of means between the two samples.
And you could test for equality of variances with a Levene test http://www.itl.nist.gov/div898/handbook/eda/section3/eda35a.htm
Note: it is possible for dissimilar sets of data to have the same mean and variance -- depending on how rigorous you want to be (and how much data you have), you could consider testing for equality of higher moments, as well.
Maybe you can see your set of numbers as a vector (each number of the set being a componant of the vector).
Then you can simply use dot product to compute the similarity of 2 given vectors (i.e. set of numbers).
You might need to normalize your vectors.
More : Cosine similarity