Time Aware Social Graph DS/Queries - algorithm

Classic social networks can be represented as a graph/matrix.
With a graph/matrix one can easily compute
shortest path between 2 participants
reachability from A -> B
general statistics (reciprocity, avg connectivity, etc)
etc
Is there an ideal data structure (or a modification to graph/matrix) that enables easy computation of the above while being time aware?
For example,
Input
t = 0...100
A <-> B (while t = 0...10)
B <-> C (while t = 5...100)
C <-> A (while t = 50...100)
Sample Queries
Is A associated with B at any time? (yes)
Is A associated with B while B is associated with C? (yes. #t = 5...10)
Is C ever reachable from A (yes. # t=5 )

What you're looking for is an explicitly persistent data structure. There's a fair body of literature on this, but it's not that well known. Chris Okasaki wrote a pretty substantial book on the topic. Have a look at my answer to this question.
Given a full implementation of something like Driscoll et al.'s node-splitting structure, there are a few different ways to set up your queries. If you want to know about stuff true in a particular time range, you would only examine nodes containing data about that time range. If you wanted to know what time range something was true, you would start searching, and progressively tighten your bounds as you explore each new node. Just remember that your results might not always be contiguous - consider two people start dating, breaking up, and getting back together.
I would guess that there's probably at least one publication worth of unexplored territory in how to do interesting queries over persistent graphs, if not much more.

Related

Need a better explanation of Communication Cost Model for MapReduce than in MMDS

I was going through MMDS book that has an online MOOC by the same name. I'm having trouble understanding Communication Cost Model and the Join Operation Calculations mentioned in Topic 2.5 and am surprised by how poorly organized the book is as the MOOC covers the same topic within the "Advanced Topics/Computation Complexity of MapReduce" at the end of the course.
There's an exercise question (example did not help at all) that goes like:
We wish to take the join R(A,B) |><| S(B,C) |><| T(A,C) as a single
MapReduce process, in a way that minimizes the communication cost. We shall use 512 Reduce tasks, and the sizes of relations R, S, and T are 220 = 1,048,576, 217 = 131,072, and 214 = 16,384, respectively. Compute the number of buckets into which each of the attributes A, B, and C are to be hashed. Then, determine the number of times each tuple of R, S, and T is replicated by the Map function.
Could you walk me through it. I don't know how he jumps from simple R+S+T to Lagrange's identities without having deliberated on intermediary steps.

Frequent itemset best algorithm and library

I started working on machine learning projects a few days ago and I have the following situation:
I have a database of itineraries (an itinerary is a set of destinations that where selected together as part of a trip) and I want to identify if a destination is going to be selected as part of a trip given other selected destinations. Here is an example given that A, B, C, D are destinations:
A, B -> C
A, D, C -> B
I think this is a recommender system problem and I studied techniques to approach a solution.
I tried using WEKA's Apriori and FPGrowth but I have not been able to generate a result as I have 91 items and 12,000 transactions (so, that is an ARFF file that contains 91 columns and 12,000 rows with TRUE and FALSE values) and the program never ends nor consumes more than 5 GB of RAM (I waited 30hours for the algorithm to run on a Core i7 last gen and 12GB RAM PC). Also, I don't see any option to select only the rules that have a value of TRUE as an implication (I need this as I want to see if someone will travel to X given the fact that some other people travel to Y.
So, are there any other techniques or approaches that can be used to achieve the result I am expecting? I want to have as an output a file with the "rules" or the set of items that "imply" the other set of items, and the probability of that "recommendation".
Example:
A, B -> C ; 90%
verbose: "People who travel to Rome and Florence travel to Milan with a probability (or other measure) of 90%"
Thanks!
Something with your implementation of the Apriori algorithm doesn't seem right. Try to use another implementation of the Apriori algorithm or check the current implementation. For the stated purpose to generate association rules between the destinations are the Apriori or the faster FP-Growth algorithm just fine. Maybe this helps for a general understanding: R - association rules - apriori
Actually, the implementation in Weka is quite inefficient. You could check the SPMF data mining library in Java, which offers efficient implementations of algorithms for pattern mining. It has actually more than 100 algorithms, including Apriori, FPGrowth and many others. I would recommend to use FPGrowth which is very fast and memory efficient. But you could also check the other algorithms. By the way, I am the founder of the librar.

Multiple depot vehicle scheduling

I have been playing around with algorithms and ILP for the single depot vehicle scheduling problem (SDVSP) and now want to extend my knowledge towards the multiple depot vehicle scheduling problem (MDVSP), as i would like to use this knowledge in a project of mine.
As for the question, I've found and implemented several algorithms for the MDSVP. However, one question i am very curious about is how to go about determining the amount of needed depots (and locations to an extend). Sadly enough i haven't been able to find any resources really which do not assume/require that the depots are set. Thus my question would be: How would i be able to approach a MDVSP in which i can determine the amount and locations of the depots?
(Edit) To clarify:
Assume we are given a set of trips T1, T2...Tn like usually in a SDVSP or MDVSP. Multiple trips can be driven in succession before returning to a depot. Leaving and returning to depots usually only happen at the start and end of a day. But as an extension to the normal problems, we can now determine the amount and locations of our depots, opposed to having set depots.
The objective is to find a solution in which all trips are driven with the minimal cost. The cost consists of the amount of deadhead (the distance which the car has to travel between trips, and from and to the depots), a fixed cost K per car, and a fixed cost C per depots.
I hope this clears up the question somewhat.
The standard approach involves adding |V| binary variables in ILP, one for each node where x_i = 1 if v_i is a depot and 0 otherwise.
However, the way the question is currently articulated, all x_i values will come out to be zero, since there is no "advantage" of making the node a depot and the total cost = (other cost factors) + sum_i (x_i) * FIXED_COST_PER_DEPOT.
Perhaps the question needs to be updated with some other constraint about the range of the car. For example, a car can only go so and so miles before returning to a depot.

System Design of Google Trends?

I am trying to figure out system design behind Google Trends (or any other such large scale trend feature like Twitter).
Challenges:
Need to process large amount of data to calculate trend.
Filtering support - by time, region, category etc.
Need a way to store for archiving/offline processing. Filtering support might require multi dimension storage.
This is what my assumption is (I have zero practial experience of MapReduce/NoSQL technologies)
Each search item from user will maintain set of attributes that will be stored and eventually processed.
As well as maintaining list of searches by time stamp, region of search, category etc.
Example:
Searching for Kurt Cobain term:
Kurt-> (Time stamp, Region of search origin, category ,etc.)
Cobain-> (Time stamp, Region of search origin, category ,etc.)
Question:
How do they efficiently calculate frequency of search term ?
In other words, given a large data set, how do they find top 10 frequent items in distributed scale-able manner ?
Well... finding out the top K terms is not really a big problem. One of the key ideas in this fields have been the idea of "stream processing", i.e., to perform the operation in a single pass of the data and sacrificing some accuracy to get a probabilistic answer. Thus, assume you get a stream of data like the following:
A B K A C A B B C D F G A B F H I B A C F I U X A C
What you want is the top K items. Naively, one would maintain a counter for each item, and at the end sort by the count of each item. This takes O(U) space and O(max(U*log(U), N)) time, where U is the number of unique items and N is the number of items in the list.
In case U is small, this is not really a big problem. But once you are in the domain of search logs with billions or trillions of unique searches, the space consumption starts to become a problem.
So, people came up with the idea of "count-sketches" (you can read up more here: count min sketch page on wikipedia). Here you maintain a hash table A of length n and create two hashes for each item:
h1(x) = 0 ... n-1 with uniform probability
h2(x) = 0/1 each with probability 0.5
You then do A[h1[x]] += h2[x]. The key observation is that since each value randomly hashes to +/-1, E[ A[h1[x]] * h2[x] ] = count(x), where E is the expected value of the expression, and count is the number of times x appeared in the stream.
Of course, the problem with this approach is that each estimate still has a large variance, but that can be dealt with by maintaining a large set of hash counters and taking the average or the minimum count from each set.
With this sketch data structure, you are able to get an approximate frequency of each item. Now, you simply maintain a list of 10 items with the largest frequency estimates till now, and at the end you will have your list.
How exactly a particular private company does it is likely not publicly available, and how to evaluate the effectiveness of such a system is at the discretion of the designer (be it you or Google or whoever)
But many of the tools and research is out there to get you started. Check out some of the Big Data tools, including many of the top-level Apache projects, like Storm, which allows for the processing of streaming data in real-time
Also check out some of the Big Data and Web Science conferences like KDD or WSDM, as well as papers put out by Google Research
How to design such a system is challenging with no correct answer, but the tools and research are available to get you started

Coming up with factors for a weighted algorithm?

I'm trying to come up with a weighted algorithm for an application. In the application, there is a limited amount of space available for different elements. Once all the space is occupied, the algorithm should choose the best element(s) to remove in order to make space for new elements.
There are different attributes which should affect this decision. For example:
T: Time since last accessed. (It's best to replace something that hasn't been accessed in a while.)
N: Number of times accessed. (It's best to replace something which hasn't been accessed many times.)
R: Number of elements which need to be removed in order to make space for the new element. (It's best to replace the least amount of elements. Ideally this should also take into consideration the T and N attributes of each element being replaced.)
I have 2 problems:
Figuring out how much weight to give each of these attributes.
Figuring out how to calculate the weight for an element.
(1) I realize that coming up with the weight for something like this is very subjective, but I was hoping that there's a standard method or something that can help me in deciding how much weight to give each attribute. For example, I was thinking that one method might be to come up with a set of two sample elements and then manually compare the two and decide which one should ultimately be chosen. Here's an example:
Element A: N = 5, T = 2 hours ago.
Element B: N = 4, T = 10 minutes ago.
In this example, I would probably want A to be the element that is chosen to be replaced since although it was accessed one more time, it hasn't been accessed in a lot of time compared with B. This method seems like it would take a lot of time, and would involve making a lot of tough, subjective decisions. Additionally, it may not be trivial to come up with the resulting weights at the end.
Another method I came up with was to just arbitrarily choose weights for the different attributes and then use the application for a while. If I notice anything obviously wrong with the algorithm, I could then go in and slightly modify the weights. This is basically a "guess and check" method.
Both of these methods don't seem that great and I'm hoping there's a better solution.
(2) Once I do figure out the weight, I'm not sure which way is best to calculate the weight. Should I just add everything? (In these examples, I'm assuming that whichever element has the highest replacementWeight should be the one that's going to be replaced.)
replacementWeight = .4*T - .1*N - 2*R
or multiply everything?
replacementWeight = (T) * (.5*N) * (.1*R)
What about not using constants for the weights? For example, sure "Time" (T) may be important, but once a specific amount of time has passed, it starts not making that much of a difference. Essentially I would lump it all in an "a lot of time has passed" bin. (e.g. even though 8 hours and 7 hours have an hour difference between the two, this difference might not be as significant as the difference between 1 minute and 5 minutes since these two are much more recent.) (Or another example: replacing (R) 1 or 2 elements is fine, but when I start needing to replace 5 or 6, that should be heavily weighted down... therefore it shouldn't be linear.)
replacementWeight = 1/T + sqrt(N) - R*R
Obviously (1) and (2) are closely related, which is why I'm hoping that there's a better way to come up with this sort of algorithm.
What you are describing is the classic problem of choosing a cache replacement policy. Which policy is best for you, depends on your data, but the following usually works well:
First, always store a new object in the cache, evicting the R worst one(s). There is no way to know a priori if an object should be stored or not. If the object is not useful, it will fall out of the cache again soon.
The popular squid cache implements the following cache replacement algorithms:
Least Recently Used (LRU):
replacementKey = -T
Least Frequently Used with Dynamic Aging (LFUDA):
replacementKey = N + C
Greedy-Dual-Size-Frequency (GDSF):
replacementKey = (N/R) + C
C refers to a cache age factor here. C is basically the replacementKey of the item that was evicted last (or zero).
NOTE: The replacementKey is calculated when an object is inserted or accessed, and stored alongside the object. The object with the smallest replacementKey is evicted.
LRU is simple and often good enough. The bigger your cache, the better it performs.
LFUDA and GDSF both are tradeoffs. LFUDA prefers to keep large objects even if they are less popular, under the assumption that one hit to a large object makes up lots of hits for smaller objects. GDSF basically makes the opposite tradeoff, keeping many smaller objects over fewer large objects. From what you write, the latter might be a good fit.
If none of these meet your needs, you can calculate optimal values for T, N and R (and compare different formulas for combining them) by minimizing regret, the difference in performance between your formula and the optimal algorithm, using, for example, Linear regression.
This is a completely subjective issue -- as you yourself point out. And a distinct possibility is that if your test cases consist of pairs (A,B) where you prefer A to B, then you might find that you prefer A to B , B to C but also C over A -- i.e. its not an ordering.
If you are not careful, your function might not exist !
If you can define a scalar function of your input variables, with various parameters for coefficients and exponents, you might be able to estimate said parameters by using regression, but you will need an awful lot of data if you have many parameters.
This is the classical statistician's approach of first reviewing the data to IDENTIFY a model, and then using that model to ESTIMATE a particular realisation of the model. There are large books on this subject.

Resources