This is a problem I've just run into, or rather its a simplification that captures the core problem.
Imagine I have a spreadsheet containing a number of columns, each of them labeled, and a number of rows.
I want to determine when the value in one column can be inferred from the value in another. For example, we might find that every time a '1' appears in column a, a '5' always appears in column d, but whenever a '2' appears in column a, a 3 always appears in column d. We observe that the value in column a reliably predicts the value in column c.
The goal is to identify all such relationships between columns.
The naive solution is to start with a list of all pairs of columns, (a, b), (a, c), (a, d)... (b, c), (b, d)... and so on. We call these the "eligible" list.
For each of these pairs, we keep track of the value of the first in the pair, and the corresponding value in the second. If we notice that we see the same value for the first of a pair, but a different value for the second of a pair, then this pair is no-longer eligible.
Whatever is left at the end of this process is the set of valid relationships.
Unfortunately, this rapidly becomes impractical as the number of columns increases, as the amount of data we must store is in the order of the number of columns squared.
Can anyone think of an efficient way to do this?
I don't think you can improve on O(n^2) for n columns: consider the case where no relationship exists between any pair. The only way to discover this is to test all pairs, which is O(n^2).
I suspect you might be best to build up the relation, rather than whittle it down.
You might well have to store n^2 pieces of information, where you have n columns. For example if a column never repeats (ie its value is different on each row) then that column predicts all others. If every column is like that then every column predicts every other. You could use a two dimensional table pred say, indexed by columns numbers, with pred(a,b) true if a predicts b. pred(a,b) could have any of 3 values: true, false and unknown.
The predicts relation is transitive, that is if a predicts b and b predicts c then a predicts c. If the number of rows is large, so that checking if a row predicts another is expensive, then it might be worth using transitivity to fill out what you can: if you have just computed that pred(a,b) is true and you have already computed pred(b,x) for every x, then you can set pred(a,y) true for every y for which pred(b,y) is true.
To fill out pred(a,.) you could build a temporary array of pairs (value,row-index) from a, and then sort by value; this gives you easy access to the sets of indices where a is constant. If each of these sets is a singleton, then pred(a,b) is true for every b; otherwise to check if a predicts b (if its not already known) you need to check that b is constant on each index set (with more than one member) where a is constant.
An optimisation might be that if pred(a,b) is true, and also pred(b,a) is true then for every c, pred(a,c) if and only if pred(b,c); thus in this case if you have already filled out pred(b,.) you can fill out all pred(a,.) by copying.
Related
This question is more logical than programming. My dataset has several data points (you can think the dataset as an Array and the data points as it's elements). Each data point is defined by its two properties. For example, if x is one of the data points among X data points in the dataset, a and b are the properties or characteristics of x. Here, larger the value of a (that ranges from 0 to 1, think it as a probability), x has a good chance to be selected. Moreover, larger the value of b (think it as any number that is larger than 1), x has the least chance to be selected. Among X data points from the X, I need to select a data point that has the maximum a value and minimum b value. Note that there may be some instances when a single data point may not hold both the conditions at the same time. For example, x my have the largest a value but not the least b value at the same time and vice-versa. Hence, I want to combine both a and b to yield another meaningful weight value that helps me to filter out the right data point from X.
If there any mathematical solution to my problem?
Lets say I have N lists which are known. Each list has items, which may repeat (Not a set)
eg:
{A,A,B,C}, {A,B,C}, {B,B,B,C,C}
I need some algorithm (Some machine-learning one maybe?) which answers the following question:
Given a new & unknown partial list of items, for example, {A,B}, what is the probability that C will appear in the list based on the what I know from the previous lists. If possible, I would like a more fine-grained probability of: given some partial list L, what is the probability that C will appear in the list once, probability it will appear twice, etc... Order doesn't matter. The probability of C appearing twice in {A,B} should equal it appearing twice in {B,A}
Any algorithms which can do this?
This is just pure mathematics, no actual "algorithms", simply estimate all the probabilities from your dataset (literally count the occurences). In particular you can do very simple data structure to achieve your goal. Represent each "list" as bag of letters, thus:
{A,A,B,C} -> {A:2, B:1, C:1}
{A,B} -> {A:1, B:1}
etc. and create basic reverse indexing of some sort, for example keep indexes for each letter separately, sorted by their counts.
Now, when a query comes, like {A,B} + C all you do is you search for your data that contains at least 1 A and 1 B (using your indexes), and then estimate probability by computing the fraction of retrived results containing C (or exactly one C) vs. all retrived results (this is a valid probability estimate assuming that your data is a bunch of independent samples from some underlying data-generating distribution).
Alternatively, if your alphabet is very small you can actually precompute all the values P(C|{A,B}) etc. for all combinations of letters.
Given a
we first define two real-valued functions and as follows:
and we also define a value m(X) for each matrix X as follows:
Now given an , we have many regions of G, denoted as . Here, a region of G is formed by a submatrix of G that is randomly chosen from some columns and some rows of G. And our problem is to compute as fewer operations as possible. Is there any methods like building hash table, or sorting to get the results faster? Thanks!
========================
For example, if G={{1,2,3},{4,5,6},{7,8,9}}, then
G_1 could be {{1,2},{7,8}}
G_2 could be {{1,3},{4,6},{7,9}}
G_3 could be {{5,6},{8,9}}
=======================
Currently, for each G_i we need mxn comparisons to compute m(G_i). Thus, for m(G_1),...,m(G_r) there should be rxmxn comparisons. However, I can notice that G_i and G_j maybe overlapped, so there would be some other approach that is more effective. Any attention would be highly appreciated!
Depending on how many times the min/max type data is needed, you could consider a matrix that holds min/max information in-between the matrix values, i.e. in the interstices between values.. Thus, for your example G={{1,2,3},{4,5,6},{7,8,9}} we would define a relationship matrix R sized ((mxn),(mxn),(mxn)) and having values from the set C = {-1 = less than, 0 = equals and 1 = greater than}.
R would have nine relationship pairs (n,1), (n,2) to (n,9) where each value would be a member of C. Note (n,n is defined and will equal 0). Thus, R[4,,) = (1,1,1,0,-1,-1,-1,-1,-1). Now consider any of your subsets G_1 ..., Knowing the positional relationships of a subset's members will give you offsets into R which will resolve to indexes into each R(N,,) which will return the desired relationship information directly without comparisons.
You, of course, will have to decide if the overhead in space and calculations to build R exceeds the cost of just computing what you need each time it's needed. Certain optimizations including realization that the R matrix is reflected along the major diagonal and that you could declare "equals" to be called, say, less than (meaning C has only two values) are available. Depending on the original matrix G, other optimizations can be had if it is know that a row or column is sorted.
And since some computers (mainframes, supercomputers, etc) store data into RAM in column-major order, store your dataset so that it fills in with the rows and columns transposed thus allowing column-to-column type operations (vector calculations) to actually favor the columns. Check your architecture.
I'm trying to create an algorithm to solve the following problem:
Input is an unsorted list of sets containing pairs (key, value) of ints. The first of each pair is positive and unique within the set.
I want to find an algorithm to split the input sets so the sets can be ordered such that for each key the value is nondecreasing in the set order.
There is a trival solution which is to split the sets into each individual value and sort them, I'd like something more efficient in terms of the number of sets which are split.
Are there any similar problems you have encountered and/or techniques you can suggest?
Does the optimal (minimum number of splits) solution sound like it is possible in polynomial time?
Edit: In the example the "<=" operator indicates a constraint on the sets as a whole whereby for each key value (100, 101, 102) the corresponding values are equal to or greater than the values in previous sets (or omitted from the set). I.e extracting the values for each key using the order from the output sets gives:
Key 100 {0, 1}
Key 101 {2, 3}
Key 102 {10, 15}
A*
I propose using A* to find an optimal solution. Build the order of split sets incrementally from left to right, minimizing the number of sets required to achieve this.
A* visits states based on some heuristic estimate of the total cost. I propose that a state is described by the totality of all the pairs already included in the order as we have it so far. If all values for every key are different, then you can represent this information rather concisely by simply storing the last value for each key. Otherwise you'll have to somehow take care of equal values, so you know which ones were already included and which ones were not. For every state you maintain some representation of the best order leading to it, but that may get updated along the way while the state remains the same.
The heuristic should be an estimate of the total cost of the path from the beginning through the current state to the goal. It may be too low, but must never be too high. In our case, the heuristic should count the number of (possibly split) sets included in the order so far, and add to that the number of (unsplit) sets still waiting for insertion. As the remaining sets may need splitting, this might be too low, but as you can never have less sets than those still waiting for insertion, it is a suitable heuristic.
Now you have some priority queue of states, ordered by the value of this heuristic. You extract minimal items from it, and know that the moment you extract a state from the queue, the cost up to that state can not decrease any more, so the path up to that state is optimal. Now you examine what other states can be reached from this: which other pairs can be next in the order of split sets? For each remaining set which has pairs that are ready to be included, you create a new subsequent state, taking all the pairs from the set which are ready. The cost so far increases by one. If you manage to take a whole set, without splitting, then the extimate for the remaining cost decreases by one.
For this new state, you check whether it is already persent in your priority queue. If it is, and its previous cost was higher than the one just computed, then you update its cost, and the optimal path leading to it. Make sure the priority key changes its position accordingly (“decrease key”). If the state wasn't present in the queue before, then add it to the queue.
Dijkstra
Come to think of it, this is the same as running Dijkstra's algorithm with the number of splits as cost. And as each edge has either cost zero or cost one, you can implement this even easier, without any priority queue at all. Instead, you can use two sets, called S₀ and S₁, where all elements from S₀ require the same number of splits, and all elements from S₁ require one more split. Roughly sketched in pseudocode:
S₀ = ∅ (empty set)
S₁ = ∅
add initial state (no pairs added yet, all sets remain to be added) to S₀
while True
while (S₀ ≠ ∅)
x = take and remove any element from zero
if x is the target state (all pairs included in the order) then
return the path information associated with it
for (r: those sets which remain to be added in state x)
if we can take r as a whole then
let y be the state obtained by taking r as the next set in the order
if y is in S₁, remove it
add y to S₀
else if we can add only some elements from r then
let y bet the state obtained by taking as many elements from r as possible
if y is not in S₀, add it to S₁
S₀ = S₁
S₁ = ∅
Let's say I have a number of 'known' sets:
1 {a, b, c, d, e}
2 {b, c, d, e}
3 {a, c, d}
4 {c, d}
I'd like a function which takes a set as an input, (for example {a, c, d, e}) and finds the set that has the highest number of elements, and no more other items in common. In other words, the subset with the greatest cardinality. The answer doesn't have to be a proper subset. The answer in this case would be {a, c, d}.
EDIT: the above example was wrong, now fixed.
I'm trying to find the absolute most efficient way of doing this.
(In the below, I am assuming that the cost of comparing two sets is O(1) for the sake of simplicity. That operation is outside my control so there's no point thinking about it. In truth it would be a function of the cardinality of the two sets being compared.)
Candiate 1:
Generate all subsets of the input, then iterate over the known sets and return the largest one that is a subset. The downside to this is that the complexity will be something like O(n! × m), where n is the cardinality of the input set and m is the number of 'known' subsets.
Candidate 1a (thanks #bratbrat):
Iterate over all 'known' sets and calculate the cardinatlity of the intersection, and take the one with the highest value. This would be O(n) where n is the number of subsets.
Candidate 2:
Create an inverse table and calculate the euclidean distance between the input and the known sets. This could be quite quick. I'm not clear how I could limit this to include only subsets without a subsequent O(n) filter.
Candidate 3:
Iterate over all known sets and compare against the input. The complexity would be O(n) where n is the number of known sets.
I have at my disposal the set functions built into Python and Redis.
None of these seems particularly great. Ideas? The number of sets may get large (around 100,000 at a guess).
There's no possible way to do this in less than O(n) time... just reading the input is O(n).
A couple ideas:
Sort the sets by size (biggest first), and search for the first set which is a subset of the input set. Once you find one, you don't have to examine the rest.
If the number of possible items which could be in the sets is limited, you could represent them by bit-vectors. Then you could calculate a lookup table to tell you whether a given set is a subset of the input set. (Walk down the bits for each input set under consideration, word by word, indexing each word into the appropriate table. If you find an entry telling you that it's not a subset, again, you can move on directly to the next input set.) Whether this would actually buy you performance, depends on the implementation language. I imagine it would be most effective in a language with primitive integral types, like C or Java.
Take the union of the known sets. This becomes a dictionary of known elements.
Sort the known elements by their value (they're integers, right). This defines a given integer's position in a bit string.
Use the above to define bit strings for each of the known sets. This is a one time operation - the results should be stored to avoid recomputation.
For an input set, run it through the same transform to obtain its bit string.
To get the largest subset, run through the list of known bit strings, taking the intersection (logical and) with the input bit string. Count the '1' elements. Remember the largest one.
http://packages.python.org/bitstring
As mentioned in the comments, this can be paralleled up by subdividing the known sets and giving each thread its own subset to work on. Each thread serves up its best match and then the parent thread picks the best from the threads.
How many searches are you making? In case you are searching multiple input sets you should be able to pre-process all the known sets (perhaps as a tree structure) and your search time for each query would be in the order of your query set size.
Eg: Create a Trie structure with all the known sets. Make sure to sort each set before inserting them. For the query, follow the links that are in the set.