I have a collection of objects with properties. I want to find the simplest set of criteria that will specify exactly one of these objects (I do not care which one).
For example, given {a=1, b=1, c=1}, {a=1, b=2, c=1}, {a=1, b=1, c=2}, specifying b==2 (or c==2) will give me an unique object.
Likewise, given {a=1, b=1, c=1}, {a=1, b=2, c=2}, {a=1, b=2, c=1}, specifying b==2 and c==2 (or b==1 && c==1 or b==2 && c==1) will give me an unique object.
This sounds like a known problem, with a known solution, but I haven't been able to find the correct formulation of the problem to allow me to Google it.
It is indeed a known problem in AI - feature selection. There are many algorithms for doing this Just Google "feature selection" "artificial intelligence".
The main issue is that when the samples set is large, you need to use some sort of heuristics in order to reach a solution within a reasonable time.
Feature Selection in Data Mining
The main idea of feature selection is to choose a subset of input
variables by eliminating features with little or no predictive
information.
The freedom of choosing the target is sort of unusual. If the target is specified, then this is essentially the set cover problem. Here's two corresponding instances side by side.
A={1,2,3} B={2,4} C={3,4} D={4,5}
0: {a=0, b=0, c=0, d=0} # separate 0 from the others
1: {a=1, b=0, c=0, d=0}
2: {a=1, b=1, c=0, d=0}
3: {a=1, b=0, c=1, d=0}
4: {a=0, b=1, c=1, d=1}
5: {a=0, b=0, c=0, d=1}
While set cover is NP-hard, however, your problem has an O(mlog n + O(1) poly(n)) algorithm where m is the number of attributes and n is the number of items (the optimal set of criteria has size at most log n), which makes it rather unlikely that an NP-hardness proof is forthcoming. I'm reminded of the situation with the Junta problem (basically the theoretical formulation of feature selection).
I don't know how easily this could be translated into an algoritm but using SQL, which is already set based, it could go like this
construct a table with all possible combinations of columns from the input table
select all combinations that appear equal to the amount of records present in the input table as distinct combinations.
SQL Script
;WITH q (a, b, c) AS (
SELECT '1', '1', '1'
UNION ALL SELECT '1', '2', '2'
UNION ALL SELECT '1', '2', '1'
UNION ALL SELECT '1', '1', '2'
)
SELECT col
FROM (
SELECT val = a, col = 'a' FROM q
UNION ALL SELECT b, 'b' FROM q
UNION ALL SELECT c, 'c' FROM q
UNION ALL SELECT a+b, 'a+b' FROM q
UNION ALL SELECT a+c, 'a+c' FROM q
UNION ALL SELECT b+c, 'b+c' FROM q
UNION ALL SELECT a+b+c, 'a+b+c' FROM q
) f
GROUP BY
col
HAVING
COUNT(DISTINCT (val)) = (SELECT COUNT(*) FROM q)
Your problem can be defined as follows:
1 1 1 -> A
1 2 1 -> B
1 1 2 -> C
.
.
where 1 1 1 is called the feature vector and A is the object class. You can then use decision trees (with pruning) to find a set of rules to classify objects. So, if your objective is to automatically decide the set of criteria to identify object A then, you can observe the path on the decision tree which leads to A.
If you have access to MATLAB, it is really easy to obtain a decision tree for your data.
Related
I have undirected and potentially disconnected graph represented as table of edges.
I need to return list of all edges reachable from given initial set of edges.
This is common task which can be found at many web sites, the recursive query with cycle clause is in many tutorials.
What particularly occupies my mind is:
In what aspect is the cycle clause better, in comparison with detecting cycles "manually"?
Example:
1
1-----2
| /|
| / |
3| /5 |2
| / |
|/ |
3-----4
4
with graph (a, id, b) as (
select 1, 1, 2 from dual union all
select 2, 2, 4 from dual union all
select 1, 3, 3 from dual union all
select 3, 4, 4 from dual union all
select 2, 5, 3 from dual union all
select null, null, null from dual where 0=1
) --select * from graph;
, input (id) as (
select column_value from table(sys.ku$_objnumset(2,4))
) --select * from input;
, s (l, path, dup, seen, a, id, b) as ( -- solution using set of seen edges
select 0, '/' || g.id, 1
, cast(collect(to_number(g.id)) over () as sys.ku$_objnumset)
, g.a, g.id, g.b
from graph g
where g.id in (select i.id from input i)
union all
select s.l + 1, s.path || '/' || g.id, row_number() over (partition by g.id order by null)
, s.seen multiset union distinct cast(collect(to_number(g.id)) over () as sys.ku$_objnumset)
, g.a, g.id, g.b
from s
join graph g on s.id != g.id
and g.id not member of (select s.seen from dual)
and (s.a in (g.a, g.b) or s.b in (g.a, g.b))
where s.dup = 1
)
, c (l, path, a, id, b) as ( -- solution using cycle clause
select 0, '/' || g.id
, g.a, g.id, g.b
from graph g
where g.id in (select i.id from input i)
union all
select c.l + 1, c.path || '/' || g.id
, g.a, g.id, g.b
from c
join graph g on c.id != g.id
and (c.a in (g.a, g.b) or c.b in (g.a, g.b))
)
cycle id set is_cycle to 1 default 0
--select * from s; --6 rows
--select distinct id from s order by id; --5 rows
select * from c order by l; --214 rows (!)
--select distinct id from c where is_cycle = 0 order by id; --5 rows
There are 2 different solutions represented by CTEs s and c.
In both solutions an edge is expanded from another edge if they have common vertex.
Solution s (seen set-based) works like flood.
It is based on mass collection of all edges on particular recursion level thanks to collect() over () clause.
Input edges are on 0th level, their neigbors on 1st level etc.
Each edge belongs to just one level.
Some edge can occur multiple times on given level thanks to expansion from many edges on parent level (for instance the edge 5 in sample graph) but these duplicities are eliminated on next level using dup column.
Solution c (cycle clause-based) is based on built-in cycle detection.
The substantial difference from solution s is in the way how rows on next recursion level are expanded.
Every row in recursive part is aware only of the information of single ancestor row from previous recursion level.
Hence there are many repetitions since the graph traversal practically generates all distinct walks.
For instance, if initial edges are {2,4}, each of them is not aware of the other one so edge 2 expands to edge 4 and edge 4 expands to edge 2. Similarly on further levels where this effect is multiplied.
The cycle clause eliminates only duplicates within ancestor chain of given row, without respect to siblings.
Various sources on the web recommend to postprocess such huge resultset using distinct or analytical function (see here).
In my experience this does not eliminate the explosion of many possibilities. For real graph with 65 edges, which is still small, the query c didn't finished but query s finished in hundreds of milliseconds.
I am interested in knowing why cycle-based solution is so much favoured in tutorials and literature.
I prefer using standard ways and don't strive to built own cycle detecting solution as I've been taught here, however the s solution works much better for me, which makes me little bit confused. (Note explain plan looks less expensive for s solution. Also I tried Oracle proprietary connect by-based solution which was slow too - I omit it here for brevity.)
My question is:
Do you see any substantial drawbacks of s solution or have any idea how to improve the c solution to avoid traversal of unnecessary combinations?
My experience working with SET functions such as COLLECT, DISTINCT, UNION, MEMBER OF and etc shows that they are quite slow with growing number of elements. You won't notice that until start testing with really big collections. So without going in details I would say that straight away just based on previous experience.
But it does not mean it will not work. They functions are really easy to use and code looks more readable. But they just slow comparing with alternative methods.
Suppose i have 2 tables t1, t2 (or matrices) that are both square (e.g. both are 3X3).
In Solver I add the following constraint :
t1 >= t2
Then how does solver compare these values?
-Value at 1X1 in t1 >= 1X1 in t2, 1X2 in t1 >= 1X2 in t2,...
-Any value in t1 must be >= the largest value in t2
-...
If it is not the one, how can I obtain the first situation? Do I enter every value comparison by hand (since that will take quite some time)
It makes the comparison element-wise. You can confirm by getting the "Answer Report".
Here are the matrices on C5:G19 and J5:N19
You add the constraint stating C5:G9 <= (or >=) J5:N9
As you can see from the formula column, it makes the comparisons element-wise (C5<J5, D5<K5, ..., G9<N9).
I have a list of elements, each one identified with a type, I need to reorder the list to maximize the minimum distance between elements of the same type.
The set is small (10 to 30 items), so performance is not really important.
There's no limit about the quantity of items per type or quantity of types, the data can be considered random.
For example, if I have a list of:
5 items of A
3 items of B
2 items of C
2 items of D
1 item of E
1 item of F
I would like to produce something like:
A, B, C, A, D, F, B, A, E, C, A, D, B, A
A has at least 2 items between occurences
B has at least 4 items between occurences
C has 6 items between occurences
D has 6 items between occurences
Is there an algorithm to achieve this?
-Update-
After exchanging some comments, I came to a definition of a secondary goal:
main goal: maximize the minimum distance between elements of the same type, considering only the type(s) with less distance.
secondary goal: maximize the minimum distance between elements on every type. IE: if a combination increases the minimum distance of a certain type without decreasing other, then choose it.
-Update 2-
About the answers.
There were a lot of useful answers, although none is a solution for both goals, specially the second one which is tricky.
Some thoughts about the answers:
PengOne: Sounds good, although it doesn't provide a concrete implementation, and not always leads to the best result according to the second goal.
Evgeny Kluev: Provides a concrete implementation to the main goal, but it doesn't lead to the best result according to the secondary goal.
tobias_k: I liked the random approach, it doesn't always lead to the best result, but it's a good approximation and cost effective.
I tried a combination of Evgeny Kluev, backtracking, and tobias_k formula, but it needed too much time to get the result.
Finally, at least for my problem, I considered tobias_k to be the most adequate algorithm, for its simplicity and good results in a timely fashion. Probably, it could be improved using Simulated annealing.
First, you don't have a well-defined optimization problem yet. If you want to maximized the minimum distance between two items of the same type, that's well defined. If you want to maximize the minimum distance between two A's and between two B's and ... and between two Z's, then that's not well defined. How would you compare two solutions:
A's are at least 4 apart, B's at least 4 apart, and C's at least 2 apart
A's at least 3 apart, B's at least 3 apart, and C's at least 4 apart
You need a well-defined measure of "good" (or, more accurately, "better"). I'll assume for now that the measure is: maximize the minimum distance between any two of the same item.
Here's an algorithm that achieves a minimum distance of ceiling(N/n(A)) where N is the total number of items and n(A) is the number of items of instance A, assuming that A is the most numerous.
Order the item types A1, A2, ... , Ak where n(Ai) >= n(A{i+1}).
Initialize the list L to be empty.
For j from k to 1, distribute items of type Ak as uniformly as possible in L.
Example: Given the distribution in the question, the algorithm produces:
F
E, F
D, E, D, F
D, C, E, D, C, F
B, D, C, E, B, D, C, F, B
A, B, D, A, C, E, A, B, D, A, C, F, A, B
This sounded like an interesting problem, so I just gave it a try. Here's my super-simplistic randomized approach, done in Python:
def optimize(items, quality_function, stop=1000):
no_improvement = 0
best = 0
while no_improvement < stop:
i = random.randint(0, len(items)-1)
j = random.randint(0, len(items)-1)
copy = items[::]
copy[i], copy[j] = copy[j], copy[i]
q = quality_function(copy)
if q > best:
items, best = copy, q
no_improvement = 0
else:
no_improvement += 1
return items
As already discussed in the comments, the really tricky part is the quality function, passed as a parameter to the optimizer. After some trying I came up with one that almost always yields optimal results. Thank to pmoleri, for pointing out how to make this a whole lot more efficient.
def quality_maxmindist(items):
s = 0
for item in set(items):
indcs = [i for i in range(len(items)) if items[i] == item]
if len(indcs) > 1:
s += sum(1./(indcs[i+1] - indcs[i]) for i in range(len(indcs)-1))
return 1./s
And here some random result:
>>> print optimize(items, quality_maxmindist)
['A', 'B', 'C', 'A', 'D', 'E', 'A', 'B', 'F', 'C', 'A', 'D', 'B', 'A']
Note that, passing another quality function, the same optimizer could be used for different list-rearrangement tasks, e.g. as a (rather silly) randomized sorter.
Here is an algorithm that only maximizes the minimum distance between elements of the same type and does nothing beyond that. The following list is used as an example:
AAAAA BBBBB CCCC DDDD EEEE FFF GG
Sort element sets by number of elements of each type in descending order. Actually only largest sets (A & B) should be placed to the head of the list as well as those element sets that have one element less (C & D & E). Other sets may be unsorted.
Reserve R last positions in the array for one element from each of the largest sets, divide the remaining array evenly between the S-1 remaining elements of the largest sets. This gives optimal distance: K = (N - R) / (S - 1). Represent target array as a 2D matrix with K columns and L = N / K full rows (and possibly one partial row with N % K elements). For example sets we have R = 2, S = 5, N = 27, K = 6, L = 4.
If matrix has S - 1 full rows, fill first R columns of this matrix with elements of the largest sets (A & B), otherwise sequentially fill all columns, starting from last one.
For our example this gives:
AB....
AB....
AB....
AB....
AB.
If we try to fill the remaining columns with other sets in the same order, there is a problem:
ABCDE.
ABCDE.
ABCDE.
ABCE..
ABD
The last 'E' is only 5 positions apart from the first 'E'.
Sequentially fill all columns, starting from last one.
For our example this gives:
ABFEDC
ABFEDC
ABFEDC
ABGEDC
ABG
Returning to linear array we have:
ABFEDCABFEDCABFEDCABGEDCABG
Here is an attempt to use simulated annealing for this problem (C sources): http://ideone.com/OGkkc.
I believe you could see your problem like a bunch of particles that physically repel eachother. You could iterate to a 'stable' situation.
Basic pseudo-code:
force( x, y ) = 0 if x.type==y.type
1/distance(x,y) otherwise
nextposition( x, force ) = coined?(x) => same
else => x + force
notconverged(row,newrow) = // simplistically
row!=newrow
row=[a,b,a,b,b,b,a,e];
newrow=nextposition(row);
while( notconverged(row,newrow) )
newrow=nextposition(row);
I don't know if it converges, but it's an idea :)
I'm sure there may be a more efficient solution, but here is one possibility for you:
First, note that it is very easy to find an ordering which produces a minimum-distance-between-items-of-same-type of 1. Just use any random ordering, and the MDBIOST will be at least 1, if not more.
So, start off with the assumption that the MDBIOST will be 2. Do a recursive search of the space of possible orderings, based on the assumption that MDBIOST will be 2. There are a number of conditions you can use to prune branches from this search. Terminate the search if you find an ordering which works.
If you found one that works, try again, under the assumption that MDBIOST will be 3. Then 4... and so on, until the search fails.
UPDATE: It would actually be better to start with a high number, because that will constrain the possible choices more. Then gradually reduce the number, until you find an ordering which works.
Here's another approach.
If every item must be kept at least k places from every other item of the same type, then write down items from left to right, keeping track of the number of items left of each type. At each point put down an item with the largest number left that you can legally put down.
This will work for N items if there are no more than ceil(N / k) items of the same type, as it will preserve this property - after putting down k items we have k less items and we have put down at least one of each type that started with at ceil(N / k) items of that type.
Given a clutch of mixed items you could work out the largest k you can support and then lay out the items to solve for this k.
Have these two tables:
TableA
ID Opt1 Opt2 Type
1 A Z 10
2 B Y 20
3 C Z 30
4 C K 40
and
TableB
ID Opt1 Type
1 Z 57
2 Z 99
3 X 3000
4 Z 3000
What would be a good algorithm to find arbitrary relations between these two tables? In this example, I'd like it to find the apparent relation between records containing Op1 = C in TableA and Type = 3000 in TableB.
I could think of apriori in some way, but doesn't seems too practical. what you guys say?
thanks.
It sounds like a relational data mining problem. I would suggest trying Ross Quinlan's FOIL: http://www.rulequest.com/Personal/
In pseudocode, a naive implementation might look like:
1. for each column c1 in table1
2. for each column c2 in table2
3. if approximately_isomorphic(c1, c2) then
4. emit (c1, c2)
approximately_isomorphic(c1, c2)
1. hmap = hash()
2. for i = 1 to min(|c1|, |c2|) do
3. hmap[c1[i]] = c2[i]
4. if |hmap| - unique_count(c1) < error_margin then return true
5. else then return false
The idea is this: do a pairwise comparison of the elements of each column with each other column. For each pair of columns, construct a hash map linking corresponding elements of the two columns. If the hash map contains the same number of linkings as unique elements of the first column, then you have a perfect isomorphism; if you have a few more, you have a near isomorphism; if you have many more, up to the number of elements in the first column, you have what probably doesn't represent any correlation.
Example on your input:
ID & anything : perfect isomorphism since all of ID are unique
Opt1 & ID : 4 mappings and 3 unique values; not a perfect
isomorphism, but not too far away.
Opt1 & Opt1 : ditto above
Opt1 & Type : 3 mappings & 3 unique values, perfect isomorphism
Opt2 & ID : 4 mappings & 3 unique values, not a perfect
isomorphism, but not too far away
Opt2 & Opt2 : ditto above
Opt2 & Type : ditto above
Type & anything: perfect isomorphism since all of ID are unique
For best results, you might do this procedure both ways - that is, comparing table1 to table2 and then comparing table2 to table1 - to look for bijective mappings. Otherwise, you can be thrown off by trivial cases... all values in the first are different (perfect isomorphism) or all values in the second are the same (perfect isomorphism). Note also that this technique provides a way of ranking, or measuring, how similar or dissimilar columns are.
Is this going in the right direction? By the way, this is O(ijk) where table1 has i columns, table 2 has j columns and each column has k elements. In theory, the best you could do for a method would be O(ik + jk), if you can find correlations without doing pairwise comparisons.
I have a table of items with [ID, ATTR1, ATTR2, ATTR3]. I'd like to select about half of the items, but try to get a random result set that is NOT clustered. In other words, there's a fairly even spread of ATTR1 values, ATTR2 values, and ATTR3 values. This does NOT necessarily represent the data as a whole, in other words, the total table may be generally concentrated on certain attribute values, but I'd like to select a subset with more variety. The attributes are not inter-related, so there's not really a correlation between ATTR1 and ATTR2.
As an example, imagine ATTR1 = "State". I'd like each line item in my subset to be from a different state, even if in the whole set, most of my data is concentrated on a few states. And for this to simultaneously be true of the other 2 attributes, too. (I realize that some tables might not make this possible, but there's enough data that it's unlikely to have no solution)
Any ideas for an efficient algorithm? Thanks! I don't really even know how to search for this :)
(by the way, it's OK if this requires pre-calculation or -indexing on the whole set, so long as I can draw out random varied subsets quickly)
Interesting problem. Since you want about half of the list, how about this:-
Create a list of half the values chosen entirely at random. Compute histograms for the value of ATTR1, ATTR2, ATTR3 for each of the chosen items.
:loop
Now randomly pick an item that's in the current list and an item that isn't.
If the new item increases the 'entropy' of the number of unique attributes in the histograms, keep it and update the histograms to reflect the change you just made.
Repeat N/2 times, or more depending on how much you want to force it to move towards covering every value rather than being random. You could also use 'simulated annealing' and gradually change the probability to accepting the swap - starting with 'sometimes allow a swap even if it makes it worse' down to 'only swap if it increases variety'.
I don't know (and I hope someone who does will answer). Here's what comes to mind: make up a distribution for MCMC putting the most weight on the subsets with 'variety'.
Assuming the items in your table are indexed by some form of id, I would in a loop, iterate through half of the items in your table, and use a random number generator to get the number.
IMHO
Finding variety is difficult but generating it is easy.
So we can generate variety of combinations and
then seach the table for records with those combinations.
If the table is sorted then searching also becomes easy.
Sample python code:
d = {}
d[('a',0,'A')]=0
d[('a',1,'A')]=1
d[('a',0,'A')]=2
d[('b',1,'B')]=3
d[('b',0,'C')]=4
d[('c',1,'C')]=5
d[('c',0,'D')]=6
d[('a',0,'A')]=7
print d
attr1 = ['a','b','c']
attr2 = [0,1]
attr3 = ['A','B','C','D']
# no of items in
# attr2 < attr1 < attr3
# ;) reason for strange nesting of loops
for z in attr3:
for x in attr1:
for y in attr2:
k = (x,y,z)
if d.has_key(k):
print '%s->%s'%(k,d[k])
else:
print k
Output:
('a', 0, 'A')->7
('a', 1, 'A')->1
('b', 0, 'A')
('b', 1, 'A')
('c', 0, 'A')
('c', 1, 'A')
('a', 0, 'B')
('a', 1, 'B')
('b', 0, 'B')
('b', 1, 'B')->3
('c', 0, 'B')
('c', 1, 'B')
('a', 0, 'C')
('a', 1, 'C')
('b', 0, 'C')->4
('b', 1, 'C')
('c', 0, 'C')
('c', 1, 'C')->5
('a', 0, 'D')
('a', 1, 'D')
('b', 0, 'D')
('b', 1, 'D')
('c', 0, 'D')->6
('c', 1, 'D')
But assuming your table is very big (otherwise why would you need algorithm ;) and data is fairly uniformly distributed there will be more hits in actual scenario. In this dummy case there are too many misses which makes algorithm look inefficient.
Let's assume that ATTR1, ATTR2, and ATTR3 are independent random variables (over a uniform random item). (If ATTR1, ATTR2, and ATTR3 are only approximately independent, then this sample should be approximately uniform in each attribute.) To sample one item (VAL1, VAL2, VAL3) whose attributes are uniformly distributed, choose VAL1 uniformly at random from the set of values for ATTR1, choose VAL2 uniformly at random from the set of values for ATTR2 over items with ATTR1 = VAL1, choose VAL3 uniformly at random from the set of values for ATTR3 over items with ATTR1 = VAL1 and ATTR2 = VAL2.
To get a sample of distinct items, apply the above procedure repeatedly, deleting each item after it is chosen. Probably the best way to implement this would be a tree. For example, if we have
ID ATTR1 ATTR2 ATTR3
1 a c e
2 a c f
3 a d e
4 a d f
5 b c e
6 b c f
7 b d e
8 b d f
9 a c e
then the tree is, in JavaScript object notation,
{"a": {"c": {"e": [1, 9], "f": [2]},
"d": {"e": [3], "f": [4]}},
"b": {"c": {"e": [5], "f": [6]},
"d": {"e": [7], "f": [8]}}}
Deletion is accomplished recursively. If we sample id 4, then we delete it from its list at the leaf level. This list empties, so we delete the entry "f": [] from tree["a"]["d"]. If we now delete 3, then we delete 3 from its list, which empties, so we delete the entry "e": [] from tree["a"]["d"], which empties tree["a"]["d"], so we delete it in turn. In a good implementation, each item should take time O(# of attributes).
EDIT: For repeated use, reinsert the items into the tree after the whole sample is collected. This doesn't affect the asymptotic running time.
Idea #2.
Compute histograms for each attribute on the original table.
For each item compute it's uniqueness score = p(ATTR1) x p(ATTR2) x p(ATTR3) (multiply the probabilities for each attribute it has).
Sort by uniqueness.
Chose a probability distribution curve for your random numbers ranging from picking only values in the first half of the set (a step function) to picking values evenly over the entire set (a flat line). Maybe a 1/x curve might work well for you in this case.
Pick values from the sorted list using your chosen probability curve.
This allows you to bias it towards more unique values or towards more evenness just by adjusting the probability curve you use to generate the random numbers.
Taking over your example, assign every possible 'State' a numeric value (say, between 1 and 9). Do the same for the other attributes.
Now, assuming you don't have more than 10 possible values for each attribute, multiply the values for ATTR3 for 100, ATTR2 for 1000, ATTR1 for 10000. Add up the results, you will end up with what can resemble a vague hash of the item. Something like
10,000 * |ATTR1| + 1000 * |ATTR2| + 100 * |ATTR3|
the advantage here is that you know that values between 10000 and 19000 have the same 'State' value; in other words, the first digit represents ATTR1. Same for ATTR2 and the other attributes.
You can sort all values and using something like bucket-sort pick one for each type, checking that the digit you're considering hasn't been picked already.
An example: if your end values are
A: 15,700 = 10,000 * 1 + 1,000 * 5 + 100 * 7
B: 13,400 = 10,000 * 1 + 1,000 * 3 + 100 * 4
C: 13,200 = ...
D: 12,300
E: 11,400
F: 10,900
you know that all your values have the same ATTR1; 2 have the same ATTR2 (that being B and C); and 2 have the same ATTR3 (B, E).
This, of course, assuming I understood correctly what you want to do. It's saturday night, afterall.
ps: yes, I could have used '10' as the first multiplier but the example would have been messier; and yes, it's clearly a naive example and there are lots of possible optimizations here, which are left as an exercise to the reader
It's a very interesting problem, for which I can see a number of applications. Notably for testing software: you get many 'main-flow' transactions, but only one is necessary to test that it works and you would prefer when selecting to get an extremely varied sample.
I don't think you really need a histogram structure, or at least only a binary one (absent/present).
{ ATTR1: [val1, val2], ATTR2: [i,j,k], ATTR3: [1,2,3] }
This is used in fact to generate a list of predicates:
Predicates = [ lambda x: x.attr1 == val1, lambda x: x.attr1 == val2,
lambda x: x.attr2 == i, ...]
This list will contain say N elements.
Now you wish to select K elements from this list. If K is less than N it's fine, otherwise we will duplicate the list i times, so that K <= N*i and with i minimal of course, so i = ceil(K/N) (note that it works although if K <= N, with i == 1).
i = ceil(K/N)
Predz = Predicates * i # python's wonderful
And finally, pick up a predicate there, and look for an element that satisfies it... that's where randomness actually hits and I am less than adequate here.
Two remarks:
if K > N you may be willing to actually select i-1 times each predicate and then select randomly from the list of predicates only to top off your selection. Thus ensuring the over representation of even the least common elements.
the attributes are completely uncorrelated this way, you may be willing to select patterns as you could never get the tuple (1,2,3) by selecting on the third element being 3, so perhaps a refinement would be to group some related attributes together, though it would probably increase the number of predicates generated
for efficiency reasons, you should have the table by the predicate category if you wish to have an efficient select.