Data table with all combinations - data.table

I am dealing with a problem. I have a one row data table and approx. 60 columns. Values in each columns are sampled, so that each cell has: two (YES, NO) or five (A, B, C, D, E) values. My question is how to add rows in such way, that all table will have all possible combinations ? Is expand.grid together with mapply a good choice or maybe, there is a better way to do this ?

Related

Relational Algebra and Cardinality?

I'm very confused when it comes to the topic of cardinality in Relational Algebra. I understand that cardinality essentially refers to the uniqueness of a table or data set. So I'll walk through a problem I attempted to solve and maybe someone can help me out, or give me better resources than the ones I've found.
I've got a table R2, with attributes D, E, and F, where D is a Primary Key, and E and F are Foreign Keys relating to the Primary Keys of the following table. Table R3, with attributes G, H, and I, where G and H are PKs. R2 has cardinality N2 = 100, R3 has cardinality N3 = 200. So what would the min and max cardinality be of a table created by joining R2 to R3 with the condition that E = G and F = H?
My answer is that the minimum is 1, and max is 200, or N3. My thought process is that E and F are FKs, so they can have many repeating values so long as they come from G and H, but since G and H are PKs, at least one value for E and F would be unique, and D is a PK as well, so at least one value is unique there too. So I assume those unique values mean the cardinality must be at least 1, and at most, it can have the same cardinality as R3, which is 200. But honestly, my own reasoning doesn't even make sense to me...
The whole idea seems really abstract to me. Attribute I is the only non FK/PK in the problem, so how does that affect the cardinality? Sorry for the long winded question, I'm just very confused by the whole idea of this and would love any help in general regarding the subject.
You are not equijoining FK-to-CK (foreign key to candidate key). You are equijoining on EF subtuples matching GH subtuples. Although every E has a G & every F has an H, there does not have to be a single EF-GH match. G & H are unique so GH is unique so each EF can match at most one, so there could be 0 to 100 rows in the result.
(If you want to make sound analyses you need to find the minimum & maximum results for various cases of kinds of joins on column sets referencing (having to appear elsewhere as) others. You can handle more cases by dealing with superkeys (unique column sets) not CKs (candidate keys) (superkeys containing no smaller superkeys). You mean CK when you say "PK" (primary key)--there can be at most one PK per table. For no duplicates or nulls, SQL UNIQUE is superkey & FOREIGN KEY is foreign superkey.

Indexing Strategy in Oracle

I have a table with 2 million rows.
The ndv ( number of distinct values ) in the columns are as follows :
A - 3
B - 60
D - 150
E - 600,000
The most frequently updated columns are A & B ( NDV = 3 for both ).
Assuming every query will have either column D or column E in WHERE clause, which of following will be the best set of indexes for SELECT statement:
D
D,E,A
E,A
A,E
Not really enough information to give a definitive assessment, but some things to consider:
You're unlikely to get a skip scan benefit, so if you want snappy
response from predicates with leading E or leading D, that will be 2
indexes. (One leading with D, and one leading with E).
If A/B are updated frequently (although that's a generic term),
you might choose to leave them out of the index definition in
order to reduce index maintenance overhead.

A greedy solution for a matrix rearrangment

I am working on something which I feel an NP-hard problem. So, I am not looking for the optimal solution but I am looking for a better heuristics. An integer input matrix (matrix A in the following example) is given as input and I have to produce an integer output matrix (matrix B in the following example) whose number of rows are smaller than the input matrix and should obey the following two conditions:
1) Each column of the output matrix should contain integers in the same order as they appear in the input matrix. (In the example below, first column of the matrix A and matrix B have the same integers 1,3 in the same order.)
2) Same integers must not appear in the same row (In the example below, first row of the matrix B contains the integers 1,3 and 2 which are different from each other.)
Note that the input matrix always obey the 2nd condition.
What a greedy algorithm looks like to solve this problem?
Example:
In this example the output matrix 'Matrix B' contains all the integers as they appear in the input matrix 'Matrix A" but the output matrix has 5 rows and the input matrix has 6 rows. So, the output 'Matrix B' is a valid solution of the input 'Matrix A'.
I would produce the output one row at a time. When working out what to put in the row I would consider the next number from each input column, starting from the input column which has the most numbers yet to be placed, and considering the columns in decreasing order of numbers yet to be placed. Where a column can put a number in the current output row when its turn comes up it should do so.
You could extend this to a branch and bound solution to find the exact best answer. Recursively try all possible rows at each stage, except when you can see that the current row cannot possibly improve on the best answer so far. You know that if you have a column with k entries yet to be placed, in the best possible case you will need at least k more rows.
In practice I would expect that this will be too expensive to be practical, so you will need to ignore some possible rows which you cannot rule out, and so cannot guarantee to find the best answer. You could try using a heuristic search such as Limited Discrepancy search.
Another non-exact speedup is to multiply the estimate for the number of rows that the best possible answer derived from a partial solution will require by some factor F > 1. This will allow you to rule out some solutions earlier than branch and bound. The answer you find can be no more than F times more expensive than the best possible answer, because you only discard possibilities that cannot improve on the current answer by more than a factor of F.
A greedy solution to this problem would involve placing the numbers column by column, top down, as they appear.
Pseudocode:
For each column c in A:
r = 0 // row index of next element in A
nextRow = 0 // row index of next element to be placed in B
while r < A.NumRows()
while r < A.NumRows() && A[r, c] is null:
r++ // increment row to check in A
if r < A.NumRows() // we found a non-null entry in A
while nextRow < A.NumRows() && ~CheckConstraints(A[r,c], B[nextRow, c]):
nextRow++ // increment output row in B
if 'nextRow' >= A.NumRows()
return unsolvable // couldn't find valid position in B
B[nextRow, c] = v // successfully found position in B
++nextRow // increment output row in B
If there are no conflicts you end up "packing" B as tightly as possible. Otherwise you greedily search for the next non-conflicting row position in B. If none can be found, the problem is unsolvable.
The helper function CheckConstraints checks backwards in columns for the same row value in B to ensure the same value hasn't already been placed in a row.
If the problem statement is relaxed such that the output row count in B is <= the row count in A, then if we are unable to pack B any tighter, then we can return A as a solution.

Is there a search algorithm for huge two-dimensional arrays?

This is not a real-life question, it is just theory-crafting.
I have a big array which consists of elements like [1,140,245,123443], all
integer or floats with low selectivity, and the number of unique values is ten
times less than the size of the array. B*tree indexing is not good in this case.
I also tried to implement bitmap indexing, but in Ruby, binary operations are not so fast.
Are there any good algorithms for searching two-dimensional arrays of fixed size vectors?
And, the main question is, how do I convert the vector in value, where the conversion function has to be monotonic, so I can apply range queries such as:
(v[0]<10, v[2]>100, v[3]=32, 0.67*10^-8<v[4]<1.2154241410*10^-6)
the only idea i have is to create separate sorted indexes for each component of vector...binary search then and merge...but it is a bad idea because in the worst case scenario it will require O(N*N) operations...
Assuming that each "column" is vaguely evenly distributed in a known range, you could keep track of a series of buckets for each column, and a list of rows that satisfy the bucket. The number of buckets for each column can be the same, or different, it's totally arbitrary. More buckets is faster, but takes slightly more memory.
my table:
range: {1to10} {1to4m} {-2mto2m}
row1: {7 3427438335 420645075}
row2: {5 3862506151 -1555396554}
row3: {1 2793453667 -1743457796}
buckets for column 1:
bucket{1-3} : row3
bucket{4-6} : row2
bucket{7-10} : row1
buckets for column 2:
bucket{1-2m} :
bucket{2m-4m} : row1, row2, row4
buckets for column 3:
bucket{-2m--1m} : row2, row3
bucket{-1m-0} :
bucket{0-1m} :
bucket{1m-2m} : row1
Then, given a series of criteria: {v[0]<=5, v[2]>3*10^10}, we pull out the buckets that match that criteria:
column 1:
v[0]<=5 matches buckets {1-3} and {4-6}, which is rows 2 and 3.
column 2:
v[2]>3*10^10} matches buckets {2m-4m} and {4-6}, which is rows 1, 2 and 3.
column 3:
"" matches all , which is rows 1, 2 and 3.
Now we know that the row(s) we're looking for meet all three criteria, so we list all the rows that are in the buckets that matched all the criteria, in this case, rows 2 and 3. At this point, the number of rows remaining will be small even for massive amounts of data, depending on the granularity of your buckets. You simply check each of the rows that is left at this point to see if they match. In this sample we see that row 2 matches, but row 3 doesn't.
This algorithm is technically O(n), but in practice, if you have large numbers of small buckets, this algorithm can be very fast.
Using an index :)
The basic idea is to turn the 2 dimensional array into a 1 dimensional sorted array(while keeping the original position) and apply binary search on the later.
This method works for any n dimensional array and is used widely by databases which can be seen as a n dimensional array with variable lengths.

Complexity of: One matrix is row/col permutation of another matrix

Given two m x n matrices A and B whose elements belong to a set S.
Problem: Can the rows and columns of A be permuted to give B?
What is the complexity of algorithms to solve this problem?
Determinants partially help (when m=n): a necessary condition is that det(A) = +/- det(B).
Also allow A to contain "don't cares" that match any element of B.
Also, if S is finite allow permutations of elements of A.
This is not homework - it is related to the solved 17x17 puzzle.
See below example of permuting rows and columns of a matrix:
Observe the start matrix and end matrix. All elements in a row or column are retained its just that their order has changed. Also the change in relative positions is uniform across rows and columns
eg. see 1 in start matrix and end matrix. Its row has elements 12, 3 and 14 along with it. Also its column has 5, 9 and 2 along with it. This is maintained across the transformations.
Based on this fact I am putting forward this basic algo to find for a given matrix A, can its rows and columns of A be permuted to give matrix B.
1. For each row in A, sort all elements in the row. Do same for B.
2. Sort all rows of A (and B) based on its columns. ie. if row1 is {5,7,16,18} and row2 is {2,4,13,15}, then put row2 above row1
3. Compare resultant matrix A' and B'.
4. If both equal, then do (1) and (2) but for columns on ORIGINAL matrix A & B instead of rows.
5. Now compare resultant matrix A'' and B''

Resources