Data mining for integers with exact fitting - algorithm

I make lot of dealing with RFID cards. As much as there are different readers there are different outputs and coding of same type of cards.
I got frequent request to figure out (if possible) to translate one output to another and that means that I have to stare at these numbers and figure out what transformations are.
Most common transforms are
added constant
reversed binary sequence
cutting a few bits away
rotation
combinations of this methods
I usually have something like 30% success rate, but I always got frustrated when after a few hours I can not find translation. It's probably very simple but I just can not figure it out. That is why I am looking for a kind of algorithm/library/software that would check these rules automatically on two sets of numbers and try to figure out smallest Kolmogorov complexity.
Since I have zero knowledge about data mining I would be thankful for any pointers.

This seems like a genetic programming problem.
The 'genes' are the individual bit transformations that can occur. The fitness function is how many bits are correctly transformed for growing input sets. A genetic programming library can shuffle genes around trying to find better fitness, and "breeding" the indivduals who have high fitness levels to attempt to create a more fit individual.
Check out pyEvolve .

I don't know what's the length of the numbers but let's assume they are 64-bit. The number of different non-trivial atomic transformations is then as follows
Added constant 2**64 - 1
Reversal 1
Remove bits 63
Rotation 63
If you have combinations also, you have 4 + 12 + 24 + 24 = 64 different ways to order a subset of the transformations (without taking the parameters of the transformation into account). So what I would do is to
Have an outer loop that iterates over the 64 ways to combine the transformations
Then have an inner loop that iterates over the maximum 63 * 63 parameter values for "remove bits" and rotation; now the total number of iterations is ~~ 643 == (26)3 = 218 which is okay
Apply the hypothetical transformation (one out of 218), and then calculate the differences between the first data set and the second data set transformed; if the difference is constant you have found the additive constant for the "added constant" transformation and are done
This should be very fast on a modern PC, i.e. you should be able to find the solution in a couple of seconds. If the data sets are large (> 100) you can use a sample first and then validate the result on the whole data set only when the subset works out correctly.

I wrote a small prove of concept. Here is what I have done.
I generated ten random binary strings with 64 digits as card content examples produced by a reference reader.
0010110111011011100000010001100011111001010100111101110111000100
0000000110001111101110001011110100000100111101100100110010100000
1111000010100111011000111000100111111001000010100101011100011001
0010011011100011001000010111100010110001001000010101001110000000
1111000100101100010011101011010011100111000000001111110010101110
0101011101000101110111000010100110000111001010000010000001110111
0101110010011010011110011110111100111001110010100111101001111101
0101110100100101110000101000001000011100010100010010110000010001
0111101011011010111001011011110101011100111011010111100110100101
0000101001000110111101000100111011110000000011010110001110101011
Then I generated a random mapping table to simulate the different output of another reader for the same ten cards. It has the format i -> j meaning bit i from the reference content occurs as bit j on the other reader.
4 -> 0 4 -> 1 49 -> 2 32 -> 3 51 -> 4 52 -> 5 10 -> 6 47 -> 7
16 -> 8 32 -> 9 14 -> 10 24 -> 11 13 -> 12 1 -> 13 8 -> 14 47 -> 15
12 -> 16 56 -> 17 55 -> 18 22 -> 19 6 -> 20 33 -> 21 22 -> 22 45 -> 23
37 -> 24 39 -> 25 46 -> 26 47 -> 27 25 -> 28 15 -> 29 43 -> 30 13 -> 31
33 -> 32 31 -> 33 16 -> 34 49 -> 35 0 -> 36 30 -> 37 28 -> 38 31 -> 39
45 -> 40 28 -> 41 17 -> 42 18 -> 43 40 -> 44 18 -> 45 23 -> 46 54 -> 47
11 -> 48 54 -> 49 41 -> 50 39 -> 51 28 -> 52 31 -> 53 1 -> 54 34 -> 55
45 -> 56 4 -> 57 59 -> 58 11 -> 59 6 -> 60 26 -> 61 21 -> 62 0 -> 63
52 -> 64 1 -> 65 55 -> 66 46 -> 67 49 -> 68 23 -> 69 47 -> 70 45 -> 71
28 -> 72 23 -> 73 41 -> 74 41 -> 75 16 -> 76 4 -> 77 4 -> 78 18 -> 79
For example bits one and two of the other readers output equal bit four of the reference reader output. The simulated output is 80 bits width and there are some bits duplicated and maybe some others missing.
11111101111000111110010001110110101100100100001010111001010100001011111011111110
00100100101110101100000110100111011100111101110000101100100001001001100110111001
00111010011111100011011001100101110110110111011101011111001000010111110011000001
00111011011000110110100001011100000100100101011101011001000011000010111011000001
00111110010111001101011011000001100110000010000000010011000001111100100000000000
00010000110011000000100011000101011000110110000000011110001011100100000010001000
11101100001101101000000001101000010101110111111111111111011101001101110011110111
11000111100111010001001010010111001001000010000000100010011000001100001000111110
11101101101101111110110110010000111100111111111010101110110111101110111111111111
11110001111010010110110100011001101101101111010101001001110010100010101110001111
Now we want to find the mapping between both data sets. For this we just look at the correlation between the bits. That means for each combination of a bit index i (0 to 63) produced by the reference reader and each bit index j (0 to 79) produced by the other reader we just count how many examples have matching bits at this positions.
111111111122222222223333333333444444444455555555566666
0123456789012345678901234567890123456789012345678901234567890123
0 3545.65456386363765535465634568436588433683666585575745656555647
1 3545.65456386363765535465634568436588433683666585575745656555647
2 4474534385457494458444376567873567875326554355654.48656783626534
3 64743565476336564544465525456333.7853368114353456646256765446574
4 669655438567727425624459656765356787734855435365684.656765446734
5 4656752763479454654464558367475525457744794735656666.72365644734
6 8676354543.33636254244956545235365655566334533456446476543266354
7 33638674585643657453154636365462565864334656463.5555545874333445
8 2434756747255656.54646334345655545255742576757474462652565642556
9 64743565476336564544465525456333.7853368114353456646256765446574
10 33636452963663.5549533285656964656786215665466763937547874535425
11 685853256165765447646475.365475725457744774555634666874343666536
12 6636334723613.36674666716343455565433764334555434462474343666376
13 6.5.554543655634494466758363455745457766554373434486654325686758
14 44745543.5477296438442396567853745677326776555854828656765444514
15 33638674585643657453154636365462565864334656463.5555545874333445
16 556564367438.363545575467478584636546655885646747757963476735843
17 42725365674574746364463545696733676535445565374768466547.3604552
18 5383647278564385547315483636744478786235445466585737347.74335445
19 9767443632944725363355.47434146456546675443644547355585434377465
20 445455.349453454656648352765655565453564538377274464216765644576
21 758562545656656356533766543656447.766455443466567757565874537665
22 9767443632944725363355.47434146456546675443644547355585434377465
23 534362745636656374775744565658663634465186766.565553545674735465
24 5747445834546725763577627276364634124.73667646345373761254753665
25 667637456565545625446457456563358585536.334351636648456547466754
26 6454552583477476436662576545655745657346774755.36626676547466534
27 33638674585643657453154636365462565864334656463.5555545874333445
28 5343667256564363347755463.54568454764255645466565555329656557465
29 445437476563365.634442554345613563455546356733654424474545262334
30 5343663854566547723553645436346434346653685.26767333783456353443
31 6636334723613.36674666716343455565433764334555434462474343666376
32 758562545656656356533766543656447.766455443466567757565874537665
33 5747445474366565567775467474764.34346655867486723555545436775647
34 2434756747255656.54646334345655545255742576757474462652565642556
35 4474534385457494458444376567873567875326554355654.48656783626534
36 .676334543855634254466956545255567635586534555638446476545468574
37 552586543456454356575564583236.434566453663666565373547436577467
38 2454556387255496658644174565.53765675326556375652846436765644536
39 5747445474366565567775467474764.34346655867486723555545436775647
40 534362745636656374775744565658663634465186766.565553545674735465
41 2454556387255496658644174565.53765675326556375652846436765644536
42 59496454345447435.5557647452566656566655443284343595545434777669
43 445453618545549445.644376765875745675324756377652846438763646336
44 5545645474388363547775467676586814346653.87668745555745456755645
45 445453618545549445.644376765875745675324756377652846438763646336
46 55856652965861853473334.5656744656788237665464765739547856355625
47 645455616565347425864457494365756587514653437565464623.745468356
48 55658654763.8163545555485656566636568455885666767557745658555845
49 645455616565347425864457494365756587514653437565464623.745468356
50 35458636743883657455534674565666143686338.5846765555963456553625
51 667637456565545625446457456563358585536.334351636648456547466754
52 2454556387255496658644174565.53765675326556375652846436765644536
53 5747445474366565567775467474764.34346655867486723555545436775647
54 6.5.554543655634494466758363455745457766554373434486654325686758
55 6474554365655474256444574745655387.75148332353656848458765448554
56 534362745636656374775744565658663634465186766.565553545674735465
57 3545.65456386363765535465634568436588433683666585575745656555647
58 68385745436536364746647565414377434575665545736342644563074.6558
59 55658654763.8163545555485656566636568455885666767557745658555845
60 445455.349453454656648352765655565453564538377274464216765644576
61 46563565654574544566863565.7673743433766758355434666634365844754
62 665653852745563267466.534565475567433784536377256484434565846796
63 .676334543855634254466956545255567635586534555638446476545468574
64 4656752763479454654464558367475525457744794735656666.72365644734
65 6.5.554543655634494466758363455745457766554373434486654325686758
66 5383647278564385547315483636744478786235445466585737347.74335445
67 6454552583477476436662576545655745657346774755.36626676547466534
68 4474534385457494458444376567873567875326554355654.48656783626534
69 55856652965861853473334.5656744656788237665464765739547856355625
70 33638674585643657453154636365462565864334656463.5555545874333445
71 534362745636656374775744565658663634465186766.565553545674735465
72 2454556387255496658644174565.53765675326556375652846436765644536
73 55856652965861853473334.5656744656788237665464765739547856355625
74 35458636743883657455534674565666143686338.5846765555963456553625
75 35458636743883657455534674565666143686338.5846765555963456553625
76 2434756747255656.54646334345655545255742576757474462652565642556
77 3545.65456386363765535465634568436588433683666585575745656555647
78 3545.65456386363765535465634568436588433683666585575745656555647
79 445453618545549445.644376765875745675324756377652846438763646336
Above are the results from this with a dot representing ten matches. As you can see this recovers the mapping for all bits except bits 13, 54, and 65 where two possible matches are found.
77 out of 80 bits with only ten samples is quite good. Admittedly this will not work that good if the bit patterns contain structure and are not just random bits or if you have to take bits computed from several bits into account. But if you have access to large enough sample sets you can uncover all possible mappings.

Related

How to subset rows from one dataframe based on matching values from a second smaller data frame in R

I want to select a control group from one data frame based of matching the age from a second data frame. As an example I have subject.df
subject.df
id age
1 1 55
2 2 62
3 3 73
4 4 54
5 5 66
I'd like to subset control.df based off of matching the age directly on a 1 to 1 matching from the subject.df dataframe.
control.df
id age
6 6 66
7 7 71
8 8 80
9 9 51
10 10 55
11 11 56
12 12 77
13 13 62
14 14 64
15 15 73
16 16 67
17 17 54
18 18 75
19 19 77
20 20 78
21 21 53
22 22 64
23 23 83
24 24 61
25 25 77
I'm fairly new to R. In the past I've used Matlab and in this instance would use a for loop to iterate over the control.df dataframe, but I've been told that R doesn't always like for loops and that it can be computationally difficult in R.
In the end I'll be doing this on a much larger data set where the subject group is around 250 and the control group is more than 40K so I know that 1:1 matching is possible.

How does the "successive passes in opposite direction" improvement work for bubble sort?

According to Data Structures Using C by Tenenbaum, one of the improvements of bubble sort is to have successive passes go in opposite direction so that the small elements move quickly to the front which will reduce the required number of passes [pg 336].
I worked out two examples, one which supports this statement and other which is against this one.
Supports: 25 48 37 12 57 86 33 92
iterations using usual Bubble sort :
25 48 37 12 57 86 33 92
25 37 12 48 57 33 86 92
25 12 37 48 33 57 86 92
12 25 37 33 48 57 86 92
12 25 33 37 48 57 86 92
iterations using improvement:
25 48 37 12 57 86 33 92
25 37 12 48 57 33 86 92
12 25 37 33 48 57 86 92
12 25 33 37 48 57 86 92
against: 3 4 1 2 5
iterations using usual Bubble sort:
3 4 1 2 5
3 1 2 4 5
1 2 3 4 5
iterations using improvement:
3 4 1 2 5
3 1 2 4 5
1 3 2 4 5
1 2 3 4 5
So is the statement incorrect that this improvement will always help? Or I am doing something wrong here ?
The example you gave above shows that this algorithm isn't a strict improvement over a standard bubble sort.
The advantage of this approach (sometimes called "cocktail sort," by the way) is that in cases where there are a lot of small elements at the end of the array, it rapidly pulls them to the front compared against normal bubble sort. For example, consider this array:
2 3 4 5 6 7 8 9 10 11 12 ... 10,000,000 1
With a normal bubble sort, it would take 9,999,999 passes over this array to sort it because the element 1, which is way out of place, only gets swapped one step forward on each iteration. On the other hand, with a cocktail sort, this would take just two passes - one initial pass and then a reverse pass.
While the above example is definitely contrived, in a randomly-shuffled array, there are likely going to be some smaller elements toward the end of the array and the number of passes of bubblesort is going to have to be large to move them back. Going in both directions helps speed this up.
That said, bubblesort is a pretty poor choice of a sorting algorithm, so hopefully this is just a theoretical discussion. :-)

Selecting the "P" in Prune and Search Algorithm

Note: the diagram above shows a partition into groups of 5 (the columns). The horizontal box denotes the median values of each partition. The 'P' item indicates the median of medians.
Most of the researches that I saw have this picture in Selecting their "P" and it always have an odd numbers of elements. But What if the numbers elements you have are even?
ex.
1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30
31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50
51 52 53 54 55 56 57 58 59 60
how do you get your "P" in an even set of elements?
This explanation gives the detail I think you're looking for:
https://www.cs.duke.edu/courses/summer10/cps130/files/Edelsbrunner_Median.pdf
The median of the set plays a special role in this algorithm, and it
is defined as the i-smallest item where i = (n+1)/2 if n is odd and i =
n/2 or (n+2)/2 if n is even.

vectorized indexing of matrices with other matrices (in octave)

Suppose we have a 2D (5x5) matrix:
test =
39 13 90 5 71
60 78 38 4 11
87 92 46 45 35
40 96 61 17 1
90 50 46 89 63
And a second 2D (5x2) matrix:
tidx =
1 3
2 4
2 3
2 4
4 5
And now we want to use tidx as an idex into test, so that we get the following output:
out =
39 90
78 4
92 46
96 17
89 63
One way to do this is with a for loop...
for i=1:size(test,1)
out(i,:) = test(i,tidx(i,:));
end
Question:
Is there a way to vectorize this so the same output is generated without a for loop?
Here is one way:
test(repmat([1:rows(test)]',1,columns(tidx)) + (tidx-1)*rows(test))
What you describe is an index problem. When you place a matrix all in one dimension, you get
test(:) =
39
60
87
40
90
13
78
92
96
50
90
38
46
61
46
5
4
45
17
89
71
11
35
1
63
This can be indexed using a single number. Here is how you figure out how to transform tidx into the correct format.
First, I use the above reference to figure out the index numbers which are:
outinx =
1 11
7 17
8 13
9 19
20 25
Then I start trying to figure out the pattern. This calculation gives a clue:
(tidx-1)*rows(test) =
0 10
5 15
5 10
5 15
15 20
This will move the index count to the correct column of test. Now I just need the correct row.
outinx-(tidx-1)*rows(test) =
1 1
2 2
3 3
4 4
5 5
This pattern is created by the for loop. I created that matrix with:
[1:rows(test)]' * ones(1,columns(tidx))
*EDIT: This does the same thing with a built in function.
repmat([1:rows(test)]',1,columns(tidx))
I then add the 2 together and use them as the index for test.

Finding a set of permutations, with a constraint

I have a set of N^2 numbers and N bins. Each bin is supposed to have N numbers from the set assigned to it. The problem I am facing is finding a set of distributions that map the numbers to the bins, satisfying the constraint, that each pair of numbers can share the same bin only once.
A distribution can nicely be represented by an NxN matrix, in which each row represents a bin. Then the problem is finding a set of permutations of the matrix' elements, in which each pair of numbers shares the same row only once. It's irrelevant which row it is, only that two numbers were both assigned to the same one.
Example set of 3 permutations satisfying the constraint for N=8:
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15
16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31
32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 55
56 57 58 59 60 61 62 63
0 8 16 24 32 40 48 56
1 9 17 25 33 41 49 57
2 10 18 26 34 42 50 58
3 11 19 27 35 43 51 59
4 12 20 28 36 44 52 60
5 13 21 29 37 45 53 61
6 14 22 30 38 46 54 62
7 15 23 31 39 47 55 63
0 9 18 27 36 45 54 63
1 10 19 28 37 46 55 56
2 11 20 29 38 47 48 57
3 12 21 30 39 40 49 58
4 13 22 31 32 41 50 59
5 14 23 24 33 42 51 60
6 15 16 25 34 43 52 61
7 8 17 26 35 44 53 62
A permutation that doesn't belong in the above set:
0 10 20 30 32 42 52 62
1 11 21 31 33 43 53 63
2 12 22 24 34 44 54 56
3 13 23 25 35 45 55 57
4 14 16 26 36 46 48 58
5 15 17 27 37 47 49 59
6 8 18 28 38 40 50 60
7 9 19 29 39 41 51 61
Because of multiple collisions with the second permutation, since, for example they're both pairing the numbers 0 and 32 in one row.
Enumerating three is easy, it consists of 1 arbitrary permutation, its transposition and a matrix where the rows are made of the previous matrix' diagonals.
I can't find a way to produce a set consisting of more though. It seems to be either a very complex problem, or a simple problem with an unobvious solution. Either way I'd be thankful if somebody had any ideas how to solve it in reasonable time for the N=8 case, or identified the proper, academic name of the problem, so I could google for it.
In case you were wondering what is it useful for, I'm looking for a scheduling algorithm for a crossbar switch with 8 buffers, which serves traffic to 64 destinations. This part of the scheduling algorithm is input traffic agnostic, and switches cyclically between a number of hardwired destination-buffer mappings. The goal is to have each pair of destination addresses compete for the same buffer only once in the cycling period, and to maximize that period's length. In other words, so that each pair of addresses was competing for the same buffer as seldom as possible.
EDIT:
Here's some code I have.
CODE
It's greedy, it usually terminates after finding the third permutation. But there should exist a set of at least N permutations satisfying the problem.
The alternative would require that choosing permutation I involved looking for permutations (I+1..N), to check if permutation I is part of the solution consisting of the maximal number of permutations. That'd require enumerating all permutations to check at each step, which is prohibitively expensive.
What you want is a combinatorial block design. Using the nomenclature on the linked page, you want designs of size (n^2, n, 1) for maximum k. This will give you n(n+1) permutations, using your nomenclature. This is the maximum theoretically possible by a counting argument (see the explanation in the article for the derivation of b from v, k, and lambda). Such designs exist for n = p^k for some prime p and integer k, using an affine plane. It is conjectured that the only affine planes that exist are of this size. Therefore, if you can select n, maybe this answer will suffice.
However, if instead of the maximum theoretically possible number of permutations, you just want to find a large number (the most you can for a given n^2), I am not sure what the study of these objects is called.
Make a 64 x 64 x 8 array: bool forbidden[i][j][k] which indicates whether the pair (i,j) has appeared in row k. Each time you use the pair (i, j) in the row k, you will set the associated value in this array to one. Note that you will only use the half of this array for which i < j.
To construct a new permutation, start by trying the member 0, and verify that at least seven of forbidden[0][j][0] that are unset. If there are not seven left, increment and try again. Repeat to fill out the rest of the row. Repeat this whole process to fill the entire NxN permutation.
There are probably optimizations you should be able to come up with as you implement this, but this should do pretty well.
Possibly you could reformulate your problem into graph theory. For example, you start with the complete graph with N×N vertices. At each step, you partition the graph into N N-cliques, and then remove all edges used.
For this N=8 case, K64 has 64×63/2 = 2016 edges, and sixty-four lots of K8 have 1792 edges, so your problem may not be impossible :-)
Right, the greedy style doesn't work because you run out of numbers.
It's easy to see that there can't be more than 63 permutations before you violate the constraint. On the 64th, you'll have to pair at least one of the numbers with another its already been paired with. The pigeonhole principle.
In fact, if you use the table of forbidden pairs I suggested earlier, you find that there are a maximum of only N+1 = 9 permutations possible before you run out. The table has N^2 x (N^2-1)/2 = 2016 non-redundant constraints, and each new permutation will create N x (N choose 2) = 28 new pairings. So all the pairings will be used up after 2016/28 = 9 permutations. It seems like realizing that there are so few permutations is the key to solving the problem.
You can generate a list of N permutations numbered n = 0 ... N-1 as
A_ij = (i * N + j + j * n * N) mod N^2
which generates a new permutation by shifting the columns in each permutation. The top row of the nth permutation are the diagonals of the n-1th permutation. EDIT: Oops... this only appears to work when N is prime.
This misses one last permutation, which you can get by transposing the matrix:
A_ij = j * N + i

Resources