Minutiae-based fingerprint matching algorithm - algorithm

The problem
I need to match two fingerprints and give a score of resemblance.
I have posted a similar question before, but I think I've made enough progress to warrant a new question.
The input
For each image, I have a list of minutiae (important points). I want to match the fingerprints by matching these two lists.
When represented graphically, they look like this:
A minutia consists of a triplet (i, j, theta) where:
i is the row in a matrix
j is the column in a matrix
theta is a direction. I don't use that parameter yet in my matching algorithm.
What I have done so far
For each list, find the "dense regions" or "clusters". Some areas have more points than others, and I have written an algorithm to find them. I can explain further if you want.
Shifting the second list in order to account for the difference in finger position between both images. I neglect differences in finger rotation. The shift is done by aligning the barycenters of the centers of the clusters. (It is more reliable than the barycenter of all minutiae)
I tried building a matrix for each list (post-shift) so that for every minutia increments the corresponding element and it's close neighbours, like below.
1 1 1 1 1 1 1
1 2 2 2 2 2 1
1 2 3 3 3 2 1
1 2 3 4 3 2 1
1 2 3 3 3 2 1
1 2 2 2 2 2 1
1 1 1 1 1 1 1
By subtracting the two matrices and adding up the absolute values of all elements in the resulting matrix, I hoped to get low numbers for close fingerprints.
Results
I tested a few fingerprints and found that the number of clusters is very stable. Matching fingerprints very often have the same number of clusters, and different fingers give different numbers. So that will definitely be a factor in the overall resemblance score.
The sum of the differences didn't work at all however. There was no correlation between resemblance and the sum.
Thoughts
I may need to use the directions of the points but I don't know how yet
I could use the standard deviation of the points, or of the clusters.
I could repeat the process for different types of minutiae. Right now my algorithm detects ridge endings and ridge bifurcations but maybe I should process these separately.
Question: How can I improve my algorithm ?
Edit
I've come a long way since posting this question, so here's my update.
I dropped the bifurcations altogether, because my thinning algorithm messes those up too often. I did however end up using the angles quite a lot.
My initial cluster-counting idea does hold up pretty well on the small scale tests I ran (different combinations of my fingers and those of a handful of volunteers).
I give a score based on the following tests (10 tests, so 10% per success. It's a bit naïve but I'll find a better way to turn these 10 results into a score, as each test has its specificities):
Cluster-thingy (all the following don't use clusters, but minutiae. This is the only cluster-related approach I took)
Mean i position
Mean angle
i variance
j variance
Angle variance
i kurtosis
j kurtosis
Angle kurtosis
j skewness
A statistical approch indeed.
Same finger comparisons give pretty much always between 80 and 100%. Odd finger comparisons between 0 and 60% (not often 60%). I don't have exact numbers here so I won't pretend this a statistically significant success but it seems like a good first shot.

Your clustering approach is interesting, but one thing I'm curious about is how well you've tested it. For a new matching algorithm to be useful with respect to all the research and methods that already exists, you need to have a reasonably low EER. Have you tested your method with any of the standard databases? I have doubts as to the ability of cluster counts and locations alone to identify individuals at larger scales.
1) Fingerprint matching is a well studied problem and there are many good papers that can help you implement this. For a nice place to start, check out this paper, "Fingerprint Minutiae Matching Based on the Local and Global Structures" by Jiang & Yau. It's a classic paper, a short read (only 4 pages), and can be implemented fairly reasonably. They also define a scoring metric that can be used to quantify the degree to which two fingerprint images match. Again, this should only be a starting point because these days there are many algorithms that perform better.
2) If you want your algorithm to be robust, it should consider transformations of the fingerprint between images. Scanned fingerprints and certainly latent prints may not be consistent from image to image.
Also, calculating the direction of the minutiae points provides a method for handling fingerprint rotations. By measuring the angles between minutiae point directions, which will remain the same or close to the same across multiple images regardless of global rotation (though small inconsistencies may occur because skin is not rigid and may stretch slightly), you can find the best set of corresponding minutia pairs or triplets and use them as the basis for rotational alignment.
3) I recommend that you distinguish between ridge line endings and bifurcations. The more features you can isolate, the more accurately you can determine whether or not the fingerprints match. You might also consider the number of ridge lines that occur between each minutiae point.
This image below illustrates the features used by Jiang and Yau.
d: Euclidean distance between minutiae
θ: Angle measure between minutiae directions
φ: Global minutiae angle
n: Number of ridge lines between minutiae i and j
If you haven't read the Handbook of Fingerprint Recognition, I recommend it.

Related

How to shuffle eight items to approximate maximum entropy?

I need to analyze 8 chemical samples repeatedly over 5 days (each sample is analyzed exactly once every day). I'd like to generate pseudo-random sample sequences for each day which achieve the following:
avoid bias in the daily sequence position (e.g., avoid some samples being processed mostly in the morning)
avoid repeating sample pairs over different days (e.g. 12345678 on day 1 and 87654321 on day 2)
generally randomize the distance between two given samples from one day to the other
I may have poorly phrased the conditions above, but the general idea is to minimize systematic effects like sample cross-contamination and/or analytical drift over each day. I could just shuffle each sequence randomly, but because the number of sequences generated is small (N=5 versus 40,320 possible combinations), I'm unlikely to approach something like maximum entropy.
Any ideas? I suspect this is a common problem in analytical science which has been solved, but I don't know where to look.
By just thinking about:
The base metric that you may be use is the Levenshtein distances or some slightly modification (maybe
myDist(w1, w2) = min(levD(w1, w2), levD(w1.reversed(), w2))
)
Since you want to prevent near distances between any pair of days,
the overall metric can be the sum of the any combinations of sample orders between two days.
Similarity = myDist(day1, day2)
+ myDist(day1, day3)
+ myDist(day1, day4)
+ myDist(day1, day5)
+ myDist(day2, day3)
+ myDist(day2, day4)
+ myDist(day2, day5)
+ myDist(day3, day4)
+ myDist(day3, day5)
+ myDist(day4, day5)
That still is missing, is a heuristic how to create the sample orders.
Your problem reminds me on some fastest path finding problem but with the further difficulty that each selected node influences the weights of the whole graph. So it is much harder.
Maybe a table with all myDistdistances between each pair of the 8! combinations can be created (its commutative, only triangular (without identity diagonal) matrix requiring (~1GB memory)) This may help speeding things up very much.
Maybe take the max from this matrix and consider each combination with value below some threshold as equally worthless to reduce the searchspace.
Build a starting set.
Use 12345678 as the fix day1 since first day does not matter. Never change this.
then repeat until n days are chose:
adding the most distant point from current point.
If there are multiple equal possibility, use the one, that also is most distant from the previous days.
Now iteratively improve the solution - maybe with some ruin-and-recreate-approach. You should always backup the absolute maximum you found and you are able to run as many iterations as you want (and you have time for)
chose (one or two) day(s) with the smallest distance sums to other days
maybe brute force an optimal (in terms of overall distance) combination for these two days.
repeat
If optimization stucks (only same 2 days are chosen or distance is not getting smaller at all)
randomly change one or two days to random orders.
may be totally random (beside day1) starting sets can be selected

A good randomizer for puzzle-15

I have implemented a puzzle 15 for people to compete online. My current randomizer works by starting from the good configuration and moving tiles around for 100 moves (arbitrary number)
Everything is fine, however, once in a little while the tiles are shuffled too easy and it takes only a few moves to solve the puzzle, therefore the game is really unfair for some people reaching better scores in a much higher speed.
What would be a good way to randomize the initial configuration so it is not "too easy"?
You can generate a completely random configuration (that is solvable) and then use some solver to determine the optimal sequence of moves. If the sequence is long enough for you, good, otherwise generate a new configuration and repeat.
Update & details
There is an article on Wikipedia about the 15-puzzle and when it is (and isn't) solvable. In short, if the empty square is in the lower-right corner, then the puzzle is solvable if and only if the number of inversions (an inversion is a swap of two elements in the sequence, not necessarily adjacent elements) with respect to the goal permutation is even.
You can then easily generate a solvable start state by doing an even number of inversions, which may lead to a not-so-easy-to-solve state far quicker than by doing regular moves, and it is guaranteed that it will remain solvable.
In fact, you don't need to use a search algorithm as I mentioned above, but an admissible heuristic. Such one always underestimates never overestimates the number of moves needed to solve the puzzle, i.e. you are guaranteed that it will not take less moves that the heuristic tells you.
A good heuristic is the sum of manhattan distances of each number to its goal position.
Summary
In short, a possible (very simple) algorithm for generating starting positions might look like this:
1: current_state <- goal_state
2: swap two arbitrary (randomly selected) pieces
3: swap two arbitrary (randomly selected) pieces again (to ensure solvability)
4: h <- heuristic(current_state)
5: if h > desired threshold
6: return current_state
7: else
8: go to 2.
To be absolutely certain about how difficult a state is, you need to find the optimal solution using some solver. Heuristics will give you only an estimate.
I would do this
start from solution (just like you did)
make valid turn in random direction
so you must keep track where the gap is and generate random direction (N,E,S,W) and do the move. I think this part you have done too.
compute the randomness of your placements
So compute some coefficient dependent on the order of the array. So ordered (solved) solutions will have low values and random will have high values. The equation for the coefficiet however is a matter of trial and error. Here some ideas what to use:
correlation coefficient
sum of average difference of value and its neighbors
1 2 4
3 6 5
9 8 7
coeff(6)= (|6-3|+|6-5|+|6-2|+|6-8|)/4
coeff=coeff(1)+coeff(2)+...coeff(15)
abs distance from ordered array
You can combine more approaches together. You can divide this to separated rows and columns and then combine the sub coefficients together.
loop #2 unit coefficient from #3 is high enough (treshold)
The treshold can be used also to change the difficulty.

Snake cube puzzle correctness

TrialPay posted a programming question about a snake cube puzzle on their blog.
Recently, one of our engineers introduced us to the snake cube. A snake cube is a puzzle composed of a chain of cubelets, connected by an elastic band running through each cubelet. Each cubelet can rotate 360° about the elastic band, allowing various structures to be built depending upon the way in which the chain is initially constructed, with the ultimate goal of arranging the cubes in such a way to create a cube.
Example:
This particular arrangement contains 17 groups of cubelets, composed of 8 groups of two cubelets and 9 groups of three cubelets. This arrangement can be expressed in a variety of ways, but for the purposes of this exercise, let '0' denote pieces whose rotation does not change the orientation of the puzzle, or may be considered a "straight" piece, while '1' will denote pieces whose rotation changes the puzzle configuration, or "bend" the snake. Using that schema, the snake puzzle above could be described as 001110110111010111101010100.
Challenge:
Your challenge is to write a program, in any language of your choosing, that takes the cube dimensions (X, Y, Z) and a binary string as input, and outputs '1' (without quotes) if it is possible to solve the puzzle, i.e. construct a proper XYZ cube given the cubelet orientation, and '0' if the current arrangement cannot be solved.
I posted a semi-detailed explanation of the solution, but how do I determine if the program solves the problem? I thought about getting more test cases, but I ran into some problems:
The snake cube example from TrialPay's blog has the same combination as the picture on Wikipedia's Snake Cube page and www.mathematische-basteleien.de.
It's very tedious to manually convert an image into a string.
I tried to make a program that would churn out a lot of combinations:
#We should start at the binary representation of 16777216 (00100...), because
#lesser numbers have more than 2 consecutive 0s (000111...)
i = 16777216
solved = []
while i <= 2**27:
s = str(bin(i))[2:]
#Add 0s
if len(s) < 27:
s = '0'*(27-len(s)) + s
#Check if there are more than 2 consecutive 0s
print s
if s.find("000") != -1:
if snake_cube_solution(3, 3, 3, s) == 1:
solved.append(s)
i += 1
But it just takes forever to finish executing. Is there a better way to verify the program?
Thanks in advance!
TL;DR: This isn't a programming problem, but a mathematical one. You may be better served at math.stackexchange.com.
Since the cube size and snake length are passed as input, the space of inputs a checker program would need to verify is essentially infinite. Even though checking the solutions's answer for a single input is reasonable, brute forcing this check across the entire input space is clearly not.
If your solution fails on certain cases, your checker program can help you find these. However it can't establish your program's correctness: if your solution is actually correct the checker will simply run forever and leave you wondering.
Unfortunately (or not, depending on your tastes), what you are looking for is not a program but a mathematical proof.
(Proving) Algorithm correctness is itself an entire field of study, and you can spend a long time in it. That said, proof by induction is often applicable (especially for recursive algorithms.)
Other times, navigating between state configurations can be restated as optimizing a utility function. Proving things about the space being optimized (such as it has only one extrema) can then translate to a proof of program correctness.
Your state configurations in this second approach could be snake orientations, or they might be some deeper structure. For example, the general strategy underneath solving a Rubik's cube
isn't usually stated on literal cube states, but on expressions of a group of relevant symmetries. This is what I personally expect your solution will eventually play out as.
EDIT: Years later, I feel I should point out that for a given, fixed cube size and snake length, of course the search space is actually finite. You can write a program to brute-force check all combinations. If you were clever, you could even argue that the times to check a set of cases can be treated as a set of independent random variables. From this you could build a reasonable progress bar to estimate how (very) long your wait would be.
I think your assertion that there can not be three consecutive 0's is false. Consider this arrangement:
000
100
101
100
100
101
100
100
100
One of the problems I'm having with this puzzle is the notation. A 1 indicates that the cubelet can change the puzzle's orientation, but about which axis? In my example above, assume that the Y axis is vertical and the X axis is horizontal. A 1 on the left indicates the ability to rotate about the cubelet's Y axis, and a 1 on the right indicates the ability to rotate about the cubelet's X axis.
I think it's possible to construct an arrangement similar to that above, but with three 000 groups. But I don't have the notation for it. Clearly, the example above could be modified so that the first three lines are:
001
000
101
With the first segment's 1 indicating rotation about the Y axis.
I wrote a Java application for the same problem not long ago.
I used the backtracking algorithm for this.
You just have to do an recursive search through the whole cube checking what directions are possible. If you have found one, you can stop and print the solution (I chose to print out all solutions).
For the 3x3x3 cubes my program solved them in under a second, for the bigger ones it takes about five seconds up to 15 minutes.
I'm sorry I couldn't find any code right now.

Challenge: Take a 48x48 image, find contiguous areas that result in the cheapest Lego solution to create that image! [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
Background
Lego produces the X-Large Gray Baseplate, which is a large building plate that is 48 studs wide and 48 studs tall, resulting in a total area of 2304 studs. Being a Lego fanatic, I've modeled a few mosaic-style designs that can be put onto these baseplates and then perhaps hung on walls or in a display (see: Android, Dream Theater, The Galactic Empire, Pokemon).
The Challenge
My challenge is now to get the lowest cost to purchase these designs. Purchasing 2304 individual 1x1 plates can get expensive. Using BrickLink, essentially an eBay for Lego, I can find data to determine what the cheapest parts are for given colors. For example, a 1x4 plate at $0.10 (or $0.025 per stud) would be cheaper than a 6x6 plate at $2.16 (or $0.06 per stud). We can also determine a list of all possible plates that can be used to assemble an image:
1x1
1x2
1x3
1x4
1x6
1x8
1x10
1x12
2x2 corner!
2x2
2x3
2x4
2x6
2x8
2x10
2x12
2x16
4x4 corner!
4x4
4x6
4x8
4x10
4x12
6x6
6x8
6x10
6x12
6x14
6x16
6x24
8x8
8x11
8x16
16x16
The Problem
For this problem, let's assume that we have a list of all plates, their color(s), and a "weight" or cost for each plate. For the sake of simplicity, we can even remove the corner pieces, but that would be an interesting challenge to tackle. How would you find the cheapest components to create the 48x48 image? How would you find the solution that uses the fewest components (not necessarily the cheapest)? If we were to add corner pieces as allowable pieces, how would you account for them?
We can assume we have some master list that is obtained by querying BrickLink, getting the average price for a given brick in a given color, and adding that as an element in the list. So, there would be no black 16x16 plate simply because it is not made or for sale. The 16x16 Bright Green plate, however, would have a value of $3.74, going by the current available average price.
I hope that my write-up of the problem is succint enough. It's something I've been thinking about for a few days now, and I'm curious as to what you guys think. I tagged it as "interview-questions" because it's challenging, not because I got it through an interview (though I think it'd be a fun question!).
EDIT
Here's a link to the 2x2 corner piece and to the 4x4 corner piece. The answer doesn't necessarily need to take into account color, but it should be expandable to cover that scenario. The scenario would be that not all plates are available in all colors, so imagine that we've got a array of elements that identify a plate, its color, and the average cost of that plate (an example is below). Thanks to Benjamin for providing a bounty!
1x1|white|.07
1x1|yellow|.04
[...]
1x2|white|.05
1x2|yellow|.04
[...]
This list would NOT have the entry:
8x8|yellow|imaginarydollaramount
This is because an 8x8 yellow plate does not exist. The list itself is trivial and should only be thought about as providing references for the solution; it does not impact the solution itself.
EDIT2
Changed some wording for clarity.
Karl's approach is basically sound, but could use some more details. It will find the optimal cost solution, but will be too slow for certain inputs. Large open areas especially will have too many possibilities to search through naively.
Anyways, I made a quick implementation in C++ here: http://pastebin.com/S6FpuBMc
It solves filling in the empty space (periods), with 4 different kinds of bricks:
0: 1x1 cost = 1000
1: 1x2 cost = 150
2: 2x1 cost = 150
3: 1x3 cost = 250
4: 3x1 cost = 250
5: 3x3 cost = 1
.......... 1112222221
...#####.. 111#####11
..#....#.. 11#2222#13
..####.#.. 11####1#13
..#....#.. 22#1221#13
.......... 1221122555
..##..#... --> 11##11#555
..#.#.#... 11#1#1#555
..#..##... 11#11##221
.......... 1122112211
......#..# 122221#11#
...####.#. 555####1#0
...#..##.. 555#22##22
...####... 555####444 total cost = 7352
So, the algorithm fills in a given area. It is recursive (DFS):
FindBestCostToFillInRemainingArea()
{
- find next empty square
- if no empty square, return 0
- for each piece type available
- if it's legal to place the piece with upper-left corner on the empty square
- place the piece
- total cost = cost to place this piece + FindBestCostToFillInRemainingArea()
- remove the piece
return the cheapest "total cost" found
}
Once we figure out the cheapest way to fill a sub-area, we'll cache the result. To very efficiently identify a sub-area, we'll use a 64-bit integer using Zobrist hashing. Warning: hash collisions may cause incorrect results. Once our routine returns, we can reconstruct the optimal solution based on our cached values.
Optimizing:
In the example, 41936 nodes (recursive calls) are explored (searching for empty square top-to-bottom). However, if we search for empty squares left-to-right, ~900,000 nodes are explored.
For large open areas: I'd suggest finding the most cost-efficient piece and filling in a lot of the open area with that piece as a pre-process step. Another technique is to divide your image into a few regions, and optimize each region separately.
Good luck! I'll be unavailable until March 26th, so hopefully I didn't miss anything!
Steps
Step 1: Iterate through all solutions.
Step 2: Find the cheapest solution.
Create pieces inventory
For an array of possible pieces (include single pieces of each color), make at least n duplicates of each piece, where n = max(board#/piece# of each color). Therefore, at most n of that piece can cover all of the entire board's colors by area.
Now we have a huge collection of possible pieces, bounded because it is guaranteed that a subset of this collection will completely fill the board.
Then it becomes a subset problem, which is NP-Complete.
Solving the subset problem
For each unused piece in the set
For each possible rotation (e.g. for a square only 1, for a rectangle piece 2, for an elbow piece 4)
For each possible position in the *remaining* open places on board matching the color and rotation of the piece
- Put down the piece
- Mark the piece as used from the set
- Recursively decent on the board (with already some pieces filled)
Optimizations
Obviously being an O(2^n) algorithm, pruning of the search tree early is of utmost importance. Optimizations must be done early to avoid long-running. n is a very large number; just consider a 48x48 board -- you have 48x48xc (where c = number of colors) just for single pieces alone.
Therefore, 99% of the search tree must be pruned from the first few hundred plies in order for this algorithm to complete in any time. For example, keep a tally of the lowest cost solution found so far, and just stop searching all lower plies and backtrack whenever the current cost plus (the number of empty board positions x lowest average cost for each color) > current lowest cost solution.
For example, further optimize by always favoring the largest pieces (or the lowest average-cost pieces) first, so as to reduce the baseline lowest cost solution as quickly as possible and to prune as many future cases as possible.
Finding the cheapest
Calculate cost of each solution, find the cheapest!
Comments
This algorithm is generic. It does not assume a piece is of the same color (you can have multi-colored pieces!). It does not assume that a large piece is cheaper than the sum of smaller pieces. It doesn't really assume anything.
If some assumptions can be made, then this information can be used to further prune the search tree as early as possible. For example, when using only single-colored pieces, you can prune large sections of the board (with the wrong colors) and prune large number of pieces in the set (of the wrong color).
Suggestion
Do not try to do 48x48 at once. Try it on something small, say, 8x8, with a reasonably small set of pieces. Then increase number of pieces and board size progressively. I really have no idea how long the program will take -- but would love for somebody to tell me!
First you use flood fill to break up the problem into filling continuous regions of lego bricks. Then for each of those you can use a dfs with memoization you wish. The flood fill is trivial so I will not describe it farther.
Make sure to follow a right hand rule while expanding the search tree to not repeat states.
My solution will be:
Sort all the pieces by stud cost.
For each piece in the sorted list, try to place as many as you can in the plate:
Raster a 2D image of your design looking for regions of the image with uniform color, the shape of the current piece and free studs for each stud that the piece will use.
If the color of the region found do not exist for that particular piece, ignore an continue searching.
If the color exists: tag the studs used by that pieces and increment a counter for that kind of piece and that color.
Step 2 will be done once for squared pieces, twice for rectangular pieces (once vertical and once horizontal) and 4 times for corner pieces.
Iterate to 2 until the plate is full or no more type of pieces are available.
Once arrived to the end you will have the number of pieces of each kind and each color that you needed with a minimum cost.
If cost by stubs can change by color, then the original sorted list must include not only the type of piece by also the color.

Averaging a set of points on a Google Map into a smaller set

I'm displaying a small Google map on a web page using the Google Maps Static API.
I have a set of 15 co-ordinates, which I'd like to represent as points on the map.
Due to the map being fairly small (184 x 90 pixels) and the upper limit of 2000 characters on a Google Maps URL, I can't represent every point on the map.
So instead I'd like to generate a small list of co-ordinates that represents an average of the big list.
So instead of having 15 sets, I'd end up with 5 sets, who's positions approximate the positions of the 15. Say there are 3 points that are in closer proximity to each-other than to any other point on the map, those points will be collapsed into 1 point.
So I guess I'm looking for an algorithm that can do this.
Not asking anyone to spell out every step, but perhaps point me in the direction of a mathematical principle or general-purpose function for this kind of thing?
I'm sure a similar function is used in, say, graphics software, when pixellating an image.
(If I solve this I'll be sure to post my results.)
I recommend K-means clustering when you need to cluster N objects into a known number K < N of clusters, which seems to be your case. Note that one cluster may end up with a single outlier point and another with say 5 points very close to each other: that's OK, it will look closer to your original set than if you forced exactly 3 points into every cluster!-)
If you are searching for such functions/classes, have a look at MarkerClusterer and MarkerManager utility classes. MarkerClusterer closely matches the described functionality, as seen in this demo.
In general I think the area you need to search around in is "Vector Quantization". I've got an old book title Vector Quantization and Signal Compression by Allen Gersho and Robert M. Gray which provides a bunch of examples.
From memory, the Lloyd Iteration was a good algorithm for this sort of thing. It can take the input set and reduce it to a fixed sized set of points. Basically, uniformly or randomly distribute your points around the space. Map each of your inputs to the nearest quantized point. Then compute the error (e.g. sum of distances or Root-Mean-Squared). Then, for each output point, set it to the center of the set that maps to it. This will move the point and possibly even change the set that maps to it. Perform this iteratively until no changes are detected from one iteration to the next.
Hope this helps.

Resources