Plotting of density estimates in matlab - estimation

Given the biometric match scores, I am required to plot the graphs of estimated densities of matching genuine and impostor scores. Following are the graphs I got for genuine and impostor scores respectively.
My question is how do i combine the 2 graphs and plot them against all the scores given in the data set in order to compare the 2 densities?
Can i do :
plot(*data_set*,pdf_genuine)
hold on;
plot(*data_set*,pdf_impostor)
hold off;
The dimensions of the entire data set are 517x516

Related

k nearest neighbour search for cells in place of matrices?

I want to find the euclidean distance between some datasets. For each class I have about 50 learning or training samples and 5 test samples. The problem is that this data for each sample is stored in cells and is in the m * n format rather than the preferred 1 * n format.
I see that matlab has a built in function for k nearest neighbours using euclidean distance; knnsearch. But the problem is that it expects the samples to be in the form of rows and columns. It gives an error when data in cells are provided.
Is there anyway to do this classification in the format my data is in? Or is it preferred that I convert them to 1-D format (maybe by using some sort of dimensionality reduction?)
Suppose there are 5 classes/models that are each represented by cells of size 32 * N, below is a screenshot from Matlab. Each cell represents the features of a lot of training images, so cell 1 is a representation of person 1 and so on.
Now similarly I have test images that need to be identified. The features of each test image are also in cells. So I have 10 unknown images with their features extracted and stored in cells. Each cell is a representation of a single unknown image. Below again is a screenshot from Matlab showing their dimensions
As can be seen they are all of varying dimensions. I know this is not the ideal format of KNN as here each image is represented by a multidimension matrix rather than a single row of variables. How do I perform euclidean distance on such data?
Thank you

How to determine the number of bins and the edge length based on the density of each bin. (Bins most likely are not uniform.)

I have been trying to figure out how to write a function to bin a sample of data together based on its density (Number of Occurrence/ Length of edge). But there are not alot of examples out there.
The output would give a vector of edges where both :
the number of bins are given by how many are required to group data that have different density by a threshold (maybe 40%?)
and the length of the edges are determined by if the adjacent data groups have similar density. (Similar density are grouped together, but if the neighboring bin is 40% more or less in density, it would require another bin).
So to illustrate my point, below is a simple example:
I have data values that ranges from 1 to 10 and I have 10 observations of it where x=[1,2,3,4,5,5,5,6, 6,7];
x would result in a range with edges that are [1,5,6,7,8], so there are four states just because the bins represent different density clusters.
Just to mention my actual data is continuous, any help is appreciated.
I thought of a preliminary algorithm for large data samples:
Sort data in ascending order.
Group data where at least a group has 10 elements
Calculate and compare density to group similar ones together.
I got stuck on the 3rd point. Where I am not sure how to effectively group them. My obstacle comes from if the density increases slowly, but gradually e.g. Density: 1,2,3,4,5,6,7,8,9,10
Where do I call it break and say that one group has a different density from another.

Alignment algorithm dynamic programming with gap

I wrote a dynamic programming alignment algorithm. I wish to align two lists of peaks that have been extracted from two different signals. Lists of peaks are dataset with two colums, two features: time for the peak and area of the peak. Since peaks are from two different signals, both lists contains no exact matches. However, both lists of peaks have some peaks in common (~two third), that is to say peaks that are close both regarding time and area.
In my first DP algorithm, i rely on a distance calculation that takes time and area into account. I iterate over peaks in the shortest peak list, and calculate their distance to some peaks in the other dataset. I fill in a score matrix with these distances and i go backward to find back the ooptimal path (with minimum distance). This is working perfect IF I WANT TO ASSIGN ALL PEAKS IN SHORTEST LIST TO PEAKS IN LARGEST LIST. However, it is not working if gap are allowed, that is to say if some elements in shortest data set have no match in largest data set.
Which refinement of DP could enable to handle this type of problems? What other algos are at hand to handle these problems ?
Thanks!

Histogram peak identification and gauss fitting with minimal accumulated hight difference in c++

I already asked a similar question some time ago in the following thread: previous thread. Until now I unfortunately couldn't entirely solve that issue and only worked around. Since it is difficult to include all the new information with the previous thread
I post a refined and extended question with distinct context here and link it to the old thread.
I am currently implementing an algorithm from a paper which extracts certain regions of a 3D data set by dynamically identifying value ranges in the data sets histogram.
In a simplified way the method could be described as following:
Find the highest peak in the histogram
Fit a gaussian to the peak
Using the value range defined by the gaussians mean (µ)+/- deviation(ϭ) certain regions
of the histogram are identified, and the voxels (=3D pixels) of these regions are removed from the original histogram.
As a result of the previous step a new highest peak should be revealed, based on which
the steps 1-3 can be repeated. The steps are repeated until the data set histogram is empty.
My questions relate to step 1 and 2 of the above description which is described as following in the paper: "The highest peak is identified and a Gaussian curve is fitted to its shape. The Gaussian is described by its midpoint µ, height h and deviation ϭ. The fitting process minimizes the accumulated height difference between the histogram and the middle part of the Gaussian. The error summation range is
µ+/ϭ? "1
In the following I will ask my questions and add my reflections on them:
How should I identify those bins of the total histogram which describe the highest peak? In order to identify its apex I simply run through the histogram and store the index of the bin with the highest frequency. But how far should the extend of the peak reach to the left and right of the highest bin. At the moment I simply go the the left and right of the highest bin for as long as the next bin is smaller as the previous one. However this is usually a very small range, since there occur creases (mini peaks) in the histogram. I already thought about smoothing the histogram. But I would have to that after each iteration since the subtraction of voxels (step 3 in the description above) can cause the histogram to contain creases again. And I am also worried that the repeated smoothing distorts the results.
Therefore I would like to ask whether there is an efficient way to detect the extend of a peak which is better than my current approach. There have been suggestions about mixture models and deconvolution in the previous thread. However are these methods really reasonable if the shape of the histogram constantly changes after each iteration?
How can I fit a gauss curve to the identified peak so that the accumulated hight difference between the histogram and middle part of the gaussian is minimized?
According to question one from the previous thread I fitted the curve to a given range of histogram bins by computing their mean and deviation (I hope this is correct?!). But how do I minimize the accumulated hight difference between the histogram and middle part of the gaussian from this point?
Thank you for your help!
Regards Marc
Add histogram values to the left and right until the goodness of the fit begins to decrease.

What would be the complexity of this color-quantization algorithm?

I started toying with this idea some years ago when I wrote my university papers. The idea is this - the perfect color quantization algorithm would take an arbitrary true-color picture and reduce the number of colors to the minimum possible, while maintaining that the new image is completely indistinguishable from the original with a naked eye.
Basically the setting is simple - you have a set of points in the RGB cube (from 0 to 255 integer values on each axis). You have to replace each of these points with another point in such a way that:
The total number of points after the operation is as small as possible;
The distance from an original point to the replaced point is no larger than some predefined constants R, G and B on each of the red, green and blue axis (these are taken from the sensitivity of the human eye and are in general configurable by the user).
I know that there are many color quantization algorithms out there that work with different efficiencies, but they are mostly targeted at reducing colors to a certain number, not "the minimum possible without violating these constraints".
Also, I would like the algorithm to produce really absolute minimum possible, not just something that is "pretty close to minimum".
Is this possible without a time consuming full search of all combinations (infeasible for any real picture)? My instincts tell me that this is a NP-complete problem or worse, but I cannot prove it.
Bonus setting: Change the limit from constants R,G,B to a function F(Rsource, Gsource, Bsource, Rtarget, Gtarget, Btarget) which returns TRUE if the mapping would be OK, and FALSE if it was out of range.
Given your definitions the structure of the picture (i.e. how the pixels are organized) does not matter at all, the only thing that matters is the subset of RGB triplets that appear at least once in the picture as a pixel value. Let that subset be S. You want to find then another subset of RGB triplets E (the encoding) such that for every s in S there exists a counterpart e in E such that diff(s,e) <= threshold where threshold is the limit you impose on the acceptable difference and diff(...) reduces the triplet distance into a single number.
Additionally, you want to find E that is minimal in size i.e. for any E' s.t. |E'|<|E|, there is at least one (s,e) pair violating the difference constraint.
This particular problem cannot be given an asymptotic complexity assessment because it has only a finite set of instances. It can be solved in constant time (theoretically) by precalculating the minimum set E for every subset S. There is a huge amount of subsets S but yet only a finite number, so the problem cannot be e.g. classified as NP-complete optimization problem or anything. The actual run-time of your algorithm for this parcticular problem hence depends completely on the amount of preprocessing you are willing to tolerate. In order to get an asymptotic complexity assessment you need to generalize the problem first so that the set of problem instances is strictly infinite.
Optimal quantization is an NP-hard problem (Son H. Nguyen, Andrzej Skowron — Quantization Of Real Value Attributes, 1995).
Predefined maximum distance doesn't make things easier when you have clusters of points which are larger than your sphere, but distances between points are less than sphere radius — then you have a lot of combinations (as each choice of placement of a sphere may displace all other spheres). And unfortunately this is going to happen quite often on real images with gradients (it's not unusual for entire histogram to be one huge cluster).
You can modify many quantization algorithms to pick number of clusters until certain quality is satisfied, e.g. in Median Cut and Linde–Buzo–Gray you can simply stop subdividing space when you reach your quality limit. It won't be guarantee that it's global minimum (that is NP-hard), but in LBG you'll at least know you're at local minimum.
Here's an idea how I'd go about this - unfortunately this will probably need a lot of memory and be very slow:
You create a 256x256x256 cubic data structure that contains a counter and a "neighbors" list of colors. For every unique color that you find in your image you increase the counter of each cell which is within the radius of a sphere around that color. The radius of the sphere is the maximum acceptable distance that you have defined originally. You also add the color to the neighbors list of each cell.
Once you have added all unique colors you loop through the cube and find the cell with the maximum counter value. Add this color to your result list. Now loop through your cube again and remove this color and all colors that are in the neighbors list of that color from all cells and decrease each cell's counter whenever you remove a color. Then repeat searching for the maximum counter and removing until no more colors are in the cube.
Alternatively once could also add the same color multiple times if it occurs more often in the image. Not sure if that would improve the visual result.

Resources