The range of similarity in openAI text-embeddings is not [0, 1] - word-embedding

I am using openAI Ada to create text-embeddings. When i calculate the similarity between these embeddings my answer is in the range [0.7, 0.9] approximately. In the rest of my project I would need these to be in [0.0, 1.0], and I want to find a way to normalize the values. Could I just normalize them based on the range [0.7, 0.9] or would that seem too risky as there is no real prove that that is actual the range?
I understand that just keeping them as they are might be the best way, but I would really benefit from having the very distanced sentences to be 0 and the really close one 1

Related

Implementing the mutation rate in a genetic algorithm

Given an array such as
[1, 4, 6, 1, 10, 3, 24, 1]
And I wanted to implement a mutation rate of .2 let's say. Would I:
always mutate 20% of my array entries, or
mutate 0-20% of the entries?
iterate over array and mutate each 20% of the time
I am unclear from literature how this is handled - or if there is even an agreed upon standard.
Note - I am a coder meddling with GA so bear with me in the lack of depth of GA knowledge.
Thanks
I was unsure about that too when I started to learn about genetic algorithms. I decided it's best to give each gene a x% chance to be mutated (completely changed). In your case I would iterate over the array and whenever Math.random() is smaller than 0.2 I would set the current number to a new random one.
If you find that you don't get enough diversity you can also add one or two completely random individuals (I like to call them 'foreigners' since they don't have any common ancestors).

transform a matrix to the reduced row echelon form in Ruby

I'm working on an open source project written in ruby, and I have hit an area where an algorithm requires the use of Linear Algebra. I'm looking for a gem to transform a matrix to the reduced row echelon form.
Basically following this (very detailed) series of steps:
http://www.math.odu.edu/~bogacki/cgi-bin/lat.cgi?c=rref
to convert
require 'matrix'
Matrix[[12, 0, -1, 0], [26, 0, 0, -2], [0, 2, -2, -1]]
to
Matrix[[1,0,0,-1/13],[0,1,0,-37/26],[0,0,1,-12/13]]
Can this be accomplished with standard ruby libraries in few steps? Or does a linear algebra gem exist?
Does this help - http://rubyforge.org/projects/linalg/ ?
Basic description reads - Linalg is a fast, LAPACK-based library for real and complex matrices. Current functionality includes: singular value decomposition, eigenvectors and eigenvalues of a general matrix, least squares, LU, QR, Schur, Cholesky, stand-alone LAPACK bindings.

Similarity algorithm (mathematics) of sampled signals

Let's say I have sampled some signals and constucted a vector of the samples for each. What is the most efficent way to calculate (dis)similarity of those vectors? Note that offset of the sampling must not count, for instance sample-vectors of sin and cos -signals should be considered similar since in sequential manner they are exately the same.
There is a simple way of doing this by "rolling" the units of the other vector, calculating euclidian distance for each roll-point and finally choosing the best match (smallest distance). This solution works fine since the only target for me is to find most similar sample-vector for input signal from a vector pool.
However, the solution above is also very inefficent when the dimension of the vectors grow. Compared to "non-sequential vector matching" for N-dimensional vector, the sequential one would have N-times more vector distance calculations to do.
Is there any higher/better mathematics/algorithms to compare two sequences with differing offsets?
Use case for this would be in sequence similarity visualization with SOM.
EDIT: How about comparing each vector's integrals and entropies? Both of them are "sequence-safe" (= time-invariant?) and very fast to calculate but I doubt they alone are enough to distinguish all possible signals from each other. Is there something else that could be used in addition for these?
EDIT2: Victor Zamanian's reply isn't directly the answer but it gave me an idea that might be. The solution might be to sample the original signals by calculating their Fourier transform coefficents and inserting those into sample vectors. First element (X_0) is the mean or "level" of the signal and the following (X_n) can be directly used to compare similarity with some other sample vector. The smaller the n is, the more it should have effect in similarity calculations, since the more coefficents there has been calculated with FT, the more accurate representation will the FT'd signal be. This brings up an bonus question:
Let's say we have FT-6 sampled vectors (values just fell out of the sky)
X = {4, 15, 10, 8, 11, 7}
Y = {4, 16, 9, 15, 62, 7}
Similarity value of these vectors could MAYBE be calculated like this: |16-15| + (|10 - 9| / 2 ) + (|8 - 15| / 3) + (|11-62| / 4) + (|7-7| / 5)
Those bolded ones are the bonus question. Is there some coefficents/some other way to know how much each FT-coefficent has effect on the similarity in relation to other coefficents?
If I understand your question correctly, maybe you would be interested in some type of cross-correlation implementation? I'm not sure if it's the most efficient thing to do or fits the purpose, but I thought I would mention it since it seems relevant.
Edit: Maybe a Fast Fourier Transform (FFT) could be an option? Fourier transforms are great for distinguishing signals from each other and I believe helpful to find similar signals too. E.g. a sine and a cosine wave would be identical in the real plane, and just have different imaginary parts (phase). FFTs can be done in O(N log N).
Google "translation invariant signal classificiation" and you'll find things like these.

Primal weights of non-linear SVM

I am experimenting with different kinds of non-linear kernels and am trying to interpret the learned models, which led me to the following question: Is there a generic method for getting the primal weights of a non-linear Support Vector machine similar to how this is possible for linear SVMs (see related question)?
Say, you have three features a, b, c and the generated model of an all-subsets/polynomial kernel. Is there a way to extract the primal weight of those subsets, e.g., a * b and a^2?
I've tried extending the method for linear kernels, where you generate output for the following samples:
a, b, c
[0, 0, 0]
[1, 0, 0]
[0, 1, 0]
[0, 0, 1]
If I use the same approach for the all-subsets kernel, I can generate some more samples:
a, b, c
[1, 1, 0]
[1, 0, 1]
...
Next, to calculate the primal weight for a * b, I analyse the predictions as follows: [1, 1, 0] - ([1, 0, 0] + [0, 1, 0] + [0, 0, 0]).
The problem I see with this is that it requires a prohibitive number of samples, doesn't address the subsets such as a^2 and it doesn't generalise to other non-linear kernels.
No. I don't claim to be the end-all-be-all expert on this, but I've done a lot of reading and research on SVM and I do not think what you are saying is possible. Sure, in the case of the 2nd degree polynomial kernel you can enumerate the feature space induced by the kernel, if the number of attributes is very small. For higher-order polynomial kernels and larger numbers of attributes this quickly becomes intractable.
The power of the non-linear SVM is that it is able to induce feature spaces without having to do computation in that space, and in fact without actually knowing what that feature space is. Some kernels can even induce an infinitely dimensional feature space.
If you look back at your question, you can see part of the issue - you are looking for the primal weights. However, the kernel is something that is introduced in the dual form, where the data shows up as a dot product. Mathematically reversing this process would involve breaking the kernel function apart - knowing the mapping function from input space to feature space. Kernel functions are powerful precisely because we do not need to know this mapping. Of course it can be done for linear kernels, because there is no mapping function used.
Extracting the weights of the explicit features is generally not computationally feasible, but a decent next-best-thing is the pre-image: generating a sample z such that its features correspond to the weights you're after.
This can be described formally as finding z such that phi(z) = w, with the weights implicitly defined as a combination of training samples, as is usual with the kernel trick: w=sum_i(alpha_i * phi(x_i)). phi is the feature map.
In general an exact pre-image does not exist, and most methods find the pre-image that minimizes the squared-error.
A good description of the classical Gauss-Newton iteration for pre-images of Gaussian kernels, as well as another more general method based on KPCA, is in:
James T. Kwok, Ivor W. Tsang, "The Pre-Image Problem in Kernel Methods", ICML 2003
Direct link: http://machinelearning.wustl.edu/mlpapers/paper_files/icml2003_KwokT03a.pdf

Can K-means be used to help in pixel-value based separation of an image?

I'm trying to separate a greylevel image based on pixel-value: suppose pixels from 0 to 60 in one bin, 60-120 in another, 120-180 ... and so on til 255. The ranges are roughly equispaced in this case.
However by using K-means clustering will it be possible to get more realistic measures of what my pixel value ranges should be? Trying to obtain similar pixels together and not waste bins where there is lower concentration of pixels present.
EDITS (to include obtained results):
k-means with no of cluster = 5
Of course K-Means can be used for color quantization. It's very handy for that.
Let's see an example in Mathematica:
We start with a greyscale (150x150) image:
Let's see how many grey levels are there when representing the image in 8 bits:
ac = ImageData[ImageTake[i, All, All], "Byte"];
First#Dimensions#Tally#Flatten#ac
-> 234
Ok. Let's reduce those 234 levels. Our first try will be to let the algorithm alone to determine how many clusters are there with the default configuration:
ic = ClusteringComponents[Image#ac];
First#Dimensions#Tally#Flatten#ic
-> 3
It selects 3 clusters, and the corresponding image is:
Now, if that is ok, or you need more clusters, is up to you.
Let's suppose you decide that a more fine-grained color separation is needed. Let's ask for 6 clusters instead of 3:
ic2 = ClusteringComponents[Image#ac, 6];
Image#ic2 // ImageAdjust
Result:
and here are the pixel ranges used in each bin:
Table[{Min##, Max##} &#(Take[orig, {#[[1]]}, {#[[2]]}] & /#
Position[clus, n]), {n, 1, 6}]
-> {{0, 11}, {12, 30}, {31, 52}, {53, 85}, {86, 134}, {135, 241}}
and the number of pixels in each bin:
Table[Count[Flatten#clus, i], {i, 6}]
-> {8906, 4400, 4261, 2850, 1363, 720}
So, the answer is YES, and it is straightforward.
Edit
Perhaps this will help you understand what you are doing wrong in your new example.
If I clusterize your color image, and use the cluster number to represent brightness, I get:
That's because the clusters are not being numbered in an ascending brightness order.
But if I calculate the mean brightness value for each cluster, and use it to represent the cluster value, I get:
In my previous example, that was not needed, but that was just luck :D (i.e. clusters were found in ascending brightness order)
k-means could be applied to your problem. If it were me, I would first try a basic approach borrowed from decision trees (although "simpler" is dependent upon your precise clustering algorithm!)
Assume one bin exists, begin stuffing the pixel intensities into the bin. When the bin is "full enough", compute the mean and standard deviation of the bin (or node). If the standard deviation is greater than some threshold, split the node in half. Continue this process until all intensities are done, and you will have a more efficient histogram.
This method can be improved with additional details of course:
You might consider using kurtosis as a splitting criteria.
Skewness might be used to determine where the split occurs
You might cross all the way into decision tree land and borrow the Jini index to guide splitting (some split techniques rely on more "exotic" statistics, like the t-test).
Lastly, you might perform a final consolidation pass to collapse any sparsely populated nodes.
Of course, if you've applied all of the above "improvements", then you've basically implemented one variation of a k-means clustering algorithm ;-)
Note: I disagree with the comment above - the problem you describe does not appear closely related histogram equalization.

Resources