Details of the "New Yale" sparse matrix format? - data-structures

There's some Netlib code written in Fortran which performs transposes and multiplication on sparse matrices. The library works with Bank-Smith (sort of), "old Yale", and "new Yale" formats.
Unfortunately, I haven't been able to find much detail on "new Yale." I implemented what I think matches the description given in the paper, and I can get and set entries appropriately.
But the results are not correct, leading me to wonder if I've implemented something which matches the description in the paper but is not what the Fortran code expects.
So a couple of questions:
Should row lengths include diagonal entries? e.g., if you have M=[1,1;0,1], it seems that it should look like this:
IJA = [3,4,4,1]
A = [1,1,X,1] // where X=NULL
It seems that if diagonal entries are included in row lengths, you'd get something like this:
IJA = [3,5,6,1]
A = [1,1,X,1]
That doesn't make much sense because IJA[2]=6 should be the size of the IJA/A arrays, but it is what the paper seems to say.
Should the matrices use 1-based indexing?
It is Fortran code after all. Perhaps instead my IJA and A should look like this:
IJA = [4,5,5,2]
A = [1,1,X,1] // still X=NULL
Is there anything else I'm missing?
Yes, that's vague, but I throw that out there in case someone who has messed with this code before would like to volunteer any additional information. Anyone else can feel free to ignore this last question.
I know these questions may seem rather trivial, but I thought perhaps some Fortran folks could provide me with some insight. I'm not used to thinking in a one-based system, and though I've converted the code to C using f2c, it's still written like Fortran.

I can't see how you deduced those vectors from that paper. First the Old Yale format:
M = [7,16;0,-12]
Then, A contains all non-zero values of M in row-form:
A = [7,16,-12]
and IA stores the position in A of the first elements of each row, and JA stores the column indices of all the values in A:
IA = [1,3,4]
JA = [1,2,2]
New format: A has diagonal values first, a zero and then the remaining non-zero elements (I have put | to clarify the seperation between diagonal and non-diagonal) :
A = [7,-12,0 | 16]
IA and JA are combined in IJA, but as far as I can tell from the paper you need to take into account the new ordering of A (I have put | to clarify the seperation between IA and JA):
IJA = [1,2,3 | 2]
So, applied to your case M = [1,1;0,1], I get
A = [1,1,0 | 1]
IJA = [1,2,3 | 2]
first element of the first row is the first in A and the first element of the second row is the second in A, then I put 3 since they say the length of a row is determined by IA(I)-IA(I+1), so I make sure the difference is 1. Then the column indices of the non-zero non-diagonal elements follow, and that is 2.

So, first of all, the reference given in the SMMP paper is possibly not the correct one. I checked it out (the ref) from the library last night. It appears to give the "old Yale" format. It does mention, on pp. 49-50, that the diagonal can be separated out from the rest of the matrix -- but doesn't so much as mention an IJA vector.
I was able to find the format described in the 1992 edition of Numerical Recipes in C on pp. 78-79.
Of course, there is no guarantee that this is the format accepted by the SMMP library from Netlib.
NR seems to have IA giving positions relative to IJA, not relative to JA. The last position in the IA portion gives not the size of the IJA and A vectors, but size-1, because the vectors are indexed starting at 1 (per Fortran standard).
Row lengths do not include non-zero diagonal entries.

Related

Pattern Recognition Algorithm/Technique

Background
I apologize for the music-based question, but the details don't really mean all that much. I'm sequentially going through a midi file and I'm looking for an efficient way to find a pattern in the data to find something called a tuplet. See image below:
The tuplets have the numbers (3 or 6) over top of them. I need to know at which position they begin in the data file. The numbers below the notes are the values you would see sequentially in the data file. Just in case you can't decipher the data below, here it is:
1, 2, 2.3333, 2.6666, 3, 3.5, 3.6666, 3.83333, 4, 4.1666, 4.3333, 4.5, 4.6666, 4.8333,
5, 6.3333, 6.6666, 7.1666, 7.3333, 7.5, 7.6666, 7.8333, 8, 8.1666, 8.333, 8.5, 8.6666.
The first tuplet begins at position 2 and the difference between the position of notes is 0.3333 (repeating)
The second tuplet begins at position 3.5 and the difference between the position of notes is 0.1666 (repeating)
The main issue is that in the note, unlike the image below, position 7 will not be noted in the data file because the data only file only lists note locations. The icon that you see in that location is called a rest, which is not notated in the data file.
Question
How can I find an efficient method to find the start of each tuplet? Is there some sort of recursive method?
I don't think you need any recursion for this.
The normal note values can only be represented by fractions of the beat of the type a / 2^b. The tuplets can be arbitrary fractions, but mostly I've seen something like triplets, quintuplets or (in your case sextuplets).
So the simplest way would be to compute the length of every note (maybe the time difference between two MIDI events? Or the length is stored explicitly in MIDI? I'm not that familiar with the format) and compute the rational representation of this length.
Every group of notes with a denominator that is not a power of two belongs to such a tuplet. To group the notes together, I would recommend the following approach (assuming that all notes of a tuplet have the same value):
Factorize the denominator into a power of two a and the rest b (e.g. a * b = 4 * 5)
Initialize an empty tuplet of size b
For every note compute the distance to the beginning of the tuplet and store the note at the corresponding position, inserting rests if necessary. The length of the tuplet can be computed by taking the minimum length l of all notes in the tuplet, so greedily adding them until the end of these notes exceeds a distance of l * b from the beginning of the tuplet
This way, you base the tuplet on the minimum note length and add all notes that fit into it.

How to save output of for loop operation in matlab

I have a matrix A which has a size of 54x100. For some specific condition I perform an operation on each row of A. I need to save the output of this for loop. I've tried the following but it did not work.
S=zeros(54,100);
for i=1:54;
Ri=A(i,:);
answer=mean(reshape(Ri,5,20),1);
S(i)=answer;
end
Firstly, judging by your question I'd recommend some basic Matlab tutorials like this or just detailed documentation like this.
To actually help you with your issue though; you can do this:
%% Make up A (since I don't know what it actually is)
n = 54; m = 100;
A = randn(n,m); % N x m matrix of random numbers
%% Loop over each row of A
S = cell(n,1);
for j = 1:n;
Rj = A(j,:); % j'th row
answer = mean(reshape(Rj,5,20),1); % some operation
S{j} = answer; % store the answer in cell S
end
The problem was that your answer was not a single number (1x1 matrix) but a vector and so you got a dimension mismatch error. Above I'm putting the answers into a cell object of size n. The result of your operation on j'th row can then be retrieved by calling S{j}.
Also:
Do not using i as an iterator since it also represents the imaginary unit.
Do not hard-code values but reference the existing ones. For example here I referenced n in the for-loop declaration as opposed to just writing for j = 1:54 because otherwise, if I got struck by a fancy to use my code for a 53x100 array it would not work anymore.
When you post your code I reccomend adding a minimal working example - a pece of code which people can just copy and paste into their Matlab (or whatever interpreter of whatever language) and run to reproduce your problem. Here you have not included anything which tells the code what A is, for example.
This is quite a good read in general and should help you in the future

How to fuzzily search for a dictionary word?

I have read a lot of threads here discussing edit-distance based fuzzy-searches, which tools like Elasticsearch/Lucene provide out of the box, but my problem is a bit different. Suppose I have a dictionary of words, {'cat', 'cot', 'catalyst'}, and a character similarity relation f(x, y)
f(x, y) = 1, if characters x and y are similar
= 0, otherwise
(These "similarities" can be specified by the programmer)
such that, say,
f('t', 'l') = 1
f('a', 'o') = 1
f('f', 't') = 1
but,
f('a', 'z') = 0
etc.
Now if we have a query 'cofatyst', the algorithm should report the following matches:
('cot', 0)
('cat', 0)
('catalyst', 0)
where the number is the 0-based starting index of the match found. I have tried the Aho-Corasick algorithm, and while it works great for exact matching and in the case when a character has relatively less number of "similar" characters, its performance drops exponentially as we increase the number of similar characters for a character. Can anyone point me to a better way of doing this? Fuzziness is an absolute necessity, and it must take in to account character similarities(i.e., not blindly depend on just edit-distances).
One thing to note is that in the wild, the dictionary is going to be really large.
I might try to use the cosine similarity using the position of each character as a feature and mapping the product between features using a match function based on your character relations.
Not a very specific advise, I know, but I hope it helps you.
edited: Expanded answer.
With the cosine similarity, you will compute how similar two vectors are. In your case the normalisation might not make sense. So, what I would do is something very simple (I might be oversimplifying the problem): First, see the matrix of CxC as a dependency matrix with the probability that two characters are related (e.g., P('t' | 'l') = 1). This will also allow you to have partial dependencies to differentiate between perfect and partial matches. After this I will compute, for each position the probability that the letter from each word is not the same (using the complement of P(t_i, t_j)) and then you can just aggregate the results using a sum.
It will count the number of terms that are different for a specific pair of words, and it allows you to define partial dependencies. Furthermore, the implementation is very simple and should scale well. This is why I am not sure if I misunderstood your question.
I am using Fuse JavaScript Library for a project of mine. It is a javascript file which works on JSON dataset. It is quite fast. Have a look at it.
It has implemented a full Bitap algorithm, leveraging a modified version of the Diff, Match & Patch tool by Google(from his site).
The code is simple to understand the algorithm implementation done.

I need help optimizing this compression algorithm I came up with on my own

I tried coming up with a compression algorithm. I do little bit about compression theories and so am aware that this scheme that I have come up with could very well never achieve compression at all.
Currently it works only for a string with no consecutive repeating letters/digits/symbols. Once properly established I hope to extrapolate it to binary data etc. But first the algorithm:
Assuming there are only 4 letters: a,b,c,d; we create a matrix/array corresponding to the letters. Whenever a letter is encountered, the corresponding index is incremented so that the index of the last letter encountered is always largest. We incremement an index by 2 if it was originally zero. If it was not originally zero then we increment it by 2+(the second largest element in the matrix). An example to clarify:
Array = [a,b,c,d]
Initial state = [0,0,0,0]
Letter = a
New state = [2,0,0,0]
Letter = b
New state = [2,4,0,0]
.
.c
.d
.
New state = [2,4,6,8]
Letter = a
New state = [12,4,6,8]
//Explanation for the above state: 12 because Largest - Second Largest - 2 = Old value
Letter = d
New state = [12,4,6,22]
and so on...
Decompression is just this logic in reverse.
A rudimentary implementation of compression (in python):
(This function is very rudimentary so not the best kind of code...I know. I can optimize it once I get the core algorithm correct.)
def compress(text):
matrix = [0]*95 #we are concerned with 95 printable chars for now
for i in text:
temp = copy.deepcopy(matrix)
temp.sort()
largest = temp[-1]
if matrix[ord(i)-32] == 0:
matrix[ord(i)-32] = largest+2
else:
matrix[ord(i)-32] = largest+matrix[ord(i)-32]+2
return matrix
The returned matrix is then used for decompression. Now comes the tricky part:
I can't really call this compression at all because each number in the matrix generated from the function are of the order of 10**200 for a string of length 50000. So storing the matrix actually takes more space than storing the original string. I know...totally useless. But I had hoped prior to doing all this that I can use the mathematical properties of a matrix to effectively represent it in some kind of mathematical shorthand. I have tried many possibilities and failed. Some things that I tried:
Rank of the matrix. Failed because not unique.
Denote using the mod function. Failed because either the quotient or the remainder
Store each integer as a generator using pickle.
Store the matrix as a bitmap file but then the integers are too large to be able to store as color codes.
Let me iterate again that the algorithm could be optimized. e.g. instead of adding 2 we could add 1 and proceed. But don't really result in any compression. Same for the code. Minor optimizations later...first I want to improve the main algorithm.
Furthermore, it is very likely that this product of a mediocre and idle mind like myself could never be able to achieve compression after all. In which case, I would then like your help and ideas on what this could probably be useful in.
TL;DR: Check coded parts which depict a compression algorithm. The compressed result is longer than the original string. Can this be fixed? If yes, how?
PS: I have the entire code on my PC. Will create a repo on github and upload in some time.
Compression is essentially a predictive process. Look for patterns in the input and use them to encode the more likely next character(s) more efficiently than the less likely. I can't see anything in your algorithm that tries to build a predictive model.

Best fit for the intersection of multiple lines

I'm trying to solve the following problem:
I'm analyzing an image and I obtain from this analysis a set of segments
I want to know the intersection of these lines (best fit)
I'm using for this opencv's function cvSolve. For reasonably good input everything works fine.
The problem that I have comes from the fact that when I have just a single bad segment as input the result is different from the one expected.
Details:
Upper left image show the "lonely" purple lines influencing the result (all lines are used as input).
Upper right image shows how a single purple line (one removed) can influence the result.
Lower left image show what we want - the intersection of lines as expected (both purple lines eliminated).
Lower right image show how the other purple line (the other is removed) can influence the result.
As you can see only two lines and the result is completely different from the one expected. Any ideas on how to avoid this are appreciated.
Thanks,
Iulian
The algorithm you are using finds, as described in the link, the least square error solution to the problem. This means that if there are more intersection points, the result will be an average (for a reasonable definition of average) of the real solutions.
I would try an iterative solution: if the error of the first solution is too large, remove from the set of segments the one farthest to the solution, and iterate until the error is acceptably small. This should remove one of the many intersection point, and converge on the one with most lines nearby.
A general answer to this kind of problems is the RANSAC algorithm (question dealing with this), however it has a few disadvantages, for example you need to estimate things like "the expected number of outliers" beforehand. Another Problem I see with your sample is that removing the two green lines also results in a pretty good fit, so that might be a more general problem.
you can solve using SVD incase line1 =(x1,y1)-(x2,y2) ; line2 =(x2,y2)-(x3,y3)
let Ax = b where;
A = [-(y2-y1) (x2-x1);
-(y3-y2) (x3-x2);
.................
.................] -->(nx2)
x = transpose[s t] -->(2x1)
b = [-(y2-y1)x1 + (x2-x1)y1 ;
-(y3-y2)x2 + (x3-x2)y2 ;
........................
........................] --> (nx1)
Example; Matlab Code
line1=[0,10;5,10]
line2=[10,0;10,5]
line3=[0,0;5,5]
A=[-(line1(2,2)-line1(1,2)),(line1(2,1)-line1(1,1));
-(line2(2,2)-line2(1,2)),(line2(2,1)-line2(1,1));
-(line3(2,2)-line3(1,2)),(line3(2,1)-line3(1,1))];
b=[(line1(1,1)*A(1,1))+ (line1(1,2)*A(1,2));
(line2(1,1)*A(2,1))+ (line2(1,2)*A(2,2));
(line3(1,1)*A(3,1))+ (line3(1,2)*A(3,2))];
[U D V] = svd(A)
bprime = U'*b
y=[bprime(1)/D(1,1);bprime(2)/D(2,2)]
x=V*y

Resources