I need to write a code to calculate cumulative product of a matrix.
For example, if
A = ( 1 2 3 | 4 3 2 )
then
cum.sum(A) = ( 1 2 6 | 4 24 144 )
Is there any good algorithm for doing this?
I'll use R, C, Matlab or Octave.
A <- matrix(c(1,2,3,4,3,2),byrow=TRUE,nrow=2)
I'm guessing you want the cumulative product of all (k,l) less than (i,j) ... ?
B <- A
nr <- nrow(B)
nc <- ncol(B)
for (i in 1:max(nr,nc)) {
if (i<=nr) B[i,i:nc] <- cumprod(B[i,])[i:nc]
}
This works for your example: you might have to be a little careful generalizing it to a case with more rows than columns ...
Related
I've given a task in which the user enters some unit relations and we have to sort them from high to low.
What is the best algorithm to do that?
I put some input/output pairs to clarify the problem:
Input:
km = 1000 m
m = 100 cm
cm = 10 mm
Output:
1km = 1000m = 100000cm = 1000000mm
Input:
km = 100000 cm
km = 1000000 mm
m = 1000 mm
Output:
1km = 1000m = 100000cm = 1000000mm
Input:
B = 8 b
MiB = 1024 KiB
KiB = 1024 B
Mib = 1048576 b
Mib = 1024 Kib
Output:
1MiB = 8Mib = 1024KiB = 8192Kib = 1048576B = 8388608b
Input:
B = 8 b
MiB = 1048576 B
MiB = 1024 KiB
MiB = 8192 Kib
MiB = 8 Mib
Output:
1MiB = 8Mib = 1024KiB = 8192Kib = 1048576B = 8388608b
How to generate output based on given output?
My attempt at a graph-based solution. Example 3 is the most interesting, so I'll take that one, (multiple steps and multiple sinks.)
Transform B = n A to edge A -> B and label it n, n > 1. If it's not a connected DAG, it's inconsistent.
Reduce to a bipartite graph by making multiple connections I -> J -> K skip to I -> K by multiplying the n of I -> J by J -> K. Any inconsistencies are a sign that the problem is inconsistent.
The idea of this step is to produce only one single greatest value. A vertex on the left with a degree of greater than 1, P, and { Q, R } are in the right set, where, P -> Q labelled n1 and P -> R labelled n2, 1 < n1 < n2, (WLOG,) can be transformed into P -> R (unchanged) and Q -> R with label n2 / n1 (bringing Q, in this case Mib, from right to left.)
Is the graph bipartite with a single right node? No, goto 2.
Sort the edges.
X -> Z with n1 ... Y -> Z with n2 becomes 1 Z = n1 X = ... = n2 Y.
You can find the following algorithm:
1. detect all existing units: `n` units
2. create a `n x n` matrix `M` such that the same rows and columns show
the corresponding unit. put all elements of the main diagonal of the
matrix to `1`.
3. put the specified value in the input into the corresponding row and column.
4. put zero for the transpose of the row and the column in step 3.
5. put `-1` for all other elements
Now, based on `M` you can easily find the biggest unit:
5.1 candidate_maxs <-- Find columns with only one non-zero positive element
not_max <-- []
6. while len(candidate_max)> 1:
a. take a pair <i, l> and find a column h such that both (i, h)
and (l, h) are known, i.e., they are positive.
If M[i, h] > M[l, h]:
remove_item <-- l
Else:
remove_item <-- i
candidate_max.remove(remove_item)
not_max.append(remove_item)
b. if cannot find such a pair, find a pair <i, l>: i from
candidate_max and h from not_max with the same property.
If M[i, h] < M[l, h]:
candidate_max.remove(i)
not_max.append(i)
biggest_unit <-- The only element of candidate_max
By finding the biggest unit, you can order others based on their value in the corresponding row of the biggest_unit.
7. while there is `-1` value in the row `biggest_unit` on column `j`:
`(biggest_unit, j)`
a. find a non-identity and non-zero positive element in (column `j`
and row `k`) or (row `j` and column `k`), i.e., `(k,j)` or `(j, k)`, such that `(biggest_unit, k)` is strictly
positive and non-identity. Then, calculate the missing value
based on the found equivalences.
b. if there is not such a row, continue the loop with another `-1`
unit element.
8. sort units based on their column value in `biggest_unit` row in
ascending order.
However, the time complexity of the algorithm is Theta(n^2) that n is the number of units (if you implement the loop on step 6 wisely!).
Example
Input 1
km = 1000 m
m = 100 cm
cm = 10 mm
Solution:
km m cm mm
km 1 1000 -1 -1
m 0 1 100 -1
cm -1 0 1 10
mm -1 -1 0 1
M = [1 1000 -1 -1
0 1 100 -1
-1 0 1 10
-1 -1 0 1]
===> 6. `biggest_unit` <--- km (column 1)
7.1 Find first `-1` in the first row and column 3: (1,3)
Find strictly positive value in row 2 such that (1,2) is strictly
positive and non-identity. So, the missing value of `(1,3)` must be
`1000 * 100 = 100000`.
7.2 Find the second `-1` in the first row and column 4: (1,4)
Find strictly positive value in row 3 such that (1,3) is strictly
positive and non-identity. So, the missing value of `(1,4)` must be
`100000 * 10 = 1000000`.
The loop is finished here and we have:
M = [1 1000 100000 1000000
0 1 100 -1
-1 0 1 10
-1 -1 0 1]
Now you can sort the elements of the first row in ascending order.
Input 2
km = 100000 cm
km = 1000000 mm
m = 1000 mm
Solution:
km m cm mm
km 1 -1 100000 1000000
m -1 1 -1 1000
cm 0 -1 1 -1
mm 0 0 -1 1
M = [1 -1 100000 1000000
-1 1 -1 1000
0 -1 1 -1
0 0 -1 1]
===>
6.1 candidate_max = [1, 2]
6.2 Compare them on column 4 and remove 2
biggest_unit <-- column 1
And by going forward on step 7,
Find first `-1` in the first row and column 2: (1,2)
Find a strictly positive and non-identity value in row 2:(1,4)
So, the missing value of `(1,2)` must be `1000000 / 1000 = 1000`.
In sum, we have:
M = [1 1000 100000 1000000
-1 1 -1 1000
0 -1 1 -1
0 0 -1 1]
Now you can sort the elements of the first row in ascending order (step 8).
There are two vectors:
a = 1:5;
b = 1:2;
in order to find all combinations of these two vectors, I am using the following piece of code:
[A,B] = meshgrid(a,b);
C = cat(2,A',B');
D = reshape(C,[],2);
the result includes all the combinations:
D =
1 1
2 1
3 1
4 1
5 1
1 2
2 2
3 2
4 2
5 2
now the questions:
1- I want to decrease the number of operations to improve the performance for vectors with bigger size. Is there any single function in MATLAB that is doing this?
2- In the case that the number of vectors is more than 2, the meshgrid function cannot be used and has to be replaced with for loops. What is a better solution?
For greater than 2 dimensions, use ndgrid:
>> a = 1:2; b = 1:3; c = 1:2;
>> [A,B,C] = ndgrid(a,b,c);
>> D = [A(:) B(:) C(:)]
D =
1 1 1
2 1 1
1 2 1
2 2 1
1 3 1
2 3 1
1 1 2
2 1 2
1 2 2
2 2 2
1 3 2
2 3 2
Note that ndgrid expects (rows,cols,...) rather than (x,y).
This can be generalized to N dimensions (see here and here):
params = {a,b,c};
vecs = cell(numel(params),1);
[vecs{:}] = ndgrid(params{:});
D = reshape(cat(numel(vecs)+1,vecs{:}),[],numel(vecs));
Also, as described in Robert P.'s answer and here too, kron can also be useful for replicating values (indexes) in this way.
If you have the neural network toolbox, also have a look at combvec, as demonstrated here.
One way would be to combine repmat and the Kronecker tensor product like this:
[repmat(a,size(b)); kron(b,ones(size(a)))]'
ans =
1 1
2 1
3 1
4 1
5 1
1 2
2 2
3 2
4 2
5 2
This can be scaled to more dimensions this way:
a = 1:3;
b = 1:3;
c = 1:3;
x = [repmat(a,1,numel(b)*numel(c)); ...
repmat(kron(b,ones(1,numel(a))),1,numel(c)); ...
kron(c,ones(1,numel(a)*numel(b)))]'
There is a logic! First: simply repeat the first vector. Secondly: Use the tensor product with the dimension of the first vector and repeat it. Third: Use the tensor product with the dimension of (first x second) and repeat (in this case there is not fourth, so no repeat.
I was wondering what the best way is to avoid row-wise processing in R, most of row-wise stuff is done in internal C routines. For example: I have a data frame a:
chromosome_name start_position end_position strand
1 15 35574797 35575181 1
2 15 35590448 35591641 -1
3 15 35688422 35688645 1
4 13 75402690 75404217 1
5 15 35692892 35693969 1
What I want is: based on whether strand is positive or negative, startOFgene as start_position or end_position. One way to avoid for loop will be to separate data.frame with +1 strand and -1 strand and perform selection. What can be other way for speed up? The method does not scale-up if one has certain other complicated processing per row.
Maybe this is fast enough...
transform(a, startOFgene = ifelse(strand == 1, start_position, end_position))
chromosome_name start_position end_position strand startOFgene
1 15 35574797 35575181 1 35574797
2 15 35590448 35591641 -1 35591641
3 15 35688422 35688645 1 35688422
4 13 75402690 75404217 1 75402690
5 15 35692892 35693969 1 35692892
First, since all your columns are integer/numeric, you could use a matrix instead of a data.frame. Many operations on a matrix are a lot faster than the same operation on a data.frame, even though they're not very different in this case. Then you can use logical subsetting to create the startOFgene column.
# Create some large-ish data
M <- do.call(rbind,replicate(1e3,as.matrix(a),simplify=FALSE))
M <- do.call(rbind,replicate(1e3,M,simplify=FALSE))
A <- as.data.frame(M)
# Create startOFgene column in a matrix
m <- function() {
M <- cbind(M, startOFgene=M[,"start_position"])
negStrand <- sign(M[,"strand"]) < 0
M[negStrand,"startOFgene"] <- M[negStrand,"end_position"]
}
# Create startOFgene column in a data.frame
d <- function() {
A$startOFgene <- A$start_position
negStrand <- sign(A$strand) < 0
A$startOFgene[negStrand] <- A$end_position[negStrand]
}
library(rbenchmark)
benchmark(m(), d(), replications=10)[,1:6]
# test replications elapsed relative user.self sys.self
# 2 d() 10 18.804 1.000 16.501 2.224
# 1 m() 10 19.713 1.048 16.457 3.152
I'm a beginner with R, so I'm having trouble thinking of things the "R way"...
I have this function:
upOneRow <- function(table, column) {
for (i in 1:(nrow(table) - 1)) {
table[i, column] = table [i + 1, column]
}
return(table)
}
It seems simple enough, and shouldn't take that long to run, but on a dataframe with ~300k rows, the time it takes to run is unreasonable. What is the right way to approach this?
Instead of the loop you could try something like this:
n <- nrow(table)
table[(1:(n-1)), column] <- table[(2:n), column];
to vectorize is the key
Simple answer: Columns in a data.frame are also vectors which can be indexed with [,]
my.table <- data.frame(x = 1:10, y=10:1)
> my.table
x y
1 1 5
2 2 4
3 3 3
4 4 2
5 5 1
my.table$y <-c(my.table[-1,"y"],NA) #move up one spot and pad with NA
> my.table
x y
1 1 4
2 2 3
3 3 2
4 4 1
5 5 NA
Now you function repeats the last data point at the end. If this is really what you want, pad with tail(x,1) instead of NA.
my.table$y <-c(my.table[-1,"y"],tail(my.table$y,1)) #pad with tail(x,1)
> my.table
x y
1 1 4
2 2 3
3 3 2
4 4 1
5 5 1
If I understand you right, you're trying to "move up" one column of a data frame, with the first element going to the bottom. Then, It might be achieved as:
col <- table[, column]
table[, column] <- col[c(nrow(table), 1:(nrow(table)-1))]
For example, if it is the choice of chocolate, ice cream, donut, ..., for the order of their preference.
If user 1 choose
A B C D E F G H I J
and user 2 chooses
J A B C I G F E D H
what are some good ways to calculate a score from 0 to 100 to tell how close their choices are? It has to make sense, such as if most answers are the same but just 1 or 2 answers different, the score cannot be made to extremely low. Or, if most answers are just "shifted by 1 position", then we cannot count them as "all different" and give 0 score for those differences of only 1 position.
Assign each letter item an integer value starting at 1
A=1, B=2, C=3, D=4, E=5, F=6 (stopping at F for simplicity)
Then consider the order the items are placed, use this as a multiple
So if a number is the first item, its multiplier is 1, if its the 6th item the multipler is 6
Figure out the maximum score you could have (basically when everything is in consecutive order)
item a b c d e f
order 1 2 3 4 5 6
value 1 2 3 4 5 6
score 1 4 9 16 25 36 Sum = 91, Score = 100% (MAX)
item a b d c e f
order 1 2 3 4 5 6
value 1 2 4 3 5 6
score 1 4 12 12 25 36 Sum = 90 Score = 99%
=======================
order 1 2 3 4 5 6
item f d b c e a
value 6 4 2 3 5 1
score 6 8 6 12 25 6 Sum = 63 Score = 69%
order 1 2 3 4 5 6
item d f b c e a
value 4 6 2 3 5 1
score 4 12 6 12 25 6 Sum = 65 Score = 71%
obviously this is a very crude implementation that I just came up with. It may not work for everything. Examples 3 and 4 are swapped by one position yet the score is off by 2% (versus ex 1 and 2 which are off by 1%). It's just a thought. I'm no algorithm expert. You could probably use the final number and do something else to it for a better numerical comparison.
You could
Calculate the edit distance between the sequences;
Subtract the edit distance from the sequence length;
Divide that by the length of the sequence
Multiply it by hundred
Score = 100 * (SequenceLength - Levenshtein( Sequence1, Sequence2 ) ) / SequenceLength
Edit distance is basically the number of operations required to transform sequence one in sequence two. An algorithm therefore is the Levenshtein distance algorithm.
Examples:
Weights
insert: 1
delete: 1
substitute: 1
Seq 1: ABCDEFGHIJ
Seq 2: JABCIGFEDH
Score = 100 * (10-7) / 10 = 30
Seq 1: ABCDEFGHIJ
Seq 2: ABDCFGHIEJ
Score = 100 * (10-3) / 10 = 70
The most straightforward way to calculate it is the Levenshtein distance, which is the number of changes that must be done to transform one string to another.
Disadvantage of Levenshtein distance for your task is that it doesn't measure closeness between products themselves. I.e. you will not know how A and J are close to each other. For example, user 1 may like donuts, and user 2 may like buns, and you know that most people who like first also like the second. From this information you can infer that user 1 makes choices that are close to choices of user 2, through they don't have same elements.
If this is your case, you will have to use one of two: statistical methods to infer correlation between choices or recommendation engines.