R for loop to create a class variable taking forever - performance

My question comprises two parts. I have a matrix with IDs and several columns (representing time) of values from 0-180. I'd like to summarize these with sub groups, then compare across the columns. For example, how many IDs switch from 0-10 in column 5, to 11+ in column 6?
Now, my first thought was a SAS-style format command. This would let me group integers into different blocks (0-10,11-20,21-30,etc). But, it seems that this doesn't exist.
My solution has been to loop through all values of this matrix (dual for loops) and check whether the values fall between certain ranges(string of if statements), then enter this value into a new matrix that keeps track of only classes. Example:
# search through columns
for (j in 2:(dim(Tab2)[2])){
# search through lines
for (i in 1:dim(Tab2)[1]){
if (is.na(Tab2[i,j])){
tempGliss[i,j] <- "NA"}
else if (Tab2[i,j]==0){
tempGliss[i,j] <- "Zero"}
else if (Tab2[i,j]>0 & Tab2[i,j]<=7){
tempGliss[i,j] <- "1-7"}
else if (Tab2[i,j]>=7 & Tab2[i,j]<=14){
tempGliss[i,j] <- "7-14"}
else if (Tab2[i,j]>=15 & Tab2[i,j]<=30){
tempGliss[i,j] <- "15-30"}
else if (Tab2[i,j]>=31 & Tab2[i,j]<=60){
tempGliss[i,j] <- "31-60"}
else if (Tab2[i,j]>=61 & Tab2[i,j]<=90){
tempGliss[i,j] <- "61-90"}
else if (Tab2[i,j]>=91 & Tab2[i,j]<=120){
tempGliss[i,j] <- "91-120"}
else if (Tab2[i,j]>=121 & Tab2[i,j]<=150){
tempGliss[i,j] <- "121-150"}
else if (Tab2[i,j]>=151 & Tab2[i,j]<=180){
tempGliss[i,j] <- "151-180"}
else if (Tab2[i,j]>180){
tempGliss[i,j] <- ">180"}
}
}
Here Tab2 is my original matrix, and tempGliss is what I'm creating as a class. This takes a VERY LONG TIME! It doesn't help that my file is quite large. Is there any way I can speed this up? Alternatives to the for loops or the if statements?

Maybe you can use cut
Tab2 <- data.frame(a = 1:9, b = c(0, 7, 14, 30, 60, 90, 120, 150, 155)
,c = c(0, 1, 7, 15, 31, 61, 91, 121, 155))
repla <- c("Zero", "1-7", "7-14", "15-30", "31-60", "61-90", "91-120", "121-150", "151-180", ">180")
for(j in 2:(dim(Tab2)[2])){
dum <- cut(Tab2[,j], c(-Inf,0,7,14,30,60,90,120,150,180, Inf))
levels(dum) <- repla
Tab2[,j] <- dum
}
> Tab2
a b c
1 1 Zero Zero
2 2 1-7 1-7
3 3 7-14 1-7
4 4 15-30 15-30
5 5 31-60 31-60
6 6 61-90 61-90
7 7 91-120 91-120
8 8 121-150 121-150
9 9 151-180 151-180
I havent looked at it too closely but you may need to adjust the bands slightly.

Related

Divide n into x random parts

What I need to achieve is basically x dice rolls = n sum but backwards.
So let's create an example:
The dice has to be rolled 5 times (min. sum 5, max. sum 30) which means:
x = 5
Let's say in this case the sum that was rolled is 23 which means:
n = 23
So what I need is to get the any of the possible single dice roll combinations (e.g. 6, 4, 5, 3, 5)
What I could make up in my mind so far is:
Create 5 random numbers.
Add them up and get the sum.
Now divide every single random number by the sum and multiply by the wanted number 23.
The result is 5 random numbers that equal the wanted number 23.
The problem is that this one returns random values (decimals, values below 1 and above 6) depending on the random numbers. I can not find a way to edit the formula to only return integers >= 1 or <= 6.
If you don't need to scale it up by far the easiest way is to re-randomize it until you get the right sum. It takes milliseconds on any modern cpu. Not pretty tho.
#!/usr/local/bin/lua
math.randomseed(os.time())
function divs(n,x)
local a = {}
repeat
local s = 0
for i=1,x do
a[i] = math.random(6)
s = s + a[i]
end
until s==n
return a
end
a = divs(23,5)
for k,v in pairs(a) do print(k,v) end
This was an interesting problem. Here's my take:
EDIT: I missed the fact that you needed them to be dice rolls. Here's a new take. As a bonus, you can specify the number of sides of the dices in an optional parameter.
local function getDiceRolls(n, num_rolls, num_sides)
num_sides = num_sides or 6
assert(n >= num_rolls, "n must be greater than num_rolls")
assert(n <= num_rolls * num_sides, "n is too big for the number of dices and sides")
local rolls = {}
for i=1, num_rolls do rolls[i] = 1 end
for i=num_rolls+1, n do
local index = math.random(1,num_rolls)
while rolls[index] == num_sides do
index = (index % num_rolls) + 1
end
rolls[index] = rolls[index] + 1
end
return rolls
end
-- tests:
print(unpack(getDiceRolls(21, 4))) -- 6 4 6 5
print(unpack(getDiceRolls(21, 4))) -- 5 5 6 5
print(unpack(getDiceRolls(13, 3))) -- 4 3 6
print(unpack(getDiceRolls(13, 3))) -- 5 5 3
print(unpack(getDiceRolls(30, 3, 20))) -- 9 10 11
print(unpack(getDiceRolls(7, 7))) -- 1 1 1 1 1 1 1
print(unpack(getDiceRolls(7, 8))) -- error
print(unpack(getDiceRolls(13, 2))) -- error
If the # of rolls does not change wildly, but the sum does, then it would be worth creating a lookup table for combinations of a given sum. You would generate every combination, and for each one compute the sum, then add the combination to a list associated to that sum. The lookup table would look like this:
T = {12 = {{1,2,3,4,2},{2,5,3,1,1},{2,2,2,3,3}, ...}, 13=....}
Then when you want to randomly select a combo for n=23, you look in table for key 23, the list has all combos with that sum, now just randomly pick one of them. Same for any other number.

Avoiding row-wise processing of data.frame in R

I was wondering what the best way is to avoid row-wise processing in R, most of row-wise stuff is done in internal C routines. For example: I have a data frame a:
chromosome_name start_position end_position strand
1 15 35574797 35575181 1
2 15 35590448 35591641 -1
3 15 35688422 35688645 1
4 13 75402690 75404217 1
5 15 35692892 35693969 1
What I want is: based on whether strand is positive or negative, startOFgene as start_position or end_position. One way to avoid for loop will be to separate data.frame with +1 strand and -1 strand and perform selection. What can be other way for speed up? The method does not scale-up if one has certain other complicated processing per row.
Maybe this is fast enough...
transform(a, startOFgene = ifelse(strand == 1, start_position, end_position))
chromosome_name start_position end_position strand startOFgene
1 15 35574797 35575181 1 35574797
2 15 35590448 35591641 -1 35591641
3 15 35688422 35688645 1 35688422
4 13 75402690 75404217 1 75402690
5 15 35692892 35693969 1 35692892
First, since all your columns are integer/numeric, you could use a matrix instead of a data.frame. Many operations on a matrix are a lot faster than the same operation on a data.frame, even though they're not very different in this case. Then you can use logical subsetting to create the startOFgene column.
# Create some large-ish data
M <- do.call(rbind,replicate(1e3,as.matrix(a),simplify=FALSE))
M <- do.call(rbind,replicate(1e3,M,simplify=FALSE))
A <- as.data.frame(M)
# Create startOFgene column in a matrix
m <- function() {
M <- cbind(M, startOFgene=M[,"start_position"])
negStrand <- sign(M[,"strand"]) < 0
M[negStrand,"startOFgene"] <- M[negStrand,"end_position"]
}
# Create startOFgene column in a data.frame
d <- function() {
A$startOFgene <- A$start_position
negStrand <- sign(A$strand) < 0
A$startOFgene[negStrand] <- A$end_position[negStrand]
}
library(rbenchmark)
benchmark(m(), d(), replications=10)[,1:6]
# test replications elapsed relative user.self sys.self
# 2 d() 10 18.804 1.000 16.501 2.224
# 1 m() 10 19.713 1.048 16.457 3.152

Separate matrix based on cumulative sum of column criteria (MATLAB)

I have a dataset something like the following:
a = [1 11; 2 16; 3 9; 4 13; 5 8; 6 14];
I am looking to separate into several Matrices by the following criteria:
Starting with the first column, construct sets where the sum of the second row is in the range 19-to-25.
So the output would be something like this:
a1 = [1 11; 3 9]
a2 = [2 16; 5 8]
a3 = [6 14]
Where a1=20, a2=24, and a3 does not meet criteria but is the last.
Could this be contained and output from a FOR loop?
Edit: Criteria of how to combine: I am looking to start at the beginning (first row) and add to the next row. If the sum is greater than 25, that row would be skipped till the next iteration. Each iteration should output a seperate matrix (a1, a2, a3).
I think I have some useful pseudo code for you.
For one I would not modify the matrix by removing columns. Rather I would keep a list of used columns.
I would use the summing like this:
used = false(1,num lines)
for i=1:num lines
if used(i) continue
curr_use = i
for j=i+1
if used(j) continue
if cant_add(j) continue
Concat j to curr_use
end
used(curr_use) = true
end

Linear time complexity ranking algorithm when the orders are precomputed

I am trying to write an efficient ranking algorithm in C++ but I will present my case in R as it is far easier to understand this way.
> samples_x <- c(4, 10, 9, 2, NA, 3, 7, 1, NA, 8)
> samples_y <- c(5, 7, 9, NA, 1, 4, NA, 8, 2, 10)
> orders_x <- order(samples_x)
> orders_y <- order(samples_y)
> cbind(samples_x, orders_x, samples_y, orders_y)
samples_x orders_x samples_y orders_y
[1,] 4 8 5 5
[2,] 10 4 7 9
[3,] 9 6 9 6
[4,] 2 1 NA 1
[5,] NA 7 1 2
[6,] 3 10 4 8
[7,] 7 3 NA 3
[8,] 1 2 8 10
[9,] NA 5 2 4
[10,] 8 9 10 7
Suppose the above is already precomputed. Performing a simple ranking on each of the sample sets takes linear time complexity (the result is much like the rank function):
> ranks_x <- rep(0, length(samples_x))
> for (i in 1:length(samples_x)) ranks_x[orders_x[i]] <- i
For a work project I am working on, it would be useful for me to emulate the following behaviour in linear time complexity:
> cc <- complete.cases(samples_x, samples_y)
> ranks_x <- rank(samples_x[cc])
> ranks_y <- rank(samples_y[cc])
The complete.cases function, when given n sets of the same length, returns the indices for which none of the sets contain NAs. The order function returns the permutation of indices corresponding to the sorted sample set. The rank function returns the ranks of the sample set.
How to do this? Let me know if I have provided sufficient information as to the problem in question.
More specifically, I am trying to build a correlation matrix based on Spearman's rank sum correlation coefficient test in a way such that NAs are handled properly. The presence of NAs requires that the rankings be calculated for every pairwise sample set (s n^2 log n); I am trying to avoid that by calculating the orders once for every sample set (s n log n) and use a linear complexity for every pairwise comparison. Is this even doable?
Thanks in advance.
It looks like, when you work out the rank correlation of two arrays, you want to delete from both arrays elements in positions where either has NA.
You have
for (i in 1:length(samples_x)) ranks_x[orders_x[i]] <- i
Could you change this to something like
wp <- 0;
for (i in 1:length(samples_x)) {
if ((samples_x[orders_x[i]] == NA) ||
(samples_y[orders_x[i]] == NA))
{
ranks_x[orders_x[i]] <- NA;
}
else
{
ranks_x[orders_x[i]] <- wp++;
}
}
Then you could either go along later and compress out the NAs, or hope the correlation subroutine just ignores them.

Link list algorithm to find pairs adding up to 10

Can you suggest an algorithm that find all pairs of nodes in a link list that add up to 10.
I came up with the following.
Algorithm: Compare each node, starting with the second node, with each node starting from the head node till the previous node (previous to the current node being compared) and report all such pairs.
I think this algorithm should work however its certainly not the most efficient one having a complexity of O(n2).
Can anyone hint at a solution which is more efficient (perhaps takes linear time). Additional or temporary nodes can be used by such a solution.
If their range is limited (say between -100 and 100), it's easy.
Create an array quant[-100..100] then just cycle through your linked list, executing:
quant[value] = quant[value] + 1
Then the following loop will do the trick.
for i = -100 to 100:
j = 10 - i
for k = 1 to quant[i] * quant[j]
output i, " ", j
Even if their range isn't limited, you can have a more efficient method than what you proposed, by sorting the values first and then just keeping counts rather than individual values (same as the above solution).
This is achieved by running two pointers, one at the start of the list and one at the end. When the numbers at those pointers add up to 10, output them and move the end pointer down and the start pointer up.
When they're greater than 10, move the end pointer down. When they're less, move the start pointer up.
This relies on the sorted nature. Less than 10 means you need to make the sum higher (move start pointer up). Greater than 10 means you need to make the sum less (end pointer down). Since they're are no duplicates in the list (because of the counts), being equal to 10 means you move both pointers.
Stop when the pointers pass each other.
There's one more tricky bit and that's when the pointers are equal and the value sums to 10 (this can only happen when the value is 5, obviously).
You don't output the number of pairs based on the product, rather it's based on the product of the value minus 1. That's because a value 5 with count of 1 doesn't actually sum to 10 (since there's only one 5).
So, for the list:
2 3 1 3 5 7 10 -1 11
you get:
Index a b c d e f g h
Value -1 1 2 3 5 7 10 11
Count 1 1 1 2 1 1 1 1
You start pointer p1 at a and p2 at h. Since -1 + 11 = 10, you output those two numbers (as above, you do it N times where N is the product of the counts). Thats one copy of (-1,11). Then you move p1 to b and p2 to g.
1 + 10 > 10 so leave p1 at b, move p2 down to f.
1 + 7 < 10 so move p1 to c, leave p2 at f.
2 + 7 < 10 so move p1 to d, leave p2 at f.
3 + 7 = 10, output two copies of (3,7) since the count of d is 2, move p1 to e, p2 to e.
5 + 5 = 10 but p1 = p2 so the product is 0 times 0 or 0. Output nothing, move p1 to f, p2 to d.
Loop ends since p1 > p2.
Hence the overall output was:
(-1,11)
( 3, 7)
( 3, 7)
which is correct.
Here's some test code. You'll notice that I've forced 7 (the midpoint) to a specific value for testing. Obviously, you wouldn't do this.
#include <stdio.h>
#define SZSRC 30
#define SZSORTED 20
#define SUM 14
int main (void) {
int i, s, e, prod;
int srcData[SZSRC];
int sortedVal[SZSORTED];
int sortedCnt[SZSORTED];
// Make some random data.
srand (time (0));
for (i = 0; i < SZSRC; i++) {
srcData[i] = rand() % SZSORTED;
printf ("srcData[%2d] = %5d\n", i, srcData[i]);
}
// Convert to value/size array.
for (i = 0; i < SZSORTED; i++) {
sortedVal[i] = i;
sortedCnt[i] = 0;
}
for (i = 0; i < SZSRC; i++)
sortedCnt[srcData[i]]++;
// Force 7+7 to specific count for testing.
sortedCnt[7] = 2;
for (i = 0; i < SZSORTED; i++)
if (sortedCnt[i] != 0)
printf ("Sorted [%3d], count = %3d\n", i, sortedCnt[i]);
// Start and end pointers.
s = 0;
e = SZSORTED - 1;
// Loop until they overlap.
while (s <= e) {
// Equal to desired value?
if (sortedVal[s] + sortedVal[e] == SUM) {
// Get product (note special case at midpoint).
prod = (s == e)
? (sortedCnt[s] - 1) * (sortedCnt[e] - 1)
: sortedCnt[s] * sortedCnt[e];
// Output the right count.
for (i = 0; i < prod; i++)
printf ("(%3d,%3d)\n", sortedVal[s], sortedVal[e]);
// Move both pointers and continue.
s++;
e--;
continue;
}
// Less than desired, move start pointer.
if (sortedVal[s] + sortedVal[e] < SUM) {
s++;
continue;
}
// Greater than desired, move end pointer.
e--;
}
return 0;
}
You'll see that the code above is all O(n) since I'm not sorting in this version, just intelligently using the values as indexes.
If the minimum is below zero (or very high to the point where it would waste too much memory), you can just use a minVal to adjust the indexes (another O(n) scan to find the minimum value and then just use i-minVal instead of i for array indexes).
And, even if the range from low to high is too expensive on memory, you can use a sparse array. You'll have to sort it, O(n log n), and search it for updating counts, also O(n log n), but that's still better than the original O(n2). The reason the binary search is O(n log n) is because a single search would be O(log n) but you have to do it for each value.
And here's the output from a test run, which shows you the various stages of calculation.
srcData[ 0] = 13
srcData[ 1] = 16
srcData[ 2] = 9
srcData[ 3] = 14
srcData[ 4] = 0
srcData[ 5] = 8
srcData[ 6] = 9
srcData[ 7] = 8
srcData[ 8] = 5
srcData[ 9] = 9
srcData[10] = 12
srcData[11] = 18
srcData[12] = 3
srcData[13] = 14
srcData[14] = 7
srcData[15] = 16
srcData[16] = 12
srcData[17] = 8
srcData[18] = 17
srcData[19] = 11
srcData[20] = 13
srcData[21] = 3
srcData[22] = 16
srcData[23] = 9
srcData[24] = 10
srcData[25] = 3
srcData[26] = 16
srcData[27] = 9
srcData[28] = 13
srcData[29] = 5
Sorted [ 0], count = 1
Sorted [ 3], count = 3
Sorted [ 5], count = 2
Sorted [ 7], count = 2
Sorted [ 8], count = 3
Sorted [ 9], count = 5
Sorted [ 10], count = 1
Sorted [ 11], count = 1
Sorted [ 12], count = 2
Sorted [ 13], count = 3
Sorted [ 14], count = 2
Sorted [ 16], count = 4
Sorted [ 17], count = 1
Sorted [ 18], count = 1
( 0, 14)
( 0, 14)
( 3, 11)
( 3, 11)
( 3, 11)
( 5, 9)
( 5, 9)
( 5, 9)
( 5, 9)
( 5, 9)
( 5, 9)
( 5, 9)
( 5, 9)
( 5, 9)
( 5, 9)
( 7, 7)
Create a hash set (HashSet in Java) (could use a sparse array if your numbers are well-bounded, i.e. you know they fall into +/- 100)
For each node, first check if 10-n is in the set. If so, you have found a pair. Either way, then add n to the set and continue.
So for example you have
1 - 6 - 3 - 4 - 9
1 - is 9 in the set? Nope
6 - 4? No.
3 - 7? No.
4 - 6? Yup! Print (6,4)
9 - 1? Yup! Print (9,1)
This is a mini subset sum problem, which is NP complete.
If you were to first sort the set, it would eliminate the pairs of numbers that needed to be evaluated.

Resources