Error in Hallmark GSEA Result- Not all stats values are finite numbers - ggseasonplot

I am running GSEA diff. exp analysis and ran it with hallmark dataset
and in the last step it gives me this error. IT GIVES ME NOT FINITE NUMBERS AS THE ERROR.
aLSO, fgsea analysis IS giving an error
`
`>! for (i in 1:length(markers)) {
>! + input <- as.numeric(markers[[i]]["avg_logFC"])
>! + names(input) <- markers[[i]]["suggested_symbol"]
>! + fgsea_result[[i]] <- fgsea(reactome, input, nproc=8)
>! + }
>! Error in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, :
>! Not all stats values are finite numbers``

Related

Microarray DEG analysis scatterplot

I have found that my selected gene, probe I.D 201667_at is differentially expressed between WDLPS and DDLPS tumour tissue samples after performing microarray DEG analysis.
Instead of just a p-value in a table format:
Probe I.D "201667_at" logFC 10.8205874181535 AveExpr 10.6925705768407 t 82.8808890739766 P.Value 3.10189446528995e-88 adj.P Val 3.10189446528995e-88 "B" 191.589248589131
I have decided to present the data as a scatter plot/plot MDS (with an error bar) using expression values of the specific gene between the two tumour types (40 vs 52 samples) to show that it is differentially expressed. So 92 dots/points in total.
Does anyone know how I might do this, if I used these commands for microarray differential expression analysis.
library("arrayQualityMetrics")
library(GEOquery)
library(oligo)
library(Biobase)
library(affy)
library("splitstackshape")
library("tidyr")
library("dplyr")
celFiles <- list.celfiles()
affyRaw <- oligo::rma(affyraw)
eset <- oligo::rma(affyRaw)
library(limma)
pData(eset)
Groups <- c("DDLPS", "DDLPS", "WDLPS", "WDLPS")
design <- model.matrix(~factor(Groups))
colnames(design) <- c("DDLPS", "DDLPSvsWDLPS")
fit <- lmFit(eset, design)
fit <- eBayes(fit)
option (digits =2)
res <- topTable (fit, number = Inf, adjust.method = "none", coef = 1)
write.table(res, "diff_exp.txt", sep= "\t)
require(hgu133a.db)
annotLookup <- select(hgu133a.db, keys = probes,
columns = c('PROBEID', 'ENSEMBL', 'SYMBOL'))
Thankyou

Minimization of Nonlinear equations

I am relatively new to R, so I apologize if my questions isn't expressed well, or if there is excessive detail. What I'm doing here is taking a naturally occurring gas isotope of C12 and C13 that is produced at a linear rate (P) at respective fractions (F12 and F13) that sums to 1. The two isotopic gases are then consumed at rates k12 for C13 and k13 for C13. I then want to solve for P and k12 using a minimization function.
The equations are:
Eqn 1: conc.12 = ((F12*P)/k12)-(((F12*P)/k12)-c12zero)exp(-k12(t-t0))
Eqn 2: conc.13 = ((F13*P)/k13)-(((F13*P)/k13)-c13zero)exp(-k13(t-t0))
Eqn 3: Sum Square Error = sum(((conc.12-c12meas)/0.07)^2) +sum(((conc.13-c13meas)/0.07)^2)
conc.12 and conc.13 are the estimated concentrations of two isotopes at time t
c12meas and c13meas are the measured concentrations of two isotopes at time t
t0 is the initial time point
F12 and F13 are fractions that sum to 1
k12 and k13 are exponential decay coefficients for the two isotopes, with k13 = k12/1.06
P is a linear production rate of both 12CH4 and 13CH4
The data for a toy data set with known approximate parameters follow:
Time c12meas c13meas
1 109.7000 19.35660
2 118.9150 18.74356
3 127.6693 18.15943
4 135.9858 17.60285
5 143.8865 17.07253
6 151.3922 16.56722
7 158.5226 16.08575
8 165.2964 15.62698
9 171.7316 15.18986
10 177.8450 14.77336
11 183.6528 14.37650
12 189.1701 13.99837
13 194.4116 13.63807
14 199.3911 13.29476
15 204.1215 12.96765
16 208.6154 12.65597
17 212.8847 12.35899
18 216.9404 12.07602
19 220.7934 11.80639
20 224.4537 11.54949
Note that the rows in reality are of equal length and the problem above has to do with pasting them in to the web portal.
I first tried to solve these equations with optim with the following code:
error.func <- function (k12, P) {
t <- Time
t0 <-Time[1]
c12zero=c12meas[1]
c13zero=c13meas[1]
k13=k12/1.06
F12=0.98
F13=1-F12
ratio.12<- (F12*P)/k12
exp.12<- exp(-k12*(t-t0))
conc.12<-ratio.12 - ((ratio.12-c12zero)*exp.12)
ratio.13<- (F13*P)/k13
exp.13<- exp(-k13*(t-t0))
conc.13<- ratio.13 - ((ratio.13-c13zero)*exp.13)
error <- sum(((conc.12-c12meas)/0.07)^2)
+sum(((conc.13-c13meas)/0.07)^2)
return (error)
}
fit.model <- optim(k12=.05, P = 15, error.func)
This is the error code in R:
"Error in optim(k12 = 0.05, P = 15, error.func) :
cannot coerce type 'closure' to vector of type 'double'
In addition: Warning message:
In optim(k12 = 0.05, P = 15, error.func) :
one-dimensional optimization by Nelder-Mead is unreliable:
use "Brent" or optimize() directly"
My intepretation of this is that the optim function can't solve multiple equations at the same time, so I then tried the solnp function.
isotopes2<- function(x) {
t=Time
t0<-Time[1]
c12zero=c12meas[1]
c13zero=c13meas[1]
k13=x[1]/1.06
F12=0.98
F13=1-F12
ratio.12<- (F12*x[2])/x[1]
exp.12<- exp(-x[1]*(t-t0))
conc.12<-ratio.12 - ((ratio.12-c12zero)*exp.12)
ratio.13<- (F13*x[2])/k13
exp.13<- exp(-k13*(t-t0))
conc.13<- ratio.13 - ((ratio.13-c13zero)*exp.13)
}
error.func <- function (x) {
t <- Time
t0<-Time[1]
c12zero=c12meas[1]
c13zero=c13meas[1]
k13=x[1]/1.06
F12=0.98
F13=1-F12
ratio.12<- (F12*x[2])/x[1]
exp.12<- exp(-x[1]*(t-t0))
conc.12<-ratio.12 - ((ratio.12-c12zero)*exp.12)
ratio.13<- (F13*x[2])/k13
exp.13<- exp(-k13*(t-t0))
conc.13<- ratio.13 - ((ratio.13-c13zero)*exp.13)
error <- sum(((conc.12-c12meas)/0.07)^2)
+sum(((conc.13-c13meas)/0.07)^2)
return (error)
}
x0 <- c(0.05,15)
constraint = c(0)
fit <- solnp (x0, fun = isotopes2, eqfun = error.func, eqB=0)
I received the following error message:
"Error:
solnp-->error: objective function returns value of length greater than 1!

Optimize haskell function with huge numbers of power function call

The following function is to find a number n which 1^3 + 2^3 + ... + (n-1) ^3 + n^3 = m. Is there any chance this function can be optimized for speed?
findNb :: Integer -> Integer
findNb m = findNb' 1 0
where findNb' n m' =
if m' == m then n - 1
else if m' < m then findNb' (n + 1) (m' + n^3)
else -1
I know there is a faster solution by using a math formula.
The reason I'm asking is that the similar implementation in JavaScript / C# seems far more faster than in Haskell. I'm just curious if it can be optimized. Thanks.
EDIT1: Add more evidences on the rum time
Haskell Version:
With main = print (findNb2 152000000000000000000000):
Compile with -O2 and profiling: ghc -o testo2.exe -O2 -prof -fprof-auto -rtsopts pileofcube.hs. Here is total time from profiling report:
total time = 0.19 secs (190 milliseconds) (190 ticks # 1000 us, 1 processor)
Compile with -O2 but no profiling: ghc -o testo22.exe -O2 pileofcube.hs. Run it with Measure-Command {./testo22.exe} in powershell. The result is:
Milliseconds : 157
JavaScript Version:
Code:
function findNb(m) {
let n = 0;
let sum = 0;
while (sum < m) {
n++;
sum += Math.pow(n, 3);
}
return sum === m ? n : -1;
}
var d1 = new Date();
findNb(152000000000000000000000);
console.log(new Date() - d1);
Result: 45 milliseconds running in Chrome on the same machine
EDIT2: Add C# Version
As #Berji and #Bakuriu commented, comparing to the JavaScript version above is not fair as it uses double-precision floating point numbers underlying and could not give the correct answer even. So I implemented it in C#, here is the code and result:
static void Main(string[] args)
{
BigInteger m = BigInteger.Parse("152000000000000000000000");
var s = new Stopwatch();
s.Start();
long n = 0;
BigInteger sum = 0;
while (sum < m)
{
n++;
sum += BigInteger.Pow(n, 3);
}
Console.WriteLine(sum == m ? n : -1);
s.Stop();
Console.WriteLine($"Escaped Time: {s.ElapsedMilliseconds} milliseconds.");
}
Result: Escaped Time: 457 milliseconds.
Conclusion
Haskell version is faster than C# one...
I was wrong at start because I didn't realized JavaScript use double-precision floating point numbers under the hood due to my poor JavaScript knowledge.
At this point seems the question does not make sense anymore...
Haskell too can use Double to get the wrong answer in less time:
% time ./so
./so 0.03s user 0.00s system 95% cpu 0.038 total
And Javascript too can get the correct result via npm-installing big-integer and using bigInt everywhere instead of Double:
% node so.js
^C
node so.js 35.62s user 0.30s system 93% cpu 38.259 total
... or maybe it isn't as trivial as that.
EDIT : I realized afterward that's not what the author of the question wanted. I'll keep it there as a in case someone wants to know the formula in question, but otherwise please disregard.
There is indeed a formula that lets you compute this in constant time (rather than n iterations). Since I couldn't remember the exact formula from school, I did a bit of searching, and here is is: https://proofwiki.org/wiki/Sum_of_Sequence_of_Cubes.
In haskell code, that would translate to
findNb n = n ^ 2 * (n + 1) ^ 2 / 4
which I believe should be much faster.
Not sure if this wording of that algorithm is faster, but try this?
findNb :: Integer -> Integer
findNb m = length $ takeWhile (<=m) $ scanl1 (+) [n^3 | n <- [1..]]
(This has different semantics in the undefined case, though.)

Matthews Correlation Coefficient yielding values outside of [-1,1]

I'm using the formula found on Wikipedia for calculating Matthew's Correlation Coefficient. It works fairly well, most of the time, but I'm running into problems in my tool's implementation, and I'm not seeing the problem.
MCC = ((TP*TN)-(FP*FN))/sqrt(((TP + FP)( TP + FN )( TN + FP )( TN + FN )))
Where TP, TN, FP, and FN are the non-negative, integer counts of the appropriate fields.
Which should only return values $\epsilon$ [-1,1]
My implementation is as follows:
double ret;
if ((TruePositives + FalsePositives) == 0 || (TruePositives + FalseNegatives) == 0 ||
( TrueNegatives + FalsePositives) == 0 || (TrueNegatives + FalseNegatives) == 0)
//To avoid dividing by zero
ret = (double)(TruePositives * TrueNegatives -
FalsePositives * FalseNegatives);
else{
double num = (double)(TruePositives * TrueNegatives -
FalsePositives * FalseNegatives);
double denom = (TruePositives + FalsePositives) *
(TruePositives + FalseNegatives) *
(TrueNegatives + FalsePositives) *
(TrueNegatives + FalseNegatives);
denom = Math.Sqrt(denom);
ret = num / denom;
}
return ret;
When I use this, as I said it works properly most of the time, but for instance if TP=280, TN = 273, FP = 67, and FN = 20, then we get:
MCC = (280*273)-(67*20)/sqrt((347*300*340*293)) = 75100/42196.06= (approx) 1.78
Is this normal behavior of Matthews Correlation Coefficient? I'm a programmer by trade, so statistics aren't a part of my formal training. Also, I've looked at questions with answers, and none of them discuss this behavior. Is it a bug in my code or in the formula itself?
The code is clear and looks correct. (But one's eyes can always deceive.)
One issue is a concern whether the output is guaranteed to lie between -1 and 1. Assuming all inputs are nonnegative, though, we can round the numerator up and the denominator down, thereby overestimating the result, by zeroing out all the "False*" terms, producing
TP*TN / Sqrt(TP*TN*TP*TN) = 1.
The lower limit is obtained similarly by zeroing out all the "True*" terms. Therefore, working code cannot produce a value larger than 1 in size unless it is presented with invalid input.
I therefore recommend placing a guard (such as an Assert statement) to assure the inputs are nonnegative. (Clearly it matters not in the preceding argument whether they are integral.) Place another assertion to check that the output is in the interval [-1,1]. Together, these will detect either or both of (a) invalid inputs or (b) an error in the calculation.

r: for loop operation with nested indices runs super slow

I have an operation I'd like to run for each row of a data frame, changing one column. I'm an apply/ddply/sqldf man, but I'll use loops when they make sense, and I think this is one of those times. This case is tricky because the column to changes depends on information that changes by row; depending on information in one cell, I should make a change to only one of ten other cells in that row. With 75 columns and 20000 rows, the operation takes 10 minutes, when every other operation in my script takes 0-5 seconds, ten seconds max. I've stripped my problem down to the very simple test case below.
n <- 20000
t.df <- data.frame(matrix(1:5000, ncol=10, nrow=n) )
system.time(
for (i in 1:nrow(t.df)) {
t.df[i,(t.df[i,1]%%10 + 1)] <- 99
}
)
This takes 70 seconds with ten columns, and 360 when ncol=50. That's crazy. Are loops the wrong approach? Is there a better, more efficient way to do this?
I already tried initializing the nested term (t.df[i,1]%%10 + 1) as a list outside the for loop. It saves about 30 seconds (out of 10 minutes) but makes the example code above more complicated. So it helps, but its not the solution.
My current best idea came while preparing this test case. For me, only 10 of the columns are relevant (and 75-11 columns are irrelevant). Since the run times depend so much on the number of columns, I can just run the above operation on a data frame that excludes irrelevant columns. That will get me down to just over a minute. But is "for loop with nested indices" even the best way to think about my problem?
It seems the real bottleneck is having the data in the form of a data.frame. I assume that in your real problem you have a compelling reason to use a data.frame. Any way to convert your data in such a way that it can remain in a matrix?
By the way, great question and a very good example.
Here's an illustration of how much faster loops are on matrices than on data.frames:
> n <- 20000
> t.df <- (matrix(1:5000, ncol=10, nrow=n) )
> system.time(
+ for (i in 1:nrow(t.df)) {
+ t.df[i,(t.df[i,1]%%10 + 1)] <- 99
+ }
+ )
user system elapsed
0.084 0.001 0.084
>
> n <- 20000
> t.df <- data.frame(matrix(1:5000, ncol=10, nrow=n) )
> system.time(
+ for (i in 1:nrow(t.df)) {
+ t.df[i,(t.df[i,1]%%10 + 1)] <- 99
+ }
+ )
user system elapsed
31.543 57.664 89.224
Using row and col seems less complicated to me:
t.df[col(t.df) == (row(t.df) %% 10) + 1] <- 99
I think Tommy's is still faster, but using row and col might be easier to understand.
#JD Long is right that if t.df can be represented as a matrix, things will be much faster.
...And then you can actually vectorize the whole thing so that it is lightning fast:
n <- 20000
t.df <- data.frame(matrix(1:5000, ncol=10, nrow=n) )
system.time({
m <- as.matrix(t.df)
m[cbind(seq_len(nrow(m)), m[,1]%%10L + 1L)] <- 99
t2.df <- as.data.frame(m)
}) # 0.00 secs
Unfortunately, the matrix indexing I use here does not seem to work on a data.frame.
EDIT
A variant where I create a logical matrix to index works on data.frame, and is almost as fast:
n <- 20000
t.df <- data.frame(matrix(1:5000, ncol=10, nrow=n) )
system.time({
t2.df <- t.df
# Create a logical matrix with TRUE wherever the replacement should happen
m <- array(FALSE, dim=dim(t2.df))
m[cbind(seq_len(nrow(t2.df)), t2.df[,1]%%10L + 1L)] <- TRUE
t2.df[m] <- 99
}) # 0.01 secs
UPDATE: Added the matrix version of Tommy's solution to the benchmarking exercise.
You can vectorize it. Here is my solution and a comparison with the loop
n <- 20000
t.df <- (matrix(1:5000, ncol=10, nrow=n))
f_ramnath <- function(x){
idx <- x[,1] %% 10 + 1
x[cbind(1:NROW(x), idx)] <- 99
return(x)
}
f_long <- function(t.df){
for (i in 1:nrow(t.df)) {
t.df[i,(t.df[i,1]%%10 + 1)] <- 99
}
return(t.df)
}
f_joran <- function(t.df){
t.df[col(t.df) == (row(t.df) %% 10) + 1] <- 99
return(t.df)
}
f_tommy <- function(t.df){
t2.df <- t.df
# Create a logical matrix with TRUE wherever the replacement should happen
m <- array(FALSE, dim=dim(t2.df))
m[cbind(seq_len(nrow(t2.df)), t2.df[,1]%%10L + 1L)] <- TRUE
t2.df[m] <- 99
return(t2.df)
}
f_tommy_mat <- function(m){
m[cbind(seq_len(nrow(m)), m[,1]%%10L + 1L)] <- 99
}
To compare the performance of the different approaches, we can use rbenchmark.
library(rbenchmark)
benchmark(f_long(t.df), f_ramnath(t.df), f_joran(t.df), f_tommy(t.df),
f_tommy_mat(t.df), replications = 20, order = 'relative',
columns = c('test', 'elapsed', 'relative')
test elapsed relative
5 f_tommy_mat(t.df) 0.135 1.000000
2 f_ramnath(t.df) 0.172 1.274074
4 f_tommy(t.df) 0.311 2.303704
3 f_joran(t.df) 0.705 5.222222
1 f_long(t.df) 2.411 17.859259
Another option for when you do need mixed column types (and so you can't use matrix) is := in data.table. Example from ?":=" :
require(data.table)
m = matrix(1,nrow=100000,ncol=100)
DF = as.data.frame(m)
DT = as.data.table(m)
system.time(for (i in 1:1000) DF[i,1] <- i)
# 591 seconds
system.time(for (i in 1:1000) DT[i,V1:=i])
# 1.16 seconds ( 509 times faster )

Resources