Microarray DEG analysis scatterplot - expression

I have found that my selected gene, probe I.D 201667_at is differentially expressed between WDLPS and DDLPS tumour tissue samples after performing microarray DEG analysis.
Instead of just a p-value in a table format:
Probe I.D "201667_at" logFC 10.8205874181535 AveExpr 10.6925705768407 t 82.8808890739766 P.Value 3.10189446528995e-88 adj.P Val 3.10189446528995e-88 "B" 191.589248589131
I have decided to present the data as a scatter plot/plot MDS (with an error bar) using expression values of the specific gene between the two tumour types (40 vs 52 samples) to show that it is differentially expressed. So 92 dots/points in total.
Does anyone know how I might do this, if I used these commands for microarray differential expression analysis.
library("arrayQualityMetrics")
library(GEOquery)
library(oligo)
library(Biobase)
library(affy)
library("splitstackshape")
library("tidyr")
library("dplyr")
celFiles <- list.celfiles()
affyRaw <- oligo::rma(affyraw)
eset <- oligo::rma(affyRaw)
library(limma)
pData(eset)
Groups <- c("DDLPS", "DDLPS", "WDLPS", "WDLPS")
design <- model.matrix(~factor(Groups))
colnames(design) <- c("DDLPS", "DDLPSvsWDLPS")
fit <- lmFit(eset, design)
fit <- eBayes(fit)
option (digits =2)
res <- topTable (fit, number = Inf, adjust.method = "none", coef = 1)
write.table(res, "diff_exp.txt", sep= "\t)
require(hgu133a.db)
annotLookup <- select(hgu133a.db, keys = probes,
columns = c('PROBEID', 'ENSEMBL', 'SYMBOL'))
Thankyou

Related

Line search fails when training linear SVM with caret

I am trying to train a linear SVM while tuning the parameters with 10fold CV for binary text classification.
As all solutions provided in other threads do not work and I already removed all NAs, NANs and Inf and balanced my dataset by applying downsampling but still the model returns NAs and fails in line search. Therefore I need the help of the community as I am kind of stuck.
The data has 2099 observations of 926 variables and is mostly 0 and 1, 2 or 3s.
dat_SetimentAnalysis <- c(
This is my code:
set.seed(335)
trainIndex <- createDataPartition(dat_SentimentAnalysis$Usefulness, p = .75,
list = FALSE,
times = 1)
train <- dat_SentimentAnalysis[ trainIndex,]
test <- dat_SentimentAnalysis[-trainIndex,]
#check for distribution of class
table(train$Usefulness)
#downsample training set
train <- downSample(train, as.factor(train$Usefulness))
#check again for distribution
table(train$Usefulness)
train <- na.omit(train) #no na values detected
#separate feature and predictors
x_train <- train[2:926]
y_train <- as.factor(train$Usefulness)
x_test <- test[2:926]
y_test <- as.factor(test$Usefulness)
sum(is.na(x_train))
sum(is.na(y_train))
#tune hyperparameters for SVM
fitControl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3,
search = "grid",
classProbs = TRUE,
savePredictions = TRUE)
model <- caret::train(x = x_train,
y = y_train,
method = "svmLinear",
trControl = fitControl,
tunegrid=data.frame(C=c(0.25, 0.5, 1,5,8,12,100)))
Does anybody have an idea what could be wrong? Because, when I do not perform tuning I get a very poor performing SVM with around 52 % accuracy but at least I get one. So maybe something with the tuning formula is wrong?
Thank you very much for your help!

How to set my inequality constraint into my R function?

I am working on a project consisting of the analysis of different portfolio constructions in a universe of various assets. I work on 22 assets and I recalibrate my portfolio every 90 days. This is why a weights penalties (see code) constraint is applied as the allocation changes every period.
I am currently implementing a construction based on independent components. My objective is to minimize the modified value at risk based on its components. (See code below).
My function runs correctly and everything seems to be OK, my function "MVaR.IC.port" and "MVaR.cm" work well. However, I can only implement this model in the case where short selling is allowed. I would now like to operate only in "Long only", i.e. that my weight vectors w only contain elements >=0. Concretely, i want that the expression "w <- t(w.IC)%*%a$A" in my code be >=0.
Do you know how to help me? Thank you in advance.
[results w.out.MVaR.IC.22,][1] Here are the results that must be positive. I also constraint that the sum of the weights must be equal to 1 (the investor allocates 100% of his wealth.).
Thomas
PS: train and test represent my rolling windows. In fact, I calibrate my models on 'train' (in sample) and apply them on 'test' (out of sample) in order to analyse their performance.
########################################
######### MVar on IC with CM #########
########################################
lower = rep(-5,k)
upper = rep(5,k)
#Set up objective function and constraint
MVaR.IC.cm.port <- function(S, weights, alpha, MixingMatrix)
{
obj <- MVaR(S, weights, alpha)
w.ICA <- t(weights)%*%MixingMatrix
weight.penalty = abs(1000*(1-sum(w.ICA)))
down.weight.penalty = 1000*sum(w.ICA[w.ICA > 1])
up.weight.penalty = 1000*abs(sum(w.ICA[w.ICA < -1]))
return(obj + weight.penalty + down.weight.penalty + up.weight.penalty)
}
#Out of sample return portfolio computation
ret.out.MVaR.IC.cm.22 <- c()
w.out.MVaR.IC.cm.22 <- matrix(ncol = n, nrow = 10)
for (i in 0:9) {
train <- as.matrix(portfolioReturns.new[((1+i*90):(8*90+i*90)),])
test <- as.matrix(portfolioReturns.new[(1+8*90+i*90):(9*90+i*90),])
a <- myfastICA(train, k, alg.typ = "parallel", fun = "logcosh", alpha = 1,
method = "R", row.norm = FALSE, maxit = 2000,
tol = 0.0000000001, verbose = TRUE)
x <- DEoptim(MVaR.IC.cm.port,lower,upper,
control=list(NP=(10*k),F=0.8,CR=0.9, trace=50),
S=a$S, alpha = alpha, MixingMatrix = a$A)
w.IC <- matrix(x$optim$bestmem, ncol=1)
w <- t(w.IC)%*%a$A
for (j in 1:ncol(train)){
w.out.MVaR.IC.cm.22[(i+1),j] <- w[j]
}
ret.out.MVaR.IC.cm.22 <- rbind(ret.out.MVaR.IC.cm.22, test %*% t(w))
}
w.out.MVaR.IC.cm.22

Rcpp instead of mapply in model validation for many subsets

Let's say that we have specified N number of train datasets (80:20 division) and we want to retrieve a two element list with pvalues and coefficients from glm model, for each train dataset. The code reproducing this example is as follows:
library(parallel)
library(caret)
# prepare dataset
data(iris)
iris <- iris[!iris$Species == "setosa", ]
# create validation folds
set.seed(12345)
folds <- createDataPartition(y = iris$Species, times = 100, p = 0.8, list = FALSE)
# glm model expression
model.expr.tr <- expression(glm(formula = Species ~ Sepal.Length,
data = dtr,
family = binomial(link = "logit")))
# glm elements that will be validated
val_part <- list(coefs = expression(summary(eval(model.expr.tr))$coefficients[, 1]),
pvals = expression(summary(eval(model.expr.tr))$coefficients[, 4]))
# lapply with mapply for validation results
val_results <- lapply(val_part, function(x){
mapply(function(i){
trindex <- rownames(iris) %in% folds[, i]
dtr <- iris[trindex, ]
eval(x)
},
i = 1:100)
})
As you are aware, the longest part is running the model summary through all of train datasets, especially if we choose more than 100 of them. In your opinion, is there any way to speed up this process? Of course I am aware of parLapply / mcmapply options but what about some kind of Rcpp speed up in this case? Any suggestions?
Thanks.

What is an efficient way to convert sets to a column index in R?

Overview
Give a large (nrows > 5,000,000+) data frame, A, with string row names and a list of disjoint sets (n = 20,000+), B, where each set consists of row names from A, what is the best way to create a vector representing the sets in B via a unique value?
Illustration
Below is an example illustrating this problem:
# Input
A <- data.frame(d = rep("A", 5e6), row.names = as.character(sample(1:5e6)))
B <- list(c("4655297", "3177816", "3328423"), c("2911946", "2829484"), ...) # Size 20,000+
The desired result would be:
# An index of NA represents that the row is not part of any set in B.
> A[,"index", drop = F]
d index
4655297 A 1
3328423 A 1
2911946 A 2
2829484 A 2
3871770 A NA
2702914 A NA
2581677 A NA
4106410 A NA
3755846 A NA
3177816 A 1
Naive Attempt
Something like this can be achieved using the following method.
n <- 0
A$index <- NA
lapply(B, function(x){
n <<- n + 1
A[x, "index"] <<- n
})
Problem
However this is unreasonably slow (several hours) due to indexing A multiple times and is not very R-esque or elegant.
How can the desired result be generated in a quick and efficient manner?
Here is a suggestion using base that isn't too bad when compared to your current method.
Sample data:
A <- data.frame(d = rep("A", 5e6),
set = sample(c(NA, 1:20000), 5e6, replace = TRUE),
row.names = as.character(sample(1:5e6)))
B <- split(rownames(A), A$set)
Base method:
system.time({
A$index <- NA
A[unlist(B), "index"] <- rep(seq_along(B), times = lapply(B, length))
})
# user system elapsed
# 15.30 0.19 15.50
Check:
identical(A$set, A$index)
# TRUE
For anything faster, I suppose data.table will come handy.

John Tukey "median median" (or "resistant line") statistical test for R and linear regression

I'm searching the John Tukey algorithm which compute a "resistant line" or "median-median line" on my linear regression with R.
A student on a mailling list explain this algorithm in these terms :
"The way it's calculated is to divide
the data into three groups, find the
x-median and y-median values (called
the summary point) for each group, and
then use those three summary points to
determine the line. The outer two
summary points determine the slope,
and an average of all of them
determines the intercept."
Article about John tukey's median median for curious : http://www.johndcook.com/blog/2009/06/23/tukey-median-ninther/
Do you have an idea of where i could find this algorithm or R function ? In which packages,
Thanks a lot !
There's a description of how to calculate the median-median line here. An R implementation of that is
median_median_line <- function(x, y, data)
{
if(!missing(data))
{
x <- eval(substitute(x), data)
y <- eval(substitute(y), data)
}
stopifnot(length(x) == length(y))
#Step 1
one_third_length <- floor(length(x) / 3)
groups <- rep(1:3, times = switch((length(x) %% 3) + 1,
one_third_length,
c(one_third_length, one_third_length + 1, one_third_length),
c(one_third_length + 1, one_third_length, one_third_length + 1)
))
#Step 2
x <- sort(x)
y <- sort(y)
#Step 3
median_x <- tapply(x, groups, median)
median_y <- tapply(y, groups, median)
#Step 4
slope <- (median_y[3] - median_y[1]) / (median_x[3] - median_x[1])
intercept <- median_y[1] - slope * median_x[1]
#Step 5
middle_prediction <- intercept + slope * median_x[2]
intercept <- intercept + (median_y[2] - middle_prediction) / 3
c(intercept = unname(intercept), slope = unname(slope))
}
To test it, here's an example:
dfr <- data.frame(
time = c(.16, .24, .25, .30, .30, .32, .36, .36, .50, .50, .57, .61, .61, .68, .72, .72, .83, .88, .89),
distance = c(12.1, 29.8, 32.7, 42.8, 44.2, 55.8, 63.5, 65.1, 124.6, 129.7, 150.2, 182.2, 189.4, 220.4, 250.4, 261.0, 334.5, 375.5, 399.1))
median_median_line(time, distance, dfr)
#intercept slope
# -113.6 520.0
Note the slightly odd way of specifying the groups. The instructions are quite picky about how you define group sizes, so the more obvious method of cut(x, quantile(x, seq.int(0, 1, 1/3))) doesn't work.
I'm a little late to the party, but have you tried line() from the stats package?
From the helpfile:
Value
An object of class "tukeyline".
References
Tukey, J. W. (1977). Exploratory Data Analysis, Reading Massachusetts: Addison-Wesley.
As member of the R Core team, I now have digged in the source code, and also studied the history of it.
Conclusion: The source C source code, added in 19961997, when R was still called alpha (and around version 0.14alpha) already computed the quantiles not quite correctly... for some sample sizes.
More about this on the R mailing lists (not yet).

Resources