I'm trying to derive a measure of Tumours heterogeneity in scRNA-seq data. For a given individual's scRNA-seq gene expression matrix, I could like to calculate the Pearson correlations with and between clusters (compare the average cell-to-cell correlation within clusters and the average cell-to-cell correlation between clusters).
The idea is that if an individual had substantial transcriptional heterogeneity, the intercluster correlation would be negative and within-cluster correlation would be positive. If the individual's Tumours had a uniform transcriptional state, they would have a near-zero intercluster correlation. I am using the cor() function from the R 'stats' package and using the log-normalized gene expression matrix as input:
c2.dat <- data.frame(c2#assays$RNA#data) #gene expression matrix for a subject's cells in "cluster 2"
c2.cor <- cor(c2.dat, method = "pearson") #correlation analysis on log-normalized gene expression matrix
I am stuck though once I have a correlation matrix. How do I calculate the average cell-to-cell correlation within this cluster?
Thank you :)
Related
I want to do scattered interpolation in Matlab, but scatteredInterpolant does not do quite what I want.
scatteredInterpolant allows me to provide a set of input sampling positions and the corresponding sample values. Then I can query the interpolated values by supplying a set of positions:
F = scatteredInterpolant(xpos, ypos, samplevals)
interpvals = F(xgrid, ygrid)
This is sort of the opposite of what I want. I already have a fixed set of sample positions, xpos/ypos, and output grid, xgrid/ygrid, and then I want to vary the sample values. The use case is that I have many quantities sampled at the same sampling positions, that should all be interpolated to the same output grid.
I have an idea how to do this for nearest neighbor and linear interpolation, but not for more general cases, in particular for natural neighbor interpolation.
This is what I want, in mock code:
G = myScatteredInterpolant(xpos, ypos, xgrid, ygrid, interp_method)
interpvals = G(samplevals)
In terms of what this means, I suppose G holds a (presumably sparse) matrix of weights, W, and then G(samplevals) basically does W * samplevals, where the weights in the matrix W depends on the input and output grid, as well as the interpolation method (nearest neighbor, linear, natural neighbor). Calculating the matrix W is probably much more expensive than evaluating the product W * samplevals, which is why I want this to be reused.
Is there any code in Matlab, or in a similar language that I could adapt, that does this? Can it somehow be extracted from scatteredInterpolant in reasonable processing time?
Predicting the probability of class assignment for each chosen sample from the Train_features:
probs = classifier.predict_proba(Train_features)`
Choosing the class for which the AUC has to be determined.
preds = probs[:,1]
Calculating false positive rate, true positive rate and the possible thresholds that can clearly separate TP and TN.
fpr, tpr, threshold = metrics.roc_curve(Train_labels, preds)
roc_auc = metrics.auc(fpr, tpr)
print(max(threshold))
Output : 1.97834
The previous answer did not really address your question of why the threshold is > 1, and in fact is misleading when it says the threshold does not have any interpretation.
The range of threshold should technically be [0,1] because it is the probability threshold. But scikit learn adds +1 to the last number in the threshold array to cover the full range [0, 1]. So if in your example the max(threshold) = 1.97834, the very next number in the threshold array should be 0.97834.
See this sklearn github issue thread for an explanation. It's a little funny because somebody thought this is a bug, but it's just how the creators of sklearn decided to define threshold.
Finally, because it is a probability threshold, it does have a very useful interpretation. The optimal cutoff is the threshold at which sensitivity + specificity are maximum. In sklearn learn this can be computed like so
fpr_p, tpr_p, thresh = roc_curve(true_labels, pred)
# maximize sensitivity + specificity, i.e. tpr + (1-fpr) or just tpr-fpr
th_optimal = thresh[np.argmax(tpr_p - fpr_p)]
The threshold value does not have any kind of interpretation, what really matters is the shape of the ROC curve. Your classifier performs well if there are thresholds (no matter their values) such that the generated ROC curve lies above the linear function (better than random guessing); your classifier has a perfect result (this happens rarely in practice) if for any threshold the ROC curve is only one point at (0,1); your classifier has the worst result if for any threshold the ROC curve is only one point at (1,0). A good indicator of the performance of your classifier is the integral of the ROC curve, this indicator is known as AUC and is limited between 0 and 1, 0 for the worst performance and 1 for perfect performance.
I am going to solve the following nonlinear DE:
Code#1:
tspan1 =t0:0.05:TT;
[t1,y1] = ode45(#(t1,T) ((1-alpha)*Q-sigm*(T.^4))/R, tspan1, t0);
h1=(TT-t0)/(size(y1,1)-1);
Tspan1=t0:h1:TT;
figure(55);plot(Tspan1,y1,'b');
Code#2:
tspan=[t0 TT];
[t,y] = ode45(#(t,T) ((1-alpha)*Q-sigm*(T.^4))/R, tspan, t0);
h=(TT-t0)/(size(y,1)-1);
Tspan=t0:h:TT;
figure(5);plot(Tspan,y,'b');
wherein:
R=2.912;
Q = 342;
alpha=0.3;
sigm=5.67*(10^(-8));
TT=20;
t0=0;
why the results are different?
The second result is not equally spaced. It in some way a minimal set of points that represents the solution curve. So if the curve is rather linear, there will be only few points, while at regions of high curvature you get a dense sampling. You can and should use the returned time array, as that contains the times that the solution points are for,
figure(55);plot(t1,y1,'b');
figure(5);plot(t,y,'b');
I am implementing the Gaussian distribution model on some data, if the sigma(covariance matrix) is singular, then it's not invertible, and will result in the failure in calculating the probability. I think add an Identity matrix to the sigma will make the sigma invertible, but that will make the model not fits the data.
Is there a way to make the sigma matrix invertible and keep the model fitting data?
Have a set of data: (x1, x2)_1 , (x1, x2)_2 , ... , (x1, x2)_i . where x1 and x2 are continues real numbers and some (x1, x2) can appear serval times, And I assumpt that those data follow Guassian distribution, and then can calculate the mean vector as (mean(x1), mean(x2)), and then calculate the covariance matrix as usual. And in some case the covariance matrix may be singular, I think add some random small shifts to it can make it nonsingular, but I don't know how to do it correctly so that the model can still fit data well.
You only need to model one dimension of the data with a 1D gaussian distribution in this case.
If you have two-dimensional data {(x1,x2)_i} whose covariance matrix is singular, this means that the data lies along a straight line. The {x2} data is a deterministic function of the {x1} data, so you only need to model the {x1} data randomly. The {x2} data follows immediately from {x1} and is no longer random once you know {x1}.
Here is my reasoning:
The covariance matrix would look something like this, since all covariance matrices are symmetric:
| a b |
| b c |
Where a = var(x1), c = var(x2), b = cov(x1,x2).
Now if this matrix is singular, the second column vector would have to be a scalar multiple of the first (since they are linearly dependent). Let's say the constant is k. Then:
b = k*c
a = k*b = k*k*c
So the covariance matrix really looks like:
| k*k*c k*c |
| k*c c |
Here there is only one parameter c = var(x2) which determines the distribution (since k can be anything), so the data in inherently one-dimension. Modelling it with one variable x1 is enough. Another way of seeing this is by checking that the Pearson Correlation Coefficient for this distribution is b/(sqrt(a)*sqrt(c)) = 1.
I have a fact data with set of parameters and some value that correspond to this parameters.
For example:
Street Color Shape Value
--------------------------------------
Versky Blue Ball 10
Soll Green Square 5
...
Now I need a create a function which get set of parameters [Holl, Red, Circle] and returns the predicted 'Value'.
If my parameters were the numbers I could use 'Classifying with k-Nearest Neighbors' algorithm, but they weren't.
Which machine-learning algorithm can I use to solve this task ?
Note that nearest neighbor is finding the nearest neighbor according to some distance metric. While indeed euclidean or similar metrics are widely used, any distance metric can be fine.
You can use a variation of Hamming distance:
Let x[i] be the i'th feature of vector x
Let the number of features be n
d(x,y) = Sum { (x[i] == y[i] ? 0 : 1) | i from 0 to n }
The above is a distance metric which is basically a variation of hamming distance where each feature got its unique alphabet.