Clustsig with modified method.distance - correlation

I am attempting to perform a Simprof test using a Pearson correlation as a distance method. I am aware that it is designed for the typical distance methods such as euclidean or bray curtis, but it supposedly allows any function that returns a dist object.
My issue lies with the creation of that function. My original data exists as a set of 35 rows and 2146 columns. I wish to correlate the columns. Below is a small subset of that data (lines 78-82).
I need a function that takes the absolute value of the Pearson correlation coefficient metric to be used as the method.distance function. I can calculate those individually, as seen in lines 84-86, but I have no idea how to make a single function that contains all of that. My attempt is on lines 89-91, but I know that as.dist needs the matrix of correlation coefficients, which you can only get from CorrelationSmall$r. I'm assuming it needs to be nested, but I'm at a loss. I apologize if I'm am asking something ridiculous. I have combed the forums and don't know who else to ask. Many thanks!
library(clustsig)
library(Hmisc)
NetworkAnalysisSmall <- read_csv("C:/Users/WilhelmLab/Desktop/Lena/NetworkAnalysisSmall.csv")
NetworkAnalysisSmallMatrix<-as.matrix(NetworkAnalysisSmall)
#subset of NetworkAnalysisSmall
a<-c(0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000001505,0.0000000000685,0.0000000009909,0.0000000001543,0.0000000000000,0.0000000000000,0.0000000000000)
b<-c(0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000002228,0.0000000000000,0.0000000001375,0.0000000000000,0.0000000000000)
c<-c(0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000546,0.0000000000000,0.0000000000000,0.0000000002293,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000540,0.0000000002085,0.0000000000000,0.0000000000000,0.0000000000000,0.0000000000000)
subset<-data.frame(a,b,c)
CorrelationSmall<-rcorr(as.matrix(NetworkAnalysisSmall),type=c("pearson"))
CCsmall<-CorrelationSmall$r
CCsmallAbs<-abs(CCsmall)
dist3 = function(x) {
as.dist(rcorr(as.matrix(x),type=c("pearson")))
}
NetworkSimprof<-simprof(NetworkAnalysisSmall,num.expected=1000,num.simulated=1000,method.cluster=c("ward"),method.distance=c("dist3"),method.transform=c("log"),alpha=0.05,sample.orientation="column")

Related

Outcome difference: using list & for-loop vs. single parameter input

This is my first question, so please let me know if I'm not giving enough details or asking a question that is not relevant on this platform!
I want to compute the same formula over a grid running from 0 to 4.0209, therefore I'm using a for-loop with an defined array using numpy.
To be certain that the for-loop is right, I've computed a selection of values by just using specific values for the radius an input in the formula.
Now, the outcomes with the same input of the radius is just slightly different. Do I interpret my grid wrongly? Or is there an error in my script?
It probably is something pretty straightforward, but maybe some of you can find a minute to help me out.
Here I use a selection of values for my radius parameter.
Here I use a for-loop to compute over a distance
Here are the differences in the outcomes:
Outcomes computed with for-loop:
9.443,086753902220000000
1.935,510475232510000000
57,174050755727700000
1,688894026484580000
0,020682674424032700
Outcomes computed with selected radii:
9.444,748178731630000000
1.938,918526458330000000
57,476599453309800000
1,703815523775800000
0,020957378277984600

determine optimal cut-off value for data (in matlab)

I realize this is an unspecific question (because I don't know a lot about the topic, please help me in this regard), that said here's the task I'd like to achieve:
Find a statistically sound algorithm to determine an optimal cut-off value to binarize a vector to filter out minimal values (i.e. get rid of). Here's code in matlab to visualize this problem:
randomdata=rand(1,100,1);
figure;plot(randomdata); %plot random data between 0 and 1
cutoff=0.5; %plot cut-off value
line(get(gca,'xlim'),[cutoff cutoff],'Color','red');
Thanks
You could try using Matlab's percentile function:
cutoff = prctile(randomdata,10);

How to fit and calculate the average for data sets in xmgrace?

I have a function which I like to fit it to a function of Y=a+(1-a)exp(-x/T) to get the T value for it.
I want to do these using Xmgrace but I do not know how.
Thanks for your suggestions.
On the xmgrace window, click Data Transformations Non-linear curve fitting.
On formula section, type in
a0+(1-a0)*exp(-x/a1)
You have 2 parameters, a0 and a1. On parametes section, select 2. Make an inital guess, range, tolerance and number of iterations. Normally default values would suffice for tolerance and iterations.
Hit Apply. Keep on hitting it untill a good fit is obtained.
Note - A good guess of initial parameters will help you get a good fit faster.

Find probability of the given data set, what probability i can say it is bad

I have a issue where there is a data set. and in there i have good and bad category, and in that category there are few elements that can be good and bad....
you can see the ven diagram i attached to get a view and the data set i have. please ill be really glad if you could help me out.
I am really new to probability and math stuff, yet i have a project to do where in the middle i have to find a way to say the given data set is bad or good depending on the data.
what probability theory can i use?
How to use... please give an an example using my data set. thankyou
Eg. if i get a data set of A,D,E elements are there... what probability i can say it is bad.
A function which gives a good / bad result is called a classification function. For any data set, there are many ways to construct a classification function. See, for example, "Pattern Recognition and Machine Learning" by Brian Ripley.
One way which is easy to understand is the so-called quadratic discriminant. It is easy to describe: (1) construct a Gaussian density for each category (good, bad, etc). (2) output the category for which a new input has the greatest probability.
(1) just compute the mean and covariance matrix for the data in each category. That gives you p(x | category).
(2) choose category such that p(category | x) is greatest. Note p(category | x) = p(x | category) p(category) / sum_i (p(x | category_i) p(category_i)), where p(category) is just (number of data in category) / (number of all data). If you work with logarithms, you can simplify the calculations somewhat.
Such a function can be constructed in a very few lines of a programming language which has matrix operations, such as Octave or R.

Random distribution of data

How do I distribute a small amount of data in a random order in a much larger volume of data?
For example, I have several thousand lines of 'real' data, and I want to insert a dozen or two lines of control data in a random order throughout the 'real' data.
Now I am not trying to ask how to use random number generators, I am asking a statistical question, I know how to generate random numbers, but my question is how do I ensure that this the data is inserted in a random order while at the same time being fairly evenly scattered through the file.
If I just rely on generating random numbers there is a possibility (albeit a very small one) that all my control data, or at least clumps of it, will be inserted within a fairly narrow selection of 'real' data. What is the best way to stop this from happening?
To phrase it another way, I want to insert control data throughout my real data without there being a way for a third party to calculate which rows are control and which are real.
Update: I have made this a 'community wiki' so if anyone wants to edit my question so it makes more sense then go right ahead.
Update: Let me try an example (I do not want to make this language or platform dependent as it is not a coding question, it is a statistical question).
I have 3000 rows of 'real' data (this amount will change from run to run, depending on the amount of data the user has).
I have 20 rows of 'control' data (again, this will change depending on the number of control rows the user wants to use, anything from zero upwards).
I now want to insert these 20 'control' rows roughly after every 150 rows or 'real' data has been inserted (3000/20 = 150). However I do not want it to be as accurate as that as I do not want the control rows to be identifiable simply based on their location in the output data.
Therefore I do not mind some of the 'control' rows being clumped together or for there to be some sections with very few or no 'control' rows at all, but generally I want the 'control' rows fairly evenly distributed throughout the data.
There's always a possibility that they get close to each other if you do it really random :)
But What I would do is:
You have N rows of real data and x of control data
To get an index of a row you should insert i-th control row, I'd use: N/(x+1) * i + r, where r is some random number, diffrent for each of the control rows, small compared to N/x. Choose any way of determining r, it can be either gaussian or even flat distribution. i is an index of the control row, so it's 1<=i<x
This way you can be sure that you avoid condensation of your control rows in one single place. Also you can be sure that they won't be in regular distances from each other.
Here's my thought. Why don't you just loop through the existing rows and "flip a coin" for each row to decide whether you will insert random data there.
for (int i=0; i<numberOfExistingRows; i++)
{
int r = random();
if (r > 0.5)
{
InsertRandomData();
}
}
This should give you a nice random distribution throughout the data.
Going with the 3000 real data rows and 20 control rows for the following example (I'm better with example than with english)
If you were to spread the 20 control rows as evenly as possible between the 3000 real data rows you'd insert one at each 150th real data row.
So pick that number, 150, for the next insertion index.
a) Generate a random number between 0 and 150 and subtract it from the insertion index
b) Insert the control row there.
c) Increase insertion index by 150
d) Repeat at step a)
Of course this is a very crude algorithm and it needs a few improvements :)
If the real data is large or much larger than the control data, just generate interarrival intervals for your control data.
So pick a random interval, copy out that many lines of real data, insert control data, repeat until finished. How to pick that random interval?
I'd recommend using a gaussian deviate with mean set to the real data size divided by the control data size, the former of which could be estimated if necessary, rather than measured or assumed known. Set the standard deviation of this gaussian based on how much "spread" you're willing to tolerate. Smaller stddev means a more leptokurtic distribution means tighter adherence to uniform spacing. Larger stdev means a more platykurtic distribution and looser adherence to uniform spacing.
Now what about the first and last sections of the file? That is: what about an insertion of control data at the very beginning or very end? One thing you can do is to come up with special-case estimates for these... but a nice trick is as follows: start your "index" into the real data at minus half the gaussian mean and generate your first deviate. Don't output any real data until your "index" into the real data is legit.
A symmetric trick at the end of the data should also work quite well (simply: keep generating deviates until you reach an "index" at least half the gaussian mean beyond the end of the real data. If the index just before this was off the end, generate data at the end.
You want to look at more than just statistics: it's helpful in developing an algorithm for this sort of thing to look at rudimentary queueing theory. See wikipedia or the Turing Omnibus, which has a nice, short chapter on the subject whose title is "Simulation".
Also: in some circumstance non-gaussian distributions, particularly the Poisson distribution, give better, more natural results for this sort of thing. The algorithm outline above still applies using half the mean of whatever distribution seems right.

Resources