Correlation with covariate in R - correlation

I have 2 files. File1 contains gene expression- 300 samples(rows) and ~50k genes (50k columns), File2 contains MRI data -the same 300 samples and ~100 columns. I want to correlate GE data with MRI data controlling for covariate which is disease status (file3, status 0 or 1).
I have tried to use ppcor package, for 2 variables it worked:
pcor.test(file1$gene1, file2$var1, file3$status, method="pearson")
but I want to run for all variables, so I merged into 1 file, last columns being status:
sapply(1:(ncol(test)-1), function(x) sapply(1:(ncol(test)-1), function(y) {
if (x == y) 1
else pcor.test(test[,x], test[,y], test[,ncol(test)])$estimate
}))
but had this error:
Error in solve.default(cvx) :
system is computationally singular: reciprocal condition number = 6.36939e-18
Am I doing this correctly? Is this a good method?
Thank you for suggestions and help
Georg
File1
gene1 gene2 gene3 ... gene50,000
Sample1 12 300 70 4000
Sample2 25 100 53 4500
Sample3 70 30 71 2000
...
Sample300 18 200 97 1765
File2
var1 var2 var3 ... var100
Sample1 5 1 170 200
Sample2 7 3 153 100
Sample3 7 18 130 34
...
Sample300 18 54 197 71
File3-STATUS
status
Sample1 1
Sample2 1
Sample3 0
...
Sample300 1

Related

Matching Column numbers from two different txt file

I have two text files which are a different size. The first one below example1.txt has only one column of numbers:
101
102
103
104
111
120
120
125
131
131
131
131
131
131
And the Second text file example2.txt has two columns:
101 3
102 3
103 3
104 4
104 4
111 5
120 1
120 1
125 2
126 2
127 2
128 2
129 2
130 2
130 2
130 2
131 10
131 10
131 10
131 10
131 10
131 10
132 10
The first column in the example1.txt is a subset of column one in example2.txt. The second column numbers in example2.txt are the associated values with the first column.
What I want to do is to get the associated second column of example1.txt following the example2.txt. I have tried but couldn't figure it out yet. Any suggestions or solutions in bash, awk would be appreciated
Therefore the result would be:
101 3
102 3
103 3
104 4
111 5
120 1
120 1
125 2
131 10
131 10
131 10
131 10
131 10
131 10
UPDATE:
I have been trying to do the column matching like :
awk -F'|' 'NR==FNR{c[$1]++;next};c[$1] > 0' example1.txt example2.txt > output.txt
In both files, the first column goes like an ascending order, but the frequency of the same numbers may not be the same. For example, the frequency of 104 is one in the example1.txt, but it appeared twice in the example2.txt The important thing is that the associated second column value would be the same for example1.txt too. Just see the expected output in the end.
$ awk 'NR==FNR{a[$1]++; next} ($1 in a) && b[$1]++ < a[$1]' f1 f2
101 3
102 3
103 3
104 4
111 5
120 1
120 1
125 2
131 10
131 10
131 10
131 10
131 10
131 10
This solution doesn't make use of the fact that the first column is in ascending order. Perhaps some optimization can be done based on that.
($1 in a) && b[$1]++ < a[$1] is the main difference from your solution. This checks if the field exists as well as that the count doesn't exceed that of the first file.
Also, not sure why you set the field separator as | because there is no such character in the sample given.

Get the summation of values at column level in a text file using shell

My query is simple to get the summation of values at column level in a text file using shell -- i.e. to add a new record in the text file which includes the sum.
For example here below:
name usersToday usersTommorrow
Data1 92 181
DataTwo 5 7
Something 61 73
Something_with_long_name 0 0
the desired output is
name usersToday usersTommorrow
Data1 92 181
DataTwo 5 7
Something 61 73
Something_with_long_name 0 0
Total 158 262
Please note that the text file will be updated with a new data column periodically everyday.
so on day 2 - post the commands for summation are updated the file will be like
name usersToday day2
Data1 92 181
DataTwo 5 7
Something 61 73
Something_with_long_name 0 0
Total 158 262
On day 3 - post new data is appended the file will be like
name usersToday day2 day3
Data1 92 181 52
DataTwo 5 7 53
Something 61 73 25
Something_with_long_name 0 0 26
Total 158 262
so I want the summation for day3 needs to be updated.
Considering that your actual Input_file will be same as shown samples, could you please try following then.
awk 'FNR>1 && NF{first+=$2;second+=$3} 1; END{print "Total "first,second}' Input_file

Max of all columns based on distinct first column

How do I change the code if I have more than two columns. Let's say the data is like this
ifile.dat
1 10 15
3 34 20
1 4 22
3 32 33
5 3 46
2 2 98
4 20 100
3 13 23
4 50 65
1 40 76
2 20 22
How do I achieve this?
ofile.dat
1 40 76
2 20 98
3 34 33
4 50 100
5 3 46
I mean the max of each column by comparing first column. Thanks.
Here is what I have tried(on a sample file with 13columns). But the highest value is not coming up this way.
cat input.txt | sort -k1,1 -k2,2nr -k3,3nr -k4,4nr -k5,5nr -k6,6nr -k7,7nr -k8,8nr -k9,9nr -k10,10nr -nrk11,11 -nrk12,12 -nrk13,13 | sort -k1,1 -u
Instead of relying on sort, you could switch over to something more robust like awk:
awk 'BEGIN{PROCINFO["sorted_in"] = "#val_num_asc"} {for(i=2;i<=NF;++i) if (a[$1][i]<$i){a[$1][i]=$i}} END{n=asorti(a, asorted); for(col1 in asorted){print col1, a[col1][2], a[col1][3]}}' input.txt
That's a mouthful. It breaks down like:
Before processing the file set the PROCINFO setting of sorted_in to #val_num_asc since we will be sorting the contents of an array by its index which will be numeric: (BEGIN{PROCINFO["sorted_in"] = "#val_num_asc"})
Loop through each field in the record for(i=2;i<=NF;++i)
Test to see if the current value is greater than the stored value in the array which has as its index the first field's value. If it is greater, then store it in place of whatever is currently held in the array for that index at that fields position: (if (a[$1][i]<$i){a[$1][i]=$i})
When done processing all of the records, sort the array into new array asorted: (END{n=asorti(a, asorted);)
Iterate through the array and print each element (for(col1 in asorted){print col1, a[col1][2], a[col1][3]})
There may be a more elegant way to do this in awk, but this will do the trick.
Example:
:~$ cat test
1 10 15
3 34 20
1 4 22
3 32 33
5 3 46
2 2 98
4 20 100
3 13 23
4 50 65
1 40 76
2 20 22
:~$ awk 'BEGIN{PROCINFO["sorted_in"] = "#val_num_asc"} {for(i=2;i<=NF;++i) if (a[$1][i]<$i){a[$1][i]=$i}} END{n=asorti(a, asorted); for(col1 in asorted){print col1, a[col1][2], a[col1][3]}}' test
1 40 76
2 20 98
3 34 33
4 50 100
5 3 46

Remove rows that have a specific numeric value in a field

I have a very bulky file about 1M lines like this:
4001 168991 11191 74554 60123 37667 125750 28474
8 145 25 101 83 51 124 43
2985 136287 4424 62832 50788 26847 89132 19184
3 129 14 101 88 61 83 32 1 14 10 12 7 13 4
6136 158525 14054 100072 134506 78254 146543 41638
1 40 4 14 19 10 35 4
2981 112734 7708 54280 50701 33795 75774 19046
7762 339477 26805 148550 155464 119060 254938 59592
1 22 2 12 10 6 17 2
6 136 16 118 184 85 112 56 1 28 1 5 18 25 40 2
1 26 2 19 28 6 18 3
4071 122584 14031 69911 75930 52394 89733 30088
1 9 1 3 4 3 11 2 14 314 32 206 253 105 284 66
I want to remove rows that have a value less than 100 in the second column.
How to do this with sed?
I would use awk to do this. Example:
awk ' $2 >= 100 ' file.txt
this will only display every row from file.txt that has a column $2 greater than 100.
Use the following approach:
sed '/^\w+\s+([0-9]{1,2}|[0][0-9]+)\b/d' -E /tmp/test.txt
(replace /tmp/test.txt with your current file path)
([0-9]{1,2}|[0][0-9]+) - will match either digits from 0 to 99 OR a digits with leading zero (ex. 012, 00982)
d - delete the pattern space;
-E(--regexp-extended) - Use extended regular expressions rather than basic regular expressions
To remove matched lines in place use -i option:
sed -i -E '/^\w+\s+([0-9]{1,2}|[0][0-9]+)\b/d' /tmp/test.txt

Difference of column value received from operation done on 2 different files

I have following 2 files file1.txt and file2.txt with the data as given below-
Data in file1.txt
125
125
295
295
355
355
355
Data in file2.txt
125
125
295
355
I did below operation over the files and got following output-
Operation1-
sort file1.txt | uniq -c
2 125
2 295
3 355
Operation2-
sort file2.txt | uniq -c
2 125
1 295
1 355
Now, I want following output using the result of Operation1 and Operation2 -
I want to compare the result of Operation1 and Operation2 and get the output which will show the difference of values from column 1 of both the files, and it will show the column 2 as it is as given below-
0 125
1 295
2 355
redirect output of operation 1 and operation 2 in some files. Let say
file1
and
file2
, then write like this:-
paste file1 file2 | awk '{print $1-$3,$2}'
you will have output
0 125
1 295
2 355

Resources