I have this huge table with ~200k lines and columns (tab separated). I'd like to pick them according to the value of this particular column $4 so their values are spaced for at least 100, but also considering the value on column $3. i.e
id tag xxx position score
id_1 aaaaaaa bbbbb 3190 1
id_2 aaaaaaa bbbbb 3199 1
id_3 aaaaaaa bbbbb 3300 1
id_4 aaaaaaa bbbbb 3350 1
id_5 aaaaaaa ccccc 100 1
id_6 aaaaaaa ccccc 500 1
id_7 aaaaaaa ccccc 550 1
id_8 aaaaaaa ccccc 599 1
To get something like this:
id tag block position score
id_1 aaaaaaa bbbbb 3190 1
id_3 aaaaaaa bbbbb 3300 1
id_5 aaaaaaa ccccc 100 1
id_6 aaaaaaa ccccc 500 1
Some time ago #hek2mgl helpedme to filter a huge table according to the distance between values using this code
awk 'NR<3; NR==2{pv=$4} NR>2 && ($4-pv>=100){print;pv=$4}' file
However, this code doesnt consider the $3 which now I need to consider to avoid creating a new file for each block.. could this be possible, as it's a bit complicated considering that the values in $4 are not consecutive if they dont represent the same block ($3)?
Thanks
awk to the rescue!
Just qualify the previous values with $3.
$ awk 'NR<3; NR==2{pv[$3]=$4} NR>2 && ($4-pv[$3]>=100){print;pv[$3]=$4}' file
id tag xxx position score
id_1 aaaaaaa bbbbb 3190 1
id_3 aaaaaaa bbbbb 3300 1
id_5 aaaaaaa ccccc 100 1
id_6 aaaaaaa ccccc 500 1
i.e. change pv to pv[$3]. You can pipe the output to column -t to get better format, or change print to printf.
If you want a fixed column size, why not a simple printf ?
spc=10;
while read a b c d e; do
printf '%-'$spc's %-'$spc's %-'$spc's %-'$spc's %-'$spc's\n' $a $b $c $d $e;
done < file
spc defines the number of spaces between columns
Outputs :
id tag xxx position score
id_1 aaaaaaa bbbbb 3190 1
id_2 aaaaaaa bbbbb 3199 1
id_3 aaaaaaa bbbbb 3300 1
id_4 aaaaaaa bbbbb 3350 1
id_5 aaaaaaa ccccc 100 1
id_6 aaaaaaa ccccc 500 1
id_7 aaaaaaa ccccc 550 1
id_8 aaaaaaa ccccc 599 1
Related
I have two files. If the column "chromosome" matches between the two files and the position of File1 is between the Start_position and the End_position of File2, I would like to associate the two cell_frac values. If the Gene (chromosome + position) is not present in File2, I would like both cell_frac values to be equal to 0.
File1:
Hugo_Symbol Chromosome Position
Gene1 1 111111
Gene2 1 222222
Gene3 2 333333
Gene4 2 333337
File2:
Chromosome Start_Position End_Position cell_frac_A1 cell_frac_A2
1 222220 222230 0.12 0.01
2 333330 333340 0.03 0.25
3 444440 444450 0.01 0.01
Desired output:
Hugo_Symbol Chromosome Position cell_frac_A1 cell_frac_A2
Gene1 1 111111 0 0
Gene2 1 222222 0.12 0.01
Gene3 2 333333 0.03 0.25
Gene4 2 333337 0.03 0.25
Edit: Here is the beginning of the code I used for now (not correct output):
awk '
NR==FNR{ range[$1,$2,$3,$4,$5]; next }
FNR==1
{
for(x in range) {
split(x, check, SUBSEP);
if($2==check[1] && $3>=check[2] && $3<=check[3]) { print $1"\t"$2"\t"$3"\t"check[4]"\t"check[5]}
}
}
' File2 File1
However, I did not manage to associate a 0 (with "else") when the gene was not present. I get the wrong number of lines. I Can you give me more hints?
Thanks a lot.
One awk-only idea ...
NOTE: see my other answer for assumptions/understandings and my version of file1
awk ' # process file2
FNR==NR { c=$1 # save chromosome value
$1="" # clear field #1
file2[c]=$0 # use chromosome as array index; save line in array
next
}
# process file1
{ start=end=-9999999999 # default values in case
a1=a2="" # no match in file2
if ($2 in file2) { # if chromosome also in file2
split(file2[$2],arr) # split file2 data into array arr[]
startpos =arr[1]
endpos =arr[2]
a1 =arr[3]
a2 =arr[4]
}
# if not the header row and file1/position outside of file2/range then set a1=a2=0
if (FNR>1 && ($3 < startpos || $3 > endpos)) a1=a2=0
print $0,a1,a2
}
' file2 file1
This generates:
Hugo_Symbol Chromosome Position cell_frac_A1 cell_frac_A2
Gene2 1 222222 0.12 0.01
Gene3 2 333333 0.03 0.25
Gene1 1 111111 0 0
Gene4 2 333337 0.03 0.25
Gene5 4 444567 0 0
Changing the last line to ' file2 file1 | column -t generates:
Hugo_Symbol Chromosome Position cell_frac_A1 cell_frac_A2
Gene2 1 222222 0.12 0.01
Gene3 2 333333 0.03 0.25
Gene1 1 111111 0 0
Gene4 2 333337 0.03 0.25
Gene5 4 444567 0 0
Presorting file1 by Chromosome and Position by changing last line to ' file2 <(head -1 file1; tail -n +2 file1 | sort -k2,3) | column -t generates:
Hugo_Symbol Chromosome Position cell_frac_A1 cell_frac_A2
Gene1 1 111111 0 0
Gene2 1 222222 0.12 0.01
Gene3 2 333333 0.03 0.25
Gene4 2 333337 0.03 0.25
Gene5 4 444567 0 0
One big issue (same as with my other answer) ... the actual code may become unweidly when dealing with 519 total columns especially if there's a need to intersperse a lot of columns; otherwise OP may be able to use some for loops to more easily print ranges of columns.
A job for sql instead of awk, perhaps?
tr -s ' ' '|' <File1 >file1.csv
tr -s ' ' '|' <File2 >file2.csv
(
echo 'Hugo_Symbol|Chromosome|Position|cell_frac_A1|cell_frac_A2'
sqlite3 <<'EOD'
.import file1.csv t1
.import file2.csv t2
select distinct
t1.hugo_symbol,
t1.chromosome,
t1.position,
case
when t1.position between t2.start_position and t2.end_position
then t2.cell_frac_a1
else 0
end,
case
when t1.position between t2.start_position and t2.end_position
then t2.cell_frac_a2
else 0
end
from t1 join t2 on t1.chromosome=t2.chromosome;
EOD
rm file[12].csv
) | tr '|' '\t'
Assumptions and/or understandings based on sample data and OP comments ...
file1 is not sorted by chromosome
file2 is sorted by chromosome
common headers in both files are spelled the same (eg, file1:Chromosome vs file2:Chromosom)
if a chromosome exists in file1 but does not exist in file2 then we keep the line from file1 and the columns from file2 are left blank
both files are relatively small (file1: 5MB, 900 lines; file2: few KB, 50 lines)
NOTE: the number of columns (file1: 500 columns; file2: 20 columns) could be problematic from the point of view of cumbersome coding ... more on that later ...
Sample inputs:
$ cat file1 # scrambled chromsome order; added chromosome=4 line
Hugo_Symbol Chromosome Position
Gene1 1 111111
Gene3 2 333333
Gene2 1 222222
Gene4 2 333337
Gene5 4 444567 # has no match in file2
$ cat file2
Chromosome Start_Position End_Position cell_frac_A1 cell_frac_A2
1 222220 222230 0.12 0.01
2 333330 333340 0.03 0.25
3 444440 444450 0.01 0.01
First issue is to sort file1 by Chromosome and Position and also keep the header line in place:
$ (head -1 file1; tail -n +2 file1 | sort -k2,3)
Hugo_Symbol Chromosome Position
Gene1 1 111111
Gene2 1 222222
Gene3 2 333333
Gene4 2 333337
Gene5 4 444567
We can now join the 2 files based on the Chromosome column:
$ join -1 2 -2 1 -a 1 --nocheck-order <(head -1 file1; tail -n +2 file1 | sort -k2,3) file2
Chromosome Hugo_Symbol Position Start_Position End_Position cell_frac_A1 cell_frac_A2
1 Gene1 111111 222220 222230 0.12 0.01
1 Gene2 222222 222220 222230 0.12 0.01
2 Gene3 333333 333330 333340 0.03 0.25
2 Gene4 333337 333330 333340 0.03 0.25
4 Gene5 444567
Where:
-1 2 -2 1 - join on Chromosome columns: -1 2 == file #1 column #2; -2 1 == file #2 column #1
-a 1 - keep columns from file #1 (sorted file1)
--nocheck-order - disable verifying input is sorted by join column; optional; may be needed if a locale thinks 1 should be sorted before Chromosome
NOTE: for the sample inputs/outputs we don't need a special output format so we can skip the -o option, but this may be needed depending on OP's output requirements for 519 total columns (but it may also become unwieldly)
From here OP can use bash or awk to do comparisons (is column #3 between columns #4/#5); one awk idea:
$ join -1 2 -2 1 -a 1 --nocheck-order <(head -1 file1; tail -n +2 file1 | sort -k2,3) file2 | awk 'FNR>1{if ($3<$4 || $3>$5) $6=$7=0} {print $2,$1,$3,$6,$7}'
Hugo_Symbol Chromosome Position cell_frac_A1 cell_frac_A2
Gene1 1 111111 0 0 # Position outside of range
Gene2 1 222222 0.12 0.01
Gene3 2 333333 0.03 0.25
Gene4 2 333337 0.03 0.25
Gene5 4 444567 0 0 # no match in file2; if there were other columns from file2 they would be empty
And to match OP's sample output (appears to be a fixed width requirement) we can pass this to column:
$ join -1 2 -2 1 -a 1 --nocheck-order <(head -1 file1; tail -n +2 file1 | sort -k2,3) file2 | awk 'FNR>1{if ($3<$4 || $3>$5) $6=$7=0} {print $2,$1,$3,$6,$7}' | column -t
Hugo_Symbol Chromosome Position cell_frac_A1 cell_frac_A2
Gene1 1 111111 0 0
Gene2 1 222222 0.12 0.01
Gene3 2 333333 0.03 0.25
Gene4 2 333337 0.03 0.25
Gene5 4 444567 0 0
NOTE: Keep in mind this may be untenable with OP's 519 total columns, especially if interspersed columns contain blanks/white-space (ie, column -t may not parse the input properly)
Issues (in addition to incorrect assumptions and previous NOTES):
for relatively small files the performance of the join | awk | column should be sufficient
for larger files all of this code can be rolled into a single awk solution though memory usage could be an issue on a small machine (eg, one awk idea would be to load file2 into memory via arrays so memory would need to be large enough to hold all of file2 ... probably not an issue unless file2 gets to be 100's/1000's of MBytes in size)
for 519 total columns the awk/print will get unwieldly especially if there's a need to move/intersperse a lot of columns
using the $ regex I can get last position of each line. but if I have the following:
12345
23456
34567
I need to add a space so it becomes
1234 5
2345 6
3456 7
Thanks!
$ sed 's/.$/ &/' file
1234 5
2345 6
3456 7
gawk -v FIELDWIDTHS='4 1' '{$1=$1}1' file
1234 5
2345 6
3456 7
This question already has answers here:
How can I sum values in column based on the value in another column?
(5 answers)
Combine text from two files, output to another [duplicate]
(2 answers)
Closed 6 years ago.
I want to extract and combine a certain column from a bunch of text files into a single file as shown.
File1_example.txt
A 123 1
B 234 2
C 345 3
D 456 4
File2_example.txt
A 123 5
B 234 6
C 345 7
D 456 8
File3_example.txt
A 123 9
B 234 10
C 345 11
D 456 12
...
..
.
File100_example.txt
A 123 55
B 234 66
C 345 77
D 456 88
How can I loop through my files of interest and paste these columns together so that the final result is like below without having to type out 1000 unique file names?
1 5 9 ... 55
2 6 10 ... 66
3 7 11 ... 77
4 8 12 ... 88
Try this:
paste File[0-9]*_example.txt | awk '{i=3;while($i){printf("%s ",$i);i+=3}printf("\n")}'
Example:
File1_example.txt:
A 123 1
B 234 2
C 345 3
D 456 4
File2_example.txt:
A 123 5
B 234 6
C 345 7
D 456 8
Run command as:
$ paste File[0-9]*_example.txt | awk '{i=3;while($i){printf("%s ",$i);i+=3}printf("\n")}'
Output:
1 5
2 6
3 7
4 8
I tested below code with first 3 files
cat File*_example.txt | awk '{a[$1$2]= a[$1$2] $3 " "} END{for(x in a){print a[x]}}' | sort
1 5 9
2 6 10
3 7 11
4 8 12
1) use an awk array, a[$1$2]= a[$1$2] $3 " " index is column1 and column2, array value appends all column 3.
2) END{for(x in a){print a[x]}} travesrsed array a and prints all values.
3)use sort to sort the output.
when cating you need to ensure the file order is preserved, one way is to explicitly specify the files
cat File{1..100}_example.txt | awk '{print $NF}' | pr 4ts' '
extract last column by awk and align using pr
I have two files of the form
file1:
#fileheader1
0 123
1 456
2 789
3 999
4 112
5 131
6 415
etc.
file2:
#fileheader2
0 442
1 232
2 542
3 559
4 888
5 231
6 322
etc.
How can I take the second column of each, divide it by a value then minus one from the other and then output a new third file with the new values?
I want the output file to have the form
#outputheader
0 123/c-422/k
1 456/c-232/k
2 789/c-542/k
etc.
where c and k are numbers I can plug into the script
I have seen this question: subtract columns from different files with awk
But I don't know how to use awk to do this by myself, does anyone know how to do this or could explain what is going on in the linked question so I can try to modify it?
I'd write:
awk -v c=10 -v k=20 ' ;# pass values to awk variables
/^#/ {next} ;# skip headers
FNR==NR {val[$1]=$2; next} ;# store values from file1
$1 in val {print $1, (val[$1]/c - $2/k)} ;# perform the calc and print
' file1 file2
output
0 -9.8
1 34
2 51.8
3 71.95
4 -33.2
5 1.55
6 25.4
etc. 0
I have two 2D-array files to read with bash.
What I want to do is extract the elements inside both files.
These two files contain different rows x columns such as:
file1.txt (nx7)
NO DESC ID TYPE W S GRADE
1 AAA 20 AD 100 100 E2
2 BBB C0 U 200 200 D
3 CCC 9G R 135 135 U1
4 DDD 9H Z 246 246 T1
5 EEE 9J R 789 789 U1
.
.
.
file2.txt (mx3)
DESC W S
AAA 100 100
CCC 135 135
EEE 789 789
.
.
.
Here is what I want to do:
Extract the element in DESC column of file2.txt then find the corresponding element in file1.txt.
Extract the W,S elements in such row of file2.txt then find the corresponding W,S elements in such row of file1.txt.
If [W1==W2 && S1==S2]; then echo "${DESC[colindex]} ok"; else echo "${DESC[colindex]} NG"
How can I read this kind of file as a 2D array with bash or is there any convenient way to do that?
bash does not support 2D arrays. You can simulate them by generating 1D array variables like array1, array2, and so on.
Assuming DESC is a key (i.e. has no duplicate values) and does not contain any spaces:
#!/bin/bash
# read data from file1
idx=0
while read -a data$idx; do
let idx++
done <file1.txt
# process data from file2
while read desc w2 s2; do
for ((i=0; i<idx; i++)); do
v="data$i[1]"
[ "$desc" = "${!v}" ] && {
w1="data$i[4]"
s1="data$i[5]"
if [ "$w2" = "${!w1}" -a "$s2" = "${!s1}" ]; then
echo "$desc ok"
else
echo "$desc NG"
fi
break
}
done
done <file2.txt
For brevity, optimizations such as taking advantage of sort order are left out.
If the files actually contain the header NO DESC ID TYPE ... then use tail -n +2 to discard it before processing.
A more elegant solution is also possible, which avoids reading the entire file in memory. This should only be relevant for really large files though.
If the rows order is not needed be preserved (can be sorted), maybe this is enough:
join -2 2 -o 1.1,1.2,1.3,2.5,2.6 <(tail -n +2 file2.txt|sort) <(tail -n +2 file1.txt|sort) |\
sed 's/^\([^ ]*\) \([^ ]*\) \([^ ]*\) \2 \3/\1 OK/' |\
sed '/ OK$/!s/\([^ ]*\) .*/\1 NG/'
For file1.txt
NO DESC ID TYPE W S GRADE
1 AAA 20 AD 100 100 E2
2 BBB C0 U 200 200 D
3 CCC 9G R 135 135 U1
4 DDD 9H Z 246 246 T1
5 EEE 9J R 789 789 U1
and file2.txt
DESC W S
AAA 000 100
CCC 135 135
EEE 789 000
FCK xxx 135
produces:
AAA NG
CCC OK
EEE NG
Explanation:
skip the header line in both files - tail +2
sort both files
join the needed columns from both files into one table like, in the result will appears only the lines what has common DESC field
like next:
AAA 000 100 100 100
CCC 135 135 135 135
EEE 789 000 789 789
in the lines, which have the same values in 2-4 and 3-5 columns, substitute every but 1st column with OK
in the remainder lines substitute the columns with NG