Combine rows with the same name in each column using bash - bash

I have a file like the following (but with 52 columns and 4,000 rows):
1NA2 1NB2 2RA2 2RB2
Vibrionaceae 0.22 0.25 0.36 1.02
Bacillaceae 2.0 1.76 0.55 0.23
Enterobacteriaceae 0.55 0.52 2.40 1.23
Vibrionaceae 0.22 0.25 0.36 1.02
Bacillaceae 2.0 1.76 0.55 0.23
Enterobacteriaceae 0.55 0.52 2.40 1.23
And I want it to look like this:
1NA2 1NB2 2RA2 2RB2
Vibrionaceae 0.44 0.50 0.72 2.04
Bacillaceae 4.0 3.52 1.10 0.46
Enterobacteriaceae 1.10 1.04 4.80 2.46
edit: I´m sorry, I don't want to delete the remaining rows and columns. Every row name is repeated several times, so I want it to appear only 1 time with the the total in every column.
I have tried the following:
awk '{a[$1]+=$2}END{for(i in a) print i,a[i]}' file
but it only does it for the first column, and I want it to work for all 52 columns.

With GNU awk and a 2D array:
awk 'NR==1
NR>1{
for(i=2; i<=NF; i++){
a[$1][i]+=$i
}
}
END{
for(i in a){
printf("%-19s", i)
for(j=2; j<=NF; j++){
printf("%.2f ", a[i][j])
}
print ""
}
}' file
or as one-liner:
awk 'NR==1; NR>1{for(i=2; i<=NF; i++){a[$1][i]+=$i}} END{for(i in a){printf("%-19s", i); for(j in a[i]){printf("%.2f ", a[i][j])} print ""}}' file
Output:
1NA2 1NB2 2RA2 2RB2
Bacillaceae 4.00 3.52 1.10 0.46
Vibrionaceae 0.44 0.50 0.72 2.04
Enterobacteriaceae 1.10 1.04 4.80 2.46
NR is the line number
NF is the number of fields in a row

Related

AWK select columns in file2 based on partial header match in file1

I have a file ("File1") with ~40-80k columns and ~10k rows. The column headers in File1 are comprised of a unique identifier (e.g. "4b_1.04:") followed by a description (e.g. "Colname_3"). File2 contains a list of unique identifiers (i.e. not an exact match). Is there a way to extract columns from File1 using a list of column headers in File2 based on a partial match?
For example:
"File1"
patient_ID,response,0_4: Number of Variants,0_6: Number of CDS Variants,3_2.83: Colname_1,3_8.5102: Colname_2,4b_1.04: Colname_3,4_1.0: Colname_4,4_7.7101: Colname_5
ID_237.vcf,Benefit,13008,4343,0.65,1.23,0.17,2.57,4.22
ID_841.vcf,Benefit,15127,2468,0.9,0.68,2.39,1.8,1.6
ID_767.vcf,Benefit,5190,3261,0.73,1.16,1.99,0.79,1.17
ID_263.vcf,Benefit,16888,9548,0.61,1.66,0.73,2.42,1.55
ID_179.vcf,Benefit,3545,842,0.22,0.67,0.48,3.9,3.95
ID_408.vcf,Benefit,1427,4583,0.92,0.76,0.17,0.8,1.27
ID_850.vcf,Benefit,13835,4682,0.8,1.21,0.05,1.74,4.61
ID_856.vcf,Benefit,8939,8435,0.31,0.99,2.5,1.36,0.74
ID_328.vcf,Benefit,14220,8481,0.23,0.22,0.79,0.14,1.08
ID_704.vcf,Benefit,18145,914,0.66,1.69,0.17,0.4,3.13
ID_828.vcf,No_Benefit,4798,8163,0.74,0.89,1.04,1.68,1.29
ID_16.vcf,No_Benefit,6472,528,0.47,1.5,1.74,0.19,3.54
ID_380.vcf,No_Benefit,9827,8359,0.86,1.59,2.41,0.11,3.71
ID_559.vcf,No_Benefit,10247,9150,0.68,0.78,1.02,0.69,1.31
ID_466.vcf,No_Benefit,11092,4078,0.16,0.03,0.4,1.51,2.86
ID_925.vcf,No_Benefit,4809,2908,0.01,1.49,2.32,2.35,4.58
ID_573.vcf,No_Benefit,4341,4307,0.87,0.14,2.63,1.35,3.54
ID_497.vcf,No_Benefit,18279,663,0.1,1.06,2.96,1.98,4.22
ID_830.vcf,No_Benefit,18505,456,0.31,0.25,1.96,3.01,4.6
ID_665.vcf,No_Benefit,15072,2962,0.43,1.35,0.76,0.68,1.47
"File2"
patient_ID
response
0_4:
0_6:
4b_1.04:
3_2.83:
3_8.5102:
NB. The identifiers in File2 are in a different order to the column headers in File1, and the delimiter in File1 is a tab, not a comma (do tabs get converted to spaces when copy-pasting from SO?).
My attempt:
awk 'NR==FNR{T[$1]=NR; next} FNR==1 {MX=NR-1; for (i=1; i<=NF; i++) if ($i in T) C[T[$i]] = i } {for (j=1; j<=MX; j++) printf "%s%s", $C[j], (j==MX)?RS:"\t" }' File2 <(tr -s "," "\t" < File1)
Unfortunately, this prints the 'partial' header - I want the full header - and appears to struggle with File2 being in a different order to File1.
Expected outcome (awk 'BEGIN{FS=","; OFS="\t"}{print $1, $2, $3, $4, $7, $5, $6}' File1):
patient_ID response 0_4: Number of Variants 0_6: Number of CDS Variants 4b_1.04: Colname_3 3_2.83: Colname_1 3_8.5102: Colname_2
ID_237.vcf Benefit 13008 4343 0.17 0.65 1.23
ID_841.vcf Benefit 15127 2468 2.39 0.9 0.68
ID_767.vcf Benefit 5190 3261 1.99 0.73 1.16
ID_263.vcf Benefit 16888 9548 0.73 0.61 1.66
ID_179.vcf Benefit 3545 842 0.48 0.22 0.67
ID_408.vcf Benefit 1427 4583 0.17 0.92 0.76
ID_850.vcf Benefit 13835 4682 0.05 0.8 1.21
ID_856.vcf Benefit 8939 8435 2.5 0.31 0.99
ID_328.vcf Benefit 14220 8481 0.79 0.23 0.22
ID_704.vcf Benefit 18145 914 0.17 0.66 1.69
ID_828.vcf No_Benefit 4798 8163 1.04 0.74 0.89
ID_16.vcf No_Benefit 6472 528 1.74 0.47 1.5
ID_380.vcf No_Benefit 9827 8359 2.41 0.86 1.59
ID_559.vcf No_Benefit 10247 9150 1.02 0.68 0.78
ID_466.vcf No_Benefit 11092 4078 0.4 0.16 0.03
ID_925.vcf No_Benefit 4809 2908 2.32 0.01 1.49
ID_573.vcf No_Benefit 4341 4307 2.63 0.87 0.14
ID_497.vcf No_Benefit 18279 663 2.96 0.1 1.06
ID_830.vcf No_Benefit 18505 456 1.96 0.31 0.25
ID_665.vcf No_Benefit 15072 2962 0.76 0.43 1.35
Would you please try the following:
awk -F"\t" '
NR==FNR { # handle File2
partial[FNR] = $i # create a list of desired header (partial)
len = FNR # array lenth of "partial"
next
}
FNR==1 { # handle header line of File1
ofs = line = ""
for (j = 1; j <= len; j++) {
for (i = 1; i <= NF; i++) {
if (index($i, partial[j]) == 1) { # test the partial match
header[++n] = i # if match, store the position
line = line ofs $i
ofs = "\t"
}
}
}
print line # print the desired header (full)
}
FNR>1 { # handle body lines of File1
ofs = line = ""
for (i = 1; i <= n; i++) { # positions of desired columns
line = line ofs $header[i]
ofs = "\t"
}
print line
}
' File2 File1
Output:
patient_ID response 0_4: Number of Variants 0_6: Number of CDS Variants 4b_1.04: Colname_33_2.83: Colname_1 3_8.5102: Colname_2
ID_237.vcf Benefit 13008 4343 0.17 0.65 1.23
ID_841.vcf Benefit 15127 2468 2.39 0.9 0.68
ID_767.vcf Benefit 5190 3261 1.99 0.73 1.16
ID_263.vcf Benefit 16888 9548 0.73 0.61 1.66
ID_179.vcf Benefit 3545 842 0.48 0.22 0.67
ID_408.vcf Benefit 1427 4583 0.17 0.92 0.76
ID_850.vcf Benefit 13835 4682 0.05 0.8 1.21
ID_856.vcf Benefit 8939 8435 2.5 0.31 0.99
ID_328.vcf Benefit 14220 8481 0.79 0.23 0.22
ID_704.vcf Benefit 18145 914 0.17 0.66 1.69
ID_828.vcf No_Benefit 4798 8163 1.04 0.74 0.89
ID_16.vcf No_Benefit 6472 528 1.74 0.47 1.5
ID_380.vcf No_Benefit 9827 8359 2.41 0.86 1.59
ID_559.vcf No_Benefit 10247 9150 1.02 0.68 0.78
ID_466.vcf No_Benefit 11092 4078 0.4 0.16 0.03
ID_925.vcf No_Benefit 4809 2908 2.32 0.01 1.49
ID_573.vcf No_Benefit 4341 4307 2.63 0.87 0.14
ID_497.vcf No_Benefit 18279 663 2.96 0.1 1.06
ID_830.vcf No_Benefit 18505 456 1.96 0.31 0.25
ID_665.vcf No_Benefit 15072 2962 0.76 0.43 1.35

bash/awk: remove duplicate columns after merging of several files

I am using the following function written in my bash script in order to merge many files (contained multi-column data) into one big summary chart with all fused data
table_fuse () {
paste -d'\t' "${rescore}"/*.csv >> "${rescore}"/results_2PROTS_CNE_strategy3.csv | column -t -s$'\t'
}
Taking two files as an example, this routine would produce the following concatenated chart as the result of the merging:
# file 1. # file 2
Lig dG(10V1) dG(rmsd) Lig dG(10V2) dG(rmsd)
lig1 -6.78 0.32 lig1 -7.04 0.20
lig2 -5.56 0.14 lig2 -5.79 0.45
lig3 -7.30 0.78 lig3 -7.28 0.71
lig4 -7.98 0.44 lig4 -7.87 0.42
lig5 -6.78 0.28 lig5 -6.75 0.31
lig6 -6.24 0.24 lig6 -6.24 0.24
lig7 -7.44 0.40 lig7 -7.42 0.39
lig8 -4.62 0.41 lig8 -5.19 0.11
lig9 -7.26 0.16 lig9 -7.30 0.13
Since the both files share the same first column (Lig), how would it be possible to remove (substitute to " ") all repeats of this column in each of the fussed file, while keeping only the Lig column from the first CSV?
EDIT: As per OP's comments to cover [Ll]ig or [Ll]ig0123 or [Ll]ig(abcd) formats in file adding following solution here.
awk '{first=$1;gsub(/[Ll]ig([0-9]+)?(\([-azA-Z]+\))?/,"");print first,$0}' Input_file
With awk you could try following, considering that you want to remove only lig(digits) duplicate values here.
awk '{first=$1;gsub(/[Ll]ig([0-9]+)?/,"");print first,$0}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
first=$1 ##Setting first column value to first here.
gsub(/[Ll]ig([0-9]+)?/,"") ##Globally substituting L/lig digits(optional) with NULL in whole line.
print first,$0 ##printing first and current line here.
}
' Input_file ##mentioning Input_file name here.
It's not hard to replace repeats of phrases. What exactly works for your case depends on the precise input file format; but something like
sed s/^\([^ ]*\)\( .* \)\1 /\1\2 /' file
would get rid of any repeat of the token in the first column.
Perhaps a better solution is to use a more sophisticated merge tool, though. A simple Awk or Python script could take care of removing the first token from every file except the first while merging.
join would appear to be a solution, at least for the sample data provided by OP ...
Sample input data:
$ cat file1
Lig dG(10V1) dG(rmsd)
lig1 -6.78 0.32
lig2 -5.56 0.14
lig3 -7.30 0.78
lig4 -7.98 0.44
lig5 -6.78 0.28
lig6 -6.24 0.24
lig7 -7.44 0.40
lig8 -4.62 0.41
lig9 -7.26 0.16
$ cat file2
Lig dG(10V2) dG(rmsd)
lig1 -7.04 0.20
lig2 -5.79 0.45
lig3 -7.28 0.71
lig4 -7.87 0.42
lig5 -6.75 0.31
lig6 -6.24 0.24
lig7 -7.42 0.39
lig8 -5.19 0.11
lig9 -7.30 0.13
We can join these two files on the first column (aka field) like such:
$ join -j1 file1 file2
Lig dG(10V1) dG(rmsd) dG(10V2) dG(rmsd)
lig1 -6.78 0.32 -7.04 0.20
lig2 -5.56 0.14 -5.79 0.45
lig3 -7.30 0.78 -7.28 0.71
lig4 -7.98 0.44 -7.87 0.42
lig5 -6.78 0.28 -6.75 0.31
lig6 -6.24 0.24 -6.24 0.24
lig7 -7.44 0.40 -7.42 0.39
lig8 -4.62 0.41 -5.19 0.11
lig9 -7.26 0.16 -7.30 0.13
For more than 2 files some sort of repetitive/looping method would be needed to repeatedly join a new file into the mix.

Why does if condition print zero value for non-zero condition?

I am unable to get the required output from bash code.
I have a text file:
1 0.00 0.00
2 0.00 0.00
3 0.00 0.08
4 0.00 0.00
5 0.04 0.00
6 0.00 0.00
7 -3.00 0.00
8 0.00 0.00
The required output should be only non-zero values:
0.08
0.04
-3.0
This is my code:
z=0.00
while read line
do
line_o="$line"
openx_de=`echo $line_o|awk -F' ' '{print $2,$3}'`
IFS=' ' read -ra od <<< "$openx_de"
for i in "${od[#]}";do
if [ $i != $z ]
then
echo "openx_default_value is $i"
fi
done
done < /openx.txt
but it also gets the zero values.
To get only the nonzero values from columns 2 and 3, try:
$ awk '$2+0!=0{print $2} $3+0!=0{print $3}' openx.txt
0.08
0.04
-3.00
How it works:
$2+0 != 0 {print $2} tests to see if the second column is nonzero. If it is nonzero, then the print statement is executed.
We want to do a numeric comparison between the second column, $2, and zero. To tell awk to treat $2 as a number, we first add zero to it and then we do the comparison.
The same is done for column 3.
Using column names
Consider this input file:
$ cat openx2.txt
n first second
1 0.00 0.00
2 0.00 0.00
3 0.00 0.08
4 0.00 0.00
5 0.04 0.00
6 0.00 0.00
7 -3.00 0.00 8 0.00 0.00
To print the column name with each value found, try:
$ awk 'NR==1{two=$2; three=$3; next} $2+0!=0{print two,$2} $3+0!=0{print three,$3}' openx2.txt
second 0.08
first 0.04
first -3.00
awk '{$1=""}{gsub(/0.00/,"")}NF{$1=$1;print}' file
0.08
0.04
-3.00

iostat & steal time

I am trying to catch some data from iostat output:
# iostat -m
avg-cpu: %user %nice %system %iowait %steal %idle
9.92 0.00 14.17 0.01 0.00 75.90
Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn
sda 6.08 0.00 0.04 2533 261072
dm-0 1.12 0.00 0.00 1290 30622
dm-1 0.00 0.00 0.00 1 0
dm-2 1.22 0.00 0.00 0 33735
dm-3 7.22 0.00 0.03 1213 196713
How can I match the "0.00" value?
Numbers aren't separated by a tab or a constant number of spaces.
Also the value can be 3 digits 0.00 or 4 digits 45.00, etc.
Any idea how to match it using bash?
Try this, using awk:
iostat | awk 'NR==3 { print $5 }'
NR==3 will operate on the third line, and $5 prints column 5. Verify that the proper column is being selected by playing around with the number, i.e. using your output and print $4 should yield 0.01.

How to get a substring in awk

This is one line of the input file:
FOO BAR 0.40 0.20 0.40 0.50 0.60 0.80 0.50 0.50 0.50 -43.00 100010101101110101000111010
And an awk command that checks a certain position if it's a "1" or "0" at column 13
Something like:
awk -v values="${values}" '{if (substr($13,1,1)==1) printf values,$1,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13}' foo.txt > bar.txt
The values variable works, but i just want in the above example to check if the first bit if it is equal to "1".
EDIT
Ok, so I guess I wasn't very clear in my question. The "$13" in the substr method is in fact the bitstring. So this awk wants to pass all the lines in foo.txt that have a "1" at position "1" of the bitstring at column "$13". Hope this clarifies things.
EDIT 2
Ok, let me break it down real easy. The code above are examples, so the input line is one of MANY lines. So not all lines have a 1 at position 8. I've double checked to see if a certain position has both occurences, so that in any case I should get some output. Thing is that in all lines it doesn't find any "1"'s on the posistions that I choose, but when I say that it has to find a "0" then it returns me all lines.
$ cat file
FOO BAR 0.40 0.20 0.40 0.50 0.60 0.80 0.50 0.50 0.50 -43.00 100010101101110101000111010
FOO BAR 0.40 0.20 0.40 0.50 0.60 0.80 1.50 1.50 1.50 -42.00 100010111101110101000111010
$ awk 'substr($13,8,1)==1{ print "1->"$0 } substr($13,8,1)==0{ print "0->"$0 }' file
0->FOO BAR 0.40 0.20 0.40 0.50 0.60 0.80 0.50 0.50 0.50 -43.00 100010101101110101000111010
1->FOO BAR 0.40 0.20 0.40 0.50 0.60 0.80 1.50 1.50 1.50 -42.00 100010111101110101000111010

Resources