Apply multiple substract commands between two columns in text file in bash - bash

I would like to substract 2x two columns in a text file and add into two new columns in a tab delimited text file in bash using awk.
I would like to substract column 3 (h3) - column 1 (h1). And name the new added column "count1".
I would like to substract column 4 (h4) - column 2 (h2). And name the new added column "count2".
I don't want to build a new text file, but edit the old one.
My text file:
h1 h2 h3 h4 h5
343 100 856 216 536
283 96 858 220 539
346 111 858 220 539
283 89 860 220 540
280 89 862 220 541
76 32 860 220 540
352 105 856 220 538
57 16 860 220 540
144 31 858 220 539
222 63 860 220 540
305 81 858 220 539
My command at the moment looks like this:
awk '{$6 = $3 - $1}1' file.txt
awk '{$6 = $4 - $2}1' file.txt
But I don't know how to rename the new added columns and maybe there is a smarter move to run both commands in the same awk command?

Pretty simple in awk. Use NR==1 to modify the first line.
awk -F '\t' -v OFS='\t' '
NR==1 {print $0,"count1","count2"}
NR!=1 {print $0,$3-$1,$4-$2}' file.txt > tmp && mv tmp file.txt

Related

how to make a table from the columns of other tables in bash?

Hello I have 50 tables in tsv format all with the same column names in the same order:
e.g.
cat sample1.tsv | head -4
name
coverage
ID
bases
reads
length
vir1
0.535
3rf
1252
53
11424
vir2
0.124
2ds
7534
152
63221
vir3
0.643
6tf
3341
73
21142
I want to elaborate a table from the "reads" column (5th column) from the 50 tables. The name column have the same values and same order along the 50 tables
Desired output:
cat reads_table.tsv | head -4
names
sample1
sample2
sample3
sample4
sample5
sample50
vir1
53
742
42
242
42
342
vir2
152
212
512
21
74
41
vir3
73
13
172
42
142
123
I was thinking on doing this by saving the reads column (the 5th column in all tables) to an array and using paste bash function to paste the columns and save them to a new empty file called "reads_table.tsv" but I don't know how to do this on bash.
This is what I tried in a first instance:
for i in *.tsv
do
reads=$(awk '{print $5}' $i)
sed -i 's/$/\t$reads/' $i >> reads_table.tsv
done
Created some input files to match OP's expected output:
$ head sample*.tsv
==> sample1.tsv <==
name coverage ID bases reads length
vir1 0.535 3rf 1252 53 11424
vir2 0.124 2ds 7534 152 63221
vir3 0.643 6tf 3341 73 21142
==> sample2.tsv <==
name coverage ID bases reads length
vir1 0.535 3rf 1252 742 11424
vir2 0.124 2ds 7534 212 63221
vir3 0.643 6tf 3341 13 21142
==> sample3.tsv <==
name coverage ID bases reads length
vir1 0.535 3rf 1252 42 11424
vir2 0.124 2ds 7534 512 63221
vir3 0.643 6tf 3341 172 21142
==> sample4.tsv <==
name coverage ID bases reads length
vir1 0.535 3rf 1252 242 11424
vir2 0.124 2ds 7534 21 63221
vir3 0.643 6tf 3341 42 21142
==> sample5.tsv <==
name coverage ID bases reads length
vir1 0.535 3rf 1252 42 11424
vir2 0.124 2ds 7534 74 63221
vir3 0.643 6tf 3341 142 21142
==> sample50.tsv <==
name coverage ID bases reads length
vir1 0.535 3rf 1252 342 11424
vir2 0.124 2ds 7534 41 63221
vir3 0.643 6tf 3341 123 21142
One awk idea:
awk '
BEGIN { FS=OFS="\t" }
FNR==NR { lines[FNR]=$1 } # save 1st column ("name") of 1st file
FNR==1 { split(FILENAME,a,".") # 1st row of each file: split FILENAME
lines[FNR]=lines[FNR] OFS a[1] # save FILENAME (sans ".tsv")
next
}
{ lines[FNR]=lines[FNR] OFS $5 } # rest of rows in file: append tjhe 5th column to our output lines
END { for (i=1;i<=FNR;i++) # loop through rows and ...
print lines[i] # print the associated line to stdout
}
' $(find . -name "sample*.tsv" -printf "%f\n" | sort -V ) > reads_table.tsv
NOTES:
the find/sort is required to insure the files are fed to awk in Version sort order (eg, sample3.tsv comes before sample21.tsv)
the printf %f\n removes the leading .\ from the filename (otherwise we could remove in the awk script)
the -V option tells sort to run a Version sort
This generates:
name sample1 sample2 sample3 sample4 sample5 sample50
vir1 53 742 42 242 42 342
vir2 152 212 512 21 74 41
vir3 73 13 172 42 142 123

Awk, printing certain columns based on how rows of different files match

I am pretty sure that it is awk I would have to use
I have one file with information I need and another file where I need to take two pieces of information from and obtain two numbers from the second file based on that piece of information.
So if the first file has m7 in its fifth column and 3 in it's third column I want to search in the second column for a row that has 3 in it's first column and m7 in it's fourth column. The I want to print certain columns from these files as listed below.
Given the following two files of input
file1
1 dog 3 8 m7 n15
50 cat 5 8 m15 m22
20 fish 6 3 n12 m7
file2
3 695 842 m7 word
5 847 881 m15 not
8 910 920 n15 important
8 695 842 m22 word
6 312 430 n12 not
I want to produce the output
pre3 695 842 21
pre5 847 881 50
pre6 312 430 20
pre8 910 920 1
pre8 695 842 50
EDIT:
I need to also produce output of the form
pre3 695 842 pre8 910 920 1
pre5 847 881 pre8 695 842 50
pre6 312 430 pre3 695 842 20
The answer below work for the question before, but I'm confused with some of the syntax of it so I'm not sure how to adjust it to make this output
This command:
awk 'NR==FNR{ar[$5,$3]=$1+ar[$5,$3]; ar[$6,$4]=$1+ar[$6,$4]}
NR>FNR && ar[$4,$1] {print "pre"$1,$2,$3,ar[$4,$1]}' file1 file2
outputs pre plus the content of the second file's first, second, and third column and the first file's first column for all lines in which the content of the first file's fifth and third (or sixth and fourth) column is identical to the second file's fourth and first column:
pre3 695 842 21
pre5 847 881 50
pre8 910 920 1
pre8 695 842 50
pre6 312 430 20
(for lines with more than one match the values of ar[$4,$1] are summed up)
Note that the output is not necessarily sorted! To achieve this: add sort:
awk 'NR==FNR{ar[$5,$3]=$1+ar[$5,$3]; ar[$6,$4]=$1+ar[$6,$4]}
NR>FNR && ar[$4,$1]{print "pre"$1,$2,$3,ar[$4,$1]}' file1 file2 | sort
What does the code?
NR==FNR{...} works on the first input file only
NR>FNR{...} works on the 2nd, 3rd,... input file
ar[$5,$3] creates an array whose key is the content of the 5th and 3rd column of the current line / record (separated by the field separator; usually a single blank)
You could use the below command :
awk 'NR==FNR {a[$3 FS $5]=1;next } a[$1 FS $4]' f1.txt f2.txt
If you want to print only the specific fields from the matching lines in second file use like below :
awk 'NR==FNR {a[$3 FS $5]=1;next } a[$1 FS $4] { print "pre"$1" "$2" "$3}' f1.txt f2.txt

awk script to return results recursively [duplicate]

This question already has an answer here:
awk script along with for loop
(1 answer)
Closed 7 years ago.
I have a data set as below (t.txt):
827 819
830 826
828 752
752 694
828 728
821 701
724 708
826 842
719 713
764 783
812 820
829 696
697 849
840 803
752 774
I have second file as below (t1.txt):
752
728
856
693
713
792
812
706
737
751
745
I am trying to extract column 2 elements of the second file from the first data set using a for loop.
I have tried :
for i in `cat t1.txt`
do
awk -F " " '$1=i {print $2}' t.txt > t0.txt
done
Desired output is :
694
820
774
Unfortunately I am getting a blank file.
I have tried to do it manually like : awk -F " " '$1==752 {print $2}' t.txt > t0.txt
Results obtained are
694
774
How can I do it for the entire t1 file in one go?
Simplest way: using join
$ join -o 1.2 <(sort t.txt) <(sort t1.txt)
694
774
820
join requires the files to be lexically sorted on the comparison field (the default field one). The -o option instructs join to output the 2nd field from the 1st file.
With awk
$ awk 'NR==FNR {key[$1]; next} $1 in key {print $2}' t1.txt t.txt
694
820
774
That remembers the keys in t1.txt, then loops over t.txt (when the accumulated record number NR is not equal to the file's record number FNR), if the first field occurred in t1, print the second field.

awk script along with for loop

I have a data set t.txt:
827 819
830 826
828 752
752 694
828 728
821 701
724 708
826 842
719 713
764 783
812 820
829 696
697 849
840 803
752 774
I also have a second file t1.txt:
752
728
856
693
713
792
812
706
737
751
745
I am trying to extract corresponding column 2 elements of the second file sequentially from the data set.
I have used: awk -F " " '$1==752 {print $2}' t.txt >> t2.txt
How can i use for loop for the above instruction and populate it in one text file instead of doing it one by one?
output for 752 will be 694. This 694 should be written in a different text file. For 812, it should give me 820. Both 694 and 820 should be written in the same text file. It should parse till end of the input file.
I was trying :
for i in `cat t1.txt` | awk -F " " '$1==$i {print $2}' t.txt >> t2.txt
which is throwing syntax error.
Answer for 3rd Version of This Question
$ awk 'FNR==NR{a[$1]=1;next;} $1 in a {print $2;}' t1.txt t.txt
694
820
774
Answer for 2nd Version of This Question
For every line in t1.txt, this checks to see if the same number appears in either column 1 of t.txt. If it does, the number in column 2 of the same line is printed:
$ awk 'FNR==NR{a[$1]=$2;next} $1 in a {print a[$1]}' t.txt t1.txt
694
820
To save the output in file t2.txt, use:
awk 'FNR==NR{a[$1]=$2;next} $1 in a {print a[$1]}' t.txt >t2.txt
How it works
FNR==NR{a[$1]=$2;next}
This reads through t.txt and creates an array a of its values.
$1 in a {print a[$1]}
For each number in file t1.txt, this checks to see if the number appears in array a and, if so, prints out the corresponding value.

calculate percentage between columns in bash?

I have long tab formatted file with many columns, i would like to calculate % between two columns (3rd and 4rth) and print this % with correspondence numbers with this format (%46.00).
input:
file1 323 434 45 767 254235 275 2345 467
file1 294 584 43 7457 254565 345 235445 4635
file1 224 524 4343 12457 2542165 345 124445 41257
Desired output:
file1 323 434(134.37%) 45(13.93%) 767 254235 275 2345 467
file1 294 584(198.64%) 43(14.63%) 7457 254565 345 235445 4635
file1 224 524(233.93%) 4343(1938.84%) 12457 2542165 345 124445 41257
i tried:
cat test_file.txt | awk '{printf "%s (%.2f%)\n",$0,($4/$2)*100}' OFS="\t" | awk '{printf "%s (%.2f%)\n",$0,($3/$2)*100}' | awk '{print $1,$2,$3,$11,$4,$10,$5,$6,$7,$8,$9}' - | sed 's/ (/(/g' | sed 's/ /\t/g' >out.txt
It works but I want something sort-cut of this.
I would say:
$ awk '{$3=sprintf("%d(%.2f%)", $3, ($3/$2)*100); $4=sprintf("%d(%.2f%)", $4, ($4/$2)*100)}1' file
file1 323 434(134.37%) 45(13.93%) 767 254235 275 2345 467
file1 294 584(198.64%) 43(14.63%) 7457 254565 345 235445 4635
file1 224 524(233.93%) 4343(1938.84%) 12457 2542165 345 124445 41257
With a function to avoid duplicities:
awk 'function print_nice (num1, num2) {
return sprintf("%d(%.2f%)", num1, (num1/num2)*100)
}
{$3=print_nice($3,$2); $4=print_nice($4,$2)}1' file
This uses sprintf to express a specific format and store it in a variable. The calculations are the obvious.

Resources