loop through numeric text files in bash and add numbers row wise - bash

I have a set of text files in a folder, like so:
a.txt
1
2
3
4
5
b.txt
1000
1001
1002
1003
1004
.. and so on (assume fixed number of rows, but unknown number of text files). What I am looking a results file which is a summation across all rows:
result.txt
1001
1003
1005
1007
1009
How do I go about achieving this in bash? without using Python etc.

Using awk
Try:
$ awk '{a[FNR]+=$0} END{for(i=1;i<=FNR;i++)print a[i]}' *.txt
1001
1003
1005
1007
1009
How it works:
a[FNR]+=$0
For every line read, we add the value of that line, $0, to partial sum, a[FNR], where a is an array and FNR is the line number in the current file.
END{for(i=1;i<=FNR;i++)print a[i]}
After all the files have been read in, this prints out the sum for each line number.
Using paste and bc
$ paste -d+ *.txt | bc
1001
1003
1005
1007
1009

Related

Handling ties when ranking in bash

Let's say I have a list of numbers that are already sorted as below
100
222
343
423
423
500
What I want is to create a rank field such that same values are assigned the same rank
100 1
222 2
343 3
423 4
423 4
500 5
I have been using the following piece of code to mimic a rank field
awk '{print $0, NR}' file
That gives me below, but it's technically a rownumber.
100 1
222 2
343 3
423 4
423 5
500 6
How do I go about this? I am an absolute beginner in bash so would really appreciate if you could add a little explanation for learning sake.
That's a job for awk:
$ awk '{if($0!=p)++r;print $0,r;p=$0}' file
Output:
100 1
222 2
343 3
423 4
423 4
500 5
Explained:
$ awk '{ # using awk
if($0!=p) # if the value does not equal the previous value
++r # increase the rank
print $0,r # output value and rank
p=$0 # store value for next round
}' file
Could you please try following.
awk 'prev==$0{--count} {print $0,++count;prev=$1}' Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk code from here.
prev==$0 ##Checking condition if variable prev is equal to current line then do following.
{
--count ##Subtract count variable with 1 here.
}
{
print $0,++count ##Printing current line and variable count with increasing value of it.
prev=$1 ##Setting value of prev to 1st field of current line.
}
' Input_file ##Mentioning Input_file name here.
another awk
$ awk '{print $1, a[$1]=a[$1]?a[$1]:++c}' file
100 1
222 2
343 3
423 4
423 4
500 5
where the file is not need to be sorted, for example after adding a new 423 at the end of the file
$ awk '{print $1, a[$1]=a[$1]?a[$1]:++c}' file
100 1
222 2
343 3
423 4
423 4
500 5
423 4
increment the rank counter a for new value observed, otherwise use the registered value for the key. since c is initialized to zero, pre-increment the value. This will use the same rank value for the same key regardless or the position.

How to merge two tab-separated files and predefine formatting of missing values?

I am trying to merge two unsorted tab separated files by a column of partially overlapping identifiers (gene#) with the option of predefining missing values and keeping the order of the first table.
When using paste on my two example tables missing values end up as empty space.
cat file1
c3 100 300 gene4
c1 300 400 gene1
c13 600 700 gene2
cat file2
gene1 4.2 0.001
gene4 1.05 0.5
paste file1 file2
c3 100 300 gene1 gene1 4.2 0.001
c1 300 400 gene4 gene4 1.05 0.5
c13 600 700 gene2
As you see the result not surprisingly shows empty spaces in non matched lines. Is there a way to keep the order of file1 and fill lines like the third as follows:
c3 100 300 gene4 gene4 1.05 0.5
c1 300 400 gene1 gene1 4.2 0.001
c13 600 700 gene2 NA 1 1
I assume one way could be to build an awk conditional construct. It would be great if you could point me in the right direction.
With awk please try the following:
awk 'FNR==NR {a[$1]=$1; b[$1]=$2; c[$1]=$3; next}
{if (!a[$4]) {a[$4]="N/A"; b[$4]=1; c[$4]=1}
printf "%s %s %s %s\n", $0, a[$4], b[$4], c[$4]}
' file2 file1
which yields:
c3 100 300 gene1 gene1 4.2 0.001
c1 300 400 gene4 gene4 1.05 0.5
c13 600 700 gene2 N/A 1 1
awk 'FNR==NR {a[$1]=$1; b[$1]=$2; c[$1]=$3; next}
{if (!a[$4]) {a[$4]="N/A"; b[$4]=1; c[$4]=1}
printf "%s %s %s %s\n", $0, a[$4], b[$4], c[$4]}
' file2 file1
[Explanations]
In the 1st line, FNR==NR { command; next} is an idiom to execute the command only when reading the 1st file in the argument list ("file2" in this case). Then it creates maps (aka associative arrays) to associate values in "file2" to genes
as:
gene1 => gene1 (with array a)
gene1 => 4.2 (with array b)
gene1 => 0.001 (with array c)
gene4 => gene4 (with array a)
gene4 => 1.05 (with array b)
gene4 => 0.5 (with array c)
It is not necessary that "file2" is sorted.
The following lines are executed only when reading the 2nd file ("file1") because these lines are skipped when reading the 1st file due to the next statement.
The line {if (!a[$4]) .. is a fallback to assign variables to default values when the associative array a[gene] is undefined (meaning the gene is not found in "file2").
The final line prints the contents of "file1" followed by the associated values via the gene.
You can use join:
join -e NA -o '1.1 1.2 1.3 1.4 1.5 2.1 2.2 2.3' -a 1 -1 5 -2 1 <(nl -w1 -s ' ' file1 | sort -k 5) <(sort -k 1 file2) | sed 's/NA\sNA$/1 1/' | sort -n | cut -d ' ' -f 2-
-e NA — replace all missing values with NA
-o ... — output format (field is specified using <file>.<field>)
-a 1 — Keep every line from the left file
-1 5, -2 1 — Fields used to join the files
file1, file2 — The files
nl -w1 -s ' ' file1 — file1 with numbered lines
<(sort -k X fileN) — File N ready to be joined on column X
s/NA\sNA$/1 1/ — Replace every NA NA on end of line with 1 1
| sort -n | cut -d ' ' -f 2- — sort numerically and remove the first column
The example above uses spaces on output. To use tabs, append | tr ' ' '\t':
join -e NA -o '1.1 1.2 1.3 1.4 2.1 2.2 2.3' -a 1 -1 4 -2 1 file1 file2 | sed 's/NA\sNA$/1 1/' | tr ' ' '\t'
The broken lines have a TAB as the last character. Fix this with
paste file1 file2 | sed 's/\t$/\tNA\t1\t1/g'

Subtract corresponding lines

I have two files, file1.csv
3 1009
7 1012
2 1013
8 1014
and file2.csv
5 1009
3 1010
1 1013
In the shell, I want to subtract the count in the first column in the second file from that in the first file, based on the identifier in the second column. If an identifier is missing in the second column, the count is assumed to be 0.
The result would be
-2 1009
-3 1010
7 1012
1 1013
8 1014
The files are huge (several GB). The second columns are sorted.
How would I do this efficiently in the shell?
Assuming that both files are sorted on second column:
$ join -j2 -a1 -a2 -oauto -e0 file1 file2 | awk '{print $2 - $3, $1}'
-2 1009
-3 1010
7 1012
1 1013
8 1014
join will join sorted files.
-j2 will join one second column.
-a1 will print records from file1 even it there is no corresponding row in file2.
-a2 Same as -a1 but applied for file2.
-oauto is in this case the same as -o1.2,1.1,2.1 which will print the joined column, and then the remaining columns from file1 and file2.
-e0 will insert 0 instead of an empty column. This works with -a1 and -a2.
The output from join is three columns like:
1009 3 5
1010 0 3
1012 7 0
1013 2 1
1014 8 0
Which is piped to awk, to subtract column three from column 2, and then reformatting.
$ awk 'NR==FNR { a[$2]=$1; next }
{ a[$2]-=$1 }
END { for(i in a) print a[i],i }' file1 file2
7 1012
1 1013
8 1014
-2 1009
-3 1010
It reads the first file in memory so you should have enough memory available. If you don't have the memory, I would maybe sort -k2 the files first, then sort -m (merge) them and continue with that output:
$ sort -m -k2 -k3 <(sed 's/$/ 1/' file1|sort -k2) <(sed 's/$/ 2/' file2|sort -k2) # | awk ...
3 1009 1
5 1009 2 # previous $2 = current $2 -> subtract
3 1010 2 # previous $2 =/= current and current $3=2 print -$3
7 1012 1
2 1013 1 # previous $2 =/= current and current $3=1 print prev $2
1 1013 2
8 1014 1
(I'm out of time for now, maybe I'll finish it later)
EDIT by Ed Morton
Hope you don't mind me adding what I was working on rather than posting my own extremely similar answer, feel free to modify or delete it:
$ cat tst.awk
{ split(prev,p) }
$2 == p[2] {
print p[1] - $1, p[2]
prev = ""
next
}
p[2] != "" {
print (p[3] == 1 ? p[1] : 0-p[1]), p[2]
}
{ prev = $0 }
END {
split(prev,p)
print (p[3] == 1 ? p[1] : 0-p[1]), p[2]
}
$ sort -m -k2 <(sed 's/$/ 1/' file1) <(sed 's/$/ 2/' file2) | awk -f tst.awk
-2 1009
-3 1010
7 1012
1 1013
8 1014
Since the files are sorted¹, you can merge them line-by-line with the join utility in coreutils:
$ join -j2 -o auto -e 0 -a 1 -a 2 41144043-a 41144043-b
1009 3 5
1010 0 3
1012 7 0
1013 2 1
1014 8 0
All those options are required:
-j2 says to join based on the second column of each file
-o auto says to make every row have the same format, beginning with the join key
-e 0 says that missing values should be substituted with zero
-a 1 and -a 2 include rows that are absent from one file or another
the filenames (I've used names based on the question number here)
Now we have a stream of output in that format, we can do the subtraction on each line. I used this GNU sed command to transform the above output into a dc program:
sed -re 's/.*/c&-n[ ]np/e'
This takes the three values on each line and rearranges them into a dc command for the subtraction, then executes it. For example, the first line becomes (with spaces added for clarity)
c 1009 3 5 -n [ ]n p
which subtracts 5 from 3, prints it, then prints a space, then prints 1009 and a newline, giving
-2 1009
as required.
We can then pipe all these lines into dc, giving us the output file that we want:
$ join -o auto -j2 -e 0 -a 1 -a 2 41144043-a 41144043-b \
> | sed -e 's/.*/c& -n[ ]np/' \
> | dc
-2 1009
-3 1010
7 1012
1 1013
8 1014
¹ The sorting needs to be consistent with LC_COLLATE locale setting. That's unlikely to be an issue if the fields are always numeric.
TL;DR
The full command is:
join -o auto -j2 -e 0 -a 1 -a 2 "$file1" "$file2" | sed -e 's/.*/c& -n[ ]np/' | dc
It works a line at a time, and starts only the three processes you see, so should be reasonably efficient in both memory and CPU.
Assuming this is a csv with blank separation, if this is a "," use argument -F ','
awk 'FNR==NR {Inits[$2]=$1; ids[$2]++; next}
{Discounts[$2]=$1; ids[$2]++}
END { for (id in ids) print Inits[ id] - Discounts[ id] " " id}
' file1.csv file2.csv
for memory issue (could be in 1 serie of pipe but prefer to use a temporary file)
awk 'FNR==NR{print;next}{print -1 * $1 " " $2}' file1 file2 \
| sort -k2 \
> file.tmp
awk 'Last != $2 {
if (NR != 1) print Result " "Last
Last = $2; Result = $1
}
Last == $2 { Result+= $1; next}
END { print Result " " $2}
' file.tmp
rm file.tmp

Divide column values of different files by a constant then output one minus the other

I have two files of the form
file1:
#fileheader1
0 123
1 456
2 789
3 999
4 112
5 131
6 415
etc.
file2:
#fileheader2
0 442
1 232
2 542
3 559
4 888
5 231
6 322
etc.
How can I take the second column of each, divide it by a value then minus one from the other and then output a new third file with the new values?
I want the output file to have the form
#outputheader
0 123/c-422/k
1 456/c-232/k
2 789/c-542/k
etc.
where c and k are numbers I can plug into the script
I have seen this question: subtract columns from different files with awk
But I don't know how to use awk to do this by myself, does anyone know how to do this or could explain what is going on in the linked question so I can try to modify it?
I'd write:
awk -v c=10 -v k=20 ' ;# pass values to awk variables
/^#/ {next} ;# skip headers
FNR==NR {val[$1]=$2; next} ;# store values from file1
$1 in val {print $1, (val[$1]/c - $2/k)} ;# perform the calc and print
' file1 file2
output
0 -9.8
1 34
2 51.8
3 71.95
4 -33.2
5 1.55
6 25.4
etc. 0

bash - check for word in specific column, check value in other column of this line, cut and paste the line to new text file

My text files contain ~20k lines and look like this:
file_A:
ATOM 624 SC1 SER 288 54.730 23.870 56.950 1.00 0.00
ATOM 3199 NC3 POP 487 50.780 27.750 27.500 1.00 3.18
ATOM 3910 C2B POP 541 96.340 99.070 39.500 1.00 7.00
ATOM 4125 W PW 559 55.550 64.300 16.880 1.00 0.00
Now I need to check for POP in column 4 (line 2 and 3) and check if the values in the last column (10) exceed a specific threshold (e.g. 5.00). These lines - in this case just line 3 - need to be removed from file_A and copied to a new file_B. Meaning:
file_A:
ATOM 624 SC1 SER 288 54.730 23.870 56.950 1.00 0.00
ATOM 3199 NC3 POP 487 50.780 27.750 27.500 1.00 3.18
ATOM 4125 W PW 559 55.550 64.300 16.880 1.00 0.00
file_B:
ATOM 3910 C2B POP 541 96.340 99.070 39.500 1.00 7.00
I'm not sure wether to use sed, grep or awk or anything couple them :/
So far i could just delete the lines and create a new file without these lines...
awk '!/POP/' file_A > file_B
EDIT:
Does the following work for having more than one different words removed?
for (( i= ; i<$numberoflipids ; i++ ))
do
awk '$4~/"${nol[$i]}"/&&$NF>"$pr"{print >"patch_rmlipids.pdb";next}{print > "tmp"}' bilayer_CG_ordered.pdb && mv tmp patch.pdb
done
whereas $nol is an array containing the words to be removed, $pr is the given threshold and the .pdb are the used files
awk
awk '$4~/POP/&&$NF>5{print >"fileb";next}{print > "tmp"}' filea && mv tmp filea
.
$4~/POP/&&$NF>5 -Checks if fourth field contains POP and last field is more than five
{print >"fileb";next} -If they are writes the line to fileb and
skips further statements
{print > "tmp"} -Only executed if first part fails, write to tmp file
filea && mv tmp filea -The file used, if awk command succeeds then overwrite
it with tmp

Resources