Add leading zeroes to awk variable - bash

I have the following awk command within a "for" loop in bash:
awk -v pdb="$pdb" 'BEGIN {file = 1; filename = pdb"_" file ".pdb"}
/ENDMDL/ {getline; file ++; filename = pdb"_" file ".pdb"}
{print $0 > filename}' < ${pdb}.pdb
This reads a series of files with the name $pdb.pdb and splits them in files called $pdb_1.pdb, $pdb_2.pdb, ..., $pdb_21.pdb, etc. However, I would like to produce files with names like $pdb_01.pdb, $pdb_02.pdb, ..., $pdb_21.pdb, i.e., to add padding zeros to the "file" variable.
I have tried without success using printf in different ways. Help would be much appreciated.

Here's how to create leading zeros with awk:
# echo 1 | awk '{ printf("%02d\n", $1) }'
01
# echo 21 | awk '{ printf("%02d\n", $1) }'
21
Replace %02 with the total number of digits you need (including zeros).

Replace file on output with sprintf("%02d", file).
Or even the whole assigment with filename = sprintf("%s_%02d.pdb", pdb, file);.

This does it without resort of printf, which is expensive. The first parameter is the string to pad, the second is the total length after padding.
echo 722 8 | awk '{ for(c = 0; c < $2; c++) s = s"0"; s = s$1; print substr(s, 1 + length(s) - $2); }'
If you know in advance the length of the result string, you can use a simplified version (say 8 is your limit):
echo 722 | awk '{ s = "00000000"$1; print substr(s, 1 + length(s) - 8); }'
The result in both cases is 00000722.

Here is a function that left or right-pads values with zeroes depending on the parameters: zeropad(value, count, direction)
function zeropad(s,c,d) {
if(d!="r")
d="l" # l is the default and fallback value
return sprintf("%" (d=="l"? "0" c:"") "d" (d=="r"?"%0" c-length(s) "d":""), s,"")
}
{ # test main
print zeropad($1,$2,$3)
}
Some tests:
$ cat test
2 3 l
2 4 r
2 5
a 6 r
The test:
$ awk -f program.awk test
002
2000
00002
000000
It's not fully battlefield tested so strange parameters may yield strange results.

Related

awk to get first column if the a specific number in the line is greater than a digit

I have a data file (file.txt) contains the below lines:
123 pro=tegs, ETA=12:00, team=xyz,user1=tom,dom=dby.com
345 pro=rbs, team=abc,user1=chan,dom=sbc.int,ETA=23:00
456 team=efg, pro=bvy,ETA=22:00,dom=sss.co.uk,user2=lis
I'm expecting to get the first column ($1) only if the ETA= number is greater than 15, like here I will have 2nd and 3rd line first column only is expected.
345
456
I tried like cat file.txt | awk -F [,TPF=]' '{print $1}' but its print whole line which has ETA at the end.
Using awk
$ awk -F"[=, ]" '{for (i=1;i<NF;i++) if ($i=="ETA") if ($(i+1) > 15) print $1}' input_file
345
456
With your shown samples please try following GNU awk code. Using match function of GNU awk where I am using regex (^[0-9]+).*ETA=([0-9]+):[0-9]+ which creates 2 capturing groups and saves its values into array arr. Then checking condition if 2nd element of arr is greater than 15 then print 1st value of arr array as per requirement.
awk '
match($0,/(^[0-9]+).*\<ETA=([0-9]+):[0-9]+/,arr) && arr[2]+0>15{
print arr[1]
}
' Input_file
I would harness GNU AWK for this task following way, let file.txt content be
123 pro=tegs, ETA=12:00, team=xyz,user1=tom,dom=dby.com
345 pro=rbs, team=abc,user1=chan,dom=sbc.int,ETA=23:00
456 team=efg, pro=bvy,ETA=02:00,dom=sss.co.uk,user2=lis
then
awk 'substr($0,index($0,"ETA=")+4,2)+0>15{print $1}' file.txt
gives output
345
Explanation: I use String functions, index to find where is ETA= then substr to get 2 characters after ETA=, 4 is used as ETA= is 4 characters long and index gives start position, I use +0 to convert to integer then compare it with 15. Disclaimer: this solution assumes every row has ETA= followed by exactly 2 digits.
(tested in GNU Awk 5.0.1)
Whenever input contains tag=value pairs as yours does, it's best to first create an array of those mappings (v[]) below and then you can just access the values by their tags (names):
$ cat tst.awk
BEGIN {
FS = "[, =]+"
OFS = ","
}
{
delete v
for ( i=2; i<NF; i+=2 ) {
v[$i] = $(i+1)
}
}
v["ETA"]+0 > 15 {
print $1
}
$ awk -f tst.awk file
345
456
With that approach you can trivially enhance the script in future to access whatever values you like by their names, test them in whatever combinations you like, output them in whatever order you like, etc. For example:
$ cat tst.awk
BEGIN {
FS = "[, =]+"
OFS = ","
}
{
delete v
for ( i=2; i<NF; i+=2 ) {
v[$i] = $(i+1)
}
}
(v["pro"] ~ /b/) && (v["ETA"]+0 > 15) {
print $1, v["team"], v["dom"]
}
$ awk -f tst.awk file
345,abc,sbc.int
456,efg,sss.co.uk
Think about how you'd enhance any other solution to do the above or anything remotely similar.
It's unclear why you think your attempt would do anything of the sort. Your attempt uses a completely different field separator and does not compare anything against the number 15.
You'll also want to get rid of the useless use of cat.
When you specify a column separator with -F that changes what the first column $1 actually means; it is then everything before the first occurrence of the separator. Probably separately split the line to obtain the first column, space-separated.
awk -F 'ETA=' '$2 > 15 { split($0, n, /[ \t]+/); print n[1] }' file.txt
The value in $2 will be the data after the first separator (and up until the next one) but using it in a numeric comparison simply ignores any non-numeric text after the number at the beginning of the field. So for example, on the first line, we are actually literally checking if 12:00, team=xyz,user1=tom,dom=dby.com is larger than 15 but it effectively checks if 12 is larger than 15 (which is obviously false).
When the condition is true, we split the original line $0 into the array n on sequences of whitespace, and then print the first element of this array.
Using awk you could match ETA= followed by 1 or more digits. Then get the match without the ETA= part and check if the number is greater than 15 and print the first field.
awk '/^[0-9]/ && match($0, /ETA=[0-9]+/) {
if(substr($0, RSTART+4, RLENGTH-4)+0 > 15) print $1
}' file
Output
345
456
If the first field should start with a number:
awk '/^[0-9]/ && match($0, /ETA=[0-9]+/) {
if(substr($0, RSTART+4, RLENGTH-4) > 15)+0 print $1
}' file

Computing the size of array in text file in bash

I have a text file that sometimes-not always- will have an array with a unique name like this
unique_array=(1,2,3,4,5,6)
I would like to find the size of the array-6 in the above example- when it exists and skip it or return -1 if it doesnt exist.
grepping the file will tell me if the array exists but not how to find its size.
The array can fill multiple lines like
unique_array=(1,2,3,
4,5,6,
7,8,9,10)
Some of the elements in the array can be negative as in
unique_array=(1,2,-3,
4,5,6,
7,8,-9,10)
awk -v RS=\) -F, '/unique_array=\(/ {print /[0-9]/?NF:0}' file.txt
-v RS=\) - delimit records by ) instead of newlines
-F, - delimit fields by , instead of whitespace
/unique_array=(/ - look for a record containing the unique identifier
/[0-9]?NF:0 - if record contains digit, number of fields (ie. commas+1), otherwise 0
There is a bad bug in the code above: commas preceding the array may be erroneously counted. A fix is to truncate the prefix:
awk -v RS=\) -F, 'sub(/.*unique_array=\(/,"") {print /[0-9]/?NF:0}' file.txt
Your specifications are woefully incomplete, but guessing a bit as to what you are actually looking for, try this at least as a starting point.
awk '/^unique_array=\(/ { in_array = 1; n = split(",", arr, $0); next }
in_array && /\)/ { sub(/\)./, ""); quit = 1 }
in_array { n += split(",", arr, $0);
if (quit) { print n; in_array = quit = n = 0 } }' file
We keep a state variable in_array which tells us whether we are currently in a region which contains the array. This gets set to 1 when we see the beginning of the array, and back to 0 when we see the closing parenthesis. At this point, we remove the closing parenthesis and everything after it, and set a second variable quit to trigger the finishing logic in the next condition. The last condition performs two tasks; it adds the items from this line to the count in n, and then checks if quit is true; if it is, we are at the end of the array, and print the number of elements.
This will simply print nothing if the array was not found. You could embellish the script to set a different exit code or print -1 if you like, but these details seem like unnecessary complications for a simple script.
I think what you probably want is this, using GNU awk for multi-char RS and RT and word boundaries:
$ awk -v RS='\\<unique_array=[(][^)]*)' 'RT{exit} END{print (RT ? gsub(/,/,"",RT)+1 : -1)}' file
With your shown samples please try following awk.
awk -v RS= '
{
while(match($0,/\<unique_array=[(][^)]*\)/)){
line=substr($0,RSTART,RLENGTH)
gsub(/[[:space:]]*\n[[:space:]]*|(^|\n)unique_array=\(|(\)$|\)\n)/,"",line)
print gsub(/,/,"&",line)+1
$0=substr($0,RSTART+RLENGTH)
}
}
' Input_file
Using sed and declare -a. The test file is like this:
$ cat f
saa
dfsaf
sdgdsag unique_array=(1,2,3,
4,5,6,
7,8,9,10) sdfgadfg
sdgs
sdgs
sfsaf(sdg)
Testing:
$ declare -a "$(sed -n '/unique_array=(/,/)/s/,/ /gp' f | \
sed 's/.*\(unique_array\)/\1/;s/).*/)/;
s/`.*`//g')"
$ echo ${unique_array[#]}
1 2 3 4 5 6 7 8 9 10
And then you can do whatever you want with ${unique_array[#]}
With GNU grep or similar that support -z and -o options:
grep -zo 'unique_array=([^)]*)' file.txt | tr -dc =, | wc -c
-z - (effectively) treat file as a single line
-o - only output the match
tr -dc =, - strip everything except = and ,
wc -c - count the result
Note: both one- and zero-element arrays will be treated as being size 1. Will return 0 rather than -1 if not found.
here's an awk solution that works with gawk, mawk 1/2, and nawk :
TEST INPUT
saa
dfsaf
sdgdsag unique_array=(1,2,3,
4,5,6,
7,8,9,10) sdfgadfg
sdgs
sdgs
sfsaf(sdg)
CODE
{m,n,g}awk '
BEGIN { __ = "-1:_ERR_NOT_FOUND_"
RS = "^$" (_ = OFS = "")
FS = "(^|[ \t-\r]?)unique[_]array[=][(]"
___ = "[)].*$|[^0-9,.+-]"
} $!NF = NR < NF ? $(gsub(___,_)*_) : __'
OUTPUT
1,2,3,4,5,6,7,8,9,10

Merging two files column and row-wise in bash

I would like to merge two files, column and row-wise but am having difficulty doing so with bash. Here is what I would like to do.
File1:
1 2 3
4 5 6
7 8 9
File2:
2 3 4
5 6 7
8 9 1
Expected output file:
1/2 2/3 3/4
4/5 5/6 6/7
7/8 8/9 9/1
This is just an example. The actual files are two 1000x1000 data matrices.
Any thoughts on how to do this? Thanks!
Or use paste + awk
paste file1 file2 | awk '{ n=NF/2; for(i=1; i<=n; i++) printf "%s/%s ", $i, $(i+n); printf "\n"; }'
Note that this script adds a trailing space after the last value. This can be avoided with a more complicated awk script or by piping the output through an additional command, e.g.
paste file1 file2 | awk '{ n=NF/2; for(i=1; i<=n; i++) printf "%s/%s ", $i, $(i+n); printf "\n"; }' | sed 's/ $//'
awk solution without additional sed. Thanks to Jonathan Leffler. (I knew it is possible but was too lazy to think about this.)
awk '{ n=NF/2; pad=""; for(i=1; i<=n; i++) { printf "%s%s/%s", pad, $i, $(i+n); pad=" "; } printf "\n"; }'
paste + perl version that works with an arbitrary number of columns without having to hold an entire file in memory:
paste file1.txt file2.txt | perl -MList::MoreUtils=pairwise -lane '
my #a = #F[0 .. (#F/2 - 1)]; # The values from file1
my #b = #F[(#F/2) .. $#F]; # The values from file2
print join(" ", pairwise { "$a/$b" } #a, #b); # Merge them together again'
It uses the non-standard but useful List::MoreUtils module; install through your OS package manager or favorite CPAN client.
Assumptions:
no blank lines in files
both files have the same number of rows
both files have the same number of fieldds
no idea how many rows and/or fields we'll have to deal with
One awk solution:
awk '
# first file (FNR==NR):
FNR==NR { for ( i=1 ; i<=NF ; i++) # loop through fields
{ line[FNR,i]=$(i) } # store field in array; array index = row number (FNR) + field number (i)
next # skip to next line in file
}
# second file:
{ pfx="" # init printf prefix as empty string
for ( i=1 ; i<=NF ; i++) # loop through fields
{ printf "%s%s/%s", # print our results:
pfx, line[FNR,i], $(i) # prefix, corresponding field from file #1, "/", current field
pfx=" " # prefix for rest of fields in this line is a space
}
printf "\n" # append linefeed on end of current line
}
' file1 file2
NOTES:
remove comments to declutter code
memory usage will climb as the size of the matrix increases (probably not an issue for the smallish fields and OPs comment about a 1000 x 1000 matrix)
The above generates:
1/2 2/3 3/4
4/5 5/6 6/7
7/8 8/9 9/1

awk to calculate average of field in multiple text files and merge into one

I am trying to calculate the average of $2 in multiple test files in a directory and merge the output in one tab-delimeted output file. The output file is two fields, in which $1 is the file name that has been extracted by pref, and $2" is the calculated average with one decimal, rounded up. There is also a header in the outputSamplein$1andPercentin$2`. The below seems close but I am missing a few things (adding the header to the output, merging into one tab-delimeted file, and rounding to 3 decimal places), that I do not know how to do yet and not getting the desired output. Thank you :).
123_base.txt
AASS 99.81
ABAT 100.00
ABCA10 0.0
456_base.txt
ABL2 97.81
ABO 100.00
ACACA 99.82
desired output (tab-delimeted)
Sample Percent
123 66.6
456 99.2
Bash
for f in /home/cmccabe/Desktop/20x/percent/*.txt ; do
bname=$(basename $f)
pref=${bname%%_base_*.txt}
awk -v OFS='\t' '{ sum += $2 } END { if (NR > 0) print sum / NR }' $f /home/cmccabe/Desktop/NGS/bed/bedtools/IDP_total_target_length_by_panel/IDP_unix_trim_total_target_length.bed > /home/cmccabe/Desktop/20x/coverage/${pref}_average.txt
done
This one uses GNU awk, which provides handy BEGINFILE and ENDFILE events:
gawk '
BEGIN {print "Sample\tPercent"}
BEGINFILE {sample = FILENAME; sub(/_.*/,"",sample); sum = n = 0}
{sum += $2; n++}
ENDFILE {printf "%s\t%.1f\n", sample, sum/n}
' 123_base.txt 456_base.txt
If you're giving a pattern with the directory attached, I'd get the sample name like this:
match(FILENAME, /^.*\/([^_]+)/, m); sample = m[1]
and then, yes this is OK: gawk '...' /path/to/*_base.txt
And to steal against division by zero, inspired by James Brown's answer:
ENDFILE {printf "%s\t%.1f\n", sample, n==0 ? 0 : sum/n}
with perl
$ perl -ane '
BEGIN{ print "Sample\tPercent\n" }
$c++; $sum += $F[1];
if(eof)
{
($pref) = $ARGV=~/(.*)_base/;
printf "%s\t%.1f\n", $pref, $sum/$c;
$c = 0; $sum = 0;
}' 123_base.txt 456_base.txt
Sample Percent
123 66.6
456 99.2
print header using BEGIN block
-a option would split input line on spaces and save to #F array
For each line, increment counter and add to sum variable
If end of file eof is detected, print in required format
$ARGV contains current filename being read
If full path of filename is passed but only filename should be used to get pref, then use this line instead
($pref) = $ARGV=~/.*\/\K(.*)_base/;
In awk. Notice printf "%3.3s" to truncate the filename after 3rd char:
$ cat ave.awk
BEGIN {print "Sample", "Percent"} # header
BEGINFILE {s=c=0} # at the start of every file reset
{s+=$2; c++} # sum and count hits
ENDFILE{if(c>0) printf "%3.3s%s%.1f\n", FILENAME, OFS, s/c}
# above output if more than 0 lines
Run it:
$ touch empty_base.txt # test for division by zero
$ awk -f ave.awk 123_base.txt 123_base.txt empty_base.txt
Sample Percent
123 66.6
456 99.2
another awk
$ awk -v OFS='\t' '{f=FILENAME;sub(/_.*/,"",f);
a[f]+=$2; c[f]++}
END{print "Sample","Percent";
for(k in a) print k, sprintf("%.1f",a[k]/c[k])}' {123,456}_base.txt
Sample Percent
456 99.2
123 66.6

Bash: extract columns with cut and filter one column further

I have a tab-separated file and want to extract a few columns with cut.
Two example line
(...)
0 0 1 0 AB=1,2,3;CD=4,5,6;EF=7,8,9 0 0
1 1 0 0 AB=2,1,3;CD=1,1,2;EF=5,3,4 0 1
(...)
What I want to achieve is to select columns 2,3,5 and 7, however from column 5 only CD=4,5,6.
So my expected result is
0 1 CD=4,5,6; 0
1 0 CD=1,1,2; 1
How can I use cut for this problem and run grep on one of the extracted columns? Any other one-liner is of course also fine.
here is another awk
$ awk -F'\t|;' -v OFS='\t' '{print $2,$3,$6,$NF}' file
0 1 CD=4,5,6 0
1 0 CD=1,1,2 1
or with cut/paste
$ paste <(cut -f2,3 file) <(cut -d';' -f2 file) <(cut -f7 file)
0 1 CD=4,5,6 0
1 0 CD=1,1,2 1
Easier done with awk. Split the 5th field using ; as the separator, and then print the second subfield.
awk 'BEGIN {FS="\t"; OFS="\t"}
{split($5, a, ";"); print $2, $3, a[2]";", $7 }' inputfile > outputfile
If you want to print whichever subfield begins with CD=, use a loop:
awk 'BEGIN {FS="\t"; OFS="\t"}
{n = split($5, a, ";");
for (i = 1; i <= n; i++) {
if (a[i] ~ /^CD=/) subfield = a[i];
}
print $2, $3, subfield";", $7}' < inputfile > outputfile
I think awk is the best tool for this kind of task and the other two answers give you good short solutions.
I want to point out that you can use awk's built-in splitting facility to gain more flexibility when parsing input. Here is an example script that uses implicit splitting:
parse.awk
# Remember second, third and seventh columns
{
a = $2
b = $3
d = $7
}
# Split the fifth column on ";". After this the positional variables
# (e.g. $1, # $2, ..., $NF) contain the fields from the previous
# fifth column
{
oldFS = FS
FS = ";"
$0 = $5
}
# For example to test if the second elemnt starts with "CD", do
# something like this
$2 ~ /^CD/ {
c = $2
}
# Print the selected elements
{
print a, b, c, d
}
# Restore FS
{
FS = oldFS
}
Run it like this:
awk -f parse.awk FS='\t' OFS='\t' infile
Output:
0 1 CD=4,5,6 0
1 0 CD=1,1,2 1

Resources