Manipulation of data with BASH - bash

I have a file full of lines like the following:
8408292236 48.04 46n20.13 12e43.22 1.00 3.1 MCSTI 0.4897 108 0.10 0.10 20 41 84EV01978
The first two digits in the first column represent the year and I need to substitute them with the full year while retaining the full line. I tried various method with awk and sed but couldn't get them working.
My latest attempt is the following:
while read line
a=20
b=19
do
awk '{ if ("$1 | cut -c1-2" == "0"); then {print $a$1}; }' > test.txt
done < catalog.txt
The final output should be:
198408292236 48.04 46n20.13 12e43.22 1.00 3.1 MCSTI 0.4897 108 0.10 0.10 20 41 84EV01978
Any ideas on how I can do this? Any help would be greatly appreciated.

You have a conceptional problem. How would you know whether 15 refers to 2015 or 1915. Otherwise it's quite easy:
#!/bin/bash
for num in 8408292236 8508292236 0408292236 1508292236
do
prefix=${num:0:2} # get first two digits of line
if [ $prefix -ge 20 ]; then # assuming that 0-20 refers to the 2000s
echo "19$num"
else
echo "20$num"
fi
done

This will prefix 19 if the first field starts with 16 or more, and will prefix 20 otherwise. I think this is what you need.
awk ' { if ($1 > 1600000000 ) print "19" $0 ; else print "20" $0 ; }' catalog.txt

A solution using just bash:
while read A; do
PREFIX=20
if [ ${A:0:2} -gt 15 ] ; then
PREFIX=19
fi
echo "${PREFIX}${A}"
done

sed -e 's/^[0-3]/20\0/' -e 's/^[4-9]/19\0/'
This will append 19 to each line starting with 3 to 9 or 20 to each line starting with 0 to 2, so we consider 30 - 99 to correspond to 1900, and 00-29 to correspond to 2000.
Adapt to your need.

Use awk ternary operator.
$ cat catalog.txt
8408292236 48.04 46n20.13 12e43.22 1.00 3.1 MCSTI 0.4897 108 0.10 0.10 20 41 84EV01978
0408292236 48.04 46n20.13 12e43.22 1.00 3.1 MCSTI 0.4897 108 0.10 0.10 20 41 84EV01978
$ awk '$1=substr($1,1,2)<15 ? "20"$1 : "19"$1' catalog.txt
198408292236 48.04 46n20.13 12e43.22 1.00 3.1 MCSTI 0.4897 108 0.10 0.10 20 41 84EV01978
200408292236 48.04 46n20.13 12e43.22 1.00 3.1 MCSTI 0.4897 108 0.10 0.10 20 41 84EV01978
Explanation
Let’s use the ternary operator.
expr ? action1 : action2
Its pretty straight forward: if expris true then action1 is performed/evaluated , if not action2.
Field one must be prefixed with 19 or 20 depending of it's two first chars value substr($1,1,2).
NOTE: This only works if your data does not contains years bellow 1915.
At this point, we just need to change the first field $1 to our needs: $1=substr($1,1,2)<15 ? "20"$1 : "19"$1
Check: http://klashxx.github.io/ternary-operator/

Simply in bash:
while read i; do echo "$((${i:0:2}<20?20:19))$i"; done <file.txt
shorter in perl:
perl -pe'print substr($_,0,2)<20?20:19' file.txt
even shorter in sed:
sed 's/^[01]/20&/;t;s/^/19/' file.txt

Related

Passing variable grep command

A given text file (say foo*.txt) data as follow.
1 g = 0.54 0.00
2 g = 0.32 0.00
3 g = 0.45 0.00
...
5000 g = 0.5 0.00
Basically, I want to extract 10 lines before and after the matching line (including the matching line). The matching line contains 59 characters that contain strings, spaces and numbers.
I have a script as follow:
#!/usr/bin/bash
for file in file*.txt;
do
var=$(command_to_extract_var) # 59 characters containing strings, spaces and numbers
# to get this var, I use grep and head
grep -C 10 "$var" "$file"
done > bar.csv
Running script by bash -x script_name.sh gives the following:
+ for file in 'foo*.txt'
++ grep 'match_pattern' foo1.txt
++ awk '{print $6}'
++ head -n1
++ grep '[0-9]'
+ basis=150
++ grep 'match_pattern' foo1.txt
++ tail -n1
++ awk '{print $3}'
+ number=25
++ grep '[0-9] f = ' foo.txt
++ tail -n150
This is followed by a number of lines (even up to 1000) like
001 h = 0.000000000000000E+00 e = 3.543218084205956E+00
Finally,
File name too long
+ final=
+ grep -C 10 '' foo1.txt
The output I expect is (one column from each file):
0.54 0.62 0.36 ... 0.45
0.32 3.25 0.89 ... 0.25
0.45 0.96 0.14 ... 0.14
... .... .... ... 0.96
0.25 0.00 7.23 ... 0.77

While loop in bash getting duplicate result

$ cat grades.dat
santosh 65 65 65 65
john 85 92 78 94 88
andrea 89 90 75 90 86
jasper 84 88 80 92 84
santosh 99 99 99 99 99
Scripts:-
#!/usr/bin/bash
filename="$1"
while read line
do
a=`grep -w "santosh" $1 | awk '{print$1}' |wc -l`
echo "total is count of the file is $a";
done <"$filename"
O/p
total is count of the file is 2
total is count of the file is 2
total is count of the file is 2
total is count of the file is 2
total is count of the file is 2
Real O/P should be
total is count of the file is 2 like this right..please let me know,where i am missing in above scripts.
Whilst others have shown you better ways to solve your problem, the answer to your question is in the following line:
a=`grep -w "santosh" $1 | awk '{print$1}' |wc -l`
You are storing names in the variable "line" through the while loop, but it is never used. Instead your loop is always looking for "santosh" which does appear twice and because you run the same query for all 5 lines in the file being searched, you therefore get 5 lines of the exact same output.
You could alter your current script like so:
a=$(grep -w "$line" "$filename" | awk '{print$1}' | wc -l)
The above is not meant to be a solution as others have pointed out, but it does solve your issue.

how to format simultaneously a string and floating number in awk?

I have a column as follows:
ifile.txt
1.25
2.78
?
?
5.6
3.4
I would like to format the floating points to decimal and skipping the strings as it is.
ofile.txt
1
3
?
?
6
3
Walter A, F. Knorr and Janez Kuhar suggested nice scripts to do it as per my question and need of a command like
awk '{printf "%d%s\n", $1}' ifile.txt
Again, I found I have a number of columns, however, other columns don't need any formatting. So I have to use the above command in the form of something like:
awk '{printf "%5s %d%s %5s %5s %5s\n", $1, $2, $3, $4, $5}' ifile.txt
for example:
ifile.txt
1 1.25 23.2 34 3.4
2 2.78 22.0 23 1.2
3 ? ? ? 4.3
4 ? ? ? 6.5
5 5.6 45.0 5 2.4
6 3.4 43.0 23 5.6
I used the following command as again suggested by F. Knorr in answer,
awk '$2~/^[0-9]+\.?[0-9]*$/{$2=int($2+0.5)}1' ifile.txt > ofile.txt
ofile.txt
1 1 23.2 34 3.4
2 3 22.0 23 1.2
3 ? ? ? 4.3
4 ? ? ? 6.5
5 6 45.0 5 2.4
6 3 43.0 23 5.6
It works fine, but need to format it. like
ofile.txt
1 1 23.2 34 3.4
2 3 22.0 23 1.2
3 ? ? ? 4.3
4 ? ? ? 6.5
5 6 45.0 5 2.4
6 3 43.0 23 5.6
You could first check whether the column contains a number (via regex) and then handle the printing accordingly:
awk '$1~/^[0-9]+\.?[0-9]*$/{printf "%i\n",$1+0.5; next}1' test.txt
Update: If it is the n-th column that needs to be formatted as described above (and no other formatting in other columns), then replace all $1 by $n in the following command:
awk '$1~/^[0-9]+\.?[0-9]*$/{$1=int($1+0.5)}1' test.txt
Just adding a half can be done with:
awk ' $1 ~ /^[0-9]+$|^[0-9]+.[0-9]+$/ { printf("%d\n", $1 + 0.5); next }
{ print $1 } ' file
or slightly shorter:
awk ' $1 ~ /^[0-9]+$|^[0-9]+.[0-9]+$/ { printf("%d\n", $1 + 0.5); next } 1' file

bash - check for word in specific column, check value in other column of this line, cut and paste the line to new text file

My text files contain ~20k lines and look like this:
file_A:
ATOM 624 SC1 SER 288 54.730 23.870 56.950 1.00 0.00
ATOM 3199 NC3 POP 487 50.780 27.750 27.500 1.00 3.18
ATOM 3910 C2B POP 541 96.340 99.070 39.500 1.00 7.00
ATOM 4125 W PW 559 55.550 64.300 16.880 1.00 0.00
Now I need to check for POP in column 4 (line 2 and 3) and check if the values in the last column (10) exceed a specific threshold (e.g. 5.00). These lines - in this case just line 3 - need to be removed from file_A and copied to a new file_B. Meaning:
file_A:
ATOM 624 SC1 SER 288 54.730 23.870 56.950 1.00 0.00
ATOM 3199 NC3 POP 487 50.780 27.750 27.500 1.00 3.18
ATOM 4125 W PW 559 55.550 64.300 16.880 1.00 0.00
file_B:
ATOM 3910 C2B POP 541 96.340 99.070 39.500 1.00 7.00
I'm not sure wether to use sed, grep or awk or anything couple them :/
So far i could just delete the lines and create a new file without these lines...
awk '!/POP/' file_A > file_B
EDIT:
Does the following work for having more than one different words removed?
for (( i= ; i<$numberoflipids ; i++ ))
do
awk '$4~/"${nol[$i]}"/&&$NF>"$pr"{print >"patch_rmlipids.pdb";next}{print > "tmp"}' bilayer_CG_ordered.pdb && mv tmp patch.pdb
done
whereas $nol is an array containing the words to be removed, $pr is the given threshold and the .pdb are the used files
awk
awk '$4~/POP/&&$NF>5{print >"fileb";next}{print > "tmp"}' filea && mv tmp filea
.
$4~/POP/&&$NF>5 -Checks if fourth field contains POP and last field is more than five
{print >"fileb";next} -If they are writes the line to fileb and
skips further statements
{print > "tmp"} -Only executed if first part fails, write to tmp file
filea && mv tmp filea -The file used, if awk command succeeds then overwrite
it with tmp

shellscript and awk extraction to calculate averages

I have a shell script that contains a loop. This loop is calling another script. The output of each run of the loop is appended inside a file (outOfLoop.tr). when the loop is finished, awk command should calculate the average of specific columns and append the results to another file(fin.tr). At the end, the (fin.tr) is printed.
I managed to get the first part which is appending the results from the loop into (outOfLoop.tr) file. also, my awk commands seem to work... But I'm not getting the final expected output in terms of format. I think I'm missing something. Here is my try:
#!/bin/bash
rm outOfLoop.tr
rm fin.tr
x=1
lmax=4
while [ $x -le $lmax ]
do
calling another script >> outOfLoop.tr
x=$(( $x + 1 ))
done
cat outOfLoop.tr
#/////////////////
#//I'm getting the above part correctly and the output is :
27 194 119 59 178
27 180 100 30 187
27 175 120 59 130
27 189 125 80 145
#////////////////////
#back again to the script
echo "noRun\t A\t B\t C\t D\t E"
echo "----------------------\n"
#// print the total number of runs from the loop
echo "$lmax\t">>fin.tr
#// extract the first column from the output which is 27
awk '{print $1}' outOfLoop.tr >>fin.tr
echo "\t">>fin.tr
#Sum the column---calculate average
awk '{s+=$5;max+=0.5}END{print s/max}' outOfLoop.tr >>fin.tr
echo "\t">>fin.tr
awk '{s+=$4;max+=0.5}END{print s/max}' outOfLoop.tr >>fin.tr
echo "\t">>fin.tr
awk '{s+=$3;max+=0.5}END{print s/max}' outOfLoop.tr >>fin.tr
echo "\t">>fin.tr
awk '{s+=$2;max+=0.5}END{print s/max}' outOfLoop.tr >> fin.tr
echo "-------------------------------------------\n"
cat fin.tr
rm outOfLoop.tr
I want the format to be like :
noRun A B C D E
----------------------------------------------------------
4 27 average average average average
I have incremented max inside the awk command by 0.5 as there was new line between the out put of the results (output of outOfLoop file)
$ cat file
27 194 119 59 178
27 180 100 30 187
27 175 120 59 130
27 189 125 80 145
$ cat tst.awk
NF {
for (i=1;i<=NF;i++) {
sum[i] += $i
}
noRun++
}
END {
fmt="%-10s%-10s%-10s%-10s%-10s%-10s\n"
printf fmt,"noRun","A","B","C","D","E"
printf "----------------------------------------------------------\n"
printf fmt,noRun,$1,sum[2]/noRun,sum[3]/noRun,sum[4]/noRun,sum[5]/noRun
}
$ awk -f tst.awk file
noRun A B C D E
----------------------------------------------------------
4 27 184.5 116 57 160

Resources