bash - check for word in specific column, check value in other column of this line, cut and paste the line to new text file - bash

My text files contain ~20k lines and look like this:
file_A:
ATOM 624 SC1 SER 288 54.730 23.870 56.950 1.00 0.00
ATOM 3199 NC3 POP 487 50.780 27.750 27.500 1.00 3.18
ATOM 3910 C2B POP 541 96.340 99.070 39.500 1.00 7.00
ATOM 4125 W PW 559 55.550 64.300 16.880 1.00 0.00
Now I need to check for POP in column 4 (line 2 and 3) and check if the values in the last column (10) exceed a specific threshold (e.g. 5.00). These lines - in this case just line 3 - need to be removed from file_A and copied to a new file_B. Meaning:
file_A:
ATOM 624 SC1 SER 288 54.730 23.870 56.950 1.00 0.00
ATOM 3199 NC3 POP 487 50.780 27.750 27.500 1.00 3.18
ATOM 4125 W PW 559 55.550 64.300 16.880 1.00 0.00
file_B:
ATOM 3910 C2B POP 541 96.340 99.070 39.500 1.00 7.00
I'm not sure wether to use sed, grep or awk or anything couple them :/
So far i could just delete the lines and create a new file without these lines...
awk '!/POP/' file_A > file_B
EDIT:
Does the following work for having more than one different words removed?
for (( i= ; i<$numberoflipids ; i++ ))
do
awk '$4~/"${nol[$i]}"/&&$NF>"$pr"{print >"patch_rmlipids.pdb";next}{print > "tmp"}' bilayer_CG_ordered.pdb && mv tmp patch.pdb
done
whereas $nol is an array containing the words to be removed, $pr is the given threshold and the .pdb are the used files

awk
awk '$4~/POP/&&$NF>5{print >"fileb";next}{print > "tmp"}' filea && mv tmp filea
.
$4~/POP/&&$NF>5 -Checks if fourth field contains POP and last field is more than five
{print >"fileb";next} -If they are writes the line to fileb and
skips further statements
{print > "tmp"} -Only executed if first part fails, write to tmp file
filea && mv tmp filea -The file used, if awk command succeeds then overwrite
it with tmp

Related

loop through numeric text files in bash and add numbers row wise

I have a set of text files in a folder, like so:
a.txt
1
2
3
4
5
b.txt
1000
1001
1002
1003
1004
.. and so on (assume fixed number of rows, but unknown number of text files). What I am looking a results file which is a summation across all rows:
result.txt
1001
1003
1005
1007
1009
How do I go about achieving this in bash? without using Python etc.
Using awk
Try:
$ awk '{a[FNR]+=$0} END{for(i=1;i<=FNR;i++)print a[i]}' *.txt
1001
1003
1005
1007
1009
How it works:
a[FNR]+=$0
For every line read, we add the value of that line, $0, to partial sum, a[FNR], where a is an array and FNR is the line number in the current file.
END{for(i=1;i<=FNR;i++)print a[i]}
After all the files have been read in, this prints out the sum for each line number.
Using paste and bc
$ paste -d+ *.txt | bc
1001
1003
1005
1007
1009

Passing variable grep command

A given text file (say foo*.txt) data as follow.
1 g = 0.54 0.00
2 g = 0.32 0.00
3 g = 0.45 0.00
...
5000 g = 0.5 0.00
Basically, I want to extract 10 lines before and after the matching line (including the matching line). The matching line contains 59 characters that contain strings, spaces and numbers.
I have a script as follow:
#!/usr/bin/bash
for file in file*.txt;
do
var=$(command_to_extract_var) # 59 characters containing strings, spaces and numbers
# to get this var, I use grep and head
grep -C 10 "$var" "$file"
done > bar.csv
Running script by bash -x script_name.sh gives the following:
+ for file in 'foo*.txt'
++ grep 'match_pattern' foo1.txt
++ awk '{print $6}'
++ head -n1
++ grep '[0-9]'
+ basis=150
++ grep 'match_pattern' foo1.txt
++ tail -n1
++ awk '{print $3}'
+ number=25
++ grep '[0-9] f = ' foo.txt
++ tail -n150
This is followed by a number of lines (even up to 1000) like
001 h = 0.000000000000000E+00 e = 3.543218084205956E+00
Finally,
File name too long
+ final=
+ grep -C 10 '' foo1.txt
The output I expect is (one column from each file):
0.54 0.62 0.36 ... 0.45
0.32 3.25 0.89 ... 0.25
0.45 0.96 0.14 ... 0.14
... .... .... ... 0.96
0.25 0.00 7.23 ... 0.77

Manipulation of data with BASH

I have a file full of lines like the following:
8408292236 48.04 46n20.13 12e43.22 1.00 3.1 MCSTI 0.4897 108 0.10 0.10 20 41 84EV01978
The first two digits in the first column represent the year and I need to substitute them with the full year while retaining the full line. I tried various method with awk and sed but couldn't get them working.
My latest attempt is the following:
while read line
a=20
b=19
do
awk '{ if ("$1 | cut -c1-2" == "0"); then {print $a$1}; }' > test.txt
done < catalog.txt
The final output should be:
198408292236 48.04 46n20.13 12e43.22 1.00 3.1 MCSTI 0.4897 108 0.10 0.10 20 41 84EV01978
Any ideas on how I can do this? Any help would be greatly appreciated.
You have a conceptional problem. How would you know whether 15 refers to 2015 or 1915. Otherwise it's quite easy:
#!/bin/bash
for num in 8408292236 8508292236 0408292236 1508292236
do
prefix=${num:0:2} # get first two digits of line
if [ $prefix -ge 20 ]; then # assuming that 0-20 refers to the 2000s
echo "19$num"
else
echo "20$num"
fi
done
This will prefix 19 if the first field starts with 16 or more, and will prefix 20 otherwise. I think this is what you need.
awk ' { if ($1 > 1600000000 ) print "19" $0 ; else print "20" $0 ; }' catalog.txt
A solution using just bash:
while read A; do
PREFIX=20
if [ ${A:0:2} -gt 15 ] ; then
PREFIX=19
fi
echo "${PREFIX}${A}"
done
sed -e 's/^[0-3]/20\0/' -e 's/^[4-9]/19\0/'
This will append 19 to each line starting with 3 to 9 or 20 to each line starting with 0 to 2, so we consider 30 - 99 to correspond to 1900, and 00-29 to correspond to 2000.
Adapt to your need.
Use awk ternary operator.
$ cat catalog.txt
8408292236 48.04 46n20.13 12e43.22 1.00 3.1 MCSTI 0.4897 108 0.10 0.10 20 41 84EV01978
0408292236 48.04 46n20.13 12e43.22 1.00 3.1 MCSTI 0.4897 108 0.10 0.10 20 41 84EV01978
$ awk '$1=substr($1,1,2)<15 ? "20"$1 : "19"$1' catalog.txt
198408292236 48.04 46n20.13 12e43.22 1.00 3.1 MCSTI 0.4897 108 0.10 0.10 20 41 84EV01978
200408292236 48.04 46n20.13 12e43.22 1.00 3.1 MCSTI 0.4897 108 0.10 0.10 20 41 84EV01978
Explanation
Let’s use the ternary operator.
expr ? action1 : action2
Its pretty straight forward: if expris true then action1 is performed/evaluated , if not action2.
Field one must be prefixed with 19 or 20 depending of it's two first chars value substr($1,1,2).
NOTE: This only works if your data does not contains years bellow 1915.
At this point, we just need to change the first field $1 to our needs: $1=substr($1,1,2)<15 ? "20"$1 : "19"$1
Check: http://klashxx.github.io/ternary-operator/
Simply in bash:
while read i; do echo "$((${i:0:2}<20?20:19))$i"; done <file.txt
shorter in perl:
perl -pe'print substr($_,0,2)<20?20:19' file.txt
even shorter in sed:
sed 's/^[01]/20&/;t;s/^/19/' file.txt

Divide column values of different files by a constant then output one minus the other

I have two files of the form
file1:
#fileheader1
0 123
1 456
2 789
3 999
4 112
5 131
6 415
etc.
file2:
#fileheader2
0 442
1 232
2 542
3 559
4 888
5 231
6 322
etc.
How can I take the second column of each, divide it by a value then minus one from the other and then output a new third file with the new values?
I want the output file to have the form
#outputheader
0 123/c-422/k
1 456/c-232/k
2 789/c-542/k
etc.
where c and k are numbers I can plug into the script
I have seen this question: subtract columns from different files with awk
But I don't know how to use awk to do this by myself, does anyone know how to do this or could explain what is going on in the linked question so I can try to modify it?
I'd write:
awk -v c=10 -v k=20 ' ;# pass values to awk variables
/^#/ {next} ;# skip headers
FNR==NR {val[$1]=$2; next} ;# store values from file1
$1 in val {print $1, (val[$1]/c - $2/k)} ;# perform the calc and print
' file1 file2
output
0 -9.8
1 34
2 51.8
3 71.95
4 -33.2
5 1.55
6 25.4
etc. 0

The simplest way to join 2 files using bash and both of their keys appear in the result

I have 2 input files
file1
A 0.01
B 0.09
D 0.05
F 0.08
file2
A 0.03
C 0.01
D 0.04
E 0.09
The output I want is
A 0.01 0.03
B 0.09 NULL
C NULL 0.01
D 0.05 0.04
E NULL 0.09
F 0.08 NULL
The best that I can do is
join -t' ' -a 1 -a 2 -1 1 -2 1 -o 1.1,1.2,2.2 file1 file2
which doesn't give me what I want
You can write:
join -t $'\t' -a 1 -a 2 -1 1 -2 1 -e NULL -o 0,1.2,2.2 file1 file2
where I've made these changes:
In the output format, I changed 1.1 ("first column of file #1") to 0 ("join field"), so that values from file #2 can show up in the first field when necessary. (Specifically, so that C and E will.)
I added the -e option to specify a value (NULL) for missing/empty fields.
I used $'\t', which Bash converts to a tab, instead of typing an actual tab. I find this easier to use than a tab in the middle of the command. But if you disagree, and the actual tab is working for you, then by all means, you can keep using it. :-)

Resources