Searching patterns within txt file with post-processing

Searching patterns within txt file with post-processing - bash

I have a long txt files consisted of
ATOM 5010 HD13 LEU 301 0.158 20.865 10.630 1.00 0.00 PROA
ATOM 5011 CD2 LEU 301 1.684 22.404 12.349 1.00 1.00 PROA
ATOM 5012 HD21 LEU 301 2.233 22.501 13.310 1.00 0.00 PROA
ATOM 5013 HD22 LEU 301 1.584 23.412 11.894 1.00 0.00 PROA
ATOM 5014 HD23 LEU 301 2.267 21.744 11.672 1.00 0.00 PROA
ATOM 5015 C LEU 301 -0.687 23.995 15.639 1.00 0.00 PROA
ATOM 5016 O LEU 301 -1.791 24.341 15.139 1.00 0.00 PROA
ATOM 5017 NT LEU 301 -0.211 24.391 16.849 1.00 1.00 PROA
ATOM 5018 HT1 LEU 301 0.679 24.065 17.168 1.00 0.00 PROA
ATOM 5019 HT2 LEU 301 -0.752 25.007 17.422 1.00 0.00 PROA
ATOM 5020 SOD SOD 302 1.519 2.284 1.361 1.00 0.00 HETA
From this file I need to copy the string where the third column = SOD
ATOM 5020 SOD SOD 302 1.519 2.284 1.361 1.00 0.00 HETA
And past it to the separate txt file sod.txt (it should consist only one line which is equl to the original)
I could use a solution via combination of awk and sed commands!

You can use the sed write(w command):
sed '/\([^ \t]*\)\{2\}SOD/!d; w outputfile' file

In awk:
$ awk '$3=="SOD"' file # > new_file # uncomment to write to a new file
ATOM 5020 SOD SOD 302 1.519 2.284 1.361 1.00 0.00 HETA

You can try this
awk '{if ($3 == "SOD") print $0;}' input.txt >sod.txt

Related

How do I return a varying number as a variable in a string found in another file that otherwise stays constant (BASH)?

I have a file that contains text like this (only a portion of it here) and want to find the ATOM # associated with the O5' line (in this case "2"). I would then like to store this number as a variable for future use. Note that the data below is stored in another file titled "xyz.file" for example. The number of spaces between "ATOM" and the column the number of interest is found in may vary as the number of interest's value changes.
ATOM 1 HO5' G5 1 7.415 -9.123 -8.109 1.00 0.00
ATOM 2 O5' G5 1 7.997 -8.960 -8.863 1.00 0.00
ATOM 3 C5' G5 1 9.136 -9.784 -8.729 1.00 0.00
ATOM 4 H5' G5 1 9.679 -9.808 -9.673 1.00 0.00
ATOM 5 H5'' G5 1 8.814 -10.797 -8.484 1.00 0.00
ATOM 6 C4' G5 1 10.067 -9.272 -7.628 1.00 0.00
ATOM 7 H4' G5 1 10.847 -10.015 -7.448 1.00 0.00
ATOM 8 O4' G5 1 10.700 -8.053 -7.990 1.00 0.00
ATOM 9 C1' G5 1 10.866 -7.262 -6.821 1.00 0.00
ATOM 10 H1' G5 1 11.907 -6.970 -6.696 1.00 0.00
ATOM 11 N9 G5 1 10.027 -6.048 -6.896 1.00 0.00

An awk one-liner:
n=$(awk '$3 == "O5'\''" {print $2; quit}' file)
echo $n
prints
2

delete rows after specific character | awk

I am writing a Bash script and,
I need to remove all lines in between TER, including 'TER's
Input File :
ATOM 186 O3' U 6 7.297 6.145 -5.250 1.00 0.00 O
ATOM 187 HO3' U 6 7.342 5.410 -5.865 1.00 0.00 H
TER
ATOM 1 HO5' A 1 3.429 -7.861 3.641 1.00 0.00 H
ATOM 2 O5' A 1 4.232 -7.360 3.480 1.00 0.00 O
ATOM 3 C5' A 1 5.480 -8.064 3.350 1.00 0.00 C
ATOM 4 H5' A 1 5.429 -8.766 2.518 1.00 0.00 H
TER
Expected output:
ATOM 186 O3' U 6 7.297 6.145 -5.250 1.00 0.00 O
ATOM 187 HO3' U 6 7.342 5.410 -5.865 1.00 0.00 H
I found
sed '/TER/,$d' ${myArray[j]}.txt >> ${MyArray[j]}.txt ### ${MyArray[j]} file name through an array
But this does not work, I think awk will work with Bash Script. help Thanks

You can just use sed like this:
sed -i.bak '/^TER/,/^TER/d' "${myArray[j]}.txt"
cat "${myArray[j]}.txt"
ATOM 186 O3' U 6 7.297 6.145 -5.250 1.00 0.00 O
ATOM 187 HO3' U 6 7.342 5.410 -5.865 1.00 0.00 H

sed '/TER/,/TER/d'
echo
"ATOM 186 O3' U 6 7.297 6.145 -5.250 1.00 0.00 O
ATOM 187 HO3' U 6 7.342 5.410 -5.865 1.00 0.00 H
TER
ATOM 1 HO5' A 1 3.429 -7.861 3.641 1.00 0.00 H
ATOM 2 O5' A 1 4.232 -7.360 3.480 1.00 0.00 O
ATOM 3 C5' A 1 5.480 -8.064 3.350 1.00 0.00 C
ATOM 4 H5' A 1 5.429 -8.766 2.518 1.00 0.00 H
TER" |sed '/TER/,/TER/d'
######################################################################################
ATOM 186 O3' U 6 7.297 6.145 -5.250 1.00 0.00 O
ATOM 187 HO3' U 6 7.342 5.410 -5.865 1.00 0.00 H
sed '/Start Pattern/,/End Pattern/d'

It can be done like this
sed '/TER/,$d' ${myArray[j]}.txt > tmp.txt #note only one " > "
mv tmp.txt ${myArray[j]}.txt

awk also provides a simple solution using a flag to control printing. Below the skip variable is used as a flag. If 1 the lines are skipped, on the transition from 1 to 0, the script exits.
awk -v skip=0 '$1=="TER"{skip=skip?1:0; if (!skip)exit}1' file
Above $1=="TER" is used to match lines (records) where the first field is TER (this disambiguates between "TER" and "TERMINAL", etc...) Within the rule, the ternary skip=skip?1:0 sets skip=1 the first time "TER" is encountered and to 0 on the next. If skip==0 the script exits. The 1 at the end is just shorthand for print.
Example Use/Output
Using your data in file, you would get:
$ awk -v skip=0 '$1=="TER"{skip=skip?1:0; if (!skip)exit}1' file
ATOM 186 O3' U 6 7.297 6.145 -5.250 1.00 0.00 O
ATOM 187 HO3' U 6 7.342 5.410 -5.865 1.00 0.00 H

remove space from specific column by bash

I am new of bash command and I really appreciate your help.
I have a file like this
ATOM 1 N LYS P1852 10.932 0.523 -24.701 1.00 0.00
ATOM 2 HN1 LYS P1852 11.571 0.864 -25.419 1.00 0.00
ATOM 3 HN2 LYS P1852 10.431 1.305 -24.278 1.00 0.00
ATOM 4 HN3 LYS P1852 10.154 0.023 -25.132 1.00 0.00
ATOM 5 CA LYS P1852 11.556 -0.319 -23.640 1.00 0.00
and I need to remove space from specific position (position 30 let say) for all the lines. The output has to be as follow:
ATOM 1 N LYS P1852 10.932 0.523 -24.701 1.00 0.00
ATOM 2 HN1 LYS P1852 11.571 0.864 -25.419 1.00 0.00
ATOM 3 HN2 LYS P1852 10.431 1.305 -24.278 1.00 0.00
ATOM 4 HN3 LYS P1852 10.154 0.023 -25.132 1.00 0.00
ATOM 5 CA LYS P1852 11.556 -0.319 -23.640 1.00 0.00
I was trying sed and other commands but no solution until now has worked.
Thanks you

You can use cut:
cut --complement -c 30 input.txt
From the manual:
-c, --characters=LIST
select only these characters
--complement
complement the set of selected bytes, characters or fields
--complement is GNU cut specific, if that is not available:
cut -c -29,31- input.txt
Above commands remove any character at position 30. If you only want to remove space:
sed -E 's/^(.{29}) /\1/' input.txt

Grep not parsing the whole file

I want to use grep to pick lines not containing "WAT" in a file containing 425409 lines with a file size of 26.8 MB, UTF8 encoding.
The file looks like this
>ATOM 1 N ALA 1 9.979 -15.619 28.204 1.00 0.00
>ATOM 2 H1 ALA 1 9.594 -15.053 28.938 1.00 0.00
>ATOM 3 H2 ALA 1 9.558 -15.358 27.323 1.00 0.00
>ATOM 12 O ALA 1 7.428 -16.246 28.335 1.00 0.00
>ATOM 13 N HID 2 7.563 -18.429 28.562 1.00 0.00
>ATOM 14 H HID 2 6.557 -18.369 28.638 1.00 0.00
>ATOM 15 CA HID 2 8.082 -19.800 28.535 1.00 0.00
>ATOM 24 HE1 HID 2 8.603 -23.670 33.041 1.00 0.00
>ATOM 25 NE2 HID 2 8.012 -23.749 30.962 1.00 0.00
>ATOM 29 O HID 2 5.854 -20.687 28.537 1.00 0.00
>ATOM 30 N GLN 3 7.209 -21.407 26.887 1.00 0.00
>ATOM 31 H GLN 3 8.168 -21.419 26.566 1.00 0.00
>ATOM 32 CA GLN 3 6.271 -22.274 26.157 1.00 0.00
**16443 lines**
>ATOM 16425 C116 PA 1089 -34.635 6.968 -0.185 1.00 0.00
>ATOM 16426 H16R PA 1089 -35.669 7.267 -0.368 1.00 0.00
>ATOM 16427 H16S PA 1089 -34.579 5.878 -0.218 1.00 0.00
>ATOM 16428 H16T PA 1089 -34.016 7.366 -0.990 1.00 0.00
>ATOM 16429 C115 PA 1089 -34.144 7.493 1.177 1.00 0.00
>ATOM 16430 H15R PA 1089 -33.101 7.198 1.305 1.00 0.00
>ATOM 16431 H15S PA 1089 -34.179 8.585 1.197 1.00 0.00
>ATOM 16432 C114 PA 1089 -34.971 6.910 2.342 1.00 0.00
>ATOM 16433 H14R PA 1089 -35.147 5.847 2.166 1.00 0.00
**132284 lines**
>ATOM 60981 O WAT 7952 -46.056 -5.515 -56.245 1.00 0.00
>ATOM 60982 H1 WAT 7952 -45.185 -5.238 -56.602 1.00 0.00
>ATOM 60983 H2 WAT 7952 -46.081 -6.445 -56.561 1.00 0.00
>TER
>ATOM 60984 O WAT 7953 -51.005 -3.205 -46.712 1.00 0.00
>ATOM 60985 H1 WAT 7953 -51.172 -3.159 -47.682 1.00 0.00
>ATOM 60986 H2 WAT 7953 -51.051 -4.177 -46.579 1.00 0.00
>TER
>ATOM 60987 O WAT 7954 -49.804 -0.759 -49.284 1.00 0.00
>ATOM 60988 H1 WAT 7954 -48.962 -0.677 -49.785 1.00 0.00
>ATOM 60989 H2 WAT 7954 -49.868 0.138 -48.903 1.00 0.00
**many lines until the end**
>TER
>END
I have used grep -v 'WAT' file.txt but it only returned me the first 16179 lines not containing "WAT" and I can see that there are more lines not containing "WAT". For instance, the following line (and many others) does not appear in the output:
> ATOM 16425 C116 PA 1089 -34.635 6.968 -0.185 1.00 0.00
In order to try to figure out what was happening I've tried grep ' ' file.txt. This command should return every line in the file, but it only returned he first 16179 lines too.
I've also tried to use tail -408977 file.txt | grep ' ' and it returned me all lines recalled by tail. Then I've tried tail -408978 file.txt | grep ' ' and the output was totally empty, zero lines.
I am working on a "normal" 64 bit system, Kubuntu.
Thanks a lot for the help!

When I try I get
$: grep WAT file.txt
Binary file file.txt matches
grep is assuming it's a binary file. add -a
-a, --text equivalent to --binary-files=text
$: grep -a WAT file.txt|head -3
ATOM 29305 O WAT 4060 -75.787 -79.125 25.925 1.00 0.00 O
ATOM 29306 H1 WAT 4060 -76.191 -78.230 25.936 1.00 0.00 H
ATOM 29307 H2 WAT 4060 -76.556 -79.670 25.684 1.00 0.00 H
Your file has 2 NULLs each at the end of lines 16426, 16428, 16430, and 16432.
$: tr "\0" # <file.txt|grep -n #
16426:ATOM 16421 KA CAL 1085 -20.614 -22.960 18.641 1.00 0.00 ##
16428:ATOM 16422 KA CAL 1086 20.249 21.546 19.443 1.00 0.00 ##
16430:ATOM 16423 KA CAL 1087 22.695 -19.700 19.624 1.00 0.00 ##
16432:ATOM 16424 KA CAL 1088 -22.147 19.317 17.966 1.00 0.00 ##

Delete duplicate lines through pattern in bash

I have to post-process some txt files consisted of some duplicated strings with the pattern "TER" e.g
ATOM 47047 H1 WAT 11303 -32.626 -35.728 -30.283 1.00 0.00
ATOM 47048 H2 WAT 11303 -33.975 -35.757 -30.969 1.00 0.00
TER
TER
TER
TER
TER
ATOM 47052 O WAT 11305 -38.279 -35.930 -33.162 1.00 0.00
ATOM 47053 H1 WAT 11305 -37.860 -35.087 -33.334 1.00 0.00
ATOM 47054 H2 WAT 11305 -39.198 -35.793 -33.391 1.00 0.00
TER
TER
ATOM 47055 O WAT 11306 -35.943 -38.199 -31.778 1.00 0.00
ATOM 47056 H1 WAT 11306 -35.823 -38.794 -31.039 1.00 0.00
ATOM 47057 H2 WAT 11306 -35.083 -38.162 -32.198 1.00 0.00
TER
ATOM 47058 O WAT 11307 -33.604 -37.645 -33.202 1.00 0.00
ATOM 47059 H1 WAT 11307 -34.130 -37.121 -33.805 1.00 0.00
ATOM 47060 H2 WAT 11307 -33.261 -37.012 -32.571 1.00 0.00
TER
TER
TER
ATOM 47061 O WAT 11308 -40.428 -29.625 -32.046 1.00 0.00
ATOM 47062 H1 WAT 11308 -40.966 -28.900 -32.365 1.00 0.00
ATOM 47063 H2 WAT 11308 -40.175 -30.102 -32.837 1.00 0.00
TER
In this log I would like to remove all repeated more than one time a TER strings, keeping only the first string with TER. E.g
ATOM 47047 H1 WAT 11303 -32.626 -35.728 -30.283 1.00 0.00
ATOM 47048 H2 WAT 11303 -33.975 -35.757 -30.969 1.00 0.00
TER
ATOM 47052 O WAT 11305 -38.279 -35.930 -33.162 1.00 0.00
ATOM 47053 H1 WAT 11305 -37.860 -35.087 -33.334 1.00 0.00
ATOM 47054 H2 WAT 11305 -39.198 -35.793 -33.391 1.00 0.00
TER
ATOM 47055 O WAT 11306 -35.943 -38.199 -31.778 1.00 0.00
ATOM 47056 H1 WAT 11306 -35.823 -38.794 -31.039 1.00 0.00
ATOM 47057 H2 WAT 11306 -35.083 -38.162 -32.198 1.00 0.00
TER
ATOM 47058 O WAT 11307 -33.604 -37.645 -33.202 1.00 0.00
ATOM 47059 H1 WAT 11307 -34.130 -37.121 -33.805 1.00 0.00
ATOM 47060 H2 WAT 11307 -33.261 -37.012 -32.571 1.00 0.00
TER
ATOM 47061 O WAT 11308 -40.428 -29.625 -32.046 1.00 0.00
ATOM 47062 H1 WAT 11308 -40.966 -28.900 -32.365 1.00 0.00
ATOM 47063 H2 WAT 11308 -40.175 -30.102 -32.837 1.00 0.00
TER
I will be grateful for any solutions with bash commands like sed, grep or awk.

Check with
uniq -d
if only the TER lines are duplicates and then
uniq
deletes the duplicated TER lines.

Short sed solution:
sed '$!N;/TER\nTER/!P;D;' file
$!N - appending each next line to the patten space (analyzing each pair of lines) till the last line $
/TER\nTER/!P;D; - prints only the 1st line from the pattern space if they don't contain same TER value
The output:
ATOM 47047 H1 WAT 11303 -32.626 -35.728 -30.283 1.00 0.00
ATOM 47048 H2 WAT 11303 -33.975 -35.757 -30.969 1.00 0.00
TER
ATOM 47052 O WAT 11305 -38.279 -35.930 -33.162 1.00 0.00
ATOM 47053 H1 WAT 11305 -37.860 -35.087 -33.334 1.00 0.00
ATOM 47054 H2 WAT 11305 -39.198 -35.793 -33.391 1.00 0.00
TER
ATOM 47055 O WAT 11306 -35.943 -38.199 -31.778 1.00 0.00
ATOM 47056 H1 WAT 11306 -35.823 -38.794 -31.039 1.00 0.00
ATOM 47057 H2 WAT 11306 -35.083 -38.162 -32.198 1.00 0.00
TER
ATOM 47058 O WAT 11307 -33.604 -37.645 -33.202 1.00 0.00
ATOM 47059 H1 WAT 11307 -34.130 -37.121 -33.805 1.00 0.00
ATOM 47060 H2 WAT 11307 -33.261 -37.012 -32.571 1.00 0.00
TER
ATOM 47061 O WAT 11308 -40.428 -29.625 -32.046 1.00 0.00
ATOM 47062 H1 WAT 11308 -40.966 -28.900 -32.365 1.00 0.00
ATOM 47063 H2 WAT 11308 -40.175 -30.102 -32.837 1.00 0.00
TER

try sed:
sed '/^TER/{N;/\nTER\s*$/D}' urfile

Keep track of the previous line and don't print the line when it's TER, if the previous one was also "TER":
awk '!/^TER$/ || prev != "TER" { print } { prev = $0 }' file
You can skip the explicit { print } block too, since that's the default action:
awk '!/^TER$/ || prev != "TER"; { prev = $0 }' file

Here is another awk Version:
awk '/^TER/{ c++; if ( c == 1 ){ print }}/^ATOM/{ print; c = 0 }' file

awk '!/^TER/{c=1}c; /^TER/{c=0}' file
Set c as a flag to determine print or not.
Under non-"TER" case, set the flag up and print.
If met the first "TER", since c is still on, it would print the line, then set c down.
The consecutive "TER" would not be printed since c kept down.

$ awk '$1=="TER" && p=="TER"{next} {print; p=$1}' file
ATOM 47047 H1 WAT 11303 -32.626 -35.728 -30.283 1.00 0.00
ATOM 47048 H2 WAT 11303 -33.975 -35.757 -30.969 1.00 0.00
TER
ATOM 47052 O WAT 11305 -38.279 -35.930 -33.162 1.00 0.00
ATOM 47053 H1 WAT 11305 -37.860 -35.087 -33.334 1.00 0.00
ATOM 47054 H2 WAT 11305 -39.198 -35.793 -33.391 1.00 0.00
TER
ATOM 47055 O WAT 11306 -35.943 -38.199 -31.778 1.00 0.00
ATOM 47056 H1 WAT 11306 -35.823 -38.794 -31.039 1.00 0.00
ATOM 47057 H2 WAT 11306 -35.083 -38.162 -32.198 1.00 0.00
TER
ATOM 47058 O WAT 11307 -33.604 -37.645 -33.202 1.00 0.00
ATOM 47059 H1 WAT 11307 -34.130 -37.121 -33.805 1.00 0.00
ATOM 47060 H2 WAT 11307 -33.261 -37.012 -32.571 1.00 0.00
TER
ATOM 47061 O WAT 11308 -40.428 -29.625 -32.046 1.00 0.00
ATOM 47062 H1 WAT 11308 -40.966 -28.900 -32.365 1.00 0.00
ATOM 47063 H2 WAT 11308 -40.175 -30.102 -32.837 1.00 0.00
TER

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Searching patterns within txt file with post-processing - bash

You can use the sed write(w command): sed '/\([^ \t]*\)\{2\}SOD/!d; w outputfile' file

In awk: $ awk '$3=="SOD"' file # > new_file # uncomment to write to a new file ATOM 5020 SOD SOD 302 1.519 2.284 1.361 1.00 0.00 HETA

You can try this awk '{if ($3 == "SOD") print $0;}' input.txt >sod.txt

Related

How do I return a varying number as a variable in a string found in another file that otherwise stays constant (BASH)?

delete rows after specific character | awk

remove space from specific column by bash

Grep not parsing the whole file

Delete duplicate lines through pattern in bash

Categories

Resources