Grep not parsing the whole file

Grep not parsing the whole file - bash

I want to use grep to pick lines not containing "WAT" in a file containing 425409 lines with a file size of 26.8 MB, UTF8 encoding.
The file looks like this
>ATOM 1 N ALA 1 9.979 -15.619 28.204 1.00 0.00
>ATOM 2 H1 ALA 1 9.594 -15.053 28.938 1.00 0.00
>ATOM 3 H2 ALA 1 9.558 -15.358 27.323 1.00 0.00
>ATOM 12 O ALA 1 7.428 -16.246 28.335 1.00 0.00
>ATOM 13 N HID 2 7.563 -18.429 28.562 1.00 0.00
>ATOM 14 H HID 2 6.557 -18.369 28.638 1.00 0.00
>ATOM 15 CA HID 2 8.082 -19.800 28.535 1.00 0.00
>ATOM 24 HE1 HID 2 8.603 -23.670 33.041 1.00 0.00
>ATOM 25 NE2 HID 2 8.012 -23.749 30.962 1.00 0.00
>ATOM 29 O HID 2 5.854 -20.687 28.537 1.00 0.00
>ATOM 30 N GLN 3 7.209 -21.407 26.887 1.00 0.00
>ATOM 31 H GLN 3 8.168 -21.419 26.566 1.00 0.00
>ATOM 32 CA GLN 3 6.271 -22.274 26.157 1.00 0.00
**16443 lines**
>ATOM 16425 C116 PA 1089 -34.635 6.968 -0.185 1.00 0.00
>ATOM 16426 H16R PA 1089 -35.669 7.267 -0.368 1.00 0.00
>ATOM 16427 H16S PA 1089 -34.579 5.878 -0.218 1.00 0.00
>ATOM 16428 H16T PA 1089 -34.016 7.366 -0.990 1.00 0.00
>ATOM 16429 C115 PA 1089 -34.144 7.493 1.177 1.00 0.00
>ATOM 16430 H15R PA 1089 -33.101 7.198 1.305 1.00 0.00
>ATOM 16431 H15S PA 1089 -34.179 8.585 1.197 1.00 0.00
>ATOM 16432 C114 PA 1089 -34.971 6.910 2.342 1.00 0.00
>ATOM 16433 H14R PA 1089 -35.147 5.847 2.166 1.00 0.00
**132284 lines**
>ATOM 60981 O WAT 7952 -46.056 -5.515 -56.245 1.00 0.00
>ATOM 60982 H1 WAT 7952 -45.185 -5.238 -56.602 1.00 0.00
>ATOM 60983 H2 WAT 7952 -46.081 -6.445 -56.561 1.00 0.00
>TER
>ATOM 60984 O WAT 7953 -51.005 -3.205 -46.712 1.00 0.00
>ATOM 60985 H1 WAT 7953 -51.172 -3.159 -47.682 1.00 0.00
>ATOM 60986 H2 WAT 7953 -51.051 -4.177 -46.579 1.00 0.00
>TER
>ATOM 60987 O WAT 7954 -49.804 -0.759 -49.284 1.00 0.00
>ATOM 60988 H1 WAT 7954 -48.962 -0.677 -49.785 1.00 0.00
>ATOM 60989 H2 WAT 7954 -49.868 0.138 -48.903 1.00 0.00
**many lines until the end**
>TER
>END
I have used grep -v 'WAT' file.txt but it only returned me the first 16179 lines not containing "WAT" and I can see that there are more lines not containing "WAT". For instance, the following line (and many others) does not appear in the output:
> ATOM 16425 C116 PA 1089 -34.635 6.968 -0.185 1.00 0.00
In order to try to figure out what was happening I've tried grep ' ' file.txt. This command should return every line in the file, but it only returned he first 16179 lines too.
I've also tried to use tail -408977 file.txt | grep ' ' and it returned me all lines recalled by tail. Then I've tried tail -408978 file.txt | grep ' ' and the output was totally empty, zero lines.
I am working on a "normal" 64 bit system, Kubuntu.
Thanks a lot for the help!

When I try I get
$: grep WAT file.txt
Binary file file.txt matches
grep is assuming it's a binary file. add -a
-a, --text equivalent to --binary-files=text
$: grep -a WAT file.txt|head -3
ATOM 29305 O WAT 4060 -75.787 -79.125 25.925 1.00 0.00 O
ATOM 29306 H1 WAT 4060 -76.191 -78.230 25.936 1.00 0.00 H
ATOM 29307 H2 WAT 4060 -76.556 -79.670 25.684 1.00 0.00 H
Your file has 2 NULLs each at the end of lines 16426, 16428, 16430, and 16432.
$: tr "\0" # <file.txt|grep -n #
16426:ATOM 16421 KA CAL 1085 -20.614 -22.960 18.641 1.00 0.00 ##
16428:ATOM 16422 KA CAL 1086 20.249 21.546 19.443 1.00 0.00 ##
16430:ATOM 16423 KA CAL 1087 22.695 -19.700 19.624 1.00 0.00 ##
16432:ATOM 16424 KA CAL 1088 -22.147 19.317 17.966 1.00 0.00 ##

Related

How do I return a varying number as a variable in a string found in another file that otherwise stays constant (BASH)?

I have a file that contains text like this (only a portion of it here) and want to find the ATOM # associated with the O5' line (in this case "2"). I would then like to store this number as a variable for future use. Note that the data below is stored in another file titled "xyz.file" for example. The number of spaces between "ATOM" and the column the number of interest is found in may vary as the number of interest's value changes.
ATOM 1 HO5' G5 1 7.415 -9.123 -8.109 1.00 0.00
ATOM 2 O5' G5 1 7.997 -8.960 -8.863 1.00 0.00
ATOM 3 C5' G5 1 9.136 -9.784 -8.729 1.00 0.00
ATOM 4 H5' G5 1 9.679 -9.808 -9.673 1.00 0.00
ATOM 5 H5'' G5 1 8.814 -10.797 -8.484 1.00 0.00
ATOM 6 C4' G5 1 10.067 -9.272 -7.628 1.00 0.00
ATOM 7 H4' G5 1 10.847 -10.015 -7.448 1.00 0.00
ATOM 8 O4' G5 1 10.700 -8.053 -7.990 1.00 0.00
ATOM 9 C1' G5 1 10.866 -7.262 -6.821 1.00 0.00
ATOM 10 H1' G5 1 11.907 -6.970 -6.696 1.00 0.00
ATOM 11 N9 G5 1 10.027 -6.048 -6.896 1.00 0.00

An awk one-liner:
n=$(awk '$3 == "O5'\''" {print $2; quit}' file)
echo $n
prints
2

How to replace a multiple columns with others using bash?

I have a text file that contains data arranged in columns, and I need to replace some columns with others, and to be specific, xyz coordinates. What I'm looking for is described in the image below.(replace the red rectangle number 1 with the green rectangle number 2).
HETATM 1 C LIG 1 -0.517 1.592 -0.048 1.00 0.00 0.212 A
HETATM 2 C LIG 1 0.017 -0.536 0.534 1.00 0.00 0.149 A
HETATM 3 C LIG 1 1.133 0.155 0.029 1.00 0.00 0.212 A
HETATM 4 N LIG 1 -1.027 0.379 0.499 1.00 0.00 -0.337 N
HETATM 5 N LIG 1 0.789 1.466 -0.324 1.00 0.00 -0.219 NA
HETATM 6 C LIG 1 -2.429 0.112 0.889 1.00 0.00 0.221 C
HETATM 7 C LIG 1 -3.179 -0.453 -0.210 1.00 0.00 -0.097 C
HETATM 8 C LIG 1 -3.805 -0.925 -1.124 1.00 0.00 0.014 C
HETATM 9 N LIG 1 2.482 -0.388 -0.118 1.00 0.00 -0.095 N
HETATM 10 O LIG 1 2.619 -1.549 0.253 1.00 0.00 -0.530 OA
HETATM 11 O LIG 1 3.362 0.305 -0.578 1.00 0.00 -0.530 OA
ATOM 1 C LIG 1 -13.469 13.704 72.248 -0.37 -0.04 +0.212 75.145
ATOM 2 C LIG 1 -14.243 15.824 72.493 -0.41 -0.03 +0.149 75.145
ATOM 3 C LIG 1 -15.124 15.039 71.727 -0.40 -0.04 +0.212 75.145
ATOM 4 N LIG 1 -13.200 14.974 72.836 -0.28 +0.06 -0.337 75.145
ATOM 5 N LIG 1 -14.635 13.735 71.586 -0.32 +0.05 -0.219 75.145
ATOM 6 C LIG 1 -11.994 15.348 73.608 -0.46 -0.02 +0.221 75.145
ATOM 7 C LIG 1 -12.341 15.781 74.943 -0.66 +0.01 -0.097 75.145
ATOM 8 C LIG 1 -12.628 16.141 76.055 -0.66 -0.00 +0.014 75.145
ATOM 9 N LIG 1 -16.387 15.490 71.145 -0.60 +0.01 -0.095 75.145
ATOM 10 O LIG 1 -17.127 14.595 70.751 -0.10 +0.02 -0.530 75.145
ATOM 11 O LIG 1 -16.631 16.674 71.082 -0.58 -0.08 -0.530 75.145

Assuming that files have the same length, you could merge them with paste. and then extract columns in the desired order:
paste file1.txt file2.txt|awk '{print $1, $2, $3, $4, $5, $18, $19, $20, $9, $10, $11, $12}'

It's not clear if you are trying to align the rows contextually, but if you are literally just wanting to replace columns 6, 7, and 8 with the columns from the same row in the other file, you can just do something like:
$ cat file1
HETATM 1 C LIG 1 -0.517 1.592 -0.048 1.00 0.00 0.212 A
HETATM 2 C LIG 1 0.017 -0.536 0.534 1.00 0.00 0.149 A
HETATM 3 C LIG 1 1.133 0.155 0.029 1.00 0.00 0.212 A
HETATM 4 N LIG 1 -1.027 0.379 0.499 1.00 0.00 -0.337 N
HETATM 5 N LIG 1 0.789 1.466 -0.324 1.00 0.00 -0.219 NA
HETATM 6 C LIG 1 -2.429 0.112 0.889 1.00 0.00 0.221 C
HETATM 7 C LIG 1 -3.179 -0.453 -0.210 1.00 0.00 -0.097 C
HETATM 8 C LIG 1 -3.805 -0.925 -1.124 1.00 0.00 0.014 C
HETATM 9 N LIG 1 2.482 -0.388 -0.118 1.00 0.00 -0.095 N
HETATM 10 O LIG 1 2.619 -1.549 0.253 1.00 0.00 -0.530 OA
HETATM 11 O LIG 1 3.362 0.305 -0.578 1.00 0.00 -0.530 OA
$ cat file2
ATOM 1 C LIG 1 -13.469 13.704 72.248 -0.37 -0.04 +0.212 75.145
ATOM 2 C LIG 1 -14.243 15.824 72.493 -0.41 -0.03 +0.149 75.145
ATOM 3 C LIG 1 -15.124 15.039 71.727 -0.40 -0.04 +0.212 75.145
ATOM 4 N LIG 1 -13.200 14.974 72.836 -0.28 +0.06 -0.337 75.145
ATOM 5 N LIG 1 -14.635 13.735 71.586 -0.32 +0.05 -0.219 75.145
ATOM 6 C LIG 1 -11.994 15.348 73.608 -0.46 -0.02 +0.221 75.145
ATOM 7 C LIG 1 -12.341 15.781 74.943 -0.66 +0.01 -0.097 75.145
ATOM 8 C LIG 1 -12.628 16.141 76.055 -0.66 -0.00 +0.014 75.145
ATOM 9 N LIG 1 -16.387 15.490 71.145 -0.60 +0.01 -0.095 75.145
ATOM 10 O LIG 1 -17.127 14.595 70.751 -0.10 +0.02 -0.530 75.145
ATOM 11 O LIG 1 -16.631 16.674 71.082 -0.58 -0.08 -0.530 75.145
$ awk '{getline s < "file2"; split(s, a); $6 = a[6]; $7 = a[7]; $8 = a[8]}1' file1
HETATM 1 C LIG 1 -13.469 13.704 72.248 1.00 0.00 0.212 A
HETATM 2 C LIG 1 -14.243 15.824 72.493 1.00 0.00 0.149 A
HETATM 3 C LIG 1 -15.124 15.039 71.727 1.00 0.00 0.212 A
HETATM 4 N LIG 1 -13.200 14.974 72.836 1.00 0.00 -0.337 N
HETATM 5 N LIG 1 -14.635 13.735 71.586 1.00 0.00 -0.219 NA
HETATM 6 C LIG 1 -11.994 15.348 73.608 1.00 0.00 0.221 C
HETATM 7 C LIG 1 -12.341 15.781 74.943 1.00 0.00 -0.097 C
HETATM 8 C LIG 1 -12.628 16.141 76.055 1.00 0.00 0.014 C
HETATM 9 N LIG 1 -16.387 15.490 71.145 1.00 0.00 -0.095 N
HETATM 10 O LIG 1 -17.127 14.595 70.751 1.00 0.00 -0.530 OA
HETATM 11 O LIG 1 -16.631 16.674 71.082 1.00 0.00 -0.530 OA

delete rows after specific character | awk

I am writing a Bash script and,
I need to remove all lines in between TER, including 'TER's
Input File :
ATOM 186 O3' U 6 7.297 6.145 -5.250 1.00 0.00 O
ATOM 187 HO3' U 6 7.342 5.410 -5.865 1.00 0.00 H
TER
ATOM 1 HO5' A 1 3.429 -7.861 3.641 1.00 0.00 H
ATOM 2 O5' A 1 4.232 -7.360 3.480 1.00 0.00 O
ATOM 3 C5' A 1 5.480 -8.064 3.350 1.00 0.00 C
ATOM 4 H5' A 1 5.429 -8.766 2.518 1.00 0.00 H
TER
Expected output:
ATOM 186 O3' U 6 7.297 6.145 -5.250 1.00 0.00 O
ATOM 187 HO3' U 6 7.342 5.410 -5.865 1.00 0.00 H
I found
sed '/TER/,$d' ${myArray[j]}.txt >> ${MyArray[j]}.txt ### ${MyArray[j]} file name through an array
But this does not work, I think awk will work with Bash Script. help Thanks

You can just use sed like this:
sed -i.bak '/^TER/,/^TER/d' "${myArray[j]}.txt"
cat "${myArray[j]}.txt"
ATOM 186 O3' U 6 7.297 6.145 -5.250 1.00 0.00 O
ATOM 187 HO3' U 6 7.342 5.410 -5.865 1.00 0.00 H

sed '/TER/,/TER/d'
echo
"ATOM 186 O3' U 6 7.297 6.145 -5.250 1.00 0.00 O
ATOM 187 HO3' U 6 7.342 5.410 -5.865 1.00 0.00 H
TER
ATOM 1 HO5' A 1 3.429 -7.861 3.641 1.00 0.00 H
ATOM 2 O5' A 1 4.232 -7.360 3.480 1.00 0.00 O
ATOM 3 C5' A 1 5.480 -8.064 3.350 1.00 0.00 C
ATOM 4 H5' A 1 5.429 -8.766 2.518 1.00 0.00 H
TER" |sed '/TER/,/TER/d'
######################################################################################
ATOM 186 O3' U 6 7.297 6.145 -5.250 1.00 0.00 O
ATOM 187 HO3' U 6 7.342 5.410 -5.865 1.00 0.00 H
sed '/Start Pattern/,/End Pattern/d'

It can be done like this
sed '/TER/,$d' ${myArray[j]}.txt > tmp.txt #note only one " > "
mv tmp.txt ${myArray[j]}.txt

awk also provides a simple solution using a flag to control printing. Below the skip variable is used as a flag. If 1 the lines are skipped, on the transition from 1 to 0, the script exits.
awk -v skip=0 '$1=="TER"{skip=skip?1:0; if (!skip)exit}1' file
Above $1=="TER" is used to match lines (records) where the first field is TER (this disambiguates between "TER" and "TERMINAL", etc...) Within the rule, the ternary skip=skip?1:0 sets skip=1 the first time "TER" is encountered and to 0 on the next. If skip==0 the script exits. The 1 at the end is just shorthand for print.
Example Use/Output
Using your data in file, you would get:
$ awk -v skip=0 '$1=="TER"{skip=skip?1:0; if (!skip)exit}1' file
ATOM 186 O3' U 6 7.297 6.145 -5.250 1.00 0.00 O
ATOM 187 HO3' U 6 7.342 5.410 -5.865 1.00 0.00 H

remove space from specific column by bash

I am new of bash command and I really appreciate your help.
I have a file like this
ATOM 1 N LYS P1852 10.932 0.523 -24.701 1.00 0.00
ATOM 2 HN1 LYS P1852 11.571 0.864 -25.419 1.00 0.00
ATOM 3 HN2 LYS P1852 10.431 1.305 -24.278 1.00 0.00
ATOM 4 HN3 LYS P1852 10.154 0.023 -25.132 1.00 0.00
ATOM 5 CA LYS P1852 11.556 -0.319 -23.640 1.00 0.00
and I need to remove space from specific position (position 30 let say) for all the lines. The output has to be as follow:
ATOM 1 N LYS P1852 10.932 0.523 -24.701 1.00 0.00
ATOM 2 HN1 LYS P1852 11.571 0.864 -25.419 1.00 0.00
ATOM 3 HN2 LYS P1852 10.431 1.305 -24.278 1.00 0.00
ATOM 4 HN3 LYS P1852 10.154 0.023 -25.132 1.00 0.00
ATOM 5 CA LYS P1852 11.556 -0.319 -23.640 1.00 0.00
I was trying sed and other commands but no solution until now has worked.
Thanks you

You can use cut:
cut --complement -c 30 input.txt
From the manual:
-c, --characters=LIST
select only these characters
--complement
complement the set of selected bytes, characters or fields
--complement is GNU cut specific, if that is not available:
cut -c -29,31- input.txt
Above commands remove any character at position 30. If you only want to remove space:
sed -E 's/^(.{29}) /\1/' input.txt

Searching patterns within txt file with post-processing

I have a long txt files consisted of
ATOM 5010 HD13 LEU 301 0.158 20.865 10.630 1.00 0.00 PROA
ATOM 5011 CD2 LEU 301 1.684 22.404 12.349 1.00 1.00 PROA
ATOM 5012 HD21 LEU 301 2.233 22.501 13.310 1.00 0.00 PROA
ATOM 5013 HD22 LEU 301 1.584 23.412 11.894 1.00 0.00 PROA
ATOM 5014 HD23 LEU 301 2.267 21.744 11.672 1.00 0.00 PROA
ATOM 5015 C LEU 301 -0.687 23.995 15.639 1.00 0.00 PROA
ATOM 5016 O LEU 301 -1.791 24.341 15.139 1.00 0.00 PROA
ATOM 5017 NT LEU 301 -0.211 24.391 16.849 1.00 1.00 PROA
ATOM 5018 HT1 LEU 301 0.679 24.065 17.168 1.00 0.00 PROA
ATOM 5019 HT2 LEU 301 -0.752 25.007 17.422 1.00 0.00 PROA
ATOM 5020 SOD SOD 302 1.519 2.284 1.361 1.00 0.00 HETA
From this file I need to copy the string where the third column = SOD
ATOM 5020 SOD SOD 302 1.519 2.284 1.361 1.00 0.00 HETA
And past it to the separate txt file sod.txt (it should consist only one line which is equl to the original)
I could use a solution via combination of awk and sed commands!

You can use the sed write(w command):
sed '/\([^ \t]*\)\{2\}SOD/!d; w outputfile' file

In awk:
$ awk '$3=="SOD"' file # > new_file # uncomment to write to a new file
ATOM 5020 SOD SOD 302 1.519 2.284 1.361 1.00 0.00 HETA

You can try this
awk '{if ($3 == "SOD") print $0;}' input.txt >sod.txt

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Grep not parsing the whole file - bash

Related

How do I return a varying number as a variable in a string found in another file that otherwise stays constant (BASH)?

How to replace a multiple columns with others using bash?

delete rows after specific character | awk

remove space from specific column by bash

Searching patterns within txt file with post-processing

Categories

Resources