Delete duplicate lines through pattern in bash - bash

I have to post-process some txt files consisted of some duplicated strings with the pattern "TER" e.g
ATOM 47047 H1 WAT 11303 -32.626 -35.728 -30.283 1.00 0.00
ATOM 47048 H2 WAT 11303 -33.975 -35.757 -30.969 1.00 0.00
TER
TER
TER
TER
TER
ATOM 47052 O WAT 11305 -38.279 -35.930 -33.162 1.00 0.00
ATOM 47053 H1 WAT 11305 -37.860 -35.087 -33.334 1.00 0.00
ATOM 47054 H2 WAT 11305 -39.198 -35.793 -33.391 1.00 0.00
TER
TER
ATOM 47055 O WAT 11306 -35.943 -38.199 -31.778 1.00 0.00
ATOM 47056 H1 WAT 11306 -35.823 -38.794 -31.039 1.00 0.00
ATOM 47057 H2 WAT 11306 -35.083 -38.162 -32.198 1.00 0.00
TER
ATOM 47058 O WAT 11307 -33.604 -37.645 -33.202 1.00 0.00
ATOM 47059 H1 WAT 11307 -34.130 -37.121 -33.805 1.00 0.00
ATOM 47060 H2 WAT 11307 -33.261 -37.012 -32.571 1.00 0.00
TER
TER
TER
ATOM 47061 O WAT 11308 -40.428 -29.625 -32.046 1.00 0.00
ATOM 47062 H1 WAT 11308 -40.966 -28.900 -32.365 1.00 0.00
ATOM 47063 H2 WAT 11308 -40.175 -30.102 -32.837 1.00 0.00
TER
In this log I would like to remove all repeated more than one time a TER strings, keeping only the first string with TER. E.g
ATOM 47047 H1 WAT 11303 -32.626 -35.728 -30.283 1.00 0.00
ATOM 47048 H2 WAT 11303 -33.975 -35.757 -30.969 1.00 0.00
TER
ATOM 47052 O WAT 11305 -38.279 -35.930 -33.162 1.00 0.00
ATOM 47053 H1 WAT 11305 -37.860 -35.087 -33.334 1.00 0.00
ATOM 47054 H2 WAT 11305 -39.198 -35.793 -33.391 1.00 0.00
TER
ATOM 47055 O WAT 11306 -35.943 -38.199 -31.778 1.00 0.00
ATOM 47056 H1 WAT 11306 -35.823 -38.794 -31.039 1.00 0.00
ATOM 47057 H2 WAT 11306 -35.083 -38.162 -32.198 1.00 0.00
TER
ATOM 47058 O WAT 11307 -33.604 -37.645 -33.202 1.00 0.00
ATOM 47059 H1 WAT 11307 -34.130 -37.121 -33.805 1.00 0.00
ATOM 47060 H2 WAT 11307 -33.261 -37.012 -32.571 1.00 0.00
TER
ATOM 47061 O WAT 11308 -40.428 -29.625 -32.046 1.00 0.00
ATOM 47062 H1 WAT 11308 -40.966 -28.900 -32.365 1.00 0.00
ATOM 47063 H2 WAT 11308 -40.175 -30.102 -32.837 1.00 0.00
TER
I will be grateful for any solutions with bash commands like sed, grep or awk.

Check with
uniq -d
if only the TER lines are duplicates and then
uniq
deletes the duplicated TER lines.

Short sed solution:
sed '$!N;/TER\nTER/!P;D;' file
$!N - appending each next line to the patten space (analyzing each pair of lines) till the last line $
/TER\nTER/!P;D; - prints only the 1st line from the pattern space if they don't contain same TER value
The output:
ATOM 47047 H1 WAT 11303 -32.626 -35.728 -30.283 1.00 0.00
ATOM 47048 H2 WAT 11303 -33.975 -35.757 -30.969 1.00 0.00
TER
ATOM 47052 O WAT 11305 -38.279 -35.930 -33.162 1.00 0.00
ATOM 47053 H1 WAT 11305 -37.860 -35.087 -33.334 1.00 0.00
ATOM 47054 H2 WAT 11305 -39.198 -35.793 -33.391 1.00 0.00
TER
ATOM 47055 O WAT 11306 -35.943 -38.199 -31.778 1.00 0.00
ATOM 47056 H1 WAT 11306 -35.823 -38.794 -31.039 1.00 0.00
ATOM 47057 H2 WAT 11306 -35.083 -38.162 -32.198 1.00 0.00
TER
ATOM 47058 O WAT 11307 -33.604 -37.645 -33.202 1.00 0.00
ATOM 47059 H1 WAT 11307 -34.130 -37.121 -33.805 1.00 0.00
ATOM 47060 H2 WAT 11307 -33.261 -37.012 -32.571 1.00 0.00
TER
ATOM 47061 O WAT 11308 -40.428 -29.625 -32.046 1.00 0.00
ATOM 47062 H1 WAT 11308 -40.966 -28.900 -32.365 1.00 0.00
ATOM 47063 H2 WAT 11308 -40.175 -30.102 -32.837 1.00 0.00
TER

try sed:
sed '/^TER/{N;/\nTER\s*$/D}' urfile

Keep track of the previous line and don't print the line when it's TER, if the previous one was also "TER":
awk '!/^TER$/ || prev != "TER" { print } { prev = $0 }' file
You can skip the explicit { print } block too, since that's the default action:
awk '!/^TER$/ || prev != "TER"; { prev = $0 }' file

Here is another awk Version:
awk '/^TER/{ c++; if ( c == 1 ){ print }}/^ATOM/{ print; c = 0 }' file

awk '!/^TER/{c=1}c; /^TER/{c=0}' file
Set c as a flag to determine print or not.
Under non-"TER" case, set the flag up and print.
If met the first "TER", since c is still on, it would print the line, then set c down.
The consecutive "TER" would not be printed since c kept down.

$ awk '$1=="TER" && p=="TER"{next} {print; p=$1}' file
ATOM 47047 H1 WAT 11303 -32.626 -35.728 -30.283 1.00 0.00
ATOM 47048 H2 WAT 11303 -33.975 -35.757 -30.969 1.00 0.00
TER
ATOM 47052 O WAT 11305 -38.279 -35.930 -33.162 1.00 0.00
ATOM 47053 H1 WAT 11305 -37.860 -35.087 -33.334 1.00 0.00
ATOM 47054 H2 WAT 11305 -39.198 -35.793 -33.391 1.00 0.00
TER
ATOM 47055 O WAT 11306 -35.943 -38.199 -31.778 1.00 0.00
ATOM 47056 H1 WAT 11306 -35.823 -38.794 -31.039 1.00 0.00
ATOM 47057 H2 WAT 11306 -35.083 -38.162 -32.198 1.00 0.00
TER
ATOM 47058 O WAT 11307 -33.604 -37.645 -33.202 1.00 0.00
ATOM 47059 H1 WAT 11307 -34.130 -37.121 -33.805 1.00 0.00
ATOM 47060 H2 WAT 11307 -33.261 -37.012 -32.571 1.00 0.00
TER
ATOM 47061 O WAT 11308 -40.428 -29.625 -32.046 1.00 0.00
ATOM 47062 H1 WAT 11308 -40.966 -28.900 -32.365 1.00 0.00
ATOM 47063 H2 WAT 11308 -40.175 -30.102 -32.837 1.00 0.00
TER

Related

delete rows after specific character | awk

I am writing a Bash script and,
I need to remove all lines in between TER, including 'TER's
Input File :
ATOM 186 O3' U 6 7.297 6.145 -5.250 1.00 0.00 O
ATOM 187 HO3' U 6 7.342 5.410 -5.865 1.00 0.00 H
TER
ATOM 1 HO5' A 1 3.429 -7.861 3.641 1.00 0.00 H
ATOM 2 O5' A 1 4.232 -7.360 3.480 1.00 0.00 O
ATOM 3 C5' A 1 5.480 -8.064 3.350 1.00 0.00 C
ATOM 4 H5' A 1 5.429 -8.766 2.518 1.00 0.00 H
TER
Expected output:
ATOM 186 O3' U 6 7.297 6.145 -5.250 1.00 0.00 O
ATOM 187 HO3' U 6 7.342 5.410 -5.865 1.00 0.00 H
I found
sed '/TER/,$d' ${myArray[j]}.txt >> ${MyArray[j]}.txt ### ${MyArray[j]} file name through an array
But this does not work, I think awk will work with Bash Script. help Thanks
You can just use sed like this:
sed -i.bak '/^TER/,/^TER/d' "${myArray[j]}.txt"
cat "${myArray[j]}.txt"
ATOM 186 O3' U 6 7.297 6.145 -5.250 1.00 0.00 O
ATOM 187 HO3' U 6 7.342 5.410 -5.865 1.00 0.00 H
sed '/TER/,/TER/d'
echo
"ATOM 186 O3' U 6 7.297 6.145 -5.250 1.00 0.00 O
ATOM 187 HO3' U 6 7.342 5.410 -5.865 1.00 0.00 H
TER
ATOM 1 HO5' A 1 3.429 -7.861 3.641 1.00 0.00 H
ATOM 2 O5' A 1 4.232 -7.360 3.480 1.00 0.00 O
ATOM 3 C5' A 1 5.480 -8.064 3.350 1.00 0.00 C
ATOM 4 H5' A 1 5.429 -8.766 2.518 1.00 0.00 H
TER" |sed '/TER/,/TER/d'
######################################################################################
ATOM 186 O3' U 6 7.297 6.145 -5.250 1.00 0.00 O
ATOM 187 HO3' U 6 7.342 5.410 -5.865 1.00 0.00 H
sed '/Start Pattern/,/End Pattern/d'
It can be done like this
sed '/TER/,$d' ${myArray[j]}.txt > tmp.txt #note only one " > "
mv tmp.txt ${myArray[j]}.txt
awk also provides a simple solution using a flag to control printing. Below the skip variable is used as a flag. If 1 the lines are skipped, on the transition from 1 to 0, the script exits.
awk -v skip=0 '$1=="TER"{skip=skip?1:0; if (!skip)exit}1' file
Above $1=="TER" is used to match lines (records) where the first field is TER (this disambiguates between "TER" and "TERMINAL", etc...) Within the rule, the ternary skip=skip?1:0 sets skip=1 the first time "TER" is encountered and to 0 on the next. If skip==0 the script exits. The 1 at the end is just shorthand for print.
Example Use/Output
Using your data in file, you would get:
$ awk -v skip=0 '$1=="TER"{skip=skip?1:0; if (!skip)exit}1' file
ATOM 186 O3' U 6 7.297 6.145 -5.250 1.00 0.00 O
ATOM 187 HO3' U 6 7.342 5.410 -5.865 1.00 0.00 H

remove space from specific column by bash

I am new of bash command and I really appreciate your help.
I have a file like this
ATOM 1 N LYS P1852 10.932 0.523 -24.701 1.00 0.00
ATOM 2 HN1 LYS P1852 11.571 0.864 -25.419 1.00 0.00
ATOM 3 HN2 LYS P1852 10.431 1.305 -24.278 1.00 0.00
ATOM 4 HN3 LYS P1852 10.154 0.023 -25.132 1.00 0.00
ATOM 5 CA LYS P1852 11.556 -0.319 -23.640 1.00 0.00
and I need to remove space from specific position (position 30 let say) for all the lines. The output has to be as follow:
ATOM 1 N LYS P1852 10.932 0.523 -24.701 1.00 0.00
ATOM 2 HN1 LYS P1852 11.571 0.864 -25.419 1.00 0.00
ATOM 3 HN2 LYS P1852 10.431 1.305 -24.278 1.00 0.00
ATOM 4 HN3 LYS P1852 10.154 0.023 -25.132 1.00 0.00
ATOM 5 CA LYS P1852 11.556 -0.319 -23.640 1.00 0.00
I was trying sed and other commands but no solution until now has worked.
Thanks you
You can use cut:
cut --complement -c 30 input.txt
From the manual:
-c, --characters=LIST
select only these characters
--complement
complement the set of selected bytes, characters or fields
--complement is GNU cut specific, if that is not available:
cut -c -29,31- input.txt
Above commands remove any character at position 30. If you only want to remove space:
sed -E 's/^(.{29}) /\1/' input.txt

Grep not parsing the whole file

I want to use grep to pick lines not containing "WAT" in a file containing 425409 lines with a file size of 26.8 MB, UTF8 encoding.
The file looks like this
>ATOM 1 N ALA 1 9.979 -15.619 28.204 1.00 0.00
>ATOM 2 H1 ALA 1 9.594 -15.053 28.938 1.00 0.00
>ATOM 3 H2 ALA 1 9.558 -15.358 27.323 1.00 0.00
>ATOM 12 O ALA 1 7.428 -16.246 28.335 1.00 0.00
>ATOM 13 N HID 2 7.563 -18.429 28.562 1.00 0.00
>ATOM 14 H HID 2 6.557 -18.369 28.638 1.00 0.00
>ATOM 15 CA HID 2 8.082 -19.800 28.535 1.00 0.00
>ATOM 24 HE1 HID 2 8.603 -23.670 33.041 1.00 0.00
>ATOM 25 NE2 HID 2 8.012 -23.749 30.962 1.00 0.00
>ATOM 29 O HID 2 5.854 -20.687 28.537 1.00 0.00
>ATOM 30 N GLN 3 7.209 -21.407 26.887 1.00 0.00
>ATOM 31 H GLN 3 8.168 -21.419 26.566 1.00 0.00
>ATOM 32 CA GLN 3 6.271 -22.274 26.157 1.00 0.00
**16443 lines**
>ATOM 16425 C116 PA 1089 -34.635 6.968 -0.185 1.00 0.00
>ATOM 16426 H16R PA 1089 -35.669 7.267 -0.368 1.00 0.00
>ATOM 16427 H16S PA 1089 -34.579 5.878 -0.218 1.00 0.00
>ATOM 16428 H16T PA 1089 -34.016 7.366 -0.990 1.00 0.00
>ATOM 16429 C115 PA 1089 -34.144 7.493 1.177 1.00 0.00
>ATOM 16430 H15R PA 1089 -33.101 7.198 1.305 1.00 0.00
>ATOM 16431 H15S PA 1089 -34.179 8.585 1.197 1.00 0.00
>ATOM 16432 C114 PA 1089 -34.971 6.910 2.342 1.00 0.00
>ATOM 16433 H14R PA 1089 -35.147 5.847 2.166 1.00 0.00
**132284 lines**
>ATOM 60981 O WAT 7952 -46.056 -5.515 -56.245 1.00 0.00
>ATOM 60982 H1 WAT 7952 -45.185 -5.238 -56.602 1.00 0.00
>ATOM 60983 H2 WAT 7952 -46.081 -6.445 -56.561 1.00 0.00
>TER
>ATOM 60984 O WAT 7953 -51.005 -3.205 -46.712 1.00 0.00
>ATOM 60985 H1 WAT 7953 -51.172 -3.159 -47.682 1.00 0.00
>ATOM 60986 H2 WAT 7953 -51.051 -4.177 -46.579 1.00 0.00
>TER
>ATOM 60987 O WAT 7954 -49.804 -0.759 -49.284 1.00 0.00
>ATOM 60988 H1 WAT 7954 -48.962 -0.677 -49.785 1.00 0.00
>ATOM 60989 H2 WAT 7954 -49.868 0.138 -48.903 1.00 0.00
**many lines until the end**
>TER
>END
I have used grep -v 'WAT' file.txt but it only returned me the first 16179 lines not containing "WAT" and I can see that there are more lines not containing "WAT". For instance, the following line (and many others) does not appear in the output:
> ATOM 16425 C116 PA 1089 -34.635 6.968 -0.185 1.00 0.00
In order to try to figure out what was happening I've tried grep ' ' file.txt. This command should return every line in the file, but it only returned he first 16179 lines too.
I've also tried to use tail -408977 file.txt | grep ' ' and it returned me all lines recalled by tail. Then I've tried tail -408978 file.txt | grep ' ' and the output was totally empty, zero lines.
I am working on a "normal" 64 bit system, Kubuntu.
Thanks a lot for the help!
When I try I get
$: grep WAT file.txt
Binary file file.txt matches
grep is assuming it's a binary file. add -a
-a, --text equivalent to --binary-files=text
$: grep -a WAT file.txt|head -3
ATOM 29305 O WAT 4060 -75.787 -79.125 25.925 1.00 0.00 O
ATOM 29306 H1 WAT 4060 -76.191 -78.230 25.936 1.00 0.00 H
ATOM 29307 H2 WAT 4060 -76.556 -79.670 25.684 1.00 0.00 H
Your file has 2 NULLs each at the end of lines 16426, 16428, 16430, and 16432.
$: tr "\0" # <file.txt|grep -n #
16426:ATOM 16421 KA CAL 1085 -20.614 -22.960 18.641 1.00 0.00 ##
16428:ATOM 16422 KA CAL 1086 20.249 21.546 19.443 1.00 0.00 ##
16430:ATOM 16423 KA CAL 1087 22.695 -19.700 19.624 1.00 0.00 ##
16432:ATOM 16424 KA CAL 1088 -22.147 19.317 17.966 1.00 0.00 ##

Searching patterns within txt file with post-processing

I have a long txt files consisted of
ATOM 5010 HD13 LEU 301 0.158 20.865 10.630 1.00 0.00 PROA
ATOM 5011 CD2 LEU 301 1.684 22.404 12.349 1.00 1.00 PROA
ATOM 5012 HD21 LEU 301 2.233 22.501 13.310 1.00 0.00 PROA
ATOM 5013 HD22 LEU 301 1.584 23.412 11.894 1.00 0.00 PROA
ATOM 5014 HD23 LEU 301 2.267 21.744 11.672 1.00 0.00 PROA
ATOM 5015 C LEU 301 -0.687 23.995 15.639 1.00 0.00 PROA
ATOM 5016 O LEU 301 -1.791 24.341 15.139 1.00 0.00 PROA
ATOM 5017 NT LEU 301 -0.211 24.391 16.849 1.00 1.00 PROA
ATOM 5018 HT1 LEU 301 0.679 24.065 17.168 1.00 0.00 PROA
ATOM 5019 HT2 LEU 301 -0.752 25.007 17.422 1.00 0.00 PROA
ATOM 5020 SOD SOD 302 1.519 2.284 1.361 1.00 0.00 HETA
From this file I need to copy the string where the third column = SOD
ATOM 5020 SOD SOD 302 1.519 2.284 1.361 1.00 0.00 HETA
And past it to the separate txt file sod.txt (it should consist only one line which is equl to the original)
I could use a solution via combination of awk and sed commands!
You can use the sed write(w command):
sed '/\([^ \t]*\)\{2\}SOD/!d; w outputfile' file
In awk:
$ awk '$3=="SOD"' file # > new_file # uncomment to write to a new file
ATOM 5020 SOD SOD 302 1.519 2.284 1.361 1.00 0.00 HETA
You can try this
awk '{if ($3 == "SOD") print $0;}' input.txt >sod.txt

Vowpal Wabbit varinfo and ngrams : non-existent combinations

I'm trying to use vw to find words or phrases that predict if someone will open an email. The target is 1 if they opened the email and 0 otherwise. My data looks like this:
1 |A this is a test
0 |A this test is only temporary
1 |A i bought a new polo shirt
1 |A that was a great online sale
I put it into a file called 'test1.txt' and run the following code to do ngrams of 2 and also output variable information:
C:\~\vw>perl vw-varinfo.pl -V --ngram 2 test1.txt >> out.txt
When I look at the output there are bigrams that I don't see in the original data. Is this a bug or am I misunderstanding something.
Output:
FeatureName HashVal MinVal MaxVal Weight RelScore
A^a 239656 0.00 1.00 +0.1664 100.00%
A^is 7514 0.00 1.00 +0.0772 46.38%
A^test 12331 0.00 1.00 +0.0772 46.38%
A^this 169573 0.00 1.00 +0.0772 46.38%
A^bought 245782 0.00 1.00 +0.0650 39.06%
A^i 245469 0.00 1.00 +0.0650 39.06%
A^new 51974 0.00 1.00 +0.0650 39.06%
A^polo 48680 0.00 1.00 +0.0650 39.06%
A^shirt 73882 0.00 1.00 +0.0650 39.06%
A^great 220692 0.00 1.00 +0.0610 36.64%
A^online 147727 0.00 1.00 +0.0610 36.64%
A^sale 242707 0.00 1.00 +0.0610 36.64%
A^that 206586 0.00 1.00 +0.0610 36.64%
A^was 223274 0.00 1.00 +0.0610 36.64%
A^a^bought 216990 0.00 0.00 +0.0000 0.00%
A^bought^great 7122 0.00 0.00 +0.0000 0.00%
A^great^i 190625 0.00 0.00 +0.0000 0.00%
A^i^is 76227 0.00 0.00 +0.0000 0.00%
A^is^new 140536 0.00 0.00 +0.0000 0.00%
A^new^online 69117 0.00 0.00 +0.0000 0.00%
A^online^only 173498 0.00 0.00 +0.0000 0.00%
A^only^polo 51059 0.00 0.00 +0.0000 0.00%
A^polo^sale 131483 0.00 0.00 +0.0000 0.00%
A^sale^shirt 191329 0.00 0.00 +0.0000 0.00%
A^shirt^temporary 81555 0.00 0.00 +0.0000 0.00%
A^temporary^test 90632 0.00 0.00 +0.0000 0.00%
A^test^that 13689 0.00 0.00 +0.0000 0.00%
A^that^this 127863 0.00 0.00 +0.0000 0.00%
A^this^was 22011 0.00 0.00 +0.0000 0.00%
Constant 116060 0.00 0.00 +0.1465 0.00%
A^only 62951 0.00 1.00 -0.0490 -29.47%
A^temporary 44641 0.00 1.00 -0.0490 -29.47%
For instance, ^bought^great never actually occurs in any of the original input rows. Am I doing something wrong?
It is a bug in vw-varinfo.
This can be verified by running vw alone with --invert_hash:
$ vw --ngram 2 test1.txt --invert_hash train.ih
$ grep '^bought^great' train.ih
# no output
The quick partial work-around is to treat all features with a weight of 0.0 as highly suspect, and probably bogus. Unfortunately, there are some features that are missing too because vw-varinfo knows nothing about --ngram.
I really need to rewrite vw-varinfo. vw changed a lot since vw-varinfo was written, plus vw-varinfo was written sub-optimally repeating a lot of the cross-feature logic that's already in vw itself. The new implementation which I have in mind should be significanly more efficient and less vulnerable to these kinds of bugs.
This project was put on hold due to more urgent stuff. Hope to find some time to correct this this year.
Unrelated tip: since you're doing binary classification, you should use labels in {-1, 1} rather than in {0,1} and use --loss_function logistic for best results.

Resources