Bash - remove all lines beginning with 'P' - bash

I have a text file that's about 300KB in size. I want to remove all lines from this file that begin with the letter "P". This is what I've been using:
> cat file.txt | egrep -v P*
That isn't outputting to console. I can use cat on the file without another other commands and it prints out fine. My final intention being to:
> cat file.txt | egrep -v P* > new.txt
No error appears, it just doesn't print anything out and if I run the 2nd command, new.txt is empty.
I should say I'm running Windows 7 with Cygwin installed.

Explanation
use ^ to anchor your pattern to the beginning of the line ;
delete lines matching the pattern using sed and the d flag.
Solution #1
cat file.txt | sed '/^P/d'
Better solution
Use sed-only:
sed '/^P/d' file.txt > new.txt

With awk:
awk '!/^P/' file.txt
Explanation
The condition starts with an ! (negation), that negates the following pattern ;
/^P/ means "match all lines starting with a capital P",
So, the pattern is negated to "ignore lines starting with a capital P".
Finally, it leverage awk's behavior when { … } (action block) is missing, that is to print the record validating the condition.
So, to rephrase, it ignores lines starting with a capital P and print everything else.
Note
sed is line oriented and awk column oriented. For your case you should use the first one, see Edouard Lopez's reponse.

Use sed with inplace substitution (for GNU sed, will also for your cygwin)
sed -i '/^P/d' file.txt
BSD (Mac) sed
sed -i '' '/^P/d' file.txt

Use start of line mark and quotes:
cat file.txt | egrep -v '^P.*'
P* means P zero or more times so together with -v gives you no lines
^P.* means start of line, then P, and any char zero or more times
Quoting is needed to prevent shell expansion.
This can be shortened to
egrep -v ^P file.txt
because .* is not needed, therefore quoting is not needed and egrep can read data from file.
As we don't use extended regular expressions grep will also work fine
grep -v ^P file.txt
Finally
grep -v ^P file.txt > new.txt

This works:
cat file.txt | egrep -v -e '^P'
-e indicates expression.

Related

bash / sed : editing of the file

I use sed to remove all lines starting from "HETATM" from the input file and cat to combine another file with the output recieved from SED
sed -i '/^HETATM/ d' file1.pdb
cat fil2.pdb file1.pdb > file3.pdb
is this way to do it in one line e.g. using only sed?
If you want to consider awk then it can be done in a single command:
awk 'FNR == NR {print; next} !/^HETATM/' file2.pdb file1.pdb > file3.pdb
With cat + grep combination please try following code. Simple explanation would be, using cat command's capability to concatenate file's output when multiple files are passed to it and using grep -v to remove all words starting from HETATM in file1.pdb before sending is as an input to cat command and creating new file named file3.pdb from cat command's output.
cat file2.pdb <(grep -v '^HETATM' file1.pdb) > file3.pdb
I'm not sure what you mean by "remove all lines starting from 'HETATM'", but if you mean that any line that appears in the file after a line that starts with "HETATM" will not be outputted, then your sed expression won't do it - it will just remove all lines starting with the pattern while leaving all following lines that do not start with the pattern.
There are ways to get the effect I believe you wanted, possibly even with sed - but I don't know sed all that well. In perl I'd use the range operator with a guaranteed non-matching end expression (not sure what will be guaranteed for your input, I used "XXX" in this example):
perl -ne 'unless (/^HETATM/../XXX/) { print; }' file1.pdb
mawk '(FNR == NR) < NF' FS='^HETATM' f1 f2

Delete all "\n" occurrences with sed

I would like to delete all "\n" (quotes, new line, quotes) in a text file.
I have tried:
sed 's/"\n"//g' < in > out
and also sed '/"\n"/d' < in > out but non of those seds worked.
What am I doing wrong?
This works with GNU sed on Linux: I don't have a Mac to test with.
sed '
# this reads the whole file into pattern space
:a; N; $ bb; ba; :b
# *now* make the replacement
s/"\n"//g
' <<END
one
two"
"three
four"
five
"six
END
one
twothree
four"
five
"six
This perl command accomplishes the same thing:
perl -0777 -pe 's/"\n"//g'
This awk-oneliner works here, you can give it a try:
awk -F'"\n"' -v RS='\0' -v ORS="" '{$1=$1;print}' file
a small test: tested with gawk
kent$ cat f
foo"
"bar"
"bla"
new line should be kept
this too
kent$ awk -F'"\n"' -v RS='\0' -v ORS="" '{$1=$1;print}' f
foo bar bla"
new line should be kept
this too
If you don't want to have the space between foo and bar blah .., add -v OFS="" to awk
Try this -- you need to escape the backslash to make it literal.
sed 's/"\\n"//g' < in > out
Verified on OSX.
The accepted answer was marked as such because of the Perl command it contains.
The sed command doesn't actually work on OSX, because it uses features specific to GNU sed, whereas OSX use BSD sed.
An equivalent answer requires only a few tweaks - note that this will work with both BSD and GNU sed:
Using multiple -e options:
sed -e ':a' -e '$!{N;ba' -e '}; s/"\n"//g' < in > out
Or, using an ANSI C-quoted string in Bash:
sed $':a\n$!{N;ba\n}; s/"\\n"//g' < in > out
Or, using a multi-line string literal:
sed ':a
$!{N;ba
}; s/"\n"//g' < in > out
BSD sed requires labels (e.g., :a) and branching commands (e.g., b) to be terminated with an actual newline (whereas in GNU sed a ; suffices), or, alternatively, for the script to be broken into multiple -e options, with each part ending where a newline is required.
For a detailed discussion of the differences between GNU and BSD sed, see https://stackoverflow.com/a/24276470/45375
$':a\n$!{N;ba\n}' is a common sed idiom for reading all input lines into the so-called pattern space (buffer on which (subsequent) commands operate):
:a is a label that can be branched to
$! matches every line but the last
{N;ba\n} keeps building the buffer by adding the current line (N) to it, then branching back to label :a to repeat the cycle.
Once the last line is reached, no branching is performed, and the buffer at that point contains all input lines, at which point the desired substitution (s/"\n"//g) is performed on the entire buffer.
As for why the OP's approach didn't work:
sed reads files line by line by default, so by default it can only operate on one line at a time.
In order to be able to replace newline chars. - i.e., to operate across multiple lines - you must explicitly read multiple/all lines first, as above.
instead of sed you could also use tr, I've tested it and for me it worked
tr -d '"\\n"' < input.txt > output.txt

sed emulate "tr | grep"

Given the following file
$ cat a.txt
FOO='hhh';BAR='eee';BAZ='ooo'
I can easily parse out one item with tr and grep
$ tr ';' '\n' < a.txt | grep BAR
BAR='eee'
However if I try this using sed it just prints everything
$ sed 's/;/\n/g; /BAR/!d' a.txt
FOO='hhh'
BAR='eee'
BAZ='ooo'
With awk you could do this:
awk '/BAR/' RS=\; file
But if in the case of BAZ this would produce an extra newline, because the is no ; after the last word. If you want to remove that newline as well you would need to do something like:
awk '/BAZ/{sub(/\n/,x); print}' RS=\; file
or with GNU awk or mawk you could use:
awk '/BAZ/' RS='[;\n]'
If your grep has the -o option then you could also try this:
grep -o '[^;]*BAZ[^;]*' file
sed can do it just as you want:
sed -n 's/.*\(BAR[^;]*\).*/\1/gp' <<< "FOO='hhh';BAR='eee';BAZ='ooo'"
The point here is that you must suppress sed's default output -- the whole line --, and print only the substitutions you want to performed.
Noteworthy points:
sed -n suppresses the default output;
s/.../.../g operates in the entire line, even if already matched -- greedy;
s/.1./.2./p prints out the substituted part (.2.);
the tr part is given as the delimiter in the expression \(BAR[^;]*\);
the grep job is represented by the matching of the line itself.
awk 'BEGIN {RS=";"} /BAR/' a.txt
The following grep solution might work for you:
grep -o 'BAR=[^;]*' a.txt
$ sed 's/;/\n/g;/^BAR/!D;P;d' a.txt
BAR='eee'
replace all ; with \n
delete until BAR line is at the top
print BAR line
delete pattern space

Display all fields except the last

I have a file as show below
1.2.3.4.ask
sanma.nam.sam
c.d.b.test
I want to remove the last field from each line, the delimiter is . and the number of fields are not constant.
Can anybody help me with an awk or sed to find out the solution. I can't use perl here.
Both these sed and awk solutions work independent of the number of fields.
Using sed:
$ sed -r 's/(.*)\..*/\1/' file
1.2.3.4
sanma.nam
c.d.b
Note: -r is the flag for extended regexp, it could be -E so check with man sed. If your version of sed doesn't have a flag for this then just escape the brackets:
sed 's/\(.*\)\..*/\1/' file
1.2.3.4
sanma.nam
c.d.b
The sed solution is doing a greedy match up to the last . and capturing everything before it, it replaces the whole line with only the matched part (n-1 fields). Use the -i option if you want the changes to be stored back to the files.
Using awk:
$ awk 'BEGIN{FS=OFS="."}{NF--; print}' file
1.2.3.4
sanma.nam
c.d.b
The awk solution just simply prints n-1 fields, to store the changes back to the file use redirection:
$ awk 'BEGIN{FS=OFS="."}{NF--; print}' file > tmp && mv tmp file
Reverse, cut, reverse back.
rev file | cut -d. -f2- | rev >newfile
Or, replace from last dot to end with nothing:
sed 's/\.[^.]*$//' file >newfile
The regex [^.] matches one character which is not dot (or newline). You need to exclude the dot because the repetition operator * is "greedy"; it will select the leftmost, longest possible match.
With cut on the reversed string
cat youFile | rev |cut -d "." -f 2- | rev
If you want to keep the "." use below:
awk '{gsub(/[^\.]*$/,"");print}' your_file

How to ignore all lines before a match occurs in bash?

I would like ignore all lines which occur before a match in bash (also ignoring the matched line. Example of input could be
R1-01.sql
R1-02.sql
R1-03.sql
R1-04.sql
R2-01.sql
R2-02.sql
R2-03.sql
and if I match R2-01.sql in this already sorted input I would like to get
R2-02.sql
R2-03.sql
Many ways possible. For example: assuming that your input is in list.txt
PATTERN="R2-01.sql"
sed "0,/$PATTERN/d" <list.txt
because, the 0,/pattern/ works only on GNU sed, (e.g. doesn't works on OS X), here is an tampered solution. ;)
PATTERN="R2-01.sql"
(echo "dummy-line-to-the-start" ; cat - ) < list.txt | sed "1,/$PATTERN/d"
This will add one dummy line to the start, so the real pattern must be on line the 1 or higher, so the 1,/pattern/ will works - deleting everything from the line 1 (dummy one) up to the pattern.
Or you can print lines after the pattern and delete the 1st, like:
sed -n '/pattern/,$p' < list.txt | sed '1d'
with awk, e.g.:
awk '/pattern/,0{if (!/pattern/)print}' < list.txt
or, my favorite use the next perl command:
perl -ne 'print unless 1../pattern/' < list.txt
deletes the 1.st line when the pattern is on 1st line...
another solution is reverse-delete-reverse
tail -r < list.txt | sed '/pattern/,$d' | tail -r
if you have the tac command use it instead of tail -r The interesant thing is than the /pattern/,$d' works on the last line but the1,/pattern/d` doesn't on the first.
How to ignore all lines before a match occurs in bash?
The question headline and your example don't quite match up.
Print all lines from "R2-01.sql" in sed:
sed -n '/R2-01.sql/,$p' input_file.txt
Where:
-n suppresses printing the pattern space to stdout
/ starts and ends the pattern to match (regular expression)
, separates the start of the range from the end
$ addresses the last line in the input
p echoes the pattern space in that range to stdout
input_file.txt is the input file
Print all lines after "R2-01.sql" in sed:
sed '1,/R2-01.sql/d' input_file.txt
1 addresses the first line of the input
, separates the start of the range from the end
/ starts and ends the pattern to match (regular expression)
$ addresses the last line in the input
d deletes the pattern space in that range
input_file.txt is the input file
Everything not deleted is echoed to stdout.
This is a little hacky, but it's easy to remember for quickly getting the output you need:
$ grep -A99999 $match $file
Obviously you need to pick a value for -A that's large enough to match all contents; if you use a too-small value the output will be silently truncated.
To ensure you get all output you can do:
$ grep -A$(wc -l $file) $match $file
Of course at that point you might be better off with the sed solutions, since they don't require two reads of the file.
And if you don't want the matching line itself, you can simply pipe this command into tail -n+1 to skip the first line of output.
awk -v pattern=R2-01.sql '
print_it {print}
$0 ~ pattern {print_it = 1}
'
you can do with this,but i think jomo666's answer was better.
sed -nr '/R2-01.sql/,${/R2-01/d;p}' <<END
R1-01.sql
R1-02.sql
R1-03.sql
R1-04.sql
R2-01.sql
R2-02.sql
R2-03.sql
END
Perl is another option:
perl -ne 'if ($f){print} elsif (/R2-01\.sql/){$f++}' sql
To pass in the regex as an argument, use -s to enable a simple argument parser
perl -sne 'if ($f){print} elsif (/$r/){$f++}' -- -r=R2-01\\.sql file
This can be accomplished with grep, by printing a large enough context following the $match. This example will output the first matching line followed by 999,999 lines of "context".
grep -A999999 $match $file
For added safety (in case the $match begins with a hyphen, say) you should use -e to force $match to be used as an expression.
grep -A999999 -e '$match' $file

Resources