explain part of sed expression - *\1$/p - bash

This code outputs lines where only the first and last digits are the same - could somebody explain in english how this works:
seq 1000 | sed -nr -e '/^([0-9])([0-9])*\1$/p'
outputs:
11
22
33 etc
I know it looks for a number at the start ^ and then another number but I am unclear how this works with the \1$ to get the answer?

Actually, what this matches is any digit:
([0-9])
followed by any number of digits
([0-9])*
followed by the first digit again
\1
\1 is a backreference to the first parenthesized group.
Note that the digits in the middle are unconstrained:
$ seq 8000 | sed -nr -e '/^([0-9])([0-9])*\1$/p' | tail
7907
7917
7927
7937
7947
7957
7967
7977
7987
7997

It looks for a number at the start, followed by zero or more numbers (notice the star after the second parenthesis), and lastly checking for \1 at the end - which represents the exact same value as in the first parenthesis.

\1 is the "first matched term".
$ is the "end of line".
So \1$ means "match the same term (ie. digit 0-9) found at the start of the string again at the end of the string.

It starts with matching the start of line, then the parenthesis is a group (which can be referenced later) which is one digit 0-9. The group is followed by another group, also with one digit and this group can be repeated 0 ore more times. After that there is a reference to the first group (the \1) and finally a match for end of line.
So, basically it just says last digit must be same as first digit and there can be any number of digits between them.
There is no need grouping the middle digits since they are not referenced thus it could be rewritten as this
sed -nr -e '/^([0-9])[0-9]*\1$/p'
If you instead wanted that the last digit should be the same as the first digit and the second to last the same as the second so you would match 1221,245642 but not 2424 then you could use
sed -nr -e '/^([0-9])([0-9])[0-9]*\2\1$/p'
Try it with seq 100000

Related

How to take first numbers and ignore the rest using sed

I have this line of code. The sed is taking 10 and 5. How can I extract 10 and ignore 5?
$ NUMBER=$(echo "The food is ready in 10mins and transport is coming in 5mins." | sed 's/[^0-9]*//g') ; echo $NUMBER
You're just removing everything that isn't a digit, so 5 is left.
Instead, use a capture group to capture the first number, and use that in the replacement.
sed 's/^[^0-9]*\([0-9]*\).*/\1/'
In the regular expression:
^ matches the beginning of the line
[^0-9]* matches all non-digits at the beginning of the line
\( and \) surround a capture group
[0-9]* matches digits. These are captured in the group
.* matches the rest of the line.
In the replacement:
\1 copies the part of the line that was matched by the capture group. These are the first set of digits on the line.

Grep pattern matching at most n times using Perl flag

I am trying to get a specific pattern from a text file. The pattern should start with chr followed by a digit appearing at most 2 times, or a letter X or Y appearing exactly 1 time, and then an underscore appearing also one time. Input example:
chr5_ 16560869
chrX 46042911
chr12_ 131428407
chr22_ 13191864
chr5 165608
chrX_ 96055593
I am running this code on the console: grep -P "^chr(\d{,2}|X{1}|Y{1})_" input_file.txt, which only gives me back the lines that start with chrX_ or chrY_, but not chr2_ (I wrote 2 but could be any digit/s).
The thing is that if I run grep -P "^chr(\d{1,2}|X{1}|Y{1})_" input_file.txt (note that I changed from {,2} to {1,2}) then I get back what I expected. I can't figure out why the first option is not working. I thought in regexp you could specify that a pattern was matched at most n times with the syntax {,N}.
Thanks in advance!
Note that you chose the PCRE regex engine with your grep due to the -P option.
The \d{,2} does not match zero to two digits, it matches a digit and then a {,2} string. See the regex demo.
See the PCRE documentation:
An opening curly bracket that appears in a position where a quantifier is not allowed, or one that does not match the syntax of a quantifier, is taken as a literal character. For example, {,6} is not a quantifier, but a literal string of four characters.
Also, see the limiting quantifier definition there:
The general repetition quantifier specifies a minimum and maximum number of permitted matches, by giving the two numbers in curly brackets (braces), separated by a comma. The numbers must be less than 65536, and the first must be less than or equal to the second.
Note that POSIX regex flavor is not that strict, it allows omitting the minimum threshold value from the limiting quantifier (then it is assumed to be 0):
grep -oE '[0-9]{,2}_' <<< "12_ 21"
## => 12_
grep -oP '[0-9]{,2}_' <<< "21_ 1{,2}_"
## => 1{,2}_
See the online demo.
Note
I'd advise to always specify the 0 min value since the behavior varies from engine to engine. In TRE regex flavor used in R as the default base R regex engine, omitting the zero leads to a bug.
-P, --perl-regexp
Interpret PATTERNS as Perl-compatible regular expressions (PCREs).
It would seem {,n} is not compatable in perl regex.
Using grep ERE instead
$ grep -E 'chr([0-9]{,2}|[XY])_' input_file
chr5_ 16560869
chr12_ 131428407
chr22_ 13191864
chrX_ 96055593
There is a missing digit in the {,2}.
Give a try to :
grep -P "^chr(\d{1,2}|X{1}|Y{1})_" input_file.txt
My first guess was to use egrep instead.
This other seems ok too:
egrep "^chr([[:digit:]]{1,2}|X{1}|Y{1})_" input_file.txt

Can someone explain this sed command?

So the text is the following:
1a fost odata
2un balaur
care fura
mere de aur
and after using this command:
sed 's/\([a-z]*\)\(.*\)\( [a-z]*\)/\1 ... \2/' filename
the result is this:
... 1a fost
... 2un balaur
care ...
mere ... de
I know that \1 is for the first [a-z]* subexpression and so on, but I just can't figure this out.. also, what's the difference between the first subexpression and the last one? why is there a space before [a-z]?
The first [a-z]* matches the first sequence of letters on the line. The * quantifier matches 0 or more repetitions, so this can also match an empty string.
On the first line it matches the empty string before 1a. On the second line it matches the empty string before 2un. On the third line it matches care, and on the fourth line it matches mere. These matches will go into capture group 1.
.* matches zero or more of any characters, so this will skip over everything in the middle of the line. These matches go into capture group 2.
[a-z]* matches a space followed by zero or more letters. The space is needed to make .* stop matching when it gets to the last space on the line. These matches go into capture group 3.
The replacement is capture groups 1 and 2 with ... between them. This is the letters at the beginning of the line, ..., then everything after that except the last word.

grep: keep lines by number in specific column

I know how to do it with awk, for example, keep lines, which contains number 3 in second column: $ awk '"$2" == 3'
But how to do the same with only grep?
What about for first column?
Grep is not great for this, awk is better. But assuming your columns are separated by spaces, then you want
grep -E '^[^ ]+ +3( |$)'
Explanation: find something that has a start of line, followed by one or more non-space characters (first column), then one or more space characters (column separator), then the number 3, then either a space (because there's another column) or end of line (if there's no other column).
(Updated to fix syntax after testing.)
Here is the longer explanation for my mysterious command grep -P '^[^\t]*\t3\t' your_file from the comments:
I assumed that the column delimiter is a tab. grep without -P would require some strange things to use it directly (see e.g. see here ) . The -P makes it possible to just write \t without any problems. If for example your delimiter is ; then you could replace the \t with ; and you dont need the -P option.
Having said that, lets explain the idea behind the regular expression: You said, you want to match a 3 in the second column:
^ means: at the beginning of the line
[^\t]* means: zero or more (*) occurences of something not a tab ([^\t] here the ^ means "not a")
followed by tab
followed by 3
followed by tab
Now we have effectively expressed the idea that we need a 3 as the content of the second column (\t3\t) and we are not interested in the precise content of the first column. The ^[^\t]*\t is only necessary to express the idea "what follows is in the second column".
If you want to match something in the fourth column, you could use this to "skip" the first three column and match a 4 in the fourth column:
^([^\t]*\t){3}4. (Note the parenthesis and the {3}).
As you can see many details and awk is much more elegant and easy.
You can read this up in the documentation of grep and then you will need to study something about regular expression, e.g. start here.

grep one liner - extract two different lines from same file

I've a file containing many number of lines like following.
== domain 1 score: 280.5 bits; conditional E-value: 2.1e-87
TSEEETTCTTTGSG---BTTSSB-HHHHHHHHHHHHHHHHHHSSS---B-HHHHHHHSTTTSTGCGBB-HHHHHHHHHHHTEBEBTTTS---SSCSESECTTGCGSCEBEESEEEEEESSBHHHHHHHHHHHSSEEEEEECTSHHHHTEESSEESCTSCETSS-EEEEEEEEEEEETTEEEEEEE-SBTTTSTBTTEEEEESSSSSGGGTTSSEEEE CS
PF00112.18 2 pesvDwrekkgavtpvkdqgsCGsCWafsavgalegrlaiktkkklvslSeqelvdCskeenegCnGGlmenafeyikknggivtekdypYkakekgkCkkkkkkekvakikgygkvkenseealkkalakngPvsvaidaseedfqlYksGvyketecsktelnhavlivGygvengkkyWivkNsWgtdwgekGYiriargknnecgieseavyp 218
p+svD+r+k+ +vtpvk+qg+CGsCWafs+vgaleg+l+ kt +kl++lS q+lvdC + en+gC GG+m+naf+y++kn+gi++e+ ypY ++e ++C ++ + + ak++gy++++e +e+alk+a+a++gPvsvaidas ++fq+Y++Gvy++++c++++lnhavl+vGyg ++g+k Wi+kNsWg++wg+kGYi +ar+knn cgi++ a++p
1AU0:A 2 PDSVDYRKKG-YVTPVKNQGQCGSCWAFSSVGALEGQLKKKT-GKLLNLSPQNLVDCVS-ENDGCGGGYMTNAFQYVQKNRGIDSEDAYPYVGQE-ESCMYNPTGKA-AKCRGYREIPEGNEKALKRAVARVGPVSVAIDASLTSFQFYSKGVYYDESCNSDNLNHAVLAVGYGIQKGNKHWIIKNSWGENWGNKGYILMARNKNNACGIANLASFP 213
I just want to extract the line that is preceded by the PF and the associated line after it which starts with digit.
Here in this case, line that starts with PF is 'PF00112.18' and line that starts with digit is '1AU0:A'. These ids will change for next domain, but PF is constant and its associated id starts with digit.
Here is what I've tried with grep, I hope there must be mistake in this oneliner. Any help will be greatly appreciated.
grep '^ PF \| \d' inFile.txt
Expected output:
PF00112.18 2 pesvDwrekkgavtpvkdqgsCGsCWafsavgalegrlaiktkkklvslSeqelvdCskeenegCnGGlmenafeyikknggivtekdypYkakekgkCkkkkkkekvakikgygkvkenseealkkalakngPvsvaidaseedfqlYksGvyketecsktelnhavlivGygvengkkyWivkNsWgtdwgekGYiriargknnecgieseavyp 218
1AU0:A 2 PDSVDYRKKG-YVTPVKNQGQCGSCWAFSSVGALEGQLKKKT-GKLLNLSPQNLVDCVS-ENDGCGGGYMTNAFQYVQKNRGIDSEDAYPYVGQE-ESCMYNPTGKA-AKCRGYREIPEGNEKALKRAVARVGPVSVAIDASLTSFQFYSKGVYYDESCNSDNLNHAVLAVGYGIQKGNKHWIIKNSWGENWGNKGYILMARNKNNACGIANLASFP 213
You can use the following grep expression:
grep '^[[:space:]]\+PF\|^[[:space:]]\+[[:digit:]]' input.txt
The first pattern ^[[:space:]]\+PF searches for a line which contains one or more spaces at the start, followed by the term PF. The second pattern also searches for a one ore more spaces at the start at the line, but followed by a digit.
This can be simplyfied to:
grep '^[[:space:]]\+\(PF\|[[:digit:]]\)' input.txt
since both patterns start with one or more spaces at the start of the line.
Let me finally suggest to use egrep instead of grep because extended POSIX regexes will save use some escaping:
egrep '^[[:space:]]+(PF|[[:digit:]])' input.txt
egrep "^[ \t]*(PF|[0-9]).*$" tmp_file
[ \t] is equivalent to a space. Its a tab delimiter.
So ^[ \t]* grabs anything that starts with a space. The asterisk grabs all leading white space thereafter.
(PF|[0-9]).*$ will grab the lines that start with either PF or a digit. The beauty of egrep is that you can specify multiple conditions encapsulated by parenthesis, separated by a pipe.
.*$ grabs every from until the end of the line
so (PF|[0-9]).*$ will grab everything that starts with PF or digits until the end of the line. It will not work without compensating for the leading white space first.
So we get :
egrep "^[ \t]*(PF|[0-9]).*$" tmp_file

Resources