Consider this line:
--dump-config=h264_sss-l2-2-ghb-16-500.ini --stats-file=h264_sss-l2-2-ghb-16-500.stat configs/example/cmp.py --l2cache -b h264_sss
and this string "l2-2-ghb-16". To change that string with SED, I ran this command:
sed 's/l2-.*-.*-.*-/l2-2-ghb-8-m-/g'
But the whole line then changed to
--dump-config=h264_sss-l2-2-ghb-8-m-b h264_sss
What is the problem
The .* portion matches the longest possible stretch of characters it can to make the pattern work. So the first .* doesn't match just 2 as you hope, but 2-ghb-16-500.ini --stats-file=h264_sss-l2-2-ghb-16, and so on. To make it work replace the dot's with [^-] (any non-dash character). So,
sed 's/l2-[^-]*-[^-]*-[^-]*-/l2-2-ghb-8-m-/g'
That regex is greedy inasmuch as .* will match the maximum number of characters.
That means it will attempt to stretch the match from what you think is the first pattern all the way to what you think is the second.
While you may think there are two matched patterns, the fact that this stretching is happening means that there's only one, and it's longer than you think.
A quick fix is to ensure that it doesn't match beyond the next - character with something like:
sed 's/l2-[^-]*-[^-]*-[^-]*-/l2-2-ghb-8-m-/g'
as per the following transcript:
pax> echo '--dump-config=h264_sss-l2-2-ghb-16-500.ini --stats-file=h264_sss-l2-2-ghb-16-500.stat configs/example/cmp.py --l2cache -b h264_sss' | sed 's/l2-[^-]*-[^-]*-[^-]*-/l2-2-ghb-8-m-/g'
--dump-config=h264_sss-l2-2-ghb-8-m-500.ini --stats-file=h264_sss-l2-2-ghb-8-m-500.stat configs/example/cmp.py --l2cache -b h264_sss
(command and output are slightly modified, lined up so you can easily see the transformations).
This works because, while .* says the largest field of any characters, [^-]* says the largest field of any characters except -.
sed looks for the most possible match. So -.*- will match a string as large as possible.
Related
I am trying to get a specific pattern from a text file. The pattern should start with chr followed by a digit appearing at most 2 times, or a letter X or Y appearing exactly 1 time, and then an underscore appearing also one time. Input example:
chr5_ 16560869
chrX 46042911
chr12_ 131428407
chr22_ 13191864
chr5 165608
chrX_ 96055593
I am running this code on the console: grep -P "^chr(\d{,2}|X{1}|Y{1})_" input_file.txt, which only gives me back the lines that start with chrX_ or chrY_, but not chr2_ (I wrote 2 but could be any digit/s).
The thing is that if I run grep -P "^chr(\d{1,2}|X{1}|Y{1})_" input_file.txt (note that I changed from {,2} to {1,2}) then I get back what I expected. I can't figure out why the first option is not working. I thought in regexp you could specify that a pattern was matched at most n times with the syntax {,N}.
Thanks in advance!
Note that you chose the PCRE regex engine with your grep due to the -P option.
The \d{,2} does not match zero to two digits, it matches a digit and then a {,2} string. See the regex demo.
See the PCRE documentation:
An opening curly bracket that appears in a position where a quantifier is not allowed, or one that does not match the syntax of a quantifier, is taken as a literal character. For example, {,6} is not a quantifier, but a literal string of four characters.
Also, see the limiting quantifier definition there:
The general repetition quantifier specifies a minimum and maximum number of permitted matches, by giving the two numbers in curly brackets (braces), separated by a comma. The numbers must be less than 65536, and the first must be less than or equal to the second.
Note that POSIX regex flavor is not that strict, it allows omitting the minimum threshold value from the limiting quantifier (then it is assumed to be 0):
grep -oE '[0-9]{,2}_' <<< "12_ 21"
## => 12_
grep -oP '[0-9]{,2}_' <<< "21_ 1{,2}_"
## => 1{,2}_
See the online demo.
Note
I'd advise to always specify the 0 min value since the behavior varies from engine to engine. In TRE regex flavor used in R as the default base R regex engine, omitting the zero leads to a bug.
-P, --perl-regexp
Interpret PATTERNS as Perl-compatible regular expressions (PCREs).
It would seem {,n} is not compatable in perl regex.
Using grep ERE instead
$ grep -E 'chr([0-9]{,2}|[XY])_' input_file
chr5_ 16560869
chr12_ 131428407
chr22_ 13191864
chrX_ 96055593
There is a missing digit in the {,2}.
Give a try to :
grep -P "^chr(\d{1,2}|X{1}|Y{1})_" input_file.txt
My first guess was to use egrep instead.
This other seems ok too:
egrep "^chr([[:digit:]]{1,2}|X{1}|Y{1})_" input_file.txt
I'm trying to parse a curl response in order to retrieve an img src, identified with the alt tag captcha.
So to test my sed expression I tried the following:
echo 'alt="captcha" src="http://example.com/foo.html" /></p>' | sed -n 's/.*alt="captcha" src="\([^"]*\)/\1/p'
However this echos
http://example.com/foo.html" /></p>
How can I simply return
http://example.com/foo.html
?
I am new to sed so I would like to know where I'm going wrong.
This answer explains sed's behavior, but 123 - who also gave the right answer to the sed problem succinctly in a comment - points to a potentially better alternative, if you have GNU grep: grep -oP 'alt="captcha" src="\K[^"]*'. GNU grep's -P option supports PCREs, which are more powerful regular expressions than those available in sed.
The issue is not related to greediness, but to the fact that your regex only matches part of the line:
To extract a substring in sed, your regex must match the entire line. Otherwise, any parts not matched by your regex are simply passed through, as happened with the trailing " /></p> in your case; here's a fix:
$ echo 'alt="captcha" src="http://example.com/foo.html" /></p>' |
sed -n 's/.*alt="captcha" src="\([^"]*\).*/\1/p'
http://example.com/foo.html
Note the trailing .* I've added, which ensures that the remainder of the line is matched as well.
Without it, what is left of the input line after the match is simply appended to the result of your substitution; i.e., the " /></p> part. More correctly: the remaining part of the line is simply not replaced.
Therefore, generally, you'd use an approach such as the following (pseudo notation):
sed 's/^...<capture-group>...$/\1/p'
Again, the regex must match the whole line for this to work.
Due to sed's greedy matching, you neither need ^ nor $, though you may choose to add it for clarity of intent.
Caveat: If your capture group has no ambiguity, .* is fine to match the remainder of the line, but .* to match everything before the capture group will not work in all cases - see below.
A simple example to demonstrate the problem:
$ sed -n 's/[^"]*"\([^"]*\)/>>\1<</p' <<<'before"foo"after' # WRONG
>>foo<<"after
Note how \1 does contain the substring of interest captured by \([^"]*\), as intended - the string foo between "..." - but, because the regex stopped matching just before the closing ", the remainder of the line - "after - is still output.
Fixed version, with .* appended to ensure that the whole line matches:
$ sed -n 's/[^"]*"\([^"]*\).*/>>\1<</p' <<<'before"foo"after'
>>foo<<
Also note how [^"]*" is used to match the beginning of the line up to the capture group; .* would not work here, due to sed's greedy matching:
$ sed -n 's/.*"\([^"]*\).*/>>\1<</p' <<<'before"foo"after' # WRONG
>>after<<
.*" greedily matches everything up to the last ", and so the capture group then captures after, which is the run of non-" chars. after the closing ".
Use sed grouping. Its always my goto!
Sed regex:
echo 'alt="captcha" src="http://example.com/foo.html" /></p>' | sed 's/\(^alt.*src=\"\)\(.*\)\(\".*p>\)/\2/g'
Output
http://example.com/foo.html
I'm trying to search for a couple of strings in a path that includes about 80 txt files.
I'm trying to search for !!, ??, ;, capitalization, and parentheses.
I'm also trying to search for if there are more than 4 words capitalized, but I just didn't know how to do that
Here is what I did:
grep -lr '!!\|??\|;\|(.*(' path
Can someone help me with it?
Here is a sample input:
file1.txt:
ryan went over there !!
file2.txt:
am I going there??
file3.txt:
how about I GO TO THE PARK TODAY and not TOMORROW
file4.txt:
This is (not) (valid)
file5.txt:
to go; or not to go
the output should be something like this:
path/file1.txt
path/file2.txt
path/file3.txt
path/file5.txt
Try this regex:
grep -Er '\?\?|\!\!|\(.+\).+\(.+\)|([A-Z]+\b.){4,}|\;' /path/to/files/*.txt
Output:
./1.txt:ryan went over there !!
./2.txt:am I going there??
./3.txt:how about I GO TO THE PARK TODAY and not TOMORROW
./4.txt:This is (not) (valid)
./5.txt:to go; or not to go
grep -Elr will output:
./1.txt
./2.txt
./3.txt
./4.txt
./5.txt
The regex searches for:
??
!!
() used at least twice on a line
Four or more capitalized words on a line
;
grep -lr '!!\|??\|;\|(.*(' path
is what you want. (.*( will match a line containing (at least) two open parentheses with arbitrary text in between.
For readability, you might try
grep -lr -e '!!' -e '??' -e ';' -e '(.*(' path
Your notation is off. In modern grep, you need to backslash the braces, just like you backslash the vertical bar for alternation. More conveniently, you might want to switch to grep -E for backslashless syntax; but then you will need \( to match a literal opening parenthesis.
But either way, inside the braces, there can only be a maximum of two numbers: the lower and the upper bound for he number of repetitions.
However, in this case, because there is no limiting context, \({2) will match the first two of an arbitrarily large number of opening parentheses. In other words, \({2,4} will not fail to match if there are more than four parens (though the actual match will end after four, as you will be able to see e.g. with grep -o). If you need to limit to no longer than four, you will need to supply some sort of trailing context, such as ($|[^(]).
To find a line containing more than one but less than five nonadjacent opening parens, try something like
^[^(]*(\([^(]*){2,4}$
Maybe a silly question but I have a text file that needs to display everything upto the first pattern match which is a '/'. (all lines contain no blank spaces)
Example.txt:
somename/for/example/
something/as/another/example
thisfile/dir/dir/example
Preferred output:
somename
something
thisfile
I know this grep code will display everything after a matching pattern:
grep -o '/[^\n]*' '/my/file.txt'
So is there any way to do the complete opposite, maybe rm everything after matching pattern or invert to display my preferred output?
Thanks.
If you're calling an external command like grep, you can get the same results your require with the sed command, i.e.
echo "something/as/another/example" | sed 's:/.*::'
something
Instead of focusing on what you want to keep, think about what you want to remove, in this case everything after the first '/' char. This is what this sed command does.
The leading s means substitute, the :/.*: is the pattern to match, with /.* meaning match the first /' char and all characters after that. The 2nd half of thesedcommand is the replacement. With::`, this means replace with nothing.
The traditional idom for sed is to use s/str/rep/, using / chars to delimit the search from the replacement, but you can use any character you want after the initial s (substitute) command.
Some seds expect the / char, and want a special indication that the following character is the sub/replace delimiter. So if s:/.*:: doesn't work, then s\:/.*:: should work.
IHTH.
Yu can use a much simpler reg exp:
/[^/]*/
The forward slash after the carat is what you're matching to.
jsFiddle
Assuming filename as "file.txt"
cat file.txt | cut -d "/" -f 1
Here, we are cutting the input line with "/" as the delimiter (-d "/"). Then we select the first field (-f 1).
You just need to include starting anchor ^ and also the / in a negated character class.
grep -o '^[^/]*' file
I am using sed to find and replace items, e.g.:
sed -i 's/fish/bear/g' ./file.txt
I want to limit this to only change items which do not have a letter or number before or after, e.g.:
The fish ate the worm. would change, because only spaces are before and after.
The lionfish ate the worm. would not change, because there is a letter before fish.
How can I find and replace some items, but not if at least one letter or number appears immediately before or after?
Use word boundary escapes:
sed -i 's/\<fish\>/bear/g' inputfile
Some versions of sed may not support this.
Use a negative character class before and after fish, like so: \(^\|[^[:alnum:]]\)fish\($\|[^[:alnum:]]\). This says:
Start of line or anything that's not alphanumeric
Followed by fish
Followed by end of line or anything that's not alphanumeric
This guarantees that the characters immediately preceding and immediately following fish are not alphanumeric.
sed 's/\(^\|[^[:alnum:]]\)fish\($\|[^[:alnum:]]\)/\1bear\2/g'
Check the character in front of and behind the string. If it's at the beginning or end, there won't be a character to check so check that too.
sed -i 's/\(^\|[^[:alnum:]]\)fish\($\|[^[:alnum:]]\)/\1bear\2/g'