Can someone explain this sed command? - bash

So the text is the following:
1a fost odata
2un balaur
care fura
mere de aur
and after using this command:
sed 's/\([a-z]*\)\(.*\)\( [a-z]*\)/\1 ... \2/' filename
the result is this:
... 1a fost
... 2un balaur
care ...
mere ... de
I know that \1 is for the first [a-z]* subexpression and so on, but I just can't figure this out.. also, what's the difference between the first subexpression and the last one? why is there a space before [a-z]?

The first [a-z]* matches the first sequence of letters on the line. The * quantifier matches 0 or more repetitions, so this can also match an empty string.
On the first line it matches the empty string before 1a. On the second line it matches the empty string before 2un. On the third line it matches care, and on the fourth line it matches mere. These matches will go into capture group 1.
.* matches zero or more of any characters, so this will skip over everything in the middle of the line. These matches go into capture group 2.
[a-z]* matches a space followed by zero or more letters. The space is needed to make .* stop matching when it gets to the last space on the line. These matches go into capture group 3.
The replacement is capture groups 1 and 2 with ... between them. This is the letters at the beginning of the line, ..., then everything after that except the last word.

Related

Regular expressions, capture groups, and the dollar sign

Reading a book about bash and it was introducing regular expressions(I'm pretty new to them) with an example:
rename -n 's/(.*)(.*)/new$1$2/' *
'file1' would be renamed to 'newfile1'
'file2' would be renamed to 'newfile2'
'file3' would be renamed to 'newfile3'
There wasn't really a breakdown provided with this example, unfortunately. I kind of get what capture groups are and that .* is greedy and will match all characters but I'm uncertain as to why two capture groups are needed. Also, I get that $ represents the end of the line but am unsure of what $1$2 is actually doing here. Appreciate any insight provided.
Attempted to research capture groups and the $ for some similar examples with explanations but came up short.
You are correct. (.*)(.*) makes no sense. The second .* will always match the empty string.
For example, matching against file,
the first .* will match the 4 character string starting at position 0 (file), and
the second .* will match the 0 character string starting at position 4 (empty string).
You could simplify the pattern to
rename -n 's/(.*)/new$1/' *
rename -n 's/.*/new$&/' *
rename -n 's/^/new/' *
rename -n '$_ = "new$_"' *
rename -n '$_ = "new" . $_' *
I don't know that rename command. The regular expression looks like sed syntax. If that is the case (as in many other regex forms), it has 3 parts:
s for substitute
everything between the first two slashes (.*)(.*) to specify what to match
everything between the 2nd and 3rd slash new$1$2 is the replacement
$ only mean end of the line on the first part of the regular expression. On the second part $ number refers to the capture groups, $1 is the first group, $2 the second, and so on, with $0 often being the whole matched text.
You are right that .* is greedy and it's pointless to have that repeated. Maybe there was a \. in between and that was an attempt to capture file name and extension. There are better ways to parse file names, like basename. So you could simplify the command to rename -n 's/(.*)/new$1/' *

How to take first numbers and ignore the rest using sed

I have this line of code. The sed is taking 10 and 5. How can I extract 10 and ignore 5?
$ NUMBER=$(echo "The food is ready in 10mins and transport is coming in 5mins." | sed 's/[^0-9]*//g') ; echo $NUMBER
You're just removing everything that isn't a digit, so 5 is left.
Instead, use a capture group to capture the first number, and use that in the replacement.
sed 's/^[^0-9]*\([0-9]*\).*/\1/'
In the regular expression:
^ matches the beginning of the line
[^0-9]* matches all non-digits at the beginning of the line
\( and \) surround a capture group
[0-9]* matches digits. These are captured in the group
.* matches the rest of the line.
In the replacement:
\1 copies the part of the line that was matched by the capture group. These are the first set of digits on the line.

Finding a common pattern from input lines using single command

I have below two lines. I want only part of line which has .script & after / using single command.
Input lines
hello/world/command_altr.program_for_input.script
hello/world/script/deleted_the_input.program_for_output.script
/com/bash/hastag/welcome/program -u util/basic/level/learning
Output expected :
command_altr.program_for_input.script
deleted_the_input.program_for_output.script
If I properly understand your problem, this can be solved with sed.
sed -n -r -e 's/^.*\/([^/]+\.script).*$/\1/p'
It will extract the given pattern and and print it. Lines without the pattern are discarded.
^.*\/ searches a string starting beginning of a line, containing any char and ended by a /
([^/]+\.script) searches a repetition of one or several chars (except /), followed by a dot and the string "script". If found, it is put in a remembered pattern thanks to the (....)
.*$ searches any number of chars up to end_of_line
/\1/ replaces the line by the remembered pattern.
'p' prints if pattern has been found.

Ignoring lines with blank or space after character using sed

I am trying to use sed to extract some assignments being made in a text file. My text file looks like ...
color1=blue
color2=orange
name1.first=Ahmed
name2.first=Sam
name3.first=
name4.first=
name5.first=
name6.first=
Currently, I am using sed to print all the strings after the name#.first's ...
sed 's/name.*.first=//' file
But of course, this also prints all of the lines with no assignment ...
Ahmed
Sam
# I'm just putting this comment here to illustrate the extra carriage returns above; please ignore it
Is there any way I can get sed to ignore the lines with blank or whitespace only assignments and store this to an array? The number of assigned name#.first's is not known, nor are the number of assignments of each type in general.
This is a slight variation on sputnick's answer:
sed -n '/^name[0-9]\.first=\(.\+\)/ s//\1/p'
The first part (/^name[0-9]\.first=\(.\+\)/) selects the lines you want to pass to the s/// command. The empty pattern in the s command re-uses the previous regular expression and the replacement portion (\1) replaces the entire match with the contents of the first parenthesized part of the regex. Use the -n and p flags to control which lines are printed.
sed -n 's/^name[0-9]\.\w\+=\(\w\+\)/\1/p' file
Output
Ahmed
Sam
Explainations
the -n switch suppress the default behavior of sed : printing all lines
s/// is the skeleton for a substitution
^ match the beginning of a line
name literal string
[0-9] a digit alone
\.\w\+ a literal dot (without backslash means any character) followed by a word character [a-zA-Z0-9_] al least one : \+
( ) is a capturing group and \1 is the captured group

explain part of sed expression - *\1$/p

This code outputs lines where only the first and last digits are the same - could somebody explain in english how this works:
seq 1000 | sed -nr -e '/^([0-9])([0-9])*\1$/p'
outputs:
11
22
33 etc
I know it looks for a number at the start ^ and then another number but I am unclear how this works with the \1$ to get the answer?
Actually, what this matches is any digit:
([0-9])
followed by any number of digits
([0-9])*
followed by the first digit again
\1
\1 is a backreference to the first parenthesized group.
Note that the digits in the middle are unconstrained:
$ seq 8000 | sed -nr -e '/^([0-9])([0-9])*\1$/p' | tail
7907
7917
7927
7937
7947
7957
7967
7977
7987
7997
It looks for a number at the start, followed by zero or more numbers (notice the star after the second parenthesis), and lastly checking for \1 at the end - which represents the exact same value as in the first parenthesis.
\1 is the "first matched term".
$ is the "end of line".
So \1$ means "match the same term (ie. digit 0-9) found at the start of the string again at the end of the string.
It starts with matching the start of line, then the parenthesis is a group (which can be referenced later) which is one digit 0-9. The group is followed by another group, also with one digit and this group can be repeated 0 ore more times. After that there is a reference to the first group (the \1) and finally a match for end of line.
So, basically it just says last digit must be same as first digit and there can be any number of digits between them.
There is no need grouping the middle digits since they are not referenced thus it could be rewritten as this
sed -nr -e '/^([0-9])[0-9]*\1$/p'
If you instead wanted that the last digit should be the same as the first digit and the second to last the same as the second so you would match 1221,245642 but not 2424 then you could use
sed -nr -e '/^([0-9])([0-9])[0-9]*\2\1$/p'
Try it with seq 100000

Resources