Unix Shell Script to extract the number from the String - bash

How to extract the bold number from the string below using unix shell script?
17: H.0(-2073):File ID (40008)in xyz file not equal to the file ID(**40004**)in file header.
Thanks :)

echo '17: H.0(-2073):File ID (40008)in xyz file not equal to the file ID(40004)in file header.' | sed -e 's/.*(\([0-9]*\)).*/\1/'
The second part of this line runs sed with command s (substitution). Part between first two slashes (/) is regular expression which matches the following:
Everything (.*) in greedy manner, i.e. until the last occurrence of any number of digits in brackets ( ([0-9]*) ) and then everything again (.*) until the end of line. Expression between \( and \) (i.e. 40004 in this case) is memorized to be used in the second part of s command.
The part between the second / and third / is what we want to place instead of the line matched with regular expression. Here it is \1, meaning reference to the substring between 1st occurrence of \( and \) which is 40004 in our case.
So the part after | replaces the whole input string with string 40004 extracted from it. Regular expressions are powerful but often write-only technique, so I hope this explanation will bring a bit more clarity.

Related

Regular expressions, capture groups, and the dollar sign

Reading a book about bash and it was introducing regular expressions(I'm pretty new to them) with an example:
rename -n 's/(.*)(.*)/new$1$2/' *
'file1' would be renamed to 'newfile1'
'file2' would be renamed to 'newfile2'
'file3' would be renamed to 'newfile3'
There wasn't really a breakdown provided with this example, unfortunately. I kind of get what capture groups are and that .* is greedy and will match all characters but I'm uncertain as to why two capture groups are needed. Also, I get that $ represents the end of the line but am unsure of what $1$2 is actually doing here. Appreciate any insight provided.
Attempted to research capture groups and the $ for some similar examples with explanations but came up short.
You are correct. (.*)(.*) makes no sense. The second .* will always match the empty string.
For example, matching against file,
the first .* will match the 4 character string starting at position 0 (file), and
the second .* will match the 0 character string starting at position 4 (empty string).
You could simplify the pattern to
rename -n 's/(.*)/new$1/' *
rename -n 's/.*/new$&/' *
rename -n 's/^/new/' *
rename -n '$_ = "new$_"' *
rename -n '$_ = "new" . $_' *
I don't know that rename command. The regular expression looks like sed syntax. If that is the case (as in many other regex forms), it has 3 parts:
s for substitute
everything between the first two slashes (.*)(.*) to specify what to match
everything between the 2nd and 3rd slash new$1$2 is the replacement
$ only mean end of the line on the first part of the regular expression. On the second part $ number refers to the capture groups, $1 is the first group, $2 the second, and so on, with $0 often being the whole matched text.
You are right that .* is greedy and it's pointless to have that repeated. Maybe there was a \. in between and that was an attempt to capture file name and extension. There are better ways to parse file names, like basename. So you could simplify the command to rename -n 's/(.*)/new$1/' *

Extract text between two special characters

Trying to extract the text between the special characters "\ and \" through sed
Ex: "\hell##$\"},
expected output : hell##$
You can do it quite easily with using a capture-group and backreference with basic regular-expressions:
sed 's/^["][\]\([^\]*\).*$/\1/'
Explanation
Normal substitution sed 's/find/replace/, where
find is ^["][\] a double-quote and \ before beginning the capture \(...\) which contains [^\]* (zero or more characters not a \), the closing of the capture \) and then .*$ the remainder of the string;
replace is \1 (the first backreference) containing the text captured between \(...\).
(note: if your "\ doesn't begin the string, remove the first '^' anchor)
Example
$ echo '"\hell##$\"},' | sed 's/^["][\]\([^\]*\).*$/\1/'
hell##$
Look things over and let me know if you have questions.
This might work for you (GNU sed):
sed -nE '/"\\[^\\]*\\+([^\\"][^\\]*\\+)*"/{s/"\\/\n/;s/.*\n//;s/\\"/\n/;P;D}' file
The solution comes in two parts:
Firstly, a regexp to determine whether a pair of two characters exists. This can be tricky as a negated class is insufficient because edge cases can easily defeat a simplistic approach.
Secondly, once a pair of characters does exist the text between them must be extracted piece meal.

Bash: remove semicolons from a line in a CSV-file

I've a CSV-file with a few hundred lines and a lot (not all) of these lines contains data (Klas/Lesgroep:;;T2B1) which I want to extract.
i.e. ;;;;;;Klas/Lesgroep:;;T2B1;;;;;;;;;;
I want to delete the semicolons which are in front of Klas/Lesgroep but the number of semicolons is variable. How can I delete these semicolons in Bash ?
I'm not a native speaking Englishman so I hope it's clear to you
To remove any nonempty run of ; chars. that come directly before literal Klas/Lesgroep:
With GNU or BSD/macOS sed:
$ sed -E 's|;+(Klas/Lesgroep)|\1|' <<< ";;;;;;Klas/Lesgroep:;;T2B1;;;;;;;;;;"
Klas/Lesgroep:;;T2B1;;;;;;;;;;
The s function performs string substitution (replacement):
The 1st argument is a regex (regular expression) that specifies what part of the line to match,
and the 2nd arguments specifies what to replace the matching part with.
Note how I've chosen | as the regex/argument delimiter instead of the customary /, because that allows unescaped use of / chars. inside the regex.
;+ matches one or more directly adjacent ; chars.
(Klas/Lesgroep) matches literal Klas/Lesgroep and by enclosing it in (...) - making it a capture group - the match is remembered and can be referenced as \1 - the 1st capture group in the regex - in the replacement argument to s.
The net effect is that all ; chars. directly preceding Klas/Lesgroep are removed.
POSIX-compliant form:
$ sed 's|;\{1,\}\(Klas/Lesgroep\)|\1|' <<< ";;;;;;Klas/Lesgroep:;;T2B1;;;;;;;;;;"
Klas/Lesgroep:;;T2B1;;;;;;;;;;
POSIX requires the less powerful and antiquated BRE syntax, where duplication symbol + must be emulated as \{1,\}, and, generally, metacharacters (, ), {, } must be \-escaped.
With sed you can search for lines starting with at least one semi-colon followed by Klas/Lesgroep and, if found, substitute leading ; with nothing:
$ sed '/;;*Klas\/Lesgroep/s/^;*//g' <<< ";;;;;;Klas/Lesgroep:;;T2B1;;;;;;;;;;"
Klas/Lesgroep:;;T2B1;;;;;;;;;;
To remove all ";" from a file , we can use sed command . sed is used for modifying the files.
$ sed 's/find/replace/g' file
The substitute flag /g (global replacement) specifies the sed command to replace all the occurrences of the string in the line.
So to remove ";" just find and replace it with nothing.
sed 's/;//g' file.csv

SED usage and Regular expression

Can anyone explain the following command?
What happens if the same code is executed using grep?
sed 's/^.*-S[[:space:]]*\([^[:space:]]*\).*$/\1/'
As a sed command, it replaces something like -S paradox (possibly with text on either side the material), with paradox. (There must be a space after paradox for exactly the string paradox to be printed; if there are non-space characters immediately after the word, then those are included in the output too, up to the first space.) For example, the input line someprog -x painter -S paradox file2 file93 yields paradox.
If you apply the expression to grep, then the ^ and the $ lose their special meanings, and it looks for a line such as:
schemas/^semicolon-S(comma)$/(comma)/gratitude
In the grep context, the \( and \) do remember a pattern (and in the example line, that pattern corresponds to (comma) — to confuse you, and me). The \1 then refers to the previously remembered string, the second (comma) in the example. You could drop all the parentheses in the sample line and it would be selected. If your version of grep supports the -o option to output only the text that matches, you can see more clearly which parts of the sample line match the regex.

Sed - Replace immediate next string/word coming after a particular pattern

I'm a newbie to shell scripting and any help is much appreciated.
I have a pattern like this rmd_ver=1.0.10
I want to search the pattern rmd_ver= and replace the numeric part 1.0.10 with a new value in all the matches. Hope my question is clear.
To replace any value till the end of the line:
sed -i 's/\(rmd_ver=\)\(.*\)/\1R/' file
sed -i 's/p/r/' file replace p with r in file
\( start first group
rmd_ver= search pattern
\) end first group
\( start second group
.* any characters
\) end second group
\1 back reference to the first group
R replacement text
To replace the exact pattern in any place of the line and possibly several times in one line:
sed -i 's/\(rmd_ver=\)\(1\.0\.10\)/\1R/g' file
\. escape special . into literal .
g to replace multiple occurrences in one line
If you are too lazy to repeat the pattern in the replacement (s/rmd_ver=1\.0\.10/rmd_ver=2.0.0/), store it in a group:
sed -e 's/\(rmd_ver=\)1\.0\.10/\12.0.0/'
From your description I think you just need the substitute command, with syntax s/from_regex/to_result/. To match a number like 1.0.10 you can match a repeat of digits or dot, e.g [0-9.]. That is a bit simple regex in that it will allow a dot at the start and the beginning, but let's start with that. Then your sed command becomes
sed 's/rmd_ver=[0-9.]\+/rmd_ver=42/' filename
The + is a repeat operator, and since sed is using BRE (basic regular expression) syntax it has to be escaped.
If you want to avoid matching dots on ends, like 1.2.3., you will have to change the regex to [0-9][0-9.]\+[0-9] to make sure that the first and last character is not a dot. Maybe you also want to be able to match a single digit, then you have to add an alternative (e.g. /a|b/ matches a or b) to match that:
sed 's/rmd_ver=\([0-9][0-9.]\+[0-9]\|[0-9]\)/rmd_ver=42/' filename
sed 's/\(rmd_ver=\).*[[:number:]]$/\1NEW_VAL/g'
you can replace NEW_VAL with the value you want to replace with.

Resources