What is going wrong with this sed command - bash

I have to transform first occurrence of word 'the' and replace it with 'this' in each line of input text, case sensitive search and replace.
Following is my command to do the task but it is going wrong
sed 's/\Wthe\W/this/'
The problem I found was similar to this simulated case :
Input-text : as the word
Output-text(correct) : as that word
Output-text : asthatword (what the command is producing).

\W is PCRE, not BRE or ERE. It is thus not supported in standard sed.
sed 's/(^|[[:space:]])the([[:space:]]|$)/\1this\2/'
In ^|[[:space:]], ^ matches the beginning of the line; [[:space:]] matches any whitespace character class. Putting this inside of parenthesis creates a matching group which can be referred to later with \1 (since this is the first such group).
[[:space:]]|$ does the same, but with $ indicating end-of-line.
That said -- if you're targeting only GNU sed, and not POSIX sed, you might instead consider:
sed 's/\<the\>/this/'

you are replacing the non-word-characters (blanks in this case) as well. An easy way round is
sed 's/\Wthe\W/ that /'
or
sed 's/\(\W\)the\(\W\)/\1that\2/'
to keep the original non-word-characters.

I'm assuming that you're using \W to make this a whole word search and replace. Try using \b to set word boundaries instead:
sed 's/\bthe\b/this/'

You will need to capture the non-word surrounding word the and then use back-reference in replacement:
s='as the word'
sed 's/\(\W\)the\(\W\)/\1this\2/' <<< "$s"
as this word
or better to use word boundary:
sed 's/\bthe\b/this/' <<< "$s"
as this word

Related

Replace a specific character at any word's begin and end in bash

I need to remove the hyphen '-' character only when it matches the pattern 'space-[A-Z]' or '[A-Z]-space'. (Assuming all letters are uppercase, and space could be a space, or newline)
sample.txt
I AM EMPTY-HANDED AND I- WA-
-ANT SOME COO- COOKIES
I want the output to be
I AM EMPTY-HANDED AND I WA
ANT SOME COO COOKIES
I've looked around for answers using sed and awk and perl, but I could only find answers relating to removing all characters between two patterns or specific strings, but not a specific character between [A-Z] and space.
Thanks heaps!!
If perl is your option, would you try the following:
perl -pe 's/(^|(?<=\s))-(?=[A-Z])//g; s/(?<=[A-Z])-((?=\s)|$)//g' sample.txt
(?<=\s) is a zero-width lookbehind assertion which matches leading
whitespace without including it in the matched substring.
(?=[A-Z]) is a zero-width lookahead assertion which matches trailing
character between A and Z without including it in the matched substring.
As a result, only the dash characters which match the pattern above are
removed from the original text.
The second statement s/..//g is the flipped version of the first one.
Could you please try following.
awk '{for(i=1;i<=NF;i++){if($i ~ /^-[a-zA-Z]+$|^[a-zA-Z]+-$/){sub(/-/,"",$i)}}} 1' Input_file
Adding a non-one liner form of solution:
awk '
{
for(i=1;i<=NF;i++){
if($i ~ /^-[a-zA-Z]+$|^[a-zA-Z]+-$/){
sub(/-/,"",$i)
}
}
}
1
' Input_file
Output will be as follows.
I AM EMPTY-HANDED AND I WA
ANT SOME COO COOKIES
If you can provide Extended Regular Expressions to sed (generally with the -E or -r option), then you can shorten your sed expression to:
sed -E 's/(^|\s)-(\w)/\1\2/g;s/(\w)-(\s|$)/\1\2/g' file
Where the basic form is sed -E 's/find1/replace1/g;s/find2/replace2/g' file which can also be written as separate expressions sed -E -e 's/find1/replace1/g' -e 's/find2/replace2/g' (your choice).
The details of s/find1/replace1/g are:
find1 is
(^|\s) locate and capture at the beginning or whitespace,
followed by the '-' hyphen,
then capture the next \w (word-character); and
replace1 is simply \1\2 reinsert both captures with the first two backreferences.
The next substitution expression is similar, except now you are looking for the hyphen followed by a whitespace or at the end. So you have:
find2 being
a capture of \w (word-character),
followed by the hyphen,
followed by a capture of either a following space or the end (\s|$), then
replace2 is the same as before, just reinsert the captured characters using backreferences.
In each case the g indicates a global replace of all occurrences.
(note: the \w word-character also includes the '_' (underscore), so while unlikely you would have a hyphen and underscore together, if you do, you need to use the [A-Za-z] list instead of \w)
Example Use/Output
In your case, then output is:
$ sed -E 's/(^|\s)-(\w)/\1\2/g;s/(\w)-(\s|$)/\1\2/g' file
I AM EMPTY-HANDED AND I WA
ANT SOME COO COOKIES
remove the hyphen '-' character only when it matches the pattern 'space-[A-Z]' or '[A-Z]-space'. Assuming all letters are uppercase, and space could be a space, or newline
It's:
sed 's/\( \|^\)-\([A-Z]\)/\1\2/g; s/\([A-Z]\)-\( \|$\)/\1\2/g'
s - substitute
/
\( \|^\) - space or beginning of the line
- - hyphen...
\(A-Z]\) - a single upper case character
/
\1\2 - The \1 is replaced by the first \(...\) thing. So it is replaced by a space or nothing. \2 is replaced by the single upper case character found. Effectively - is removed.
/
g apply the regex globally
; - separate two s commands
s
Same as above. The $ means end of the line.
awk '{sub(/ -/,"");sub(/^-|-$/,"");sub(/- /," ")}1' file
I AM EMPTY-HANDED AND I WA
ANT SOME COO COOKIES

Extract text between two special characters

Trying to extract the text between the special characters "\ and \" through sed
Ex: "\hell##$\"},
expected output : hell##$
You can do it quite easily with using a capture-group and backreference with basic regular-expressions:
sed 's/^["][\]\([^\]*\).*$/\1/'
Explanation
Normal substitution sed 's/find/replace/, where
find is ^["][\] a double-quote and \ before beginning the capture \(...\) which contains [^\]* (zero or more characters not a \), the closing of the capture \) and then .*$ the remainder of the string;
replace is \1 (the first backreference) containing the text captured between \(...\).
(note: if your "\ doesn't begin the string, remove the first '^' anchor)
Example
$ echo '"\hell##$\"},' | sed 's/^["][\]\([^\]*\).*$/\1/'
hell##$
Look things over and let me know if you have questions.
This might work for you (GNU sed):
sed -nE '/"\\[^\\]*\\+([^\\"][^\\]*\\+)*"/{s/"\\/\n/;s/.*\n//;s/\\"/\n/;P;D}' file
The solution comes in two parts:
Firstly, a regexp to determine whether a pair of two characters exists. This can be tricky as a negated class is insufficient because edge cases can easily defeat a simplistic approach.
Secondly, once a pair of characters does exist the text between them must be extracted piece meal.

sed substitute whitespace for dash only between specific character patterns

I have a lines like these:
ORIGINAL
sometext1 sometext2 word:A12 B34 C56 sometext3 sometext4
sometext5 sometext6 word:A123 B45 C67 sometext7 sometext8
sometext9 sometext10 anotherword:(someword1 someword2 someword3) sometext11 sometext12
EDITED
asdjfkklj lkdsjfic kdiw:A12 B34 C56 lksjdfioe sldkjflkjd
lknal niewoc kdiw:A123 B45 C678 oknes lkwid
cnqule nkdal anotherword:(kdlklks inlqok mncvmnx) unqieo lksdnf
Desired output:
asdjfkklj lkdsjfic kdiw:A12-B34-C56 lksjdfioe sldkjflkjd
lknal niewoc kdiw:A123-B45-C678 oknes lkwid
cnqule nkdal anotherword:(kdlklks-inlqok-mncvmnx) unqieo lksdnf
EDITED: Would this be more explicit? But frankly this is much more difficult to read and answer than writing sometext#. I do not know people's preference.
I only want to replace the whitespaces with dashes after A alphabet letter followed by some digits AND replace the whitespaces with dashes between the words between the two parentheses. And not any other whitespaces in the line. Would appreciate an explanation of the syntax too.
Thanks!
This might work for you (GNU sed):
sed -r ':a;s/(A[0-9]+(-[A-Z][0-9]+)*) ([A-Z][0-9]+)/\1-\3/;ta;s/(\(\S+(-\S+)*) (\S+( \S+)*\))/\1-\3/;ta' file
Iteratively replace the space(s) in the required strings using a regexp and back references.
This code work good
darby#Debian:~/Scrivania$ cat test.txt | sed -r 's#\s+([A-Z][0-9]+)#-\1#g' | sed ':l s/\(([^ )]*\)[ ]/\1-/;tl'
asdjfkklj lkdsjfic kdiw:A12-B34-C56 lksjdfioe sldkjflkjd
lknal niewoc kdiw:A123-B45-C678 oknes lkwid
cnqule nkdal anotherword:(kdlklks-inlqok-mncvmnx) unqieo lksdnf
Explain my regex
In the first regex
Options
-r Enable regex extended
Pattern
\s+ One or more space characters
([A-Z][0-9]+) Submatch a uppercase letter and one or more digits
Replace
- Dash character
\1 Previous submatch
Note
The g after delimiters ///g is for global substitution.
In the second regex
Pattern
:l label branched to by t or b
tl jump to label if any substitution has been made on the pattern space since the most recent reading of input line or execution of command 't'. If label is not specified, then jump to the end of the script. This is a conditional branch
\(([^ )]*\) match all in round brackets and stop to first space found
[ ] one space character
Replace
\1 Previous submatch
- Add a dash
You need capture the first Alphanumeric group using () and the second group. Then you can simply replace all using backreferences \1 and \2 :
using sed twice
sed -E 's/(\b[A-Za-z][0-9]+) ([A-Z])/\1-\2/g' | sed -E 's/(\b[A-Za-z][0-9]+) ([A-Z])/\1-\2/g'
or using perl (with lookahead (?=...)the regex don't capture the 2nd group)
perl -pe 's/(\b[A-Za-z][0-9]+) (?=[A-Z])/\1-/g'
\b work boundary
[A-Za-z] 1 letter
[0-9]+ 1 or more digits
sed doesn't support lookahead and lookbehind fonctionality

Bash: remove semicolons from a line in a CSV-file

I've a CSV-file with a few hundred lines and a lot (not all) of these lines contains data (Klas/Lesgroep:;;T2B1) which I want to extract.
i.e. ;;;;;;Klas/Lesgroep:;;T2B1;;;;;;;;;;
I want to delete the semicolons which are in front of Klas/Lesgroep but the number of semicolons is variable. How can I delete these semicolons in Bash ?
I'm not a native speaking Englishman so I hope it's clear to you
To remove any nonempty run of ; chars. that come directly before literal Klas/Lesgroep:
With GNU or BSD/macOS sed:
$ sed -E 's|;+(Klas/Lesgroep)|\1|' <<< ";;;;;;Klas/Lesgroep:;;T2B1;;;;;;;;;;"
Klas/Lesgroep:;;T2B1;;;;;;;;;;
The s function performs string substitution (replacement):
The 1st argument is a regex (regular expression) that specifies what part of the line to match,
and the 2nd arguments specifies what to replace the matching part with.
Note how I've chosen | as the regex/argument delimiter instead of the customary /, because that allows unescaped use of / chars. inside the regex.
;+ matches one or more directly adjacent ; chars.
(Klas/Lesgroep) matches literal Klas/Lesgroep and by enclosing it in (...) - making it a capture group - the match is remembered and can be referenced as \1 - the 1st capture group in the regex - in the replacement argument to s.
The net effect is that all ; chars. directly preceding Klas/Lesgroep are removed.
POSIX-compliant form:
$ sed 's|;\{1,\}\(Klas/Lesgroep\)|\1|' <<< ";;;;;;Klas/Lesgroep:;;T2B1;;;;;;;;;;"
Klas/Lesgroep:;;T2B1;;;;;;;;;;
POSIX requires the less powerful and antiquated BRE syntax, where duplication symbol + must be emulated as \{1,\}, and, generally, metacharacters (, ), {, } must be \-escaped.
With sed you can search for lines starting with at least one semi-colon followed by Klas/Lesgroep and, if found, substitute leading ; with nothing:
$ sed '/;;*Klas\/Lesgroep/s/^;*//g' <<< ";;;;;;Klas/Lesgroep:;;T2B1;;;;;;;;;;"
Klas/Lesgroep:;;T2B1;;;;;;;;;;
To remove all ";" from a file , we can use sed command . sed is used for modifying the files.
$ sed 's/find/replace/g' file
The substitute flag /g (global replacement) specifies the sed command to replace all the occurrences of the string in the line.
So to remove ";" just find and replace it with nothing.
sed 's/;//g' file.csv

Ignoring lines with blank or space after character using sed

I am trying to use sed to extract some assignments being made in a text file. My text file looks like ...
color1=blue
color2=orange
name1.first=Ahmed
name2.first=Sam
name3.first=
name4.first=
name5.first=
name6.first=
Currently, I am using sed to print all the strings after the name#.first's ...
sed 's/name.*.first=//' file
But of course, this also prints all of the lines with no assignment ...
Ahmed
Sam
# I'm just putting this comment here to illustrate the extra carriage returns above; please ignore it
Is there any way I can get sed to ignore the lines with blank or whitespace only assignments and store this to an array? The number of assigned name#.first's is not known, nor are the number of assignments of each type in general.
This is a slight variation on sputnick's answer:
sed -n '/^name[0-9]\.first=\(.\+\)/ s//\1/p'
The first part (/^name[0-9]\.first=\(.\+\)/) selects the lines you want to pass to the s/// command. The empty pattern in the s command re-uses the previous regular expression and the replacement portion (\1) replaces the entire match with the contents of the first parenthesized part of the regex. Use the -n and p flags to control which lines are printed.
sed -n 's/^name[0-9]\.\w\+=\(\w\+\)/\1/p' file
Output
Ahmed
Sam
Explainations
the -n switch suppress the default behavior of sed : printing all lines
s/// is the skeleton for a substitution
^ match the beginning of a line
name literal string
[0-9] a digit alone
\.\w\+ a literal dot (without backslash means any character) followed by a word character [a-zA-Z0-9_] al least one : \+
( ) is a capturing group and \1 is the captured group

Resources