sed substitute whitespace for dash only between specific character patterns - bash

I have a lines like these:
ORIGINAL
sometext1 sometext2 word:A12 B34 C56 sometext3 sometext4
sometext5 sometext6 word:A123 B45 C67 sometext7 sometext8
sometext9 sometext10 anotherword:(someword1 someword2 someword3) sometext11 sometext12
EDITED
asdjfkklj lkdsjfic kdiw:A12 B34 C56 lksjdfioe sldkjflkjd
lknal niewoc kdiw:A123 B45 C678 oknes lkwid
cnqule nkdal anotherword:(kdlklks inlqok mncvmnx) unqieo lksdnf
Desired output:
asdjfkklj lkdsjfic kdiw:A12-B34-C56 lksjdfioe sldkjflkjd
lknal niewoc kdiw:A123-B45-C678 oknes lkwid
cnqule nkdal anotherword:(kdlklks-inlqok-mncvmnx) unqieo lksdnf
EDITED: Would this be more explicit? But frankly this is much more difficult to read and answer than writing sometext#. I do not know people's preference.
I only want to replace the whitespaces with dashes after A alphabet letter followed by some digits AND replace the whitespaces with dashes between the words between the two parentheses. And not any other whitespaces in the line. Would appreciate an explanation of the syntax too.
Thanks!

This might work for you (GNU sed):
sed -r ':a;s/(A[0-9]+(-[A-Z][0-9]+)*) ([A-Z][0-9]+)/\1-\3/;ta;s/(\(\S+(-\S+)*) (\S+( \S+)*\))/\1-\3/;ta' file
Iteratively replace the space(s) in the required strings using a regexp and back references.

This code work good
darby#Debian:~/Scrivania$ cat test.txt | sed -r 's#\s+([A-Z][0-9]+)#-\1#g' | sed ':l s/\(([^ )]*\)[ ]/\1-/;tl'
asdjfkklj lkdsjfic kdiw:A12-B34-C56 lksjdfioe sldkjflkjd
lknal niewoc kdiw:A123-B45-C678 oknes lkwid
cnqule nkdal anotherword:(kdlklks-inlqok-mncvmnx) unqieo lksdnf
Explain my regex
In the first regex
Options
-r Enable regex extended
Pattern
\s+ One or more space characters
([A-Z][0-9]+) Submatch a uppercase letter and one or more digits
Replace
- Dash character
\1 Previous submatch
Note
The g after delimiters ///g is for global substitution.
In the second regex
Pattern
:l label branched to by t or b
tl jump to label if any substitution has been made on the pattern space since the most recent reading of input line or execution of command 't'. If label is not specified, then jump to the end of the script. This is a conditional branch
\(([^ )]*\) match all in round brackets and stop to first space found
[ ] one space character
Replace
\1 Previous submatch
- Add a dash

You need capture the first Alphanumeric group using () and the second group. Then you can simply replace all using backreferences \1 and \2 :
using sed twice
sed -E 's/(\b[A-Za-z][0-9]+) ([A-Z])/\1-\2/g' | sed -E 's/(\b[A-Za-z][0-9]+) ([A-Z])/\1-\2/g'
or using perl (with lookahead (?=...)the regex don't capture the 2nd group)
perl -pe 's/(\b[A-Za-z][0-9]+) (?=[A-Z])/\1-/g'
\b work boundary
[A-Za-z] 1 letter
[0-9]+ 1 or more digits
sed doesn't support lookahead and lookbehind fonctionality

Related

Replace a specific character at any word's begin and end in bash

I need to remove the hyphen '-' character only when it matches the pattern 'space-[A-Z]' or '[A-Z]-space'. (Assuming all letters are uppercase, and space could be a space, or newline)
sample.txt
I AM EMPTY-HANDED AND I- WA-
-ANT SOME COO- COOKIES
I want the output to be
I AM EMPTY-HANDED AND I WA
ANT SOME COO COOKIES
I've looked around for answers using sed and awk and perl, but I could only find answers relating to removing all characters between two patterns or specific strings, but not a specific character between [A-Z] and space.
Thanks heaps!!
If perl is your option, would you try the following:
perl -pe 's/(^|(?<=\s))-(?=[A-Z])//g; s/(?<=[A-Z])-((?=\s)|$)//g' sample.txt
(?<=\s) is a zero-width lookbehind assertion which matches leading
whitespace without including it in the matched substring.
(?=[A-Z]) is a zero-width lookahead assertion which matches trailing
character between A and Z without including it in the matched substring.
As a result, only the dash characters which match the pattern above are
removed from the original text.
The second statement s/..//g is the flipped version of the first one.
Could you please try following.
awk '{for(i=1;i<=NF;i++){if($i ~ /^-[a-zA-Z]+$|^[a-zA-Z]+-$/){sub(/-/,"",$i)}}} 1' Input_file
Adding a non-one liner form of solution:
awk '
{
for(i=1;i<=NF;i++){
if($i ~ /^-[a-zA-Z]+$|^[a-zA-Z]+-$/){
sub(/-/,"",$i)
}
}
}
1
' Input_file
Output will be as follows.
I AM EMPTY-HANDED AND I WA
ANT SOME COO COOKIES
If you can provide Extended Regular Expressions to sed (generally with the -E or -r option), then you can shorten your sed expression to:
sed -E 's/(^|\s)-(\w)/\1\2/g;s/(\w)-(\s|$)/\1\2/g' file
Where the basic form is sed -E 's/find1/replace1/g;s/find2/replace2/g' file which can also be written as separate expressions sed -E -e 's/find1/replace1/g' -e 's/find2/replace2/g' (your choice).
The details of s/find1/replace1/g are:
find1 is
(^|\s) locate and capture at the beginning or whitespace,
followed by the '-' hyphen,
then capture the next \w (word-character); and
replace1 is simply \1\2 reinsert both captures with the first two backreferences.
The next substitution expression is similar, except now you are looking for the hyphen followed by a whitespace or at the end. So you have:
find2 being
a capture of \w (word-character),
followed by the hyphen,
followed by a capture of either a following space or the end (\s|$), then
replace2 is the same as before, just reinsert the captured characters using backreferences.
In each case the g indicates a global replace of all occurrences.
(note: the \w word-character also includes the '_' (underscore), so while unlikely you would have a hyphen and underscore together, if you do, you need to use the [A-Za-z] list instead of \w)
Example Use/Output
In your case, then output is:
$ sed -E 's/(^|\s)-(\w)/\1\2/g;s/(\w)-(\s|$)/\1\2/g' file
I AM EMPTY-HANDED AND I WA
ANT SOME COO COOKIES
remove the hyphen '-' character only when it matches the pattern 'space-[A-Z]' or '[A-Z]-space'. Assuming all letters are uppercase, and space could be a space, or newline
It's:
sed 's/\( \|^\)-\([A-Z]\)/\1\2/g; s/\([A-Z]\)-\( \|$\)/\1\2/g'
s - substitute
/
\( \|^\) - space or beginning of the line
- - hyphen...
\(A-Z]\) - a single upper case character
/
\1\2 - The \1 is replaced by the first \(...\) thing. So it is replaced by a space or nothing. \2 is replaced by the single upper case character found. Effectively - is removed.
/
g apply the regex globally
; - separate two s commands
s
Same as above. The $ means end of the line.
awk '{sub(/ -/,"");sub(/^-|-$/,"");sub(/- /," ")}1' file
I AM EMPTY-HANDED AND I WA
ANT SOME COO COOKIES

Extract text between two special characters

Trying to extract the text between the special characters "\ and \" through sed
Ex: "\hell##$\"},
expected output : hell##$
You can do it quite easily with using a capture-group and backreference with basic regular-expressions:
sed 's/^["][\]\([^\]*\).*$/\1/'
Explanation
Normal substitution sed 's/find/replace/, where
find is ^["][\] a double-quote and \ before beginning the capture \(...\) which contains [^\]* (zero or more characters not a \), the closing of the capture \) and then .*$ the remainder of the string;
replace is \1 (the first backreference) containing the text captured between \(...\).
(note: if your "\ doesn't begin the string, remove the first '^' anchor)
Example
$ echo '"\hell##$\"},' | sed 's/^["][\]\([^\]*\).*$/\1/'
hell##$
Look things over and let me know if you have questions.
This might work for you (GNU sed):
sed -nE '/"\\[^\\]*\\+([^\\"][^\\]*\\+)*"/{s/"\\/\n/;s/.*\n//;s/\\"/\n/;P;D}' file
The solution comes in two parts:
Firstly, a regexp to determine whether a pair of two characters exists. This can be tricky as a negated class is insufficient because edge cases can easily defeat a simplistic approach.
Secondly, once a pair of characters does exist the text between them must be extracted piece meal.

Replace All first 4 spaces with a tab

I am doing some documentation work, and I have a tree structure like this:
A
BB
C C
DD
How can I replace just all the occurrences of 2 spaces in the head of the line with '-', like:
A
--BB
--C C
----DD
I have tried sed 's/ /-/g', but this replaces all occurrences of 2 spaces; also sed 's/^ /-/g', this just replaces the first occurrence of 2 spaces. How can I do this?
The regular expression for four spaces at beginning of line is /^ / where I put the slashes just to demarcate the expression (they are not part of the actual regular expression, but they are used as delimiters by sed).
sed 's/^ /\t/' file
In recent sed versions, you can add an -i option to modify file in-place (that is, sed will replace the file with the modified file); on *BSD (including OSX), you need -i '' with an empty option argument.
The \t escape code for tab is also not universally supported; if that is a problem, your shell probably allows you to type a literal tab by prefixing it with ctrl-V.
(Your question title says "tab" but your question asks about dashes. To replace with two dashes, replace \t in the replacement part of the script with --, obviously.)
If you are trying to generalize to "any groups of two spaces at beginning of line should be replaced by a dash", this is not impossible to do in sed, but I would recommend Perl instead:
perl -pe 's%^((?: )+)% "-" x (length($1) / 2)%e' file
This captures the match into $1; the inner parenthesized expression matches two spaces and the + quantifier says to match that as many times as possible. The /e flag allows us to use Perl code in the replacement; this piece of code repeats the character "-" as many times as the captured expression was repeated, which is conveniently equal to half its length.

What is going wrong with this sed command

I have to transform first occurrence of word 'the' and replace it with 'this' in each line of input text, case sensitive search and replace.
Following is my command to do the task but it is going wrong
sed 's/\Wthe\W/this/'
The problem I found was similar to this simulated case :
Input-text : as the word
Output-text(correct) : as that word
Output-text : asthatword (what the command is producing).
\W is PCRE, not BRE or ERE. It is thus not supported in standard sed.
sed 's/(^|[[:space:]])the([[:space:]]|$)/\1this\2/'
In ^|[[:space:]], ^ matches the beginning of the line; [[:space:]] matches any whitespace character class. Putting this inside of parenthesis creates a matching group which can be referred to later with \1 (since this is the first such group).
[[:space:]]|$ does the same, but with $ indicating end-of-line.
That said -- if you're targeting only GNU sed, and not POSIX sed, you might instead consider:
sed 's/\<the\>/this/'
you are replacing the non-word-characters (blanks in this case) as well. An easy way round is
sed 's/\Wthe\W/ that /'
or
sed 's/\(\W\)the\(\W\)/\1that\2/'
to keep the original non-word-characters.
I'm assuming that you're using \W to make this a whole word search and replace. Try using \b to set word boundaries instead:
sed 's/\bthe\b/this/'
You will need to capture the non-word surrounding word the and then use back-reference in replacement:
s='as the word'
sed 's/\(\W\)the\(\W\)/\1this\2/' <<< "$s"
as this word
or better to use word boundary:
sed 's/\bthe\b/this/' <<< "$s"
as this word

Ignoring lines with blank or space after character using sed

I am trying to use sed to extract some assignments being made in a text file. My text file looks like ...
color1=blue
color2=orange
name1.first=Ahmed
name2.first=Sam
name3.first=
name4.first=
name5.first=
name6.first=
Currently, I am using sed to print all the strings after the name#.first's ...
sed 's/name.*.first=//' file
But of course, this also prints all of the lines with no assignment ...
Ahmed
Sam
# I'm just putting this comment here to illustrate the extra carriage returns above; please ignore it
Is there any way I can get sed to ignore the lines with blank or whitespace only assignments and store this to an array? The number of assigned name#.first's is not known, nor are the number of assignments of each type in general.
This is a slight variation on sputnick's answer:
sed -n '/^name[0-9]\.first=\(.\+\)/ s//\1/p'
The first part (/^name[0-9]\.first=\(.\+\)/) selects the lines you want to pass to the s/// command. The empty pattern in the s command re-uses the previous regular expression and the replacement portion (\1) replaces the entire match with the contents of the first parenthesized part of the regex. Use the -n and p flags to control which lines are printed.
sed -n 's/^name[0-9]\.\w\+=\(\w\+\)/\1/p' file
Output
Ahmed
Sam
Explainations
the -n switch suppress the default behavior of sed : printing all lines
s/// is the skeleton for a substitution
^ match the beginning of a line
name literal string
[0-9] a digit alone
\.\w\+ a literal dot (without backslash means any character) followed by a word character [a-zA-Z0-9_] al least one : \+
( ) is a capturing group and \1 is the captured group

Resources