why does one less space in regex makes my sed go weird? - bash

Here is an example of some regex I am trying to figure out. The goal is to strip out extra spaces and make it only one space between words via sed. The sample given has three spaces between sdf and sdk:
test#ubuntu:~/addr_book_script$ echo "est sdf sdk" | sed 's/ */ /g'
est sdf sdk
test#ubuntu:~/addr_book_script$ echo "est sdf sdk" | sed 's/ */ /g'
e s t s d f s d k
You will notice that the two sed statement only differs on the number of spaces before the *. The first statement had two spaces and it behaved exactly what I wanted.
The second statement had one space before the * and it stuck a space between each letter and word.
I know the * means any number of occurrences of whatever-it-is-that-I-am-looking-for. What I don't understand is why the one space sed replace behaves the way it does.
Thanks

sed 's/ */ /g'
The regex * matches 0 or more occurrences of (space).
At the start of the string a 0 space match is found and replaced by single space
After the first letter another 0 space match is found and replaced by single space and so forth.
After est, more than 0 space is found and replaced by single space
And so forth.
Another example:
~ >>> echo "est sdf sdk" | sed 's/a*/ /g'
e s t s d f s d k
The replacements are occurred because of 0 character match.

" *" (space-star) in regex means 0 or more occurrences of space and so it replaces every instance of 0 or more spaces with a space
" *" (space-space-star) forces there to be at least one space
" +" (space-plus) would accomplish the same thing in some regular expression flavors, but not BRE

Related

how to remove all whitespaces in front and beind 3 consecutive periods

I'm trying to remove all white spaces before and after 3 consecutive periods and replace it with the actual ellipse symbol.
I've tried the following code:
sed 's/[[:space:]]*\.\.\.[[:space:]]*/…/g'
It replaces the 3 periods with the ellipse symbol, but the spaces before and after remain.
Sample Input.
hello ... world
Desired output
hello…world
Expression you are using is ERE(extended regular expressions) you have to add -E option to sed as follows to allow it, since you are using character classes in your code [[:space:]].
sed -E 's/[[:space:]]*\.\.\.[[:space:]]*/.../g' Input_file
Without -E try:
sed 's/ *\.\.\. */.../g' Input_file
Here is another sed
echo "hello ... world" | sed -E 's/ +(\.\.\.) +/\1/g'
hello...world
4 dots, do nothing?
echo "hello .... world" | sed -E 's/ +(\.\.\.) +/\1/g'
hello .... world
In bash, just use parameter substitution...
foo="hello ... world"
foo="${foo//+( )...+( )/...}"
Now, echo "$foo", outputs:
hello...world
The syntax for BaSH regex variable substitution are as follows:
${var-name/search/replace}
A single /replaces only the first occurrence from the left, while a double //replaces every occurrence.
One of ?*+#! followed by (pattern-list) replaces a specified number of occurrences of the patterns in pattern-list as follows:
? Zero or one occurrence
* Zero or more occurrences
+ One or more occurrences
# A single occurence
! Anything that *doesn't* match one of the occurrences
Pattern list can be any combination of literal strings, or character classes, separated by the pipe character |

Replacing one space with two spaces in Unix

I am trying to replace every time there is one space with two spaces in Unix. We are just reading from standard input and writing to standard ouput. I also have to avoid using the functions awk and perl. For example if I read in something like San Diego it should print San Diego. If there are already multiple spaces, it should just leave them alone.
How about bash only? First test file:
$ cat file
1
2 3
4 5
San Diego NO
Then:
$ cat file |
while IFS= read line
do
while [[ "$line" =~ (^|.+[^ ])\ ([^ ].*) ]]
do
line="${BASH_REMATCH[1]} ${BASH_REMATCH[2]}"
done
echo "$line"
done
1
2 3
4 5
San Diego NO
You have to a bit careful here not to forget spaces at the beginning or end.
I present three solutions for educational purpose:
sed 's/\(^\|[^ ]\) \($\|[^ ]\)/\1 \2/g' # solution 1
sed 's/\( \+\)/ \1/g;s/ \( \+\)/\1/g' # solution 2
sed 's/ \( \+\)/\1/g;s/\( \+\)/ \1/g' # solution 3
All three solutions make use of subexpressions:
9.3.6 BREs Matching Multiple Characters
A subexpression can be defined within a BRE by enclosing it between
the character pairs \( and \). Such a subexpression shall match
whatever it would have matched without the \( and \), except that
anchoring within subexpressions is optional behavior; see BRE
Expression Anchoring. Subexpressions can be arbitrarily nested.
The back-reference expression '\n' shall match the same (possibly
empty) string of characters as was matched by a subexpression enclosed
between "\(" and "\)" preceding the '\n'. The character n shall be a
digit from 1 through 9, specifying the nth subexpression (the one that
begins with the nth \( from the beginning of the pattern and ends
with the corresponding paired \) ). The expression is invalid if
less than n subexpressions precede the \n. For example, the
expression ".∗\1$" matches a line consisting of two adjacent
appearances of the same string, and the expression a*\1 fails to
match a. When the referenced subexpression matched more than one
string, the back-referenced expression shall refer to the last matched
string. If the subexpression referenced by the back-reference matches
more than one string because of an asterisk (*) or an interval
expression (see item (5)), the back-reference shall match the last
(rightmost) of these strings.
Solution 1: sed 's/\(^\|[^ ]\) \($\|[^ ]\)/\1 \2/g'
Here there are two subexpressions. The first subexpression \(^\|[^ ]\) matches the beginning of the line (^) or (\|) a non-space character ([^ ]). The second subexpression \($\|[^ ]\) is similar but with the end-of-line ($).
Solution 2: sed 's/\( \+\)/ \1/g;s/ \( \+\)/\1/g'
This replaces one-or more spaces by the same amount of spaces and an extra one. Afterwards we correct the ones with 3 spaces or more by removing a single space from those.
Solution 3: sed 's/ \( \+\)/\1/g;s/\( \+\)/ \1/g'
This does the same thing as solution 2 but inverts the logic. First remove a space from all sequences that have more then one space, and afterwards add a space. This one-liner is just one-character shorter then solution 2.
Example: based on solution 1
The following commands are nothing more then echo "string" | sed ..., but to show the spaces, wrapped into a printf statement.
# default string
$ printf "|%s|" " foo bar car "
| foo bar car |
# spaces replaced
$ printf "|%s|" "$(echo " foo bar car " | sed 's/\(^\|[^ ]\) \($\|[^ ]\)/\1 \2/g')"
| foo bar car |
# 3 spaces in front and back
$ printf "|%s|" "$(echo " foo bar car " | sed 's/\(^\|[^ ]\) \($\|[^ ]\)/\1 \2/g')"
| foo bar car |
note: If you want to replace any form of blanks (spaces and tabs in any encoding) by the same doubled blank, you could use :
sed 's/\(^\|[^[:blank:]]\)\([[:blank:]]\)\($\|[^[:blank:]]\)/\1\2\2\3/g'
sed 's/\(^\|[[:graph:]]\)\([[:blank:]]\)\($\|[[:graph:]]\)/\1\2\2\3/g
Something along the lines of
cat input.txt | sed 's,\([[:alnum:]]\) \([[:alnum:]]\),\1 \2,'
should work for that purpose.
replace only occurrence of 1 space between 2 chars hat are not white space with 2 spaces
`sed 's/\([^ ]\) \([^ ]\)/\1 \2/g' file`
1) [^ ] - not space char
2) \1 \2 - first expression found in Parenthesis, 2 spaces, second Parentheses expiration
3) sed used with s///g is replacing the regex in the first // with the value in the second //

Reverse four length of letters with sed in unix

How can I reverse a four length of letters with sed?
For example:
the year was 1815.
Reverse to:
the raey was 5181.
This is my attempt:
cat filename | sed's/\([a-z]*\) *\([a-z]*\)/\2, \1/'
But it does not work as I intended.
not sure it is possible to do it with GNU sed for all cases. If _ doesn't occur immediately before/after four letter words, you can use
sed -E 's/\b([a-z0-9])([a-z0-9])([a-z0-9])([a-z0-9])\b/\4\3\2\1/gi'
\b is word boundary, word definition being any alphabet or digit or underscore character. So \b will ensure to match only whole words not part of words
$ echo 'the year was 1815.' | sed -E 's/\b([a-z0-9])([a-z0-9])([a-z0-9])([a-z0-9])\b/\4\3\2\1/gi'
the raey was 5181.
$ echo 'two time five three six good' | sed -E 's/\b([a-z0-9])([a-z0-9])([a-z0-9])([a-z0-9])\b/\4\3\2\1/gi'
two emit evif three six doog
$ # but won't work if there are underscores around the words
$ echo '_good food' | sed -E 's/\b([a-z0-9])([a-z0-9])([a-z0-9])([a-z0-9])\b/\4\3\2\1/gi'
_good doof
tool with lookaround support would work for all cases
$ echo '_good food' | perl -pe 's/(?<![a-z0-9])([a-z0-9])([a-z0-9])([a-z0-9])([a-z0-9])(?!=[a-z0-9])/$4$3$2$1/gi'
_doog doof
(?<![a-z0-9]) and (?!=[a-z0-9]) are negative lookbehind and negative lookahead respectively
Can be shortened to
perl -pe 's/(?<![a-z0-9])[a-z0-9]{4}(?!=[a-z0-9])/reverse $&/gie'
which uses the e modifier to place Perl code in substitution section. This form is suitable to easily change length of words to be reversed
Possible shortest sed solution even if a four length of letters contains _s.
sed -r 's/\<(.)(.)(.)(.)\>/\4\3\2\1/g'
Following awk may help you in same. Tested this in GNU awk and only with provided sample Input_file
echo "the year was 1815." |
awk '
function reverse(val){
num=split(val, array,"");
i=array[num]=="."?num-1:num;
for(;i>q;i--){
var=var?var array[i]:array[i]
};
printf (array[num]=="."?var".":var);
var=""
}
{
for(j=1;j<=NF;j++){
printf("%s%s",j==NF||j==2?reverse($j):$j,j==NF?RS:FS)
}}'
This might work for you (GNU sed):
sed -r '/\<\w{4}\>/!b;s//\n&\n/g;s/^[^\n]/\n&/;:a;/\n\n/!s/(.*\n)([^\n])(.*\n)/\2\1\3/;ta;s/^([^\n]*)(.*)\n\n/\2\1/;ta;s/\n//' file
If there are no strings of the length required to reverse, bail out.
Prepend and append newlines to all required strings.
Insert a newline at the start of the pattern space (PS). The PS is divided into two parts, the first line will contain the current word being reversed. The remainder will contain the original line.
Each character of the word to be reversed is inserted at the front of the first line and removed from the original line. When all the characters in the word have been processed, the original word will have gone and only the bordering newlines will exist. These double newlines are then replaced by the word in the first line and the process is repeated until all words have been processed. Finally the newline introduced to separate the working line and the original is removed and the PS is printed.
N.B. This method may be used to reverse strings of varying string length i.e. by changing the first regexp strings of any number can be reversed. Also strings between two lengths may also be reversed e.g. /\<w{2,4}\>/ will change all words between 2 and 4 character length.
It's a recurrent problem so somebody created a bash command called "rev".
echo "$(echo the | rev) $(echo year | rev) $(echo was | rev) $(echo 1815 | rev)".
OR
echo "the year was 1815." | rev | tr ' ' '\n' | tac | tr '\n' ' '

Using sed to search large number of files for specific string and replace it

What I am trying to do is search a large number of source files for a particular pattern and put in fort of this pattern another expression. The files I am looking in are all with the same extension *.F90.
My first step is to use grep and find all lines of those files containing allocate but not allocated, so I have:
grep –I “ allocate *(” *.F90 | grep –v allocated
The first problem that I have is that the bracket might be preceded by one or more spaces. I can have
allocate(
or allocate (
or allocate (
This is why I need the “*” in the grep command.
The general rule however (besides the spaces) says that the allocate is followed by “(” and than comes the thing that is being allocated. So I have:
allocate ( array_name ( ....
again the spaces are optional
So what I would like to do is find this string, and put in front of it the following:
If( allocated(array_name) ) deallocate(array_name)
and imidiately after this on the next line I would like to have the original string allocate(array( … .
Please note that the array_name is an alphanumeric string which after the substitutions is appearing in more than one place. It is the name of the array being alocated.
I would be very grateful if someone can give me a hint how to do this. I am stuck and have no idea how to do it.
I assume you mean you want to replace allocate ( array_name ) with If( allocated(array_name) ) deallocate(array_name) allocate ( array_name ).
In GNU or BSD sed you can do the following:
sed -i.bk -e '/allocated/t' \
-e 's/allocate *( *\([A-Za-z0-9_]*\) *)/If( allocated(\1) ) deallocate(\1) &/' \
*.F90
This will search and replace matching lines in *.F90 and skip lines with allocated on. The original file will be called *.F90.bk.
As #Anders Johansson mentioned there can be other cases where the argument to allocate is something not alphanum-underscore, then you can search for this before you search and replace:
for i in *.F90; do
echo "$i"
sed -n '/.*allocate *( *\([^ )]*\) *).*/{h; s//\1/; /^[A-Za-z0-9_]*$/t
x; p;}' "$i"
done
(note the newline after t, BSD sed interpret everything after t as a label). Press ctrl+v ctrl+j in bash to input a newline on the command line.
/a\(b\)c/ find line with matching string
h *h*old the match abc into hold space
s//\1/ *s*ubstitute last match abc with first group b
/^[a-z]*$/t if b matches ^[a-z]*$, then branch to end of script
x e*x*change hold space abc an pattern space b
p *p*rint pattern space b
cat old_file.txt | sed 's/allocate *( *\([a-zA-Z0-9_]*\)/If( allocated(\1) ) deallocate(\1)\
allocate(\1/' > new_file.txt

insert a string at specific position in a file by SED awk

I have a string which i need to insert at a specific position in a file :
The file contains multiple semicolons(;) i need to insert the string just before the last ";"
Is this possible with SED ?
Please do post the explanation with the command as I am new to shell scripting
before :
adad;sfs;sdfsf;fsdfs
string = jjjjj
after
adad;sfs;sdfsf jjjjj;fsdfs
Thanks in advance
This might work for you:
echo 'adad;sfs;sdfsf;fsdfs'| sed 's/\(.*\);/\1 jjjjj;/'
adad;sfs;sdfsf jjjjj;fsdfs
The \(.*\) is greedy and swallows the whole line, the ; makes the regexp backtrack to the last ;. The \(.*\) make s a back reference \1. Put all together in the RHS of the s command means insert jjjjj before the last ;.
sed 's/\([^;]*\)\(;[^;]*;$\)/\1jjjjj\2/' filename
(substitute jjjjj with what you need to insert).
Example:
$ echo 'adad;sfs;sdfsf;fsdfs;' | sed 's/\([^;]*\)\(;[^;]*;$\)/\1jjjjj\2/'
adad;sfs;sdfsfjjjjj;fsdfs;
Explanation:
sed finds the following pattern: \([^;]*\)\(;[^;]*;$\). Escaped round brackets (\(, \)) form numbered groups so we can refer to them later as \1 and \2.
[^;]* is "everything but ;, repeated any number of times.
$ means end of the line.
Then it changes it to \1jjjjj\2.
\1 and \2 are groups matched in first and second round brackets.
For now, the shorter solution using sed : =)
sed -r 's#;([^;]+);$#; jjjjj;\1#' <<< 'adad;sfs;sdfsf;fsdfs;'
-r option stands for extented Regexp
# is the delimiter, the known / separator can be substituted to any other character
we match what's finishing by anything that's not a ; with the ; final one, $ mean end of the line
the last part from my explanation is captured with ()
finally, we substitute the matching part by adding "; jjjj" ans concatenate it with the captured part
Edit: POSIX version (more portable) :
echo 'adad;sfs;sdfsf;fsdfs;' | sed 's#;\([^;]\+\);$#; jjjjj;\1#'
echo 'adad;sfs;sdfsf;fsdfs;' | sed -r 's/(.*);(.*);/\1 jjjj;\2;/'
You don't need the negation of ; because sed is by default greedy, and will pick as much characters as it can.
sed -e 's/\(;[^;]*\)$/ jjjj\1/'
Inserts jjjj before the part where a semicolon is followed by any number of non-semicolons ([^;]*) at the end of the line $. \1 is called a backreference and contains the characters matched between \( and \).
UPDATE: Since the sample input has no longer a ";" at the end.
Something like this may work for you:
echo "adad;sfs;sdfsf;fsdfs"| awk 'BEGIN{FS=OFS=";"} {$(NF-1)=$(NF-1) " jjjjj"; print}'
OUTPUT:
adad;sfs;sdfsf jjjjj;fsdfs
Explanation: awk starts with setting FS (field separator) and OFS (output field separator) as semi colon ;. NF in awk stands for number of fields. $(NF-1) thus means last-1 field. In this awk command {$(NF-1)=$(NF-1) " jjjjj" I am just appending jjjjj to last-1 field.

Resources