Regular expression to capture alphanumeric string only in shell - shell

Trying to write the regex to capture the given alphanumeric values but its also capturing other numeric values. What should be the correct way to get the desire output?
code
grep -Eo '(\[[[:alnum:]]\)\w+' file > output
$ cat file
2022-04-29 08:45:11,754 [14] [Y23467] [546] This is a single line
2022-04-29 08:45:11,764 [15] [fpes] [547] This is a single line
2022-04-29 08:46:12,454 [143] [mwalkc] [548] This is a single line
2022-04-29 08:49:12,554 [143] [skhat2] [549] This is a single line
2022-04-29 09:40:13,852 [5] [narl12] [550] This is a single line
2022-04-29 09:45:14,754 [1426] [Y23467] [550] This is a single line
current output -
[14
[Y23467
[546
[15
[fpes
[547
[143
[mwalkc
[548
[143
[skhat2
[549
[5
[narl12
[550
[1426
[Y23467
[550
expected output -
Y23467
fpes
mwalkc
skhat2
narl12
Y23467

1st solution: With your shown samples, please try following awk code. Simple explanation would be, using gsub function to substitute [ and ] in 4th field, printing 4th field after that.
awk '{gsub(/\[|\]/,"",$4);print $4}' Input_file
2nd solution: With GNU grep please try following solution.
grep -oP '^[0-9]{4}(-[0-9]{2}){2} [0-9]{2}(:[0-9]{2}){2},[0-9]{1,3} \[[0-9]+\] \[\K[^]]*' Input_file
Explanation: Adding detailed explanation for above regex used in GNU grep.
^[0-9]{4}(-[0-9]{2}){2} ##From starting of value matching 4 digits followed by dash 2 digits combination of 2 times.
[0-9]{2}(:[0-9]{2}){2} ##Matching space followed by 2 digits followed by : 2 digits combination of 2 times.
,[0-9]{1,3} ##Matching comma followed by digits from 1 to 3 number.
\[[0-9]+\] \[\K ##Matching space followed by [ digits(1 or more occurrences of digits) followed by space [ and
##then using \K to forget all the previously matched values.
[^]]* ##Matching everything just before 1st occurrence of ] to get actual values.

Using [[:alnum:]] or \w means that it can possibly match alphanumeric or word characters.
If there can be numbers, but there should be a character a-z and using -P for a perl compatible regex is supported:
grep -oP '\[\K\d*[A-Za-z][\dA-Za-z]*(?=])' file
Explanation
\[ Match [
\K Forget what is matched so far
\d*[A-Za-z] Match optional digits and at least a single char a-zA-Z
[\dA-Za-z]* Match optional chars a-zA-Z and digits
(?=]) Assert ] to the right
Output
Y23467
fpes
mwalkc
skhat2
narl12
Y23467
If there can be only 1 occurrence, you might also use sed with a capture group \(...\) and use the group in the replacement using \1
sed 's/.*\[\([[:digit:]]*[[:alpha:]][[:alnum:]]*\)].*/\1/' file

There are several parts to your problem. First I'll try to help you with your regex (but it will probably unlock more problems); next I'll show you an alternative.
The Regex
The thing to understand about [[:alnum:]] is that it captures anything that contains an alphanumeric character. So it will capture "123", and it will capture "abc", as all of those characters are alphanumeric. It judges each character individually and cannot capture "only sections that have both numbers and letters" like what you want.
However, by chaining several greps together, we could filter out lines which only contain numbers.
grep -Eo '(\[[[:alnum:]]\)\w+' file | grep -v -Eo '\[[[:digit:]]+(\w+|$)' > output
To refine this further, there look to be a couple of bugs in your regex. First, you have included \[ inside the captured part, which is why it's capturing the [ in your results, so you should change (\[ to \[( to move the [ outside of the captured part in parantheses ( ... ).
Next, your combination of [[:alnum:]] with \w+ probably doesn't do what you expect. It looks for a single alphanumeric character, followed by one or more "word" characters (which is all the alphanumerics, plus some extra ones). You probably want ([[:alnum:]]+) instead of ([[:alnum:]])\w+
Alternative
Why not use cut instead? cut -d' ' -f4 will take the 4th field (with "space" as the delimiter between fields)
$ cut -d' ' -f 4 file
[Y23467]
[fpes]
[mwalkc]
[skhat2]
[narl12]
[Y23467]
If you also want to remove the square brackets, try
$ cut -d' ' -f 4 file | grep -Eo '\w+'
Y23467
fpes
mwalkc
skhat2
narl12
Y23467

Using sed
$ sed 's/\([^[]*\[\)\{2\}\([^]]*\).*/\2/' input_file
Y23467
fpes
mwalkc
skhat2
narl12
Y23467

Using FPAT with GNU awk:
awk -v FPAT='[[[:alnum:]]*]' '{gsub(/^\[|\]$/, "",$(NF-1));print $(NF-1)}' file
Y23467
fpes
mwalkc
skhat2
narl12
Y23467
setting FPAT as '[[[:alnum:]]*]' we match [ char followed by zero o more alphanumeric chars followed by ] char.
with gsub() function we remove initial [ and final ] chars.
we print the field previous to the last field, i.e. $(NF-1) field, without [ and ] characters.

Related

sed replace string with pipe and stars

I have the following string:
|**barak**.version|2001.0132012031539|
in file text.txt.
I would like to replace it with the following:
|**barak**.version|2001.01.2012031541|
So I run:
sed -i "s/\|\*\*$module\*\*.version\|2001.0132012031539/|**$module**.version|$version/" text.txt
but the result is a duplicate instead of replacing:
|**barak**.version|2001.01.2012031541|**barak**.version|2001.0132012031539|
What am I doing wrong?
Here is the value for module and version:
$ echo $module
barak
$ echo $version
2001.01.2012031541
Assumptions:
lines of interest start and end with a pipe (|) and have one more pipe somewhere in the middle of the data
search is based solely on the value of ${module} existing between the 1st/2nd pipes in the data
we don't know what else may be between the 1st/2nd pipes
the version number is the only thing between the 2nd/3rd pipes
we don't know the version number that we'll be replacing
Sample data:
$ module='barak'
$ version='2001.01.2012031541'
$ cat text.txt
**barak**.version|2001.0132012031539| <<<=== leave this one alone
|**apple**.version|2001.0132012031539|
|**barak**.version|2001.0132012031539| <<<=== replace this one
|**chuck**.version|2001.0132012031539|
|**barak**.peanuts|2001.0132012031539| <<<=== replace this one
One sed solution with -Extended regex support enabled and making use of a capture group:
$ sed -E "s/^(\|[^|]*${module}[^|]*).*/\1|${version}|/" text.txt
Where:
\| - first occurrence (escaped pipe) tells sed we're dealing with a literal pipe; follow-on pipes will be treated as literal strings
^(\|[^|]*${module}[^|]*) - first capture group that starts at the beginning of the line, starts with a pipe, then some number of non-pipe characters, then the search pattern (${module}), then more non-pipe characters (continues up to next pipe character)
.* - matches rest of the line (which we're going to discard)
\1|${version}| - replace line with our first capture group, then a pipe, then the new replacement value (${version}), then the final pipe
The above generates:
**barak**.version|2001.0132012031539|
|**apple**.version|2001.0132012031539|
|**barak**.version|2001.01.2012031541| <<<=== replaced
|**chuck**.version|2001.0132012031539|
|**barak**.peanuts|2001.01.2012031541| <<<=== replaced
An awk alternative using GNU awk:
awk -v mod="$module" -v vers="$version" -F \| '{ OFS=FS;split($2,map,".");inmod=substr(map[1],3,length(map[1])-4);if (inmod==mod) { $3=vers } }1' file
Pass two variables mod and vers to awk using $module and $version. Set the field delimiter to |. Split the second field into array map using the split function and using . as the delimiter. Then strip the leading and ending "**" from the first index of the array to expose the module name as inmod using the substr function. Compare this to the mod variable and if there is a match, change the 3rd delimited field to the variable vers. Print the lines with short hand 1
Pipe is only special when you're using extended regular expressions: sed -E
There's no reason why you need extended here, stick with basic regex:
sed "
# for lines matching module.version
/|\*\*$module\*\*.version|/ {
# replace the version
s/|2001.0132012031539|/|$version|/
}
" text.txt
or as an unreadable one-liner
sed "/|\*\*$module\*\*.version|/ s/|2001.0132012031539|/|$version|/" text.txt

How to convert a line into camel case?

This picks all the text on single line after a pattern match, and converts it to camel case using non-alphanumeric as separator, remove the spaces at the beginning and at the end of the resulting string, (1) this don't replace if it has 2 consecutive non-alphanumeric chars, e.g "2, " in the below example, (2) is there a way to do everything using sed command instead of using grep, cut, sed and tr?
$ echo " hello
world
title: this is-the_test string with number 2, to-test CAMEL String
end! " | grep -o 'title:.*' | cut -f2 -d: | sed -r 's/([^[:alnum:]])([0-9a-zA-Z])/\U\2/g' | tr -d ' '
ThisIsTheTestStringWithNumber2,ToTestCAMELString
To answer your first question, change [^[:alnum:]] to [^[:alnum:]]+ to mach one ore more non-alnum chars.
You may combine all the commands into a GNU sed solution like
sed -En '/.*title: *(.*[[:alnum:]]).*/{s//\1/;s/([^[:alnum:]]+|^)([0-9a-zA-Z])/\U\2/gp}'
See the online demo
Details
-En - POSIX ERE syntax is on (E) and default line output supressed with n
/.*title: *(.*[[:alnum:]]).*/ - matches a line having title: capturing all after it up to the last alnum char into Group 1 and matching the rest of the line
{s//\1/;s/([^[:alnum:]]+|^)([0-9a-zA-Z])/\U\2/gp} - if the line is matched,
s//\1/ - remove all but Group 1 pattern (received above)
s/([^[:alnum:]]+|^)([0-9a-zA-Z])/\U\2/ - match and capture start of string or 1+ non-alnum chars into Group 1 (with ([^[:alnum:]]+|^)) and then capture an alnum char into Group 2 (with ([0-9a-zA-Z])) and replace with uppercased Group 2 contents (with \U\2).

Reverse four length of letters with sed in unix

How can I reverse a four length of letters with sed?
For example:
the year was 1815.
Reverse to:
the raey was 5181.
This is my attempt:
cat filename | sed's/\([a-z]*\) *\([a-z]*\)/\2, \1/'
But it does not work as I intended.
not sure it is possible to do it with GNU sed for all cases. If _ doesn't occur immediately before/after four letter words, you can use
sed -E 's/\b([a-z0-9])([a-z0-9])([a-z0-9])([a-z0-9])\b/\4\3\2\1/gi'
\b is word boundary, word definition being any alphabet or digit or underscore character. So \b will ensure to match only whole words not part of words
$ echo 'the year was 1815.' | sed -E 's/\b([a-z0-9])([a-z0-9])([a-z0-9])([a-z0-9])\b/\4\3\2\1/gi'
the raey was 5181.
$ echo 'two time five three six good' | sed -E 's/\b([a-z0-9])([a-z0-9])([a-z0-9])([a-z0-9])\b/\4\3\2\1/gi'
two emit evif three six doog
$ # but won't work if there are underscores around the words
$ echo '_good food' | sed -E 's/\b([a-z0-9])([a-z0-9])([a-z0-9])([a-z0-9])\b/\4\3\2\1/gi'
_good doof
tool with lookaround support would work for all cases
$ echo '_good food' | perl -pe 's/(?<![a-z0-9])([a-z0-9])([a-z0-9])([a-z0-9])([a-z0-9])(?!=[a-z0-9])/$4$3$2$1/gi'
_doog doof
(?<![a-z0-9]) and (?!=[a-z0-9]) are negative lookbehind and negative lookahead respectively
Can be shortened to
perl -pe 's/(?<![a-z0-9])[a-z0-9]{4}(?!=[a-z0-9])/reverse $&/gie'
which uses the e modifier to place Perl code in substitution section. This form is suitable to easily change length of words to be reversed
Possible shortest sed solution even if a four length of letters contains _s.
sed -r 's/\<(.)(.)(.)(.)\>/\4\3\2\1/g'
Following awk may help you in same. Tested this in GNU awk and only with provided sample Input_file
echo "the year was 1815." |
awk '
function reverse(val){
num=split(val, array,"");
i=array[num]=="."?num-1:num;
for(;i>q;i--){
var=var?var array[i]:array[i]
};
printf (array[num]=="."?var".":var);
var=""
}
{
for(j=1;j<=NF;j++){
printf("%s%s",j==NF||j==2?reverse($j):$j,j==NF?RS:FS)
}}'
This might work for you (GNU sed):
sed -r '/\<\w{4}\>/!b;s//\n&\n/g;s/^[^\n]/\n&/;:a;/\n\n/!s/(.*\n)([^\n])(.*\n)/\2\1\3/;ta;s/^([^\n]*)(.*)\n\n/\2\1/;ta;s/\n//' file
If there are no strings of the length required to reverse, bail out.
Prepend and append newlines to all required strings.
Insert a newline at the start of the pattern space (PS). The PS is divided into two parts, the first line will contain the current word being reversed. The remainder will contain the original line.
Each character of the word to be reversed is inserted at the front of the first line and removed from the original line. When all the characters in the word have been processed, the original word will have gone and only the bordering newlines will exist. These double newlines are then replaced by the word in the first line and the process is repeated until all words have been processed. Finally the newline introduced to separate the working line and the original is removed and the PS is printed.
N.B. This method may be used to reverse strings of varying string length i.e. by changing the first regexp strings of any number can be reversed. Also strings between two lengths may also be reversed e.g. /\<w{2,4}\>/ will change all words between 2 and 4 character length.
It's a recurrent problem so somebody created a bash command called "rev".
echo "$(echo the | rev) $(echo year | rev) $(echo was | rev) $(echo 1815 | rev)".
OR
echo "the year was 1815." | rev | tr ' ' '\n' | tac | tr '\n' ' '

Get numbers from line and separate them in bash

I have this line:
io=9839.1MB, bw=4012.3KB/s, iops=250, runt=2511369msec
and I need to extract the numbers, to get an output like this:
9839.1 4012.3 250 2511369
I tried with sed 's/[^0-9.]*//g' but numbers are not separated from each other.
How can I do that?
Remove everything but digits, dots and blanks:
echo 'io=9839.1MB, bw=4012.3KB/s, iops=250, runt=2511369msec' | tr -cd '0-9. '
Output:
9839.1 4012.3 250 2511369
Your attempt with sed is almost right, just need a add a single whitespace character for excluding from replacement,
sed 's/[^0-9. ]*//g' file
9839.1 4012.3 250 2511369
You can also use GNU grep with a -E regular expression syntax and -o for matching only words which include digits and dots.
grep -o -E '[0-9.]+' file
9839.1
4012.3
250
2511369
The Fish
// a group containing a digit and the literal character "." at least once
const regex = /([\d.]+)/g
// your string
const haystack = `io=9839.1MB, bw=4012.3KB/s, iops=250, runt=2511369msec`
console.log(haystack.match(regex))
How to fish:
Here is the fantastic regex101 featuring a wysiwyg regexp editor including a large knowledge base.
https://regex101.com/r/a2scnd/1
With shell parameter expansion:
$ var='io=9839.1MB, bw=4012.3KB/s, iops=250, runt=2511369msec'
$ echo "${var//[![:digit:]. ]}"
9839.1 4012.3 250 2511369
Specifically, this uses the ${parameter//pattern/string} with an empty string and a negated bracket expression for pattern: [![:digit:]. ] – everything other than digits, periods and spaces.

insert a string at specific position in a file by SED awk

I have a string which i need to insert at a specific position in a file :
The file contains multiple semicolons(;) i need to insert the string just before the last ";"
Is this possible with SED ?
Please do post the explanation with the command as I am new to shell scripting
before :
adad;sfs;sdfsf;fsdfs
string = jjjjj
after
adad;sfs;sdfsf jjjjj;fsdfs
Thanks in advance
This might work for you:
echo 'adad;sfs;sdfsf;fsdfs'| sed 's/\(.*\);/\1 jjjjj;/'
adad;sfs;sdfsf jjjjj;fsdfs
The \(.*\) is greedy and swallows the whole line, the ; makes the regexp backtrack to the last ;. The \(.*\) make s a back reference \1. Put all together in the RHS of the s command means insert jjjjj before the last ;.
sed 's/\([^;]*\)\(;[^;]*;$\)/\1jjjjj\2/' filename
(substitute jjjjj with what you need to insert).
Example:
$ echo 'adad;sfs;sdfsf;fsdfs;' | sed 's/\([^;]*\)\(;[^;]*;$\)/\1jjjjj\2/'
adad;sfs;sdfsfjjjjj;fsdfs;
Explanation:
sed finds the following pattern: \([^;]*\)\(;[^;]*;$\). Escaped round brackets (\(, \)) form numbered groups so we can refer to them later as \1 and \2.
[^;]* is "everything but ;, repeated any number of times.
$ means end of the line.
Then it changes it to \1jjjjj\2.
\1 and \2 are groups matched in first and second round brackets.
For now, the shorter solution using sed : =)
sed -r 's#;([^;]+);$#; jjjjj;\1#' <<< 'adad;sfs;sdfsf;fsdfs;'
-r option stands for extented Regexp
# is the delimiter, the known / separator can be substituted to any other character
we match what's finishing by anything that's not a ; with the ; final one, $ mean end of the line
the last part from my explanation is captured with ()
finally, we substitute the matching part by adding "; jjjj" ans concatenate it with the captured part
Edit: POSIX version (more portable) :
echo 'adad;sfs;sdfsf;fsdfs;' | sed 's#;\([^;]\+\);$#; jjjjj;\1#'
echo 'adad;sfs;sdfsf;fsdfs;' | sed -r 's/(.*);(.*);/\1 jjjj;\2;/'
You don't need the negation of ; because sed is by default greedy, and will pick as much characters as it can.
sed -e 's/\(;[^;]*\)$/ jjjj\1/'
Inserts jjjj before the part where a semicolon is followed by any number of non-semicolons ([^;]*) at the end of the line $. \1 is called a backreference and contains the characters matched between \( and \).
UPDATE: Since the sample input has no longer a ";" at the end.
Something like this may work for you:
echo "adad;sfs;sdfsf;fsdfs"| awk 'BEGIN{FS=OFS=";"} {$(NF-1)=$(NF-1) " jjjjj"; print}'
OUTPUT:
adad;sfs;sdfsf jjjjj;fsdfs
Explanation: awk starts with setting FS (field separator) and OFS (output field separator) as semi colon ;. NF in awk stands for number of fields. $(NF-1) thus means last-1 field. In this awk command {$(NF-1)=$(NF-1) " jjjjj" I am just appending jjjjj to last-1 field.

Resources