How to take first numbers and ignore the rest using sed - bash

I have this line of code. The sed is taking 10 and 5. How can I extract 10 and ignore 5?
$ NUMBER=$(echo "The food is ready in 10mins and transport is coming in 5mins." | sed 's/[^0-9]*//g') ; echo $NUMBER

You're just removing everything that isn't a digit, so 5 is left.
Instead, use a capture group to capture the first number, and use that in the replacement.
sed 's/^[^0-9]*\([0-9]*\).*/\1/'
In the regular expression:
^ matches the beginning of the line
[^0-9]* matches all non-digits at the beginning of the line
\( and \) surround a capture group
[0-9]* matches digits. These are captured in the group
.* matches the rest of the line.
In the replacement:
\1 copies the part of the line that was matched by the capture group. These are the first set of digits on the line.

Related

Can someone explain this sed command?

So the text is the following:
1a fost odata
2un balaur
care fura
mere de aur
and after using this command:
sed 's/\([a-z]*\)\(.*\)\( [a-z]*\)/\1 ... \2/' filename
the result is this:
... 1a fost
... 2un balaur
care ...
mere ... de
I know that \1 is for the first [a-z]* subexpression and so on, but I just can't figure this out.. also, what's the difference between the first subexpression and the last one? why is there a space before [a-z]?
The first [a-z]* matches the first sequence of letters on the line. The * quantifier matches 0 or more repetitions, so this can also match an empty string.
On the first line it matches the empty string before 1a. On the second line it matches the empty string before 2un. On the third line it matches care, and on the fourth line it matches mere. These matches will go into capture group 1.
.* matches zero or more of any characters, so this will skip over everything in the middle of the line. These matches go into capture group 2.
[a-z]* matches a space followed by zero or more letters. The space is needed to make .* stop matching when it gets to the last space on the line. These matches go into capture group 3.
The replacement is capture groups 1 and 2 with ... between them. This is the letters at the beginning of the line, ..., then everything after that except the last word.

sed substitute whitespace for dash only between specific character patterns

I have a lines like these:
ORIGINAL
sometext1 sometext2 word:A12 B34 C56 sometext3 sometext4
sometext5 sometext6 word:A123 B45 C67 sometext7 sometext8
sometext9 sometext10 anotherword:(someword1 someword2 someword3) sometext11 sometext12
EDITED
asdjfkklj lkdsjfic kdiw:A12 B34 C56 lksjdfioe sldkjflkjd
lknal niewoc kdiw:A123 B45 C678 oknes lkwid
cnqule nkdal anotherword:(kdlklks inlqok mncvmnx) unqieo lksdnf
Desired output:
asdjfkklj lkdsjfic kdiw:A12-B34-C56 lksjdfioe sldkjflkjd
lknal niewoc kdiw:A123-B45-C678 oknes lkwid
cnqule nkdal anotherword:(kdlklks-inlqok-mncvmnx) unqieo lksdnf
EDITED: Would this be more explicit? But frankly this is much more difficult to read and answer than writing sometext#. I do not know people's preference.
I only want to replace the whitespaces with dashes after A alphabet letter followed by some digits AND replace the whitespaces with dashes between the words between the two parentheses. And not any other whitespaces in the line. Would appreciate an explanation of the syntax too.
Thanks!
This might work for you (GNU sed):
sed -r ':a;s/(A[0-9]+(-[A-Z][0-9]+)*) ([A-Z][0-9]+)/\1-\3/;ta;s/(\(\S+(-\S+)*) (\S+( \S+)*\))/\1-\3/;ta' file
Iteratively replace the space(s) in the required strings using a regexp and back references.
This code work good
darby#Debian:~/Scrivania$ cat test.txt | sed -r 's#\s+([A-Z][0-9]+)#-\1#g' | sed ':l s/\(([^ )]*\)[ ]/\1-/;tl'
asdjfkklj lkdsjfic kdiw:A12-B34-C56 lksjdfioe sldkjflkjd
lknal niewoc kdiw:A123-B45-C678 oknes lkwid
cnqule nkdal anotherword:(kdlklks-inlqok-mncvmnx) unqieo lksdnf
Explain my regex
In the first regex
Options
-r Enable regex extended
Pattern
\s+ One or more space characters
([A-Z][0-9]+) Submatch a uppercase letter and one or more digits
Replace
- Dash character
\1 Previous submatch
Note
The g after delimiters ///g is for global substitution.
In the second regex
Pattern
:l label branched to by t or b
tl jump to label if any substitution has been made on the pattern space since the most recent reading of input line or execution of command 't'. If label is not specified, then jump to the end of the script. This is a conditional branch
\(([^ )]*\) match all in round brackets and stop to first space found
[ ] one space character
Replace
\1 Previous submatch
- Add a dash
You need capture the first Alphanumeric group using () and the second group. Then you can simply replace all using backreferences \1 and \2 :
using sed twice
sed -E 's/(\b[A-Za-z][0-9]+) ([A-Z])/\1-\2/g' | sed -E 's/(\b[A-Za-z][0-9]+) ([A-Z])/\1-\2/g'
or using perl (with lookahead (?=...)the regex don't capture the 2nd group)
perl -pe 's/(\b[A-Za-z][0-9]+) (?=[A-Z])/\1-/g'
\b work boundary
[A-Za-z] 1 letter
[0-9]+ 1 or more digits
sed doesn't support lookahead and lookbehind fonctionality

How to keep only three letters in a variable in bash

I'm accepting user input, $1, asking for a date. People can't use the help page, so I'm forced to dumb it down when passing it through grep.
My input is Day-Mon-Year - where the day doesn't have a preceding 0 and the month is only 3 letters long.
I have everything done, except for the 3 letter 'cut-down.'
## stripping leading zero, turning words to lower-case & then capitalizing only the first letter ##
fdate=$(echo $1 | sed 's/^0//g' | tr '[:upper:]' '[:lower:]' | sed -e "s/\b\(.\)/\u\1/g")
Can anyone help me take "August," for example, and cut it down to Aug, in this single variable? Or perhaps another way? I'm open to anything.
Thanks in advance!
You can do this in bash, without external commands:
a='0heLLo wOrld'
a=${a#0} # Remove leading 0. Change to ${a##0} to remove multiply zeros
a="${a:0:3}" # Take 3 first characters
a=${a,,} # Lowercase
a=${a^} # Uppercase first
printf "%s\n" "$a" # Hel
Alternative it can be done in one sed command:
% sed 's/^0//;s/\(.\)\(..\).*/\u\1\L\2/' <<< "0heLLo wOrld"
Hel
Breakdown
s/^0//; # Remove leading 0. Change to 's/^0*//' to remove multiply zeros
s/
\(.\)\(..\) # Capture first character in \1 and next two in \2
.* # Match rest of string
/\u\1\L\2/ # Uppercase \1 and lowercase \2

grep one liner - extract two different lines from same file

I've a file containing many number of lines like following.
== domain 1 score: 280.5 bits; conditional E-value: 2.1e-87
TSEEETTCTTTGSG---BTTSSB-HHHHHHHHHHHHHHHHHHSSS---B-HHHHHHHSTTTSTGCGBB-HHHHHHHHHHHTEBEBTTTS---SSCSESECTTGCGSCEBEESEEEEEESSBHHHHHHHHHHHSSEEEEEECTSHHHHTEESSEESCTSCETSS-EEEEEEEEEEEETTEEEEEEE-SBTTTSTBTTEEEEESSSSSGGGTTSSEEEE CS
PF00112.18 2 pesvDwrekkgavtpvkdqgsCGsCWafsavgalegrlaiktkkklvslSeqelvdCskeenegCnGGlmenafeyikknggivtekdypYkakekgkCkkkkkkekvakikgygkvkenseealkkalakngPvsvaidaseedfqlYksGvyketecsktelnhavlivGygvengkkyWivkNsWgtdwgekGYiriargknnecgieseavyp 218
p+svD+r+k+ +vtpvk+qg+CGsCWafs+vgaleg+l+ kt +kl++lS q+lvdC + en+gC GG+m+naf+y++kn+gi++e+ ypY ++e ++C ++ + + ak++gy++++e +e+alk+a+a++gPvsvaidas ++fq+Y++Gvy++++c++++lnhavl+vGyg ++g+k Wi+kNsWg++wg+kGYi +ar+knn cgi++ a++p
1AU0:A 2 PDSVDYRKKG-YVTPVKNQGQCGSCWAFSSVGALEGQLKKKT-GKLLNLSPQNLVDCVS-ENDGCGGGYMTNAFQYVQKNRGIDSEDAYPYVGQE-ESCMYNPTGKA-AKCRGYREIPEGNEKALKRAVARVGPVSVAIDASLTSFQFYSKGVYYDESCNSDNLNHAVLAVGYGIQKGNKHWIIKNSWGENWGNKGYILMARNKNNACGIANLASFP 213
I just want to extract the line that is preceded by the PF and the associated line after it which starts with digit.
Here in this case, line that starts with PF is 'PF00112.18' and line that starts with digit is '1AU0:A'. These ids will change for next domain, but PF is constant and its associated id starts with digit.
Here is what I've tried with grep, I hope there must be mistake in this oneliner. Any help will be greatly appreciated.
grep '^ PF \| \d' inFile.txt
Expected output:
PF00112.18 2 pesvDwrekkgavtpvkdqgsCGsCWafsavgalegrlaiktkkklvslSeqelvdCskeenegCnGGlmenafeyikknggivtekdypYkakekgkCkkkkkkekvakikgygkvkenseealkkalakngPvsvaidaseedfqlYksGvyketecsktelnhavlivGygvengkkyWivkNsWgtdwgekGYiriargknnecgieseavyp 218
1AU0:A 2 PDSVDYRKKG-YVTPVKNQGQCGSCWAFSSVGALEGQLKKKT-GKLLNLSPQNLVDCVS-ENDGCGGGYMTNAFQYVQKNRGIDSEDAYPYVGQE-ESCMYNPTGKA-AKCRGYREIPEGNEKALKRAVARVGPVSVAIDASLTSFQFYSKGVYYDESCNSDNLNHAVLAVGYGIQKGNKHWIIKNSWGENWGNKGYILMARNKNNACGIANLASFP 213
You can use the following grep expression:
grep '^[[:space:]]\+PF\|^[[:space:]]\+[[:digit:]]' input.txt
The first pattern ^[[:space:]]\+PF searches for a line which contains one or more spaces at the start, followed by the term PF. The second pattern also searches for a one ore more spaces at the start at the line, but followed by a digit.
This can be simplyfied to:
grep '^[[:space:]]\+\(PF\|[[:digit:]]\)' input.txt
since both patterns start with one or more spaces at the start of the line.
Let me finally suggest to use egrep instead of grep because extended POSIX regexes will save use some escaping:
egrep '^[[:space:]]+(PF|[[:digit:]])' input.txt
egrep "^[ \t]*(PF|[0-9]).*$" tmp_file
[ \t] is equivalent to a space. Its a tab delimiter.
So ^[ \t]* grabs anything that starts with a space. The asterisk grabs all leading white space thereafter.
(PF|[0-9]).*$ will grab the lines that start with either PF or a digit. The beauty of egrep is that you can specify multiple conditions encapsulated by parenthesis, separated by a pipe.
.*$ grabs every from until the end of the line
so (PF|[0-9]).*$ will grab everything that starts with PF or digits until the end of the line. It will not work without compensating for the leading white space first.
So we get :
egrep "^[ \t]*(PF|[0-9]).*$" tmp_file

explain part of sed expression - *\1$/p

This code outputs lines where only the first and last digits are the same - could somebody explain in english how this works:
seq 1000 | sed -nr -e '/^([0-9])([0-9])*\1$/p'
outputs:
11
22
33 etc
I know it looks for a number at the start ^ and then another number but I am unclear how this works with the \1$ to get the answer?
Actually, what this matches is any digit:
([0-9])
followed by any number of digits
([0-9])*
followed by the first digit again
\1
\1 is a backreference to the first parenthesized group.
Note that the digits in the middle are unconstrained:
$ seq 8000 | sed -nr -e '/^([0-9])([0-9])*\1$/p' | tail
7907
7917
7927
7937
7947
7957
7967
7977
7987
7997
It looks for a number at the start, followed by zero or more numbers (notice the star after the second parenthesis), and lastly checking for \1 at the end - which represents the exact same value as in the first parenthesis.
\1 is the "first matched term".
$ is the "end of line".
So \1$ means "match the same term (ie. digit 0-9) found at the start of the string again at the end of the string.
It starts with matching the start of line, then the parenthesis is a group (which can be referenced later) which is one digit 0-9. The group is followed by another group, also with one digit and this group can be repeated 0 ore more times. After that there is a reference to the first group (the \1) and finally a match for end of line.
So, basically it just says last digit must be same as first digit and there can be any number of digits between them.
There is no need grouping the middle digits since they are not referenced thus it could be rewritten as this
sed -nr -e '/^([0-9])[0-9]*\1$/p'
If you instead wanted that the last digit should be the same as the first digit and the second to last the same as the second so you would match 1221,245642 but not 2424 then you could use
sed -nr -e '/^([0-9])([0-9])[0-9]*\2\1$/p'
Try it with seq 100000

Resources