How to overcome greedy match everything when looking for a particular string later? - bash

echo "A number is about to show up 1 and now I want to parse 365 guys and some extra junk" | sed -E 's/.*([0-9]+) guys.*/\1/g'
The above command currently outputs just 5. Essentially I'd like to parse the number of "guys" in a random sentence that could have numbers (or not.. I'd also like to parse just echo "365 guys") preceding the number of guys. My .* is matching the 36 and preventing it from appearing in the \1. How can I write a sed command (or any other regex/perl/awk) to accomplish what I want?

Use the "frugal" quantifier *? in Perl:
perl -pe 's/.*?([0-9]+) guys.*/$1/'

With GNU grep:
$ grep -Po '\b[0-9]+(?= guys\b)' <<<"365 guys or 366 guys, but not foo12 guys."
365
366
-P actives support for PCREs, which enables advanced regex features.
-o specifies that only the matching parts of input lines should be printed.
\b matches only on a word boundary, including at the start of a line;
this prevents matching numbers that aren't stand-alone numbers but part of other words, such as in foo365 guys, and words that start with guys, such as guysanddolls.
(?= guys) is a look-ahead assertion that matches the enclosed subexpression without including it in the matched string returned.
As demonstrated, this may match multiple patterns on a given line, with each number extracted printed on its own output line.
If that is undesired, grep cannot be used, because -o invariably returns all of a line's matches; see the perl command below for a solution.
Inspired by Sobrique's comment on choroba's answer, here is the perl equivalent of the above grep command:
$ perl -lne 'print for m/\b(\d+) guys\b/g' <<<"365 guys or 366 guys, but not foo12 guys."
365
366
Simply omit the g to only match at most 1 number per line.

Since your number is preceded by a blank, you can make it a part of the regex:
echo "A number is about to show up 1 and now I want to parse 365 guys and some extra junk" | sed -E 's/.* ([0-9]+) guys.*/\1/g'
# => 365

In Bash:
$ s="A number is about to show up 1 and now I want to parse 365 guys and some extra junk"
$ [[ $s =~ ([0-9]+)\ +guys.*$ ]] && echo ${BASH_REMATCH[1]}
365
Or, with awk:
$ echo "$s" | awk '/guys/{for (i=1;i<=NF;i++) if ($i=="guys" && $(i-1)+0==$(i-1)) print $(i-1)}'
365

with standard sed regex you can benefit from greedy match if you reverse the string and matching
echo ... | rev | sed -E 's/.*syug ([0-9]+).*/\1/g' | rev
obviously this is a hack, but desperate times...

#Andrew Cassidy: #try:
echo "A number is about to show up 1 and now I want to parse 365 guys and some extra junk" |
awk '/guys/{print VAL;exit} {VAL=$0}' RS=" "

This might work for you (GNU sed):
sed -r 's/.*\b([0-9]+) guys.*/\1/' file
or perhaps:
sed -r 's/.*\<([0-9]+) guys.*/\1/' file
Make the numeric part of the pattern match a word boundary.

Related

SED commandd to check DATE is palindromic

I have file with dates in format MM/D/YYYY, called dates.txt
02/02/2020
08/25/1998
03/02/2030
12/02/2021
06/19/1960
01/10/2010
03/07/2100
I need single-line SED command to print just palindromic. For example 02/02/2020 is palindromic while 08/25/2020 is not. Expected output is:
02/02/2020
03/02/2030
12/02/2021
What I did till now is to remove / from date format. How to check is that output the same reading from start and from end?
sed -E "s|([0-9]{2})/([0-9]{2})/([0-9]{4})|\3\2\1|" dates.txt
Here is what I get:
20200202
19982508
20300203
20210212
19601906
20101001
21000703
You can backreference in the pattern match:
sed -n '/\([0-9]\)\([0-9]\)\/\([0-9]\)\([0-9]\)\/\4\3\2\1/p'
Using extended regex and dots looks just nice:
sed -rn '/(.)(.)\/(.)(.)\/\4\3\2\1/p'
sed -rn '\#(.)(.)/(.)(.)/\4\3\2\1#p' # means the same
You may delete any line that does not match the d1d2/M1M2/M2M1d2d1 pattern. To check that, match and capture each day and month digits separately:
sed -E '/^([0-9])([0-9])\/([0-9])([0-9])\/\4\3\2\1$/!d' file > outfile
Or, with GNU sed:
sed -i -E '/^([0-9])([0-9])\/([0-9])([0-9])\/\4\3\2\1$/!d' file
The ^ stands for start of string position and $ means the end of string.
The !d at the end tells sed to "drop" the lines that do not follow this pattern.
See the online demo.
Alternatively, when you have more complex cases, you may read the file line by line, swap the digits in days and months and concatenate them, and compare the value with the year part. You may perform more operations there if need be:
while IFS= read -r line; do
p1="$(sed -En 's,([0-9])([0-9])/([0-9])([0-9])/.*,\4\3\2\1,p' <<< "$line")";
p2="${line##*/}";
if [[ "$p1" == "$p2" ]]; then
echo "$line"
fi
done < file > outfile
See the online demo
The sed -En 's,([0-9])([0-9])/([0-9])([0-9])/.*,\4\3\2\1,p part gets the first four digits and reorders them. The "${line##*/}" uses parameter expansion to remove as many chars as possible from the start till the last / (including it).

Extract all characters after a match - shell script

I am in need to extract all characters after a pattern match.
For example ,
NAME=John
Age=16
I need to extract all characters after "=". Output should be like
John
16
I cant go with perl or Jython for this purpose because of some restrictions.
I tried with grep , but to my knowledge I came as shown below only
echo "NAME=John" |grep -o -P '=.{0,}'
You were pretty close:
grep -oP '(?<=\w=)\w+' file
makes it.
Explanation
it looks for any word after word= and prints it.
-o stands for "Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line".
-P stands for "Interpret PATTERN as a Perl regular expression".
(?<=\w=)\w+ means: match only \w+ following word=. More info in [Regex tutorial - Lookahead][1] and in [this nice explanation by sudo_O][2].
Test
$ cat file
NAME=John
Age=16
$ grep -oP '(?<=\w=)\w+' file
John
16
One sed solution
sed -ne 's/.*=//gp' <filename>
another awk solution
awk -F= '$0=$2' <filename>
Explanation:
in sed we remove anything from the beginning of a line till a = and print the rest.
in awk we break the string in 2 parts, separated by =, now after that $0=$2 is making replacing the whole string with the second portion

How can I strip first X characters from string using sed?

I am writing shell script for embedded Linux in a small industrial box. I have a variable containing the text pid: 1234 and I want to strip first X characters from the line, so only 1234 stays. I have more variables I need to "clean", so I need to cut away X first characters and ${string:5} doesn't work for some reason in my system.
The only thing the box seems to have is sed.
I am trying to make the following to work:
result=$(echo "$pid" | sed 's/^.\{4\}//g')
Any ideas?
The following should work:
var="pid: 1234"
var=${var:5}
Are you sure bash is the shell executing your script?
Even the POSIX-compliant
var=${var#?????}
would be preferable to using an external process, although this requires you to hard-code the 5 in the form of a fixed-length pattern.
Here's a concise method to cut the first X characters using cut(1). This example removes the first 4 characters by cutting a substring starting with 5th character.
echo "$pid" | cut -c 5-
Use the -r option ("use extended regular expressions in the script") to sed in order to use the {n} syntax:
$ echo 'pid: 1234'| sed -r 's/^.{5}//'
1234
Cut first two characters from string:
$ string="1234567890"; echo "${string:2}"
34567890
pipe it through awk '{print substr($0,42)}' where 42 is one more than the number of characters to drop. For example:
$ echo abcde| awk '{print substr($0,2)}'
bcde
$
Chances are, you'll have cut as well. If so:
[me#home]$ echo "pid: 1234" | cut -d" " -f2
1234
Well, there have been solutions here with sed, awk, cut and using bash syntax. I just want to throw in another POSIX conform variant:
$ echo "pid: 1234" | tail -c +6
1234
-c tells tail at which byte offset to start, counting from the end of the input data, yet if the the number starts with a + sign, it is from the beginning of the input data to the end.
Another way, using cut instead of sed.
result=`echo $pid | cut -c 5-`
I found the answer in pure sed supplied by this question (admittedly, posted after this question was posted). This does exactly what you asked, solely in sed:
result=\`echo "$pid" | sed '/./ { s/pid:\ //g; }'\``
The dot in sed '/./) is whatever you want to match. Your question is exactly what I was attempting to, except in my case I wanted to match a specific line in a file and then uncomment it. In my case it was:
# Uncomment a line (edit the file in-place):
sed -i '/#\ COMMENTED_LINE_TO_MATCH/ { s/#\ //g; }' /path/to/target/file
The -i after sed is to edit the file in place (remove this switch if you want to test your matching expression prior to editing the file).
(I posted this because I wanted to do this entirely with sed as this question asked and none of the previous answered solved that problem.)
Rather than removing n characters from the start, perhaps you could just extract the digits directly. Like so...
$ echo "pid: 1234" | grep -Po "\d+"
This may be a more robust solution, and seems more intuitive.
This will do the job too:
echo "$pid"|awk '{print $2}'

Bash - Extract numbers from String

I got a string which looks like this:
"abcderwer 123123 10,200 asdfasdf iopjjop"
Now I want to extract numbers, following the scheme xx,xxx where x is a number between 0-9. E.g. 10,200. Has to be five digit, and has to contain ",".
How can I do that?
Thank you
You can use grep:
$ echo "abcderwer 123123 10,200 asdfasdf iopjjop" | egrep -o '[0-9]{2},[0-9]{3}'
10,200
In pure Bash:
pattern='([[:digit:]]{2},[[:digit:]]{3})'
[[ $string =~ $pattern ]]
echo "${BASH_REMATCH[1]}"
Simple pattern matching (glob patterns) is built into the shell. Assuming you have the strings in $* (that is, they are command-line arguments to your script, or you have used set on a string you have obtained otherwise), try this:
for token; do
case $token in
[0-9][0-9],[0-9][0-9][0-9] ) echo "$token" ;;
esac
done
Check out pattern matching and regular expressions.
Links:
Bash regular expressions
Patterns and pattern matching
SO question
and as mentioned above, one way to utilize pattern matching is with grep.
Other uses: echo supports patterns (globbing) and find supports regular expressions.
A slightly non-typical solution:
< input tr -cd [0-9,\ ] | tr \ '\012' | grep '^..,...$'
(The first tr removes everything except commas, spaces, and digits. The
second tr replaces spaces with newlines, putting each "number" on a separate
line, and the grep discards everything except those that match your criterion.)
The following example using your input data string should solve the problem using sed.
$ echo abcderwer 123123 10,200 asdfasdf iopjjop | sed -ne 's/^.*\([0-9,]\{6\}\).*$/\1/p'
10,200

Listing all words containing more than 1 capitalized letter

I want to search for all of the acronyms placed within a document so I can correct their formatting. I think I can assume that all acronyms are words containing at least 2 capital letters in them (e.g.: "EU"), as I've never seen a one-word acronym or acronym only containing 1 capital letter, but sometimes they have a small "o" for "of" in them or another small letter. How can I print out a list showing all of the possible matches once?
This might work for you:
tr -s '[:space:]' '\n' <input.txt | sed '/\<[[:upper:]]\{2,\}\>/!d' | sort -u
The -o option of grep can help you:
grep -o '\b[[:alpha:]]*[[:upper:]][[:alpha:]]*[[:upper:]][[:alpha:]]*'
Almost only Bash:
for word in $(cat file.txt) ; do
if [[ $word =~ [[:upper:]].*[[:upper:]] ]] ; then # at least 2 capital letters
echo "${word//[^[:alpha:]]/}" # remove non-alphabetic characters
fi
done
Will this work for you:
sed 's/[[:space:]]\+/\n/g' $your_file | sort -u | egrep '[[:upper:]].*[[:upper:]]'
Translation:
Replace all runs of whitespace in $your_file with newlines. This will put each word on its own line.
Sort the file and remove duplicates.
Find all lines that contain two uppercase letters separated by zero or more characters.
One way using perl.
Example:
Content of infile:
One T
Two T
THREE
Four
Five SIX
Running the perl command:
perl -ne 'printf qq[%s\n], $1 while /\b([[:upper:]]{2,})\b/g' infile
Result:
THREE
SIX

Resources