Using sed to extract multiple occurrences in a string to array? - bash

How to get all matches into a Bash array using sed?
My input is
<node><system_id>app1261.works.com</system_id><name>app1261</name></node><node><system_id>app1361.works.com</system_id><name>app1361</name></node><node><system_id>app1461.works.com</system_id><name>app1461</name></node>
Output expected stored in array
app1261.works.com
app1361.works.com
app1461.works.com
I am trying the command below but it always returns the last match instead of all matches.
sed -ne '{s/.*<system_id>\(.*\)<\/system_id>.*/\1/p;q;}' <<< "$xml"
Can anyone help me with this?

If you have gnu grep then use this look arounds based solution:
grep -Po '(?<=<system_id>)[^<>]+(?=</system_id>)' <<< "$xml
app1261.works.com
app1361.works.com
app1461.works.com
Otherwise, a grep + sed solution:
grep -Eo '<system_id>([^<>]+)</system_id>' <<< "$xml |
sed -E 's~</?system_id>~~g'
app1261.works.com
app1361.works.com
app1461.works.com

Related

extract string between '$$' characters - $$extractabc$$

I am working on shell script and new to it. I want to extract the string between double $$ characters, for example:
input:
$$extractabc$$
output
extractabc
I used grep and sed but not working out. Any suggestions are welcome!
You could do
awk -F"$" '{print $3}' file.txt
assuming the file contained input:$$extractabc$$ output:extractabc. awk splits your data into pieces using $ as a delimiter. First item will be input:, next will be empty, next will be extractabc.
You could use sed like so to get the same info.
sed -e 's/.*$$\(.*\)$$.*/\1/' file.txt
sed looks for information between $$s and outputs that. The goal is to type something like this .*$$(.*)$$.*. It's greedy but just stay with me.
looks for .* - i.e. any character zero or more times before $$
then the string should have $$
after $$ there'll be any character zero or more times
then the string should have another $$
and some more characters to follow
between the 2 $$ is (.*). String found between $$s is given a placeholder \1
sed finds such information and publishes it
Using grep PCRE (where available) and look-around:
$ echo '$$extractabc$$' | grep -oP "(?<=\\$\\$).*(?=\\$\\$)"
extractabc
echo '$$extractabc$$' | awk '{gsub(/\$\$/,"")}1'
extractabc
Here is an other variation:
echo "$$extractabc$$" | awk -F"$$" 'NF==3 {print $2}'
It does test of there are two set of $$ and only then prints whats between $$
Does also work for input like blabla$$some_data$$moreblabla
How about remove all the $ in the input?
$ echo '$$extractabc$$' | sed 's/\$//g'
extractabc
Same with tr
$ echo '$$extractabc$$' | tr -d '$'
extractabc

bash to extract and store after second underscore

I am trying to use bash to extract after the second _ and store that in a variable pref. I am using a loop so the below is not completely accurate, but the file structure/format is.
I can extract evertything before the first _ using pref=${bname%%_*}, but can't seem to change it to the second_`. Thank you :).
file to extract from
00-0000_Last-First_base_counts_FBN1.txt
desired output
00-0000_Last-First
bash
pref=${bname%%_}; pref=${bname%%_*.txt}
Using cut with _ as delimiter get 1st and 2nd fields:
s='00-0000_Last-First_base_counts_FBN1.txt'
cut -d_ -f1-2 <<< "$s"
00-0000_Last-First
To store in a variable:
pref=$(cut -d_ -f1-2 <<< "$s")
GNU sed and grep
$ sed -r 's/([^_]+_[^_]*).*/\1/' <<<"00-0000_Last-First_base_counts_FBN1.txt"
00-0000_Last-First
$ sed 's/_[^_]*//2g' <<< "00-0000_Last-First_base_counts_FBN1.txt"
00-0000_Last-First
$ grep -o "^[^_]*_\?[^_]*" <<<"00-0000_Last-First_base_counts_FBN1.txt"
00-0000_Last-First
To store in variable
somevar="00-0000_Last-First_base_counts_FBN1.txt";
pref=$(sed 's/_[^_]*//2g' <<< "$somevar")
As well as with sed, awk and cut, you can achieve this with expr:
$ filename=00-0000_Last-First_base_counts_FBN1.txt
$ echo $(expr match "$filename" '^\([^_]*_[^_]*\)')
00-0000_Last-First
This is echoing the capture group with in the \( and \) of the regular expression.

Need string extraction between tags

I have a string named as <tr><td>-Xms36g</td></tr>
I need to extract Xms36g from it and I have tried and ended successfully with
grep -oE '[Xms0-9g]' | xargs | sed 's| ||g'
But I would like to know is there any other way I can achieve this.
Thank you.
Using grep with PCRE (-P)
grep -Po -- '-\K[^<]+'
- matches - literally and \K discards the match
[^<]+ gets the portion upto next < i.e. our desired portion
With sed:
sed -E 's/^[^-]*-([^<]+)<.*/\1/'
^[^-]*- matches substring upto the -
The only captured group, ([^<]+) gets the portion upto next <
<.* matches the rest
In the replacement we have used the captured group only
Example:
% grep -Po -- '-\K[^<]+' <<<'<tr><td>-Xms36g</td></tr>'
Xms36g
% sed -E 's/^[^-]*-([^<]+)<.*/\1/' <<<'<tr><td>-Xms36g</td></tr>'
Xms36g
Parsing HTML with regular expressions is frowned upon. If you have xmllint which is shipped with libxml2-util you can use this:
xmllint --html --xpath '//text()' file
You can also pipe to standard input. In this case you need to use - for the filename:
foo | xmllint --html --xpath '//text()' -
There are seemingly endless ways you could do this. Here's an awk example:
awk -F'-|<' '{print $4}'
Another variation:
awk -F'[-<]' '$0=$4 {print}'
Using sed:
sed -E 's/.*-([^/<>]*).*/\1/'
Using cut:
cut -b 10-15
Using echo:
echo "${str:9:6}"

Extract tokens from log files in unix

I have a directory containing log files.
We are interested in a particular log line which goes like 'xxxxxxxxx|platform=SUN|.......|orderId=ABCDEG|........'
We have to extract all similar lines from the log files in this directory,and print out the token 'ABCDEG'.
Duplication is acceptable.
How do we achieve this with a single unix command operation?
sed -r '/platform=.*orderId=/s/.*orderId=([^|]+).*/\1/g' *
From all lines containing platform= && orderId= (/platform=.*orderId=/), take the non-| sequence of characters (([^|]+))after orderId=.
awk -F'|' '$2=="platform=SUN"{sub(/orderId=/,"", $4); print $4}' logFile*
output
ABCDEG
IHTH
grep -rP "\|platform=SUN\|.*(?<=\|orderId=)" | sed s/.*platform=SUN.*orderId=// | sed s/\|.*//
$ str='xxxxxxxxx|platform=SUN|.......|orderId=ABCDEG|........'
$ grep -Po 'platform=SUN.*orderId=\K[^|]*' <<< "$str"
ABCDEG
This requires Perl compatible regular expressions (-P); -o retains just the match. \K is variable length look-behind: "match the stuff to the left of it, but don't include it in the matched string".
From the logs directory you could run the following command:
sed -n /platform=SUN/p * | sed 's#.*orderId=\(.*\)|.*$#\1#'

Text Manipulation using sed or AWK

I get the following result in my script when I run it against my services. The result differs depending on the service but the text pattern showing below is similar. The result of my script is assigned to var1. I need to extract data from this variable
$var1=HOST1*prod*gem.dot*serviceList : svc1 HOST1*prod*kem.dot*serviceList : svc3, svc4 HOST1*prod*fen.dot*serviceList : svc5, svc6
I need to strip the name of the service list from $var1. So the end result should be printed on separate line as follow:
svc1
svc2
svc3
svc4
svc5
svc6
Can you please help with this?
Regards
Using sed and grep:
sed 's/[^ ]* :\|,\|//g' <<< "$var1" | grep -o '[^ ]*'
sed deletes every non-whitespace before a colon and commas. Grep just outputs the resulting services one per line.
Using gnu grep and gnu sed:
grep -oP ': *\K\w+(, \w+)?' <<< "$var1" | sed 's/, /\n/'
svc1
svc3
svc4
svc5
svc6
grep is the perfect tool for the job.
From man grep:
-o, --only-matching
Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.
Sounds perfect!
As far as I'm aware this will work on any grep:
echo "$var1" | grep -o 'svc[0-9]\+'
Matches "svc" followed by one or more digits. You can also enable the "highly experimental" Perl regexp mode with -P, which means you can use the \d digit character class and don't have to escape the + any more:
grep -Po 'svc\d+' <<<"$var1"
In bash you can use <<< (a Here String) which supplies "$var1" to grep on the standard input.
By the way, if your data was originally on separate lines, like:
HOST1*prod*gem.dot*serviceList : svc1
HOST1*prod*kem.dot*serviceList : svc3, svc4
HOST1*prod*fen.dot*serviceList : svc5, svc6
This would be a good job for awk:
awk -F': ' '{split($2,a,", "); for (i in a) print a[i]}'

Resources