Need string extraction between tags - bash

I have a string named as <tr><td>-Xms36g</td></tr>
I need to extract Xms36g from it and I have tried and ended successfully with
grep -oE '[Xms0-9g]' | xargs | sed 's| ||g'
But I would like to know is there any other way I can achieve this.
Thank you.

Using grep with PCRE (-P)
grep -Po -- '-\K[^<]+'
- matches - literally and \K discards the match
[^<]+ gets the portion upto next < i.e. our desired portion
With sed:
sed -E 's/^[^-]*-([^<]+)<.*/\1/'
^[^-]*- matches substring upto the -
The only captured group, ([^<]+) gets the portion upto next <
<.* matches the rest
In the replacement we have used the captured group only
Example:
% grep -Po -- '-\K[^<]+' <<<'<tr><td>-Xms36g</td></tr>'
Xms36g
% sed -E 's/^[^-]*-([^<]+)<.*/\1/' <<<'<tr><td>-Xms36g</td></tr>'
Xms36g

Parsing HTML with regular expressions is frowned upon. If you have xmllint which is shipped with libxml2-util you can use this:
xmllint --html --xpath '//text()' file
You can also pipe to standard input. In this case you need to use - for the filename:
foo | xmllint --html --xpath '//text()' -

There are seemingly endless ways you could do this. Here's an awk example:
awk -F'-|<' '{print $4}'
Another variation:
awk -F'[-<]' '$0=$4 {print}'
Using sed:
sed -E 's/.*-([^/<>]*).*/\1/'
Using cut:
cut -b 10-15
Using echo:
echo "${str:9:6}"

Related

How to grep only matching string from this result?

I am just simply trying to grab the commit ID, but not quite sure what I'm missing:
➜ ~ curl https://github.com/microsoft/vscode/releases -s | grep -oE 'microsoft/vscode/commit/(.*?)/hovercard'
microsoft/vscode/commit/ccbaa2d27e38e5afa3e5c21c1c7bef4657064247/hovercard
The only thing I need back from this is ccbaa2d27e38e5afa3e5c21c1c7bef4657064247.
This works just fine on regex101.com and in ruby/python. What am I missing?
If supported, you can use grep -oP
echo "microsoft/vscode/commit/ccbaa2d27e38e5afa3e5c21c1c7bef4657064247/hovercard" | grep -oP "microsoft/vscode/commit/\K.*?(?=/hovercard)"
Output
ccbaa2d27e38e5afa3e5c21c1c7bef4657064247
Another option is to use sed with a capture group
echo "microsoft/vscode/commit/ccbaa2d27e38e5afa3e5c21c1c7bef4657064247/hovercard" | sed -E 's/microsoft\/vscode\/commit\/([^\/]+)\/hovercard/\1/'
Output
ccbaa2d27e38e5afa3e5c21c1c7bef4657064247
The point is that grep does not support extracting capturing group submatches. If you install pcregrep you could do that with
curl https://github.com/microsoft/vscode/releases -s | \
pcregrep -o1 'microsoft/vscode/commit/(.*?)/hovercard' | head -1
The | head -1 part is to fetch the first occurrence only.
I would suggest using awk here:
awk 'match($0,/microsoft\/vscode\/commit\/[^\/]*\/hovercard/){print substr($0,RSTART+24,RLENGTH-34);exit}'
The regex will match a line containing
microsoft\/vscode\/commit\/ - microsoft/vscode/commit/ fixed string
[^\/]* - zero or more chars other than /
\/hovercard - a /hovercard string.
The substr($0,RSTART+24,RLENGTH-34) will print the part of the line starting at the RSTART+24 (24 is the length of microsoft/vscode/commit/) index and the RLENGTH is the length of microsoft/vscode/commit/ + the length of the /hovercard.
The exit command will fetch you the first occurrence. Remove it if you need all occurrences.
You can use sed:
curl -s https://github.com/microsoft/vscode/releases |
sed -En 's=.*microsoft/vscode/commit/([^/]+)/hovercard.*=\1=p' |
head -n 1
head -n 1 is to print the first match (there are 10)grep -o will print (only) everything that matches, including microsoft/ etc.
Your task can not be achieved with Mac's grep. grep -o prints all matching text (compared to default behaviour of printing matching lines), including microsoft/ etc. A grep which implemented perl regex (like GNU grep on Linux) could make use of look ahead/behind (grep -Po '(?<=microsoft/vscode/commit/)[^/]+(?=/hovercard)'). But it's just not available on Mac's grep.
On MacOS you don't have gnu utilities available by default. You can just pipe your output to a simple awk like this:
curl https://github.com/microsoft/vscode/releases -s |
grep -oE 'microsoft/vscode/commit/[^/]+/hovercard' |
awk -F/ '{print $(NF-1)}'
ccbaa2d27e38e5afa3e5c21c1c7bef4657064247
3a6960b964327f0e3882ce18fcebd07ed191b316
f4af3cbf5a99787542e2a30fe1fd37cd644cc31f
b3318bc0524af3d74034b8bb8a64df0ccf35549a
6cba118ac49a1b88332f312a8f67186f7f3c1643
c13f1abb110fc756f9b3a6f16670df9cd9d4cf63
ee8c7def80afc00dd6e593ef12f37756d8f504ea
7f6ab5485bbc008386c4386d08766667e155244e
83bd43bc519d15e50c4272c6cf5c1479df196a4d
e7d7e9a9348e6a8cc8c03f877d39cb72e5dfb1ff

Using sed to extract multiple occurrences in a string to array?

How to get all matches into a Bash array using sed?
My input is
<node><system_id>app1261.works.com</system_id><name>app1261</name></node><node><system_id>app1361.works.com</system_id><name>app1361</name></node><node><system_id>app1461.works.com</system_id><name>app1461</name></node>
Output expected stored in array
app1261.works.com
app1361.works.com
app1461.works.com
I am trying the command below but it always returns the last match instead of all matches.
sed -ne '{s/.*<system_id>\(.*\)<\/system_id>.*/\1/p;q;}' <<< "$xml"
Can anyone help me with this?
If you have gnu grep then use this look arounds based solution:
grep -Po '(?<=<system_id>)[^<>]+(?=</system_id>)' <<< "$xml
app1261.works.com
app1361.works.com
app1461.works.com
Otherwise, a grep + sed solution:
grep -Eo '<system_id>([^<>]+)</system_id>' <<< "$xml |
sed -E 's~</?system_id>~~g'
app1261.works.com
app1361.works.com
app1461.works.com

Grep multiple strings from text file

Okay so I have a textfile containing multiple strings, example of this -
Hello123
Halo123
Gracias
Thank you
...
I want grep to use these strings to find lines with matching strings/keywords from other files within a directory
example of text files being grepped -
123-example-Halo123
321-example-Gracias-com-no
321-example-match
so in this instance the output should be
123-example-Halo123
321-example-Gracias-com-no
With GNU grep:
grep -f file1 file2
-f FILE: Obtain patterns from FILE, one per line.
Output:
123-example-Halo123
321-example-Gracias-com-no
You should probably look at the manpage for grep to get a better understanding of what options are supported by the grep utility. However, there a number of ways to achieve what you're trying to accomplish. Here's one approach:
grep -e "Hello123" -e "Halo123" -e "Gracias" -e "Thank you" list_of_files_to_search
However, since your search strings are already in a separate file, you would probably want to use this approach:
grep -f patternFile list_of_files_to_search
I can think of two possible solutions for your question:
Use multiple regular expressions - a regular expression for each word you want to find, for example:
grep -e Hello123 -e Halo123 file_to_search.txt
Use a single regular expression with an "or" operator. Using Perl regular expressions, it will look like the following:
grep -P "Hello123|Halo123" file_to_search.txt
EDIT:
As you mentioned in your comment, you want to use a list of words to find from a file and search in a full directory.
You can manipulate the words-to-find file to look like -e flags concatenation:
cat words_to_find.txt | sed 's/^/-e "/;s/$/"/' | tr '\n' ' '
This will return something like -e "Hello123" -e "Halo123" -e "Gracias" -e" Thank you", which you can then pass to grep using xargs:
cat words_to_find.txt | sed 's/^/-e "/;s/$/"/' | tr '\n' ' ' | dir_to_search/*
As you can see, the last command also searches in all of the files in the directory.
SECOND EDIT: as PesaThe mentioned, the following command would do this in a much more simple and elegant way:
grep -f words_to_find.txt dir_to_search/*

Extract field from xml file

xml file:
<head>
<head2>
<dict type="abc" file="/path/to/file1"></dict>
<dict type="xyz" file="/path/to/file2"></dict>
</head2>
</head>
I need to extract the list of files from this. So the output would be
/path/to/file1
/path/to/file2
So far, I've managed to the following.
grep "<dict*file=" /path/to/xml.file | awk '{print $3}' | awk -F= '{print $NF}'
quick and dirty based on your sample, not xml possibilties
# sed a bit secure
sed -e '/<head>/,/<\/head>/!d' -e '/.*[[:blank:]]file="\([^"]*\)".*/!d' -e 's//\1/' YourFile
# sed in brute force
sed -n 's/.*[[:blank:]]file="\([^"]*\)".*/\1/p' -e 's//\1/' YourFile
# awk quick unsecure using your sample
awk -F 'file="|">' '/<head>/{h=1} /\/head>{h=0} h && /[[:blank:]]file/ { print $2 }' YourFile
now, i don't promote this kind of extraction on XML unless your really know how is your source in format and content (extra field, escaped quote, content of string like tag format, ...) are a big cause of failure and unexpected result and no more appropriate tools are available
now to use your own script
#grep "<dict*file=" /path/to/xml.file | awk '{print $3}' | awk -F= '{print $NF}'
awk '! /<dict.*file=/ {next} {$0=$3;FS="\"";$0=$0;print $2;FS=OFS}' YourFile
no need of a grep with awk, use starting pattern filter /<dict.*file/
second awk for using a different separator (FS) could be done inside the same script changing FS but because it only occur at next evaluation (next line by default), you could force a reevaluation of current content with $0=$0 in this case
Use an xmllint solution with -xpath as //head/head2/dict/#file
xmllint --xpath "//head/head2/dict/#file" input-xml | awk 'BEGIN{FS="file="}{printf "%s\n%s\n", gensub(/"/,"","g",$2), gensub(/"/,"","g",$3)}'
/path/to/file1
/path/to/file2
Unfortunately couldn't provide a pure xmllint logic, because thought applying,
xmllint --xpath "string(//head/head2/dict/#file)" input-xml
will return the file attributes from both the nodes, but it was returning only the first instance.
So added coupled my logic with GNU Awk, to extract the required values, doing
xmllint --xpath "//head/head2/dict/#file" input-xml
returns values as
file="/path/to/file1" file="/path/to/file2"
On the above output, setting a string de-limiter as file= and removing the double-quotes using gensub() function solved the requirement.
Also PE [perl everywhere :) ] solution:
perl -MXML::LibXML -E 'say $_->to_literal for XML::LibXML->load_xml(location=>q{file.xml})->findnodes(q{/head/head2/dict/#file})'
it prints
/path/to/file1
/path/to/file2
For the above you need to have installed the XML::LibXML module.
With xmlstarlet it would be:
xmlstarlet sel -t -v "//head/head2/dict/#file" -nl input.xml
This command:
awk -F'[=" ">]' '{print $12}' file
Will produces:
/path/to/file1
/path/to/file2

Text Manipulation using sed or AWK

I get the following result in my script when I run it against my services. The result differs depending on the service but the text pattern showing below is similar. The result of my script is assigned to var1. I need to extract data from this variable
$var1=HOST1*prod*gem.dot*serviceList : svc1 HOST1*prod*kem.dot*serviceList : svc3, svc4 HOST1*prod*fen.dot*serviceList : svc5, svc6
I need to strip the name of the service list from $var1. So the end result should be printed on separate line as follow:
svc1
svc2
svc3
svc4
svc5
svc6
Can you please help with this?
Regards
Using sed and grep:
sed 's/[^ ]* :\|,\|//g' <<< "$var1" | grep -o '[^ ]*'
sed deletes every non-whitespace before a colon and commas. Grep just outputs the resulting services one per line.
Using gnu grep and gnu sed:
grep -oP ': *\K\w+(, \w+)?' <<< "$var1" | sed 's/, /\n/'
svc1
svc3
svc4
svc5
svc6
grep is the perfect tool for the job.
From man grep:
-o, --only-matching
Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.
Sounds perfect!
As far as I'm aware this will work on any grep:
echo "$var1" | grep -o 'svc[0-9]\+'
Matches "svc" followed by one or more digits. You can also enable the "highly experimental" Perl regexp mode with -P, which means you can use the \d digit character class and don't have to escape the + any more:
grep -Po 'svc\d+' <<<"$var1"
In bash you can use <<< (a Here String) which supplies "$var1" to grep on the standard input.
By the way, if your data was originally on separate lines, like:
HOST1*prod*gem.dot*serviceList : svc1
HOST1*prod*kem.dot*serviceList : svc3, svc4
HOST1*prod*fen.dot*serviceList : svc5, svc6
This would be a good job for awk:
awk -F': ' '{split($2,a,", "); for (i in a) print a[i]}'

Resources