Regex - Pattern Matching in Shell - shell

I am trying to match a pattern and extract the values that comes after it. I have used below regex pattern matchching, it it dint help me. No values got extracted as I got blank value when I echoed it.
Someone let me know what mistake I made.
Sample regex:
class="remove_link_style">Site Issue - Please check</a></td><td>
Working</td><td>
<ahref="/0051043899"class="remove_link_style">
patten used: text=$(echo "class="remove_link_style">Site Issue - Please check</a></td><td>Working</td><td><ahref="/0051043899"class="remove_link_style">" | grep -o --perl-regexp "(?class="remove_link_style")[a-zA-Z0-9_]+"")
I also wanted to extract the string that comes after class="remove_link_style" but before </a></td><td>

I think you would find a lot of references and advice not to parse XML with bash tools like grep/sed/awk . With this context, I would advise using any of the parsing tools like http://xmlsoft.org/xmllint.html or http://xmlstar.sourceforge.net/doc/xmlstarlet.txt . But if you'd like to quickly extract the contents, you can combine grep and cut as below.
echo 'class="remove_link_style">GB|Trekkinn-UK|Manualcrawlrequest|1</a></td><td>WorkInProgress</td><td><ahref="/0051043899"class="remove_link_style">' | grep -Eo 'style"[^<>]*>[^<>]+' | cut -f2 -d">"
This prints out:
GB|Trekkinn-UK|Manualcrawlrequest|1
WorkInProgress
EDIT : As per OP's ask, store the output into an array.
If you need the output to be stored in an array, you need to set the IFS since you have white spaces in your elements.
IFS=$'\n'
result=($(echo 'class="remove_link_style">Site Issue - Please check</a></td><td>Working</td><td><ahref="/0051043899"class="remove_link_style">' | grep -Eo 'style"[^<>]*>[^<>]+' | cut -f2 -d">"))
unset IFS
for i in "${result[#]}"; do echo $i; done
Site Issue - Please check
Working

Related

Shell script - remove all before and after

Find the next link if the Link header contains rel=next..
Getting the link header can result in different strings.. I need to find the next link.
e.g.
Link: <http://mygithub.com/api/v3/organizations/20/repos?page=1>; rel=prev, <http://mygithub.com/api/v3/organizations/20/repos?page=3>; rel=next, <http://mygithub.com/api/v3/organizations/20/repos?page=4>; rel=last, <http://mygithub.com/api/v3/organizations/20/repos?page=1>;
would be http://mygithub.com/api/v3/organizations/20/repos?page=3
Link: <http://mygithub.com/api/v3/organizations/4/repos?page=2>; rel="next", <http://mygithub.com/api/v3/organizations/4/repos?page=2>; rel="last"
would be http://mygithub.com/api/v3/organizations/4/repos?page=2
Played with sed and parameter expansion - not that experienced so got stuck :)
Please be aware that parsing HTML with non-html tools it fraught with peril; you will see that this works, and assume you can get away with it always. You'll spend hours trying to get the next level of complexity to work, when you should be studying how to use html-aware tools. Don't say we didn't warn you (-;, but
printf "<http://mygithub.com/api/v3/organizations/20/repos?page=1>; rel=prev, <http://mygithub.com/api/v3/organizations/20/repos?page=3>; rel=next, <http://mygithub.com/api/v3/organizations/20/repos?page=4>; rel=last, <http://mygithub.com/api/v3/organizations/20/repos?page=1>;\n" \
| awk -F" " '{
for(i=1;i<=NF;i++){
if ($i == "rel=next,") {
gsub(/[<>]/,"",$(i-1);sub(/;$/,"",$(i-1))
print $(i-1)
}
}
}'
produces required output:
http://mygithub.com/api/v3/organizations/20/repos?page=3
To save the output of a script section into a variable, you wrap the code for command-substitution, in this case
nextReposLink=$( printf .... | awk '....' )
#-------------^^--------------------------^
The ^ pointed items are modern syntax for command-substitution. The code inside of $( ... ) is executed and the standard output is passed as a argument to the invoking command line. (The original syntax for command substitution is/was `cmds` and works the same in the simple case var=`cmds` . You can nest modern cmd-substitution easily, whereas the old version requires a lot of escape character fiddling. Avoid it if you can.
Note that about any s/str/rep/ that sed can do, awk can do the same, but requires the use of the sub(/regx/, "repl", "str") or gsub(sameArgs) functions. In this particular case, you may need to escape the <> like \<\>.
Be sure to always dbl-quote the use of variables, i.e. echo "$nextReposLink".
IHTH
Well - I put one of your URL strings in a text file and was able to pull out the first URL with two cuts.
[root#oelinux2 ~]# cat test
Link: <http://mygithub.com/api/v3/organizations/20/repos?page=1>; rel=prev, <http://mygithub.com/api/v3/organizations/20/repos?page=3>; rel=next, <http://mygithub.com/api/v3/organizations/20/repos?page=4>; rel=last, <http://mygithub.com/api/v3/organizations/20/repos?page=1>;
Then with using cut:
cat test | cut -d "<" -f2 | cut -d ">" -f1
[root#oelinux2 ~]# cat test | cut -d "<" -f2 | cut -d ">" -f1
http://mygithub.com/api/v3/organizations/20/repos?page=1
That's one option - if you are just looking to get the first URL in the string. Basically - that's just grabbing what's between the two delimiters "<" and ">"
With Cut:
-d is the 'delimiter'
-f is the field you want to get.
If you wanted to get a later URL in that string, you could change the fields (-f #) and see what you get :)

Grep $value `grep $value2 `<command>`` - Nested grep?

I'm a complete noob at awk/sed so forgive me if I'm missing something obvious here.
Basically I'm trying to do a nested grep, i.e. something akin to:
grep $value `exim -Mvh $(`exim -bpru | grep $eximID | more`)`
Breakdown:
grep $value IN COMMAND
--> exim -Mvh (print exim mail headers) FROM RESULTS OF
---> exim -bpru | grep $eximID | more
$value is the string I'm looking for
$eximID is the string I'm looking for within exim -bpru (list all exim thingies)
No idea if what I'm trying to accomplish would be easier with awk/sed hence the question really.
I tried to make that as legible as possible but nested nesting is hard yo
Edit
Tada! My script is now workings thanks to you guys! Here it is, unfinished, but working:
#!/usr/bin/bash
echo "Enter the email address you want to search for + compare sender info via exim IDs."
read searchTarget
echo "Enter the target domain the email is coming from."
read searchDomain
#domanList is array for list of exim IDs needed
domainList=($(exim -bpru | grep "$searchDomain" | awk '{ print $3 }'))
for i in "${domainList[#]}"
do
echo "$(exim -Mvh $i | grep $searchTarget)"
#echo "$(grep $searchTarget $(exim -Mvh $i))"
done
grep $value `exim -Mvh $(`exim -bpru | grep $eximID | more`)`
This isn't right. The backticks (`command`) and $(command) do the same thing, it's just an alternative syntax. The advantage of using $() is that it's better nestable, so it's a good habit to always use that.
So, let's fix this, we now end up with:
grep "$value" "$(exim -Mvh "$(exim -bpru | grep "$eximID")")" | more
I relocated the more command, for what I think will be obvious reasons. more just paginates data for the user, feeding the output of more to something else almost never makes sense.
I've also quoted the variables, this is also a good habit, because otherwise things will break when there are certain characters in your variable (most common is the a space).
I can't test if this gives you the output you want, if it doesn't, then update your answer with a few lines of example data, and the expected output.
If you're going to do it with back-quotes (not recommended; it is hard work), then you have to write:
grep $value `exim -Mvh $(\`exim -bpru | grep $eximID\`)`
(where I've removed the more since when used like that it behaves like cat and there's no point in using cat at the end of the commands like that either).
It would be more sane to use the $(…) notation throughout:
grep $value $(exim -Mvh $( $(exim -bpru | grep $eximID)))
And it seems more plausible that you don't need quite that many sets of indirection and this is what you're really after:
grep $value $(exim -Mvh $(exim -bpru | grep $eximID))
You should look at:
Why didn't back quotes in a shell script help me cd to a directory?
What is the benefit of using $(…) instead of back ticks in shell scripts?
Why does \$ reduce to $ inside backquotes [though not inside $(…)]?
and no doubt there are other related questions too.

grep pipe searching for one word, not line

For some reason I cannot get this to output just the version of this line. I suspect it has something to do with how grep interprets the dash.
This command:
admin#DEV:~/TEMP$ sendemail
Yields the following:
sendemail-1.56 by Brandon Zehm
More output below omitted
The first line is of interest. I'm trying to store the version to variable.
TESTVAR=$(sendemail | grep '\s1.56\s')
Does anyone see what I am doing wrong? Thanks
TESTVAR is just empty. Even without TESTVAR, the output is empty.
I just tried the following too, thinking this might work.
sendemail | grep '\<1.56\>'
I just tried it again, while editing and I think I have another issue. Perhaps im not handling the output correctly. Its outputting the entire line, but I can see that grep is finding 1.56 because it highlights it in the line.
$ TESTVAR=$(echo 'sendemail-1.56 by Brandon Zehm' | grep -Eo '1.56')
$ echo $TESTVAR
1.56
The point is grep -Eo '1.56'
from grep man page:
-E, --extended-regexp
Interpret PATTERN as an extended regular expression (ERE, see below). (-E is specified by POSIX.)
-o, --only-matching
Print only the matched (non-empty) parts of a matching line, with each such part on a separate output
line.
Your regular expression doesn't match the form of the version. You have specified that the version is surrounded by spaces, yet in front of it you have a dash.
Replace the first \s with the capitalized form \S, or explicit set of characters and it should work.
I'm wondering: In your example you seem to know the version (since you grep for it), so you could just assign the version string to the variable. I assume that you want to obtain any (unknown) version string there. The regular expression for this in sed could be (using POSIX character classes):
sendemail |sed -n -r '1 s/sendemail-([[:digit:]]+\.[[:digit:]]+).*/\1/ p'
The -n suppresses the normal default output of every line; -r enables extended regular expressions; the leading 1 tells sed to only work on line 1 (I assume the version appears in the first line). I anchored the version number to the telltale string sendemail- so that potential other numbers elsewhere in that line are not matched. If the program name changes or the hyphen goes away in future versions, this wouldn't match any longer though.
Both the grep solution above and this one have the disadvantage to read the whole output which (as emails go these days) may be long. In addition, grep would find all other lines in the program's output which contain the pattern (if it's indeed emails, somebody might discuss this problem in them, with examples!). If it's indeed the first line, piping through head -1 first would be efficient and prudent.
jayadevan#jayadevan-Vostro-2520:~$ echo $sendmail
sendemail-1.56 by Brandon Zehm
jayadevan#jayadevan-Vostro-2520:~$ echo $sendmail | cut -f2 -d "-" | cut -f1 -d" "
1.56

How to parse a config file using sed

I've never used sed apart from the few hours trying to solve this. I have a config file with parameters like:
test.us.param=value
test.eu.param=value
prod.us.param=value
prod.eu.param=value
I need to parse these and output this if REGIONID is US:
test.param=value
prod.param=value
Any help on how to do this (with sed or otherwise) would be great.
This works for me:
sed -n 's/\.us\././p'
i.e. if the ".us." can be replaced by a dot, print the result.
If there are hundreds and hundreds of lines it might be more efficient to first search for lines containing .us. and then do the string replacement... AWK is another good choice or pipe grep into sed
cat INPUT_FILE | grep "\.us\." | sed 's/\.us\./\./g'
Of course if '.us.' can be in the value this isn't sufficient.
You could also do with with the address syntax (technically you can embed the second sed into the first statement as well just can't remember syntax)
sed -n '/\(prod\|test\).us.[^=]*=/p' FILE | sed 's/\.us\./\./g'
We should probably do something cleaner. If the format is always environment.region.param we could look at forcing this only to occur on the text PRIOR to the equal sign.
sed -n 's/^\([^,]*\)\.us\.\([^=]\)=/\1.\2=/g'
This will only work on lines starting with any number of chars followed by '.' then 'us', then '.' and then anynumber prior to '=' sign. This way we won't potentially modify '.us.' if found within a "value"

Trimming pathnames beyond a keyword (awk, sed, ?)

I want to trim a pathname beyond a certain point after finding a keyword. I'm drawing a blank this morning.
/home/quikq/1.0/dev/Java/tsupdater/src/tsupdater.java
I want to find the keyword Java, save the pathname beyond that (tsupdater), then cut everything off after the Java portion.
I don't know if this is what you want, but you can split the pathname into two with:
echo "/home/quikq/1.0/dev/Java/tsupdater/src/tsupdater.java" | sed 'h;s/.*Java//p;g;s/Java.*/Java/'
Which outputs:
/tsupdater/src/tsupdater.java
/home/quikq/1.0/dev/Java
If you would like to save the second part into a file part2.txt and print the first part, you could do:
echo "/home/quikq/1.0/dev/Java/tsupdater/src/tsupdater.java" | sed 'h;s/.*Java//;wpart2.txt;g;s/Java.*/Java/'
If you're writing a shell script:
myvar="/home/quikq/1.0/dev/Java/tsupdater/src/tsupdater.java"
part1="${myvar%Java*}Java"
part2="${myvar#*Java/}"
Hope this helps =)
take one you need:
kent$ echo "/home/quikq/1.0/dev/Java/tsupdater/src/tsupdater.java"|sed -r 's#(.*Java/[^/]*).*#\1#g'
/home/quikq/1.0/dev/Java/tsupdater
kent$ echo "/home/quikq/1.0/dev/Java/tsupdater/src/tsupdater.java"|sed -r 's#(.*Java).*#\1#g'
/home/quikq/1.0/dev/Java
kent$ echo "/home/quikq/1.0/dev/Java/tsupdater/src/tsupdater.java"|sed -r 's#.*Java/([^/]*).*#\1#g'
tsupdater
I'm not entirely sure what you want as output (please specify more clearly), but this command:
echo "/home/quikq/1.0/dev/Java/tsupdater/src/tsupdater.java" | sed 's/.*Java//'
results in:
/tsupdater/src/tsupdater.java
If you want the preceding part then this command:
echo "/home/quikq/1.0/dev/Java/tsupdater/src/tsupdater.java" | sed 's/Java.*//'
results in:
/home/quikq/1.0/dev/
Like I said, I was having a weird morning, but it dawned on me.
echo /home/quikq/1.0/dev/Java/TSUpdater/src/TSUpdater.java | sed s/Java.*//g
Yields
/home/quikq/1.0/dev
Lots of great tips here for chopping it up different ways though. Thanks a bunch!

Resources