Getting text between first occurance of two strings [Shell]

Getting text between first occurance of two strings [Shell] - macos

I have a feed.xml file that looks something like this. What I want to do is to grab the test.html from this feed.(Basically, the top most item's url.) Any thoughts on how to do this?
<rss>
<item>
<title>ABC</title>
<url>
test.html
</url>
</item>
<item>
<title>CDE</title>
<url>
test1.html
</url>
</item>
</rss>
Thanks!

If the structure is fixed and you know that the URL has the postfix .html, you can simply do:
cat <yourfile> | grep ".html" | head -n1
If you don't know the postfix (or the string "html" can exist before), you can do:
cat <yourfile> | grep -A1 "<url>" | head -n2 | tail -n1
EDIT
In case, the structure is not fixed (i.e., no newlines), there this
cat <yourfile> | grep -o "<url>[^<]*</url>" | head -n1 | cut -d'>' -f2 | cut -d'<' -f1
or that
cat <yourfile> | grep -o "<url>[^<]*</url>" | head -n1 | sed -E -e"s#<url>(.*)</url>#\1#"
may work.

This might work for you:
sed '/<url>/,/<\/url>/{//d;s/ *//;q};d' file.xml

This awk script should work:
awk '/<url>/ && url==0 {url=1;next;} {if(url==1) {print;url=2;}}' file
EDIT:
Following grep command might also work:
grep -m 1 "^ *<url>" -A1 file | grep -v "<url>"

Instead of using line-based tools, I'd suggest using an xsl transform to get the data you want out of the document without making assumptions about the way it's formatted.
If you save this to get-url.xsl:
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:value-of select="normalize-space(rss/item/url)"/>
</xsl:template>
</xsl:stylesheet>
Then you can get the value of url from feed.xml like this:
$ xsltproc get-url.xsl feed.xml; echo
test.html
$
The extra echo is just there to give you a newline after the end of the output, to make it friendly for an interactive shell. Just remove it if you're assigning the result to a shell variable with $().

Related

Fetch value from xml element which is enclosed with CDATA through awk

I am using awk command to fetch values from xml elements. Using below command
$(awk -F "[><]" '/'$tag_name'/{print $3}' $FILE_NAME | sort | uniq)
Here
File_name: XML File.
tag_name: name of xml element whose value we
need.
Sample XML
<item>
<tag1>test</tag1>
<tag2><![CDATA[cdata_test]]></tag2>
</item>
One of the tag in xml contains CDATA. For that script is not working as expected.
When I tried to print it is printing blank.

Instead of using a specific tool as AWK, not aware of the XML specificities, I suggest you to use xmlstarlet for selecting the nodes you want. For instance:
xmlstarlet select -t -v '//tag1' -n input.xml
will give as result:
test
Issuing:
xmlstarlet select -t -v '//tag2' -n input.xml
gives as output:
cdata_test
If you don't need the newline at the end of the returned string, just remove the -n from the options of the xmlstarlet command.
Keep it simple.

As xmlstarlet is not installed on my machine.
I used sed prior to my awk command as follows and that works for me.
$(sed -e 's/<![CDATA[//g; s/]]>//g' ${FILE_NAME} | awk -F "[><]" '/'$tag_name'/{print $3}' | sort | uniq)
Also, If anybody has any other solution. That too is also welcome.

How to write a script that will use regex to output only the heading and paragraph text from the http://example.com website

I am a beginner in scripting and i am working on the bash scripting for my work.
for this task i tried the sed command which didn't work

for your problem, following would work:
#!/bin.bash
curl -s http://example.com/ | grep -P "\s*\<h1\>.*\<\/h1\>" |sed -n 's:.*<h1>\(.*\)</h1>.*:\1:p'
curl -s http://example.com/ | grep -P "\s*\<p\>.*\<\/p\>" |sed -n 's:.*<p>\(.*\)</p>.*:\1:p'
The first line scrapes via curl and grep the <h1>..</h1> part(assuming theres only one as we are considering your example) and using sed extract the first capturing group( (.*) ) by :\1:
The second line does the same but for <p1> tag.
I could cram these 2 lines in one grep but these'll work fine!
Edit:
If <p> tag end on different lines, above wouldn't, you may have to use pcregrep
curl -s http://example.com/ | pcregrep -M "\s*\<p\>(\n|.)*\<\/p\>"

You can use the following one liner :
curl -s http://example.com/ | sed -n '2,$p' > /tmp/tempfile && cat /tmp/tempfile | xmllint --xpath '/html/head/title/text()' - && echo ; cat /tmp/tempfile | xmllint --xpath '/html/body/div/p/text()' -
This uses xmllint's xpath command to extract the text within <title> and <p> tags.

xmllint: Formating without adding header

Is there a way to use $xmllint --format file without the header section?
<?xml version="1.0"?>
<Tag>
<Sub>A</Sub>
</Tag>
I know you can use --c14n but that does not seem to mix well with --format.
As $xmllint --format --c14n file will just produce:
<Tag><Sub>A</Sub></Tag>
Desired Result
<Tag>
<Sub>A</Sub>
</Tag>

You can use sed to remove the first line. Not saying it's the best but it would get you going:
xmllint --format <file> | sed 1d
You would preferrable try to avoid one million calls to xmllint. And sed or tail.
I'm not sure if xmllint supports inplace edit. But if it does then something like this might be possible:
xargs < list_of_files_to_change.txt xmllint --inplace --format
xargs < list_of_files_to_change.txt sed -i 1d

extract unique value from a log4j log file

Im having trouble extracting only a matching string: OPER^ from a log4j file.
I can get this value from two different sources inside my log file:
2012-01-26 03:06:45,428 INFO [NP_OSS] OSSBSSGWIMPL6000|**OPR20120126120537008893**|GenServiceDeactivationResponse :: processRequestGenServiceDeactivationResponse() ::
or:
2012-01-26 03:06:45,411 INFO [NP_OSS] MESSAGE_DATA = <?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns2:ServiceDeactivationResponse xmlns:ns2="urn:ngn:foo"><MessageHeader><MessageTimeStamp>20120126031123</MessageTimeStamp>**<OperatorTrxID>OPR20120126120537008893</OperatorTrxID>**</MessageHeader></ns2:ServiceDeactivationResponse>
I need to extract only the value OPR*
I'm guessing its much easier to extract it from the first one since it doesn't involve parsing xml.
Thanks a lot in advance for your help!

maybe I didn't understand OP's question well, why a simple grep command cannot do the job?
like
grep -Po 'OPR\d+'
output for both lines are same:
OPR20120126120537008893

$ echo $line | grep OPR | sed -e "s/^.*OPR\([0-9]*\).*$/\1/"
Edit:
After reading your comment:
$ echo $line | grep OPR | sed -e "s/^.*\(OPR[0-9]*\).*$/\1/" | head -1

awk: Setting up Field Separators
awk -v FS="[<>]" '{print $13}' logfile
perl: Using Positive look ahead and look behind
perl -pne 's/.*(?<=\<OperatorTrxID\>)([A-Z0-9]+)(?=\<\/OperatorTrxID\>).*/$1/' logfile
Test:
[jaypal:~/Temp] cat logfile
2012-01-26 03:06:45,411 INFO [NP_OSS] MESSAGE_DATA = <?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns2:ServiceDeactivationResponse xmlns:ns2="urn:ngn:foo"><MessageHeader><MessageTimeStamp>20120126031123</MessageTimeStamp><OperatorTrxID>OPR20120126120537008893</OperatorTrxID></MessageHeader></ns2:ServiceDeactivationResponse>
[jaypal:~/Temp] awk -v FS="[<>]" '{print $13}' logfile
OPR20120126120537008893
[jaypal:~/Temp] perl -pne 's/.*(?<=\<OperatorTrxID\>)([A-Z0-9]+)(?=\<\/OperatorTrxID\>).*/$1/' logfile
OPR20120126120537008893

pipe one long line as multiple lines

Say I have a bunch of XML files which contain no newlines, but basically contain a long list of records, delimited by </record><record>
If the delimiter were </record>\n<record> I would be able to do something like cat *.xml | grep xyz | wc -l to count instances of records of interest, because cat would emit the records one per line.
Is there a way to write SOMETHING *.xml | grep xyz | wc -l where SOMETHING can stream out the records one per line? I tried using awk for this but couldn't find a way to avoid streaming the whole file into memory.
Hopefully the question is clear enough :)

This is a little ugly, but it works:
sed 's|</record>|</record>\
|g' *.xml | grep xyz | wc -l
(Yes, I know I could make it a little bit shorter, but only at the cost of clarity.)

If your record body has no character like < or / or >, then you may try this:
grep -E -o 'SEARCH_STRING[^<]*</record>' *.xml| wc -l
or
grep -E -o 'SEARCH_STRING[^/]*/record>' *.xml| wc -l
or
grep -E -o 'SEARCH_STRING[^>]*>' *.xml| wc -l

Here is a different approach using xsltproc, grep, and wc. Warning: I am new to XSL so I can be dangerous :-). Here is my count_records.xsl file:
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" /> <!-- Output text, not XML -->
<xsl:template match="record"> <!-- Search for "record" node -->
<xsl:value-of select="text()"/> <!-- Output: contents of node record -->
<xsl:text> <!-- Output: a new line -->
</xsl:text>
</xsl:template>
</xsl:stylesheet>
On my Mac, I found a command line tool called xsltproc, which read instructions from an XSL file, process XML files. So the command would be:
xsltproc count_records.xsl *.xml | grep SEARCH_STRING | wc -l
The xsltproc command displays the text in each node, one line at a time
The grep command filters out the text you are interested in
Finally, the wc command produces the count

You may also try xmlstarlet for gig-sized files:
# cf. http://niftybits.wordpress.com/2008/03/27/working-with-huge-xml-files-tools-of-the-trade/
xmlstarlet sel -T -t -v "count(//record[contains(normalize-space(text()),'xyz')])" -n *.xml |
awk '{n+=$1} END {print n}'
xmlstarlet sel -T -t -v "count(//record[contains(normalize-space(text()),'xyz')])" -n *.xml |
paste -s -d '+' /dev/stdin | bc

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Getting text between first occurance of two strings [Shell] - macos

This might work for you: sed '/<url>/,/<\/url>/{//d;s/ *//;q};d' file.xml

This awk script should work: awk '/<url>/ && url==0 {url=1;next;} {if(url==1) {print;url=2;}}' file EDIT: Following grep command might also work: grep -m 1 "^ *<url>" -A1 file | grep -v "<url>"

Related

Fetch value from xml element which is enclosed with CDATA through awk

How to write a script that will use regex to output only the heading and paragraph text from the http://example.com website

xmllint: Formating without adding header

extract unique value from a log4j log file

pipe one long line as multiple lines

Categories

Resources