bash XHTML parsing using xpath - bash

I'm writing a small script to learn how to parse an XHTML web page. The following command:
cat q?s=goog.xhtml | xpath '//span[#id="yfs_l10_goog"]'
returns:
Found 2 nodes:
-- NODE --
<span id="yfs_l10_goog">624.50</span>-- NODE --
<span id="yfs_l10_goog">624.50</span>
How do I:
need to write my command in order to only extract the value 624.50 ?
what do I need to do to extract it only once ?
source page I'm parsing: http://finance.yahoo.com/q?s=goog

Edit 2:
Give this a try:
xpath -q -e '//span[#id="yfs_l10_goog"][1]/text()'
Edit:
Pipe your output through:
sed -n '/span/{s/<span[^<]*>\([^<]*\)<.*/\1/;p;q}'
Original answer:
Using xmlstarlet:
echo -e '<foo><span id="yfs_l10_goog">624.50</span>\n<bar>xyz</bar><span id="yfs_l10_goog">555.50</span>\n<span id="yfs_l10_goog">123.50</span></foo>' |
xmlstarlet sel -t -v "//span[#id='yfs_l10_goog']"
Result of query:
624.50
Result of echo:
<foo><span id="yfs_l10_goog">624.50</span>
<bar>xyz</bar><span id="yfs_l10_goog">555.50</span>
<span id="yfs_l10_goog">123.50</span></foo>
Result of xml fo:
<?xml version="1.0"?>
<foo>
<span id="yfs_l10_goog">624.50</span>
<bar>xyz</bar>
<span id="yfs_l10_goog">555.50</span>
<span id="yfs_l10_goog">123.50</span>
</foo>
Other queries:
$ echo -e '...' | xmlstarlet sel -t -v "//span[#id='yfs_l10_goog'][1]"
624.50
$ echo -e '...' | xmlstarlet sel -t -v "//span[#id='yfs_l10_goog'][3]"
123.50
$ echo -e '...' | xmlstarlet sel -t -v "//span[#id='yfs_l10_goog'][last()]"
123.50

Related

How to scrape Wikipedia GPS latitude/longitude?

I have been wondering how is it possible to scrap Wikipedia information. For example, I have a list of world cities and want to obtain their approximate latitude and longitude. Take Miami as an example. When I type curl https://en.wikipedia.org/wiki/Miami | grep -E '(latitude|longitude)', somewhere in the HTML there will be a tag mark like below.
<span class="latitude">25°46′31″N</span> <span class="longitude">80°12′31″W</span>
I know I can extract it with some regex string, but I speak a very poor regexish. Can some of you help me on this?
With xidel and xpath:
$ xidel -se '
concat(
(//span[#class="latitude"]/text())[1],
" ",
(//span[#class="longitude"]/text())[1]
)
' 'https://en.wikipedia.org/wiki/Miami'
Output
25°46′31″N 80°12′31″W
Or
saxon-lint --html --xpath '<XPATH EXP>' <URL>
If you want most known tools:
curl -s 'https://en.wikipedia.org/wiki/Miami' > Miami.html
xmlstarlet format -H Miami.html 2>/dev/null | sponge Miami.html
xmlstarlet sel -t -v '<XPATH EXP>' Miami.html
Not mentioned, but regex are not the right tool to parse HTML
You can't parse HTML with RegEx. Please use an HTML-parser like xidel instead:
$ xidel -s "https://en.wikipedia.org/wiki/Miami" -e '
(//span[#class="geo-dms"])[1],
(//span[#class="geo-dec"])[1],
(//span[#class="geo"])[1],
replace((//span[#class="geo"])[1],";",())
'
25°46′31″N 80°12′31″W
25.775163°N 80.208615°W
25.775163; -80.208615
25.775163 -80.208615
Take your pick.

XMLStarlet doesn't select xpath query correctly

I have the following XML
<?xml version='1.0' encoding='UTF-8'?>
<ListBucketResult xmlns='http://doc.s3.amazonaws.com/2006-03-01'>
<Name>chromedriver</Name>
<Prefix></Prefix>
<Marker></Marker>
<IsTruncated>false</IsTruncated>
<Contents>
<Key>2.0/chromedriver_linux32.zip</Key>
<Generation>1380149859530000</Generation>
<MetaGeneration>4</MetaGeneration>
<LastModified>2013-09-25T22:57:39.349Z</LastModified>
<ETag>"c0d96102715c4916b872f91f5bf9b12c"</ETag>
<Size>7262134</Size>
</Contents>
<Contents>
<Key>2.0/chromedriver_linux64.zip</Key>
<Generation>1380149860664000</Generation>
<MetaGeneration>4</MetaGeneration>
<LastModified>2013-09-25T22:57:40.449Z</LastModified>
<ETag>"858ebaf47e13dce7600191ed59974c09"</ETag>
<Size>7433593</Size>
</Contents>
...
</ListBucketResult>
And I tried select only Key node with this command:
xmlstarlet sel -T -t -m '/ListBucketResult/Contents/Key' -v '.' -n file.xml
I tried some commands, but none return any value
And I tried el to see the scructure:
xmlstarlet el file.xml
ListBucketResult
ListBucketResult/Name
ListBucketResult/Prefix
ListBucketResult/Marker
ListBucketResult/IsTruncated
ListBucketResult/Contents
ListBucketResult/Contents/Key
ListBucketResult/Contents/Generation
ListBucketResult/Contents/MetaGeneration
ListBucketResult/Contents/LastModified
ListBucketResult/Contents/ETag
ListBucketResult/Contents/Size
I don't know what is incorrect
Your XML elements are bound to the namespace http://doc.s3.amazonaws.com/2006-03-01, but your XPath is not referencing any namespaces (not using a namespace-prefix). So, it is attempting to reference elements in the "no namespace" and finding nothing.
You need to declare that namespace with a namespace-prefix using the -N switch, and use the namespace-prefix in your XPath:
xmlstarlet sel -N s3="http://doc.s3.amazonaws.com/2006-03-01" -T -t -m '/s3:ListBucketResult/s3:Contents/s3:Key' -v '.' -n file.xml
Reference:
http://xmlstar.sourceforge.net/doc/UG/ch05s01.html

How to get attribute values of multiple nodes in xpath with just xmllint?

I want to query the names of all the persons in the test.xml below.
<body>
<person name="abc"></person>
<person name="def"></person>
<person name="ghi"></person>
</body>
basic query
This has the problem of including "name", which I don't want.
$ xmllint --xpath '//body/person/#name' test.xml`
name="abc"
name="def"
name="ghi"
string function
Using the string function, I only get one result.
$ xmllint --xpath 'string(//body/person/#name)' test.xml
abc
sed and grep
This works but looks needlessly complicated to me.
xmllint --xpath '//body/person/#name' test.xml | grep -o '"\([^"]*\)"' | sed 's|"||g'
abc
def
ghi
Question
Is it possible to get multiple values without the attribute name and without using another tool like grep?
I don't know about xmllint, but xmlstarlet can do it:
xmlstarlet sel -t -v 'body/person/#name' test.xml
Output:
abc
def
ghi

XmlStarlet Querying XML

I have this xml schema , could you possible help me to extract the values of all item, using XMLStarlet, in shell script.
<transfer-matrix.xml>
<transfers>
<rows>
<item>
<item>Hungary</item>
<item>Kharkov-KIPT-LCG2</item>
<item>9882899680</item>
<item>4</item>
<item>1</item>
</item>
<item>
<item>Spain</item>
<item>Kharkov-KIPT-LCG2</item>
<item>32945102817</item>
<item>12</item>
<item>2</item>
</item>
<item>
<item>Finland</item>
<item>Kharkov-KIPT-LCG2</item>
<item>10737418240</item>
<item>4</item>
<item>0</item>
</item>
<item>...</item>
<item>...</item>
<item>...</item>
</rows>
<key>...</key>
</transfers>
<params>...</params>
</transfer-matrix.xml>
I'm trying to extract item in such way
outcome=`xml sel -T -t -m /transfer-matrix.xml/transfers/rows/item -s D:N:- "#item" -v "concat(#item,'|',item,'|',item,'|',item,'|',item,'|',item)" -n /usr/share/dashboard/xml/transfers-country.xml`
My output is:
|Hungary|Hungary|Hungary|Hungary|Hungary |Spain|Spain|Spain|Spain|Spain |Finland|Finland|Finland|Finland|Finland
I need format like this
|Hungary|Kharkov-KIPT-LCG2|9882899680|4|1
|Spain|Kharkov-KIPT-LCG2|32945102817|12|2
|Finland|Kharkov-KIPT-LCG2|10737418240|4|0
I would be grateful for the help
You need to specify which element you want and add new line character in the end like this:
OUTPUT=$(xmlstarlet sel -T -t -m /transfer-matrix.xml/transfers/rows/item -s D:N:- "#item" -v "concat(#item,'|',item[1],'|',item[2],'|',item[3],'|',item[4],'|',item[5],'\n')" transfers-country.xml)
And then you can get the desired result via echo -e:
$ echo -e "$OUTPUT"
|Hungary|Kharkov-KIPT-LCG2|9882899680|4|1
|Spain|Kharkov-KIPT-LCG2|32945102817|12|2
|Finland|Kharkov-KIPT-LCG2|10737418240|4|0
Edit: As npostavs points out, it would be much better to use -n flag instead:
$ xmlstarlet sel -T -t -m /transfer-matrix.xml/transfers/rows/item -s D:N:- "#item" -n -v "concat(#item,'|',item[1],'|',item[2],'|',item[3],'|',item[4],'|',item[5])" transfers-country.xml
|Hungary|Kharkov-KIPT-LCG2|9882899680|4|1
|Spain|Kharkov-KIPT-LCG2|32945102817|12|2
|Finland|Kharkov-KIPT-LCG2|10737418240|4|0

how can a BPEL variable be put into a shell variable

a BPEL process creates a xml document, a certain XSD file that has xml structure and i want to parse that BPEL variable with xmllint or xmlstarlet with a unix shell commandline command. is that possible at all?
how can i put the BPEL variable into a shell variable , in order to be able to parse it with xmllint for instance?
INPUT:
<?xml version="1.0"?>
<ns:ItemList xmlns:ns="http:///blabla">
<GenericItem>
<ns2:LocalItem xmlns:ns2="http:///blabla">
<ItemSource> </ItemSource>
<ConcItemSource>
<name></name>
<requirements/>
<strategy/>
</ConcItemSource>
<dataFormat/>
<directory></directory>
<file/>
</ns2:LocalItem>
</GenericItem>
<GenericItem>
<ns2:LocalItem xmlns:ns2="http:///blabla">
<ItemSource>
</ItemSource>
<ConcItemSource>
<name></name>
<requirements/>
<strategy/>
</ConcItemSource>
<dataFormat/>
<directory></directory>
<file/>
</ns2:LocalItem>
</GenericItem>
</ns:ItemList>
Using xmlstarlet :
$ cat bpel.xml
<?xml version="1.0"?>
<ns:ItemList xmlns:ns="http:///blabla">
<GenericItem>
<ns2:LocalItem xmlns:ns2="http:///blabla">
<ItemSource> </ItemSource>
<ConcItemSource>
<name></name>
<requirements/>
<strategy/>
</ConcItemSource>
<dataFormat/>
<directory>d1</directory>
<file/>
</ns2:LocalItem>
</GenericItem>
<GenericItem>
<ns2:LocalItem xmlns:ns2="http:///blabla">
<ItemSource>
</ItemSource>
<ConcItemSource>
<name></name>
<requirements/>
<strategy/>
</ConcItemSource>
<dataFormat/>
<directory>d2</directory>
<file/>
</ns2:LocalItem>
</GenericItem>
</ns:ItemList>
command line :
$ dir1=$(xmlstarlet sel -t -v '//directory[1]/text()' bpel.xml)
$ echo "$dir1"
d1
Using a for loop :
$ count=$(xmlstarlet sel -t -v 'count(//directory)' bpel.xml)
$ for ((i=1; i<=count; i++)) {
xmlstarlet sel -t -v "//directory[$i]/text()" bpel.xml >> newfile
}
But you can do simply :
$ xmlstarlet sel -t -v "//directory/text()" bpel.xml >> newfile
xmlstarlet from STDIN :
command_producing_xml | xmlstarlet sel -t -v "//directory/text()" -

Resources