How to select the text behind an element? - bash

I have the following xmllint example selecting an element:
$ curl -s http://lists.opencsw.org/pipermail/users/2015-January/date.html |
xmllint --html --xpath '/html/body/p/b[contains(., "Messages:")]' -
<b>Messages:</b>
Behind the bold element is the number of messages I am interested in. It is shown, when I use the parent axis:
$ curl -s http://lists.opencsw.org/pipermail/users/2015-January/date.html |
xmllint --html --xpath '/html/body/p/b[contains(., "Messages:")]/parent::*' -
<p><b>Starting:</b> <i>Thu Jan 1 23:17:09 CET 2015</i><br><b>Ending:</b> <i>Sat Jan 31 14:51:07 CET 2015</i><br><b>Messages:</b> 28</p>
I thought that the following-sibling axis might give me exactly this number, but it does not do so:
$ curl -s http://lists.opencsw.org/pipermail/users/2015-January/date.html |
xmllint --html --xpath '/html/body/p/b[contains(., "Messages:")]/following-sibling::*' -
XPath set is empty

This text node you are after is indeed a following sibling, but it's a text node, not an element node. An expression like
following-sibling::*
Only looks for following siblings that are elements. To match text nodes, use text():
$ curl -s http://lists.opencsw.org/pipermail/users/2015-January/date.html |
xmllint --html --xpath '/html/body/p/b[contains(., "Messages:")]/following-sibling::text()'
The commands above do not work on my computer, using bash on Mac OS X - but I trust it works for you. If I first save the result from curl and then use
$ xmllint example.html --html --xpath '/html/body/p/b[contains(., "Messages:")]/following-sibling::text()'
The result is _28. That's not really an underscore, but a whitespace that I wanted to point to. To remove the leading whitespace, use
$ xmllint example.html --html --xpath 'normalize-space(/html/body/p/b[contains(., "Messages:")]/following-sibling::text())'
And no, using regex is not really an option.

Related

Does exists any way to redirect pipe output as file?

I'm use xmllint to select node, and my test purposed 1.xml is looks like this
<resources>
<item>
<label>LABEL</label>
<value>VALUE</value>
<description>DESCRIPTION</description>
</item>
<item>
<label>LABEL</label>
<value>VALUE</value>
<description>DESCRIPTION</description>
</item>
</resources>
$ xmllint --xpath '/resources/item/value' 1.xml
<value>VALUE</value><value>VALUE</value>
Command likes above is work well.
And then i try to combine with pipe |, error occurred
$ cat 1.xml | xmllint --xpath '/resources/item/value'
Usage : xmllint [options] XMLfiles ...
...(help info)
I suppose the reason is pipe transmit process cat's output as a stream, but xmllint can only receive file path as argument. So, does any way to solve this problem? or maybe some alternative?
Of course. If my guess if fault, point at real reason is also pretty helpful to me.
sorry
My English is poor. Please excuse grammar or typing error. I'm trying my best to improve.
I have never used xmllint, but from its man page, I can see:
The xmllint program parses one or more XML files, specified on the command line as XML-FILE (or the standard input if the filename provided is -
Therefore, the following should work:
xmllint --xpath '/resources/item/value' 1.xml
or, if you insist that the input should come via stdin,
xmllint --xpath '/resources/item/value' - <1.xml
As an alternative, you can pass xmllint command to stdin via --shell option
echo -e 'cat /resources/item/value\nbye' | xmllint --shell 1.xml
Or
(echo 'cat /resources/item/value'; echo 'bye') | xmllint --shell 1.xml

Get title of an RSS feed with bash

How can I get the title of an RSS feed with Bash? Say I want to get the most recent article from MacRumors. Their RSS feed link is http://feeds.macrumors.com/MacRumors-All. How can I get the most recent article title with Bash?
An alternative to xmllint is xmlstarlet and so:
curl -s http://feeds.macrumors.com/MacRumors-All | xmlstarlet sel -t -m "/rss/channel/item[1]" -v "title"
Use the xmlstarlet sel command to select the xpath we are looking for and then use -v to display a specific element.
You can combine curl and an XPath expression (here, using xmllint), and rely on the fact that the feed is in reverse chronological order:
curl http://feeds.macrumors.com/MacRumors-All | xmllint --xpath '/rss/channel/item[1]/title/text()'
See How to execute XPath one-liners from shell? for other ways to evaluate XPath.
In particular, if you have an older xmllint with --xpath, you may be able to use the technique suggested by this wrapper:
echo 'cat /rss/channel/item[1]/title/text()' | xmllint --shell <(curl http://feeds.macrumors.com/MacRumors-All)

Bash wget filter specific word

i want to filter a specific word from a website using wget.
the word i want to filter out is hPa and the value of it.
see: https://www.foreca.de/Deutschland/Berlin/Berlin
i can't find useful information on how to filter out a specific string.
this is what i've tried so far:
#!/bin/bash
LAST=$(wget -l1 https://www.foreca.de/Deutschland/Berlin/Berlin -O - | sed -e 'hPa')
echo $LAST
thanks for helping me out.
A fully fledged solution using xpath :
Command :
$ saxon-lint --html --xpath '//div[contains(text(), "hPa")]/text()' \
'https://www.foreca.de/Deutschland/Berlin/Berlin'
Output :
1026 hPa
Notes :
Don't parse HTML with regex, use a proper XML/HTML parser like we do here. Check: Using regular expressions with HTML tags
Check https://github.com/sputnick-dev/saxon-lint (my own project)
if what I wrote bores you and you just want a quick and dirty command even if it's evil, then use curl -s https://www.foreca.de/Deutschland/Berlin/Berlin | grep -oP '\d+\s+hPa'

Get last page number from HTML

I have a page pagination that look like this in HTML:
<span class="nav">Go to <b>1</b>, 2, 3, 4, 5 Next</span>
What I want to get is the last page number (so in this example "5"). How I can do it in Bash? Thank you for help.
As a solution that only considers numbers given as the text associated with links inside of <span class="nav"> (assuming in.html as your input file):
xmllint --html --xmlout - <in.html \
| xmlstarlet sel -t -m '//span[#class="nav"]//a' -v 'text()' -n \
| egrep '^[[:digit:]]+$' \
| sort -n \
| tail -n 1
This uses xmllint (included with modern Linux distributions) to convert your HTML to XML, and XMLStarlet (not always included, but generally packaged for common distributions) to search that XML.
This assumes that the HTML is always conform your input:
sed 's/page-/\n/g' | sort -n | tail -1 | sed 's/.html.*//'
(sed 's/page-/\n/g' puts a newline just before the page number. sort -n sorts numerically; the lines that do not start with a page number get sorted on top. tail -1 selects the line with the highest page number and the sed 's/.html.*//' strips-off all the non-pagenumber stuff)
If there are only numbers of the pages only numbers in the text, then you can do it like following
egrep '[0-9]+' -o | sort -r -n | head -1
it will match numbers from the text, than sort it and take the first one (highest). You can modify the regexp if you want so be more specific. Better approach would be deffinitely possible in python using BeautifulSoup4 where you can traverse through the DOM like in jQuery.
EDIT added -n to the command (+1 #CharlesDuffy)

How do you use the --pattern option of xmllint?

I'm trying to see how libxml implements XPath support, so it made sense to me to test using xmllint. However, the obvious option, --pattern, is somewhat obscure, and I ended up using something like the following:
test.xml: <foo><bar/><bar/></foo>
> xmllint --shell test.xml
/ > dir /foo
ELEMENT foo
/ > dir /foo/*
ELEMENT bar
ELEMENT bar
This seems to work, and that's great, but I'm still curious. What is xmllint's --pattern option for, and how does it work?
Provide an example for full credit. =)
The seemingly undocumented option --xpath seems to be more useful.
% cat data.xml
<project>
<name>
bob
</name>
<version>
1.1.1
</version>
</project>
% xmllint --xpath '/project/version/text()' data.xml | xargs -i echo -n "{}"
1.1.1
% xmllint --xpath '/project/name/text()' data.xml | xargs -i echo -n "{}"
bob
The hint is in the words "which can be used with the reader interface to the parser": xmllint only uses the reader interface when passed the --stream option:
$ xmllint --stream --pattern /foo/bar test.xml
Node /foo/bar[1] matches pattern /foo/bar
Node /foo/bar matches pattern /foo/bar
From the xmllint(1) man page:
--pattern PATTERNVALUE
Used to exercise the pattern recognition engine, which can be
used with the reader interface to the parser. It allows to
select some nodes in the document based on an XPath (subset)
expression. Used for debugging.
It only understands a subset of XPath and its intention is to aid debugging. The library that does understand XPath fully is libxslt(3) and its command-line tool xsltproc(1).
The ``pattern'' module in libxml "allows to compile and test pattern expressions for nodes either in a tree or based on a parser state" and its documentation lives here: http://xmlsoft.org/html/libxml-pattern.html
Ari.
If you simply want the text value of a number of xml nodes then you could use something like this (if --xpath is not available on your version of xmllint):
./foo.xml:
<hello>
<world>its alive!!</world>
<world>and works!!</world>
</hello>
$ xmllint --stream --pattern /hello/world --debug ./foo.xml | grep -A 1 "matches pattern" | grep "#text" | sed 's/.* [0-9] //'
its alive!!
and works!!

Resources