How can I get the title of an RSS feed with Bash? Say I want to get the most recent article from MacRumors. Their RSS feed link is http://feeds.macrumors.com/MacRumors-All. How can I get the most recent article title with Bash?
An alternative to xmllint is xmlstarlet and so:
curl -s http://feeds.macrumors.com/MacRumors-All | xmlstarlet sel -t -m "/rss/channel/item[1]" -v "title"
Use the xmlstarlet sel command to select the xpath we are looking for and then use -v to display a specific element.
You can combine curl and an XPath expression (here, using xmllint), and rely on the fact that the feed is in reverse chronological order:
curl http://feeds.macrumors.com/MacRumors-All | xmllint --xpath '/rss/channel/item[1]/title/text()'
See How to execute XPath one-liners from shell? for other ways to evaluate XPath.
In particular, if you have an older xmllint with --xpath, you may be able to use the technique suggested by this wrapper:
echo 'cat /rss/channel/item[1]/title/text()' | xmllint --shell <(curl http://feeds.macrumors.com/MacRumors-All)
Related
i want to filter a specific word from a website using wget.
the word i want to filter out is hPa and the value of it.
see: https://www.foreca.de/Deutschland/Berlin/Berlin
i can't find useful information on how to filter out a specific string.
this is what i've tried so far:
#!/bin/bash
LAST=$(wget -l1 https://www.foreca.de/Deutschland/Berlin/Berlin -O - | sed -e 'hPa')
echo $LAST
thanks for helping me out.
A fully fledged solution using xpath :
Command :
$ saxon-lint --html --xpath '//div[contains(text(), "hPa")]/text()' \
'https://www.foreca.de/Deutschland/Berlin/Berlin'
Output :
1026 hPa
Notes :
Don't parse HTML with regex, use a proper XML/HTML parser like we do here. Check: Using regular expressions with HTML tags
Check https://github.com/sputnick-dev/saxon-lint (my own project)
if what I wrote bores you and you just want a quick and dirty command even if it's evil, then use curl -s https://www.foreca.de/Deutschland/Berlin/Berlin | grep -oP '\d+\s+hPa'
I have the following xmllint example selecting an element:
$ curl -s http://lists.opencsw.org/pipermail/users/2015-January/date.html |
xmllint --html --xpath '/html/body/p/b[contains(., "Messages:")]' -
<b>Messages:</b>
Behind the bold element is the number of messages I am interested in. It is shown, when I use the parent axis:
$ curl -s http://lists.opencsw.org/pipermail/users/2015-January/date.html |
xmllint --html --xpath '/html/body/p/b[contains(., "Messages:")]/parent::*' -
<p><b>Starting:</b> <i>Thu Jan 1 23:17:09 CET 2015</i><br><b>Ending:</b> <i>Sat Jan 31 14:51:07 CET 2015</i><br><b>Messages:</b> 28</p>
I thought that the following-sibling axis might give me exactly this number, but it does not do so:
$ curl -s http://lists.opencsw.org/pipermail/users/2015-January/date.html |
xmllint --html --xpath '/html/body/p/b[contains(., "Messages:")]/following-sibling::*' -
XPath set is empty
This text node you are after is indeed a following sibling, but it's a text node, not an element node. An expression like
following-sibling::*
Only looks for following siblings that are elements. To match text nodes, use text():
$ curl -s http://lists.opencsw.org/pipermail/users/2015-January/date.html |
xmllint --html --xpath '/html/body/p/b[contains(., "Messages:")]/following-sibling::text()'
The commands above do not work on my computer, using bash on Mac OS X - but I trust it works for you. If I first save the result from curl and then use
$ xmllint example.html --html --xpath '/html/body/p/b[contains(., "Messages:")]/following-sibling::text()'
The result is _28. That's not really an underscore, but a whitespace that I wanted to point to. To remove the leading whitespace, use
$ xmllint example.html --html --xpath 'normalize-space(/html/body/p/b[contains(., "Messages:")]/following-sibling::text())'
And no, using regex is not really an option.
I'm trying to construct an XPath argument for use in the program xmllint (used within a Bash shell script) that will return a list of available tags within a tag (while not listing subtags).
Here's the sort of XML I have:
<functionInformation>
<class>
setup
</class>
<description>
This is a natural language description of this function.
</description>
<prerequisiteFunctions>
myFunction1
myFunction2
</prerequisiteFunctions>
<prerequisitePrograms>
myProgram1
myProgram2
</prerequisitePrograms>
</functionInformation>
This XML is stored in the Bash variable functionInformation.
The output that I would like to have when using xmllint on this XML is the following:
class
description
prerequisiteFunctions
prerequisitePrograms
I should note that I would like the tags returned in a non recursive way (I do not want all available tags or subtags listed).
I can access information in tags using xmllint in a way such as the following:
descriptionFunctionInformation="$(echo "${functionInformation}"\
| xmllint --xpath '/functionInformation/description/text()' -\
| xargs -i echo -n "{}")"
Could you point me in the right direction on how I may build an XPath (or something similar) to return the information I need?
You can use xmlstarlet:
xmlstarlet sel -t -m '/*/*' -v 'concat(name(.)," ")' < xmlfile
I'd just like to get the album names. Here's an example page:
http://picasaweb.google.com/sunnchoi
But when I wget it and grep for a title pattern, I get 100 results. I understand that I have to emulate clicking the 'Show More Albums' link. How do I do that (using bash utils/perl)?
Try the Picases Web Album API.
They have examples in Python/Java and other languages. Here's request a list of albums (this one using python).
If you have xmlstarlet available, you can directly parse the corresponding RSS URL of the given website:
xmlstarlet sel --net -T -t -m '//item' -v 'title' -n \
'http://picasaweb.google.com/data/feed/base/user/sunnchoi?alt=rss&kind=album&hl=en_US&access=public' |
nl
I'm trying to see how libxml implements XPath support, so it made sense to me to test using xmllint. However, the obvious option, --pattern, is somewhat obscure, and I ended up using something like the following:
test.xml: <foo><bar/><bar/></foo>
> xmllint --shell test.xml
/ > dir /foo
ELEMENT foo
/ > dir /foo/*
ELEMENT bar
ELEMENT bar
This seems to work, and that's great, but I'm still curious. What is xmllint's --pattern option for, and how does it work?
Provide an example for full credit. =)
The seemingly undocumented option --xpath seems to be more useful.
% cat data.xml
<project>
<name>
bob
</name>
<version>
1.1.1
</version>
</project>
% xmllint --xpath '/project/version/text()' data.xml | xargs -i echo -n "{}"
1.1.1
% xmllint --xpath '/project/name/text()' data.xml | xargs -i echo -n "{}"
bob
The hint is in the words "which can be used with the reader interface to the parser": xmllint only uses the reader interface when passed the --stream option:
$ xmllint --stream --pattern /foo/bar test.xml
Node /foo/bar[1] matches pattern /foo/bar
Node /foo/bar matches pattern /foo/bar
From the xmllint(1) man page:
--pattern PATTERNVALUE
Used to exercise the pattern recognition engine, which can be
used with the reader interface to the parser. It allows to
select some nodes in the document based on an XPath (subset)
expression. Used for debugging.
It only understands a subset of XPath and its intention is to aid debugging. The library that does understand XPath fully is libxslt(3) and its command-line tool xsltproc(1).
The ``pattern'' module in libxml "allows to compile and test pattern expressions for nodes either in a tree or based on a parser state" and its documentation lives here: http://xmlsoft.org/html/libxml-pattern.html
Ari.
If you simply want the text value of a number of xml nodes then you could use something like this (if --xpath is not available on your version of xmllint):
./foo.xml:
<hello>
<world>its alive!!</world>
<world>and works!!</world>
</hello>
$ xmllint --stream --pattern /hello/world --debug ./foo.xml | grep -A 1 "matches pattern" | grep "#text" | sed 's/.* [0-9] //'
its alive!!
and works!!