Hello i have this xml
<item>
<title> Something for title»</title>
<link>some url</link>
<description><![CDATA[<div class="feed-description"><div class="feed-image"><img src="pictureUrl.jpg" /></div>text for desc</div>]]></description>
<pubDate>Thu, 11 Jun 2015 16:50:16 +0300</pubDate>
</item>
I try to get the img src with path: //description//div[#class='feed-description']//div[#class='feed-image']//img/#src but it doesn't work
is there any solution?
A CDATA section escapes its contents. In other words, CDATA prevents its contents from being parsed as markup when the rest of the document is parsed. So the <div>s in there are not seen as XML elements, only as flat text. The <description> element has no element children ... only a single text child. As such, XPath can't select any <div> descendant of <description> because none exists in the parsed XML tree.
What to do?
If your XPath environment supports XPath 3.0, you could use parse-xml() to turn the flat text into a tree, then use XPath to select //div[#class='feed-description']//div[#class='feed-image']//img/#src from the resulting tree.
Otherwise, your best workaround may be to use primitive string-processing functions like substring-before(), substring-after(), or match(). (The latter uses regular expressions and requires XPath 2.0.) Of course, many people will tell you not to use regular expressions to analyze markup like XML and HTML. For good reason: in the general case, it's very difficult to do it right (with regexes or with plain string searches). But for very restricted cases where the input is highly predictable, and in absence of better tools, it can be the best tool for a less-than-ideal job.
For example, for the data shown in your question, you could use
substring-before(substring-after(//description, 'img src="'), '"')
In this case, the inner call substring-after(//description, 'img src="') returns pictureUrl.jpg" /></div>text for desc</div>, of which the substring before " is pictureUrl.jpg.
This isn't really robust, for example it'll fail if there's a space between src and =. But if the exact formatting is predictable, you'll be OK.
Related
There is a page like the following:
<html>
<head></head>
<body>
<p> 5-8 </p>
<p></br>5-8</br></p>
<p> </br>5-8 </br></p>
</body>
</html>
The goal is to abstract the text in each p, the breaks and whitespaces are not wanted.
How to achieve that?
Thanks in advance! Best Wishes!
--The first Updating
Another post suggested using normalize_space(). I tried that, well, It can remove the spaces. However, only one node is left. How can I get all 30 node text without unwanted spaces? Thanks in advance and Best wishes!
enter image description here
It's not possible to achieve what you want entirely in XPath 1.0, but in XPath 2.0 or later it is possible.
You don't say what XPath interpreter you have available but you mention Chrome's XPath Helper which relies on Chrome's built in XPath interpreter which supports XPath 1.0 (as is the norm for web browsers).
But it's possible you are just using Chrome to examine the data, and have another, more modern XPath interpreter such as e.g. Saxon. If so, an XPath 2.0 solution will work for you, though you won't be able to use it in Chrome, obviously.
I've tidied up your XML example:
<html>
<head></head>
<body>
<p> 5-8 </p>
<p><br/>5-8<br/></p>
<p> <br/>5-8 <br/></p>
</body>
</html>
NB those are non-breaking spaces there.
In XPath 2.0:
for $paragraph in //p
return normalize-space(
translate($paragraph, codepoints-to-string(160), ' ')
)
NB this uses the translate function to convert non-breaking spaces (the char with Unicode codepoint 160) to a space, and then uses normalize-space to trim leading and trailing whitespace (I'm not sure what you would want to do if there were whitespace in the middle of the para, instead of just at the start or end; this will convert any such sequence of whitespace to a single space character). You might think normalize-space would be enough, but in fact a non-breaking-space doesn't fall into normalize-space's category of "white space" so it would not be trimmed.
In XPath 1.0 is not exactly possible to do what you want. You could use an XPath expression that would return each p element to your host language, and then iterate over those p elements, executing a second XPath expression for each one, with that p as the context. Essentially this mean moving the for ... in ... return iterator from XPath into your host language. To select the paragraphs:
//p
... and then for each one:
normalize-space(
translate(., ' ', ' ')
)
NB in that expression, the first string literal is a non-breaking-space character, and the second is a space. XPath 1.0 doesn't have the codepoints-to-string function or I'd have used that, for clarity.
The . which is the first parameter to the translate function represents the context node (the current node). When you execute this XPath expression in your host language you need to pass one of the p elements as the context node. You don't say what host language you're using, but in JavaScript, for instance, you could use the document.evaluate function to execute the first XPath, receiving an iterator of p elements. Then for each element, you'd call its evaluate method to execute the second XPath, and that would ensure that the p element was the context node for the XPath (i.e. the . in the expression).
For purposes to automatically replace keywords with links based on a list of keyword-link pairs I need to get text that is not already linked, not a script or manually excluded, inside paragraphs (p) and list items (li) –- to be used in Drupal's Alinks module.
I modified the existing xpath selector as follows and would like to get feedback on it, if it is efficient or might be improved:
//*[p or li]//text()[not(ancestor::a) and not(ancestor::script) and not(ancestor::*[#data-alink-ignore])]
The xpath is meant to work with any html5 content, also with self closing tags (not well-formed xml) -- that's the way the module was designed, and it works quite well.
In order to select text node descendant of p or li elements that are not descendant of a or script elements, you can use this XPath 1.0:
//*[self::p|self::li]
//text()[
not(ancestor::a|ancestor::script|ancestor::*[#data-alink-ignore])
]
Your XPath expression is invalid. You are missing a / before text(). So a valid expression would be
//*[p or li]/text()[not(ancestor::a) and not(ancestor::script) and not(ancestor::*[#data-alink-ignore])]
But without an XML source file it is impossible to tell if this expression would match your desired node.
I am new to xpath expression. Need help on a issue
Consider the following Document :
<tbody><tr>
<td>By <strong>Bec</strong></td>
<td><strong>Great Support</strong></td>
</tr></tbody>
In this I have to find the text inside tags separately.
Following is my xpath expression:
//tbody//td//strong/text();
It evaluates output as expected:
Bec
Great Support
How can I write xpath expressions to distinguish between the results i.e Becand Great Support
It's rather unclear what you're trying to do, but the following should succeed in selecting them separately:
//tbody/tr/td[1]/strong
and
//tbody/tr/td[2]/strong
Note that the text() you had at the end is most likely not needed in this case.
Not sure I understand 100%, but if you're trying to get the text of the first and the second strong tags, you can use position (1 based index)
//tbody/td[position()=1]/strong/text() //first text
//tbody/td[position()=2]/strong/text() //second text
This solution only applies to the current sample though, where your strong tags are inside either the first or second td tag.
Not sure this is what you're looking for... anyway, assuming you're asking to retrieve a node based on its text you can look up for text content by doing something like:
//tbody//td//strong/text()[.="Bec"]
PS
in [.=""] the dot is an alias for text() self::node() (thanks JLRishe for pointing out the mistake).
I'm currently trying to extract the blurb, or summary from any given Wikipedia page, using XPath. Now, there are many places online where this has already been done: http://jimblackler.net/blog/?p=13, How to use XPath or xgrep to find information in Wikipedia?.
But, when I try to use similar XPath expressions, on a variety of pages, the returned results are strange. For the sake of this question, let's assume I'm trying to retrieve the very first paragraph in the printable Wikipedia page on Boston: http://en.wikipedia.org/w/index.php?title=Boston&printable=yes.
When I try to use this expression /html/body/div[#id='content']/div[#id='bodyContent']//p, only the last four words of the paragraph, "in the United States.", are returned.
Actually, the expression used above could be simplified to //div/p, but the results are the same.
Strangely, the links I linked to previously seem to use similar methods and return great results; originally, I imagined this was due to Wikipedia changing the formatting of their pages in recent years, but honestly, I can't seem to find what's wrong with both the expressions.
Does anyone have any idea about this?
When I try to use this expression
/html/body/div[#id='content']/div[#id='bodyContent']//p, only the
last four words of the paragraph, "in the United States.", are
returned.
There are a few problems here:
The XML document is in a default namespace. Writing XPath expressions to select nodes in a document that is in a default namespace is the most FAQ about XPath -- search for "XPath and default namespace". In short, any unprefixed name will most probably cause nothing to be selected. One must register the default namespace and associate a specific prefix with this namespace. Then any element name in the XPath expression must be written with this prefix. So, the expression above will become:
:
/x:html/x:body/x:div[#id='content']/x:div[#id='bodyContent']//x:p
where the "x:" prefix is associated to the "http://www.w3.org/1999/xhtml" namespace.
.2. Even the above expression doesn't select (only) the wanted node. In order to select only the first x:p from the above, the XPath expression must be specified as (note the brackets):
(/x:html/x:body/x:div[#id='content']/x:div[#id='bodyContent']//x:p)[1]
.3. As you want the text of the paragraph, an easy way to do this is to use the standard XPath function string():
string((/x:html/x:body/x:div[#id='content']/x:div[#id='bodyContent']//x:p)[1])
When this XPath expression is evaluated, I get the text of the paragraph -- for example in the XPath Visualizer I wrote some years ago:
I have such content of html file:
<a class="bf" title="Link to book" href="/book/229920/">book name</a>
Help me to construct xpath expression to get link text (book name).
I try to use /a, but expression evaluates without results.
If the context is the entire document you should probably use // instead of /. Also you may (not sure about that) need to get down one more level to retrieve the text.
I think it should look like this
//a/text()
EDIT: As Tomalak pointed out it's text() not text
Have you tried
//a
?
More specific is better:
//a[#class='bf' and starts-with(#href, '/book/')]
Note that this selects the <a> element. In your host environment it's easy to extract the text value of that node via standard DOM methods (like the .textContent property).
To select the actual text node, see the other answers in this thread.
It depends also on the rest of your document. If you use // in the beginning all the matching nodes will be returned, which might be too many results in case you have other links in your document.
Apart from that a possible xpath expression is //a/text().
The /a you tried only returns the a-tag itself, if it is the root element. To get the link text you need to append the /text() part.