I'm trying to come up with a complex xPath expression but I can't figure out how to do that. Imagine you have some HTML like this:
<span>
something1
<br>
something2
<br>
something3
</span>
Imagine that sometimes the second <br> and the subsequent "something3" are not present. I would like to create an xPath expression that takes all the span nodes and its content up to the first <br> so that I end up parsing just "something1". I don't know if this is possible, if not does anyone know a way to get that after having parsed all the <span> nodes?
I have to say that I'm using HtmlParser, which is a Java library which parses HTML and supports xPath expressions.
Thanks,
Masiar
I'm a bit confused by your description of the problem, but it sounds something like
//span/br[1]/preceding-sibling::text()
Related
Can anyone please help me here ?
I want to run two xpath together and store the value, I am not sure if it is possible.
My one xpath is fetching City and second is state
//div[(text()='city')]/following-sibling::div
//div[contains(text(),'state')]/following-sibling::div
As xpath is telling name of city and state is provided in next div of city and state. I want to run both and capture output in string format.
On side note: both xpath is working fine for me.
<div>
<div>City</div>
<div>London</div>
</div>
<--In between some other elements like p, section other divs-->
<div>
<div>state</div>
<div>England</div>
</div>
It sounds like you want to convert the results of the two XPath expressions to strings, and concatenate those strings. The expression below concatenates them (with a single space between) using the XPath concat function.
concat(
//div[(text()='city')]/following-sibling::div,
' ',
//div[contains(text(),'state')]/following-sibling::div
)
One other thing: note that in your example XML the text of the first div is "City" rather than "city". Make sure the strings in your XPath expression match the text exactly because the expression 'City'='city' evaluates to false
This question already has answers here:
XPath contains(text(),'some string') doesn't work when used with node with more than one Text subnode
(7 answers)
Closed 4 years ago.
I have an xpath string //*[normalize-space() = "some sub text"]/text()/.. which works fine if the text I am finding is in a node which does not have multiple text sub nodes, but if it does then it won't work, so I am trying to combine it with contains() as follows: //*[contains(normalize-space(), "some sub text")]/text()/.. which does work, but it always returns the body and html tags as well as the p tag which contains the text. How can I change it so it only returns the p tag?
It depends exactly what you want to match.
The most likely scenario is that you want to match some text if it appears anywhere in the normalized string value of the element, possibly split across multiple text nodes at different levels: for example any of the following:
<p>some text</p>
<p>There was some text</p>
<p>There was <b>some text</b></p>
<p>There <b>was</b> some text</p>
<p>There was <b>some</b> <!--italic--> <i>text</i></p>
<p>There was <b>some</b> text</p>
If that's the case, then use //p[contains(normalize-space(.), "some text")].
As you point out, using //* with this predicate will also match ancestor elements of the relevant element. The simplest way to fix this is by using //p to say what element you are looking for. If you don't know what element you are looking for, then in XPath 3.0 you could use
innermost(//*[contains(normalize-space(.), "some text")])
but if you have the misfortune not to be using XPath 3.0, then you could do (//*[contains(normalize-space(.), "some text")])[last()], though this doesn't do quite the same thing if there are multiple paragraphs with the required content.
If you don't want to match all of the above, but want to be more selective, then you need to explain your requirements more clearly.
Either way, use of text() in a path expression is generally a code smell, except in the rare cases where you want to select text in an element only if it is not wrapped in other tags.
I am new to xpath expression. Need help on a issue
Consider the following Document :
<tbody><tr>
<td>By <strong>Bec</strong></td>
<td><strong>Great Support</strong></td>
</tr></tbody>
In this I have to find the text inside tags separately.
Following is my xpath expression:
//tbody//td//strong/text();
It evaluates output as expected:
Bec
Great Support
How can I write xpath expressions to distinguish between the results i.e Becand Great Support
It's rather unclear what you're trying to do, but the following should succeed in selecting them separately:
//tbody/tr/td[1]/strong
and
//tbody/tr/td[2]/strong
Note that the text() you had at the end is most likely not needed in this case.
Not sure I understand 100%, but if you're trying to get the text of the first and the second strong tags, you can use position (1 based index)
//tbody/td[position()=1]/strong/text() //first text
//tbody/td[position()=2]/strong/text() //second text
This solution only applies to the current sample though, where your strong tags are inside either the first or second td tag.
Not sure this is what you're looking for... anyway, assuming you're asking to retrieve a node based on its text you can look up for text content by doing something like:
//tbody//td//strong/text()[.="Bec"]
PS
in [.=""] the dot is an alias for text() self::node() (thanks JLRishe for pointing out the mistake).
I have set of strings with nested [quote] tags in following format:
[quote name="John"]Some text. [quote name="Piter"]Inner quote.[/quote][/quote]
As you see it is not like ordinary BBCode. So I can't find a suitable regexp for gsub in Ruby to convert them to strings like this:
<blockquote>
<p>Some text.
<blockquote>
<p>Inner quote.</p>
<small>Piter</small>
</blockquote>
</p>
<small>John</small>
</blockquote>
Can anybody please help me with such regexp?
I'm pretty sure that regexes fundamentally can't cope with nesting. What you could do is make it do a minimal match (e.g. only the inner quote levels), replace them, and then repeat as long as you have more matches. Once you've replaced a level it will just be HTML so will not match the regex any more.
I have a string like this.
<p class='link'>try</p>bla bla</p>
I want to get only <p class='link'>try</p>
I have tried this.
/<p class='link'>[^<\/p>]+<\/p>/
But it doesn't work.
How can I can do this?
Thanks,
If that is your string, and you want the text between those p tags, then this should work...
/<p\sclass='link'>(.*?)<\/p>/
The reason yours is not working is because you are adding <\/p> to your not character range. It is not matching it literally, but checking for not each character individually.
Of course, it is mandatory I mention that there are better tools for parsing HTML fragments (such as a HTML parser.)
'/<p[^>]+>([^<]+)<\/p>/'
will get you "try"
It looks like you used this block: [^<\/p>]+ intending to match anything except for </p>. Unfortunately, that's not what it does. A [] block matches any of the characters inside. In your case, the /<p class='link'>[^<\/p>]+ part matched <p class='link'>try</, but it was not immediately followed by the expected </p>, so there was no match.
Alex's solution, to use a non-greedy qualifier is how I tend to approach this sort of problem.
I tried to make one less specific to any particular tag.
(<[^/]+?\s+[^>]*>[^>]*>)
this returns:
<p class='link'>try</p>