I'm trying to use xpath on this page to capture the text "past few days" from:
<li class="last">
Last visited
<span>
past few days
</span>
</li>
I've tried several variants of the xpath expression '//li[#class="last"]/span/text()', as part of:
from lxml import html
import requests
page = requests.get(url)
tree = html.fromstring(page.text)
visit = tree.xpath('//li[#class="last"]/span/text()')
All return nothing.
What is the correct syntax for capturing "past few days"?
Thanks
That page has a default namespace (xmlns="http://www.w3.org/1999/xhtml"). You'll either have to register that namespace and use a prefix in your xpath or use local-name() (and namespace-uri() if there is a possibility of elements from different namespaces having the same local names).
Example of local-name()...
//*[local-name()="li"][#class="last"]/*[local-name()="span"]/text()
Disclaimer: I don't use scrapy or python. This answer is purely xpath and might not apply 100%.
Related
There is a page like the following:
<html>
<head></head>
<body>
<p> 5-8 </p>
<p></br>5-8</br></p>
<p> </br>5-8 </br></p>
</body>
</html>
The goal is to abstract the text in each p, the breaks and whitespaces are not wanted.
How to achieve that?
Thanks in advance! Best Wishes!
--The first Updating
Another post suggested using normalize_space(). I tried that, well, It can remove the spaces. However, only one node is left. How can I get all 30 node text without unwanted spaces? Thanks in advance and Best wishes!
enter image description here
It's not possible to achieve what you want entirely in XPath 1.0, but in XPath 2.0 or later it is possible.
You don't say what XPath interpreter you have available but you mention Chrome's XPath Helper which relies on Chrome's built in XPath interpreter which supports XPath 1.0 (as is the norm for web browsers).
But it's possible you are just using Chrome to examine the data, and have another, more modern XPath interpreter such as e.g. Saxon. If so, an XPath 2.0 solution will work for you, though you won't be able to use it in Chrome, obviously.
I've tidied up your XML example:
<html>
<head></head>
<body>
<p> 5-8 </p>
<p><br/>5-8<br/></p>
<p> <br/>5-8 <br/></p>
</body>
</html>
NB those are non-breaking spaces there.
In XPath 2.0:
for $paragraph in //p
return normalize-space(
translate($paragraph, codepoints-to-string(160), ' ')
)
NB this uses the translate function to convert non-breaking spaces (the char with Unicode codepoint 160) to a space, and then uses normalize-space to trim leading and trailing whitespace (I'm not sure what you would want to do if there were whitespace in the middle of the para, instead of just at the start or end; this will convert any such sequence of whitespace to a single space character). You might think normalize-space would be enough, but in fact a non-breaking-space doesn't fall into normalize-space's category of "white space" so it would not be trimmed.
In XPath 1.0 is not exactly possible to do what you want. You could use an XPath expression that would return each p element to your host language, and then iterate over those p elements, executing a second XPath expression for each one, with that p as the context. Essentially this mean moving the for ... in ... return iterator from XPath into your host language. To select the paragraphs:
//p
... and then for each one:
normalize-space(
translate(., ' ', ' ')
)
NB in that expression, the first string literal is a non-breaking-space character, and the second is a space. XPath 1.0 doesn't have the codepoints-to-string function or I'd have used that, for clarity.
The . which is the first parameter to the translate function represents the context node (the current node). When you execute this XPath expression in your host language you need to pass one of the p elements as the context node. You don't say what host language you're using, but in JavaScript, for instance, you could use the document.evaluate function to execute the first XPath, receiving an iterator of p elements. Then for each element, you'd call its evaluate method to execute the second XPath, and that would ensure that the p element was the context node for the XPath (i.e. the . in the expression).
This question already has answers here:
XPath contains(text(),'some string') doesn't work when used with node with more than one Text subnode
(7 answers)
Closed 4 years ago.
I have an xpath string //*[normalize-space() = "some sub text"]/text()/.. which works fine if the text I am finding is in a node which does not have multiple text sub nodes, but if it does then it won't work, so I am trying to combine it with contains() as follows: //*[contains(normalize-space(), "some sub text")]/text()/.. which does work, but it always returns the body and html tags as well as the p tag which contains the text. How can I change it so it only returns the p tag?
It depends exactly what you want to match.
The most likely scenario is that you want to match some text if it appears anywhere in the normalized string value of the element, possibly split across multiple text nodes at different levels: for example any of the following:
<p>some text</p>
<p>There was some text</p>
<p>There was <b>some text</b></p>
<p>There <b>was</b> some text</p>
<p>There was <b>some</b> <!--italic--> <i>text</i></p>
<p>There was <b>some</b> text</p>
If that's the case, then use //p[contains(normalize-space(.), "some text")].
As you point out, using //* with this predicate will also match ancestor elements of the relevant element. The simplest way to fix this is by using //p to say what element you are looking for. If you don't know what element you are looking for, then in XPath 3.0 you could use
innermost(//*[contains(normalize-space(.), "some text")])
but if you have the misfortune not to be using XPath 3.0, then you could do (//*[contains(normalize-space(.), "some text")])[last()], though this doesn't do quite the same thing if there are multiple paragraphs with the required content.
If you don't want to match all of the above, but want to be more selective, then you need to explain your requirements more clearly.
Either way, use of text() in a path expression is generally a code smell, except in the rare cases where you want to select text in an element only if it is not wrapped in other tags.
I need to wrap all instances of %{ ... %} with <span code='notranslate'>...</span> UNLESS the %{ ... } appears within an HTML tag. For example, this:
"Or %{register_text} for a new account by <a href='%{path}'>clicking here</a>."
needs to become this
"Or <span code='notranslate'>%{register_text}</span> for a new account by <a href='%{path}'>clicking here</a>."
my current regex doesn't take into account the HTML tag situation:
x.gsub(/[?<!]%\{([a-zA-Z0-9_\-]*)\}[?>!]/i) {|s| "<span class='notranslate'>#{s}</span>"}
so I am wondering how to do this in Ruby with regex.
Any takers?
I am not sure about the input space, so this is the best that I can come up with. I also clean up the regex a bit along the way.
/%\{[\w-]+\}(?![^<>]>)/
For a well-formed HTML, it will only match tokens that are outside tag. If the HTML is malformed, I don't think I'm up to the task to write the regex.
I also assume that there is no embedded Javascript in the page, since > and < in Javascript is not escaped.
I'm trying to come up with a complex xPath expression but I can't figure out how to do that. Imagine you have some HTML like this:
<span>
something1
<br>
something2
<br>
something3
</span>
Imagine that sometimes the second <br> and the subsequent "something3" are not present. I would like to create an xPath expression that takes all the span nodes and its content up to the first <br> so that I end up parsing just "something1". I don't know if this is possible, if not does anyone know a way to get that after having parsed all the <span> nodes?
I have to say that I'm using HtmlParser, which is a Java library which parses HTML and supports xPath expressions.
Thanks,
Masiar
I'm a bit confused by your description of the problem, but it sounds something like
//span/br[1]/preceding-sibling::text()
For some reason I have to have one HTML tag per line. So if the following is the input:
<p><div class="class1 <%= "class3" %>class2">div content</div></p>
Output should be:
<p>
<div class="class1 <%= "class3" %>class2">div content
</div>
</p>
The regular expression should be able to recognize the difference between the erb script tag and HTML tag. Indentation is not needed.
How can this be done through regular expression?
You can replace (?=<[\w/]) with \n. This is a lookahed that matched the position before a < sign, the is followed by a letter or a slash. (another option is (?=<(?!%))).
This works for your posted code, but fails on quite a few scenarios, notionally < in attributes, or < in server-side scripts and JavaScript blocks. If you need anything more complex, you may need a stronger solution, like an erb parser.
replace "(?<!%)>\s*<(?=!%))" with ">\n<" and replace "(?<!(\s|^))</" with "\n</"
this makes sure that % is not found either before or after >whitespace<.
then always break on </
i think kobi's answer is better :)