A bit confused about the effects of these two methods. Here's a simple string:
test = """
<p> This is my head <h1> this is my middle </h1> and this is my tail.</p>
"""
We create two roots with this string:
from lxml import html, etree
root_e = etree.fromstring(test)
root_h = html.fromstring(test)
Let's see what the trees look like:
tree_e = etree.ElementTree(root_e)
for elem in root_e.iter():
print(tree_e.getpath(elem))
Output is:
/p
/p/h1
which is what I would expect. However with:
tree_h = etree.ElementTree(root_h)
for elem in root_h.iter():
print(tree_h.getpath(elem))
the output is now:
/html/div
/div/p
/div/h1
which I didn't expect. And strange consequences follow. Various xpath expressions work the same in both trees, but others don't. For example
root_h.xpath('/html/div')[0].text_content()
outputs the whole string text (with a newline attached), although test has neither html nor div in it. On the other hand,
root_h.xpath('/html/div')[0].text
does nothing.
So why the differences, and when should you use one or the other?
Related
here xpath one
/Document/Attributes/BlobContent/Property[#Name="FileName"]/parent::*/Reference/#Link
and xpath two
Document/Attributes/BlobContent/Property[#Name="FileName"]/parent::*/Property[#Name="FileName"]/#Value
both bring back the right result !
I would like to avoid the complete chaining [one | two] as that brought back only a list of alternating results.
tried with
/Document/Attributes/BlobContent/Property[#Name="FileName"]/parent::*/Reference/#Link | */Property[#Name="FileName"]/#Value
but that brings back only the later one.
So how would I correctly bring back two child node attributes from a found parent ?
For anyone interested I didn't find any XPATH solution. However that python code did work for me
import xml.etree.ElementTree as ET
tree = ET.parse(file_xml)
root = tree.getroot()
blobs = root.findall("*/Attributes[1]/BlobContent")
for blob in blobs:
try:
filename = blob.find('Property[#Name="FileName"]').attrib["Value"]
exportname = blob.find('Reference[#Type="RELATIVEFILE"]').attrib["Link"]
print(filename + "," + exportname)
except:
#no filename Property
pass
While trying to help another user out with some question, I ran into the following problem myself:
The object is to find the country of origin of a list of wines on the page. So we start with:
import requests
from lxml import etree
url = "https://www.winepeople.com.au/wines/Dry-Red/_/N-1z13zte"
res = requests.get(url)
content = res.content
res = requests.get(url)
tree = etree.fromstring(content, parser=etree.HTMLParser())
tree_struct = etree.ElementTree(tree)
Next, for reasons I'll get into in a separate question, I'm trying to compare the xpath of two elements with certain attributes. So:
wine = tree.xpath("//div[contains(#class, 'row wine-attributes')]")
country = tree.xpath("//div/text()[contains(., 'Australia')]")
So far, so good. What are we dealing with here?
type(wine),type(country)
>> (list, list)
They are both lists. Let's check the type of the first element in each list:
type(wine[0]),type(country[0])
>> (lxml.etree._Element, lxml.etree._ElementUnicodeResult)
And this is where the problem starts. Because, as mentioned, I need to find the xpath of the first elements of the wine and country lists. And when I run:
tree_struct.getpath(wine[0])
The output is, as expected:
'/html/body/div[13]/div/div/div[2]/div[6]/div[1]/div/div/div[2]/div[2]'
But with the other:
tree_struct.getpath(country[0])
The output is:
TypeError: Argument 'element' has incorrect type (expected
lxml.etree._Element, got lxml.etree._ElementUnicodeResult)
I couldn't find much information about _ElementUnicodeResult), so what is it? And, more importantly, how do I fix the code so that I get an xpath for that node?
You're selecting a text() node instead of an element node. This is why you end up with a lxml.etree._ElementUnicodeResult type instead of a lxml.etree._Element type.
Try changing your xpath to the following in order to select the div element instead of the text() child node of div...
country = tree.xpath("//div[contains(., 'Australia')]")
I am using Sphinx to generate HTML documentation for my project. Under Inline Markup, the Sphinx documentation discusses :menuselection: for marking a sequence of menu selections using markup like:
:menuselection:`Start --> Programs`
This results in the following HTML:
<span class="menuselection">Start ‣ Programs</span>
i.e. the --> gets converted to the small triangle, which I've determined is U+2023, TRIANGULAR BULLET.
That's all well and good, but I'd like to use a different character instead of the triangle. I have searched the Sphinx package and the theme package (sphinx-bootstrap-theme) somewhat exhaustively for 'menuselection', the triangle character, and a few other things, but haven't turned up anything that does the substitution from --> to ‣ (nothing obvious to me, anyway). But something must be converting it between my .rst source and the html.
My question is: what, specifically is doing the conversion (sphinx core? HTML writer? Theme JS?)?
The conversion is done in the sphinx.roles.menusel_role() function. You can create your own version of this function with a different separator character and register it to be used.
Add the following to your project's conf.py:
from docutils import nodes, utils
from docutils.parsers.rst import roles
from sphinx.roles import _amp_re
def patched_menusel_role(typ, rawtext, text, lineno, inliner, options={}, content=[]):
text = utils.unescape(text)
if typ == 'menuselection':
text = text.replace('-->', u'\N{RIGHTWARDS ARROW}') # Here is the patch
spans = _amp_re.split(text)
node = nodes.emphasis(rawtext=rawtext)
for i, span in enumerate(spans):
span = span.replace('&&', '&')
if i == 0:
if len(span) > 0:
textnode = nodes.Text(span)
node += textnode
continue
accel_node = nodes.inline()
letter_node = nodes.Text(span[0])
accel_node += letter_node
accel_node['classes'].append('accelerator')
node += accel_node
textnode = nodes.Text(span[1:])
node += textnode
node['classes'].append(typ)
return [node], []
# Use 'patched_menusel_role' function for processing the 'menuselection' role
roles.register_local_role("menuselection", patched_menusel_role)
When building html, make sure to make clean first so that the updated conf.py is re-parsed with the patch.
I'm using XPath with Scrapy to scrape data off of a movie website BoxOfficeMojo.com.
As a general question: I'm wondering how to select certain child nodes of one parent node all in one Xpath string.
Depending on the movie web page from which I'm scraping data, sometimes the data I need is located at different children nodes, such as whether or not there is a link or not. I will be going through about 14000 movies, so this process needs to be automated.
Using this as an example. I will need actor/s, director/s and producer/s.
This is the Xpath to the director: Note: The %s corresponds to a determined index where that information is found - in the action Jackson example director is found at [1] and actors at [2].
//div[#class="mp_box_content"]/table/tr[%s]/td[2]/font/text()
However, would a link exist to a page on the director, this would be the Xpath:
//div[#class="mp_box_content"]/table/tr[%s]/td[2]/font/a/text()
Actors are a bit more tricky, as there <br> included for subsequent actors listed, which may be the children of an /a or children of the parent /font, so:
//div[#class="mp_box_content"]/table/tr[%s]/td[2]/font//a/text()
Gets all most all of the actors (except those with font/br).
Now, the main problem here, I believe, is that there are multiple //div[#class="mp_box_content"] - everything I have works EXCEPT that I also end up getting some digits from other mp_box_content. Also I have added numerous try:, except: statements in order to get everything (actors, directors, producers who both have and do not have links associated with them). For example, the following is my Scrapy code for actors:
actors = hxs.select('//div[#class="mp_box_content"]/table/tr[%s]/td[2]/font//a/text()' % (locActor,)).extract()
try:
second = hxs.select('//div[#class="mp_box_content"]/table/tr[%s]/td[2]/font/text()' % (locActor,)).extract()
for n in second:
actors.append(n)
except:
actors = hxs.select('//div[#class="mp_box_content"]/table/tr[%s]/td[2]/font/text()' % (locActor,)).extract()
This is an attempt to cover for the facts that: the first actor may not have a link associated with him/her and subsequent actors do, the first actor may have a link associated with him/her but the rest may not.
I appreciate the time taken to read this and any attempts to help me find/address this problem! Please let me know if any more information is needed.
I am assuming you are only interested in textual content, not the links to actors' pages etc.
Here is a proposition using lxml.html (and a bit of lxml.etree) directly
First, I recommend you select td[2] cells by the text content of td[1], with expressions like .//tr[starts-with(td[1], "Director")]/td[2] to account for "Director", or "Directors"
Second, testing various expressions with or without <font>, with or without <a> etc., makes code difficult to read and maintain, and since you're interested only in the text content, you might as well use string(.//tr[starts-with(td[1], "Actor")]/td[2]) to get the text, or use lxml.html.tostring(e, method="text", encoding=unicode) on selected elements
And for the <br> issue for multiple names, the way I do is generally modify the lxml tree containing the targetted content to add a special formatting character to <br> elements' .text or .tail, for example a \n, with one of lxml's iter() functions. This can be useful on other HTML block elements, like <hr> for example.
You may see better what I mean with some spider code:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
import lxml.etree
import lxml.html
MARKER = "|"
def br2nl(tree):
for element in tree:
for elem in element.iter("br"):
elem.text = MARKER
def extract_category_lines(tree):
if tree is not None and len(tree):
# modify the tree by adding a MARKER after <br> elements
br2nl(tree)
# use lxml's .tostring() to get a unicode string
# and split lines on the marker we added above
# so we get lists of actors, producers, directors...
return lxml.html.tostring(
tree[0], method="text", encoding=unicode).split(MARKER)
class BoxOfficeMojoSpider(BaseSpider):
name = "boxofficemojo"
start_urls = [
"http://www.boxofficemojo.com/movies/?id=actionjackson.htm",
"http://www.boxofficemojo.com/movies/?id=cloudatlas.htm",
]
# locate 2nd cell by text content of first cell
XPATH_CATEGORY_CELL = lxml.etree.XPath('.//tr[starts-with(td[1], $category)]/td[2]')
def parse(self, response):
root = lxml.html.fromstring(response.body)
# locate the "The Players" table
players = root.xpath('//div[#class="mp_box"][div[#class="mp_box_tab"]="The Players"]/div[#class="mp_box_content"]/table')
# we have only one table in "players" so the for loop is not really necessary
for players_table in players:
directors_cells = self.XPATH_CATEGORY_CELL(players_table,
category="Director")
actors_cells = self.XPATH_CATEGORY_CELL(players_table,
category="Actor")
producers_cells = self.XPATH_CATEGORY_CELL(players_table,
category="Producer")
writers_cells = self.XPATH_CATEGORY_CELL(players_table,
category="Producer")
composers_cells = self.XPATH_CATEGORY_CELL(players_table,
category="Composer")
directors = extract_category_lines(directors_cells)
actors = extract_category_lines(actors_cells)
producers = extract_category_lines(producers_cells)
writers = extract_category_lines(writers_cells)
composers = extract_category_lines(composers_cells)
print "Directors:", directors
print "Actors:", actors
print "Producers:", producers
print "Writers:", writers
print "Composers:", composers
# here you should of course populate scrapy items
The code can be simplified for sure, but I hope you get the idea.
You can do similar things with HtmlXPathSelector of course (with the string() XPath function for example), but without modifying the tree for <br> (how to do that with hxs?) it works only for non-multiple names in your case:
>>> hxs.select('string(//div[#class="mp_box"][div[#class="mp_box_tab"]="The Players"]/div[#class="mp_box_content"]/table//tr[contains(td, "Director")]/td[2])').extract()
[u'Craig R. Baxley']
>>> hxs.select('string(//div[#class="mp_box"][div[#class="mp_box_tab"]="The Players"]/div[#class="mp_box_content"]/table//tr[contains(td, "Actor")]/td[2])').extract()
[u'Carl WeathersCraig T. NelsonSharon Stone']
How can I get H1,H2,H3 contents in one single xpath expression?
I know I could do this.
//html/body/h1/text()
//html/body/h2/text()
//html/body/h3/text()
and so on.
Use:
/html/body/*[self::h1 or self::h2 or self::h3]/text()
The following expression is incorrect:
//html/body/*[local-name() = "h1"
or local-name() = "h2"
or local-name() = "h3"]/text()
because it may select text nodes that are children of unwanted:h1, different:h2, someWeirdNamespace:h3.
Another recommendation: Always avoid using // when the structure of the XML document is statically known. Using // most often results in significant inefficiencies because it causes the complete document (sub)tree roted in the context node to be traversed.