Difference in trees generated by html.fromstring() and etree.fromstring()

Difference in trees generated by html.fromstring() and etree.fromstring() - xpath

A bit confused about the effects of these two methods. Here's a simple string:
test = """
<p> This is my head <h1> this is my middle </h1> and this is my tail.</p>
"""
We create two roots with this string:
from lxml import html, etree
root_e = etree.fromstring(test)
root_h = html.fromstring(test)
Let's see what the trees look like:
tree_e = etree.ElementTree(root_e)
for elem in root_e.iter():
print(tree_e.getpath(elem))
Output is:
/p
/p/h1
which is what I would expect. However with:
tree_h = etree.ElementTree(root_h)
for elem in root_h.iter():
print(tree_h.getpath(elem))
the output is now:
/html/div
/div/p
/div/h1
which I didn't expect. And strange consequences follow. Various xpath expressions work the same in both trees, but others don't. For example
root_h.xpath('/html/div')[0].text_content()
outputs the whole string text (with a newline attached), although test has neither html nor div in it. On the other hand,
root_h.xpath('/html/div')[0].text
does nothing.
So why the differences, and when should you use one or the other?

Related

Xpath uninion parameter from parent

here xpath one
/Document/Attributes/BlobContent/Property[#Name="FileName"]/parent::*/Reference/#Link
and xpath two
Document/Attributes/BlobContent/Property[#Name="FileName"]/parent::*/Property[#Name="FileName"]/#Value
both bring back the right result !
I would like to avoid the complete chaining [one | two] as that brought back only a list of alternating results.
tried with
/Document/Attributes/BlobContent/Property[#Name="FileName"]/parent::*/Reference/#Link | */Property[#Name="FileName"]/#Value
but that brings back only the later one.
So how would I correctly bring back two child node attributes from a found parent ?

For anyone interested I didn't find any XPATH solution. However that python code did work for me
import xml.etree.ElementTree as ET
tree = ET.parse(file_xml)
root = tree.getroot()
blobs = root.findall("*/Attributes[1]/BlobContent")
for blob in blobs:
try:
filename = blob.find('Property[#Name="FileName"]').attrib["Value"]
exportname = blob.find('Reference[#Type="RELATIVEFILE"]').attrib["Link"]
print(filename + "," + exportname)
except:
#no filename Property
pass

Problems with '._ElementUnicodeResult'

While trying to help another user out with some question, I ran into the following problem myself:
The object is to find the country of origin of a list of wines on the page. So we start with:
import requests
from lxml import etree
url = "https://www.winepeople.com.au/wines/Dry-Red/_/N-1z13zte"
res = requests.get(url)
content = res.content
res = requests.get(url)
tree = etree.fromstring(content, parser=etree.HTMLParser())
tree_struct = etree.ElementTree(tree)
Next, for reasons I'll get into in a separate question, I'm trying to compare the xpath of two elements with certain attributes. So:
wine = tree.xpath("//div[contains(#class, 'row wine-attributes')]")
country = tree.xpath("//div/text()[contains(., 'Australia')]")
So far, so good. What are we dealing with here?
type(wine),type(country)
>> (list, list)
They are both lists. Let's check the type of the first element in each list:
type(wine[0]),type(country[0])
>> (lxml.etree._Element, lxml.etree._ElementUnicodeResult)
And this is where the problem starts. Because, as mentioned, I need to find the xpath of the first elements of the wine and country lists. And when I run:
tree_struct.getpath(wine[0])
The output is, as expected:
'/html/body/div[13]/div/div/div[2]/div[6]/div[1]/div/div/div[2]/div[2]'
But with the other:
tree_struct.getpath(country[0])
The output is:
TypeError: Argument 'element' has incorrect type (expected
lxml.etree._Element, got lxml.etree._ElementUnicodeResult)
I couldn't find much information about _ElementUnicodeResult), so what is it? And, more importantly, how do I fix the code so that I get an xpath for that node?

You're selecting a text() node instead of an element node. This is why you end up with a lxml.etree._ElementUnicodeResult type instead of a lxml.etree._Element type.
Try changing your xpath to the following in order to select the div element instead of the text() child node of div...
country = tree.xpath("//div[contains(., 'Australia')]")

How can I configure the separator character used for :menuselection:?

I am using Sphinx to generate HTML documentation for my project. Under Inline Markup, the Sphinx documentation discusses :menuselection: for marking a sequence of menu selections using markup like:
:menuselection:`Start --> Programs`
This results in the following HTML:
<span class="menuselection">Start ‣ Programs</span>
i.e. the --> gets converted to the small triangle, which I've determined is U+2023, TRIANGULAR BULLET.
That's all well and good, but I'd like to use a different character instead of the triangle. I have searched the Sphinx package and the theme package (sphinx-bootstrap-theme) somewhat exhaustively for 'menuselection', the triangle character, and a few other things, but haven't turned up anything that does the substitution from --> to ‣ (nothing obvious to me, anyway). But something must be converting it between my .rst source and the html.
My question is: what, specifically is doing the conversion (sphinx core? HTML writer? Theme JS?)?

The conversion is done in the sphinx.roles.menusel_role() function. You can create your own version of this function with a different separator character and register it to be used.
Add the following to your project's conf.py:
from docutils import nodes, utils
from docutils.parsers.rst import roles
from sphinx.roles import _amp_re
def patched_menusel_role(typ, rawtext, text, lineno, inliner, options={}, content=[]):
text = utils.unescape(text)
if typ == 'menuselection':
text = text.replace('-->', u'\N{RIGHTWARDS ARROW}') # Here is the patch
spans = _amp_re.split(text)
node = nodes.emphasis(rawtext=rawtext)
for i, span in enumerate(spans):
span = span.replace('&&', '&')
if i == 0:
if len(span) > 0:
textnode = nodes.Text(span)
node += textnode
continue
accel_node = nodes.inline()
letter_node = nodes.Text(span[0])
accel_node += letter_node
accel_node['classes'].append('accelerator')
node += accel_node
textnode = nodes.Text(span[1:])
node += textnode
node['classes'].append(typ)
return [node], []
# Use 'patched_menusel_role' function for processing the 'menuselection' role
roles.register_local_role("menuselection", patched_menusel_role)
When building html, make sure to make clean first so that the updated conf.py is re-parsed with the patch.

XPath: Select Certain Child Nodes

I'm using XPath with Scrapy to scrape data off of a movie website BoxOfficeMojo.com.
As a general question: I'm wondering how to select certain child nodes of one parent node all in one Xpath string.
Depending on the movie web page from which I'm scraping data, sometimes the data I need is located at different children nodes, such as whether or not there is a link or not. I will be going through about 14000 movies, so this process needs to be automated.
Using this as an example. I will need actor/s, director/s and producer/s.
This is the Xpath to the director: Note: The %s corresponds to a determined index where that information is found - in the action Jackson example director is found at [1] and actors at [2].
//div[#class="mp_box_content"]/table/tr[%s]/td[2]/font/text()
However, would a link exist to a page on the director, this would be the Xpath:
//div[#class="mp_box_content"]/table/tr[%s]/td[2]/font/a/text()
Actors are a bit more tricky, as there <br> included for subsequent actors listed, which may be the children of an /a or children of the parent /font, so:
//div[#class="mp_box_content"]/table/tr[%s]/td[2]/font//a/text()
Gets all most all of the actors (except those with font/br).
Now, the main problem here, I believe, is that there are multiple //div[#class="mp_box_content"] - everything I have works EXCEPT that I also end up getting some digits from other mp_box_content. Also I have added numerous try:, except: statements in order to get everything (actors, directors, producers who both have and do not have links associated with them). For example, the following is my Scrapy code for actors:
actors = hxs.select('//div[#class="mp_box_content"]/table/tr[%s]/td[2]/font//a/text()' % (locActor,)).extract()
try:
second = hxs.select('//div[#class="mp_box_content"]/table/tr[%s]/td[2]/font/text()' % (locActor,)).extract()
for n in second:
actors.append(n)
except:
actors = hxs.select('//div[#class="mp_box_content"]/table/tr[%s]/td[2]/font/text()' % (locActor,)).extract()
This is an attempt to cover for the facts that: the first actor may not have a link associated with him/her and subsequent actors do, the first actor may have a link associated with him/her but the rest may not.
I appreciate the time taken to read this and any attempts to help me find/address this problem! Please let me know if any more information is needed.

I am assuming you are only interested in textual content, not the links to actors' pages etc.
Here is a proposition using lxml.html (and a bit of lxml.etree) directly
First, I recommend you select td[2] cells by the text content of td[1], with expressions like .//tr[starts-with(td[1], "Director")]/td[2] to account for "Director", or "Directors"
Second, testing various expressions with or without <font>, with or without <a> etc., makes code difficult to read and maintain, and since you're interested only in the text content, you might as well use string(.//tr[starts-with(td[1], "Actor")]/td[2]) to get the text, or use lxml.html.tostring(e, method="text", encoding=unicode) on selected elements
And for the <br> issue for multiple names, the way I do is generally modify the lxml tree containing the targetted content to add a special formatting character to <br> elements' .text or .tail, for example a \n, with one of lxml's iter() functions. This can be useful on other HTML block elements, like <hr> for example.
You may see better what I mean with some spider code:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
import lxml.etree
import lxml.html
MARKER = "|"
def br2nl(tree):
for element in tree:
for elem in element.iter("br"):
elem.text = MARKER
def extract_category_lines(tree):
if tree is not None and len(tree):
# modify the tree by adding a MARKER after <br> elements
br2nl(tree)
# use lxml's .tostring() to get a unicode string
# and split lines on the marker we added above
# so we get lists of actors, producers, directors...
return lxml.html.tostring(
tree[0], method="text", encoding=unicode).split(MARKER)
class BoxOfficeMojoSpider(BaseSpider):
name = "boxofficemojo"
start_urls = [
"http://www.boxofficemojo.com/movies/?id=actionjackson.htm",
"http://www.boxofficemojo.com/movies/?id=cloudatlas.htm",
]
# locate 2nd cell by text content of first cell
XPATH_CATEGORY_CELL = lxml.etree.XPath('.//tr[starts-with(td[1], $category)]/td[2]')
def parse(self, response):
root = lxml.html.fromstring(response.body)
# locate the "The Players" table
players = root.xpath('//div[#class="mp_box"][div[#class="mp_box_tab"]="The Players"]/div[#class="mp_box_content"]/table')
# we have only one table in "players" so the for loop is not really necessary
for players_table in players:
directors_cells = self.XPATH_CATEGORY_CELL(players_table,
category="Director")
actors_cells = self.XPATH_CATEGORY_CELL(players_table,
category="Actor")
producers_cells = self.XPATH_CATEGORY_CELL(players_table,
category="Producer")
writers_cells = self.XPATH_CATEGORY_CELL(players_table,
category="Producer")
composers_cells = self.XPATH_CATEGORY_CELL(players_table,
category="Composer")
directors = extract_category_lines(directors_cells)
actors = extract_category_lines(actors_cells)
producers = extract_category_lines(producers_cells)
writers = extract_category_lines(writers_cells)
composers = extract_category_lines(composers_cells)
print "Directors:", directors
print "Actors:", actors
print "Producers:", producers
print "Writers:", writers
print "Composers:", composers
# here you should of course populate scrapy items
The code can be simplified for sure, but I hope you get the idea.
You can do similar things with HtmlXPathSelector of course (with the string() XPath function for example), but without modifying the tree for <br> (how to do that with hxs?) it works only for non-multiple names in your case:
>>> hxs.select('string(//div[#class="mp_box"][div[#class="mp_box_tab"]="The Players"]/div[#class="mp_box_content"]/table//tr[contains(td, "Director")]/td[2])').extract()
[u'Craig R. Baxley']
>>> hxs.select('string(//div[#class="mp_box"][div[#class="mp_box_tab"]="The Players"]/div[#class="mp_box_content"]/table//tr[contains(td, "Actor")]/td[2])').extract()
[u'Carl WeathersCraig T. NelsonSharon Stone']

How to get H1,H2,H3,... using a single xpath expression

How can I get H1,H2,H3 contents in one single xpath expression?
I know I could do this.
//html/body/h1/text()
//html/body/h2/text()
//html/body/h3/text()
and so on.

Use:
/html/body/*[self::h1 or self::h2 or self::h3]/text()
The following expression is incorrect:
//html/body/*[local-name() = "h1"
or local-name() = "h2"
or local-name() = "h3"]/text()
because it may select text nodes that are children of unwanted:h1, different:h2, someWeirdNamespace:h3.
Another recommendation: Always avoid using // when the structure of the XML document is statically known. Using // most often results in significant inefficiencies because it causes the complete document (sub)tree roted in the context node to be traversed.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Difference in trees generated by html.fromstring() and etree.fromstring() - xpath

Related

Xpath uninion parameter from parent

Problems with '._ElementUnicodeResult'

How can I configure the separator character used for :menuselection:?

XPath: Select Certain Child Nodes

How to get H1,H2,H3,... using a single xpath expression

Categories

Resources