here xpath one
/Document/Attributes/BlobContent/Property[#Name="FileName"]/parent::*/Reference/#Link
and xpath two
Document/Attributes/BlobContent/Property[#Name="FileName"]/parent::*/Property[#Name="FileName"]/#Value
both bring back the right result !
I would like to avoid the complete chaining [one | two] as that brought back only a list of alternating results.
tried with
/Document/Attributes/BlobContent/Property[#Name="FileName"]/parent::*/Reference/#Link | */Property[#Name="FileName"]/#Value
but that brings back only the later one.
So how would I correctly bring back two child node attributes from a found parent ?
For anyone interested I didn't find any XPATH solution. However that python code did work for me
import xml.etree.ElementTree as ET
tree = ET.parse(file_xml)
root = tree.getroot()
blobs = root.findall("*/Attributes[1]/BlobContent")
for blob in blobs:
try:
filename = blob.find('Property[#Name="FileName"]').attrib["Value"]
exportname = blob.find('Reference[#Type="RELATIVEFILE"]').attrib["Link"]
print(filename + "," + exportname)
except:
#no filename Property
pass
I have many statements like this in my test.xml file
<House name="bla"><Room id="bla" name="black" ></Room></House>
How do I print all Rooms with name="black". I am using CSS selector but Only House and Room attributes are taken by the selector.
I started with trying to print all name's, doesn't matter House or Room.
nodes = doc.css("name"). But it gives null as the output. So I am not able to proceed.
In CSS you have a syntax for matching elements by an attribute key-val pair:
nodes = doc.css("[name='black']")
For future reference you can also chain attribute selectors
nodes = doc.css(".my-class[name='black'][foo='bar']")
Or omit the val and match any element where the attribute is present:
nodes = doc.css("[name]")
I'm using scrapy, and I'm trying to look for a span that contains a specific text. I have:
response.selector.xpath('//*[#class="ParamText"]/span/node()')
which returns:
<Selector xpath='//*[#class="ParamText"]/span/text()' data=u' MILES STODOLINK'>,
<Selector xpath='//*[#class="ParamText"]/span/text()' data=u'C'>,
<Selector xpath='//*[#class="ParamText"]/span/text()' data=u' MILES STODOLINK'>]
However when I run:
>>> response.selector.xpath('//*[#class="ParamText"]/span[contains(text(),"STODOLINK")]')
Out[11]: []
Why does the contains function not work?
contains() can not evaluate multiple nodes at once :
/span[contains(text(),"STODOLINK")]
So, in case there are multiple text nodes within the span, and "STODOLINK" isn't located in the first text node child of the span, then contains() in the above expression won't work. You should try to apply contains() check on individual text nodes as follow :
//*[#class="ParamText"]/span[text()[contains(.,"STODOLINK")]]
Or if "STODOLINK" isn't necessarily located directly within span (can be nested within other element in the span), then you can simply use . instead of text() :
//*[#class="ParamText"]/span[contains(.,"STODOLINK")]
In my terminal (assuming my example is identical to your file though) your code works:
Input
import scrapy
example='<div class="ParamText"><span>STODOLINK</span></div>'
scrapy.Selector(text=example).xpath('//*[#class="ParamText"]/span[contains(text(),"STODOLINK")]').extract()
Output:
['<span>STODOLINK</span>']
Can you clarify what might be different?
I use Scrapy with BeautifulSoup4.0. IMO, Soup is easy to read and understand. This is an option if you don't have to use HtmlXPathSelector. Below is an example for finding all links. You can replace that with 'span'. Hope this helps!
import scrapy
from bs4 import BeautifulSoup
import Item
def parse(self, response):
soup = BeautifulSoup(response.body,'html.parser')
print 'Current url: %s' % response.url
item = Item()
for link in soup.find_all('a'):
if link.get('href') is not None:
url = response.urljoin(link.get('href'))
item['url'] = url
yield scrapy.Request(url,callback=self.parse)
yield item
I'm using XPath with Scrapy to scrape data off of a movie website BoxOfficeMojo.com.
As a general question: I'm wondering how to select certain child nodes of one parent node all in one Xpath string.
Depending on the movie web page from which I'm scraping data, sometimes the data I need is located at different children nodes, such as whether or not there is a link or not. I will be going through about 14000 movies, so this process needs to be automated.
Using this as an example. I will need actor/s, director/s and producer/s.
This is the Xpath to the director: Note: The %s corresponds to a determined index where that information is found - in the action Jackson example director is found at [1] and actors at [2].
//div[#class="mp_box_content"]/table/tr[%s]/td[2]/font/text()
However, would a link exist to a page on the director, this would be the Xpath:
//div[#class="mp_box_content"]/table/tr[%s]/td[2]/font/a/text()
Actors are a bit more tricky, as there <br> included for subsequent actors listed, which may be the children of an /a or children of the parent /font, so:
//div[#class="mp_box_content"]/table/tr[%s]/td[2]/font//a/text()
Gets all most all of the actors (except those with font/br).
Now, the main problem here, I believe, is that there are multiple //div[#class="mp_box_content"] - everything I have works EXCEPT that I also end up getting some digits from other mp_box_content. Also I have added numerous try:, except: statements in order to get everything (actors, directors, producers who both have and do not have links associated with them). For example, the following is my Scrapy code for actors:
actors = hxs.select('//div[#class="mp_box_content"]/table/tr[%s]/td[2]/font//a/text()' % (locActor,)).extract()
try:
second = hxs.select('//div[#class="mp_box_content"]/table/tr[%s]/td[2]/font/text()' % (locActor,)).extract()
for n in second:
actors.append(n)
except:
actors = hxs.select('//div[#class="mp_box_content"]/table/tr[%s]/td[2]/font/text()' % (locActor,)).extract()
This is an attempt to cover for the facts that: the first actor may not have a link associated with him/her and subsequent actors do, the first actor may have a link associated with him/her but the rest may not.
I appreciate the time taken to read this and any attempts to help me find/address this problem! Please let me know if any more information is needed.
I am assuming you are only interested in textual content, not the links to actors' pages etc.
Here is a proposition using lxml.html (and a bit of lxml.etree) directly
First, I recommend you select td[2] cells by the text content of td[1], with expressions like .//tr[starts-with(td[1], "Director")]/td[2] to account for "Director", or "Directors"
Second, testing various expressions with or without <font>, with or without <a> etc., makes code difficult to read and maintain, and since you're interested only in the text content, you might as well use string(.//tr[starts-with(td[1], "Actor")]/td[2]) to get the text, or use lxml.html.tostring(e, method="text", encoding=unicode) on selected elements
And for the <br> issue for multiple names, the way I do is generally modify the lxml tree containing the targetted content to add a special formatting character to <br> elements' .text or .tail, for example a \n, with one of lxml's iter() functions. This can be useful on other HTML block elements, like <hr> for example.
You may see better what I mean with some spider code:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
import lxml.etree
import lxml.html
MARKER = "|"
def br2nl(tree):
for element in tree:
for elem in element.iter("br"):
elem.text = MARKER
def extract_category_lines(tree):
if tree is not None and len(tree):
# modify the tree by adding a MARKER after <br> elements
br2nl(tree)
# use lxml's .tostring() to get a unicode string
# and split lines on the marker we added above
# so we get lists of actors, producers, directors...
return lxml.html.tostring(
tree[0], method="text", encoding=unicode).split(MARKER)
class BoxOfficeMojoSpider(BaseSpider):
name = "boxofficemojo"
start_urls = [
"http://www.boxofficemojo.com/movies/?id=actionjackson.htm",
"http://www.boxofficemojo.com/movies/?id=cloudatlas.htm",
]
# locate 2nd cell by text content of first cell
XPATH_CATEGORY_CELL = lxml.etree.XPath('.//tr[starts-with(td[1], $category)]/td[2]')
def parse(self, response):
root = lxml.html.fromstring(response.body)
# locate the "The Players" table
players = root.xpath('//div[#class="mp_box"][div[#class="mp_box_tab"]="The Players"]/div[#class="mp_box_content"]/table')
# we have only one table in "players" so the for loop is not really necessary
for players_table in players:
directors_cells = self.XPATH_CATEGORY_CELL(players_table,
category="Director")
actors_cells = self.XPATH_CATEGORY_CELL(players_table,
category="Actor")
producers_cells = self.XPATH_CATEGORY_CELL(players_table,
category="Producer")
writers_cells = self.XPATH_CATEGORY_CELL(players_table,
category="Producer")
composers_cells = self.XPATH_CATEGORY_CELL(players_table,
category="Composer")
directors = extract_category_lines(directors_cells)
actors = extract_category_lines(actors_cells)
producers = extract_category_lines(producers_cells)
writers = extract_category_lines(writers_cells)
composers = extract_category_lines(composers_cells)
print "Directors:", directors
print "Actors:", actors
print "Producers:", producers
print "Writers:", writers
print "Composers:", composers
# here you should of course populate scrapy items
The code can be simplified for sure, but I hope you get the idea.
You can do similar things with HtmlXPathSelector of course (with the string() XPath function for example), but without modifying the tree for <br> (how to do that with hxs?) it works only for non-multiple names in your case:
>>> hxs.select('string(//div[#class="mp_box"][div[#class="mp_box_tab"]="The Players"]/div[#class="mp_box_content"]/table//tr[contains(td, "Director")]/td[2])').extract()
[u'Craig R. Baxley']
>>> hxs.select('string(//div[#class="mp_box"][div[#class="mp_box_tab"]="The Players"]/div[#class="mp_box_content"]/table//tr[contains(td, "Actor")]/td[2])').extract()
[u'Carl WeathersCraig T. NelsonSharon Stone']
I'm pretty confused about this one. Given the following xml:
<sch:eventList>
<sch:event>
<sch:eventName>Event One</sch:eventName>
<sch:locationName>Location One</sch:locationName>
</sch:event>
<sch:event>
<sch:eventName>Event Two</sch:eventName>
<sch:locationName>Location Two</sch:locationName>
</sch:event>
</sch:eventList>
When using JDOM using the following code:
XPath eventNameExpression = XPath.newInstance("//sch:eventName");
XPath eventLocationExpression = XPath.newInstance("//sch:eventLocation");
XPath eventExpression = XPath.newInstance("//sch:event");
List<Element> elements = eventExpression.selectNodes(requestElement);
for(Element e: elements) {
System.out.println(eventNameExpression.valueOf(e));
System.out.println(eventLocationExpression.valueOf(e));
}
The console shows this:
Event One
Location One
Event One
Location One
What am I missing?
Don't use '//' it starts always searching at the root node. Use e.g. './sch:eventName' it is relative to the current node.