I'm using XPath with Scrapy to scrape data off of a movie website BoxOfficeMojo.com.
As a general question: I'm wondering how to select certain child nodes of one parent node all in one Xpath string.
Depending on the movie web page from which I'm scraping data, sometimes the data I need is located at different children nodes, such as whether or not there is a link or not. I will be going through about 14000 movies, so this process needs to be automated.
Using this as an example. I will need actor/s, director/s and producer/s.
This is the Xpath to the director: Note: The %s corresponds to a determined index where that information is found - in the action Jackson example director is found at [1] and actors at [2].
//div[#class="mp_box_content"]/table/tr[%s]/td[2]/font/text()
However, would a link exist to a page on the director, this would be the Xpath:
//div[#class="mp_box_content"]/table/tr[%s]/td[2]/font/a/text()
Actors are a bit more tricky, as there <br> included for subsequent actors listed, which may be the children of an /a or children of the parent /font, so:
//div[#class="mp_box_content"]/table/tr[%s]/td[2]/font//a/text()
Gets all most all of the actors (except those with font/br).
Now, the main problem here, I believe, is that there are multiple //div[#class="mp_box_content"] - everything I have works EXCEPT that I also end up getting some digits from other mp_box_content. Also I have added numerous try:, except: statements in order to get everything (actors, directors, producers who both have and do not have links associated with them). For example, the following is my Scrapy code for actors:
actors = hxs.select('//div[#class="mp_box_content"]/table/tr[%s]/td[2]/font//a/text()' % (locActor,)).extract()
try:
second = hxs.select('//div[#class="mp_box_content"]/table/tr[%s]/td[2]/font/text()' % (locActor,)).extract()
for n in second:
actors.append(n)
except:
actors = hxs.select('//div[#class="mp_box_content"]/table/tr[%s]/td[2]/font/text()' % (locActor,)).extract()
This is an attempt to cover for the facts that: the first actor may not have a link associated with him/her and subsequent actors do, the first actor may have a link associated with him/her but the rest may not.
I appreciate the time taken to read this and any attempts to help me find/address this problem! Please let me know if any more information is needed.
I am assuming you are only interested in textual content, not the links to actors' pages etc.
Here is a proposition using lxml.html (and a bit of lxml.etree) directly
First, I recommend you select td[2] cells by the text content of td[1], with expressions like .//tr[starts-with(td[1], "Director")]/td[2] to account for "Director", or "Directors"
Second, testing various expressions with or without <font>, with or without <a> etc., makes code difficult to read and maintain, and since you're interested only in the text content, you might as well use string(.//tr[starts-with(td[1], "Actor")]/td[2]) to get the text, or use lxml.html.tostring(e, method="text", encoding=unicode) on selected elements
And for the <br> issue for multiple names, the way I do is generally modify the lxml tree containing the targetted content to add a special formatting character to <br> elements' .text or .tail, for example a \n, with one of lxml's iter() functions. This can be useful on other HTML block elements, like <hr> for example.
You may see better what I mean with some spider code:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
import lxml.etree
import lxml.html
MARKER = "|"
def br2nl(tree):
for element in tree:
for elem in element.iter("br"):
elem.text = MARKER
def extract_category_lines(tree):
if tree is not None and len(tree):
# modify the tree by adding a MARKER after <br> elements
br2nl(tree)
# use lxml's .tostring() to get a unicode string
# and split lines on the marker we added above
# so we get lists of actors, producers, directors...
return lxml.html.tostring(
tree[0], method="text", encoding=unicode).split(MARKER)
class BoxOfficeMojoSpider(BaseSpider):
name = "boxofficemojo"
start_urls = [
"http://www.boxofficemojo.com/movies/?id=actionjackson.htm",
"http://www.boxofficemojo.com/movies/?id=cloudatlas.htm",
]
# locate 2nd cell by text content of first cell
XPATH_CATEGORY_CELL = lxml.etree.XPath('.//tr[starts-with(td[1], $category)]/td[2]')
def parse(self, response):
root = lxml.html.fromstring(response.body)
# locate the "The Players" table
players = root.xpath('//div[#class="mp_box"][div[#class="mp_box_tab"]="The Players"]/div[#class="mp_box_content"]/table')
# we have only one table in "players" so the for loop is not really necessary
for players_table in players:
directors_cells = self.XPATH_CATEGORY_CELL(players_table,
category="Director")
actors_cells = self.XPATH_CATEGORY_CELL(players_table,
category="Actor")
producers_cells = self.XPATH_CATEGORY_CELL(players_table,
category="Producer")
writers_cells = self.XPATH_CATEGORY_CELL(players_table,
category="Producer")
composers_cells = self.XPATH_CATEGORY_CELL(players_table,
category="Composer")
directors = extract_category_lines(directors_cells)
actors = extract_category_lines(actors_cells)
producers = extract_category_lines(producers_cells)
writers = extract_category_lines(writers_cells)
composers = extract_category_lines(composers_cells)
print "Directors:", directors
print "Actors:", actors
print "Producers:", producers
print "Writers:", writers
print "Composers:", composers
# here you should of course populate scrapy items
The code can be simplified for sure, but I hope you get the idea.
You can do similar things with HtmlXPathSelector of course (with the string() XPath function for example), but without modifying the tree for <br> (how to do that with hxs?) it works only for non-multiple names in your case:
>>> hxs.select('string(//div[#class="mp_box"][div[#class="mp_box_tab"]="The Players"]/div[#class="mp_box_content"]/table//tr[contains(td, "Director")]/td[2])').extract()
[u'Craig R. Baxley']
>>> hxs.select('string(//div[#class="mp_box"][div[#class="mp_box_tab"]="The Players"]/div[#class="mp_box_content"]/table//tr[contains(td, "Actor")]/td[2])').extract()
[u'Carl WeathersCraig T. NelsonSharon Stone']
Related
Below is my Pokemon TCG API key
I have read the documentation but I cannot find a way to retrieve the data I need which is the images of each card in a set. If anyone can help with or give me any pointers on how to retrieve this I would greatly appreciate it.
https://github.com/PokemonTCG/pokemon-tcg-sdk-ruby/blob/master/README.md
This above is the link to the documentation on Github.
Pokemon.configure do |config|
config.api_key = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
end
sets = Pokemon::Set.where(q: 'legalities.standard:legal').first
p sets
I am expecting to retrieve specific data from the Pokemon TCG API which from above should be within these Ruby classes.
p sets.images.symbol will print out the link to:
"https://images.pokemontcg.io/swshp/symbol.png"
p sets.images.logo will print out the link to:
"https://images.pokemontcg.io/swshp/logo.png"
When you are unsure what is contained within the class you're getting, in this case a Pokemon::Set class, you may see all the methods of this resulting variable with sets.methods.sort.each{|m| p m}, there you'll find the images method and can dig deeper to symbol and logo.
In your example you are searching for sets. In case of cards:
cards = Pokemon::Card.where(page: 5, pageSize: 100)
cards.each{|card| p card.images.large}
This will print out all the links to large cards.
Update
Clarification based on comments
How to print out methods and how to know which class you are referring to.
Let's start with you code snippet:
sets = Pokemon::Set.where(q: 'legalities.standard:legal').first
puts sets.class
Output:
Pokemon::Set
It would be more clear if you name your variable as set because you're selecting the .first element of an array.
Back to my cards example:
cards = Pokemon::Card.where(page: 5, pageSize: 100)
puts cards.class
Output:
Array
As I did not select any item from the array cards naming makes perfect sense, and this array will contain a card:
card = cards.first
puts card.class
Output
Pokemon::Card
Printing methods
So, given we are now well aware of class each variable contains and what we are referring to, we are ready to print out the methods for both:
sets.methods.sort.each{|method| p method}
cards[0].methods.sort.each{|method| p method}
Output:
…
:hash
:hp
:hp=
:id
:id=
:images <== Here!
:images=
:inspect
:instance_eval
:instance_exec
:instance_of?
…
Once again, mind the [0] selection of the first item. When we want to know methods, we want to know it either for Pokemon::Set or Pokemon::Card.
A bit confused about the effects of these two methods. Here's a simple string:
test = """
<p> This is my head <h1> this is my middle </h1> and this is my tail.</p>
"""
We create two roots with this string:
from lxml import html, etree
root_e = etree.fromstring(test)
root_h = html.fromstring(test)
Let's see what the trees look like:
tree_e = etree.ElementTree(root_e)
for elem in root_e.iter():
print(tree_e.getpath(elem))
Output is:
/p
/p/h1
which is what I would expect. However with:
tree_h = etree.ElementTree(root_h)
for elem in root_h.iter():
print(tree_h.getpath(elem))
the output is now:
/html/div
/div/p
/div/h1
which I didn't expect. And strange consequences follow. Various xpath expressions work the same in both trees, but others don't. For example
root_h.xpath('/html/div')[0].text_content()
outputs the whole string text (with a newline attached), although test has neither html nor div in it. On the other hand,
root_h.xpath('/html/div')[0].text
does nothing.
So why the differences, and when should you use one or the other?
While trying to help another user out with some question, I ran into the following problem myself:
The object is to find the country of origin of a list of wines on the page. So we start with:
import requests
from lxml import etree
url = "https://www.winepeople.com.au/wines/Dry-Red/_/N-1z13zte"
res = requests.get(url)
content = res.content
res = requests.get(url)
tree = etree.fromstring(content, parser=etree.HTMLParser())
tree_struct = etree.ElementTree(tree)
Next, for reasons I'll get into in a separate question, I'm trying to compare the xpath of two elements with certain attributes. So:
wine = tree.xpath("//div[contains(#class, 'row wine-attributes')]")
country = tree.xpath("//div/text()[contains(., 'Australia')]")
So far, so good. What are we dealing with here?
type(wine),type(country)
>> (list, list)
They are both lists. Let's check the type of the first element in each list:
type(wine[0]),type(country[0])
>> (lxml.etree._Element, lxml.etree._ElementUnicodeResult)
And this is where the problem starts. Because, as mentioned, I need to find the xpath of the first elements of the wine and country lists. And when I run:
tree_struct.getpath(wine[0])
The output is, as expected:
'/html/body/div[13]/div/div/div[2]/div[6]/div[1]/div/div/div[2]/div[2]'
But with the other:
tree_struct.getpath(country[0])
The output is:
TypeError: Argument 'element' has incorrect type (expected
lxml.etree._Element, got lxml.etree._ElementUnicodeResult)
I couldn't find much information about _ElementUnicodeResult), so what is it? And, more importantly, how do I fix the code so that I get an xpath for that node?
You're selecting a text() node instead of an element node. This is why you end up with a lxml.etree._ElementUnicodeResult type instead of a lxml.etree._Element type.
Try changing your xpath to the following in order to select the div element instead of the text() child node of div...
country = tree.xpath("//div[contains(., 'Australia')]")
I'm working on a web-scraping solution that grabs totally different webpages and lets the user define rules/scripts in order to extract information from the page.
I started scraping from a single domain and build a parser based on Nokogiri.
Basically everything works fine.
I could now add a ruby class each time somebody wants to add a webpage with a different layout/style.
Instead I thought about using an approach where the user specifies elements where content is stored using xpath and storing this as a sort of recipe for this webpage.
Example: The user wants to scrape a table-structure extracting the rows using a hash (column-name => cell-content)
I was thinking about writing a ruby function for extraction of this generic table information once:
# extracts a table's rows as an array of hashes (column_name => cell content)
# html - the html-file as a string
# xpath_table - specifies the html table as xpath which hold the data to be extracted
def basic_table(html, xpath_table)
xpath_headers = "#{xpath_table}/thead/tr/th"
html_doc = Nokogiri::HTML(html)
html_doc = Nokogiri::HTML(html)
row_headers = html_doc.xpath(xpath_headers)
row_headers = row_headers.map do |column|
column.inner_text
end
row_contents = Array.new
table_rows = html_doc.xpath('#{xpath_table}/tbody/tr')
table_rows.each do |table_row|
cells = table_row.xpath('td')
cells = cells.map do |cell|
cell.inner_text
end
row_content_hash = Hash.new
cells.each_with_index do |cell_string, column_index|
row_content_hash[row_headers[column_index]] = cell_string
end
row_contents << [row_content_hash]
end
return row_contents
end
The user could now specify a website-recipe-file like this:
<basic_table xpath='//div[#id="grid"]/table[#id="displayGrid"]'
The function basic_table is referenced here, so that by parsing the website-recipe-file I would know that I can use the function basic_table to extract the content from the table referenced by the xPath.
This way the user can specify simple recipe-scripts and only has to dive into writing actual code if he needs a new way of extracting information.
The code would not change every time a new webpage needs to be parsed.
Whenever the structure of a webpage changes only the recipe-script would need to be changed.
I was thinking that someone might be able to tell me how he would approach this. Rules/rule engines pop into my mind, but I'm not sure if that really is the solution to my problem.
Somehow I have the feeling that I don't want to "invent" my own solution to handle this problem.
Does anybody have a suggestion?
J.
I need to update multiple targets when a link is clicked.
This example builds a list of links.
When the link is clicked, the callback needs to populate two different parts of the .html file.
The actual application uses bokeh for plotting.
The user will click on a link, the 'linkDetails1' and 'linkDetails2' will hold the script and div return from calls to bokeh.component()
The user will click on a link, and the script, div returned from bokeh's component() function will populate the 'linkDetails'.
Obviously this naive approach does not work.
How can I make a list of links that when clicked on will populate two separate places in the .html file?
################################
#views/default/test.html:
{{extend 'layout.html'}}
{{=linkDetails1}}
{{=linkDetails2}}
{{=links}}
################################
# controllers/default.py:
def test():
"""
example action using the internationalization operator T and flash
rendered by views/default/index.html or views/generic.html
if you need a simple wiki simply replace the two lines below with:
return auth.wiki()
"""
d = dict()
links = []
for ii in range(5):
link = A("click on link %d"%ii, callback=URL('linkHandler/%d'%ii), )
links.append(["Item %d"%ii, link])
table = TABLE()
table.append([TR(*rows) for rows in links])
d["links"] = table
d["linkDetails1"] = "linkDetails1"
d["linkDetails2"] = "linkDetails2"
return d
def linkHandler():
import os
d = dict()
# request.url will be linked/N
ii = int(os.path.split(request.url)[1])
# want to put some information into linkDetails, some into linkDiv
# this does not work:
d = dict()
d["linkDetails1"] = "linkHandler %d"%ii
d["linkDetails2"] = "linkHandler %d"%ii
return d
I must admit that I'm not 100% clear on what you're trying to do here, but if you need to update e.g. 2 div elements in your page in response to a single click, there are a couple of ways to accomplish that.
The easiest, and arguably most web2py-ish way is to contain your targets in an outer div that's a target for the update.
Another alternative, which is very powerful is to use something like Taconite [1], which you can use to update multiple parts of the DOM in a single response.
[1] http://www.malsup.com/jquery/taconite/
In this case, it doesn't look like you need the Ajax call to return content to two separate parts of the DOM. Instead, both elements returned (the script and the div elements) can simply be placed inside a single parent div.
# views/default/test.html:
{{extend 'layout.html'}}
<div id="link_details">
{{=linkDetails1}}
{{=linkDetails2}}
</div>
{{=links}}
# controllers/default.py
def test():
...
for ii in range(5):
link = A("click on link %d" % ii,
callback=URL('default', 'linkHandler', args=ii),
target="link_details")
...
If you provide a "target" argument to A(), the result of the Ajax call will go into the DOM element with that ID.
def linkHandler():
...
content = CAT(SCRIPT(...), DIV(...))
return content
In linkHandler, instead of returning a dictionary (which requires a view in order to generate HTML), you can simply return a web2py HTML helper, which will automatically be serialized to HTML and then inserted into the target div. The CAT() helper simply concatenates other elements (in this case, your script and associated div).