Scrapy xpath selector doesn't retrieve the element - xpath

From this url: https://www.basketball-reference.com/boxscores/202110190LAL.html, I want to extract the text from this xpath:
//div[#id='div_four_factors']/table/tbody/tr[1]/td[1]
But, the element I got is None.
In Scrapy shell I use this:
>>> text = response.xpath("//div[#id='div_four_factors']/table/tbody/tr[1]/td[1]/text()").get()
>>> print(text)
>>> None
I have tried to write the right xpath for the element I want to retrieve but get none result.

It is because that table, and it looks like all the tables from that page are loaded using javascript after the page has already loaded. So the xpath path doesn't exist in the response html you are parsing.
You can see this if you open the page in a webbrowser and right click and select "open page source" or something like that. Alternatively you could just print(response.text) but it won't be formatted and will be difficult to read.
However it does look like a copy of the tables html is commented out adjacent to where it is located when rendered. Which means you can do this:
In [1]: import re
In [2]: pat = re.compile(r'<!--(.*?)-->', flags=re.DOTALL)
In [3]: text = response.xpath("//div[#id='all_four_factors']//comment()").get()
In [4]: selector = scrapy.Selector(text=pat.findall(text)[0])
In [5]: result = selector.xpath('//tbody/tr[1]/td[1]')
In [6]: result
Out[6]: [<Selector xpath='//tbody/tr[1]/td[1]' data='<td class="right " data-stat="pace">1...'>]
In [7]: result[0].xpath('./text()').get()
Out[7]: '112.8'
In [8]:

Related

xpath text() returns "None" when the tag is #href

I'm trying to extract text contained within HTML tags in order build a python defaultdict.
To accomplish this I need to clean out all xpath and/or HTML data and get just the text, which I can accomplish with /text() , unless it's an href.
How I scrape the items:
for item in response.xpath(
"//*[self::h3 or self::p or self::strong or self::a[#href]]"):
How it looks if I print the above, without extraction attempts:
<Selector xpath='//*[self::h3 or self::p or self::a[#href]]' data='<h3> Some text here ...'>
<Selector xpath='//*[self::h3 or self::p or self::a[#href]]' data='<a href="https://some.url.com...'>
I want to extract "Some text here" and "https://some.url.com"
How I try to extract the text:
item = item.xpath("./text()").get()
print(item):
The result:
Some text here
None
"None" is where I would expect to see: https://some.url.com, after trying various methods suggested online, I cannot get this to work.
Try to use this line to extract either text or #href:
item = item.xpath("./text() | ./#href").get()

extract text from nested div using xpath

I would like to get the text inside the h2 tag
<p>Mi. 5. Dezember 2018</p>
<h2>Slam: Jägerschlacht</h2>
<p>Einlass 19:30 Uhr // Beginn 20:30 Uhr</p>
<p>Tickets: 4€</p>
out of this page with xpath. The problem is i cant find the right xpath with all the div. All i get when i use this python code
from lxml import html
import requests
page = requests.get("https://www.gruener-jaeger-stpauli.de/")
tree = html.fromstring(page.content)
text = tree.xpath("/html/body/div/div/div/div/div/div/div[1]/div/div[2]/div/div/div[1]/div/a[1]/h2")
print (text)
is [< Element h2 at 0x25ae6341a98 >]
It is better to use a handwritten XPath instead of a generated path.
Try it like this to get the first h2-element (selecting all text-node children using /text())
"//a[contains(#class, 'event_box_gj')][1]/h2/text()")
or drop [1] to get all of them.

Unsure why xpath query coming back empty

I'm new to scrapy and am experimenting with the shell, attempting to retrieve products from the following URL: https://www.newbalance.co.nz/men/shoes/running/trail/?prefn1=sizeRefinement&prefv1=men_shoes_7
The following is my code - I'm not sure why the final query is coming back empty:
$ scrapy shell
fetch("https://www.newbalance.co.nz/men/shoes/running/trail/?prefn1=sizeRefinement&prefv1=men_shoes_7")
div_product_lists = response.xpath('//div[#id="product-lists"]')
ul_product_list_main = div_product_lists.xpath('//ul[#id="product-list-main"]')
for li_tile in ul_product_list_main.xpath('//li[#class="tile"]'):
... print li_tile.xpath('//div[#class="product"]').extract()
...
[]
[]
If I inspect the page using a property inspector, then I am seeing data for the div (with class product), so am not sure why this is coming back empty. Any help would be appreciated.
The problem here is that xpath doesn't understand that class="product product-tile " means "This element has 2 classes, product and product-tile".
In an xpath selector, the class attribute is just a string, like any other.
Knowing that, you can search for the entire class string:
>>> li_tile.xpath('.//div[#class="product product-tile "]')
[<Selector xpath='.//div[#class="product product-tile "]' data='<div class="product product-tile " tabin'>]
If you want to find all the elements that have the "product" class, the easiest way to do so is using a css selector:
>>> li_tile.css('div.product')
[<Selector xpath="descendant-or-self::div[#class and contains(concat(' ', normalize-space(#class), ' '), ' product ')]" data='<div class="product product-tile " tabin'>]
As you can see by looking at the resulting Selector, this is a bit more complicated to achieve using only xpath.
The data you want to extract is more readily available in other div having class product-top-spacer.
For instance you can get all divs having class="product-top-spacer" by following:
ts = response.xpath('//div[#class="product-top-spacer"]')
and check the item of first extracted div and its price:
ts[0].xpath('descendant::p[#class="product-name"]/a/text()').extract()[0]
>> 'Leadville v3'
ts[0].xpath('descendant::div[#class="product-pricing"]/text()').extract()[0].strip()
>> '$260.00'
All items can be viewed by iterating ts
for t in ts:
itname = t.xpath('descendant::p[#class="product-name"]/a/text()').extract()[0]
itprice = t.xpath('descendant::div[#class="product-pricing"]/text()').extract()[0].strip()
itprice = ' '.join(itprice.split()) # some cleaning
print(itname + ", " + itprice)

XPath - Nested path scraping

I'm trying to perform html scrapping of a webpage. I like to fetch the three alternate text (alt - highlighted) from the three "img" elements.
I'm using the following code extract the whole "img" element of slide-1.
from lxml import html
import requests
page = requests.get('sample.html')
tree = html.fromstring(page.content)
text_val = tree.xpath('//a[class="cover-wrapper"][id = "slide-1"]/text()')
print text_val
I'm not getting the alternate text values displayed. But it is an empty list.
HTML Script used:
This is one possible XPath :
//div[#id='slide-1']/a[#class='cover-wrapper']/img/#alt
Explanation :
//div[#id='slide-1'] : This part find the target <div> element by comparing the id attribute value. Notice the use #attribute_name syntax to reference attribute in XPath. Missing the # symbol would change the XPath selector meaning to be referencing a -child- element with the same name, instead of an attribute.
/a[#class='cover-wrapper'] : from each <div> element found by the previous bit of the XPath, find child element <a> that has class attribute value equals 'cover-wrapper'
/img/#alt : then from each of such <a> elements, find child element <img> and return its alt attribute
You might want to change the id filter to be starts-with(#id,'slide-') if you meant to return the all 3 alt attributes in the screenshot.
Try this:
//a[#class="cover-wrapper"]/img/#alt
So, I am first selecting the node having a tag and class as cover-wrapper and then I select the node img and then the attribute alt of img.
To find the whole image element :
//a[#class="cover-wrapper"]
I think you want:
//div[#class="showcase-wrapper"][#id="slide-1"]/a/img/#alt

scrapy xpath solution for xml with type=html and html entities

I am scraping an atom feed (xml). One of the tags says:
<content type="html">
<p&gt Some text and stuff </p&gt
</content>
Also i see the same html entities for img and a tags.
Is there a generic xpath to find the img tag or the p tag like this:
//content/p or //content/img/#src
But obviously this does not work with these html entities. Or maybe an other solution with scrapy?
I think you need to extract content text elements, and for each, parse HTML content using lxml.html
import lxml.etree
import lxml.html
xmlfeed = lxml.etree.fromstring(xmlfeedstring)
for content in xmlfeed.xpath('//content[#type="html"]/text()'):
htmlcontent = lxml.html.fragment_fromstring(content)
paragraphs = htmlcontent.xpath('//p')
image_urls = htmlcontent.xpath('//img/#src')
See Parsing HTML fragments from lxml documentation.

Resources