scrapy xpath solution for xml with type=html and html entities - xpath

I am scraping an atom feed (xml). One of the tags says:
<content type="html">
<p&gt Some text and stuff </p&gt
</content>
Also i see the same html entities for img and a tags.
Is there a generic xpath to find the img tag or the p tag like this:
//content/p or //content/img/#src
But obviously this does not work with these html entities. Or maybe an other solution with scrapy?

I think you need to extract content text elements, and for each, parse HTML content using lxml.html
import lxml.etree
import lxml.html
xmlfeed = lxml.etree.fromstring(xmlfeedstring)
for content in xmlfeed.xpath('//content[#type="html"]/text()'):
htmlcontent = lxml.html.fragment_fromstring(content)
paragraphs = htmlcontent.xpath('//p')
image_urls = htmlcontent.xpath('//img/#src')
See Parsing HTML fragments from lxml documentation.

Related

Scrapy xpath selector doesn't retrieve the element

From this url: https://www.basketball-reference.com/boxscores/202110190LAL.html, I want to extract the text from this xpath:
//div[#id='div_four_factors']/table/tbody/tr[1]/td[1]
But, the element I got is None.
In Scrapy shell I use this:
>>> text = response.xpath("//div[#id='div_four_factors']/table/tbody/tr[1]/td[1]/text()").get()
>>> print(text)
>>> None
I have tried to write the right xpath for the element I want to retrieve but get none result.
It is because that table, and it looks like all the tables from that page are loaded using javascript after the page has already loaded. So the xpath path doesn't exist in the response html you are parsing.
You can see this if you open the page in a webbrowser and right click and select "open page source" or something like that. Alternatively you could just print(response.text) but it won't be formatted and will be difficult to read.
However it does look like a copy of the tables html is commented out adjacent to where it is located when rendered. Which means you can do this:
In [1]: import re
In [2]: pat = re.compile(r'<!--(.*?)-->', flags=re.DOTALL)
In [3]: text = response.xpath("//div[#id='all_four_factors']//comment()").get()
In [4]: selector = scrapy.Selector(text=pat.findall(text)[0])
In [5]: result = selector.xpath('//tbody/tr[1]/td[1]')
In [6]: result
Out[6]: [<Selector xpath='//tbody/tr[1]/td[1]' data='<td class="right " data-stat="pace">1...'>]
In [7]: result[0].xpath('./text()').get()
Out[7]: '112.8'
In [8]:

xpath text() returns "None" when the tag is #href

I'm trying to extract text contained within HTML tags in order build a python defaultdict.
To accomplish this I need to clean out all xpath and/or HTML data and get just the text, which I can accomplish with /text() , unless it's an href.
How I scrape the items:
for item in response.xpath(
"//*[self::h3 or self::p or self::strong or self::a[#href]]"):
How it looks if I print the above, without extraction attempts:
<Selector xpath='//*[self::h3 or self::p or self::a[#href]]' data='<h3> Some text here ...'>
<Selector xpath='//*[self::h3 or self::p or self::a[#href]]' data='<a href="https://some.url.com...'>
I want to extract "Some text here" and "https://some.url.com"
How I try to extract the text:
item = item.xpath("./text()").get()
print(item):
The result:
Some text here
None
"None" is where I would expect to see: https://some.url.com, after trying various methods suggested online, I cannot get this to work.
Try to use this line to extract either text or #href:
item = item.xpath("./text() | ./#href").get()

extract text from nested div using xpath

I would like to get the text inside the h2 tag
<p>Mi. 5. Dezember 2018</p>
<h2>Slam: Jägerschlacht</h2>
<p>Einlass 19:30 Uhr // Beginn 20:30 Uhr</p>
<p>Tickets: 4€</p>
out of this page with xpath. The problem is i cant find the right xpath with all the div. All i get when i use this python code
from lxml import html
import requests
page = requests.get("https://www.gruener-jaeger-stpauli.de/")
tree = html.fromstring(page.content)
text = tree.xpath("/html/body/div/div/div/div/div/div/div[1]/div/div[2]/div/div/div[1]/div/a[1]/h2")
print (text)
is [< Element h2 at 0x25ae6341a98 >]
It is better to use a handwritten XPath instead of a generated path.
Try it like this to get the first h2-element (selecting all text-node children using /text())
"//a[contains(#class, 'event_box_gj')][1]/h2/text()")
or drop [1] to get all of them.

How do I use Nokogiri to scrape text from an image tag?

I need to get text from a list of image tags that are formatted like this:
<img src="/images/TextImage.ashx?text=Richmond" style="border-width:0px;" class="">
When I enter the XPath into Nokogiri, I get:
[#<Nokogiri::XML::Element:0x80513954 name="img" attributes=[#<Nokogiri::XML::Attr:0x805138dc name="src" value="/images/TextImage.ashx?text=Richmond">, #<Nokogiri::XML::Attr:0x805138b4 name="style" value="border-width:0px;">]>]
Is there any way that I can tell Nokogiri to return "Richmond"? I'm looking for a method that will return the text after a certain string. If there is not a way to get only "Richmond", how do I get it to return the value?
You can extract the src attribute with an xpath expression like
src = doc.at_xpath '//img/#src'
After that, you’ll need to extract the name from the attribute, probably with a regex.
For example (this may need to be more involved, depending on what formats are possible in the src attribute in your HTML page):
/\?text=(.*)/ =~ src
puts $1

select a word in a text blob in ruby based on a pattern

I have a text blob and I would like to select URL's based on whether they have .png or .jpg. I would like to select the entire word based on a pattern.
For example in this blob:
width='17'></a> <a href='http://click.e.groupon.com/? qs=94bee0ddf93da5b3903921bfbe17116f859915d3a978c042430abbcd51be55d8df40eceba3b1c44e' style=\"text-decoration: none;\">\n<img alt='Facebook' border='0' height='18' src='http://s3.grouponcdn.com/email/images/gw-email/facebook.jpg' style='display: i
I'd like to select the image:
http://s3.grouponcdn.com/email/images/gw-email/facebook.jpg
Can I use nokogiri on an html text blob?
Using Nokogiri and XPath:
frag = Nokogiri::HTML.fragment(str) # Don't construct an entire HTML document
images = frag.xpath('.//img/#src').map(&:text).grep /\.(png|jpg|jpeg)\z/
The XPath says:
.// — anywhere in this fragment
img — find all the <img> elements
/#src — now find the src attribute of each
Then we:
map(&:text) – convert all the Nokogiri::XML::Attr to the value of the attribute.
grep - find only those strings in the array that end with the appropriate text.
Yes, you can use nokogiri, and you should!
Here's a simple snippet:
require "nokogiri"
str = "....your blob"
html_doc = Nokogiri::HTML(str)
html_doc.css("a").collect{|e| e.attributes["href"].value}.select{|e| e.index(".png") || e.index(".jpeg") }
If you only want to find urls ending in .jpg or .png a pattern like this should do it.
https?:\/\/.*?\.(?:jpg|png)

Resources