How to capture a number of divs/ tables at once - xpath

I am using Scrapy with the following url:
http://www.marzetti.com/products/marzetti/detail.php?bc=35&cid=2&pid=1101&i=pl
I need to capture in the same scrapy item, the following:
/html/body/div/div[2]/table/tbody/tr[3]/td[4]/table/tbody/tr/td/div[2]/table/tbody/tr[2]/td[2] /div[4]
/html/body/div/div[2]/table/tbody/tr[3]/td[4]/table/tbody/tr/td/div[2]/table/tbody/tr[2]/td[2]/div[4]`
So here's my code snippet:
hxs = HtmlXPathSelector(response)
sites = hxs.select('/html/body/div/div[2]/table/tr[3]/td[4]/table/tr')
items = []
for site in sites:
..............
item['description'] = site.select('td/div[2]/table/tr[2]/td[2]/div[4] or div[5]//text()').extract()
However, this returns a Boolean answer such as 'description = True', whereas what I need is the actual text within the two divs.
Any suggestions welcome. Thanks.
-TM

Use the standard XPath union operator '|' :
(td/div[2]/table/tr[2]/td[2]/div[4]
|
div[5])
//text()

Related

how to get value of a tag that has no class or id in html agility pack?

I am trying to get the text value of this a tag:
67 comments
so i'm trying to get '67' from this. however there are no defining classes or id's.
i've managed to get this far:
IEnumerable<HtmlNode> commentsNode = htmlDoc.DocumentNode.Descendants(0).Where(n => n.HasClass("subtext"));
var storyComments = commentsNode.Select(n =>
n.SelectSingleNode("//a[3]")).ToList();
this only give me "comments" annoyingly enough.
I can't use the href id as there are many of these items, so i cant hardcord the href
how can i extract the number aswell?
Just use the #href attribute and a dedicated string function :
substring-before(//a[#href="item?id=22513425"],"comments")
returns 67.
EDIT : Since you can't hardcode all the content of #href, maybe you can use starts-with. XPath 1.0 solution.
Shortest form (+ text has to contain "comments") :
substring-before(//a[starts-with(#href,"item?") and text()[contains(.,"comments")]],"c")
More restrictive (+ text has to finish with "comments") :
substring-before(//a[starts-with(#href,"item?")][substring(//a, string-length(//a) - string-length('comments')+1) = 'comments'],"c")
I am using ScrapySharp nuget which adds in my sample below, (It's possible HtmlAgilityPack offers the same functionality built it, I am just used to ScrapySharp from years ago)
var doc = new HtmlDocument();
doc.Load(#"C:\desktop\anchor.html"); //I created an html file with your <a> element as the body
var anchor = doc.DocumentNode.CssSelect("a").FirstOrDefault();
if (anchor == null) return;
var digits = anchor.InnerText.ToCharArray().Where(c => Char.IsDigit(c));
Console.WriteLine($"anchor text: {anchor.InnerText} - digits only: {new string(digits.ToArray())}");
Output:

Scrape Instagram Web Hashtag Posts

I'm trying to scrape the number of posts to a given hashtag (#castles) and populate a Google Sheet cell using ImportXML.
I tried copying the Xpath from Chrome and paste it to the ImportXML parameter in the cell like this:
=ImportXML("https://www.instagram.com/explore/tags/castels/", "//*[#id="react-root"]/section/main/header/div[2]/div/div[2]/span/span")
I saw there is a problem with the quotation marks so I also tried:
=ImportXML("https://www.instagram.com/explore/tags/castels/", "//*[#id='react-root']/section/main/header/div[2]/div/div[2]/span/span")
Nevertheless, both return an error.
What am I doing wrong?
P.S. I am aware of the Xpath to the meta tag description "//meta[#name='description']/#content" however I would like to scrape the exact number of posts and not an abbreviated number.
Try this -
function hashCount() {
var url = 'instagram.com/explore/tags/cats/';
var response = UrlFetchApp.fetch(url, {muteHttpExceptions: true}).getContentText();
var regex = /(edge_hashtag_to_media":{"count":)(\d+)(,"page_info":)/gm;
var count = regex.exec(response)[2];
Logger.log(count);
}
Demo -
I've added muteHttpExceptions: true which was not added in my comment above. Hope this helps.

Xpath is correct but no result after scraping

I am trying to crawl all the name of the cities of the following web:
https://www.zomato.com/directory.
I have tried to used the following xpath.
python
#1st approach:
def parse(self,response):
cities_name = response.xpath('//div//h2//a/text()').extract_first()
items['cities_name'] = cities_name
yield items
#2nd approach:
def parse(self,response):
for city in response.xpath("//div[#class='col-l-5 col-s-8 item pt0 pb5
ml0']"):
l = ItemLoader(item = CountryItem(),selector = city)
l.add_xpath("cities_name",".//h2//a/text()")
yield l.load_item()
yield city
Actual result: Crawl 0 pages and scrape 0 items
Expected: Adelaide, Ballarat etc
First thing to note:
Your xpath is a bit too specific. Css classes in html don't always have reliable order. class1 class2 could end up being class2 class1 or even have some broken syntax involved like trailing spaces: class1 class2.
When you direct match your xpath to [#class="class1 class2"] there's a high chance that it will fail. Instead you should try to use contains function.
Second:
You have a tiny error in your cities_name xpath. In html body its a>h2>text and in your code it's reversed h2>a>text
So that being said I managed to get it working with these css and xpath selectors:
$ parsel "https://www.zomato.com/directory"
> p.mb10>a>h2::text +first
Adelaide
> p.mb10>a>h2::text +len
736
> -xpath
switched to xpath
> //p[contains(#class,"mb10")]/a/h2/text() +first
Adelaide
> //p[contains(#class,"mb10")]/a/h2/text() +len
736
parselcli - https://github.com/Granitosaurus/parsel-cli
You have a wrong XPath:
def parse(self,response):
for city_node in response.xpath("//h2"):
l = ItemLoader(item = CountryItem(), selector = city_node)
l.add_xpath("city_name", ".//a/text()")
yield l.load_item()
The main reason you are not getting any result from that page is because the html elements of that site are not well-formed. You can get the results using html5lib parser. I tried with different parsers but the one I just mentioned did the trick. The following is how you can do it. I used css selector, though.
import scrapy
from bs4 import BeautifulSoup
class ZomatoSpider(scrapy.Spider):
name = "zomato"
start_urls= ['https://www.zomato.com/directory']
def parse(self, response):
soup = BeautifulSoup(response.text, 'html5lib')
for item in soup.select(".row h2 > a"):
yield {"name":item.text}

Scrapy Xpath with text() contains

I'm using scrapy, and I'm trying to look for a span that contains a specific text. I have:
response.selector.xpath('//*[#class="ParamText"]/span/node()')
which returns:
<Selector xpath='//*[#class="ParamText"]/span/text()' data=u' MILES STODOLINK'>,
<Selector xpath='//*[#class="ParamText"]/span/text()' data=u'C'>,
<Selector xpath='//*[#class="ParamText"]/span/text()' data=u' MILES STODOLINK'>]
However when I run:
>>> response.selector.xpath('//*[#class="ParamText"]/span[contains(text(),"STODOLINK")]')
Out[11]: []
Why does the contains function not work?
contains() can not evaluate multiple nodes at once :
/span[contains(text(),"STODOLINK")]
So, in case there are multiple text nodes within the span, and "STODOLINK" isn't located in the first text node child of the span, then contains() in the above expression won't work. You should try to apply contains() check on individual text nodes as follow :
//*[#class="ParamText"]/span[text()[contains(.,"STODOLINK")]]
Or if "STODOLINK" isn't necessarily located directly within span (can be nested within other element in the span), then you can simply use . instead of text() :
//*[#class="ParamText"]/span[contains(.,"STODOLINK")]
In my terminal (assuming my example is identical to your file though) your code works:
Input
import scrapy
example='<div class="ParamText"><span>STODOLINK</span></div>'
scrapy.Selector(text=example).xpath('//*[#class="ParamText"]/span[contains(text(),"STODOLINK")]').extract()
Output:
['<span>STODOLINK</span>']
Can you clarify what might be different?
I use Scrapy with BeautifulSoup4.0. IMO, Soup is easy to read and understand. This is an option if you don't have to use HtmlXPathSelector. Below is an example for finding all links. You can replace that with 'span'. Hope this helps!
import scrapy
from bs4 import BeautifulSoup
import Item
def parse(self, response):
soup = BeautifulSoup(response.body,'html.parser')
print 'Current url: %s' % response.url
item = Item()
for link in soup.find_all('a'):
if link.get('href') is not None:
url = response.urljoin(link.get('href'))
item['url'] = url
yield scrapy.Request(url,callback=self.parse)
yield item

Using mechanize to check for div with similar but different names

Currently I'm doing the following:
if( firstTemp == true )
total = doc.xpath("//div[#class='pricing condense']").text
else
total = doc.xpath("//div[#class='pricing ']").text
end
I'm wondering is there anyway that I can get mechanize to automatically fetch divs that contain the string "pricing" ?
Is doc a Mechanize::Page? usually the convention is page for those and doc for Nokogiri::HTML::Document. Anyway, for either one try:
doc.search('div.pricing')
For just the first one, use at instead of search:
doc.at('div.pricing')

Resources