Scrapy failed to find Xpath that Nokogiri found - ruby

I am newly working for a website that needs to crawl products from several stores/sites...
I am a bit new to python and scrapy, in which the original code was written, so when testing crawlers and Xpaths, i use Scrapy and also open another console to test using nokogiri (Ruby gem)
in a particular site, i failed to extract some content using scrapy, but I've found that I can get this content, from the same url using same xpath
Here is the code snippet used in both cases:
Scrapy
yield Request(product_url,headers={'User-Agent':'curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3'}, callback=self.parse_item)
def parse_item(self, response):
script = response.xpath('//script[contains(text(),"var ProductViewJSON")]')
yield {
'url': response.url,
'script length': len(script),
'script': script,
}
it produces following result:
{"url": "http://www.pullandbear.com/eg/en/man/accessories/pack-of-3-assorted-bracelets-c29537p100036212.html", "script length": 0, "script": []},
Nokogiri
require 'nokogiri'
require 'open-uri'
html_data = open('http://www.pullandbear.com/eg/en/man/accessories/pack-of-3-assorted-bracelets-c29537p100036212.html', 'User-Agent' => 'curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3').read
nokogiri_object = Nokogiri::HTML(html_data)
script = nokogiri_object.xpath('//script[contains(text(),"var ProductViewJSON")]')
script.length # produces 1
Can anybody help me to explain that, please note that this scrapy code was running, I've just been reported that it has stopped, and the main problem was the need to add the headers
I hope I was clear enough, thanks for your interest :)
Edit
I've tried to parse the url from scrapy shell, using the same User Agent as the spider's request and nokogiri's one, it worked for me, it found the element matching the xpath, but still not running with in the spider...

The cause for this is the User-Agent you use.
I tried the site with a simple scrapy shell (with the default User-Agent) and I get the following response:
>>> response.body
'<HTML><HEAD>\n<TITLE>Access Denied</TITLE>\n</HEAD><BODY>\n<H1>Access Denied</H1>\n \nYou don\'t have permission to access "http://www.pullandbear.com/eg/en/man/accessories/pack-of-3-assorted-bracelets-c29537p100036212.html" on this server.<P>\nReference #18.3f496768.1453197808.1ef09a53\n</BODY>\n</HTML>\n'
So change your User-Agent in your Request (or set it through the settings of scrapy once) and you should be ready to gather your information.
As you can see the server returns an access denied site for User-Agents which are not a browser -- just as your cURL agent.
If I start the shell with the following User-Agent:
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36'
and execute your XPath I get following results:
>>> response.xpath('//script[contains(text(),"var ProductViewJSON")]')
[<Selector xpath='//script[contains(text(),"var ProductViewJSON")]' data=u'<script type="text/javascript">\r\n\tvar Pr'>]

Related

Xpath expression returns empty output

My xidel command is the following:
xidel "https://www.iec-iab.be/nl/contactgegevens/c360afae-29a4-dd11-96ed-005056bd424d" -e '//div[#class="consulentdetail"]'
This should extract all data in the divs with class consulentdetail
Nothing special I thought but it wont print anything.
Can anyone help me finding my mistake?
//EDIT: When I use the same expression in Firefox it finds the desired tags
The site you are connecting to obviously checks the user agent string and delivers different pages, according to the user agent string it gets sent.
If you instruct xidel to send a user agent string, impersonating as e.g. Firefox on Windows 10, your query starts to work:
> ./xidel --silent --user-agent="Mozilla/5.0 (Windows NT 10.0; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0" "http://www.iec-iab.be/nl/contactgegevens/c360afae-29a4-dd11-96ed-005056bd424d" -e '//div[#class="consulentdetail"]'
Lidnummer11484 2 N 73
TitelAccountant, Belastingconsulent
TaalNederlands
Accountant sinds4/04/2005
Belastingconsulent sinds4/04/2005
AdresStationsstraat 2419550 HERZELE
Telefoon+32 (53) 41.97.02
Fax+32 (53) 41.97.03
AdresStationsstraat 2419550 HERZELE
Telefoon+32 (53) 41.97.02
Fax+32 (53) 41.97.03
GSM+32 (474) 29.00.67
Websitehttp://abbeloosschinkels.be
E-mail
<!--
document.write("");document.write(decrypt(unescCtrlCh("5yÿÃ^à(pñ_!13!­[îøû!13!5ãév¦Ãçj|°W"),"Iate1milrve%ster"));document.write("");
-->
As a rule of thumb, when doing Web scraping and getting weird results:
Check the page in a browser with Javascript disabled.
Send a user agent string simulating a Web browser.

Ruby Curb not following redirects

I'm using Curb to get various URLs, and if the response is 200, I get what I need. However, if the response is a redirect, Curb doesn't seem to follow the redirects, even though I ask it to - e.g:
easy = Curl::Easy.new
easy.follow_location = true
easy.max_redirects = 3
easy.url = "http://stats.berr.gov.uk/ed/vat/VATStatsTables2a2d2007.xls"
easy.perform
=> Curl::Err::GotNothingError: Curl::Err::GotNothingError
from /Users/stuart/.rvm/gems/ruby-2.0.0-p0#datakitten/gems/curb-0.8.4/lib/curl/easy.rb:60:in `perform'
However, if I do curl -L http://stats.berr.gov.uk/ed/vat/VATStatsTables2a2d2007.xls on the command line, I get the expected response. What am I doing wrong?
It sounds as this server returns an empty reply[1] if you do not provide an user agent.
To solve you problem just set one:
...
easy.useragent = "curb"
easy.perform
[1]: curl -A '' -L http://stats.berr.gov.uk/... gives (52) Empty reply from server.

Using cURL to replicate browser request

I'm trying to use cURL to get data to the form an URL:
http://example.com/site-explorer/get_overview_text_data.php?data_type=refdomains_stats&hash=19a53c6b9aab3917d8bed5554000c7cb
which needs a cookie, so I first store it on a file:
curl -c cookie-jar http://example.com/site-explorer/overview/subdomains/example.com
Trying curl with these values:
curl -b cookie-jar -A "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)" --referer "http://example.com/site-explorer/overview/subdomains/example.com" http://example.com/site-explorer/get_overview_text_data.php?data_type=refdomains_stats&hash=19a53c6b9aab3917d8bed5554000c7cb
There is one problem which leaps out at me: You aren't quoting the URL, which means that characters such as & and ? will be interpreted by the shell instead of getting passed to curl. If you're using a totally static URL, enclose it in single quotes, as in 'http://blah.com/blah/blah...'.

How to scrape a _private_ google group?

I'd like to scrape the discussion list of a private google group. It's a multi-page list and I might have to this later again so scripting sounds like the way to go.
Since this is a private group, I need to login in my google account first.
Unfortunately I can't manage to login using wget or ruby Net::HTTP. Surprisingly google groups is not accessible with the Client Login interface, so all the code samples are useless.
My ruby script is embedded at the end of the post. The response to the authentication query is a 200-OK but no cookies in the response headers and the body contains the message "Your browser's cookie functionality is turned off. Please turn it on."
I got the same output with wget. See the bash script at the end of this message.
I don't know how to workaround this. am I missing something? Any idea?
Thanks in advance.
John
Here is the ruby script:
# a ruby script
require 'net/https'
http = Net::HTTP.new('www.google.com', 443)
http.use_ssl = true
path = '/accounts/ServiceLoginAuth'
email='john#gmail.com'
password='topsecret'
# form inputs from the login page
data = "Email=#{email}&Passwd=#{password}&dsh=7379491738180116079&GALX=irvvmW0Z-zI"
headers = { 'Content-Type' => 'application/x-www-form-urlencoded',
'user-agent' => "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.2 (KHTML, like Gecko) Chrome/6.0"}
# Post the request and print out the response to retrieve our authentication token
resp, data = http.post(path, data, headers)
puts resp
resp.each {|h, v| puts h+'='+v}
#warning: peer certificate won't be verified in this SSL session
Here is the bash script:
# A bash script for wget
CMD=""
CMD="$CMD --keep-session-cookies --save-cookies cookies.tmp"
CMD="$CMD --no-check-certificate"
CMD="$CMD --post-data='Email=john#gmail.com&Passwd=topsecret&dsh=-8408553335275857936&GALX=irvvmW0Z-zI'"
CMD="$CMD --user-agent='Mozilla'"
CMD="$CMD https://www.google.com/accounts/ServiceLoginAuth"
echo $CMD
wget $CMD
wget --load-cookies="cookies.tmp" http://groups.google.com/group/mygroup/topics?tsc=2
Have you tried with mechanize for ruby?
Mechanize library is used for automating interaction with website; you could log in to google and browse your private google group saving what you need.
Here an example where mechanize is used for gmail scraping.
I did this previously by logging in manually with Firefox and then used Chickenfoot to automate browsing and scraping.
Found this PHP Solution to scraping private Google Groups.

Writing a simple webserver in Ruby

I want to create an extremely simple web server for development purposes in Ruby (no, don’t want to use ready solutions).
Here is the code:
#!/usr/bin/ruby
require 'socket'
server = TCPServer.new('127.0.0.1', 8080)
while connection = server.accept
headers = []
length = 0
while line = connection.gets
headers << line
if line =~ /^Content-Length:\s+(\d+)/i
length = $1.to_i
end
break if line == "\r\n"
end
body = connection.readpartial(length)
IO.popen(ARGV[0], 'r+') do |script|
script.print(headers.join + body)
script.close_write
connection.print script.read
end
connection.close
end
The idea is to run this script from the command line, providing another script, which will get the request on its standard input, and gives back the complete response on its standard output.
So far so good, but this turns out to be really fragile, as it breaks on the second request with the error:
/usr/bin/serve:24:in `write': Broken pipe (Errno::EPIPE)
from /usr/bin/serve:24:in `print'
from /usr/bin/serve:24
from /usr/bin/serve:23:in `popen'
from /usr/bin/serve:23
Any idea how to improve the above code to be sufficient for easy use?
Versions: Ubuntu 9.10 (2.6.31-20-generic), Ruby 1.8.7 (2009-06-12 patchlevel 174) [i486-linux]
The problem appears to be in the child script, since the parent script in your question runs on my box (Debian Squeeze, Ruby 1.8.7 patchlevel 249):
I created the dummy child script bar.rb:
#!/usr/bin/ruby1.8
s = $stdin.read
$stderr.puts s
print s
I then ran your script, passing it the path to the dummy script:
$ /tmp/foo.rb /tmp/bar.rb
The I hit it with wget:
$ wget localhost:8080/index
And saw the dummy script's output:
GET /index HTTP/1.0^M
User-Agent: Wget/1.12 (linux-gnu)^M
Accept: */*^M
Host: localhost:8080^M
Connection: Keep-Alive^M
^M
I also saw that wget received what it sent:
$ cat index
GET /index HTTP/1.0
User-Agent: Wget/1.12 (linux-gnu)
Accept: */*
Host: localhost:8080
Connection: Keep-Alive
It worked the same no matter how many times I hit it with wget.
The Ruby Web Servers Booklet describes most of web server implementation strategies.
With the Ruby Webrick Lib you have an easy Library to build a webserver.
http://www.ruby-doc.org/stdlib/libdoc/webrick/rdoc/

Resources