How to pass string in Xpath expression in Scrapy 1.2.0 - xpath

I am not able to pass Xpath Expression as a string variable in my Scrapy code. Code below:
def start_requests(self):
urls = [
'http://www.example.com'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
strvar = "'//title'"
print (strvar)
print (response.xpath(strvar))
print (response.xpath('//title'))
The above two response.xpath(xpath expression) queries evaluates to different xpaths as
Selector xpath="'//title'" ....
Selector xpath='//title' ....
Can't figure out where am I going wrong.

You don't need to put the inner quotes, replace:
strvar = "'//title'"
with just:
strvar = "//title"

Related

How to search a XML file using the value in ARGV[1]

I am trying to search a file using the value in the ARGV array. However using doc.at is not working. I have set the variable keyword to ARGV[1] and when given a value that prints to the console but when i try to puts the variable text to the console it comes up blank.
require 'nokogiri'
input = ARGV[0]
keyword = ARGV[1]
case input
when input = "list"
doc = File.open("emails.xml") { |f| Nokogiri::XML(f) }
text = doc.at('record:contains("{keyword}")')
puts text
puts keyword
else
puts "no"
end
Your string interpolation is wrong.
Change it to:
doc.at("record:contains('#{keyword}')")
start with double " and interpolate with #{}

Extract url params in ruby

I would like to extract parameters from url. I have following path pattern:
pattern = "/foo/:foo_id/bar/:bar_id"
And example url:
url = "/foo/1/bar/2"
I would like to get {foo_id: 1, bar_id: 2}. I tried to convert pattern into something like this:
"\/foo\/(?<foo_id>.*)\/bar\/(?<bar_id>.*)"
I failed on first step when I wanted to replace backslash in url:
formatted = pattern.gsub("/", "\/")
Do you know how to fix this gsub? Maybe you know better solution to do this.
EDIT:
It is plain Ruby. I am not using RoR.
As I said above, you only need to escape slashes in a Regexp literal, e.g. /foo\/bar/. When defining a Regexp from a string it's not necessary: Regexp.new("foo/bar") produces the same Regexp as /foo\/bar/.
As to your larger problem, here's how I'd solve it, which I'm guessing is pretty much how you'd been planning to solve it:
PATTERN_PART_MATCH = /:(\w+)/
PATTERN_PART_REPLACE = '(?<\1>.+?)'
def pattern_to_regexp(pattern)
expr = Regexp.escape(pattern) # just in case
.gsub(PATTERN_PART_MATCH, PATTERN_PART_REPLACE)
Regexp.new(expr)
end
pattern = "/foo/:foo_id/bar/:bar_id"
expr = pattern_to_regexp(pattern)
# => /\/foo\/(?<foo_id>.+?)\/bar\/(?<bar_id>.+?)/
str = "/foo/1/bar/2"
expr.match(str)
# => #<MatchData "/foo/1/bar/2" foo_id:"1" bar_id:"2">
Try this:
regex = /\/foo\/(?<foo_id>.*)\/bar\/(?<bar_id>.*)/i
matches = "/foo/1/bar/2".match(regex)
Hash[matches.names.zip(matches[1..-1])]
IRB output:
2.3.1 :032 > regex = /\/foo\/(?<foo_id>.*)\/bar\/(?<bar_id>.*)/i
=> /\/foo\/(?<foo_id>.*)\/bar\/(?<bar_id>.*)/i
2.3.1 :033 > matches = "/foo/1/bar/2".match(regex)
=> #<MatchData "/foo/1/bar/2" foo_id:"1" bar_id:"2">
2.3.1 :034 > Hash[matches.names.zip(matches[1..-1])]
=> {"foo_id"=>"1", "bar_id"=>"2"}
I'd advise reading this article on how Rack parses query params. The above works for your example you gave, but is not extensible for other params.
http://codefol.io/posts/How-Does-Rack-Parse-Query-Params-With-parse-nested-query
This might help you, the foo id and bar id will be dynamic.
require 'json'
#url to scan
url = "/foo/1/bar/2"
#scanning ids from url
id = url.scan(/\d/)
#gsub method to replacing values from url
url_with_id = url.gsub(url, "{foo_id: #{id[0]}, bar_id: #{id[1]}}")
#output
=> "{foo_id: 1, bar_id: 2}"
If you want to change string to hash
url_hash = eval(url_with_id)
=>{:foo_id=>1, :bar_id=>2}

Not extracting the full link using index

I'm trying to extract the first href link from a website. Just the full link alone.
I am expecting to get http://www.iana.org/domains/example as the output but instead I am getting just http://www.iana.org/domains/ex
require 'net/http'
source = Net::HTTP.get('www.example.org', '/index.html')
def findhref(page) #returns rest of the html after href
return page[page.index('href')..-1]
end
def findlink(page)
text = findhref(page)
firstquote = text.index('"') #first position of quote
secondquote = text[firstquote+1..-1].index('"') #2nd quote
puts text #for debugging
puts firstquote+1 #for debugging
puts secondquote #for debugging
return text[firstquote+1..secondquote]
end
print findlink(source)
I would suggest using Nokogiri for HTML parsing. The solution to your problem would be as simple as:
doc = Nokogiri::HTML(open('www.example.org/index.html'))
first_anchor = doc.css('a').first
first_href = first_anchor['href']

Embed Ruby in xpath/Nokogiri

Probably a pretty easy question:
I'm using Mechanize, Nokogori, and Xpath to parse through some html as such:
category = a.page.at("//li//a[text()='Test']")
Now, I want the term that I'm searching for in text()= to be dynamic...i.e. I want to create a local variable:
term = 'Test'
and embed that local ruby variable in the Xpath, if that makes sense.
Any ideas how?
My intuition was to treat this like string concatenation, but that doesn't work out:
term = 'Test'
category = a.page.at("//li//a[text()=" + term + "]")
When you use category = a.page.at("//li//a[text()=" + term + "]"). The final result to method is //li//a[text()=Test] where test is not in quotes. So to put quotes around string you need to use escape character \.
term = 'Test'
category = a.page.at("//li//a[text()=\"#{term}\"]")
or
category = a.page.at("//li//a[text()='" + term + "']")
or
category = a.page.at("//li//a[text()='#{term}']")
For example:
>> a="In quotes" #=> "In quotes"
>> puts "This string is \"#{a}\"" #=> This string is "In quotes"
>> puts "This string is '#{a}'" #=> This string is 'In quotes'
>> puts "This string is '"+a+"'" #=> This string is 'In quotes'
A little-used feature that might be relevant to your question is Nokogiri's ability to call a ruby callback while evaluating an XPath expression.
You can read more about this feature at http://nokogiri.org under the method docs for Node#xpath (http://nokogiri.org/Nokogiri/XML/Node.html#method-i-xpath), but here's an example addressing your question:
#! /usr/bin/env ruby
require 'nokogiri'
xml = <<-EOXML
<root>
<a n='1'>foo</a>
<a n='2'>bar</a>
<a n='3'>baz</a>
</root>
EOXML
doc = Nokogiri::XML xml
dynamic_query = Class.new do
def text_matching node_set, string
node_set.select { |node| node.inner_text == string }
end
end
puts doc.at_xpath("//a[text_matching(., 'bar')]", dynamic_query.new)
# => <a n="2">bar</a>
puts doc.at_xpath("//a[text_matching(., 'foo')]", dynamic_query.new)
# => <a n="1">foo</a>
HTH.

Getting portion of href attribute using hpricot

I think I need a combo of hpricot and regex here. I need to search for 'a' tags with an 'href' attribute that starts with 'abc/', and returns the text following that until the next forward slash '/'.
So, given:
One
Two
I need to get back:
'12345'
and
'67890'
Can anyone lend a hand? I've been struggling with this.
You don't need regex but you can use it. Here's two examples, one with regex and the other without, using Nokogiri, which should be compatible with Hpricot for your use, and uses CSS accessors:
require 'nokogiri'
html = %q[
One
Two
]
doc = Nokogiri::HTML(html)
doc.css('a[#href]').map{ |h| h['href'][/(\d+)/, 1] } # => ["12345", "67890"]
doc.css('a[#href]').map{ |h| h['href'].split('/')[2] } # => ["12345", "67890"]
or use regex:
s = 'One'
s =~ /abc\/([^\/]*)/
return $1
What about splitting the string by /?
(I don't know Hpricot, but according to the docs):
doc.search("a[#href]").each do |a|
return a.somemethodtogettheattribute("href").split("/")[2]; // 2, because the string starts with '/'
end

Resources