How to parse html source code with ruby/nokogiri? - ruby

I've successfully used ruby (1.8) and nokogiri's css parsing to pull out front facing data from web pages.
However I now need to pull out some data from a series of pages where the data is in the "meta" tags in the source code of the page.
One of the lines I need is the following:
<meta name="geo.position" content="35.667459;139.706256" />
I've tried using xpath put haven't been able to get it right.
Any help as to what syntax is needed would be much appreciated.
Thanks

This is a good case for a CSS attribute selector. For example:
doc.css('meta[name="geo.position"]').each do |meta_tag|
puts meta_tag['content'] # => 35.667459;139.706256
end
The equivalent XPath expression is almost identical:
doc.xpath('//meta[#name = "geo.position"]').each do |meta_tag|
puts meta_tag['content'] # => 35.667459;139.706256
end

require 'nokogiri'
doc = Nokogiri::HTML('<meta name="geo.position" content="35.667459;139.706256" />')
doc.at('//meta[#name="geo.position"]')['content'] # => "35.667459;139.706256"

Related

How to access attribute value when child is text node using Nori

I tried this:
xml_parser = Nori.new
xml_parser.parse "<FareReference ResBookDesigCode='Q'>Value</FareReference>"
And the result is:
{"FareReference"=>"Value"}
I wanted to retrieve the ResBookDesigCode value also.
Nokogiri is my recommended tool since Nori doesn't appear it's being actively supported.
require 'nokogiri'
doc = Nokogiri::XML("<FareReference ResBookDesigCode='Q'>Value</FareReference>")
doc now contains the DOM for the XML.
We can access the content for the FareReference node easily, along with its parameters:
doc.at('FareReference').text # => "Value"
doc.at('FareReference')['ResBookDesigCode'] # => "Q"
at basically means find the first node containing that selector. The documentation and tutorials describe the sibling methods.

How to scrape pages which have lazy loading

Here is the code which i used for parsing of web page.I did it in rails console.But i am not getting any output in my rails console.The site which i want to scrape is having lazy loading
require 'nokogiri'
require 'open-uri'
page = 1
while true
url = "http://www.justdial.com/functions"+"/ajxsearch.php?national_search=0&act=pagination&city=Delhi+%2F+NCR&search=Pandits"+"&where=Delhi+Cantt&catid=1195&psearch=&prid=&page=#{page}"
doc = Nokogiri::HTML(open(url))
doc = Nokogiri::HTML(doc.at_css('#ajax').text)
d = doc.css(".rslwrp")
d.each do |t|
puts t.css(".jrcw").text
puts t.css("span.jcn").text
puts t.css(".jaid").text
puts t.css(".estd").text
page+=1
end
end
You have 2 options here:
Switch pure HTTP scraping to some tool which supports javascript evaluation, such as Capybara (with proper driver selected). This can be slow, since you're running headless browser under the hood plus you'll have to set some timeouts or figure another way to make sure the blocks of text you're interested in are loaded before you start any scraping.
Second option is to use Web Developer console and figure out how those blocks of text are loaded (which AJAX calls, their parameters and etc.) and implement them in your scraper. This is more advanced approach, but more performant, since you won't make any extra work, like you've done in option 1.
Have a nice day!
UPDATE:
Your code above doesn't work, because the response is HTML code wrapped in JSON object, while you're trying to parse it as a raw HTML. It looks like this:
{
"error": 0,
"msg": "request successful",
"paidDocIds": "some ids here",
"itemStartIndex": 20,
"lastPageNum": 50,
"markup": 'LOTS AND LOTS AND LOTS OF MARKUP'
}
What you need is unwrap JSON and then parse as HTML:
require 'json'
json = JSON.parse(open(url).read) # make sure you check http errors here
html = json['markup'] # can this field be empty? check for the json['error'] field
doc = Nokogiri::HTML(html) # parse as you like
I'd also advise you against using open-uri since your code may become vulnerable if you use dynamic urls because of the way open-uri works (read the linked article for the details) and use good and more feature-wise libraries such as HTTParty and RestClient.
UPDATE 2: Minimal working script for me:
require 'json'
require 'open-uri'
require 'nokogiri'
url = 'http://www.justdial.com/functions/ajxsearch.php?national_search=0&act=pagination&city=Delhi+%2F+NCR&search=Pandits&where=Delhi+Cantt&catid=1195&psearch=&prid=&page=2'
json = JSON.parse(open(url).read) # make sure you check http errors here
html = json['markup'] # can this field be empty? check for the json['error'] field
doc = Nokogiri::HTML(html) # parse as you like
puts doc.at_css('#newphoto10').attr('title')
# => Dr Raaj Batra Lal Kitab Expert in East Patel Nagar, Delhi

Parsing Liquid in a Jekyll generator before converting to JSON

Best to start by saying that I am very new to Ruby and Liquid. I have searched around looking for some resource on this issue, but as yet haven't been able to find anything of real use.
I have a Jekyll site, which utilises the HTML5 History API. I have a Jekyll generator plugin which creates a single JSON file which holds all the post and page content, ready for use with HTML5 PushState and PopState. This part is functioning properly and is tested.
My problem comes when I have a post/page on the site which has Liquid tags in it. I am guessing I need to parse these Liquid tags to get the template output before I create my JSON object for each post/page. Here is what I have for pages as an example:
# Iterate over all pages
site.pages.each do |page|
# Encode the page HTML content to JSON
link = page.url
#content = Liquid::Template.parse(page.content)
hash[link] = { "body_class" => page.data['body_class'], "content" => converter.convert(#content.render), "title" => '<h1>' + page.data["content_title"] + '</h1>' }
end
Now, this at the minute is basically removing all Liquid tags from the generated JSON file, leaving nothing in it's place.
Here is my full generator file on Github which is based very heavily on nice work by Jezen Thomas.
The output JSON file is also in that repo with the site, or can be accessed quickly here. The blog.html content is the last item in the JSON file and shows the empty h1 and div tags which should have content.

Ruby: How to use generate HTML with dynamic values?

I am writing a Ruby script that will generate a large flat HTML menu for my website, I could generate this menu on the fly each time a page loads, but I think doing so is a waste of resources, especially as this will almost never need to change.
I want to effectively do the following (in semi-sudocode):
part_of_my_menu = eval %{
<script type="text/javascript">
var mapper = new Array();
<% parent_categories.each_with_index do |parent_category,i| -%>
mapper["#{parent_category.name}"] = <%= i -%>;
<% end -%>
</script>
}
and then be able to write the part_of_my_menu string variable to a HTML file (this I can do).
I know this is not how eval works in Ruby but does anyone know how to achieve this same "wrapper" functionality?
(fyi - the code I want to wrap with my "eval" function is much longer than this, I've only posted a very small snippet to illustrate what I am trying to achieve)
Thanks!
ERB is part of the standard library so you could do things like this:
tmpl = %q{<script type="text/javascript">...</script>}
erb = ERB.new(tmpl)
parent_categories = [ ... ]
part_of_my_menu = erb.result
The ERB documentation contains some good examples of how to use it.
You don't need a hand rolled eval construction, you can use standard existing libraries and your existing knowledge.
You might be interested in the dom gem that I have developed. You can generate HTML strings like this:
require "dom"
["foo".dom(:span, class: "bold"), "bar"].dom(:div).dom(:body).dom(:html)
# => "<html><body><div><span class=\"bold\">foo</span>bar</div></body></html>"

Nokogiri: Running into error "undefined method ‘text’ for nil:NilClass"

I'm a newbie to programmer so excuse my noviceness. So I'm using Nokogiri to scrape a police crime log. Here is the code below:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://www.sfsu.edu/~upd/crimelog/index.html"
doc = Nokogiri::HTML(open(url))
puts doc.at_css("title").text
doc.css(".brief").each do |brief|
puts brief.at_css("h3").text
end
I used the selector gadget bookmarklet to find the CSS selector for the log (.brief). When I pass "h3" through brief.at_css I get all of the h3 tags with the content inside.
However, if I add the .text method to remove the tags, I get NoMethod error.
Is there any reason why this is happening? What am I missing? Thanks!
To clarify if you look at the structure of the HTML source you will see that the very first occurrence of <div class="brief"> does not have a child h3 tag (it actually only has a child <p> tag).
The Nokogiri Docs say that
at_css(*rules)
Search this node for the first occurrence of CSS rules. Equivalent to css(rules).first See Node#css for more information.
If you call at_css(*rules) the docs states it is equivalent to css(rules).first. When there are items (your .brief class contains a h3) then an Nokogiri::XML::Element object is returned which responds to text, whereas if your .brief does not contain a h3 then a NilClass object is returned, which of course does not respond to text
So if we call css(rules) (not at_css as you have) we get a Nokogiri::XML::NodeSet object returned, which has the text() method defined as (notice the alias)
# Get the inner text of all contained Node objects
def inner_text
collect{|j| j.inner_text}.join('')
end
alias :text :inner_text
because the class is Enumerable it iterates over it's children calling their inner_text method and joins them all together.
Therefore you can either perform a nil? check or as #floatless correctly stated just use the css method
You just need to replace at_css with css and everything should be okay.

Resources