Scraping data based on the text of other neighboring elements?

Scraping data based on the text of other neighboring elements? - ruby

I have a code like this:
<div id="left">
<div id="leftNav">
<div id="leftNavContainer">
<div id="refinements">
<h2>Department</h2>
<ul id="ref_2975312011">
<li>
<a href="#">
<span class="expand">Pet Supplies</span>
</a>
</li>
<li>
<strong>Dogs</strong>
</li>
<li>
<a>
<span class="refinementLink">Carriers & Travel Products</span>
<span class="narrowValue"> (5,570)</span>
</a>
</li>
(etc...)
Which I'm scriping like this:
html = file
data = Nokogiri::HTML(open(html))
categories = data.css('#ref_2975312011')
#categories_hash = {}
categories.css('li').drop(2).each do | categories |
categories_title = categories.css('.refinementLink').text
categories_count = categories.css('.narrowValue').text[/[\d,]+/].delete(",").to_i
#categories_hash[:categories] ||= {}
#categories_hash[:categories]["Dogs"] ||= {}
#categories_hash[:categories]["Dogs"][categories_title] = categories_count
end
So now. I want to do the same but without using #ref_2975312011 and "Dogs".
So I was thinking I could tell Nokogiri the following:
Scrap the li elements (starting from the third one) that are right
below the li element which has the text Pet Supplies enclosed by a link and a span tag.
Any ideas of how to accomplish that?

The Pet Supplies li would be:
puts doc.at('li:has(a span[text()="Pet Supplies"])')
The following sibling li's would be (skipping the first one):
puts doc.search('li:has(a span[text()="Pet Supplies"]) ~ li:gt(1)')

Related

Regex to remove p tags within li tags and td tags

I have this html content:
<p>This is a paragraph:</p>
<ul>
<li>
<p>point 1</p>
</li>
<li>
<p>point 2</p>
<ul>
<li>
<p>point 3</p>
</li>
<li>
<p>point 4</p>
</li>
</ul>
</li>
<li>
<p>point 5</p>
</li>
</ul>
<ul>
<li>
<p><strong>sub-head : </strong>This is a para followed by heading, This is a para followed by heading, This is a para followed by heading, This is a para followed by heading</p>
</li>
<li>
<p><strong>sub-head 2: </strong></p>
<p>This is a para followed by heading, This is a para followed by heading, This is a para followed by heading, This is a para followed by heading</p>
</li>
</ul>
I want to remove all the <p>&</p> tags between <li>&</li> irrespective of its position between <li>&</li>. similarly i need to remove p tags between td tags inside a table.
This is my controller code so far:
nogo={"<li>\n<p>" =>'<li>', "</p>\n</li>" => '</li>', "<td>\n<p>" => '<td>', "</p>\n</td>" => '</td>',
'<p> </p>' => '','<ul>' => "\n<ul>",'</ul>' => "</ul>\n", '</ol>' => "</ol>\n" ,
'<table>' => "\n<table width='100%' border='0' cellspacing='0' cellpadding='0' class='table table-curved'>",
'<' => '<', '>'=>'>','<br>' => '','<p></p>' => '', ' rel="nofollow"' => ''
c=params[:content]
bundle_out=Sanitize.fragment(c,Sanitize::Config.merge(Sanitize::Config::BASIC,
:elements=> Sanitize::Config::BASIC[:elements]+['table', 'tbody', 'tr', 'td', 'h1', 'h2', 'h3'],
:attributes=>{'a' => ['href']}) )#.split(" ").join(" ")
re = Regexp.new(nogo.keys.map { |x| Regexp.escape(x) }.join('|'))
#bundle_out=bundle_out.gsub(re, nogo)
im passing the above html content to this code through params[:content] which ive assigned to a variable c.
Following is the o/p which is not as expected. Some close p tags and open p tags are still between li and close li tags
<p>This is a paragraph:</p>
<ul>
<li>point 1</li>
<li>point 2</p>
<ul>
<li>point 3</li>
<li>point 4</li>
</ul>
</li>
<li>point 5</li>
</ul>
<ul>
<li><strong>sub-head : </strong>This is a para followed by heading, This is a para followed by heading, This is a para followed by heading, This is a para followed by heading</li>
<li><strong>sub-head 2: </strong></p>
<p>This is a para followed by heading, This is a para followed by heading, This is a para followed by heading, This is a para followed by heading</li>
</ul>
My aim is simple i just want to remove all the p tags inside li and td tags, which im not able to do correctly. Any help is appreciated.
I would like to use regex to do this. and i know using regex is not the correct way to parse html content.

I won't recommend using regex because they're a dead-end unless the HTML is trivial and you create it. And, if you are the one creating it, then modifying it after generating it is the wrong way to go about generating content.
Use a parser. Nokogiri is the de-facto standard for Ruby, and, with some knowledge of CSS or XPath, you can quickly learn to search, or modify, HTML and XML:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<ul>
<li>
<p>foo</p>
</li>
<li>
<span>
<p>bar</p>
</span>
</li>
</ul>
</body>
</html>
EOT
doc.search('li p').each do |p_tag|
p_tag.remove
end
puts doc.to_html
Running that results in:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<body>
<ul>
<li>
</li>
<li>
<span>
</span>
</li>
</ul>
</body>
</html>
The tutorials on the Nokogiri site are your starting point. Stack Overflow is also a good resource as there are many different easily-searchable questions about all aspects of using the gem.

How can I extract URLs from HTML content with a Ruby regexp?

This is an example since it is not easy to explain:
<li id="l_f6a1ok3n4d4p" class="online"> <div class="link"> random strings - 4 <a style="float:left; display:block; padding-top:3px;" href="http://www.webtrackerplus.com/?page=flowplayerregister&a_aid=&a_bid=&chan=flow"><img border="0" src="/resources/img/fdf.gif"></a> <!-- a class="none" href="#">random strings - 4 site2.com - # - </a --> </div> <div class="params"> <span>Submited: </span>7 June 2015 | <span>Host: </span>site2.com </div> <div class="report"> <a title="" href="javascript:report(3191274,%203,%202164691,%201)" class="alert"></a> <a title="" href="javascript:report(3191274,%203,%202164691,%200)" class="work"></a> <b>100% said work</b> </div> <div class="clear"></div> </li> <li id="l_zsgn82c4b96d" class="online"> <div class="link"> <a href="javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com');%20" onclick="visited('zsgn82c4b96d');" style
In the above content I want to extract from
javascript:show('f6a1ok3n4d4p','random%20strings%204',%20'site2.com')
the string "f6a1ok3n4d4p" and "site2.com" then make it as
http://site2.com/f6a1ok3n4d4p
and same for
javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com')
to become
http://site1.com/zsgn82c4b96d
I need it to be done with Ruby regex.

You can proceed like this:
require 'uri'
str = "javascript:show('f6a1ok3n4d4p','random%20strings%204',%20'site2.com')"
# regex scan to get values within javascript:show
vals = str.scan(/javascript:show\((.*)\)/)[0][0].split(',')
# => ["'f6a1ok3n4d4p'", "'random%20strings%204'", "%20'site2.com'"]
# joining resultant Array elements to generate url
url = "http://" + URI.decode(a.last).tr("'", '').strip + "/" + a.first.tr("'", '')
# => "http://site2.com/f6a1ok3n4d4p"
obviously my answer is not foolproof. You can make it better with checks for what if scan returns []?

This should do the trick, though the regexp isn't particularly flexible.
js_link_regex = /href=\"javascript:show\('([^']+)','[^']+',%20'([^']+)'\)/
link = <<eos
<li id="l_f6a1ok3n4d4p" class="online"> <div class="link"> random strings - 4 <a style="float:left; display:block; padding-top:3px;" href="http://www.webtrackerplus.com/?page=flowplayerregister&a_aid=&a_bid=&chan=flow"><img border="0" src="/resources/img/fdf.gif"></a> <!-- a class="none" href="#">random strings - 4 site2.com - # - </a --> </div> <div class="params"> <span>Submited: </span>7 June 2015 | <span>Host: </span>site2.com </div> <div class="report"> <a title="" href="javascript:report(3191274,%203,%202164691,%201)" class="alert"></a> <a title="" href="javascript:report(3191274,%203,%202164691,%200)" class="work"></a> <b>100% said work</b> </div> <div class="clear"></div> </li> <li id="l_zsgn82c4b96d" class="online"> <div class="link"> <a href="javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com');%20" onclick="visited('zsgn82c4b96d');" style
eos
matches = link.scan(js_link_regex)
matches.each do |match|
puts "http://#{match[1]}/#{match[0]}"
end

To just match your case,
str = "javascript:show('f6a1ok3n4d4p','random%20strings%204',%20'site2.com')"
parts = str.scan(/'([\w|\.]+)'/).flatten # => ["f6a1ok3n4d4p", "site2.com"]
puts "http://#{parts[1]}/#{parts[0]}" # => http://site2.com/f6a1ok3n4d4p

Ruby Nokogiri Parsing Multiple Elements within Lists

<div class='prdlist'>
<ul>
<li class='first'>
<a href="some url 1">
<div class="text>
<br>product number 1
</div>
</a>
</li>
<li class='second'>
<a href="some url 2">
<div class="text">
<br>product number 2
</div>
</a>
</li>
</ul>
</div>
Using above example,
I would like to parse the values inside each list, list by list. Something like:
html.xpath("//*[#class='prdlist']/ul/li'").each do |each|
url = each.xpath/css (parse the href from each list)
name = each.xpath/css (parse the text from each list)
end
arr << [url,name]
which would eventually output:
arr = [["some url 1","product number1"],["some url2","product number2"]]
I am currently using regex & xpath("//*[#href]/#href) to get all urls and similar to get all product names and then using .zip to put the arrays together... but I've come across an html where I would like to do it list by list..
Thanks for the help!

And there you have it.
arr = []
html.css("div.prdlist li").each do |me|
url = me.css("a").map{|link| link['href']}[0]
name = me.text.delete("\n").split.join(" ")
arr << [url,name]
end

Xpath: Select parent link matching three child attributes

I need to select a link using xpath that matches the following three criterion:
parent #class = 'testItem'
child #class = 'icon icon_checked'
text = 'test text goes here!'
i'm unsure about where to put the text attribute in the xpath reference. i've tried many permutations of the following:
//a[#class="testItem" and child::span[#class="icon icon_checked"] and li[text()="test text goes here!"]]
my issue is that the text part is not in its own span.
here's the raw example:
<li>
<a class="testItem2" data-code="2" href="javascript:void(0);">
<span class="icon icon_checked"></span>
test text goes here2!
</a>
</li>
<li>
<a class="testItem" data-code="2" href="javascript:void(0);">
<span class="icon icon_checked"></span>
test text goes here!
</a>
</li>

Thanks for the help. I've found the answer.
I can simply change the last part of my xpath from li[text()="test text goes here!"] to .[text()="test text goes here!"]].
My final working xpath is:
//a[#class='testItem' and child::span[#class='icon icon_checked'] and .[text()='test text goes here!']]

Accessing a div element in an array of li elements

I am trying to access a div in an li array
<ul>
<li class="views-row views-row-1 views-row-odd views-row-first">
<div class="news-item">
</li>
<li class="views-row views-row-2 views-row-even">
<li class="views-row views-row-3 views-row-odd">
<div class="news-item">
<div class="image">
<div class="details with-image">
<h2>
<p class="standfirst">The best two-seat </p>
<div class="meta">
<div class="pub-date">26 April 2012</div>
<div class="topic-bar clearfix">
<div class="topic car_review">review</div>
</div>
</div>
</div>
</div>
</li>
I am trying to access the "div class="topic car_review">car review "and get its text.
The reason I am specifically using that text is that, depending on what the text is it would enter specific steps.
Code that I am using is
#topic = #browser.li(:class => /views-row-#{x}/).div(:class,'news-item').div(:class,'details').div(:class,'meta').div(:class,/topic /).text
The script was working fine before and suddenly it has stopped working and is just not able to get the div(:class,'news-item').
The error message I get is
unable to locate element, using {:class=>"news-item", :tag_name=>"div"} (Watir::Exception::UnknownObjectException)
I tried div(:class => /news-/) but still its just not able to find that element
I am really stuck!!!

I assume that when you are doing li(:class => /views-row-#{x}/), the x means you are iterating over all rows? If so, then your script will fail on the row-2 since it does not contain the news-item div (resulting in the error that you see).
If there is only one of these 'topic car_review' div tags, you can just do:
#topic = #browser.div(:class, 'topic car_review')
Update - Iterating over each LI:
If you need to iterate over each LI, then you could do:
#browser.lis.each do |li|
#topic = li.div(:class, 'topic car_review').text
end

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Scraping data based on the text of other neighboring elements? - ruby

The Pet Supplies li would be: puts doc.at('li:has(a span[text()="Pet Supplies"])') The following sibling li's would be (skipping the first one): puts doc.search('li:has(a span[text()="Pet Supplies"]) ~ li:gt(1)')

Related

Regex to remove p tags within li tags and td tags

How can I extract URLs from HTML content with a Ruby regexp?

Ruby Nokogiri Parsing Multiple Elements within Lists

Xpath: Select parent link matching three child attributes

Accessing a div element in an array of li elements

Categories

Resources