How to scrape the text of <li> and children - ruby

I am trying to scrape the content of <li> tags and within them.
The HTML looks like:
<div class="insurancesAccepted">
<h4>What insurance does he accept?*</h4>
<ul class="noBottomMargin">
<li class="first"><span>Aetna</span></li>
<li>
<a title="See accepted plans" class="insurancePlanToggle arrowUp">AvMed</a>
<ul style="display: block;" class="insurancePlanList">
<li class="last first">Open Access</li>
</ul>
</li>
<li>
<a title="See accepted plans" class="insurancePlanToggle arrowUp">Blue Cross Blue Shield</a>
<ul style="display: block;" class="insurancePlanList">
<li class="last first">Blue Card PPO</li>
</ul>
</li>
<li>
<a title="See accepted plans" class="insurancePlanToggle arrowUp">Cigna</a>
<ul style="display: block;" class="insurancePlanList">
<li class="first">Cigna HMO</li>
<li>Cigna PPO</li>
<li class="last">Great West Healthcare-Cigna PPO</li>
</ul>
</li>
<li class="last">
<a title="See accepted plans" class="insurancePlanToggle arrowUp">Empire Blue Cross Blue Shield</a>
<ul style="display: block;" class="insurancePlanList">
<li class="last first">Empire Blue Cross Blue Shield HMO</li>
</ul>
</li>
</ul>
</div>
The main issue is when I am trying to get content from:
doc.css('.insurancesAccepted li').text.strip
It displays all <li> text at once. I want "AvMed" and "Open Access" scraped at the same time with a relationship parameter so that I can insert it into my MySQL table with reference.

The problem is that doc.css('.insurancesAccepted li') matches all nested list items, not only direct descendants. To match only a direct descendant one should use a parent > child CSS rule. To accomplish your task you need to carefully assemble the result of the iteration:
doc = Nokogiri::HTML(html)
result = doc.css('div.insurancesAccepted > ul > li').each do |li|
chapter = li.css('span').text.strip
section = li.css('a').text.strip
subsections = li.css('ul > li').map(&:text).map(&:strip)
puts "#{chapter} ⇒ [ #{section} ⇒ [ #{subsections.join(', ')} ] ]"
puts '=' * 40
end
Resulted in:
# Aetna ⇒ [ ⇒ [ ] ]
# ========================================
# ⇒ [ AvMed ⇒ [ Open Access ] ]
# ========================================
# ⇒ [ Blue Cross Blue Shield ⇒ [ Blue Card PPO ] ]
# ========================================
# ⇒ [ Cigna ⇒ [ Cigna HMO, Cigna PPO, Great West Healthcare-Cigna PPO ] ]
# ========================================
# ⇒ [ Empire Blue Cross Blue Shield ⇒ [ Empire Blue Cross Blue Shield HMO ] ]
# ========================================

Related

Xpath, how to get access to inner elements?

<div class="vehicle-item__main-content">
<div class=class="vehicle-item_summary-container">
<ul class="vehicle-item__attributes">
<li class="vehicle-item__attribute-item">
<i class="icon icon-specs-transmission-gray"></i>
"Manual"
</li>
<li class="vehicle-item__attribute-item">
<i class="icon icon-specs-passenger-gray">
"4 People"
</li>
I have a webscraper andI would like to catch the following texts, 'Manual' and '4 People'. The website has many more class="vehicle-item__attribute-item" which I dont need. How can I get access to the text ? Maybe by using the help of the i class (class="icon icon-specs-transmission-gray")
transmission = driver.find_elements_by_xpath('//li[#class="vehicle-item__attribute-item"]')
transmissionlist = []
for trans in transmission:
print(trans.text)
transmissionlist.append(trans.text)
With this I am getting all 100+ items from the website, but I only need the above 2 car properties.
Instead of
'//li[#class="vehicle-item__attribute-item"]'
try
'//li[i[contains(#class, "icon-specs-transmission-gray")]]'
'//li[i[contains(#class, "icon-specs-passenger-gray")]]'
transmission = driver.find_element_by_xpath('//li[i[contains(#class, "icon-specs-transmission-gray")]]').text
passengers = driver.find_element_by_xpath('//li[i[contains(#class, "icon-specs-passenger-gray")]]').text

Regex to remove p tags within li tags and td tags

I have this html content:
<p>This is a paragraph:</p>
<ul>
<li>
<p>point 1</p>
</li>
<li>
<p>point 2</p>
<ul>
<li>
<p>point 3</p>
</li>
<li>
<p>point 4</p>
</li>
</ul>
</li>
<li>
<p>point 5</p>
</li>
</ul>
<ul>
<li>
<p><strong>sub-head : </strong>This is a para followed by heading, This is a para followed by heading, This is a para followed by heading, This is a para followed by heading</p>
</li>
<li>
<p><strong>sub-head 2: </strong></p>
<p>This is a para followed by heading, This is a para followed by heading, This is a para followed by heading, This is a para followed by heading</p>
</li>
</ul>
I want to remove all the <p>&</p> tags between <li>&</li> irrespective of its position between <li>&</li>. similarly i need to remove p tags between td tags inside a table.
This is my controller code so far:
nogo={"<li>\n<p>" =>'<li>', "</p>\n</li>" => '</li>', "<td>\n<p>" => '<td>', "</p>\n</td>" => '</td>',
'<p> </p>' => '','<ul>' => "\n<ul>",'</ul>' => "</ul>\n", '</ol>' => "</ol>\n" ,
'<table>' => "\n<table width='100%' border='0' cellspacing='0' cellpadding='0' class='table table-curved'>",
'<' => '<', '>'=>'>','<br>' => '','<p></p>' => '', ' rel="nofollow"' => ''
c=params[:content]
bundle_out=Sanitize.fragment(c,Sanitize::Config.merge(Sanitize::Config::BASIC,
:elements=> Sanitize::Config::BASIC[:elements]+['table', 'tbody', 'tr', 'td', 'h1', 'h2', 'h3'],
:attributes=>{'a' => ['href']}) )#.split(" ").join(" ")
re = Regexp.new(nogo.keys.map { |x| Regexp.escape(x) }.join('|'))
#bundle_out=bundle_out.gsub(re, nogo)
im passing the above html content to this code through params[:content] which ive assigned to a variable c.
Following is the o/p which is not as expected. Some close p tags and open p tags are still between li and close li tags
<p>This is a paragraph:</p>
<ul>
<li>point 1</li>
<li>point 2</p>
<ul>
<li>point 3</li>
<li>point 4</li>
</ul>
</li>
<li>point 5</li>
</ul>
<ul>
<li><strong>sub-head : </strong>This is a para followed by heading, This is a para followed by heading, This is a para followed by heading, This is a para followed by heading</li>
<li><strong>sub-head 2: </strong></p>
<p>This is a para followed by heading, This is a para followed by heading, This is a para followed by heading, This is a para followed by heading</li>
</ul>
My aim is simple i just want to remove all the p tags inside li and td tags, which im not able to do correctly. Any help is appreciated.
I would like to use regex to do this. and i know using regex is not the correct way to parse html content.
I won't recommend using regex because they're a dead-end unless the HTML is trivial and you create it. And, if you are the one creating it, then modifying it after generating it is the wrong way to go about generating content.
Use a parser. Nokogiri is the de-facto standard for Ruby, and, with some knowledge of CSS or XPath, you can quickly learn to search, or modify, HTML and XML:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<ul>
<li>
<p>foo</p>
</li>
<li>
<span>
<p>bar</p>
</span>
</li>
</ul>
</body>
</html>
EOT
doc.search('li p').each do |p_tag|
p_tag.remove
end
puts doc.to_html
Running that results in:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<body>
<ul>
<li>
</li>
<li>
<span>
</span>
</li>
</ul>
</body>
</html>
The tutorials on the Nokogiri site are your starting point. Stack Overflow is also a good resource as there are many different easily-searchable questions about all aspects of using the gem.

How can I extract URLs from HTML content with a Ruby regexp?

This is an example since it is not easy to explain:
<li id="l_f6a1ok3n4d4p" class="online"> <div class="link"> random strings - 4 <a style="float:left; display:block; padding-top:3px;" href="http://www.webtrackerplus.com/?page=flowplayerregister&a_aid=&a_bid=&chan=flow"><img border="0" src="/resources/img/fdf.gif"></a> <!-- a class="none" href="#">random strings - 4 site2.com - # - </a --> </div> <div class="params"> <span>Submited: </span>7 June 2015 | <span>Host: </span>site2.com </div> <div class="report"> <a title="" href="javascript:report(3191274,%203,%202164691,%201)" class="alert"></a> <a title="" href="javascript:report(3191274,%203,%202164691,%200)" class="work"></a> <b>100% said work</b> </div> <div class="clear"></div> </li> <li id="l_zsgn82c4b96d" class="online"> <div class="link"> <a href="javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com');%20" onclick="visited('zsgn82c4b96d');" style
In the above content I want to extract from
javascript:show('f6a1ok3n4d4p','random%20strings%204',%20'site2.com')
the string "f6a1ok3n4d4p" and "site2.com" then make it as
http://site2.com/f6a1ok3n4d4p
and same for
javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com')
to become
http://site1.com/zsgn82c4b96d
I need it to be done with Ruby regex.
You can proceed like this:
require 'uri'
str = "javascript:show('f6a1ok3n4d4p','random%20strings%204',%20'site2.com')"
# regex scan to get values within javascript:show
vals = str.scan(/javascript:show\((.*)\)/)[0][0].split(',')
# => ["'f6a1ok3n4d4p'", "'random%20strings%204'", "%20'site2.com'"]
# joining resultant Array elements to generate url
url = "http://" + URI.decode(a.last).tr("'", '').strip + "/" + a.first.tr("'", '')
# => "http://site2.com/f6a1ok3n4d4p"
obviously my answer is not foolproof. You can make it better with checks for what if scan returns []?
This should do the trick, though the regexp isn't particularly flexible.
js_link_regex = /href=\"javascript:show\('([^']+)','[^']+',%20'([^']+)'\)/
link = <<eos
<li id="l_f6a1ok3n4d4p" class="online"> <div class="link"> random strings - 4 <a style="float:left; display:block; padding-top:3px;" href="http://www.webtrackerplus.com/?page=flowplayerregister&a_aid=&a_bid=&chan=flow"><img border="0" src="/resources/img/fdf.gif"></a> <!-- a class="none" href="#">random strings - 4 site2.com - # - </a --> </div> <div class="params"> <span>Submited: </span>7 June 2015 | <span>Host: </span>site2.com </div> <div class="report"> <a title="" href="javascript:report(3191274,%203,%202164691,%201)" class="alert"></a> <a title="" href="javascript:report(3191274,%203,%202164691,%200)" class="work"></a> <b>100% said work</b> </div> <div class="clear"></div> </li> <li id="l_zsgn82c4b96d" class="online"> <div class="link"> <a href="javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com');%20" onclick="visited('zsgn82c4b96d');" style
eos
matches = link.scan(js_link_regex)
matches.each do |match|
puts "http://#{match[1]}/#{match[0]}"
end
To just match your case,
str = "javascript:show('f6a1ok3n4d4p','random%20strings%204',%20'site2.com')"
parts = str.scan(/'([\w|\.]+)'/).flatten # => ["f6a1ok3n4d4p", "site2.com"]
puts "http://#{parts[1]}/#{parts[0]}" # => http://site2.com/f6a1ok3n4d4p

Array/loop behaviour

I have a dataset of three shops (Winkel1-3) and I would like to extract the addresses. What I've built extracts the names and then the addresses in stead of the combination of both. I'm sure I've built a flawed loop but I can't figure out what to change.
My dataset:
<ul id="itemsList">
<li class="citem ">
<a alt="Winkel 1" href="/Zuid-Holland/Delft/Winkel1">Winkel1</a>
Buitenwatersloot 51,2613TB
</li>
<li class="citem ">
<a alt="Winkel 2" href="/Zuid-Holland/Delft/Winkel2">Winkel 2</a>
Laan van Van der Gaag 75,2627BX
</li>
<li class="citem ">
<a alt="Winkel 3" href="/Zuid-Holland/Delft/Winkel3">Winkel 3</a>
Achterom 89,2611PM
</li>
</ul>
My scraper:
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["mydomain.nl"]
start_urls = [
"http://www.mydomaintestdata.nl/Zuid-Holland/Delft"
]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//ul[#id="itemsList"]/li')
loop = sel.xpath('/html')
for site in loop:
adres = sites.xpath('.//a/text()').extract(),
sites.xpath('text()').extract()
print adres
This returns two arrays:
[Winkel1, Winkel2, Winkel3],['Buitenwatersloot 51,2613TB','Laan van Van der Gaag 75,2627BX','Achterom 89,2611PM']
What I would like:
[Winkel1,'Buitenwatersloot 51,2613TB'],[Winkel2, 'Laan van Van der Gaag 75,2627BX'],[Winkel3, 'Achterom 89,2611PM']
Iterate over li elements and get the link and test for each li in the loop:
sites = sel.xpath('//ul[#id="itemsList"]/li')
for site in sites:
print site.xpath('./a/text()').extract(), site.xpath('text()').extract()

Scraping data based on the text of other neighboring elements?

I have a code like this:
<div id="left">
<div id="leftNav">
<div id="leftNavContainer">
<div id="refinements">
<h2>Department</h2>
<ul id="ref_2975312011">
<li>
<a href="#">
<span class="expand">Pet Supplies</span>
</a>
</li>
<li>
<strong>Dogs</strong>
</li>
<li>
<a>
<span class="refinementLink">Carriers & Travel Products</span>
<span class="narrowValue"> (5,570)</span>
</a>
</li>
(etc...)
Which I'm scriping like this:
html = file
data = Nokogiri::HTML(open(html))
categories = data.css('#ref_2975312011')
#categories_hash = {}
categories.css('li').drop(2).each do | categories |
categories_title = categories.css('.refinementLink').text
categories_count = categories.css('.narrowValue').text[/[\d,]+/].delete(",").to_i
#categories_hash[:categories] ||= {}
#categories_hash[:categories]["Dogs"] ||= {}
#categories_hash[:categories]["Dogs"][categories_title] = categories_count
end
So now. I want to do the same but without using #ref_2975312011 and "Dogs".
So I was thinking I could tell Nokogiri the following:
Scrap the li elements (starting from the third one) that are right
below the li element which has the text Pet Supplies enclosed by a link and a span tag.
Any ideas of how to accomplish that?
The Pet Supplies li would be:
puts doc.at('li:has(a span[text()="Pet Supplies"])')
The following sibling li's would be (skipping the first one):
puts doc.search('li:has(a span[text()="Pet Supplies"]) ~ li:gt(1)')

Resources