Parsing with Ruby, Nokogiri & Mechanize java cookies links in a webpage - ruby

everyone.
I need to parse a webpage which has java cookies set for every link. I can parse the normal search and every product is shown and imported to a mysql database.
I was able to scrape from a search result every product with its elements with this code:
This is what I have:
require 'rubygems'
require 'logger'
require 'mechanize'
require 'mysql2'
agent = WWW::Mechanize.new{|a| a.log = Logger.new(STDERR) }
#agent.set_proxy('a-proxy', '8080')
agent.read_timeout = 60
def add_cookie(agent, uri, cookie)
uri = URI.parse(uri)
Mechanize::Cookie.parse(uri, cookie) do |cookie|
agent.cookie_jar.add(uri, cookie)
end
end
# get main page
page = agent.get "http://www.site.com.mx"
# get login form
form = page.forms.first
form.correo_ingresar = "user"
form.password = "password"
# submit login form
page = agent.submit form
# parse cookies
myarray = page.body.scan(/SetCookie\(\"(.+)\", \"(.+)\"\)/)
# set session cookies
myarray.each do |item|
add_cookie(agent, 'http://www.site.com.mx', "#{item[0]}=#{item[1]}; path=/; domain=www.site.com.mx")
end
# show 1000 search results per page
add_cookie(agent, 'http://www.site.com.mx', "tampag=1000; path=/; domain=www.site.com.mx")
# order results
add_cookie(agent, 'http://www.site.com.mx', "orden_articulos=existencias asc; path=/; domain=www.site.com.mx")
# section results
add_cookie (agent, 'http://www.site.com.mx', "codigoseccion_buscar=14; path=/; domain=www.site.com.mx")
# get main page
page = agent.get "http://www.site.com.mx/tienda/index.php"
search_form = page.forms.first
search_result = agent.submit search_form
doc = Nokogiri::HTML(search_result.body)
rows = doc.css("table.articulos tr")
i = 0
details = rows.collect do |row|
detail = {}
[
[:sku, 'td[3]/text()'],
[:desc, 'td[4]/text()'],
[:qty, 'td[5]/text()'],
[:qty2, 'td[5]/p/b/text()'],
[:price, 'td[6]/text()']
].collect do |name, xpath|
detail[name] = row.at_xpath(xpath).to_s.strip
end
i = i + 1
detail
end
# walk through paginator links
links = doc.css("a.paginar").map {|l| "http://www.site.com.mx#{l['href']}"}.uniq!
links.each do |l|
page = agent.get l
doc = Nokogiri::HTML(page.body)
rows = doc.css("table.articulos tr")
rows.each do |row|
detail = {}
[
[:sku, 'td[3]/text()'],
[:desc, 'td[4]/text()'],
[:qty, 'td[5]/text()'],
[:qty2, 'td[5]/p/b/text()'],
[:price, 'td[6]/text()']
].collect do |name, xpath|
detail[name] = row.at_xpath(xpath).to_s.strip
end
details << detail
end
end
# update db
client = Mysql2::Client.new(:host => "localhost", :username => "myusername", :password => "mypassword", :database => "mydatabase")
details.each do |d|
if d[:sku] != ""
price = d[:price].split
if price[1] == "D"
currency = 144
else
currency = 168
end
cost = price[0].gsub(",", "").to_f
if d[:qty] == ""
qty = d[:qty2]
else
qty = d[:qty]
end
results = client.query("SELECT * FROM jos_vm_product WHERE product_sku = '#{d[:sku]}' LIMIT 1;")
if results.count == 1
product = results.first
client.query("UPDATE jos_vm_product SET product_sku = '#{d[:sku]}', product_name = '#{d[:desc]}', product_desc = '#{d[:desc]}', product_in_stock = '#{qty}' WHERE product_id =
#{product['product_id']};")
client.query("UPDATE jos_vm_product_price SET product_price = '#{cost}', product_currency = '#{currency}' WHERE product_id = '#{product['product_id']}';")
else
client.query("INSERT INTO jos_vm_product(product_sku, product_name, product_desc, product_in_stock) VALUES('#{d[:sku]}', '#{d[:desc]}', '#{d[:desc]}', '#{qty}');")
last_id = client.last_id
client.query("INSERT INTO jos_vm_product_price(product_id, product_price, product_currency) VALUES('#{last_id}', '#{cost}', #{currency});")
end
end
end
Now I dont want to search I want to parse from the Categories list:
link to main page:http://www.site.com.mx/tienda/articulos.php?opcion=lineas&seccion_mostrar=11
this shows a table like this (everything contains links)
The top name: ACCESORIOS is a link to the category ACCESORIOS, and the bold names listed bellow is the subcategories, and the ones bellow the bold names are brands. If I click on ACCESORIOS it will show every brand and every subcategory mixed up, and so on.
ACCESORIOS
Accesorios Multimedia(6)
ACTECK DE MEXICO (5), MANHATTAN (1)
Accesorios P/impres. Punto De Venta(1)
EPSON CORPORATION (1)
Accesorios Para Cableados De Patch Panels(1)
INTELLINET NETWORK SOLUTIONS (1)
Accesorios Para Camaras Digitales(1)
MANHATTAN (1)
Accesorios Para Computadoras De Escritorio(32)
ACTECK DE MEXICO (2), GENERICA (1), MANHATTAN (28), TARGUS (1)
Accesorios Para Computadoras Portatiles(60)
ACTECK DE MEXICO (3), GENIUS (2), HP COMERCIAL (2), HP IMPRESION (1), MANHATTAN (17), PERFECT CHOICES (32), SOLIDEX (1), TARGUS (1), TECH ZONE (1)
Accesorios Para Ipod(3)
ACTECK DE MEXICO (1), PERFECT CHOICES (2)
Accesorios Para Mesas(3)
MANHATTAN (2), PERFECT CHOICES (1)
Accesorios Para Redes(13)
INTELLINET NETWORK SOLUTIONS (5), MANHATTAN (8)
Accesoriso Para Celulares(14)
BLACKBERRY (14)
Adaptador Bluetooth(6)
ACTECK DE MEXICO (1), MANHATTAN (2), PERFECT CHOICES (3)
Adaptadores Para Mouse Y Teclado(3)
MANHATTAN (2), PERFECT CHOICES (1)
Audifono/diademas Y Microfonos(49)
ACTECK DE MEXICO (14), BTO (1), GENIUS (3), LOGITECH (2), MANHATTAN (11), PERFECT CHOICES (18)
Here is the code for the Table that has cookies for each link, that is why I have been having a hard time scraping this.
<table width="95%" cellspacing="0" cellpadding="3" border="0">
<tbody>
<tr>
<td valign="top" align="left" style="font-family: verdana; font-size: 12px" colspan="2"><a onClick="fijar_filtro('codigoseccion_buscar','11')" href="javascript:void(0)" class="busquedas"><b>ACCESORIOS</b></a></td>
</tr>
<tr>
<td width="20" valign="top" align="left"></td>
<td valign="top" align="left" style="font-family: verdana; font-size: 12px"><a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','338')" href="javascript:void(0)" class="busquedas"><b>Accesorios Multimedia</b>(6)</a><br>
<a onClick="SetCookie('codigolinea_buscar','338');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (5)</a>, <a onClick="SetCookie('codigolinea_buscar','338');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (1)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','540')" href="javascript:void(0)" class="busquedas"><b>Accesorios P/impres. Punto De Venta</b>(1)</a><br>
<a onClick="SetCookie('codigolinea_buscar','540');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','106');" href="javascript:void(0)" class="busquedas">EPSON CORPORATION (1)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','542')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Cableados De Patch Panels</b>(1)</a><br>
<a onClick="SetCookie('codigolinea_buscar','542');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','635');" href="javascript:void(0)" class="busquedas">INTELLINET NETWORK SOLUTIONS (1)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','361')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Camaras Digitales</b>(1)</a><br>
<a onClick="SetCookie('codigolinea_buscar','361');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (1)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','277')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Computadoras De Escritorio</b>(32)</a><br>
<a onClick="SetCookie('codigolinea_buscar','277');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (2)</a>, <a onClick="SetCookie('codigolinea_buscar','277');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','530');" href="javascript:void(0)" class="busquedas">GENERICA (1)</a>, <a onClick="SetCookie('codigolinea_buscar','277');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (28)</a>, <a onClick="SetCookie('codigolinea_buscar','277');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','586');" href="javascript:void(0)" class="busquedas">TARGUS (1)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','357')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Computadoras Portatiles</b>(60)</a><br>
<a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (3)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','167');" href="javascript:void(0)" class="busquedas">GENIUS (2)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','694');" href="javascript:void(0)" class="busquedas">HP COMERCIAL (2)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','107');" href="javascript:void(0)" class="busquedas">HP IMPRESION (1)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (17)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (32)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','212');" href="javascript:void(0)" class="busquedas">SOLIDEX (1)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','586');" href="javascript:void(0)" class="busquedas">TARGUS (1)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','691');" href="javascript:void(0)" class="busquedas">TECH ZONE (1)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','1302')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Ipod</b>(3)</a><br>
<a onClick="SetCookie('codigolinea_buscar','1302');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (1)</a>, <a onClick="SetCookie('codigolinea_buscar','1302');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (2)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','1175')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Mesas</b>(3)</a><br>
<a onClick="SetCookie('codigolinea_buscar','1175');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (2)</a>, <a onClick="SetCookie('codigolinea_buscar','1175');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (1)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','292')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Redes</b>(13)</a><br>
<a onClick="SetCookie('codigolinea_buscar','292');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','635');" href="javascript:void(0)" class="busquedas">INTELLINET NETWORK SOLUTIONS (5)</a>, <a onClick="SetCookie('codigolinea_buscar','292');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (8)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','1378')" href="javascript:void(0)" class="busquedas"><b>Accesoriso Para Celulares</b>(14)</a><br>
<a onClick="SetCookie('codigolinea_buscar','1378');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','714');" href="javascript:void(0)" class="busquedas">BLACKBERRY (14)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','1313')" href="javascript:void(0)" class="busquedas"><b>Adaptador Bluetooth</b>(6)</a><br>
<a onClick="SetCookie('codigolinea_buscar','1313');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (1)</a>, <a onClick="SetCookie('codigolinea_buscar','1313');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (2)</a>, <a onClick="SetCookie('codigolinea_buscar','1313');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (3)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','555')" href="javascript:void(0)" class="busquedas"><b>Adaptadores Para Mouse Y Teclado</b>(3)</a><br>
<a onClick="SetCookie('codigolinea_buscar','555');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (2)</a>, <a onClick="SetCookie('codigolinea_buscar','555');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (1)</a><br>
</td>
</tr>
</tbody>
</table>
so the question is what do I add to my code to be able to access every link? if it uses java cookies.
Cookies used:
Name , Value Ranges
codigoseccion_buscar, 11-30
codigomarca_buscar, 100-736
codigolinea_buscar, 15-1385

I managed to scrape one of those links contents by adding cookies to my Ruby code:
# set cookies
add_cookie(agent, 'http://www.site.com.mx', "codigoseccion_buscar=11; path=/; domain=www.site.com.mx")
add_cookie(agent, 'http://www.site.com.mx', "codigolinea_buscar=; path=/; domain=www.site.com.mx")
add_cookie(agent, 'http://www.site.com.mx', "codigomarca_buscar=; path=/; domain=www.site.com.mx")
add_cookie(agent, 'http://www.site.com.mx', "textobuscar=; path=/; domain=www.site.com.mx")
weird thing was that if I only added one of those cookies it would not work. so I had to add all , even tho they dont have any values, because every link has a cookie, so that way it would delete or clear saved cookie.
now I need to scrape those cookies use it as variable and do a loop or something, anybody can help me?
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','542')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Cableados De Patch Panels</b>(1)</a><br>

Related

How to scrape the text of <li> and children

I am trying to scrape the content of <li> tags and within them.
The HTML looks like:
<div class="insurancesAccepted">
<h4>What insurance does he accept?*</h4>
<ul class="noBottomMargin">
<li class="first"><span>Aetna</span></li>
<li>
<a title="See accepted plans" class="insurancePlanToggle arrowUp">AvMed</a>
<ul style="display: block;" class="insurancePlanList">
<li class="last first">Open Access</li>
</ul>
</li>
<li>
<a title="See accepted plans" class="insurancePlanToggle arrowUp">Blue Cross Blue Shield</a>
<ul style="display: block;" class="insurancePlanList">
<li class="last first">Blue Card PPO</li>
</ul>
</li>
<li>
<a title="See accepted plans" class="insurancePlanToggle arrowUp">Cigna</a>
<ul style="display: block;" class="insurancePlanList">
<li class="first">Cigna HMO</li>
<li>Cigna PPO</li>
<li class="last">Great West Healthcare-Cigna PPO</li>
</ul>
</li>
<li class="last">
<a title="See accepted plans" class="insurancePlanToggle arrowUp">Empire Blue Cross Blue Shield</a>
<ul style="display: block;" class="insurancePlanList">
<li class="last first">Empire Blue Cross Blue Shield HMO</li>
</ul>
</li>
</ul>
</div>
The main issue is when I am trying to get content from:
doc.css('.insurancesAccepted li').text.strip
It displays all <li> text at once. I want "AvMed" and "Open Access" scraped at the same time with a relationship parameter so that I can insert it into my MySQL table with reference.
The problem is that doc.css('.insurancesAccepted li') matches all nested list items, not only direct descendants. To match only a direct descendant one should use a parent > child CSS rule. To accomplish your task you need to carefully assemble the result of the iteration:
doc = Nokogiri::HTML(html)
result = doc.css('div.insurancesAccepted > ul > li').each do |li|
chapter = li.css('span').text.strip
section = li.css('a').text.strip
subsections = li.css('ul > li').map(&:text).map(&:strip)
puts "#{chapter} ⇒ [ #{section} ⇒ [ #{subsections.join(', ')} ] ]"
puts '=' * 40
end
Resulted in:
# Aetna ⇒ [ ⇒ [ ] ]
# ========================================
# ⇒ [ AvMed ⇒ [ Open Access ] ]
# ========================================
# ⇒ [ Blue Cross Blue Shield ⇒ [ Blue Card PPO ] ]
# ========================================
# ⇒ [ Cigna ⇒ [ Cigna HMO, Cigna PPO, Great West Healthcare-Cigna PPO ] ]
# ========================================
# ⇒ [ Empire Blue Cross Blue Shield ⇒ [ Empire Blue Cross Blue Shield HMO ] ]
# ========================================

web scraping data to dashing dashboard

I am having trouble scraping a site to get some information. I am very green with ruby but trying to learn as much as I can. The data will be sent to a Dashing.io dashboard using a google maps widget ( https://github.com/andmcgregor/dashing-map ) to get the current location of our delivery driver. This will be on a screen in our sale office so the staff know exactly where our driver is at a glance.
I have the following code:
require 'mechanize'
require 'open-uri'
require 'restclient'
agent = Mechanize.new
agent.get('http://ilogix.capitaltransport.com.au/iLogixLogin.asp')
agent.page.forms[0]["txtUsername"] = "someusername"
agent.page.forms[0]["txtPassword"] = "somepassword"
agent.page.forms[0].submit
page = agent.get('http://ilogix.capitaltransport.com.au/Home/iLogix_AlertInformation.asp?DisplayType=Comms')
url = puts page.parser.xpath('//table/tr/td/script')
which gets me here
irb(main):014:0* url = puts page.parser.xpath('//table/tr/td/script')
<script type="text/javascript">
//Set Table Headers
var myHeaders = ["Fleet","ID", "Name","Type","Status
","Last Update","Location","Information"];
//Set Grid Object
var obj = new AW.UI.Grid;
obj.setId("myGrid"); // necessary for CSS rules
//Virtual mode
obj.setVirtualMode(true); // Enable virtual mode
//Set Header Text
obj.setHeaderText(myHeaders);
//Set Number of Columns and their width
//Set the Number of columns
obj.setColumnCount(8);
//Set Column Widths
obj.setColumnWidth(130, 0); // set width of the column-1 Vehicle Details
obj.setColumnWidth(45, 1); // set width of the column-3 Fleet
obj.setColumnWidth(100, 2); // set width of the column-4 ID
obj.setColumnWidth(100, 3); // set width of the column-5 DisplayName
obj.setColumnWidth(100, 4); // set width of the column-6 Status
obj.setColumnWidth(130, 5); // set width of the column-7 Last Comms
obj.setColumnWidth(220, 6); // set width of the column-8 Location
obj.setColumnWidth(250, 7); // set width of the column-9 Bad GPS
obj.setCellText("Clutch and Brake (WA)",0,0);
obj.setCellText("7518",1 ,0);
obj.setCellText("7518",2 ,0);
obj.setCellText("Tray (1 T)",3,0);
obj.setCellText("",4,0);
obj.setCellText("26 Jun 2015 15:51:46",5,0);
obj.setCellText("414-424 Beechboro Rd, Morley 6102",6,0);
obj.setCellText(":: since last Communication Activity",7,0);
//number of rows and Height
obj.setRowCount(1);
obj.setRowHeight(20);
//Write Object to the Screen
document.write(obj);
</script>
=> nil
irb(main):015:0>
What I would like is to pull just the location string. I just cant figure it out. After that I think I will need to find a way to convert that to lat and long for use with the widget but I think this http://www.rubygeocoder.com/ will do it.
Here is the html. If someone can help or point me in the right direction I would really appreciate it because I am truly stuck. Thanks.
<html class=" aw-all aw-quirks aw-vista aw-png2" xmlns="http://www.w3.org/1999/xhtml">
<head></head>
<body class="Body">
<table cellspacing="0" cellpadding="0" border="0" align="center" style="width:100%;height:100%">
<tbody>
<tr style="height:100%">
<td>
<script type="text/javascript"></script>
<span id="myGrid" class="aw-system-control aw-grid-control aw-selectors-hidden " onselectstart="return false" oncontextmenu="AW(this,event)" hidefocus="true" tabindex="-1">
<span id="myGrid-box" class="aw-grid-box " onactivate="AW(this,event)">
<textarea id="myGrid-box-focus" class="aw-control-focus " onpaste="AW(this,event)" oncopy="AW(this,event)" oncut="AW(this,event)" onbeforecopy="AW(this,event)" onselectstart="AW(this,event)" onbeforedeactivate="AW(this,event)" tabindex="0"></textarea>
<span id="myGrid-scroll" class="aw-scroll-bars aw-scrollbars-none " onmousewheel="AW(this,event)" onresize="AW(this,event)" style="visibility: inherit;">
<span id="myGrid-scroll-box" class="aw-bars-box " onscroll="AW(this,event)"></span>
<span id="myGrid-scroll-content" class="aw-bars-content " style="right: 17px; bottom: 17px;">
<span id="myGrid-view" class="aw-hpanel-template ">
<span id="myGrid-view-box" class="aw-hpanel-box ">
<span id="myGrid-view-box-top" class="aw-hpanel-top " style="height:20px;visibility:inherit;"></span>
<span id="myGrid-view-box-middle" class="aw-hpanel-middle " style="top:20px;bottom:0px;">
<span id="myGrid-rows" class="aw-templates-list aw-grid-view ">
<span id="myGrid-rows-start" class="aw-view-top " style="height:0px;"></span>
<span id="myGrid-row-0" class="aw-templates-list aw-text-normal aw-grid-row aw-row-0 aw-rows-normal aw-alternate-even ">
<span id="myGrid-row-0-start" class="aw-row-start " style="width:0px;"></span>
<span id="myGrid-cell-0-0" class="aw-item-template aw-templates-cell aw-grid-cell aw-column-0 aw-cells-normal " title=""></span>
<span id="myGrid-cell-1-0" class="aw-item-template aw-templates-cell aw-grid-cell aw-column-1 aw-cells-normal " title=""></span>
<span id="myGrid-cell-2-0" class="aw-item-template aw-templates-cell aw-grid-cell aw-column-2 aw-cells-normal " title=""></span>
<span id="myGrid-cell-3-0" class="aw-item-template aw-templates-cell aw-grid-cell aw-column-3 aw-cells-normal " title=""></span>
<span id="myGrid-cell-4-0" class="aw-item-template aw-templates-cell aw-grid-cell aw-column-4 aw-cells-selected " title=""></span>
<span id="myGrid-cell-5-0" class="aw-item-template aw-templates-cell aw-grid-cell aw-column-5 aw-cells-normal " title=""></span>
<span id="myGrid-cell-6-0" class="aw-item-template aw-templates-cell aw-grid-cell aw-column-6 aw-cells-normal " title=""></span>
<span id="myGrid-cell-7-0" class="aw-item-template aw-templates-cell aw-grid-cell aw-column-7 aw-cells-normal " title=""></span>
<span id="myGrid-cell-8-0" class="aw-item-template aw-templates-cell aw-grid-cell aw-column-8 aw-cells-normal " title=""></span>
<span id="myGrid-row-0-end" class="aw-item-template aw-grid-cell aw-column-space "></span>
</span>
<span id="myGrid-rows-end" class="aw-item-template aw-row-selector aw-selector-space "></span>
</span>
</span>
<span id="myGrid-view-box-bottom" class="aw-hpanel-bottom " style="height:0px;display:none;"></span>
</span>
</span>
</span>
</span>
<span id="myGrid-box-sample" class="aw-row-sample aw-grid-row "></span>
</span>
</span>
</td>
</tr>
</tbody>
</table>
</body>
</html>

How to parse text within span tags using Nokogiri

I want to build an application displaying artists from a popular venue and want to extract only the artist's name.
Here is my code:
data.css('.headliner').each do |artist|
puts artist
end
It's currently returning:
<span class="headliner"><span class="prepend"><i>Rescheduled Date</i></span><br>London Grammar</span>
<span class="headliner">Hozier</span>
<span class="headliner"><span class="prepend"><i>KFOG presents</i></span><br>Ben Howard<br><span class="append"><i>with special guest</i><br></span></span>
<span class="headliner">Dr. Dog</span>
Some elements have more than one span tag and I'm having trouble getting the data I want. All I want returned is the artist's name such as 'London Grammar', 'Hozier', 'Ben Howard', and 'Dr. Dog'.
Currently, when I run artist.text it returns "Rescheduled DateLondon Grammar" and so on.
<table class="concert_calendar" cellspacing="0" width="720" style="margin-top:35px;">
<tbody><tr><td class="noborder"><img src="images/title_date2.gif" alt="Date"></td>
<td class="noborder" colspan="2"><img src="images/title_show2.gif" alt="Show"></td>
<td class="noborder"><img src="images/title_time2.gif" alt="Time"></td>
<td class="noborder"><img src="images/title_tickets2.gif" alt="Tickets"></td></tr>
<tr><td colspan="5" class="noborder"><hr size="1" color="#550818" noshade="" style="margin:0px; padding:0px;"></td></tr>
<tr><td style="width:100px;" class="">Saturday,<br>February 7</td>
<td style="width:115px;" valign="top" class=""><img src="http://www.apeconcerts.com/concertimages/LondonGrammar_100.jpg" alt="London Grammar"></td>
<td valign="top" style="width:345px; padding-right:10px;" class="">
<a href="popartist.php?cID=4600&KeepThis=true&TB_iframe=true&height=600&width=475" style="text-decoration:none;" class="thickbox">
<span class="headliner"><span class="prepend"><i>Rescheduled Date</i></span><br>London Grammar</span></a>
<div><span class="warmup">Until The Ribbon Breaks</span><br>
<span class="warmup"></span></div></td>
<td style="width:80px;">show<br>8:00PM</td>
<td style="width:80px;">
<img src="images/cal_soldout.gif" alt="SOLD OUT - Thank you!"> </td></tr>
<tr><td style="width:100px;">Tuesday,<br>February 10</td>
<td style="width:115px;" valign="top"><img src="http://www.apeconcerts.com/concertimages/Hozier_1001.jpg" alt="Hozier"></td>
<td valign="top" style="width:345px; padding-right:10px;" class="">
<a href="popartist.php?cID=4733&KeepThis=true&TB_iframe=true&height=600&width=475" style="text-decoration:none;" class="thickbox">
<span class="headliner">Hozier</span></a>
<div class=""><span class="warmup">Ásgeir</span><br>
<span class="warmup"></span></div></td>
<td style="width:80px;">show<br>8:00PM</td>
<td style="width:80px;">
<img src="images/cal_soldout.gif" alt="SOLD OUT - Thank you!"> </td></tr>
All I want returned is the artist's name such as 'London Grammar',
'Hozier', 'Ben Howard', and 'Dr. Dog'
Here's one way:
require 'nokogiri'
html = %q{
<span class="headliner"><span class="prepend"><i>Rescheduled Date</i></span><br>London Grammar</span>
<span class="headliner">Hozier</span>
<span class="headliner"><span class="prepend"><i>KFOG presents</i></span><br>Ben Howard<br><span class="append"><i>with special guest</i><br></span></span>
<span class="headliner">Dr. Dog</span>
}
html_doc = Nokogiri::HTML(html)
headliners = html_doc.css('.headliner')
headliners.each do |headliner|
headliner.css('i').each do |i|
i.content = ''
end
puts headliner.text
end
--output:--
London Grammar
Hozier
Ben Howard
Dr. Dog
If all you're trying to do is remove the <i> tag's content, then just remove the tags entirely:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<span class="headliner"><span class="prepend"><i>Rescheduled Date</i></span><br>London Grammar</span>
<span class="headliner">Hozier</span>
<span class="headliner"><span class="prepend"><i>KFOG presents</i></span><br>Ben Howard<br><span class="append"><i>with special guest</i><br></span></span>
<span class="headliner">Dr. Dog</span>
EOT
doc.search('.headliner i').map(&:remove)
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> <span class="headliner"><span class="prepend"></span><br>London Grammar</span>
# >> <span class="headliner">Hozier</span>
# >> <span class="headliner"><span class="prepend"></span><br>Ben Howard<br><span class="append"><br></span></span>
# >> <span class="headliner">Dr. Dog</span>
# >> </body></html>
At that point it's really easy to iterate over the .headliner tags and output their content:
puts doc.search('.headliner').map(&:text)
# >> London Grammar
# >> Hozier
# >> Ben Howard
# >> Dr. Dog
I'd probably do it a little different for a big page consisting of a lot of tags matching .headliner but this is sufficient for normal pages.
See "How to avoid joining all text from Nodes when scraping" also.

Array/loop behaviour

I have a dataset of three shops (Winkel1-3) and I would like to extract the addresses. What I've built extracts the names and then the addresses in stead of the combination of both. I'm sure I've built a flawed loop but I can't figure out what to change.
My dataset:
<ul id="itemsList">
<li class="citem ">
<a alt="Winkel 1" href="/Zuid-Holland/Delft/Winkel1">Winkel1</a>
Buitenwatersloot 51,2613TB
</li>
<li class="citem ">
<a alt="Winkel 2" href="/Zuid-Holland/Delft/Winkel2">Winkel 2</a>
Laan van Van der Gaag 75,2627BX
</li>
<li class="citem ">
<a alt="Winkel 3" href="/Zuid-Holland/Delft/Winkel3">Winkel 3</a>
Achterom 89,2611PM
</li>
</ul>
My scraper:
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["mydomain.nl"]
start_urls = [
"http://www.mydomaintestdata.nl/Zuid-Holland/Delft"
]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//ul[#id="itemsList"]/li')
loop = sel.xpath('/html')
for site in loop:
adres = sites.xpath('.//a/text()').extract(),
sites.xpath('text()').extract()
print adres
This returns two arrays:
[Winkel1, Winkel2, Winkel3],['Buitenwatersloot 51,2613TB','Laan van Van der Gaag 75,2627BX','Achterom 89,2611PM']
What I would like:
[Winkel1,'Buitenwatersloot 51,2613TB'],[Winkel2, 'Laan van Van der Gaag 75,2627BX'],[Winkel3, 'Achterom 89,2611PM']
Iterate over li elements and get the link and test for each li in the loop:
sites = sel.xpath('//ul[#id="itemsList"]/li')
for site in sites:
print site.xpath('./a/text()').extract(), site.xpath('text()').extract()

How can I create a custom xpath query?

This is my HTML file data:
<article class='course-box'>
<div class='row-fluid'>
<div class='span2'>
<div class='course-cover' style='width: 100%'>
<img alt='' src='https://d2d6mu5qcvgbk5.cloudfront.net/courses/cover_photos/c4f5fd2efb200e71d09014970cf0b8c86e1e7013.png?1375831955'>
</div>
</div>
<div class='span10'>
<h2 class='coursetitle'>
<a href='https://novoed.com/hc'>Hippocrates Challenge</a>
</h2>
<figure class='pricetag'>
Free
</figure>
<div class='timeline independent-text'>
<div class='timeline inline-block'>
Starting Spring 2014
</div>
</div>
By Jill Helms
<div class='university' style='margin-top:0px; font-style:normal;'>
Stanford University
</div>
</div>
</div>
<div class='hovered row-fluid' onclick="location.href='https://novoed.com/hc'">
<div class='span2'>
<div class='course-cover'>
<img alt='' src='https://d2d6mu5qcvgbk5.cloudfront.net/courses/cover_photos/c4f5fd2efb200e71d09014970cf0b8c86e1e7013.png?1375831955' style='width: 100%'>
</div>
</div>
<div class='span10'>
<h2 class='coursetitle' style='margin-top: 10px'>
<a href='https://novoed.com/hc'>
Hippocrates Challenge
</a>
</h2>
<p class='description' style='width: 70%'>
Hippocrates Challenge 2014 is a course designed for anyone with an interest in medicine. The course focuses on teaching anatomy in an interactive way, students will learn about diagnosis and treatment planning while...
</p>
<div style='margin-right: 10px'>
<a class='btn action-btn novoed-primary' href='https://novoed.com/users/sign_up?class=hc'>
Sign Up
</a>
</div>
</div>
</div>
from above the code i need to fetch the following tag class values.
coursetitle
coursetitle href link
pircetag
timeline inline-block
uinversity
description
instructor name
but coursetitle is available in two places but i need only once. same instructor name does not contain any specifi tag to fecth.
my xpath queries are:
novoedData = HtmlXPathSelector(response)
courseTitle = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/h2[re:test(#class, "coursetitle")]/a/text()').extract()
courseDetailLink = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/h2[re:test(#class, "coursetitle")]/a/#href').extract()
courseInstructorName = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/text()').extract()
coursePriceType = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/figure[re:test(#class, "pricetag")]/text()').extract()
courseShortSummary = novoedData.xpath('//div[re:test(#class, "hovered row-fluid")]/div[re:test(#class, "span10")]/p[re:test(#class, "description")]/text()').extract()
courseUniversity = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/div[re:test(#class, "university")]/text()').extract()
but the number of values in each list variable is difference:
len(courseTitle) = 40 (two times because of repetition)
len(courseDetailLink) = 40 (two times because of repetition)
len(courseInstructorName) = 160 (some unwanted character is coming because no specific tag for this value)
len(coursePriceType) = 20 (correct count no repetition)
len(courseShortSummary)= 20 (correct count no repetition)
len(courseUniversity) = 20 (correct count no repetition)
kindly modify my xpath query to solve my problem. thanks in advance..
you dont need that re:test, simply do:
>>> s = sel.xpath('//div[#class="row-fluid"]/div[#class="span10"]')[0]
>>> len(s)
1
>>> s.xpath('h2[#class="coursetitle"]/a/#href').extract()
[u'https://novoed.com/hc']
also note that once s is set on the right place you can just continue from it.

Resources