everyone.
I need to parse a webpage which has java cookies set for every link. I can parse the normal search and every product is shown and imported to a mysql database.
I was able to scrape from a search result every product with its elements with this code:
This is what I have:
require 'rubygems'
require 'logger'
require 'mechanize'
require 'mysql2'
agent = WWW::Mechanize.new{|a| a.log = Logger.new(STDERR) }
#agent.set_proxy('a-proxy', '8080')
agent.read_timeout = 60
def add_cookie(agent, uri, cookie)
uri = URI.parse(uri)
Mechanize::Cookie.parse(uri, cookie) do |cookie|
agent.cookie_jar.add(uri, cookie)
end
end
# get main page
page = agent.get "http://www.site.com.mx"
# get login form
form = page.forms.first
form.correo_ingresar = "user"
form.password = "password"
# submit login form
page = agent.submit form
# parse cookies
myarray = page.body.scan(/SetCookie\(\"(.+)\", \"(.+)\"\)/)
# set session cookies
myarray.each do |item|
add_cookie(agent, 'http://www.site.com.mx', "#{item[0]}=#{item[1]}; path=/; domain=www.site.com.mx")
end
# show 1000 search results per page
add_cookie(agent, 'http://www.site.com.mx', "tampag=1000; path=/; domain=www.site.com.mx")
# order results
add_cookie(agent, 'http://www.site.com.mx', "orden_articulos=existencias asc; path=/; domain=www.site.com.mx")
# section results
add_cookie (agent, 'http://www.site.com.mx', "codigoseccion_buscar=14; path=/; domain=www.site.com.mx")
# get main page
page = agent.get "http://www.site.com.mx/tienda/index.php"
search_form = page.forms.first
search_result = agent.submit search_form
doc = Nokogiri::HTML(search_result.body)
rows = doc.css("table.articulos tr")
i = 0
details = rows.collect do |row|
detail = {}
[
[:sku, 'td[3]/text()'],
[:desc, 'td[4]/text()'],
[:qty, 'td[5]/text()'],
[:qty2, 'td[5]/p/b/text()'],
[:price, 'td[6]/text()']
].collect do |name, xpath|
detail[name] = row.at_xpath(xpath).to_s.strip
end
i = i + 1
detail
end
# walk through paginator links
links = doc.css("a.paginar").map {|l| "http://www.site.com.mx#{l['href']}"}.uniq!
links.each do |l|
page = agent.get l
doc = Nokogiri::HTML(page.body)
rows = doc.css("table.articulos tr")
rows.each do |row|
detail = {}
[
[:sku, 'td[3]/text()'],
[:desc, 'td[4]/text()'],
[:qty, 'td[5]/text()'],
[:qty2, 'td[5]/p/b/text()'],
[:price, 'td[6]/text()']
].collect do |name, xpath|
detail[name] = row.at_xpath(xpath).to_s.strip
end
details << detail
end
end
# update db
client = Mysql2::Client.new(:host => "localhost", :username => "myusername", :password => "mypassword", :database => "mydatabase")
details.each do |d|
if d[:sku] != ""
price = d[:price].split
if price[1] == "D"
currency = 144
else
currency = 168
end
cost = price[0].gsub(",", "").to_f
if d[:qty] == ""
qty = d[:qty2]
else
qty = d[:qty]
end
results = client.query("SELECT * FROM jos_vm_product WHERE product_sku = '#{d[:sku]}' LIMIT 1;")
if results.count == 1
product = results.first
client.query("UPDATE jos_vm_product SET product_sku = '#{d[:sku]}', product_name = '#{d[:desc]}', product_desc = '#{d[:desc]}', product_in_stock = '#{qty}' WHERE product_id =
#{product['product_id']};")
client.query("UPDATE jos_vm_product_price SET product_price = '#{cost}', product_currency = '#{currency}' WHERE product_id = '#{product['product_id']}';")
else
client.query("INSERT INTO jos_vm_product(product_sku, product_name, product_desc, product_in_stock) VALUES('#{d[:sku]}', '#{d[:desc]}', '#{d[:desc]}', '#{qty}');")
last_id = client.last_id
client.query("INSERT INTO jos_vm_product_price(product_id, product_price, product_currency) VALUES('#{last_id}', '#{cost}', #{currency});")
end
end
end
Now I dont want to search I want to parse from the Categories list:
link to main page:http://www.site.com.mx/tienda/articulos.php?opcion=lineas&seccion_mostrar=11
this shows a table like this (everything contains links)
The top name: ACCESORIOS is a link to the category ACCESORIOS, and the bold names listed bellow is the subcategories, and the ones bellow the bold names are brands. If I click on ACCESORIOS it will show every brand and every subcategory mixed up, and so on.
ACCESORIOS
Accesorios Multimedia(6)
ACTECK DE MEXICO (5), MANHATTAN (1)
Accesorios P/impres. Punto De Venta(1)
EPSON CORPORATION (1)
Accesorios Para Cableados De Patch Panels(1)
INTELLINET NETWORK SOLUTIONS (1)
Accesorios Para Camaras Digitales(1)
MANHATTAN (1)
Accesorios Para Computadoras De Escritorio(32)
ACTECK DE MEXICO (2), GENERICA (1), MANHATTAN (28), TARGUS (1)
Accesorios Para Computadoras Portatiles(60)
ACTECK DE MEXICO (3), GENIUS (2), HP COMERCIAL (2), HP IMPRESION (1), MANHATTAN (17), PERFECT CHOICES (32), SOLIDEX (1), TARGUS (1), TECH ZONE (1)
Accesorios Para Ipod(3)
ACTECK DE MEXICO (1), PERFECT CHOICES (2)
Accesorios Para Mesas(3)
MANHATTAN (2), PERFECT CHOICES (1)
Accesorios Para Redes(13)
INTELLINET NETWORK SOLUTIONS (5), MANHATTAN (8)
Accesoriso Para Celulares(14)
BLACKBERRY (14)
Adaptador Bluetooth(6)
ACTECK DE MEXICO (1), MANHATTAN (2), PERFECT CHOICES (3)
Adaptadores Para Mouse Y Teclado(3)
MANHATTAN (2), PERFECT CHOICES (1)
Audifono/diademas Y Microfonos(49)
ACTECK DE MEXICO (14), BTO (1), GENIUS (3), LOGITECH (2), MANHATTAN (11), PERFECT CHOICES (18)
Here is the code for the Table that has cookies for each link, that is why I have been having a hard time scraping this.
<table width="95%" cellspacing="0" cellpadding="3" border="0">
<tbody>
<tr>
<td valign="top" align="left" style="font-family: verdana; font-size: 12px" colspan="2"><a onClick="fijar_filtro('codigoseccion_buscar','11')" href="javascript:void(0)" class="busquedas"><b>ACCESORIOS</b></a></td>
</tr>
<tr>
<td width="20" valign="top" align="left"></td>
<td valign="top" align="left" style="font-family: verdana; font-size: 12px"><a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','338')" href="javascript:void(0)" class="busquedas"><b>Accesorios Multimedia</b>(6)</a><br>
<a onClick="SetCookie('codigolinea_buscar','338');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (5)</a>, <a onClick="SetCookie('codigolinea_buscar','338');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (1)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','540')" href="javascript:void(0)" class="busquedas"><b>Accesorios P/impres. Punto De Venta</b>(1)</a><br>
<a onClick="SetCookie('codigolinea_buscar','540');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','106');" href="javascript:void(0)" class="busquedas">EPSON CORPORATION (1)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','542')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Cableados De Patch Panels</b>(1)</a><br>
<a onClick="SetCookie('codigolinea_buscar','542');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','635');" href="javascript:void(0)" class="busquedas">INTELLINET NETWORK SOLUTIONS (1)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','361')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Camaras Digitales</b>(1)</a><br>
<a onClick="SetCookie('codigolinea_buscar','361');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (1)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','277')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Computadoras De Escritorio</b>(32)</a><br>
<a onClick="SetCookie('codigolinea_buscar','277');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (2)</a>, <a onClick="SetCookie('codigolinea_buscar','277');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','530');" href="javascript:void(0)" class="busquedas">GENERICA (1)</a>, <a onClick="SetCookie('codigolinea_buscar','277');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (28)</a>, <a onClick="SetCookie('codigolinea_buscar','277');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','586');" href="javascript:void(0)" class="busquedas">TARGUS (1)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','357')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Computadoras Portatiles</b>(60)</a><br>
<a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (3)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','167');" href="javascript:void(0)" class="busquedas">GENIUS (2)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','694');" href="javascript:void(0)" class="busquedas">HP COMERCIAL (2)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','107');" href="javascript:void(0)" class="busquedas">HP IMPRESION (1)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (17)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (32)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','212');" href="javascript:void(0)" class="busquedas">SOLIDEX (1)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','586');" href="javascript:void(0)" class="busquedas">TARGUS (1)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','691');" href="javascript:void(0)" class="busquedas">TECH ZONE (1)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','1302')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Ipod</b>(3)</a><br>
<a onClick="SetCookie('codigolinea_buscar','1302');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (1)</a>, <a onClick="SetCookie('codigolinea_buscar','1302');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (2)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','1175')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Mesas</b>(3)</a><br>
<a onClick="SetCookie('codigolinea_buscar','1175');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (2)</a>, <a onClick="SetCookie('codigolinea_buscar','1175');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (1)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','292')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Redes</b>(13)</a><br>
<a onClick="SetCookie('codigolinea_buscar','292');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','635');" href="javascript:void(0)" class="busquedas">INTELLINET NETWORK SOLUTIONS (5)</a>, <a onClick="SetCookie('codigolinea_buscar','292');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (8)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','1378')" href="javascript:void(0)" class="busquedas"><b>Accesoriso Para Celulares</b>(14)</a><br>
<a onClick="SetCookie('codigolinea_buscar','1378');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','714');" href="javascript:void(0)" class="busquedas">BLACKBERRY (14)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','1313')" href="javascript:void(0)" class="busquedas"><b>Adaptador Bluetooth</b>(6)</a><br>
<a onClick="SetCookie('codigolinea_buscar','1313');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (1)</a>, <a onClick="SetCookie('codigolinea_buscar','1313');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (2)</a>, <a onClick="SetCookie('codigolinea_buscar','1313');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (3)</a><br>
<br>
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','555')" href="javascript:void(0)" class="busquedas"><b>Adaptadores Para Mouse Y Teclado</b>(3)</a><br>
<a onClick="SetCookie('codigolinea_buscar','555');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (2)</a>, <a onClick="SetCookie('codigolinea_buscar','555');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (1)</a><br>
</td>
</tr>
</tbody>
</table>
so the question is what do I add to my code to be able to access every link? if it uses java cookies.
Cookies used:
Name , Value Ranges
codigoseccion_buscar, 11-30
codigomarca_buscar, 100-736
codigolinea_buscar, 15-1385
I managed to scrape one of those links contents by adding cookies to my Ruby code:
# set cookies
add_cookie(agent, 'http://www.site.com.mx', "codigoseccion_buscar=11; path=/; domain=www.site.com.mx")
add_cookie(agent, 'http://www.site.com.mx', "codigolinea_buscar=; path=/; domain=www.site.com.mx")
add_cookie(agent, 'http://www.site.com.mx', "codigomarca_buscar=; path=/; domain=www.site.com.mx")
add_cookie(agent, 'http://www.site.com.mx', "textobuscar=; path=/; domain=www.site.com.mx")
weird thing was that if I only added one of those cookies it would not work. so I had to add all , even tho they dont have any values, because every link has a cookie, so that way it would delete or clear saved cookie.
now I need to scrape those cookies use it as variable and do a loop or something, anybody can help me?
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','542')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Cableados De Patch Panels</b>(1)</a><br>