Concept for recipe-based parsing of webpages needed - ruby

I'm working on a web-scraping solution that grabs totally different webpages and lets the user define rules/scripts in order to extract information from the page.
I started scraping from a single domain and build a parser based on Nokogiri.
Basically everything works fine.
I could now add a ruby class each time somebody wants to add a webpage with a different layout/style.
Instead I thought about using an approach where the user specifies elements where content is stored using xpath and storing this as a sort of recipe for this webpage.
Example: The user wants to scrape a table-structure extracting the rows using a hash (column-name => cell-content)
I was thinking about writing a ruby function for extraction of this generic table information once:
# extracts a table's rows as an array of hashes (column_name => cell content)
# html - the html-file as a string
# xpath_table - specifies the html table as xpath which hold the data to be extracted
def basic_table(html, xpath_table)
xpath_headers = "#{xpath_table}/thead/tr/th"
html_doc = Nokogiri::HTML(html)
html_doc = Nokogiri::HTML(html)
row_headers = html_doc.xpath(xpath_headers)
row_headers = row_headers.map do |column|
column.inner_text
end
row_contents = Array.new
table_rows = html_doc.xpath('#{xpath_table}/tbody/tr')
table_rows.each do |table_row|
cells = table_row.xpath('td')
cells = cells.map do |cell|
cell.inner_text
end
row_content_hash = Hash.new
cells.each_with_index do |cell_string, column_index|
row_content_hash[row_headers[column_index]] = cell_string
end
row_contents << [row_content_hash]
end
return row_contents
end
The user could now specify a website-recipe-file like this:
<basic_table xpath='//div[#id="grid"]/table[#id="displayGrid"]'
The function basic_table is referenced here, so that by parsing the website-recipe-file I would know that I can use the function basic_table to extract the content from the table referenced by the xPath.
This way the user can specify simple recipe-scripts and only has to dive into writing actual code if he needs a new way of extracting information.
The code would not change every time a new webpage needs to be parsed.
Whenever the structure of a webpage changes only the recipe-script would need to be changed.
I was thinking that someone might be able to tell me how he would approach this. Rules/rule engines pop into my mind, but I'm not sure if that really is the solution to my problem.
Somehow I have the feeling that I don't want to "invent" my own solution to handle this problem.
Does anybody have a suggestion?
J.

Related

Posting data on website using Mechanize Nokogiri Selenium

I need to post data on a website through a program.
To achieve this I am using Mechanize Nokogiri and Selenium.
Here's my code :
def aeiexport
# first Mechanize is submitting the form to identify yourself on the website
agent = Mechanize.new
agent.get("https://www.glou.com")
form_login_AEI = agent.page.forms.first
form_login_AEI.util_vlogin = "42"
form_login_AEI.util_vpassword = "666"
# this is suppose to submit the form I think
page_compet_list = agent.submit(form_login_AEI, form_login_AEI.buttons.first)
#to be able to scrap the page you end up on after submitting form
body = page_compet_list.body
html_body = Nokogiri::HTML(body)
#tds give back an array of td
tds = html_body.css('.L1').xpath("//table/tbody/tr[position()>1]/td")
# Checking my array of td with some condition
tds.each do |td|
link = td.children.first # Select the first children
if link.html = "2015 32 92 0076 012"
# Only consider the html part of the link, if matched follow the previous link
previous_td = td.previous
previous_url = previous_td.children.first.href
#following the link contained in previous_url
page_selected_compet = agent.get(previous_url)
# to be able to scrap the page I end up on
body = page_selected_compet.body
html_body = Nokogiri::HTML(body)
joueur_access = html_body.search('#tabs0head2 a')
# clicking on the link
joueur_access.click
rechercher_par_numéro_de_licence = html_body.css('.L1').xpath("//table/tbody/tr/td[1]/a[1]")
pure_link_rechercher_par_numéro_de_licence = rechercher_par_numéro_de_licence['href']
#following pure_link_rechercher_par_numéro_de_licence
page_submit_licence = agent.get(pure_link_rechercher_par_numéro_de_licence)
body_submit_licence = page_submit_licence.body
html_body = Nokogiri::HTML(body_submit_licence)
#posting my data in the right field
form.field_with(:name => 'lic_cno[0]') == "9511681"
1) So far what do you think about this code, Do you think there is an error in there
2) This part is the one I am really not sure about : I have posted my data in the right field but now I need to submit it. The problem is that the button I need to click is like this:
<input type="button" class="button" onclick="dispatchAndSubmit(document.JoueurRechercheForm, 'rechercher');" value="Rechercher">
it triggers a javascript function onclick. I am triying Selenium to trigger the click event. Then I end up on another page, where I need to click a few more times.. I tried this:
driver.find_element(:value=> 'Rechercher').click
driver.find_element(:name=> 'sel').click
driver.find_element(:value=> 'Sélectionner').click
driver.find_element(:value=> 'Inscrire').click
But so far I have not succeeded in posting the data.
Could you please tell me if selenium will enable me to do what I need to do. If can I do it ?
At a glance your code can use less indentation and more white space/empty lines to separate the internal logic of AEIexport (which should be changed to aei_export since Ruby uses snake case for method names. You can find more recommendations on how to style ruby code here).
Besides the style of your code, an error I found at the beginning of your method is using an undefined variable page when defining form_login_AEI.
For your second question, I'm not familiar with Selenium; however since it does use a real web browser it can handle JavaScript. Watir is another possible solution.
An alternative would be to view the page source (i.e. in Firebug) and understand what the JavaScript on the page does. Then use Mechanize to follow the link manually.

Nokogiri: Filling in a default value for empty table cells

I'm trying to scrape the cell values from an HTML table. Randomly, some of these cells are empty, and I can't guess which ones with any reliability.
Is there a way to fill a default value in for Nokogiri when it comes across an empty cell?
Thanks for any advice you can provide. Here's my code:
def scrape_stats
stats = []
(2002..2012).to_a.each do |year|
url = "website/#{year}"
doc = Nokogiri::HTML(open(url))
rows = doc.at_css("body tbody").text.split(" ")
(rows.count / 25).times do |i| # there are 25 columns per row
stats << rows.shift(25)
end
end
It sounds like you want something like:
doc.search('td:empty').each{|n| n.content = 'default value'}
This would basically involve using the Nokogiri::XML::Node#add_child method (or the shorter version, Nokogiri::XML::Node#<<) to add a new child node containing the text you want to add to the empty cell.
See this question for an example:
How to add child nodes in NodeSet using Nokogiri

Using Loop Conditions in Excel-Ruby Watir Scripts

I'm New to Ruby-Watir ..And am Checking Login Functionality for one Application that Multiple values like (Invalid user& Password ,Valid Username & Password)coming from Excel Sheet.
reqire'watir-webdriver'
require 'win32ole'
require 'roo'
b = Watir::Browser.new
b.goto 'http://tech/mellingcarsweb/Admin/Login.aspx'
xl = WIN32OLE.new('excel.application')
wrkbook= xl.Workbooks.Open("C:\\Excel\\cars.xlsx")
wrksheet= wrkbook.Worksheets(1)
wrksheet.select
$username= wrksheet.Range("A1").Value
$password= wrksheet.Range("B1").Value
b.text_field(:id, "MainContent_txtUsername").set($username)
b.text_field(:id, "MainContent_txtPassword").set($password)
b.button(:id, "MainContent_btnLogin").click
b.alert.ok
$username1= wrksheet.Range("A2").Value
$password1= wrksheet.Range("B2").Value
b.text_field(:id, "MainContent_txtUsername").set($username1)
b.text_field(:id, "MainContent_txtPassword").set($password1)
b.button(:id, "MainContent_btnLogin").click
puts "Authorised Entry"
This code is working fine.What my need is.. using loop statements i need to execute for multiple times.
I don't Know how to use looping conditions in Excel.And i have seen lot of Examples But none of them are clear.Sorry am not able to understand.Will anyone Explain me in proper way of using looping Conditions in Excel ruby.
Thanks
Given that you are testing valid and invalid logins, I am not sure that it makes the most sense to write a loop. The problem with a loop is that, each iteration needs to know what the expected result is. You would likely need to include that information in your spreadsheet. You might be better off creating specific individual tests for each valid and invalid scenario (ie one assertion per test).
To answer your question, you can convert the excel range to a 2D array and then iterate through each row of data. To get the 2D array:
require 'win32ole'
#Open the spreadsheet
xl = WIN32OLE.new('excel.application')
wrkbook= xl.Workbooks.Open('C:\Users\Someone\Desktop\Logins.xlsx')
wrksheet= wrkbook.Worksheets(1)
#Convert the range of cells to a 2D array
login_data = wrksheet.range("A1:B2").value
#Output of the 2D array
p login_data
#=> [["jsmith", "password1"], ["jdoe", "password2"]]
Each element of the 2D array is a row of the excel data. You can iterate through it using the each method.
#Start your browser
b = Watir::Browser.new
#Loop through the login data
login_data.each do |data|
#Determine the user id and password for the row of data
user = data[0]
password = data[1]
#Go to the login page
b.goto 'http://tech/mellingcarsweb/Admin/Login.aspx'
#Input the fields and submit
b.text_field(:id, "MainContent_txtUsername").set(user)
b.text_field(:id, "MainContent_txtPassword").set(password)
b.button(:id, "MainContent_btnLogin").click
#Do something to validate the result
end
Google for ruby loops. This looks like a good introduction: http://www.tutorialspoint.com/ruby/ruby_loops.htm

Sinatra can't convert Symbol into Integer when making MongoDB query

This is a sort of followup to my other MongoDB question about the torrent indexer.
I'm making an open source torrent indexer (like a mini TPB, in essence), and offer both SQLite and MongoDB for backend, currently.
However, I'm having trouble with the MongoDB part of it. In Sinatra, I get when trying to upload a torrent, or search for one.
In uploading, one needs to tag the torrent — and it fails here. The code for adding tags is as follows:
def add_tag(tag)
if $sqlite
unless tag_exists? tag
$db.execute("insert into #{$tag_table} values ( ? )", tag)
end
id = $db.execute("select oid from #{$tag_table} where tag = ?", tag)
return id[0]
elsif $mongo
unless tag_exists? tag
$tag.insert({:tag => tag})
end
return $tag.find({:tag => tag})[:_id] #this is the line it presumably crashes on
end
end
It reaches line 105 (noted above), and then fails. What's going on? Also, as an FYI this might turn into a few other questions as solutions come in.
Thanks!
EDIT
So instead of returning the tag result with [:_id], I changed the block inside the elsif to:
id = $tag.find({:tag => tag})
puts id.inspect
return id
and still get an error. You can see a demo at http://torrent.hypeno.de and the source at http://github.com/tekknolagi/indexer/
Given that you are doing an insert(), the easiest way to get the id is:
id = $tag.insert({:tag => tag})
id will be a BSON::ObjectId, so you can use appropriate methods depending on the return value you want:
return id # BSON::ObjectId('5017cace1d5710170b000001')
return id.to_s # "5017cace1d5710170b000001"
In your original question you are trying to use the Collection.find() method. This returns a Mongo::Cursor, but you are trying to reference the cursor as a document. You need to iterate over the cursor using each or next, eg:
cursor = $tag.find_one({:tag => tag})
return cursor.next['_id'];
If you want a single document, you should be using Collection.find_one().
For example, you can find and return the _id using:
return $tag.find_one({:tag => tag})['_id']
I think the problem here is [:_id]. I dont know much about Mongo but `$tag.find({:tag => tag}) is probably retutning an array and passing a symbol to the [] array operator is not defined.

Write Hash to CSV file and then read the values back to form a hash

I am banging my head try to resolve an issue I am experiencing with one of my latest projects. Here is the senario:
I am making an call to GoToWebinar API to fetch upcoming webinars. Everything is working fine and the webinars are fetched in the form of hash just like this :
[
{
"webinarKey":5303085652037254656,
"subject":"Test+Webinar+One",
"description":"Test+Webinar+One+Description",
"times":[{"startTime":"2011-04-26T17:00:00Z","endTime":"2011-04-26T18:00:00Z"}]
},
{
"webinarKey":9068582024170238208,
"name":"Test+Webinar+Two",
"description":"Test Webinar Two Description",
"times":[{"startTime":"2011-04-26T17:00:00Z","endTime":"2011-04-26T18:00:00Z"}]
}
]
I have created a rake task which we are going to run once a day to populate the CSV file with this hash and then the CSV file is read in the controller action to populate the views.
Here is my code to populate the CSV file :
g = GoToWebinar::API.new()
#all_webinars = g.get_upcoming_webinars
CSV.open("#{Rails.root.to_s}/public/upcoming_webinars.csv", "wb") do |csv|
#all_webinars.each do |webinar|
webinar.to_a.each {|elem| csv << elem}
end
end
I need some help in figuring out a way to save the information received in the form of hashed to be saved in the CSV file in such a way that the order is preserved and also a way to read to the information back from the CSV file that it populates the hash in the controller action in the very same way.
You want to use the keys of the hash (since they are constant) as the headers for your CSV file. Then push each element on as you are doing.
g = GoToWebinar::API.new()
#all_webinars = g.get_upcoming_webinars
headers= #all_webinars.keys
CSV.open("#{Rails.root.to_s}/public/upcoming_webinars.csv", "wb", headers: headers) do |csv|
#all_webinars.each do |webinar|
webinar.to_a.each {|elem| csv << elem}
end
end
You are going to want to make sure, however, that any data inside the hash values is flattened. That hash inside of an array for times needs to be dealt with (perhaps just remove times and have a startTime and endTime key in the hash).
From What I have learnt from all the examples and work done to accomplish this I think the best way to go around this type of functionality is to create a rake task to and populate the database with the information and use the information saved to populate the views.

Resources