Using mechanize to check for div with similar but different names - ruby

Currently I'm doing the following:
if( firstTemp == true )
total = doc.xpath("//div[#class='pricing condense']").text
else
total = doc.xpath("//div[#class='pricing ']").text
end
I'm wondering is there anyway that I can get mechanize to automatically fetch divs that contain the string "pricing" ?

Is doc a Mechanize::Page? usually the convention is page for those and doc for Nokogiri::HTML::Document. Anyway, for either one try:
doc.search('div.pricing')
For just the first one, use at instead of search:
doc.at('div.pricing')

Related

Concept for recipe-based parsing of webpages needed

I'm working on a web-scraping solution that grabs totally different webpages and lets the user define rules/scripts in order to extract information from the page.
I started scraping from a single domain and build a parser based on Nokogiri.
Basically everything works fine.
I could now add a ruby class each time somebody wants to add a webpage with a different layout/style.
Instead I thought about using an approach where the user specifies elements where content is stored using xpath and storing this as a sort of recipe for this webpage.
Example: The user wants to scrape a table-structure extracting the rows using a hash (column-name => cell-content)
I was thinking about writing a ruby function for extraction of this generic table information once:
# extracts a table's rows as an array of hashes (column_name => cell content)
# html - the html-file as a string
# xpath_table - specifies the html table as xpath which hold the data to be extracted
def basic_table(html, xpath_table)
xpath_headers = "#{xpath_table}/thead/tr/th"
html_doc = Nokogiri::HTML(html)
html_doc = Nokogiri::HTML(html)
row_headers = html_doc.xpath(xpath_headers)
row_headers = row_headers.map do |column|
column.inner_text
end
row_contents = Array.new
table_rows = html_doc.xpath('#{xpath_table}/tbody/tr')
table_rows.each do |table_row|
cells = table_row.xpath('td')
cells = cells.map do |cell|
cell.inner_text
end
row_content_hash = Hash.new
cells.each_with_index do |cell_string, column_index|
row_content_hash[row_headers[column_index]] = cell_string
end
row_contents << [row_content_hash]
end
return row_contents
end
The user could now specify a website-recipe-file like this:
<basic_table xpath='//div[#id="grid"]/table[#id="displayGrid"]'
The function basic_table is referenced here, so that by parsing the website-recipe-file I would know that I can use the function basic_table to extract the content from the table referenced by the xPath.
This way the user can specify simple recipe-scripts and only has to dive into writing actual code if he needs a new way of extracting information.
The code would not change every time a new webpage needs to be parsed.
Whenever the structure of a webpage changes only the recipe-script would need to be changed.
I was thinking that someone might be able to tell me how he would approach this. Rules/rule engines pop into my mind, but I'm not sure if that really is the solution to my problem.
Somehow I have the feeling that I don't want to "invent" my own solution to handle this problem.
Does anybody have a suggestion?
J.

Sinatra can't convert Symbol into Integer when making MongoDB query

This is a sort of followup to my other MongoDB question about the torrent indexer.
I'm making an open source torrent indexer (like a mini TPB, in essence), and offer both SQLite and MongoDB for backend, currently.
However, I'm having trouble with the MongoDB part of it. In Sinatra, I get when trying to upload a torrent, or search for one.
In uploading, one needs to tag the torrent — and it fails here. The code for adding tags is as follows:
def add_tag(tag)
if $sqlite
unless tag_exists? tag
$db.execute("insert into #{$tag_table} values ( ? )", tag)
end
id = $db.execute("select oid from #{$tag_table} where tag = ?", tag)
return id[0]
elsif $mongo
unless tag_exists? tag
$tag.insert({:tag => tag})
end
return $tag.find({:tag => tag})[:_id] #this is the line it presumably crashes on
end
end
It reaches line 105 (noted above), and then fails. What's going on? Also, as an FYI this might turn into a few other questions as solutions come in.
Thanks!
EDIT
So instead of returning the tag result with [:_id], I changed the block inside the elsif to:
id = $tag.find({:tag => tag})
puts id.inspect
return id
and still get an error. You can see a demo at http://torrent.hypeno.de and the source at http://github.com/tekknolagi/indexer/
Given that you are doing an insert(), the easiest way to get the id is:
id = $tag.insert({:tag => tag})
id will be a BSON::ObjectId, so you can use appropriate methods depending on the return value you want:
return id # BSON::ObjectId('5017cace1d5710170b000001')
return id.to_s # "5017cace1d5710170b000001"
In your original question you are trying to use the Collection.find() method. This returns a Mongo::Cursor, but you are trying to reference the cursor as a document. You need to iterate over the cursor using each or next, eg:
cursor = $tag.find_one({:tag => tag})
return cursor.next['_id'];
If you want a single document, you should be using Collection.find_one().
For example, you can find and return the _id using:
return $tag.find_one({:tag => tag})['_id']
I think the problem here is [:_id]. I dont know much about Mongo but `$tag.find({:tag => tag}) is probably retutning an array and passing a symbol to the [] array operator is not defined.

Selenium Webdriver + Ruby regex: Can I use regex with find_element?

I am trying to click an element that changes per each order like so
edit_div_123
edit_div_124
edit_div_xxx
xxx = any three numbers
I have tried using regex like so:
#driver.find_element(:css, "#edit_order_#{\d*} > div.submit > button[name=\"commit\"]").click
#driver.find_element(:xpath, "//*[(#id = "edit_order_#{\d*}")]//button").click
Is this possible? Any other ways of doing this?
You cannot use Regexp, like the other answers have indicated.
Instead, you can use a nifty CSS Selector trick:
#driver.find_element(:css, "[id^=\"edit_order_\"] > div.submit > button[name=\"commit\"]").click
Using:
^= indicates to find the element with the value beginning with your criteria.
*= says the criteria should be found anywhere within the element's value
$= indicates to find the element with with your criteria at the end of the value.
~= allows you to find the element based on a single criteria when the actual value has multiple space-seperated list of values.
Take a look at http://net.tutsplus.com/tutorials/html-css-techniques/the-30-css-selectors-you-must-memorize/ for some more info on other neat CSS tricks you should add to your utility belt!
You have no provided any html fragment that you are working on. Hence my answer is just based on the limited inputs provided your question.
I don't think WebDriver APIs support regex for locating elements. However, you can achieve what you want using just plain XPath as follows:
//*[starts-with(#id, 'edit_div_')]//button
Explanation: Above xpath will try to search all <button> nodes present under all elements whose id attribute starts with string edit_div_
In short, you can use starts-with() xpath function in order to match element with id format as edit_div_ followed by any number of characters
No, you can not.
But you should do something like this:
function hasClass(element, className) {
var re = new RegExp('(?:^|\\s+)' + className + '(?:\\s+|$)');
return re.test(element.className);
}
This worked for me
#driver.find_element(:xpath, "//a[contains(#href, 'person')]").click

OR operators and Ruby where clause

Probably really easy but im having trouble finding documentation online about this
I have two activerecord queries in Ruby that i want to join together via an OR operator
#pro = Project.where(:manager_user_id => current_user.id )
#proa = Project.where(:account_manager => current_user.id)
im new to ruby but tried this myself using ||
#pro = Project.where(:manager_user_id => current_user.id || :account_manager => current_user.id)
this didnt work, So 1. id like to know how to actually do this in Ruby and 2. if that person can also give me a heads up on the boolean syntax in a ruby statement like this altogether.
e.g. AND,OR,XOR...
You can't use the Hash syntax in this case.
Project.where("manager_user_id = ? OR account_manager = ?", current_user.id, current_user.id)
You should take a look at the API documentation and follow conventions, too. In this case for the code that you might send to the where method.
This should work:
#projects = Project.where("manager_user_id = '#{current_user.id}' or account_manager_id = '#{current_user.id}'")
This should be safe since I'm assuming current_user's id value comes from your own app and not from an external source such as form submissions. If you are using form submitted data that you intent to use in your queries you should use placeholders so that Rails creates properly escaped SQL.
# with placeholders
#projects = Project.where(["manager_user_id = ? or account_manager_id = ?", some_value_from_form1, some_value_from_form_2])
When you pass multiple parameters to the where method (the example with placeholders), the first parameter will be treated by Rails as a template for the SQL. The remaining elements in the array will be replaced at runtime by the number of placeholders (?) you use in the first element, which is the template.
Metawhere can do OR operations, plus a lot of other nifty things.

Pulling Images from rss/atom feeds using magpie rss

Im using php and magpie and would like a general way of detecting images in feed item. I know some websites place images within the enclosure tag, others like this images[rss] and some simply add it to description. Is there any one with a general function for detecting if rss item has image and extracting image url after its been parsed by magpie?
i think reqular expressions would be needed to extract from description but im a noob at those. Please help if you can.
I spent ages searching for a way of displaying images in RSS via Magpie myself, and in the end I had to examine the code to figure out how to get it to work.
Like you say, the reason Magpie doesn't pick up images in the element is because they are specified using the 'enclosure' tag, which is an empty tag where the information is in the attributes, e.g.
<enclosure url="http://www.mysite.com/myphoto.jpg" length="14478" type="image/jpeg" />
As a hack to get it to work quickly for me I added the following lines of code into rss_parse.inc:
function feed_start_element($p, $element, &$attrs) {
...
if ( $el == 'channel' )
{
$this->inchannel = true;
}
...
// START EDIT - add this elseif condition to the if ($el=xxx) statement.
// Checks if element is enclosure tag, and if so store the attribute values
elseif ($el == 'enclosure' ) {
if ( isset($attrs['url']) ) {
$this->current_item['enclosure_url'] = $attrs['url'];
$this->current_item['enclosure_type'] = $attrs['type'];
$this->current_item['enclosure_length'] = $attrs['length'];
}
}
// END EDIT
...
}
The url to the image is in $myRSSitem['enclosure_url'] and the size is in $myRSSitem['enclosure_length'].
Note that enclosure tags can refer to many types of media, so first check if the type is actually an image by checking $myRSSitem['enclosure_type'].
Maybe someone else has a better suggestion and I'm sure this could be done more elegantly to pick up attributes from other empty tags, but I needed a v quick fix (deadline pressures) but I hope this might help someone else in difficulty!

Resources