Extract data from URL with Ruby - ruby

I'm new to ruby and I'm trying to return a list of ASINs and corresponding prices using Ruby. I was able to get pretty close to what I need but would need help to answer 2 questions:
How can I get rid of the [[' and >\n"]] around the ASIN (see result below)
Is there a simpler way to extract the ASIN from the URL than using this regex?
Thanks so much for your help!
Here is what I get in the Terminal from the current code:
[["B00EJDIG8M\n"]] - $7.00
[["B00KJ07SEM\n"]] - $26.99
[["B000FAR33M\n"]] - $119.00
[["B00LLMKPVK\n"]] - $22.99
[["B007NXPAQG\n"]] - $9.47
[["B004W5WAMU\n"]] - $22.43
[["B00LFUNGU0\n"]] - $17.99
[["B0052G14E8\n"]] - $54.99
[["B002MPLYEW\n"]] - $212.99
[["B00009W3G7\n"]] - $6.61
[["B000NCTOUM\n"]] - $3.04
[["B009SANIDO\n"]] - $12.29
[["B0052G51AQ\n"]] - $67.99
[["B003XEUEPQ\n"]] - $26.74
[["B00CYH9HRO\n"]] - $25.75
[["B00KV0SKQK\n"]] - $21.99
[["B009PCI2JU\n"]] - $56.66
[["B00LLM6ZFK\n"]] - $24.99
[["B004RQDY60\n"]] - $18.40
[["B000JLNBW4\n"]] - $49.14
Here is the code:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
PAGE_URL = "http://www.amazon.com/Best-Sellers-Appliances/zgbs/appliances/ref=zg_bs_nav_0"
page = Nokogiri::HTML(open(PAGE_URL))
page.css(".zg_itemWrapper").each do |item|
price = item.at_css(".zg_price .price").text
asin = item.at_css(".zg_title a")[:href].scan(/http:\/\/(?:www\.|)amazon\.com\/(?:gp\/product|[^\/]+\/dp|dp)\/([^\/]+)/)
puts "#{asin} - #{price}"
end

Rather than cleaning up your Nokogiri search, the easiest thing to do at this point is just clean up your current asin values during interpolation. For example:
puts "#{asin.flatten.pop.chomp} - #{price}"

Regarding question 2., I realized I don't really need regex and found a way to get the same result with a much shorter line of code
replacing
asin = item.at_css(".zg_title a")[:href].scan(/http:\/\/(?:www\.|)amazon\.com\/(?:gp\/product|[^\/]+\/dp|dp)\/([^\/]+)/)
with
asin = item.at_css(".zg_title a")[:href].split("/")[5].chomp

Related

Ruby - URL to Markdown

TOTAL rookie here.
I'm working on customizing a script made by Brett Terpstra - http://brettterpstra.com/2013/11/01/save-pocket-favorites-to-nvalt-with-ifttt-and-hazel/
Mine is a different use: I'd like to save my pinboard bookmarks with a specific tag to a file in dropbox in Markdown.
I feed it a text file such as:
Title: Yesterday is over.
URL: http://www.jonacuff.com/blog/want-to-change-the-world-get-doing/
Tags: 2md, 2wcx, 2pdf
Date: June 20, 2013 at 06:20PM
Image: notused
Excerpt: You can't start the next chapter of your life if you keep re-reading the last one.
And it outputs the markdown file.
Everything works great except when the 'excerpt' (see above) is more than one line. Sometimes it's a couple of paragraphs. When that happens, it stops working. When I hit enter from the command line, it's still waiting for more input.
Here's an example of a file that it doesn't work on:
Title: Talking ’bout my Generation.
URL: http://blog.greglaurie.com/?p=8881
Tags: 2md, 2wcx, 2pdf
Date: June 28, 2013 at 09:46PM
Image: notused
Excerpt: Contrast two men from the 19th century: Max Jukes and Jonathan Edwards.
Max Jukes lived in New York. He did not believe in Christ or in raising his children in the way of the Lord. He refused to take his children to church, even when they asked to go. Of his 1,026 descendants:
•300 were sent to prison for an average term of 13 years
•190 were prostitutes
•680 were admitted alcoholics
His family, thus far, has cost the state in excess of $420,000 and has made no contribution to society.
Jonathan Edwards also lived in New York, at the same time as Jukes. He was known to have studied 13 hours a day and, in spite of his busy schedule of writing, teaching, and pastoring, he made it a habit to come home and spend an hour each day with his children. He also saw to it that his children were in church every Sunday. Of his 929 descendants:
•430 were ministers
•86 became university professors
•13 became university presidents
•75 authored good books
•7 were elected to the United States Congress
•1 was Vice President of the United States
Edwards’ family never cost the state one cent.
We tend to think that our decisions only affect ourselves, but they have ramifications for generations to come.
Here's a screenshot of what it looks like after I run the command: https://www.dropbox.com/s/i9zg483k7nkdp6f/Screenshot%202013-11-22%2016.39.17.png
I'm hoping it's something easy. Any ideas?
#!/usr/bin/env ruby
# Works with IFTTT recipe https://ifttt.com/recipes/125999
#
# Set Hazel to watch the folder you specify in the recipe.
# Make sure nvALT is set to store its notes as individual files.
# Edit the $target_folder variable below to point to your nvALT
# ntoes folder.
require 'date'
require 'open-uri'
require 'net/http'
require 'fileutils'
require 'cgi'
$target_folder = "~/Dropbox/messx/urls2md"
def url_to_markdown(url)
res = Net::HTTP.post_form(URI.parse("http://heckyesmarkdown.com/go/"),{'u'=>url,'read'=>'1'})
if res.code.to_i == 200
res.body
else
false
end
end
file = ARGV[0]
begin
input = IO.read(file).force_encoding('utf-8')
headers = {}
input.each_line {|line|
key, value = line.split(/: /)
headers[key] = value.strip || ""
}
outfile = File.join(File.expand_path($target_folder), headers['Title'].gsub(/["!*?'|]/,'') + ".txt")
date = Time.now.strftime("%Y-%m-%d %H:%M")
date_added = Date.parse(headers['Date']).strftime("%Y-%m-%d %H:%M")
content = "Title: #{headers['Title']}\nDate: #{date}\nDate Added: #{date_added}\nSource: #{headers['URL']}\n"
tags = false
if headers['Tags'].length > 0
tag_arr = header s['Tags'].split(", ")
tag_arr.map! {|tag|
%Q{"#{tag.strip}"}
}
tags = tag_arr.join(" ")
content += "Keywords: #{tags}\n"
end
markdown = url_to_markdown(headers['URL']).force_encoding('utf-8')
if markdown
content += headers['Image'].length > 0 ? "\n\n> #{headers['Excerpt']}\n\n---#{markdown}\n" : "\n\n"+markdown
else
content += headers['Image'].length > 0 ? "\n\n![](#{headers['Image']})\n\n#{headers['Excerpt']}\n" : "\n\n"+headers['Excerpt']
end
File.open(outfile,'w') {|f|
f.puts content
}
if tags && File.exists?("/usr/local/bin/openmeta")
%x{/usr/local/bin/openmeta -a #{tags} -p "#{outfile}"}
end
# FileUtils.rm(file)
rescue Exception => e
puts e
end
How about this? Modify your input.each_line area accordingly:
headers = {}
key = nil
input.each_line do |line|
match = /^(?<key>\w+)\s*:\s*(?<value>.*)/.match(line)
value = line
if match
key = match[:key].strip
headers[key] = match[:value].strip
else
headers[key] += line
end
end
First, splitting on just ":" is dangerous since that can be in content. Instead, a (simplified from code) regex of /^\w+:.*/ will match "Word: Content". Since the lines after the "Excerpt:" aren't prefixed, you need to hang on to the last seen key, and just append if there's no key for this line. You may need to add a newline in there, depending on what you're doing with that header information, but it seems to work.

How do I parse Google image URLs using Ruby and Nokogiri?

I'm trying to make an array of all the image files on a Google images webpage.
I want a regular expression to pull everything after "imagurl=" and ending before "&amp" as seen in this HTML:
<img height="124" width="124" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRLy5inpSdHxWuE7z3QSZw35JwN3upbBaLr11LR25noTKbSMn9-qrySSg"><br><cite title="trendytree.com">trendytree.com</cite><br>Silent Night Chapel <b>20031</b><br>400 × 400 - 58k - jpg</td>
I feel like I can do this with a regex, but I can't find a way to search my parsed document using regex, but I'm not finding any solutions.
str = '<img height="124" width="124" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRLy5inpSdHxWuE7z3QSZw35JwN3upbBaLr11LR25noTKbSMn9-qrySSg"><br><cite title="trendytree.com">trendytree.com</cite><br>Silent Night Chapel <b>20031</b><br>400 × 400 - 58k - jpg</td>'
str.split('imgurl=')[1].split('&amp')[0]
#=> "http://www.trendytree.com/old-world- christmas/images/20031chapel20031-silent-night-chapel.jpg"
Is that what you're looking for?
The problem with using a regex is you assume too much knowledge about the order of parameters in the URL. If the order changes, or & disappears the regex won't work.
Instead, parse the URL, then split the values out:
# encoding: UTF-8
require 'nokogiri'
require 'cgi'
require 'uri'
doc = Nokogiri::HTML.parse('<img height="124" width="124" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRLy5inpSdHxWuE7z3QSZw35JwN3upbBaLr11LR25noTKbSMn9-qrySSg"><br><cite title="trendytree.com">trendytree.com</cite><br>Silent Night Chapel <b>20031</b><br>400 × 400 - 58k - jpg</td>')
img_url = doc.search('a').each do |a|
query_params = CGI::parse(URI(a['href']).query)
puts query_params['imgurl']
end
Which outputs:
http://www.trendytree.com/old-world-christmas/images/20031chapel20031-silent-night-chapel.jpg
Both URI and CGI are used because URI's decode_www_form raises an exception when trying to decode the query.
I've also been known to decode the query string into a hash using something like:
Hash[URI(a['href']).query.split('&').map{ |p| p.split('=') }]
That will return:
{"imgurl"=>
"http://www.trendytree.com/old-world-christmas/images/20031chapel20031-silent-night-chapel.jpg",
"imgrefurl"=>
"http://www.trendytree.com/old-world-christmas/silent-night-chapel-20031-christmas-ornament-old-world-christmas.html",
"usg"=>"__YJdf3xc4ydSfLQa9tYnAzavKHYQ",
"h"=>"400",
"w"=>"400",
"sz"=>"58",
"hl"=>"en",
"start"=>"19",
"zoom"=>"1",
"tbnid"=>"ajDcsGGs0tgE9M:",
"tbnh"=>"124",
"tbnw"=>"124",
"ei"=>"qagfUbXmHKfv0QHI3oG4CQ",
"itbs"=>"1",
"sa"=>"X",
"ved"=>"0CE4QrQMwEg"}
To get all the img urls you want do
# get all links
url = 'some-google-images-url'
links = Nokogiri::HTML( open(url) ).css('a')
# get regex match or nil on desired img
img_urls = links.map {|a| a['href'][/imgurl=(.*?)&/, 1] }
# get rid of nils
img_urls.compact
The regex you want is /imgurl=(.*?)&/ because you want a non-greedy match between imgurl= and &, otherwise the greedy .* would take everything to the last & in the string.

YAML/Ruby: Get the first item whose <field> is <value>?

I have this YAML:
- company:
- id: toyota
- fullname: トヨタ自動車株式会社
- company:
- id: konami
- fullname: Konami Corporation
And I want to get the fullname of the company whose id is konami.
Using Ruby 1.9.2, what is the simplest/usual way to get it?
Note: In the rest of my code, I have been using require "yaml" so I would prefer to use the same library.
This works too and does not use iteration:
y = YAML.load_file('japanese_companies.yml')
result = y.select{ |x| x['company'].first['id'] == 'konami' }
result.first['company'].last['fullname'] # => "Konami Corporation"
Or if you have other attributes and you can't be sure fullname is the last one:
result.first['company'].select{ |x| x['fullname'] }.first['fullname']
I agree with Ray Toal, if you change your yml it becomes much easier. E.g.:
toyota:
fullname: トヨタ自動車株式会社
konami:
fullname: Konami Corporation
With the above yaml, fetching the fullname of konami becomes much easier:
y = YAML.load_file('test.yml')
y.fetch('konami')['fullname']
Your YAML is a little unconventional but we can compensate.
A brute force approach is (I'm not sure if this can be done without parsing the YAML):
require 'yaml'
YAML.parse_file(ARGV[0]).transform.each do |company|
properties = {}
company['company'].each {|h| properties = properties.merge(h)}
puts properties['fullname'] if properties['id'] == 'konami'
end
Pass your YAML file in as the first argument to this script.
Feel free to adapt into a method that takes the YAML as a string and returns the desired fullname. (A return is useful because it directly answers the OP's question of obtaining the first such company.)

Parse Apache Formatted URLs in Ruby

How can I take in a Apache Common Log file and list all of the URLs in it in a neat histogram like:
/favicon.ico ##
/manual/mod/mod_autoindex.html #
/ruby/faq/Windows/ ##
/ruby/faq/Windows/index.html #
/ruby/faq/Windows/RubyonRails #
/ruby/rubymain.html #
/robots.txt ########
Sample of test file:
65.54.188.137 - - [03/Sep/2006:03:50:20 -0400] "GET /~longa/geomed/ppa/doc/localg/localg.htm HTTP/1.0" 200 24834
65.54.188.137 - - [03/Sep/2006:03:50:32 -0400] "GET /~longa/geomed/modules/sv/scen1.html HTTP/1.0" 200 1919
65.54.188.137 - - [03/Sep/2006:03:53:51 -0400] "GET /~longa/xlispstat/code/statistics/introstat/axis/code/axisDens.lsp HTTP/1.0" 200 15962
65.54.188.137 - - [03/Sep/2006:04:03:03 -0400] "GET /~longa/geomed/modules/cluster/lab/nm.pop HTTP/1.0" 200 66302
65.54.188.137 - - [03/Sep/2006:04:11:15 -0400] "GET /~longa/geomed/data/france/names.txt HTTP/1.0" 200 20706
74.129.13.176 - - [03/Sep/2006:04:14:35 -0400] "GET /~jbyoder/ambiguouslyyours/ambig.rss HTTP/1.1" 304 -
This is what I have right now (but I'm not sure how to make the histogram):
...
---
$apache_line = /\A(?<ip_address>\S+) \S+ \S+ \[(?<time>[^\]]+)\] "(?<method>GET|POST) (?<url>\S+) \S+?" (?<status>\d+) (?<bytes>\S+)/
$parts = apache_line.match(file)
$p parts[:ip_address], parts[:status], parts[:method], parts[:url]
def get_url(file)
hits = Hash.new {|h,k| h[k]=0}
File.read(file).to_a.each do |line|
while $p parts[:url]
if k = k
h[k]+=1
puts "%-15s %s" % [k,'#'*h[k]]
end
end
end
...
---
Here is the full question: http://pastebin.com/GRPS6cTZ Pseudo code is fine.
You can create a hash mapping each path to the number of hits. For convenience, I suggest using a Hash that sets the value to 0 when you ask for a path it hasn't seen before. For example:
hits = Hash.new{ |h,k| h[k]=0 }
...
hits["/favicon.ico"] += 1
hits["/ruby/faq/Windows/"] += 1
hits["/favicon.ico"] += 1
p hits
#=> {"/favicon.ico"=>2, "/ruby/faq/Windows/"=>1}
In case the log file is really huge, instead of slurping the whole thing into memory, process the lines one at a time. (Look through the methods of the File class.)
Because Apache log file formats don't have standard delimiters, I'd suggesting using a regular expression to take each line and separate it into the chunks you want. Assuming you're using Ruby 1.9, I'm going to use named captures for clean access to the methods later on. For example:
apache_line = /\A(?<ip_address>\S+) \S+ \S+ \[(?<time>[^\]]+)\] "(?<method>GET|POST) (?<url>\S+) \S+?" (?<status>\d+) (?<bytes>\S+)/
...
parts = apache_line.match(log_line)
p parts[:ip_address], parts[:status], parts[:method], parts[:url]
You might want to choose to filter these based on the status code. For example, do you want to include in your graph all the 404 hits where someone mistyped? If you're not slurping all the lines into memory, you won't be using Array#select but instead skipping over them during your loop.
After you have gathered all your hits, then its time to write out the results. Some helpful tips:
Hash#keys can give you all the keys of the array (the paths) at once. You probably want to write out all the paths with the same amount of whitespace, so you need to figure out which is the longest. Perhaps you want to map the paths to their lengths and then get the max element, or perhaps you want to use max_by to find the longest path and then find its length.
Although geeky, using sprintf or String#% is a great way to lay out formatted reports. For example:
puts "%-15s %s" % ["Hello","####"]
#=> "Hello ####"
Just like you needed to find the longest name for good formatting, might want to to find the URL with the most hits, so that you can scale your longest amount of hashes to that value. Hash#values will give you an array of all values. Alternatively, perhaps you have a requirement that one # must always represent 100 hits, or something.
Note that String#* lets you create a string by repetition:
p '#'*10
#=> "##########"
If you have specific questions with your code, ask more questions!
Since this is homework, I won't give you the exact answer, but Simone Carletti has implemented a Ruby class to parse Apache log files. You might start there and look at how he does things.

Simplest way to display each hour of the day in Ruby

I have a calendar screen where I want to display the hours of the day like this:
12:00am
1:00am
2:00am
..
4:00pm
5:00pm
etc.
Being a total Ruby noob, I was wondering if anyone could help me figure out the simplest way to display this.
#!/usr/bin/env ruby
# without using actual `Date` objects ...
p ["12:00am"] + (1..11).map {|h| "#{h}:00am"}.to_a +
["12:00pm"] + (1..11).map {|h| "#{h}:00pm"}.to_a
["12:00am", "1:00am", "2:00am", "3:00am", "4:00am", "5:00am", "6:00am",
"7:00am", "8:00am", "9:00am", "10:00am", "11:00am", "12:00pm", "1:00pm",
"2:00pm", "3:00pm", "4:00pm", "5:00pm", "6:00pm", "7:00pm", "8:00pm",
"9:00pm", "10:00pm", "11:00pm"]
Or using actual DateTime objects and %I:%M%p as format:
#!/usr/bin/env ruby
require "Date"
for hour in 0..23 do
d = DateTime.new(2010, 1, 1, hour, 0, 0)
p d.strftime("%I:%M%p")
end
Which would print:
"12:00AM"
"01:00AM"
"02:00AM"
"03:00AM"
"04:00AM"
"05:00AM"
"06:00AM"
"07:00AM"
"08:00AM"
"09:00AM"
"10:00AM"
"11:00AM"
"12:00PM"
"01:00PM"
"02:00PM"
"03:00PM"
"04:00PM"
"05:00PM"
"06:00PM"
"07:00PM"
"08:00PM"
"09:00PM"
"10:00PM"
"11:00PM"
You could generate these like this:
array = ['12:00am'] + (1..11).map {|h| "#{h}:00am"} + ['12:00pm'] + (1..11).map {|h| "#{h}:00pm"}
or simply write out the array (this is more efficient):
array = ["12:00am", "1:00am", "2:00am", "3:00am", "4:00am", "5:00am", "6:00am", "7:00am", "8:00am", "9:00am", "10:00am", "11:00am", "12:00pm", "1:00pm", "2:00pm", "3:00pm", "4:00pm", "5:00pm", "6:00pm", "7:00pm", "8:00pm", "9:00pm", "10:00pm", "11:00pm"]
You can then print these however you want, eg.
array.each do |el|
puts el
end

Resources