xml to querystring - ruby

What is the best option to generate a querystring (url params) from an XML document in Ruby ?
xml_string = <abc><session>1234</session><description>some_description</description></abc>
query_string = # I want here "?abc=session......."

xml_string = "<abc><session>1234</session><description>some_description</description></abc>"
result = "?"+Hash.from_xml(xml_string).to_query

XmlSimple is a nice gem.
require 'rubygems'
require 'xmlsimple'
data = XmlSimple.xml_in(xml_string)
url_params = xml_to_url_params(data, "abc")
def xml_to_url_params(xml_data, root)
elements = []
data[root].each do |item|
item.each do |name, value|
elements << "#{CGI::escape(name)}=#{CGI::escape(value)}"
end
end
elements.join("&")
end
ps. did not tested this code, so there can be bugs ;)

Related

Converting Ruby Hash into string with escapes

I have a Hash which needs to be converted in a String with escaped characters.
{name: "fakename"}
and should end up like this:
'name:\'fakename\'
I don't know how this type of string is called. Maybe there is an already existing method, which I simply don't know...
At the end I would do something like this:
name = {name: "fakename"}
metadata = {}
metadata['foo'] = 'bar'
"#{name} AND #{metadata}"
which ends up in that:
'name:\'fakename\' AND metadata[\'foo\']:\'bar\''
Context: This query a requirement to search Stripe API: https://stripe.com/docs/api/customers/search
If possible I would use Stripe's gem.
In case you can't use it, this piece of code extracted from the gem should help you encode the query parameters.
require 'cgi'
# Copied from here: https://github.com/stripe/stripe-ruby/blob/a06b1477e7c28f299222de454fa387e53bfd2c66/lib/stripe/util.rb
class Util
def self.flatten_params(params, parent_key = nil)
result = []
# do not sort the final output because arrays (and arrays of hashes
# especially) can be order sensitive, but do sort incoming parameters
params.each do |key, value|
calculated_key = parent_key ? "#{parent_key}[#{key}]" : key.to_s
if value.is_a?(Hash)
result += flatten_params(value, calculated_key)
elsif value.is_a?(Array)
result += flatten_params_array(value, calculated_key)
else
result << [calculated_key, value]
end
end
result
end
def self.flatten_params_array(value, calculated_key)
result = []
value.each_with_index do |elem, i|
if elem.is_a?(Hash)
result += flatten_params(elem, "#{calculated_key}[#{i}]")
elsif elem.is_a?(Array)
result += flatten_params_array(elem, calculated_key)
else
result << ["#{calculated_key}[#{i}]", elem]
end
end
result
end
def self.url_encode(key)
CGI.escape(key.to_s).
# Don't use strict form encoding by changing the square bracket control
# characters back to their literals. This is fine by the server, and
# makes these parameter strings easier to read.
gsub("%5B", "[").gsub("%5D", "]")
end
end
params = { name: 'fakename', metadata: { foo: 'bar' } }
Util.flatten_params(params).map { |k, v| "#{Util.url_encode(k)}=#{Util.url_encode(v)}" }.join("&")
I use it now with that string, which works... Quite straigt forward:
"email:\'#{email}\'"
email = "test#test.com"
key = "foo"
value = "bar"
["email:\'#{email}\'", "metadata[\'#{key}\']:\'#{value}\'"].join(" AND ")
=> "email:'test#test.com' AND metadata['foo']:'bar'"
which is accepted by Stripe API

Nokogiri - Returning a Filtered Array

Is it possible to just return the actual image source links, rather than the entire nokogiri array object?
def self.images(url)
doc = Nokogiri::HTML(open(url))
images = doc.css('img[src$="jpg"], img[src$="png"]').select do |image|
image['src'] =~ %r{^http://(\d+|media)}
end
images
end
Try using Array#map to convert the array of elements to an array containing all the src attributes.
def self.images(url)
doc = Nokogiri::HTML(open(url))
doc.css('img[src$="jpg"], img[src$="png"]').select do |image|
image['src'] =~ %r{^http://(\d+|media)}
end.map { |i| i['src'] }
end

Nokogiri: Slop access a node named name

I'm trying to parse a xml that looks like this:
<lesson>
<name>toto</name>
<version>42</version>
</lesson>
Using Nokogiri::Slop.
I can access lesson easily through lesson.version but I cannot access lesson.name, as name refer in this case to the name of the node (lesson).
Is there any way to access the child ?
As a variant you could try this one:
doc.lesson.elements.select{|el| el.name == "name"}
Why? Just because of this benchmarks:
require 'nokogiri'
require 'benchmark'
str = '<lesson>
<name>toto</name>
<version>42</version>
</lesson>'
doc = Nokogiri::Slop(str)
n = 50000
Benchmark.bm do |x|
x.report("select") { n.times do; doc.lesson.elements.select{|el| el.name == "name"}; end }
x.report("search") { n.times do; doc.lesson.search('name'); end }
end
Which gives us the result:
#=> user system total real
#=> select 1.466000 0.047000 1.513000 ( 1.528153)
#=> search 2.637000 0.125000 2.762000 ( 2.777278)
You can use search and give the node a xpath or css selector:
doc.lesson.search('name').first
Do a bit hack using meta programming.
require 'nokogiri'
doc = Nokogiri::Slop <<-HTML
<lesson>
<name>toto</name>
<version>42</version>
</lesson>
HTML
name_val = doc.lesson.instance_eval do
self.class.send :undef_method, :name
self.name
end.text
p name_val # => toto
p doc.lesson.version.text # => '42'
Nokogiri::XML::Node#name is a method defined to get the names of Nokogiri::XML::Node. Just for some moment, remove the method from the class Nokogiri::XML::Node in the scope of #instance_eval.

Data scraping with Nokogiri

I am able to scrape http://www.example.com/view-books/0/new-releases using Nokogiri but how do I scrape all the pages? This one has five pages, but without knowing the last page how do I proceed?
This is the program that I wrote:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'csv'
urls=Array['http://www.example.com/view-books/0/new-releases?layout=grid&_pop=flyout',
'http://www.example.com/view-books/1/bestsellers',
'http://www.example.com/books/pre-order?query=book&cid=1&layout=list&ref=4b116001-01a6-4f53-8da7-945b74fdb253'
]
#titles=Array.new
#prices=Array.new
#descriptions=Array.new
#page=Array.new
urls.each do |url|
doc=Nokogiri::HTML(open(url))
puts doc.at_css("title").text
doc.css('.fk-inf-scroll-item').each do |item|
#prices << item.at_css(".final-price").text
#titles << item.at_css(".fk-srch-title-text").text
#descriptions << item.at_css(".fk-item-specs-section").text
#page << item.at_css(".fk-inf-pageno").text rescue nil
end
(0..#prices.length - 1).each do |index|
puts "title: #{#titles[index]}"
puts "price: #{#prices[index]}"
puts "description: #{#descriptions[index]}"
# puts "pageno. : #{#page[index]}"
puts ""
end
end
CSV.open("result.csv", "wb") do |row|
row << ["title", "price", "description","pageno"]
(0..#prices.length - 1).each do |index|
row << [#titles[index], #prices[index], #descriptions[index],#page[index]]
end
end
As you can see I have hardcoded the URLs. How do you suggest that I scrape the entire books category? I was trying anemone but couldn't get it to work.
If you inspect what exactly happens when you load more results, you will realise that they are actually using a JSON to read the info with an offset.
So, you can get the five pages like this :
http://www.flipkart.com/view-books/0/new-releases?response-type=json&inf-start=0
http://www.flipkart.com/view-books/0/new-releases?response-type=json&inf-start=20
http://www.flipkart.com/view-books/0/new-releases?response-type=json&inf-start=40
http://www.flipkart.com/view-books/0/new-releases?response-type=json&inf-start=60
http://www.flipkart.com/view-books/0/new-releases?response-type=json&inf-start=80
Basically you keep incrementing inf-start and get the results until you get the result-set less than 20 which should be your last page.
Here's an untested sample of code to do what yours is, only written a bit more concisely:
require 'nokogiri'
require 'open-uri'
require 'csv'
urls = %w[
http://www.flipkart.com/view-books/0/new-releases?layout=grid&_pop=flyout
http://www.flipkart.com/view-books/1/bestsellers
http://www.flipkart.com/books/pre-order?query=book&cid=1&layout=list&ref=4b116001-01a6-4f53-8da7-945b74fdb253
]
CSV.open('result.csv', 'wb') do |row|
row << ['title', 'price', 'description', 'pageno']
urls.each do |url|
doc = Nokogiri::HTML(open(url))
puts doc.at_css('title').text
doc.css('.fk-inf-scroll-item').each do |item|
page = {
titles: item.at_css('.fk-srch-title-text').text,
prices: item.at_css('.final-price').text,
descriptions: item.at_css('.fk-item-specs-section').text,
pageno: item.at_css('.fk-inf-pageno').text rescue nil,
}
page.each do |k, v|
puts '%s: %s' % [k.to_s, v]
end
row << page.values
end
end
end
There are some useful pieces of data you can use to help you figure out how many records you need to retrieve:
var config = {container: "#search_results", page_size: 20, counterSelector: ".fk-item-count", totalResults: 88, "startParamName" : "inf-start", "startFrom": 20};
To access the values use something like:
doc.at('script[type="text/javascript+fk-onload"]').text =~ /page_size: (\d+).+totalResults: (\d+).+"startFrom": (\d+)/
page_size, total_results, start_from = $1, $2, $3

ruby hashes with files list from a directory of mp3

I have thousands of mp3 named like this: record-20091030.mp3, record-20091130.mp3 etc
I want to parse and obtain a ruby hash year->month->[days] (hash, hash, array)
what wrong whit this code?
#!/usr/bin/env ruby
files = Dir.glob("mp3/*.mp3")
#result = Hash.new
files.each do |file|
date = file.match(/\d{8}/).to_s
year = date[0,4]
month = date[4,2]
day = date[6,2]
#result[year.to_i] = Hash.new
#result[year.to_i][month.to_i] = Array.new
#result[year.to_i][month.to_i] << day
end
puts #result
You're overwriting the stored values (with Hash.new and Array.new) on every iteration of the loop, you should only be doing this if the hash/array is nil, e.g:
#result[year.to_i] ||= Hash.new
#result[year.to_i][month.to_i] ||= Array.new
I've tried to make some fixes.
#!/usr/bin/env ruby
files = Dir.glob("mp3/*.mp3")
#result = Hash.new{|h,k| h[k]=Hash.new(&h.default_proc) }
files.each do |file|
date = file[-12..-4]
year, month, day = date.scan(/(.{4})(.{2})(.{2})/).first.map(&:to_i)
#result[year][month][day] = file
end
#result.each_pair { |name, val| puts "#{name} #{val}" }
# => 2009 {10=>{30=>"mp3/record-20091030.mp3"},
# 11=>{30=>"mp3/record-20091130.mp3"}}
# 2010 {1=>{23=>"mp3/record-20100123.mp3"}}

Resources