how to post (http-post) content of pdf using ruby? - ruby

I am trying to post (raw) content of a PDF in ruby using the following block
require 'pdf/reader'
require 'curb'
reader = PDF::Reader.new('folder/file.pdf')
raw_string = ''
reader.pages.each do |page|
raw_string = raw_string + page.raw_content.to_s
end
c = Curl::Easy.new('http://0.0.0.0:4567/pdf_upload')
c.http_post(Curl::PostField.content('param1', 'value1'),Curl::PostField.content('param2', 'value2'), c.http_post(Curl::PostField.content('body', raw_string)))
Inside the API implementation params[:body] seems to be empty all the time (though puts raw_string confirms that the variable has all the values.
Also, is there a better way to post pdf content?

Regarding how you're building raw_string...
Instead of:
reader.pages.each do |page|
raw_string = raw_string + page.raw_content.to_s
end
You should be able to do something like one of these:
raw_string = reader.pages.map(&:raw_content).join
raw_string = reader.pages.map{ |p| p.raw_content.to_s }.join
I'd also recommend you write your last line spread across several lines, for clarity and readability:
c.http_post(
Curl::PostField.content('param1', 'value1'),
Curl::PostField.content('param2', 'value2'),
c.http_post(Curl::PostField.content('body', raw_string))
)

Related

How to parse CSON to Ruby object?

I am trying to read CSON (CoffeeScript Object Notation) into Ruby.
I am looking for something similar to data = JSON.parse(file) that one would use for JSON files.
file = File.read(filename)
data = CSON.parse(file) # does not exist - would like to have
I looked into invoking CoffeeScript and JavaScript from Ruby, but it feels overly complicated and like reinventing the wheel. Also, code in the data file should not be executed.
How can I read CSON into Ruby objects in a simple way?
This is what I came up with. It is sufficient for the data I am processing. The main work is done with the YAML parser Psych (https://github.com/ruby/psych). Arrays, hashes, and some of the multi-line text require a special treatment.
module CSON
def load_file(fname)
load_string File.read fname
end
def remove_indent(data)
out = ""
data.each_line do |line|
out += line.sub /^\s\s/,""
end
out
end
def parse_array(data)
data.gsub! /\n/, ","
data.gsub! /([\[\{]),/, '\1'
data.gsub! /,([\]\}])/, '\1'
YAML.load data
end
def load_string(data)
hashed = {}
data.gsub! /^(\w+):\s+(\[.*?\])/mu do # find arrays
key = Regexp.last_match[1]
value = parse_array Regexp.last_match[2]
hashed[key] = value
""
end
data.gsub! /(\w+):\s+\'\'\'\s*\n(.*?)\'\'\'/mu do # find heredocs
hashed[Regexp.last_match[1]] = remove_indent Regexp.last_match[2]
""
end
hashed.merge YAML.load data
end
end
This solution is likely to fail when applied to more complicated .cson files. I would be happy to see if someone has a more elegant answer!

How to get RSS feed in xml format for ruby script

I am using the following ruby script from this dashing widget that retrieves an RSS feed and parses it and sends that parsed title and description to a widget.
require 'net/http'
require 'uri'
require 'nokogiri'
require 'htmlentities'
news_feeds = {
"seattle-times" => "http://seattletimes.com/rss/home.xml",
}
Decoder = HTMLEntities.new
class News
def initialize(widget_id, feed)
#widget_id = widget_id
# pick apart feed into domain and path
uri = URI.parse(feed)
#path = uri.path
#http = Net::HTTP.new(uri.host)
end
def widget_id()
#widget_id
end
def latest_headlines()
response = #http.request(Net::HTTP::Get.new(#path))
doc = Nokogiri::XML(response.body)
news_headlines = [];
doc.xpath('//channel/item').each do |news_item|
title = clean_html( news_item.xpath('title').text )
summary = clean_html( news_item.xpath('description').text )
news_headlines.push({ title: title, description: summary })
end
news_headlines
end
def clean_html( html )
html = html.gsub(/<\/?[^>]*>/, "")
html = Decoder.decode( html )
return html
end
end
#News = []
news_feeds.each do |widget_id, feed|
begin
#News.push(News.new(widget_id, feed))
rescue Exception => e
puts e.to_s
end
end
SCHEDULER.every '60m', :first_in => 0 do |job|
#News.each do |news|
headlines = news.latest_headlines()
send_event(news.widget_id, { :headlines => headlines })
end
end
The example rss feed works correctly because the URL is for an xml file. However I want to use this for a different rss feed that does not provide an actual xml file. This rss feed I want is at http://www.ttc.ca/RSS/Service_Alerts/index.rss
This doesn't seem to display anything on the widget. Instead of using "http://www.ttc.ca/RSS/Service_Alerts/index.rss", I also tried "http://www.ttc.ca/RSS/Service_Alerts/index.rss?format=xml" and "view-source:http://www.ttc.ca/RSS/Service_Alerts/index.rss" but with no luck. Does anyone know how I can get the actual xml data related to this rss feed so that I can use it with this ruby script?
You're right, that link does not provide regular XML, so that script won't work in parsing it since it's written specifically to parse the example XML. The rss feed you're trying to parse is providing RDF XML and you can use the Rubygem: RDFXML to parse it.
Something like:
require 'nokogiri'
require 'rdf/rdfxml'
rss_feed = 'http://www.ttc.ca/RSS/Service_Alerts/index.rss'
RDF::RDFXML::Reader.open(rss_feed) do |reader|
# use reader to iterate over elements within the document
end
From here you can try learning how to use RDFXML to extract the content you want. I'd begin by inspecting the reader object for methods I could use:
puts reader.methods.sort - Object.methods
That will print out the reader's own methods, look for one you might be able to use for your purposes, such as reader.each_entry
To further dig down you can inspect what each entry looks like:
reader.each_entry do |entry|
puts "----here's an entry----"
puts entry.inspect
end
or see what methods you can call on the entry:
reader.each_entry do |entry|
puts "----here's an entry's methods----"
puts entry.methods.sort - Object.methods
break
end
I was able to crudely find some titles and descriptions using this hack job:
RDF::RDFXML::Reader.open('http://www.ttc.ca/RSS/Service_Alerts/index.rss') do |reader|
reader.each_object do |object|
puts object.to_s if object.is_a? RDF::Literal
end
end
# returns:
# TTC Service Alerts
# http://www.ttc.ca/Service_Advisories/index.jsp
# TTC Service Alerts.
# TTC.ca
# http://www.ttc.ca
# http://www.ttc.ca/images/ttc-main-logo.gif
# Service Advisory
# http://www.ttc.ca/Service_Advisories/all_service_alerts.jsp#Service+Advisory
# 196 York University Rocket route diverting northbound via Sentinel, Finch due to a collision that has closed the York U Bus way.
# - Affecting: Bus Routes: 196 York University Rocket
# 2013-12-17T13:49:03.800-05:00
# Service Advisory (2)
# http://www.ttc.ca/Service_Advisories/all_service_alerts.jsp#Service+Advisory+(2)
# 107B Keele North route diverting northbound via Keele, Lepage due to a collision that has closed the York U Bus way.
# - Affecting: Bus Routes: 107 Keele North
# 2013-12-17T13:51:08.347-05:00
But I couldn't quickly find a way to know which one was a title, and which a description :/
Finally, if you still can't find how to extract what you want, start a new question with this info.
Good luck!

Why only the first link is fetched?

I'm trying to fetch news from Hacker News and write a link's title and URL to an HTML file. However, only the first link is getting written and others are not. What am I doing wrong?
require 'httparty'
def fetch(source)
response = HTTParty.get(source)
response["items"].each do |item|
return '' + item["title"] + ''
end
end
links = fetch('http://api.ihackernews.com/page')
File.open("/tmp/news.html", "w") do |f|
f.puts links
end
You shouldn't use return keyword in this case. It ends the method prematurely and returns only the first link. Use this instead:
require 'httparty'
def fetch(source)
response = HTTParty.get(source)
# convert response['items'] array to array of strings
response["items"].map do |item|
'' + item["title"] + ''
end
end
links = fetch('http://api.ihackernews.com/page')
links.length # => 30

Output several times

I have the following code:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
time = Time.new
url = "http://mobile.bahn.de/bin/mobil/bhftafel.exe/dox?input=Richard-Strauss-Stra%DFe%2C+M%FCnchen%23625127&date=" +
time.strftime("%d%m%Y") +
"&time=" +
time.strftime("%H") +
"%3A" +
time.strftime("%M") +
"&productsFilter=1111111111000000&REQTrain_name=&maxJourneys=10&start=Suchen&boardType=Abfahrt&ao=yes"
doc = Nokogiri::HTML(open(url))
doc.xpath('//div//p').remove
doc.encoding = 'UTF-8'
doc = doc.xpath('//div').each do |node|
text = node.text.gsub(/\n([ \t]*\n)+/,"\n",).gsub(/^\s+|\s+$/,'').gsub("Startseite", '').gsub("Impressum", '')
puts text unless text.empty?
end
I have two problems:
The code outputs three times and not one time.
The German "umlauts" like äü.
The original HTML is long and not indented, so it is very hard to debug.
But I think you need to replace:
doc = doc.xpath('//div').each do |node|
With:
doc = doc.xpath('//body/div').each do |node|
The first one was also including all <div> elements so it included //body/div and then separately included the <div>s inside //body/div
I had no problems with umlaut characters, using puts, but did have problems writing them to a file. What is your exact problem? It might be best if you create a new question on Stack Overflow for the umlauts issue.

Optimizing Ruby RSS

I'm writing a very simple Ruby script to parse tweets out of a twitter RSS feed. Here's the code I have:
require 'rss'
#rss = RSS::Parser.parse('statuses.xml', false)
outputfile = open("output.txt", "w")
#rss.items.each do |i|
pubdate = i.published.to_s
if pubdate.include? '2011-05'
tweet = i.title.to_s
tweet = tweet.gsub(/<title>SlyFlourish: /, "")
tweet = tweet.gsub(/<\/title>/, "\n\n")
outputfile << tweet
end
end
I think I'm missing something about dealing with the objects coming out of the RSS parser. Can someone tell me how I can better pull out the title and date entries from the object returned by the parser?
Is there a reason you chose RSS? Parsing XML is expensive.
I'd consider using JSON instead.
There's also a twitter Ruby gem that makes this really easy:
require "twitter"
Twitter.user_timeline("gavin_morrice").each do |tweet|
puts tweet.text
puts tweet.created_at
end

Resources