Ruby_send the result of scraping through email - ruby

With Ruby, my app:
checks if the page status is 200
Parses the PDF files if so
sends via email the result of scraping
Having tested all the parts of the code, everything works fine, except one thing, the mail that is sent doesn't contain the result of my scrpaing;
What is the issue, is it related to the variable #monscrape that may be not recongnised in the final party of the code ?
My code:
require 'open-uri'
require "net/http"
require 'rubygems'
require 'pdf/reader'
require 'mail'
options = { :address => "smtp.gmail.com",
:port => 587,
:domain => 'gmail.com',
:user_name => 'mail#gmail.com',
:password => 'pwd',
:authentication => 'plain',
:enable_starttls_auto => true
}
lien= "http://www.example.com"
url = URI.parse(lien)
req = Net::HTTP.new(url.host, url.port)
res = req.request_head(url.path)
if res.code == "200"
io = open('http://www.example.com')
reader = PDF::Reader.new(io)
reader.pages.each do |page|
res = page.text
#monscrape = res.scan(/text[\s\S]*text/)
end
Mail.defaults do
delivery_method :smtp, options
end
Mail.deliver do
to 'mail#hotmail.com'
from 'Author <mail#gmail.com>'
subject 'testing sendmail'
html_part do
content_type 'text/html; charset=UTF-8'
body '<h1>Please find below the scrape <%= #monscrape %></h1>'
end
end
else
puts "the link doenst work"
end

The problem is the Mail.deliver block is evaluated using instance_eval. Therefore no local instance #variables will be visible to the Mail block.
So #monscrape will always be nil inside the Mail.deliver block.
One solution is to use a local (non-instance) variable instead:
monscrape = "test"
Mail.deliver do
...
body "<h1>Please find below the scrape #{monscrape}</h1>"
...
end
Also note that Mail does not support ERB(!) therefore you cannot use something like <%= monscrape %> in the body. You have to treat it like a normal string using string expansion with double quotes " and not single quotes '.
See further discussion and options here:
Why can't the Mail block see my variable?

You can't use
res = req.request_head(url.path)
when url.path returns "". request_head expects a path of at least "/". That implies you need to fix up the URL being passed so it at least has the root path "/".
url = URI.parse('http://www.example.com')
url.path # => ""
req.request_head(url.path)
*** ArgumentError Exception: HTTP request path is empty
vs.
url = URI.parse('http://www.example.com/')
url.path # => "/"
req.request_head(url.path)
#<Net::HTTPOK 200 OK readbody=true>
The second problem is you're trying to read something as PDF that isn't a PDF file. Example.com returns HTML, which is text. You can't use:
io = open('http://www.example.com')
reader = PDF::Reader.new(io)
Trying to returns "PDF does not contain EOF marker".
It's really important that you understand what types of objects/resources are being returned by a site when you request a URL. You can't declare them willy-nilly and expect code to accept it without errors.

Related

Only OpenURI succeeds at Reddit API request

I’m making requests to the Reddit API. First, I set a subreddit top URL:
reddit_url = URI.parse('https://www.reddit.com/r/pixelart/top.json')
All of these correctly get the contents:
Net::HTTP.get(reddit_url, 'User-Agent' => 'My agent')
Open3.capture2('/usr/bin/curl', '--user-agent', 'My agent', reddit_url.to_s)[0]
URI.open(reddit_url, 'User-Agent' => 'My agent').read
But then I try it with a URL for a specific post:
reddit_url = URI.parse('https://reddit.com/r/PixelArt/comments/lkaiqf/another_watercolour_pixelart_tree.json')
And both Net::HTTP and Open3/curl fail, getting only empty strings. URI.open continues to work, as does opening the URL in a web browser.
Why doesn’t the second request work with two of the solutions? And why does it work with URI.open, when that’s supposed to be “an easy-to-use wrapper for Net::HTTP”? What does it do differently, and how to replicate it with Net::HTTP an curl?
Working with your example, and focussing on Net::HTTP for simplicity, the first example doesn't work as written:
require 'net/http'
reddit_url = URI.parse('https://www.reddit.com/r/pixelart/top.json')
Net::HTTP.get(reddit_url, 'User-Agent' => 'My agent')
# => Type Error - no implicit conversion of URI::HTTPS into String
Instead I used this as my starting point:
require 'net/http'
reddit_url = URI.parse('https://www.reddit.com/r/pixelart/top.json')
http = Net::HTTP.new(reddit_url.host, reddit_url.port)
http.use_ssl = true
result = http.get(reddit_url.request_uri, 'User-Agent' => 'My agent')
puts result
# => #<Net::HTTPOK:0x00007fc3ea8e7320>
puts result.body.size
# => 167,394
With that working we can try the second URL. Interestingly, I get different results depending on whether I re-use the initial connection or make a new one:
require 'net/http'
reddit_url = URI.parse('https://www.reddit.com/r/pixelart/top.json')
reddit_url_two = URI.parse('https://reddit.com/r/PixelArt/comments/lkaiqf/another_watercolour_pixelart_tree.json')
http = Net::HTTP.new(reddit_url.host, reddit_url.port)
http.use_ssl = true
result = http.get(reddit_url.request_uri, 'User-Agent' => 'My agent')
puts result
# => #<Net::HTTPOK:0x00007f931a143390>
puts result.body.size
# => 174,615
http_two = Net::HTTP.new(reddit_url_two.host, reddit_url_two.port)
http_two.use_ssl = true
result_two = http_two.get(reddit_url_two.request_uri, 'User-Agent' => 'My agent')
puts result_two
# => #<Net::HTTPMovedPermanently:0x00007f931a148818>
puts result_two.body.size
# => 0
result_reusing_connection = http.get(reddit_url_two.request_uri, 'User-Agent' => 'My agent')
puts result_reusing_connection
# => #<Net::HTTPOK:0x00007f931a0fb3b0>
puts result_reusing_connection.body.size
# => 141,575
So I suspect you're getting a 301 redirect sometimes and that's causing the confusion. There's another question and answer here for how to follow redirects.

Undefined method 'host' in rspec

I have the following methods in a Ruby script:
def parse_endpoint(endpoint)
return URI.parse(endpoint)
end
def verify_url(endpoint, fname)
url = “#{endpoint}#{fname}”
req = Net::HTTP.new(url.host, url.port)
res = req.request_head(url.path)
if res.code == “200”
true
else
puts “#{fname} is an invalid file”
false
end
end
Testing the url manually like so works fine (returns true since the url is indeed valid):
endpoint = parse_endpoint('http://mywebsite.com/mySubdirectory/')
verify_url(endpoint, “myFile.json”)
However, when I try to do the following in rspec
describe 'my functionality'
let (:endpoint) { parse_endpoint(“http://mywebsite.com/mySubdirectory/”) }
it 'should verify valid url' do
expect(verify_url(endpoint, “myFile.json”).to eq(true))
end
end
it gives me this error
“NoMethodError:
undefined method `host' for "http://mysebsite.com/mySubdirectory/myFile.json":String”
What am I doing wrong?
url is a String object, and you are trying to access a method called host which does not exist in String:
url = “#{endpoint}#{fname}”
req = Net::HTTP.new(url.host, url.port)
EDIT you probably need an URI object. I think this is what you want:
2.2.1 :004 > require 'uri'
=> true
2.2.1 :001 > url = 'http://mywebsite.com/mySubdirectory/'
=> "http://mywebsite.com/mySubdirectory/"
2.2.1 :005 > parsed_url = URI.parse url
=> #<URI::HTTP http://mywebsite.com/mySubdirectory/>
2.2.1 :006 > parsed_url.host
=> "mywebsite.com"
So just add url = URI.parse url before using url.host.
Testing the url manually like so works fine (returns true since the url is indeed valid):
endpoint = parse_endpoint('http://mywebsite.com/mySubdirectory/')
verify_url(endpoint, “myFile.json”)
It seems you missed something when you tested code above (maybe you tested old version) because it can't work as it is now.
Look at these lines of code:
url = "#{endpoint}#{fname}"
req = Net::HTTP.new(url.host, url.port)
You're creating a string variable url from other two variables endpoint and fname. So far, so good.
But then you're trying to access method host on url variable, which doesn't exist (but it exists on the endpoint variable), that's why you get this error.
You may want to use this code instead:
def verify_url(endpoint, fname)
url = endpoint.merge(fname)
res = Net::HTTP.start(url.host, url.port) do |http|
http.head(url.path)
end
# it's actually a bad idea to puts some text in a query method
# let's just return value instead
res.code == "200"
end

Hash/string gets escaped

This is my hyperresource client:
require 'rubygems'
require 'hyperresource'
require 'json'
api = HyperResource.new(root: 'http://127.0.0.1:9393/todos',
headers: {'Accept' => 'application/vnd.127.0.0.1:9393/todos.v1+hal+json'})
string = '{"todo":{"title":"test"}}'
hash = JSON.parse(string)
api.post(hash)
puts hash
The hash output is: {"todo"=>{"title"=>"test"}}
At my Sinatra with Roar API I have this post function:
post "/todos" do
params.to_json
puts params
#todo = Todo.new(params[:todo])
if #todo.save
#todo.extend(TodoRepresenter)
#todo.to_json
else
puts 'FAIL'
end
end
My puts 'params' over here gets: {"{\"todo\":{\"title\":\"test\"}}"=>nil}
I found out, these are 'escaped strings' but I don't know where it goes wrong.
EDIT:
I checked my api with curl and postman google extension, both work fine. It's just hyperresource I guess
You are posting JSON, ergo you either need to register a Sinatra middleware that will automatically parse incoming JSON requests, or you need to do it yourself.
require 'rubygems'
require 'hyperresource'
require 'json'
api = HyperResource.new(root: 'http://127.0.0.1:9393/todos',
headers: {'Accept' => 'application/vnd.127.0.0.1:9393/todos.v1+hal+json'})
string = '{"todo":{"title":"test"}}'
hash = JSON.parse(string)
api.post({:data => hash})
puts hash
---
post "/todos" do
p = JSON.parse(params[:data])
puts p.inspect
#todo = Todo.new(p[:todo])
if #todo.save
#todo.extend(TodoRepresenter)
#todo.to_json
else
puts 'FAIL'
end
end
Should do what you need.

Ruby RestClient converts XML to Hash

I need to send a POST request as an XML string but I get odd results. The code:
require 'rest_client'
response = RestClient.post "http://127.0.0.1:2000", "<tag1>text</tag1>", :content_type => "text/xml"
I expect to receive "<tag1>text</tag1>" as the parameter on the request server. Instead, I get "tag1"=>"text". It converts the XML to a hash. Why is that? Any way around this?
Try this:
response = RestClient.post "http://127.0.0.1:2000",
"<tag1>text</tag1>",
{:accept => :xml, :content_type => :xml}
I think you just needed to specify the ":accept" to let it know you wanted to receive it in the XML format. Assuming it's your own server, you can debug on the server and see the request format used is probably html.
Hope that helps.
Instead of using RestClient, use Ruby's built-in Open::URI for GET requests or something like Net::HTTP or the incredibly powerful Typhoeus:
uri = URI('http://www.example.com/search.cgi')
res = Net::HTTP.post_form(uri, 'q' => 'ruby', 'max' => '50')
In Typhoeus, you'd use:
res = Typhoeus::Request.post(
'http://localhost:3000/posts',
:params => {
:title => 'test post',
:content => 'this is my test'
}
)
Your resulting page, if it's in XML will be easy to parse using Nokogiri:
doc = Nokogiri::XML(res.body)
At that point you'll have a fully parsed DOM, ready to be searched, using Nokogiri's search methods, such as search and at, or any of their related methods.

Ruby HTTP get with params

How can I send HTTP GET request with parameters via ruby?
I have tried a lot of examples but all of those failed.
I know this post is old but for the sake of those brought here by google, there is an easier way to encode your parameters in a URL safe manner. I'm not sure why I haven't seen this elsewhere as the method is documented on the Net::HTTP page. I have seen the method described by Arsen7 as the accepted answer on several other questions also.
Mentioned in the Net::HTTP documentation is URI.encode_www_form(params):
# Lets say we have a path and params that look like this:
path = "/search"
params = {q: => "answer"}
# Example 1: Replacing the #path_with_params method from Arsen7
def path_with_params(path, params)
encoded_params = URI.encode_www_form(params)
[path, encoded_params].join("?")
end
# Example 2: A shortcut for the entire example by Arsen7
uri = URI.parse("http://localhost.com" + path)
uri.query = URI.encode_www_form(params)
response = Net::HTTP.get_response(uri)
Which example you choose is very much dependent on your use case. In my current project I am using a method similar to the one recommended by Arsen7 along with the simpler #path_with_params method and without the block format.
# Simplified example implementation without response
# decoding or error handling.
require "net/http"
require "uri"
class Connection
VERB_MAP = {
:get => Net::HTTP::Get,
:post => Net::HTTP::Post,
:put => Net::HTTP::Put,
:delete => Net::HTTP::Delete
}
API_ENDPOINT = "http://dev.random.com"
attr_reader :http
def initialize(endpoint = API_ENDPOINT)
uri = URI.parse(endpoint)
#http = Net::HTTP.new(uri.host, uri.port)
end
def request(method, path, params)
case method
when :get
full_path = path_with_params(path, params)
request = VERB_MAP[method].new(full_path)
else
request = VERB_MAP[method].new(path)
request.set_form_data(params)
end
http.request(request)
end
private
def path_with_params(path, params)
encoded_params = URI.encode_www_form(params)
[path, encoded_params].join("?")
end
end
con = Connection.new
con.request(:post, "/account", {:email => "test#test.com"})
=> #<Net::HTTPCreated 201 Created readbody=true>
I assume that you understand the examples on the Net::HTTP documentation page but you do not know how to pass parameters to the GET request.
You just append the parameters to the requested address, in exactly the same way you type such address in the browser:
require 'net/http'
res = Net::HTTP.start('localhost', 3000) do |http|
http.get('/users?id=1')
end
puts res.body
If you need some generic way to build the parameters string from a hash, you may create a helper like this:
require 'cgi'
def path_with_params(page, params)
return page if params.empty?
page + "?" + params.map {|k,v| CGI.escape(k.to_s)+'='+CGI.escape(v.to_s) }.join("&")
end
path_with_params("/users", :id => 1, :name => "John&Sons")
# => "/users?name=John%26Sons&id=1"

Resources