For context, I'm someone with zero experience in Ruby - I just asked my Senior Dev to copy-paste me some of his Ruby code so I could try to work with some APIs that he ended up putting off because he was too busy.
So I'm using an API wrapper called zoho_hub, used as a wrapper for Zoho APIs (https://github.com/rikas/zoho_hub/blob/master/README.md).
My IDE is VSCode.
I execute the entire length of the code, and I'm faced with this:
[Done] exited with code=0 in 1.26 seconds
The API is supposed to return a paginated list of records, but I don't see anything outputted in VSCode, despite the fact that no error is being reflected. The last 2 lines of my code are:
ZohoHub.connection.get 'Leads'
p "testing"
I use the dummy string "testing" to make sure that it's being executed up till the very end, and it does get printed.
This has been baffling me for hours now - is my response actually being outputted somewhere, and I just can't see it??
Ruby does not print anything unless you tell it to. For debugging there is a pretty printing method available called pp, which is decent for trying to print structured data.
In this case, if you want to output the records that your get method returns, you would do:
pp ZohoHub.connection.get 'Leads'
To get the next page you can look at the source code, and you will see the get request has an additional Hash parameter.
def get(path, params = {})
Then you have to read the Zoho API documentation for get, and you will see that the page is requested using the page param.
Therefore we can finally piece it together:
pp ZohoHub.connection.get('Leads', page: NNN)
Where NNN is the number of the page you want to request.
I have a collection of Person, stored in a legacy mongodb server (2.4) and accessed with the mongoid gem via the ruby mongodb driver.
If I perform a
Person.where(email: 'some.existing.email#server.tld').first
I get a result (let's assume I store the id in a variable called "the_very_same_id_obtained_above")
If I perform a
Person.find(the_very_same_id_obtained_above)
I got a
Mongoid::Errors::DocumentNotFound
exception
If I use the javascript syntax to perform the query, the result is found
Person.where("this._id == #{the_very_same_id_obtained_above}").first # this works!
I'm currently trying to migrate the data to a newever version. Currently mongodbrestore-ing on amazon documentdb to make tests (mongodb 3.6 compatible) and the issue remains.
One thing I noticed is that those object ids are peculiar:
5ce24b1169902e72c9739ff6 this works anyway
59de48f53137ec054b000004 this requires the trick
The small number of zeroes toward the end of the id seems to be highly correlated with the problem (I have no idea of the reason).
That's the default:
# Raise an error when performing a #find and the document is not found.
# (default: true)
raise_not_found_error: true
Source: https://docs.mongodb.com/mongoid/current/tutorials/mongoid-configuration/#anatomy-of-a-mongoid-config
If this doesn't answer your question, it's very likely the find method is overridden somewhere in your code!
I'm starting to use whoisrb and I'm noticing domains from some registrars return nil contact information.
For example:
domain_name = ARGV[0]
r = Whois.whois(domain_name)
t=r.registrant_contact
if t == nil
puts 'Registrant Contact is empty.'
end
Will return "Registrant Contact is empty." Trying to access the contact attributes results in an error, like undefined method 'id' for nil:NilClass (NoMethodError).
If I check the raw record that's being returned, puts r, I can see it's getting the thick record, so the contact information is there in the unparsed raw record.
The two registrars I've noticed this for, so far, are onlinenic.com and namesilo.com. If you try to run whois for those two domains, you'll see what I mean.
I'm checking the ICANN compliant sample here:
https://www.icann.org/resources/pages/approved-with-specs-2013-09-17-en#whois
against onlinenic.com and namesilo.com, and I don't see any substantial differences (maybe I'm missing something, though).
Any ideas why it's having trouble parsing these, or pointers on what I could check to fix it? Thanks.
It happens when the registrar has no parser associated, or the parser doesn't have the definition required to parse the contacts.
In other words, unless a parser exists, it's possible that the registrar details are in the response but the library can't find them.
In that case, the solution is to either add/update the parser corresponding to the specific registrar/registry.
Since this behavior is confusing to whoever is not familiar with the internals of the library, also note that the new release 4 will raise an error in this case (instead of silently returning nil). In this way it will be clear when the value is nil vs the value is unknown.
r = Whois.whois(domain_name)
The r here is a Whois::Record object and you can find the available methods here. registrant_contact is not one of them. You probably have to parse it out yourself.
Curl has an option which allows me to specify to which IP a domain should be resolved
e.g. curl --resolve example.com:443:1.2.3.4 https://example.com/foo
to make sure that a very specific server is called
(e.g. when multiple servers have the same vhost with a load balancer usually in front of it and there are multiple applications running on the same port with different vhosts)
How do I set this value when using Ethon? https://github.com/typhoeus/ethon
This is how I'd expect it to work
Ethon::Easy.new(url: "https://example.com/foo", :resolve => "example.com:443:1.2.3.4")
but I'm getting an invalid value exception (I have tried multiple different formats that came to mind)
Ethon::Errors::InvalidValue: The value: example.com:443:1.2.3.4 is invalid for option: resolve.
I took a look at the code but couldn't figure out how I'd have to provide the value - and the documentation on this is a bit scarce
Thanks in advance for any reply that might point me in the right direction
Thanks to i0rek on Github I got the answer I was looking for:
resolve = nil
Ethon::Curl.slist_append(resolve, "example.com:443:1.2.3.4")
e = Ethon::Easy.new(url: "https://example.com/foo", :resolve => resolve)
#=> #<Ethon::Easy:0x007faca2574f30 #url="https://example.com/foo", ...>
Further information can be found here:
https://github.com/typhoeus/ethon/issues/95#event-199961240
I have a Mechanize based Ruby script to scrape a website. I am hoping to speed it up by caching the downloaded HTML pages locally to make the whole "tweak output -> run -> tweak output" cycle quicker. I would prefer not to have to install an external cache on the machine just for this script. The ideal solution would plugin to Mechanize and transparently cache fetched pages, images and so on.
Anyone know of a library that will do this? Or another way of achieving the same outcome (script runs much quicker second time round)?
A good way of doing this type of thing is to use the (AWESOME) VCR gem.
Here's an example of how you would do it:
require 'vcr'
require 'mechanize'
# Setup VCR's configs. The cassette library directory is where
# all of your "recordings" are saved as YAML files.
VCR.configure do |c|
c.cassette_library_dir = 'vcr_cassettes'
c.hook_into :webmock
end
# Make a request...
# The first time you do this it will actually make the call out
# Subsequent calls will read the cassette file instead of hitting the network
VCR.use_cassette('google_homepage') do
a = Mechanize.new
a.get('http://google.com/')
end
As you can see... VCR records the communication as a YAML file on the first run:
mario$ find tester -mindepth 1 -maxdepth 3
tester/vcr_cassettes
tester/vcr_cassettes/google_homepage.yml
If you want to have VCR create new versions of the cassettes, just delete the corresponding file.
I'm not sure that caching the pages is going to help that much. What will help more is to have a record of previously visited URLs so you don't revisit them repeatedly. The page caching is moot because you should have already grabbed the important information when you saw the page the first time so all you need to do is check to see if you've seen it already. If you have, grab the summary information you care about and manipulate it as necessary.
I used to write analytical spiders using Perl's Mechanize. Ruby's Mechanize is based on it. Storing the previously visited URLs in SOME sort of cache was useful, like a hash, but, because apps crash or hosts go down mid-session, all the previous results would be gone. A real disk-based database was essential at that point.
I like Postgres, but even SQLite is a good choice. Whatever you use, get the important information on the drive where it can survive a restart or crash.
Something else I'd recommend, is use a YAML file for configuration of your app. Put every parameter that is likely to be changed during the app's run in there. Then, write the app so it periodically checks that file's modification time and reloads it if there's been a change. That way, you can adjust its run-time behavior on the fly. I had to write a spider to analyze a Fortune 50 corporation's multiple-websites several years ago. The app ran for three weeks spidering many different sites tied to that corporation, and because I could tweak the regex used to control which pages the app processed, I could fine tune it without shutting down that app.
If you store some information about the page after the first request, you can rebuild the page later without having to re-request it from the server.
# 1) store the page information
# uri: a URI instance
# response: a hash of response headers
# body: a string
# code: the HTTP response code
page = agent.get(url)
uri, response, body, code = [page.uri, page.response, page.body, page.code]
# 2) rebuild the page, given the stored information
page = Mechanize::Page.new(uri, response, body, code, agent)
I've used this technique in spiders/scrapers so that the code can be tweaked without having to re-request all the pages. e.g.:
# agent: a Mechanize instance
# storage: must respond to [] and []=, and must accept and return arbitrary ruby objects.
# for in-memory storage, you could use a Hash.
# or, you could write something that is backed by a filesystem, mongodb, riak, redis, s3, etc...
# logger: a Logger instance
class Foobar < Struct.new(:agent, :storage, :logger)
def get_cached(uri)
cache_key = "_cache/#{uri}"
if args = storage[cache_key]
logger.debug("getting (cached) #{uri}")
uri, response, body, code = args
page = Mechanize::Page.new(uri, response, body, code, agent)
agent.send(:add_to_history, page)
page
else
logger.debug("getting (UNCACHED) #{uri}")
page = agent.get(uri)
storage[cache_key] = [page.uri, page.response, page.body, page.code]
page
end
end
end
Which you could use like this:
require 'logger'
require 'pp'
require 'rubygems'
require 'mechanize'
storage = {}
foo = Foobar.new(Mechanize.new, storage, Logger.new(STDOUT))
foo.get_cached("http://ifconfig.me/ua")
foo.get_cached("http://ifconfig.me/ua")
foo.get_cached("http://ifconfig.me/ua")
foo.get_cached("http://ifconfig.me/encoding")
foo.get_cached("http://ifconfig.me/encoding")
pp storage
Which prints the following information:
D, [2013-10-19T14:13:32.019291 #18107] DEBUG -- : getting (UNCACHED) http://ifconfig.me/ua
D, [2013-10-19T14:13:36.375649 #18107] DEBUG -- : getting (cached) http://ifconfig.me/ua
D, [2013-10-19T14:13:36.376822 #18107] DEBUG -- : getting (cached) http://ifconfig.me/ua
D, [2013-10-19T14:13:36.376910 #18107] DEBUG -- : getting (UNCACHED) http://ifconfig.me/encoding
D, [2013-10-19T14:13:52.830416 #18107] DEBUG -- : getting (cached) http://ifconfig.me/encoding
{"_cache/http://ifconfig.me/ua"=>
[#<URI::HTTP:0x007fe4ac94d098 URL:http://ifconfig.me/ua>,
{"date"=>"Sat, 19 Oct 2013 19:13:33 GMT",
"server"=>"Apache",
"vary"=>"Accept-Encoding",
"content-encoding"=>"gzip",
"content-length"=>"87",
"connection"=>"close",
"content-type"=>"text/plain"},
"Mechanize/2.7.2 Ruby/2.0.0p247 (http://github.com/sparklemotion/mechanize/)\n",
"200"],
"_cache/http://ifconfig.me/encoding"=>
[#<URI::HTTP:0x007fe4ac99d2a0 URL:http://ifconfig.me/encoding>,
{"date"=>"Sat, 19 Oct 2013 19:13:48 GMT",
"server"=>"Apache",
"vary"=>"Accept-Encoding",
"content-encoding"=>"gzip",
"content-length"=>"42",
"connection"=>"close",
"content-type"=>"text/plain"},
"gzip,deflate,identity\n",
"200"]}
How about writing pages out to files, each page in an individual file, and separating the tweak and run cycles?