Size of data scraped using ruby mechanize - ruby

agent = Mechanize.new
url = "---------------------------"
page = agent.get(url)
Now, I want to know the KB(kilobytes) of data that has been used by my internet service provider to scrape that data.
More specifically, whats the size in KB, of the variable "page"?

page.content.bytesize / 1024.0

It's really two separate things. The size of unzipped response body and the amount of bytes that were transferred. You can get the first by inspecting page.body, for the second you would need to measure response and request headers as well as account for things like gzip and redirects. Not to mention dns lookups, etc.

Related

varnish cache real (body) size vs content-length

Sometimes, when an object is not in the cache, varnish will send an object that has a real size smaller than the size declared in the content-length header. For example - only part of the picture.
Is it possible to construct such a rule...?
if (beresp.http.content-lenght <> real_object_body_size) { return(retry); }
I wrote a script that tests the same request against the varnish and the backend. It compares the downloaded size with the content-lenght header. The backend, unlike varnish, sometimes ends up with a timeout but the size is always fine. The problem is rare but annoying because the objects are set to long user cache time.
After a few days I can say that the problem was in occasional backend problems with varnish's ability to send a chunked transfer if the object is not in the cache.
Thank you #Thijs Feryn for pointing this out. I knew about that property but until I read it here, I didn't connect it to my problem at all.
It seems that "set beresp.do_stream = false;" solved the problem.

How to send Base64 image to Google Cloud Vision API label detection in Ruby?

Hi I'm building a program in Ruby to generate alt attributes for images on a webpage. I'm scraping the page for the images then sending their src, in other words a URL, to google-cloud-vision for label detection and other Cloud Vision methods. It takes about 2-6 seconds per image. I'm wondering if there's any way to reduce response time. I first used TinyPNG to compress the images. Cloud Vision was a tad faster but the time it took to compress more than outweighed the improvement. How can I improve response time? I'll list some ideas.
1) Since we're sending a URL to Google Cloud, it takes time for Google Cloud to receive a response, that is from the img_src, before it can even analyze the image. Is it faster to send a base64 encoded image? What's the fastest form in which to send (or really, for Google to receive) an image?
cloud_vision = Google::Cloud::Vision.new project: PROJECT_ID
#vision = cloud_vision.image(#file_name)
#vision.labels #or #vision.web, etc.
2) My current code for label detection. First question: is it faster to send a JSON request rather than call Ruby (label or web) methods on a Google Cloud Project? If so, should I limit responses? Labels with less than a 0.6 confidence score don't seem of much help. Would that speed up image rec/processing time?
Open to any suggestions on how to speed up response time from Cloud Vision.
TL;DR - You can take advantage of the batching supporting in the annotation API for Cloud Vision.
Longer version
Google Cloud Vision API supports batching multiple requests in a single call to the images:annotate API. There are also these limits which are enforced for Cloud Vision:
Maximum of 16 images per request
Maximum 4 MB per image
Maximum of 8 MB total request size.
You could reduce the number of requests by batching 16 at a time (assuming you do not exceed any of the image size restrictions within the request):
#!/usr/bin/env ruby
require "google/cloud/vision"
image_paths = [
...
"./wakeupcat.jpg",
"./cat_meme_1.jpg",
"./cat_meme_2.jpg",
...
]
vision = Google::Cloud::Vision.new
length = image_paths.length
start = 0
request_count = 0
while start < length do
last = [start + 15, length - 1].min
current_image_paths = image_paths[start..last]
printf "Sending %d images in the request. start: %d last: %d\n", current_image_paths.length, start, last
result = vision.annotate *current_image_paths, labels: 1
printf "Result: %s\n", result
start += 16
request_count += 1
end
printf "Made %d requests\n", request_count
So you're using Ruby to scrape some images off a page and then send the image to Google, yeah?
Why you might not want to base64 encode the image:
Headless scraping becomes more network intensive. You have to download the image to then process it.
Now you also have to worry about adding in the base64 encode process
Potential storage concerns if you aren't just holding the image in memory (and if you do this, debugging becomes somewhat more challenging
Why you might want to base64 encode the image:
The image is not publicly accessible
You have to store the image anyway
Once you have weighed the choices, if you still want to get the image into base64 here is how you do it:
require 'base64'
Base64.encode(image_binary)
It really is that easy.
But how do I get that image in binary?
require 'curb'
# This line is an example and is not intended to be valid
img_binary = Curl::Easy.perform("http://www.imgur.com/sample_image.png").body_str
How do I send that to Google?
Google has a pretty solid write-up of this process here: Make a Vision API Request in JSON
If you can't click it (or are too lazy to) I have provided a zero-context copy-and-paste of what a request body should look like to their API here:
request_body_json = {
"requests":[
{
"image":{
"content":"/9j/7QBEUGhvdG9...image contents...eYxxxzj/Coa6Bax//Z"
},
"features":[
{
"type":"LABEL_DETECTION",
"maxResults":1
}
]
}
]
}
So now we know what a request should look like in the body. If you're already sending the img_src in a POST request, then it's as easy as this:
require 'base64'
require 'curb'
requests = []
for image in array_of_image_urls
img_binary = Curl::Easy.perform(image).body_str
image_in_base64 = Base64.encode(image_binary)
requests << { "image" => { "content" : image_in_base64 }, "imageContext" => "<OPTIONAL: SEE REFERENCE LINK>", "features" => [ {"type" => "LABEL_DETECTION", "maxResults" => 1 }]}
end
# Now just POST requests.to_json with your Authorization and such (You did read the reference right?)
Play around with the hash formatting and values as required. This is the general idea which is the best I can give you when your question is SUPER vague.

Google Places API does not return gym info

I could not get Google Places API to return gym info. Below is an example request api.
https://maps.googleapis.com/maps/api/place/search/json?location=33.347075,-111.96318&radius=100&types=gym&sensor=false&key=AIzaSyBg8HI6sH1Rxyhn1Mno_hhgDawuF1KAfq0
(Open URL)
I know the lat/lon in the link is valid.
If I remove the "types=gym" (see below link" it returns some places info but none of type gym.
https://maps.googleapis.com/maps/api/place/search/json?location=33.347075,-111.96318&radius=100&sensor=false&key=AIzaSyBg8HI6sH1Rxyhn1Mno_hhgDawuF1KAfq0
(Open URL)
Is there a limitation on the api?
Also, could I have the api to return an uri which takes me directly to the location?
You just need to increase your search radius a bit - you're looking for results within a 100m circle. Try this:
https://maps.googleapis.com/maps/api/place/search/json?location=33.347075,-111.96318&radius=200&types=gym&sensor=false&key=AIzaSyBg8HI6sH1Rxyhn1Mno_hhgDawuF1KAfq0
Increasing the radius to just 200m returns a result; at 1000m you get four results.
You can then pass the reference value to a Places Details search to get the url value, as follows:
https://maps.googleapis.com/maps/api/place/details/json?reference=CnRnAAAA99xxsFT0V-FNigzMi7GEnmkqWRYCOZG-lrQH0fpw9iI_JUp5WHrYOCcTGpeyzVdHrtk3rE2zrHleBxRw4i67K0sT_fhsSQufaAHN80Oi4OvxR-amG_W4plz5Mr8a-512584oHpfUpV87jMqyF2R8cRIQpqTgOCgZtZF0hYR4R_ZVRRoUEIS-oN1fcyVQcN5nj7DxaNK-e8o&sensor=false&key=AIzaSyBg8HI6sH1Rxyhn1Mno_hhgDawuF1KAfq0
The url links to the Google Maps Place page: http://maps.google.com/maps/place?cid=2681829493569576902
See the docs here: http://code.google.com/apis/maps/documentation/places/#PlaceDetailsRequests
also there is a limit of the no. requests on this particular api.
2500 IIRC -- but you can read it in the docs
Are there 20 coming back with the "types=gym" call? If so then you are hitting the limitation of the api and it just so happens the 20 returned are not a gym:
The Places API returns up to 20 establishment results. Additionally,
political results may be returned which serve to identify the area of
the request

Best way to concurrently check urls (for status i.e. 200,301,404) for multiple urls in database

Here's what I'm trying to accomplish. Let's say I have 100,000 urls stored in a database and I want to check each of these for http status and store that status. I want to be able to do this concurrently in a fairly small amount of time.
I was wondering what the best way(s) to do this would be. I thought about using some sort of queue with workers/consumers or some sort of evented model, but I don't really have enough experience to know what would work best in this scenario.
Ideas?
Take a look at the very capable Typhoeus and Hydra combo. The two make it very easy to concurrently process multiple URLs.
The "Times" example should get you up and running quickly. In the on_complete block put your code to write your statuses to the DB. You could use a thread to build and maintain the queued requests at a healthy level, or queue a set number, let them all run to completion, then loop for another group. It's up to you.
Paul Dix, the original author, talked about his design goals on his blog.
This is some sample code I wrote to download archived mail lists so I could do local searches. I deliberately removed the URL to keep from subjecting the site to DOS attacks if people start running the code:
#!/usr/bin/env ruby
require 'nokogiri'
require 'addressable/uri'
require 'typhoeus'
BASE_URL = ''
url = Addressable::URI.parse(BASE_URL)
resp = Typhoeus::Request.get(url.to_s)
doc = Nokogiri::HTML(resp.body)
hydra = Typhoeus::Hydra.new(:max_concurrency => 10)
doc.css('a').map{ |n| n['href'] }.select{ |href| href[/\.gz$/] }.each do |gzip|
gzip_url = url.join(gzip)
request = Typhoeus::Request.new(gzip_url.to_s)
request.on_complete do |resp|
gzip_filename = resp.request.url.split('/').last
puts "writing #{gzip_filename}"
File.open("gz/#{gzip_filename}", 'w') do |fo|
fo.write resp.body
end
end
puts "queuing #{ gzip }"
hydra.queue(request)
end
hydra.run
Running the code on my several-year-old MacBook Pro pulled in 76 files totaling 11MB in just under 20 seconds, over wireless to DSL. If you're only doing HEAD requests your throughput will be better. You'll want to mess with the concurrency setting because there is a point where having more concurrent sessions only slow you down and needlessly use resources.
I give it a 8 out of 10; It's got a great beat and I can dance to it.
EDIT:
When checking the remove URLs you can use a HEAD request, or a GET with the If-Modified-Since. They can give you responses you can use to determine the freshness of your URLs.
I haven't done anything multithreaded in Ruby, only in Java, but it seems pretty straightforward: http://www.tutorialspoint.com/ruby/ruby_multithreading.htm
From what you described, you don't need any queue and workers (well, I'm sure you can do it that way too, but I doubt you'll get much benefit). Just partition your urls between several threads, and let each thread do each chunk and update the database with the results. E.g., create 100 threads, and give each thread a range of 1000 database rows to process.
You could even just create 100 separate processes and give them rows as arguments, if you'd rather deal with processes than threads.
To get the URL status, I think you do an HTTP HEAD request, which I guess is http://apidock.com/ruby/Net/HTTP/request_head in ruby.
The work_queue gem is the easiest way to perform tasks asynchronously and concurrently in your application.
wq = WorkQueue.new 10
urls.each do |url|
wq.enqueue_b do
response = Net::HTTP.get_response(uri)
puts response.code
end
end
wq.join

How do I overcome the XHR FF POST size limit?

My XHR POST REQUEST is cut off. When I try to reload my page information is missing. Firebugs sends following message:
Firebug request size limit has been reached by Firebug.
My question is: What are my options?
Would it work if I declare the content.length in the header?
I added a line to my apache config file and restarted it: LimitRequestBody 0
I increased the size of transfer files in mysql config file
Or it is a browser issue?
The only solution I could think of was to cut the data in pieces and transmit the array one by one but I don't like this idea. The content length is 91691 according to firebug.
Any suggestions?
You just need to modify Firebug settings. In browser's address bar go to about:config, then look for option extensions.firebug.netDisplayedPostBodyLimit.
You should increase its value in order to see non truncated requests. Set it to 65535 for example.
Here you can find many other Firebug options you may want to change: http://getfirebug.com/wiki/index.php/Firebug_Preferences

Resources