How to send Base64 image to Google Cloud Vision API label detection in Ruby? - ruby

Hi I'm building a program in Ruby to generate alt attributes for images on a webpage. I'm scraping the page for the images then sending their src, in other words a URL, to google-cloud-vision for label detection and other Cloud Vision methods. It takes about 2-6 seconds per image. I'm wondering if there's any way to reduce response time. I first used TinyPNG to compress the images. Cloud Vision was a tad faster but the time it took to compress more than outweighed the improvement. How can I improve response time? I'll list some ideas.
1) Since we're sending a URL to Google Cloud, it takes time for Google Cloud to receive a response, that is from the img_src, before it can even analyze the image. Is it faster to send a base64 encoded image? What's the fastest form in which to send (or really, for Google to receive) an image?
cloud_vision = Google::Cloud::Vision.new project: PROJECT_ID
#vision = cloud_vision.image(#file_name)
#vision.labels #or #vision.web, etc.
2) My current code for label detection. First question: is it faster to send a JSON request rather than call Ruby (label or web) methods on a Google Cloud Project? If so, should I limit responses? Labels with less than a 0.6 confidence score don't seem of much help. Would that speed up image rec/processing time?
Open to any suggestions on how to speed up response time from Cloud Vision.

TL;DR - You can take advantage of the batching supporting in the annotation API for Cloud Vision.
Longer version
Google Cloud Vision API supports batching multiple requests in a single call to the images:annotate API. There are also these limits which are enforced for Cloud Vision:
Maximum of 16 images per request
Maximum 4 MB per image
Maximum of 8 MB total request size.
You could reduce the number of requests by batching 16 at a time (assuming you do not exceed any of the image size restrictions within the request):
#!/usr/bin/env ruby
require "google/cloud/vision"
image_paths = [
...
"./wakeupcat.jpg",
"./cat_meme_1.jpg",
"./cat_meme_2.jpg",
...
]
vision = Google::Cloud::Vision.new
length = image_paths.length
start = 0
request_count = 0
while start < length do
last = [start + 15, length - 1].min
current_image_paths = image_paths[start..last]
printf "Sending %d images in the request. start: %d last: %d\n", current_image_paths.length, start, last
result = vision.annotate *current_image_paths, labels: 1
printf "Result: %s\n", result
start += 16
request_count += 1
end
printf "Made %d requests\n", request_count

So you're using Ruby to scrape some images off a page and then send the image to Google, yeah?
Why you might not want to base64 encode the image:
Headless scraping becomes more network intensive. You have to download the image to then process it.
Now you also have to worry about adding in the base64 encode process
Potential storage concerns if you aren't just holding the image in memory (and if you do this, debugging becomes somewhat more challenging
Why you might want to base64 encode the image:
The image is not publicly accessible
You have to store the image anyway
Once you have weighed the choices, if you still want to get the image into base64 here is how you do it:
require 'base64'
Base64.encode(image_binary)
It really is that easy.
But how do I get that image in binary?
require 'curb'
# This line is an example and is not intended to be valid
img_binary = Curl::Easy.perform("http://www.imgur.com/sample_image.png").body_str
How do I send that to Google?
Google has a pretty solid write-up of this process here: Make a Vision API Request in JSON
If you can't click it (or are too lazy to) I have provided a zero-context copy-and-paste of what a request body should look like to their API here:
request_body_json = {
"requests":[
{
"image":{
"content":"/9j/7QBEUGhvdG9...image contents...eYxxxzj/Coa6Bax//Z"
},
"features":[
{
"type":"LABEL_DETECTION",
"maxResults":1
}
]
}
]
}
So now we know what a request should look like in the body. If you're already sending the img_src in a POST request, then it's as easy as this:
require 'base64'
require 'curb'
requests = []
for image in array_of_image_urls
img_binary = Curl::Easy.perform(image).body_str
image_in_base64 = Base64.encode(image_binary)
requests << { "image" => { "content" : image_in_base64 }, "imageContext" => "<OPTIONAL: SEE REFERENCE LINK>", "features" => [ {"type" => "LABEL_DETECTION", "maxResults" => 1 }]}
end
# Now just POST requests.to_json with your Authorization and such (You did read the reference right?)
Play around with the hash formatting and values as required. This is the general idea which is the best I can give you when your question is SUPER vague.

Related

How to get all the videos of a YouTube channel with the Yt gem?

I want to use the Yt gem to get all the videos of channel. I configure the gem with my YouTube Data API key.
Unfortunately when I use it it returns a maximum of ~1000 videos, even for channels having more than 1000 videos. Yt::Channel#video_count returns the correct number of videos.
channel = Yt::Channel.new id: "UCGwuxdEeCf0TIA2RbPOj-8g"
channel.video_count # => 1845
channel.videos.map(&:id).size # => 949
The Youtube API can't be set to return more than 50 items per request, so I guess Yt automatically performs several requests going through each next page of results to be able to return more than 50 results.
For some reason though it does not go through all the result pages. I don't see a way in Yt for me to control how it goes through the pages of results. In particular I could not find a way to force it to get a single page of results, access the returned value nextPageToken, in order to perform a new request with this value.
Any idea?
Looking into gem's /spec folder, you can see a test for your code.
describe 'when the channel has more than 500 videos' do
let(:id) { 'UC0v-tlzsn0QZwJnkiaUSJVQ' }
specify 'the estimated and actual number of videos can be retrieved' do
# #note: in principle, the following three counters should match, but
# in reality +video_count+ and +size+ are only approximations.
expect(channel.video_count).to be > 500
expect(channel.videos.size).to be > 500
end
end
I did some tests and what I have noticed it that: video_count is the number that is displayed on youtube next to channel's name. This value is not accurate. Not rly sure what it represents.
If you do channel.videos.size, the number is not accurate either, because the videos collection can contain some empty(?) records.
If you do channel.videos.map(&:id).size the returned value should be correct. By correct I mean it should equal to number of videos listed at:
https://www.youtube.com/channel/:channel_id/videos

PlaylistItems: list does not return videoId when using part:id without snippet

Trying to manage the "cost" of API request and so to generate a delta of videos that were added to a playlist since last API request
Would like to make the "0" cost request of just fetching the videoIds before matching getting additional details about the Video in the playlist
GET https://www.googleapis.com/youtube/v3/playlistItems?part=id&playlistId=PLlTLHnxSVuIyeEZPBIQF_krewJkY2JSwi&key={YOUR_API_KEY}
The response is like below
"items": [
{
"kind": "youtube#playlistItem",
"etag": "\"5g01s4-wS2b4VpScndqCYc5Y-8k/2wturocJM7aMkvG4Zrmv45tbyWY\"",
"id": "UExsVExIbnhTVnVJeWVFWlBCSVFGX2tyZXdKa1kySlN3aS4xMjU2MjFGMDJBNEUzQzcw"
},
The playlistItem id cannot be used in the video list to get additional info about the video and instead part:"snippet" which has a cost associated with it has to be added to the playlistItems request. Is this a bug or intentional ? Also is there a way to map the playlistItem-id to videoId/ResourceId ?
Firstly, all calls have a cost. No matter what it is. Just how much depends on your request.
Yes, this is by design. They want to limit as much as possible the amount of calls to the system. This will make for better stream lining of call requests, as well as reducing strain on the site.

Can I reduce my amount of requests in Google Maps JavaScript API v3?

I call 2 locations. From an xml file I get the longtitude and the langtitude of a location. First the closest cafe, then the closest school.
$.get('https://maps.googleapis.com/maps/api/place/nearbysearch/xml?
location='+home_latitude+','+home_longtitude+'&rankby=distance&types=cafe&sensor=false&key=X',function(xml)
{
verander($(xml).find("result:first").find("geometry:first").find("location:first").find("lat").text(),$(xml).find("result:first").find("geometry:first").find("location:first").find("lng").text());
}
);
$.get('https://maps.googleapis.com/maps/api/place/nearbysearch/xml?
location='+home_latitude+','+home_longtitude+'&rankby=distance&types=school&sensor=false&key=X',function(xml)
{
verander($(xml).find("result:first").find("geometry:first").find("location:first").find("lat").text(),$(xml).find("result:first").find("geometry:first").find("location:first").find("lng").text());
}
);
But as you can see, I do the function verander(latitude,longtitude) twice.
function verander(google_lat, google_lng)
{
var bryantPark = new google.maps.LatLng(google_lat, google_lng);
var panoramaOptions =
{
position:bryantPark,
pov:
{
heading: 185,
pitch:0,
zoom:1,
},
panControl : false,
streetViewControl : false,
mapTypeControl: false,
overviewMapControl: false ,
linksControl: false,
addressControl:false,
zoomControl : false,
}
map = new google.maps.StreetViewPanorama(document.getElementById("map_canvas"), panoramaOptions);
map.setVisible(true);
}
Would it be possible to push these 2 locations in only one request(perhaps via an array)? I know it sounds silly but I really want to know if their isn't a backdoor to reduce these google maps requests.
FTR: This is what a request is for Google:
What constitutes a 'map load' in the context of the usage limits that apply to the Maps API? A single map load occurs when:
a. a map is displayed using the Maps JavaScript API (V2 or V3) when loaded by a web page or application;
b. a Street View panorama is displayed using the Maps JavaScript API (V2 or V3) by a web page or application that has not also displayed a map;
c. a SWF that loads the Maps API for Flash is loaded by a web page or application;
d. a single request is made for a map image from the Static Maps API.
e. a single request is made for a panorama image from the Street View Image API.
So I'm afraid it isn't possible, but hey, suggestions are always welcome!
Your calling places api twice and loading streetview twice. So that's four calls but I think they only count those two streetviews as once if your loading it on one page. And also your places calls will be client side so they won't count towards your limits.
But to answer your question there's no loop hole to get around the double load since you want to show the users two streetviews.
What I would do is not load anything until the client asks. Instead have a couple of call to action type buttons like <button onclick="loadStreetView('cafe')">Click here to see Nearby Cafe</button> and when clicked they will call the nearby search and load the streetview. And since it is only on client request your page loads will never increment the usage counts like when your site get's crawled by search engines.
More on those usage limits
The Google Places API has different usages then the maps. https://developers.google.com/places/policies#usage_limits
Users with an API key are allowed 1 000 requests per 24 hour period
Users who have verified their identity through the APIs console are allowed 100 000 requests per 24 hour period. A credit card is required for verification, by enabling billing in the console. We ask for your credit card purely to validate your identity. Your card will not be charged for use of the Places API.
100,000 requests a day if you verify yourself. That's pretty decent.
As for Google Maps, https://developers.google.com/maps/faq#usagelimits
You get 25,000 map loads per day and it says.
In order to accommodate sites that experience short term spikes in usage, the usage limits will only take effect for a given site once that site has exceeded the limits for more than 90 consecutive days.
So if you go over a bit not and then it seems like they won't mind.
p.s. you have an extra comma after zoom:1 and zoomControl : false and they shouldn't be there. Will cause errors in some browsers like IE. You also are missing a semicolon after var panoramaOptions = { ... } and before map = new

Size of data scraped using ruby mechanize

agent = Mechanize.new
url = "---------------------------"
page = agent.get(url)
Now, I want to know the KB(kilobytes) of data that has been used by my internet service provider to scrape that data.
More specifically, whats the size in KB, of the variable "page"?
page.content.bytesize / 1024.0
It's really two separate things. The size of unzipped response body and the amount of bytes that were transferred. You can get the first by inspecting page.body, for the second you would need to measure response and request headers as well as account for things like gzip and redirects. Not to mention dns lookups, etc.

Best way to concurrently check urls (for status i.e. 200,301,404) for multiple urls in database

Here's what I'm trying to accomplish. Let's say I have 100,000 urls stored in a database and I want to check each of these for http status and store that status. I want to be able to do this concurrently in a fairly small amount of time.
I was wondering what the best way(s) to do this would be. I thought about using some sort of queue with workers/consumers or some sort of evented model, but I don't really have enough experience to know what would work best in this scenario.
Ideas?
Take a look at the very capable Typhoeus and Hydra combo. The two make it very easy to concurrently process multiple URLs.
The "Times" example should get you up and running quickly. In the on_complete block put your code to write your statuses to the DB. You could use a thread to build and maintain the queued requests at a healthy level, or queue a set number, let them all run to completion, then loop for another group. It's up to you.
Paul Dix, the original author, talked about his design goals on his blog.
This is some sample code I wrote to download archived mail lists so I could do local searches. I deliberately removed the URL to keep from subjecting the site to DOS attacks if people start running the code:
#!/usr/bin/env ruby
require 'nokogiri'
require 'addressable/uri'
require 'typhoeus'
BASE_URL = ''
url = Addressable::URI.parse(BASE_URL)
resp = Typhoeus::Request.get(url.to_s)
doc = Nokogiri::HTML(resp.body)
hydra = Typhoeus::Hydra.new(:max_concurrency => 10)
doc.css('a').map{ |n| n['href'] }.select{ |href| href[/\.gz$/] }.each do |gzip|
gzip_url = url.join(gzip)
request = Typhoeus::Request.new(gzip_url.to_s)
request.on_complete do |resp|
gzip_filename = resp.request.url.split('/').last
puts "writing #{gzip_filename}"
File.open("gz/#{gzip_filename}", 'w') do |fo|
fo.write resp.body
end
end
puts "queuing #{ gzip }"
hydra.queue(request)
end
hydra.run
Running the code on my several-year-old MacBook Pro pulled in 76 files totaling 11MB in just under 20 seconds, over wireless to DSL. If you're only doing HEAD requests your throughput will be better. You'll want to mess with the concurrency setting because there is a point where having more concurrent sessions only slow you down and needlessly use resources.
I give it a 8 out of 10; It's got a great beat and I can dance to it.
EDIT:
When checking the remove URLs you can use a HEAD request, or a GET with the If-Modified-Since. They can give you responses you can use to determine the freshness of your URLs.
I haven't done anything multithreaded in Ruby, only in Java, but it seems pretty straightforward: http://www.tutorialspoint.com/ruby/ruby_multithreading.htm
From what you described, you don't need any queue and workers (well, I'm sure you can do it that way too, but I doubt you'll get much benefit). Just partition your urls between several threads, and let each thread do each chunk and update the database with the results. E.g., create 100 threads, and give each thread a range of 1000 database rows to process.
You could even just create 100 separate processes and give them rows as arguments, if you'd rather deal with processes than threads.
To get the URL status, I think you do an HTTP HEAD request, which I guess is http://apidock.com/ruby/Net/HTTP/request_head in ruby.
The work_queue gem is the easiest way to perform tasks asynchronously and concurrently in your application.
wq = WorkQueue.new 10
urls.each do |url|
wq.enqueue_b do
response = Net::HTTP.get_response(uri)
puts response.code
end
end
wq.join

Resources