Google custom search API and Ruby - ruby

I wanted to write a Google search scraper/parser to pull employees from Google's index of linkedin.com. Linkedin closed their API, so I wrote a Mechanize/Nokogiri scraper first, which got me captcha'd, so I rewrote the script using the Google search API gem.
The problem is, I can't figure out where to begin to make it bring back more than the first page of results and the official docs couldn't even be described as 'sparse'.
This is the code that returns page 1 only:
require 'rubygems'
require 'google/api_client'
require 'json'
require 'pp'
puts "What organisation's employees shall we get today?"
organisation = gets.chomp
puts "Harvesting Google Search Results - This may take some time"
apikey = "1234"
cxid = "5678"
client = Google::APIClient.new(:key => apikey, :authorization => nil, :application_name => "linkedout", :application_version => "beta_0.5")
search = client.discovered_api('customsearch')
response = client.execute(
:api_method => search.cse.list,
:parameters => {
'q' => 'current ' + organisation + ' site:linkedin.com',
'maxResults' => 100,
'key' => apikey,
'cx' => cxid
}
)
status, headers, body = response
jsonresponse = response.body
employees = []
#tags = JSON.parse(jsonresponse)['items']
#tags.each do |tag|
x = tag['title']
x.gsub!(/ \| LinkedIn/, "")
x.downcase!
x.gsub!(/ profiles/, "")
employees << x
end
employees = employees.uniq
puts employees
Any help would be very gratefully received - I'm still learning this stuff.
Edit:
Here is a snippet of the JSON google's API returns:
"items": [
{
"kind": "customsearch#result",
"title": "Tina Minor - Recruiter, The Walt Disney Company | LinkedIn",
"htmlTitle": "Tina Minor - Recruiter, The \u003cb\u003eWalt Disney\u003c/b\u003e Company | LinkedIn",
"link": "https://www.linkedin.com/pub/tina-minor-recruiter-the-walt-disney- company/5/849/5a6",
"displayLink": "www.linkedin.com",
"snippet": "View Tina Minor - Recruiter, The Walt Disney Company's professional profile on \n... Current. The Walt Disney Company. Previous. True Religion Brand Jeans, ...",
"htmlSnippet": "View Tina Minor - Recruiter, The \u003cb\u003eWalt Disney\u003c/b\u003e Company's professional profile on \u003cbr\u003e\n... \u003cb\u003eCurrent\u003c/b\u003e. The \u003cb\u003eWalt Disney\u003c/b\u003e Company. Previous. True Religion Brand Jeans, ...",
"formattedUrl": "https://www.linkedin.com/pub/tina-minor-recruiter-the- walt- disney.../5a6",
"htmlFormattedUrl": "https://www.linkedin.com/pub/tina-minor-recruiter- the- \u003cb\u003ewalt\u003c/b\u003e- \u003cb\u003edisney\u003c/b\u003e.../5a6",
"pagemap": {
"cse_image": [
{
"src": "https://media.licdn.com/mpr/mpr/shrink_200_200/p/8/005/09b/3f2/1eb6f83.jpg"
}
],
"person": [
{
"location": "Greater Los Angeles Area",
"role": "Recruiter, Talent Acquisition at The Walt Disney Company"
}
],
"cse_thumbnail": [
{
"width": "160",
"height": "160",
"src": "https://encrypted-tbn1.gstatic.com/images? q=tbn:ANd9GcTbmlbDVBOMKTtOA_D88aFaPuZ9MjABABwumzBPk0F2x2P2-0puaIRlktce"
}
],
"metatags": [
{
"globaltrackingurl": "//www.linkedin.com/mob/tracking",
"globaltrackingappname": "profile",
"globaltrackingappid": "webTracking",
"lnkd-track-json-lib": "https://static.licdn.com/scds/concat/common/js? h=2jds9coeh4w78ed9wblscv68v-ebbt2vixcc5qz0otts5io08xv&fc=2",
"treeid": "SnQhTqcr1RNgnKS8RSsAAA==",
"appname": "profile",
"pageimpressionid": "29ca4803-0233-4934-955a-1959a37dfbbf",
"pagekey": "nprofile_v2_public_fs",
"analyticsurl": "/analytics/noauthtracker",
"msapplication-tileimage": "https://static.licdn.com/scds/common/u/images/logos/linkedin/logo-in-win8-tile- 144_v1.png",
"msapplication-tilecolor": "#0077B5",
"application-name": "LinkedIn",
"remote-nav-init-marker": "true"
}
],
"hcard": [
{
"fn": "Tina Minor - Recruiter, The Walt Disney Company",
"title": "Recruiter, Talent Acquisition at The Walt Disney Company"
}
]
}
}
...

According to Search request metadata, there should be a nextPage value returned next to items when there are additional results. However it always says Note: This API returns up to the first 100 results only. so it looks like you are already getting the maximum number of results.

Related

I'm having issues with using an API

I have the following code:
class EpisodeIndex::API
def initialize
#url = "https://www.officeapi.dev/api/episodes?limit=400"
end
def get_episode_data
uri = URI.parse(#url)
response = Net::HTTP.get(uri)
data = JSON.parse(response)
data["data"].each do |episode|
get_episode_title(episode["title"])
end
end
def get_episode_title(title)
uri = URI.parse(title)
response = Net::HTTP.get(title)
data = JSON.parse(response)
binding.pry
end
EpisodeIndex::API.new.get_episode_data
end
and I'm getting this error in return.
`get_response': undefined method `hostname' for "Pilot":String (NoMethodError)
jocelynpeters#Jocelyns-Air office_cli %
I have no idea how to fix it. Be kind, please. I'm very new to programming.
Thanks!
The data return by the API looks like this:
{
"data":
[
{
"_id":"5e94d646f733a1332868e1dc",
"title":"Pilot",
"description":"A documentary crew gives a firsthand introduction to the staff of the Scranton branch of the Dunder Mifflin Paper Company, managed by Michael Scott.",
"writer": {
"_id":"5e95242f9511994a07f9a319",
"name":"Greg Daniels",
"role":"Writer/Director",
"__v":0
},
"director": {
"_id":"5e9523649511994a07f9a313",
"name":"Ken Kwapis",
"role":"Director",
"__v":0
},
"airDate":"2005-03-24T06:00:00.000Z",
"__v":0
},
# ...
That means the result set already includes the title of each episode and it doesn't include any URL at which you could load further information (what you currently try in your get_episode_title method). Therefore you can simplify your code to:
module EpisodeIndex
require "json"
require "net/http"
class API
def initialize
#url = "https://www.officeapi.dev/api/episodes?limit=400"
end
def titles
uri = URI.parse(#url)
response = Net::HTTP.get(uri)
data = JSON.parse(response)
data["data"].map do |episode|
episode["title"]
end
end
end
end
EpisodeIndex::API.new.titles
#=> ["Pilot", "Diversity Day", "Health Care", "The Alliance", "Basketball", "Hot Girl", "The Dundies", "Sexual Harassment", "Office Olympics", "The Fire", "Halloween", "The Fight", "The Client", "Performance Review", "E-Mail Surveillance", "Christmas Party", "Booze Cruise", "The Injury", "The Secret", "The Carpet", "Boys and Girls", "Valentine's Day", "Dwight's Speech", "Take Your Daughter to Work Day", "Michael's Birthday", "Drug Testing", "Conflict Resolution", "Casino Night"]

DRY Strategy for looping over unknown levels of nested objects

My scenario is based on Gmail API.
I've learned that email messages can have their message parts deeply or shallowly nested based upon varying factors, but mostly the presence of attachments.
I'm using the Google API Ruby Client gem, so I'm not working with JSON, I'm getting objects with all the same information, but I think the JSON representation makes it easier to understand my issue.
A simple message JSON response looks like this (one parts array with 2 hashes inside it):
{
"id": "175b418b1ff69896",
"snippet": "COVID-19: Resources to help your business manage through uncertainty 20 Liters 500 PEOPLE FOUND YOU ON GOOGLE Here are the top search queries used to find you: 20 liters used by 146 people volunteer",
"payload": {
"parts": [
{
"mimeType": "text/plain",
"body": {
"data": "Hey, you found the body of the email! I want this!"
}
},
{
"mimeType": "text/html",
"body": {
"data": "<div>I actually don't want this</div>"
}
}
]
}
}
The value I want is not that hard to get:
response.payload.parts.each do |part|
#body_data = part.body.data if part.mime_type == 'text/plain'
end
BUT The JSON response of a more complex email message with attachments looks something like this (now parts nests itself 3 levels deep):
{
"id": "175aee26de8209d2",
"snippet": "snippet text...",
"payload": {
"parts": [
{
"mimeType": "multipart/related",
"parts": [
{
"mimeType": "multipart/alternative",
"parts": [
{
"mimeType": "text/plain",
"body": {
"data": "hey, you found me! This is what I want!!"
}
},
{
"mimeType": "text/html",
"body": {
"data": "<div>I actually don't want this one.</div>"
}
}
]
},
{
"mimeType": "image/jpeg"
},
{
"mimeType": "image/png"
},
{
"mimeType": "image/png"
},
{
"mimeType": "image/jpeg"
},
{
"mimeType": "image/png"
},
{
"mimeType": "image/png"
}
]
},
{
"mimeType": "application/pdf"
}
]
}
}
And looking at a few other messages, the object can vary from 1 to 5 levels (maybe more) of parts
I need to loop over an unknown number of parts and then loop over an unknown number of nested parts and the repeat this again until I reach the bottom, hopefully finding the thing I want.
Here's my best attempt:
def trim_response(response)
# remove headers I don't care about
response.payload.headers.keep_if { |header| #valuable_headers.include? header.name }
# remove parts I don't care about
response.payload.parts.each do |part|
# parts can be nested within parts, within parts, within...
if part.mime_type == #valuable_mime_part && part.body.present?
#body_data = part.body.data
break
elsif part.parts.present?
# there are more layers down
find_body(part)
end
end
end
def find_body(part)
part.parts.each do |sub_part|
if sub_part.mime_type == #valuable_mime_part && sub_part.body.present?
#body_data = sub_part.body.data
break
elsif sub_part.parts.present?
# there are more layers down
######### THIS FEELS BAD!!! ###########
find_body(sub_part)
end
end
end
Yep, there's a method calling itself. I know, that's why I'm here.
This does work, I've tested it on a few dozen messages, but... there has to be a better, DRY-er way to do this.
How do I recursively loop and then move down a level and loop again in a DRY fashion when I don't know how deep the nesting goes?
No need to go through all this pain. Just keep diving in the parts dictionary until you find the first value where there is no parts anymore. At this moment you have the final parts in your parts variable.
Code:
reponse = {"id" => "175aee26de8209d2","snippet" => "snippet text...","payload" => {"parts" => [{"mimeType" => "multipart/related","parts" => [{"mimeType" => "multipart/alternative","parts" => [{"mimeType" => "text/plain","body" => {"data" => "hey, you found me! This is what I want!!"}},{"mimeType" => "text/html","body" => {"data" => "<div>I actually don't want this one.</div>"}}]},{"mimeType" => "image/jpeg"}]},{"mimeType" => "application/pdf"}]}}
parts = reponse["payload"]
parts = (parts["parts"].send("first") || parts["parts"]) while parts["parts"]
data = parts["body"]["data"]
puts data
Output:
hey, you found me! This is what I want!!
You can compute the desired result using recursion.
def find_it(h, top_key, k1, k2, k3)
return nil unless h.key?(top_key)
recurse(h[top_key], k1, k2, k3)
end
def recurse(h, k1, k2, k3)
return nil unless h.key?(k1)
h[k1].each do |g|
v = g.dig(k2,k3) || recurse(g, k1 , k2, k3)
return v unless v.nil?
end
nil
end
See Hash#dig.
Let h1 and h2 equal the two hashes given in the example1. Then:
find_it(h1, :payload, :parts, :body, :data)
#=> "Hey, you found the body of the email! I want this!"
find_it(h2, :payload, :parts, :body, :data)
#=> "hey, you found me! This is what I want!!"
1. The hash h[:payload][:parts].last #=> { "mimeType": "application/pdf" } appears to contain hidden characters that are causing a problem. I therefore removed that hash from h2.

How can I iterate over an array of hashes and form new one

I have a call to Companies House API and response I get from API is an array of hashes.
companies = {
"total_results" => 2,
"items" => [{
"title" => "First company",
"date_of_creation" => "2016-11-09",
"company_type" => "ltd",
"company_number" => "10471071323",
"company_status" => "active"
},
{
"title" => "Second company",
"date_of_creation" => "2016-11-09",
"company_type" => "ltd",
"company_number" => "1047107132",
"company_status" => "active"
}]
}
How I can iterate over companies and get a result similar to:
[{
title: "First company",
company_number: "10471071323"
},
{
title: "Second company",
company_number: "1047107132"
}]
You can use map which will iterate through the elements in an array and return a new array:
companies["items"].map do |c|
{
title: c['title'],
company_number: c['company_number']
}
end
=> [
{:title=>"First company", :company_number=>"10471071323"},
{:title=>"Second company", :company_number=>"1047107132"}
]
companies.map { |company| company.slice('title', 'company_number').symbolize_keys }
This should do the trick.
If you're not using Rails (or, more specifically, ActiveSupport), then symbolize_keys won't be available. In this case, you'd have to go for a more standard-Ruby approach:
companies.map do |company|
{ title: company["title"], company_number: company["company_number"] }
end
The answers are totally correct; but you should be made aware that what you’re looking at from companies house is not just an array of hashes - it’s a valid JsonApi response.
You might find your job easier if you’re using a gem which is aware of JsonApi specs, or if you’re just approaching it as that kind of data.
Have a look at the ruby implementations of https://jsonapi.org/implementations/
Or ActiveModelSerializer for ways to not only reform your hashes but deserialise this very structured data into ruby objects.
But like I say, if all you’re looking for is a quick way to reform the data as you describe. The above answers are perfect.

How can I get the original charge and refund ids of an automatic payout

Stripe connect accounts are configurable to coalesce payouts in a regular payout schedule, e.g. for monthly payouts in our case. For these monthly payouts we need to explain the account owners which of the transactions on our platform (bookings and refunds in our case) produced the overall amount they receive. We store the stripe charge id (resp. refund id) in the booking (resp refund) objects in our database. Thus the question boils down to:
Given a stripe account id, how can you get the list of stripe charge and refund ids that contributed to the last payout?
I've had an extensive exchange with Stripe's support team and there are several puzzle pieces necessary to get there:
Payouts are scoped by accounts
If you query stripe for a list of payouts, you will only receive the payout objects that you, the platform owner, get from stripe. To get the payout objects of a specific account you can use the normal authentication for the platform, but send the stripe account id as a header. So the code snippet to get the last payout looks like this (I'll use ruby snippets as examples for the rest of the answer):
Stripe::Payout.list({limit: 1}, {stripe_account: 'acct_0000001234567890aBcDeFgH'})
=> #<Stripe::ListObject:0x0123456789ab> JSON: {
"object": "list",
"data": [
{"id":"po_1000001234567890aBcDeFgH",
"object":"payout",
"amount":53102,
"arrival_date":1504000000,
"balance_transaction":"txn_2000001234567890aBcDeFgH",
"created":1504000000,
"currency":"eur",
"description":"STRIPE TRANSFER",
"destination":"ba_3000001234567890aBcDeFgH",
"failure_balance_transaction":null,
"failure_code":null,
"failure_message":null,
"livemode":true,"metadata":{},
"method":"standard",
"source_type":"card",
"statement_descriptor":"[…]",
"status":"paid",
"type":"bank_account"
}
],
"has_more": true,
"url": "/v1/payouts"
}
Having the payout id, we can query the list of balance transactions, scoped to a payout:
Stripe::BalanceTransaction.all({
payout: 'po_1000001234567890aBcDeFgH',
limit: 2,
}, {
stripe_account: 'acct_0000001234567890aBcDeFgH'
})
Objects viewed as an account are stripped of most information, compared to those viewed as a platform owner
Even though you now have the payout id, the object is still scoped to the account and you cannot retrieve it as platform owner. But viewed as an account, the payout only shows pseudo charge and refund objects like these (notice the second transaction has a py_7000001234567890aBcDeFgH object as a source instead of a regular ch_ charge object):
Stripe::BalanceTransaction.all({
payout: 'po_1000001234567890aBcDeFgH',
limit: 2,
}, {
stripe_account: 'acct_0000001234567890aBcDeFgH'
})
=> {
:object => "list",
:data => [
{
:id => "txn_4000001234567890aBcDeFgH",
:object => "balance_transaction",
:amount => -53102,
:available_on => 1504000000,
:created => 1504000000,
:currency => "eur",
:description => "STRIPE TRANSFER",
:fee => 0,
:fee_details => [],
:net => -53102,
:source => "po_5000001234567890aBcDeFgH",
:status => "available",
:type => "payout"
},
{
:id => "txn_6000001234567890aBcDeFgH",
:object => "balance_transaction",
:amount => 513,
:available_on => 1504000000,
:created => 1504000000,
:currency => "eur",
:description => nil,
:fee => 0,
:fee_details => [],
:net => 513,
:source => "py_7000001234567890aBcDeFgH",
:status => "available",
:type => "payment"
}
],
:has_more => true,
:url => "/v1/balance/history"
}
You can let stripe automatically expand objects in the response
As an additional parameter, you can give stripe paths of objects which you want stripe to expand in their response. Thus we can walk from the pseudo objects back to the original charge objects via the transfers:
Stripe::BalanceTransaction.all({
payout: 'po_1000001234567890aBcDeFgH',
limit: 2,
expand:['data.source.source_transfer',]
}, {
stripe_account: 'acct_0000001234567890aBcDeFgH'
}).data.second.source.source_transfer.source_transaction
=> "ch_8000001234567890aBcDeFgH"
And if you want to process the whole list you need disambiguate between the source.object attribute:
Stripe::BalanceTransaction.all({
payout: 'po_1000001234567890aBcDeFgH',
limit: 2,
expand:['data.source.source_transfer',]
}, {
stripe_account: 'acct_0000001234567890aBcDeFgH'
}).data.map do |bt|
if bt.source.object == 'charge'
['charge', bt.source.source_transfer.source_transaction]
else
[bt.source.object]
end
end
=> [["payout"], ["charge", "ch_8000001234567890aBcDeFgH"]]
Refunds have no connecting object path back to the original ids
Unfortunately, there is currently no way to get the original re_ objects from the pseudo pyr_ that are returned by the BalanceTransaction list call for refund transactions. The best alternative I've found is to go via the data.source.charge.source_transfer.source_transaction path to get the charge id of the charge on which the refund was issued and use that in combination with the created attribute of the pyr_ to match our database refund object. I'm not sure, though, how stable that method really is. The code to extract that data:
Stripe::BalanceTransaction.all({
payout: 'po_1000001234567890aBcDeFgH',
limit: 100, # max page size, the code to iterate over all pages is TBD
expand: [
'data.source.source_transfer', # For charges
'data.source.charge.source_transfer', # For refunds
]
}, {
stripe_account: 'acct_0000001234567890aBcDeFgH'
}).data.map do |bt|
res = case bt.source.object
when 'charge'
{
charge_id: bt.source.source_transfer.source_transaction
}
when 'refund'
{
charge_id: bt.source.charge.source_transfer.source_transaction
}
else
{}
end
res.merge(type: bt.source.object, amount: bt.amount, created: bt.created)
end
It is now possible to get the refund ids via a "transfer reversal" object:
Stripe::BalanceTransaction.list({
payout: 'po_1000001234567890aBcDeFgH',
expand: [
'data.source.source_transfer', # For charges
'data.source.transfer_reversal', # For refunds
]
}, {
stripe_account: 'acct_0000001234567890aBcDeFgH'
}).auto_paging_each do |balance_transaction|
case balance_transaction.type
when 'payment'
charge_id = balance_transaction.source.source_transfer.source_transaction
when 'payment_refund'
refund_id = balance_transaction.source.charge.source_transfer.source_transaction
end
end
end

Saving Point to a Google Fitness API (fitness.body.write)

Im trying to save a Point with float value into fitness.body.
Getting value is not a problem, while saving a new point causes 403. No permission to modify data for this source.
Im using DataSetId derived:com.google.weight:com.google.android.gms:merge_weight to find point and read value, and raw:com.google.weight:com.google.android.apps.fitness:user_input to insert data.
.
Here is a workflow using Ruby and google-api-ruby-client:
require 'google/api_client'
require 'google/api_client/client_secrets'
require 'google/api_client/auth/installed_app'
require 'pry'
# Initialize the client.
client = Google::APIClient.new(
:application_name => 'Example Ruby application',
:application_version => '1.0.0'
)
fitness = client.discovered_api('fitness')
# Load client secrets from your client_secrets.json.
client_secrets = Google::APIClient::ClientSecrets.load
flow = Google::APIClient::InstalledAppFlow.new(
:client_id => client_secrets.client_id,
:client_secret => client_secrets.client_secret,
:scope => ['https://www.googleapis.com/auth/fitness.body.write',
'https://www.googleapis.com/auth/fitness.activity.write',
'https://www.googleapis.com/auth/fitness.location.write']
)
client.authorization = flow.authorize
Forming my new data Point:
dataSourceId = 'raw:com.google.weight:com.google.android.apps.fitness:user_input'
startTime = (Time.now-1).to_i # 1 Second ago
endTime = (Time.now).to_i
metadata = {
dataSourceId: dataSourceId,
maxEndTimeNs: "#{startTime}000000000", # Faking nanoseconds with tailing zeros
minStartTimeNs: "#{endTime}000000000",
point: [
{
endTimeNanos: "#{endTime}000000000",
startTimeNanos: "#{startTime}000000000",
value: [
{ fpVal: 80 }
]
}
]
}
Attempting to save the point:
result = client.execute(
:api_method => fitness.users.data_sources.datasets.patch,
:body_object => metadata,
:parameters => {
'userId' => "me",
'dataSourceId' => dataSourceId,
'datasetId' => "#{Time.now.to_i-1}000000000-#{(Time.now).to_i}000000000"
}
)
And as I indicated previously im getting 403. No permission to modify data for this source
#<Google::APIClient::Schema::Fitness::V1::Dataset:0x3fe78c258f60 DATA:{"error"=>{"er
rors"=>[{"domain"=>"global", "reason"=>"forbidden", "message"=>"No permission to modif
y data for this source."}], "code"=>403, "message"=>"No permission to modify data for
this source."}}>
I believe, I selected all required permissions in the scope. I tried submitting the point to both accessible datasetid's for fitness.body.
Please let me know if im doing anything wrong here.
Thank you!
I encountered the same situation, turns out you can NOT insert data points directly into the datasource "raw:com.google.weight:com.google.android.apps.fitness:user_input". From the name, one might guess out this datasource is reserved. So the workaround is to add your own datasource, note should with dataType.name="com.google.weight", like this:
{
"dataStreamName": "xxxx.body.weight",
"dataType": {
"field": [
{
"name": "weight",
"format": "floatPoint"
}
],
"name": "com.google.weight"
},
"dataQualityStandard": [],
"application": {
"version": "1",
"name": "Foo Example App",
"detailsUrl": "http://example.com"
},
"device": {
"model": "xxxmodel",
"version": "1",
"type": "scale",
"uid": "xxx#yyy",
"manufacturer": "xxxxManufacturer"
},
"type": "derived"
}
then after the successful creation, you can use this datasource(datastream id) to insert your own data points, and then the inserted data points will also be included in the datasource "derived:com.google.weight:com.google.android.gms:merge_weight" when you do the querying with suffix "dataPointChanges".
Try adding an Authorization header:
result = client.execute(
:api_method => fitness.users.data_sources.datasets.patch,
:headers => {'Authorization' => 'Bearer YOUR_AUTH_TOKEN'},
:body_object => metadata,
:parameters => {
'userId' => "me",
'dataSourceId' => dataSourceId,
'datasetId' => "#{Time.now.to_i-1}000000000-#{(Time.now).to_i}000000000"
}
)

Resources