Check headers before downloading with Net::HTTP::Pipeline - ruby

I am trying to parse a list of image URL's and get some basic information before I actually commit to download.
Is the image there (solved with response.code?)
Do I have the image already (want to look at type and size?)
My script will check a large list every day (about 1300 rows) and each row has 30-40 image URLs. My #photo_urls variable allows me to keep track of what I have downloaded already. I would really like to be able to use that later as a hash (instead of an array in my example code) to interate through later and do the actual downloading.
Right now my problem (besides being a Ruby newbie) is that Net::HTTP::Pipeline only accepts an array of Net::HTTPRequest objects. The documentation for net-http-pipeline indicates that response objects will come back in the same order as the corresponding request objects that went in. The problem is that I have no way to correlate the request to the response other than that order. However, I don't know how to get relative ordinal position inside a block. I assume I could just have a counter variable but how would I access a hash by ordinal position?
Net::HTTP.start uri.host do |http|
# Init HTTP requests hash
requests = {}
photo_urls.each do |photo_url|
# make sure we don't process the same image again.
hashed = Digest::SHA1.hexdigest(photo_url)
next if #photo_urls.include? hashed
#photo_urls << hashed
# change user agent and store in hash
my_uri = URI.parse(photo_url)
request = Net::HTTP::Head.new(my_uri.path)
request.initialize_http_header({"User-Agent" => "My Downloader"})
requests[hashed] = request
end
# process requests (send array of values - ie. requests) in a pipeline.
http.pipeline requests.values do |response|
if response.code=="200"
# anyway to reference the hash here so I can decide whether
# I want to do anything later?
end
end
end
Finally, if there is an easier way of doing this, please feel free to offer any suggestions.
Thanks!

Make requests an array instead of a hash and pop off the requests as the responses come in:
Net::HTTP.start uri.host do |http|
# Init HTTP requests array
requests = []
photo_urls.each do |photo_url|
# make sure we don't process the same image again.
hashed = Digest::SHA1.hexdigest(photo_url)
next if #photo_urls.include? hashed
#photo_urls << hashed
# change user agent and store in hash
my_uri = URI.parse(photo_url)
request = Net::HTTP::Head.new(my_uri.path)
request.initialize_http_header({"User-Agent" => "My Downloader"})
requests << request
end
# process requests (send array of values - ie. requests) in a pipeline.
http.pipeline requests.dup do |response|
request = requests.shift
if response.code=="200"
# Do whatever checking with request
end
end
end

Related

rake db:seed not working to seed from API in Ruby CLI app - will seed manually written data - Ruby/ActiveRecord

I’m trying to make improvements to a project for school (super beginner) using seeded data from an API to make a CLI app using Ruby and ActiveRecord, no Rails. I have had to kind of "cheat" the data by taking it (a hash of object IDs), appending that ID to the end of another URL link (creating an array of these links) and then iterating over each one and making a GET request, putting it into final hash from which I iterate over and seed into my database.
I was able to successfully do it once - but I wanted to expand the data set, so I cleared the db and went to re-seed and it no longer works. It hangs for quite a bit, then seems to complete, but the data isnt there. The only change I made in my code was to the URL, but even when I change it back it no longer works. However, it does seed anything I've manually written. The URL works fine in my browser. I tried rake:db:migrate:reset but that didnt seem to work for me.
I apologize if my code is a bit messy, I'm just trying to get to the bottom of this issue and it is my first time working with APIs / creating a project like this. I appreciate any help. Thanks!
response = RestClient.get("https://collectionapi.metmuseum.org/public/collection/v1/search?departmentId=11&15&19&21&6q=*")
metData = JSON.parse(response)
url = "https://collectionapi.metmuseum.org/public/collection/v1/objects/"
urlArray = []
metData["objectIDs"].each do |e|
urlArray.push(url.to_s + e.to_s)
end
# urlArray.slice!(0,2)
urlArray
end
object_id_joiner
def finalHash
finalHash =[]
object_id_joiner.each do |e|
response = RestClient.get(e)
data = JSON.parse(response)
finalHash.push(data)
end
finalHash
end
finalHash
finalHash.each do |artist_hash|
if artist_hash["artistDisplayName"] == nil
next
end
if (!artist_hash["artistDisplayName"])
art1 = Artist.create(artist_name:artist_hash["artistDisplayName"])
else
next
end
if (!artist_hash["objectID"])
Artwork.create(title: artist_hash["title"],image: artist_hash["primaryImage"], department: artist_hash["department"], artist: art1, object_id: artist_hash["objectID"])
else
next
end
end
As mentioned in comments you had some rogue ! in your code.
Here is a simpler version of your last loop.
finalHash.each do |artist_hash|
next if artist_hash["artistDisplayName"] == nil
# Now you don't need conditional for artistDisplayName
art1 = Artist.create(artist_name: artist_hash["artistDisplayName"])
# Now create artwork if you HAVE objectID
if (artist_hash["objectID"])
Artwork.create(title: artist_hash["title"],image: artist_hash["primaryImage"], department: artist_hash["department"], artist: art1, object_id: artist_hash["objectID"])
end
end

Using Sinatra to Parse JSON data from url

I'm using Sinatrarb to complete a task
I need to:
Parse the data of a JSON object from a url,
Single out one of attributes of the json data and store it as a variable
Run some arithmetic on the variable
Return the result as a new variable
then post this to a new url as a new json object.
I have seen bits and pieces of information all over including information on parsing JSON data in ruby and information on open-uri but I believe it would be very valuable having someone break this down step by step as most similar solutions given to this are either outdated or steeply complex.
Thanks in advance.
Here's a simple guide. I've done the same task recently.
Let's use this JSON (put it in a file called 'simple.json'):
{
"name": "obscurite",
"favorites": {
"icecream": [
"chocolate",
"pistachio"
],
"cars": [
"ferrari",
"porsche",
"lamborghini"
]
},
"location": "NYC",
"age": 100}
Parse the data of a JSON object from a url.
Step 1 is to add support for JSON parsing:
require 'json'
Step 2 is to load in the JSON data from our new .json file:
json_file = File.read('simple.json')
json_data = JSON.parse(json_file)
Single out one of attributes of the json data and store it as a variable
Our data is in the form of a Hash on the outside (curly braces with key:values). Some of the values are also hashes ('favorites' and 'cars'). The values of those inner hashes are lists (Arrays in Ruby). So what we have is a hash of hashes, where some hashes are arrays.
Let's pull out my location:
puts json_data['location'] # NYC
That was easy. It was just a top level key/value. Let's go deeper and pull out my favorite icecream(s):
puts json_data['favorites']['icecream'] # chocolate pistachio
Now only my second favorite car:
puts json_data['favorites']['cars'][1] # porsche
Run some arithmetic on the variable
Step 3. Let's get my age and cut it down by 50 years. Being 100 is tough!
new_age = json_data['age'] / 2
puts new_age
Return the result as a new variable
Step 4. Let's put the new age back into the json
json_data['age'] = new_age
puts json_data['age'] # 50
then post this to a new url as a new json object.
Step 5. Add the ability for your program to do an HTTP POST. Add this up at top:
require 'net/http'
and then you can post anywhere you want. I found a fake web service you could use, if you just want to make sure the request got there.
# use this guy's fake web service page as a test. handy!
uri = URI.parse("http://jsonplaceholder.typicode.com/posts")
header = {'Content-Type'=> 'text/json'}
http = Net::HTTP.new(uri.host, uri.port)
request = Net::HTTP::Post.new(uri.request_uri, header)
request.body = json_data.to_json
response = http.request(request)
# Did we get something back?
puts response.body
On linux or mac you can open a localhost port and listen as a test:
nc -4 -k -l -v localhost 1234
To POST to this port change the uri to:
uri = URI.parse("http://localhost:1234")
Hope this helps. Let me know if you get stuck and I'll try to lend a hand. I'm not a ruby expert, but wanted to help a fellow explorer. Good luck.

Ruby split and parse batched HTTP Response (multipart/mixed)

I'm using the GMail API that allows me to get a batched response of multiple Gmail objects.
This comes back in the form of a multipart/mixed HTTP response with a set of separate HTTP responses separated by a boundary as defined in the header.
Each of the HTTP sub-Responses is a JSON format.
i.e.
result.response.response_headers = {...
"content-type"=>"multipart/mixed; boundary=batch_abcdefg"...
}
result.response.body = "----batch_abcdefg
<the response header>
{some JSON}
--batch_abcdefg
<another response header>
{some JSON}
--batch_abcdefg--"
Is there a library or an easy way to convert those responses from the string into a set of separate HTTP responses or JSON objects?
Thanks to Tholle above...
def parse_batch_response(response, json=true)
# Not the same delimiter in the response as we specify ourselves in the request,
# so we have to extract it.
# This should give us exactly what we need.
delimiter = response.split("\r\n")[0].strip
parts = response.split(delimiter)
# The first part will always be an empty string. Just remove it.
parts.shift
# The last part will be the "--". Just remove it.
parts.pop
if json
# collects the response body as json
results = parts.map{ |part| JSON.parse(part.match(/{.+}/m).to_s)}
else
# collates the separate responses as strings so you can do something with them
# e.g. you need the response codes
results = parts.map{ |part| part}
end
result
end

Ruby Sinatra and JSON objects from toodledo API 2.0

I have a small problem with receiving JSON objects. I'm using Ruby 1.9.3 and my goal is to receive my tasks from an API via RestClient and print them more or less pretty onto the page.
I created a route /test:
get '/test' do
json_ip_url = "http://api.toodledo.com/2/tasks/get.php?key=198196ae24792467eec09ac2191*****;modafter=1234567890;fields=folder,star,priority"
ip_details = RestClient.get(json_ip_url)
test = JSON.pretty_generate(ip_details) # => throws exception
end
The JSON#pretty_generate line throws an error, "only generation of JSON objects or arrays allowed". What am I doing wrong here?
Update:
I'am now able to output via pretty_generate, but what do I have to do, to get the elements of it. Here is the JSON Data, it seems to me its an Array with Objects inside of it?
[{"num":"18","total":"18"},{"id":"11980343","title":"Add some items to your todo list","modified":1391670256,"completed":0,"folder":"0","star":"0"},{"id":"11980345","title":"Visit the Settings section and configure your account","modified":1391670256,"completed":0,"folder":"0","star":"0"},{"id":"11980347","title":"Watch our tutorial videos in the Help section","modified":1391670256,"completed":0,"folder":"0","star":"1"},{"id":"12607789","title":"test","modified":1392285802,"completed":0,"folder":"0","star":"0"},{"id":"12636039","title":"My Task","modified":1392308705,"completed":0,"folder":"0","star":"0"},{"id":"12636041","title":"Another","modified":1392308705,"completed":0,"folder":"0","star":"1"},{"id":"12636143","title":"My Task","modified":1392308789,"completed":0,"folder":"0","star":"0"},{"id":"12636145","title":"Another","modified":1392308789,"completed":0,"folder":"0","star":"1"},{"id":"12636449","title":"My Task","modified":1392308950,"completed":0,"folder":"0","star":"0"},{"id":"12636451","title":"Another","modified":1392308950,"completed":0,"folder":"0","star":"1"},{"id":"12636621","title":"My Task","modified":1392309061,"completed":0,"folder":"0","star":"0"},{"id":"12636623","title":"Another","modified":1392309061,"completed":0,"folder":"0","star":"1"},{"id":"12636665","title":"My Task","modified":1392309085,"completed":0,"folder":"0","star":"0"},{"id":"12636667","title":"Another","modified":1392309085,"completed":0,"folder":"0","star":"1"},{"id":"12636733","title":"My Task","modified":1392309137,"completed":0,"folder":"0","star":"0"},{"id":"12636735","title":"Another","modified":1392309137,"completed":0,"folder":"0","star":"1"},{"id":"12637135","title":"My Task","modified":1392309501,"completed":0,"folder":"0","star":"0"},{"id":"12637137","title":"Another","modified":1392309501,"completed":0,"folder":"0","star":"1"}]
The Code I used for pretty_generate:
get '/save' do
jdata = params[:data]
response = RestClient.get 'http://api.toodledo.com/2/tasks/get.php?key=da21e24e2a00ba9d45008974aed00***;modafter=1234567890;fields=folder,star,priority', {:accept => :json}
test = JSON.parse(response)
test.to_json
output = JSON.pretty_generate(test)
puts output
RestClient#get returns the raw response as a string (and not a hash or array) when called without a block, so ip_details isn't a structure that JSON#pretty_generate knows how to handle. You need to use JSON#parse to turn the response into a hash or array first.

Ruby script for posting comments

I have been trying to write a script that may help me to comment from command line.(The sole reason why I want to do this is its vacation time here and I want to kill time).
I often visit and post on this site.So I am starting with this site only.
For example to comment on this post I used the following script
require "uri"
require 'net/http'
def comment()
response = Net::HTTP.post_form(URI.parse("http://www.geeksforgeeks.org/wp-comments-post.php"),{'author'=>"pikachu",'email'=>"saurabh8c#gmail.com",'url'=>"geekinessthecoolway.blogspot.com",'submit'=>"Have Your Say",'comment_post_ID'=>"18215",'comment_parent'=>"0",'akismet_comment_nonce'=>"70e83407c8",'bb2_screener_'=>"1330701851 117.199.148.101",'comment'=>"How can we generalize this for a n-ary tree?"})
return response.body
end
puts comment()
Obviously the values were not hardcoded but for sake of clearity and maintaining the objective of the post i am hardcoding them.
Beside the regular fields that appear on the form,the values for the hidden fields i found out from wireshark when i posted a comment the normal way.I can't figure out what I am missing?May be some js event?
Edit:
As few people suggested using mechanize I switched to python.Now my updated code looks like:
import sys
import mechanize
uri = "http://www.geeksforgeeks.org/"
request = mechanize.Request(mechanize.urljoin(uri, "archives/18215"))
response = mechanize.urlopen(request)
forms = mechanize.ParseResponse(response, backwards_compat=False)
response.close()
form=forms[0]
print form
control = form.find_control("comment")
#control=form.find_control("bb2_screener")
print control.disabled
# ...or readonly
print control.readonly
# readonly and disabled attributes can be assigned to
#control.disabled = False
form.set_all_readonly(False)
form["author"]="Bulbasaur"
form["email"]="ashKetchup#gmail.com"
form["url"]="9gag.com"
form["comment"]="Y u no put a captcha?"
form["submit"]="Have Your Say"
form["comment_post_ID"]="18215"
form["comment_parent"]="0"
form["akismet_comment_nonce"]="d48e588090"
#form["bb2_screener_"]="1330787192 117.199.144.174"
request2 = form.click()
print request2
try:
response2 = mechanize.urlopen(request2)
except mechanize.HTTPError, response2:
pass
# headers
for name, value in response2.info().items():
if name != "date":
print "%s: %s" % (name.title(), value)
print response2.read() # body
response2.close()
Now the server returns me this.On going through the html code of the original page i found out there is one more field bb2_screener that i need to fill if I want to pretend like a browser to the server.But the problem is the field is not written inside the tag so mechanize won't treat it as a field.
Assuming you have all the params correct, you're still missing the session information that the site stores in a cookie. Consider using something like mechanize, that'll deal with the cookies for you. It's also more natural in that you tell it which fields to fill in with which data. If that still doesn't work, you can always use a jackhammer like selenium, but then technically you're using a browser.

Resources