EventMachine read and write files in chunks - ruby

I'm using EventMachine and EM-Synchrony in a REST API server. When a receive a POST request with a large binary file in body I receive it in chunks, writing that chunks to a Tempfile, not blocking the reactor.
Then, at some point, I need to read this file in chunks and write that chunks to a definitive file. This is working, but it blocks the reactor as expected, and cant find a way to make it work without blocking.
I call this function at some time, passing to it the tempfile and new file name:
def self.save(tmp_file, new_file)
tmp = File.open(tmp_file, "rb")
newf = File.open(new_file, "wb")
md5 = Digest::MD5.new
each_chunk(tmp, CHUNKSIZE) do |chunk|
newf << chunk
md5.update chunk
end
md5.hexdigest
end
def self.each_chunk(file, chunk_size=1024)
yield file.read(chunk_size) until file.eof?
end
I've been reading all other similar questions here at StackOverflow, trying to use EM#next_tick, which is perhaps the solution (not so much EM experience) but cant get it to work, perhaps I'm placing it in the wrong places.
Also, I've tried EM#defer, but I need the function to wait for the read/write process to complete before it returns the md5, as in my main file, after call this function I do a database update with the return value.
If someone can help me on this I would be grateful.
EDIT 1
I need that the save function only returns after complete the files read/write, as in the caller function I'm waiting for the final md5 value, something like this:
def copy_and_update(...)
checksum = SomeModule.save(temp_file, new_file)
do_database_update({:checksum => checksum}) # only with the final md5 value
end

You need to inject something in there to break it up:
def self.each_chunk(file, chunk_size=1024)
chunk_handler = lambda {
unless (file.eof?)
yield file.read(chunk_size)
EM.next_tick(&chunk_handler)
end
}
EM.next_tick(&chunk_handler)
end
It's kind of messy to do it this way, but such is asynchronous programming.

Related

What's the correct way to read an array from a csv field in Ruby?

I am trying to save some class objects to a csv file, everything works fine. I can save and read back from the csv file, there is only a 'minor' problem with an attribute that is an Array of Strings.
When I save it to the file it appears like this: "[""Dan Brown""]"
CSV.open('documents.csv', "w") do |csv|
csv << %w[ISBN Titre Auteurs Type Disponibilité]
#docs.each { |doc|
csv << [doc.isbn, doc.titre, doc.auteurs, doc.type, doc.empruntable ? "Disponible" : "Emprunté"]
}
end
And when I try to extract the data from the file I end up with something like this: ["[\"Dan Brown\"]"].
table = CSV.parse(File.read("documents.csv"), headers: true)
table.each do |row|
doc = Document.new(row['Titre'], row['ISBN'], row['Type'])
doc.auteurs << row['Auteurs'] #This the array where there is a 'problem'
if row['Disponibilité'] == "Disponible"
doc.empruntable = true
else
doc.empruntable = false
end
#docs.push(doc) #this an array where I save my objects
end
I tried many things to solve this but without any luck. I would be thankful if you can help me find a solution.
Since a CSV file, by it's nature, contains in its fields only strings, not arrays or other data types, the CSV class is applying the to_s method of the objects to turn them into a string before putting them into the CSV.
When you later read them back, you just get this - the string representation of what once had been your array. The only one who knows that 'Auteurs' should end up as an array of strings, is the application, i.e. you.
Hence on reading the CSV, after having extracted the autheurs string, you need to convert it manually back to an Array, because there is no automatic "inverse method" to reverse the to_s.
A cheap, but dangerous way to do it, is to use eval, which indeed would reconstruct your array. However, you need to be sure that nobody had a chance to fiddle manually with the CSV data, because an eval allows sneaking in arbitrary code.
A safer way would be to either write your own conversion function to and from String representation, or use a format such as YAML or JSON for representing the Array as String, instead of using to_s.

rake db:seed not working to seed from API in Ruby CLI app - will seed manually written data - Ruby/ActiveRecord

I’m trying to make improvements to a project for school (super beginner) using seeded data from an API to make a CLI app using Ruby and ActiveRecord, no Rails. I have had to kind of "cheat" the data by taking it (a hash of object IDs), appending that ID to the end of another URL link (creating an array of these links) and then iterating over each one and making a GET request, putting it into final hash from which I iterate over and seed into my database.
I was able to successfully do it once - but I wanted to expand the data set, so I cleared the db and went to re-seed and it no longer works. It hangs for quite a bit, then seems to complete, but the data isnt there. The only change I made in my code was to the URL, but even when I change it back it no longer works. However, it does seed anything I've manually written. The URL works fine in my browser. I tried rake:db:migrate:reset but that didnt seem to work for me.
I apologize if my code is a bit messy, I'm just trying to get to the bottom of this issue and it is my first time working with APIs / creating a project like this. I appreciate any help. Thanks!
response = RestClient.get("https://collectionapi.metmuseum.org/public/collection/v1/search?departmentId=11&15&19&21&6q=*")
metData = JSON.parse(response)
url = "https://collectionapi.metmuseum.org/public/collection/v1/objects/"
urlArray = []
metData["objectIDs"].each do |e|
urlArray.push(url.to_s + e.to_s)
end
# urlArray.slice!(0,2)
urlArray
end
object_id_joiner
def finalHash
finalHash =[]
object_id_joiner.each do |e|
response = RestClient.get(e)
data = JSON.parse(response)
finalHash.push(data)
end
finalHash
end
finalHash
finalHash.each do |artist_hash|
if artist_hash["artistDisplayName"] == nil
next
end
if (!artist_hash["artistDisplayName"])
art1 = Artist.create(artist_name:artist_hash["artistDisplayName"])
else
next
end
if (!artist_hash["objectID"])
Artwork.create(title: artist_hash["title"],image: artist_hash["primaryImage"], department: artist_hash["department"], artist: art1, object_id: artist_hash["objectID"])
else
next
end
end
As mentioned in comments you had some rogue ! in your code.
Here is a simpler version of your last loop.
finalHash.each do |artist_hash|
next if artist_hash["artistDisplayName"] == nil
# Now you don't need conditional for artistDisplayName
art1 = Artist.create(artist_name: artist_hash["artistDisplayName"])
# Now create artwork if you HAVE objectID
if (artist_hash["objectID"])
Artwork.create(title: artist_hash["title"],image: artist_hash["primaryImage"], department: artist_hash["department"], artist: art1, object_id: artist_hash["objectID"])
end
end

How to use get_object in ruby for AWS?

I am very new to ruby. I am able to connect to AWS S3 using ruby. I am using following code
filePath = '/TMEventLogs/stable/DeviceWiFi/20160803/1.0/20160803063600-2f9aa901-2ce7-4932-aafd-f7286cdb9871.csv'
s3.get_object({bucket: "analyticspoc", key:"TMEventLogs/stable/DeviceWiFi/20160803/1.0/"}, target:filePath ) do |chunk|
puts "1"
end
In above code s3 is client. "analyticspoc" is root bucket. My path to csv file is as follows All Buckets /analyticspoc/TMEventLogs/stable/DeviceWiFi/20160803/1.0/20160803063600-2f9aa901-2ce7-4932-aafd-f7286cdb9871.csv.
I have tried above code. I above code I was getting error Error getting objects: [Aws::S3::Errors::NoSuchKey] - The specified key does not exist. Using above code I want to read the contents of a file. How to do that ? Please tell me what is the mistake in above code
Got the answer. You can use list_objects for accessing array of file names in chunk(1000 at a time) where as get_object is used for accessing the content of a single file as follows
BUCKET = "analyticspoc"
path = "TMEventLogs/stable/DeviceWiFi/20160803/1.0/"
s3.list_objects(bucket:BUCKET, prefix: path).each do |response|
contents = response.contents
end
file_name = "TMEventLogs/stable/DeviceWiFi/20160803/1.0/012121212121"
response = s3.get_object(bucket: BUCKET, key: file_name)
As far as I can tell you're passing in the arguments incorrectly. It should be a single options hash according to the documentation for get_object:
s3.get_object(
bucket: "analyticspoc",
key: "TMEventLogs/stable/DeviceWiFi/20160803/1.0/",
target: filePath
) do |chunk|
puts "1"
end
I believe it was trying to use your hash as a string key which is obviously not going to work.
With Ruby the curly braces { } are only necessary in method calls if additional arguments follow that need to be in another hash or are non-hash in nature. This makes the syntax a lot less ugly in most cases where options are deliberately last, and sometimes first and last by virtue of being the only argument.

Check headers before downloading with Net::HTTP::Pipeline

I am trying to parse a list of image URL's and get some basic information before I actually commit to download.
Is the image there (solved with response.code?)
Do I have the image already (want to look at type and size?)
My script will check a large list every day (about 1300 rows) and each row has 30-40 image URLs. My #photo_urls variable allows me to keep track of what I have downloaded already. I would really like to be able to use that later as a hash (instead of an array in my example code) to interate through later and do the actual downloading.
Right now my problem (besides being a Ruby newbie) is that Net::HTTP::Pipeline only accepts an array of Net::HTTPRequest objects. The documentation for net-http-pipeline indicates that response objects will come back in the same order as the corresponding request objects that went in. The problem is that I have no way to correlate the request to the response other than that order. However, I don't know how to get relative ordinal position inside a block. I assume I could just have a counter variable but how would I access a hash by ordinal position?
Net::HTTP.start uri.host do |http|
# Init HTTP requests hash
requests = {}
photo_urls.each do |photo_url|
# make sure we don't process the same image again.
hashed = Digest::SHA1.hexdigest(photo_url)
next if #photo_urls.include? hashed
#photo_urls << hashed
# change user agent and store in hash
my_uri = URI.parse(photo_url)
request = Net::HTTP::Head.new(my_uri.path)
request.initialize_http_header({"User-Agent" => "My Downloader"})
requests[hashed] = request
end
# process requests (send array of values - ie. requests) in a pipeline.
http.pipeline requests.values do |response|
if response.code=="200"
# anyway to reference the hash here so I can decide whether
# I want to do anything later?
end
end
end
Finally, if there is an easier way of doing this, please feel free to offer any suggestions.
Thanks!
Make requests an array instead of a hash and pop off the requests as the responses come in:
Net::HTTP.start uri.host do |http|
# Init HTTP requests array
requests = []
photo_urls.each do |photo_url|
# make sure we don't process the same image again.
hashed = Digest::SHA1.hexdigest(photo_url)
next if #photo_urls.include? hashed
#photo_urls << hashed
# change user agent and store in hash
my_uri = URI.parse(photo_url)
request = Net::HTTP::Head.new(my_uri.path)
request.initialize_http_header({"User-Agent" => "My Downloader"})
requests << request
end
# process requests (send array of values - ie. requests) in a pipeline.
http.pipeline requests.dup do |response|
request = requests.shift
if response.code=="200"
# Do whatever checking with request
end
end
end

Write Hash to CSV file and then read the values back to form a hash

I am banging my head try to resolve an issue I am experiencing with one of my latest projects. Here is the senario:
I am making an call to GoToWebinar API to fetch upcoming webinars. Everything is working fine and the webinars are fetched in the form of hash just like this :
[
{
"webinarKey":5303085652037254656,
"subject":"Test+Webinar+One",
"description":"Test+Webinar+One+Description",
"times":[{"startTime":"2011-04-26T17:00:00Z","endTime":"2011-04-26T18:00:00Z"}]
},
{
"webinarKey":9068582024170238208,
"name":"Test+Webinar+Two",
"description":"Test Webinar Two Description",
"times":[{"startTime":"2011-04-26T17:00:00Z","endTime":"2011-04-26T18:00:00Z"}]
}
]
I have created a rake task which we are going to run once a day to populate the CSV file with this hash and then the CSV file is read in the controller action to populate the views.
Here is my code to populate the CSV file :
g = GoToWebinar::API.new()
#all_webinars = g.get_upcoming_webinars
CSV.open("#{Rails.root.to_s}/public/upcoming_webinars.csv", "wb") do |csv|
#all_webinars.each do |webinar|
webinar.to_a.each {|elem| csv << elem}
end
end
I need some help in figuring out a way to save the information received in the form of hashed to be saved in the CSV file in such a way that the order is preserved and also a way to read to the information back from the CSV file that it populates the hash in the controller action in the very same way.
You want to use the keys of the hash (since they are constant) as the headers for your CSV file. Then push each element on as you are doing.
g = GoToWebinar::API.new()
#all_webinars = g.get_upcoming_webinars
headers= #all_webinars.keys
CSV.open("#{Rails.root.to_s}/public/upcoming_webinars.csv", "wb", headers: headers) do |csv|
#all_webinars.each do |webinar|
webinar.to_a.each {|elem| csv << elem}
end
end
You are going to want to make sure, however, that any data inside the hash values is flattened. That hash inside of an array for times needs to be dealt with (perhaps just remove times and have a startTime and endTime key in the hash).
From What I have learnt from all the examples and work done to accomplish this I think the best way to go around this type of functionality is to create a rake task to and populate the database with the information and use the information saved to populate the views.

Resources