efficient read of EXIF meta-data from remote images using Ruby - ruby

I have a few thousand high-res JPEG photos stored on a travel blog website, and I'm hoping to write some Ruby code that will extract a few key EXIF meta-data values from the images without downloading the entire contents of each image file (they are large, and I have a LOT of them).
I'm using the 'exifr' gem to read the EXIF data, and it is designed to work with any type of IO object, not just local files. However, the Net::HTTPResponse object isn't really an IO object, although it does allow for incremental reading if you pass the read_body method a block. I've read conflicting reports, however, about whether this increment reading really allows you to only download a portion of a file, or whether it just lets you read the contents in chunks for efficiency (i.e. the entire contents is downloaded anyway).
So, is what I'm trying to do possible? Should I be looking at alternatives to Net::HTTP, or is there some way for me to get at the low-level TCP socket (which should be an IO object) to pass to the 'exifr' code to read just enough of the image to get the EXIF data? Other solutions?

I generated a quick table of where, in my pile of photos, the EXIF data is stored:
$ find . -type f -exec grep -a -bo Exif {} \; > /tmp/exif
$ sort /tmp/exif | uniq -c | sort -n
1 12306:Exif
1 3271386:Exif
1 8210:Exif
1 8234:Exif
1 9234:Exif
2 10258:Exif
24 449:Exif
30 24:Exif
8975 6:Exif
$
The clear majority are just a few bytes into the file; a handful are scattered elsewhere, but the worst is only three megabytes into the file. (Give or take.)
I wrote a little test script that appears to do what it necessary for a single URL. (Tested by looking for the string AA in chunks of a huge binary file I had available.) This certainly isn't the prettiest program I've written, but it might be an adequate start to a solution. Note that if the Exif text spans the chunks, you're going to retrieve the entire file. That's unfortunate. I hope it doesn't happen often. The 66000 is there because the JPEG AAP1 block is limited in size to 64 kilobytes and grabbing a bit more is probably better than grabbing a bit less.
#!/usr/bin/ruby
require 'net/http'
require 'uri'
url = URI.parse("http://....")
begin
looking = true
extra_size = 0
File.open("/tmp/output", "w") do |f|
Net::HTTP.start(url.host, url.port) do |http|
request = Net::HTTP::Get.new url.request_uri
http.request request do |resp|
resp.read_body do |chunk|
f.write chunk
if (looking)
if (chunk.match(/Exif/))
looking = false
end
elsif (extra_size < 66000)
extra_size += chunk.length
else
throw "done"
end
end
end
end
end
rescue
puts "done"
exit(0)
end

Related

Compressing large string in ruby

I have a web application(ruby on rails) that sends some YAML as the value of a hidden input field.
Now I want to reduce the size of the text that is sent across to the browser. What is the most efficient form of lossless compression that would send across minimal data? I'm ok to incur additional cost of compression and decompression at the server side.
You could use the zlib implementation in the ruby core to in/de-flate data:
require "zlib"
data = "some long yaml string" * 100
compressed_data = Zlib::Deflate.deflate(data)
#=> "x\x9C+\xCE\xCFMU\xC8\xC9\xCFKW\xA8L\xCC\xCDQ(.)\xCA\xCCK/\x1E\x15\x1C\x15\x1C\x15\x1C\x15\x1C\x15\x1C\x15\x1C\x15\x1C\x15D\x15\x04\x00\xB3G%\xA6"
You should base64-encode the compressed data to make it printable:
require 'base64'
encoded_data = Base64.encode64 compressed_data
#=> "eJwrzs9NVcjJz0tXqEzMzVEoLinKzEsvHhUcFRwVHBUcFRwVHBUcFUQVBACz\nRyWm\n"
Later, on the client-side, you might use pako (a zlib port to javascript) to get your data back. This answer probably helps you with implementing the JS part.
To give you an idea on how effective this is, here are the sizes of the example strings:
data.size # 2100
compressed_data.size # 48
encoded_data.size # 66
Same thing goes vice-versa when compressing on the client and inflating on the server.
Zlib::Inflate.inflate(Base64.decode64(encoded_data))
#=> "some long yaml stringsome long yaml str ... (shortened, as the string is long :)
Disclaimer:
The ruby zlib implementation should be compatible with the pako implementation. But I have not tried it.
The numbers about string sizes are a little cheated. Zlib is really effective here, because the string repeats a lot. Real life data usually does not repeat as much.
If you are working on a Rails application, you can also use the ActiveSupport::Gzip wrapper that allows compression/decompression of strings with gzip.
compressed_log = ActiveSupport::Gzip.compress('large string')
=> "\x1F\x8B\b\x00yq5c\x00\x03..."
original_log = ActiveSupport::Gzip.decompress(compressed_log)
=> "large string"
Behind the scenes, the compress method uses the Zlib::GzipWriter class which writes gzipped files. Similarly, the decompress method uses Zlib::GzipReader class which reads a gzipped file.

dealing with large CSV files (20G) in ruby

I am working on little problem and would have some advice on how to solve it:
Given a csv file with an unknown number of columns and rows, output a list of columns with values and the number of times each value was repeated. without using any library.
if the file is small this shouldn't be a problem, but when it is a few Gigs, i get NoMemoryError: failed to allocate memory. is there a way to create a hash and read from the disk instead of loading the file to Memory? you can do that in perl with tied Hashes
EDIT: will IO#foreach load the file into memory? how about File.open(filename).each?
Read the file one line at a time, discarding each line as you go:
open("big.csv") do |csv|
csv.each_line do |line|
values = line.split(",")
# process the values
end
end
Using this method, you should never run out of memory.
Do you read the whole file at once? Reading it on a per-line basis, i.e. using ruby -pe, ruby -ne or $stdin.each should reduce the memory usage by garbage collecting lines which were processed.
data = {}
$stdin.each do |line|
# Process line, store results in the data hash.
end
Save it as script.rb and pipe the huge CSV file into this script's standard input:
ruby script.rb < data.csv
If you don't feel like reading from the standard input we'll need a small change.
data = {}
File.open("data.csv").each do |line|
# Process line, store results in the data hash.
end
For future reference, in such cases you want to use CSV.foreach('big_file.csv', headers: true) do |row|
This will read the file line by line from the IO object with minimal memory footprint (should be below 1MB regardless of file size).

Ruby File.read vs. File.gets

If I want to append the contents of a src file into the end of a dest file in Ruby, is it better to use:
while line = src.gets do
or
while buffer = src.read( 1024 )
I have seen both used and was wondering when should I use each method and why?
One is for reading "lines", one is for reading n bytes.
While byte buffering might be faster, a lot of that may disappear into the OS which likely does buffering anyway. IMO it has more to do with the context of the read--do you want lines, or are you just shuffling chunks of data around?
That said, a performance test in your specific environment may be helpful when deciding.
You have a number of options when reading a file that are tailored to different situations.
Read in the file line-by-line, but only store one line at a time:
while (line = file.gets) do
# ...
end
Read in all lines of a file at once:
file.readlines.each do |line|
# ...
end
Read the file in as a series of blocks:
while (data = file.read(block_size))
# ...
end
Read in the whole file at once:
data = file.read
It really depends on what kind of data you're working with. Generally read is better suited towards binary files, or those where you want it as one big string. gets and readlines are similar, but readlines is more convenient if you're confident the file will fit in memory. Don't do this on multi-gigabyte log files or you'll be in for a world of hurt as your system starts swapping. Use gets for situations like that.
gets will read until the end of the line based on a separator
read will read n bytes at a time
It all depends on what you are trying to read.
It may be more efficient to use read if your src file has unpredictable line lengths.

Ruby open returning a string instead of a file?

When trying to open() remote images, some return as StringIO and others return as File...how do I force the File?
data = open("http://graph.facebook.com/61700024/picture?type=square")
=> #<StringIO:0x007fd09b013948>
data = open("http://28.media.tumblr.com/avatar_7ef57cb42cb0_64.png")
=> #<StringIO:0x007fd098bf9490>
data = open("http://25.media.tumblr.com/avatar_279ec8ee3427_64.png")
=> #<File:/var/folders/_z/bb18gdw52ns0x5r8z9f2ncj40000gn/T/open-uri20120229-9190-mn52fu>
I'm using Paperclip to save remote images (which are stored in S3), so basically wanting to do:
user = User.new
user.avatar = open(url)
user.save
Open-URI has a 10KB limit on StringIO objects, anything above that and it stores it as a temp file.
One way to get past this is by actually changing the constant that Open-URI takes for the limit of StringIO objects. You can do this by setting the constant to 0;
OpenURI::Buffer.send :remove_const, 'StringMax' if OpenURI::Buffer.const_defined?('StringMax')
OpenURI::Buffer.const_set 'StringMax', 0
Add that to your initialiser and you should be good to go.
While steigers solution is a simple all around solution, some of us might be repelled by the "nasty hack" feeling of it and the way it changes behaviour globally. Including other gems and such that might benefit or depend on this feature of OpenURI. Ofc. you could also use the above approach and then when your done reset the constant back to it's original value and because of the GIL you might get away with that sort of nastiness as well (though be sure to stay away from jruby and threads then!).
Alternatively you could do something like this, which basically ensures that if you get a stream it's piped to a temp file:
def write_stream_to_a_temp_file(stream)
ext = begin
"."+MIME::Types[stream.meta["content-type"]].first.extensions.first
rescue #In case meta data is not available
#It seems sometimes the content-type is binary/octet-stream
#In this case we should grab the original ext name.
File.extname(stream.base_uri.path)
end
file = Tempfile.new ["temp", ext]
begin
file.binmode
file.write stream.read
ensure
file.flush rescue nil
file.close rescue nil
end
file
end
# and when you want to enforce that data must be a temp file then just...
data = write_stream_to_a_temp_file data unless data.is_a? Tempfile

How do I wrap ruby IO with a sliding window filter

I'm using an opaque API in some ruby code which takes a File/IO as a parameter. I want to be able to pass it an IO object that only gives access to a given range of data in the real IO object.
For example, I have a 8GB file, and I want to give the api an IO object that has a 1GB range within the middle of my real file.
real_file = File.new('my-big-file')
offset = 1 * 2**30 # start 1 GB into it
length = 1 * 2**30 # end 1 GB after start
filter = IOFilter.new(real_file, offset, length)
# The api only sees the 1GB of data in the middle
opaque_api(filter)
The filter_io project looks like it would be the easiest to adapt to do this, but doesn't seem to support this use case directly.
I think you would have to write it yourself, as it seems like a rather specific thing: you would have to implement all (or, a subset that you need) of IO's methods using a chunk of the opened file as a data source. An example of the "speciality" would be writing to such stream - you would have to take care not to cross the boundary of the segment given, i.e. constantly keeping track of your current position in the big file. Doesn't seem like a trivial job, and I don't see any shortcuts that could help you there.
Perhaps you can find some OS-based solution, e.g. making a loopback device out of the part of the large file (see man losetup and particularly -o and --sizelimit options, for example).
Variant 2:
If you are ok with keeping the contents of the window in memory all the time, you may wrap StringIO like this (just a sketch, not tested):
def sliding_io filename, offset, length
File.open(filename, 'r+') do |f|
# read the window into a buffer
f.seek(offset)
buf = f.read(length)
# wrap a buffer into StringIO and pass it given block
StringIO.open(buf) do |buf_io|
yield(buf_io)
end
# write altered buffer back to the big file
f.seek(offset)
f.write(buf[0,length])
end
end
And use it as you would use block variant of IO#open.
I believe the IO object has the functionality you are looking for. I've used it before for MD5 hash summing similarly sized files.
incr_digest = Digest::MD5.new()
file = File.open(filename, 'rb') do |io|
while chunk = io.read(50000)
incr_digest << chunk
end
end
This was the block I used, where I was passing the chunk to the MD5 Digest object.
http://www.ruby-doc.org/core/classes/IO.html#M000918

Resources