Compressing large string in ruby - ruby

I have a web application(ruby on rails) that sends some YAML as the value of a hidden input field.
Now I want to reduce the size of the text that is sent across to the browser. What is the most efficient form of lossless compression that would send across minimal data? I'm ok to incur additional cost of compression and decompression at the server side.

You could use the zlib implementation in the ruby core to in/de-flate data:
require "zlib"
data = "some long yaml string" * 100
compressed_data = Zlib::Deflate.deflate(data)
#=> "x\x9C+\xCE\xCFMU\xC8\xC9\xCFKW\xA8L\xCC\xCDQ(.)\xCA\xCCK/\x1E\x15\x1C\x15\x1C\x15\x1C\x15\x1C\x15\x1C\x15\x1C\x15\x1C\x15D\x15\x04\x00\xB3G%\xA6"
You should base64-encode the compressed data to make it printable:
require 'base64'
encoded_data = Base64.encode64 compressed_data
#=> "eJwrzs9NVcjJz0tXqEzMzVEoLinKzEsvHhUcFRwVHBUcFRwVHBUcFUQVBACz\nRyWm\n"
Later, on the client-side, you might use pako (a zlib port to javascript) to get your data back. This answer probably helps you with implementing the JS part.
To give you an idea on how effective this is, here are the sizes of the example strings:
data.size # 2100
compressed_data.size # 48
encoded_data.size # 66
Same thing goes vice-versa when compressing on the client and inflating on the server.
Zlib::Inflate.inflate(Base64.decode64(encoded_data))
#=> "some long yaml stringsome long yaml str ... (shortened, as the string is long :)
Disclaimer:
The ruby zlib implementation should be compatible with the pako implementation. But I have not tried it.
The numbers about string sizes are a little cheated. Zlib is really effective here, because the string repeats a lot. Real life data usually does not repeat as much.

If you are working on a Rails application, you can also use the ActiveSupport::Gzip wrapper that allows compression/decompression of strings with gzip.
compressed_log = ActiveSupport::Gzip.compress('large string')
=> "\x1F\x8B\b\x00yq5c\x00\x03..."
original_log = ActiveSupport::Gzip.decompress(compressed_log)
=> "large string"
Behind the scenes, the compress method uses the Zlib::GzipWriter class which writes gzipped files. Similarly, the decompress method uses Zlib::GzipReader class which reads a gzipped file.

Related

How to read a large file into a string

I'm trying to save and load the states of Matrices (using Matrix) during the execution of my program with the functions dump and load from Marshal. I can serialize the matrix and get a ~275 KB file, but when I try to load it back as a string to deserialize it into an object, Ruby gives me only the beginning of it.
# when I want to save
mat_dump = Marshal.dump(#mat) # serialize object - OK
File.open('mat_save', 'w') {|f| f.write(mat_dump)} # write String to file - OK
# somewhere else in the code
mat_dump = File.read('mat_save') # read String from file - only reads like 5%
#mat = Marshal.load(mat_dump) # deserialize object - "ArgumentError: marshal data too short"
I tried to change the arguments for load but didn't find anything yet that doesn't cause an error.
How can I load the entire file into memory? If I could read the file chunk by chunk, then loop to store it in the String and then deserialize, it would work too. The file has basically one big line so I can't even say I'll read it line by line, the problem stays the same.
I saw some questions about the topic:
"Ruby serialize array and deserialize back"
"What's a reasonable way to read an entire text file as a single string?"
"How to read whole file in Ruby?"
but none of them seem to have the answers I'm looking for.
Marshal is a binary format, so you need to read and write in binary mode. The easiest way is to use IO.binread/write.
...
IO.binwrite('mat_save', mat_dump)
...
mat_dump = IO.binread('mat_save')
#mat = Marshal.load(mat_dump)
Remember that Marshaling is Ruby version dependent. It's only compatible under specific circumstances with other Ruby versions. So keep that in mind:
In normal use, marshaling can only load data written with the same major version number and an equal or lower minor version number.

Read and write JSON and maintain the formatting in Ruby

I have a nicely structured (human made) JSON file that I would like to programatically add and update values.
The issue is that the current structure of the JSON file is very easy to read for me and my colleagues, and we would like it to stay in the same (or very similar) indentation, line spacing and key order, etc.
Is there a way to do this with Ruby?
Ruby's JSON supports pretty_generate, which is a "pretty" generator, but in no way will it attempt to remember how you've structured a particular JSON data file, nor should it.
foo = {'a' => 1, 'b' => %w[2 3]}
puts JSON.generate(foo)
{"a":1,"b":["2","3"]}
puts JSON.pretty_generate(foo)
{
"a": 1,
"b": [
"2",
"3"
]
}
JSON is a data serialization format, and, along with YAML and XML, it's designed to move data accurately. Doing that while maintaining an arbitrary line spacing, or leading white-space adds no value to a serializer.
Remember, adding "pretty" to the output increases the size of the data being moved, without improving the quality:
puts JSON.generate(foo).size
21
puts JSON.pretty_generate(foo).size
43
Making just that little hash "pretty" doubled the size, which, over time, reduces throughput to browsers or across networks between servers. I'd recommend only bothering with the "pretty" output when initially debugging your code, then abandoning it once you're happy with the data movement, in favor of speed and efficiency. The data will be the same.
If you're worried about being able to modify some of the data, write a simple reader and/or JSON generator that works from a standard Ruby data object, then let JSON serialize it, and write the output to a file.

efficient read of EXIF meta-data from remote images using Ruby

I have a few thousand high-res JPEG photos stored on a travel blog website, and I'm hoping to write some Ruby code that will extract a few key EXIF meta-data values from the images without downloading the entire contents of each image file (they are large, and I have a LOT of them).
I'm using the 'exifr' gem to read the EXIF data, and it is designed to work with any type of IO object, not just local files. However, the Net::HTTPResponse object isn't really an IO object, although it does allow for incremental reading if you pass the read_body method a block. I've read conflicting reports, however, about whether this increment reading really allows you to only download a portion of a file, or whether it just lets you read the contents in chunks for efficiency (i.e. the entire contents is downloaded anyway).
So, is what I'm trying to do possible? Should I be looking at alternatives to Net::HTTP, or is there some way for me to get at the low-level TCP socket (which should be an IO object) to pass to the 'exifr' code to read just enough of the image to get the EXIF data? Other solutions?
I generated a quick table of where, in my pile of photos, the EXIF data is stored:
$ find . -type f -exec grep -a -bo Exif {} \; > /tmp/exif
$ sort /tmp/exif | uniq -c | sort -n
1 12306:Exif
1 3271386:Exif
1 8210:Exif
1 8234:Exif
1 9234:Exif
2 10258:Exif
24 449:Exif
30 24:Exif
8975 6:Exif
$
The clear majority are just a few bytes into the file; a handful are scattered elsewhere, but the worst is only three megabytes into the file. (Give or take.)
I wrote a little test script that appears to do what it necessary for a single URL. (Tested by looking for the string AA in chunks of a huge binary file I had available.) This certainly isn't the prettiest program I've written, but it might be an adequate start to a solution. Note that if the Exif text spans the chunks, you're going to retrieve the entire file. That's unfortunate. I hope it doesn't happen often. The 66000 is there because the JPEG AAP1 block is limited in size to 64 kilobytes and grabbing a bit more is probably better than grabbing a bit less.
#!/usr/bin/ruby
require 'net/http'
require 'uri'
url = URI.parse("http://....")
begin
looking = true
extra_size = 0
File.open("/tmp/output", "w") do |f|
Net::HTTP.start(url.host, url.port) do |http|
request = Net::HTTP::Get.new url.request_uri
http.request request do |resp|
resp.read_body do |chunk|
f.write chunk
if (looking)
if (chunk.match(/Exif/))
looking = false
end
elsif (extra_size < 66000)
extra_size += chunk.length
else
throw "done"
end
end
end
end
end
rescue
puts "done"
exit(0)
end

StringScanner scanning IO instead of a string

I've got a parser written using ruby's standard StringScanner. It would be nice if I could use it on streaming files. Is there an equivalent to StringScanner that doesn't require me to load the whole string into memory?
You might have to rework your parser a bit, but you can feed lines from a file to a scanner like this:
File.open('filepath.txt', 'r') do |file|
scanner = StringScanner.new(file.readline)
until file.eof?
scanner.scan(/whatever/)
scanner << file.readline
end
end
StringScanner was intended for that, to load a big string and going back and forth with an internal pointer, if you make it a stream, then the references get lost, you can not use unscan, check_until, pre_match, post_match,
well you can, but for that you need to buffer all the previous input.
If you are concerned about the buffer size then just load by chunk of data, and use a simple regexp or a gem called Parser.
The simplest way is to read a fix size of data.
# iterate over fixed length records
open("fixed-record-file") do |f|
while record = f.read(1024)
# parse here the record using regexp or parser
end
end
[Updated]
Even with this loop you can use StringSanner, you just need to update the string with each new chunk of data:
string=(str)
Changes the string being scanned to str and resets the scanner.
Returns str
There is StringIO.
Sorry misread you question. Take a look at this seems to have streaming options

Processing large XML file with libxml-ruby chunk by chunk

I'd like to read a large XML file that contains over a million small bibliographic records (like <article>...</article>) using libxml in Ruby. I have tried the Reader class in combination with the expand method to read record by record but I am not sure this is the right approach since my code eats up memory. Hence, I'm looking for a recipe how to conveniently process record by record with constant memory usage. Below is my main loop:
File.open('dblp.xml') do |io|
dblp = XML::Reader.io(io, :options => XML::Reader::SUBST_ENTITIES)
pubFactory = PubFactory.new
i = 0
while dblp.read do
case dblp.name
when 'article', 'inproceedings', 'book':
pub = pubFactory.create(dblp.expand)
i += 1
puts pub
pub = nil
$stderr.puts i if i % 10000 == 0
dblp.next
when 'proceedings','incollection', 'phdthesis', 'mastersthesis':
# ignore for now
dblp.next
else
# nothing
end
end
end
The key here is that dblp.expand reads an entire subtree (like an <article> record) and passes it as an argument to a factory for further processing. Is this the right approach?
Within the factory method I then use high-level XPath-like expression to extract the content of elements, like below. Again, is this viable?
def first(root, node)
x = root.find(node).first
x ? x.content : nil
end
pub.pages = first(node,'pages') # node contains expanded node from dblp.expand
When processing big XML files, you should use a stream parser to avoid loading everything in memory. There are two common approaches:
Push parsers like SAX, where you react to encoutered tags as you get them (see tadman answer).
Pull parsers, where you control a "cursor" in the XML file that you can move with simple primitives like go up/go down etc.
I think that push parsers are nice to use if you want to retrieve just some fields, but they are generally messy to use for complex data extraction and are often implemented whith use case... when... constructs
Pull parser are in my opinion a good alternative between a tree-based model and a push parser. You can find a nice article in Dr. Dobb's journal about pull parsers with REXML .
When processing XML, two common options are tree-based, and event-based. The tree-based approach typically reads in the entire XML document and can consume a large amount of memory. The event-based approach uses no additional memory but doesn't do anything unless you write your own handler logic.
The event-based model is employed by the SAX-style parser, and derivative implementations.
Example with REXML: http://www.iro.umontreal.ca/~lapalme/ForestInsteadOfTheTrees/HTML/ch08s01.html
REXML: http://ruby-doc.org/stdlib/libdoc/rexml/rdoc/index.html
I had the same problem, but I think I solved it by calling Node#remove! on the expanded node. In your case, I think you should do something like
my_node = dblp.expand
[do what you have to do with my_node]
dblp.next
my_node.remove!
Not really sure why this works, but if you look at the source for LibXML::XML::Reader#expand, there's a comment about freeing the node. I am guessing that Reader#expand associates the node to the Reader, and you have to call Node#remove! to free it.
Memory usage wasn't great, even with this hack, but at least it didn't keep on growing.

Resources