Read and write JSON and maintain the formatting in Ruby - ruby

I have a nicely structured (human made) JSON file that I would like to programatically add and update values.
The issue is that the current structure of the JSON file is very easy to read for me and my colleagues, and we would like it to stay in the same (or very similar) indentation, line spacing and key order, etc.
Is there a way to do this with Ruby?

Ruby's JSON supports pretty_generate, which is a "pretty" generator, but in no way will it attempt to remember how you've structured a particular JSON data file, nor should it.
foo = {'a' => 1, 'b' => %w[2 3]}
puts JSON.generate(foo)
{"a":1,"b":["2","3"]}
puts JSON.pretty_generate(foo)
{
"a": 1,
"b": [
"2",
"3"
]
}
JSON is a data serialization format, and, along with YAML and XML, it's designed to move data accurately. Doing that while maintaining an arbitrary line spacing, or leading white-space adds no value to a serializer.
Remember, adding "pretty" to the output increases the size of the data being moved, without improving the quality:
puts JSON.generate(foo).size
21
puts JSON.pretty_generate(foo).size
43
Making just that little hash "pretty" doubled the size, which, over time, reduces throughput to browsers or across networks between servers. I'd recommend only bothering with the "pretty" output when initially debugging your code, then abandoning it once you're happy with the data movement, in favor of speed and efficiency. The data will be the same.
If you're worried about being able to modify some of the data, write a simple reader and/or JSON generator that works from a standard Ruby data object, then let JSON serialize it, and write the output to a file.

Related

How to read a large file into a string

I'm trying to save and load the states of Matrices (using Matrix) during the execution of my program with the functions dump and load from Marshal. I can serialize the matrix and get a ~275 KB file, but when I try to load it back as a string to deserialize it into an object, Ruby gives me only the beginning of it.
# when I want to save
mat_dump = Marshal.dump(#mat) # serialize object - OK
File.open('mat_save', 'w') {|f| f.write(mat_dump)} # write String to file - OK
# somewhere else in the code
mat_dump = File.read('mat_save') # read String from file - only reads like 5%
#mat = Marshal.load(mat_dump) # deserialize object - "ArgumentError: marshal data too short"
I tried to change the arguments for load but didn't find anything yet that doesn't cause an error.
How can I load the entire file into memory? If I could read the file chunk by chunk, then loop to store it in the String and then deserialize, it would work too. The file has basically one big line so I can't even say I'll read it line by line, the problem stays the same.
I saw some questions about the topic:
"Ruby serialize array and deserialize back"
"What's a reasonable way to read an entire text file as a single string?"
"How to read whole file in Ruby?"
but none of them seem to have the answers I'm looking for.
Marshal is a binary format, so you need to read and write in binary mode. The easiest way is to use IO.binread/write.
...
IO.binwrite('mat_save', mat_dump)
...
mat_dump = IO.binread('mat_save')
#mat = Marshal.load(mat_dump)
Remember that Marshaling is Ruby version dependent. It's only compatible under specific circumstances with other Ruby versions. So keep that in mind:
In normal use, marshaling can only load data written with the same major version number and an equal or lower minor version number.

Can we store multiple objects in file?

I am already familiar with How can I save an object to a file?
But what if we have to store multiple objects (say hashes) to a file.
I tried appending YAML.dump(hash) to a file from various locations in my code. But the difficult part is reading it back. As yaml dump can extend to many lines, do I have to parse the file? Also this will only complicate code. Is there a better way to achieve this?
PS: Same issue will persist with Marshal.dump. So I prefer YAML as its more human readable.
YAML.dump creates a single Yaml document. If you have several Yaml documents together in a file then you have a Yaml stream. So when you appended the results from several calls to YAML.dump together you would have had a stream.
If you try reading this back using YAML.load you will only get the first document. To get all the documents back you can use YAML.load_stream, which will give you an array with an entry for each of the documents.
An example:
f = File.open('data.yml', 'w')
YAML.dump({:foo => 'bar'}, f)
YAML.dump({:baz => 'qux'}, f)
f.close
After this data.yml will look like this, containing two separate documents:
---
:foo: bar
---
:baz: qux
You can now read it back like this:
all_docs = YAML.load_stream(File.open('data.yml'))
Which will give you an array like [{:foo=>"bar"}, {:baz=>"qux"}].
If you don’t want to load all the documents into an array in one go you can pass a block to load_stream and handle each document as it is parsed:
YAML.load_stream(File.open('data.yml')) do |doc|
# handle the doc here
end
You could manage to save multiple objects by creating a delimiter (something to mark that one object is finished and that you go to the next one). You could then process the file in two steps:
read the file, splitting it around each delimiter
use YAML to restore the hashes from each chunk
Now, this would be a bit cumbersome, as there is a much simpler solution. Let's say you have three hash to save:
student = { first_name: "John"}
restaurant = { location: "21 Jump Street" }
order = { main_dish: "Happy Meal" }
You can simply put them in an array and then dump them:
objects = [student, restaurant, order]
dump = YAML.dump(objects)
You can restore your objects easily:
saved_objects = YAML.load(dump)
saved_student = saved_objects[0]
Depending of your objects relationship, you may prefer to use an Hash to save them instead of an array (so that you can name them instead of depending on the order).

Compressing large string in ruby

I have a web application(ruby on rails) that sends some YAML as the value of a hidden input field.
Now I want to reduce the size of the text that is sent across to the browser. What is the most efficient form of lossless compression that would send across minimal data? I'm ok to incur additional cost of compression and decompression at the server side.
You could use the zlib implementation in the ruby core to in/de-flate data:
require "zlib"
data = "some long yaml string" * 100
compressed_data = Zlib::Deflate.deflate(data)
#=> "x\x9C+\xCE\xCFMU\xC8\xC9\xCFKW\xA8L\xCC\xCDQ(.)\xCA\xCCK/\x1E\x15\x1C\x15\x1C\x15\x1C\x15\x1C\x15\x1C\x15\x1C\x15\x1C\x15D\x15\x04\x00\xB3G%\xA6"
You should base64-encode the compressed data to make it printable:
require 'base64'
encoded_data = Base64.encode64 compressed_data
#=> "eJwrzs9NVcjJz0tXqEzMzVEoLinKzEsvHhUcFRwVHBUcFRwVHBUcFUQVBACz\nRyWm\n"
Later, on the client-side, you might use pako (a zlib port to javascript) to get your data back. This answer probably helps you with implementing the JS part.
To give you an idea on how effective this is, here are the sizes of the example strings:
data.size # 2100
compressed_data.size # 48
encoded_data.size # 66
Same thing goes vice-versa when compressing on the client and inflating on the server.
Zlib::Inflate.inflate(Base64.decode64(encoded_data))
#=> "some long yaml stringsome long yaml str ... (shortened, as the string is long :)
Disclaimer:
The ruby zlib implementation should be compatible with the pako implementation. But I have not tried it.
The numbers about string sizes are a little cheated. Zlib is really effective here, because the string repeats a lot. Real life data usually does not repeat as much.
If you are working on a Rails application, you can also use the ActiveSupport::Gzip wrapper that allows compression/decompression of strings with gzip.
compressed_log = ActiveSupport::Gzip.compress('large string')
=> "\x1F\x8B\b\x00yq5c\x00\x03..."
original_log = ActiveSupport::Gzip.decompress(compressed_log)
=> "large string"
Behind the scenes, the compress method uses the Zlib::GzipWriter class which writes gzipped files. Similarly, the decompress method uses Zlib::GzipReader class which reads a gzipped file.

Weird JSON parsing issues with Ruby

I'm downloading content from a webpage that seems to be in JSON. It is a large file with the following format:
"address1":"123 Street","address2":"Apt 1","city":"City","state":"ST","zip":"xxxxx","country":"US"
There are about 1000 of these entries, where each entry is contained within brackets. When I download the page using RestClient.get (open-uri for some reason was throwing a http 500 error), the data is in the following format:
\"address\1":\"123 Street\",\"address2\":\"Apt 1\",\"city\":\"City\",\"state\":\"ST\",\"zip\":\"xxxxx\",\"country\":\"US\"
When I then use the json class
parsed = JSON.parse(data_out)
it completely scrambles both the order of entries within the data structure, and also the order of the objects within each entry, for example:
"address1"=>"123 Street", "city"=>"City", "country"=>"US", "address2"=>"Apt 1"
If instead I use
data_j=data_out.to_json
then I get:
\\\"address\\\1":\\\"123 Street\\\",\\\"address2\\\":\\\"Apt 1\\\",\\\"city\\\":\\\"City\\\",\\\"state\\\":\\\"ST\\\",\\\"zip\\\":\\\"xxxxx\\\",\\\"country\\\":\\\"US\\\"
Further, only using the json class seems to allow me to select the entries I want:
parsed[1]["address1"]
=> "123 Street"
data_j[1]["address1"]
TypeError: can't convert String into Integer
from (irb):17:in `[]'
from (irb):17
from :0
Any idea whats going on? I guess since the json commands are working I can use them, but it is disconcerting that its scrambling the entries and order of the objects.
Although the data appears ordered in string form, it represents an unordered dataset. The line:
parsed = JSON.parse(data_out)
which you use is the correct way to convert the string form into something usable in Ruby. I cannot see the full structure from your example, so I don't know whether the top level is an array or id-based hash. I suspect the latter since you say it becomes unordered when you view from Ruby. Therefore, if you knew which part of the address you were interested in you might have code like this:
# Writes all the cities
parsed.each do |id,data|
puts data["city"]
end
If the outer structure is an array, you'd do this:
# Writes all the cities
parsed.each do |data|
puts data["city"]
end

Processing large XML file with libxml-ruby chunk by chunk

I'd like to read a large XML file that contains over a million small bibliographic records (like <article>...</article>) using libxml in Ruby. I have tried the Reader class in combination with the expand method to read record by record but I am not sure this is the right approach since my code eats up memory. Hence, I'm looking for a recipe how to conveniently process record by record with constant memory usage. Below is my main loop:
File.open('dblp.xml') do |io|
dblp = XML::Reader.io(io, :options => XML::Reader::SUBST_ENTITIES)
pubFactory = PubFactory.new
i = 0
while dblp.read do
case dblp.name
when 'article', 'inproceedings', 'book':
pub = pubFactory.create(dblp.expand)
i += 1
puts pub
pub = nil
$stderr.puts i if i % 10000 == 0
dblp.next
when 'proceedings','incollection', 'phdthesis', 'mastersthesis':
# ignore for now
dblp.next
else
# nothing
end
end
end
The key here is that dblp.expand reads an entire subtree (like an <article> record) and passes it as an argument to a factory for further processing. Is this the right approach?
Within the factory method I then use high-level XPath-like expression to extract the content of elements, like below. Again, is this viable?
def first(root, node)
x = root.find(node).first
x ? x.content : nil
end
pub.pages = first(node,'pages') # node contains expanded node from dblp.expand
When processing big XML files, you should use a stream parser to avoid loading everything in memory. There are two common approaches:
Push parsers like SAX, where you react to encoutered tags as you get them (see tadman answer).
Pull parsers, where you control a "cursor" in the XML file that you can move with simple primitives like go up/go down etc.
I think that push parsers are nice to use if you want to retrieve just some fields, but they are generally messy to use for complex data extraction and are often implemented whith use case... when... constructs
Pull parser are in my opinion a good alternative between a tree-based model and a push parser. You can find a nice article in Dr. Dobb's journal about pull parsers with REXML .
When processing XML, two common options are tree-based, and event-based. The tree-based approach typically reads in the entire XML document and can consume a large amount of memory. The event-based approach uses no additional memory but doesn't do anything unless you write your own handler logic.
The event-based model is employed by the SAX-style parser, and derivative implementations.
Example with REXML: http://www.iro.umontreal.ca/~lapalme/ForestInsteadOfTheTrees/HTML/ch08s01.html
REXML: http://ruby-doc.org/stdlib/libdoc/rexml/rdoc/index.html
I had the same problem, but I think I solved it by calling Node#remove! on the expanded node. In your case, I think you should do something like
my_node = dblp.expand
[do what you have to do with my_node]
dblp.next
my_node.remove!
Not really sure why this works, but if you look at the source for LibXML::XML::Reader#expand, there's a comment about freeing the node. I am guessing that Reader#expand associates the node to the Reader, and you have to call Node#remove! to free it.
Memory usage wasn't great, even with this hack, but at least it didn't keep on growing.

Resources