How to "observe" a stream in Ruby's CSV module? - ruby

I am writing a class that takes a CSV files, transforms it, and then writes the new data out.
module Transformer
class Base
def initialize(file)
#file = file
end
def original_data(&block)
opts = { headers: true }
CSV.open(file, 'rb', opts, &block)
end
def transformer
# complex manipulations here like modifying columns, picking only certain
# columns to put into new_data, etc but simplified to `+10` to keep
# example concise
-> { |row| new_data << row['some_header'] + 10 }
end
def transformed_data
self.original_data(self.transformer)
end
def write_new_data
CSV.open('new_file.csv', 'wb', opts) do |new_data|
transformed_data
end
end
end
end
What I'd like to be able to do is:
Look at the transformed data without writing it out (so I can test that it transforms the data correctly, and I don't need to write it to file right away: maybe I want to do more manipulation before writing it out)
Don't slurp all the file at once, so it works no matter the size of the original data
Have this as a base class with an empty transformer so that instances only need to implement their own transformers but the behavior for reading and writing is given by the base class.
But obviously the above doesn't work because I don't really have a reference to new_data in transformer.
How could I achieve this elegantly?

I can recommend one of two approaches, depending on your needs and personal taste.
I have intentionally distilled the code to just its bare minimum (without your wrapping class), for clarity.
1. Simple read-modify-write loop
Since you do not want to slurp the file, use CSV::Foreach. For example, for a quick debugging session, do:
CSV.foreach "source.csv", headers: true do |row|
row["name"] = row["name"].upcase
row["new column"] = "new value"
p row
end
And if you wish to write to file during that same iteration:
require 'csv'
csv_options = { headers: true }
# Open the target file for writing
CSV.open("target.csv", "wb") do |target|
# Add a header
target << %w[new header column names]
# Iterate over the source CSV rows
CSV.foreach "source.csv", **csv_options do |row|
# Mutate and add columns
row["name"] = row["name"].upcase
row["new column"] = "new value"
# Push the new row to the target file
target << row
end
end
2. Using CSV::Converters
There is a built in functionality that might be helpful - CSV::Converters - (see the :converters definition in the CSV::New documentation)
require 'csv'
# Register a converter in the options hash
csv_options = { headers: true, converters: [:stripper] }
# Define a converter
CSV::Converters[:stripper] = lambda do |value, field|
value ? value.to_s.strip : value
end
CSV.open("target.csv", "wb") do |target|
# same as above
CSV.foreach "source.csv", **csv_options do |row|
# same as above - input data will already be converted
# you can do additional things here if needed
end
end
3. Separate input and output from your converter classes
Based on your comment, and since you want to minimize I/O and iterations, perhaps extracting the read/write operations from the responsibility of the transformers might be of interest. Something like this.
require 'csv'
class NameCapitalizer
def self.call(row)
row["name"] = row["name"].upcase
end
end
class EmailRemover
def self.call(row)
row.delete 'email'
end
end
csv_options = { headers: true }
converters = [NameCapitalizer, EmailRemover]
CSV.open("target.csv", "wb") do |target|
CSV.foreach "source.csv", **csv_options do |row|
converters.each { |c| c.call row }
target << row
end
end
Note that the above code still does not handle the header, in case it was changed. You will probably have to reserve the last row (after all transformations) and prepend its #headers to the output CSV.
There are probably plenty other ways to do it, but the CSV class in Ruby does not have the cleanest interface, so I try to keep code that deals with it as simple as I can.

Related

Naming different files. Handler? IO? Stream? Processor? Controller?

I'm having some trouble naming some files that I wrote. I don't really know the different between a stream, I/O, a handler, a processor (Is this a real concept?), and a controller. These are what my files look like in Ruby:
Starting from the rakefile:
desc "Calculate chocolate totals from a CSV of orders"
task :redeem_orders, [:orders_csv_path, :redemptions_csv_path] do |t, args|
args.with_defaults(:orders_csv_path => "./public/input/orders.csv", :redemptions_csv_path => "./public/output/redemptions.csv")
DataController.transfer(
input_path: args[:orders_csv_path],
output_path: args[:redemptions_csv_path],
formatter: ChocolateTotalsFormatter,
converter: ChocolateTotalsConverter
)
end
Then the controller (which in my mind delegates between different classes with the data obtained from the rakefile):
class DataController
def self.transfer(input_path:, output_path:, formatter:, converter:)
data_processor = DataProcessor.new(
input_path: input_path,
output_path: output_path,
formatter: formatter
)
export_data = converter.convert(data_processor.import)
data_processor.export(export_data)
end
end
The processor (which performs imports and exports according to the various files that were passed into this file):
class DataProcessor
attr_reader :input_path,
:output_path,
:formatter,
:input_file_processor,
:output_file_processor
def initialize(input_path:, output_path:, formatter:)
#input_path = input_path
#output_path = output_path
#formatter = formatter
#input_file_processor = FileProcessorFactory.create(File.extname(input_path))
#output_file_processor = FileProcessorFactory.create(File.extname(output_path))
end
def import
formatter.format_input(input_file_processor.read(input_path: input_path))
end
def export(export_data)
output_file_processor.write(
output_path: output_path,
data: formatter.format_output(export_data)
)
end
end
the converter referenced in the controller looks like this (it converts data that was passed in to a different format... I'm more confident about this naming):
class ChocolateTotalsConverter
def self.convert(data)
data.map do |row|
ChocolateTotalsCalculator.new(row).calculate
end
end
end
And the FileProcessorFactory in the above code snippet creates a file like this one that actually does the reading and the writing to CSV:
require 'csv'
class CSVProcessor
include FileTypeProcessor
def self.read(input_path:, with_headers: true, return_headers: false)
CSV.read(input_path, headers: with_headers, return_headers: return_headers, converters: :numeric)
end
def self.write(output_path:, data:, write_headers: false)
CSV.open(output_path, "w", write_headers: write_headers) do |csv|
data.each do |row|
csv << row
end
end
end
end
I'm having trouble with naming. Does it looks like I named things correctly? What should be named something like DataIO vs DataProcessor? What should a file named DataStream be doing? What about something that's a converter?
Ruby isn't a kingdom of nouns. Some programmers hear "everything is an object" and think "I am processing data, therefore I need a DataProcessor object!" But in Ruby, "everything is an object". There's only one novel "thing" in your example: a chocolate order (maybe redemptions, too). So you only need one custom class: ChocolateOrder. The other "things" we already have objects for: CSV represents the CSV file, Array (or Set or Hash) can represent the collection of chocolate orders.
Processing a CSV row into an order, converting an order into workable data, and totaling those data into a result aren't "things". They're actions! In Ruby, actions are methods, blocks, procs, lambdas, or top-level functions*. In your case I see a method like ChocolateOrder#payment for getting just the price to add up, then maybe some blocks for the rest of the processing.
In pseudocode I imagine something like this:
# input
orders = CSV.foreach(input_file).map do |row|
# get important stuff out of the row
Order.new(x, y, z)
end
# processing
redemptions = orders.map { |order| order.get_redemption }
# output
CSV.open(output_file, "wb") do |csv|
redemptions.each do |redemption|
# convert redemption to an array of strings
csv << redemption_ary
end
end
If your rows are really simple, I would even consider just setting headers:true on the CSV so it returns Hash and leave orders as that.
* Procs, lambdas, and top-level functions are objects too. But that's beside the point.
This seems like quite a 'java' way of thinking - in Ruby I haven't seen patterns like this used very often. I'd say that you might only really need the DataProcessor class. CSVProcessor and ChocolateTotalsConverter have only class methods, which might be more idiomatic if they were instance methods of DataProcessor instead. I'd start there and see how you feel about it.

How can I process huge JSON files as streams in Ruby, without consuming all memory?

I'm having trouble processing a huge JSON file in Ruby. What I'm looking for is a way to process it entry-by-entry without keeping too much data in memory.
I thought that yajl-ruby gem would do the work but it consumes all my memory. I've also looked at Yajl::FFI and JSON:Stream gems but there it is clearly stated:
For larger documents we can use an IO object to stream it into the
parser. We still need room for the parsed object, but the document
itself is never fully read into memory.
Here's what I've done with Yajl:
file_stream = File.open(file, "r")
json = Yajl::Parser.parse(file_stream)
json.each do |entry|
entry.do_something
end
file_stream.close
The memory usage keeps getting higher until the process is killed.
I don't see why Yajl keeps processed entries in the memory. Can I somehow free them, or did I just misunderstood the capabilities of Yajl parser?
If it cannot be done using Yajl: is there a way to do this in Ruby via any library?
Problem
json = Yajl::Parser.parse(file_stream)
When you invoke Yajl::Parser like this, the entire stream is loaded into memory to create your data structure. Don't do that.
Solution
Yajl provides Parser#parse_chunk, Parser#on_parse_complete, and other related methods that enable you to trigger parsing events on a stream without requiring that the whole IO stream be parsed at once. The README contains an example of how to use chunking instead.
The example given in the README is:
Or lets say you didn't have access to the IO object that contained JSON data, but instead only had access to chunks of it at a time. No problem!
(Assume we're in an EventMachine::Connection instance)
def post_init
#parser = Yajl::Parser.new(:symbolize_keys => true)
end
def object_parsed(obj)
puts "Sometimes one pays most for the things one gets for nothing. - Albert Einstein"
puts obj.inspect
end
def connection_completed
# once a full JSON object has been parsed from the stream
# object_parsed will be called, and passed the constructed object
#parser.on_parse_complete = method(:object_parsed)
end
def receive_data(data)
# continue passing chunks
#parser << data
end
Or if you don't need to stream it, it'll just return the built object from the parse when it's done. NOTE: if there are going to be multiple JSON strings in the input, you must specify a block or callback as this is how yajl-ruby will hand you (the caller) each object as it's parsed off the input.
obj = Yajl::Parser.parse(str_or_io)
One way or another, you have to parse only a subset of your JSON data at a time. Otherwise, you are simply instantiating a giant Hash in memory, which is exactly the behavior you describe.
Without knowing what your data looks like and how your JSON objects are composed, it isn't possible to give a more detailed explanation than that; as a result, your mileage may vary. However, this should at least get you pointed in the right direction.
Both #CodeGnome's and #A. Rager's answer helped me understand the solution.
I ended up creating the gem json-streamer that offers a generic approach and spares the need to manually define callbacks for every scenario.
Your solutions seem to be json-stream and yajl-ffi. There's an example on both that're pretty similar (they're from the same guy):
def post_init
#parser = Yajl::FFI::Parser.new
#parser.start_document { puts "start document" }
#parser.end_document { puts "end document" }
#parser.start_object { puts "start object" }
#parser.end_object { puts "end object" }
#parser.start_array { puts "start array" }
#parser.end_array { puts "end array" }
#parser.key {|k| puts "key: #{k}" }
#parser.value {|v| puts "value: #{v}" }
end
def receive_data(data)
begin
#parser << data
rescue Yajl::FFI::ParserError => e
close_connection
end
end
There, he sets up the callbacks for possible data events that the stream parser can experience.
Given a json document that looks like:
{
1: {
name: "fred",
color: "red",
dead: true,
},
2: {
name: "tony",
color: "six",
dead: true,
},
...
n: {
name: "erik",
color: "black",
dead: false,
},
}
One could stream parse it with yajl-ffi something like this:
def parse_dudes file_io, chunk_size
parser = Yajl::FFI::Parser.new
object_nesting_level = 0
current_row = {}
current_key = nil
parser.start_object { object_nesting_level += 1 }
parser.end_object do
if object_nesting_level.eql? 2
yield current_row #here, we yield the fully collected record to the passed block
current_row = {}
end
object_nesting_level -= 1
end
parser.key do |k|
if object_nesting_level.eql? 2
current_key = k
elsif object_nesting_level.eql? 1
current_row["id"] = k
end
end
parser.value { |v| current_row[current_key] = v }
file_io.each(chunk_size) { |chunk| parser << chunk }
end
File.open('dudes.json') do |f|
parse_dudes f, 1024 do |dude|
pp dude
end
end

How to read data from a different file without using YAML or JSON

I'm experimenting with a Ruby script that will add data to a Neo4j database using REST API. (Here's the tutorial with all the code if interested.)
The script works if I include the hash data structure in the initialize method but I would like to move the data into a different file so I can make changes to it separately using a different script.
I'm relatively new to Ruby. If I copy the following data structure into a separate file, is there a simple way to read it from my existing script when I call #data? I've heard one could do something with YAML or JSON (not familiar with how either work). What's the easiest way to read a file and how could I go about coding that?
#I want to copy this data into a different file and read it with my script when I call #data.
{
nodes:[
{:label=>"Person", :title=>"title_here", :name=>"name_here"}
]
}
And here is part of my code, it should be enough for the purposes of this question.
class RGraph
def initialize
#url = 'http://localhost:7474/db/data/cypher'
#If I put this hash structure into a different file, how do I make #data read that file?
#data = {
nodes:[
{:label=>"Person", :title=>"title_here", :name=>"name_here"}
]
}
end
#more code here... not relevant to question
def create_nodes
# Scan file, find each node and create it in Neo4j
#data.each do |key,value|
if key == :nodes
#data[key].each do |node| # Cycle through each node
next unless node.has_key?(:label) # Make sure this node has a label
#WE have sufficient data to create a node
label = node[:label]
attr = Hash.new
node.each do |k,v| # Hunt for additional attributes
next if k == :label # Don't create an attribute for "label"
attr[k] = v
end
create_node(label,attr)
end
end
end
end
rGraph = RGraph.new
rGraph.create_nodes
end
Given that OP said in comments "I'm not against using either of those", let's do it in YAML (which preserves the Ruby object structure best). Save it:
#data = {
nodes:[
{:label=>"Person", :title=>"title_here", :name=>"name_here"}
]
}
require 'yaml'
File.write('config.yaml', YAML.dump(#data))
This will create config.yaml:
---
:nodes:
- :label: Person
:title: title_here
:name: name_here
If you read it in, you get exactly what you saved:
require 'yaml'
#data = YAML.load(File.read('config.yaml'))
puts #data.inspect
# => {:nodes=>[{:label=>"Person", :title=>"title_here", :name=>"name_here"}]}

Using Ruby to parse and write Puppet node definitions

I am writing a helper API in Ruby to automatically create and manipulate node definitions. My code is working; it can read and write the node defs successfully, however, it is a bit clunky.
Ruby is not my main language, so I'm sure there is a cleaner, and more rubyesque solution. I would appreciate some advice or suggestions.
Each host has its own file in manifests/nodes containing just the node definition. e.g.
node 'testnode' {
class {'firstclass': }
class {'secondclass': enabled => false }
}
The classes all are either enabled (default) or disabled elements. In the Ruby code, I store these as an instance variable hash #elements.
The read method looks like this:
def read()
data = File.readlines(#filepath)
for line in data do
if line.include? 'class'
element = line[/.*\{'([^\']*)':/, 1]
if #elements.include? element.to_sym
if not line.include? 'enabled => false'
#elements[element.to_sym] = true
else
#elements[element.to_sym] = false
end
end
end
end
end
And the write method looks like this:
def write()
data = "node #{#hostname} {\n"
for element in #elements do
if element[1]
line = " class {'#{element[0]}': }\n"
else
line = " class {'#{element[0]}': enabled => false}\n"
end
data += line
end
data += "}\n"
file = File.open(#filepath, 'w')
file.write(data)
file.close()
end
One thing to add is that these systems will be isolated from the internet. So I'd prefer to avoid large number of dependency libraries as I'll need to install / maintain them manually.
If your goal is to define your node's programmatically, there is a much more straightforward way then reading and writing manifests. One of the built-in features of puppet is "External Node Classifiers"(ENC). The basic idea is that something external to puppet will define what a node should look like.
In the simplest form, the ENC can be a ruby/python/whatever script that writes out yaml with the list of classes and enabled parameters. Reading and writing yaml from ruby is as simple as it gets.
Ruby has some pretty good methods to iterate over data structures. See below for an example of how to rubify your code a little bit. I am by no means an expert on the subject, and have not tested the code. :)
def read
data = File.readlines(#filepath)
data.each_line do |line|
element = line[/.*\{'([^\']*)':/, 1].to_sym
if #elements.include?(element)
#elements[element] = line.include?('enabled => false') ? false : true
end
end
end
def write
File.open(#filepath, 'w') do |file|
file.puts "node #{#hostname} {"
#elements.each do |element|
if element[1]
file.puts " class {'#{element[0]}': }"
else
file.puts " class {'#{element[0]}': enabled => false }"
end
end
file.puts '}'
end
end
Hope this points you in the right direction.

How do I test reading a file?

I'm writing a test for one of my classes which has the following constructor:
def initialize(filepath)
#transactions = []
File.open(filepath).each do |line|
next if $. == 1
elements = line.split(/\t/).map { |e| e.strip }
transaction = Transaction.new(elements[0], Integer(1))
#transactions << transaction
end
end
I'd like to test this by using a fake file, not a fixture. So I wrote the following spec:
it "should read a file and create transactions" do
filepath = "path/to/file"
mock_file = double(File)
expect(File).to receive(:open).with(filepath).and_return(mock_file)
expect(mock_file).to receive(:each).with(no_args()).and_yield("phrase\tvalue\n").and_yield("yo\t2\n")
filereader = FileReader.new(filepath)
filereader.transactions.should_not be_nil
end
Unfortunately this fails because I'm relying on $. to equal 1 and increment on every line and for some reason that doesn't happen during the test. How can I ensure that it does?
Global variables make code hard to test. You could use each_with_index:
File.open(filepath) do |file|
file.each_with_index do |line, index|
next if index == 0 # zero based
# ...
end
end
But it looks like you're parsing a CSV file with a header line. Therefore I'd use Ruby's CSV library:
require 'csv'
CSV.foreach(filepath, col_sep: "\t", headers: true, converters: :numeric) do |row|
#transactions << Transaction.new(row['phrase'], row['value'])
end
You can (and should) use IO#each_line together with Enumerable#each_with_index which will look like:
File.open(filepath).each_line.each_with_index do |line, i|
next if i == 1
# …
end
Or you can drop the first line, and work with others:
File.open(filepath).each_line.drop(1).each do |line|
# …
end
If you don't want to mess around with mocking File for each test you can try FakeFS which implements an in memory file system based on StringIO that will clean up automatically after your tests.
This way your test's don't need to change if your implementation changes.
require 'fakefs/spec_helpers'
describe "FileReader" do
include FakeFS::SpecHelpers
def stub_file file, content
FileUtils.mkdir_p File.dirname(file)
File.open( file, 'w' ){|f| f.write( content ); }
end
it "should read a file and create transactions" do
file_path = "path/to/file"
stub_file file_path, "phrase\tvalue\nyo\t2\n"
filereader = FileReader.new(file_path)
expect( filereader.transactions ).to_not be_nil
end
end
Be warned: this is an implementation of most of the file access in Ruby, passing it back onto the original method where possible. If you are doing anything advanced with files you may start running into bugs in the FakeFS implementation. I got stuck with some binary file byte read/write operations which weren't implemented in FakeFS quite how Ruby implemented them.

Resources