RSpec failing under unknown (seemingly specific) circumstances - ruby

I'm relatively new to ruby, moving into it from languages like Python, and C# so please forgive me if this is an obvious bug I'm missing.
The Issue
I was writing code for a tool to help myself and other developers generate .editorconfig files as a way to learn Ruby. I use rspec as my testing framework and wrote the following tests:
module EditorConfigGenerator
RSpec.describe EditorConfigGenerator::FileGenerator do
configs_sensible_defaults = [EditorConfigGenerator::EditorConfig.new({root: true, indent_style: 'space', indent_size: 2,
end_of_line: 'lf', charset: 'utf-8',
trim_trailing_whitespace: true, insert_final_newline: true})]
file_generator = EditorConfigGenerator::FileGenerator.new(configs_sensible_defaults)
context "A collection of one EditorConfig object is provided" do
it "displays a preview of the file output" do
# Newlines are automatically inserted into ruby multi-line strings
output = "root = true
[*]
indent_style = space
indent_size = 2
end_of_line = lf
charset = utf-8
trim_trailing_whitespace = true
insert_final_newline = true
"
# String.delete(' ') to remove any potential formatting discrepancies
# which don't affect the test of the content.
expect(file_generator.preview_output.delete(' ')).to eq output.delete(' ')
end
it "creates a .editorconfig file in the correct location" do
dot_editorconfig = file_generator.generate_config_file('test_output/.editorconfig')
expect(dot_editorconfig.class).to eq File
expect(File.exist? "test_output/.editorconfig").to eq true
end
end
context "A collection of multiple EditorConfig objects is provided" do
it "displays a preview of the file output" do
configs = configs_sensible_defaults.push(EditorConfigGenerator::EditorConfig.new({file_type: '*.{js,py}', indent_size: 4}))
file_generator = EditorConfigGenerator::FileGenerator.new(configs)
output = "root = true
[*]
indent_style = space
indent_size = 2
end_of_line = lf
charset = utf-8
trim_trailing_whitespace = true
insert_final_newline = true
[*.{js,py}]
indent_size = 4"
expect(file_generator.preview_output.delete(' ')).to eq output.delete(' ')
end
end
end
end
When I ran these tests the first few times, they worked seemingly fine. I then enabled the following recommended lines in my spec_helper.rb file
config.profile_examples = 10
config.order = :random
kernel.srand config.seed
I ran the tests a few more times, all of them passing until - seemingly randomly - I obtained the following failure output with seed 31166:
1) EditorConfigGenerator::FileGenerator A collection of one EditorConfig object is provided displays a preview of the file output
Failure/Error: expect(file_generator.preview_output.delete(' ')).to eq output.delete(' ')
expected: "root=true\n[*]\nindent_style=space\nindent_size=2\nend_of_line=lf\ncharset=utf-8\ntrim_trailing_whitespace=true\ninsert_final_newline=true\n"
got: "root=true\n[*]\nindent_style=space\nindent_size=2\nend_of_line=lf\ncharset=utf-8\ntrim_trailing_whitespace=true\ninsert_final_newline=true\n[*.{js,py}]\nindent_size=4"
(compared using ==)
Diff:
## -6,4 +6,6 ##
charset=utf-8
trim_trailing_whitespace=true
insert_final_newline=true
+[*.{js,py}]
+indent_size=4
# ./spec/file_generator_spec.rb:23:in `block (3 levels) in <module:EditorConfigGenerator>'
With other seeds I tried, it worked, but with seed 31166 it failed.
The Fix
The output made me feel like it was contextual, so therefore I tried to look for a bug in my implementation. I didn't find one, so I thought it may have been an issue with the way I defined the shared variable configs_sensible_defaults in the spec.
I decided to change the code to use before(:each) to assign an instance variable before each test (i.e. resetting the data properly) and it seemed to fix it.
Here's the fixed spec/file_generator_spec.rb
module EditorConfigGenerator
RSpec.describe EditorConfigGenerator::FileGenerator do
before(:each) do
#configs_sensible_defaults = [EditorConfigGenerator::EditorConfig.new({root: true, indent_style: 'space', indent_size: 2, end_of_line: 'lf', charset: 'utf-8', trim_trailing_whitespace: true, insert_final_newline: true})]
#file_generator = EditorConfigGenerator::FileGenerator.new(#configs_sensible_defaults)
end
context "A collection of one EditorConfig object is provided" do
it "displays a preview of the file output" do
# Newlines are automatically inserted into ruby multi-line strings
output = "root = true
[*]
indent_style = space
indent_size = 2
end_of_line = lf
charset = utf-8
trim_trailing_whitespace = true
insert_final_newline = true
"
# String.delete(' ') to remove any potential formatting discrepancies
# which don't affect the test of the content.
expect(#file_generator.preview_output.delete(' ')).to eq output.delete(' ')
end
it "creates a .editorconfig file in the correct location" do
dot_editorconfig = #file_generator.generate_config_file('test_output/.editorconfig')
expect(dot_editorconfig.class).to eq File
expect(File.exist? "test_output/.editorconfig").to eq true
end
end
context "A collection of multiple EditorConfig objects is provided" do
it "displays a preview of the file output" do
configs = #configs_sensible_defaults.clone.push(EditorConfigGenerator::EditorConfig.new({file_type: '*.{js,py}', indent_size: 4}))
#file_generator = EditorConfigGenerator::FileGenerator.new(configs)
output = "root = true
[*]
indent_style = space
indent_size = 2
end_of_line = lf
charset = utf-8
trim_trailing_whitespace = true
insert_final_newline = true
[*.{js,py}]
indent_size = 4"
expect(#file_generator.preview_output.delete(' ')).to eq output.delete(' ')
end
end
end
end
Here is a diff output of both files (view on Github):
RSpec.describe EditorConfigGenerator::FileGenerator do
configs_sensible_defaults = [EditorConfigGenerator::EditorConfig.new({root: true, indent_style: 'space', indent_size: 2,
end_of_line: 'lf', charset: 'utf-8',
trim_trailing_whitespace: true, insert_final_newline: true})]
file_generator = EditorConfigGenerator::FileGenerator.new(configs_sensible_defaults)
before(:each) do
#configs_sensible_defaults = [EditorConfigGenerator::EditorConfig.new({root: true, indent_style: 'space', indent_size: 2, end_of_line: 'lf', charset: 'utf-8', trim_trailing_whitespace: true, insert_final_newline: true})]
#file_generator = EditorConfigGenerator::FileGenerator.new(#configs_sensible_defaults)
end
context "A collection of one EditorConfig object is provided" do
it "displays a preview of the file output" do
# file_generator_spec.rb:22 # module EditorConfigGenerator
"
# String.delete(' ') to remove any potential formatting discrepancies
# which don't affect the test of the content.
expect(file_generator.preview_output.delete(' ')).to eq output.delete(' ')
expect(#file_generator.preview_output.delete(' ')).to eq output.delete(' ')
end
it "creates a .editorconfig file in the correct location" do
dot_editorconfig = file_generator.generate_config_file('test_output/.editorconfig')
dot_editorconfig = #file_generator.generate_config_file('test_output/.editorconfig')
expect(dot_editorconfig.class).to eq File
expect(File.exist? "test_output/.editorconfig").to eq true
end
end
context "A collection of multiple EditorConfig objects is provided" do
Finally, for context, here are the relevant file_generator.rb and editor_config.rb files
module EditorConfigGenerator
class FileGenerator
def initialize(configs)
#configs = configs
end
def preview_output
output = ""
#configs.each do |config|
if output.include? "root="
output << config.to_s_without_root
next
end
output << config.to_s
end
return output.rstrip if #configs.size > 1
output
end
def generate_config_file(location='.editorconfig')
File.delete(location) if File.exist? location
file = File.new(location, "w")
file.print(preview_output)
file.close
file
end
end
end
module EditorConfigGenerator
class EditorConfig
attr_reader :root, :indent_style, :indent_size,
:end_of_line, :charset, :trim_trailing_whitespace,
:insert_final_newline, :file_type
def initialize(options)
#root = nil
#file_type = :*
transform_options(options)
set_options(options)
end
def set_options(options)
#root = options[:root] unless options[:root].nil?
#indent_style = options[:indent_style] unless options[:indent_style].nil?
#indent_size = options[:indent_size] unless options[:indent_size].nil?
#end_of_line = options[:end_of_line] unless options[:end_of_line].nil?
#charset = options[:charset] unless options[:charset].nil?
#trim_trailing_whitespace = options[:trim_trailing_whitespace] unless options[:trim_trailing_whitespace].nil?
#insert_final_newline = options[:insert_final_newline] unless options[:insert_final_newline].nil?
#file_type = options[:file_type] unless options[:file_type].nil?
end
def transform_options(options)
options[:root] = true if options[:root] == 'y'
options[:root] = false if options[:root] == 'n'
options[:trim_trailing_whitespace] = true if options[:trim_trailing_whitespace] == 'y'
options[:trim_trailing_whitespace] = false if options[:trim_trailing_whitespace] == 'n'
options[:insert_final_newline] = true if options[:insert_final_newline] == 'y'
options[:insert_final_newline] = false if options[:insert_final_newline] == 'n'
end
def to_s
config_string = ""
config_string << "root = #{#root.to_s}\n" unless #root.nil?
config_string << "[#{#file_type}]\n"
config_string << "indent_style = #{#indent_style}\n" unless #indent_style.nil?
config_string << "indent_size = #{#indent_size}\n" unless #indent_size.nil?
config_string << "end_of_line = #{#end_of_line}\n" unless #end_of_line.nil?
config_string << "charset = #{#charset}\n" unless #charset.nil?
config_string << "trim_trailing_whitespace = #{#trim_trailing_whitespace.to_s}\n" unless #trim_trailing_whitespace.nil?
config_string << "insert_final_newline = #{#insert_final_newline.to_s}\n" unless #insert_final_newline.nil?
config_string
end
def to_s_without_root
lines = to_s.lines
lines.delete_at 0
lines.join
end
def to_s_without_line(line_identifier)
lines = to_s.lines
index = lines.index(line_identifier)
lines.delete_at index
lines.join
end
end
end
The Question
Could someone explain exactly why the change fixed the issue? Did the change fix the issue?
I believe it's to do with the collection in configs_sensible_defaults being mutated and not being reset correctly after each run, which would explain why a certain seed would trigger a failure (as maybe under that seed test 2 would run before 1).

The reason it was working sometimes and not other times is because one of your tests (in the second context) reassigned file_generator. The test in the first context was using the file_generator from the shared outer scope. So long as that test ran first, it worked as expected. Then the second test in the second scope ran, reassigned file_generator, and also passed.
Since the tests ran in random order, whenever they ran in the order they are in the file, everything was passing. When the second test ran first, it overwrote file_generator, the second test still passed, but the first test ran with the overwritten value.
You can use a before block to configure things for each test, as you did, but you should only configure common, shared values in a before block. Anything that's going to be different for each context or each test should be initialized closer to the scope in which it is used.
In your examples above, I wouldn't use any before block, at least not in the outer scope. Both #config_sensible_defaults and #file_generator are different in each context, and should not be shared at all. If you want to group a bunch of tests together in a context with the same defaults and generator, you can put a before block inside the context block, to initialize things correctly for each context.

Related

How to read multiple XML files then output to multiple CSV files with the same XML filenames

I am trying to parse multiple XML files then output them into CSV files to list out the proper rows and columns.
I was able to do so by processing one file at a time by defining the filename, and specifically output them into a defined output file name:
File.open('H:/output/xmloutput.csv','w')
I would like to write into multiple files and make their name the same as the XML filenames without hard coding it. I tried doing it multiple ways but have had no luck so far.
Sample XML:
<?xml version="1.0" encoding="UTF-8"?>
<record:root>
<record:Dataload_Request>
<record:name>Bob Chuck</record:name>
<record:Address_Data>
<record:Street_Address>123 Main St</record:Street_Address>
<record:Postal_Code>12345</record:Postal_Code>
</record:Address_Data>
<record:Age>45</record:Age>
</record:Dataload_Request>
</record:root>
Here is what I've tried:
require 'nokogiri'
require 'set'
files = ''
input_folder = "H:/input"
output_folder = "H:/output"
if input_folder[input_folder.length-1,1] == '/'
input_folder = input_folder[0,input_folder.length-1]
end
if output_folder[output_folder.length-1,1] != '/'
output_folder = output_folder + '/'
end
files = Dir[input_folder + '/*.xml'].sort_by{ |f| File.mtime(f)}
file = File.read(input_folder + '/' + files)
doc = Nokogiri::XML(file)
record = {} # hashes
keys = Set.new
records = [] # array
csv = ""
doc.traverse do |node|
value = node.text.gsub(/\n +/, '')
if node.name != "text" # skip these nodes: if class isnt text then skip
if value.length > 0 # skip empty nodes
key = node.name.gsub(/wd:/,'').to_sym
if key == :Dataload_Request && !record.empty?
records << record
record = {}
elsif key[/^root$|^document$/]
# neglect these keys
else
key = node.name.gsub(/wd:/,'').to_sym
# in case our value is html instead of text
record[key] = Nokogiri::HTML.parse(value).text
# add to our key set only if not already in the set
keys << key
end
end
end
end
# build our csv
File.open('H:/output/.*csv', 'w') do |file|
file.puts %Q{"#{keys.to_a.join('","')}"}
records.each do |record|
keys.each do |key|
file.write %Q{"#{record[key]}",}
end
file.write "\n"
end
print ''
print 'output files ready!'
print ''
end
I have been getting 'read memory': no implicit conversion of Array into String (TypeError) and other errors.
Here's a quick peer-review of your code, something like you'd get in a corporate environment...
Instead of writing:
input_folder = "H:/input"
input_folder[input_folder.length-1,1] == '/' # => false
Consider doing it using the -1 offset from the end of the string to access the character:
input_folder[-1] # => "t"
That simplifies your logic making it more readable because it's lacking unnecessary visual noise:
input_folder[-1] == '/' # => false
See [] and []= in the String documentation.
This looks like a bug to me:
files = Dir[input_folder + '/*.xml'].sort_by{ |f| File.mtime(f)}
file = File.read(input_folder + '/' + files)
files is an array of filenames. input_folder + '/' + files is appending an array to a string:
foo = ['1', '2'] # => ["1", "2"]
'/parent/' + foo # =>
# ~> -:9:in `+': no implicit conversion of Array into String (TypeError)
# ~> from -:9:in `<main>'
How you want to deal with that is left as an exercise for the programmer.
doc.traverse do |node|
is icky because it sidesteps the power of Nokogiri being able to search for a particular tag using accessors. Very rarely do we need to iterate over a document tag by tag, usually only when we're peeking at its structure and layout. traverse is slower so use it as a very last resort.
length is nice but isn't needed when checking whether a string has content:
value = 'foo'
value.length > 0 # => true
value > '' # => true
value = ''
value.length > 0 # => false
value > '' # => false
Programmers coming from Java like to use the accessors but I like being lazy, probably because of my C and Perl backgrounds.
Be careful with sub and gsub as they don't do what you're thinking they do. Both expect a regular expression, but will take a string which they do a escape on before beginning their scan.
You're passing in a regular expression, which is OK in this case, but it could cause unexpected problems if you don't remember all the rules for pattern matching and that gsub scans until the end of the string:
foo = 'wd:barwd:' # => "wd:barwd:"
key = foo.gsub(/wd:/,'') # => "bar"
In general I recommend people think a couple times before using regular expressions. I've seen some gaping holes opened up in logic written by fairly advanced programmers because they didn't know what the engine was going to do. They're wonderfully powerful, but need to be used surgically, not as a universal solution.
The same thing happens with a string, because gsub doesn't know when to quit:
key = foo.gsub('wd:','') # => "bar"
So, if you're looking to change just the first instance use sub:
key = foo.sub('wd:','') # => "barwd:"
I'd do it a little differently though.
foo = 'wd:bar'
I can check to see what the first three characters are:
foo[0,3] # => "wd:"
Or I can replace them with something else using string indexing:
foo[0,3] = ''
foo # => "bar"
There's more but I think that's enough for now.
You should use Ruby's CSV class. Also, you don't need to do any string matching or regex stuff. Use Nokogiri to target elements. If you know the node names in the XML will be consistent it should be pretty simple. I'm not exactly sure if this is the output you want, but this should get you in the right direction:
require 'nokogiri'
require 'csv'
def xml_to_csv(filename)
xml_str = File.read(filename)
xml_str.gsub!('record:','') # remove the record: namespace
doc = Nokogiri::XML xml_str
csv_filename = filename.gsub('.xml', '.csv')
CSV.open(csv_filename, 'wb' ) do |row|
row << ['name', 'street_address', 'postal_code', 'age']
row << [
doc.xpath('//name').text,
doc.xpath('//Street_Address').text,
doc.xpath('//Postal_Code').text,
doc.xpath('//Age').text,
]
end
end
# iterate over all xml files
Dir.glob('*.xml').each { |filename| xml_to_csv(filename) }

Check command line argument in Ruby code is present

I want to execute Ruby code if command line argument is present:
ruby sanity_checks.rb --trx-request -e staging_psp -g all
Ruby code:
def execute
command_options = CommandOptionsProcessor.parse_command_options
request_type = command_options[:env]
tested_env = command_options[:trx-request] // check here if flag is present
tested_gateways = gateway_names(env: tested_env, gateway_list: command_options[:gateways])
error_logs = []
if(request_type.nil?)
tested_gateways.each do |gateway|
........
end
end
raise error_logs.join("\n") if error_logs.any?
end
How I can get the argument --trx-request and check is it present?
EDIT:
Command parser:
class CommandOptionsProcessor
def self.parse_command_options
command_options = {}
opt_parser = OptionParser.new do |opt|
opt.banner = 'Usage: ruby sanity_checks.rb [OPTIONS]'
opt.separator "Options:"
opt.on('-e TEST_ENV', '--env TEST_ENV','Set tested environment (underscored).') do |setting|
command_options[:env] = setting
end
opt.on('-g X,Y,Z', '--gateways X,Y,Z', Array, 'Set tested gateways (comma separated, no spaces).') do |setting|
command_options[:gateways] = setting
end
end
opt_parser.parse!(ARGV)
command_options
end
end
Can you advice?
You will need to add a boolean switch option to your OptionParser like so
opt.on('-t','--[no-]trx-request','Signifies a TRX Request') do |v|
# note we used an underscore rather than a hyphen to make this symbol
# easier to access
command_options[:trx_request] = v
end
Then in your execute method you can access this as
command_options[:trx_request]
If you need to have a default value you can add one in the parse_command_options method by setting it outside the OptionParser as command_options[:trx_request] = false

Code not actually asserting in RSpec?

I'm new to Ruby and in various open source software I've noticed a number of "statements" in some RSpec descriptions that appear not to accomplish what they intended, like they wanted to make an assertion, but didn't. Are these coding errors or is there some RSpec or Ruby magic I'm missing? (Likelihood of weirdly overloaded operators?)
The examples, with #??? added to the suspect lines:
(rubinius/spec/ruby/core/array/permutation_spec.rb)
it "returns no permutations when the given length has no permutations" do
#numbers.permutation(9).entries.size == 0 #???
#numbers.permutation(9) { |n| #yielded << n }
#yielded.should == []
end
(discourse/spec/models/topic_link_spec.rb)
it 'works' do
# ensure other_topic has a post
post
url = "http://#{test_uri.host}/t/#{other_topic.slug}/#{other_topic.id}"
topic.posts.create(user: user, raw: 'initial post')
linked_post = topic.posts.create(user: user, raw: "Link to another topic: #{url}")
TopicLink.extract_from(linked_post)
link = topic.topic_links.first
expect(link).to be_present
expect(link).to be_internal
expect(link.url).to eq(url)
expect(link.domain).to eq(test_uri.host)
link.link_topic_id == other_topic.id #???
expect(link).not_to be_reflection
...
(chef/spec/unit/chef_fs/parallelizer.rb)
context "With :ordered => false (unordered output)" do
it "An empty input produces an empty output" do
parallelize([], :ordered => false) do
sleep 10
end.to_a == [] #???
expect(elapsed_time).to be < 0.1
end
(bosh/spec/external/aws_bootstrap_spec.rb)
it "configures ELBs" do
load_balancer = elb.load_balancers.detect { |lb| lb.name == "cfrouter" }
expect(load_balancer).not_to be_nil
expect(load_balancer.subnets.sort {|s1, s2| s1.id <=> s2.id }).to eq([cf_elb1_subnet, cf_elb2_subnet].sort {|s1, s2| s1.id <=> s2.id })
expect(load_balancer.security_groups.map(&:name)).to eq(["web"])
config = Bosh::AwsCliPlugin::AwsConfig.new(aws_configuration_template)
hosted_zone = route53.hosted_zones.detect { |zone| zone.name == "#{config.vpc_generated_domain}." }
record_set = hosted_zone.resource_record_sets["\\052.#{config.vpc_generated_domain}.", 'CNAME'] # E.g. "*.midway.cf-app.com."
expect(record_set).not_to be_nil
record_set.resource_records.first[:value] == load_balancer.dns_name #???
expect(record_set.ttl).to eq(60)
end
I don't think there is any special behavior. I think you've found errors in the test code.
This doesn't work because there's no assertion, only a comparison:
#numbers.permutation(9).entries.size == 0
It would need to be written as:
#numbers.permutation(9).entries.size.should == 0
Or using the newer RSpec syntax:
expect(#numbers.permutation(9).entries.size).to eq(0)

Ruby Set include function returning true even when item is not in set

I am writing a program that searches from tweets and images to combine the two. I have a built in two arrays that hold MD5 hashes of the tweets used and the uri's of the images used which I check against before using results from a new search so I don't use the same thing again.
Here is the code I use to check if the tweet contains characters I don't want or isn't in the set of MD5 hashes
unless (/#/.match(tweet[0]) or /http/.match(tweet[0]) or /^#/.match(tweet[0]) or md5list.include?(Digest::MD5.hexdigest(tweet[0])))
where md5list is the set which gets populated like this
md5list << "#{Digest::MD5.hexdigest(tweet[0])}"
but md5list.include?(Digest::MD5.hexdigest(tweet[0])) seems to always return true, even when the arrays are empty
Can anyone spot where I'm messing up here?
Thanks
Edit:
The set contains a number of MD5 hashes of strings of text, I want to search this set for a hash of a random string i have and only execute code if it isn't already present in the set.
To do this I've used, essentially, unless (set.include?(Digest::MD5.hexdigest("test")
which should return true if the set does include it, and false if it does not. I have tested this in irb and it seems to work
irb(main):009:0> s = Set.new
=> #<Set: {}>
irb(main):010:0> s << Digest::MD5.hexdigest("test")
=> #<Set: {"9cdfb439c7876e703e307864c9167a15"}>
irb(main):011:0> s.include?("test")
=> false
irb(main):012:0> s.include?(Digest::MD5.hexdigest("test"))
=> true
irb(main):013:0> s.include?(Digest::MD5.hexdigest("test2"))
=> false
but in my implementation it seems to always return true.
EDIT
Some, uh more, stuff. (here is the full code, I'll try not to post to huge chunks: https://github.com/rolandshoemaker/bleak-tweets/blob/master/bleak-tweet.rb)
this is the function that is failing. it should search for an image, and only if the md5 hash of the uri isn't already in the imagemd5 set it will retreieve the image, do some stuff, then add the md5 hash of the uri to the set so that the same image won't be used again.
def imageSearch(tag, tweet, imagemd5)
Google::Search::Image.new(:query => tag).each do |image|
unless (imagemd5.include?(Digest::MD5.hexdigest(image.uri)))
filename = String.new
open(image.uri) { |f|
File.open("current", "wb") do |file|
file.puts f.read
end
img = Magick::Image::read("current").first
img.resize_to_fit!(600, 600)
drawable = Magick::Draw.new
drawable.pointsize = 18.0
#drawable.gravity = Magick::SouthEastGravity
drawable.font_weight = Magick::BoldWeight
tm = drawable.get_type_metrics(img, tweet)
drawable.fill = 'black'
#drawable.opacity(1)
xy1 = [0, (((img.rows)*6)/10)]
xy2 = [(((img.columns)*8)/10), (((img.rows)*9)/10)]
drawable.rectangle(xy1[0],xy1[1],xy2[0],xy2[1])
drawable.draw(img)
position = xy1[1]+10
wraptext(tweet, ((xy2[0]-xy1[0])-10)/10).each do |row|
drawable.annotate(img,(xy2[0]-xy1[0])-10,(xy2[1]-xy1[1])-10,10,position += 15,row) {self.fill='white'}
end
filename = "testy." << img.format
img.write(filename)
}
puts imagemd5.include?(Digest::MD5.hexdigest(image.uri)).inspect
imagemd5 << "#{Digest::MD5.hexdigest(image.uri)}"
puts imagemd5.include?(Digest::MD5.hexdigest(image.uri)).inspect
tumblrPost(tag, filename)
File.delete(filename)
File.delete("current")
break
end
end
end
this outputs an image and in the console outputs this (with an example tweet)
Damn this swollen ankle. Smh #injured #painful
false
true
the problem is that, in this case, the image that the program used was one that had already been used and yet imagemd5.include?(Digest::MD5.hexdigest(image.uri) is returning false where it should be true
From what you're saying, it seems like you're questioning why:
["#{foo}"].include?(foo)
is always true where foo is the expression MD5.hexdigest(tweet[0]). But the above expression will always be true as long as foo returns a string, which MD5.hexdigest does.

How do I force one field in Ruby's CSV output to be wrapped with double-quotes?

I'm generating some CSV output using Ruby's built-in CSV. Everything works fine, but the customer wants the name field in the output to have wrapping double-quotes so the output looks like the input file. For instance, the input looks something like this:
1,1.1.1.1,"Firstname Lastname",more,fields
2,2.2.2.2,"Firstname Lastname, Jr.",more,fields
CSV's output, which is correct, looks like:
1,1.1.1.1,Firstname Lastname,more,fields
2,2.2.2.2,"Firstname Lastname, Jr.",more,fields
I know CSV is doing the right thing by not double-quoting the third field just because it has embedded blanks, and wrapping the field with double-quotes when it has the embedded comma. What I'd like to do, to help the customer feel warm and fuzzy, is tell CSV to always double-quote the third field.
I tried wrapping the field in double-quotes in my to_a method, which creates a "Firstname Lastname" field being passed to CSV, but CSV laughed at my puny-human attempt and output """Firstname Lastname""". That is the correct thing to do because it's escaping the double-quotes, so that didn't work.
Then I tried setting CSV's :force_quotes => true in the open method, which output double-quotes wrapping all fields as expected, but the customer didn't like that, which I expected also. So, that didn't work either.
I've looked through the Table and Row docs and nothing appeared to give me access to the "generate a String field" method, or a way to set a "for field n always use quoting" flag.
I'm about to dive into the source to see if there's some super-secret tweaks, or if there's a way to monkey-patch CSV and bend it to do my will, but wondered if anyone had some special knowledge or had run into this before.
And, yes, I know I could roll my own CSV output, but I prefer to not reinvent well-tested wheels. And, I'm also aware of FasterCSV; That's now part of Ruby 1.9.2, which I'm using, so explicitly using FasterCSV buys me nothing special. Also, I'm not using Rails and have no intention of rewriting it in Rails, so unless you have a cute way of implementing it using a small subset of Rails, don't bother. I'll downvote any recommendations to use any of those ways just because you didn't bother to read this far.
Well, there's a way to do it but it wasn't as clean as I'd hoped the CSV code could allow.
I had to subclass CSV, then override the CSV::Row.<<= method and add another method forced_quote_fields= to make it possible to define the fields I want to force-quoting on, plus pull two lambdas from other methods. At least it works for what I want:
require 'csv'
class MyCSV < CSV
def <<(row)
# make sure headers have been assigned
if header_row? and [Array, String].include? #use_headers.class
parse_headers # won't read data for Array or String
self << #headers if #write_headers
end
# handle CSV::Row objects and Hashes
row = case row
when self.class::Row then row.fields
when Hash then #headers.map { |header| row[header] }
else row
end
#headers = row if header_row?
#lineno += 1
#do_quote ||= lambda do |field|
field = String(field)
encoded_quote = #quote_char.encode(field.encoding)
encoded_quote +
field.gsub(encoded_quote, encoded_quote * 2) +
encoded_quote
end
#quotable_chars ||= encode_str("\r\n", #col_sep, #quote_char)
#forced_quote_fields ||= []
#my_quote_lambda ||= lambda do |field, index|
if field.nil? # represent +nil+ fields as empty unquoted fields
""
else
field = String(field) # Stringify fields
# represent empty fields as empty quoted fields
if (
field.empty? or
field.count(#quotable_chars).nonzero? or
#forced_quote_fields.include?(index)
)
#do_quote.call(field)
else
field # unquoted field
end
end
end
output = row.map.with_index(&#my_quote_lambda).join(#col_sep) + #row_sep # quote and separate
if (
#io.is_a?(StringIO) and
output.encoding != raw_encoding and
(compatible_encoding = Encoding.compatible?(#io.string, output))
)
#io = StringIO.new(#io.string.force_encoding(compatible_encoding))
#io.seek(0, IO::SEEK_END)
end
#io << output
self # for chaining
end
alias_method :add_row, :<<
alias_method :puts, :<<
def forced_quote_fields=(indexes=[])
#forced_quote_fields = indexes
end
end
That's the code. Calling it:
data = [
%w[1 2 3],
[ 2, 'two too', 3 ],
[ 3, 'two, too', 3 ]
]
quote_fields = [1]
puts "Ruby version: #{ RUBY_VERSION }"
puts "Quoting fields: #{ quote_fields.join(', ') }", "\n"
csv = MyCSV.generate do |_csv|
_csv.forced_quote_fields = quote_fields
data.each do |d|
_csv << d
end
end
puts csv
results in:
# >> Ruby version: 1.9.2
# >> Quoting fields: 1
# >>
# >> 1,"2",3
# >> 2,"two too",3
# >> 3,"two, too",3
This post is old, but I can't believe no one thought of this.
Why not do:
csv = CSV.generate :quote_char => "\0" do |csv|
where \0 is a null character, then just add quotes to each field where they are needed:
csv << [product.upc, "\"" + product.name + "\"" # ...
Then at the end you can do a
csv.gsub!(/\0/, '')
I doubt if this will help the customer feeling warm and fuzzy after all this time, but this seems to work:
require 'csv'
#prepare a lambda which converts field with index 2
quote_col2 = lambda do |field, fieldinfo|
# fieldinfo has a line- ,header- and index-method
if fieldinfo.index == 2 && !field.start_with?('"') then
'"' + field + '"'
else
field
end
end
# specify above lambda as one of the converters
csv = CSV.read("test1.csv", :converters => [quote_col2])
p csv
# => [["aaa", "bbb", "\"ccc\"", "ddd"], ["fff", "ggg", "\"hhh\"", "iii"]]
File.open("test1.txt","w"){|out| csv.each{|line|out.puts line.join(",")}}
CSV has a force_quotes option that will force it to quote all fields (it may not have been there when you posted this originally). I realize this isn't exactly what you were proposing, but it's less monkey patching.
2.1.0 :008 > puts CSV.generate_line [1,'1.1.1.1','Firstname Lastname','more','fields']
1,1.1.1.1,Firstname Lastname,more,fields
2.1.0 :009 > puts CSV.generate_line [1,'1.1.1.1','Firstname Lastname','more','fields'], force_quotes: true
"1","1.1.1.1","Firstname Lastname","more","fields"
The drawback is that the first integer value ends up listed as a string, which changes things when you import into Excel.
It's been a long time, but since the CSV library has been patched, this might help someone if they're now facing this issue:
require 'csv'
# puts CSV::VERSION # this should be 3.1.9+
headers = ['id', 'ip', 'name', 'foo', 'bar']
data = [
[1, '1.1.1.1','Firstname Lastname','more','fields'],
[2, '2.2.2.2','Firstname Lastname, Jr.','more','fields']
]
quoter = Proc.new do |field, field_meta|
# the index starts at zero, that's why the third field would be 2:
field = '"' + field + '"' if field_meta.index == 2 && fields_meta.index > 1
field = '"' + field + '"' if field.is_a?(String) && field.include?(',')
# ^ CSV format needs to escape fields containing comma(s): ,
field
end
file = CSV.generate(headers: true, quote_char: '', write_converters: quoter) do |csv|
csv << headers
data.each { |row| csv << row }
end
puts file
the output would be:
id,ip,name,foo,bar
1,1.1.1.1,"Firstname Lastname",more,fields
2,2.2.2.2,"Firstname Lastname, Jr.",more,fields
It doesn't look like there's any way to do this with the existing CSV implementation short of monkey-patching/rewriting it.
However, assuming you have full control over the source data, you could do this:
Append a custom string including a comma (i.e. one that would never be naturally found in the data) to the end of the field in question for each row; maybe something like "FORCE_COMMAS,".
Generate the CSV output.
Now that you have CSV output with quotes on every row for your field, remove the custom string: csv.gsub!(/FORCE_COMMAS,/, "")
Customer feels warm and fuzzy.
CSV has changed a bit in Ruby 2.1 as mentioned by #jwadsack, however here's an working version of #the-tin-man's MyCSV. Bit modified, you set the forced_quote_fields via options.
MyCSV.generate(forced_quote_fields: [1]) do |_csv|...
The modified code
require 'csv'
class MyCSV < CSV
def <<(row)
# make sure headers have been assigned
if header_row? and [Array, String].include? #use_headers.class
parse_headers # won't read data for Array or String
self << #headers if #write_headers
end
# handle CSV::Row objects and Hashes
row = case row
when self.class::Row then row.fields
when Hash then #headers.map { |header| row[header] }
else row
end
#headers = row if header_row?
#lineno += 1
output = row.map.with_index(&#quote).join(#col_sep) + #row_sep # quote and separate
if #io.is_a?(StringIO) and
output.encoding != (encoding = raw_encoding)
if #force_encoding
output = output.encode(encoding)
elsif (compatible_encoding = Encoding.compatible?(#io.string, output))
#io.set_encoding(compatible_encoding)
#io.seek(0, IO::SEEK_END)
end
end
#io << output
self # for chaining
end
def init_separators(options)
# store the selected separators
#col_sep = options.delete(:col_sep).to_s.encode(#encoding)
#row_sep = options.delete(:row_sep) # encode after resolving :auto
#quote_char = options.delete(:quote_char).to_s.encode(#encoding)
#forced_quote_fields = options.delete(:forced_quote_fields) || []
if #quote_char.length != 1
raise ArgumentError, ":quote_char has to be a single character String"
end
#
# automatically discover row separator when requested
# (not fully encoding safe)
#
if #row_sep == :auto
if [ARGF, STDIN, STDOUT, STDERR].include?(#io) or
(defined?(Zlib) and #io.class == Zlib::GzipWriter)
#row_sep = $INPUT_RECORD_SEPARATOR
else
begin
#
# remember where we were (pos() will raise an exception if #io is pipe
# or not opened for reading)
#
saved_pos = #io.pos
while #row_sep == :auto
#
# if we run out of data, it's probably a single line
# (ensure will set default value)
#
break unless sample = #io.gets(nil, 1024)
# extend sample if we're unsure of the line ending
if sample.end_with? encode_str("\r")
sample << (#io.gets(nil, 1) || "")
end
# try to find a standard separator
if sample =~ encode_re("\r\n?|\n")
#row_sep = $&
break
end
end
# tricky seek() clone to work around GzipReader's lack of seek()
#io.rewind
# reset back to the remembered position
while saved_pos > 1024 # avoid loading a lot of data into memory
#io.read(1024)
saved_pos -= 1024
end
#io.read(saved_pos) if saved_pos.nonzero?
rescue IOError # not opened for reading
# do nothing: ensure will set default
rescue NoMethodError # Zlib::GzipWriter doesn't have some IO methods
# do nothing: ensure will set default
rescue SystemCallError # pipe
# do nothing: ensure will set default
ensure
#
# set default if we failed to detect
# (stream not opened for reading, a pipe, or a single line of data)
#
#row_sep = $INPUT_RECORD_SEPARATOR if #row_sep == :auto
end
end
end
#row_sep = #row_sep.to_s.encode(#encoding)
# establish quoting rules
#force_quotes = options.delete(:force_quotes)
do_quote = lambda do |field|
field = String(field)
encoded_quote = #quote_char.encode(field.encoding)
encoded_quote +
field.gsub(encoded_quote, encoded_quote * 2) +
encoded_quote
end
quotable_chars = encode_str("\r\n", #col_sep, #quote_char)
#quote = if #force_quotes
do_quote
else
lambda do |field, index|
if field.nil? # represent +nil+ fields as empty unquoted fields
""
else
field = String(field) # Stringify fields
# represent empty fields as empty quoted fields
if field.empty? or
field.count(quotable_chars).nonzero? or
#forced_quote_fields.include?(index)
do_quote.call(field)
else
field # unquoted field
end
end
end
end
end
end

Resources