I know that File supposedly takes encoding: 'bom|utf-8' but afaict there's no equivalent for streams. My server is getting a zip file containing one csv that has the bom. It seems silly to save the csv as a file vs just using CSV.new(Zip::InputStream::open(zip_file).get_next_entry.get_input_stream), but afaict none of those can detect and strip the byte order mark (bom) and CSV fails trying to parse the header if the bom is there.
I see that CSV.new takes encoding as an option, but, in 2.3.0 at least, it doesn't recognize bom (ArgumentError: unknown encoding name - bom)
Looks like handling the BOM is implemented in IO - maybe you can wrap your zip stream around an IO object?
https://ruby-doc.org/core-2.3.1/IO.html#method-c-new-label-Open+Mode
Since you can rewind streams, the answer is to get the first chars, see if they're the bom, if they are, consume them; otherwise, rewind the stream.
BYTE_ORDER_MARKS_LENGTHS =
{"\xEF".bytes.first => 2, "\xFE".bytes.first => 1, "\xFF".bytes.first => 1}
# checks if input_stream starts with a byte order mark and if so skips over it
def skip_bom(input_stream)
entry = BYTE_ORDER_MARKS_LENGTHS[input_stream.read(1).bytes.first]
if entry
input_stream.read(entry)
else
input_stream.rewind
end
end
My situation was similar, but I also needed to remove extra double-quotes:
Zip::File.open(zipfolder) do |zipfile|
zipfile.each do |zip_entry|
zip_entry.get_input_stream.each_line do |line|
line_without_bom_or_quotes = line.force_encoding('UTF-8').gsub('"', '')
row = CSV.parse_line(line_without_bom_or_quotes)
puts "DETAIL: #{row.inspect}"
end
end
end
Related
I have a 30MB XML file that contains some gibberish in the beginning, and so typically I have to remove that in order for Nokogiri to be able to parse the XML document properly.
Here's what I currently have:
contents = File.open(file_path).read
if contents[0..123].include? 'authenticate_response'
fixed_contents = File.open(file_path).read[123..-1]
File.open(file_path, 'w') { |f| f.write(fixed_contents) }
end
However, this actually causes the ruby script to open up the large XML file twice. Once to read the first 123 characters, and another time to read everything but the first 123 characters.
To solve the first issue, I was able to accomplish this:
contents = File.open(file_path).read(123)
However, now I need to remove these characters from the file without reading the entire file. How can I "trim" the beginning of this file without having to open the entire thing in memory?
You can open the file once, then read and check the "garbage" and finally pass the opened file directly to nokogiri for parsing. That way, you only need read the file once and don't need to write it at all.
File.open(file_path) do |xml_file|
if xml_file.read(123).include? 'authenticate_response'
# header found, nothing to do
else
# no header found. We rewind and let nokogiri parse the whole file
xml_file.rewind
end
xml = Nokogiri::XML.parse(xml_file)
# Now to whatever you want with the parsed XML document
end
Please refer to the documentation of IO#read, IO#rewind and Nokigiri::XML::Document.parse for details about those methods.
more sysadmin (chef) than ruby guy, so this may be a five minute fix.
I am working on a task where i write a ruby script that pulls json data from multiple files, parses it, and writes the desired fields to a single .csv file. Basically pulling metadata about aws accounts and putting it in an accountant friendly format.
Got a lot of help from another stackoverflow on how to solve the problem for a single file, json.parse help.
My issue is that I am trying to pull the same data from multiple JSON files in an array. I can get it to loop through each file with the code below.
require 'csv'
require "json"
delim_file = CSV.open("delimited_test.csv", "w")
aws_account_list = %w(example example2)
aws_account_list.each do |account|
json_file = File.read(account.to_s + "_aws.json")
parsed_json = JSON.parse(json_file)
delim_file = CSV.open("delimited_test.csv", "w")
# This next line could be a problem if you ran this code multiple times
delim_file << ["EbsOptimized", "PrivateDnsName", "KeyName", "AvailabilityZone", "OwnerId"]
parsed_json['Reservations'].each do |inner_json|
inner_json['Instances'].each do |instance_json|
delim_file << [[instance_json['EbsOptimized'].to_s, instance_json['PrivateDnsName'], instance_json['KeyName'], instance_json['Placement']['AvailabilityZone'], inner_json['OwnerId']],[]]
end
delim_file.close
end
end
However, whenever I do it, it overwrites every time to the same single row in the .csv file. I have tried adding a \n string to the end of the array, converting the array to a string with hashes and doing a \n, but all that does is add a line to the same row that it overwrites.
How would I go about writing that it reads each json file, then appending each files metadata to a new row? This looks like a simple case of writing the right loop, but I can't figure it out.
You declared your file like this:
delim_file = CSV.open("delimited_test.csv", "w")
To fix your issue, all you have to do is change "w" to "a":
delim_file = CSV.open("delimited_test.csv", "a")
See the docs for IO#new for a description of the available file modes. In short, w creates an empty file at the filename, overwriting anyothers, and writes to that. a only creates the file if it doesn't exist, and appends otherwise. Because you have it currently at w, it'll overwrite it each time you run the script. With a, it'll append to what's already there.
You need to open file in append mode, use
delim_file = CSV.open("delimited_test.csv", "a")
'a' Write-only, starts at end of file if file exists, otherwise creates a new file for writing.
'a+' Read-write, starts at end of file if file exists, otherwise creates a new file for reading and writing'
I'm still learning ruby, so I'm sure I'm doing something wrong here, but using ruby 1.9.3 on windows, I'm having a problem writing a file with random ascii garbage to be a specific size. I need to be able to write these files for a test on an application I'm QAing. On Mac and on *nix, the file size is written correctly every time. But on windows, it generates files of random size, generally between 1,024 bytes and 1,031 bytes.
I'm sure the problem is one of the characters that the rstr is generating is counting as two characters but... it seems like this shouldn't happen.
Here is my code:
num = 10
k = 1
for i in 1..num
fname = "f#{i}.txt"
f = File.new(fname, "w")
for k in 1..size
rstr = "#{(1..1024).map{rand(255).chr}.join}"
f.write rstr
print " #{rstr.size} " # this returns 1024 every time.
rstr = ""
end
f.close
end
Also tried:
opts = {}
opts[:encoding] = "UTF-8"
fname = "f#{i}.txt"
f = File.new(fname, "w", opts)
By default files open in Windows are open with text mode meaning that line endings and other details are adjusted.
If you want the files be written byte-to-byte exactly as you want, you need to open the files in binary mode:
File.new("foo", "wb") do |f|
# ...
end
The b is a ignored on POSIX operating systems, so your scripts are now cross-platform compatible.
Note: I used block syntax to manage the file so it properly closes and disposes the file handler once the block is executed. You no longer need to worry about closing the file ;-)
Hope this helps.
There is not any 255 ASCII. The values goes from 0~254.
If you try to printf 255.chr, you'll get a multibyte character.
As Windows does not standard utf-8, you'll get incorrect values. Hence the problem you're facing!
Try adding #coding: utf-8 at the top of your file. It should get things working.
I am trying to do a simple find/replace on all text files in a directory, modifying any instance of [RAVEN_START: by inserting a string (in this case 'raven was here') before the line.
Here is the entire ruby program:
#!/usr/bin/env ruby
require 'rubygems'
require 'fileutils' #for FileUtils.mv('your file', 'new location')
class RavenParser
rawDir = Dir.glob("*.txt")
count = 0
rawDir.each do |ravFile|
#we have selected every text file, so now we have to search through the file
#and make the needed changes.
rav = File.open(ravFile, "r+") do |modRav|
#Now we've opened the file, and we need to do the operations.
if modRav
lines = File.open(modRav).readlines
lines.each { |line|
if line.match /\[RAVEN_START:.*\]/
line.gsub!(/\[RAVEN_START:/, 'raven was here '+line)
count = count + 1
end
}
printf("Total Changed: %d\n",count)
else
printf("No txt files found. \n")
end
end
#end of file replacing instructions.
end
# S
end
The program runs and compiles fine, but when I open up the text file, there has been no change to any of the text within the file. count increments properly (that is, it is equal to the number of instances of [RAVEN_START: across all the files), but the actual substitution is failing to take place (or at least not saving the changes).
Is my syntax on the gsub! incorrect? Am I doing something else wrong?
You're reading the data, updating it, and then neglecting to write it back to the file. You need something like:
# And save the modified lines.
File.open(modRav, 'w') { |f| f.puts lines.join("\n") }
immediately before or after this:
printf("Total Changed: %d\n",count)
As DMG notes below, just overwriting the file isn't properly paranoid as you could be interrupted in the middle of the write and lose data. If you want to be paranoid (which all of us should be because they really are out to get us), then you want to write to a temporary file and then do an atomic rename to replace the original file the new one. A rename generally only works when you stay within a single file system as there is no guarantee that the OS's temp directory (which Tempfile uses by default) will be on the same file system as modRav so File.rename might not even be an option with a Tempfile unless precautions are taken. But the Tempfile constructor takes a tmpdir parameter so we're saved:
modRavDir = File.dirname(File.realpath(modRav))
tmp = Tempfile.new(modRav, modRavDir)
tmp.write(lines.join("\n"))
tmp.close
File.rename(tmp.path, modRav)
You might want to stick that in a separate method (safe_save(modRav, lines) perhaps) to avoid further cluttering your block.
There is no gsub! in the post (except the title and question). I would actually recommend not using gsub!, but rather use the result of gsub -- avoiding mutability can help reduce a number of subtle bugs.
The line read from the file stream into a String is a copy and modifying it will not affect the contents of the file. (The general approach is to read a line, process the line, and write the line. Or do it all at once: read all lines, process all lines, write all processed lines. In either case, nothing is being written back to the file in the code in the post ;-)
Happy coding.
You're not using gsub!, you're using gsub. gsub! and gsub different methods, one does replacement on the object itself and the other does replacement then returns the result, respectively.
Change this
line.gsub(/\[RAVEN_START:/, 'raven was here '+line)
to this :
line.gsub!(/\[RAVEN_START:/, 'raven was here '+line)
or this:
line = line.gsub(/\[RAVEN_START:/, 'raven was here '+line)
See String#gsub for more info
I need an easy way to take a tar file and convert it into a string (and vice versa). Is there a way to do this in Ruby? My best attempt was this:
file = File.open("path-to-file.tar.gz")
contents = ""
file.each {|line|
contents << line
}
I thought that would be enough to convert it to a string, but then when I try to write it back out like this...
newFile = File.open("test.tar.gz", "w")
newFile.write(contents)
It isn't the same file. Doing ls -l shows the files are of different sizes, although they are pretty close (and opening the file reveals most of the contents intact). Is there a small mistake I'm making or an entirely different (but workable) way to accomplish this?
First, you should open the file as a binary file. Then you can read the entire file in, in one command.
file = File.open("path-to-file.tar.gz", "rb")
contents = file.read
That will get you the entire file in a string.
After that, you probably want to file.close. If you don’t do that, file won’t be closed until it is garbage-collected, so it would be a slight waste of system resources while it is open.
If you need binary mode, you'll need to do it the hard way:
s = File.open(filename, 'rb') { |f| f.read }
If not, shorter and sweeter is:
s = IO.read(filename)
To avoid leaving the file open, it is best to pass a block to File.open. This way, the file will be closed after the block executes.
contents = File.open('path-to-file.tar.gz', 'rb') { |f| f.read }
how about some open/close safety.
string = File.open('file.txt', 'rb') { |file| file.read }
Ruby have binary reading
data = IO.binread(path/filaname)
or if less than Ruby 1.9.2
data = IO.read(path/file)
on os x these are the same for me... could this maybe be extra "\r" in windows?
in any case you may be better of with:
contents = File.read("e.tgz")
newFile = File.open("ee.tgz", "w")
newFile.write(contents)
You can probably encode the tar file in Base64. Base 64 will give you a pure ASCII representation of the file that you can store in a plain text file. Then you can retrieve the tar file by decoding the text back.
You do something like:
require 'base64'
file_contents = Base64.encode64(tar_file_data)
Have look at the Base64 Rubydocs to get a better idea.
Ruby 1.9+ has IO.binread (see #bardzo's answer) and also supports passing the encoding as an option to IO.read:
Ruby 1.9
data = File.read(name, {:encoding => 'BINARY'})
Ruby 2+
data = File.read(name, encoding: 'BINARY')
(Note in both cases that 'BINARY' is an alias for 'ASCII-8BIT'.)
If you can encode the tar file by Base64 (and storing it in a plain text file) you can use
File.open("my_tar.txt").each {|line| puts line}
or
File.new("name_file.txt", "r").each {|line| puts line}
to print each (text) line in the cmd.