Zlib::GzipReader doesnt read whole file - ruby

I have this block of ruby code. I need to read big json.gz file, that cannot be loaded into RAM at once (so no GzipReader.new method). To achieve this, I use GzipReader and then lazy read with batch loading. Everything works perfectly, but for some reasons, not all data from json get to block. Only 5500125 rows are being processed in this code, but file has cca 6600000 rows. If i use File.open('authors.jsonl.gz') instead of Zlib, then all rows are processed, but are not unzipped.
I looked almost all day to documenattion and haven't found anything :( I also try to unzip each row, that is processed, but all my attempts failed also. Is there way how to unzip file and then read it in chunks (all of its content not just part), or at least read line by line and unzip each line on its own?
Thank you guys :)
Zlib::GzipReader(File.open('authors.jsonl.gz')) do |file|
file.lazy.each_slice(batch_size) do |lines|
lines.each do |line|
parsed_line = JSON.parse(line.gsub('\u0000', ''))
array_of_authors << {id: parsed_line['id'],
name: parsed_line['name'],
username: parsed_line['username'],
description: parsed_line['description'],
followers_count: parsed_line.dig('public_metrics', 'followers_count'),
following_count: parsed_line.dig('public_metrics', 'following_count'),
tweet_count: parsed_line.dig('public_metrics', 'tweet_count'),
listed_count: parsed_line.dig('public_metrics', 'listed_count')}
end
end
end

Related

How to replace the first few bytes of a file in Ruby without opening the whole file?

I have a 30MB XML file that contains some gibberish in the beginning, and so typically I have to remove that in order for Nokogiri to be able to parse the XML document properly.
Here's what I currently have:
contents = File.open(file_path).read
if contents[0..123].include? 'authenticate_response'
fixed_contents = File.open(file_path).read[123..-1]
File.open(file_path, 'w') { |f| f.write(fixed_contents) }
end
However, this actually causes the ruby script to open up the large XML file twice. Once to read the first 123 characters, and another time to read everything but the first 123 characters.
To solve the first issue, I was able to accomplish this:
contents = File.open(file_path).read(123)
However, now I need to remove these characters from the file without reading the entire file. How can I "trim" the beginning of this file without having to open the entire thing in memory?
You can open the file once, then read and check the "garbage" and finally pass the opened file directly to nokogiri for parsing. That way, you only need read the file once and don't need to write it at all.
File.open(file_path) do |xml_file|
if xml_file.read(123).include? 'authenticate_response'
# header found, nothing to do
else
# no header found. We rewind and let nokogiri parse the whole file
xml_file.rewind
end
xml = Nokogiri::XML.parse(xml_file)
# Now to whatever you want with the parsed XML document
end
Please refer to the documentation of IO#read, IO#rewind and Nokigiri::XML::Document.parse for details about those methods.

Append new lines to a csv from json.parse

more sysadmin (chef) than ruby guy, so this may be a five minute fix.
I am working on a task where i write a ruby script that pulls json data from multiple files, parses it, and writes the desired fields to a single .csv file. Basically pulling metadata about aws accounts and putting it in an accountant friendly format.
Got a lot of help from another stackoverflow on how to solve the problem for a single file, json.parse help.
My issue is that I am trying to pull the same data from multiple JSON files in an array. I can get it to loop through each file with the code below.
require 'csv'
require "json"
delim_file = CSV.open("delimited_test.csv", "w")
aws_account_list = %w(example example2)
aws_account_list.each do |account|
json_file = File.read(account.to_s + "_aws.json")
parsed_json = JSON.parse(json_file)
delim_file = CSV.open("delimited_test.csv", "w")
# This next line could be a problem if you ran this code multiple times
delim_file << ["EbsOptimized", "PrivateDnsName", "KeyName", "AvailabilityZone", "OwnerId"]
parsed_json['Reservations'].each do |inner_json|
inner_json['Instances'].each do |instance_json|
delim_file << [[instance_json['EbsOptimized'].to_s, instance_json['PrivateDnsName'], instance_json['KeyName'], instance_json['Placement']['AvailabilityZone'], inner_json['OwnerId']],[]]
end
delim_file.close
end
end
However, whenever I do it, it overwrites every time to the same single row in the .csv file. I have tried adding a \n string to the end of the array, converting the array to a string with hashes and doing a \n, but all that does is add a line to the same row that it overwrites.
How would I go about writing that it reads each json file, then appending each files metadata to a new row? This looks like a simple case of writing the right loop, but I can't figure it out.
You declared your file like this:
delim_file = CSV.open("delimited_test.csv", "w")
To fix your issue, all you have to do is change "w" to "a":
delim_file = CSV.open("delimited_test.csv", "a")
See the docs for IO#new for a description of the available file modes. In short, w creates an empty file at the filename, overwriting anyothers, and writes to that. a only creates the file if it doesn't exist, and appends otherwise. Because you have it currently at w, it'll overwrite it each time you run the script. With a, it'll append to what's already there.
You need to open file in append mode, use
delim_file = CSV.open("delimited_test.csv", "a")
'a' Write-only, starts at end of file if file exists, otherwise creates a new file for writing.
'a+' Read-write, starts at end of file if file exists, otherwise creates a new file for reading and writing'

How to assert a CSV file in Ruby

Is there a nice way to assert the contents of a CSV file in Ruby?
I understand how to use the CSV libraries and how to read in the CSV file, but that results in a long list of assertions such as:
`assert_equal("0", #csv_array[0].field('impressions'))
assert_equal("7", #csv_array[0].field('clicks'))
assert_equal("330", #csv_array[0].field('currency.GBP.commissions'))
assert_equal("6", #csv_array[0].field('currency.GBP.conversions'))
assert_equal("3300", #csv_array[0].field('currency.GBP.ordervalue'))`
Is there some sort of file comparator so I could write:
assert_equal(expected.csv ,actual.csv )
or something along those lines?
How about this:
expected_csv = "impressions,clicks,currency.GBP.comiisions,currency.GBP.conversions,currency.GBP.ordervalue
0,7,330,6,3300"
actual_csv = File.open('actual.csv').read
assert_equal(expected_csv, actual_csv)
That should work if the entire contents of the CSV file is only 2 lines. Otherwise you will have to manipulate actual_csv to get the parts you want to test. You could do that like so:
IO.readlines('actual.csv')[3]
That will get you the third line. You can then concatenate with a header line or compare to a string without the header.
If you have to test very output, you might find approval testing an interesting approach. Basically, the output is saved the first time your test runs. You can then check the output manually and approve it if correct. On subsequent runs, there will be an error when the output differs.
I created a quick and dirty method for doing this which I may clean up and turn into a gem at some point. https://gist.github.com/bpardee/513b4a15e5ebdc596e0b
For instance, the following code:
file = 'test.csv'
File.open(file, 'w') do |fout|
fout.puts "foo,bar,zulu\n1,2,3\n4,5,6"
end
assert_csv(file) do |csv|
csv << %w(foo bar warrior)
csv << [1,3,5]
csv << [4,5,6]
end
Would result in:
Missing columns: ["zulu"]
Unexpected columns: ["warrior"]
The following mismatches were found in line 2:
bar actual=3 expected=2
I don't recommend this for big csv files since everything is loaded into memory.

How to edit each line of a file in Ruby, without using a temp file

Is there a way to edit each line in a file, without involving 2 files? Say, the original file has,
test01
test02
test03
I want to edit it like
test01,a
test02,a
test03,a
Tried something as show in the code block, but it replaces some of the characters.
Writing it to a temporary file and then replace the original file works, However, I need to edit the file quite often and therefore prefer to do it within the file itself .Any pointers are appreciated.
Thank you!
File.open('mytest.csv', 'r+') do |file|
file.each_line do |line|
file.seek(-line.length, IO::SEEK_CUR)
file.puts 'a'
end
end
f = open 'mytest.csv', 'r+'
r = f.readlines.map { |e| e.strip << ',a' }
f.rewind
f.puts r
f.close # you can leave out this line if it's the last one that runs
Here is a one-liner variation, note that in this case 2 descriptors are left open until the program exits.
open(F='mytest.csv','r+').puts open(F,'r').readlines.map{|e|e.strip<<',a'}
Writing to a file doesn't insert; it always overwrites. This makes it awkward to modify text in-place, because you have to rewrite the entire rest of the contents of the file every time you add something new.
If the file is small enough to fit in memory, you can read it in, modify it, and write it back out. Otherwise, you really are better off with the temporary file.

Why won't gsub! change my files?

I am trying to do a simple find/replace on all text files in a directory, modifying any instance of [RAVEN_START: by inserting a string (in this case 'raven was here') before the line.
Here is the entire ruby program:
#!/usr/bin/env ruby
require 'rubygems'
require 'fileutils' #for FileUtils.mv('your file', 'new location')
class RavenParser
rawDir = Dir.glob("*.txt")
count = 0
rawDir.each do |ravFile|
#we have selected every text file, so now we have to search through the file
#and make the needed changes.
rav = File.open(ravFile, "r+") do |modRav|
#Now we've opened the file, and we need to do the operations.
if modRav
lines = File.open(modRav).readlines
lines.each { |line|
if line.match /\[RAVEN_START:.*\]/
line.gsub!(/\[RAVEN_START:/, 'raven was here '+line)
count = count + 1
end
}
printf("Total Changed: %d\n",count)
else
printf("No txt files found. \n")
end
end
#end of file replacing instructions.
end
# S
end
The program runs and compiles fine, but when I open up the text file, there has been no change to any of the text within the file. count increments properly (that is, it is equal to the number of instances of [RAVEN_START: across all the files), but the actual substitution is failing to take place (or at least not saving the changes).
Is my syntax on the gsub! incorrect? Am I doing something else wrong?
You're reading the data, updating it, and then neglecting to write it back to the file. You need something like:
# And save the modified lines.
File.open(modRav, 'w') { |f| f.puts lines.join("\n") }
immediately before or after this:
printf("Total Changed: %d\n",count)
As DMG notes below, just overwriting the file isn't properly paranoid as you could be interrupted in the middle of the write and lose data. If you want to be paranoid (which all of us should be because they really are out to get us), then you want to write to a temporary file and then do an atomic rename to replace the original file the new one. A rename generally only works when you stay within a single file system as there is no guarantee that the OS's temp directory (which Tempfile uses by default) will be on the same file system as modRav so File.rename might not even be an option with a Tempfile unless precautions are taken. But the Tempfile constructor takes a tmpdir parameter so we're saved:
modRavDir = File.dirname(File.realpath(modRav))
tmp = Tempfile.new(modRav, modRavDir)
tmp.write(lines.join("\n"))
tmp.close
File.rename(tmp.path, modRav)
You might want to stick that in a separate method (safe_save(modRav, lines) perhaps) to avoid further cluttering your block.
There is no gsub! in the post (except the title and question). I would actually recommend not using gsub!, but rather use the result of gsub -- avoiding mutability can help reduce a number of subtle bugs.
The line read from the file stream into a String is a copy and modifying it will not affect the contents of the file. (The general approach is to read a line, process the line, and write the line. Or do it all at once: read all lines, process all lines, write all processed lines. In either case, nothing is being written back to the file in the code in the post ;-)
Happy coding.
You're not using gsub!, you're using gsub. gsub! and gsub different methods, one does replacement on the object itself and the other does replacement then returns the result, respectively.
Change this
line.gsub(/\[RAVEN_START:/, 'raven was here '+line)
to this :
line.gsub!(/\[RAVEN_START:/, 'raven was here '+line)
or this:
line = line.gsub(/\[RAVEN_START:/, 'raven was here '+line)
See String#gsub for more info

Resources