Download large csv file from sftp server in chunks ruby - ruby

I want to download and process csv file that is on sftp server line by line.
If I am using download! or sftp.file.open, it is buffering whole data in memory that I want to avoid.
Here is my source code:
sftp = Net::SFTP.start(#sftp_details['server_ip'], #sftp_details['server_username'], :password => decoded_pswd)
if sftp
begin
sftp.dir.foreach(#sftp_details['server_folder_path']) do |entry|
print_memory_usage do
print_time_spent do
if entry.file? && entry.name.end_with?("csv")
batch_size_cnt = 0
sftp.file.open("#{#sftp_details['server_folder_path']}/#{entry.name}") do |file|
header = file.gets
header = header.force_encoding(header.encoding).encode('UTF-8', invalid: :replace, undef: :replace, replace: '')
csv_data = ''
while line = file.gets
batch_size_cnt += 1
csv_data.concat(line.force_encoding(line.encoding).encode('UTF-8', invalid: :replace, undef: :replace, replace: ''))
if batch_size_cnt == 1000 || file.eof?
CSV.parse(csv_data, {headers: header, write_headers: true}) do |row|
row.delete(nil)
entities << row.to_hash
end
csv_data, batch_size_cnt = '', 0
courses.delete_if(&:blank?)
# DO PROCESSING PART
entities = []
end
end if header
end
sftp.rename("#{#sftp_details['server_folder_path']}/#{entry.name}", "#{#sftp_details['processed_file_path']}/#{entry.name}")
end
end
end
end
Can someone please help? Thanks

You need to add some kind of buffer to be able to read chunks and then write them all together. I think it would be wise to split in your script parsing and downloading. Focus on one thing at the time:
Your original line:
...
sftp.file.open("#{#sftp_details['server_folder_path']}/#{entry.name}") do |file|
...
If you check the source file of the download! (don't forget the bang!) method you can use 'stringio'. A stub which you can easily adjust. Usually the default buffer, which is 32kB, is sufficient. You can change it if you want (see the example).
Replace with (works only with single files) :
The StringIO usage:
...
io = StringIO.new
sftp.download!("#{#sftp_details['server_folder_path']}/#{entry.name}", io.puts, :read_size => 16000))
OR you can just download a file
...
file = File.open("/your_local_path/#{entry.name}",'wb')
sftp.download!("#{#sftp_details['server_folder_path']}/#{entry.name}", file, :read_size => 16000)
....
From the Doc's you can use an option :read_size:
:read_size - the maximum number of bytes to read at a time from the
source. Increasing this value might improve throughput. It defaults to
32,000 bytes.

Related

How to read multiple XML files then output to multiple CSV files with the same XML filenames

I am trying to parse multiple XML files then output them into CSV files to list out the proper rows and columns.
I was able to do so by processing one file at a time by defining the filename, and specifically output them into a defined output file name:
File.open('H:/output/xmloutput.csv','w')
I would like to write into multiple files and make their name the same as the XML filenames without hard coding it. I tried doing it multiple ways but have had no luck so far.
Sample XML:
<?xml version="1.0" encoding="UTF-8"?>
<record:root>
<record:Dataload_Request>
<record:name>Bob Chuck</record:name>
<record:Address_Data>
<record:Street_Address>123 Main St</record:Street_Address>
<record:Postal_Code>12345</record:Postal_Code>
</record:Address_Data>
<record:Age>45</record:Age>
</record:Dataload_Request>
</record:root>
Here is what I've tried:
require 'nokogiri'
require 'set'
files = ''
input_folder = "H:/input"
output_folder = "H:/output"
if input_folder[input_folder.length-1,1] == '/'
input_folder = input_folder[0,input_folder.length-1]
end
if output_folder[output_folder.length-1,1] != '/'
output_folder = output_folder + '/'
end
files = Dir[input_folder + '/*.xml'].sort_by{ |f| File.mtime(f)}
file = File.read(input_folder + '/' + files)
doc = Nokogiri::XML(file)
record = {} # hashes
keys = Set.new
records = [] # array
csv = ""
doc.traverse do |node|
value = node.text.gsub(/\n +/, '')
if node.name != "text" # skip these nodes: if class isnt text then skip
if value.length > 0 # skip empty nodes
key = node.name.gsub(/wd:/,'').to_sym
if key == :Dataload_Request && !record.empty?
records << record
record = {}
elsif key[/^root$|^document$/]
# neglect these keys
else
key = node.name.gsub(/wd:/,'').to_sym
# in case our value is html instead of text
record[key] = Nokogiri::HTML.parse(value).text
# add to our key set only if not already in the set
keys << key
end
end
end
end
# build our csv
File.open('H:/output/.*csv', 'w') do |file|
file.puts %Q{"#{keys.to_a.join('","')}"}
records.each do |record|
keys.each do |key|
file.write %Q{"#{record[key]}",}
end
file.write "\n"
end
print ''
print 'output files ready!'
print ''
end
I have been getting 'read memory': no implicit conversion of Array into String (TypeError) and other errors.
Here's a quick peer-review of your code, something like you'd get in a corporate environment...
Instead of writing:
input_folder = "H:/input"
input_folder[input_folder.length-1,1] == '/' # => false
Consider doing it using the -1 offset from the end of the string to access the character:
input_folder[-1] # => "t"
That simplifies your logic making it more readable because it's lacking unnecessary visual noise:
input_folder[-1] == '/' # => false
See [] and []= in the String documentation.
This looks like a bug to me:
files = Dir[input_folder + '/*.xml'].sort_by{ |f| File.mtime(f)}
file = File.read(input_folder + '/' + files)
files is an array of filenames. input_folder + '/' + files is appending an array to a string:
foo = ['1', '2'] # => ["1", "2"]
'/parent/' + foo # =>
# ~> -:9:in `+': no implicit conversion of Array into String (TypeError)
# ~> from -:9:in `<main>'
How you want to deal with that is left as an exercise for the programmer.
doc.traverse do |node|
is icky because it sidesteps the power of Nokogiri being able to search for a particular tag using accessors. Very rarely do we need to iterate over a document tag by tag, usually only when we're peeking at its structure and layout. traverse is slower so use it as a very last resort.
length is nice but isn't needed when checking whether a string has content:
value = 'foo'
value.length > 0 # => true
value > '' # => true
value = ''
value.length > 0 # => false
value > '' # => false
Programmers coming from Java like to use the accessors but I like being lazy, probably because of my C and Perl backgrounds.
Be careful with sub and gsub as they don't do what you're thinking they do. Both expect a regular expression, but will take a string which they do a escape on before beginning their scan.
You're passing in a regular expression, which is OK in this case, but it could cause unexpected problems if you don't remember all the rules for pattern matching and that gsub scans until the end of the string:
foo = 'wd:barwd:' # => "wd:barwd:"
key = foo.gsub(/wd:/,'') # => "bar"
In general I recommend people think a couple times before using regular expressions. I've seen some gaping holes opened up in logic written by fairly advanced programmers because they didn't know what the engine was going to do. They're wonderfully powerful, but need to be used surgically, not as a universal solution.
The same thing happens with a string, because gsub doesn't know when to quit:
key = foo.gsub('wd:','') # => "bar"
So, if you're looking to change just the first instance use sub:
key = foo.sub('wd:','') # => "barwd:"
I'd do it a little differently though.
foo = 'wd:bar'
I can check to see what the first three characters are:
foo[0,3] # => "wd:"
Or I can replace them with something else using string indexing:
foo[0,3] = ''
foo # => "bar"
There's more but I think that's enough for now.
You should use Ruby's CSV class. Also, you don't need to do any string matching or regex stuff. Use Nokogiri to target elements. If you know the node names in the XML will be consistent it should be pretty simple. I'm not exactly sure if this is the output you want, but this should get you in the right direction:
require 'nokogiri'
require 'csv'
def xml_to_csv(filename)
xml_str = File.read(filename)
xml_str.gsub!('record:','') # remove the record: namespace
doc = Nokogiri::XML xml_str
csv_filename = filename.gsub('.xml', '.csv')
CSV.open(csv_filename, 'wb' ) do |row|
row << ['name', 'street_address', 'postal_code', 'age']
row << [
doc.xpath('//name').text,
doc.xpath('//Street_Address').text,
doc.xpath('//Postal_Code').text,
doc.xpath('//Age').text,
]
end
end
# iterate over all xml files
Dir.glob('*.xml').each { |filename| xml_to_csv(filename) }

Errno::EINVAL Invalid argument # io_fread

I'm attempting to download a ~2GB file and write it to a file locally but I'm running into this issue:
Here's the applicable code:
File.open(local_file, "wb") do |tempfile|
puts "Downloading the backup..."
pbar = nil
open(backup_url,
:read_timeout => nil,
:content_length_proc => lambda do |content_length|
if content_length&.positive?
pbar = ProgressBar.create(:total => content_length)
end
end,
:progress_proc => ->(size) { pbar&.progress = size }) do |retrieved|
begin
tempfile.binmode
tempfile << retrieved.read
tempfile.close
rescue Exception => e
binding.pry
end
end
Read your file in chunks.
The line causing the issue is here:
tempfile << retrieved.read
This reads the entire contents into memory before writing it to the tempfile. If the content is small, this isn't a big deal, but if this content is quite large (how large depends on the system, configuration, OS and available resources), this can cause an Errno::EINVAL error, like Invalid argument # io_fread and Invalid argument # io_write.
To work around this, read the content in chunks and write each chunk to the tempfile. Something like this:
tempfile.write( retrieved.read( 1024 ) ) until retrieved.eof?
This will get chunks of 1024 bytes and write each chunk to the tempfile until retrieved reaches the end of the file (i.e. .eof?).
If retrieved.read doesn't take a size parameter, you may need to convert retrieved into a StringIO, like this:
retrievedIO = StringIO.new( retrieved )
tempfile.write( retrievedIO.read( 1024 ) ) until retrievedIO.eof?

Working with large CSV files in Ruby

I want to parse two CSV files of the MaxMind GeoIP2 database, do some joining based on a column and merge the result into one output file.
I used standard CSV ruby library, it is very slow. I think it tries to load all the file in memory.
block_file = File.read(block_path)
block_csv = CSV.parse(block_file, :headers => true)
location_file = File.read(location_path)
location_csv = CSV.parse(location_file, :headers => true)
CSV.open(output_path, "wb",
:write_headers=> true,
:headers => ["geoname_id","Y","Z"] ) do |csv|
block_csv.each do |block_row|
puts "#{block_row['geoname_id']}"
location_csv.each do |location_row|
if (block_row['geoname_id'] === location_row['geoname_id'])
puts " match :"
csv << [block_row['geoname_id'],block_row['Y'],block_row['Z']]
break location_row
end
end
end
Is there another ruby library that support processing in chuncks ?
block_csv is 800MB and location_csv is 100MB.
Just use CSV.open(block_path, 'r', :headers => true).each do |line| instead of File.read and CSV.parse. It will parse the file line by line.
In your current version, you explicitly tell it to read all the file with File.read and then to parse the whole file as a string with CSV.parse. So it does exactly what you have told.

How to write columns header to a csv file with Ruby?

I am having trouble writing columns to a csv file with Ruby. Below is my snippet of code.
calc = numerator/denominator.to_f
data_out = "#{numerator}, #{denominator}, #{calc}"
File.open('cdhu3_X.csv','a+') do|hdr|
hdr << ["numerator","denominator","calculation\n"] #< column header
hdr << "#{data_out}\n"
end
The code adds the column headers to every line and I only need it at the top of each column of data. I have searched here and other places but can't find a clear answer to how its done.
Any help would be greatly appreciated.
I would recommend to use the CSV-library instead:
require 'csv'
CSV.open('test.csv','w',
:write_headers=> true,
:headers => ["numerator","denominator","calculation"] #< column header
) do|hdr|
1.upto(12){|numerator|
1.upto(12){ |denominator|
data_out = [numerator, denominator, numerator/denominator.to_f]
hdr << data_out
}
}
end
If you can't use the w option and you really need the a+ (e.g., the data isn't available all at once), then you could try the following trick:
require 'csv'
column_header = ["numerator","denominator","calculation"]
1.upto(12){|numerator|
1.upto(12){ |denominator|
CSV.open('test.csv','a+',
:write_headers=> true,
:headers => column_header
) do|hdr|
column_header = nil #No header after first insertion
data_out = [numerator, denominator, numerator/denominator.to_f]
hdr << data_out
end
}
}
The cleanest way to do this is to open the file once, in mode 'w', write the headers, and then write the data.
If there's some technical reason that can't do this (e.g., the data isn't available all at once), then you can use the IO#tell method on the file to return the current file position. When you open the file for appending, the position is set to the end of the file, so if the current file position is zero, then the file was newly created and has no headers:
File.open('cdhu3_X.csv', 'a+') do |hdr|
if hdr.tell() == 0 # file is empty, so write header
hdr << "numerator, denominator, calculation\n"
end
hdr << "#{data_out}\n"
end
Best way to handle csv file is to use Ruby's CSV module.
I had same problem after reading CSV code I came across this solution which i find most efficient.
headers = ['col1','col2','col3']
CSV.open(file_path, 'a+', {force_quotes: true}) do |csv|
csv << headers if csv.count.eql? 0 # csv.count method gives number of lines in file if zero insert headers
end
This works for me
headers = ["Reference Number", "Vendor Line Code"]
CSV.open(file_path, "wb") do |csv|
csv << headers
#vendor.vendor_items.each do |vi|
row_data = [vi.reference_number, vi.line_code]
csv << row_data
end
end

Removing whitespaces in a CSV file

I have a string with extra whitespace:
First,Last,Email ,Mobile Phone ,Company,Title ,Street,City,State,Zip,Country, Birthday,Gender ,Contact Type
I want to parse this line and remove the whitespaces.
My code looks like:
namespace :db do
task :populate_contacts_csv => :environment do
require 'csv'
csv_text = File.read('file_upload_example.csv')
csv = CSV.parse(csv_text, :headers => true)
csv.each do |row|
puts "First Name: #{row['First']} \nLast Name: #{row['Last']} \nEmail: #{row['Email']}"
end
end
end
#prices = CSV.parse(IO.read('prices.csv'), :headers=>true,
:header_converters=> lambda {|f| f.strip},
:converters=> lambda {|f| f ? f.strip : nil})
The nil test is added to the row but not header converters assuming that the headers are never nil, while the data might be, and nil doesn't have a strip method. I'm really surprised that, AFAIK, :strip is not a pre-defined converter!
You can strip your hash first:
csv.each do |unstriped_row|
row = {}
unstriped_row.each{|k, v| row[k.strip] = v.strip}
puts "First Name: #{row['First']} \nLast Name: #{row['Last']} \nEmail: #{row['Email']}"
end
Edited to strip hash keys too
CSV supports "converters" for the headers and fields, which let you get inside the data before it's passed to your each loop.
Writing a sample CSV file:
csv = "First,Last,Email ,Mobile Phone ,Company,Title ,Street,City,State,Zip,Country, Birthday,Gender ,Contact Type
first,last,email ,mobile phone ,company,title ,street,city,state,zip,country, birthday,gender ,contact type
"
File.write('file_upload_example.csv', csv)
Here's how I'd do it:
require 'csv'
csv = CSV.open('file_upload_example.csv', :headers => true)
[:convert, :header_convert].each { |c| csv.send(c) { |f| f.strip } }
csv.each do |row|
puts "First Name: #{row['First']} \nLast Name: #{row['Last']} \nEmail: #{row['Email']}"
end
Which outputs:
First Name: 'first'
Last Name: 'last'
Email: 'email'
The converters simply strip leading and trailing whitespace from each header and each field as they're read from the file.
Also, as a programming design choice, don't read your file into memory using:
csv_text = File.read('file_upload_example.csv')
Then parse it:
csv = CSV.parse(csv_text, :headers => true)
Then loop over it:
csv.each do |row|
Ruby's IO system supports "enumerating" over a file, line by line. Once my code does CSV.open the file is readable and the each reads each line. The entire file doesn't need to be in memory at once, which isn't scalable (though on new machines it's becoming a lot more reasonable), and, if you test, you'll find that reading a file using each is extremely fast, probably equally fast as reading it, parsing it then iterating over the parsed file.

Resources