Parse remote file with FasterCSV - ruby

I'm trying to parse the first 5 lines of a remote CSV file. However, when I do, it raises Errno::ENOENT exception, and says:
No such file or directory - [file contents] (with [file contents] being a dump of the CSV contents
Here's my code:
def preview
#csv = []
open('http://example.com/spreadsheet.csv') do |file|
CSV.foreach(file.read, :headers => true) do |row|
n += 1
#csv << row
if n == 5
return #csv
end
end
end
end
The above code is built from what I've seen others use on Stack Overflow, but I can't get it to work.
If I remove the read method from the file, it raises a TypeError exception, saying:
can't convert StringIO into String
Is there something I'm missing?

Foreach expects a filename. Try parse.each

You could manually pass each line to CSV for parsing:
require 'open-uri'
require 'csv'
def preview(file_url)
#csv = []
open(file_url).each_with_index do |line, i|
next if i == 0 #Ignore headers
#csv << CSV.parse(line)
if i == 5
return #csv
end
end
end
puts preview('http://www.ferc.gov/docs-filing/eqr/soft-tools/sample-csv/contract.txt')

Related

'Failed to allocate memory' error with large array

I am trying to import a large text file (approximately 2 million rows of numbers at 260MB) into an array, make edits to the array, and then write the results to a new text file, by writing:
file_data = File.readlines("massive_file.txt")
file_data = file_data.map!(&:strip)
file_data.each do |s|
s.gsub!(/,.*\z/, "")
end
File.open("smaller_file.txt", 'w') do |f|
f.write(file_data.map(&:strip).uniq.join("\n"))
end
However, I have received the error failed to allocate memory (NoMemoryError). How can I allocate more memory to complete the task? Or, ideally, is there another method I can use where I can avoid having to re-allocate memory?
You can read the file line by line:
require 'set'
require 'digest/md5'
file_data = File.new('massive_file.txt', 'r')
file_output = File.new('smaller_file.txt', 'w')
unique_lines_set = Set.new
while (line = file_data.gets)
line.strip!
line.gsub!(/,.*\z/, "")
# Check if the line is unique
line_hash = Digest::MD5.hexdigest(line)
if not unique_lines_set.include? line_hash
# It is unique so add its hash to the set
unique_lines_set.add(line_hash)
# Write the line in the output file
file_output.puts(line)
end
end
file_data.close
file_output.close
You can try reading and writing one line at once:
new_file = File.open('smaller_file.txt', 'w')
File.open('massive_file.txt', 'r') do |file|
file.each_line do |line|
new_file.puts line.strip.gsub(/,.*\z/, "")
end
end
new_file.close
The only thing pending is find duplicated lines
Alternatively you can read file in chunks which should be faster compared to reading it line by line:
FILENAME="massive_file.txt"
MEGABYTE = 1024*1024
class File
def each_chunk(chunk_size=MEGABYTE) # or n*MEGABYTE
yield read(chunk_size) until eof?
end
end
filedata = ""
open(FILENAME, "rb") do |f|
f.each_chunk() {|chunk|
chunk.gsub!(/,.*\z/, "")
filedata += chunk
}
end
ref: https://stackoverflow.com/a/1682400/3035830

Missing parts after parsing and processing a very large XML file in Ruby

I have to parse and modify a 22.2MB XML file (a wordpress export).
The problem is after parsing, the last part of the file is always missing, but I can't really figure out why.
I've tried using the saxerator gem, but it does not seem to solve my problem
Here I'm just trying to get all the <item> from the input file and display them in an output file:
class SaxImport
def initialize input_file, output_file
f = File.read(input_file, File.size(input_file))
xml_data = Saxerator.parser(f) do |config|
config.output_type = :xml
end
category_fr_list = {}
items = []
output = File.open output_file, "w"
xml_data.for_tag(:item).reverse_each do |item|
output << item.to_xml
end
output.close
end
end
import_en = SaxImport.new 'weekly.xml', 'weekly.processed.xml'

read file and send data to yml file

i read multipe file and i try to get data in yaml file, but i dont know why i get nothing in my yaml file .
Do you have an idea where i can make a mistake ?
a = array.size
i = 0
array.each do |f|
while i < a
puts array[i]
output = File.new('/home/zyriuse/documents/Ruby-On-Rails/script/Api_BK/licence.yml', 'w')
File.readlines(f).each do |line|
output.puts line
output.puts line.to_yaml
#output.puts YAML::dump(line)
end
i += 1
end
end
There's two problems...
You are initializing i to zero too early... when you process the
first file 'f' you process JUST that first file as many times as you
have files in the array, but for all following files i is now always >= a so you're not doing anything with them.
You are doing File.new in every iteration of 'f' so you are wiping out your last iteration.
This might work better...
output = File.new('licence.yml', 'w')
array.each do |f|
puts f
File.readlines(f).each do |line|
output.puts line
output.puts line.to_yaml
end
end

Parsing data to a csv file on a new line - ruby

I am trying to export data that I 'get' into a new csv file. Currently, my code below posts everyone onto a single line until it fills up and then it continues to the next line.
I would like to have it where when data is imported, it starts on the following line below, creating a list of transactions.
def export_data
File.open('coffee_orders.csv', 'a+') do |csv|
puts #item_quantity = [Time.now, #item_name, #amount]
csv << #item_quantity
end
end
Basing it on your starting code, I'd do something like:
def export_data
File.open('coffee_orders.csv', 'a') do |csv|
csv << [Time.now, #item_name, #amount].join(', ')
end
end
Or:
def export_data
File.open('coffee_orders.csv', 'a') do |csv|
csv << '%s, %s, %s' % [Time.now, #item_name, #amount].map(&:to_s)
end
end
Notice, it's not necessary to use 'a+' to append to a file. Instead use 'a' only unless you absolutely need "read" mode while the file is open also. Here's what the IO.new documentation says:
"a" Write-only, starts at end of file if file exists,
otherwise creates a new file for writing.
"a+" Read-write, starts at end of file if file exists,
otherwise creates a new file for reading and
writing.
The way I'd write it for myself would be something like:
CSV_FILENAME = 'coffee_orders.csv'
def export_data
csv_has_content = File.size?(CSV_FILENAME)
CSV.open(CSV_FILENAME, 'a') do |csv|
csv << %w[Time Item Amount] unless csv_has_content
csv << [Time.now, #item_name, #amount]
end
end
This uses Ruby's CSV class to handle all the ins-and-outs. It checks to see if the file already exists, and if it has no content it writes the header before writing the content.
Try this. It will add a new line after each transaction. When you append to it next, it will be from a new line.
def export_data
File.open('coffee_orders.csv', 'a+') do |csv|
csv.puts #item_quantity = [Time.now, #item_name, #amount]
end
end
Although by looking the extension, you would probably want to confine it to csv format.
def export_data
File.open('coffee_orders.csv', 'a+') do |csv|
#item_quantity = [Time.now, #item_name, #amount]
csv.puts #item_quantity.join(',')
end
end

trying to find the 1st instance of a string in a CSV using fastercsv

I'm trying to open a CSV file, look up a string, and then return the 2nd column of the csv file, but only the the first instance of it. I've gotten as far as the following, but unfortunately, it returns every instance. I'm a bit flummoxed.
Can the gods of Ruby help? Thanks much in advance.
M
for the purpose of this example, let's say names.csv is a file with the following:
foo, happy
foo, sad
bar, tired
foo, hungry
foo, bad
#!/usr/local/bin/ruby -w
require 'rubygems'
require 'fastercsv'
require 'pp'
FasterCSV.open('newfile.csv', 'w') do |output|
FasterCSV.foreach('names.csv') do |lookup|
index_PL = lookup.index('foo')
if index_PL
output << lookup[2]
end
end
end
ok, so, if I want to return all instances of foo, but in a csv, then how does that work?
so what I'd like as an outcome is happy, sad, hungry, bad. I thought it would be:
FasterCSV.open('newfile.csv', 'w') do |output|
FasterCSV.foreach('names.csv') do |lookup|
index_PL = lookup.index('foo')
if index_PL
build_str << "," << lookup[2]
end
output << build_str
end
end
but it does not seem to work
Replace foreach with open (to get an Enumerable) and find:
FasterCSV.open('newfile.csv', 'w') do |output|
output << FasterCSV.open('names.csv').find { |r| r.index('foo') }[2]
end
The index call will return nil if it doesn't find anything; that means that the find will give you the first row that has 'foo' and you can pull out the column at index 2 from the result.
If you're not certain that names.csv will have what you're looking for then a bit of error checking would be advisable:
FasterCSV.open('newfile.csv', 'w') do |output|
foos_row = FasterCSV.open('names.csv').find { |r| r.index('foo') }
if(foos_row)
output << foos_row[2]
else
# complain or something
end
end
Or, if you want to silently ignore the lack of 'foo' and use an empty string instead, you could do something like this:
FasterCSV.open('newfile.csv', 'w') do |output|
output << (FasterCSV.open('names.csv').find { |r| r.index('foo') } || ['','',''])[2]
end
I'd probably go with the "complain if it isn't found" version though.

Resources